Practical Biostatistics in Translational Healthcare (PDFDrive)
Practical Biostatistics in Translational Healthcare (PDFDrive)
Biostatistics
in Translational
Healthcare
Allen M. Khakshooy
Francesco Chiappelli
123
Practical Biostatistics
in Translational Healthcare
Allen M. Khakshooy · Francesco Chiappelli
Practical Biostatistics in
Translational Healthcare
Allen M. Khakshooy Francesco Chiappelli
Rappaport Faculty of Medicine UCLA School of Dentistry
Technion-Israel Institute of Technology Los Angeles, CA
Haifa, Israel USA
This Springer imprint is published by the registered company Springer-Verlag GmbH, DE part
of Springer Nature.
The registered company address is: Heidelberger Platz 3, 14197 Berlin, Germany
To master and apprentice,
“Credette Cimabue ne la pittura tener lo campo, e ora ha
Giotto il grido, … ”
(Dante Alighieri, Divine Comedy, Purgatory XI 94,95)
Foreword
vii
viii Foreword
ix
x Preface
love throughout my entire life and my siblings, Arash and Angela, who I can
always rely on for inspiration and wisdom. I thank Moses Farzan and Saman
Simino for their continued support and friendship. Lastly, I extend my deep-
est gratitude and appreciation to Margaret Moore, Rekha Udaiyar, and the
rest of the wonderful team at Springer for this opportunity and help through-
out the process.
There is little that I can add to Dr. Khakshooy’s excellent preface, except to
thank him for the kind words, most of which I may not—in truth—deserve.
This work is his primarily, and for me it has been a fulfilling delight to mentor
and guide a junior colleague of as much value as Dr. Khakshooy’s in his ini-
tial steps of publishing.
I join him in thanking Ms. Balenton, who will soon enter the nursing pro-
fession. Her indefatigable attention to detail and dedication to the research
endeavors, and her superb and untiring help and assistance in the editorial
process, has proffered incalculable value to our work.
I also join in thanking most warmly Ms. Margaret Moore, Editor, Clinical
Medicine; Ms. Rekha Udaiyar, Project Coordinator; and their superb team at
Springer for their guidance, encouragement, and patience in this endeavor.
I express my gratitude to the Division of Oral Biology and Medicine of the
School of Dentistry at UCLA, where I have been given the opportunity to
develop my work in this cutting-edge area of research and practice in health-
care, and to the Department of the Health Sciences at CSUN, where both Dr.
Khakshooy and I have taught biostatistics for several years. I express my
gratitude to the Fulbright Program, of which I am a proud alumnus, having
been sent as a Fulbright Specialist to Brazil where I also taught biostatistics.
In closing, I dedicate this work, as all of my endeavors, to Olivia, who
somehow always knows how to get the best out of me, to Fredi and Aymerica,
without whom none of this would have been possible, and as in all, to only
and most humbly serve and honor.
“… la gloria di Colui che tutto move
per l’universo penetra e risplende
in una parte più e meno altrove …”
(Dante Alighieri, 1265–1321; La Divina Commedia, Paradiso, I 1-3)
xi
Contents
xiii
xiv Contents
Chapter 3
Video 1: Variables. Reprint courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Chapter 4
Video 2: Frequency tables. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 3: Graphing. Reprint courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Chapter 6
Video 4: One-sample t-test. Reprint Courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 5: Independent sample t-test. Reprint Courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 6: Dependent-sample t test. Reprint Courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 7: ANOVA. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Video 8: Correlation. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Video 9: Regression. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Chapter 7
Video 10: Wilcoxon rank-sum. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 11: Wilcoxon signed-rank. Reprint courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 12: Mann–Whitney U. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 13: Kruskal–Wallis H. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
xvii
xviii List of Videos
Contents
1.1 Core Concepts 3
1.2 The Scientific Method 4
1.2.1 The Research Process 5
1.2.2 Biostatistics Today 9
1.2.3 Self-Study: Practice Problems 11
Resarch Study
Study Design Methodology Data Analysis Conclusion
Question Hypothesis
Met
Dat
Study Design
that have largely influenced today’s traditional
a
hod
Western thought. Of his many contributions, the
Ana
olog
techniques of inductive and deductive reasoning
lysis
y
have played a large role in our scientific method
today. We will return to this dichotomy of sci-
entific reasoning later, but it must be noted that
there currently exist many more influences to the
evolution of our scientific method as we know
it. On the same note, the scientific method still
today is impartial to influences.
Finally, the scientific method is a method of Fig. 1.3 Methodology, study design, and data analysis
investigating phenomena based on observations are the foundations of the research process
from the world around us, in which specific prin-
ciples of reasoning are used in order to test strong evidence or, as we may call it, proof. We
hypotheses, create knowledge, and ultimately use the research process to create theories, find
become one step closer in obtaining the truth. We solutions to problems, and even find problems to
must understand that there is no universal scien- solutions we already have. In addition, the over-
tific method; rather there are fundamental con- arching goal of the research process is also an
cepts and principles that make this method of attempt to find some sort of truth. However,
inquiry scientific. Moreover, the scientific method abstract this may seem, we can actualize its mean-
is ever-changing and ever-growing, such that the ing by making the goal of the research process to
method itself is under its own scrutiny. be the culmination to an inference consensus or an
ability to make a generalization of the whole based
on its representative parts. Though the specific
1.2.1 The Research Process steps may differ based on their source, this book
will take the steps of the research process as
The research process can be argued to be the depicted in Fig. 1.2, along with a brief description
same as or a synonym for the scientific method. of each provided in the following section.
Though skeptics in this differentiation exist, for Lastly, the conceptualization of the research
simplicity and practicality sake, we will dis- process as a whole can be interpreted to be a three-
tinguish the scientific method and the research legged stool (Fig. 1.3) that sits on methodology,
process as the latter representing the actual appli- study design, and data analysis. This metaphoric
cation of the former. description is crucial to the understanding of the
The research process is a process that uses the research process such that each individual leg is
scientific method to establish, confirm, and/or equally valuable and important to the function of
reaffirm certain pieces of knowledge supported by the stool. Just as the function of a stool is for one to
6 1 Introduction to Biostatistics
sit, so too is the function of the research process: questions that are relevant to our specific inter-
for one to gain otherwise unattainable knowledge. est in this book1.
Hence, the integrity of the stool as a whole is placed A hypothesis, commonly referred to as an edu-
in question should any single leg deteriorate. cated guess, is seen as both a starting point and
guiding tool of the research process. But, was it not
1.2.1.1 Hypothesis-Driven Research mentioned earlier that it is the research question
So, how does one begin the research process? that is the starting point? Indeed! Here is where the
The research process begins with nothing other intimate relationship between the research question
than a question. The research question, simply and the study hypothesis is made clear. The study
put, is a question of interest to the investigator hypothesis is nothing more than the research ques-
that serves as the driver of the entire process. The tion stated positively (i.e., research question is
great value placed on this concept is an attempt to changed from question format to statement for-
prove that the answer to this question is not only mat.) The disparate forms of hypotheses are further
one that is interesting enough to warrant the need discussed in Hypothesis Testing in Chap. 5.
of a process but more importantly that the answer The study design serves as the infrastructure
to it is both meaningful and useful. To take it or the system we create that aids in answering the
many steps further, obtaining the answer to a research question. The design of any research
research question could potentially prevent mass process is, obviously, dependent on both the
casualties in the world and help end world peripheral and inherent details of the research
hunger. question like the specific population, disease, and
Of course, this book is not a how-to manual therapy that is being studied.
on world peace. Rather, the premise that we are The methodology of the research process is
attempting to drive home is that not only can the concerned with the process of measuring and
successful answering of the research question collecting the necessary information (which we
be worthwhile but that we may very well not call data, discussed further in Chap. 3) regarding
always be successful in obtaining an answer. the specific population of interest depicted in the
Thus, research questions are chosen based on a research question. As further elaborated in Chap.
certain criterion easily remembered as the acro- 3, because it is seemingly impossible to compre-
nym FINER. We say that a research question hensively study an entire population, we obtain
must be: feasible, interesting, novel, ethical, and data from a sample that is a representative of the
relevant. Though there can be a never-ending entire population that can survive this comparison.
list of categories of research questions (Table Data analysis is the statistical techniques and
1.1), below we provide a few types of research reasoning tools utilized in the examination of
the collected information, i.e., data. Some have
regarded this section as the results of the study, in
Table 1.1 Types of research questions which the evidence obtained is used in hopes of
Types of research questions proving or disproving the conjectured hypotheses.
Descriptive—attempts to simply describe that which Lastly, the conclusion is the researcher’s
is occurring or that which exists
attempt to answer the research question relative
Relational—seeks to establish, or to test the
establishment of, a specific relationship or association to the results that were obtained. It is at this point
among variables within groups that our initial concepts of inference consensus
Causal—developed to establish a direct cause-and- and truth determination converge. Though the
effect relationship either by means of a comparison or
by means of a prediction
PICO(TS)—describes specific criteria of research as Note the acronym stands originally for population, inter-
1
they refer to the patient(s), the interventions, and its vention, comparator, outcome, timeline, and setting; the
comparators that are under consideration for a given latter two are parenthetic such that they are not always
sought outcome, under a specified timeline and in the used or available to use; in any case they can be described
context of a specific clinical setting as PICO, PICOT, or PICOS research questions.
1.2 The Scientific Method 7
+
Research Study
Question Hypothesis
Data analysis
∴ Inference consensus
analysis of the data is meant to provide some sort why the research process is referred to as
of concrete evidence to influence the decision- hypothesis-driven research (Fig. 1.4). It is the
making process on behalf of the postulates, it is study hypothesis that is the driver of all three legs
unfortunately not that forthright. Statistical anal- of the stool (methodology, study design, and data
ysis allows us to put limits on our uncertainty analysis), which culminate into the making of a
regarding the issue at hand, but what it does not potential inference.
clearly allow is the absolute proof of anything.
Thus, when arriving at the conclusion of a study, 1.2.1.2 Errors in Research
the results are unable to provide an absolute truth Statistics in translational healthcare pervades the
statement when all is considered. Rather, its scientific literature: its aim to improve the reli-
application is more practical in disqualifying ability and validity of the findings from transla-
substantiated claims or premises. tional research. As we progress toward a more
Similar to the fundamental principle in the US technologically advanced future with greater
Justice System of “innocent until proven guilty,” accessibility, it seems as though we are constantly
so too exists a principle that is central to the sci- bombarded with so-called proven research find-
entific method and the research process in regard ings, medical breakthroughs, and secretive thera-
to the treatment of hypotheses within a research pies on a daily basis. It even goes as far as having
study. We commonly retain a basic hypothesis distinct research findings that directly contradict
of the research (namely, the null hypothesis dis- one another! Recently, we have witnessed a prev-
cussed in Chap. 5) such that we cannot adequately alence in the retraction of research papers that,
prove its absolute truth for obvious reasons. just a few years earlier, were highly regarded as
Instead, what we are capable of is proving its pivotal to modern-day science. Though the major-
absolute falsehood. Subsequently, the pragmatism ity of retracted papers stem from ethical concerns,
that is intrinsic to our conclusion is the ability to there are papers that have so-called “fudged” the
make an inference. Upon evaluation of the results, numbers or simply have improperly handled the
an inference is made onto the population based on statistics. These mishandlings also stretch beyond
the information gleaned from the sample. just the statistics, which we categorize as errors.
A quick glance at the crude descriptions of Naturally, the susceptibility of the research (and
each step of the research process shows the the researcher) to error is inevitable. The effect
impact of the research question along the way. of errors is most felt during the determination of
Then, after equating the research question with results, or more specifically when establishing sta-
the study hypothesis, it can now be understood tistical significance. Discussed in more depth in
8 1 Introduction to Biostatistics
Chap. 5, the establishment of statistical significance tion of a study design like the type of research
(or lack thereof) is an imperative and necessary question, the nature of the data we are working
step in the substantiation of our results (i.e., when with, and the goal of our study to list a few. But
moving from data analysis to conclusion). This more importantly, the risk of running a system-
lends a hand to the importance placed on inherent atic error (choosing a poor study design) is that it
errors and introduced biases that are, unfortunately, will always produce wrong results of the study.
contained in many published research today. The second type of error are errors of judg-
Just as the research process is a three-legged ment or fallacies. To elaborate, these are errors
stool, so too is the establishment of statistical sig- that are grounded in biases and/or false reasoning
nificance (Fig. 1.5). The process of obtaining sta- (i.e., a fallacy), in which the improper use of
tistical significance sits on three forms of error: logic or rational leads to errors in scientific rea-
systematic errors, errors of judgment (i.e. falla- soning. It can be argued that these are the most
cies), and random errors. We do not have the full dangerous errors in research as they are subjec-
capability of understanding the intricacies of tive to the researcher(s). In Table 1.2, we provide
each error just yet, but for the moment, it is worth a list of the various types of fallacies.
attempting to briefly describe each one.
Systematic errors are just as they sound— Table 1.2 A description of several common types of fal-
errors in the system we have chosen to use in our lacies or biases that may occur in scientific reasoning
research. What systems are we referring to? That related to research
would be the study design. Erroneously choosing Errors of judgment/fallacies
one design over another can lead to the collapse Hindsight bias (“knew-it-all-along” effect): The
foretelling of results on the basis of the previously
of our ultimate goal of attaining statistical signifi- known outcomes and observations; subsequently
cance. Luckily, systematic errors are one of the testing a hypothesis to confirm the prediction to be
few errors that have the ability of being avoided. correct. For example, taking it for granted that the Sun
We can avoid systematic errors by simply select- will not rise tomorrow
ing the best possible study design. There are Recomposing-the-whole bias (fallacy of
composition): The bias of inferring a certain truth
many factors that lead to the appropriate selec- about the whole, simply because it is true of its parts.
For example, since atoms are not alive, then nothing
made up of atoms is alive
Ecological inference bias (ecological fallacy): The
act of interpreting statistical data (i.e., making
Statistical Significance
statistical inferences) where deductions about the
nature of individuals are made based on the groups to
which they belong. For example, America is regarded
as the most obese country in the world today;
therefore my American cousin who I’ve never met
must be obese!
Ran
Sys
Error of Judgment
software!”
rs
The third type of errors in research are ran- those of statistics itself. Moreover, it is the over-
dom errors which can arguably be the most arching theme and ultimate purpose behind the
common of the bunch. These are errors that are utilization of these techniques that make it spe-
essentially beyond control—meaning that no cific to biostatistics.
matter what, this type of error cannot be avoided The study of biostatistics is not limited to any
or prevented in entirety. Better yet, we can be cer- one field, like biology. One of the great virtues of
tain of its occurrence simply because we (the this study is that it involves a multidisciplinary
researcher, study subjects, etc.) are all human and collaboration between the wealth of today’s stud-
error is imbedded in our nature. ies that have to do with human life. To name just
Although this should not be as alarming as its a few, these disciplines range from psychology
doomsday, description makes it to be. Why? and sociology to public health and epidemiology
Because statistics are here to save the day! One of and even to medicine and dentistry.
the primary functions of the statistical tools and Thus, the utilization of biostatistics today is
techniques later described in this book is to the application and development of statistical
decrease or fractionate random error, thereby theory to real-world issues particular to life as we
minimizing its potentially detrimental effects on know it. Additionally, the aim we are working
our results. On the other hand, the presence of toward is solving some of these problems, in
error in our study can also serve a positive pur- hopes of improving life as a whole. So, we can
pose in so far as it takes into consideration the see how it would not be uncommon to hear the
individual differences of the study subjects. biomedical sciences as being the broad discipline
Truthfully, there can be an entire field within sta- subjective to biostatistics. But since the nature of
tistics dedicated to the process of and value this book is pragmatism, we will simplify its
behind the minimization of error. For now, we comprehensive discipline from the biomedical
can assure that its presence will be felt in the sciences to the health sciences. Hence, taken
majority of the sections that follow in this book. together, biostatistics lies at the heart of the
research process in the health sciences.
Bench
h Healthy decision-
n- T2
T1
making habits
Bedside
Clinical guidelines
answered. Moreover, we can now perceive the area that is being studied within the health sci-
value of the hopeful answer that we obtain from ences. But as we progress from today’s students
the health science-based research question. The to tomorrow’s professionals, the great value of
significance of this answer being that it is the best biostatistics arises within the field of translational
possible answer to a problem that seeks the bet- healthcare.
terment of both the healthcare field and its As this is being written, the fate of US health-
constituents.2 care, for better or worse, is uncertain, but what is
Conclusively, a primary aim of this book is to certain is the direction that the field is moving
provide an understanding of the basic principles toward as a whole: placing focus on the individ-
that underlie research design, data analysis, and ual patient. The field of translational healthcare is
the interpretation of results in order to enable the one which takes a patient-centered approach that
reader to carry out a wide range of statistical translates health information gained from a
analyses. The emphasis is firmly on the practical research setting to the individual patient and, if
aspects and applications of the methodology, effective, translated to benefit all patients.
design, and analysis of research in the science Furthermore, this is the crude conceptualization
behind translational healthcare. of the two primary constructs of the science of
translation (or translational medicine as it was
1.2.2.2 Research in Translational first introduced)—namely, translational research
Healthcare and translational effectiveness.
A biostatistics course is essential, if not manda- In theory, translational research refers to
tory, to a student in the health sciences. This is the first construct (T1) and translational effec-
mainly for the acquisition of basic and scientific tiveness as the second construct (T2), in which
statistical knowledge that pertains to the specific this book has been divided as such accordingly
(Fig. 1.6). The first half of this book is respon-
sible for expounding on the fundamentals of
For example, just a few years ago citizens of the United
2
On the other hand, the succeeding half is 1.2.3 Self-Study: Practice Problems
responsible for the introduction of the second
construct of translational science, namely, trans- 1. How does the process of using the scientific
lational effectiveness. This is referred to as “result method begin?
translation,” in which the results that are gath- 2. List and provide a brief description of the
ered from clinical studies are translated or trans- steps of the research process?
ferred to everyday clinical practices and healthy 3. What are the legs that represent the stool that
decision-making habits. Although we have is the research process? What happens if one
bisected the two based on their distinct purposes, of the legs is compromised?
methods, and results, both enterprises coalesce to 4. What is the difference between the research
the ultimate goal of new and improved means of question and the study hypothesis?
individualized patient-centered care. 5. True or False: The best type of research
In brief, the most timely and critical role of study is one that can conclude the absolute
biostatistics in modern contemporary research in truth.
healthcare today appears to be in the context of: 6. What are the legs that represent the stool that
is statistical significance? What happens if
(a) Establishing and evaluating the best avail- one of the legs is compromised?
able evidence, in order to ensure evidence- 7. Which of the three most common types of
based interventions errors are avoidable? Which are unavoidable?
(b) Distinguishing between comparative effec- 8. You have just finished taking your first bio-
tiveness analysis, which is designed to com- statistics exam and are unsure how well you
pare quantified measures of quality of life performed. Later that week, you receive your
and related variables among several interven- results and see that you received an A—and
tions, and comparative effectiveness research, exclaim: “I knew I was going to ace that!”—
which aims at comparing several interven- Which of the biases was taken advantage of
tions in terms of relative differences in cost- in this scenario?
and benefit effectiveness and in reduced risk, 9. True or False: All forms of error introduced
in order to ensure effectiveness-focused during the research process negatively
interventions impact the study as a whole.
(c) Characterizing novel biostatistical toolkits 10. Translational healthcare is comprised of two
that permit the assessment, analysis, and enterprises. What are these two enterprises
inferences on individual, rather than group, and what does each represent?
data to ensure the optimization of patient- (See back of book for answers to Chapter
centered interventions Practice Problems)
Study Design
2
Contents
2.1 Core Concepts 13
2.2 Conceptual Introduction 13
2.3 Diagnostic Studies 15
2.3.1 Reliability and Validity 15
2.3.2 Specificity and Sensitivity 16
2.4 Prognostic Studies 16
2.4.1 Observational Design 17
2.4.2 Experimental Design 21
2.5 Self-Study: Practice Problems 24
2.1 Core Concepts that the researchers are able to effectively address
the health research problem and apply the find-
Nicole Balenton ings to those most in need.
The success of any scientific research endeavor
The composition of the basic principles that act is established by the structure of the study design,
as the foundation of the research process is con- offering direction and systematization to the
ceptualized as a three-legged stool. This chapter research that assists in ultimately understanding
highlights the first of the three legs of the stool— the health phenomenon. There are a variety of
namely, study design—that acts as the blueprint study design classifications; this chapter primar-
for researchers to collect, measure, and analyze ily focuses on the two main types: diagnostic
the data of their health topic of interest. The study studies and prognostic studies. We further explore
design hinges on the research topic of choice. their respective subcategories and their relation
As the backbone of any successful scientific to scientific research in translational healthcare.
research, the study design is the researcher’s
strategy in choosing various components of a
study deemed necessary to integrate in a coherent 2.2 Conceptual Introduction
manner in order to answer the research question.
The design chosen affects both the results and the As one of the three fundamental pillars of the
manner in which one analyzes the findings. By research process, the design of a research study is
obtaining valid and reliable results, this ensures essentially the plan that is used and the system
employed by the researcher. The specific organi- a study design that can facilitate the goal of our
zation is subjective to the particulars of the object research. Luckily, there is no creation necessary
of the study. The “particulars” we mention are on our part as there are a multitude of study
pieces of information that refer to things designs we can select for the specific components
(i.e., variables, discussed further in Chap. 3) such of our study. Moreover, correctly selecting the
as the population and the outcome(s) of interest infrastructure at the outset is an early step in pre-
that are being studied. We can conceptualize the venting faulty outcomes during the research pro-
study design to be the infrastructure or the organi- cess (i.e., preventing systematic errors).
zation of the study that serves the ultimate goal of Now, if your dream home consists of a two-
the research that is being done. story building, would you build a foundation, cre-
Let’s say, for example, that you and your signifi- ate a blueprint, and buy the accessories necessary
cant other decide to build your dream home together to build a one-story house? Or if the zoning regula-
and we will largely assume that you have both also tions of the property prohibited the building of a
agreed on the final plan of the whole house, i.e., multistory house, could you still move forward
where each room will be and their individual uses. with the successful materialization of the dream?
But as any contractor will tell you, the foundation Of course not, these would be disastrous! Similarly,
of the house is of utmost importance because it sets we would not dare to, on purpose, select a study
the precedent for the building as a whole. Sure, it is design that is specific to one method of research
pertinent to have a plan of what each room should when our own research is concerned with another.
be, but through the creation of the foundation is This chapter discusses distinct and compre-
where the proper piping, plumbing, and electrical hensive study designs relative to scientific
groundwork are set for the ultimate design of each research. More specifically, the designs and their
room. This is exactly the purpose and relationship subcategories are tailored toward the infrastruc-
between a study and its design. ture that is necessary for research in translational
As researchers, we must be fully cognizant of healthcare. The summarizing schematic of the
the intricate details of our study in order to create disparate study designs is shown in Fig. 2.1.
Cohort-Study
Diagnostic
Observational Case-Control
Studies
Prognostic
Cross-Sectional
Studies
Design
Naturalisitc
Experimental Clinical Trials
Studies
Research
Synthesis
Fig. 2.1 The various study types including observational ting (see Chiappelli 2014), and research syntheses
studies, experimental studies, naturalistic studies (Research synthesis—a type of study design that utilizes a
(Naturalistic study, often referred to as qualitative, partici- PICOTS research question (see Chap. 1, Sect. 1.2.1.1), in
pant observation, or field research design, is a type of which relevant research literature (the sample) is gath-
study design that seeks to investigate personal experiences ered, analyzed, and synthesized into a generalization
within the context of social environments and phenomena. regarding the evidence—this is the fundamental design
Here, the researcher observes and records some behavior behind evidence-based healthcare (see Chiappelli 2014)
or phenomenon (usually longitudinally) in a natural set-
2.3 Diagnostic Studies 15
Funny, indeed, but these must have been seri- tion. We can further interpret this definition into
ous questions asked or thought of by the earliest the ability of a new diagnostic test to accurately
physicians. Luckily, there are no more urine tast- determine the presence and absence of a disease.
ings attended by physicians. Today, there are a This latter, and more simplified, definition gives
multitude of diagnostic tests that without a doubt rise to two concepts that a diagnostic test gener-
are better than a gulp of urine. When we say bet- ates—namely, sensitivity and specificity.
ter, we are referring not just to the particular The sensitivity of a new diagnostic test refers to
method of diagnosis but also to a systematic how effective the new test is at identifying the pres-
improvement in diagnosis. This betterment ence of a condition. The identification of a condi-
encompasses the concepts of the reliability and tion in an individual that truly has the condition is
validity of the test. referred to as a true positive. It is clear to see the
A new diagnostic test is subject to the criteria difficulty in obtaining this true measure; due to this
of reliability and validity in order for the test to overt stringency, there may exist individuals that are
be rendered as the new gold standard. Moreover, truly positive for the condition, but the test has failed
an unreliable or invalid test will provide little or, to accurately identify them. This subclass of indi-
even worse, detrimental information in research viduals (those rendered as negative but in actuality
and clinical decision-making. We must evaluate a have the disease) is referred to as false negatives.
novel diagnostic test for its accuracy, which is On the other hand, the specificity of a new
dependent on how exact the test can be in dis- diagnostic test refers to how good the new test is
criminating between those with the disease and at identifying the absence of a condition. The
those without. Hence, a diagnostic study design identification of a lack of condition in an indi-
is employed to test the accuracy of a novel diag- vidual that truly does not have the condition is
nostic test. referred to as a true negative. Subsequently, the
The accuracy of a diagnostic test is determined leniency of this measure may include many sub-
through the extent of how reliable and valid the jects who, in actuality, truly do not have the dis-
measurements of the test are. The reliability of a ease, but the test has essentially “missed” them.
diagnostic test refers to how replicable and con- This subclass of individuals (those rendered as
sistent the results are in different periods of time, positive but in actuality do not have the disease)
in which we are essentially asking: “Does the test is referred to as false positives. Table 2.1 shows
produce the same results if the same patient were all possible permutations along with the calcula-
to return tomorrow? The week after? Next year? tions of sensitivity and specificity. Moreover, we
(assuming all other factors are held the same).” provide a brief description of predictive values
Lastly, the validity of a diagnostic test refers to and their calculations, but further elaboration is
whether the instrument measures precisely what saved for a more epidemiological context2.
it was meant to, which must also be the same con-
dition that the current gold standard measures.
The actual methods of determining reliability and 2.4 Prognostic Studies
validity are discussed in the next chapter.
Back at the doctor’s office, after receiving the
diagnosis, we are hopeful that the timeline of the
2.3.2 Specificity and Sensitivity given condition is short and, of course, the condi-
tion is curable. This is essentially referred to as the
When speaking of the accuracy of a diagnostic prognosis—namely, the probable course and out-
test, measures of reliability and validity are not come of the identified condition. Though the
the only concepts we utilize. As mentioned results, unfortunately, may not always be short
above, the accuracy of a diagnostic test aims to and/or curable, the knowledge of this prognosis
determine how precisely a test can discriminate can empower both the physician and patient to be
between the patients who truly have the condition
from the patients who are truly free of the condi- See Katz (2001).
2
2.4 Prognostic Studies 17
Table 2.1 2 × 2 contingency table accompanied with measures of validity and predictive value formulas
Disease No disease
Positive test result True positive (A) False positive (B) A + B
Negative test result False negatives (C) True negatives (D) C + D
A + C B+D
A A
Sensitivity (SE) = Predictive value positive (PVP) =
( C)
A + ( B)
A +
D D
Specificity (SP) = Predictive value negative (PVN) =
( B + D) (C + D )
proactive (i.e., patient is under the supervision of a investigator-mediated exposures. Thus, an obser-
medical professional, patient is more careful from vationally designed study is employed such that
now on, etc.). Although time is not exclusive to a the researchers merely observe the subjects in
prognosis, it is essential in both this medical aspect order to examine potential associations between
and the research characteristic we are to discuss. risk factors and outcomes, but they do nothing to
A prognostic study is one which examines affect or regulate the participants.
specific predictive variables or risk factors and What is also of critical importance to an obser-
then assesses their influence on the outcome vational design is time. Under the umbrella of
of the disease. Subsequently, the performance observational design, there exist three different
of a research study is designed as such with studies; each with disparate methods, purposes,
the intent of following the course of a given and gained knowledge potentiality. The sub-
disease or condition of interest through a classes beneath this category each distinctly have
period of time. The most effective method of a relationship with time, so it is not surprising to
this type of study is a comparison of various hear this design being referred to as longitudinal.
factors among individuals with relatively sim- This will be explained in further detail below.
ilar characteristics, divisible by the presence
or absence of disease. This is the typical treat- 2.4.1.1 Cohort Studies
ment–control relationship, in which the con- Colloquially, a cohort is defined as a group consist-
trol is used as a “standard” that allots this ing of individuals that share common attributes or
comparison. Moreover, we can thus say that a characteristics in a set period of time. Subsequently,
prognostic study is designed to monitor the a cohort study is a study that chronologically
management of subjects or patients in the observes individuals (initially disease-free) that
treatment and control groups. But we must have been naturally exposed to potential risk fac-
note that they cannot always be so simply tors. This goal of observation is pertinent in deter-
divided. We elaborate on this and the two mining whether or not the patients develop a specific
major classifications of prognostic studies, disease or condition (or outcome).
observational and experimental, below. We may quickly jump to the conclusion that
if disease X was observed to have developed
from risk factor Y, then the observed risk factors
2.4.1 Observational Design obviously caused the disease—seems logical,
right? Unfortunately, we are unable to use any
There are numerous qualifications that determine observational study design to procure causal
whether a study is said to have an observational relationships between variables, i.e., a cause–
design. One of the most important is when there effect relationship. What is allotted is the estab-
are no manipulations or external influences from lishment of an associative relationship, namely,
the researcher onto the subjects that are being that “There seems to be a weak/moderate/
studied. The manipulations or external influences strong association between Disease X and risk
that stem from the researcher can be seen as factor Y.” Surely the former causal relationship
18 2 Study Design
established can be proved erroneous by the sim- ease). This determination of the development of
ple consideration of those who were exposed disease is referred to as incidence and is one of
to risk factor Y but did not develop disease the features of a cohort study. Incidence refers to
X. Consequently, we note that the exposures in the number of individuals that develop a certain
observational designs may be necessary for dis- condition or disease relative to all individuals at
ease acquisition, but not sufficient. risk of developing the disease, during a set period
Thus, we say that study subjects are divided of time. Though mention of incidence rates
into cohorts, exposed and unexposed, and then slowly begins to carry over to the study of epide-
observed throughout time to determine the out- miology, the formula for its calculation is pro-
come of their exposure (disease and lack of dis- vided below:
Assuming that the disease of interest is rare study subjects into the present. A nested study is
and that the subjects are representative of their a combination or mixture of the temporal attri-
overall populations, then we are also able to butes of retrospective and prospective designs—
approximate the relative risk, also read as risk namely, a study begins at some point in the past
ratio, as the ratio of the incidence of those and follows subjects into the present and further
exposed relative to the incidence of those not on into the future. Below we provide an example
exposed (Fig. 2.2). of a nested cohort study which, by association,
In the discussion of cohort studies, there must will describe the time components of the former
be a moment for and an emphasis placed on time. two studies as well. Figure 2.3 also provides a
Cohort studies may be subdivided by time into pictorial elaboration.
three main categories: prospective, retrospective, For example, you have just realized that a
and nested (or mixed). A prospective study is a number of people in your extended family have
study that essentially begins today and the study recently been admitted to a hospital for E. coli
subjects (i.e., cohorts) are observed into the food poisoning. After much thought, you realize
future. A retrospective study is one that begins at that this must have something to do with the
a certain period of time in the past and observes recent Thanksgiving potluck—you suspect your
Fig. 2.2 Risk
calculations
RETROSPECTIVE
Exposure Disease
PROSPECTIVE
COHORT GROUP Exposure Disease
NESTED
Exposure Disease
Fig. 2.3 Pictorial elaboration of the three fundamental types of cohort studies
2.4 Prognostic Studies 19
Population at Risk
Ate Tuna Fish Casserole Did Not Eat Tuna Fish Casserole
(Exposed Group) (Unexposed Group)
Fig. 2.4 Cohort study design tree for tuna fish casserole example
Aunt’s tuna fish casserole. Hence, you employ a the past. Just as we emphasized the importance of
nested cohort design, in which (through extensive time in the previous section, the retrospective time
investigation) you divide the family members in component is particular to a case-control study.
attendance into those who ate the tuna fish cas- Moreover, this type of study is concerned with
serole (exposed) and those who did not or pri- determining the potential occurrence of events that
marily ate other dishes (unexposed). Then, you lead to the manifestation of a certain disease in the
observe the groups starting from Thanksgiving patients that are being studied (i.e., observed).
until the present moment (retrospective compo- This method compares two groups of individ-
nent) noting signs and symptoms while also uals: those with the presence of the disease of
keeping in close contact with your family mem- interest and those with the absence of the disease.
bers for the next month (prospective) to see if We refer to the former group as the “cases” (i.e.,
they develop signs and symptoms of food poison- presence of disease) and the latter group as the
ing (Fig. 2.4). “controls” (i.e., absence of disease). Although we
In conclusion, it is simple to see the utility of will expound on the importance of control groups
cohort studies in investigative contexts. Indeed, later on in experimental design (Sect. 2.3.2), the
there are both strengths and limitations inherent control group is what largely facilitates the com-
to this type of study. The strengths include the parison of the two groups; it may ultimately assist
establishment of incidence rates, the possibility in determining what happened differently in the
to study multiple outcomes from a single expo- case group which may shed light on the progres-
sure, and even the ability to investigate rare expo- sion of disease.
sures. One the other hand, the limitations are of Subsequently, a case-control study begins
equally voracity, namely, that cohort studies are with the identification of the disease of interest.
expensive, time-consuming, prone to biases, and Then, two related groups are divided by disease
subjects lost to follow-up. Of course, if time and state, where one group suffers from the disease
money are not of grave concern (i.e., large fund- and the other does not. Next is the introduction
ing), then the strengths drastically outweigh the of the retrospective time component—namely,
weaknesses, supporting others’ claim that a both groups are essentially “followed” back in
cohort study is the most powerful of observa- time through some method of investigation (i.e.,
tional study designs. questionnaire, survey, etc.) to determine their
exposure to particular risk factors of interest
2.4.1.2 Case-Control Studies (Table 2.2). Surely, we can notice that it is not the
Also under observational study designs falls the actual participants that are being “followed”
case-control study which is a study whose back in time, rather it is more the data being col-
research focuses on specific diseases exclusive to lected that is from the past.
20 2 Study Design
Table 2.2 At the beginning of the study, exposure status case-control study provides an estimate of the
is unknown; thus we classify subjects into cases or
strength of an association between particular
controls
exposures and the presence or absence of the dis-
Outcome
ease. We commonly refer to these exposures as
Cases Controls
(disease) (no disease) predictors, such that the prediction of the exis-
Exposure Exposed (A) (B) tence of an association with the disease can pro-
Unexposed (C) (D) vide researchers with an odds ratio (OR). An
A+C B+D odds ratio essentially measures the odds of expo-
sure for the cases compared to the odds of expo-
sure for the controls. We can organize the odds of
We may ponder on the utility of this specific exposure for both groups in a simple table (Table
design. Case-control studies are of most value 2.2) to aid the calculation of the odds ratio in the
when studying rare diseases. Additionally, a formula provided below:
Other strengths of this type of study include Well done Bueller, you have just successfully
that it is relatively inexpensive, there is no “wait-employed a cross-sectional study on transporta-
ing period” for disease exposure, and multiple tion methods to school! But what can we do with
exposures can be taken under consideration. But this information?
along with these strengths comes a serious limi- A cross-sectional study provides information
tation in that this study design is quite susceptibleon the prevalence of a condition. Prevalence is
to bias, more so than other study designs. Of the referred to as the number of individuals that cur-
multiple biases, we briefly consider the recall rently have a specific condition of disease of
bias, for example. Recall bias considers the flaws interest. Returning to our example, perhaps you
of human memory, in which subjects asked to record that only 3 of the 30 students in the class-
recall certain instances may provide erroneous room raised their hand when you asked the ques-
responses that lead to erroneous results of the tion. Thus, you can report that the prevalence of
study. bicycling to school as an alternative method of
transportation is about 30% in the class you sur-
2.4.1.3 Cross-Sectional Studies veyed. Hence, we see that prevalence is calcu-
Lastly, a cross-sectional study is an observa- lated as the ratio of the number of people who
tional design whose research focuses on specific have a given condition or characteristic (i.e.,
disease as they relate to the present. Indeed, it is bicycling to school) at a given time over all of the
a study done at a specific and singular cross- people that were studied (the entire classroom)
section of time—now. Certainly, the importance (Fig. 2.5).
of the time aspect cannot be stressed enough. It Now, we do not support the irritation of class-
relates to both the convenience and advantage rooms nor do we intend to mock the utilization of
that are overtly subjective to a cross-sectional cross-sectional studies with the oversimplifica-
study. tion of the above scenario. In fact, the basic
Say you are on your way to your favorite bio- nature behind its presentation aims at exalting its
statistics class and decide to randomly walk into usefulness! Two of the great advantages primar-
another class, interrupt the lecture, and ask, ily exclusive to a cross-sectional study is that it is
“Show of hands, how many of you rode a bicy- usually swift and inexpensive—two factors cru-
cle to school today?” You count the hands, cial to any scientist. The value of the information
politely thank the aggravated professor, and out- gained relative to the investment made is
run campus security to safety in your next class. tremendous.
2.4 Prognostic Studies 21
dered as consequential to the treatment. ple is a Latin square design, the purpose of which is, along
with that of all other block methods, to reduce the varia-
Another important quality of the control group tion among individuals within the groups in hope of fur-
is that it consists of a group of individuals that ther reducing random error (see Hinkelmann and
have, at best, similar characteristics and qualities, Kempthorne 2008).
22 2 Study Design
with the acronym for randomized controlled tri- in Chapter 1 of the Book of Daniel in Ketuvim (“Writings”)
of the Bible. In 605 BCE, the kingdom of Babylon fell
into the hands of the fierce military leader Nebuchadnezzar.
4
Clinical equipoise—as first coined by Benjamin King Nebuchadnezzar enforced a strict diet of only meat
Freedman—is an ethical principle relating to the and wine in his kingdom. The Israelites that inhabited his
researcher and their honest anticipation of the experimen- palace felt doomed as they were not permitted to consume
tal treatment having some benefit to the patient, at least food that were not subject to their divine dietary law of
equal to no treatment. This essentially returns to the fun- Kashrut (Kosher). Among those living in his palace, an
damental healthcare maxim of primum non nocere, Latin Israelite named Daniel, in fear of retribution, suggested a
for “First, do no harm.” In context, randomization may not “trial” where he and his Israelite friends would consume a
always be ethical (and hence permitted) on, say, termi- diet of only vegetables for 10 days. Lo and behold, after
nally ill cancer patients that are recruited for experimenta- 10 days, Daniel and his gang presented as much healthier
tion of a novel treatment intervention. to the King than did his meat-eating counterparts. Shortly
The Arrangement of Field Experiments, 1926, and The
5
after, the King’s dietary commandment was no longer
Design of Experiments, 1935. obligatory.
2.4 Prognostic Studies 23
YES NO
2. Experimental 3. Observational
Randomization
2A. Experimental 2B. Quasi- 3A. Cohort 3B. Case-Control 3C. Cross-sectional
Experimental
Control? Time?
Treatment on
humans?
YES NO
trials must also abide by the rigors of ethical and and its constituents (Fig. 2.9), as it should. And,
moral principles that are overseen by disparate why not? Don’t we all want to receive the best
government agencies (see Sect. 2.3.2 and Footnote of the best when it comes to our health, no less
4 on clinical equipoise). the health of our parents and children?
The ultimate goal of clinical trials is the bet-
terment of public health. Whether that is in
terms of acquiring new medical knowledge or 2.5 Self-Study: Practice
discovering the best medical instruments, the Problems
end result ultimately returns back to the patient.
Indeed, clinical trials are central to translational 1. For each of the studies below, identify whether
healthcare, particularly in the T2 block—trans- it is an observational study or an experimental
lational effectiveness—such that the result study:
translation is the transmission of knowledge
(a) Scientists wish to determine if trace
gained in clinical studies (i.e., the studies of amounts of lead in her city’s water affect
clinical trials) to the establishment of clinical the cognitive development of young
guidelines for the entire healthcare community children.
2.5 Self-Study: Practice Problems 25
(b) A researcher is interested in determining students ate the mysterious meat dish, in
whether there is a relationship between which 47 of those who ate the meat dish
years of education and annual income. developed gastroenteritis.
(c) A study on healthy eating habits measures (a) Calculate the incidence of developing
the type of food participants purchase at gastroenteritis from the consumption of
the grocery store. the mysterious meat dish.
(d) A neuroscientist electrically stimulates
(b) Does this measure consider individuals
different parts of a human brain in order who may have at gastroenteritis before
to determine the function of those specific the outbreak? Explain.
regions. (c) What type of observational study was
(e) In order to determine the effectiveness of done that determined the primary sus-
an antidepressant, a psychiatrist randomly pect and provided the incidence rate?
assigns geriatric patients to two groups— 5. Scientists studying the effects of breastfeed-
one group takes the new drug, while the ing on infections in babies closely watched a
other takes sugar pills (i.e., placebo). sample of mothers during the first 3 years of
(f) The administration of a medical school their newborn’s life. The researchers wit-
preparation course creates three different nessed that newborns that were breastfed for
courses for students preparing for the a minimum of 3.5 months had significantly
Medical College Admission Test less infectious diseases than those who were
(MCAT)—a 3-month intensive course, a not breastfed at all.
4.5-month medium course, and a 6-month (a) What type of study design is being taken
easy course. After all courses are complete, advantage of here?
the administrators compare exam scores to (b) Is this a prospective, retrospective, or
determine which course was most effective. nested study?
2. True or false: Sensitivity establishes how good (c) Can it be accurately concluded that
a measuring device is as detecting the absence breastfeeding causes less infectious dis-
of a specific disease. eases in newborn babies? Explain.
3. A local dental office receives a promotional 6. An investigator is interested in conducting a
caries detection kit. The kit contains a paste case-control study of childhood leukemia
that you apply to the tooth and whose color and exposure to environmental toxins in
turns red if there is active cavity-generating utero. How should the investigator choose
plaque. You compare this supposed caries cases and controls? How should the investi-
detection kit with traditional X-rays (i.e., the gator define exposure and outcome?
gold standard). The use of the kit provides you 7. Determine whether each of the following
with the following data in 100 of the patients statements are true or false:
(80 of whom have cavities by X-rays): (a) A cross-sectional study yields informa-
tion on prevalence.
Cavities No cavities
(b) A case-control study produces data that
Positive for carries 70 5
can compute odd risks.
Negative for carries 10 15
(c) A cohort study establishes what happens
(a) Calculate the sensitivity and specificity. to a group of patients with respect to
(b) Calculate the prevalence of patients with time.
caries. 8. A sample of women ranging from 25 to 35
4. In an outbreak of Campylobacter jejuni at a years old was recruited for a study on the
college cafeteria, the primary suspect is the effects of alcohol consumption on hormone
weekly mysterious meat dish. The campus levels. All of the participants were given a
health office reports that out of the 500 stu- 90-day regimen to consume either a certain
dents that ate at the cafeteria that day, 150 amount of alcohol or a placebo drink based
26 2 Study Design
on the specific day. The daily drink alloca- an identical-looking sugar pill. Participants
tion was random for each participant. The were monitored every 3 months for 5 years.
outcome was measured by the difference in The reports that were collected every 3
hormone levels on the days of alcohol con- months were assessed by an independent,
sumption compared to the days of placebo. third-party medical group.
(a) Was this a run-in or crossover trial? (a) What role did the sugar pill play in the
Explain. study?
(b) What is significant about random alloca- (b) Was this a single-blind, double-blind, or
tion of drinks? triple-blind study? Justify your answer.
(c) Could the participants have been blinded (c) What type of study design was utilized
to their specific treatment? Explain. here? Be as specific as possible.
9. The geriatric department at the local commu- 10. What qualifications must a measuring tool
nity hospital was interested in studying the meet in order to be considered the gold stan-
effects of aspirin in the prevention of cardio- dard? Also explain how a measuring tool can
vascular disease in the elderly. Approximately potentially lose its gold standard “seal” (i.e.,
1266 geriatric patients were randomly the tool is no longer considered the gold
assigned to either a treatment group or a con- standard).
trol group. The treatment group took 500 mg (See back of book for answers to Chapter
of aspirin daily, while the control was given Practice Problems)
Methodology
3
Contents
3.1 Core Concepts 27
3.2 Conceptual Introduction 27
3.3 Sample vs. Population 28
3.3.1 Sampling Methods 30
3.4 Measurement 33
3.4.1 Instrument Validity 34
3.4.2 Instrument Reliability 35
3.5 Data Acquisition 36
3.5.1 On Data: Quantitative vs. Qualitative 38
3.5.2 Variables 39
3.6 Self-Study: Practice Problems 40
Recommended Reading 41
Meth
Data
Study Design
related, principles of the physical sciences such as
Ana
dolo
Newton’s Laws of Motion and Thermodynamics.
lysis
But in life, we often witness that it is not the path
gy
of least resistance that yields the most rewarding
ends. Indeed, it is usually overcoming the ardu-
ous path that bears the greatest returns.
We all may have specified paths, but so too
does the first section of this book on Translational
Research in Translational Healthcare. We can
argue whether our approach is the path of least
resistance; certainly we may hope it is not, so
as to maximize the reward in its culmination. Fig. 3.1 Methodology is the science of measurement and
Regardless, we shall not make the mistake of the process of obtaining and allocating the sample
losing sight of the goal of our path, namely, a
practical and comprehensive understanding of
the research process. As the second leg of our 3.3 Sample vs. Population
stool (Fig. 3.1), the appreciation of research
methodology is our next quest as we continue As many struggling students and professional
on our path. shoppers will tell you, the best time to go grocery
At a first glance, we may perceive research shopping is on a Sunday. Why you might ask?
methodology to be synonymous with research Well, because of all of the free samples of food,
methods—but this is not entirely true. The meth- of course! The psychology behind grocery stores
ods we utilize in research may refer to the specific and supermarkets providing free samples of their
tools, techniques, and/or procedures that are under- products to its guests is both simple and complex.
taken. On the other hand, the methodology refers Showcasing featured products and providing an,
to the comprehensive study of (-logia) the basic often frustratingly, small sample ultimately trans-
principles that guide our processes in research. late to the purchasing of more goods. But most
The research methodology fundamentally asks: important, and most apparent, is that a free sam-
How?—that is, how is the research done? How ple provides the shopper with an understanding
did the researchers obtain their information? On of the product as a whole before they commit to
the same note, it also further begs the question purchasing.
of Why?—Why did the researchers use this tech- Let us say, for example, that you and your
nique, this tool, or this protocol over the others? mother are shopping at your favorite grocery
Therefore, the principal domains of research store. While you were preoccupied in the school
methodology refer to the science of measurement supplies aisle, your mother was in the frozen food
and the process of obtaining and allocating the section and managed to grab an extra sample of
sample. These two domains ensure the qualifi- their Sunday-featured pizza for you (Fig. 3.2).
cation of numerous criteria that are pertinent to You scarf down the frustratingly small, yet deli-
the research process, but most importantly they cious, sample, and then your mother inquires:
ensure that the study has gathered the appropri- “Do you like it? Should we buy this pizza for the
ate information necessary for the third leg of our house?”—to which you respond positively. Fine,
stool, namely, data analysis (Fig. 3.1). this seems like a normal occurrence when buying
3.3 Sample vs. Population 29
the patient’s blood is measured for abnormally study becomes much more vulnerable to error and bias
high levels of low-density lipoproteins (LDL) (see Chap. 5 for more).
3.3 Sample vs. Population 31
For example, in 1936, The Literary Digest Esbensen (2015) and Corbin and Strauss (1998).
conducted an opinion poll to predict the results of See Sect. 2.3.2, Experimental Design.
3
32 3 Methodology
Random sampling is perhaps the most advanta- Again, we emphasize the randomness of this
geous sampling technique, as it allots the collec- method in order to secure a high degree of repre-
tion of a representative sample and thus enables sentativeness of the sample from the population.
the researcher to draw conclusions (inferences/ Notice that the hospital manager arbitrarily chose
generalizations) about the population from its her systematic selection based on her lucky num-
daughter sample. We shall soon see how other ber, but that is not to say that all selection processes
effective sampling techniques, discussed hereafter, are the same. It is further emphasized that regard-
strive to include some randomness in the method less of the particular processes used (again, arbi-
of collection. Such techniques, then, may be cat- trary), it should be both systematic and random.
egorized under random sampling as well. Lastly, Stratified sampling is a method that essen-
a strategy for randomness can be achieved by the tially involves a two-step process, whereby
utilization of a random number table (Appendix (1) the population of interest is divided into
B) or random number generator applications. groups (or strata) based off of certain qualities
Systematic sampling is a method of sam- of interest, such as age or sex, (2) individuals are
pling that follows an arbitrary system set by the then selected at random under each characteriza-
researcher in selecting individuals at random tion, and finally (3) the results of each stratum
from a population. This method is easiest and (sg.) are combined to give the results for the total
best accomplished when a list of potential par- sample. This method warrants the representative-
ticipants is readily available. Consider a hospital ness principle of samples, such that its purpose is
manager that desires to evaluate the health of her to collect a random sample relative to each char-
own hospital staff to support an optimal work- acteristic. Moreover, the samples with certain
ing environment. Instead of wasting the immense fixed proportions amalgamate to a single repre-
time and resources to evaluate each staff member sentative sample.
(we’ll assume 1000 people), she enumerates a list For instance, in determining the eating habits
of her staff and arbitrarily decides that her lucky of the national population, characteristics such
number 4 will choose fate. Thus, as she goes as age, sex, and socioeconomic status are critical
through the list, each staff member enumerated factors that would be necessary to be reflected in a
by the number 4 is selected to undergo a health random sample so as to render the sample as rep-
evaluation (i.e., 4th, 14th, 24th, 34th, etc.). resentative (Fig. 3.6). Well, one might ask, would
X
SE
AGE
City 3 City 4
those qualities not be reflected under the utiliza- group (i.e., cluster) patients by hospital and then
tion of a simple random technique? Unfortunately, randomly sample each cluster Fig. 3.7. This
not—a simple random sample is likely not able makes the information that is to be gleaned much
to represent the qualities of particular interest to more manageable.
the population. On the other hand, the division of Although both stratified and cluster samplings
the population into strata based on age, sex, and take advantage of group organization, it is impor-
socioeconomic status ensures that the random tant to note a stark difference between the two
sample obtained within each stratum is reflective (strata vs. clusters). In the former sampling method,
of the entire population. It also goes without men- individuals are stratified by specific characteris-
tioning that this sampling technique makes use tics of study interest, such as race and ethnicity.
of two principles that warrant representativeness, Conversely, the latter method clusters individuals
namely, randomness and stratification. by their natural groupings, such as university, city,
Cluster sampling is a sampling technique or hospital. Alternatively, the apparent similarity
where individuals from the population are orga- between the two techniques cannot be denied. The
nized by their natural factions (clusters) and then importance of this similarity, in terms of grouping,
randomly sampled from each thereafter. This lends a hand to the importance of randomness in
method of sampling is particularly useful when obtaining a representative sample, along with the
the population of interest is extensively distrib- advantages of orderly information.
uted and otherwise impractical to gather from
all of its elements. For example, researchers
interested in hospital-acquired infections (HAI) 3.4 Measurement
would not make very good use of their time and
resources by attempting to review the records Now that we have determined the specific sam-
from a statewide list of discharge diagnoses from pling technique, the question remains: How does
each affiliated hospital. Instead, it would be more one go about collecting the sample?—and, more
practical—in terms of time and resources—to importantly—How does one go about obtain-
34 3 Methodology
ing the necessary information from the collected from it. Similar to the criteria subject to our diag-
sample? In order to answer those questions, we nostic tests, so too are our measurement tools. As
must turn to measurement. we shall see later in Chaps. 5 and 6, the impor-
Once we have identified our population of tance of validity and reliability scale across the
interest and selected a sampling technique, the entirety of this book, and in scientific research
method we utilize to collect a sample and the per se, particularly due to vulnerability to error.
necessary information from that sample requires
a measuring tool or instrument. In research, there
essentially exist two forms of measuring instru- 3.4.1 Instrument Validity
ments: (1) researcher-completed instruments and
(2) participant-completed instruments. A valid instrument is one that really truly mea-
Researcher-completed instruments are sures that which it is intended to measure. There
instruments that are completed by researchers. are three primary means that we delineate in
Well, obviously, but more specifically, they refer order to establish the validity of an instrument:
to instruments that a researcher uses to gather (1) construct validity, (2) content validity, and (3)
information on something specific that is being criterion validity.
studied. For example, a laboratory scientist study- Construct validity refers to the establishment
ing fibroblasts (muscle cells) under different of the degree to which an instrument measures the
environmental conditions may have an observa- construct it is designed to measure. For example,
tion form she uses each day she observes the cells does a tool that is aimed at measuring the level of
under a microscope. For example, “Are the cells anxiety in an individual truly measure anxiety?
still alive?; How large have they grown?; How Or does it measure things like depression and/or
many of them are there today?; Are they growing stress that are closely related to anxiety? In this
comfortably or are they struggling?” Another form case, the construct that a certain instrument is
of researcher-completed instruments includes measuring is anxiety. Hence, in this connotation,
checklists that measure the quality of evidence in a construct refers to a theory or concept particu-
medical literature, such as the AHRQ Risk of Bias lar to the realm of the health sciences. Although
instrument, STROBE, and R-AMSTAR.4 we have provided a definition for a basic under-
On the other hand, subject-completed instru- standing, there exist many other more elaborate
ments are instruments that are administered by domains involved in the validation of an instru-
researchers onto subjects under study. You have ment’s construct, such as Messick’s Unified
definitely completed one of these, whether know- Theory of Construct Validity.5
ingly or not. These usually come in the form of Content validity refers to the extent to
surveys or questionnaires, such as aptitude tests, which the content of an instrument adequately
product quality assessments, and attitude scales addresses all of the intricate components of a
to name just a few. specific construct. We can ask: Does the content
Regardless of what specific type of instrument of the questions within the instrument align with
is being utilized, there is one thing that is true (at the construct of the instrument? With our anxi-
least in research) for all measurement tools: all ety instrument, content validity essentially vali-
measurement tools used in translational research, dates whether the subject (and by extension the
or any scientific research for that matter, must answers) of the questions are good assessments
have the two essential qualities of validity and of anxiety. In this case, the content within an
reliability. Ha! And you thought that Chap. 2 was instrument must provide seemingly logical steps
the last we heard from validity and reliability! relative to the greater construct of the instrument.
No, it was not, nor is this instance the last we hear Hence, it is not uncommon to see this specific
measure of validity referred to as logical validity.
See West et al. (2002), Vandenbroucke et al. (2014), and
4
Criterion validity refers to the extent to ily deduce the reasoning behind that, but the sci-
which the measures of a given instrument reflect ence behind a consistent measurement implies
a preestablished criterion. This method of vali- the replicability of a measuring instrument.
dation can essentially assess whether the mea- Therefore, we say that a reliable instru-
surements made within an instrument meet the ment is one that produces similar results under
criteria relative to the specific construct being consistent conditions. We not only must require
studied. Criterion validity has two distinct yet this of scales of weight but rather all measuring
interrelated behaviors: instruments, particularly in the health sciences.
Imagine the chaos a blood glucose monitor would
–– Concurrent Criterion Validity—validates the cause if it rendered a patient diabetic one day, not
criteria of a new instrument against a prees- the next, and so on. To prevent ensuing chaos of
tablished and previously validated instrument, any sorts, we elaborate on the methods of reli-
also known as the gold standard tool (see Sect. ability verification. But before that, let us pon-
2.2, Diagnostic Studies). This is most often der on the word reliable for a moment. When we
used in the establishment of a new instrument. adulate anything as reliable, we somehow also
–– Predictive Criterion Validity—refers to the credit its replicability. A car, for example, is said
degree to which the measurements of an to be reliable because we can trust to repeatedly
instrument meet certain criteria, such that it drive the car without fearing any major complica-
can predict a corresponding outcome. For tions down the road. So too goes for a measuring
example, can the overall score from the anxi- instrument in the health sciences.
ety instrument accurately predict the severity Let us use a sphygmomanometer—used to
of anxiety disorder? Next anxiety attack? measure blood pressure—for example. There are
two ways to verify the reliability of this instru-
Considering all of the measurement validations ment: inter-rater and intra-rater. At the doctor’s
spoken of above, there is a single theme that is com- office, your physician measures your blood
mon and crucial to any form of validation. Namely, pressure with the sphygmomanometer and then
whenever we use an instrument to measure some passes the instrument to an uncanny premedical
thing, it is critical that the instrument truly mea- student to do the same. We hope, if the instrument
sures that thing. Let that settle in for a moment. is reliable, that the measurements that both the
We often never (knowingly) create an instrument physician and the shadowing student receive are
that measures something other than what it was the same. This is referred to as inter-rater reli-
originally conceived to measure. Should that be ability, such that the measurement provided the
the case though—that is, creating a measurement same results under consistent conditions between
tool that does not accurately measure what it is (inter-) two different and independent raters (i.e.,
intended to measure—then both the instrument physician and student). Intra-rater reliability
and its measurements are rendered invalid; we are refers to the producing of similar results under
systematically obtaining erroneous measurements consistent conditions within (intra) the same
regarding an erroneous thing. Moreover, the data rater. More clearly, that is when the physician is
that is to be obtained and analyzed from the invalid able to replicate your blood pressure measure-
instrument introduces a harmful blow to our study, ment multiple times with the same sphygmoma-
namely, a systematic error. nometer. The particular analytical techniques we
use to justify these measurements are discussed
in greater depth in Chap. 7.
3.4.2 Instrument Reliability Returning to our weight example above, cer-
tainly, we expect the scale, and hence its mea-
We often hear that while on the quest toward a surement, in the gym to be identical to the scale
healthier lifestyle, one should always use the and its measurement at your house. Why? Well,
same scale to monitor weight loss. We can read- because weight is just weight, i.e., the gravita-
36 3 Methodology
you? But how? How did you know exactly what our secret admirers. We are also measured by
distance your arm needed to stretch? Or pre- disparate governmental and regulatory agen-
cisely which muscles to use and with how much cies such as the Internal Revenue Service (IRS),
intensity to use them? US Census Bureau, and the Environmental
Proprioception. This nontraditional sense, Protection Agency (EPA) to name a few. We can
often referred to as kinesthesia, is the awareness continue these examples indefinitely—however,
of the space around us, our position in that space, it is important to understand that measurement
and our movements. As children, we seem to is central not only to our lives but also to our
struggle with this sense to an appreciable degree existence.
as we learn to stand upright, walk fluidly, and When it comes to research in the health sci-
the ability to extend our arm to just the right dis- ences, the conceptual measurement device that
tance to grab that shiny object our parents forgot is taken advantage of is statistics. Statistics is
to hide. As we grow older and develop further, heavily relied on in order to capture the instances
we seem to not realize (consciously) our sense we observe (i.e., observations) from the natural
of proprioception and its importance in our daily world that are important to us and require further
lives. When we do, though, it is usually in the analysis. But what is so important about these
context of attributes akin to this sense like hand– observations that it requires an entire college of
eye coordination and muscle memory. thought like statistics? The necessity of a field
We can surmise that fundamental to the such as statistics can be said to have its origins
sense of proprioception is the understanding and in variability. The legitimization of statistics was
awareness of measurement. Take basketball, for initially for its application in governmental policy
example—how was Kobe Bryant so successful in that was based on the demographic and economic
making those seemingly impossible shots? Well, differences (i.e., variations) of the people.
the best answer is practice (practice?). But the rel- Similarly, there exists variation in the observa-
evant answer is the experiences that came along tions we make in the health sciences relevant to
with his practice. The experiences of practicing research. More importantly, we want to be able to
the different intricacies of his body required to capture or record those observations because they
shoot at certain distances and at certain angles are important. And because they are important,
from the hoop; all of which come together as we want to be able to utilize and even manipulate
measurements necessary to make those impos- those observations so that we can garner perti-
sible shots—a genius, undeniably. nent findings. Surely, the majority learned from
We also consciously, actively, and purpose- an important observation—especially in trans-
fully utilize the science of measurement daily. lational healthcare—is an important finding to
Take a standard weekday morning, for example: someone, somewhere.
You wake up, measure the amount of toothpaste In science, we refer to the observations we
to use, measure the amount of time needed for make from within the natural world (or, in
you to leave home, measure the weather to deter- research, scores from an experiment or survey) as
mine what clothes to wear, measure the amount data. Data (datum, sg.) are essentially the product
of coffee to make, measure the best route to get of transforming observations and measurements
to school, measure the distance needed to break from the natural world into scientific informa-
(or gas) at a yellow light, and so on and so forth. tion. The human intellect is what mediates this
Notice that, although we speak of measuring, it transformation or codification. Everything we
is not necessary nor required for there to be an observe—say the different people sitting in the
actual scale or instrument to measure whatever it library right now—has the ability to someway,
is that you want to measure. somehow be transformed into data.
Furthermore, when we arrive at school or There are two inherent properties critical to
work, we are measured by other people like the observations and measurements we make;
our teachers, supervisors, counselors, and even one of which we have briefly touched on already,
38 3 Methodology
namely, the importance of observations. The sec- statistics is: the science of making effective use
ond essential principle is quantification. Truly, of numerical data.
every single thing that is observable in the natu- The issue that lies at the heart of our redhead
ral world can have a numerical value assigned example is this: How do we quantify something
to our perception of it. This is quite simple for that is an inherent quality of something? The bet-
measurements that are numerical in nature such ter question is this: How do even we measure
as weight, cell-culture viability, blood pressure, red hair? Is red hair measured by the intensity of
etc. But the quantification of “things” that we color? If so, how red is red? Or is red hair mea-
observe that are not numerical in nature requires sured by a mutation in the melanocortin-1 recep-
a few additional steps relative to data acquisition. tor gene (MC1R)? What numerical value can be
assigned to account for red hair?
This thought experiment can get a little hairy,
3.5.1 O
n Data: Quantitative vs. to say the least. The point that we are attempt-
Qualitative ing to drive home is that there are essentially two
methods of quantification we can use to assign
The previous section painted data to be exclusive a numerical value to anything: measuring and
only to numerical values. But this is as much true counting. It is simple to see the basic nature of
as it is false. Certainly, the study of statistics, and counting as compared to measuring. But this is
biostatistics for that matter, is deeply rooted in not to diminish the stringent fact that measure-
probability and mathematics which are critical ments require some kind of instrument or tool that
to data analysis. In fact, measurements pertinent must abide by the rigors of validity and reliability
to research are associated with numbers simply as described above (see Sect. 3.3, Measurement).
because they permit a greater variety of statistical When we do have the ability to actually
procedures and arithmetical operations. But since measure something that is of interest to us via
its inception, data, research, and even science an instrument that produces a numerical value,
have all evolved in such a way that this simplistic then we refer to those measures as quantitative
understanding is no longer sufficient. Moreover, data. Intrinsic to quantitative data is the fact that
the particular aspects of life we scrutinize or are our relative observations or measurements made
interested in studying have evolved as well. were obtained via a measuring tool or instrument.
Take redheads, for example—up until 2004, For example, observations such as height,
the biomedical community had no idea that speed, or blood pressure all use some measuring
women with natural red hair have a lower pain instrument—a ruler, a speedometer, or a sphyg-
threshold than do women with dark hair and, momanometer, respectively. Quantitative data
therefore, require a higher dosage of anesthesia consist of numbers that represent an amount, and
than their dark-haired counterparts. So then, how hence—due to the importance of those numbers
do we transform something that is non-numerical in and of themselves—these types of data are
in nature to a numerical datum? How did the sci- often referred to as continuous data.
entists acquire data to something they perceived On the other hand, when that which is of
to be red hair? How about pain? interest has neither an inherent numerical value
To reiterate, everything can be quantified. nor a measuring instrument that can produce a
Everything has the ability to be quantified so numerical value, then the method of quantifica-
that the numbers we assign to anything can be tion is limited only to counting, and the resul-
used in the description, the comparison, and the tant information is rendered as qualitative data.
prediction of information most relevant to the Qualitative data are data that have been quanti-
health sciences. Therefore, all that is left for us to fied based on a certain quality of something that
learn are the intricacies of different quantification has been observed. Data of this sort consist of
methods and how to most effectively utilize those words, names, or numerical codes that represent
numbers. To echo a mentor, that is precisely what the quality of something.
3.5 Data Acquisition 39
For example, observations such as hair color, consists of numerical values that do have restric-
socioeconomic status, or pain intensity do not tions or are isolated. Whole numbers such as
have measurement tools per se but are perceivable household size, number of medications taken
qualities that are useful in the health sciences. per day, and the population size of US college
The best we can do—in terms of quantification— students are all examples of discrete variables.
with these qualities is to simply categorize them Discrete variables are often referred to as semi-
for what they are and count the number of their continuous or scalar as those values can include
occurrences. Thus, it is not uncommon to hear enumeration (like household size), which (many
qualitative data to be referred to as categorical have argued) are neither wholly quantitative nor
data. wholly qualitative.
Considering variables that exist in the realm
of quantitative data, there are also distinct scales
3.5.2 Variables of measurements that coincide accordingly.
Interval measures are measurements that are
According to the ancient Greek philosopher separated by equal intervals and do not have a
Heraclitus, the only thing that is constant in life true zero. For example, measuring a patient’s
is change itself. Indeed, that is what makes us body temperature using a thermometer produces
humans and the world we live in so unique. No readings of temperature along a range of −40
cell, no human, no tree, and no planet are con- °F to 120°F, with ticks separated at intervals of
stant. In research, we must account for this differ- 1°F. This level of measurement does not have a
entiation by organizing our data by variables. A true zero for two reasons: (1) a reading of 0°F
variable is a characteristic or property of interest does not mean that there is no temperature to read
that can take on different values. Similar to the (i.e., it can get colder than 0°F) and (2) a reading
different types of data, there are different types of 50°F is not twice the amount of 25°F worth of
of variables that have within them distinct levels temperature.6
of measurement (see Video 1). On the other hand, ratio measures are mea-
surements that do have true zeros. For example,
3.5.2.1 Quantitative measuring someone’s height using a meter stick
At the heart of quantitative data lies two char- produces readings of height along a range of
acteristic variables that are respective of what it 0–1 m, with ticks reflecting its ratio distance from
means for data to be rendered as such. A continu- the point of origin (0 m). In this case, a height can
ous variable is a variable that consists of numeri- be considered as a ratio measurement because a
cal values that have no restrictions. Amounts such reading of 0 m essentially means that there is
as body temperature, standardized test scores, nothing to measure and a height of 1.50 m is
and cholesterol levels are all examples of contin- twice the amount of 0.75 m worth of length.
uous variables. It is noteworthy to mention that
the lack of restrictions mentioned is theoretical in 3.5.2.2 Qualitative
essence. For example, measuring a patient’s body All qualitative variables can be denoted as cat-
temperature in °F might be 100.2861…, and it egorical variables. A categorical variable is a
continues ad infinitum. We recognize this theo- variable that organizes qualitative data into cat-
retical behavior of the numbers we work with by egories. Well, that was obvious—but we cannot
the label we give them (i.e., continuous), but for stress the importance of being able to identify
practical reasons similar examples of numbers
with decimals are rounded to the nearest hun- In reality, temperature is a subjective and humanistic per-
6
between measures and counts, quantities and Conversely, qualitative data that utilize ordinal
qualities. Moreover, categorical variables also measures are variables similar to grade point
have distinct scales of measurement. averages, where students are ranked from lowest
At the most basic level are nominal measure- to highest class standing based upon a letter grade
ments. A nominal measure is a measure of clas- (e.g., A, B, C, and D) (Fig. 3.9). If you notice
sification where observations are organized by here, the data mentioned is neither wholly quan-
either class, category, or name. For example, titative nor wholly qualitative. Furthermore, we
religious affiliation is a nominal measure where can presume ordinal measures to be exclusive to
participants can identify with Christianity, Islam, ranked data—whether quantitative or qualitative.
Judaism, or other religions. A simple way to The intricacies of ranked data and their analyses
remember nominal measures is by looking at the are discussed further in Chap. 7.
etymology of the word nominal—namely, nomi- The importance of the type of data collected
nalis, Latin for “name.” or utilized cannot be stressed enough. Not only
Another level of measurement relative to does the type of data set the precedent for the
qualitative data is dichotomous measurements. specific type of research being done, but it also
A dichotomous measure is a measure that can determines the appropriate statistical analysis
take on one of two values. For example, a ques- techniques that are permitted to be employed. We
tionnaire that asks whether food was consumed shall see why this is as such in Chaps. 4–6. For
prior to a medical exam can be answered as either now, let us consider all of the concepts discussed
“yes” or “no” which can be coded as 1 and 2, and their culmination to research methodology.
respectively. Or, as with our redhead example,
women with red hair can be labeled as 0 and
women with dark hair as 1. 3.6 Self-Study: Practice
We must mention that there are measurement Problems
levels that can be identified with both quantita-
tive and qualitative data. An ordinal measure is 1. What fundamental question does the meth-
a measurement made that is reflective of relative odology portion of a research endeavor ask?
standing or order (think: ordinal). Quantitative What aspects of the study does the answer-
data that utilize ordinal measurements can be ing of this question provide information to?
variables such as class standing, where—rela- 2. The following list is a mixture of samples
tive to course grade—students are ranked from and populations. Identify and match the sam-
lowest to highest grade point average (Fig. 3.9). ples to their parent-population:
(a) US college students
3.67–4.00 (b) Stars in the Milky Way Galaxy
A
(c) Republican Presidents
(d) Female business majors
2.67–3.33
(e) Female entrepreneurs
B
(f) Republican congressmen
1.67–2.33 (g) Arizona college students
C (h) Stars in the universe
0.67–1.33 3. Why do we more frequently measure sam-
D ples instead of entire populations? Can entire
0 GPA populations be measured?
F
4. What qualities are fundamental for a good
sample? Why is this important to a research
study?
Fig. 3.9 Both qualitative (GPA) and quantitative (letter 5. An investigator interested in studying breast-
grade) data can utilize ordinal measures feeding behavior in her county is in need of
Recommended Reading 41
data. Due to her busy schedule and fast- 10. For .each of the variables above, determine
approaching deadline, she takes advantage of the type of variable and specific measure if
the pediatric hospital across the street from her applicable.
laboratory. She stands outside of the main clinic (See back of book for answers to Chapter
and surveys pregnant women as they walk in. Is Practice Problems)
this a form of random sampling? Explain.
6. Researchers from the city’s Department of
Public Health are conducting a study on vac- Recommended Reading
cination efficacy in their state. After a month
of collecting data, the researchers compile Corbin JM, Strauss AL. Basics of qualitative research:
techniques and procedures for developing grounded
4366 observations. They categorize their theory. 2nd ed. Los Angeles: Sage; 1998.
data by socioeconomic status and then ran- Kung J, Chiappelli F, Cajulis OO, Avezova R, Kossan G,
domly select 50 observations from each cat- Chew L, Maida CA. From systematic reviews to clini-
egory. What type of random sampling was cal recommendations for evidence-based health care:
validation of revised assessment of multiple system-
utilized here? atic reviews (R-AMSTAR) for grading of clinical rel-
7. A diabetic patient visits his local physician’s evance. Open Dent J. 2010;4:84–91. https://doi.org/10
office for an annual checkup. For the past .2174/1874210601004020084.
4 months, the patient’s handheld glucose Messick S. Validity of psychological assessment: vali-
dation of inferences from persons’ responses and
meter (which measures blood by pricking his performances as scientific inquiry into score mean-
finger) has been reporting 108 mg/dL every ing. Am Psychol. 1995;50(9):741–9. https://doi.
day before breakfast—the patient is excited org/10.1037/0003-066X.50.9.741.
for his physician to see this. At the office, the Vandenbroucke JP, von Elm E, Altman DG, et al.
Strengthening the reporting of observational studies
physician takes a full blood sample and runs in epidemiology (STROBE): explanation and elabora-
it through complex machinery to find his tion. Int J Surg. 2014;12(12):1500–24.
blood glucose levels. The physician returns, Wagner C, Esbensen KH. Theory of sampling: four
slightly disappointed, and reports the critical success factors before analysis. J AOAC
Int. 2015;98(2):275–81. https://doi.org/10.5740/
patient’s blood glucose level to be 120 mg/ jaoacint.14-236.
dL. Assuming the instrument in the physi- West S, King V, Carey TS, et al. Systems to rate the
cian’s office is the gold standard for measur- strength of scientific evidence: summary. In: AHRQ
ing blood glucose levels, what might be evidence report summaries. Rockville, MD: Agency
for Healthcare Research and Quality (US); 2002.
wrong the patient’s blood glucose measuring 47:1998–2005. https://www.ncbi.nlm.nih.gov/books/
instrument? NBK11930/
8. Is it more important for an instrument to be
reliable or valid? Explain.
9. For each variable listed below, determine
whether it is quantitative data or qualitative
data:
(a) Socioeconomic status (low, middle,
high)
(b) Grade point average (GPA)
(c) Annual income ($)
(d) Graduate Schools in the United States
(e) Number of patients in the ER waiting
room
(f) Biological sex (male, female)
Descriptive Statistics
4
Contents
4.1 Core Concepts 43
4.2 Conceptual Introduction 44
4.3 Tables and Graphs 45
4.4 Descriptive Measures 53
4.4.1 Measures of Central Tendency 53
4.4.2 Measures of Variability 55
4.5 Distributions 59
4.6 Probability 63
4.6.1 Rules of Probability 64
4.6.2 Bayesian vs. Frequentist Approach 66
4.6.3 Z-Transformation 66
4.7 Self-Study: Practice Problems 69
ance) describe the distribution of the data by pro- tence of uncertainty and the understanding that
viding an understanding of the dispersion of the knowledge is ever-growing.
data. These descriptive measures are techniques At the turn of the twentieth century, scientists
that aid in the organization and assist in the effec- scrambled to explain and pictorialize a model of our
tive summarization of the data. atoms on a quantum level. In 1913, Ernest
We mention a lot about distribution when dis- Rutherford and Niels Bohr introduced the
cussing the measures of central tendency and Rutherford-Bohr model which correlated the
variability. This chapter helps us understand how behavior of our atoms to that of our solar system.
the shape of the distribution tells us, as research- Just as the planets orbit our Sun via gravitational
ers, more about the data. Distributions such as attraction, the electrons orbit the protons via electro-
normal and skewed are further discussed along static attraction (Fig. 4.1). This theory entails that
with their corresponding characteristics. You will the electrons “orbiting” the protons follow a spheri-
learn that the most central distribution among cal path that is both continuous and identifiable—
those listed is the normal distribution, also called similar to our solar system. But this—however nice
Gaussian or “bell-shaped” curve. it may seem—was found not to be the case.
Culminating the chapter is the introduction of About a decade later, a student of Bohr’s,
probability, which is the likelihood or chance of a Werner Heisenberg, proposed that we cannot be
particular event occurring. To assist in the deter- certain of the exact location of an orbiting elec-
mination of the likelihood or chance that a par- tron.2 Instead, we can only discern the likelihood
ticular event is likely to occur, we turn to a list of (probability) of an electron’s position relative to
formulae which are essentially the rules of prob-
ability. Finally, tying together the topics of distri-
butions and probabilities is the z-transformation.
the proton it circles (Fig. 4.2). This was later we obtain are from the same world that is so
known to be the Heisenberg uncertainty princi- inherently uncertain and, therefore, must contain
ple—a seminal piece of work central to his Nobel a degree of uncertainty within them as well.
Prize of 1932 and, more importantly, to our Furthermore, in statistics, uncertainty encom-
understanding of quantum mechanics today. passes more than a seemingly lack of knowl-
Although theoretical physics is beyond the scope edge. In fact, the root of uncertainty in statistics
of (nor required for) this book, there is a lesson to is a topic we have recently become quite familiar
be learned. We can argue that this branch of phys- with—namely, variability. Whether it is referred
ics attempts to do just what we hoped for: widen- to as variability, variation, or just individual dif-
ing certainty and narrowing uncertainty. At the ferences, the application of a tool like statistics is
heart of theoretical physics must lie some math- useful for the systematic organization, analyza-
ematical model that is taken advantage of in order tion, and interpretation of data in spite of
to explain, rationalize, and predict these naturally uncertainty.
occurring (and uncertain) phenomena. What Our interest in uncertainty, then, is com-
might that be? pounded when we begin to discuss biostatistics.
You guessed it, statistics! Statistical mechan- Indeed, more fearful than an unfortunate diagno-
ics is critical for any physicist; it provides tools sis is an uncertain prognosis. To the best of our
such as probability theory to study the behavior ability, we wish to minimize any uncertainty, par-
of the uncertainty in mechanical systems. This is ticularly when it comes to patient health and
just one example of the long reach of statistic’s healthcare research. The ideal option is to know
utility, and it is why statistics is often set apart everything about everything—to know the whole
from the other investigative sciences. We can truth (whatever that means!) The second best
further our understanding of the role of statistics possible option is to understand the fundamental
to be the effective use of numerical data in the concepts behind ways to handle variability and,
framework of uncertainty. In other words, statis- its corollary, uncertainty. In accomplishing this,
tics must deal with uncertainty because the data we start with the appreciation of the most basic
concepts underlying statistical thought.
varying and uncertain world. Positively, more set. The compact and coherent aspect of a fre-
damaging than the effects of uncertainty to our quency table facilitates the understanding of the
study (or to anything in that matter) is the igno- distribution of a specific set of data—for this rea-
rance of uncertainty. Descriptive statistics high- son, a frequency table is often referred to as a fre-
lights the uncertainty inherent in our data in the quency distribution. The presentation of data in this
form of variability. Descriptive statistics bears the manner is allotted for both quantitative and qualita-
fruits of statistical tools required to neatly describe tive data, although the latter form of data has a few
the observations we make. The first statistical tool restrictions described in additional detail below.
we mention utilizes tabulation, a method of sys- Table 4.3 shows a frequency table of the data
tematically arranging or organizing in a tabular from Fig. 4.1.
form (table). Consolidating data into tables is one
of the most practical ways of organization, espe- 1. The first column is labeled as our variable of
cially when it comes to large sets of data. interest, “systolic BP,” and was configured by
Assume you have just collected data about the first identifying the extreme values (smallest
systolic blood pressure from 50 of your college and largest) from the data set and then numer-
peers at random, shown in Table 4.1. ating the values in between the extremes in
Although the data are compiled together, the numerical order.
lack of organization in the wide array of data is 2. The second column labeled as “f” represents
quite overwhelming. Moreover, one would rarely the frequency of occurrence within each of
present data to their superior (i.e., principal inves- those classes of systolic BP from the data set.
tigator or research teacher) in this manner. Even This can be obtained via a simple tally or
if it is presented in this unorganized fashion, what count of each class’s occurrence through the
type of beneficial information can we glean from raw data.
it? Nothing much. In fact, the extensive detail 3. Next, the sum of each class’s frequency should
highlights more unimportant information than be equal to the total number of observations in
important. The least one can do in this instant is the data set. Thus, when a frequency table
to order the array of data numerically from low- organizes data by classes of single values as
est to highest (Table 4.2). Yet still, there is little such, it is viewed as a frequency table for
added utility to this method other than its aes- ungrouped data.
thetic pleasure to the eye. A more effective use of
tabulation is organization by frequency. Now, the differences in the organization and util-
A frequency table is a method of tabulation ity comparing Table 4-3 to Table 4-1 comes to light.
that organizes data by reflecting the frequency of Table 4.3 to Table 4.1 comes to light. Not only
each observation’s occurrence relative to the whole are the data more organized, but there is also a
Table 4.1 Randomly collected systolic blood pressures Table 4.2 Randomly collected systolic blood pressures of
of 50 college students 50 college students ordered/sorted from lowest to highest
Systolic blood pressure Systolic blood pressure
103 98 105 107 96 90 94 94 95 96
116 97 118 114 116 96 96 97 98 98
122 126 111 114 106 98 98 98 99 103
113 98 94 124 122 103 104 105 105 106
98 96 132 125 90 107 107 110 111 113
115 114 132 133 107 114 114 114 115 116
136 132 140 104 99 116 118 119 122 122
94 134 110 137 105 124 125 126 126 130
137 95 98 119 130 132 132 132 133 134
126 98 140 96 103 136 137 137 140 140
4.3 Tables and Graphs 47
Table 4.3 Frequency table of the systolic blood pres- Table 4.4 Frequency table for the weight of 500 college
sures from Fig. 4.1 students in lbs
1
Systolic BP 2
f Weight in lbs f
90 1 100–109 59
94 2 110–119 54
95 1 120–129 48
96 3 130–139 50
97 1
140–149 37
98 5
99 1 150–159 49
103 2 160–169 51
104 1 170–179 52
105 2 180–189 51
106 1 190–199 45
107 2 200–209 4
110 1
Total = 500
111 1
113 1
114 3 Table 4.5 Rules for constructing tables for grouped data
115 1
Four rules for constructing tables
116 2
118 1 1. Observations must fall in one and only one
119 1 interval. Groups cannot overlap
122 2 2. Groups must be the same width. Equal-sized
124 1 intervals
125 1 3. List all groups even if frequency of occurrence is
126 2 zero. All groups should be ordered from lowest to
130 1 highest
132 3 4. If groups have zeros or low frequencies, then widen
133 1 the interval. The intervals should not be too narrow
134 1
136 1
137 2 nization technique, in terms of classification,
140 2 from above. That is, if there are more than 10–15
3
Total = 50
different possible values to be classified singu-
larly, then we must take advantage of a more
greater deal of important information to be refined method of tabulation by grouping the
gleaned from this type of systematic organiza- classes into intervals.
tion. For instance, we can better identify the A frequency table for grouped data organizes
highest, lowest, and most common frequencies of observations by interval classification, whereas
blood pressure, which become important findings this differs from a frequency table for ungrouped
in the context of cardiovascular disease. It data that organizes observations by classes of
becomes even more helpful if you are tasked with single values. This refinement is reflected by yet
comparing your data set to that of a colleagues. another level of organization, conveyed by the
Hence, we have a much better understanding and grouping of data into intervals. Table 4.4 depicts
a more organized report (to present to your supe- a frequency table of weights from 500 college
rior) regarding the systolic blood pressure of your students. Note how the weights in the table are
college peers when using a frequency table. not singly defined but are classified by intervals
The significance of efficiently describing sta- of 10; yet the table still provides us with useful
tistics increases manyfold when the amount of and important information regarding the fre-
data increases. What if the size of data to be col- quency of each category’s occurrence in an orga-
lected is larger than 50, like 500 pieces of datum? nized manner. Table 4.5 presents a few simple
Surely the convenience of tabulation would lose rules to follow for constructing frequency tables
its pragmatic nature if we applied the same orga- for grouped data.
48 4 Descriptive Statistics
There is still more advantage to be taken from a realistically be anywhere between 99 and
frequency table, by performing a few more calcula- 101% as there may be errors in rounding.
tions. Table 4.6 is an extension of the earlier example • Cumulative frequency (cf) represents the
of college student’s weights from Table 4.4. Here we total number of occurrences in each class,
see the addition of four new columns—namely: fre- including the sum of the occurrences from the
quency percent, cumulative frequency, cumulative classes before it.
frequency percent, and interval midpoint. • Cumulative frequency percent (cf%) repre-
sents the cumulative frequency of each class
• Frequency percent (f%)3 represents the fre- relative to the total frequency of the whole set,
quency of each class (or interval) relative to expressed as a percentage.
the total frequency of the whole set, expressed –– This calculation is particularly useful in
as a percentage. describing the relative position of the par-
–– The sum of each category’s frequency per- ticular class of observations within the
cent should ideally equal to 100% but can whole data set, often viewed as percentiles.
Table 4.6 Complete frequency table of 500 college students and their recorded weight in pounds (lbs)
Weight in lbs f f% cf cf%
100–109 59 59 59 59
× 100 ≡ 11.8% × 100 ≡ 11.8%
500 500
110–119 54 54 113 113
× 100 ≡ 10.8% × 100 ≡ 22.6%
500 500
120–129 48 48 161 161
× 100 ≡ 9.6% × 100 ≡ 32.2%
500 500
130–139 50 50 211 211
× 100 ≡ 10% × 100 ≡ 42.2%
500 500
140–149 37 37 248 248
× 100 ≡ 7.4% × 100 ≡ 49.6%
500 500
150–159 49 48 297 297
× 100 ≡ 9.6% × 100 ≡ 59.4%
500 500
160–169 51 51 348 348
× 100 ≡ 10.2% × 100 ≡ 69.6%
500 500
170–179 52 52 400 400
× 100 ≡ 10.4% × 100 ≡ 80%
500 500
180–189 51 51 451 451
× 100 ≡ 10.2% × 100 ≡ 90.2%
500 500
190–199 45 45 496 496
× 100 ≡ 9% × 100 ≡ 99.2%
500 500
200–209 4 4 500 500
× 100 ≡ 0.8% × 100 ≡ 100%
500 500
Total = 500 ≈ 100%
Note that there are no calculated totals in cumulative frequency
• Midpoint refers to the “middle” value of either the cumulative frequency or the cumula-
between the interval of the specific class. tive frequency percentages. Instead, we can
–– This is limited to frequency tables for gauge the accuracy of our calculation by confirm-
grouped data, where the middle point of ing that the values in the final interval of the table
the interval can be found by simply averag- for cf and cf% are equal to the total number of
ing the lower and upper points of the spe- observations and 100%,4 respectively.
cific interval. The importance and usage of The importance of the type of data we are work-
this become apparent in graphing (see next ing with should not be forgotten, as that specific
page). criterion sets the precedent for what we can and
cannot do with the data. We are guilty of being Table 4.10 Tabulates a nominal variable
somewhat biased, as all of the calculations men- Race f
tioned above can be taken advantage of when using American Indian/Alaskan native 9
tabulation to describe quantitative data. However, Asian 11
when working qualitative data, we can utilize only Black or African American 13
Native Hawaiian or other Pacific islander 7
a few of the characteristic traits of tabulation men-
White 24
tioned above. This is simply due to the nature of
Total 64
qualitative data; the information to be obtained
from categories, counts, and/or names is limited
when we attempt to apply the same mathematical simple, the importance of presenting and reporting
procedures as we did toward quantitative data. data in an organized manner through tabulation is a
Hence, when it comes to creating tables for critical first step in the effectiveness of any scien-
qualitative data, we do not create intervals, cal- tific study. As should be the case in any scientific
culate cumulative frequency, or calculate the undertaking, we must always be concise, direct,
midpoint of the interval. Conversely and depend- and precise in presenting data obtained from an
ing on the specific type of qualitative variable, we invariable and uncertain world. Moreover, we have
can still calculate frequency, frequency percent, also opened the door for yet another tool we can
and cumulative frequency percent. The contin- use in descriptive statistics—namely, graphs.
gency on the specific type of data is mentioned Graphs represent yet another statistical tool
particularly for the calculation of cumulative fre- available for the clear and concise description of
quency percent. When working with qualitative data. Similar to the construction of frequency
data, it is customary to only calculate the cumula- tables, graphs provide the means of organizing
tive frequency percent when the data to be tabu- and consolidating the inevitable variability within
lated are ordinally measured (i.e., an ordinal data in a visual manner. Think of all of the
variable). This is primarily due to the fact that instances (i.e., advertisements, class presenta-
percentiles are meaningless if there is no order to tions) where graphs played a vital role in visually
the data. Tables 4.8, 4.9, and 4.10 are three exam- imparting a piece of information or underlying
ples of frequency tables for different measures of message to the viewer as intended by the pre-
qualitative data (see Video 2). senter. Although numerous forms of graphing that
Let us be the first to congratulate you on passing can be utilized within descriptive statistics exist,
a milestone on your journey toward being an effi- below we outline a few of the most common.
cient and conscientious researcher. Congratulations! The most common forms of graphs used to
Seriously, although tabulation may seem relatively statistically describe data are histograms and bar
charts. Up until this moment, it is possible that
many of us believed that a histogram and a bar
Table 4.8 Tabulates a dichotomous variable question
chart (or bar graph) were synonymous with each
Do you brush your teeth before you go to bed? f other. Unfortunately, this is a grave misconcep-
Yes 59 tion. In fact, the primary difference between the
No 145
two is critical in understanding the data at hand.
Total 204
As was the case before, the nature of the data that
we work with (i.e., quantitative vs. qualitative)
Table 4.9 Tabulates an ordinal variable set the precedent as to which type of graphing we
Class standing f can utilize.
Freshman 116 Both a histogram and a bar chart can be
Sophomore 102 referred to as “bar-type graphs” that utilize
Junior 153 Cartesian coordinates and bars to summarize and
Senior 129 organize data. A histogram is a type of bar graph
Total 500 used for quantitative data, where the lack of gaps
4.3 Tables and Graphs 51
between the bars highlight the continuity of the type graphs, the x-axis represents the intervals or
data (hence, continuous data/variables.) On the classes of the variable, and the y-axis represents
other hand, a bar chart is a type of bar graph the frequency. Figure 4.3 is a histogram created
used for qualitative data, where a bar is separated from the quantitative data presented in Table 4.4;
by gaps to highlight the discontinuity of the data. Fig. 4.4 is a bar chart created from the qualitative
The primary difference between these bar-type data presented in Table 4.10.
graphs is that the bars in a histogram “touch” one Satisfy yourself that the graphing mechanism
another, whereas the bars in a bar chart do not is just as useful in concisely, directly, and pre-
touch one another and instead are separated by cisely describing data over the simple presenta-
gaps. tion of raw data. Table 4.11 provides a step-by-step
The easiest way to construct either of these protocol for constructing either of the two graphs.
graphs is by transforming the already organized Graphically speaking, there is one other graph
data provided by a frequency table. In both bar- that we can construct as yet another statistical
70
NUMBER OF STUDENTS
60
50
40
30
20
10
0
9 9 9
109 -11 129 139 -14 159 169 179 189 199 -20
0- 10 0- 0- 40 0- 0- 0- 0- 0- 00
10 1 12 13 1 15 16 17 18 19 2
Race
30
25
20
15
10
0
American Asian Black or African Native Hawaiian White
Indian/Alaskan American or Other Pacific
Fig. 4.4 Bar chart Native Islander
52 4 Descriptive Statistics
60
50
40
30
20
10
0
9 9 9 9 9 9 9 9 9 9 9
-10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20
100 110 120 130 140 150 160 170 180 190 200
tool used in descriptive statistics. This additional the midpoint), its x- and y-coordinates are labeled,
graph is a modification to a traditional histogram. and then a line is drawn connecting each point.
Furthermore, this variation is critical when we The frequency polygon depicted in Fig. 4.5 is a
begin to introduce distributions in the following transformation of the histogram shown in
section. A frequency polygon is a refined histo- Fig. 4.3. We can attempt to grasp the reason why
gram that has a line graph added within it. Just as it is referred to as a frequency polygon once the
a histogram has limited use within a quantitative bars are removed and the area under the line is
context, so too by extension does a frequency shaded (Fig. 4.6).
polygon. But it would be repetitive to outline the It is worth mentioning (again) that you need
protocol for constructing a frequency polygon. not initially create a histogram to transform into a
We can surmise that the first step in the construc- frequency polygon. Indeed, a frequency polygon
tion of a frequency polygon is a histogram. So can be composed directly from a frequency table,
then, how is a frequency polygon constructed? provided that the frequency table has a column
A histogram is transformed into a frequency for the calculated midpoint. This also omits the
polygon by simply connecting the peak of each necessity of drawing and then erasing the bars of
bar within the histogram by a line. Recall from the histogram, as long as the x- and y-coordinates
the section on frequency tables, specifically for of the dots on the line are labeled appropriately
quantitative (grouped) data, that we assigned a (x-coordinate, interval’s midpoint; y-coordinate,
column for the calculation of each interval’s mid- frequency). We briefly return to the importance
point. Thus, at the top of each bar within the his- behind frequency polygons in Sect. 4.4 (see
togram, a dot is placed at the middle (to represent Video 3).
4.4 Descriptive Measures 53
NUMBER OF STUDENTS
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12
WEIGHT IN POUNDS (LBS.)
4.4 Descriptive Measures loquially but read and see it being used on a daily
basis. We often even refer to the average of some-
Now that we are equipped with a few of the sta- thing without the actual mentioning of the word
tistical tools necessary for descriptive statistics, itself. Though we may not consciously realize its
we must also begin a discussion on the mathe- usage, it is one of the most efficient ways to
matical techniques we can utilize to describe our describe whatever it is that requires describing.
data and the inevitable variability they are forti- For example, when returning from a vacation,
fied with. The techniques we are to describe not our friends and family usually ask: “How was the
only aid in the organization of our data, but— weather like?” Now, we don’t usually go hunting
even more—they assist in the effective summari- for a 10-day weather report that outlines the tem-
zation of our data in a direct, precise, and concise perature of each day in order to provide an answer
manner. to the simple question. Instead, we often respond
Consider the summary of a book you read on with a general description as to how the overall
the internet the night before a book report you temperature was or we provide a single tempera-
have been procrastinating is due. The purpose of ture that is about the same as the distribution of
the summary is to give you an overarching under- temperatures during our stay. In reality, all we are
standing of the book, its characters, and (hope- doing is providing an average—whether precise
fully) the overall message your teacher intended or not—of whatever it is we are attempting to
for you to learn. In essence, that is precisely what describe or summarize.
the summarization techniques relative to statisti- As useful as averages are to our daily life, they
cal data intend to do as well. As we shall see, the are even more useful when it comes to describing
processes contained within descriptive statistics data. In statistics, the techniques we use to
go beyond tabulation and graphing; we learn that describe data using averages are referred to as
data can be described by averages and measures of central tendency. Measures of cen-
variability. tral tendency are techniques used in describing
how the center of the distribution of data tends to
be. That is, we use these specific measures to
4.4.1 Measures of Central Tendency help us summarize the average behavior of our
data which happen to lie in the center of the
What do we think of when we think of the word distribution.
“average”? Well, for starters, we usually think of The plural word “measures” implies that there
the most common or frequently occurring event. is more than just one calculation of the average.
We also tend to not only use the word average col- Any layperson may be familiar with how to math-
54 4 Descriptive Statistics
14 9 1 18 4 8 8 20 16 6
1 4 6 8 8 9 14 18 16 20
104 17
Mean= = 10.4 Median= 8 + 9 = 17 = 8.5 Mode = 8
10 2
Fig. 4.7 Measures of central tendency are calculated with an example of numbers above. Notice that the first step is to
order the data from lowest to highest value. See contingency for calculating the median for an odd number of data
ematically calculate the average of a set of num- rences is identical and no other observation
bers. But there is more than just this single occurs more frequently.
calculation of the average. The measures of central
tendency include the mean, median, and mode— If the description of the central nature of these
the calculations and meaning of each are provided measures relative to the distribution of data is not
below, along with a comprehensive example utiliz- yet clear, then there is no need for panic; the fol-
ing all measures of central tendency within a sin- lowing section should make more sense.
gle data set from Table 2 in Fig. 4.7. Additionally, we must pause, yet again, for the
ultimate deciding factor as to the usability of
• Mean—synonymous with the arithmetic these measures—namely, the nature of the data.
mean, refers to a form of average calculation Let us begin with the easiest one first, namely,
that is the sum of the scores from a data set quantitative data. All measures of central ten-
divided by the total number of scores from dency are permissible when describing quantita-
that data set. tive data. On the other hand, when it comes to
describing qualitative data, only the mode and
(seldom) the median are permissible.
Population Mean Sample Mean For starters, it should be evident as to why a
å Xi å Xi mean calculation is never permitted when work-
m= x=
N n ing with qualitative data. Why? Consider the
nominal variable of gender. Say your biostatistics
• Median—refers to a form of average calcula- class is composed of 16 females and 14 males.
tion that is represented by the meedle number, Okay, great—so what is the average gender in
given that the data is organized in numerical your class? We will wait… there is no valid
order (think: meedle number). answer. Let us go one step back: what are 16
–– Contingency: if the number of data is odd, females plus 14 males equal to? 30… what?
then count off from both ends toward the People? No, that cannot be the summative gen-
center, and you will arrive at the median. If der. Even two steps back: what is one female plus
the number of data is even, locate the two one male equal to? Two femalemales? Silly, but
middle numbers, and calculate the arithme- no that cannot be it either. The point that we are
tic mean of the two to arrive at the median. attempting to get at is this: due to the qualitative
• Mode—refers to a form of average calcula- nature of the data, meaningful (even simple)
tion that is represented as the most frequent mathematical calculations are not always appro-
occurring number within a data set. priate. As mentioned in the previous chapter, the
–– Contingency: there can exist many values best we can do with qualitative or categorical
within a data set that represent the mode, data is simply to count them or record their
given that the frequency of their occur- frequencies.
4.4 Descriptive Measures 55
Thus, the measures of central tendency we tion of the variability contained within our
are left with are the median and mode. As dis- data. The ability to understand and summarize
cussed above, a median is the middle number the variation among data provides important
among the data set, given that the data are and meaningful information regarding many
listed in numerical order (i.e., smallest to larg- aspects of the data.
est). This is problematic when it comes to qual- In brief, the measures of variability provide
itative data because not all qualitative data an understanding of the overall distribution and
(like gender or ethnicity) have the potential to dispersion of quantitative data. It is true that the
be ordered. Hence, the calculation of median is amount in which each value contained within a
only appropriate when we have an ordinal data set differs or varies from one another pro-
measure of qualitative data5 (see Chap. 3, Sect. vides an understanding of how spread out the
3.4.2.2). Lastly, with a sigh of relief, we can data are. Take a moment to reflect on the words
say that the calculation of mode is permissible we use such as distribution, dispersion, and
for all qualitative data, as it is simply the spread; they are not only essentially synonymous
description of the most frequent occurring with one another but also fundamentally contain
observation. Table 4.12 provides a quick tool within them the idea of variation. Below we pro-
that delineates which measures of central ten- vide the details of four distinct, yet interrelated,
dency are appropriate relative to the nature of measures of variability that are critical to descrip-
the data at hand. tive statistics.
We begin with the simplest measure of vari-
ability: range. The range is the distance
4.4.2 Measures of Variability between the highest and lowest values among
the data. Of course, there must be numerical
As outlined in the introduction of this chapter, order to the data before we can calculate the
variability is a chief marker of the uncertainty range—signifying that the data must be quanti-
contained within the data itself and its source, tative in nature. However simple the calcula-
namely, the world. But it is better to be igno- tion of range may be, its ability to provide
rant about the uncertainty in the world than to meaningful and useful information regarding
be knowledgeable of uncertainty without a way the distribution of data is limited. This is pri-
of measuring it. Luckily, we are both aware of marily due to outliers which are observations
the existence of uncertainty and have the abil- within a data set that significantly differ from
ity to measure it in the form of variability. the other observations.
Moreover, contained within its measurement is Outliers can be the consequence of numer-
the implication of a clear and concise descrip- ous factors, such as erroneous measurements
or observations obtained from unrepresenta-
Table 4.12 Measures of central tendency checklist tive samples; they are typically found among
the lower and/or upper distribution extremes.
Measures of central
tendency Quantitative Qualitative Regardless of the causes, outliers pose a threat
Mean ✓ × to the calculation of range, as they provide a
Median ✓ ✓* deceiving description of the variability con-
Mode ✓ ✓* tained within our data. Thus, in order to pre-
* indicates contingencies vent deception, we introduce a calculation of
range that is much less vulnerable to the
potentially damaging effects of outliers, which
This is contingent on their being an odd number of obser-
5
is also the second measure of variability
vations. If there are an even number of observations and
the middle two observations are not identical, then the
explored here.
median cannot be calculated (see median definition and its Interquartile range (IQR) refers to the range
contingencies above). between the middle of our data, given that the
56 4 Descriptive Statistics
data have been divided in quarters. Assuming the the quarters or quartiles (“IQR”). Hence, we can
data are in numerical order, the quarters are split remember the formula to be:
by Q1, Q2, and Q3 (Fig. 4.8).
The second quartile (Q2) is the same as the Interquartile Range ( IQR )
median of the set of quartiles. Once Q2 is deter- IQR = Q3 − Q1
mined, we can visualize the first and third quar-
tiles (Q1 and Q3, respectively) to be the “medians” Figure 4.9 provides a brief guideline along
of the first and third quarters or the numbers to with an example for the calculation of IQR.
the left and right of the real median (Q2). After Next, we discuss the two measures that lie at
isolating the quarters, all that is left is determin- the heart of variability: standard deviation and
ing the distance (range) that is between (inter) variance. These are the most common and useful
measures of variability used within scientific
Q1 Q2 Q3 research. They not only adequately describe the
variability contained within quantitative data but
also provide credence to many of our statistical
Fig. 4.8 Illustration of the interquartile range (IQR) analyses, their interpretations, and their applica-
Fig. 4.9 Step-by-step
procedure on how to
calculate the Interquartile Range (IQR)
interquartile range (IQR) IQR = Q3 – Q1
4.4 Descriptive Measures 57
Fig. 4.13 Steps for obtaining the standard deviation using the table method
formulae) is the subtraction of each observation obtaining the sum of squares is division by their
point from the mean. Next the value of that differ- respective denominators.7 Figure 4.13 shows these
ence is squared (2), and after doing this for each steps in an easy-to-follow table method.
observation, then the products are summed together
(∑). This method is referred to as the sum of squares Understanding the difference in denominators for popu-
7
(SS). The final step in obtaining the standard devia- lation (N) and sample (n−1) is important for inferential
tion—of either the population or the sample—after statistics, expanded on in Chaps. 5 and 6.
4.5 Distributions 59
Number of Patients
200
100
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)
200
100
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)
300
Number of Patients
200
100
Fig. 4.17 Bell-shaped 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
frequency polygon Hemoglobin Levels (Hb)
4.5 Distributions 61
Number of Patients
200
100
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)
witness our original crooked distribution natu- Additionally, the unit of measurement for the
rally attain that smooth bell-shaped curve spread of the distribution on the x-axis is standard
(Fig. 4.18). This bell-shaped curve is what we deviation—hence, measures of variability or dis-
refer to as a normal distribution. persion. It is also critical to note (or reiterate) that
A normal distribution is a naturally occur- the normal distribution is based on quantitative
ring phenomenon that depicts the distribution of data and continuous variables.
data obtained from the world as a bell-shaped There are still other characteristics of a normal
curve, given a sufficient number of collected distribution that are important to understand.
observations. There are multiple qualities of a Recall from a few paragraphs above that the fre-
distribution that render it a normal distribution— quency polygon from the sample data did not
but that is not to say that there is anything normal necessarily smooth out until we increased the
about it, per se. size of our data. By continuously increasing the
The normal distribution, also referred to as the size of our sample (n), we began describing more
Gaussian distribution, was first introduced by the the population of our observations (N), rather
highly influential German mathematician, Johann than the sample. Thus, we say that normal
Carl Friedrich Gauss in the late eighteenth cen- distributions are primarily observed when
tury. Gauss described the qualities of this theo- describing the parameters of a population.10
retical distribution as a bell-shaped curve that is There are an infinite number of normal distribu-
symmetrical at the center with both of its tail tions that, in theory, can occur depending on the
ends stretching to infinity, never touching the specific population we are to describe. For this
x-axis. Moreover, at the center of a normal distri- reason, a short-hand method of labeling any nor-
bution is where the mean, median, and mode are mal distribution relative to its specific parameters
all located (i.e., mean = median = mode)8— is N(μ, σ). Also, the total area under a normal dis-
hence, measures of central9 tendency. tribution is equal to one; the reasoning of which
is explained further below in Sect. 4.5.3.
8
Depending on the size of data, these measures must not Table 4.13 summarizes the important qualities of
necessarily be exactly equal to one another but relatively a normal distribution.
close in order for a normal distribution to be observed. Not only are there different types of normal
9
Notice, now, how the measures of central tendency are distributions, but there are also different types of
essentially a description of the distribution of data. The
calculated mean tends to fall in the center, the median—
by definition—is in the middle, and the mode is the obser- This is not to make normal distributions exclusive to
10
vation that occurs most frequently which is the highest bar populations. Sample data may very well be normally dis-
in a frequency polygon and later the peak in the normal tributed as well, the reasoning we save for the next chapter
distribution. under the central limit theorem.
62 4 Descriptive Statistics
Mode Mean
Median
of distribution with two modes is referred to as a Although there is an entire branch of mathematics
bimodal distribution, while similar distribu- devoted to the theory of probability, below we
tions with more than two modes are referred to as provide the fundamental axioms and rules of
polymodal distributions (Fig. 4.23). probability relative to statistics in the health sci-
ences. Following tradition, we begin with the flip-
ping of a coin to introduce the ultimate concept of
4.6 Probability probability.
Assuming we have a fair coin, how is the like-
One of the most popular techniques used to lihood of flipping the coin and getting a head
tackle uncertainty is probability theory. expressed mathematically? The most basic prob-
Probability refers to the likelihood or chance of ability formula is in the form of a fraction or pro-
a specific event occurring. We—probably—did portion. Figure 4.24 shows a simple fraction
not need to provide a definition of probability; where the numerator represents the number of
chance, odds, likelihood, and possibility are all times our event of interest (head) occurs from the
synonymous with probability and its inherent set (the coin) and the denominator represents the
concept. Whether it is determining the outfit of total number of occurrences of all possible events
the day based on the chance of rainy weather or within the entire set (head and tail). Thus, the
the likelihood of a specific treatment being effec- probability of flipping a coin and getting a head is
tive for a specific patient, the theory of probabil- ½ or 0.50. Multiplication of this proportion by
ity is used across the spectrum of our daily 100 results in a percentage—what we commonly
activities. call a 50% chance.
Moreover, probability joins the constant battle Along with this basic introduction comes the
in widening certainty and narrowing uncertainty. rudimentary understanding of the fundamental
The fact that we are able to quantify probability premise—or axiom (Andre Kolmogorov 1965),
and its associated bearing on uncertainty is, in as it was—of probability theory. The probabil-
essence, a function of descriptive statistics. ity of any event occurring is a nonnegative real
number. From just this first axiom, we can
deduce another other important concept of
Mode Mode probability. The probability of any event occur-
ring, written as P(E), is bound by zero and
one—meaning that the likelihood of the event
of interest (E) occurring ranges from 0% (abso-
lutely not happening) to 100% (absolutely hap-
pening). Recall that probability is essentially a
fraction that ranges from 0 to 1, in which its trans-
Fig. 4.22 Bimodal distribution formation to a percentage requires m ultiplication
Fig. 4.23 Polymodal
distribution
64 4 Descriptive Statistics
peanut-center)? Now we do have an instance nature of mutual exclusivity relative to the addi-
where the events are not mutually exclusive; tion rule in this example.
there is a singular occasion where both events Did we miss something? As we were familiar-
are able to occur simultaneously (i.e., a yellow izing with the multiplication rule, we did not
chocolate candy with a peanut-center.) For pose the question of what to do if our events were
this, we must introduce a refinement to the not independent of each other. What if the prob-
addition rule. ability of event A occurring was dependent on the
probability of event B occurring? Events are con-
Addition Rule for non mutually exclusive events sidered to be dependent when the likelihood of
P ( A or B ) = P ( A) + P ( B ) – P ( A and B ) one event’s occurrence effects the likelihood of
the other event’s occurrence. In order to calculate
This refinement to the addition rule consid- events that share this relationship, we must turn
ers events that are not mutually exclusive, in to a third rule of probability.
which the original addition rule is joined with
the multiplication rule. Notice, that we are still Conditional Rule
utilizing the addition rule—meaning—we are P ( A and B )
P ( B|A) =
still interested in determining the probability of P ( A)
any single event occurring among several other
events. This modification simply considers the The conditional rule is the probability of a sec-
fact that the events have the ability to occur ond event occurring (B) given the probability of the
together (P (A and B)) and are not mutually first event occurring (A). This can also be consid-
exclusive. ered a refinement to the multiplication rule when
Thus, returning to the above example, the the occurrence of the events of interest are depen-
probability of obtaining either a piece of candy dent on each other. The vertical bar (|) in the equa-
that is yellow or a piece of candy that has a tion represents the dependence of the two events A
peanut-center (P (yellow or peanut-center)) must and B, in which it is read as “given.” For example,
have the probability of obtaining a yellow choco- what is the probability of examining a color-blind
late candy with a peanut-center (P (yellow and patient, given that the patient is male? We can
peanut-center)) removed (subtracted) from the express this problem as P (CB|M), where the prob-
occurrence of each singular event [(P (yellow or ability of a color-blind male [P (CB and M)] is 8%
peanut-center)) = P(yellow) + P(peanut) − (P and the probability of the next patient being a male
(yellow and peanut-center) = (5/10) + (5/10) − P(M) is 35%. Thus, by division, the probability of
(3/10) = 7/10 or 70%]. Figure 4.25 shows a series examining a color-blind patient, given that the
of Venn diagrams that depicts the conceptual patient is a male, is approximately 29%.
4.6.2 B
ayesian vs. Frequentist 4.6.3 Z-Transformation
Approach
One of the fundamental qualities of a normal
The conditional probability above was first curve (or normal distribution) that was discussed
described as Bayes’ theorem, where the proba- was that the total area under the curve is equal to
bility of an event takes into account prior to one. In Sect. 4.4 Distributions, our attempts of
knowledge of similar events. Named after Rev. smoothening out the curve that blanketed the bars
Thomas Bayes, the theory takes subjective beliefs of the histogram was accomplished by the intro-
and experiences into mathematical consideration duction of more bars, i.e., an increase in observa-
when calculating the probability of future out- tions or sample size. Interestingly enough, it is
comes. Another way to put it is that the theory this action that ultimately leads to the measure-
provides a way to obtain new probabilities based ment of the area under the curve—a concept with
on new information. For example, having knowl- origins in calculus. The area under a curve is
edge that demographics are associated with one’s divided into numerous rectangular strips (i.e.,
overall health status, allows health practitioners bars), the area of each individual strip is mea-
to better assess the likelihood of their patient sured, and then the summation of those individ-
being at risk for certain cardiovascular diseases ual areas produce the area of the whole—a
relative to their socioeconomic status. This type process commonly known as integration.
of statistics takes a probabilistic approach to the The ability to be able to isolate certain areas
uncertainty in our world, in which probabilities under a curve is critical in descriptive statistics. Is
are never stagnant upon the availability of new knowledge of calculus required to do this?
knowledge. The purveyors of this lines of proba- Luckily not. What we do require is an under-
bility theory are commonly referred to as standing of the standard normal curve and the
Bayesians. process of the z-transformation. The standard
On the other hand, a more traditional view of normal curve has all of the qualities of a normal
statistics takes a regularity or frequentist approach distribution described in Table 4.13, along with
to probability and the uncertainty of our world. three additional qualities discussed next.
The frequentist approach relies on hard data, per The most important and differentiating quality
se; there is no mathematical consideration of sub- of a standard normal curve and other normal
jective experiences and new knowledge in deter- curves is that the mean (μ) is equal to 0 and the
mining the likelihood of future outcomes. Instead, standard deviation (σ) is equal to 1. Thus, the
only the frequency of an event’s occurrence rela- center of the graph of a standard normal curve is
tive to the rate of its occurrence in a large number at zero, and distances from the mean (along the
of trials is taken into consideration. For example, x-axis) are measured in standard deviations of
the fact that a coin was flipped ten times and only length one (Fig. 4.26). With this notion intact, the
the head was observed does not make the proba- second quality of standard normal curve exempli-
bility of obtaining a head on the 11th time 100%; fied is the 68-95-99.7 rule.
the probability of obtaining a head on the next As shown in Fig. 4.27, the 68-95-99.7 rule
flip still remains 50%. This classical interpreta- states that approximately 68% of observations
tion of probability theory is the most commonly fall within one standard deviation to the left and
utilized by statisticians and experimental scien- to the right of the mean (μ ± 1σ), approximately
tists, in which they and other purveyors of this 95% of observations fall within two standard
line of thought are referred to as frequentists.12 deviations to the left and to the right of the mean
(μ ± 2σ), and approximately 99.7% of observa-
tions fall within three standard deviations to the
left and to the right of the mean (μ ± 3σ).
See Perkins and Wang (2004), Raue et al. (2013), and
12 The third quality of a standard normal curve is
Sanogo et al. (2014). its usage in the z-transformation process, in
4.6 Probability 67
75
Z-Score Map
(75) via the z-score formula (Fig. 4.29). Notice,
the similarities in Figs. 4.28 and 4.29. Also notice
that the z-score (−0.70) that corresponds to the Original Score
original score (75) has a negative sign, indicating
that it falls below the mean (0) of the standard
normal curve that also corresponds to below the
mean of the original normal distribution (78). Z-transformation
To determine the proportion of students that
scored below a 75, we must find (quantify) the
area of the shaded region. Since the tools of calcu-
lus will not be utilized here, this can be accom- Z-Score
plished by using a standard normal table (Appendix
B) which contains the calculated areas of the
regions that fall to the left or below any specified
z-score. By pinpointing the proportion on the table Standard Normal Table
based on the z-score, we find that approximately
22.66% of students scored 75 or lower. In order to
obtain the proportion of students that scored to the
right or above a 75, then all that is necessary is to Probability/Area
subtract the proportion of total students (i.e., total
area under curve = 1) by the proportion that scored
less than 75 (1 − 0.2266 = 0.7734 or 77.34%). Fig. 4.31 Z-score map
Lastly, in order to determine the proportion of stu-
dents that scored between scores 75 and 85, the It may be beneficial, at this point, to return
proportion of students that scored below an origi- to the overarching theme of this chapter—
nal score of 85 (0.9599, at z = + 1.75) is subtracted namely, descriptive statistics. The purpose of
by the proportion that scored less than 75 (0.9599 being able to effectively utilize data that have
− 0.2266 = 0.7333 or 73.33%). Figure 4.30 pro- been properly consolidated is not limited to the
vides a stepwise procedure and helpful strategies process of z-transformation. This section has
for solving z-score-related questions. Figure 4.31 been appropriately placed at the end as it ties
is a map that can guide from any starting point to together individual sections, such as distribu-
any destination throughout the z-transformation tions and probabilities, into a much larger
process. concept.
4.7 Self-Study: Practice Problems 69
Think of the process of z-transformation as the Frequency table—weights of newborn babies (kg)
tare function on a weight scale. With the ability to Interval (kg) f f% cf cf%
essentially zero-out any normal distribution—by 2.00–2.99 19 ? 37 61.67%
transforming it into a standard normal curve—we 3.00–3.99 14 23.34% 51 ?
4.00–4.99 6 10% ? 95%
are able to precisely quantify certain locations and
5.00–5.99 ? ? 60 ?
certain distances of interest within the distribution
Total ? 100% ? ?
of observations. We can utilize distributions to cal-
culate probabilities and attempt to predict certain
outcomes. We are able to learn things like the area 4. Create the appropriate graph for the distribu-
between two certain points or, conversely, the points tion of data from questions 2 and 3 above.
within which a certain area of interest exists. By 5. The insulin levels (pmol/L) of a sample of 15
arming ourselves with these disparate tools and diabetic patients were collected an hour after
techniques, we have begun the effective utilization consumption of breakfast and are provided
of numerical data through description and, certainly, below. Please identify the mean, median, and
have begun narrowing the scope of uncertainty. mode:
356, 422, 297, 102, 334, 378, 181,
389, 366, 230, 120, 378, 256, 302, 120
4.7 Self-Study: Practice 6. A scientist measures the rate of replication
Problems (s−1) for a sample of bacteria colonies from
ten Petri dishes. Determine the appropriate
1. The following are data collected on the length standard deviation and variance of the sample
of hospital stay (in days) from sample of 25 (hint: use table method from Figure 4.13):
patients: 2.33, 2.02, 1.99, 1.53, 0.99, 1.26,
6, 11, 4, 8, 14, 30, 1, 3, 7, 11, 4, 9, 5, 1.18, 3.50, 0.22, 2.62
22, 25, 17, 2, 5, 19, 13, 21, 26, 26, 20, 29 7. Determine the range and interquartile range
(a) Should the data be organized into inter- from the data set above. Which of the mea-
vals? Explain. sures (including those in question 6) are bet-
(b) Create a frequency table that describes ter descriptions of dispersion? Explain.
only the frequency of the observations. 8. A local urgent care clinic reviews the
2. A group of college students were interested in recorded patient illnesses that were treated in
understanding the degree of satisfaction their the previous month from 450 patients—275
peers had with the campus health office. They males and 175 females. Their reports found
gathered 50 responses to their one-question the following number of diagnoses: 101
survey, in which 7 said they were very unsatis- common colds, 274 bodily injuries, 76 uri-
fied, 9 were unsatisfied, 19 were neither satis- nary tract infections (UTIs), 62 ear infec-
fied nor unsatisfied, 11 were satisfied, and 4 tions, and 100 unexplained pains.
were very satisfied. Create a frequency distri- Approximately 106 of the bodily injuries
bution that describes the frequency and cumu- were male patients and 55 of the UTI cases
lative frequency of the responses. were female patients.
3. The following is frequency table that orga-
(a) What is the probability of randomly
nizes the weights (kg) of a group of 60 new- selecting a diagnosis of the common
born babies. Complete the table by filling in cold and an ear infection?
the boxes labeled with the “?”. (b) What is the probability of randomly
selecting a diagnosis of unexplained
Frequency table—weights of newborn babies (kg)
pain or a bodily injury?
Interval (kg) f f% cf cf%
(c) What is the probability of randomly
0.00–0.99 6 10% 6 10%
1.00–1.99 ? 20% 18 ?
selecting a male patient or a bodily
injury case?
70 4 Descriptive Statistics
9. After a family banquet, 75% of family mem- population with a mean of 5.05 liters (L) and
bers were exposed to the peanut butter a standard deviation of 0.78 (L). For each
cheesecake, out of which 35% developed problem, use the z-score formula and the
acute inflammation. It was also found that standard normal probability table to
5% of the remaining family members who determine:
were not exposed to the cheesecake also (a) What proportion of scores fall above
reported acute inflammation. 6.13?
(a) What is the probability of a family mem- (b) What proportion of scores fall below
ber showing signs of acute inflammation? 5.44?
(b) Given those who reported acute inflam- (c) What proportion of scores fall below
mation, what are the chances of them 4.20?
actually being exposed to the peanut but- (d) What proportion of scores fall between
ter cheesecake? 5.44 and 6.13?
(c) Given those who reported acute inflam- (e) Which score marks the lower 10% of the
mation, what are the chances that they population?
were not exposed to the peanut butter (f) Which score marks the upper 60% of the
cheesecake? population?
10. Scores on a spirometry test are used to deter- (g) Which scores represent the middle 95%
mine lung function based on the volume of of the population?
air that is inspired and expired. The scores (See back of book for answers to Chapter Practice
approximate a normal distribution in the Problems)
Inferential Statistics I
5
Contents
5.1 Core Concepts 71
5.2 Conceptual Introduction 72
5.3 Principles of Inference and Analysis 73
5.3.1 Sampling Distribution 74
5.3.2 Assumptions of Parametric Statistics 76
5.3.3 Hypotheses 76
5.4 Significance 77
5.4.1 Level of Significance 77
5.4.2 P-Value 79
5.4.3 Decision-Making 80
5.5 Estimation 84
5.6 Hypothesis Testing 86
5.7 Study Validity 88
5.7.1 Internal Validity 88
5.7.2 External Validity 89
5.8 Self-Study: Practice Problems 89
of a decision. The decision-making process dur- pictured in Fig. 5.1, for example. The frogs’ bril-
ing hypothesis testing has potentially two forms liant colored body warns (or reminds) predators
of error that may occur, namely, Type I and Type of the slow and painful death caused by feeding
II errors. This chapter goes more into detail about on the venomous species. But there was a time
what constitutes these types of errors, the ele- where the luminous color actually seduced pred-
ments of a power analysis, and how they establish ators to the possibility of the delicious meal that
the power of a study. awaits. This temptation was swiftly suppressed
Estimation is also related to inferential statis- after experiencing the death of similarly situated
tics as they are used to precisely and accurately predators consuming the colorful prey, or the
estimate/predict the actual population. Used in predator itself falling ill for a period of time.
conjunction with hypothesis testing are the tools Predators quickly learned that the brilliant colors
of estimation (e.g., confidence interval and level of prey meant a dooming venomous death. Even
of confidence) which increase the robustness of other prey adapted this antipredator technique
the study. At the heart of inferential statistics is and defense mechanism of warning coloration—
hypothesis testing which is used as the chief a concept referred to as aposematism.
method to determine the validity of a hypothesis. In order to ensure genetic success, there soon
Through the six steps of hypothesis testing, was a certain mutual understanding developed
researchers can determine the validity of a hypoth- within the arenas of the wild. Predators under-
esis by assessing the evidence. This basic protocol stood the consequence of feeding on prey with
is the foundation that will be used in all statistical seductive neon-like colors. Prey understood that
tests that will be mentioned in the next chapter. warning coloration is a powerful piece of artillery
Overall, the quality of a research study is scru- to add to their arsenal in a world full of predators.
tinized by validity whether it be internally or Thus, this mutual understanding established
externally. Researchers look at the soundness of among the earliest of predators and prey became
the entire study including the study design, meth- a type of generalization to add to the repertoire of
odology, and data analysis and how the findings survival skills for future generations. This gener-
truly represent the phenomenon being measured. alization in the simplest form equated brilliant
A research study that is valid is solid because it is colors with poison—an association that still to
well-designed and the findings are appropriate to this day is taken advantage of by both predator
generalize or infer to the population of interest. and prey.As rightful heirs of Kingdom Animalia,
we too adapt to our surroundings for survival
based on certain generalizations. As children, we
5.2 Conceptual Introduction learn to never take candy from strangers. As stu-
dents, we learn to always strive for the best
One of the earliest survival mechanisms devel- grades. As scientists, we learn that all questions
oped by Kingdom Animalia was the ability to are worth asking. The words italicized are com-
learn and adapt. Take the poison dart frog species monly referred to absolutes or universals, but was
it not already established that nothing is truly
absolute? That absolutes necessitate all-knowing
truth? Indeed, that notion still remains pertinent.
In fact, it is not the case that all brilliantly color-
ful animals in the wild are poisonous. The
California mountain kingsnake (Fig. 5.2) takes
advantage of coloration by using its red, black,
and yellow stripes to “warn” predators to stay
away—but this intelligent snake is neither ven-
omous nor harmful. Similarly, strangers posing
Fig. 5.1 Dart frog (NightLife Exhibit: Color of Life— as preschool teachers seem to be exempt when
Cali. Academy of Sciences 2015) offering candy to children.
5.3 Principles of Inference and Analysis 73
POPULATION
SAMPLE
The list of exemptions to absolutes or univer- Fig. 5.3 The population–sample interaction
sals can continue ad infinitum. But once an
exemption is exposed, they are no longer consid- often referred to as and interchangeable with
ered absolutely true. Rather, we accept certain parametric statistics—that is, making inferences
things to be generally true—such statements are about the parameters (population) that are based
true most of the time. But what virtue does a truth on and go beyond the statistics (sample). The
statement hold if it is not always true? We digress. core principles of statistical analysis underlying
Yet, the general truth still contains a certain inferential statistics are discussed next and are
stronghold on our knowledge. Generalizations considered for the remainder of this book’s first
are essentially made based on the frequency of half on translational research.
our observations and the probability of making
similar or contradictory observations. Our ability
to make generalizations can serve as useful heu- 5.3 Principles of Inference
ristics or harmful stereotypes. The science of and Analysis
making accurate and precise generalizations that
are based on, and go beyond, actual observations Inferential statistics is the second branch of the
is referred to as inferential statistics.Whether fundamental concept underlying statistical
for statistics in general or for biostatistics in thought—the first of which was descriptive sta-
translational healthcare specifically, inferential tistics, as outlined in Chap. 4. Along with this
statistics is used to make inferences about the concept enters the third leg of the stool represent-
population based on observations collected from ing the foundation of the research process,
samples.1 This is the essence of the population– namely, data analysis (Fig. 5.4). Data analysis
sample interaction depicted in Fig. 5.3 and previ- refers to the statistical techniques that analyses
ously discussed in Chap. 3. Briefly, the mere fact both quantitative and qualitative data in order to
that it is neither feasible nor practical to collect render information not immediately apparent
observations from an entire population proves the from mere raw data. Indeed, we learned a few of
utility of inferential statistics. Instead, we collect the core concepts of data analysis under the
a representative sample of observations in order framework of descriptive statistics. So why not
to infer critical pieces of information regarding introduce data analysis in the previous chapter?
the population. Because a population is charac- To be clear, all data analyses take place within
terized by parameters, inferential statistics is the frameworks of descriptive and/or inferential
statistics. But inferential statistics is set apart
Notice that inference, generalization, and conclusion can
1 from descriptive statistics because the latter
all be considered synonymous. reaches a halt after describing and organizing the
74 5 Inferential Statistics I
Ana
Study Design
Data Analysis
Inference
Consensus
X5
Moreover, it is referred to as the standard
error because it considers the large probability
X4 X6 that the random samples being considered may
not actually be precise and accurate representa-
X3 tions of the population. This measure also exem-
X7
plifies the unavoidable random error prone to a
X2 X8 research study—a concept distinct from system-
atic error.2 Although random error is unavoid-
X1 X9
able, the amount of error introduced has the
µx = µ ability to and should be reduced by increasing the
sample size (n). Mathematically speaking, an
Fig. 5.6 Sampling distribution of the mean increase in the value of the denominator ( n )
results in a smaller value of the entire fraction of
population, ultimately permitting the inference SEM. Also, conceptually speaking, obtaining a
consensus. sufficiently large sample size translates to a
By now, we should be quite familiar with the higher chance of having a more accurate repre-
ins and outs of distributions. Distributions are sentation of the population.
able to be effectively described by two different Now, with all of these factors considered, we
yet related measures of central tendency and vari- are one step closer to being able to make more
ability—specifically, the mean and standard devi- accurate and precise inferences about a certain
ation. The distribution of sample means itself has population based on the sampling distribution.
both a mean and a standard deviation relative to Having discussed the various intricacies of a sam-
the population from which the samples were pling distribution, we move toward a concept that
obtained. ties the above concepts together, allows the deter-
mination of the shape of the sampling distribu-
• Mean ( m x ) of the sampling distribution of tion, and also happens to be fundamental to the
the mean represents the mean of all of the usability of inferential statistics. The concepts in
sample means, which is ultimately equal to this section, both above and below, will continue
the population mean (μ). to reappear in some form during the next chapter.
mx = m The central limit theorem states that a sam-
• Standard error of the mean (SEM) is pling distribution with a sufficiently large sample
essentially the standard deviation of the sam- size will approximate a normal distribution,
pling distribution of the mean, which is equal regardless of the shape of the population distribu-
to the population standard deviation (σ) divided tion. The lack of regard to the shape of the popu-
by the square root of the sample size (n). lation is supported by obtaining a sufficiently
large sample size, which happens to depend on
sx = s the shape of the population distribution. To elab-
n orate, if the population is normally distributed
Indeed, this is why the importance of under- (and known), then even a small sample size will
standing the fundamental concepts of means, be sufficient to render a normal distribution of
standard deviations, and distributions was sample means. But if the population is not nor-
stressed in Chap. 4. The SEM is a special form of mally distributed (or is unknown), then we can
variability used to measure the dispersion or use a generally accepted rule that a minimum
spread of the data. This is due to the fact that the sample size of 30 will suffice for a good approxi-
data contained within the sampling distribution mation of the population being normally
of the mean are no longer composed of many distributed.
single observations but rather are composed of
numerous random samples. See Chap. 1, Sect. 1.2.1.2).
2
76 5 Inferential Statistics I
3
Statistical inferences with qualitative data can only be Hypothesis
nonparametric inferences. See Chap. 7 for more.
A large amount of standard error dictates heterogeneity
4
effect on future academic performance. It would pose in this chapter may be to understand hypoth-
also be just as valid to hypothesize that psycho- esis testing, we must first understand the intricate
logical trauma during childhood has no effect concepts that are inherent to testing a hypothesis
on future academic performance. The former is first. In the next section, we discuss the main con-
chosen only when there is a hunch or prior evi- cepts behind hypothesis decision-making.
dence that suggests its validity. Regardless of
how the hypothesis is formulated, it is the rela-
tionship between psychological trauma during 5.4 Significance
childhood and future academic performance
that will be tested, determined, and—if appli- The data analysis section of a research study
cable—inferred. Moreover, these statistical should provide the evidence or proof necessary to
hypotheses—hypotheses that claim relationships effectively determine the validity of the hypoth-
among certain variables—contend the existence esis that started the investigation in the first place.
of something unique underlying the population The statistical reasoning tools and techniques uti-
of interest which promotes further investigation. lized within the framework of data analysis are
The example of the hypotheses above is also mathematical in nature (Chap. 6). However, the
an example of the two main types of statistical decision that is made regarding the hypothesis is
hypotheses used within the world of research. A not mathematical—we simply decide whether to
null hypothesis, symbolized as H0 and read as accept or reject H0. We will see that our decision
“H naught,” is a hypothesis that claims that there is based on evidence provided by data analysis,
is no relationship between the variables being which renders the findings as either significant or
considered. The null hypothesis can be formu- insignificant. Therefore, there must exist some
lated in many ways, in which it most often claims tool that can be utilized to directly translate the
no effect, no difference, no association, etc. The results from data analysis and guide our decision-
second type of hypothesis is referred to as the making process. These tools are used within the
alternative hypothesis, H1, that claims that there context of significance testing.
is a relationship between the variables being con-
sidered. The alternative hypothesis is essentially
the opposite of the null hypothesis. However, it is 5.4.1 Level of Significance
the null hypothesis that is most commonly
asserted, considered, and tested. There are many The first tool of significance testing is the level of
reasons to favor the null hypothesis that will be significance (α), also called “alpha” or “alpha
discussed throughout the remainder of this chap- level,” which refers to the threshold that the
ter. Perhaps one of the most basic explanations observed outcome—resulting from the null
involves removing any notions of bias and other hypothesis—must reach in order to be considered
errors to the study. a rare outcome (Fig. 5.8).
The determination of whether the hypotheses The level of significance is an arbitrary value
are true or not is equivalent to answering of the that is determined: at the discretion of the investi-
research question that takes place in conclusion gator, at the onset of a research study, and relative
of the study. Notice that data analysis is the final to the particulars of the study. Case in point, com-
step before the conclusion of the research pro- mon practice is to set the level of significance at
cess, in which the outcome of analyzing the data 0.05 or 5%.
promotes the decision that is to be made regard- The reason the level of significance is set prior
ing the hypotheses. Thus, after successfully to the actual analysis of data is to—yet again—
determining which hypothesis was “correct” and prevent any introduction of bias. The level of sig-
which was not, we are able to take the informa- nificance also describes the actual area of its
tion contained within the hypothesis and translate distribution as well. This means that if our level
it onto the population. Though our ultimate pur- of significance is 5%, then the shaded areas in the
78 5 Inferential Statistics I
0.025 0.025
tails of the distribution in Fig. 5.9 should be equal the potential observations as observations that
to 0.05. In the same breath, this measure also are most different (uncommon/rare) than what
considers the amount of error we permit into our is hypothesized in the original null
study, which will be expanded on a little later in hypothesis.
this chapter. The observed outcome we refer to in the
Rare outcomes are, obviously, opposed to definition above is, mathematically, the actual
common outcomes. Figure 5.9 shows a normal test statistic that is obtained from data analysis.
distribution that delineates this difference— Another way to visualize the process of ana-
and this difference makes sense. Understanding lyzing our data is noticing that the claim of the
the normal distribution is understanding that null hypothesis is being quantified based on the
the observations in the middle have the highest collected data, whereby a test statistic can be
chance of occurring signified by the large area obtained. The test statistic (i.e., the observed
(common), whereas the observations con- outcome) is essentially the proof or evidence
tained in the tails of the distribution have the that will be used against the null hypothesis.5
lowest chance of occurring signified by their
very small areas (rare). Thus, in consideration
We present this information only for clarification pur-
5
of the entire distribution, if the investigator poses; test statistics and the actually formulae of the tests
desires a level of significance of, say, 0.10, utilized in analyzing data are discussed at great length in
then they are essentially setting aside 10% of the following chapter. For now, just understand that the
5.4 Significance 79
So, then, how do we use the level of signifi- for you to have lost? Is there anything significant
cance and the test statistic in order to ultimately about your particular situation? No, the chances
make a decision regarding the null hypothesis? A favor your loss (i.e., the probability of losing or
simple answer is that the test statistic is visual- the p-value was very high).
ized within the context of the level of significance On the other hand, if the winning numbers
in order to render the observed outcomes as either came to you in a dream and you ended up win-
a rare or common occurrence. But, in reality, we ning the lottery over and over again, then you are
are unable to simply compare the two as there are different than the vast majority of other players, it
numerous different test statistics and only one is uncommon or rare for you to have won multi-
level of significance. Thus, there must be some ple times, and there is something significant
measure that standardizes all test statistics and about your situation. Why? Well, because the
can be comparable to the level of significance. p-value, i.e., the probability of winning the lot-
tery, was about 1/175,000,000, and YOU were
that one! And not just once—you were that one
5.4.2 P-Value each and every time!
Notice that in the above example, the null
The most common statistical measure used in hypothesis essentially stated that there was no
significance testing is the p-value. The p-value is difference between you and the rest of the popu-
the probability of observing similar or more lation in winning the lottery. It is not as if we
extreme occurrences of the actual observed out- delineated you from the start by claiming that
come, given that the null hypothesis is true. Every you were an outcast that supposedly had revela-
parametric test statistic has an associated p-value tions in your dreams. No, we gave you the benefit
that is comparable to the level of significance. of being no different than the rest of the popula-
The p-value essentially considers the evidence tion. It was only when you won multiple lotteries
that goes against the hypothesis to be attributable consecutively with exceptionally low chances
to error and suggests that the outcome that was that your situation became a statistically signifi-
observed may have occurred just by chance. In cant situation relative to the rest of the popula-
this context, the null hypothesis is given the ben- tion. Furthermore, it could not have simply been
efit of the doubt; its claim of no difference is con- due to chance that the observed outcome (i.e.,
sidered to be probably true from the start of the you winning the lottery) occurred multiple times.
study.6 Thus, it is only when there is an exceptionally
For just a moment, conceptualize the p-value low probability of observing similar or more
as simply being the probability of generally extreme outcomes than the observed outcome
observing a specific outcome. Let us assume that that evidence against the statement of no differ-
the outcome we are interested in observing is a ence (H0) is substantiated. This signifies that the
winning lottery ticket. You are already aware of specific outcome that was observed did not occur
the slim chances of winning—about 1 in by chance alone—something special happened
175,000,000. But because you are a competent here. The question, then, that remains is: What
biostatistician, you know that the chances are constitutes an exceptionally low probability?
even more slim if you do not purchase a ticket at Better yet, at what level do we delineate the dif-
all. So you purchase a ticket. If you do not win ference between an outcome that, if observed,
the lottery, then are you any different from the occurred by chance alone and one that does not
vast majority of other players? Is it uncommon occur by chance alone?
The level of significance, of course! Therefore,
if the p-value is less than the level of significance
outcome of data analysis is a test statistic that is judged as
either common or rare, in order to make a decision about (α), then the observed outcome is statistically
the null hypothesis. significant. On the other hand, if the p-value is
Think: “innocent until proven guilty.”
6
greater than the level of significance (α), then the
80 5 Inferential Statistics I
Table 5.1 P-values and the level of significance guides the assessment of the evidence provided
P-value < Level of significance (α) ⇨ Statistically by the data analysis, in which the probability of
significant ⇨ Reject H0 the outcome’s occurrence is taken into consider-
P-value > Level of significance (α) ⇨ Not statistically ation. We ask questions like: “Could this outcome
significant ⇨ Retain H0
have occurred by chance alone?”; “Might this
outcome have been due to sampling errors?” We
Table 5.2 Significance measures scrutinize our findings simply because the deci-
Significance measures
sion that we make is ultimately translated—bet-
Alpha (α) p-value ter, yet—inferred onto the population from which
Probability Probability the sample was drawn. Take a moment to con-
Dependent on H0 Dependent on H0 sider the gravity behind generalizations that have
Used in hypothesis Used in hypothesis testing the potential of influencing the health and overall
testing well-being of a population’s constituents.
Determined before Determined after analysis Therefore, to be able to make accurate and
analysis
precise generalizations, we must be able to take
Dependent on Dependent on data
investigator the results of our significance testing and effec-
tively interpret a decision regarding the hypothe-
sis—a process that is critical when testing a
observed outcome is not statistically significant hypothesis. To be clear, because both the level of
(Table 5.1). significance and the p-value address the null
The p-value and the level of significance share hypothesis, the decision made is in regard to the
important similarities and even more important null hypothesis. Of course, we can imply the
differences. Notice that both measures are meaning of this to be the opposite decision made
innately probabilistic, depend on the null hypoth- regarding the alternative hypothesis. That said,
esis, and are characterized with the observation decisions that are made regarding the null hypoth-
of rare outcomes—all utilized to guide the deci- esis are considered strong decisions, due to the
sion-making process with statistical significance. support of significance testing. Conversely, deci-
Still, it is their slight differences that make them sions made regarding the alternative hypothesis
so important to scientific research. The level of are considered weak, due to the lack support
significance is an arbitrary number determined at from significance testing.7
the onset of the study and at the discretion of the To restate, a statistically significant outcome
investigator. The p-value, on the other hand, signifies that the evidence substantiated against
comes into play after data analysis; the p-value is the null hypothesis cannot be ignored; something
determined relative to the specific data and test special is being observed here. If this is the case,
statistic used. Lastly, it is important to note that then the decision is to reject H0. This claim of no
p-values are most commonly obtained from sta- difference is rejected because the evidence has
tistical software applications and can also be provided proof that, in fact, there is a difference
roughly measured through different standardized between the variables in consideration. Therefore,
tables—both of which are described in the next this decision strongly implies that H0 is probably
chapter. Table 5.2 compares and contrasts alpha false (and that H1 is probably true).
and p-value. The same is true for the converse—a statisti-
cally insignificant outcome or an outcome that is
not statistically significant indicates that there is
5.4.3 Decision-Making no solid evidence substantiated against H0. If this
is the case, then we fail to reject H0; the decision
In statistical analysis, the decision-making pro-
cess hinges on the presence and/or absence of
statistical significance. Significance testing See Chiappelli (2014).
7
5.4 Significance 81
Strong
H1 is probably true
Weak
H1 might be false
is to retain H0.8 Instead, the claim that there is no 5.4.3.1 Errors in Decision-Making
difference between the variables in consideration Earlier in this section, the importance of sound
is retained until further evidence can prove other- decision-making was discussed in the context
wise. Retaining or failing to reject H0 does not of inferential statistics. Indeed, should all ele-
necessarily mean that its claim is true, per se— ments of a research study be done correctly and
this decision only weakly implies that H0 might properly, then there is no reason why the con-
be true (and that H1 might be false) (Fig. 5.10). clusion ought not be inferred onto the popula-
Realize the immense pressure of statistical tion and the findings disseminated throughout
significance during the decision-making process the scientific community. However, it is not
and on the research study as a whole. always the case that the decisions we make are
Unfortunately, the scientific community has correct decisions—after all, we are but only
become comfortable with associating insignifi- human. That is not to say that it always is an
cant results to insignificant studies. Could it be error of judgment; rather it can also be due to
that just because the significance testing rendered spurious data.
the findings stimulated by H0 as insignificant that The two forms of errors that may occur in
the information provided by the whole study is decision-making during hypothesis testing are:
invaluable? Of particular concern is the p-value.
Consider a p = 0.06, for example, which would • Type I error—rejecting a true H0
result in a statistically insignificant outcome and –– Researcher incorrectly rejected the null
a decision to retain H0. Did you waste all the hypothesis, rendering it as being probably
time, money, and resources that were invested false when, in reality, its claim is probably
into your study just because the p-value was one- true.
hundredth of a decimal off? The answer to both –– The decision should have been to
questions posed is negative. This raises a discus- retain H0.
sion regarding the overreliance on p-value’s and –– The study concludes by allotting the infer-
publication bias that are prevalent in the current ence of the existence of a difference
research community.9 between the variables in consideration by
H0, when there most probably was no real
difference after all.
• Type II error—retaining a false H0
8
Earlier we mentioned that this case would lead to accept-
–– Researcher incorrectly retained (or failed
ing H0. But “accepting” is a bit too strong of word to use
in a scientific context—instead we are better off deciding to reject) the null hypothesis, presuming
to retain H0 or that we fail to reject H0. the claim as being probably true when, in
9
See Wasserstein and Lazar (2016). actuality, its claim is probably false.
82 5 Inferential Statistics I
–– The decision should have been to reject H0. ally as the strength or robustness of a study.
–– Researchers risk generalizing their obser- However, at the most, these may qualify as just
vation of no difference, instead of realizing loose definitions of the word and the strategies
that there actually is a difference between used relative to research. More directly, the
the variables in consideration by H0. power of a study refers to the test’s ability to
detect an effect size, should there be one to be
Perhaps you are wondering, as biostatisticians found. The effect size, α, β, and n are the four
now, what the chances are of making these types elements necessary in power determination and
of errors. It should not be a surprise to learn that analysis, discussed next.
the probability of making a Type I error is noth-
ing other than alpha, (α), also known as the level 5.4.3.3 Elements of Power Analysis
of significance. The definition of a significance During the discussion of statistical hypotheses, it
level has contained in it the assumption that null was mentioned that a null hypothesis may also be
hypothesis is true. Therefore, by making a deci- a claim of no effect. In statistical research, the
sion you are essentially running the risk that the hypotheses we establish provide the variables
level established is also the probability that your that are to be observed in relation with one
decision is incorrect. This further stresses the another during data analysis. In other words, test-
importance of the level of significance being at ing a hypothesis usually entails comparing a
the discretion of the investigator. By increasing hypothesized population mean against a true
or decreasing the level of significance, your population mean, in which the presence or
chances of an incorrect and correct decision fluc- absence of an effect serves as the relationship.
tuate accordingly. Thus, the size of the effect or the effect size (ES)
On the other hand, the probability of making a refers to the extent to which our results are mean-
Type II error is referred to as beta, (β), which is ingful, where the effect is the difference between
usually equal to 0.20 or 20%. The conventional the compared means. In order for the difference
value of beta and its relevance to the power of to be meaningful then, the observed outcome
statistical tests will be expounded on in the next must be statistically significant.
section. For now, Table 5.3 shows an organiza- Therefore, we witness a direct relationship
tion of the decision-making process. between the size of an effect and statistical sig-
nificance, such that the larger the effect size, the
5.4.3.2 Power Analysis more statistically significant the findings. In
In scientific research studies, power analyses are terms of power and power analysis, there must be
strategies conducted to establish the power of a some notion of what ES might be detected from a
study. The techniques examine the relationship preliminary or pilot study (see Sect. 5.5). Let us
between a series of elements relative to the spe- use an example for clarification with the null
cific statistical tests that are used in data analysis. hypothesis below; feel free to replace the phrase
Indeed, we may have already heard the term no difference with no effect.
power being used to describe certain qualities of
a research study. At face value, or colloquially, H0 : There is no difference between the effective-
power may seem to refer to how well the study is ness of nicotine gum and the effectiveness of
able to do something in a certain way or gener- e-cigarettes in smoking cessation.
According to the decision-making process, we
know that if the data analysis provides a statisti-
Table 5.3 Decision-making process cally significant difference between the effective-
Status H0 ness of the two treatments, then the decision is to
Decision True False reject H0. By rejecting H0, we are rejecting the
Reject H0 Type I error (α) Correct decision claim of no difference or no effect. In other words,
Fail to reject H0 Correct decision Type II error (β) we are saying that there is, indeed, an effect! Let
5.4 Significance 83
that simmer for a moment. Again, if H0 claims no being the strength behind the investigator’s accu-
effect when comparing the two treatments (nico- racy regarding the observed differences—that is,
tine v. e-cigarette) and our decision is to reject the making a correct decision. Thus, in order to cal-
null hypothesis (due to a statistically significant culate the power of a study, we simply take the
outcome), then we are in actuality saying that complement of β, shown below.
there is an effect (difference) between the two
treatment groups. Power = 1 − β
On the other hand, if our data had presented a
lack of statistical significance, then we would Unlike α, the size of β is neither arbitrarily set
retain H0 and conclude that, indeed, there is no prior to data analysis, nor will it be known after a
difference (no effect) between the two treat- decision is made regarding H0. In reality, the level
ments relative to smoking cessation. Moreover, of β is not so important as a measure by itself;
when the means of the two variables were com- rather it is important in terms of power. It is the
pared, the size of their difference (effect) was not role that β plays along with the other elements of
appreciably large enough to provide a statisti- a power analysis that make its understanding
cally significant outcome. In terms of size, this even more imperative.
could very well mean that the difference of the The last element critical to the power of a
means compared was quite small or even nonex- study and the power analysis of a statistical test is
istent—in other words, more or less, close to the sample size (n). The importance of collecting a
value of zero. sufficiently large enough sample size is not lim-
This lends a hand to the importance of the level ited to power and the elements of a power analy-
of significance (α) as an element in establishing sis either. Notice the emphasis on sufficiently
the power of a study and, more specifically, con- large sample size. A good study is not one that
ducting a power analysis of a specific statistical has an extravagantly large sample size. Instead, a
test. By adjusting our level of significance, we good study is one that has a sample size that is
essentially affect the chances of making both an large enough (i.e., sufficient) to attain statistical
incorrect and correct decision regarding the null significance, should it exist. Each type of statisti-
hypothesis. For example, a study with an α = 0.10 cal test used in data analysis has a specific sample
denotes that there is a 10% likelihood of the deci- size that is appropriate for the study. The formu-
sion being a Type I error, along with a 90% likeli- las used to determine an appropriate sample size
hood (1−α) of it being a correct decision. By relative to power and the specific statistical tests
decreasing our alpha, we increase our chances of are discussed in the next chapter.
making a correct decision, while lowering the Conclusively, in order to establish the power
chances of an incorrect decision. More so, in of a study, there must be some consideration of
terms of effect size, a statistically significant out- these four elements (i.e., α, β, ES, and n) and
come will render a sizeable effect, should there be their interrelated relationship in power analyses
one to be found. This also settles our worry of relative to statistical tests. It will be convenient
making an erroneous decision when there is a lack to know that establishing any three of these ele-
of statistical significance. ments will subsequently determine the fourth,
Now, we can further our definition of power in addition to the overall power of the study.
to be the probability of rejecting H0 when it is This implies that there are actually four distinct
actually false. Notice that rejecting a false H0 is a power analyses that can be employed during
correct decision. This definition of power bears a any study and relative to any statistical test
stark resemblance with that of the one provided depending on which three out of the four ele-
in the opening of the section, both being equally ments are selected. Those notwithstanding, we
valid. Also, realize that this definition of power recommend the most practical approach, par-
essentially represents the opposite of making a ticularly to research in the health sciences, to be
Type II error (β). We can further view power as the establishment of the ES, alpha (= 0.05), and
84 5 Inferential Statistics I
beta (= 0.20) in order to determine an appropri- uncommon to see in shorthand estimated mea-
ate sample size, as referred to above by the for- sures as the mean ± SD.11
mulae mention in the next chapter. A range of values that goes beyond just the
point estimate, such as mean ± SD, makes us feel
a bit more confident in our estimation of the pop-
5.5 Estimation ulation mean. We are more scientifically poised
that somewhere contained within that range is the
The ability to make estimates regarding the precise and accurate population mean we are
parameters of population is pertinent to inferen- interested in. This form of estimation is referred
tial statistics. Some even argue the utility of esti- to as a confidence interval (CI) that provides a
mation over hypothesis testing, as the latter only range of values containing the true population
determines the presence or absence of an effect.10 parameter with a specific degree of certainty.
Nevertheless, estimation can be used in conjunc- Now, we may be more accurate in describing
tion with hypothesis testing in order to increase the population of human beings on Earth when
the robustness of our study. Particular to research presented as a confidence interval. For example,
in the health sciences, the techniques of estima- we can be 95% confident that the true population
tion are used in order to estimate the true param- mean of the number of human beings living on
eter of a population—a measure often unknown. planet Earth falls between 7,440,000,000 and
The topic and importance of estimation return to 7,560,000,000. If we consider the fact that our
that of a population, in general. original point estimate of 7,500,000,000 is the
Take the population of human beings on hypothesized population mean of the number of
planet Earth. As this is being written, there are human beings on planet Earth, then it would be
approximately 7.5 billion human beings on more logical to claim that the true population
planet Earth. So, does that mean that there are mean is probably either a little more or a little
exactly 7.5 billion? Of course not—this, and less as captured by the confidence interval.
most other parametric measures, is at best sim- A confidence interval must also take into con-
ply estimations. Yet, estimation is a superbly sideration a specific degree of certainty if it is to
useful technique that is fundamentally adopted provide accurate information. This considers,
from descriptive statistics. Notice that the mea- among others, the random error that may have
sure of the population of human beings is sim- been introduced during the measurement process.
ply an average measure of human beings that But the error we refer to is not simply a single
live on planet Earth (i.e., 𝜇 = 7,500,000,000). standard deviation above and below the mean
This is an example of a point estimate that uses (i.e., mean ± SD). Instead, the more or less aspect
a single value to represent the true (and often is taken into consideration by a measure known as
unknown) population parameter—in this case the margin of error. The margin of error is the
the population mean. product of the standard error of the mean (SEM)
Undoubtedly, there could be at any instant and a specified critical value that is relative to the
more than or less than 7.5 billion human beings statistical analysis technique used.12
on Earth. In terms of estimation, the more or less A confidence interval implies that there are
aspect encapsulates the potential error in mea- two products that are obtained from its calcula-
surement represented by the population standard tion. The interval is represented by a lower and
deviation (𝜎). Statisticians and researchers within upper limit that are referred to as the confidence
the health sciences in particular are much fonder limits. The limits signify the extension of the
of this more or less consideration than they are of
just the point estimate alone. That is why it is not 11
Notice the similarity of this concept with those con-
tained within the sampling distribution of the mean.
We briefly expand on critical values below but provide a
12
CI: Mean ± (Critical Value) (SEM) the true population mean, if multiple confidence
intervals were constructed around sample means
CI: [lower (-) limit, upper (+) limit] that were obtained from a sampling distribution
(Fig. 5.12).
Fig. 5.11 Confidence intervals
Moreover, the level of confidence (i.e., degree
of certainty, confidence percent, etc.) is estab-
original point estimate on either side (below and lished by the level of significance (α), namely, by
above), where the sum is the upper limit and the taking its complement.
difference is the lower limit. Confidence intervals
are reported by their limits, in which they are CI % : 1 − α
usually written within brackets and separated by
a comma (Fig. 5.11). Thus, a standard formula Notice the importance of the level of signifi-
we can use for the construction of a confidence cance in differentiating true confidence intervals
interval is: from false confidence intervals illustrated in Fig.
Notice that the formula above utilizes a sam- 5.12. At an α = 0.05, 95% of the confidence inter-
ple mean and not a population mean. Indeed, the vals are true because they contain the true popu-
confidence interval for the true population mean lation mean, while 5% are false because they do
is based on a hypothesized population mean that not. In the equation for computing a confidence
is obtained from a sample. We consider and can interval, the level of confidence is considered by
better understand confidence intervals in the con- the critical value that is particular to the specific
text of the sampling distribution of the mean statistical tests utilized, as discussed further in
described earlier.13 Hence, a confidence interval the next chapter.
is most often constructed around a single sample Take a moment to consider the effects on the
mean obtained from a sampling distribution, width or range of a confidence interval when the
which is hypothesized to be the population mean. level of confidence and the sample size change.
We may be wondering what a critical value is By increasing the level of confidence, we have
and how to obtain it. In brief, a critical value essentially chosen a smaller α (e.g., α = 0.01
quantifies the threshold that is determined by the means 99% CI) which, we will see, results in a
level of significance and is relative to the specific larger critical value. Holding the SEM constant,
statistical technique used—but this is not of chief this widens the confidence interval making our
concern right now and will be discussed in depth estimation less precise. On the other hand, by
in the next chapter. However, the question as to increasing the sample size, we make the fraction
why we use critical value in terms of confidence of SEM smaller. Now, holding the critical value
intervals is of concern and will be discussed now. constant, this narrows the confidence interval
Recall from the definition of a confidence making our estimation more precise.14 Of course,
interval that there is a specific degree of certainty there are numerous combinations of different
in the estimation of a true population parameter, manipulations one can do within a confidence
in which the example above represented as a 95% interval. The point we are attempting to impart is
confidence. The degree of certainty is character- that a confidence interval is most practical and
ized by the percentage of confidence that is beneficial when it has the ability to provide the
referred to as the level of confidence. The level of precise and accurate estimation possible of the
confidence represents the probability that a suc- specific population parameter of interest.
cession of confidence intervals will include the
true parameter. In the example above, the level of
confidence contains the likelihood of obtaining Imagine looking for a needle in a haystack. Would it not
14
X2
X3
X4
X5
X6
X7
X8
m - margin of error m + margin of error
Investigation within all factions of the health- able hypothesis must be statistical in nature; only
care field is heavily reliant on hypothesis testing. then can its claim be analyzed by statistical tests.
Any clinician worth her salt is continuously gen- Statistical hypotheses are tested relative to the
erating and testing hypotheses throughout the interactions between a set of variables from data
process of clinical care. Prior to the patient meet- obtained from the random sample(s). The statisti-
ing the clinician, the clinician has already begun cal tests we use to assess the evidence provided
to judge the patient and their purported symp- by the sample data are the statistical analysis
toms on information that has already come before techniques discussed in Chap. 7. Upon data anal-
them. During the patient–clinician interaction, ysis, the observed outcome is then determined
the clinician generates numbers of differential either to be a common occurrence (attributable to
diagnoses and estimates multiple probable prog- chance) or rare occurrence (something special).
noses; she makes predictions regarding a specific If the analysis of data renders the evidence incon-
drug’s course of action or the effects of exposure sistent with the hypothesized claim, then we are
to certain external stimuli, all of which are funda- forced to invalidate the hypothesis. The same is
mentally hypotheses. If necessary, she turns to true for the converse—evidence that is seemingly
the best available evidence contained within bio- consistent with the association that is claimed by
medical literature, in which her ability to inter- the hypothesis allots confirming or retaining the
pret those findings relies on the comprehension original hypothesis (see Sect. 5.3.3).
of the intricacies relative to hypothesis testing Regardless of the outcome observed, should
and the research process, in general. the criteria of parametric statistics be satisfied,
As discussed in further depth in Chap. 1, then we are able to generalize the findings onto
inherent to hypothesis testing is the generation of the population from hence the sample came.
hypotheses which are ultimately dependent on However, what may be even more important than
imagination, curiosity, and even—to a certain the inference consensus produced by hypothesis
degree—biases. Take a moment to consider this testing is establishing concrete evidence of cau-
fact. Even an educated guess must arise from sality. That is, does the specific exposure cause
something that was initially just believed to be the disease due to the effect observed? Or, does
true. Yet, it is these biases that require the method the specific intervention cause treatment due to
of hypothesis testing; convictions, educated the effect observed?
guesses, assumptions, and the like must be held Unfortunately, establishing causation requires
to a higher standard in order to quell (or demar- much higher levels of evidence that go beyond
cate) the biases. By doing this, we move one step collecting sample data. To save from the errors
closer to the truth and thus prevent fallacious and fallacies commonly made in research, the
inferences. We can visualize the crux of this con- relationships tested in Chap. 7 will be at best
cept to be the ratio of signal to noise or, statisti- associative. We establish whether or not an asso-
cally speaking, effect to error. Our interest is in ciation or relationship exists among the variables
amplifying the signal (large effect size) and in consideration, although even statistically sig-
reducing the noise (error fractionation) around it nificant associations made between variables
(Fig. 5.13). may be confounding.15
Hypothesis testing essentially boils down to As we will see in the next chapter, the major-
determining the validity of a hypothesis by ity of statistical hypothesis tests entail the com-
assessing the evidence that it implicates. A test- parison of means. We shall see how to transform
the research question (i.e., study hypothesis) into
a statistical hypothesis with averages that are
SIGNAL EFFECT obtained from sample data. Indeed, it is the sam-
ple means that we hypothesize to be the true
NOISE ERROR
descriptions of the parameters. Therefore, in must take a step back and look at the validity of
order to make parametric inferences, a number of the entire study. This type of validity essentially
things that we will be working with and must scrutinizes the quality of the research study and
consider relative to hypothesis testing are neces- the evidence produced. Thus, by looking at the
sary: (1) quantitative data, (2) random samples, entire study, we are including the study design,
(3) sampling distribution (as reference frame), the methodology, and the data analysis and infer-
and (4) assumptions of parametric statistics. ences and are then faced with two questions:
The basic protocol for testing a hypothesis
with brief descriptions of each step are outlined 1. Is the study tight enough, such that the find-
below, in which its basic format will be utilized ings it produces are able to be replicated?
in virtually all statistical test and further 2. Is the study of sufficiently broad implications,
expounded on throughout the next chapter. such that the findings it produces are able to
be generalized?
Six Steps of Hypothesis Testing
1. Research Question—state the research prob- We discuss these questions and their relevance
lem of interest in terms of a question. toward study validity next.
2. Hypotheses—null and alternative hypotheses
are stated in a statistical manner.
3. Decision Rule—a preset rule is established in 5.7.1 Internal Validity
order to guide the decision-making process
after the data are analyzed. The first question posed above has to do with the
4. Calculation—data are analyzed, and the concept of the internal validity of a research
appropriate test statistic is calculated. study. Internal validity refers to the validity of
Statistical significance may also be estab- the assertions presented relative to the effects
lished here and used as proof below. between the variables being tested. However,
5. Decision—a decision regarding only the null there are a plethora of definitions that can be pro-
hypothesis is made as guided by the rule vided to determine a study’s internal validity. By
above and supported with significance testing looking at the phrase itself in the context of a
as proof from analysis. research study, we can surmise a comprehensive
6. Conclusion—the research question is understanding. We can wonder: “How valid is the
answered based on the findings and the deci- study internally?” Consider all of the aspects
sion that was made. Assuming the criteria of within a study that constitute its internal anat-
parametric statistics are satisfied, findings omy—namely, but not limited to, the study
may be generalized onto the population rela- design, methodology, and data analysis.
tive to the sample. Confidence intervals may For just a moment, stretch the word validity
also be placed and interpreted here. to mean accuracy and then we can ask: “Were
the steps within the study accurately done? Or
did it contain certain factors that made the study
5.7 Study Validity more vulnerable to systematic errors?” More
generally: “Was the study done well enough?”
As we wrap up this chapter, there must be a brief These are, in essence, what the word tight in the
discussion regarding validity of the research question above is trying to capture. Recall that
study. Until this moment, there have been numer- systematic errors have to do with the specific
ous occasions in the preceding chapters that system that was utilized, i.e., the study design.
talked about validity—ranging from topics of Yet, this type of validity has to do with the study
design to methodology. But as we approach the as a whole, simply because the study design sets
conclusion of the research process that takes the precedent for the remainder of the research
place within the confines of a scientific study, we study.
5.8 Self-Study: Practice Problems 89
Table 5.4 Threats to internal validity taken. This clearly lies at the core of inferential
Threats to internal validity statistics, in which the most fundamental ques-
• History—refers to certain events that may have tion we can begin to ask is: “Is the sample being
occurred during the time of study that may have studied representative of its parent population?”
affected the outcome of the study. The events may
have occurred in personal and/or professional aspect This is, of course and as mentioned earlier, one of
relative to both the investigators and/or the study the primary qualifications that facilitates sound
participants inferences. We, yet again, suggest a closer look at
• Maturation—refers to certain changes that may have the phrase for further clarification and ask: “How
occurred to the study participants throughout the
valid is the study externally?”; “Can the informa-
time of the study. These changes may be due to
growth in age, experience, fatigue, hunger, etc. tion learned go beyond just the sample?”
• Testing—refers to changes in the performance of Indeed, “external” refers to the inference
study participants or investigators upon consecutive consensus. This further begs the question of the
measurements. This may be due to memory of ability of the findings to go beyond the internal
earlier responses, practice, or desensitization
components of the study—namely, the study
• Instrumentation—refers to changes in the
calibration of a measuring device or the people that design, methodology, and data analysis—as well.
use the devices, which results in erroneous It should be evident that a necessary condition
measurements for the external validity of a study is in fact the
• Selection—refers to the process of assigning study establishment of an internally valid study. Hence,
participants (or even other units) to different
treatment or control groups. This can also be seen as should the internal validity of a study be jeop-
selection bias ardized, then any attempts of generalizing the
• Mortality—refers to the demise of study participants findings become impermissible—nonetheless,
during the course of the study. This may be extraneous. Although there also exist threats to
particular to studies of comparison such that the
external validity, those will not be spoken of as
death or attrition of a study participant no longer
facilitates the comparison they go beyond the scope of practicality, particu-
larly due to the lack of discussion regarding sta-
tistical analysis techniques.16
The virtue of having a study that is internally
valid implies that any other investigator with the
same research question and interest in the same 5.8 Self-Study: Practice
variables is able to precisely replicate or repro- Problems
duce the same findings as you did. It is for these
critical components that internal validity is often 1. What important role (if any) does the sam-
referred to as the sine qua non (i.e., absolutely ple–population interaction play in inferential
necessary condition) of research to be rendered statistics?
meaningful. In order to establish the internal 2. The following calculations are concerned
validity of a study, we must be aware of the major with the standard error of the mean ( s x ):
threats to validity. Although there can be an end- (a) If σ = 100 and n = 25, what is s x ?
less number of factors that have the ability jeop- (b) If s x = 2.50 and σ = 25, what is the sam-
ardize internal validity, Table 5.4 illustrates a ple size (n)?
brief description of a few of the most important. (c) If n = 35 and s x = 2.82, what is σ2?
3. What strategy can be used to decrease the
amount of random error introduced into a
5.7.2 External Validity statistical test? Provide a mathematical proof
of your answer.
The second question posed has to do with the con- 4. James is interested in comparing the rates of
cept of external validity. External validity refers bullying among ten schools in Los Angeles
to the ability to generalize the findings of a study
onto the population from which the sample was See Campbell (1957).
16
90 5 Inferential Statistics I
County. After he obtains a need assessment liter of blood. Researchers were confused
for each school, James determines that due when they obtained a 95% confidence inter-
to the differences between the schools he val of 1101–1278 from the participants.
must craft a specific questionnaire for each Determine which of the following state-
school in order to collect data on the rates of ments are true or false regarding the confi-
bullying. dence interval:
(a) Which of the assumptions of parametric (a) The interval of 1101–1278 contains all
statistics, if any, are being violated here? possible values of the true population
(b) Can an accurate inference consensus
mean for all patients that enter the trial.
still be made? Explain. (b) The interval of 1101–1278 estimates the
5. True or False: The alpha level represents the true population mean roughly 95% of
probability that the obtained results were due the time.
to chance alone. (c) The true population mean is absolutely
6. For each pair, indicate which of the p-values between 1101 and 1278.
describes the rarer result: (d) About 5% of participants did not score
(a) p = 0.04 or p = 0.02 between 1101 and 1278 and 95% did.
(b) p > 0.05 or p < 0.05 (e) There is a certain degree of confidence
(c) p < 0.001 or p < 0.01 that the true population mean lies
(d) p < 0.05 or p < 0.01 between 1101 and 1278.
(e) p < 0.15 or p < 0.20 10. If beta is the probability of making a Type II
7. What are the four elements necessary for the error, then which of the following describes
establishment of the power of a study? the power of the hypothesis test (i.e., 1−𝛽)?
8. True or False: As power increases, the prob- (a) Probability of rejecting a true H0
ability of making a Type II error increases. (b) Probability of failing to reject a true H0
9. Before entrance into a clinical trial, partici- (c) Probability of rejecting H0, when H1 is
pants had their average CD4 T cells mea- true
sured. A review of the existing literature (d) Probability of failing to reject any H0
determined the average count in health (See back of book for answers to Chapter
adults to be about 975 cells per cubic milli- Practice Problems)
Inferential Statistics II
6
Contents
6.1 Core Concepts 91
6.2 Conceptual Introduction 92
6.3 Details of Statistical Tests 93
6.3.1 Critical Values 93
6.3.2 Directional vs. Nondirectional Tests 95
6.4 Two-Group Comparisons 96
z Test
6.4.1 96
t Test Family
6.4.2 99
6.5 Multiple Group Comparison 106
6.5.1 ANOVA 107
6.6 Continuous Data Analysis 111
6.6.1 Associations 112
6.6.2 Predictions 116
6.7 Self-Study: Practice Problem 120
two-tailed tests such as z test, one-sample t test, taught us how to appropriately examine and orga-
independent sample t test, dependent sample t nize data to obtain useful information that was not
test, and analysis of variance (ANOVA). Selection readily available from the raw data. But pertinent
of statistical tests is dependent on individual to inferential statistics is understanding the more
characteristics and availability of the data. For sophisticated techniques that transform our data.
each test, we will consider what data and design We must go one step beyond simple organization
are appropriate, its corresponding formula, and a if we desire to provide evidence that will guide
step-by-step procedure for hypothesis testing. our decision-making processes and, hopefully,
make generalizations toward the greater good.
In the previous chapter, we discussed the prin-
6.2 Conceptual Introduction ciples and philosophies behind inferential statis-
tics, i.e., making certain inferences,
An unspoken understanding behind the ultimate generalizations, or conclusions regarding a popu-
goal of healthcare is promoting and prolonging lation based on the sample. However, that is not
our species—fancy for keeping us from dying. to say that our principles of descriptive statistics
Just as the effective usability of penicillin was can be neglected. The three assumptions neces-
paramount to the overall well-being of humans, sary for parametric statistical inferences (i.e.,
so too is the ability to efficiently and effectively normality, independence of measurement, and
analyze data. A bit extreme? Consider our ability homogeneity of variance) are deeply grounded in
to compare the effectiveness between two cancer descriptive statistical theory.
treatments and predict the most prevalent strains All of these principles and philosophies must
of influenza during the next flu season or the abil- be in the forefront of our minds during this chap-
ity to model patient behavior and inclinations. ter as they are the underlying logic to the tech-
Every pioneer and frontier established within niques of data analysis. We shall see how data
healthcare began its infancy with a phase of data analysis takes advantage of the vehicle, that is,
analysis. Today we may think of lavish machinery hypothesis testing, in the form of statistical tests
or never-ending mathematical computations when that ultimately allot the inference consensus. This
thinking of analyzing data. But that is not neces- is experimental science in its crudest form. This
sarily the case. The earliest of analytics could have chapter further solidifies the third leg of the stool
simply been comparing observations between two (Fig. 6.1) that is the data analysis, in which the
modalities. Joe the Caveman observed that a circu- techniques discussed will be limited to quantita-
lar-shaped stone could better serve as a wheel than
a rectangular-shaped stone. Surely, it is the ability
to analyze, or more generally our cognitive abili-
Research Process
ties, that make us humans second to none—we are
able to look deeper or beyond that which is readily
apparent. Joe the Caveman did not simply
acknowledge the existence of two differently
Meth
Study Design
Data
tive data and parametric statistics. The next chap- standardized sampling distribution specific to the
ter will explore the analogous fundamental formula (i.e., statistical test) that is used.
concepts and techniques of nonparametric statis- For example, the first statistical technique we
tics, which will be the final chapter in the transla- discuss in this chapter is referred to as the z test.
tional research enterprise. If we are interested in using a z test to test a popu-
lation mean (i.e., testing whether there is a differ-
ence between sample mean and the population
6.3 Details of Statistical Tests mean), then we must convert the data from the
sampling distribution of the mean ( x ) to the
Statistical tests are the techniques of data analy- sampling distribution of z. Now, we have a distri-
sis that are used during the examination of bution of z-values that represent the means of
hypotheses. These statistical tests mathemati- numerous random samples, instead of the actual
cally analyze the data obtained from the sam- sample means (see Chap. 5.2.1 on Sampling
pling distribution and essentially quantify the Distribution for clarification).
outcome espoused by the hypotheses. It is the We will further elaborate on what the sam-
compounding of principles of significance test- pling distribution of z is under the section devoted
ing onto statistical tests that allows for the test- to the z test but notice that the only major differ-
ing of a hypothesis. Now, we will be able to ence between the hypothesized sampling distri-
make sense of the test statistics and be able to bution of the mean and its standardized
make decisions regarding the tenability of cer- counterpart, the hypothesized sampling distribu-
tain hypotheses. The majority of the statistical tion of z, is the scaling and units of the graph
tests we explore are inherently similar in terms (Fig. 6.3). Notice that the transformation being
of mean comparison, effect attainment, and error done is very similar to the z-transformation pro-
consideration. This similarity can be noted cess discussed in Chap. 4! Here, the mean of the
through their formulae that innately represent sampling distribution (i.e., mean of sample
the signal (effect)-to-noise (error) ratios means) is now represented as z-score (0), and the
(Fig. 6.2). But before we can begin our discus- other sample means are observations as z-scores
sion on the different statistical tests, we must on the standard normal curve. Thus, just as we
consider a few important details below. are able to transform data onto a particular sam-
pling distribution, we can transfer to the relevant
concepts of hypothesis testing as well.
6.3.1 Critical Values Recall that in hypothesis testing, the decision-
making process regarding the null hypothesis
One of the most valuable abilities provided by the was dependent on the observed outcome being
techniques of data analysis is the ability to trans- either a common observation or a rare observa-
form data. However, the transformation of data tion. This is briefly the process of establishing
does not necessarily mean that we are tampering statistical significance, in which an arbitrarily
with evidence. Rather, a smooth transformation chosen level of significance (𝜶) determines the
via the formulae sheds light on possible relation- threshold(s) between common outcomes and rare
ships between variables that may be of interest to outcomes (Fig. 6.4).
an investigator. This process takes the hypothe- Therefore, we can also imply this concept
sized sampling distribution and transforms it to a onto other hypothesized sampling distributions,
such as the sampling distribution of z. Here is
where we get to the meat of it—say we are inter-
SIGNAL EFFECT ested in identifying the exact locations of the
thresholds that represent the level of significance
NOISE ERROR
on either tail ends (Fig. 6.5). Would that not be
Fig. 6.2 Error fractionation useful? Of course! By identifying or quantifying
94 6 Inferential Statistics II
Fig. 6.3 Side-by-side comparison of the sampling distribution of the mean and sampling distribution of z. Notice how
the standard normal curve on the right has a center of 0 and a narrower distribution
α = 0.05
0.0.25 0.0.25
–1.96 +1.96
National BMI
Fig. 6.6 The threshold (i.e., critical values) and the area ? (µ) ?
of the tails are labeled
Middle School Children BMI
(x)
6.3.2 D
irectional vs. Nondirectional
Tests Fig. 6.7 Sampling distribution of the average BMI of
middle school children and the national average
In the previous chapter, we discussed at length
the central role that the null hypothesis plays in
statistical significance and decision-making
within hypothesis testing. We learned that the
null hypothesis was a statement of no difference
or no effect. For example, say we are interested in
comparing the average BMI of middle school
α
children to the BMI of the national average. The
null hypothesis would then state: there is no dif-
ference between the average BMI of middle
school children when compared to the BMI of the
national average. If the analysis substantiates µ−χ:
plausible evidence against this claim, then our
decision would be to reject H0 and state that there Fig. 6.8 Observed outcomes occur in the lower refection
is indeed a statistically significant difference (or area
effect).
But that is all it says. It simply proves that a But what if we are interested in direction?
difference exists—it does not tell us whether the Rather than detecting just the existence of a dif-
average BMI of middle school children is less ference or an effect, we want to identify the spe-
than the national average or, conversely, greater cific type of difference or effect. Statistically, we
than the national average. In statistical terms, this might be interested in proving an alternative
means that the observed outcome could have hypothesis that claims: the average BMI of mid-
been observed in either of the tails (rejection dle school children is less than the BMI of the
areas) of the corresponding sampling distribution national average. Here, we are only interested in
(Fig. 6.7). This type of statistical test, namely, rejecting H0 if the observed outcome occurs in
one that tests a hypothesis with areas of rejection the lower rejection area (Fig. 6.8). On the other
on both tails of the sampling distribution, is hand, another alternative hypothesis might claim:
referred to as a nondirectional or two-tailed the average BMI of middle school children is
test. Moreover, along with two-tailed tests comes greater than the BMI of the national average.
two critical values that must equally share the Here, we are only interested in rejecting the null
area espoused by the level of significance. hypothesis if the observed outcome occurs in the
96 6 Inferential Statistics II
hypothesized population mean and test whether size this to be the case, until testing proves other-
the claimed population mean is true—that is, wise. Hence, to test a hypothesis of this sort to
whether it is actually representative of the entire determine how different these two measures
population. If it is representative, then the sample really are, we can use the z test formula, along
mean should not be significantly different than the with its confidence interval formula below:
population mean. So, the research question might
x - m hyp
ask: “Is there a statistically significant difference
between x and μ?” In this case, the null hypothe-
z=
s/ n
(
CI : x ± ( zcrit ) s / n )
sis will state: “There is no statistically significant
difference between x and μ.” By just looking at the formula, we can outline a
When we test the hypothesis, we essentially few important factors for a z test. Most impor-
examine the degree to which the sample mean tantly, notice the specific measures required to
deviates from the population mean. In order to do successfully complete the calculation: x , μhyp, σ,
this comparison, the sampling distribution of x and n—these will be important in determining the
is converted to the sampling distribution of z. appropriate statistical test, particularly when there
The sampling distribution of z is a distribu- are a variety to choose from. Also notice that the
tion of z-values that represent the means of numerator is essentially the effect (difference) and
numerous random samples of a given size for a the denominator is the standard error; this is akin
specific population. This hypothesized sampling to the signal-to-noise ratio discussed earlier. Below
distribution is then centered around the given we provide a step-by-step procedure for hypothe-
hypothesized population mean—recall that the sis testing using the z test, along with commentary
mean of the sampling distribution is equal to the under each step. After that, we demonstrate the
population mean, in which we initially hypothe- process with an example for further clarification.
(b) The “mathematical proof” is either of (b) The interpretation of the confidence
the three possibilities mentioned in Step interval follows the conclusion. The
3 and satisfied by the z test statistic. degree of confidence depends on the
6 . Conclusion: Based on the results, there level of significance chosen earlier.
seems (to be a/to be no) statistically signifi- This can also be used as proof that
cant difference between x and 𝜇hyp. We your conclusion is correct:
are (1 − 𝛼)% confident that the true popula- 1. If the actual numerical value of μhyp
tion mean lies between (insert lower limit) falls within the lower and upper
and (insert upper limit). limits of the confidence interval,
(a) Depending on the decision made in then the decision that should have
Step 5, the research question is been made was to reject H0.
answered. 2. If the actual numerical value of
1. If the decision was to reject H0, then μhyp does not fall within the lower
there seems to be a statistically sig- and upper limits of the confidence
nificant difference. interval, then the decision that
2. If the decision was to retain H0,
should have been made was to
then there seems to be no statisti- retain H0.
cally significant difference.
α = 0.01
0.005 0.005
–2.58 +2.58
t Test
6.4.2 t Test Family
6.4.2.1 One-Sample t Test two groups (Fig. 6.12). Like the z test, the t test
This type of t test shares many similarities with examines location of the sample mean relative to
the z test discussed above, yet the few differenti- the hypothesized population mean to determine if
ating factors are what make the two so distinct. the claimed parameter is actually representative
The one-sample t test is a statistical test that of all the samples (i.e., the entire population).
compares a sample mean to a hypothesized popu- Moreover, in order to test a hypothesis for statis-
lation mean, in order to determine if there is a tical significance using a t test, the data must be
statistically significant difference between the transformed into the sampling distribution of t.
100 6 Inferential Statistics II
x - m hyp
4. Calculation: t = (c) It is also possible to calculate the con-
s /Ön fidence interval here.
(
CI : x ± ( tcrit ) s / Ö n ) 5 . Decision: Reject/Retain H0 because (insert
corresponding mathematical proof)
df = n – 1 (a) Same commentary as z test
(a) The calculation of the t test statistic is 6. Conclusion: Based on the results, there
done here either by hand using the for- seems (to be a/to be no) statistically sig-
mula or via statistical software (see nificant difference between x and 𝜇hyp. We
Video 4). are (1−𝛼)% confident that the true popula-
(b) The p-value can be Roughly measured tion mean lies between (insert lower limit)
using the t table in Appendix C or a and (insert upper limit).
more exact measure via statistical (a) Same commentary as z test.
software. (b) The confidence interval is interpreted
here.
Example
3. At 𝛼 = 0.01 and df = 35–1 = 34: If p ≤ 𝛼,
A national study determined that a healthy then reject H0|If p > 𝛼, and then retain H0.
adult human requires an average of 100 min of
exercise per week. At a local community event, 103 - 100 æ 30 ö
it is gathered from Community Alpha that 35 t= = 0.592 CI :103 ± ( 2.750 ) ç ÷
30 è 35 ø
adults exercise approximately 103 minutes per 35
week (min/wk) with a standard deviation of = [89.05, 116.95].
30 min/wk. At a significance level of 0.01, we
test the null hypothesis that there is no statisti-
cally significant difference between these 4. p > 0.05; therefore, retain H0.
estimates. 5. Based on the results, there seems to be no
statistically significant difference between
1. Is there a statistically significant difference the recommended average minutes of exer-
between the recommended average of cise per week of 100 and Community
100 min/wk and the 103 min/wk sample Alpha’s 103 min/wk. We are 99% confident
obtained from Community Alpha? that the true population average of minutes
2. H0, 𝜇 = 100; H1, 𝜇 ≠ 100. per exercise per week for healthy adults falls
in the interval between 89.05 and 116.95.
6.4.2.2 Independent Sample t Test of two independent samples that are most often
The independent sample t test is a statistical test used in this context are treatment and control
that examines the difference between two hypoth- groups within an experimental design (Fig. 6.14).
esized populations through a comparison of two The group averages of the treatment group ( x1 )
sample means. Critical to this t test is that the two and the control group ( x2 ) can be compared
groups that are being compared must be two sam- because the observations in each sample are
ples that are independent of each other. Examples based on different participants. Therefore, the
102 6 Inferential Statistics II
( x1 - x2 ) - ( m1 - m2 )
t= df = n2 + n1 - 2
s x1 - x2
where
m-c1 – -c2
sp2 sp2 s12 ( n1 - 1) + s22 ( n2 - 1)
s x1 - x2 = - where sp2 =
Fig. 6.15 Sampling distribution of the mean difference is n1 n2 n2 + n1 - 2
noticeable flatter and fatter in shape
Steps for Independent Sample t Test (b) The p-value can be roughly measured
1. Research Question: Is there a statistically using the t table in Appendix C or more
significant mean difference between x 1 and exactly measured via statistical
x 2? software.
(a) Same commentary as other statistical (c) It is also possible to calculate the con-
tests fidence interval here.
2. Hypotheses: H0, 𝜇1 = 𝜇2; H1, 𝜇1 ≠ 𝜇2 5 . Decision: Reject/Retain H0 because (insert
(a) These are new hypotheses specific for corresponding mathematical proof)
this statistical test. Notice that they (a) Same commentary as other statistical
still capture the essence of both the tests
null and alternative hypotheses. 6. Conclusion: Based on the results, there
(b) Another equally valid way of writing seems (to be a/to be no) statistically signifi-
the hypotheses can be: cant difference between x 1 and x 2. We are
1. H0, 𝜇1 – 𝜇2 = 0; H1, 𝜇1 – 𝜇2 ≠ 0. (1 – 𝛼)% confident that the true population
3. Decision Rule mean difference lies between (insert lower
(a) This can be written with either the
limit) and (insert upper limit).
p-value or t critical value (see decision (a) Same commentary as other statistical
rule commentary for one-sample t test). tests
4. Calculation: (see formulae above) (b) Notice that here the confidence inter-
(a) The calculation of the t test statistic is val represents the true mean differ-
done here either by hand using the for- ence, as espoused by the associated
mula or via statistical software (see hypotheses.
Video 5).
Example
3. At 𝛼 = 0.05 and df = 75 + 75–2 = 148: If
A supervisor is interested in evaluating p ≤ 𝛼, then reject H0 | If p > 𝛼, then retain
whether two different health education semi- H0.
nars offered by her company increase the
(88 - 85) - ( 0 )
2*
health knowledge of participants. A total of
150 participants are randomly assigned to t=
2.21
either Seminar A or Seminar B for a total of 75 = 1.36 CI : ( 88 - 85 ) ± (1.980 )( 2.21)
participants in each group. After a 3-day-long
= [ -1.38,7.38].
session, the sample mean health-knowledge
score ( x 1) for participants in Seminar A was
88, and the sample mean ( x 2) for participants 4. Retain H0 because p > 0.05.
of Seminar B was 85. Assuming the estimated 5 . Based on the results, there seems to be no
standard error to be 2.21, test the null hypoth- statistically significant difference between
esis at a level of significance of 0.05. the average health-knowledge score of
participants that take either Seminar A or
1. Is there a statistically significant difference Seminar B. We are 95% confident that the
between the average health- knowledge true population mean difference lies on
score of participants in Seminar A com- health-knowledge scores that falls in the
pared to Seminar B? interval between –1.38 and 7.38.
2. H0, 𝜇1 – 𝜇2 = 0; H1, 𝜇1 – 𝜇2 ≠ 0.
104 6 Inferential Statistics II
Twin Group 1 Twin Group 2 groups, rather than differences between groups—
an important concept that will be further
Pair 1:
expounded on under the next statistical test.
Briefly, and more clearly, we are not comparing
Pair 2:
groups with each other; instead wn e are compar-
Pair 3:
ing individuals that are in a group with each
other. Figure 6.19 provides a pictorial illustration
Pair 4: of the difference in the comparisons.
One virtue of examining individual differ-
Pair 5: ences is the ability to control for variations that
would otherwise result in random error, distort-
Fig. 6.17 Twins matched individually from one sample
ing a signal we may be seeking. This is an
to the other
important characteristic that increases the likeli-
hood of detecting an effect—should there be
Before After
one to be found—by decreasing the thwarting
x–1 x–1 influence of individual variability. This unique t
test is distinct from all of the other statistical
Fig. 6.18 Individual observations measured before and tests we have learned because it does not look at
after the intervention differences in group averages, rathn er it exam-
ines differences in individual scores that have
specific statistical test: (1) paired measures or (2) been matched3 together. Let us turn to the for-
repeated measures.2 mula for further clarification on how this is done
Paired measures are composed of two sam- exactly:
ples that have been matched individually from
D - mD
one sample to the other (Fig. 6.17). Pairin ng t= df = n - 1
observations are most often used in twin studies, SD / n
crossover trials, or any other related characteris-
tics that allot the matching of two different indi-
(
CI : D ± ( tcrit ) SD / n )
viduals. Notice that although there are two sets
of samples, we are able to use them here because We are introduced with a few new symbols
there isn a certain study characteristic that here. Let us first break down D. Without the
makes the individuals dependent on each other. average (i.e., the straight bar atop), D represents
Repeated measures are composed of a single the difference between matched scores. That is,
sample where the subjects have been measured D = X1−X2 where the subscripts represent the
twice and matched together. This technique is matched pairs or the matched repetitions.
most often used to test the effects of an interven- Regardless of the type of measure, after each D
tion or treatment on a single group—the partici- has been determined for every match, the sum is
pants are measured once before the intervention computed and then divided by tn he number of
and once after the intervention. After measure- matches4 (n). This provides us with the D which
ment, the individual observations from the before is the average of the differences, better defined as
group are matched to individual observations the mean difference.
from the after group dependent on each single
individual (Fig. 6.18).
Regardless of the type of measure, a depen- Whether paired measures or repeated measures, individ-
3
BETWEEN WITHIN
X1 X1 X1 X1
X2 X2 X2 X2
X3 X3 X3 X3
X4 X4 X4
X4
X5 X5 X5 X5
Fig. 6.19 Examine the difference in comparison between and within groups
(c) May also calculate confidence interval nificant mean difference between (group
here. one description) and (group two descrip-
5. Decision: Reject/Retain H0 because (insert tion). We are (1 − 𝛼)% confident that the
corresponding mathematical proof) true population mean difference lies
(a) Same commentary as other statistical between (insert lower limit) and (insert
tests upper limit).
6. Conclusion: Based on the results, there (a) Same commentary as independent
seems (to be a/to be no) statistically sig- sample t test
Camper
Weight
before
Weight
after
Difference
(D = WBefore − WAfter)
(
= 0.165 CI : 0.25 ± ( 2.262 ) 4.79 / Ö 10 )
= [ -3.176, 3.676 ].
1 198 193 5
2 205 200 5
3 220 211 9 4. Retain H0 because −2.262 < 0.165 <
4 213 215 –2
+2.262.
5 239 233 6
5. Based on the results, there seems to be no
6 305 309 –4
7 276 270 –4 statistically significant mean difference
8 254 255 –1 between the average weight before and
9 281 274 7 after the weight loss intervention. We are
10 230 226 4 95% confident that the trn ue population
∑D = 2.5, mean difference of average weight loss lies
sD = 4.79 D = 0.25
in the interval between –3.176 and 3.676.
6.5 Multiple Group Comparison treatment that is most effective. Well, we might
argue, why not conduct a series of t tests?
What if we are interested in comparing more Unfortunately, conducting a series of t tests
than just two groups? As is often the case with will increase the chances of rejecting a true null
translational research, we may be interested in hypothesis (i.e., making a Type I error). This is
comparing the effectiveness of multiple treat- due to the fact that each t test that is conducted
ment modalities in order to determine the single claims the null hypothesis to be true at a
6.5 Multiple Group Comparison 107
probability equal to the level of significance. paring it to the variability within the groups. In
Thus, if multiple t tests are being done on the essence, the basic conceptual format of the
same null hypothesis, then we increase the chance F-statistic is:
of observing an effect when in reality there is no
variability between groups
effect to be found—said another way, multiple t F=
tests make it easier and easier to attain statistical variability within groups
significance. Therefore, we must discuss another In data analysis, the variability mentioned in
statistical technique that allows the parametric both the numerator and denominator above is
comparison of multiple groups but saves from represented as the mean square (MS), which is
this inaccuracy and an ultimately fallacious simply the calculation of the population variance
inference. (σ2) for both measures. Take an experimental
design that seeks the effectiveness between three
treatment modalities, in which subjects are ran-
6.5.1 ANOVA domly assigned to three different groups of equal
size (Fig. 6.20).
While Gosset argued that a comparison of distri- The MS between the groups estimates the
butions should be done through means, Sir variation of the overall results obtained from the
Ronald Aylmer Fisher contended that it is the subjects in the three different groups, receiving
variances of distributions that need be compared three different treatments—i.e., the variability
for the existence of an effect. Without even intro- that is between the (different) groups. The MS
ducing the reasoning, we can surmise the plausi- within the groups estimates the variation among
bility of using variances rather than means to the individual results obtained from subjects that
facilitate a comparison. Our extensive discussion are in the same group, receiving the same treat-
on descriptive statistics (Chap. 4) proved to us ment—i.e., the variability within the (same)
that measures of central tendency and variability groups. Figure 6.21 shows this in a pictorial
were simply two different, yet related and equally manner.
important, perspectives of describing a distribu- Thus, the actual equation for a F-statistic is
tion. Subsequently, testing a hypothesis with a
MSBetween
statistical test rooted in variability should not be F=
alarming; neither should the fact that this tech- MSWithin
nique is utilized within the context of testing for
population means. The equations for obtaining the mean squares
Fisher went on to develop the F-statistic and will be provided at the end of the section.
the corresponding statistical analysis technique Conceptually, the MS between groups is a reflec-
referred to as ANOVA (analysis of variance) or, tion of the treatment effect—if there is one to be
as used in hypothesis testing, the F test. Indeed, it found—along with a touch of random error due
is the variance of the different groups that must to the perspective of variability. On the other
be analyzed in order to test a null hypothesis for hand, the MS within groups is a representation of
multiple parameters. This technique allots the only random or residual error because it exam-
study of multiple independent groups by analyz- ines the individual differences that are contained
ing the variability between the groups and com- within the groups. It is for this reason that some
BETWEEN
BETWEEN BETWEEN
Group 1 Group 2 Group 3
X1 X1 X1
X2 X2 X2
WITHIN
WITHIN
WITHIN
X3 X3 X3
X4
X4 X4
X5 X5 X5
Fig. 6.21 ANOVA analyzes the variability between and within the groups
refer to MS within as MS error. It should then not modalities (i.e., strains of marijuana) in the
be a surprise to realize the signal-to-noise ratio degree to which they stimulate a hunger response.
when these measures are combined in a fraction: If this claim is true, then the lack of difference in
hunger stimulation between all three groups
treatment effect + error signal
F= = should be supported by the lack of treatment
residual error noise effect in the F ratio. The only appreciable differ-
Thus, when using the F test for a null hypoth- ence, then, that might be observed is attributable
esis, a false H0 will support rejection because the to inevitable random error, which would reflect
ratio reflects the treatment effect above the as a small F-statistic.
amount of error contained in the study, similar to Conversely, if there actually exists a statisti-
the figure above. This will appear numerically as cally significant difference in the degree to
an F-statistic that is much greater than 1 depend- which each distinct strain stimulates a hunger
ing on the effect size. On the other hand, a true H0 response, then the effectiveness (i.e., differ-
will adopt a ratio that has only relatively equal ence in treatments) would present itself in the
amounts of variability between the groups form of a large effect size and, hence, a large
(numerator) to variability within the groups F-statistic. Thus, the evidence renders the null
(denominator), such that the value of the fraction hypothesis as false, showing that, in fact, there
is somewhere close to 1. is a statistically significant difference in the
Ponder on the meaning of variability in terms distinct strains of marijuana. This difference is
of an experiment. Take an experiment that is such that one strain’s effectiveness on hunger
interested in testing the effectiveness of three stimulation was resilient enough to rise up,
different strains of medical marijuana on appe- regardless of the error around it—like the rose
tite stimulation in cancer patients. The null that grew from the concrete.
hypothesis would state that there is no statisti- The formula (and sub-formulae) for these
cally significant difference between the effec- statistical analyses are referred to as a one-way
tiveness of the three different treatment ANOVA, which analyzes the influence of one
6.5 Multiple Group Comparison 109
MSBetween
F=
MSWithin
SSBetween SSWithin
MSBetween = MSWithin =
dfBetween dfWithin
• k = number of groups sures, the ratio of mean squares will produce the
• N = number of total observations F-statistic we require. Results of an F test are
most commonly illustrated as an ANOVA table
Do not be overwhelmed by the formulae. First which organizes the measures for the calculation
notice the formula for calculating both mean of an F-statistic; Step 4 of the protocol for
squares is simply their sum of squares (SS) hypothesis testing we provide below illustrates
divided by their respective degrees of freedom this table. In what follows shortly, the protocols
(df). Recall from descriptive statistics that the to calculate the F ratio from data and for testing a
underlying concept of sum of squares is simply null hypothesis are shown, along with an illustra-
the calculation of variability (for SS calculation, tion that utilizes both protocols together in an
see Appendices E). After obtaining those mea- example.
(a) The calculation of the F test statistic is (a) Same commentary as other statistical
done here either by hand using the formula tests
or via statistical software (see Video 7). 6 . Conclusion: Based on the results, there
(b) The p-value can be roughly measured seems (to be a/to be no) statistically signifi-
using the F table in Appendix D or a cant difference in (dependent variable)
more exact measure via statistical among (independent variable).
software. (a) Same commentary as other statistical
(c) No confidence interval calculation. tests.
(b)
There is no confidence interval
5. Decision: Reject/retain H0 because (insert interpretation.
corresponding mathematical proof) (c) A post hoc analysis is presented here.
6.6 Continuous Data Analysis measure designs) that were compared to observe
an effect on a single outcome. In the case of
The majority of the statistical analysis techniques ANOVA, we examined factors within groups, but
that have been mentioned are chiefly concerned this was simply to account for individual differ-
with examining the relationship between a set of ences in terms of error; it was not, however,
groups. The distinct groups were primarily com- examined for a relationship within a singular
posed of different subjects (except for repeated group (Fig. 6.23).
112 6 Inferential Statistics II
6.6.1.1 Correlation
Fig. 6.23 Comparing the difference in analysis within In statistics, particularly descriptive statistics, an
groups and within a singular group association is represented as a correlation, which
graphically illustrates the relationship between
In this final section of inferential statistics, we two continuous variables on a scatterplot. The
will discuss the statistical analysis techniques numerical description of this relationship is
that allot the comparison of variables within a denoted by the Pearson correlation coefficient
single group of subjects. Indeed, we will notice (r). Developed by, and named after, one of the
that these analysis techniques are not limited to giants of contemporary statistics, Karl Pearson,
inferential analyses. In fact, both techniques are the correlation coefficient (r) is a measure of the
grounded in descriptive analytical theory. The strength and direction that describes the associa-
variables in consideration must be continuous in tion between two continuous variables. As
nature (i.e., quantitative data) to permit the ulti- implied above, critical to the calculation of the
mate parametric inference. The nonparametric correlation coefficient is the presence of two con-
counterpart for the analysis of categorical vari- tinuous variables that are collected from and
ables is examined further in Chap. 7. describe subjects within a single group.
To elaborate on this topic, we can ask a few
questions:
6.6.1 Associations
1 . How is the correlation coefficient calculated?
During the introductions to inferential statistics, it 2. What does the correlation coefficient repre-
was briefly mentioned that hypothesis testing is a sent, graphically?
method primarily useful during the examination of 3. How is the strength and direction of the cor-
effect sizes. Observing a large effect size among relation coefficient interpreted once it is
comparable groups provides the initial foundation obtained?
required for the establishment of causality relative
to the variables in question. However, the difficulty We will answer these questions in terms of
of actually materializing the causation was also Pearson’s first endeavor, which entailed deter-
considered. Our reluctance toward this leads to our mining the resemblance (i.e., association) in
use of hypothesis testing for establishing associa- heights of fathers and sons within pairs of family
tions in order to prevent fallacious inferences. members. Realize that a pair involves a single
Still, we distinguish those associations from subject within the group in this context.
these associations (i.e., the ones we discuss in The two continuous variables collected from
this section) on two grounds: subjects within a single group for Pearson were
father’s height and son’s height in centimeters
1. Those associations described the relationship (cm) (Table 6.1). After measuring the subjects,
between groups. the data points from each variable are plotted as
2. Those associations set the precedent for poten- an (x, y) coordinate. The x-axis will represent
tial causality, contingent on the availability of all of the heights from the fathers, and the
higher levels and quality of evidence. y-axis will represent all of the heights from the
6.6 Continuous Data Analysis 113
Table 6.1 Father’s height and son’s height in centimeters • x—continuous variable 1
on a traditional xy plot
• y—continuous variable 2
Pairs Father’s height (cm) (x) Son’s height (cm) (y) • i—measure from each subject (or pair)
1 210.00 205.00 • X —mean of continuous variable 1
2 239.00 230.00
• Y —mean of continuous variable 2
3 219.00 199.00
4 222.00 220.00
5 250.00 249.00 Graphically, the correlation coefficient repre-
6 216.00 218.00 sents the degree of linearity (i.e., straight line8)
7 208.00 221.00 between the scatter of the data points from the
8 199.00 202.00 two variables. Confirm that Fig. 6.24 above is,
9 226.00 220.00 indeed, a scatterplot, in which we obtained r =
10 197.00 199.00 + 0.858. So, how does this numerical value trans-
late information regarding the association
Father-Son Height ScatterPlot between the heights of fathers and sons?
260 To answer this—synonymous with the third
250 question posed earlier—we must turn to the
Son's Height (Y)
Lucky for us, the computing of r today is most often done regression discussed in the next section.
(easily) by any good statistical software program. The for- The correlation coefficient is unit-less because when the
9
mula is provided here for continuity purposes; however, terms are placed in the actual formula, the units in the
its calculation by hand will not be further discussed. numerator and denominator cancel each other out.
114 6 Inferential Statistics II
we can imply that r-values around 0.50 describe between two variables as being inversely pro-
moderate associations, in which relatively close portional with one another—i.e., the variables
values above and below can be seen as moder- behave in a completely opposite manner:
ately weak associations or moderately strong –– As the value of variable 1 (x) increases, the
associations, respectively. value of variable 2 (y) decreases.
The direction of the correlation coefficient OR
represents the behavior of the association –– As the value of variable 1 (x) decreases, the
between the two variables, which is denoted by value of variable 2 (y) increases.
the sign of the correlation coefficient (i.e., +/−).
Conceptually, this is opposite to the strength of a Now we are able to accurately interpret the
correlation coefficient. Indeed, now we are con- correlation coefficient from the father–son exam-
cerned with the sign and not with the actual ple above. The r of +0.858 insinuates that there
numerical value. Therefore, there must be only seems to be a strong positive association between
two possible scenarios: the heights of fathers and the heights of their
sons. Note that we incorporated both the strength
• Positive associations—signified by a positive and direction of the association in a single
(+) r-value, depict the relationship between description of the relationship between the vari-
two variables as being directly proportional ables. Also with these understandings, we are
with one another—i.e., the variables behave in able to (roughly) imply the strength and direction
a similar manner: of associations without the actual need of a
–– As the value of variable 1 (x) increases, the correlation coefficient by simply viewing the
value of variable 2 (y) increases. data’s scatterplot (Fig. 6.25)11 (see Video 8).
OR10 Observe in Fig. 6.25 that the strength of an asso-
–– As the value of variable 1 (x) decreases, the ciation is dependent upon the distances between
value of variable 2 (y) decreases. the cluster of points on the graph. The closer the
• Negative associations—signified by a nega- dots are scattered together and approximate a
tive (−) r-value, depict the relationship straight line (linearity), the stronger the association
between the variables. Similarly, if we imagine a
Do not let the word “positive” in positive association
10 straight line that follows the trend of the points, the
lead you to believe that this relationship is exclusive to
variables that increase together. A positive relationship
may also be used to describe two variables that decrease Although this may be a useful heuristic, having the
11
together. What is more important to understand is that actual correlation coefficient provides a much more accu-
they behave in a similar manner. rate and precise description.
6.6 Continuous Data Analysis 115
Weak
-0.7 -0.4 0.4 0.7
-1.00 0 +1.00
slope of that line reflects the direction of the asso- sions onto the population. Like the other statisti-
ciation. A slope on the graph that extends from the cal analysis techniques, the correlation coefficient
lower left up to the upper right represents a positive (r) is a statistic (i.e., sample measure), in which
association, whereas a slope that extends from the its analogous parameter is 𝜌, the Greek letter
upper left down to the lower right represents a neg- “rho.” The population correlation coefficient is
ative association. Figure 6.26 illustrates a number the hypothesized measure being tested. Thus, we
line that considers all possible values of r in rela- similarly utilize the sample correlation coeffi-
tion to both strength and direction. cient (r) in hypothesis testing to draw conclu-
The final point we wish to make is regarding sions regarding such associations in the
the order of the variables. Similar to the conten- underlying parent population.
tion in this section’s introduction, we reiterate the Unlike our other statistical tests, the test used
fact that we are unable to imply causal relation- for hypothesis testing regarding an association in
ships from correlation coefficients—no matter the population is not a variation of the formula
how strong the association may be. In previous for the Pearson correlation coefficient provided
associations that set the precedent for potential above. Instead, it is a variation of a t ratio which
causality, in which an effect was sought, our vari- is, in fact, a t test for a single population correla-
ables were either labeled as independent or tion coefficient 𝜌. Nonetheless, the inferences we
dependent. Here, though, that classification is not make must consider the sampling distribution of
present; further suggesting that the order in which r in order to hypothesize the correlation coeffi-
the variables are created or analyzed (i.e., either x cient parameter ρhyp. This permits the testing of
or y) is not of importance. the null hypothesis that claims a lack of associa-
For example, in the father–son example, the tion between the two continuous variables (i.e., ρ
variables could have been constructed on the = 0). Also, we dare not forget that this test must
opposite axes. Father’s heights could have been similarly fulfill and satisfy the three assumptions
the y variable and son’s heights could have been of parametric statistics.
the x variable. Nonetheless, the identical value The actual protocol for hypothesis tests of sig-
for r would still have been obtained. Thus, we are nificance for correlation coefficients is much
unable to claim that it is the father’s height that more extensive than our other parametric tests,
effects the son’s height or vice versa—further particularly in the absence of a statistical soft-
proving the absence of an effect and strict prohi- ware program. Indeed, today correlation coeffi-
bition of causality. cients and their tests of significance are easily
The majority of our current discussion has had obtained by a few strokes of a keyboard. Hence,
more to do with descriptive statistics, rather than we refrain from providing the protocol with the
inferential statistics. Certainly, correlation can be refined t test as they are beyond the scope of this
a powerful tool in the description of relationships book. However, in the name of knowledge, the
among continuous variables. Still, as appropri- formula for this newly refined t test is provided
ately placed in this chapter, the Pearson correla- below, along with additional resources for the
tion coefficient is inherently a parametric measure curious mind.12
that can be utilized to take useful associations
obtained from a sample and infer them as conclu- 12
See Furr (n.d.).
116 6 Inferential Statistics II
220
210
200
190
190 200 210 220 230 240 250 260
Father's Height (X)
or the exact rates at which the disease is devel- The most basic regression analysis is referred to
oped? Conversely, how come we fail in accurately as a simple linear regression, where a line is
predicting the development of these diseases in graphed on a scatterplot of a dependent variable
individuals that are not heavy smokers? The point and a single independent variable.
we are attempting to make can be better imparted Sample Population
by rephrasing into a much more general question. Y = bX + a + e Y = βX + α + ε
Why are we unable to accurately predict future
outcomes? Answer: Uncertainty. • Y = dependent variable, continuous
Recall that statistics perceives the innate • X = independent variable, continuous
uncertainty contained within our observations as • b/β = slope
variability (see Chap. 4). So, just as before, we • a/α = y-intercept
acknowledge our continued inclination of mini- • e/ε = residual error
mizing the contained error in the overall size of
the effect. Thus, in the context of regression, this Notice that the regression line equation
interest is primarily important for making more innately represents a basic mathematical concept
accurate and precise predictions. Furthermore, referred to as the slope–intercept formula
due to the inherent associative linearity, scatter- (y = mx + b), but with the addition of the residual
plots are taken advantage of not only to describe error (e/ε), also known as the residuals. The
regressions but also as a tool to measure and elu- residual error, similar to random error, is the vari-
cidate variability. ability within our results that is unexplainable or
has been unaccounted for.14 Mathematically, the
6.6.2.1 Simple Linear Regression error is represented by the scattering of the
The linear relationship between the dependent dots—i.e., the vertical distances between the data
and independent variables, along with the con- points relative to the regression line (Fig. 6.28).
tained error, is determined by the linear regres- The addition of the residual error is the elucida-
sion line (also referred to as least-squares tion of uncertainty that is inherently contained
regression line). This line represents how the within our data and, henceforth, the predictions
dependent variable (y) regresses on the indepen-
dent variable (x), which is done by fitting the This add-on also makes the equation akin to the general-
14
best-fit line between the plots of dots (Fig. 6.27).13 izability (G) theory, briefly described in Chap. 3. The G
theory is a statistical framework that aims to fractionate
the error in the measurements we make, which ultimately
Recall from correlations that this linear line was simply
13
allows our findings to be closer and closer to the true
imagined—however, here, we actually graph this line. value: X = T + ε.
118 6 Inferential Statistics II
190
190 200 210 220 230 240 250 260
Father's Height (X)
we wish to make. The residual error is not shown • r = Pearson correlation coefficient
in the regression equation for descriptive pur- • sx = sample standard deviation of independent
pose; rather it is more often shown during infer- variable (Xs)
ential analyses. • sy = sample standard deviation of dependent
Both a and 𝛂 are mathematically the y- variable (Ys)
intercept of their respective regression equations. • Y = the average of the independent variable
They essentially describe the expected value of y • X = the average of the dependent variable
when x is 0. The calculation of this term (equa-
tion provided below) is useful in completing the
line equation. However, in most of the research in
the health sciences, the interpretation of this mea- Example
sure is not primarily meaningful. We are interested in determining whether
In the regression equations, the most impor- years of heaving alcohol assumption (X)
tant term is the slope of the line referred to as the are able to predict life expectancy (Y). The
regression coefficient (b). This measure repre- correlation coefficient of –0.78 describes a
sents the effect of the independent variable on the strong negative association between the
dependent variable. That is, for every associated two continuous variables. Simple linear
change in the independent variable, it considers regression analysis finds the regression
the expected change in the dependent variable coefficients b = −1.69 and a = 76.78, which
(Y).15 Below we provide the formula for calculat- culminate into the line equation:
ing both the regression coefficients b and a, along y = − 1.69X + 59.78. Moe, a 15-year alco-
with an example to tie all of the aforementioned holic, has consented to be the study subject.
together: Using the equation y = − 1.39(15) + 76.78,
we find y = 55.93, which is interpreted as
sy
b=r a = Y - bX Moe’s predicted life expectancy to be
sx 55.93 years.
is in terms of life expectancy. Surely, it is not the vide substantial information regarding health-
single or even most effective predictor. Moreover, related outcomes. More importantly, we might
there can be many different factors, whether good want to consider how good or bad a regression
or bad, that if considered can provide a much bet- line is, particularly when we are interested in
ter prediction of Moe’s life expectancy. Factors making a predictive inference. The quality of a
such as physical activity, genetic disposition, or regression line depends on the amount of error
smoking behavior are also able to give us infor- it is able to explain and its ultimate predictive
mation regarding life expectancy. In terms of error.
regression, the different factors are represented as There is one last measure relevant to
additional predictors, which take on the form of prediction-making in both forms of regression
independent variables. analyses that we have yet to mention and is criti-
cal for the consideration of predictive error. The
6.6.2.2 Multiple Linear Regression coefficient of determination (R2) indicates the
Regression analyses for predictions with multiple proportion of total variability in one variable that
independent variables (i.e., multiple predictors) is predictable from its relationship with another
are conducted with the technique referred to as a variable. The coefficient of determination entails
multiple linear regression. the predictive accuracy of an inference consen-
Sample sus. Some refer to the coefficient of determina-
Y = b1X1 + b2X2 + …biXi + a + e tion as shared variance, explained variability, or
Population predictive error.
Y = β1X1 + β2X2 + …βiXi + α + ε The measure of predictive error is able to esti-
mate the quality of a regression line—or, better
As briefly hinted above, there is an important stated, the goodness of fit of the regression line.
utility of multiple linear regression over simple lin- We can determine whether the data provides a
ear regression relative to prediction-making. The good or bad regression line by converting the
more predictors (X’s, independent variables) we decimal provided by R2 to a percentage. Hence,
use to predict a given outcome, the more error we the larger the percentage, the better the fit of the
are able to account for. The more explained error, line, and the more accurate the prediction.
the smaller the remaining (unexplained or unac- Moreover, the complement of this estimate,
counted) error tends to be, i.e., residual error (ε). namely, 1 – R2, is just as useful in determining the
In essence, this is the process of error fraction- unpredictive error, which in turn provides an idea
ation, whereby the error—originally unexplain- of the inaccuracy of our prediction.
able—is divided into components (i.e., fractions), In simple linear regression analyses, the R2 is
which ultimately reveal sources of error that were (simply) determined by squaring the Pearson cor-
otherwise unaccountable. The knowledge of this relation coefficient. Returning to the previous
error, in turn, ultimately facilitates a more accu- example, squaring the correlation coefficient of
rate and precise prediction. Therefore, we can –0.78 gives a R2 equal to about 0.608. As a per-
deduce that accurate and precise predictions cent—how it is most often presented—this means
entail a minimal amount of predictive error. The that about 60.8% of the total variability present in
reduction of predictive errors is one of the pri- life expectancy is predictable from the variability
mary advantages of fitting a regression line to a in years of alcohol consumption. This may not
scatterplot of data. seem so fortunate for our friend Moe. However, if
The estimation of predictive errors becomes we consider the unpredictive error in this estima-
particularly important when we begin to con- tion, 1–0.608 = 0.392 or 39.2% of the variability
jecture on health-related predictions useful to still remains a mystery. Moreover, if we consider
public health. In this case, regression analyses the numerous other factors relevant for life
are considered in terms of inferential statistics. expectancy, then—well, let just say—Moe is still
Indeed, predictive inferences are able to pro- in the game.
120 6 Inferential Statistics II
The addition of other predictors is exclusive Simple linear regression modifies the t test,
to a multiple linear regression analysis, in which which tests the null hypothesis that y has no effect
we are able to consider a larger degree of predic- on x, written as H0: β = 0. On the other hand,
tive error and ultimately provide a more accurate multiple linear regression modifies the F test due
prediction of Moe’s life expectancy. As men- to the addition of more independent variables.
tioned, the method of determining R2 in this case Like parametric ANOVA, the null hypothesis in
is not as straightforward as it was for the simple this type of analysis equates all of the regression
linear regression. We cannot just square the cor- coefficients together, in which the number of β’s
relation coefficient in this instance because a depends on the number of independent variables
multiple linear regression contains more than present (i.e., H0: β1 = β2 = β3 = … βp, where p is
two variables and does not have a correlation the number of independent variables). Here, the
coefficient. In order to obtain the R2 for a multi- null hypothesis claims that the regression model
ple linear regression, we must utilize a method in consideration does not fit in with the
of ANOVA, which provides an F-statistic that population.
serves as the coefficient of determination.
Furthermore, the F-statistic relative to regression
analyses may be used as an F test for hypothesis 6.7 Self-Study: Practice Problem
testing.
To make parametric inferences by way of 1. Determine the critical values based on the
hypothesis testing, it becomes more about how specific two-tailed statistical tests for each of
an outcome can be predicted on the basis of cer- the following:
tain predictors. Now we become interested in (a) A one-sample t test with an α = 0.05 and
identifying the statistically significant predictors n = 21
(independent variables) of a given outcome or (b) Two independent sample t tests with an
response (dependent variable). The refined α = 0.01 and a df = 45
F-statistic used in conjunction with regression (c) A z test with an α = 0.05
analyses can provide predictive inferences and (d) A z test with an α = 0.01
can be tested for statistical significance—pre- (e) A F test with an dfbetween = 7 and dfwithin = 32
suming they satisfy the assumptions of paramet- 2. The Board of Education has just reported that
ric statistics. But unique to regression analyses is college students study an average of 15 h a
the satisfaction of a fourth assumption, namely, week with a standard deviation of 6.7. You
homoscedasticity. want to test whether the study hours of stu-
Homoscedasticity can be viewed as analogous dents at your college are a representation of
to homogeneity of variances, but for the variation the entire nation. A sample of collected data
of the vertical distances (see Fig. 6.28). This provided the following:
assumption is proffered by the predictive error 8.5, 19, 9, 15, 6, 1, 6, 10, 7, 6, 28, 35, 8, 20,
that results from the vertical distances of the data 6.5, 11, 2, 8, 5, 9
points on a scatterplot. (a) Use a one-sample z test to test the hypoth-
Like correlations, we omit the extensive dis- esis at α = 0.01 using the six steps.
cussion regarding the relative hypotheses and sig- (b) Calculate and interpret a 99% confidence
nificance tests relating to both simple and multiple interval.
linear regressions as they are beyond the scope of 3. Determine the appropriate t test for each of
this book. Their calculations are just as easily the following scenarios:
obtained using a variety of statistical software (a) Your boss wants you to determine if there
programs (see Video 9). Both simple and multiple is a difference in effectivity between dif-
linear regression analyses use distinct yet familiar ferent cognitive therapies. College stu-
test statistics to consider the null hypothesis pend- dents are randomly assigned to receive
ing further substantiated evidence. either behavioral or cognitive therapy.
6.7 Self-Study: Practice Problem 121
After 20 therapeutic sessions, each stu- in which one sibling was breastfed and the
dent earns a score on a mental health other bottle-fed. The following are the scores
questionnaire. collected on a standardized IQ measuring
(b) One hundred pharmacy students attend a tool.
seminar on novel therapeutic treatments. 6.
Students are tested once before the semi-
Pair of Breastfed sibling Bottle-fed sibling
nar and once after the seminar in order to
siblings IQ IQ
gage the effectiveness of the seminar. Pair 1 119 115
(c) According to the US Department of Pair 2 96 97
Health, the average 16-year-old male can Pair 3 102 105
do 23 pushups. A physical education Pair 4 111 110
instructor wants to find if 30 randomly Pair 5 79 83
selected lazy 16-year-olds are meeting Pair 6 88 90
this recommended standard. Pair 7 87 84
(d) The Centers for Disease Control and
Pair 8 99 99
Pair 9 126 121
Prevention recommend the ideal daily
Pair 10 106 101
dietary fiber consumption for adults. You
decide to see if your group of friends meet
this criterion or not. (a) Make a decision regarding the null hypothesis
4. The incubation period for the Zika virus is using the six steps.
between 2 and 17 days. A recently discovered (b) Which type of statistical measure did the
strain of Zika virus has an average incubation researchers take advantage of?
time of 6 days. A Zika outbreak in the popula- (c) What might be a few sources of error that the
tion has a group of epidemiologists curious as researchers might want to consider?
to whether the new epidemic is really the
recently discovered Zika strain. A random 7. A group of 40 middle-aged men are recruited
sample of Zika patients from the recent out- to a study in order to determine which home
break in the population revealed the following remedy is best suitable for decreasing the
incubation times in days: severity of the seasonal flu. The men are cat-
8, 2, 3, 11, 7, 8, 2, 5 egorized into four groups with different
(a) Test the hypothesis at α = 0.05 using the interventions: orange juice, chicken soup,
six steps and p-value for proof. green tea, and salt water. The men are
(b) Calculate and interpret a 95% confidence reported to drink a cup of their specific inter-
interval for the true population mean incu- vention each day for 1 week and told to
bation time. report the severity of their condition on a ten-
(c) Would you say the Zika virus in the popu- point scale.
lation is the hypothetical strain? Explain
OJ C. Soup G. Tea S. Water
and provide proof.
5 8 3 1
(d) Without doing any further hypothesis test-
7 7 2 7
ing, would the decision regarding the null 3 4 5 2
hypothesis change at α = 0.01? Provide 3 7 5 4
proof. 6 5 3 1
5. A group of psychologists is interested in
5 9 3 4
determining whether there is a statistically 8 9 2 4
significant effect on IQ in children who were 7 7 1 2
breastfed compared to children who were not. 6 6 4 2
The researchers recruited ten pairs of siblings, 5 6 4 1
122 6 Inferential Statistics II
(a) Using the six steps of hypothesis testing, test 10. In regard to the r = + 0.592 that describes
the null hypothesis at a significance level of association between cholesterol levels and
0.01. (Hint: use Appendix E for SS caloric intake from above, answer the fol-
calculation.) lowing questions:
(b) Based on your conclusion, can you deter- (a) What percent of the variability in choles-
mine which group is most effective? Explain. terol levels is predictable from its asso-
ciation with caloric intake?
8. Answer the following questions based on the (b) What percent of the variability in choles-
ANOVA table below: terol levels is unpredictable from its
association with caloric intake?
Degrees
(c) Is the R2 able to predict the percent of
Sum of of Mean
squares freedom square people with high cholesterol levels?
Source (SS) (df) (MS) F Explain.
Between ? 4 37.5 ? 11. The multiple linear regression line below
Within 5250 245 ? estimates the effects the number of daily
Total ? 249 X X cigarettes, daily alcoholic beverages, and age
have on monthly urinary output (L).
(a) Calculate the missing values in the table y = 0.466–0.181 xCIG − 0.299 xALC + 0.333 xAGE
above. (a) Identify which of the variables are pre-
(b) How many groups were studied? dictors and which are responses.
(c) Assuming the groups were equal in
(b) A patient has volunteered his time to the
size, what is the sample size of each research study and is interested to know
group? whether his behavior can predict his uri-
(d) Determine the p-value for the outcome. nary output. You learn that he has four
If this study could be done again, what cigarettes a day and a single alcohol bev-
might you suggest? erage with dinner and is 56 years of age.
9. A scatterplot describing the relationship Calculate his predicted urinary output.
between cholesterol levels and caloric intake (c) Assuming control for the other variables,
renders a Pearson correlation coefficient of interpret the regression coefficient of the
r = + 0.582. variable for daily cigarettes.
(a) Describe the strength and direction of r. (d) With a coefficient of determination (R2)
(b) Provide a verbal interpretation of corre- of 0.388, what would you say regarding
lation between cholesterol levels and the predictive accuracy of the potential
caloric intake. inference consensus that may result from
(c) Can it be said that lower caloric intake this study?
causes lower cholesterol levels? (See back of book for answers to Chapter
Explain. Practice Problems.)
Nonparametric Statistics
7
Contents
7.1 Core Concepts 123
7.2 Conceptual Introduction 124
7.2.1 What Is Nonparametric Statistics? 124
7.2.2 When Must We Use the Nonparametric Paradigm? 125
7.2.3 Why Should We Run Nonparametric Inferences? 125
7.3 Nonparametric Comparisons of Two Groups 126
7.3.1 Wilcoxon Rank-Sum 126
7.3.2 Wilcoxon Signed-Rank 127
7.3.3 Mann–Whitney U 127
7.4 Nonparametric Comparisons of More than Two Groups 128
7.4.1 Kruskal–Wallis for One-Way ANOVA 128
7.4.2 Friedman for Factorial ANOVA 129
7.4.3 Geisser–Greenhouse Correction for Heterogeneous Variances 129
7.5 Categorical Data Analysis 129
7.5.1 The Chi-Square (χ2) Tests, Including Small and Matched Designs 130
7.5.2 Time Series Analysis with χ2: Kaplan–Meier Survival and Cox Test 133
7.5.3 Association and Prediction: Logistic Regression 134
7.6 Self-Study: Practice Problems 136
Recommended Reading 137
regarding the population cannot be extrapolated; the sample. Simply stated, it is in those cases that
therefore, no assumptions or generalizations can we must rely on nonparametric statistics.
be made regarding the distribution of the popula- Another important characteristic of nonpara-
tion. This can explain why it is sometimes referred metric statistics is that they never require the
to as a distribution-free method. Used as a simple assumption that the structure of a research model
preliminary test of statistical significance, non- is fixed. In research designs, such as is often the
parametric statistics is a method commonly used case in adaptive clinical trials or summative eval-
to model and analyze ordinal or nominal data with uation protocols, the structure of the investiga-
small sample sizes. Since nonparametric statistics tional model changes during the study and
relies on fewer assumptions, these methods are evolves—e.g., alterations in sample size resulting
more robust. The simplicity and robustness of from adaptive changes, mortality, dropout, or
nonparametric statistics leaves less room for related situations—parametric constraints become
improper use and misunderstanding. limiting. Consequently, often albeit not always,
This chapter focuses on the nonparametric nonparametric tools of comparison or of predic-
comparison of two groups and more than two tion become the statistical approach of choice.
groups with an emphasis on tests for unifactorial To be clear, in such cases of flexible research
designs and multifactorial designs, as well as a models, frequentist statistical inference, such as
brief discussion on categorical data analysis. that advocated by probabilistic statisticians with a
Though there are a number of frequently used Fisherian formation, will prove less useful, less
tests, this book highlights a handful of these reliable, less manageable, and overall less appro-
mathematical procedures for statistical hypothe- priate than a Bayesian approach. Of course,
sis testing. By the end of the chapter, we learn Bayesian statistics can entertain statistical tests
that the objective of all statistical analyses is to that are traditionally considered to be parametric,
reveal the underlying systematic variations in a such as the t test, ANOVA, regression, and the like,
set of data from experimental manipulation or as well as the nonparametric tests described in this
observed measured variables. chapter. But, Bayesian statistics proffers the
unequalled advantage of progression toward the
absolute, the true characterization of the popula-
7.2 Conceptual Introduction tion by sequentially adding current new observa-
tions to established priors. From that perspective,
7.2.1 W
hat Is Nonparametric Bayesian inferences are undoubtedly preferred to
Statistics? probabilistic statistics, which are static in time and
can only be repeated in time-dependent repeated
Nonparametric statistics are statistics that are, as measures designs, in any and all adaptive trials and
the name indicates, distinct from parametric sta- related flexible research situations.
tistics. The latter pertain to situations where the To be clear, whether the fundamental research
raw data are categorical in nature, rather than question deals with a comparison between
continuous (see Chaps. 5 and 6). groups, or with a set of inferences to be drawn
In addition, nonparametric statistics also per- about future outcomes of the dependent variable,
tain to research models where a set of parametric which relates to predictions, the rationale of the
assumptions—e.g., homogeneity of variance, process of statistical analysis remains the same:
normal distribution of measurements of the out- to compare the obtained p with the preset level of
come variable, and independence of outcome— significance, α (see Sect. 5.3 on significance).
are not verified, and the investigator is thus These considerations, while they pertain to the
unable to use the sample data to make inferences analysis of both categorical and continuous data,
about the population. When one or more of these are particularly relevant to continuous data analy-
assumptions are not verified, it follows that we sis, because these data permit to draw conclu-
cannot and must not use parametric statistics and sions—that is to say, statistical inferences—about
attempt to infer properties of the population from the population.
7.2 Conceptual Introduction 125
Categorical data do not, ever, in any circum- tical inference. Nonparametric statistics pertain to
stance allow statistical inferences about the pop- a domain of statistical inference where the investi-
ulation. The only exception to this fundamental gator is bound by the fact that either the data them-
rule of biostatistics pertains to situations where selves are not continuous—i.e., categorical in
the counts obtained in categorical data—e.g., nature—or that certain fundamental assumptions
white blood cell count in millions—that, by con- (homogeneity of variance, normality, and indepen-
vention, these data are now taken as continuous. dence) are violated by the data. Consequently, the
In those circumstances, these data must satisfy conclusions based on the statistical analyses, the
the parametric assumptions. statistical inferences cannot be extended and
In brief, continuous data have the advantage extrapolated to characterizing the population.
over categorical data of allowing extrapolations to Rather, they must be restricted to explaining the
the population. That is to say, observations made sample alone. Nonparametric statistics can never
on a discrete sample can—provided that the para- allow a descriptive, summative, or even formative
metric assumptions are satisfied—be used to generalization of the inferences to the population.
describe certain properties of the population. In brief, nonparametric statistics are:
It should be self-evident that this can only be the
case when and if the population is taken by conven- • Always required when the data under analysis
tion to be a fixed entity, with a known mean, μ, and are not continuous measurements obtained from
standard deviation, σ, of which our sample is in fact interval scales but rather categorical counts
representative. This convention takes us into the derived from simple enumeration of quantities
camp of Fisherian probabilistic statistics, which in certain categories defined by nominal or ordi-
essentially states that there is out there a “beast” nal variables—it follows that a simple transition
that we can refer to as the “population,” which we from a parametric to a nonparametric consider-
can study and comprehend. The alternative ation of the data rests on the translation of con-
Bayesian position states that the parameters of the tinuous measurements to a categorization of the
population may in fact never be known and are pro- numbers in the form of ranking.
gressively approximated by each iteration of inte- • Not based on parameterized families of prob-
grating the priors to newer observations. ability distributions.
The probabilistic perspective predominates • Whereas they include, like their parametric
current trends of research in the health sciences counterpart, both descriptive and inferential
and is therefore the commonly found in pub- statistics, nonparametric statistics require no
lished reports. This view proposes that the popu- assumption and make no assumption about the
lation is characterized by parameters, such as its probability distributions of the variables being
mean, μ, and its standard deviation, σ. In this assessed (e.g., normality, homogeneity of
paradigm, statistical analyses are based on variance, independence of measurement).
sample statistics, which are used to characterize
the population—that is, to make inferences about
the parameters of the population—hence the ter- 7.2.3 Why Should We Run
minology “parametric.” Nonparametric statistics Nonparametric Inferences?
are simply situations where inferences about the
parameters of the population cannot be drawn for We have noted above that the primary reasons for
the reasons outlined above. using nonparametric inference can be summa-
rized as:
Group 1: 3.5 ± 2.70. the null hypothesis that the population means are
Group 2: 6.5 ± 2.1 statistically homogeneous must be rejected.
The original Wilcoxon signed-rank test may
Now, based on this very simple example, we use a different, albeit equivalent statistic. Denoted
have the exceedingly strong impulse to do either by Siegel as the T statistic, it is the smaller of the
of two things—or both: two sums of ranks of given sign. Low values of T
are required for significance. T is generally easier
• Compare the extent of overlap of the ranks by to calculate than W. One important caveat is that
means of the W statistics. when the difference between the groups is zero,
• Compare the means of the ranks by a standard the observations are discarded. This is of particular
T statistics. concern if the samples are taken from a discrete
distribution, although in that case the Pratt modifi-
Both are permitted, and no assumptions are cation can be run to render the test more robust.
required for option 2 because we already have A second important aspect of this test pertains
given up, as it were the option of parametric to the power analysis. To compute an effect size
inferences due to the fact that we are not dealing for the signed-rank test, one must use the rank cor-
with the raw data any more, but in fact with the relation. If the test statistic W is reported, which is
relative ranks of the original data (see Video 10). most often the case because the Wilcoxon signed-
The null hypothesis of the Wilcoxon rank sum rank test generally relies on the W-statistics—sim-
test is usually taken as equal medians. The alter- ply the sum of the signed ranks— then the rank
native hypothesis is stated as the “true location correlation, r, is equal to the test statistic, W,
shift is not equal to 0.” That’s another way of say- divided by the total rank sum, S, or r = W/S. But,
ing “the distribution of one population is shifted if the test statistic T is reported, then the equivalent
to the left or right of the other,” which implies way to compute the rank correlation requires the
different medians. difference in proportion between the two rank
In comparing the ranks of the data in two sums (see Appendix F for T critical values).
groups, either of two approaches can be followed:
either we compare the overlap of the ranks by
means of the Mann–Whitney test, or we compare 7.3.3 Mann–Whitney U
the means of the ranks by the Wilcoxon test,
which is essentially a t test approach on the ranks. The Mann–Whitney U Test (aka, Wilcoxon two-
sample test and Mann–Whitney–Wilcoxon test)
examines whether the sums of the rankings for two
7.3.2 Wilcoxon Signed-Rank groups are different from an expected number. The
sum of one ranking is given as an integer value in
The Wilcoxon signed-rank test is a nonparametric the third box. If the sum is different from the
statistical hypothesis test used when comparing two expectation, this means that one of the two groups
matched samples or two repeated measurements on has a tendency toward the lower numbered ranks,
a single sample to assess whether their population while the other group has a tendency toward the
mean ranks differ. In other words, the signed-rank higher numbered ranks. The probability value pre-
test is a paired difference test, which is used as the sented is one-sided (“tailed”). Use this probability
nonparametric alternative, or equivalent to the paired value if you are only interested in the question if
Student t test for matched pairs (see Video 11). one of the two samples tend to cluster in a certain
Generally, the test renders a W statistics: W+ direction (see Appendix G for U critical values).
and W− as the sums of the positive and negative The paired Wilcoxon test ranks the absolute
ranks, respectively. If the two medians are statis- values of the differences between the paired data
tically not different, then the sums of the ranks in sample 1 and sample 2 and calculates a statistic
should also be nearly equal. If the difference on the number of negative and positive differ-
between the sum of the ranks is too great, then ences. The unpaired Wilcoxon test combines and
128 7 Nonparametric Statistics
ranks the data from sample 1 and sample 2 and these assumptions are violated, then you must
calculates a statistic on the difference between the resort to a nonparametric form of analysis, which
sum of the ranks of sample 1 and sample 2. By as we saw earlier must rest on the raw data rather
contrast, the Mann–Whitney U test compares the than on the means.
relative overlap of the ranks in groups 1 and 2.
But the question remains as to what to do if we • The Kruskal–Wallis test on ranks provides
have more than two groups to compare, and we you with such a tool, when we are dealing
have violated the assumptions for parametric sta- with one-way designs.
tistics. We still shall use the ranks of the data, • The Friedman test on ranks provides you with
rather than the raw data (see Video 12). a nonparametric comparison approach in the
The Mann–Whitney U renders a U statistics case of a two-way design.
that is computed as follows:
In either case, if significance is found, then
n ( n + 1) n2 Wilcoxon rank sum post hoc tests are done, with
U = n1 n2 + 2 2 - å Ri the Bonferroni correction, as above. It is very
2 i = n1 +1 straightforward, once you have grasped the flow
of things.
n1 and n2 correspond, respectively, to the samples The Kruskal–Wallis test by ranks, Kruskal–
sizes of groups 1 and 2 Wallis H test (named after William Kruskal and
R1 corresponds to the sum of the ranks for group 1 W. Allen Wallis), or one-way ANOVA on ranks is
a nonparametric method for testing whether sam-
ples originate from the same distribution. It is used
7.4 Nonparametric Comparisons for comparing two or more independent samples
of More than Two Groups of equal or different sample sizes. It extends the
Mann–Whitney U test when there are more than
Nonparametric ANOVA equivalents include the two groups. The parametric equivalent of the
Kruskal–Wallis test for unifactorial designs and Kruskal–Wallis test is the one-way analysis of
the Friedman test for multifactorial designs. variance (ANOVA). The distribution of the
Kruskal–Wallis test statistic approximates a χ2 dis-
• Kruskal–Wallis one-way analysis of variance tribution, with k-1 degrees of freedom (see
by ranks tests whether >2 independent sam- Appendix H for χ2 critical values). A significant
ples are drawn from the same distribution. Kruskal–Wallis test indicates that at least one sam-
• Friedman two-way analysis of variance by ple stochastically dominates one other sample.
ranks tests whether k treatments in random- The test does not identify where this stochastic
ized block designs have identical effects. dominance occurs or for how many pairs of groups
stochastic dominance obtains. Post hoc tests serve
Consideration is also given to the correction of to make those distinctions (see Video 13).
sphericity, viz., the assumption of homogeneity Since it is a nonparametric method, the
of variance, by means of the Geisser–Greenhouse Kruskal–Wallis test does not assume a normal
correction. distribution of the residuals, unlike the analogous
one-way analysis of variance. If the researcher
can make the less stringent assumptions of an
7.4.1 Kruskal–Wallis for One-Way identically shaped and scaled distribution for all
ANOVA groups, except for any difference in medians,
then the null hypothesis is that the medians of all
Nonparametric tests for comparisons of more groups are equal, and the alternative hypothesis
than two groups utilize, as was the case for two- is that at least one population median of one
group comparison, the ranking of the data, rather group is different from the population median of
than the raw data themselves. That is to say, when at least one other group.
7.5 Categorical Data Analysis 129
• Nonparametric (or distribution-free) inferential vations contingent upon the nominal variables
statistical methods: used to analyze similarities, used. Frequency tables should only list one
or association, include but are not limited to: observation (one count) per individual; but com-
–– Anderson–Darling test: tests whether a plex studies (i.e., often badly designed studies)
sample is drawn from a given distribution often list more complex and misleading fre-
–– Statistical bootstrap method: estimates the quency tables, the discussion and analysis of
accuracy/sampling distribution of a which are beyond the scope of our present
statistic examination.
–– Cohen’s kappa: measures inter-rater agree- Chi-square (note: χ2 test, whose outcome is
ment for categorical items checked on the appropriate table of the χ2 distri-
–– Kendall’s tau: measures statistical depen- bution) is the appropriate test for comparing and
dence between two variables for testing associations of frequencies and pro-
–– Kendall’s W: a measure between 0 and 1 of portions. This test can be used equally well for
inter-rater agreement two or more than two groups. That is to say that,
–– Kolmogorov–Smirnov test: tests whether a while χ2 can answer such questions as “is there a
sample is drawn from a given distribution difference in the frequencies among the groups”
or whether two samples are drawn from the (test of equality of proportions among groups), it
same distribution can also test whether or not there is an associa-
–– Kuiper’s test: tests whether a sample is tion among the groups (test of association among
drawn from a given distribution, sensitive groups).
to cyclic variations such as day of the week Since χ2 is a relatively easy test to compute
–– Logrank test: compares survival distribu- and to interpret, it is often abused. There are few
tions of two right-skewed, censored special cases, which deserve discussion, because
samples failure to rectify the test in certain situations lead
–– Pitman’s permutation test: a statistical sig- to making a Type I error more likely. Appropriate
nificance test that yields exact p-values by use of χ2 includes a preliminary characterization
examining all possible rearrangements of of the sample used in a study, or the analysis of
labels such designs as diagnostic tests, where the out-
–– Rank products: detect differentially comes refer to counts of patients who are true
expressed genes in replicated microarray positives, true negatives, false positives, or false
experiments negatives.
–– Spearman’s rank correlation coefficient: The χ2 test computes the extent of deviation of
measures statistical dependence between the observed (“O”) cases from frequencies attrib-
two variables using a monotonic function utable to chance (expected frequencies; “E”). In
–– Wald–Wolfowitz: runs test tests whether brief, the χ2 test is a computation that is based on
the elements of a sequence are mutually the frequency table of the observed cases (O) and
independent/random the extent of deviation of the observed cases from
the expected frequencies (E) contingent upon
Detailed consideration is given to the princi- (i.e., dictated by) the nominal variables used. The
pal ones below. frequency table so constructed is referred to as a
contingency table.
For example, if we are counting men and
7.5.1 T
he Chi-Square (χ2) Tests, women who are either old or young, we can tally
Including Small and Matched each individual we count in one of four cells: men-
Designs young, men-old, women-young, and women-old.
The totals of our tallies in each cell represent the
In order to conduct an analysis of frequencies, observed frequencies, and the cells themselves rep-
the data are organized by constructing a fre- resent the levels of the nominal variables our analy-
quency table. The frequency table lists the obser- sis is contingent upon (i.e., the “categories”).
7.5 Categorical Data Analysis 131
å (O - E ) df = ( a - 1) ( b - 1) ( c - 1)¼( p - 1)
2
c =
2
E
It also important to note that the χ2 distribution
(χ2) is a distribution of square values, whose
The test achieves that by adding (use the symbol: mean and standard deviation are its degrees of
Σ) the ratios of each of the differences between freedom. Thus, we only need to know the degrees
observed and expected frequencies, squared and of freedom to characterize the χ2 distribution. The
then divided by the expected frequencies. definition of this distribution therefore is quite
Each difference (O−E) is squared because oth- simple: for any quantity that has a standard nor-
erwise the simple sum of these differences would mal distribution, its square has a χ2 distribution.
add up to 0. Also note that this test tells us nothing It should be evident that this distribution can
about the spread (dispersion) of the frequencies only have positive values. That is to say, the χ2
within each category. However, it is a fact that as distribution is positively skewed: as the design
long as the E values are at least >5, they turn out increases, and therefore the degree of freedom
to be (quasi-)normally distributed, with a variance increases, the distribution increasingly tends to
equal to the frequency itself. Therefore, the vari- become normal.
ance in each cell could be rendered as the expected The positive skew of the χ2 distribution also
frequency, E. That means that: implies that inferences can only be and are
always one-tail. Practice using the χ2 distribu-
• It is fair game to divide the squared difference tion table, and find, for example, the critical
between O and E values by value of χ2, at α = 0.5, for df = 1; for df = 5; for
df = 7; etc.
(O - E )
2
A B A + B sum A B
60 50 110 110 110
100 ´ = 36.72 200 ´ = 73.3
300 300
40 150 190 190 190
100 ´ = 63.32 200 ´ = 126.7
300 300
Total sum 100 200 300 Observed values Expected values
132 7 Nonparametric Statistics
Now, we obtain each individual spread of the quencies among the groups (test of equality of
O’s (observed value) to the respective E’s proportions), and whether there is an association
(expected value) by subtracting O − E. We square among the groups (test of association), it is, nev-
them, lest the sum add up to 0, and divide each by ertheless, a test with limited statistical stringency.
the respective E. We add up these individual The weak nature of the χ2 test lies inherently in
ratios to obtain the overall spread from O to E is the fact that it relies not on measurements per-
in the overall design: formed on the subjects, which could then be used
to extrapolate the behavior and characteristics of
å (O - E )
2 the population, but rather on the actual quantity,
c2 = or number of subjects.
E Therefore, the χ2 test:
The final step is to determine whether the • Must not be overuse, just because it is simple
observed χ2 value (χ2obs) is larger than the critical to perform.
value given in the table for the corresponding • Assumes no ordering among the categories
degrees of freedom (χ2crit). If χ2obs > χ2crit, then the under study (in the case of ordinal data [e.g.,
test is significant, and your statistical software stage of disease]) that information will be lost
would compute a p-value (the probability of your in the process of analysis.
finding that outcome by chance alone) that would • Becomes inaccurate when the frequencies in
be smaller than the α level set (often by conven- any one cell is small (<5). The Yates’ correc-
tion 5%). tion for continuity of χ2 must be done in that
There is a shortcut to this computation that instance.
can be used in instances of a 2×2 design, table,
such as in diagnostic tests. The shortcut is as The Yates’ correction for continuity must be
follows (using the numbers above): applied when dealing with small designs, when E
is anticipated (or computed to be <5 - …gener-
ally, most statisticians agree with this “thresh-
c 2 obs = 300 ( 60 ´ 150 - 50 ´ 40 ) 2 old”). This correction involves subtracting 0.5
= ( 60 + 40 ) ( 50 + 150 ) ( 60 + 50 ) ( 40 + 150 ) from the difference between O and E frequencies
108 before squaring in the regular formula. In the
= 147 ´ ´ 108 = 35.16 shortcut formula, the correction entails subtract-
4.18
ing one half from each of the O–E differences in
As stated above, χ2 values are always greater than the numerator before squaring. The correction
1.0, and the test is always one-tail. The greater the decreases the difference between each pair of
value of χ2 obs, the greater the deviation of the observed and expected frequencies by 0.5. If
observed values from the values expected based each observed frequency is so close to the
on chance alone and the greater the probability expected frequency that the correction reverses
that this deviation is statistically significant the algebraic sign of the difference, then the
(Appendix H). That is to say and as noted above, χ2 agreement is as good as possible, and the null
is a test of association and of comparison between hypothesis is accepted.
observed and expected values. Whereas χ2 is most That is to say, the Yates’ correction makes the
often used as a test of association, relationship, or final computed value of χ2 smaller, which pro-
dependency, it also serves to test the equality in tects from a Type I error, rejecting H0 when it is
proportions among groups (see Video 15). true. By the same token, the Yates’ correction
Despite the fact that the χ2 test can answer increases the risk of a Type II error, not rejecting
such questions as is there a difference in the fre- the null hypothesis when it is false.
7.5 Categorical Data Analysis 133
preted via its transformation to an exponent (e.g., “blocks,” represented by the same individual
b = 0.520; eb = 1.68; meaning in this particular within whom measurements are obtained, cross-
case that this variable increased hazard by 168%). individual differences are eliminated and the
By contrast, the approach of survival analysis design is made stronger; but, because each value
entails the following: From a set of observed sur- will be correlated with the preceding and the fol-
vival times, we can estimate the proportion of the lowing measurements on the same individual,
sample who would survive a given length of time, data points are not to be considered fully “inde-
thus generating a “life table” and a “survival pendent” and problems arise.
Kaplan-Meier curve.” This estimate requires con- In the instance of a few measurements (e.g.,
sidering time in many small intervals. For example, pre/post), the more appropriate term (and analy-
the probability of surviving 2 days if the product of sis) is “repeated measure.” The statistical
the probability of surviving the first day times the approach of choice is a within-group ANOVA
probability of surviving the second day, which design, as was noted above. Analyses can often
itself is called the “conditional” probability as it is be simplified by analyzing in fact the post-/
conditional upon the probability of surviving the pre-difference.
first day. So, for say, 100 days, the total probability Thus, these regression coefficients have com-
becomes P1 × P2 × … × P100. In actual terms, P100 is mon usage in the derivation of a prognostic index
calculated as the proportion of the sample surviv- for each individual variable, as well as overall. In
ing at day 100. On days when nobody dies, the this analysis, the outcome measure is the cumula-
probability of surviving is 1, and it is therefore only tive hazard of dying at time t, h(t):
necessary to calculate these probabilities on days
that somebody dies. The data are plotted as a “step
function,” where it is incorrect to join the propor- h ( t ) = h0 + e ( b1 ´ X1 +¼ bn ´ X n )
tion surviving by sloping lines.
The data are best analyzed by the logrank test,
a nonparametric test that tests the null hypothesis 7.5.3 Association and Prediction:
that the groups being compared are samples of Logistic Regression
the same population with regard to survival. It
acts a bit like a χ2 test in that it compares observed It is not uncommon that two variables have some
with expected number of events: indeed, it uses degree of relationship, or association, in a given
the χ2 distribution with k groups −1 as degrees of data set. The correlation coefficient, r, is a mea-
freedom (Appendix H). A significant outcome sure of the relationship between variable X and
would suggest that the groups do not come from variable Y. Actually, r gives an indication of the
the same population. When groups are stratified direction of the relationship (positive or negative)
(e.g., age range), then a logrank test could be and of the strength of the relationship (from –1 to
used to determine whether there were significant +1, 0 being no correlation whatsoever). But,
differences between the stratified groups. never can r imply a cause–effect relationship.
Whereas the logrank test serves to compare the The correlation coefficient, r, or the Pearson
survival experience of two or more groups, it cannot coefficient is the measure of relationship between
be used to explore the effects of several variables on two independent continuous variables. The value
survival. For that purpose, the Cox proportional indicates the degree of covariance, that is to say,
hazard regression analysis should be used. of shared variability. The square of r, the coeffi-
Let us recall that time series and survival anal- cient of determination, provides an indication of
yses allow us to look at data where we make how tight the relationship is (see Chap. 6).
many repeated measurements on the same indi- The relationship between ordinal variables is
vidual over time. Thus, they are also called given by the Spearman rank or Spearman rho cor-
“repeated measures” analyses. One advantage to relation coefficient. The Spearman rho (Ρ) is uti-
these analyses is the fact that by producing lized, as is the Kendall’s tau (Τ) to compute the
7.5 Categorical Data Analysis 135
association between the ranks (nonparametric), Diseased state = 0 + age + smoking + alcohol
rather than the actual raw data of two distributions. + treatment + error
Factor analysis is the set of statistical meth-
ods used to determine, for example, which items We then must “translate” the dependent vari-
on a scale clump (or cluster, hence the term for able, Y, into a continuous variable look-alike, and
the related statistical approach of “cluster analy- we do so by means of the logistic function,
sis”) together into some sort of a significant (or at æ p ö , hence the term logistic regression.
best, highly related) factor or construct. Factor log ç ÷
è 1- p ø
analysis is the means by which the investigator The equation now becomes
can group items, data, or trends of results (e.g.,
gels showing this or that particular band) into a
æ disease ö
coherent group that share fundamental similari- log ç ÷ = 0 + age + smoking + alcohol
ties. A factor analysis is based on the notion that è 1 - disease ø
+ treatment + error
each measurement is associated within a certain
degree of variability. When the variability about
two sets of measurements overlaps to a signifi- Multiple linear regression is a parametric test,
cant extent, then it is fair to assume that both which requires satisfying four assumptions: nor-
measurements are essentially the same or at least mality, independence, homogeneity of variance,
measure the same “thing,” the same factor. By and homoscedasticity, lest a logistic regression
contrast, if two sets of measurements do not over- be necessary. The statistical quality of the regres-
lap at all, then it seems fair to state that they are sion can be verified by multiple means: for exam-
totally and absolutely unrelated, hence the funda- ple, ANOVA can test its significance, CI to
mental principle behind factor analysis (and clus- examine the standardized regression coefficients,
ter analysis). An exploratory factor analysis will the β weights, and R and R2 to establish the lin-
establish, by calculating the correlation coeffi- earity of the relationship. This assumption refers
cients across the data for the expression of each to the fact that the variance around the regression
gene, whether or not some of the genes group line is the same for all values of the predictor
together in some meaningful way. If the investi- variable (X).
gator has a good idea of what the principal factors If the assumptions noted above hold, then the
are within each family of genes, and how they residuals should be normally distributed with a
will be ordered (a priori model), the data and the mean of 0, and a plot of the residuals against each
analysis will be organized accordingly. The fac- X should be evenly scattered. Statistical soft-
tor analysis is said to be a confirmatory statistical wares often will actually produce these graphs
analysis in this instance. with the initial regression command, followed by
That is to say, linear multiple regression test a plot command. Abnormal plots of the residuals
rests on the verification of the assumptions of will occur consequentially to the assumptions not
independence, normality, and homogeneity of being met. Therefore, while you rarely read about
variance. In addition, about something analogous this stage of analysis in papers, it is always a
to the assumptions of homogeneity of variances good idea to check the plot of the residuals before
must be verified, which refers to the homogeneity going any further in a regression analysis.
of the variation of Y’s across the range of the tested Abnormal plots of residuals could show, for
X’s: that assumption is called homoscedasticity. example, that:
When even one of these assumptions is vio-
lated, or when the outcome variable Y is not a con- (a) The variability of the residuals could increase
tinuous variable (e.g., disease present: yes, no), as the values of X increase.
then log transforming corrections of the outcome (b) There is a curved relationship between the
variable, Y, must be actualized. Thus, we might residuals and the X values, indicating a non-
have the following equation, for example: linear relation.
136 7 Nonparametric Statistics
Logistic regression is a statistical regression 3. Match the following nonparametric tests with
model that uses the logit of a number p between their analogous parametric tests.
0 and 1 for prediction models of binary depen- Kruskal–Wallis H
dent variables (see Video 16). Wilcoxon rank sum
In conclusion to this chapter, the objective of Wilcoxon signed-rank
all statistical analysis is to reveal underlying Spearman rho
systematic variations in a set of data, either as a Mann–Whitney U
result of some experimental manipulation or Friedman
from the effect of other observed measured Dependent sample t
variable. The basis of all statistical tests is an One-way ANOVA
assessment of the probability that given obser- Two independent sample t
vations and occurrences happen by chance, or One-sample t
not. The probability, p, is computed by the sta- Pearson r
tistical analytical tests on the basis of the data Two-way ANOVA
and is compared to a set value, a level, set by 4. Is there such a thing as a nonparametric infer-
the investigator, which establishes the point ence? If so, how does it differ from a paramet-
beyond which outcomes cannot be attributed to ric inference?
chance. 5. An immunologist is interested in comparing
Most research aims at either comparing two or the effects of peanut butter exposure to dust
more groups or at predicting the outcome vari- particle exposure on the number of specific T
able based on the independent and control vari- cells in two groups of postmortem patients that
ables. The research question sets this up at the died from a fatal asthma attack. Below are the
onset of the research process. Certain research T cell levels for each group. Use the Wilcoxon
designs favor a comparison approach (e.g., cross- rank sum method to rank the distribution of T
sectional), and others lead to a prediction type cells among the groups, along with the average
analysis (e.g., cohort studies). Experimental stud- and standard deviation of the ranks.
ies (and clinical trials) can go either ways (or 11.02, 5.98, 101.26, 18.09, 8.01, 45.93, 6.77
both ways), depending on how the research ques- 0.07, 32.33, 95.12, 11.02, 2.44, 300.65, 750.81
tion is stated. Systematic evaluation of the statis- 6. Assuming the two groups from question #5
tical analysis (SESTA), in any event, plays a (above) are independent, run the appropriate
central and a perichoretic role, one could say, in nonparametric test to determine whether the
evidence-based research and in evidence-based sums of the rankings for the two groups are
clinical decision-making. different.
7. What is sphericity in the context of nonpara-
metric testing?
7.6 Self-Study: Practice 8. Which type of error is a chi-square test vul-
Problems nerable to if used improperly? Why might cor-
recting this be considered a double-edged
1. What conditions allot the usage of nonpara- sword?
metric statistics? 9. Randomly selected patients at a hospital were
2. True or False: Bayesian approach to statistics asked whether they prefer to receive their
facilitates the usage of both parametric and diagnosis from a physician over a nurse. The
nonparametric tests, unlike the frequentist patients were divided by age group, and the
approach. results were as follows:
Recommended Reading 137
Age group Favor Oppose Total Corder GW, Foreman DI. Nonparametric statistics: a step-
by-step approach. New York: Wiley; 2014.
Child 191 303 494
Dunn OJ. Multiple comparisons using rank sums.
Adult 405 222 627
Technometrics. 1964;6:241–52.
Total 596 525 1121 Fisher RA. Statistical methods for research workers.
Edinburgh: Oliver and Boyd; 1925.
Friedman M. The use of ranks to avoid the assumption of
Using χ test, determine whether there is a dif-
2
normality implicit in the analysis of variance. J Am
ference in age group and their attitude toward Stat Assoc. 1937;32:675–701.
receiving the diagnosis from a physician over a Friedman M. A correction: the use of ranks to avoid the
nurse. assumption of normality implicit in the analysis of
variance. J Am Stat Assoc. 1939;34:109.
Friedman M. A comparison of alternative tests of signifi-
10. What type of study design would call for a cance for the problem of m rankings. Ann Math Stat.
logistic regression as one of its statistical 1940;11:86–92.
tests? Why? Hollander M, Wolfe DA, Chicken E. Nonparametric sta-
tistical methods. New York: Wiley; 2014.
Hosmer DW, Lemeshow S. Applied logistic regression.
2nd ed. New York: Wiley; 2000.
Recommended Reading Kruskal W, Wallis A. Use of ranks in one-criterion vari-
ance analysis. J Am Stat Assoc. 1952a;47:583–621.
Bagdonavicius V, Kruopis J, Nikulin MS. Non-parametric Kruskal WH, Wallis WA. Errata to Use of ranks in
tests for complete data. London/Hoboken: ISTE/ one-criterion variance analysis. J Am Stat Assoc.
Wiley; 2011. 1952b;48:907–11.
Bonferroni CE. Teoria statistica delle classi e calcolo delle Mauchly JW. Significance test for sphericity of a nor-
probabilità. Pubblicazioni del Real Istituto Superiore mal n-variate distribution. Ann Math Stat. 1940;11:
di Scienze Economiche e Commerciali di Firenze. 204–9.
1936;8:3–62. Pratt J. Remarks on zeros and ties in the Wilcoxon signed
Chiappelli F. Fundamentals of evidence-based health rank procedures. J Am Stat Assoc. 1959;54:655–67.
care and translational science. Heidelberg: Springer- Wasserman L. All of nonparametric statistics. New York:
Verlag; 2014. Springer; 2007.
Conover WJ. Practical nonparametric statistics. Wilcoxon F. Individual comparisons by ranking methods.
New York: Wiley; 1960. Biom Bull. 1945;1:80–3.
Part II
Biostatistics for Translational Effectiveness
Individual Patient Data
8
Contents
8.1 Core Concepts 141
8.2 Conceptual, Historical, and Philosophical Background 142
8.2.1 Aggregate Data vs. Individual Patient Data 142
8.2.2 Stakeholders 143
8.2.3 Stakeholder Mapping 144
8.3 Patient-Centered Outcomes 145
8.3.1 Primary Provider Theory 145
8.3.2 Individual Patient Outcomes Research 147
8.3.3 Individual Patient Reviews 148
8.4 Patient-Centered Inferences 149
8.4.1 Individual Patient Data Analysis 149
8.4.2 Individual Patient Data Meta-Analysis 149
8.4.3 Individual Patient Data Evaluation 151
8.5 Implications and Relevance for Sustained Evolution of Translational
Research 153
8.5.1 The Logic Model 153
8.5.2 Repeated Measure Models 153
8.5.3 Comparative Individual Patient Effectiveness Research (CIPER) 154
8.6 Self-Study: Practice Problems 155
selection bias may arise. Regardless, IPD sup- Groups of patients may seem homogeneous.
ports the active involvement of investigators, But in actuality, they can vary largely on indi-
improved data quality, and a more powerful anal- vidual characteristics. Practice guidelines and
ysis. Its role in evidence-based healthcare is criti- recommendations, which often are created
cal because of its time-to-event analysis in from research conducted with specific patient
evaluating prognostic studies. The principal groups, are de facto based on aggregate patient
models of evaluation are discussed and how they data and may not pertain with the same efficacy
affect the field of translational effectiveness, bet- and effectiveness to individual patient, because
ter known as Comparative Individual Patient of their individual physiological uniqueness,
Effectiveness Research (CIPER). individual needs and preferences, individual
This chapter focuses on the transition of pathological specificities, and individual psy-
patient-centered outcomes research (PCOR) to cho-emotional status. As an alternative to the
patient-centered outcomes evaluation (PCOE). traditional aggregate data meta-analytical proto-
As mentioned by observations made in Chiappelli cols, current trends concertedly lead toward the
(2014) and discussed further in this chapter, there development and characterization of individual
is a need for an established protocol for IPD patient data meta-analysis. To be clear, we must
meta-analysis to be validated and widely recog- emphasize that:
nized. Altogether, we learn how the fundamental
elements that drive the comparative effectiveness • Individual patient data meta-analysis can
research (CER) paradigm are integrated within involve the central collection, validation, and
the construct of IPD analysis and inferences. reanalysis of data from clinical trials world-
wide that pertain to a common research ques-
tion and data obtained from those responsible
8.2 Conceptual, Historical, for the original trials.
and Philosophical • The statistical implementation of an individ-
Background ual patient data meta-analysis should preserve
the clustering of patients within studies.
8.2.1 A
ggregate Data vs. Individual • It is a misconception to assume that one can
Patient Data simply analyze individual participant data as
if they all came from a single study. Clusters
It is a fair assumption that patients enroll in ran- must be established, preserved, and retained
domized controlled trials because they fulfill throughout the analysis in either the two-step
inclusion criteria, which are based on strictly or the one-step approach briefly outlined
defined diagnostic criteria of the disease under above for the random model inference, as rec-
study. However, the majority of the patients have ommended by Simmonds and collaborators
symptoms that do not fit exactly in the diagnostic (2005).
criteria formulated by the researchers.
Randomized clinical trials are performed on In a typical two-step approach, individual
homogeneous patient groups that are artificially patient data are first analyzed in each separate
constructed by inclusion and exclusion criteria, study independently by using a statistical
which can include: method appropriate for the type of data being
analyzed. This may generate a typical aggregate
• Disease severity or comorbidity data analysis within each study. In a second step,
• Nature of healthcare facilities these data are combined and synthesized in a
• Intervention given suitable random model for meta-analysis of
• Clinical endpoint or outcome (death, disease, aggregate data.
disability) By contrast, in the one-step approach, individ-
• Expected treatment benefit ual patient data across all studies are integrated
8.2 Conceptual, Historical, and Philosophical Background 143
simultaneously into a generalized model that sim- benefits from assisting the patient. They form the
ply accounts for the clustering of participants structure of the socio-environmental reality of the
within studies. patient. Therefore, any consideration of individ-
As we discuss in greater depth below in this ual patient assessments and analyses must take
chapter, to conduct individual patient data meta- into account stakeholders’ attitudes, opinions,
analysis has distinct advantages, but inherent dif- knowledge gaps, and interests.
ficulties as well, including access to data from Stakeholder engagement improves the rele-
unpublished trials, inconsistent or incompatible vance of research, increases its transparency, and
data across trials, inadequate or limited informa- accelerates its adoption into practice. Stakeholder-
tion provided in the published reports, longer engaged research is overwhelmingly useful to
follow-up time, more participants, more complex comparative effectiveness research (CER) and
outcomes, and overall lower cost-effectiveness to patient-centered outcomes research (PCOR).
that aggregate data meta-analysis. There are several advantageous key points of
In summary and as stressed in Chiappelli running stakeholder-centered endeavors in
(2014, 2016), individual patient data analysis and evidence-based healthcare, including:
meta-analysis may be more reliable, because they
are more directly targeting individual patient • To shape the entity’s projects at an early stage
(i.e., patient-centered outcomes research) than to improve the quality of the project and
aggregate data analyses and meta-analyses. But ensure its success
undoubtedly, individual patient data are more • To help win more resources and ensure fund-
complex, expensive, and arduous to interpret ing support of the project to its successful
than aggregate data meta-analysis. Currently the completion
PRISMA statement is the standard for investiga- • To ensure that all participants fully understand
tors when reporting their aggregate data meta- what the process and potential benefits of the
analysis findings. PRISMA also provides a project
benchmark by which aggregate data meta- • To anticipate what people’s reaction to the
analyses may be appraised. It does little or noth- entity may be and build into plan the actions
ing, however, in its present version, to provide that will win people’s support
useful guidance for the critical evaluation of indi-
vidual patient data. In brief, the purpose of stakeholder-engaged
research is to widen the participation of shared
governance and utilization of the extracted data
8.2.2 Stakeholders and of the best available evidence among all cli-
nicians, patients, and insurers. This process con-
As we discussed elsewhere (Chiappelli 2014), tributes to align the interests among the groups of
the term “stakeholder” was originally meant to stakeholders in the context of patient-centered
define “those groups without whose support care. The engagement on the part of stakeholders
the organization would cease to exist.” The is critical to the success of the contemporary
concept and the role of stakeholders have healthcare model.
evolved and gained wide acceptance in the In patient-centered care, not all stakeholders
context of healthcare and biostatistics. Today, are equal, perform the same roles, or have the
the term specifically refers to the group of indi- same degree of involvement. Different stakehold-
viduals and constituencies that contribute, ers contribute to different extents, and, as recom-
either voluntarily or involuntarily, to the mended by the Accountability Stakeholder
patient’s recovery, well-being, and, more gen- Engagement Technical Committee (208), research
erally, quality of life. focus must be deployed to develop and validate
Stakeholders as the constituencies of indi- novel tools to establish the nature, level (or
viduals who have interests in, receive concrete quantity), and quality stakeholder engagement.
144 8 Individual Patient Data
Certain lines of investigation have already been • Communication planning and stakeholder
drawn and include: management strategies
• Approaches to reduce potential negative
• To establish the necessary commitment to impacts and manage negative stakeholders
stakeholder engagement
• To ensure that stakeholders’ involvement is In a related context, the 6Ps framework of
fully integrated in strategy and operations stakeholders identifies key groups to consider for
• To define the purpose, scope, and stakeholders engagement, as follows:
of the engagement
• To characterize and define what a quality 1. Patients and the public consumers of patient-
stakeholder engagement process looks like centered healthcare
2. Providers, including clinicians and organiza-
tions that provide care to patients and
8.2.3 Stakeholder Mapping populations
3. Purchasers (e.g., employers) who underwrite
A well-constructed stakeholder analysis the costs of healthcare
includes a “stakeholder map,” which is derived 4. Payers and insurers who pay and reimburse
from the identification of the needed stakehold- medical care
ers, in terms of the stakeholder’s perceived and 5. Governmental policy makers and advocates in
real power, influence, hierarchies of values, and the nongovernmental sector, product makers,
interest, in a manner similar to Fletcher and col- and manufacturers
laborators’ Key Performance Areas (2003). The 6. Researchers, including writers of research dis-
stakeholder analysis proceeds along four prin- semination report
cipal steps:
Outcomes of formative and summative evalu-
1 . Identify who the stakeholders are or should be ation (see Sect. 9.4) yield protocols of stakehold-
2. Prioritize, map, and classify the stakeholders ers may result in a reassessment of their relative
on the basis of interest, relative influence, and ranking and position in the project along the fol-
likelihood of involvement lowing broad system:
3. Understand the needs, wants, priorities, and
opinions of the stakeholders • Primary stakeholders, those individuals ulti-
4. Educate the stakeholders to keep them
mately affected, either positively or negatively
informed about, in touch with, and advocating by the project’s outcomes (e.g., patients)
in favor of the project as it evolves • Secondary stakeholders, those individuals
who are the intermediaries, the people indi-
By this systematic and validated approach, rectly affected by the project’s outcomes (e.g.,
and as we noted previously (Chiappelli 2014), caregivers, family members)
fundamental principles of stakeholders can be • Key stakeholders, those individuals, who may
identified, such as and not limited to: or may not be primary stakeholders as well,
who have a significant influence on the out-
• The interests of all stakeholders, who may come and/or running of the project
affect or be affected by the project
• Potential issues that could disrupt the project Taken together, stakeholder analysis is a criti-
• Key people for information distribution dur- cal sine qua non for stakeholder identification
ing executing phase and for analyzing the range of interest and needs
• Relevant groups that should participate in dif- among primary and secondary stakeholders.
ferent stages of the project Stakeholder analysis process can be seen in
8.3 Patient-Centered Outcomes 145
terms of five generally sequential, yet indepen- in the conduct and assessment of reviews is
dent but integrated, stages of activity: needed) (i.e., research synthesis).
• Evidence integration—to integrate clinical,
• Defining: Stakeholders are defined and identi- behavioral, economic, and systems evidence
fied in relation to a specific issue: stakeholder in decision analysis, simulation modeling,
identification operates in respect to a particu- cost-effectiveness analysis, and related proto-
lar specified issue. cols (i.e., translational inference).
• Long Listing: With respect to the specified • Evidence dissemination—active distribution
issue, a “long list” of key, primary, and second- of the outcomes of the research process
ary stakeholders is drawn that indicates group- described above to the five strata of stake-
ings (e.g., public, private, and community) and holders.
subgroupings (i.e., gender, ethnicity, age). • Evidence utilization—formative and summa-
• Mapping: Analysis of the long list along tive evaluation, adoption, and implementation
selected criteria (i.e., interest, social influence, of the findings in policies and revised clinical
political role) to allow systematic exploitation practice guidelines for practical use in specific
of positive attributes, identification of gaps or clinical and world settings (i.e., translational
needed bridge-building among stakeholders. effectiveness).
• Visualizing: Drawing an influence–interest– • Evidence feedback—stakeholders offer feed-
capacity matrix is essential at this stage. back regarding their participation, including
• Verification: Validity of the analysis is estab- on mechanisms for engagement, intensity of
lished by assessing and verifying stakehold- engagement, and support throughout the pro-
er’s availability and commitment. This step cess, as well as nature and use of uncovered
may require additional informants and infor- evidence.
mation sources.
• Mobilizing: Strategies for sustaining effective
participation of the stakeholders, tailored to 8.3 Patient-Centered Outcomes
the different groups and subgroups of identi-
fied stakeholders, and including empower- 8.3.1 Primary Provider Theory
ment interventions for high stake stakeholders
with little power or influence. By “individual patient data” (IPD), we mean
• Evaluation: Reassess to ensure maximizing the availability of raw data for each study partici-
the roles and contribution of all stakeholders. pant in each included trial. That is distinct from
aggregate data (summary data for the comparison
In a patient-centered healthcare modality, groups in each study), which has been the focus
stakeholder engagement strategies must be of the preceding chapters—mainly because
responsive to the values and interests of patients, aggregate data are still the focus of healthcare
patient advocates, and the public. The process research and biostatistics.
ought to include: Strictly speaking in the context of patient-
centered evidence-based healthcare, it is impos-
• Evidence prioritization—establishing a vision sible to concede that aggregate mean data
and mission for research, identifying topics, are—on the rule—representative of any one
setting priorities, and refining key working patient in the group. In point of fact, aggregate
questions (i.e., formulation of CI). data are rather meaningless and useless in the
• Evidence generation—obtaining and refining context of patient-centered research outcomes.
the bibliome. Consequently, a timely and critical approach at
• Evidence synthesis—systematic review of collecting and looking at data is specifically
research (continued exploration of engagement and uniquely directed to each individual
146 8 Individual Patient Data
patient: patient-centered measures of care and assessment and analysis, is intertwined with
individual patient data analysis. patient-centered outcomes, such as:
The core of patient-centered care is patient
satisfaction in clinical outcome. Thence • Expectations of provider value
emerged Aragon’s Primary Provider Theory, • Descriptors of the dynamic process in which
which is, a generalizable theory that defends patient satisfaction occurs and converges from
that patient-centeredness is a latent trait/ability provider power and patient expectations
of healthcare providers that influences their care
behavior and related patient outcomes. Based Therefore, patient satisfaction can be concep-
on these principles, research can be crafted to tualized as the result of an underlying network, a
test directly the robustness of the theory’s infer- meta-construct of interrelated satisfaction con-
ences across patients, settings, healthcare pro- structs, including satisfaction of the patient with
viders, healthcare settings including hospitals, the primary provider and the care received, wait-
medical practices, emergency departments, ing for the provider and bedside manner of the
physicians, and allied health practitioners, provider, and satisfaction with the provider’s
nurses, nurse practitioners, dentists, physician assisting office and clinical staff. Taken together,
assistants, and others. The Primary Provider these elements define the primary providers offer
Theory is grounded on eight fundamental to the individual patients in terms of the greatest
principles: clinical utility.
The Primary Provider Theory generates the
1. Clinical competency one of the necessary patient-centered measure of quality of service
conditions of desired outcomes. and offers an alternative paradigm for the mea-
2. Desired outcomes depend on the transmission surement and realization of patient satisfaction
of care, which is based on clinical knowledge, by informing the patient-centered physician
effective communication, and interaction with directly about how to improve the culture prac-
patients. tice, medical continuing education, quality of
3. Patient-centeredness describes an underlying care improvement, outcome measurement, sat-
quality of the provider’s interaction with and isfaction survey construction, and the like.
transmission of care to the patients. The Primary Provider Theory is related some-
4. Providing patient-center transmission of care what to the trialectical relationship among the
influences the outcomes of the treatment and clinical provider, the patient, and the patient-
the satisfaction of the patients. centered best available evidence, which we
5. Providers are uniquely responsible for the described at length elsewhere (Chiappelli 2014).
patient-centered quality of the transmission In brief, the paradigm is an adaptation of the per-
of care and clinical knowledge to their son–environment fit model to evidence-based
patients. healthcare.
6. Providers who are both clinically competent Additional measures of patient-centered
and patient-centered generally achieve desired measurements in healthcare include quality
outcomes of clinical outcomes and indicators generated by the Agency for
compliance. Healthcare Research and Quality (AHRQ) and
7.
Patients and families value patient-distributed as free software by AHRQ for that
centeredness care because the patient-centered purpose. These tools can serve hospitals to
encounter is more important than any finan- help identify quality of care events that might
cial objectives. need further improvement, greater safety, and
8.
Patients are the best judges of the more extensive evaluation. They generally
patient-centeredness. include:
areas, which may have been avoided through clinical provider, the patient, and the patient-cen-
access to high-quality outpatient care tered best available evidence, to test and verify
• Inpatient Quality Indicators that reflect qual- the Primary Provider Theory.
ity of care inside hospitals, as well as across geo-
graphic areas, including inpatient mortality for
medical conditions and surgical procedures. 8.3.2 I ndividual Patient Outcomes
Research
These indicators provide a set of measures
that provide a novel and unbiased perspective on Methodologically speaking, individual patient
hospital quality of care using hospital administra- outcomes research, or patient-centered outcomes
tive data. They reflect specifically quality of care research (PCOR) protocols, should:
inside hospitals and include inpatient mortality
for certain procedures and medical conditions. In • Specify the outcomes and patient characteris-
addition, AHRQ has also developed: tics to be analyzed.
–– Establish, before embarking on data collec-
• Patient Safety Indicators that reflect quality tion, what data are actually available.
of care inside hospitals, as well as geographic –– Determine, when deciding what variable to
areas, and focus on potentially avoidable com- measure, what analyses are planned and
plications and iatrogenic events what data will be needed; to do them mini-
• Pediatric Quality Indicators that use indica- mize the potential of redundant or useless
tors from the other three modules with adapta- data gathering.
tions for use among children and neonates to • Consider the individual data items in terms
reflect quality of care inside hospitals, as well of which further or constituent variables are
as geographic areas, and identify potentially necessary.
avoidable hospitalizations. –– Redefine outcome variables as necessary
for consistency and completeness of
Taken together, the AHRQ quality indicators analysis.
serve to help hospitals and clinical practices in • Provide protocol and data format instructions
the community: for standardization among experimenters.
–– Streamline paper and digital data acquisi-
• Identify potential problem areas that might tion formats.
need further study and provide the opportunity • Collect and analyze data at the level of the
to assess quality of care inside the hospital individual participant to enable translation
using administrative data found in the typical between different staging, grading, ranking, or
discharge record. other scoring systems.
• Include mortality indicators for conditions or –– Pool homogeneous data whenever possible
procedures for which mortality can vary from from studies that would not otherwise be
hospital to hospital. possible, because of differences between
• Include utilization indicators for procedures the data collection tools.
for which utilization varies across hospitals or
geographic areas. The aims of the operations on individual
• Include volume indicators for procedures for patient data verification are:
which outcomes may be related to the volume
of those procedures performed. 1. To increase the probability that the data sup-
plied are accurate
New research must now integrate and utilize 2. To confirm that trials are appropriately
these AHRQ quality and integrators and the novel randomized
assessment tool designed and validated to 3. To ensure where ever appropriate that the data
measure the trialectical relationship among the are current
148 8 Individual Patient Data
Furthermore, to ensure efficient data verifica- collaborate. There may also be circumstances
tion, a practical protocol was outlined and recom- where it may not be necessary, for example, if all
mended in Chiappelli (2014). the required data are readily available in a suitable
Collecting PCOR data that include the time format within publications.
interval between the randomization and the event Researchers naturally require safeguards on
of interest enables time-to-event analyses, includ- the use of their study data and wish to ensure that
ing reverse Kaplan-Meier survival, to be it will be stored securely and used appropriately.
conducted. For this reason, a signed confidentiality agree-
For outcomes such as survival, where events ment is often used as a “contract” between the
can continue to take place over time, PCOR original investigators and the PCOR review team.
meta-analyses can provide an important opportu- The details of such agreements will vary, but
nity to examine the effects of interventions over a most will state that data will be held securely, be
prolonged period. They can also provide an accessed only by authorized members of the
opportunity for researchers to provide more up- project team and will not be copied or distributed
to-date data for relevant outcomes such as mor- elsewhere. It is also good practice to request that
tality than they have published for their study. individual participants are de-identified in sup-
In brief, PCOR data are useful in that they plied data, such that individuals are identified
may be the most practical way to carry out analy- only by a study identifier code and not by name.
ses to investigate whether any observed effect of This seems to be an increasing requirement for
an intervention is consistent across well-defined obtaining PCOR from some countries where data
types of participants. By means of PCOR data, protection legislation requires that a participant
the investigator can: cannot be identified from the data supplied. Data
sent by email should be encrypted wherever
• Obtain a straightforward categorization of possible.
individuals for subgroup analysis, stratified by The general approach to PCOR review is
study and defined by single or multiple similar to any other systematic review. The
factors. methods used should differ substantially only
• Produce more complex and precise analyses, in the data collection, checking, and analysis
such as multilevel modeling, to explore asso- stages. Just as for any Cochrane review, a
ciations between intervention effects and detailed protocol should be prepared, setting
patient characteristics. out the objective for the review, the specific
• Conduct in-depth exploration of patient char- questions to be addressed, study inclusion and
acteristics, irrespective of the intervention. exclusion criteria, the reasons why PCOR are
• Consequentially yield more accurate sought, the methods to be used, and the analy-
inferences. ses that are planned. Similarly, the methods
used to identify and screen studies for eligibil-
ity should be the same irrespective of whether
8.3.3 Individual Patient Reviews PCOR will be sought, although the close
involvement of the original researchers in the
Reviews of PCOR data should, as we already project might make it easier to find other stud-
emphasized in Chiappelli (2014), be considered ies done by them or known to them. The proj-
in circumstances where the published information ect should culminate in the preparation and
does not permit a good quality review or where dissemination of a structured report. A PCOR
particular types of analyses are required that are review might also include a meeting at which
not feasible using standard approaches. There are results are presented and discussed with the
situations where the PCOR approach will not be collaborating researchers.
feasible, because data have been destroyed or lost In brief, and as we stated in Chiappelli (2014),
or, despite every effort, researchers do not wish to PCOR review is a specific type of systematic
8.4 Patient-Centered Inferences 149
prognostic studies. To allow this type of analysis, Timely and concerted research must address
one needs to know the time that each individual several aspects of individual patient data meta-
spends “event-free.” This is usually collected as the analysis, including:
date of randomization, the event status (i.e., whether
the event was observed or not), and the date of last • Improving design to secure a more compre-
evaluation for the event. Sometimes, it will be col- hensive investigation of the influence of
lected as the interval in days between randomiza- patient-level covariates and confounders on
tion and the most recent evaluation for the event. the heterogeneity of treatment effects, both
Time-to-event analyses are performed for each trial within and between trials: that is to separate
to calculate hazard ratios, which are then pooled in within-trial and across-trials treatment-
the meta-analysis. covariate interactions.
From an analysis standpoint, most individual • Better characterizing the impact of heteroge-
patient data meta-analyses to date have used a neity or use of random effects.
two-stage approach to analysis: • More stringent consideration of statistical
implementation, particularly in the context
1. In the first stage, each individual study is ana- of the fact that the analysis must preserve
lyzed in the same way, as set out in the meta- the clustering of patients within studies, it
analysis protocol or analysis plan. would be quite inappropriate to simply ana-
2. In the second step, the results, or summary lyze individual participant data as if they all
statistics, of each of these individual study came from a single study. Clusters must be
analyses are combined to provide a pooled retained during analysis through the two-
estimate of effect in the same way as for a step or one-step approach outlined above for
conventional meta-analysis in systematic the purpose of aggregating the data for each
reviews. study, such as a mean treatment effect esti-
mate and its standard error, and then synthe-
More complex approaches using multilevel sizing them in the second step by means of
modeling have been described for binary data, the suitable inference model for meta-analy-
continuous data, ordinal data, and time-to-event sis. Alternatively, the individual participant
data, but, currently, their application is less com- data from all studies can be modeled simul-
mon. When there is no heterogeneity between tri- taneously in a one-step process while
als, a stratified log-rank two-stage approach for accounting for the clustering of participants
time-to-event data may be best avoided for esti- within studies. Either model provides a
mating larger intervention effects. PCOR meta-analysis that yields the very
In brief, individual patient data meta-analysis estimate a single patient treatment effect
involves the central collection, validation, and under study.
reanalysis of “raw” data from all clinical trials
worldwide that have addressed a common In closing, it is important to reiterate the
research question with data obtained from those observation made in Chiappelli (2014) in that a
responsible for the original trials. formal protocol for individual patient data meta-
As we already emphasized in Chiappelli analysis must be established, validated, and
(2014), despite the many advantages of individ- widely recognized, such that a new revision of
ual patient data meta-analysis in assessing a Preferred Reporting Items for Systematic
plethora of prognostic outcomes in evidence- Reviews and Meta-Analyses (PRISMA) check-
based healthcare, there is considerable scope for list could include it, perhaps along the essential
enhancing the methods of analysis and presenta- criteria we noted elsewhere (Chiappelli 2014,
tion of this analysis. 2016).
8.4 Patient-Centered Inferences 151
8.4.3 I ndividual Patient Data to say, of course, that evaluation cannot act in a
Evaluation void. It must be grounded—as research is for
sure—on a theoretical model. Evaluation must
We could conceive the scientific endeavor as optimally always be theory-based.
being a four-step process, which can be suc- This chapter examines current trends in the
cinctly outlined as the development of a new science of evaluation. This chapter also proposes
model, research question and hypothesis, system- the next necessary step in the field: from patient-
atic research designed to test the model by prov- centered outcomes research (PCOR) to patient-
ing or disproving the hypothesis, application of centered outcomes evaluation (PCOE).
the findings in real-life settings, and evaluation of The core concepts discussed in this chapter
the implications of the outcomes to improving pertain to evaluation science. The principal mod-
the model and to generating novel hypotheses. els of evaluation are discussed as they pertain to
That is to say, the phase of research and develop- translational effectiveness. The ultimate goal of a
ment both initiates and initiates anew, in a previous chapter (see Chap. 9) was to describe
dynamic process, which is akin to the progres- the process of evaluation. Now, it becomes
sion on a spiral, rather than on a circle: walking straightforward to expand and include this para-
along a circular path leads us back to where we digm into the topic of this present chapter: that is,
started from; progressing along a spiral leads us to transit from patient-centered outcomes
to ever newer, better, greater, and more fascinating research (PCOR) to patient-centered outcomes
discoveries than we could imagine at the onset. evaluation (PCOE).
But, in order for the scientific endeavor to Evaluation is critical to understanding how
retain its pragmatic systematic nature, the participatory processes work and how they can
research and development step must engender a be structured to maximize the benefits of stake-
phase during which findings are applied, dissemi- holder and decision-maker collaboration. Mixed
nated, and generalized to environments beyond model analysis allows us to investigate factors
the variables considered in the research study. whose levels can be controlled by the researcher
The extent to which findings can be validated (fixed) as well as factors whose levels are beyond
beyond the study’s constraints is, as noted in pre- the researcher’s control (random).
vious chapters (see Chap. 3), what is termed Mixed model analysis is preferred in PCOE. It
external validity. The process of establishing usually adopts a frequentist inferential interpreta-
external validity is akin to the process of evaluat- tion, although the Bayesian approach to inference
ing the implications and applications of the is becoming increasingly integrated in mixed
research outcomes to the real-world situations model analysis. Mixed models of evaluation
they were designed to address, with the ultimate imply a participatory process. Stakeholders must
purpose both of improving the original theoreti- be engaged early in the process to articulate the
cal model and to generating novel hypotheses. goals for the project and the participatory process
That is to say that, in brief, yes we pursue to achieve those goals. The assumptions underly-
evidence-based healthcare, and yes we determine ing the goals and the process form the basis for
that research synthesis is the appropriate research the evaluation questions. The stakeholders are
protocol to obtain the best available evidence, also involved in the evaluation methodology, data
and yes we determine the fundamental elements collection procedures, and the interpretation of
that drive the utilization of the best available evi- the results. Mixed models are preferred and supe-
dence in specific clinical settings, which we prag- rior to other models of evaluation in the context
matically defined as translation effectiveness. of patient-centered care because they engage a
But, the question remains as to evaluate the out- systematic way to explore, explain, and verify
comes of evidence-based healthcare. This is also evaluation results, by proffering opportunities for
152 8 Individual Patient Data
evaluators to examine and peruse systematically absence of the intervention under study. In broad
data collection and analysis strategies for prompt lines, we could say that, whereas outcome evalu-
incorporation of a large number of evaluation ation “observes” outcomes, impact evaluation
questions (i.e., “nodes”) into the study design. In seeks to establish a cause-and-effect relationship
brief, the mixed method evaluation model yields in that it aims at testing the hypothesis that the
a novel and creative framework for the design recorded changes in outcome are directly attrib-
and implementation of rigorous, meaningful utable to the program, intervention, or policy
evaluations of participatory approaches that ben- being evaluated.
efit all stakeholders, from the patient to the clini- Impact evaluation—that is to say, in broad
cian, from the user to the decision-makers, which lines PCOE—serves to inform the stakeholders
the random model also proffers, but with greater about what program works, which policy is fail-
caveats. ing, in which contextual environment a given
We recall the emphasis we have made to dis- intervention is successful or not; that is to say, in
tinguish between the evaluation of outcomes what specific clinical setting will translational
(i.e., outcome monitoring evaluation: have pro- effectiveness be optimal, why, at what cost
posed targets been achieved?) and the evaluation (financial, risk-wise, and otherwise), etc. Impact
of impact. The latter, in brief, pertains to the sys-evaluation is timely and critical to the pursuit of
tematic assessment of the changes (e.g., improve- systematic reviews in patient-centered care.
ment vs. deterioration of quality of life)—intended Single difference estimators are designed to com-
as well as unintended side effects—attributed to a pare mean outcomes at end line, based on the
particular intervention, program, or policy. In an assumption that intervention and control groups
impact evaluation program, the intended have the homogenous values at baseline. Double
impact corresponds to the program goal and is (or multiple) difference estimators analyze the
generally obtained as a comparison of outcomes difference in the change, delta, in outcome from
among the participants who comply with the baseline over time for the intervention and con-
intervention2 in the treatment group to outcomes trol groups at each time point following imple-
in the control subjects. Here, we must distinguish mentation of the intervention.
between: From the methodological standpoint, impact
evaluation is complex primarily because it
• Treatment-on-the-treated (TOT) analyses involves a comparison between the intervention
• Intention-to-treat (ITT) analyses, which typi- under study and an approximated reference situa-
cally yield a lower-bound estimate of impact tion deprived of said intervention. This is the key
but are more relevant than TOT in evaluating challenge to impact evaluation, that the reference
impact of optional programs, such as patient- group cannot be directly observed, that it can
centered care only be inferred, that, for all intents and purposes,
it remains merely hypothetical. Consequently,
In this case, it is clear that impact evaluation impact evaluation relies upon an uncontrolled
protocols follow primarily the logic model of quasi-experimental counter-factual design, which
evaluation (vide infra), in which outputs refer to can yield either prospective (ex ante) or retro-
the totality of longer-term consequences associ- spective (ex post) time-dependent comparisons.
ated with the intervention, program, or policy
under study on quality of life, satisfaction, and • Prospective impact evaluation begins during
related patient-centered outcomes. It is also clear the design phase of the intervention and
that impact evaluation implies a “counter-factual” requires the collection of baseline data for
analysis that compares actual outcomes and find- time series comparative analyses with midline
ings to results that could have emerged in the and end-line data collected from the interven-
tion and control groups (i.e., double and mul-
2
See Chap. 9 on evaluation. tiple difference estimation based on the
8.5 Implications and Relevance for Sustained Evolution of Translational Research 153
delta’s). Subjects in the intervention group are critical to establishing and enhancing perfor-
referred to as the “beneficiaries,” and subjects mance and outcomes.
in the control group are the “non-beneficiaries” Logic models describe the concepts that need
(of the intervention). Selection and allocation to be considered at each separate step and in so
principles and issues, including clustering doing inextricably link the problem (situation) to
effects, discussed in previous chapters apply the intervention (our inputs and outputs), to the
to impact evaluation studies to the same extent impact (outcome). The application and imple-
as noted for research investigations. mentation of the logic model in the planning
• Retrospective impact evaluation pertains to phase allows precise communication about the
the implementation phase of interventions or purposes of a project, the components of a proj-
programs. These modes of evaluation utilize ect, and the sequence of activities and the
end-stage survey data (i.e., single difference expected accomplishments. The logic model
estimation), as well as questionnaires and entails six fundamental steps:
assessments as close to baseline as possible, to
ensure comparability of intervention and com- 1 . Situation and Priorities
parison groups. 2. Inputs (what we invest)
3. Outputs
Threats to the internal and external validity of 4. Activities (the actual tasks we do)
impact evaluation are related to the threats of 5. Participation (who we serve; customers and
internal and external validity of research designs, stakeholders)
as discussed in preceding chapters. They were 6. Outcomes/Impacts:
also described at length in Chiappelli (2014). (a) Short-Term (learning: awareness, knowl-
edge, skills, motivations)
(b) Medium-Term (action: behavior, practice,
8.5 Implications and Relevance decisions, policies)
for Sustained Evolution
(c) Long-Term (consequences: social, eco-
of Translational Research nomic, environmental, etc.)
sis, pooling, and plotting of independent patient 6. Describe the relationship shared between
data in meta-analysis. These are large data sets, individual patient data and patient-centered
often as complex as what is today referred to as outcomes in translational healthcare.
“big data,” the analysis of which in translational 7. Which of the following is the most appropri-
science is still in its infancy. ate study design to utilize in patient-centered
outcome research (PCOR) for obtaining the
best available evidence?
8.6 Self-Study: Practice (a) Diagnostic Study
Problems (b) Prognostic Study
(c) Naturalistic Study
1. Why might the analysis of individual patient (d) Research Synthesis Study
data be more advantageous than the analysis 8. Based on the answer above, what is the most
of aggregate data? appropriate format of the relevant research
2. What are current complications within the question?
healthcare field that have impeded the utili- 9. After the analysis of individual patient data
zation of individual patient data? in PCOR, what is the next necessary step in
3. Who may be considered a stakeholder and this dynamic process and why is it
why do they play an important role in important?
healthcare? 10. In the evaluation of patient-centered pro-
4. Can inferences be made from individual grams and research, a novel repeated mea-
patient data? Explain. sure model is proposed. How is this different
5. What are the differences between inferences than the traditional repeated measure model
made from individual patient data compared and what is its advantage?
to aggregate group data?
Evaluation
9
Contents
9.1 Core Concepts 157
9.2 Conceptual, Historical, and Philosophical Background 158
9.2.1 Conceptual Definition 158
9.2.2 Historical and Philosophical Models 158
9.2.3 Strengths and Deficiencies 160
9.3 Qualitative vs. Quantitative Evaluation 162
9.3.1 Quantifiable Facts Are the Basis of the Health Sciences 162
9.3.2 Qualitative Evaluation 162
9.3.3 Qualitative vs. Quantitative Evaluation 163
9.4 Formative vs. Summative Evaluations 163
9.4.1 Methodology and Data Analysis 163
9.4.2 Formative and Summative Evaluation 163
9.4.3 Comparative Inferences 164
9.5 Implications and Relevance for Sustained Evolution of Translational
Research 164
9.5.1 Participatory Action Research and Evaluation 164
9.5.2 Sustainable Communities: Stakeholder Engagement 165
9.5.3 Ethical Recommendations 165
9.6 Self-Study: Practice Problems 165
Recommended Reading 166
thing works as opposed to how it works. Modern evaluation endeavors is even broader than that
evaluation is relevant for program-related and often spans a wide range of human enter-
decision-making. It draws evaluative conclusions prises from the arts to criminal justice and from
about merit, worth, and quality designed to for-profit business to nonprofit organizations.
improve a particular health-related program or Evaluation rests on a well-characterized set of
policy. The four major evaluation strategies (i.e., criteria, protocols, and standards to ensure its
scientific-experimental model, management-reliability and validity. It examines not only a
oriented model, qualitative-anthropological program or project in its entirety but also exam-
model, and participant-oriented model) are dis- ines its individual components, from the state-
cussed further in this chapter. ment of realistic and feasible aims to the
We examine and compare two sets of evalua- conceptualization of the background facts and
tion, specifically qualitative versus quantitative and data, the statement of expectations and alterna-
formative versus summative. In brief, quantita- tive inferences, and the detailed methodology
tive—the basis of research in the health sciences— and interpretation of the outcomes. Ultimately,
and qualitative evaluation are complimentary in evaluations confront the very decision-making
which they are equally essential to scientific process that arises from the completed project.
inquiry, yielding data that neither approach would Overall, the purpose of evaluation is to ascer-
produce on its own. On the other hand, formative tain the degree of achievement or value vis-à-vis
and summative evaluation yield significant esti- the objectives and results of any such action as it
mates of the program’s benefits, costs, and liabili- is in process and as it has been completed.
ties over a period of time. Evaluation is, one could say, the lightning rod
As mentioned, evaluation is stakeholder that helps policy makers, decision-makers, and
driven, and this chapter concludes with participa- actors in a certain field remain focused to gain
tory action research and evaluation’s (PARE) insight into planned or existing programs and to
contribution in raising stakeholder engagement, identify required initiatives for new and improved
awareness, and health literacy, as well as several directions.
ethical conducts to take into consideration for
translational healthcare. Broadly, PARE is an
approach that seeks to understand the reality of 9.2.2 Historical and Philosophical
what communities experience directed to social Models
change and improved effectiveness of transla-
tional healthcare. The origins of evaluation can be traced to antiq-
uity. But the conceptualization of modern-day
uses, protocols, and implications of evaluation as
9.2 Conceptual, Historical, a scientific discipline for societal decision-
and Philosophical making and policies is more recent. Several dis-
Background crete periods of the evolution of modern
evaluation as we know it today can be identified.
9.2.1 Conceptual Definition The foundations of contemporary Figure 9.1
evaluation theory and practice were established
Evaluation can be conceived as a systematic as a modern scientific pursuit by William Farish
longtime process aimed at the determination of in the early 1790s. In his role as the Proctor of
the worth (i.e., effectiveness and efficacy) and Examinations at Cambridge, Farish examined the
significance, strengths and weaknesses, and qualitative and subjective scoring of examina-
validity and intrinsic biases and fallacies of cer- tions and consequentially the potential bias that
tain studies, investigations, programs, and poli- was introduced in the ranking of students. He
cies that pertain to society’s well-being, including developed a process by which correct answers
education and healthcare. In fact, the breadth of and incorrect answers could be scored numeri-
9.2 Conceptual, Historical, and Philosophical Background 159
Criterion-referenced testing was refined to many of the same methodologies and data ana-
yield a valid and reliable measure of group per- lytical paradigms used in research in general (cf.,
formance based on established criteria, as well Chaps. 1–7), but it does so for a different purpose
as, and as importantly, a measure of achievement or mission. Therefore, evaluation requires an
of each individual subject. By the 1970s and additional set of special skills, management abil-
1990s, criterion-referenced testing became a ity, political dexterity, sensitivity to multiple
timely and critical complement to norm- stakeholders, and other specific attributes.
referenced testing, which is designed to distin- Evaluation has a distinct mission or purpose,
guish differences from an established normative compared to research per se, in that it pertains to
value. In that regard, it was the precursor of the systematic assessment of the worth or merit
today’s individual patient measurement, analysis, of the findings produced by research. It follows
and inference (cf., Chap. 11). that evaluation has a central role in the
Evaluation today is considered a field of aca- interpretative processing of research findings and
demic inquiry in its own right. It encompasses six related feedback functions. Figure 9.2 below
distinct sub-domains: compares research and evaluation.
That is to say, evaluation is conceptualized as
(a) Objectives-oriented the systematic acquisition and assessment of
(b) Management-oriented information, including the generation of the
(c) Consumer-oriented resulting feedback to the appropriate stakehold-
(d) Expertise-oriented ers, viz., sponsors, donors, client groups, adminis-
(e) Adversary-oriented trators, staff, and other relevant constituencies. It
(f) Participant-oriented produces outcomes that are intended to influence
decision-making and policy formulation through
Academic journals and higher education gradu- the provision of empirically-driven feedback.
ate degrees in concert have contributed to the pro- Nonetheless, that is not always the case, and
fessionalization of contemporary evaluation. This this potential ambivalence can be a weakness of
movement was coordinated by some of the top the process of evaluation, in part attributable to the
universities (e.g., University of Illinois, Stanford heterogeneity of evaluation strategies. Four major
University, Boston College, University of groups of evaluation strategies can be identified:
California Los Angeles, University of Minnesota,
and Western Michigan University), and, while it • The scientific-experimental model of evalua-
struggled under the Reagan administration, when tion rests on the fundamental values and meth-
funding was dramatically cut, it recovered in the ods that are well grounded and generally
Clinton years when much of the funding for accepted across the health, life, and social sci-
research and academic development was rein- ences. They include the unbiased pursuit
stated. It fell again in disarray during the Great impartiality, accuracy, objectivity, reliability
Recession of the Bush administration but again and replicability, and validity. The scientific-
rebounded when the economy stabilized during experimental model of evaluation relies on
the Obama years. Many academicians fear that experimental and quasi-experimental designs,
funding for evaluation may be curtailed once more as well as some observational designs (i.e.,
in the current political climate. cohort), and focuses on questions that pertain to
comparative effectiveness and efficacy research
and analysis for practice (i.e., CEERAP), com-
9.2.3 Strengths and Deficiencies parative effectiveness research (i.e., CER), and
comparative effectiveness analysis (i.e., CEA).
Evaluation is a methodological area of research • The management-oriented model of evalua-
that is closely related to but clearly distinct from tion examines comprehensiveness in evalua-
other traditional modes of inquiry. It utilizes tion and inserts evaluation as a most valued
9.2 Conceptual, Historical, and Philosophical Background 161
Recommendations Recommendations
based on research based on questions
component of a larger framework, which usu- evaluation can provide stepwise estimates based
ally comprises business, organizational, on qualitative observations and on quantitative
governmental, or occupational activities (e.g., data at given time points, which examines the
PERT, the program evaluation and review effects or outcomes of the object under evaluation
technique; CPM, the critical path method; at the completion of the process, and can also sum-
logic-based framework). marize either qualitatively or quantitatively pre-
• The qualitative-anthropological model of sented findings to highlight a given program’s
evaluation emphasizes the relevance of natu- strengths and weaknesses and successes and fail-
ralistic observation, the essential nature of the ures and to scrutinize recommendations for future
phenomenological quality of the evaluation improvements. Summative evaluation uses equally
context, and the value of subjective human qualitative and quantitative data to establish
interpretation in the evaluation process. whether the outcome that results at the completion
• The participant-oriented (or client-centered of the program under examination can in fact be
or consumer-oriented) model of evaluation said to have been caused by, to be a direct impact
focuses on the critical nature of the evaluation of, a cause-effect factor of the program under eval-
participants, clients, and users of the program uation or random coincidence. The major strength
or technology under examination. of formative and summative evaluations together
is that it yields timely and critical estimates of the
A second level of heterogeneity emerges from relative benefits, overall costs, and liabilities of the
the fact that each type of evaluation can be either program under examination over time. This chap-
qualitative or quantitative in nature. Formative ter examines these issues in greater detail.
162 9 Evaluation
The appropriate rigor necessary in all sciences To quantify and to analyze qualitative infor-
includes the stringent criteria that govern quali- mation, it might be necessary to proceed along
tative evaluation. Indeed, qualitative methods of the following four principal steps:
inquiry and of evaluation range across many dif-
ferent academic disciplines. • Categorization and sorting of the information
Qualitative research is a broad methodologi- on the basis of certain criteria of hierarchy for
cal approach that encompasses many research thematic analyses
methods that may vary substantially across dis- • Recognition of recurrence of the themes under
ciplinary specialties. Broadly speaking, quali- study
9.4 Formative vs. Summative Evaluations 163
assessment procedures and qualitative feedbacks Moreover, it is noteworthy that whereas sum-
(rather than quantitative scores) aimed at modify- mative evaluation yields information that can
ing and improving a given set of activities moni- yield either norm-based or criterion-based con-
toring outcomes and establishing accountability. clusions, formative evaluation can only produce,
Formative evaluation may seek: by design, criterion-based suggestions.
social change and improved effectiveness. patients, caregivers, and all stakeholders in
Broadly speaking, PARE draws on a wide range patient-centered, effectiveness-focused, and
of influences and key initiatives such as the evidence-based healthcare.
Participatory Research Network (1979), which
was created to foster an interdisciplinary devel-
opment drawing its theoretical strength from 9.5.3 Ethical Recommendations
adult education, sociology, political economy,
community psychology, community develop- Norms of ethical conduct to guide the relationship
ment, feminist studies, critical psychology, orga- between investigators and participants are sine qua
nizational development, and the like. Today, the non of effective and efficacious PARE paradigms.
PARE movement has evolved strategies to Informed consent; stringent adherence to HIPAA
democratize and disseminate knowledge—such regulations; full disclosure of potential physiologi-
as in the context of translational healthcare, cal, psychological, and sociological outcomes of
knowledge, and dissemination of the best evi- interventions; and unbiased consideration of bene-
dence base, BEB—thusly contributing to the fits, risks, and costs are essential to ensure that eval-
development of better informed communities uation protocols in translational healthcare, and in
founded on sustainable livelihoods, education, particular PARE, focus on patient welfare, privacy,
public health, and productive civic engagement. confidentiality, equal treatment and equipoise, and
In brief, it is safe to say that the modern con- appropriate inclusion free of conflicts of interests.
temporary conceptualization of participatory Furthermore, research and evaluation collabo-
action research and evaluation (PARE) reflects a rators must protect themselves and each other
fragile but growing intertwined unity between against potential risks, by mitigating the potential
reality and perceptions based on ethnic, cultural, negative consequences of their collaborative
and popular traditions, as well as a range of ideolo- work and pursuing the welfare of the patients
gies and a variety of socio-politico-organizational first, and of all parties of stakeholders concerned.
contexts that together impact the well-being of Commitment of ethics must not exclude concerns
individuals and of communities. PARE is still, for social justice and welfare, such as critical
relatively speaking, at its infancy, particularly in struggles of certain patient groups (e.g., the dis-
the context of translational healthcare. Nonetheless, abled) in existing social structures and their
PARE is recognized by most as the avenue of the struggle against the policies and interests of indi-
future for the purpose of engaging stakeholders viduals, groups, and institutions.
and increasing their health literacy with BEB, the In conclusion, norms of ethical conduct in
product of comparative effectiveness research. healthcare are not fixed and immutable. Rather
they must be revised and updated as society
changes and evolves. The science of evaluation in
9.5.2 Sustainable Communities: general and PARE in particular play a central role
Stakeholder Engagement in this process of modernization, we might say, of
ethical norms for translational healthcare.
PARE proffers an important contribution to inter-
vention and self-transformation within groups
and communities, particularly, as noted, in the 9.6 Self-Study: Practice
context of raising awareness, engagement, and Problems
health literacy among the stakeholders in transla-
tional healthcare. It contributes to increased fac- 1. What is evaluation in translational
tual knowledge, understanding, discernment, and healthcare?
informed problem-solving and participation in 2. Which of the legs from the traditional three-
decision-making for treatment intervention. It legged stool of the research process is evalu-
favors in other words active involvement by ation most like?
166 9 Evaluation
3. What is the purpose of formative evaluation? Donner A, Birkett N, Buck C. Randomisation by clus-
ter: sample size requirements and analysis. Am J
Summative evaluation? Epidemiol. 1991;114:906–14.
4. True or False: Formative evaluation takes Dowie J. “Evidence-based,” “cost-effective” and
advantage of both quantitative and qualitative “preference-driven” medicine: decision analysis based
data to establish the effect of an outcome. medical decision making is the pre-requisite. J Health
Serv Res Policy. 1996;1:104–13.
5. What type of study design and statistical test Gaventa J, Tandon R. Globalizing citizens: new dynamics
might be used in formative evaluation of a of inclusion and exclusion. London: Zed; 2010.
health intervention program? Gray JAM, Haynes RB, Sackett DL, Cook DJ, Guyatt
6. Explain the relationship between qualitative GH. Transferring evidence from health care research
into medical practice. 3. Developing evidence-based
and quantitative methods in evaluation. clinical policy. Evid Based Med. 1997;2:36–9.
7. Which type of evaluation method generates, Gubrium JF, Holstein JA. The new language of quali-
at best, hypothetical statements as opposed tative method. New York: Oxford University Press;
to conclusive data? 2000.
Ham C, Hunter DJ, Robinson R. Evidence-based policy-
8. Is it possible to quantify information obtained making—research must inform health policy as well
from qualitative evaluation? If so, how? as medical care. BMJ. 1995;310:71–2.
9. An investigator at your school’s local Liddle J, Williamson M, Irwig L. Method for evaluat-
research laboratory claims that she only ing research and guidelines evidence. Sydney: NSW
Health Department; 1999.
works with quantitative data because it is Madaus GF, Stufflebeam DL, Kellaghan T. Evaluation
better than qualitative data. Based on your models: viewpoints on educational and human ser-
knowledge, what would you tell her? vices evaluation. 2nd ed. Hingham: Kluwer Academic;
10. What is PARE and why is it important in 2000.
McIntyre A. Participatory action research. Thousand
translational healthcare? Oaks: Sage; 2009.
Muir Gray JA. Evidence-based health care: how to make
health policy and management decisions. London:
Churchill Livingstone; 1997.
Patton MQ. Utilization-focused evaluation. 3 London
Recommended Reading Sage, 1996.
Racino J. Policy, program evaluation and research in dis-
Bloom BS, Hasting T, Madaus G. Handbook of forma- ability: community support for all. London: Haworth
tive and summative evaluation of student learning. Press; 1999.
New York: McGraw-Hill; 1971. Royse D, Thyer BA, Padgett DK, Logan TK. Program
Bogdan R, Taylor S. Looking at the bright side: A positive evaluation: an introduction. 4th ed. Belmont: Brooks-
approach to qualitative policy and evaluation research. Cole; 2006.
Qual Sociol. 1997;13:193–2. Scriven M. The methodology of evaluation. In: Stake
Chiappelli F. Fundamentals of evidence-based health care RE, editor. Curriculum evaluation. Chicago: Rand
and translational science. Heidelberg: Springer; 2014. McNally; 1967.
Cochrane A. Effectiveness and efficiency. Random reflec- Stufflebeam DL. The CIPP model for program evaluation.
tions on health service. London: Nuffield Provincial In: Madaus GF, Scriven M, Stufflebeam DL, editors.
Hospital Trust; 1972. Evaluation models: viewpoints on educational and
Donner A. A bayesian approach to the interpretation of human services evaluation. Boston: Kluwer Nijhof;
sub-group results in clinical trials. J Chronic Dis. 1993.
1992;34:429–35.
New Frontiers in Comparative
Effectiveness Research 10
Contents
10.1 Core Concepts 167
10.2 Conceptual Background 168
10.2.1 Introduction 168
10.2.2 Comparative Effectiveness Research in the Next Decades 170
10.2.3 Implications and Relevance for Sustained Evolution of Translational
Research and Translational Effectiveness 180
10.2.4 Self-Study: Practice Problems 182
Recommended Reading 183
to the patient. We look at the emerging inquisitive of translational research grounded in the molecu-
and inferential models, as well as the future of lar characterization of biological specimens, and
healthcare that is telehealth. of translational effectiveness, the concerted oper-
ationalization of effectiveness-focused, patient-
centered and evidence-based care.
10.2 Conceptual Background Once all the best evidence is assessed, treat-
ment is categorized as:
10.2.1 Introduction
• Likely to be beneficial
In this book, we have endeavored to discuss cer- • Likely to be harmful
tain of the fundamental concepts of biostatistics • Evidence did not support either benefit or
that appear most pertinent in our current times harm
and that can be foreseen to be most relevant in the
next decade. Biostatistics is the application of A 2007 analysis of 1016 systematic reviews from
statistics to the wide range of topics in the psy- all 50 Cochrane Collaboration review groups found
chobiological sciences in health and disease. that 44% of the reviews concluded that the interven-
Therefore, it encompasses research, clinical tion was likely to be beneficial, 7% concluded that
designs, and methodologies, in addition to the the intervention was likely to be harmful, and 49%
collection, organization, and analysis of data in concluded that evidence did not support either ben-
psychobiology as well as inferences about the efit or harm. Ninety-six percent recommended fur-
implications of these findings for the health sci- ther research. A 2001 review of 160 Cochrane
ences in general and healthcare in particular. systematic reviews (excluding complementary treat-
Current trends in medicine, dentistry, nursing, ments) in the 1998 database revealed that, according
and clinical psychology encourage new research to two standardized readers:
in effectiveness-focused, patient-centered, and
evidence-based clinical decision-making and • 41.3% concluded positive or possibly positive
practice. This perspective, which is barely a few effect
decades old at best, still challenges the commu- • 20% concluded evidence of no effect
nity of fundamental researchers and clinical pro- • 8.1% concluded net harmful effects
viders to develop and validate new and improved • 21.3% of the reviews concluded insufficient
tools for gathering, analyzing, and interpreting evidence
data aimed at improving patient care.
Therefore, this book examined the field of bio- A review of 145 alternative medicine Cochrane
statistics from two primary viewpoints. Firstly, it reviews using the 2004 database revealed that
was important to proffer a novel and clear discus- 38.4% concluded positive effect or possibly posi-
sion of the most common statistical concepts and tive (12.4%) effect, 4.8% concluded no effect,
tests that have been used in modern psychobiol- 0.69% concluded harmful effect, and 56.6% con-
ogy research and treatment evaluation ever since cluded insufficient evidence.
the emergence of current frequentist biostatistics,
as described by Pearson, Spearman, Fisher, 10.2.1.1 Translational Effectiveness
Gossett, and several others. Secondly, it was It behooves us to focus and define a bit more
timely and critical to contrast these views with clearly the breadth, constraints, limitations, and
Bayesian statistics, which is fast gaining greater fallacies of translational effectiveness at this point,
acceptance than the frequentist models in today’s not because the science of translational research is
psychobiological research and clinical domains. fully circumscribed by our current knowledge but
Moreover, it was unquestionably necessary to because translational effectiveness is relatively
incorporeal this discussion in the context of new—or at least newer than translational research,
translational healthcare: the crossroad, as it were, less clearly understood than translational research
10.2 Conceptual Background 169
to neophytes in health sciences research, and, by demands and in fact implies that healthcare edu-
all accounts, the future of healthcare. In its broad- cation must undergo a stringent formative and
est form, translational effectiveness is the appli- summative evaluation process designed to anchor
cation of the scientific method into healthcare clinical decisions, guidelines, and policies to the
decision-making and practice. Paraphrasing from fundamental principles of effectiveness, patient-
the Agency for Healthcare Research on Quality centeredness, and evidence-based.
(AHRQ), translational effectiveness entails the
utilization and dissemination across all stakehold- 10.2.1.2 Fundamentals
ers of the best evidence base derived from com- of Comparative Effectiveness
parative effectiveness research (CER) and Research and Remaining
systematic reviews for patient-centered care. Open Questions
Translational effectiveness relies extensively To be clear, comparative effectiveness research is the
on the biostatistical principles outlined in the systematic process by which a qualitative and a
chapters of this book. Translational effectiveness quantitative consensus of the best available evidence
also empowers the development of novel and con- is obtained by a process of critical summative evalu-
certed biostatistical models, which borrow equally ation of the entire body of the pertinent available
from the frequentist and the Bayesian viewpoints, published research literature and a cogent interpreta-
to tackle new challenges in biostatistical infer- tive synthesis of the findings thereof. It is obtained
ence. These emerging inquisitive and inferential by means of a hypothesis-driven research design
models include but are not limited to second- and known as research synthesis, which yields the con-
third-generation instruments to assess the quality sensus of the best available evidence in response to a
of the evidence, individual patient research out- population, intervention, comparator, outcome,
comes and analysis, individual patient data meta- timeline, setting (PICOTS) question typically initi-
analysis, stakeholder engagement quantification ated at the initial patient-clinician encounter.
and analysis, and local, national, and international The systematic review, the scientific report of
dissemination by such means as telehealth. the process of comparative effectiveness research,
Whereas the term “translational effectiveness” describes the methodology employed for obtain-
was coined relatively recently, certain of its ele- ing, quantifying, analyzing, and reporting the
ments are rather well-rooted in the conceptual- consensus of the best evidence base for the
ization of healthcare in the Western and the patient’s clinical treatment. It is not unusual that
Eastern cultures. Is origin can be traced back to the systematic review’s arduous biostatistics-
ancient dogmas of philosophy. The “art” of treat- laden report needs to be translated into a language
ing ailments and to bring the patient back to that is clinician-friendly: that is, to rewrite the
health, be it in the context of medicine, dentistry, core of the systematic review process and analyti-
or clinical psychology, is in effect the conflict cal inferences in a form that emphasizes the utility
concerted approach to making critical informed (cf., utility-based clinical decision models, such
decisions about individual patients (i.e., patient- as the Markov decision tree) and logic of the
centered), to ensure the best possible intervention derived consensus in the evidence-based clinical
(evidence-based), that will yield optimal benefit decision process (cf., logic-based models of clini-
to the patient (effectiveness-focused). cal decision-making). These translations of sys-
Today, and the decades ahead, translational tematic reviews into clinician-friendly summaries
effectiveness must continue to emphasize reex- are often referred to as critical reviews, although
amining, revisiting, reviewing, and revising clini- they may be found under other rubrics. There is
cal practice guidelines by means of a systematic little consensus among specialists in the field
and peer-reviewed process to confirm the strength about either how these translational summaries
and validity, as well as the limitations and caveats must be obtained—particularly with respect of
of new and established clinical methods, materi- the biostatistics reported in the original systematic
als, and interventions. Translational effectiveness reviews—or about how to name them, how to
170 10 New Frontiers in Comparative Effectiveness Research
report them in the scientific clinical literature, or damental difference between what car we drive—
how to disseminate them, for that matter, to all Bentley, Lexus vs. FIAT, Ford—and how well the
clinicians who might need the best evidence base car runs. A Ferrari whose engine is not running
for patient-centered treatment intervention. well will be a much worse means of transporta-
That particular problem has a further dimen- tion than even the oldest Chevrolet with a tuned-
sion that is as important. Namely, the very princi- up engine. In other words, it is not so much the
ple of translational effectiveness, as noted above, type of car that will reliably allow us a safe trip,
requires not only the pursuit of the best evidence as its mechanical quality. To exactly the same
base with a focus on effectiveness; it also requires extent, it not so much the type of research and
a patient-centered approach to clinical decision clinical study that will contribute to the best evi-
and intervention. Patient-centeredness implies dence base, as it is the quality and stringency of
patient participation in all phases of clinical deci- the research methodology (viz., sampling proto-
sions, and patient participation demands patient col, validity, and reliability of measurements) and
education. That very point opens a Pandora’s box of the biostatistical analysis and inferences.
of several complex issues, not the least of which The type of research study refers to the
entails health literacy: how do we best assess research design: to say it in the jargon of com-
health literacy across sociology-economic, ethno- parative effectiveness research and systematic
cultural, and linguistic barriers, how do we raise reviews, the level of the evidence—namely, clini-
health literacy of our patients, how do we ensure cal trials, cohort observational study, etc. The
that they retain the information provided—and if point here is that the level of the evidence—the
this information is about the best evidence base, as type of design—is quite a different and distinct
we presume here that it would largely be—and concept from the quality of the evidence, the
then how do we translate systematic reviews and stringency of the research methodology and data
critical reviews into lay language summaries while analysis. It is unfortunate that, in the past, the
preserving the stringency of the statement and its field has used the two conceptual frameworks
scientific foundations. Last but certainly not the interchangeably, using the words “quality” and
least, we recognize that patients often involve “level” of the evidence to describe the same
caregivers, family members, religious advisors, thing. In recent years, the distinction has been
friends, and other stakeholders, to various extents, made increasingly, and concerted effort must be
as guides and sounding boards in their decision- sustained to distinguish the level of evidence
making. Therefore, it is timely and critical to from quality of the evidence in the next decades.
develop new and improved means of characteriz- Levels of evidence are determined based on the
ing the nature and commitment of stakeholders, research design: from meta-analyses and system-
their level of engagement and persistence of atic reviews of triple-blind randomized clinical tri-
engagement, as well as similar health literacy als, with concealment of allocation and no attrition
issues as those noted above. The “umbrella” prob- at the top end, to observational design, bench and
lem may perhaps be stated as follows: how do we animal research, and published opinions and edi-
disseminate the consensus of the best evidence torials. At each level, levels of evidence consider
base for effectiveness among all interested parties factors, such as internal vs. external validity, statis-
to ensure patient-centered care. tical significance vs. clinical relevance, intention
to treat (ITT) analyses if applicable, number
needed to treat (NNT) and disease severity-cor-
10.2.2 Comparative Effectiveness rected NNT values, and prevented (or preventable)
Research in the Next Decades fraction (PF), and related information that arises
from the performance of the research design.
10.2.2.1 Methodological Issues Case in point, in healthcare research, the goal is
There is a fundamental difference between to be able to make some general conclusions appli-
“what” we do and “how” we do it. There is a fun- cable to about a wider set of patients, and, in gen-
10.2 Conceptual Background 171
eral, there is little interest for the particular group the ratio of nonevents to events is the odds ratio
of patients under study. The process that permits to (OR). For example, in the case of oral cancer and
make such general statements is grounded on a using smoking as the intervening factor, data may
systematic analysis of the information, which in show that the event rate for oral carcinoma in
research we call data, is the process of inference. smokers is 1%. Data may also show that its non-
In providing healthcare, the goal is to make spe- event rate, that is, the rate for oral carcinoma for
cific conclusions based on a given patient under nonsmokers, is 99%: the odds of smoking as an
study by the process of clinical diagnosis. intervening factor for the disease in question will
There is a fundamental difference between be computed as the following ratio 99:1.
statistical significance (see Chap. 5, Sect. 5.3 on Odds ratio are most common in primary clini-
significance), which can be said to be based on cal research, including observational designs
and derived from group data and which serves to (e.g., case-control and cohort studies). Data
draw conclusion about the population, and clini- obtained from a variety of clinical investigations
cal significance, which, while it may be derived can be transformed into OR’s to produce results
from observations obtained on a group of patients, in the form of expected event rates (i.e., patient
seeks to draw conclusion beneficial for each indi- expected event rate, PEER). When PEER is
vidual patient. Whereas statistical significance combined with the estimation of risk (OR), that
rests on the notion of sample size needed to attain is, the probability of a given situation (e.g., oral
statistical significance, clinical relevance rests carcinoma) to occur (i.e., in smokers) or not to
upon the concept of the minimum number of occur (i.e., in nonsmokers), then NNT is com-
patients needed to treat to obtain the desired out- puted as:
come or to avoid the undesired side effect.
1 - ( PEER ´ [1 - OR ])
The concept of number needed to treat NNT =
(NNT) is central to the process of comparative
(1 - PEER ) ´ PEER ´ (1 - OR )
effectiveness research and translational effec-
tiveness, simply because a critical determinant A treatment intervention produces a sizeable
of the decision-making process for clinical inter- event in the experimental group (i.e., experimen-
vention rests upon defining the minimal number tal event rate, EER), and a control event rate
of patients that must be treated to prevent—bio- (CER) is obtained in the control arm of the study,
statistically speaking, of course—one additional where placebo intervention was administered. The
bad outcome or to attain the benefit sought. following two-by-two table can be constructed:
The computation of NNT serves as a guide in
Control Experimental
the clinician’s decision-making process with
Event (A) (B)
respect to whether a given intervention ought to Nonevent (C) (D)
be applied, and of how few patients need to be
treated to prevent (i.e., risks), or to obtain (i.e., A
benefits) a given event. Intuitively, one of the Control even rate (CER) =
important uses of NNT is to provide a quantita-
(A +C)
B
tive guide for the assessment of cost-benefit Experimental event rate (EER) =
analysis.
( D)
B +
Research data are often expressed as a ratio of ( CER - EER )
Relative risk reduction (RRR) =
the measured outcome. That is to say, the “event” CER
divided by the “nonevent” corresponds to the CER
absence of the event or to the event whose magni- Absolute risk reduction (ARR) =
EER
tude falls below the measurable capability of the 1
instrument used, including background noise Number needed to treat (NNT)1 =
ARR
(i.e., random error). In the case of research syn-
thesis, the presentation of the published data as 1
NNT is always rounded up.
172 10 New Frontiers in Comparative Effectiveness Research
1
CI 95 NNT = 95
CI ARR
vention. Dramatic results in uncontrolled tri- findings. A grading system has been developed for
als might also be regarded as this type of that purpose and is widely used in the field.
evidence. However, research methodologists consider it fal-
• Level III: Opinions of respected authorities, lacious because it has not been validated psycho-
based on clinical experience, descriptive stud- metrically for validity and reliability. The GRADE
ies, or reports of expert committees. evaluation system produces a numerical value,
which purports to quantify the confidence in the
In addition, the same US Preventive Services observed effect as being close to what the true
Task Force qualifies the level of evidence as: effect is but that is completely and absolutely
devoid of statistical grounds and foundation. The
• Level A: Good scientific evidence suggests confidence value generated by GRADE is purely
that the benefits of the clinical service sub- judgmental and therefore biased and is not derived
stantially outweigh the potential risks. from the traditional statistically based computa-
Clinicians should discuss the service with eli- tion of the confidence interval. Moreover, the
gible patients. GRADE working group defines “quality of evi-
• Level B: At least fair scientific evidence sug- dence” (read: level of evidence) and “strength of
gests that the benefits of the clinical service recommendations” (read: confidence in the clini-
outweigh the potential risks. Clinicians should cal outcomes) as two interdependent yet distinct
discuss the service with eligible patients. concepts; but in actuality, these two concepts are
• Level C: At least fair scientific evidence sug- commonly—and erroneously—used interchange-
gests that there are benefits provided by the ably and confused with each other.
clinical service, but the balance between ben- The GRADE goes a step further and proposes
efits and risks is too close for making general the following inference:
recommendations. Clinicians need not offer it
unless there are individual considerations. • High-quality evidence: The authors are very
• Level D: At least fair scientific evidence sug- confident that the estimate that is presented
gests that the risks of the clinical service out- lies very close to the true value. One could
weigh potential benefits. Clinicians should not interpret it as: there is very low probability of
routinely offer the service to asymptomatic further research completely changing the pre-
patients. sented conclusions.
• Level F: Scientific evidence is lacking, of poor • Moderate-quality evidence: The authors are
quality, or conflicting, such that the risk versus confident that the presented estimate lies close
benefit balance cannot be assessed. Clinicians to the true value, but it is also possible that it
should help patients understand the uncer- may be substantially different. One could also
tainty surrounding the clinical service. interpret it as: further research may com-
pletely change the conclusions.
A system was developed by the GRADE (short • Low-quality evidence: The authors are not con-
for the Grading of Recommendations Assessment, fident in the effect estimate and the true value
Development, and Evaluation) working group to may be substantially different. One could inter-
take into account more dimensions than just the pret it as: further research is likely to change
quality of medical research. It requires users of the presented conclusions completely.
GRADE to use these criteria to develop a tool to • Very low-quality evidence: The authors do not
assess the quality (read: level) of evidence. The have any confidence in the estimate, and it is
GRADE checklist evaluates the impact of certain likely that the true value is substantially differ-
factors, which research methodologists would call ent from it. One could interpret it as: new
intervening or confounding variables, on the confi- research will most probably change the pre-
dence in the results—that is, the stringency of the sented conclusions completely.
174 10 New Frontiers in Comparative Effectiveness Research
The Appraisal of Guidelines for Research and In addition, three domains modulate these
Evaluation Enterprise (AGREE) working group assessments:
has also produced an instrument, which is
designed to evaluate the process of practice • Large effect: This is when methodologically
guideline development and the quality (read: strong studies show that the observed effect is
level) of reporting. The original AGREE instru- so large that the probability of it changing
ment has recently been updated and methodolog- completely is less likely.
ically refined, but the principal deficiencies and • Plausible confounding would change the
fallacies noted for the GRADE above remain in effect: This is when despite the presence of a
the AGREE-II assessment tool of practice guide- possible confounding factor which is expected
lines. Some efforts have been made to validate to reduce the observed effect, the effect esti-
this instrument psychometrically, such that mate still shows significant effect.
claims are common that “the AGREE-II is both • Dose response gradient: This is when the
valid and reliable,” but exception can be taken intervention used becomes more effective
with that assertion from a research methodology with increasing dose. This suggests that a fur-
standpoint. Nonetheless, the AGRRE-II is a con- ther increase will likely bring about more
siderable improvement over the GRADE check- effect.
list, if anything for its greater breadth and depth.
AGREE-II consists of 23 items organized into six To be sure, the field is endowed with many
domains of evidence quality (read: evidence more examples of instruments designed to
level). grade the level of the evidence (e.g., AGREE)
The quality of the evidence is often evaluated and the quality of the evidence (e.g., AMSTAR,
as the risk of bias, which both the Cochrane QUOROM, PRISMA). However, generally
group and AHRQ independently conceptualized. speaking, most if not all, these are originally
The best evidence base must be derived from conceptualized as checklists, which limit their
studies with low risk of bias. The proposition has psychometric validation concerted efforts and
been brought forward that clinical trials always have been deployed to restructure some of
have, by definition, lower risk of bias than obser- these instruments such as to generate a rating
vational studies, although this thesis has been scale, rather than a yes/no answer. These revi-
proven by research methodologists to be a fal- sions and expansions enrich the original instru-
lacy. The risk of bias assessment tool consists of: ments by:
• Risk of bias: Is a judgment made based on the 1. Generating a total final score of evidence
chance that bias in included studies has influ- quality. Based on this score, the acceptable
enced the estimate of effect. sampling statistical reasoning can be applied
• Imprecision: Is a judgment made based on the such that only the highest scoring literature
chance that the observed estimate of effect can be included in the process of generating
could change completely. the consensus of the best evidence base.
• Indirectness: Is a judgment made based on the 2. The semester-continuous scores thus obtained
differences in characteristics of how the study permit psychometric analysis of test reliabil-
was conducted and how the results are actu- ity (i.e., test-retest, inter-rather, internal con-
ally going to be applied. sistency, coefficient of agreement) and validity
• Inconsistency: Is a judgment made based on (i.e., criterion, content, construct).
the variability of results across the included
studies. In this very fashion, the stringency of the
• Publication bias: Is a judgment made based on assessment of the quality of the evidence is
the question whether all the research evidence improved when using the expanded version of
has been taken into account. GRADE (Ex-GRADE), the revised version of
10.2 Conceptual Background 175
AMSTAR (rAMSTAR), or the existed version of Consensus is then sought to prioritize the
the risk of bias instrument. Consequently, con- knowledge gaps by ranking and by Likert scale.
certed effort in the field is directed at significantly If the number of identified knowledge gaps is
improving the reliability and the validity of the large, then multiple rounds of prioritization (i.e.,
assessment of the quality of evidence simply by >2) and ranking will be run, to ensure replicable
revising and expanding existing instruments or cross-validation. Additional domains ought to
by developing tools for that purpose anew include plausible confounding that decreases the
(Wong). observed effect and large magnitudes of effect.
Another aspect of methodology of systematic The transparency of data sharing is necessary to
reviews that deserves consideration for improve- ensure that the product of the systematic reviews
ment to increase the stringency of research syn- proposed here is useful to a broad range of poten-
thesis is the process of sampling. Sampling is a tial audiences. Deficiencies in the strength of the
fundamental consideration in biostatistics, as we evidence grade can of course impact both the sys-
have noted in a preceding chapter (see Chap. 3, tematic reviews sub-aim and the gaps in knowl-
Sect. 3.2.1 on sampling methods). In brief, we edge sub-aim.
stated that sampling can be defined as a sequen- Therefore, the purpose of the analytical
tial collection of random variables, both indepen- framework is to crystalize the sharp criteria of
dent and identically distributed, or at least having effectiveness as practically feasible based on
potentially the same probability distribution as the PICOTS and to prioritize the research gaps
the others, and all are mutually independent. To thus identified. At this stage, more often than
test how realistic these assumptions of random not, engagement, participation, and involve-
sampling actually are on a given data set, auto- ment of the stakeholders in formulating
correlation statistics, which detect the presence PICOTS, finalizing the analytical framework,
of non-random periodic non-randomness, can be and stating the relevant key questions can be
computed, lag plots drawn, or turning point test assessed by the psychometrically validated
performed. The generalized assumption of participatory evaluation measurement instru-
exchangeable randomness is however, as we ment (PEMI) or other stakeholder engagement
emphasized above, most often sufficient and scales.
more easily met. The sample of primary research is obtained
The same consideration about random sam- from MEDLINE, PsycINFO, EMBASE,
pling, which applies to experimental design, is PsycARTICLES, Scopus, CINAHL, AMED, or
also pertinent to the research synthesis design another database. The sample for existing system-
that is used in comparative effectiveness research. atic reviews usually comes from MEDLINE, the
That is to say, the process by which the available Cochrane Library, Bandolier, or any other library
literature pertinent to the PICOTS question is of systematic reviews and meta-analyses.
identified and accessed. To identify gaps in The MEDLINE search strategy is developed
knowledge, the PICOTS is refined by means of and validated using PubMed medical subject
an analytical framework to generate specific key headings (MeSH) and keywords taken from the
questions that address certain intervening/con- PICOTS statement and related key questions.
founding variables. Knowledge gaps are com- The strategy is then replicated with the other
monly derived from GRADE or related electronic databases. Translators are used as
assessments, based on the criteria of: needed, unless the search is limited to English
language only. The clinicaltrials.gov registration
(a) Insufficient or imprecise information database is routinely reviewed to identify trials
(b) Biased information completed 3 or more years earlier that prespeci-
(c) Inconsistency or unknown consistency fied our outcomes of interest but did not publish
(d) Not the right information (wrong population all of the outcomes. The original authors can be
or wrong outcome) contacted as needed.
176 10 New Frontiers in Comparative Effectiveness Research
To ensure inclusion of individual reports, two healthcare specialties to distribute the best evi-
trained investigators independently screen titles dence base to clinicians, patients, caregivers, and
and abstracts of the list of references for perti- other stakeholders in real time.
nence to the stated PICOTS statement and identi- The healthcare provider (i.e., dentist, physi-
fied key questions. A second round of review by cian, nurse) can perform telecare in the form of
two additional independent reviewers examines teleconsultation and telediagnosis by using these
the full-text article. Differences regarding article electronic applications. That is to say, telecare
inclusion are resolved through consensus. The enables patients who live in rural areas or far
systemic review software (DistillerSR, 2010; away from healthcare services to receive the best
Evidence Partners) can serve to manage the available treatment and care in a cost-effective
screening process and information extraction on modality. It ensures that healthcare providers
measures of intervention fidelity. Funnel plots connect and treat patients in need with the proper
serve to estimate publication bias. communication technology in place.
As stringent and rigorous the sampling pro- With improving technology, telecare has
cess is, which requires searching multiple appro- much potential as a healthcare service, particu-
priate databases, eliminating duplicates and larly in situations such as complex dental inter-
reports that only approximate the PICOTS ques- ventions (e.g., 1-day crowns, immediate loading
tion, it generates a bibliome—that is, the collec- or delayed loading dental implants, mini-
tion of published papers that most adheres to the implants, inlays and onlays, etc.). By substan-
stated PICOTS statement and identified key tially reducing the cost of healthcare delivery
questions—that can suffer from selection and and increasing instant access to providers with-
accentuate the publication bias, including: out the need to travel, telecare technologies
improve the quality of dental care, and health-
• Language bias care in general given to the patients in inacces-
• Study designs bias sible communities, and raise patient and
• Time of publication bias healthcare provider’s satisfaction.
• Investigator bias (e.g., the same group of Implementation of telecare communication
investigators publishing multiple reports perti- technologies for mentally handicapped patients,
nent to PICOTS and thus being included in the elderly and disabled patients, and other special
bibliome) populations is particularly important, because
telecare can be optimized by using an electronic
More often than not, the funnel plot analysis application across the five domains listed above.
proves too soft to alert the investigators ade- The same benefits of telecare can also be obtained
quately of emerging publication bias. with other difficult patient groups, such as
Concerted methodological efforts must be patients with high levels of dental anxiety and
deployed in the coming decade to characterize dental phobia and homeless and destitute patients
new and improved means of obtaining the bibli- who live in poverty-stricken environments and
ome and ensuring that it is free of bias. Novel have access—at best—to dilapidated healthcare
biostatistical approaches must be developed to structure with intrinsic limits of patient access to
test for publication bias in a manner that is both clinical services. In these extreme situations,
more reliable and more stringent than the present telecare can vastly improve the well-being of
funnel plot analysis. dental patients in need of simple restorative den-
tistry or more complex and involved endodontic,
10.2.2.2 New Frontiers periodontal, or prosthodontic treatment
in Dissemination intervention.
Patient-centered care also implies that novel tele- Patients who are afflicted with serious infec-
health information and communication technolo- tious diseases, such as HIV/AIDS, Ebola,
gies must be developed and standardized across Zika, and other communicable diseases, are
10.2 Conceptual Background 177
oftentimes quarantined due to the infectious vide the necessary data. From a methodological
nature of the disease. These patients can be diag- standpoint, the domain of individual patient data
nosed and treated for dental problems and oral gathering, analysis, and inference needs to spec-
pathologies by means of telecare, even if dentists, ify the specifics of the individual patient data out-
physicians, and nurses are ordered to stay a safe comes under study—viz., individual patient data
distance away from those infected to prevent outcomes research. This requires a cogent char-
transmission of the virus from other vectors while acterization of the variables to measure, the anal-
providing diagnoses and treatment assistance via yses to plan, and the type of data (i.e., qualitative
electronic devices. That is in part the reason why vs. quantitative; categorical vs. continuous) to
teleconsultation, a low-cost and low-bandwidth gather. Thence will derive the type of analyses—
exchange of information between health special- usually longitudinal repeated measures type anal-
ists and patients when specialists are not avail- yses—that will be most appropriate and
able, is among the most common type of informative.
telehealth service in developing countries. In brief, three principal programs of telecare
Telehealth has shown a great promise across a ought to be developed in the decade to come:
variety of health problems, and telecare is
increasingly benefiting dental patients as well. • Electronic Data Methods Forum: A program
But, this will be obtained only if concerted that is in its second phase presently. It has
research is sustained in this field, which must established preliminary interconnections and
include the development of faster and more user- communications to a variety of electronic data
friendly technologies. Improved telecare tech- infrastructure and now is in the process of
nologies require, particularly in the field of expanding the breadth and depth of these
dentistry, seamless interconnectedness among interactions. To achieve its goal, the Electronic
clinical professionals and direct access to patients Data Methods Forum conducts comparative
in critical needs. effectiveness research on a wide spectrum of
In dentistry, and in other domains of health- patient-centered research outcomes, including
care, the need for cutting-edge, reliable, fast, and quality of life assessment and targeted
hack-free telecare is unquestionable. When improvement, and fosters the new and
implemented effectively, telecare will greatly improved utilization of a wide spectrum of
increase the treatment and care for dental patients. health information technologies to support
One aspect of telecare that is fast emerging routine clinical care.
with increasing relevance to situations of com- • Bringing evidence to stakeholders for transla-
plex dental interventions, or to some of the more tion to primary care: This is a concerted effort,
difficult patient populations briefly outlined initiated and supported by AHRQ-generated
above, is that it must ensure individualized, intramural and extramural funding programs,
patient-centered care. Consequently, one impor- to ensure and expand dissemination of pro-
tant development in translational effectiveness grams and best evidence base, evidence-based
that must go forth hand in hand with new devel- revisions of clinical practice guidelines, and
opments in telecare requires the validation of reports and information about professional
new research tools and protocols to analyze and and patient–stakeholder networking to patients
interpret individual patient data. and providers in primary care settings in the
The term individual patient data refers to the United States and worldwide.
availability of raw data for each study participant • Disseminating Patient-Centered Outcomes
in each included trial, as opposed to aggregate Research to Improve Healthcare Delivery
data (summary data for the comparison groups in Systems: A concerted effort to utilize existing
each study). Reviews using individual patient networks of providers and other key stake-
data require collaboration of the investigators holders to disseminate, translate, and imple-
who conducted the original trials, who must pro- ment delivery system evidence.
178 10 New Frontiers in Comparative Effectiveness Research
one thing emerges as certain is that healthcare dulling of the efficacy of the pharmaceutical
will be under siege by a vast spectrum of infec- interventions at our disposal to blunt the growth
tious diseases in the decades to come. This alarm- and proliferation of said agents. The purpose and
ing situation is by no means blunted by the recent call of CERID is to develop new and improved
exacerbation of climate change, which brings effectiveness-focused, patient-centered, and
along more cataclysmic hurricanes and flooding. evidence-based countermeasures targeted against
Standing water slowly receding in warm tropical infectious diseases along these two converging
climates, such as a Texas, Florida, South Asia, the fronts.
Caribbean, and Central America, to cite only a
few of the more recent flooding events, are breed- 10.2.2.6 C reating and Disseminating
ing grounds for waterborne parasitic and infec- New Knowledge in CERID
tious diseases, waterborne mosquitoes that are To be clear, incontrovertible evidence points to
carriers for viral infections, and a vast array of human activity as one major cause for the progres-
non-hygienic conditions that impose a serious sive warming of the planet’s temperature, green-
load on the immune system even of healthy house gases, pollution, and other contributors to
young individuals, thus undermining their health. climate change. Together, these factors contribute
Taking together current epidemiological to warming ocean waters, which then feed into
trends with the new frontier of translational larger, more menacing, forcefully destructive, and
healthcare that we have outlined in the preceding more frequent hurricanes and typhoons. This
chapters and in our preceding work, it becomes knowledge is now widespread, and only a handful
self-evident that the new serious threats to popu- of deplorably denying politicians do not accept
lation health brought about by the emergence of this cumulative evidence and obstruct local,
new infectious threats and the re-emergence of national, and international action to counter and to
older ones call for a worldwide concerted reverse these natural ecological trends. This is the
endeavor of comparative effectiveness research realm of politics and social history.
for infectious diseases (CERID) to establish and Nonetheless, throughout history, politics and
disseminate evidence-based, best clinical prac- social history have played a timely and critical
tices for this specific type of health threat in the role in population health and epidemiology, from
next decades. the scourges of antiquity to the Black Death that
One primary concern of CERID must also spread through Europe consequential, some say,
include the alarming trend of antimicrobial to the Crusades and other intestinal wars within
resistance, that is, the progressively weaker abil- Europe in the Middle Ages (e.g., sanguine con-
ity of commonly available antibiotics to counter flicts between the Guelfs and the Ghibellines),
infectious diseases. Antimicrobial resistance is the Spanish flu following WWI, the testing of
on the rise with millions of deaths every year. The penicillin on diseased soldiers following WWII
World Health Organization (WHO) reported in in the first clinical trial of its kind, and so on. The
2014 that this serious threat is no longer a pre- world population finds itself at a different junc-
diction for the future; it is happening right now in ture presently: one in which political system
every region of the world and has the potential to across the planet must work jointly and construc-
affect anyone, of any age, in any country. tively to block and reverse the fast rising of the
Antibiotic resistance—when bacteria change so planet, lest storms will increase in frequency and
antibiotics no longer work in people who need strength, bringing with them disastrous flood,
them to treat infections—is now a major threat to and life-threatening waterborne infectious
public health. diseases.
In other words, the world population is seri- As if this was not a sufficiently ominous
ously at risk both of a sharp increase in causative threat, antibiotic resistance is a growing problem
agents of infectious diseases and of a progressive among humans, domesticated animals, and wild-
180 10 New Frontiers in Comparative Effectiveness Research
life alike in terrestrial, aerial, or aquatic environ- national level. Global tracking of infectious dis-
ments. This is due, part at least, to the fact that eases may be a worthwhile endeavor, though
farm animals, which constitute a large proportion expensive and complex to develop, validate, and
of the human diet, are fed antibiotics themselves implement. A pluripotent national, politics-free
to ensure their health status, continued growth, system of this nature could be designed in
and maximal weight until slaughter. These antibi- increasing stages of complexity, starting from the
otics, and their by-products, contaminate the system that is operative presently and which pro-
meat products that enter the food chain, which vides real-time news information and images
we feed our developing children and youngsters. about wheaten patterns and storm destruction
It is not surprising that they develop resistance to cataclysms worldwide via satellites equipped
the antibiotics and antibiotics by-products found with the appropriate software.
in animal meats. Similar health-endangering situ- Based on that model, we might now conceptu-
ations can be traced to the traces of by-products alize a second-generation satellite software that
of fungicides and insecticides still found in veg- will integrate population health data and
etables and fruits following exhausting washing evidence-based healthcare information into a
of the crops. When ingested, these by-products of worldwide health information technology net-
fungicides and insecticides can contribute to an work, a global telecare system, as it were.
override of cellular immune surveillance events, Consonant with the issues discussed in the previ-
which together signify increased debilitation to ous paragraphs, we argue that concerted effort
microbial assault. Last, but not the least, is the should focus initially on the establishment of a
pollution of the water we drink—pollution by CERID/CIPER-focused dimension of global
heavy metal products of refining industries, pol- telecare, that is to say, a focus on comparative
lution by fungicides and insecticides washes, individual patient effectiveness research targeted
etc.—which progressively contributes to organ on infectious diseases.
weakness and eventually failure (e.g., kidney,
liver) and to altered physiological homeostasis.
Taken together, the knowledge and the evi- 10.2.3 I mplications and Relevance
dence about the potential causes for our decreased for Sustained Evolution
ability to combat infectious diseases are widely of Translational Research
known and accessible to all, particularly those and Translational Effectiveness
living at or close to “hot spots,” such as urban
centers. The spread and contamination of the 10.2.3.1 T oward Bayesian Biostatistics
environment constitute a growing and serious in Translational Research
public health problem, which physiologist might Bayesian inference is a method of statistical
describe as a type II allostatic load2 on the inference in which Bayes’ theorem is used to
immune system, and its consequential irreparable update the probability for a hypothesis as more
fall to a state of immune compromise and immune evidence or information becomes available and
deficiency. added onto the prior. Bayesian inference is an
There have been increasing public calls for important technique in biostatistics, which will
global collective action to address the threat, grow in relevance enormously in the next decades
including a proposal for international treaty on as the lines of research we have outlined in this
antimicrobial resistance. Further detail and atten- chapter continue to expand.
tion is still needed to recognize and measure The Bayesian approach to biostatistical infer-
trends in resistance at the national and the inter- ence is sometimes called biostatistical updating,
because new inference is updated every time new
Chiappelli F, Cajulis OS. Psychobiologic views on
2 information is added to previously gathered data,
stress-related oral ulcers. Quintessence Int. 2004 35:223- that is, the prior. It is also referred to as Bayesian
7. PMID: 15119681 probability, as an alternative to frequentist
10.2 Conceptual Background 181
probability-based inferences. The aim of the Bayes context, the Kullback–Leibler divergence analysis
approach is not to estimate the proximity of sam- pertains to the behaviors of the distributions of the
ple observations to the population; rather, it is prior distribution and observed posterior.
grounded in the principle that we do not know that In Bayesian hierarchical modeling, multiple
we cannot know the population, and hence any levels are proffered in a hierarchical structure that
attempt at estimating the probability that a sample estimates the parameters of the posterior distribu-
belongs to, or not, the population is futile. Rather, tion using Bayesian inference. They are then inte-
Bayesian inference considers that all sample grated in a manner not dissimilar to the integration
observations are valid information about the popu- of the pieces of a puzzle. In this manner, relevant
lation—even if unexpected—and thus potentially information regarding decision-making and updat-
considered erroneous. No observation is erroneous ing beliefs cannot be ignored because the Bayesian
from the Bayesian perspective, and all observa- hierarchical modeling has the potential to overrule
tions act as, as it were, independent pieces of the classical methods in applications such as clinical
puzzle. As new observations are obtained and decision-making based on the integration of con-
added to the prior, in a manner similar to adding a tinuous updates of the best evidence base through
new piece to the puzzle, a composite of the popu- comparative effectiveness research findings
lation emerges exactly as the composite image of reported in systematic reviews. The hierarchical
the puzzle emerges. form of Bayesian analysis and organization pro-
To the same extent as there is a relative vides a promising new dimension for the analysis
entropy—disorder—in the piece of the puzzle we and evaluation of multiparameter clinical deci-
have mixed before starting to compose it, so it is sion-making elaborated to integrate stakeholders’
for the possible data and observations we may views, patient’s needs and wants, clinicians’
collect and add unto the prior in the pursuit of expertise, and the evidence-based, effectiveness-
defining the population. In our puzzle example, focused, and patient-centered consensus of the
we might say that the pieces are well mixed when best evidence base. In that light, Bayesian biosta-
there is a considerable degree of disorder among tistics is only viable strategy in translational
them. Scientists might call this disorder entropy. healthcare for the twenty-first century.
In the case of Bayesian inference about the popu-
lation, we will call the elements that will consti- 10.2.3.2 iostatistics and Meta-
B
tute our observations and our data the Bayesian Analysis in Systematic
factors. Bayesian factors are in a state of disor- Reviews: Toward Individual
ders, like the mixed pieces of the puzzle, and we Patient Data Meta-Analysis
can call this disorder entropy. In Bayesian statis- Meta-analysis is the core of quantitative biosta-
tics jargon, we call this state of entropy the tistical consensus of the best available evidence
Kullback–Leibler divergence. in comparative effectiveness research. Broadly
In other words, building upon the priors with speaking, it involves pooling quantitative evi-
current observations is a process that depends in dence from related homogeneous studies—as
large part upon the relative entropy of the Bayesian determined by the funnel plot and the Cochran Q
factors, that is, their Kullback–Leibler divergence. and/or I2 statistics—to estimate the effect of an
The probability of a Bayesian inference depends intervention, along with a confidence interval.
upon the relative size of the Kullback–Leibler Traditionally, meta-analyses synthesize group
divergence of its relative factors. data information obtained from multiple studies.
In general, a Kullback–Leibler divergence of 0 By contrast, individual participant-level data
indicates that we can expect similar, if not the meta-analysis (IPD MA) utilizes the prespecified
same, behavior among the distributions of different variables for each individual participant from mul-
Bayesian factors. But, a Kullback–Leibler diver- tiple applicable studies and synthesizes those data
gence of 1 indicates that the distributions behave in across all studies to assess the impact of a clinical
a dramatically different manner. In the Bayesian intervention in a more granular fashion. IPD MA,
182 10 New Frontiers in Comparative Effectiveness Research
which is the preferred biostatistical test to assess software is inadequate. Big data challenges
quantitative consensus in the individual patient include capturing data, data storage, data analy-
data outcomes research model discussed in a pre- sis, search, sharing, transfer, visualization, query-
vious chapter (see Chap. 9), has several important ing, updating, and information privacy. In the
potential advantages, including the ability to: context of the topics discussed in this book, and
specifically this chapter, big data can apply to the
1. Standardize the analysis across studies. bibliome, as well as to individual patient data
2. Include more up-to-date information than was sets, individual patient data meta-analyses, and
available at the time of each original trial’s other collection of information that becomes part
publication. of the consensus of the best evidence base, stake-
3. Incorporate results for previously missing or holder engagement, and the like.
poorly reported patient-centered outcomes. The domain of big data extends beyond the
4. Help personalize clinical decisions by assess- simple observation that the data set under study is
ing differential treatment effects for specific large. It extends to the use of predictive analytics,
subgroups. IPD MA can also allow for better user behavior analytics, and a range of alternative
ascertainment of the optimal dose, timing, advanced data analytic methods. Traditional rela-
and delivery method of a specific intervention tional database management systems and desktop
that might have been previously tested in mul- biostatistics and visualization-packages have dif-
tiple, nonuniform ways. ficulty handling big data, and the big data sets that
is projected to become common occurrence in
One important new frontier in biostatistics translational research and translational effective-
will be to develop new and improved protocols to ness urgently demand research and development,
perform IPD MA. At this point, the protocol to and work and quality control evaluation of novel
develop, test, and validate such a novel and com- biostatistical software packages for the purpose. It
plex way of obtaining individual patient data is possible and even probable that these new
consensus is thought to require a three-pronged approaches to biostatistics will increasingly rely
participatory structure that might include: on the Bayesian paradigm and the Kullback–
Leibler divergence analysis we outlined above.
1. The investigators who have performed trial Big data analysis has been criticized as being
meeting inclusion criteria for a given IPD MA relatively shallow at this point in its infancy. The
2. A representative group of stakeholders same could be said of traditional inferential tests
(including select trial investigators, patient soon as biostatistics was becoming established as
representatives, and biostatisticians) whose the modern science that it is today back in the
role is to collate and evaluate the protocols as 1920s. As comparative effectiveness research con-
they are being proposed tinues to grow along the dimensions we have out-
3. An IPD MA Research Center, a group of lined here, its reliance on big data analysis will
researchers with established expertise in the grow in parallel. That process is bound to drive big
underlying methods and conduct of high- data analysis to grow in biostatistical stringency.
quality, rigorous, IPD MA
These three entities all actively contribute 10.2.4 Self-Study: Practice Problems
to and participate throughout the development
and conduct of an IPD MA, but each plays a 1. Describe the two enterprises of translational
different role. healthcare. What relationship do they share?
2. What measuring instruments are needed in
10.2.3.3 B ig Data Paradigm translational effectiveness? Why are they
in Translational Healthcare necessarily important?
Big data refers to data sets that are so large and 3. What is meant by the best available evi-
complex that traditional biostatistical application dence? How is it most commonly obtained?
Recommended Reading 183
4. Describe the process of comparative effec- Bernardo J, Smith AFM. Bayesian Theory: Wiley; 1994.
Chiappelli F. Fundamentals of evidence-based health care
tiveness research (CER). How is this differ- and translational science. New York: Springer; 2014.
ent than comparative individual patient Chiappelli F. Methods, Fallacies and Implications of
effectiveness research (CIPER)? Comparative Effectiveness Research (CER) for health-
5. What is the difference between a systematic care in the 21st century (Chapter 1). In Chiappelli F.
(Ed.) Comparative Effectiveness Research (CER):
review and a meta-analysis? new methods, challenges and health implications Inc.
6. What is meant by the level of evidence as com- NovaScience, Hauppauge (2016)
pared to the quality of evidence? Then, name Cochrane AL. Effectiveness and efficiency: random
at least one instrument that measures each. reflections on health services: Nuffield Provincial
Hospitals Trust; 1972.
7. From the studies below, rank the evidence El Dib RP, Atallah AN, Andriolo RB. Mapping the
obtained from each from highest to lowest: Cochrane evidence for decision making in health care.
(a) Randomized, triple-blinded, placebo- J Eval Clin Pract. 2007;13:689–92. PMID 17683315.
controlled clinical trial Ezzo J, Bausell B, Moerman DE, Berman B, Hadhazy
V. Reviewing the reviews. How strong is the evidence?
(b) Mixed model cohort study How clear are the conclusions? Int J Technol Assess
(c) Systematic review research synthesis Health Care. 2001;17(4):457–66. PMID 11758290.
(d) Cross-sectional study Feinstein AR. Clinical Judgement. Baltimore: Williams &
8. What is a bibliome and what is it analogous Wilkins; 1967.
Gelman A, Carlin J, Stern H, Rubin D. Bayesian data
to in traditional biostatistics? analysis. London: Chapman & Hall; 1995.
9. What is the difference between a frequentist Laxminarayan R, Duse A, Wattal C, Zaidi AK, Wertheim
approach and a Bayesian approach to biosta- HF, Sumpradit N, Vlieghe E, Hara GL, Gould IM,
tistical inference? Explain why the future of Goossens H, Greko C, So AD, Bigdeli M, Tomson G,
Woodhouse W, Ombaka E, Peralta AQ, Qamar FN,
biostatistics in translational research is Mir F, Kariuki S, Bhutta ZA, Coates A, Bergstrom R,
headed toward the latter approach. Wright GD, Brown ED, Cars O. Antibiotic resistance-
10. Where is the dissemination of information the need for global solutions. Lancet Infect Dis.
and communication within translational 2013;13(12):1057–98.
Leskovec J, Rajaraman A, Jeffrey D. Ullman JD. (2014).
healthcare headed toward? Mining of massive datasets. CambridgeCambridge
University Press.
Murdoch TB, Detsky AS. The inevitable application of
Recommended Reading big data to health care. JAMA. 2013;309:1351–2.
Renganathan V. Overview of frequentist and bayesian
approach to survival analysis. Appl Med Informatics.
Baez J, Fritz T. A Bayesian characterization of relative
2016;38(1):25–38.
entropy. Theory Appl Categ. 2014;29:421–56.
Vallverdu J. Bayesians versus frequentists a philosophical
Bauer JB, Spackman SS, Chiappelli F. Evidence-based
debate on statistical reasoning. New York: Springer;
research and practice in the big data era (Chapter 17).
2016.
In: Chiappelli F, editor. Comparative Effectiveness
Research (CER): new methods, challenges and health
implications. Hauppauge: NovaScience; 2015.
Appendices
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
190 Appendices
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix D: Critical Values of (F) 191
Appendix D: Critical Values of F
Appendix D: Critical Values of F (continued)
192
Appendices
Appendix D: Critical Values of (F)
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the Standard Normal Random Variable by Stephen
193
n
Northridge, Department of Health Sciences
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix G: Critical Values for Mann-Whitney (U) 197
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix H: Critical Values for the Chi-Square Distribution 201
Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
nswers to End of Chapter Practice
A
Problems
4.
20
15
Frequency
10
0
Very Unsatisfied Unsatisfied Neither Satisfied Very Satisfied
Campus Health Office Satisfaction
20
15
Frequency
10
0
Very Unsatisfied Unsatisfied Neither Satisfied Very Satisfied
Campus Health Office Satisfaction
Answers to End of Chapter Practice Problems 207
c2 = +
n1 ( n1 + 1) 262.64 231.36
U1 = n1n2 + - R1 ( 405 - 333. 36 ) +(
2
222 - 293 .64 )
2
2 + = 74.6 df = 1
7 ( 7 + 1) 333.36 293.64
= ( 7 )( 7 ) + - 48 = 29
2 At 𝛼 = 0.05, Reject H0 because p < 0.001. Based
on these results, there seems to be a statistically
U = 21 significant difference within the specific sample
that was measured in regard to age group and
n2 ( n2 + 1) preference toward the type of medical practitio-
U 2 = n1n2 + - R2 ner that provides the diagnosis.
2
7 ( 7 + 1)
= ( 7 )( 7 ) + - 56 = 21 10. A logistic regression would be most com-
2
monly used in an observational, case-control
At 𝛼 = 0.05 and Ucrit = 8, UOBS > Ucrit, so we study. This type of statistical technique
Retain H0 would be useful for this specific study design
because it can provide an odds ratio.
7. In the Geisser-Greenhouse correction, sphe-
ricity refers to an assumption that is analogous
to the homogeneity of variances. Chapter 8
8. This type of test is highly vulnerable to a Type
I Error. It is a double-edged sword because 1. The analysis of individual patient data has the
decreasing its vulnerability to a Type I Error, ability to provide more reliable and more accu-
in turn, increases the probability of making a rate information regarding the specific patient.
Type II error. Aggregate data analysis provides a general-
9. ized inference consensus regarding similar
patients; however it does not necessarily entail
Children who favor: that it is applicable to each similar patient. The
individual differences that individual patient
596 x 494 data takes into consideration make it much
Expected f = = 262.64
1121 more advantageous than that of aggregate data
in the perspective of an individual patient.
Children who oppose:
2. Complications include lack of access to
525 x 494 unpublished trials, inconsistent data across
Expected f = = 231.36
1121 trials, limited information in published
reports, longer follow-up time, more com-
Adults who favor:
plex outcomes, and higher monetary cost.
596 x 627
Expected f = = 333.36
1121
212 Answers to End of Chapter Practice Problems
3. Whether primary, secondary, or key, stake- to the patients, stakeholders, and the like.
holders may include patients, patient’s Moreover, its overall purpose is to ascertain
family members, caregivers, governmental a certain degree of value or worth based on
figures, etc. Additionally, all those who fit the objectives and results throughout the
the following description: “those groups entire study.
without whose support, the organization 2. Evaluation is a systematic acquisition and
would cease to exist.” Stakeholder engage- management of information, which includes
ment in healthcare improves the relevance of the generation of feedback to specific stake-
research, increases transparency, and accel- holders. Therefore, evaluation is most like
erates its adoption into practice. (or includes) methodology and data
4. Yes, with tools such as individual patient analysis.
meta-analysis, information learned from the 3. The purpose of formative evaluation is to
patient can be inferred onto that specific measure the effectiveness of a program or
patient. research study as it is taking place in order to
5. Individual patient inferences provide infer- determine what can be improved within the
ences about the specific patient, whereas course of the study. On the other hand, sum-
aggregate patient data can only provide mative evaluation is concerned with the
inferences regarding the general population overall assessment of the study after it has
of similar patients. been completed.
6. Analysis of individual patient data should 4. False—formative assessments are dependent
ultimately culminate to patient-centered on qualitative feedback.
outcomes. 5. Observational study design, with chi-square
7. D. statistical tests.
8. PICOTS—population, intervention, compar- 6. Both quantitative and qualitative methods
ator, outcome, timeline, and setting. are essential and necessary in evaluation
9. After individual patient data analysis in within translational healthcare. The methods
PCOR, the next necessary step must be complement each other in their own specific
patient-centered outcome effectiveness ways to provide an ultimately more robust
(PCOE). The evaluation of the outcomes of result on the effectiveness and efficacy of
evidence-based healthcare is critical for that which they are concerned with.
understanding how the particular processes 7. Qualitative evaluation methods.
work and how the benefits can be maximized 8. In order to quantify qualitative methods and
for stakeholders and the like. information utilized in evaluation, one must
10. The traditional model of repeated measures (1) categorize and sort the relevant informa-
calls for a pre-post approach—where the tion on the basis of certain criteria; (2) recog-
effectiveness of an intervention is measured nize a recurrence of the themes under study;
by comparing the posttest results to the pre- (3) conduct continuous, semicontinuous, or
test results. In the new model proposed, there dichotomous assessment of recurrence; and
is a post-then(before)-pre approach utilized. (4) conduct statistical analysis of recurrence
The advantage here is the control for of the themes via traditional statistical
response shift bias. techniques.
9. It is a fallacy to think that one is better than
the other. Both quantitative and qualitative
Chapter 9 methods are distinctly beneficial in their own
right—and, when utilized together, are able
1. In translational healthcare, evaluation refers to complement each other to provide an all-
to the systematic approach of determining encompassing basis of knowledge.
the effectiveness and efficacy of research 10. Participatory Action Research an Evaluation
studies, investigations, and programs relative (PARE)—refers to a formative and summa-
Answers to End of Chapter Practice Problems 213
tive approach utilized within community relies on aggregate patient data, whereas the
health action. It is a crucial concept within latter focuses on individual patient data and
translational healthcare as it seeks to increase meta-analysis.
benefit effectiveness by understanding the 5. A systematic review is a scientific report
experiences through the perspective of the that describes the methodology employed
actual patient or affected groups. for obtaining, quantifying, analyzing, and
reporting the consensus of the best evi-
dence base for a specific clinical treatment.
Chapter 10 A meta-analysis is the biostatistical tech-
nique utilized within systematic reviews to
1. Translational research (T1) and transla- analyze quantitative evidence from related,
tional effectiveness (T2)—T1 refers to the homogenous studies in order to estimate
biostatistical applications and methods used the effect of the specific clinical
on information obtained from the patient in intervention.
order to obtain new information that directly 6. The level of evidence represents evidence
benefits the patient in return (i.e., bench to that is obtained from a particular research
bedside); T2 refers to the results gathered study design, which can be measured by the
from clinical studies that are translated or AGREE instrument. The quality of evidence
transferred to everyday clinical practices refers to the stringency of the research meth-
and healthy decision-making habits (i.e., odology and data analysis, which can be
result translation). T2 relies heavily on the measured by the PRISMA instrument.
biostatistical principles and concepts of 7. Systematic review research synthesis > ran-
T1, while T1 relies heavily on the develop- domized, triple-blinded, placebo-controlled
ment of novel and concerted biostatistical clinical trial > mixed model cohort study >
models. cross-sectional study.
2. Translational effectiveness is in need of mea- 8. The bibliome refers to a collection of pub-
surement tools that have the ability to assess lished papers obtained through a litera-
the quality of the evidence, individual patient ture review that most closely answers the
research outcomes and analysis, individual PICOTS research question. The bibliome
patient data meta-analysis, stakeholder is most analogous to a sample (i.e., sam-
engagement quantification and analysis, and ple–population interaction) from traditional
all-encompassing dissemination. The impor- biostatistics.
tance of this lies at the core of translational 9. A Bayesian approach to biostatical infer-
healthcare, whereby only the best possible ence refers to the continuous updating of
intervention and the most optimal benefit are previous inferences as new information and
provided to the patient. data becomes possible. This is in contrast
3. The best available evidence refers to the to the frequentist approach as it does not
highest level and quality of evidence that permit the updating of new knowledge. In
currently exists. It is most commonly the frequentist approach to biostatistical
obtained by a process of critical evaluation inference, the development of new knowl-
of the entire body of available published edge that is significantly unlike that which
research literature, along with a clear inter- is commonly known is rendered as extreme.
pretative synthesis of the relevant findings. This “extreme” or statistically significant
4. Comparative effectiveness research is the different observation is exiled from the
systematic process by which quantitative and population, even if the sample from which
qualitative consensuses of the best available it is obtained is a good representation. As
evidence are obtained. This differs from learned from individual patient data analy-
comparative individual patient effectiveness sis, the stark individual differences between
research (CIPER) simply because the former patients and their conditions exemplify the
214 Answers to End of Chapter Practice Problems
extent to which we are unable to grasp the of technology. Therefore, the most effective
true population (or even if one exists). dissemination and communication of infor-
Thus, the future of biostatistics must accept mation to all stakeholders (and the like) on
this dynamic challenge for the betterment the best available evidence must be through
of healthcare and its constituents. some technologically advance system, such as
10. Translational healthcare, just like the rest tele-health.
of the world, must move with the direction
Bibliography
AdvaMedDX, 2–24. A policy primer in diagnostics: June Bernardo J, Smith AFM. Bayesian theory. Hoboken:
2011. 2011. https://dx.advamed.org/sites/dx.advamed. Wiley; 1994.
org/files/resource/advameddx-policy-primer-on-diag- Beveridge WIB. Biologist and statistician Ronald Fisher
nostics-june-2011.pdf. [Online image]. 1957. https://www.flickr.com/photos/
Agency for Healthcare Research and Quality. Logic mod- internetarchivebookimages/20150531109/.
els: the foundation to implement, study, and refine Bewick V, Cheek L, Ball J. Statistics review 7: correlation
patient-centered medical home models. March 2013a. and regression. Crit Care. 2003;7(6):451–9. Print.
AHRQ Publication No. 13–0029-EF. Bhatt A. Evolution of clinical research: a history
Agency for Healthcare Research and Quality. Mixed before and beyond James Lind. Perspect Clin Res.
methods: integrating quantitative and qualitative 2010;1(1):6–10.
data collection and analysis while studying patient- Biddix P. Instrument, validity, reliability. 2009. https://
centered medical home models. March 2013b. AHRQ researchrundowns.com/quantitative-methods/instru-
Publication No. 13–0028-EF. ment-validity-reliability/. Accessed July 2017.
Altman DG. Practical statistics for medical research. Boca Black K. Business Statistics for Contemporary Decision
Raton, FL: CRC; 1990. Making (4th edn, Wiley student edition for India).
Andersen H, Hepburn B. In: Zalta EN, editor. Scientific New Delhi: Wiley; 2004. ISBN 978-81-265-0809-9.
Method. The Stanford Encyclopedia of Philosophy Bloom BS, Hasting T, Madaus G. Handbook of forma-
(Summer 2016 Edition); 2016. https://plato.stanford. tive and summative evaluation of student learning.
edu/entries/scientific-method/#SciMetSciEduSeeSci. New York: McGraw-Hill; 1971.
Baez J, Fritz T. A Bayesian characterization of relative Bogdan R, Taylor S. Looking at the bright side: a positive
entropy. Theor Appl Categories. 2014;29:421–56. approach to qualitative policy and evaluation research.
Bagdonavicius V, Kruopis J, Nikulin MS. Non-parametric Qual Sociol. 1997;13:193–2.
tests for complete data. London & Hoboken: ISTE & Bohr N. LXXIII. On the constitution of atoms and
Wiley; 2011. molecules. Lond Edinb Dubl Phil Mag J Sci.
Balescu R. Equilibrium and non-equilibrium statis- 1913;26(155):857–75.
tical mechanics. Hoboken: Wiley; 1975. ISBN: Bonferroni CE. Teoria statistica delle classi e calcolo delle
9780471046004. probabilità. Pubblicazioni del Real Istituto Superiore
Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, di Scienze Economiche e Commerciali di Firenze.
Chaudhury S. Hypothesis testing, type I and type II 1936;8:3–62.
errors. Ind Psychiatry J. 2009;18(2):127–31. https:// Born M, Heisenberg W. The quantum theory of mol-
doi.org/10.4103/0972-6748.62274. ecules. Ann Phys. 1924;74(9):1–31.
Barkhordarian A, et al. Assessment of risk of bias in trans- Campbell DT. Factors relevant to the validity of
lational science. J Transl Med. 2013;11:184. http:// experiments in social settings. Psychol Bull.
www.translational-medicine.com/content/11/1/184. 1957;54:297–312.
Bauer JB, Spackman SS, Chiappelli F. Evidence-based Chiang C, Zelen M. What is biostatistics? Biometrics.
research and practice in the big data era. In: Chiappelli 1985;41(3):771–5. https://doi.org/10.2307/2531297.
F, editor. Comparative effectiveness research (CER): Chiappelli F. Fundamentals of evidence-based health care
new methods, challenges and health implications. and translational science. New York: Springer; 2014.
Hauppauge, NY: NovaScience; 2015. Chapter 17. Chiappelli F. Methods, fallacies and implications of
Bem DJ. Writing the Empirical Journal Article. comparative effectiveness research (CER) for health-
Psychology Writing Center. University of Washington; care in the 21st century. In: Chiappelli F, editor.
Denscombe, Martyn. In: The Good Research Guide: Comparative effectiveness research (CER): new meth-
for small-scale social research projects. 5th ed. ods, challenges and health implications. Hauppauge,
Buckingham: Open University Press; 2014. NY. Chapter 1.: NovaScience; 2016.
Chiappelli F. Comparing two groups: T tests family How clear are the conclusions? Int J Technol Assess
(13,14,15) [PowerPoint slides]; n.d. Health Care. 2001;17(4):457–66.
Chiappelli F, Cajulis OS. Psychobiologic views on stress- Fain J. Is there a difference between evaluation and
related oral ulcers. Quintessence Int. 2004;35:223–7. research. Diabetes Educ. 2005;31:150–5.
PMID: 15119681. Fan J, Han F, Liu H. Challenges of big data analysis. Natl
Chu VW. Assessing proprioception in children: a review. Sci Rev. 2014;1(2):293–314. https://doi.org/10.1093/
J Mot Behav. 2017;49:458–66. https://doi.org/10.1080 nsr/nwt032.
/00222895.2016.1241744. Feinstein AR. Clinical judgement. Philadelphia, PA:
Cochrane AL. Effectiveness and efficiency: random reflec- Williams & Wilkins; 1967.
tions on health services. London: Nuffield Provincial Fisher RA. Statistical methods for research workers.
Hospitals Trust; 1972. Edinburgh: Oliver and Boyd; 1925.
Collier R. Legumes, lemons and streptomycin: a short history Fisher RA. The design of experiments. New York: Hafner;
of the clinical trial. Can Med Assoc J. 2009;180(1):23– 1949.
4. https://doi.org/10.1503/cmaj.081879. Fisher RA. Contributions to mathematical statistics.
Colosi L, Dunifon R. What’s the difference? “Post then New York: Wiley; 1950.
Pre & Pre then Post”. Cornell Cooperative Extension, Fletcher A, Guthrie J, Steane P, Roos G, Pike S. Mapping
2006. stakeholder perceptions for a third sector organization.
Conover WJ. Practical nonparametric statistics. J Intellect Capital. 2003;4(4):505–27.
New York: Wiley. ISBN 0–471–16851-3.; 1960. Freedman B. Equipoise and the ethics of clinical research.
Corbin JM, Strauss AL. Basics of qualitative research: N Engl J Med. 1987;317:141–5.
techniques and procedures for developing grounded Freedman D, Pisani R, Purves R. Statistics. 2nd ed.
theory. 2nd ed. Thousand Oaks, CA: SAGE; 1998. New York: W.W. Norton; 1991.
Corder GW, Foreman DI. Nonparametric statistics: a step- Friedman M. The use of ranks to avoid the assumption of
by-step approach. Hoboken: Wiley. p. 2014. normality implicit in the analysis of variance. J Am
Cram F. Method or methodology, what’s the difference?— Statist Assoc. 1937;32:675–701.
Whānau Ora. 2013. http://whanauoraresearch.co.nz/ Friedman M. A correction: the use of ranks to avoid the
news/method-or-methodology-whats-the-difference/. assumption of normality implicit in the analysis of
Creswell J. Research design: qualitative, quantitative, and variance. J Am Statist Assoc. 1939;34:109.
mixed methods approaches second editions. Thousand Friedman M. A comparison of alternative tests of sig-
Oaks, CA: Sage; 2002. nificance for the problem of m rankings. Ann Math
Daly LE, Bourke GJ, McGilvray J. Interpretation and Statist. 1940;11:86–92.
uses of medical statistics. 4th ed. Scarborough, ON: Friedman C. The frequency interpretation in probabil-
Blackwell Scientific; 1991. ity. Adv Appl Math. 1999;23(3):234–54. https://doi.
Data, Trends and Maps. 2017, April 10. https://www. org/10.1006/aama.1999.0653.
cdc.gov/obesity/data/databases.html. Accessed 11 Jul Furr RM. Testing the statistical significance of a corre-
2017. lation. Winston-Salem, NC: Wake Forrest University;
Dawe H. William Farish, Chemist, c 1815. [Online n.d.
Image]. 1815. https://commons.wikimedia.org/wiki/ Gaba E. A bust of Socrates in the Louvre [Online
File:William_Farish.jpg. Image]. 2005. https://commons.wikimedia.org/wiki/
Donner A. A bayesian approach to the interpretation of File:Socrates_Louvre.jpg#file.
sub-group results in clinical trials. J Chronic Dis. Gaventa J, Tandon R. Globalizing citizens: new dynamics
1992;34:429–35. of inclusion and exclusion. London: Zed; 2010.
Donner A, Birkett N, Buck C. Randomisation by clus- Gelman A, Carlin J, Stern H, Rubin D. Bayesian data
ter: sample size requirements and analysis. Am analysis. London: Chapman & Hall; 1995.
J Epidemiol. 1991;114:906–14. Gerstman BB. Basic biostatistics: statistics for public
Dowie J. “Evidence-based,” “cost-effective” and health practice. Sudbury, MA: Jones and Bartlett;
“preference-driven” medicine: decision analysis based 2008.
medical decision making is the pre-requisite. J Health Gibbs JW. Elementary principles in statistical mechanics.
Services Res Policy. 1996;1:104–13. New York: Charles Scribner’s Sons; 1902.
Dudovskiy J. Cluster sampling. 2016. http://research- Golafshani N. Understanding reliability and validity in
methodology.net/sampling-in-primary-data-collection/ qualitative research. Qual Rep. 2003;8(4):597–606.
cluster-sampling/. Accessed Jul 2017. http://nsuworks.nova.edu/tqr/vol8/iss4/6.
Dunn OJ. Multiple comparisons using rank sums. Gray JAM, Haynes RB, Sackett DL, Cook DJ, Guyatt
Technometrics. 1964;6:241–52. GH. Transferring evidence from health care research
El Dib RP, Atallah AN, Andriolo RB. Mapping the into medical practice. 3. Developing evidence-based
Cochrane evidence for decision making in health care. clinical policy. Evid Based Med. 1997;2:36–9.
J Eval Clin Pract. 2007;13:689–92. PMID 17683315. Gubrium JF, Holstein JA. The new language of qualitative
Ezzo J, Bausell B, Moerman DE, Berman B, Hadhazy method. New York: Oxford University Press; 2000.
V. Reviewing the reviews. How strong is the evidence?
Bibliography 217
Haahr M. Randomness and Integrity Services Ltd. 2010. Kung J, Chiappelli F, Cajulis OO, Avezova R, Kossan G,
https://www.random.org/. Chew L, Maida CA. From systematic reviews to clini-
Ham C, Hunter DJ, Robinson R. Evidence based policy- cal recommendations for evidence-based health care:
making—research must inform health policy as well validation of revised assessment of multiple system-
as medical care. Br Med J. 1995;310:71–2. atic reviews (R-AMSTAR) for grading of clinical rel-
Haynes SN, Richard DCS, Kubany ES. Content valid- evance. Open Dent J. 2010;4:84–91. https://doi.org/10
ity in psychological assessment: a functional .2174/1874210601004020084.
approach to concepts and methods. Psychol Assess. Lamb T. The retrospective pretest: an imperfect but useful
1995;7(3):238–47. tool. Eval Exchange. 2005;8:18.
Healy E, Jordan S, Budd P, Suffolk R, Rees J, Jackson Laxminarayan R, Duse A, Wattal C, Zaidi AK, Wertheim HF,
I. Functional variation of MC1R alleles from red-haired Sumpradit N, Vlieghe E, Hara GL, Gould IM, Goossens
individuals. Hum Mol Genet. 2001;10(21):2397–402. H, Greko C, So AD, Bigdeli M, Tomson G, Woodhouse
Hinkelmann K, Kempthorne O. Introduction to experi- W, Ombaka E, Peralta AQ, Qamar FN, Mir F, Kariuki S,
mental design. In: Design and analysis of experi- Bhutta ZA, Coates A, Bergstrom R, Wright GD, Brown
ments, vol. 1. 2nd ed. Hoboken, NJ: Wiley; 2008. ED, Cars O. Antibiotic resistance-the need for global
Hollander M, Wolfe DA, Chicken E. Nonparametric sta- solutions. Lancet Infect Dis. 2013;13(12):1057–98.
tistical methods. Hoboken, NJ: Wiley; 2014. Leskovec J, Rajaraman A, Jeffrey D, Ullman JD. Mining
Hosmer DW, Lemeshow S. Applied logistic regression. of massive datasets. Cambridge: Cambridge University
2nd ed. New York: Wiley; 2000. Press; 2014.
Hulley SB, Cummings SR, Browner WS, Grady DG, Liddle J, Williamson M, Irwig L. Method for evaluat-
Newman TB, Hearst N. Designing clinical research. ing research and guidelines evidence. Sydney: NSW
2nd ed. Philadelphia: Wolters Kluwer/Lippincott Health Department; 1999.
Williams & Wilkins; 2001. Liem EB, Lin C, Suleman M, Doufas AG, Gregg RG,
Hund L, Bedrick EJ, Pagano M. Choosing a cluster sam- Veauthier JM, Sessler DI. Anesthetic require-
pling design for lot quality assurance sampling sur- ment is increased in redheads. Anesthesiology.
veys. PLoS One. 2015;10(6):e0129564. https://doi. 2004;101(2):279–83.
org/10.1371/journal.pone.0129564. W.K. Kellogg Foundation. Logic model development
Jahn D. Coast Mountain Kingsnake (Lampropeltis zonata guide. Battle Creek, MI: W.K. Kellogg Foundation;
multifasciata) [photograph]. Santa Cruz: Flickr; 2017. 2004.
Jenkins J, Hubbard S. History of clinical trials. Semin Oncol Lunenburg F. Writing a successful thesis or dissertation:
Nurs. 1991;7(4):228–34. https://doi.org/10.1016/0749- tips and strategies for students in the social and behav-
2081(91)90060-3. ISSN 0749-2081. http://www.scien- ioral sciences. Thousand Oaks, CA: Corwin Press;
cedirect.com/science/article/pii/0749208191900603. 2008.
Kahn CH. Plato and the Socratic dialogue: the philosophi- Madaus GF, Stufflebeam DL, Kellaghan T. Evaluation
cal use of a literary form. Cambridge: Cambridge models: viewpoints on educational and human ser-
University Press; 1998. p. xvii. vices evaluation. 2nd ed. Hingham, MA: Kluwer
Kallet RH. How to write the methods section of a research Academic; 2000.
paper. Respir Care. 2004;49:1229–32. Mak K, Kum CK. How to appraise a prognostic study.
Katz DL. Clinical epidemiology & evidence-based medi- World J Surg. 2005;29:567. https://doi.org/10.1007/
cine: fundamental principles of clinical reasoning & s00268-005-7914-x.
research. Thousand Oaks, CA: Sage; 2001. Mauchly JW. Significance test for sphericity of a normal
Klute R. Stylised atom with three Bohr model orbits and n-variate distribution. Ann Math Stat. 1940;11:204–9.
stylised nucleus [Stylised atom. Blue dots are elec- McHugh ML. Multiple comparison analysis testing
trons, red dots are protons and black dots are neu- in ANOVA. Biochem Med (Zagreb). 2011;21(3):
trons]. 2007. https://commons.wikimedia.org/wiki/ 203–9.
File:Stylised_atom_with_three_Bohr_model_orbits_ McIntyre A. Participatory action research. Thousand
and_stylised_nucleus.svg. Oaks, CA: Sage; 2009.
Kolmogorov AN. Foundations of the theory of prob- McMurray F. Preface to an autonomous discipline of edu-
ability. 2nd ed. New York: Chelsea; 1956. ISBN cation. Educ Theory. 1955;5(3):129–40. https://doi.
0-8284-0023-7. org/10.1111/j.1741-5446.1955.tb01131.x. Accessed 3
Kolmogorov AN. The theory of probability. In: Mar 2017.
Alexandrov AD, Kolmogorov AN, Lavrent’ev MA, Messick S. Validity of psychological assessment: vali-
editors. Mathematics, its content, methods, and mean- dation of inferences from persons’ responses and
ing, vol. 2. Cambridge, MA: MIT Press; 1965. performances as scientific inquiry into score mean-
Kruskal W, Wallis A. Use of ranks in one-criterion vari- ing. Am Psychol. 1995;50(9):741–9. https://doi.
ance analysis. J Am Stat Assoc. 1952a;47:583–621. org/10.1037/0003-066X.50.9.741.
Kruskal WH, Wallis WA. Errata to use of ranks in Millar A, Simeone RS, Carnevale JT. Logic models: a sys-
one-criterion variance analysis. J Am Stat Assoc. tems tool for performance management. Eval Program
1952b;48:907–11. Plann. 2001;24:73–81.
218 Bibliography
Muir Gray JA. Evidence-based health care: how to make Royse D, Thyer BA, Padgett DK, Logan TK. Program
health policy and management decisions. London: evaluation: an introduction. 4th ed. Belmont, CA:
Churchill Livingstone; 1997. Brooks-Cole; 2006.
Murdoch TB, Detsky AS. The inevitable application of Sadler GR, Lee H-C, Seung-Hwan Lim R, Fullerton
big data to health care. JAMA. 2013;309:1351–2. J. Recruiting hard-to-reach United States population
Nails D. In: Zalta EN, editor. Socrates: Socrates’s strange- sub-groups via adaptations of snowball sampling strat-
ness. The Stanford encyclopedia of philosophy (Spring egy. Nurs Health Sci. 2010;12(3):369–74. https://doi.
2014 Edition); 2017. org/10.1111/j.1442-2018.2010.00541.x.
National Institutes of Health. The Basics. US Sanogo M, Abatih E, Saegerman C. Bayesian ver-
Department of Health and Human Services; sus frequentist methods for estimating true preva-
2017. p. 20. www.nih.gov/health-information/ lence of disease and diagnostic test performance.
nih-clinical-research-trials-you/basics. Vet J. 2014;202(2):204–7. https://doi.org/10.1016/j.
National Institutes of Health (NIH). NIH clinical research tvjl.2014.08.002.
trials and you: the basics. 2017. https://www.nih.gov/ Scriven M. The methodology of evaluation. In: Stake
health-information/nih-clinical-research-trials-you/ RE, editor. Curriculum evaluation. Chicago: Rand
basics. Accessed 17 Aug 2017. McNally; 1967.
NightLife Exhibit: Color of Life—Cali. Academy of Scriven M. The theory behind practical evaluation.
Sciences [Photograph]. Pacific Tradewinds Hostel, Evaluation. 1996;2:393–404.
San Franecisco, 2015. Selvin HC. Durkheim’s suicide and problems of empirical
Norman GR, Monteiro SD, Sherbino J, Ilgen JS, Schmidt research. Am J Sociol. 1958;63(6):607–19. https://doi.
HG, Mamede S. The causes of errors in clinical rea- org/10.1086/222356.
soning: cognitive biases, knowledge deficits, and dual Shaw L, Chalmers T. Ethics in cooperative clinical trials.
process thinking. Acad Med. 2017;92:23–30. Ann N Y Acad Sci. 1970;169:487–95.
Patton MQ. Utilization-focused evaluation. 3rd ed. Shuttleworth M. Construct validity. 2009. Explorable.com:
Thousand Oaks, CA: Sage; 1996. https://explorable.com/construct-validity. Accessed 18
Pedhazur EJ, Schmelkin LP. Measurement, design, and Jul 2017.
analysis: an integrated approach. Hove: Psychology Simmonds MC, Higgins JPT, Stewart LA, Tierney JF,
Press; 2013. Clarke MJ, Thompson SG. Meta-analysis of individ-
Perkins J, Wang D. A comparison of Bayesian ual patient data from randomized trials: a review of
and frequentist statistics as applied in a simple methods used in practice. Clin Trials. 2005;2:209–17.
repeated measures example. J Modern Appl Statist Skelly AC, Dettori JR, Brodt ED. Assessing bias: the
Methods. 2004;3(1):24. https://doi.org/10.22237/ importance of considering confounding. Evid Based
jmasm/1083371040. http://digitalcommons.wayne. Spine Care J. 2012;3(1):9–12. https://doi.org/10.105
edu/jmasm/vol3/iss1/24. 5/s-0031-1298595.
Pinsker J. The psychology behind Costco’s free sam- Steup M. In: Zalta EN, editor. Epistemology. Stanford
ples. 2014. https://www.theatlantic.com/business/ Encyclopedia of Philosophy. Stanford, CA:
archive/2014/10/the-psychology-behind-costcos-free- Metaphysics Research Lab, Stanford University; 2005.
samples/380969/. Accessed July 2017. Stufflebeam DL. The CIPP model for program evaluation.
Pocock SJ. Clinical trials: a practical approach. New York: In: Madaus GF, Scriven M, Stufflebeam DL, editors.
Wiley; 2004. Evaluation models: viewpoints on educational and human
Pratt J. Remarks on zeros and ties in the Wilcoxon signed services evaluation. Boston: Kluwer Nijhof; 1993.
rank procedures. J Am Stat Assoc. 1959;54:655–67. Suresh K, Thomas SV, Suresh G. Design, data analysis
Racino J. Policy, program evaluation and research in dis- and sampling techniques for clinical research. Ann
ability: community support for all. London: Haworth Indian Acad Neurol. 2011;14(4):287–90. https://doi.
Press; 1999. org/10.4103/0972-2327.91951.
Raue A, Kreutz C, Theis FJ, Timmer J. Joining Tolman RC. The principles of statistical mechanics.
forces of Bayesian and frequentist methodol- Mineola, NY: Dover; 1938. ISBN: 9780486638966.
ogy: a study for inference in the presence of non- Translational Science Spectrum. 2016. https://ncats.nih.
identifiability. Philos Trans A Math Phys Eng Sci. gov/translation/spectrum. Accessed 11 Jul 2017.
2013;371(1984):20110544. https://doi.org/10.1098/ Trochim WM. Levels of measurement. 2006. http://www.
rsta.2011.0544. socialresearchmethods.net/kb/measlevl.php. Accessed
Rees JL, Flanagan N. Pigmentation, melanocortins and July 2017.
red hair. QJM. 1999;92:125–31. Vallverdu J. Bayesians versus frequentists a philosophical
Renganathan V. Overview of frequentist and Bayesian debate on statistical reasoning. New York: Springer;
approach to survival analysis. Appl Med Inform. 2016.
2016;38(1):25–38. Vandenbroucke JP, von Elm E, Altman DG, et al.
Robinson WS. Ecological correlations and the behavior Strengthening the reporting of observational studies
of individuals. Am Sociol Rev. 1950;15(3):351–7. in epidemiology (STROBE): explanation and elabora-
JSTOR 2087176. tion. Int J Surg. 2014;12(12):1500–24.
Bibliography 219
Vogt DS, King DW, King LA. Focus groups in psycho- Wanjek C. Oops! 5 Retracted Science Studies of
logical assessment: enhancing content validity by 2012. 2012. http://www.livescience.com/25750-
consulting members of the target population. Psychol science-journal-retractions.html. Accessed 11 Jul 2017.
Assess. 2004;16(3):231–43. Wasserman L. All of nonparametric statistics. Berlin:
von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche Springer; 2007.
PC, Vandenbroucke JP, STROBE Initiative. The Wasserstein RL, Lazar NA. The ASA’s statement on p-values:
Strengthening the Reporting of Observational Studies context, process, and purpose. Am Stat. 2016;70(2):129–
in Epidemiology (STROBE) statement: guidelines 33. https://doi.org/10.1080/00031305.2016.1154108.
for reporting observational studies. J Clin Epidemiol. West S, King V, Carey TS, et al. Systems to rate the strength of
2008;61(4):344–9. PMID: 18313558. scientific evidence: summary. In: AHRQ Evidence Report
Wagensberg J. On the existence and uniqueness of the Summaries. Rockville (MD): Agency for Healthcare
scientific method. Biol Theory. 2014;9(3):331–46. Research and Quality (US); 1998–2005, 2002. p. 47.
https://doi.org/10.1007/s13752-014-0166-y. https://www.ncbi.nlm.nih.gov/books/NBK11930/.
Wagner C, Esbensen KH. Theory of sampling: four Wilcoxon F. Individual comparisons by ranking methods.
critical success factors before analysis. J AOAC Biom Bull. 1945;1:80–3.
Int. 2015;98(2):275–81. https://doi.org/10.5740/ Witte RS, Witte JS. Statistics. 10th ed. Hoboken, NJ:
jaoacint.14-236. Wiley; 2014.
Walker JS. Figure 8: Electron cloud model for the 1s Woodson CE. Parameter estimation vs. hypothesis
orbital. [Digital image]. 2018. http://thebiologyprimer. testing. Philos Sci. 1969;36(2):203–4. https://doi.
com/atoms-and-molecules/. org/10.1086/288247.
Index
A C
Absolute benefit increase (ABI), 172 Case-control study, 19, 20
Absolute risk reduction (ARR), 172 Categorical data, 39, 54, 124, 125, 129–136, 208
Addition rule, 64, 65 Categorical variable, 39, 40, 112, 203
Agency for Healthcare Research and Quality (AHRQ), Central limit theorem, 61, 75, 76, 85
146, 147, 169, 177 Chi-Square (χ2) tests, 130–133, 198–199
AGREE instrument, 174 Clinical significance, 171
Alpha level, see Level of significance Clinical trials
Alternative hypothesis, 77, 80, 95, 97, 110, 126–128, 206 controlled trials, 23
Analysis of variance (ANOVA), 120, 129, 134, 135 crossover trials, 23
dependent variable, 109 definition, 22
F distribution, 109 public health, 24
F-statistic, 107 randomized trials, 23
independent variable, 109 run-in trials, 23
mean squares, 107 single-blinded and double-blinded clinical trials, 23
population means, 107 study design tree, 24
post hoc analyses, 109 run-in trials, 23
steps for, 110, 111 Cluster sampling, 33
three different groups testing, 107 Cochran’s Q test, 129, 133
variability between and within groups, 108 Coefficient of determination (R2), 119, 120, 122,
Anderson–Darling test, 130 134, 208
Antimicrobial resistance, 178–180 Cohen’s kappa, 130
Aposematism, 72 Cohort study
Appraisal of Guidelines for Research and Evaluation cause–effect relationship, 17
Enterprise (AGREE), 174 definition, 17
Aragon’s Primary Provider Theory, 146 exposed and unexposed, 18
for tuna fish casserole, 19
incidence, 18
B limitations, 19
Bar chart, 50–52 nested study, 18, 19
Basic/generic/pragmatic qualitative and quantitative prospective study, 18
research and evaluation, 163 relative risk, 18
Bayesian approach, 124, 136, 151, 167, 180, 183, 211 strengths, 19
Bayesian biostatistics in translational research, 180, 181 types of, 18
Bayesian hierarchical modeling, 181 Comparative effectiveness analysis (CEA), 160
Bayesian statistics, 66, 124, 168, 169, 181 Comparative effectiveness and efficacy research and
Bell-shaped curve, 44, 59, 61 analysis for practice (CEERAP), 160
Bell-shaped frequency polygon, 60 Comparative effectiveness research and evaluation
Bibliome, 145, 176, 182, 183, 211 (CERE), 164
Big data, 155, 182 Comparative effectiveness research (CER), 11, 142, 143,
Big data analysis, 182 160, 164
Bimodal distribution, 63 CERID
Blocking principle, 21 creating and disseminating new knowledge,
Bonferroni correction, 128, 129 179, 180