100% found this document useful (2 votes)
453 views235 pages

Practical Biostatistics in Translational Healthcare (PDFDrive)

Uploaded by

reema aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
453 views235 pages

Practical Biostatistics in Translational Healthcare (PDFDrive)

Uploaded by

reema aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

Practical

Biostatistics
in Translational
Healthcare

Allen M. Khakshooy
Francesco Chiappelli

123
Practical Biostatistics
in Translational Healthcare
Allen M. Khakshooy  · Francesco Chiappelli

Practical Biostatistics in
Translational Healthcare
Allen M. Khakshooy Francesco Chiappelli
Rappaport Faculty of Medicine UCLA School of Dentistry
Technion-Israel Institute of Technology Los Angeles, CA
Haifa, Israel USA

ISBN 978-3-662-57435-5     ISBN 978-3-662-57437-9 (eBook)


https://doi.org/10.1007/978-3-662-57437-9

Library of Congress Control Number: 2018943260

© Springer-Verlag GmbH Germany, part of Springer Nature 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made. The publisher remains
neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer-Verlag GmbH, DE part
of Springer Nature.
The registered company address is: Heidelberger Platz 3, 14197 Berlin, Germany
To master and apprentice,
“Credette Cimabue ne la pittura tener lo campo, e ora ha
Giotto il grido, … ”
(Dante Alighieri, Divine Comedy, Purgatory XI 94,95)
Foreword

It is a great pleasure for me to write the Foreword of this interesting book on


biostatistics and translational healthcare. Biostatistics is a rapidly evolving
discipline. Large databases that collect massive information at individual
level, coupled with high computing power, have increased enormously the
potentiality of this discipline to a point that we can realistically speak of
evidence-­based patient-centered healthcare. However, the foundations of this
discipline have not changed over time and lay on arguments that are inher-
ently philosophical and logical. They also require a deep subject matter
knowledge. All these issues are well addressed in the book, which constitutes
a very good introduction to the principles of biostatistics.
The book describes in detail the research process underlying the construc-
tion of what can be considered the scientific knowledge. The research process
is described as three-legged stool: study design, methodology, and data anal-
ysis. Each leg is investigated with profound competence. Carefully designed
examples introduce the students to the core concepts and, at the same time,
equip them with a critical view. I particularly appreciate the description of the
sources of errors hindering any statistical conclusion. Random errors are by
far the most studied and widely understood. Nowadays, also systematic
errors, due for instance to self-selection, missing information, informative
drop-out, and so on, have received much attention in the statistical literature,
thanks to sensitivity analysis and related methods. Left behind, to my opin-
ion, are what in Chap. 1 of the book is described as the most dangerous source
of mistakes in research: the errors of judgment. Improper use of logic and
information induces fallacies in the interpretation of evidence and the formu-
lation of the conclusions. These types of errors are now even more common,
since the inrush of the so-called big data, i.e., data that are collected from
different sources with different purposes. Errors of judgment are subtle and
subjective and cannot be easily transferred into a statistical model. For this
reason, they are often overlooked.
Starting from the first leg, study design, the book describes the different
studies that are relevant in biostatistics, ranging from diagnostic studies to
research synthesis. It then focuses on the major distinction underlying prog-
nostic research, namely, observational and experimental studies. Experimental
studies with random assignment of subjects to the treatment are the ideal
framework for research in biostatistics. However, ethical reasons together
with practical obstacles make this study not feasible in many contexts.
Moreover, complete adherence to the study protocol is rare, and elements of

vii
viii Foreword

observational studies are therefore introduced even in a well-designed experi-


mental study. It is therefore crucial to understand the potential sources of bias
due to the absence of randomization. As the recent reprint, with comments, of
the seminal paper by Cochran (1972) “Observational studies” witnesses, the
issue is of crucial importance in all applied research areas, with biostatistics
being a notable example.
The book then carries over by addressing the methodology. Emphasis is
placed on the sampling procedures as well as on data acquisition through
valid and reliable instruments. Coming to data analysis, descriptive statistics
and inferential methods are presented, with an eye to the process that trans-
fers research results into new methods for diagnosis, therapy, and prevention,
that is, the object of translational healthcare. I particularly appreciate the
emphasis on questions that a researcher must address to ascertain the internal
and external validity of a study that constitutes the replicability of the find-
ings and therefore their accreditation in the scientific community.
The final part of the book is dedicated to the consolidation of statistical
knowledge and its capacity to “make a difference” in the society. The first
concept addresses the issue of comparability of research results across stud-
ies. Subtle unmeasured differences may hamper the crude comparison of
findings of different studies and call for sophisticated statistical methods.
Once again, the book stresses that any choice of data analysis should be
accompanied by a critical view and a deep understanding of the subject mat-
ter under investigation. Then, the last two chapters present a detailed account
of strategies for the evaluation of the impact of a biostatistics research pro-
gram in the society, which is the goal of modern scientific research.

Perugia, Italy Elena Stanghellini


February 26, 2018
Preface

It almost never happened. My first introduction with Dr. Chiappelli resulted


from a slight mishap in the administration’s assignment of teaching assistants
to lecturers. I received an email from him requesting that we meet before we
began our working relationship for the upcoming semester. Being my first
semester as a biostatistics teaching assistant, I was eager to showcase my pas-
sion for statistics within the health sciences. But before we even had the
chance to meet, the error in our assignment had come to light—I received a
follow-up email from him bidding me farewell, luck with whomever I would
be assigned to, and left me with this: “Teaching is a very rewarding experi-
ence—hard work, but rewarding!”
This is but one of the many great pieces of advice that I would later receive
from Dr. Chiappelli. By some miraculous change of events, I would not only
remain assigned to his lecture that semester but also the semesters for the next
year and half. During our first meeting, we quickly discovered our joint pas-
sion for biostatistics, research, and healthcare. As an ambitious premedical
student, I was amazed by Dr. Chiappelli’s pioneering work in convergent and
translational science within modern healthcare. When I heard of his prolifera-
tive laboratory and research group, I knew that my once-in-a-lifetime oppor-
tunity was sitting across from me.
Fast-forward to the present—almost 3 years later—I have had the oppor-
tunity to publish in prestigious biomedical journals, be promoted to biostatis-
tics lecturer, hold a senior laboratory position at UCLA, and now on track
toward my dream—receiving a medical degree. None of this would have hap-
pened without the wisdom, guidance, and kindheartedness of Dr. Chiappelli.
From biostatistics and research to the laboratory and medicine, the great deal
of knowledge and experiences I have gained from him will certainly serve me
in ways currently unfathomable.
And now our first book together! So, second to The Omnipotent, I thank
you, Dr. Chiappelli, for this opportunity—I will cherish this book and the
invaluable lessons you have taught me (and continue to teach me) for a life-
time. Your passion for teaching, advancing knowledge, and healthcare, has
had a profound effect on me. I hope to one day pay forward your lessons to
students of my own—and to think…it almost never happened.
I would like to thank our editor, Nicole Balenton, for her hard work and
dedication to making this book perfect. Nicole is a brilliant young mind and
former student who continues to surprise us with her excellence and ingenu-
ity. I express my appreciation to my parents who have given me the utmost

ix
x Preface

love throughout my entire life and my siblings, Arash and Angela, who I can
always rely on for inspiration and wisdom. I thank Moses Farzan and Saman
Simino for their continued support and friendship. Lastly, I extend my deep-
est gratitude and appreciation to Margaret Moore, Rekha Udaiyar, and the
rest of the wonderful team at Springer for this opportunity and help through-
out the process.

Haifa, Israel Allen M. Khakshooy


January 2018
Acknowledgments

There is little that I can add to Dr. Khakshooy’s excellent preface, except to
thank him for the kind words, most of which I may not—in truth—deserve.
This work is his primarily, and for me it has been a fulfilling delight to mentor
and guide a junior colleague of as much value as Dr. Khakshooy’s in his ini-
tial steps of publishing.
I join him in thanking Ms. Balenton, who will soon enter the nursing pro-
fession. Her indefatigable attention to detail and dedication to the research
endeavors, and her superb and untiring help and assistance in the editorial
process, has proffered incalculable value to our work.
I also join in thanking most warmly Ms. Margaret Moore, Editor, Clinical
Medicine; Ms. Rekha Udaiyar, Project Coordinator; and their superb team at
Springer for their guidance, encouragement, and patience in this endeavor.
I express my gratitude to the Division of Oral Biology and Medicine of the
School of Dentistry at UCLA, where I have been given the opportunity to
develop my work in this cutting-edge area of research and practice in health-
care, and to the Department of the Health Sciences at CSUN, where both Dr.
Khakshooy and I have taught biostatistics for several years. I express my
gratitude to the Fulbright Program, of which I am a proud alumnus, having
been sent as a Fulbright Specialist to Brazil where I also taught biostatistics.
In closing, I dedicate this work, as all of my endeavors, to Olivia, who
somehow always knows how to get the best out of me, to Fredi and Aymerica,
without whom none of this would have been possible, and as in all, to only
and most humbly serve and honor.
“… la gloria di Colui che tutto move
per l’universo penetra e risplende
in una parte più e meno altrove …”
(Dante Alighieri, 1265–1321; La Divina Commedia, Paradiso, I 1-3)

Los Angeles, CA, USA Francesco Chiappelli

xi
Contents

Part I  Fundamental Biostatistics for Translational Research

1 Introduction to Biostatistics. . . . . . . . . . . . . . . . . . . . . . . . . . . .    3


1.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   3
1.2 The Scientific Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . .   4
1.2.1 The Research Process . . . . . . . . . . . . . . . . . . . . . . .   5
1.2.2 Biostatistics Today. . . . . . . . . . . . . . . . . . . . . . . . . .   9
1.2.3 Self-Study: Practice Problems. . . . . . . . . . . . . . . . .  11
2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   13
2.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13
2.2 Diagnostic Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15
2.2.1 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . .  15
2.2.2 Specificity and Sensitivity. . . . . . . . . . . . . . . . . . . .  16
2.3 Prognostic Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16
2.3.1 Observational Design. . . . . . . . . . . . . . . . . . . . . . . .  17
2.3.2 Experimental Design. . . . . . . . . . . . . . . . . . . . . . . .  21
2.4 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . .  24
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   27
3.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  27
3.2 Sample vs. Population. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  28
3.2.1 Sampling Methods. . . . . . . . . . . . . . . . . . . . . . . . . .  30
3.3 Measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  33
3.3.1 Instrument Validity. . . . . . . . . . . . . . . . . . . . . . . . . .  34
3.3.2 Instrument Reliability . . . . . . . . . . . . . . . . . . . . . . .  35
3.4 Data Acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  36
3.4.1 On Data: Quantitative vs. Qualitative . . . . . . . . . . .  38
3.4.2 Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  39
3.5 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . .  40
Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41
4 Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   43
4.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  43
4.2 Tables and Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  45
4.3 Descriptive Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  53
4.3.1 Measures of Central Tendency. . . . . . . . . . . . . . . . .  53
4.3.2 Measures of Variability . . . . . . . . . . . . . . . . . . . . . .  55

xiii
xiv Contents

4.4 Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  59


4.5 Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  63
4.5.1 Rules of Probability. . . . . . . . . . . . . . . . . . . . . . . . .  64
4.5.2 Bayesian vs. Frequentist Approach. . . . . . . . . . . . .  66
4.5.3 Z-Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . .  66
4.6 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . .  69
5 Inferential Statistics I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   71
5.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  71
5.2 Principles of Inference and Analysis. . . . . . . . . . . . . . . . . .  73
5.2.1 Sampling Distribution. . . . . . . . . . . . . . . . . . . . . . .  74
5.2.2 Assumptions of Parametric Statistics. . . . . . . . . . . .  76
5.2.3 Hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  76
5.3 Significance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77
5.3.1 Level of Significance. . . . . . . . . . . . . . . . . . . . . . . .  77
5.3.2 P-Value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  79
5.3.3 Decision-Making. . . . . . . . . . . . . . . . . . . . . . . . . . .  80
5.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  84
5.5 Hypothesis Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  86
5.6 Study Validity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  88
5.6.1 Internal Validity. . . . . . . . . . . . . . . . . . . . . . . . . . . .  88
5.6.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . .  89
5.7 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . .  89
6 Inferential Statistics II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   91
6.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  91
6.2 Details of Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . .  93
6.2.1 Critical Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  93
6.2.2 Directional vs. Nondirectional Tests. . . . . . . . . . . .  95
6.3 Two-Group Comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . .  96
6.3.1 z Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96
6.3.2 t Test Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  99
6.4 Multiple Group Comparison . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.1 ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Continuous Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.1 Associations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5.2 Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.6 Self-Study: Practice Problem. . . . . . . . . . . . . . . . . . . . . . . . 120
7 Nonparametric Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   123
7.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Conceptual Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2.1 What Is Nonparametric Statistics?. . . . . . . . . . . . . . 124
7.2.2 When Must We Use the Nonparametric Paradigm?. 125
7.2.3 Why Should We Run Nonparametric Inferences?. . 125
7.3 Nonparametric Comparisons of Two Groups . . . . . . . . . . . 126
7.3.1 Wilcoxon Rank-Sum. . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.2 Wilcoxon Signed-Rank . . . . . . . . . . . . . . . . . . . . . . 127
7.3.3 Mann–Whitney U. . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Contents xv

7.4 Nonparametric Comparisons of More than Two Groups. . . 128


7.4.1 Kruskal–Wallis for One-Way ANOVA . . . . . . . . . . 128
7.4.2 Friedman for Factorial ANOVA. . . . . . . . . . . . . . . . 129
7.4.3 Geisser–Greenhouse Correction for Heterogeneous
Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5 Categorical Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5.1 The Chi-Square (χ2) Tests, Including Small
and Matched Designs. . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.2 Time Series Analysis with χ2: Kaplan–Meier
Survival and Cox Test . . . . . . . . . . . . . . . . . . . . . . . 133
7.5.3 Association and Prediction:
Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . 134
7.6 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . . 136
Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  137

Part II  Biostatistics for Translational Effectiveness

8 Individual Patient Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   141


8.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Conceptual, Historical, and Philosophical Background . . . 142
8.2.1 Aggregate Data vs. Individual Patient Data. . . . . . . 142
8.2.2 Stakeholders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.3 Stakeholder Mapping. . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Patient-Centered Outcomes. . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3.1 Primary Provider Theory. . . . . . . . . . . . . . . . . . . . . 145
8.3.2 Individual Patient Outcomes Research . . . . . . . . . . 147
8.3.3 Individual Patient Reviews. . . . . . . . . . . . . . . . . . . 148
8.4 Patient-Centered Inferences. . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.1 Individual Patient Data Analysis. . . . . . . . . . . . . . . 149
8.4.2 Individual Patient Data Meta-Analysis . . . . . . . . . . 149
8.4.3 Individual Patient Data Evaluation . . . . . . . . . . . . . 151
8.5 Implications and Relevance for Sustained Evolution
of Translational Research. . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5.1 The Logic Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5.2 Repeated Measure Models. . . . . . . . . . . . . . . . . . . . 153
8.5.3 Comparative Individual Patient Effectiveness
Research (CIPER). . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.6 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . . 155
9 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   157
9.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2 Conceptual, Historical, and Philosophical Background . . . 158
9.2.1 Conceptual Definition . . . . . . . . . . . . . . . . . . . . . . . 158
9.2.2 Historical and Philosophical Models. . . . . . . . . . . . 158
9.2.3 Strengths and Deficiencies. . . . . . . . . . . . . . . . . . . . 160
9.3 Qualitative vs. Quantitative Evaluation. . . . . . . . . . . . . . . . 162
9.3.1 Quantifiable Facts Are the Basis of the 
Health Sciences. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
xvi Contents

9.3.2 Qualitative Evaluation. . . . . . . . . . . . . . . . . . . . . . . 162


9.3.3 Qualitative vs. Quantitative Evaluation. . . . . . . . . . 163
9.4 Formative vs. Summative Evaluations. . . . . . . . . . . . . . . . . 163
9.4.1 Methodology and Data Analysis. . . . . . . . . . . . . . . 163
9.4.2 Formative and Summative Evaluation. . . . . . . . . . . 163
9.4.3 Comparative Inferences. . . . . . . . . . . . . . . . . . . . . . 164
9.5 Implications and Relevance for Sustained Evolution
of Translational Research . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.5.1 Participatory Action Research and Evaluation. . . . . 164
9.5.2 Sustainable Communities: Stakeholder
Engagement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.5.3 Ethical Recommendations. . . . . . . . . . . . . . . . . . . . 165
9.6 Self-Study: Practice Problems. . . . . . . . . . . . . . . . . . . . . . . 165
Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  166
10 New Frontiers in Comparative Effectiveness Research. . . . .   167
10.1 Core Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.2 Conceptual Background. . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.2.2 Comparative Effectiveness Research in the 
Next Decades. . . . . . . . . . . . . . . . . . . . . . . . . . . 170
10.2.3 Implications and Relevance for Sustained
Evolution of Translational Research and 
Translational Effectiveness. . . . . . . . . . . . . . . . . 180
10.2.4 Self-Study: Practice Problems. . . . . . . . . . . . . . 182
Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  183
Appendices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   185
Appendix A: Random Number Table. . . . . . . . . . . . . . . . . . . . . 186
Appendix B: Standard Normal Distribution (z) . . . . . . . . . . . . . 188
Appendix C: Critical t Values. . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Appendix D: Critical Values of F. . . . . . . . . . . . . . . . . . . . . . . . 191
Appendix E: Sum of Squares (SS) Stepwise Calculation
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Appendix F: Critical Values for Wilcoxon T . . . . . . . . . . . . . . . 195
Appendix G: Critical Values for Mann-Whitney U . . . . . . . . . . 197
Appendix H: Critical Values for the Chi-Square
Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  201
Answers to Chapter Practice Problems . . . . . . . . . . . . . . . . . . . . . .  
203
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
215
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
221
List of Videos

Chapter 3
Video 1: Variables. Reprint courtesy of International Business Machines
Corporation, © International Business Machines Corporation

Chapter 4
Video 2: Frequency tables. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 3: Graphing. Reprint courtesy of International Business Machines
Corporation, © International Business Machines Corporation

Chapter 6
Video 4: One-sample t-test. Reprint Courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 5: Independent sample t-test. Reprint Courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 6: Dependent-sample t test. Reprint Courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 7: ANOVA. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Video 8: Correlation. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Video 9: Regression. Reprint Courtesy of International Business Machines
Corporation, © International Business Machines Corporation

Chapter 7
Video 10: Wilcoxon rank-sum. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 11: Wilcoxon signed-rank. Reprint courtesy of International
Business Machines Corporation, © International Business Machines
Corporation
Video 12: Mann–Whitney U. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Video 13: Kruskal–Wallis H. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation

xvii
xviii List of Videos

Video 14: Friedman. Reprint courtesy of International Business Machines


Corporation, © International Business Machines Corporation
Video 15: Chi-square. Reprint courtesy of International Business Machines
Corporation, © International Business Machines Corporation
Video 16: Logistic regression. Reprint courtesy of International Business
Machines Corporation, © International Business Machines Corporation
Part I
Fundamental Biostatistics for Translational
Research
Introduction to Biostatistics
1

Contents
1.1 Core Concepts  3
1.2 The Scientific Method  4
1.2.1 The Research Process  5
1.2.2 Biostatistics Today  9
1.2.3 Self-Study: Practice Problems  11

1.1 Core Concepts the stringent laws of probabilities and bound by a


rigid adherence to the requirements of random-
Nicole Balenton ness. Nonetheless, errors do occur in biostatis-
tics, and the second area of strength of the field is
By the term “biostatistics,” we mean the applica- its full awareness of these limitations. There are
tion of the field of probability and statistics to a three types of errors possible in biostatistics: sys-
wide range of topics that pertain to the biological tematic errors, viz., mistake in planning and con-
sciences. We focus our discussion on the practi- ducting the research protocol and in analyzing its
cal applications of fundamental biostatistics in outcomes; random errors, viz., mistakes that are
the domain of healthcare, including experimental consequential to situations and properties that
and clinical medicine, dentistry, and nursing. occur randomly and are not under the control of
As a branch of science, biostatistics encom- the investigator (i.e., chance); and errors of judg-
passes the design of experiments, the monitoring ment (i.e., fallacies), viz., making errors of inter-
of methodologies for appropriate sampling and pretation rather than errors of facts.
accurate measurements, and the cogent analysis This chapter discusses these fundamental con-
and inference of the findings obtained. These cepts and introduces the timely and critical role
concerted activities are driven by the search of of biostatistics in modern contemporary research
data-based answers to specific research ques- in evidence-based, effectiveness-focused, and
tions. That is to say, biostatistics is the primary patient-centered healthcare. Emphasis is placed
driver hypothesis-driven process by which on the fact that there are, today, two principal
research evidence is obtained, evaluated, and approaches for looking at effectiveness: compar-
integrated in the growing knowledge-base of psy- ative effectiveness analysis, viz., comparing
chobiological processes in health and disease. quantified measures of quality of life and related
One strength of biostatistics lies in the unbi- variables among several interventions, and com-
ased nature of its inferences, which are based on parative effectiveness research, viz., comparing

© Springer-Verlag GmbH Germany, part of Springer Nature 2018 3


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_1
4 1  Introduction to Biostatistics

several interventions in terms of relative differ-


ences in cost- and benefit effectiveness and in
reduced risk. This chapter introduces these core
concepts, which are explored in greater depth
throughout this book.

1.2 The Scientific Method

Ask any lay person: “What is the scientific


method?” and you will probably hear a response
along the lines of “a method used in science.”
Be that as it may, it can be said with a degree of
certainty that it is a method that almost every
living human being has utilized—a bold state-
ment, indeed, but the more we scrutinize its
plausibility, the more we can consider its
possibility.
A simple Internet search of “the scientific
method” will produce millions upon millions of
results that can provide anyone with an in-depth
understanding of the scientific method. But that
Fig. 1.1  A bust of Socrates in the Louvre (Gaba 2005)
wouldn’t be without hours of reading, critical
analysis, surely a migraine or two, and of course
the innate passion to learn. Now, what if there lier in this chapter on the universal usage of the
was a single word that could describe the scien- scientific method. Earlier, this may have not
tific method as simply and accurately as possi- entirely resonated, but now with the “single
ble—wouldn’t that be of interest? word” description, that claim seems more con-
Why? This word, in and of itself, character- ceivable. To stretch the argument further, a visit
izes the curiosity, ingenuity, and advancement must be paid to the philosophers of old.
that is so particular to human nature. Asking Socrates (Fig. 1.1), regarded as the street phi-
questions like Why? How? What? and even losopher of Athens, was infamous for soliciting
When? and Where? are arguably the fundamental the seemingly never-ending spiral of questions to
basis of this method we use in science. Granted, those who passed by. Just as today, the people of
a small lie may have been told regarding just a ancient Greece considered it childish and aggra-
single word. Rather, it can be said that there are a vating when a man, uncanny to say the least,
few single words that can just as simply and approached and perpetually probed them with
accurately achieve that, which is attempted to be odd and even embarrassing questions. Plato,
imparted. So, let us refrain from claiming that his student, later dubbed this seemingly eerie
there is a single word or many single words that behavior of Socrates as elenchus, today denoted
can reflect what the scientific method is. Instead, as the Socratic method of inquiry. This method
it will be argued that it is the act of questioning, is one where, through a series of questions and
examination, or inquiry that lies at the heart of investigations, one could attain the answers to a
the scientific method. problem or, more philosophically, the truth of
Believe it or not, the scientific method was the the matter.
scientific method before it was called the scien- Though the truth may seem out of scope or
tific method. As funny as that may sound, this even unrelated to this subject matter, we shall see
meaning goes back to what was mentioned ear- throughout this book that the understanding and
1.2  The Scientific Method 5

Resarch Study
Study Design Methodology Data Analysis Conclusion
Question Hypothesis

Fig. 1.2  Steps of the research process

attempt of attaining the truth may not seem so


far-fetched after all. Research Process
Another large instrumental factor to our cur-
rent scientific method was a grand-student of
Socrates, Aristotle, a naturalist who explored
the realms of logic, reasoning, and rationality

Met

Dat
Study Design
that have largely influenced today’s traditional

a
hod
Western thought. Of his many contributions, the

Ana
olog
techniques of inductive and deductive reasoning

lysis
y
have played a large role in our scientific method
today. We will return to this dichotomy of sci-
entific reasoning later, but it must be noted that
there currently exist many more influences to the
evolution of our scientific method as we know
it. On the same note, the scientific method still
today is impartial to influences.
Finally, the scientific method is a method of Fig. 1.3  Methodology, study design, and data analysis
investigating phenomena based on observations are the foundations of the research process
from the world around us, in which specific prin-
ciples of reasoning are used in order to test strong evidence or, as we may call it, proof. We
hypotheses, create knowledge, and ultimately use the research process to create theories, find
become one step closer in obtaining the truth. We solutions to problems, and even find problems to
must understand that there is no universal scien- solutions we already have. In addition, the over-
tific method; rather there are fundamental con- arching goal of the research process is also an
cepts and principles that make this method of attempt to find some sort of truth. However,
inquiry scientific. Moreover, the scientific method abstract this may seem, we can actualize its mean-
is ever-changing and ever-growing, such that the ing by making the goal of the research process to
method itself is under its own scrutiny. be the culmination to an inference consensus or an
ability to make a generalization of the whole based
on its representative parts. Though the specific
1.2.1 The Research Process steps may differ based on their source, this book
will take the steps of the research process as
The research process can be argued to be the depicted in Fig. 1.2, along with a brief description
same as or a synonym for the scientific method. of each provided in the following section.
Though skeptics in this differentiation exist, for Lastly, the conceptualization of the research
simplicity and practicality sake, we will dis- process as a whole can be interpreted to be a three-
tinguish the scientific method and the research legged stool (Fig. 1.3) that sits on methodology,
process as the latter representing the actual appli- study design, and data analysis. This metaphoric
cation of the former. description is crucial to the understanding of the
The research process is a process that uses the research process such that each individual leg is
scientific method to establish, confirm, and/or equally valuable and important to the function of
reaffirm certain pieces of knowledge supported by the stool. Just as the function of a stool is for one to
6 1  Introduction to Biostatistics

sit, so too is the function of the research process: questions that are relevant to our specific inter-
for one to gain otherwise unattainable knowledge. est in this book1.
Hence, the integrity of the stool as a whole is placed A hypothesis, commonly referred to as an edu-
in question should any single leg deteriorate. cated guess, is seen as both a starting point and
guiding tool of the research process. But, was it not
1.2.1.1 Hypothesis-Driven Research mentioned earlier that it is the research question
So, how does one begin the research process? that is the starting point? Indeed! Here is where the
The research process begins with nothing other intimate relationship between the research question
than a question. The research question, simply and the study hypothesis is made clear. The study
put, is a question of interest to the investigator hypothesis is nothing more than the research ques-
that serves as the driver of the entire process. The tion stated positively (i.e., research question is
great value placed on this concept is an attempt to changed from question format to statement for-
prove that the answer to this question is not only mat.) The disparate forms of hypotheses are further
one that is interesting enough to warrant the need discussed in Hypothesis Testing in Chap. 5.
of a process but more importantly that the answer The study design serves as the infrastructure
to it is both meaningful and useful. To take it or the system we create that aids in answering the
many steps further, obtaining the answer to a research question. The design of any research
research question could potentially prevent mass process is, obviously, dependent on both the
casualties in the world and help end world peripheral and inherent details of the research
hunger. question like the specific population, disease, and
Of course, this book is not a how-to manual therapy that is being studied.
on world peace. Rather, the premise that we are The methodology of the research process is
attempting to drive home is that not only can the concerned with the process of measuring and
successful answering of the research question collecting the necessary information (which we
be worthwhile but that we may very well not call data, discussed further in Chap. 3) regarding
always be successful in obtaining an answer. the specific population of interest depicted in the
Thus, research questions are chosen based on a research question. As further elaborated in Chap.
certain criterion easily remembered as the acro- 3, because it is seemingly impossible to compre-
nym FINER.  We say that a research question hensively study an entire population, we obtain
must be: feasible, interesting, novel, ethical, and data from a sample that is a representative of the
relevant. Though there can be a never-ending entire population that can survive this comparison.
list of categories of research questions (Table Data analysis is the statistical techniques and
1.1), below we provide a few types of research reasoning tools utilized in the examination of
the collected information, i.e., data. Some have
regarded this section as the results of the study, in
Table 1.1  Types of research questions which the evidence obtained is used in hopes of
Types of research questions proving or disproving the conjectured hypotheses.
Descriptive—attempts to simply describe that which Lastly, the conclusion is the researcher’s
is occurring or that which exists
attempt to answer the research question relative
Relational—seeks to establish, or to test the
establishment of, a specific relationship or association to the results that were obtained. It is at this point
among variables within groups that our initial concepts of inference consensus
Causal—developed to establish a direct cause-and-­ and truth determination converge. Though the
effect relationship either by means of a comparison or
by means of a prediction
PICO(TS)—describes specific criteria of research as Note the acronym stands originally for population, inter-
1 
they refer to the patient(s), the interventions, and its vention, comparator, outcome, timeline, and setting; the
comparators that are under consideration for a given latter two are parenthetic such that they are not always
sought outcome, under a specified timeline and in the used or available to use; in any case they can be described
context of a specific clinical setting as PICO, PICOT, or PICOS research questions.
1.2  The Scientific Method 7

Fig. 1.4  The study


hypothesis is the driving
force of the research Study Design
process, hence
hypothesis-driven
research
Methodology

+
Research Study
Question Hypothesis
Data analysis

∴ Inference consensus

analysis of the data is meant to provide some sort why the research process is referred to as
of concrete evidence to influence the decision-­ hypothesis-­driven research (Fig. 1.4). It is the
making process on behalf of the postulates, it is study hypothesis that is the driver of all three legs
unfortunately not that forthright. Statistical anal- of the stool (methodology, study design, and data
ysis allows us to put limits on our uncertainty analysis), which culminate into the making of a
regarding the issue at hand, but what it does not potential inference.
clearly allow is the absolute proof of anything.
Thus, when arriving at the conclusion of a study, 1.2.1.2 Errors in Research
the results are unable to provide an absolute truth Statistics in translational healthcare pervades the
statement when all is considered. Rather, its scientific literature: its aim to improve the reli-
application is more practical in disqualifying ability and validity of the findings from transla-
substantiated claims or premises. tional research. As we progress toward a more
Similar to the fundamental principle in the US technologically advanced future with greater
Justice System of “innocent until proven guilty,” accessibility, it seems as though we are constantly
so too exists a principle that is central to the sci- bombarded with so-called proven research find-
entific method and the research process in regard ings, medical breakthroughs, and secretive thera-
to the treatment of hypotheses within a research pies on a daily basis. It even goes as far as having
study. We commonly retain a basic hypothesis distinct research findings that directly contradict
of the research (namely, the null hypothesis dis- one another! Recently, we have witnessed a prev-
cussed in Chap. 5) such that we cannot adequately alence in the retraction of research papers that,
prove its absolute truth for obvious reasons. just a few years earlier, were highly regarded as
Instead, what we are capable of is proving its pivotal to modern-day science. Though the major-
absolute falsehood. Subsequently, the pragmatism ity of retracted papers stem from ethical concerns,
that is intrinsic to our conclusion is the ability to there are papers that have so-called “fudged” the
make an inference. Upon evaluation of the results, numbers or simply have improperly handled the
an inference is made onto the population based on statistics. These mishandlings also stretch beyond
the information gleaned from the sample. just the statistics, which we categorize as errors.
A quick glance at the crude descriptions of Naturally, the susceptibility of the research (and
each step of the research process shows the the researcher) to error is inevitable. The effect
impact of the research question along the way. of errors is most felt during the determination of
Then, after equating the research question with results, or more specifically when establishing sta-
the study hypothesis, it can now be understood tistical significance. Discussed in more depth in
8 1  Introduction to Biostatistics

Chap. 5, the establishment of statistical significance tion of a study design like the type of research
(or lack thereof) is an imperative and necessary question, the nature of the data we are working
step in the substantiation of our results (i.e., when with, and the goal of our study to list a few. But
moving from data analysis to conclusion). This more importantly, the risk of running a system-
lends a hand to the importance placed on inherent atic error (choosing a poor study design) is that it
errors and introduced biases that are, unfortunately, will always produce wrong results of the study.
contained in many published research today. The second type of error are errors of judg-
Just as the research process is a three-legged ment or fallacies. To elaborate, these are errors
stool, so too is the establishment of statistical sig- that are grounded in biases and/or false reasoning
nificance (Fig. 1.5). The process of obtaining sta- (i.e., a fallacy), in which the improper use of
tistical significance sits on three forms of error: logic or rational leads to errors in scientific rea-
systematic errors, errors of judgment (i.e. falla- soning. It can be argued that these are the most
cies), and random errors. We do not have the full dangerous errors in research as they are subjec-
capability of understanding the intricacies of tive to the researcher(s). In Table 1.2, we provide
each error just yet, but for the moment, it is worth a list of the various types of fallacies.
attempting to briefly describe each one.
Systematic errors are just as they sound— Table 1.2  A description of several common types of fal-
errors in the system we have chosen to use in our lacies or biases that may occur in scientific reasoning
research. What systems are we referring to? That related to research
would be the study design. Erroneously choosing Errors of judgment/fallacies
one design over another can lead to the collapse Hindsight bias (“knew-it-all-along” effect): The
foretelling of results on the basis of the previously
of our ultimate goal of attaining statistical signifi- known outcomes and observations; subsequently
cance. Luckily, systematic errors are one of the testing a hypothesis to confirm the prediction to be
few errors that have the ability of being avoided. correct. For example, taking it for granted that the Sun
We can avoid systematic errors by simply select- will not rise tomorrow
ing the best possible study design. There are Recomposing-the-whole bias (fallacy of
composition): The bias of inferring a certain truth
many factors that lead to the appropriate selec- about the whole, simply because it is true of its parts.
For example, since atoms are not alive, then nothing
made up of atoms is alive
Ecological inference bias (ecological fallacy): The
act of interpreting statistical data (i.e., making
Statistical Significance
statistical inferences) where deductions about the
nature of individuals are made based on the groups to
which they belong. For example, America is regarded
as the most obese country in the world today;
therefore my American cousin who I’ve never met
must be obese!
Ran

Sys
Error of Judgment

Fallacia ad hominem (shooting the messenger): The


tem
dom

fallacy of blaming a poor outcome on the fault of


atic

others. For example, “It’s not my fault the results were


Erro

bad, it’s the fault of the engineer of the statistical


Erro
r

software!”
rs

Fallacia ad populum et ad verecundiam (“everybody


does it!”): The fallacy of common practice or of
authoritative knowledge. For example, “I just did it the
way everybody else does it!” or “This is how my
Principal Investigator does it!”
Fallacia ad ignorantiam et non sequitur (“Just
because!”): The fallacy of common practice without
any certain proof that what is done is appropriate. For
Fig. 1.5  The three basic types of error that mediate statis- example, “I did it this way because I don’t know of a
tical significance better way, that’s just how I learned to do it!”
1.2  The Scientific Method 9

The third type of errors in research are ran- those of statistics itself. Moreover, it is the over-
dom errors which can arguably be the most arching theme and ultimate purpose behind the
common of the bunch. These are errors that are utilization of these techniques that make it spe-
essentially beyond control—meaning that no cific to biostatistics.
matter what, this type of error cannot be avoided The study of biostatistics is not limited to any
or prevented in entirety. Better yet, we can be cer- one field, like biology. One of the great virtues of
tain of its occurrence simply because we (the this study is that it involves a multidisciplinary
researcher, study subjects, etc.) are all human and collaboration between the wealth of today’s stud-
error is imbedded in our nature. ies that have to do with human life. To name just
Although this should not be as alarming as its a few, these disciplines range from psychology
doomsday, description makes it to be. Why? and sociology to public health and epidemiology
Because statistics are here to save the day! One of and even to medicine and dentistry.
the primary functions of the statistical tools and Thus, the utilization of biostatistics today is
techniques later described in this book is to the application and development of statistical
decrease or fractionate random error, thereby theory to real-world issues particular to life as we
minimizing its potentially detrimental effects on know it. Additionally, the aim we are working
our results. On the other hand, the presence of toward is solving some of these problems, in
error in our study can also serve a positive pur- hopes of improving life as a whole. So, we can
pose in so far as it takes into consideration the see how it would not be uncommon to hear the
individual differences of the study subjects. biomedical sciences as being the broad discipline
Truthfully, there can be an entire field within sta- subjective to biostatistics. But since the nature of
tistics dedicated to the process of and value this book is pragmatism, we will simplify its
behind the minimization of error. For now, we comprehensive discipline from the biomedical
can assure that its presence will be felt in the sciences to the health sciences. Hence, taken
majority of the sections that follow in this book. together, biostatistics lies at the heart of the
research process in the health sciences.

1.2.2 Biostatistics Today 1.2.2.1 Relevance for the Health


Sciences
Biostatistics—the word itself may seem intimi- The health sciences are composed of a large vari-
dating at first. Should one want to impress their ety of applied scientific fields that pertain to the
friends and family, mentioning you are studying usage of science, mathematics, technology, and
biostatistics is an easy way to accomplish that. engineering in the delivery of healthcare and its
But however intimidating the word may seem, constituents. The health sciences cover a wide
the actual study of biostatistics should not be variety of disciplines that are not solely limited to
feared. Moreover, the roots of the word may hint conventional Western medicine; rather they
at its actual concept and study: bio and statistics. stretch to both traditional and alternative medical
Hence, a layperson may perceive the study of modalities. That being said, it is not so much the
biostatistics to mean the statistics in biology or actual practices of these medical modalities that
life statistics, a weak interpretation of the word. are of concern in this book; rather it is the meth-
Although we may use biostatistics in the field ods of utilization of the collected information
of biology, the more representative meaning that from these practices that is of chief concern.
we will side with is the latter—namely, the statis- When we bring biostatistics into the conversa-
tics behind human life. Further, biostatistics is a tion, we see that its introduction to the health sci-
specific branch of statistics that utilizes informa- ences serves the purpose of our research. Just as
tion that is particular to living organisms. But it we spoke of the importance of the research ques-
must be made clear that the fundamental tools tion, it is the health science-based research ques-
and concepts of biostatistics are no different than tion that requires biostatistical theory to be
10 1  Introduction to Biostatistics

Fig. 1.6 Translational Translational Healthcare


healthcare model
Clinical studies

Bench
h Healthy decision-
n- T2
T1
making habits

Bedside

Clinical guidelines

answered. Moreover, we can now perceive the area that is being studied within the health sci-
value of the hopeful answer that we obtain from ences. But as we progress from today’s students
the health science-based research question. The to tomorrow’s professionals, the great value of
significance of this answer being that it is the best biostatistics arises within the field of translational
possible answer to a problem that seeks the bet- healthcare.
terment of both the healthcare field and its As this is being written, the fate of US health-
constituents.2 care, for better or worse, is uncertain, but what is
Conclusively, a primary aim of this book is to certain is the direction that the field is moving
provide an understanding of the basic principles toward as a whole: placing focus on the individ-
that underlie research design, data analysis, and ual patient. The field of translational healthcare is
the interpretation of results in order to enable the one which takes a patient-centered approach that
reader to carry out a wide range of statistical translates health information gained from a
analyses. The emphasis is firmly on the practical research setting to the individual patient and, if
aspects and applications of the methodology, effective, translated to benefit all patients.
design, and analysis of research in the science Furthermore, this is the crude conceptualization
behind translational healthcare. of the two primary constructs of the science of
translation (or translational medicine as it was
1.2.2.2 Research in  Translational first introduced)—namely, translational research
Healthcare and translational effectiveness.
A biostatistics course is essential, if not manda- In theory, translational research refers to
tory, to a student in the health sciences. This is the first construct (T1) and translational effec-
mainly for the acquisition of basic and scientific tiveness as the second construct (T2), in which
statistical knowledge that pertains to the specific this book has been divided as such accordingly
(Fig. 1.6). The first half of this book is respon-
sible for expounding on the fundamentals of
For example, just a few years ago citizens of the United
2 

States questioned the lack of universal healthcare in their


translational research and its practical applica-
country. This was deemed as a problem to the overall tion in healthcare, such that the methods to be
well-being of the United States and its constituents, which discussed aid in the translation of information
was supported by epidemiological evidence among others from “bench to bedside.” This is essentially the
(i.e., mortality rates, prevalence of preventable diseases,
etc.). Moreover, the evidence proved that there was much
process of going from the patient to the labora-
need for an affordable and accessible healthcare plan that tory bench and back to the patient. Namely,
would solve the problems that resulted from a lack of uni- new knowledge of disease mechanisms gained
versal healthcare in the United States. Hence, in 2008, the from the laboratory bench is transferred to the
US Congress passed the Affordable Care Act which was
aimed at settling this real-world problem for the overall
development of new methods for diagnosis,
well-being of the healthcare field (i.e., legislative policy) therapy, and prevention that directly benefit the
and its constituents (i.e., US citizens). patient.
1.2  The Scientific Method 11

On the other hand, the succeeding half is 1.2.3 Self-Study: Practice Problems
responsible for the introduction of the second
construct of translational science, namely, trans- 1. How does the process of using the scientific
lational effectiveness. This is referred to as “result method begin?
translation,” in which the results that are gath- 2. List and provide a brief description of the
ered from clinical studies are translated or trans- steps of the research process?
ferred to everyday clinical practices and healthy 3. What are the legs that represent the stool that
decision-­making habits. Although we have is the research process? What happens if one
bisected the two based on their distinct purposes, of the legs is compromised?
methods, and results, both enterprises coalesce to 4. What is the difference between the research
the ultimate goal of new and improved means of question and the study hypothesis?
individualized patient-centered care. 5. True or False: The best type of research
In brief, the most timely and critical role of study is one that can conclude the absolute
biostatistics in modern contemporary research in truth.
healthcare today appears to be in the context of: 6. What are the legs that represent the stool that
is statistical significance? What happens if
(a) Establishing and evaluating the best avail- one of the legs is compromised?
able evidence, in order to ensure evidence-­ 7. Which of the three most common types of
based interventions errors are avoidable? Which are unavoidable?
(b) Distinguishing between comparative effec- 8. You have just finished taking your first bio-
tiveness analysis, which is designed to com- statistics exam and are unsure how well you
pare quantified measures of quality of life performed. Later that week, you receive your
and related variables among several interven- results and see that you received an A—and
tions, and comparative effectiveness research, exclaim: “I knew I was going to ace that!”—
which aims at comparing several interven- Which of the biases was taken advantage of
tions in terms of relative differences in cost- in this scenario?
and benefit effectiveness and in reduced risk, 9. True or False: All forms of error introduced
in order to ensure effectiveness-focused during the research process negatively
interventions impact the study as a whole.

(c) Characterizing novel biostatistical toolkits 10. Translational healthcare is comprised of two
that permit the assessment, analysis, and enterprises. What are these two enterprises
inferences on individual, rather than group, and what does each represent?
data to ensure the optimization of patient-­ (See back of book for answers to Chapter
centered interventions Practice Problems)
Study Design
2

Contents
2.1 Core Concepts  13
2.2 Conceptual Introduction  13
2.3 Diagnostic Studies  15
2.3.1 Reliability and Validity  15
2.3.2 Specificity and Sensitivity  16
2.4 Prognostic Studies  16
2.4.1 Observational Design  17
2.4.2 Experimental Design  21
2.5 Self-Study: Practice Problems  24

2.1 Core Concepts that the researchers are able to effectively address
the health research problem and apply the find-
Nicole Balenton ings to those most in need.
The success of any scientific research endeavor
The composition of the basic principles that act is established by the structure of the study design,
as the foundation of the research process is con- offering direction and systematization to the
ceptualized as a three-legged stool. This chapter research that assists in ultimately understanding
highlights the first of the three legs of the stool— the health phenomenon. There are a variety of
namely, study design—that acts as the blueprint study design classifications; this chapter primar-
for researchers to collect, measure, and analyze ily focuses on the two main types: diagnostic
the data of their health topic of interest. The study studies and prognostic studies. We further explore
design hinges on the research topic of choice. their respective subcategories and their relation
As the backbone of any successful scientific to scientific research in translational healthcare.
research, the study design is the researcher’s
strategy in choosing various components of a
study deemed necessary to integrate in a coherent 2.2 Conceptual Introduction
manner in order to answer the research question.
The design chosen affects both the results and the As one of the three fundamental pillars of the
manner in which one analyzes the findings. By research process, the design of a research study is
obtaining valid and reliable results, this ensures essentially the plan that is used and the system

© Springer-Verlag GmbH Germany, part of Springer Nature 2018 13


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_2
14 2  Study Design

employed by the researcher. The specific organi- a study design that can facilitate the goal of our
zation is subjective to the particulars of the object research. Luckily, there is no creation necessary
of the study. The “particulars” we mention are on our part as there are a multitude of study
pieces of information that refer to things designs we can select for the specific components
(i.e., variables, discussed further in Chap. 3) such of our study. Moreover, correctly selecting the
as the population and the outcome(s) of interest infrastructure at the outset is an early step in pre-
that are being studied. We can conceptualize the venting faulty outcomes during the research pro-
study design to be the infrastructure or the organi- cess (i.e., preventing systematic errors).
zation of the study that serves the ultimate goal of Now, if your dream home consists of a two-­
the research that is being done. story building, would you build a foundation, cre-
Let’s say, for example, that you and your signifi- ate a blueprint, and buy the accessories necessary
cant other decide to build your dream home together to build a one-story house? Or if the zoning regula-
and we will largely assume that you have both also tions of the property prohibited the building of a
agreed on the final plan of the whole house, i.e., multistory house, could you still move forward
where each room will be and their individual uses. with the successful materialization of the dream?
But as any contractor will tell you, the foundation Of course not, these would be disastrous! Similarly,
of the house is of utmost importance because it sets we would not dare to, on purpose, select a study
the precedent for the building as a whole. Sure, it is design that is specific to one method of research
pertinent to have a plan of what each room should when our own research is concerned with another.
be, but through the creation of the foundation is This chapter discusses distinct and compre-
where the proper piping, plumbing, and electrical hensive study designs relative to scientific
groundwork are set for the ultimate design of each research. More specifically, the designs and their
room. This is exactly the purpose and relationship subcategories are tailored toward the infrastruc-
between a study and its design. ture that is necessary for research in translational
As researchers, we must be fully cognizant of healthcare. The summarizing schematic of the
the intricate details of our study in order to create disparate study designs is shown in Fig. 2.1.

Cohort-Study

Diagnostic
Observational Case-Control
Studies

Prognostic
Cross-Sectional
Studies
Design
Naturalisitc
Experimental Clinical Trials
Studies

Research
Synthesis

Fig. 2.1  The various study types including observational ting (see Chiappelli 2014), and research syntheses
studies, experimental studies, naturalistic studies (Research synthesis—a type of study design that utilizes a
(Naturalistic study, often referred to as qualitative, partici- PICOTS research question (see Chap. 1, Sect. 1.2.1.1), in
pant observation, or field research design, is a type of which relevant research literature (the sample) is gath-
study design that seeks to investigate personal experiences ered, analyzed, and synthesized into a generalization
within the context of social environments and phenomena. regarding the evidence—this is the fundamental design
Here, the researcher observes and records some behavior behind evidence-based healthcare (see Chiappelli 2014)
or phenomenon (usually longitudinally) in a natural set-
2.3  Diagnostic Studies 15

2.3 Diagnostic Studies established standard test as the “gold standard” or


the criterion test. This is essentially the diagnostic
At the doctor’s office, we often dread the words that test that is currently being used in the healthcare
might follow “your diagnosis is ….” This is usually field and the results of which are widely accepted.
because the doctor is about to tell you the reasoning But, just as with the physician, can we simply trust
behind the experiences (i.e., symptoms) that ini- that the results of the instrument are correct? No,
tially led you to her office. Here, the doctor is essen- and that is where the researchers come in.
tially providing the identification of the condition or We can imagine the great deal of hardship that
disease after having analyzed the observed symp- comes from both the delivery and reception of an
toms—simply stated, the doctor is diagnosing the incorrect diagnosis. Luckily, with physicians, we
medical issue based on the symptoms. By associa- have the luxury of consulting with other physicians
tion, we trust that the years of rigorous schooling if we are dissatisfied with our doctor’s opinion. But
and hopeful experiences have adequately equipped if it is the actual diagnostic test that is faulty, then
the doctor with the ability to accurately diagnose every physician that uses the instrument (assuming
the problem based on observations. it is the gold standard) will provide the same diag-
As researchers we are not as concerned with nosis. Thus, as researchers, our chief concern is the
the actual evaluation of the physician’s decision diagnosis of the actual test, one step removed from
as we are with the tools the physician uses in pro- that of the physician’s opinion.
viding the diagnosis. It may be argued that it is Additionally, interlaced in our healthcare field
the physician’s competency that is of primary are the medical professionals (and even entrepre-
importance in diagnosis, such that it is the cogni- neurs) that are constantly developing novel diag-
tive biases and knowledge deficits of the physi- nostic tests claiming its effectiveness over a
cian that may lead to diagnostic errors1. But we current test. But we must not be skeptical of new
will leave these more complex issues to the scru- diagnostic tests and immediately disqualify them
tiny and pleasure of licensing boards that oversee as entrepreneurial efforts with ulterior motives.
their constituent physician. In actuality, the incentivizing of novel diagnostic
What we are presently concerned with are the tests is argued to be more beneficial to public
man-made machines physicians use in diagnoses. health as a whole than to the monetary benefit of
We note that the utilization of a machine gov- the individual entrepreneur. Thus, it is up to us—
erned by logical mechanisms that produces the researchers—to do our part in promoting the
(hopefully) quantifiable results is not excluded overall well-being of the public.
from the scrutiny of the scientific method.
Moreover, this interest is compounded when the
results of this mechanism influence decisions that 2.3.1 Reliability and  Validity
are consequential to human health—something
that is definitely of concern to any researcher in Novel diagnostic tests may very well be better at
the health sciences. Therefore, we refer to this diagnosing a specific condition than the test that
machine as a diagnostic tool (or a diagnostic test) is currently being used. For example, diabetes
that is an instrument designed to provide a diag- mellitus, a group of metabolic diseases most
nosis about a particular condition (or provide notably characterized by high blood glucose, was
information that leads to a diagnosis). once diagnosed by the actual tasting of one’s
Thus, we say that a diagnostic study refers to urine. Indeed, that is how it got its name diabetes
the protocol that is used to compare and contrast a mellitus, which roughly translates to sweet urine.
new diagnostic test with an established standard Some regard this as one of the earliest diagnostic
test that both aim at serving the purpose (i.e., both tests—but, how sweet must the urine be in order
diagnose the same condition). We often refer to the for one to ascertain the presence or absence of the
disease? Must it be super sweet like honey or just
See Norman et al. (2017).
1  a little sweet like agave?
16 2  Study Design

Funny, indeed, but these must have been seri- tion. We can further interpret this definition into
ous questions asked or thought of by the earliest the ability of a new diagnostic test to accurately
physicians. Luckily, there are no more urine tast- determine the presence and absence of a disease.
ings attended by physicians. Today, there are a This latter, and more simplified, definition gives
multitude of diagnostic tests that without a doubt rise to two concepts that a diagnostic test gener-
are better than a gulp of urine. When we say bet- ates—namely, sensitivity and specificity.
ter, we are referring not just to the particular The sensitivity of a new diagnostic test refers to
method of diagnosis but also to a systematic how effective the new test is at identifying the pres-
improvement in diagnosis. This betterment ence of a condition. The identification of a condi-
encompasses the concepts of the reliability and tion in an individual that truly has the condition is
validity of the test. referred to as a true positive. It is clear to see the
A new diagnostic test is subject to the criteria difficulty in obtaining this true measure; due to this
of reliability and validity in order for the test to overt stringency, there may exist individuals that are
be rendered as the new gold standard. Moreover, truly positive for the condition, but the test has failed
an unreliable or invalid test will provide little or, to accurately identify them. This subclass of indi-
even worse, detrimental information in research viduals (those rendered as negative but in actuality
and clinical decision-making. We must evaluate a have the disease) is referred to as false negatives.
novel diagnostic test for its accuracy, which is On the other hand, the specificity of a new
dependent on how exact the test can be in dis- diagnostic test refers to how good the new test is
criminating between those with the disease and at identifying the absence of a condition. The
those without. Hence, a diagnostic study design identification of a lack of condition in an indi-
is employed to test the accuracy of a novel diag- vidual that truly does not have the condition is
nostic test. referred to as a true negative. Subsequently, the
The accuracy of a diagnostic test is determined leniency of this measure may include many sub-
through the extent of how reliable and valid the jects who, in actuality, truly do not have the dis-
measurements of the test are. The reliability of a ease, but the test has essentially “missed” them.
diagnostic test refers to how replicable and con- This subclass of individuals (those rendered as
sistent the results are in different periods of time, positive but in actuality do not have the disease)
in which we are essentially asking: “Does the test is referred to as false positives. Table 2.1 shows
produce the same results if the same patient were all possible permutations along with the calcula-
to return tomorrow? The week after? Next year? tions of sensitivity and specificity. Moreover, we
(assuming all other factors are held the same).” provide a brief description of predictive values
Lastly, the validity of a diagnostic test refers to and their calculations, but further elaboration is
whether the instrument measures precisely what saved for a more epidemiological context2.
it was meant to, which must also be the same con-
dition that the current gold standard measures.
The actual methods of determining reliability and 2.4 Prognostic Studies
validity are discussed in the next chapter.
Back at the doctor’s office, after receiving the
diagnosis, we are hopeful that the timeline of the
2.3.2 Specificity and Sensitivity given condition is short and, of course, the condi-
tion is curable. This is essentially referred to as the
When speaking of the accuracy of a diagnostic prognosis—namely, the probable course and out-
test, measures of reliability and validity are not come of the identified condition. Though the
the only concepts we utilize. As mentioned results, unfortunately, may not always be short
above, the accuracy of a diagnostic test aims to and/or curable, the knowledge of this prognosis
determine how precisely a test can discriminate can empower both the physician and patient to be
between the patients who truly have the condition
from the patients who are truly free of the condi- See Katz (2001).
2 
2.4  Prognostic Studies 17

Table 2.1  2 × 2 contingency table accompanied with measures of validity and predictive value formulas
Disease No disease
Positive test result True positive (A) False positive (B) A + B
Negative test result False negatives (C) True negatives (D) C + D
A + C B+D
A A
Sensitivity (SE) = Predictive value positive (PVP) =
( C)
A + ( B)
A +

D D
Specificity (SP) = Predictive value negative (PVN) =
( B + D) (C + D )

proactive (i.e., patient is under the supervision of a investigator-mediated exposures. Thus, an obser-
medical professional, patient is more careful from vationally designed study is employed such that
now on, etc.). Although time is not exclusive to a the researchers merely observe the subjects in
prognosis, it is essential in both this medical aspect order to examine potential associations between
and the research characteristic we are to discuss. risk factors and outcomes, but they do nothing to
A prognostic study is one which examines affect or regulate the participants.
specific predictive variables or risk factors and What is also of critical importance to an obser-
then assesses their influence on the outcome vational design is time. Under the umbrella of
of the disease. Subsequently, the performance observational design, there exist three different
of a research study is designed as such with studies; each with disparate methods, purposes,
the intent of following the course of a given and gained knowledge potentiality. The sub-
disease or condition of interest through a classes beneath this category each distinctly have
period of time. The most effective method of a relationship with time, so it is not surprising to
this type of study is a comparison of various hear this design being referred to as longitudinal.
factors among individuals with relatively sim- This will be explained in further detail below.
ilar characteristics, divisible by the presence
or absence of disease. This is the typical treat- 2.4.1.1 Cohort Studies
ment–control relationship, in which the con- Colloquially, a cohort is defined as a group consist-
trol is used as a “standard” that allots this ing of individuals that share common attributes or
comparison. Moreover, we can thus say that a characteristics in a set period of time. Subsequently,
prognostic study is designed to monitor the a cohort study is a study that chronologically
management of subjects or patients in the observes individuals (initially disease-free) that
treatment and control groups. But we must have been naturally exposed to potential risk fac-
note that they cannot always be so simply tors. This goal of observation is pertinent in deter-
divided. We elaborate on this and the two mining whether or not the patients develop a specific
major classifications of prognostic studies, disease or condition (or outcome).
observational and experimental, below. We may quickly jump to the conclusion that
if disease X was observed to have developed
from risk factor Y, then the observed risk factors
2.4.1 Observational Design obviously caused the disease—seems logical,
right? Unfortunately, we are unable to use any
There are numerous qualifications that determine observational study design to procure causal
whether a study is said to have an observational relationships between variables, i.e., a cause–
design. One of the most important is when there effect relationship. What is allotted is the estab-
are no manipulations or external influences from lishment of an associative relationship, namely,
the researcher onto the subjects that are being that “There seems to be a weak/moderate/
studied. The manipulations or external influences strong association between Disease X and risk
that stem from the researcher can be seen as factor Y.” Surely the former causal relationship
18 2  Study Design

established can be proved erroneous by the sim- ease). This determination of the development of
ple consideration of those who were exposed disease is referred to as incidence and is one of
to risk factor Y but did not develop disease the features of a cohort study. Incidence refers to
X. Consequently, we note that the exposures in the number of individuals that develop a certain
observational designs may be necessary for dis- condition or disease relative to all individuals at
ease acquisition, but not sufficient. risk of developing the disease, during a set period
Thus, we say that study subjects are divided of time. Though mention of incidence rates
into cohorts, exposed and unexposed, and then slowly begins to carry over to the study of epide-
observed throughout time to determine the out- miology, the formula for its calculation is pro-
come of their exposure (disease and lack of dis- vided below:

New cases of disease in a population in a period of time


Incidence =
Population at risk of disease during that period of time

Assuming that the disease of interest is rare study subjects into the present. A nested study is
and that the subjects are representative of their a combination or mixture of the temporal attri-
overall populations, then we are also able to butes of retrospective and prospective designs—
approximate the relative risk, also read as risk namely, a study begins at some point in the past
ratio, as the ratio of the incidence of those and follows subjects into the present and further
exposed relative to the incidence of those not on into the future. Below we provide an example
exposed (Fig. 2.2). of a nested cohort study which, by association,
In the discussion of cohort studies, there must will describe the time components of the former
be a moment for and an emphasis placed on time. two studies as well. Figure 2.3 also provides a
Cohort studies may be subdivided by time into pictorial elaboration.
three main categories: prospective, retrospective, For example, you have just realized that a
and nested (or mixed). A prospective study is a number of people in your extended family have
study that essentially begins today and the study recently been admitted to a hospital for E. coli
subjects (i.e., cohorts) are observed into the food poisoning. After much thought, you realize
future. A retrospective study is one that begins at that this must have something to do with the
a certain period of time in the past and observes recent Thanksgiving potluck—you suspect your

Fig. 2.2 Risk
calculations

TIME Past Present Future

RETROSPECTIVE
Exposure Disease
PROSPECTIVE
COHORT GROUP Exposure Disease

NESTED
Exposure Disease

Fig. 2.3  Pictorial elaboration of the three fundamental types of cohort studies
2.4  Prognostic Studies 19

Population at Risk

Ate Tuna Fish Casserole Did Not Eat Tuna Fish Casserole
(Exposed Group) (Unexposed Group)

Developed Symptoms No Symptoms Developed Symptoms No Symptoms


(Disease) (No Disease) (Disease) (No Disease)

Fig. 2.4  Cohort study design tree for tuna fish casserole example

Aunt’s tuna fish casserole. Hence, you employ a the past. Just as we emphasized the importance of
nested cohort design, in which (through extensive time in the previous section, the retrospective time
investigation) you divide the family members in component is particular to a case-control study.
attendance into those who ate the tuna fish cas- Moreover, this type of study is concerned with
serole (exposed) and those who did not or pri- determining the potential occurrence of events that
marily ate other dishes (unexposed). Then, you lead to the manifestation of a certain disease in the
observe the groups starting from Thanksgiving patients that are being studied (i.e., observed).
until the present moment (retrospective compo- This method compares two groups of individ-
nent) noting signs and symptoms while also uals: those with the presence of the disease of
keeping in close contact with your family mem- interest and those with the absence of the disease.
bers for the next month (prospective) to see if We refer to the former group as the “cases” (i.e.,
they develop signs and symptoms of food poison- presence of disease) and the latter group as the
ing (Fig. 2.4). “controls” (i.e., absence of disease). Although we
In conclusion, it is simple to see the utility of will expound on the importance of control groups
cohort studies in investigative contexts. Indeed, later on in experimental design (Sect. 2.3.2), the
there are both strengths and limitations inherent control group is what largely facilitates the com-
to this type of study. The strengths include the parison of the two groups; it may ultimately assist
establishment of incidence rates, the possibility in determining what happened differently in the
to study multiple outcomes from a single expo- case group which may shed light on the progres-
sure, and even the ability to investigate rare expo- sion of disease.
sures. One the other hand, the limitations are of Subsequently, a case-control study begins
equally voracity, namely, that cohort studies are with the identification of the disease of interest.
expensive, time-consuming, prone to biases, and Then, two related groups are divided by disease
subjects lost to follow-up. Of course, if time and state, where one group suffers from the disease
money are not of grave concern (i.e., large fund- and the other does not. Next is the introduction
ing), then the strengths drastically outweigh the of the retrospective time component—namely,
weaknesses, supporting others’ claim that a both groups are essentially “followed” back in
cohort study is the most powerful of observa- time through some method of investigation (i.e.,
tional study designs. questionnaire, survey, etc.) to determine their
exposure to particular risk factors of interest
2.4.1.2 Case-Control Studies (Table 2.2). Surely, we can notice that it is not the
Also under observational study designs falls the actual participants that are being “followed”
case-control study which is a study whose back in time, rather it is more the data being col-
research focuses on specific diseases exclusive to lected that is from the past.
20 2  Study Design

Table 2.2  At the beginning of the study, exposure status case-control study provides an estimate of the
is unknown; thus we classify subjects into cases or
strength of an association between particular
controls
exposures and the presence or absence of the dis-
Outcome
ease. We commonly refer to these exposures as
Cases Controls
(disease) (no disease) predictors, such that the prediction of the exis-
Exposure Exposed (A) (B) tence of an association with the disease can pro-
Unexposed (C) (D) vide researchers with an odds ratio (OR). An
A+C B+D odds ratio essentially measures the odds of expo-
sure for the cases compared to the odds of expo-
sure for the controls. We can organize the odds of
We may ponder on the utility of this specific exposure for both groups in a simple table (Table
design. Case-control studies are of most value 2.2) to aid the calculation of the odds ratio in the
when studying rare diseases. Additionally, a formula provided below:

Odds of exposure for cases A / C AD


Odds ratio = = =
Odds of exposure for controls B / D BC

Other strengths of this type of study include Well done Bueller, you have just successfully
that it is relatively inexpensive, there is no “wait-employed a cross-sectional study on transporta-
ing period” for disease exposure, and multiple tion methods to school! But what can we do with
exposures can be taken under consideration. But this information?
along with these strengths comes a serious limi- A cross-sectional study provides information
tation in that this study design is quite susceptibleon the prevalence of a condition. Prevalence is
to bias, more so than other study designs. Of the referred to as the number of individuals that cur-
multiple biases, we briefly consider the recall rently have a specific condition of disease of
bias, for example. Recall bias considers the flaws interest. Returning to our example, perhaps you
of human memory, in which subjects asked to record that only 3 of the 30 students in the class-
recall certain instances may provide erroneous room raised their hand when you asked the ques-
responses that lead to erroneous results of the tion. Thus, you can report that the prevalence of
study. bicycling to school as an alternative method of
transportation is about 30% in the class you sur-
2.4.1.3 Cross-Sectional Studies veyed. Hence, we see that prevalence is calcu-
Lastly, a cross-sectional study is an observa- lated as the ratio of the number of people who
tional design whose research focuses on specific have a given condition or characteristic (i.e.,
disease as they relate to the present. Indeed, it is bicycling to school) at a given time over all of the
a study done at a specific and singular cross-­ people that were studied (the entire classroom)
section of time—now. Certainly, the importance (Fig. 2.5).
of the time aspect cannot be stressed enough. It Now, we do not support the irritation of class-
relates to both the convenience and advantage rooms nor do we intend to mock the utilization of
that are overtly subjective to a cross-sectional cross-sectional studies with the oversimplifica-
study. tion of the above scenario. In fact, the basic
Say you are on your way to your favorite bio- nature behind its presentation aims at exalting its
statistics class and decide to randomly walk into usefulness! Two of the great advantages primar-
another class, interrupt the lecture, and ask, ily exclusive to a cross-sectional study is that it is
“Show of hands, how many of you rode a bicy- usually swift and inexpensive—two factors cru-
cle to school today?” You count the hands, cial to any scientist. The value of the information
politely thank the aggravated professor, and out- gained relative to the investment made is
run campus security to safety in your next class. tremendous.
2.4  Prognostic Studies 21

in terms of demographics and disease state, as the


experimental group. This not only facilitates the
comparison discussed above but also fosters the
protection of our study from systematic and ran-
dom errors. The concept that underlies the com-
patibility and congruous nature of the
Fig. 2.5  Prevalence formulae experimental and control group is randomization.
Randomization (i.e., random assignment, ran-
2.4.2 Experimental Design dom allocation) truly lies at the heart of an experi-
mental study, alongside being the defining
The most apparent difference in design between characteristic of the study. Randomization refers
an observational study and an experimental to the process by which participants in a research
study is a concept that we stressed earlier in the study are randomly assigned to either the experi-
previous sections—investigator-mediated manip- mental or the control group. The importance of
ulations. True to its name, an experimentally this process is threefold. Firstly, allocating partici-
designed study is one which the researcher exper- pants at random ensures that each individual has
iments the effects of different external stimuli or an equal chance of being in either group. Secondly,
manipulations on the group of subjects under the randomization produces a high probability that
study. In the health sciences, the manipulation the two groups are essentially similar and hence
most often represents a treatment that is being bridges the possibility of comparison. Lastly, ran-
experimented among individuals that suffer from domly assigning participants necessitates that the
the relative disease. choice of treatment is independent of the subjects.
Experimental designs are most notably char- Although it may not be apparent now, ran-
acterized by the presence of (at least) two groups, domization of participants is yet another crucial
an experimental group and a control group. An step toward minimizing the error that is intro-
experimental group, often referred to as the treat- duced to the study. Furthermore, there exist
ment group, is the group that receives the experi- methods relative to experimental designs that aid
mental drug, treatment, or intervention. On the in the reduction of the introduction of bias or
other hand, the control group, often referred to as error that is particular to randomization. These
the placebo group, is the group that does not include simple, cluster, and wedged randomiza-
receive the treatment. But why study a group that tion. Additionally, there are “block” methods,
does not receive the treatment? also referred to as the blocking principle, that
Equally important as the experimental group, organize groups of subjects in “blocks” or strata
the control group warrants the comparison based on specific commonalities.3
between the groups, in which any differences Experimental studies may be regarded as one
observed in the experimental group that is not of the fundamental research archetypes, in which
observed in the control group may be considered its purpose is to obtain an impartial insight into the
to be attributable to the introduction of the effect of a specific treatment. Nevertheless, the
manipulation in question. Note, we further stress importance of randomization cannot be stressed
the importance of the possible consideration of enough. Should the randomization of subjects into
the effect from the treatment. Any differences both groups be unsuccessful, then we can no lon-
observed between the two groups (not limited to ger call our design experimental per se, rather we
just the experimental group) require further scru-
tiny and statistical analysis in order to be ren- An example of a design that utilizes the blocking princi-
3 

dered as consequential to the treatment. ple is a Latin square design, the purpose of which is, along
with that of all other block methods, to reduce the varia-
Another important quality of the control group tion among individuals within the groups in hope of fur-
is that it consists of a group of individuals that ther reducing random error (see Hinkelmann and
have, at best, similar characteristics and qualities, Kempthorne 2008).
22 2  Study Design

call it a quasi-experimental design. In truth, should


any of the conditions necessary for an experimen-
tal study design not be met, then they are also ren-
dered quasi-­experiments, but it is randomization
that is most often unsuccessful.4
The final principle pertinent to the successful
design of experiments is the ability to replicate.
We can perceive replication to refer solely to sta-
tistical replication, as in the ability to replicate
similar statistical findings. But by association, it
also refers to the ability to replicate both the mea-
surements of the study (i.e., data and data-­
collecting instruments) and the entire study as a
whole. Replication is yet another important fac-
tor that aids the reduction of both random and
systematic errors, along with increasing crucial
innate components of a study such as reliability
and validity (see Chap. 3, Sect. 3.4.1).
When speaking of experimental designs,
there is but one name that immediately comes
to mind, Sir Ronald Aylmer Fisher (Fig. 2.6). Fig. 2.6 Biologist and statistician Ronald Fisher
Fisher is regarded as one of the founding (Beveridge 1957)
fathers of statistics and often known as the
father of biostatistics. Of the numerous contri- als (RCT—experimental study) as a prefix, or
butions to the field of biostatistics, the first even as simply “randomized trials.” Unfortunately,
contribution we recognize him for is here under these are inaccurate synonyms that are inappro-
experimental designs. In his pioneering books5, priately used to generalize this specific type of
Fisher outlines the four principles that are nec- study design. In actuality, the synonyms above
essary to experimental designs, namely, com- refer to a specific experimental design—clinical
parisons, randomization, blocking, and trials—which we set to lay its distinction in what
replicability. Although we briefly introduced follows.
each topic above prior to his introduction, this A clinical trial6 is a planned experiment that
will not be the last we hear from Sir Fisher. is aimed at evaluating the effectiveness of differ-
ent treatments, interventions, and/or protocols on
2.4.2.1 Clinical Trials
We often hear experimental studies referred to The earliest known account of clinical trials can be found
6 

with the acronym for randomized controlled tri- in Chapter 1 of the Book of Daniel in Ketuvim (“Writings”)
of the Bible. In 605 BCE, the kingdom of Babylon fell
into the hands of the fierce military leader Nebuchadnezzar.
4 
Clinical equipoise—as first coined by Benjamin King Nebuchadnezzar enforced a strict diet of only meat
Freedman—is an ethical principle relating to the and wine in his kingdom. The Israelites that inhabited his
researcher and their honest anticipation of the experimen- palace felt doomed as they were not permitted to consume
tal treatment having some benefit to the patient, at least food that were not subject to their divine dietary law of
equal to no treatment. This essentially returns to the fun- Kashrut (Kosher). Among those living in his palace, an
damental healthcare maxim of primum non nocere, Latin Israelite named Daniel, in fear of retribution, suggested a
for “First, do no harm.” In context, randomization may not “trial” where he and his Israelite friends would consume a
always be ethical (and hence permitted) on, say, termi- diet of only vegetables for 10 days. Lo and behold, after
nally ill cancer patients that are recruited for experimenta- 10 days, Daniel and his gang presented as much healthier
tion of a novel treatment intervention. to the King than did his meat-eating counterparts. Shortly
The Arrangement of Field Experiments, 1926, and The
5 
after, the King’s dietary commandment was no longer
Design of Experiments, 1935. obligatory.
2.4  Prognostic Studies 23

human beings, referred to as the subjects or RUN-IN TRIAL


participants. It is a truism that clinical trials are
experimental studies, but it is not the case that all
experimental studies are clinical trials. Of the
numerous and distinct “flavors” of clinical trials,
below we present the four that are not only most
Fig. 2.7  Illustration of run-in trials
common but also most relevant to translational
healthcare.
CROSS-OVER TRIAL

• Controlled Trials—a form of clinical trial


where a specific novel treatment is compared
to a control treatment or a placebo. (This form
of a clinical trial is also referred to as a com-
Fig. 2.8  Illustration of crossover trials
parative trial.)
• Randomized Trials—a form of clinical trial
where subjects that have initially been ment group. In fact, there can be multiple treat-
recruited to participate are randomized to ment groups that are studied under the same
treatment options (i.e., randomly allocated/ context for their effectiveness relative to a cer-
assigned to either a treatment group or control tain condition. Furthermore, there are mea-
group). sures that can be expended alongside clinical
• Run-In Trials—a form of clinical trial where trials that further aid the possibility of observ-
all recruited subjects are initially (as they run ing a treatment effect or lack thereof. These
in) placed on a placebo. Only after, subjects include, for example, single-­blinded and dou-
are randomly assigned to either a treatment or ble-blinded clinical trials, where the former
a control group. The advantage of this specific method blinds subjects to which group they are
method of study is chief to statistical utility (in in, whereas the latter blinds both the partici-
terms of power and external validity discussed pants and the researchers in contact with the
in Chap. 3), but it is also advantageous in participants as to the group classification.8
increasing the chances of subjects’ comple- Over the years, clinical trials have become the
tion of the study (Fig. 2.7). gold standard for establishing evidence of causal
• Crossover Trials—a form of clinical trials associations in medical research. Clinically speak-
where participants, categorized in either the ing, there is an array of disparate treatments, inter-
treatment or control group, each crossover or ventions, and/or protocols that can be tested for
switch group classifications at some preestab- effectivity; these include novel drug therapies,
lished point in time (Fig. 2.8), meaning that medical devices, behavioral or cognitive therapy,
those initially taking the said treatment will and even diet and exercise. But before any novel
now be taking the placebo, whereas those ini- treatment is available, its associated clinical trial
tially taking the placebo will now be placed must go through a variety of phases (as set by the
under the said treatment. Prior to the utility of National Institutes of Health [NIH] 9) in order for
computer applications, this method was it to be deemed safe and effective for public use.
widely utilized but is less common in practice Due to the potential of their impact, clinical trials
today.7 are held most stringent to the rules and criteria of
experimental studies. Moreover, because of their
It is here that we must note that clinical tri- experimentation on actual human beings, clinical
als (or even all experimental studies for that
matter) are not limited to simply a single treat- 8 
See Chiappelli (2014) and Hinkelmann and Kempthorne
(2008).
See Chiappelli (2014).
7  9 
See NIH (2017).
24 2  Study Design

1. Does the researcher intervene?

YES NO

2. Experimental 3. Observational

Randomization

YES NO Exposure Disease Disease Exposure Exposure + Disease

2A. Experimental 2B. Quasi- 3A. Cohort 3B. Case-Control 3C. Cross-sectional
Experimental

Control? Time?

YES NO (2B.) Past Present

RCT-Experimental Retrospective Prospective

Treatment on
humans?

YES NO

Clinical Studies Animal Studies

Fig. 2.9  Study design tree

trials must also abide by the rigors of ethical and and its constituents (Fig. 2.9), as it should. And,
moral principles that are overseen by disparate why not? Don’t we all want to receive the best
government agencies (see Sect. 2.3.2 and Footnote of the best when it comes to our health, no less
4 on clinical equipoise). the health of our parents and children?
The ultimate goal of clinical trials is the bet-
terment of public health. Whether that is in
terms of acquiring new medical knowledge or 2.5 Self-Study: Practice
discovering the best medical instruments, the Problems
end result ultimately returns back to the patient.
Indeed, clinical trials are central to translational 1. For each of the studies below, identify whether
healthcare, particularly in the T2 block—trans- it is an observational study or an experimental
lational effectiveness—such that the result study:
translation is the transmission of knowledge
(a) Scientists wish to determine if trace
gained in clinical studies (i.e., the studies of amounts of lead in her city’s water affect
clinical trials) to the establishment of clinical the cognitive development of young
guidelines for the entire healthcare community children.
2.5  Self-Study: Practice Problems 25

(b) A researcher is interested in determining students ate the mysterious meat dish, in
whether there is a relationship between which 47 of those who ate the meat dish
years of education and annual income. developed gastroenteritis.
(c) A study on healthy eating habits measures (a) Calculate the incidence of developing
the type of food participants purchase at gastroenteritis from the consumption of
the grocery store. the mysterious meat dish.
(d) A neuroscientist electrically stimulates
(b) Does this measure consider individuals
different parts of a human brain in order who may have at gastroenteritis before
to determine the function of those specific the outbreak? Explain.
regions. (c) What type of observational study was
(e) In order to determine the effectiveness of done that determined the primary sus-
an antidepressant, a psychiatrist randomly pect and provided the incidence rate?
assigns geriatric patients to two groups— 5. Scientists studying the effects of breastfeed-
one group takes the new drug, while the ing on infections in babies closely watched a
other takes sugar pills (i.e., placebo). sample of mothers during the first 3 years of
(f) The administration of a medical school their newborn’s life. The researchers wit-
preparation course creates three different nessed that newborns that were breastfed for
courses for students preparing for the a minimum of 3.5 months had significantly
Medical College Admission Test less infectious diseases than those who were
(MCAT)—a 3-month intensive course, a not breastfed at all.
4.5-month medium course, and a 6-month (a) What type of study design is being taken
easy course. After all courses are complete, advantage of here?
the administrators compare exam scores to (b) Is this a prospective, retrospective, or

determine which course was most effective. nested study?
2. True or false: Sensitivity establishes how good (c) Can it be accurately concluded that
a measuring device is as detecting the absence breastfeeding causes less infectious dis-
of a specific disease. eases in newborn babies? Explain.
3. A local dental office receives a promotional 6. An investigator is interested in conducting a
caries detection kit. The kit contains a paste case-control study of childhood leukemia
that you apply to the tooth and whose color and exposure to environmental toxins in
turns red if there is active cavity-generating utero. How should the investigator choose
plaque. You compare this supposed caries cases and controls? How should the investi-
detection kit with traditional X-rays (i.e., the gator define exposure and outcome?
gold standard). The use of the kit provides you 7. Determine whether each of the following
with the following data in 100 of the patients statements are true or false:
(80 of whom have cavities by X-rays): (a) A cross-sectional study yields informa-
tion on prevalence.
Cavities No cavities
(b) A case-control study produces data that
Positive for carries 70 5
can compute odd risks.
Negative for carries 10 15
(c) A cohort study establishes what happens
(a) Calculate the sensitivity and specificity. to a group of patients with respect to
(b) Calculate the prevalence of patients with time.
caries. 8. A sample of women ranging from 25 to 35
4. In an outbreak of Campylobacter jejuni at a years old was recruited for a study on the
college cafeteria, the primary suspect is the effects of alcohol consumption on hormone
weekly mysterious meat dish. The campus levels. All of the participants were given a
health office reports that out of the 500 stu- 90-day regimen to consume either a certain
dents that ate at the cafeteria that day, 150 amount of alcohol or a placebo drink based
26 2  Study Design

on the specific day. The daily drink alloca- an identical-looking sugar pill. Participants
tion was random for each participant. The were monitored every 3 months for 5 years.
outcome was measured by the difference in The reports that were collected every 3
hormone levels on the days of alcohol con- months were assessed by an independent,
sumption compared to the days of placebo. third-party medical group.
(a) Was this a run-in or crossover trial? (a) What role did the sugar pill play in the
Explain. study?
(b) What is significant about random alloca- (b) Was this a single-blind, double-blind, or
tion of drinks? triple-blind study? Justify your answer.
(c) Could the participants have been blinded (c) What type of study design was utilized
to their specific treatment? Explain. here? Be as specific as possible.
9. The geriatric department at the local commu- 10. What qualifications must a measuring tool
nity hospital was interested in studying the meet in order to be considered the gold stan-
effects of aspirin in the prevention of cardio- dard? Also explain how a measuring tool can
vascular disease in the elderly. Approximately potentially lose its gold standard “seal” (i.e.,
1266 geriatric patients were randomly the tool is no longer considered the gold
assigned to either a treatment group or a con- standard).
trol group. The treatment group took 500 mg (See back of book for answers to Chapter
of aspirin daily, while the control was given Practice Problems)
Methodology
3

Contents
3.1 Core Concepts  27
3.2 Conceptual Introduction  27
3.3 Sample vs. Population  28
3.3.1 Sampling Methods   30
3.4 Measurement  33
3.4.1 Instrument Validity   34
3.4.2 Instrument Reliability   35
3.5 Data Acquisition  36
3.5.1 On Data: Quantitative vs. Qualitative   38
3.5.2 Variables   39
3.6 Self-Study: Practice Problems  40
Recommended Reading  41

3.1 Core Concepts nale behind certain techniques in researching the


health phenomenon. It begins by establishing
Nicole Balenton the relationship between a sample and popula-
tion. A sample must be of reasonable size and an
In the next leg of our “three-legged” stool, we accurate representation of the entire population
learn that the research methodology is much more to control against vulnerability to both random
than just a simple set of methods—it is the science and systematic errors. The discussion of various
of measurement and the process of obtaining and sampling techniques becomes of particular inter-
allocating the sample. Methodology also criti- est when recruiting study participants.
cally evaluates the overall validity and reliability Once we have established our sampling tech-
of scientific research. We ask questions like how nique of choice, we then focus our attention to the
the researchers obtain their information and why subject of measurements and how researchers col-
they use a particular technique, tool, or protocol. lect the necessary information. Researchers col-
In a particular scientific research, the method- lect the observations from the selected sample via
ology describes the actions taken and the ratio- researcher-completed or participant-­ completed
instruments, both of which must be valid and reli-
able. We examine the role of statistics and the
Electronic supplementary material  The online version
of this chapter (https://doi.org/10.1007/978-3-662-57437-9_3).
importance of quantitative and qualitative data, their
contains supplementary material, which is available to respective variables, and the distinct scales of mea-
authorized users. surement required for research in the health sciences.
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 27
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_3
28 3 Methodology

3.2 Conceptual Introduction


Research Process
“The path of least resistance” is a common say-
ing in the physical sciences used to describe the
relative motion or movement of material objects.
This heuristic is influenced by distinct, yet

Meth

Data
Study Design
related, principles of the physical sciences such as

Ana
dolo
Newton’s Laws of Motion and Thermodynamics.

lysis
But in life, we often witness that it is not the path

gy
of least resistance that yields the most rewarding
ends. Indeed, it is usually overcoming the ardu-
ous path that bears the greatest returns.
We all may have specified paths, but so too
does the first section of this book on Translational
Research in Translational Healthcare. We can
argue whether our approach is the path of least
resistance; certainly we may hope it is not, so
as to maximize the reward in its culmination. Fig. 3.1  Methodology is the science of measurement and
Regardless, we shall not make the mistake of the process of obtaining and allocating the sample
losing sight of the goal of our path, namely, a
practical and comprehensive understanding of
the research process. As the second leg of our 3.3 Sample vs. Population
stool (Fig.  3.1), the appreciation of research
methodology is our next quest as we continue As many struggling students and professional
on our path. shoppers will tell you, the best time to go grocery
At a first glance, we may perceive research shopping is on a Sunday. Why you might ask?
methodology to be synonymous with research Well, because of all of the free samples of food,
methods—but this is not entirely true. The meth- of course! The psychology behind grocery stores
ods we utilize in research may refer to the specific and supermarkets providing free samples of their
tools, techniques, and/or procedures that are under- products to its guests is both simple and complex.
taken. On the other hand, the methodology refers Showcasing featured products and providing an,
to the comprehensive study of (-logia) the basic often frustratingly, small sample ultimately trans-
principles that guide our processes in research. late to the purchasing of more goods. But most
The research methodology fundamentally asks: important, and most apparent, is that a free sam-
How?—that is, how is the research done? How ple provides the shopper with an understanding
did the researchers obtain their information? On of the product as a whole before they commit to
the same note, it also further begs the question purchasing.
of Why?—Why did the researchers use this tech- Let us say, for example, that you and your
nique, this tool, or this protocol over the others? mother are shopping at your favorite grocery
Therefore, the principal domains of research store. While you were preoccupied in the school
methodology refer to the science of measurement supplies aisle, your mother was in the frozen food
and the process of obtaining and allocating the section and managed to grab an extra sample of
sample. These two domains ensure the qualifi- their Sunday-featured pizza for you (Fig.  3.2).
cation of numerous criteria that are pertinent to You scarf down the frustratingly small, yet deli-
the research process, but most importantly they cious, sample, and then your mother inquires:
ensure that the study has gathered the appropri- “Do you like it? Should we buy this pizza for the
ate information necessary for the third leg of our house?”—to which you respond positively. Fine,
stool, namely, data analysis (Fig. 3.1). this seems like a normal occurrence when buying
3.3  Sample vs. Population 29

tion, by definition, may represent a potential set


of observations simply because the complete set
may be unattainable (i.e., if you return to the sta-
tion where your mother grabbed the sample, the
entire pie of which that sample belongs to has
already been eaten!).
Back at the grocery store, you come to the con-
clusion that you do in fact want to purchase the
whole pie. But before your decision is final, your
mother thinks it is a good idea to get your father’s
opinion on the pizza. So, naturally, you call your
father and describe the characteristics of the pizza
of interest. How could this be accomplished
without having observed the whole pie? Luckily,
through the means of your mother, you obtained
a sample of the pie and therefore feel qualified
enough to give a description of the entire pie to
your father. Thus, the description of the whole pie
(i.e., population) that you deliver to your father
that is really based on that frustratingly small
piece is a thin-crust pizza with marinara sauce,
mozzarella cheese, and a mushroom topping.
This is essentially the concept behind inferential
statistics (Chap. 5)—where an inference or gener-
Fig. 3.2  Pizza sample alization is made about the population (the whole
pizza) based on an observation of a sample (the
a product. But, you might be wondering, what in free pizza sample) that ultimately belongs to its
the world do free pizza samples have to do with parent population (whole pizza) (Fig. 3.3).
statistics?
And yet, this brief description of a normal dia-
logue exemplifies a fundamental concept of sta-
tistics, namely, inferential statistics (Chaps. 5 and
6). We shall return why this is so in a moment,
but first let us see what was done here. With just
a small sample of pizza given to you by your
mother, you are able to ascertain an understand-
ing of the whole pie without even having actually
seen or tasted the entire pizza.
In statistics, a population refers to any com-
plete collection of observations or potential
observations. On the other hand, a sample refers
to any smaller collection of actual observations
drawn from a population. Now the correlation
between our pizza example and statistics is clear:
the frustratingly small piece of pizza given to you
by your mother represents the sample, whereas
the whole pie of pizza represents the population. Fig. 3.3  A sample that enables the researcher to make
It is also important to point out that a popula- inferences or generalizations about the population
30 3 Methodology

(i.e., “bad” cholesterol). But, does the physician


measure all of the patient’s blood? Absolutely
not! That would require draining the patient of
all of their blood which would lead to immediate
death! Rather, the physician takes an appropriate
sample of the patient’s blood and, thus, is allotted
POPULATION to make a general statement about the whole of
the patient’s blood (i.e., population).
Lastly, it is important to realize that sample
SAMPLE and population classifications depend on the
researcher’s perspective. For example, the stu-
dents in your biostatistics class can be a sample
of the population of all students that attend your
college. At the same time, the students in your
Fig. 3.4  A sample must be representative of an entire college may be a sample of all college students
population in your state. Even more, the college students in
your state may represent a sample of all college
The population–sample interaction, as men- students in your country! It is simple to see how
tioned above, is a vital component and resource this can continue ad infinitum, but a critical take-
relative to statistics in research and, more specifi- away and inherent quality of this concept is that
cally, translational healthcare. Certainly, then, we a sample is rendered a good sample when it is
can say that in biostatistics our goal is to make representative of the whole population. More on
some general statements about a wider set of sub- that in what follows.
jects and thus we use information from a sample
of individuals to make some inference about the
wider population of like individuals (Fig. 3.4). 3.3.1 Sampling Methods
If our aim is to create a health education semi-
nar to learn about the knowledge, attitudes, and There may be some curiosity as to the logic of
beliefs about the eating habits of Hispanic youth the different adjectives we have used to describe
in the United States, would you be able to obtain samples. When we say a good sample or an
this information (i.e., measure) from all Hispanic appropriate sample, we are essentially referring
youth that live in the United States? Or if we to the representativeness of a sample. Two prin-
wanted to gage the average grade point average cipal properties required of any sample are that it
(GPA) of college students in your state, would be of reasonable size and it be an accurate repre-
you be able to reach (i.e., measure) all college sentation of the parent population. It is true that
students that live in your state? Of course not! any sample from a population can be of theoreti-
That type of information is neither feasible nor cal interest, but a small, nonrepresentative sample
practical in terms of time, money, and resources. is unable to truly reveal much about the popula-
Though this concept may seem novel, it is tion from which it was taken.1 Let us return to our
actually a common practice that is not limited to favorite pizza example for elaboration.
statistics. For example, when a patient visits their On your way home from a long day of gro-
primary care physician for their annual health cery shopping with your mother, all you can
assessment, how is the physician able to diag- think about is the delicious pizza sample you
nose the patient’s, say, high cholesterol levels? had. The moment you arrive at home, you run
Common practice for high cholesterol diagnoses
usually consists of a lipoprotein panel, where Unless many assumptions are made, in which case the
1 

the patient’s blood is measured for abnormally study becomes much more vulnerable to error and bias
high levels of low-density lipoproteins (LDL) (see Chap. 5 for more).
3.3  Sample vs. Population 31

the upcoming US presidential election, between


first-term President Franklin D.  Roosevelt and
Governor Alfred Landon of Kansas. The trusted
and influential magazine sampled approximately
2.4 million residents via landline telephone and
predicted a landslide win for Governor Landon.
Unfortunately, this was not the case, and FDR
went to become not only a second-term president
but a record-setting four-term president!
But what did the magazine company do
wrong? Were 2.4 million observations insuffi-
cient to qualify as an adequate sample? Of course
not, the issue was not the size of the sample,
rather the method of collecting the information,
namely, surveyed participants were telephone
Fig. 3.5  For true inferences to be made, a sample must be
subscribers. The problem was that the voting
well chosen and is relatively representative of its parent
population preferences accumulated by the magazine were
not representative of the voting preferences of
all US citizens (i.e., not all voters in the United
to the kitchen with the groceries and begin to States are telephone subscribers of the maga-
prepare the pizza. As you rip open the box, you zine), which ultimately led to an erroneous gen-
realize that this is not same pizza you had at the eralization. The question that remains is how then
store—that was a thin-crust pizza with marinara are we able to ensure that a representative sample
sauce, mozzarella cheese, and a mushroom top- is selected? There are a variety of methods used
ping. The pizza in front of you, on the other hand, in sample selection that, to a certain degree, can
has three different cheeses and a variety of other ensure a sample that is relatively representative
vegetables. Have you just been deceived by your of its parent population. We can further catego-
own mother? Surely, not. What has happened, rize these methods as probability sampling.2
though, is that the pizza sample you had in the Simple random sampling is a method of
store was not representative of the entire pizza. selecting a sample from a population without a
Undoubtedly, if your mother was able to grab predetermined selection process or study interest
many other samples of the pizza—say, from the in mind. Inherent to this sampling technique is
center—you would have realized that it was not the concept of randomness, as in, the application
simply a mushroom pizza with mozzarella but of selecting individuals without pattern, predict-
it was a vegetable pizza with different types of ability, or predetermined order. This is not to be
cheese (Fig. 3.5). confused with randomization or random alloca-
Now we can see how an inadequate sample tion, spoken of before,3 but it is impossible to
(small and not representative) can lead to an erro- say that they are not similar. Further, the concept
neous generalization of the whole. We may sim- relative to general randomness is overlapping,
ply think that the size of the sample is directly where each individual has an equal chance of
proportional with its representation of the popu- being selected, and with good purpose! This sig-
lation (i.e., the larger the sample, the more repre- nifies that our measuring units are independent
sentative of the population). Although this may of each other—another important topic discussed
be a useful heuristic, it is not an absolute—a large further in Chap. 2.
sample does not necessarily make for a represen-
tative sample. For non-probability sampling methods, see Wagner and
2 

For example, in 1936, The Literary Digest Esbensen (2015) and Corbin and Strauss (1998).
conducted an opinion poll to predict the results of See Sect. 2.3.2, Experimental Design.
3 
32 3 Methodology

Random sampling is perhaps the most advanta- Again, we emphasize the randomness of this
geous sampling technique, as it allots the collec- method in order to secure a high degree of repre-
tion of a representative sample and thus enables sentativeness of the sample from the population.
the researcher to draw conclusions (inferences/ Notice that the hospital manager arbitrarily chose
generalizations) about the population from its her systematic selection based on her lucky num-
daughter sample. We shall soon see how other ber, but that is not to say that all selection processes
effective sampling techniques, discussed hereafter, are the same. It is further emphasized that regard-
strive to include some randomness in the method less of the particular processes used (again, arbi-
of collection. Such techniques, then, may be cat- trary), it should be both systematic and random.
egorized under random sampling as well. Lastly, Stratified sampling is a method that essen-
a strategy for randomness can be achieved by the tially involves a two-step process, whereby
utilization of a random number table (Appendix (1) the population of interest is divided into
B) or random number generator applications. groups (or strata) based off of certain qualities
Systematic sampling is a method of sam- of interest, such as age or sex, (2) individuals are
pling that follows an arbitrary system set by the then selected at random under each characteriza-
researcher in selecting individuals at random tion, and finally (3) the results of each stratum
from a population. This method is easiest and (sg.) are combined to give the results for the total
best accomplished when a list of potential par- sample. This method warrants the representative-
ticipants is readily available. Consider a hospital ness principle of samples, such that its purpose is
manager that desires to evaluate the health of her to collect a random sample relative to each char-
own hospital staff to support an optimal work- acteristic. Moreover, the samples with certain
ing environment. Instead of wasting the immense fixed proportions amalgamate to a single repre-
time and resources to evaluate each staff member sentative sample.
(we’ll assume 1000 people), she enumerates a list For instance, in determining the eating habits
of her staff and arbitrarily decides that her lucky of the national population, characteristics such
number 4 will choose fate. Thus, as she goes as age, sex, and socioeconomic status are critical
through the list, each staff member enumerated factors that would be necessary to be reflected in a
by the number 4 is selected to undergo a health random sample so as to render the sample as rep-
evaluation (i.e., 4th, 14th, 24th, 34th, etc.). resentative (Fig. 3.6). Well, one might ask, would

POPULATION STRATA SAMPLE

X
SE

AGE

Fig. 3.6  The population


is divided into separate SE
groups, strata, and a S
random sample is drawn
from each group (i.e.,
age, sex, and
socioeconomic status)
3.4 Measurement 33

Fig. 3.7  Each cluster is


randomly sampled City 1 City 2

City 3 City 4

those qualities not be reflected under the utiliza- group (i.e., cluster) patients by hospital and then
tion of a simple random technique? Unfortunately, randomly sample each cluster Fig.  3.7. This
not—a simple random sample is likely not able makes the information that is to be gleaned much
to represent the qualities of particular interest to more manageable.
the population. On the other hand, the division of Although both stratified and cluster samplings
the population into strata based on age, sex, and take advantage of group organization, it is impor-
socioeconomic status ensures that the random tant to note a stark difference between the two
sample obtained within each stratum is reflective (strata vs. clusters). In the former sampling method,
of the entire population. It also goes without men- individuals are stratified by specific characteris-
tioning that this sampling technique makes use tics of study interest, such as race and ethnicity.
of two principles that warrant representativeness, Conversely, the latter method clusters individuals
namely, randomness and stratification. by their natural groupings, such as university, city,
Cluster sampling is a sampling technique or hospital. Alternatively, the apparent similarity
where individuals from the population are orga- between the two techniques cannot be denied. The
nized by their natural factions (clusters) and then importance of this similarity, in terms of grouping,
randomly sampled from each thereafter. This lends a hand to the importance of randomness in
method of sampling is particularly useful when obtaining a representative sample, along with the
the population of interest is extensively distrib- advantages of orderly information.
uted and otherwise impractical to gather from
all of its elements. For example, researchers
interested in hospital-acquired infections (HAI) 3.4 Measurement
would not make very good use of their time and
resources by attempting to review the records Now that we have determined the specific sam-
from a statewide list of discharge diagnoses from pling technique, the question remains: How does
each affiliated hospital. Instead, it would be more one go about collecting the sample?—and, more
practical—in terms of time and resources—to importantly—How does one go about obtain-
34 3 Methodology

ing the necessary information from the collected from it. Similar to the criteria subject to our diag-
sample? In order to answer those questions, we nostic tests, so too are our measurement tools. As
must turn to measurement. we shall see later in Chaps. 5 and 6, the impor-
Once we have identified our population of tance of validity and reliability scale across the
interest and selected a sampling technique, the entirety of this book, and in scientific research
method we utilize to collect a sample and the per se, particularly due to vulnerability to error.
necessary information from that sample requires
a measuring tool or instrument. In research, there
essentially exist two forms of measuring instru- 3.4.1 Instrument Validity
ments: (1) researcher-completed instruments and
(2) participant-completed instruments. A valid instrument is one that really truly mea-
Researcher-completed instruments are sures that which it is intended to measure. There
instruments that are completed by researchers. are three primary means that we delineate in
Well, obviously, but more specifically, they refer order to establish the validity of an instrument:
to instruments that a researcher uses to gather (1) construct validity, (2) content validity, and (3)
information on something specific that is being criterion validity.
studied. For example, a laboratory scientist study- Construct validity refers to the establishment
ing fibroblasts (muscle cells) under different of the degree to which an instrument measures the
environmental conditions may have an observa- construct it is designed to measure. For example,
tion form she uses each day she observes the cells does a tool that is aimed at measuring the level of
under a microscope. For example, “Are the cells anxiety in an individual truly measure anxiety?
still alive?; How large have they grown?; How Or does it measure things like depression and/or
many of them are there today?; Are they growing stress that are closely related to anxiety? In this
comfortably or are they struggling?” Another form case, the construct that a certain instrument is
of researcher-completed instruments includes measuring is anxiety. Hence, in this connotation,
checklists that measure the quality of evidence in a construct refers to a theory or concept particu-
medical literature, such as the AHRQ Risk of Bias lar to the realm of the health sciences. Although
instrument, STROBE, and R-AMSTAR.4 we have provided a definition for a basic under-
On the other hand, subject-completed instru- standing, there exist many other more elaborate
ments are instruments that are administered by domains involved in the validation of an instru-
researchers onto subjects under study. You have ment’s construct, such as Messick’s Unified
definitely completed one of these, whether know- Theory of Construct Validity.5
ingly or not. These usually come in the form of Content validity refers to the extent to
surveys or questionnaires, such as aptitude tests, which the content of an instrument adequately
product quality assessments, and attitude scales addresses all of the intricate components of a
to name just a few. specific construct. We can ask: Does the content
Regardless of what specific type of instrument of the questions within the instrument align with
is being utilized, there is one thing that is true (at the construct of the instrument? With our anxi-
least in research) for all measurement tools: all ety instrument, content validity essentially vali-
measurement tools used in translational research, dates whether the subject (and by extension the
or any scientific research for that matter, must answers) of the questions are good assessments
have the two essential qualities of validity and of anxiety. In this case, the content within an
reliability. Ha! And you thought that Chap. 2 was instrument must provide seemingly logical steps
the last we heard from validity and reliability! relative to the greater construct of the instrument.
No, it was not, nor is this instance the last we hear Hence, it is not uncommon to see this specific
measure of validity referred to as logical validity.
See West et al. (2002), Vandenbroucke et al. (2014), and
4 

Kung et al. (2010). See Messick (1995).


5 
3.4 Measurement 35

Criterion validity refers to the extent to ily deduce the reasoning behind that, but the sci-
which the measures of a given instrument reflect ence behind a consistent measurement implies
a preestablished criterion. This method of vali- the replicability of a measuring instrument.
dation can essentially assess whether the mea- Therefore, we say that a reliable instru-
surements made within an instrument meet the ment is one that produces similar results under
criteria relative to the specific construct being consistent conditions. We not only must require
studied. Criterion validity has two distinct yet this of scales of weight but rather all measuring
interrelated behaviors: instruments, particularly in the health sciences.
Imagine the chaos a blood glucose monitor would
–– Concurrent Criterion Validity—validates the cause if it rendered a patient diabetic one day, not
criteria of a new instrument against a prees- the next, and so on. To prevent ensuing chaos of
tablished and previously validated instrument, any sorts, we elaborate on the methods of reli-
also known as the gold standard tool (see Sect. ability verification. But before that, let us pon-
2.2, Diagnostic Studies). This is most often der on the word reliable for a moment. When we
used in the establishment of a new instrument. adulate anything as reliable, we somehow also
–– Predictive Criterion Validity—refers to the credit its replicability. A car, for example, is said
degree to which the measurements of an to be reliable because we can trust to repeatedly
instrument meet certain criteria, such that it drive the car without fearing any major complica-
can predict a corresponding outcome. For tions down the road. So too goes for a measuring
example, can the overall score from the anxi- instrument in the health sciences.
ety instrument accurately predict the severity Let us use a sphygmomanometer—used to
of anxiety disorder? Next anxiety attack? measure blood pressure—for example. There are
two ways to verify the reliability of this instru-
Considering all of the measurement validations ment: inter-rater and intra-rater. At the doctor’s
spoken of above, there is a single theme that is com- office, your physician measures your blood
mon and crucial to any form of validation. Namely, pressure with the sphygmomanometer and then
whenever we use an instrument to measure some passes the instrument to an uncanny premedical
thing, it is critical that the instrument truly mea- student to do the same. We hope, if the instrument
sures that thing. Let that settle in for a moment. is reliable, that the measurements that both the
We often never (knowingly) create an instrument physician and the shadowing student receive are
that measures something other than what it was the same. This is referred to as inter-rater reli-
originally conceived to measure. Should that be ability, such that the measurement provided the
the case though—that is, creating a measurement same results under consistent conditions between
tool that does not accurately measure what it is (inter-) two different and independent raters (i.e.,
intended to measure—then both the instrument physician and student). Intra-rater reliability
and its measurements are rendered invalid; we are refers to the producing of similar results under
systematically obtaining erroneous measurements consistent conditions within (intra) the same
regarding an erroneous thing. Moreover, the data rater. More clearly, that is when the physician is
that is to be obtained and analyzed from the invalid able to replicate your blood pressure measure-
instrument introduces a harmful blow to our study, ment multiple times with the same sphygmoma-
namely, a systematic error. nometer. The particular analytical techniques we
use to justify these measurements are discussed
in greater depth in Chap. 7.
3.4.2 Instrument Reliability Returning to our weight example above, cer-
tainly, we expect the scale, and hence its mea-
We often hear that while on the quest toward a surement, in the gym to be identical to the scale
healthier lifestyle, one should always use the and its measurement at your house. Why? Well,
same scale to monitor weight loss. We can read- because weight is just weight, i.e., the gravita-
36 3 Methodology

tional force that pulls all things to the earth. So


is the reasoning behind using the same scale due
to a change in gravitational force from one scale
to another? Surely not. Instead, it is an attempt to
prevent error in our readings, errors that are inevi-
tably inherent to different scales of measurement.
For example, the scale at the gym reads you at
180 pounds, whereas the scale at home produces Reliable Reliable
Valid Valid
a reading of 182 pounds. After a week of inten-
sive exercise, you may have truly lost 2 pounds,
but if we err to assume weight readings to be con-
stant between all measurements, then the scale at
home will cause somewhat of a tantrum. On the
other hand, if we consistently rely on a single
measurement tool, then regardless of which scale
you select, it will be able to replicate your origi-
Reliable Reliable
nal weight measurement while reflecting any net Valid Valid
loss or gain.
This brings us to the discussion of the gen- Fig. 3.8  Illustration comparing the relationship between
eralizability (G) theory, a statistical framework reliability and validity
utilized to determine the reliability of measure-
ments. G theory can be pictured to be a brainchild instrument validity and reliability, respectively.
of classical test theory, such that both view any Figure 3.8 contrasts the differences between the
method of measurement (X) as being composed two. Furthermore, it is ideal to have an instru-
of the true value (T) and the error of measure- ment that is both valid and reliable, but we can-
ment (ε): not be idealists, especially in the healthcare field.
X = T + ε We often must sacrifice one for the other. But,
G theory was originally developed to incorpo- in absolute terms, it is impossible to argue that
rate the relative influence of individual measure- one is objectively more important than the other.
ments within an instrument. Moreover, it also Rather, the distinct, yet related, measures of
addresses the issue of consistency during perfor- validity and reliability may be better one way for
mance assessments overtime and during different a certain measurement and the other way for a
occasions. G theory is advantageous to measure- different measurement.
ments made in healthcare due to its capacity to
characterize and quantify the specific sources of
error in clinical measurements. Additionally, its 3.5 Data Acquisition
objective nature facilitates the effectiveness of
future instruments, so that their measurements The introduction to this chapter mentioned that
are less prone to error and hence closer to the true one of the principal domains of methodology
value (T). We expand further on G theory and its was the science of measurement. But up until
implications in Chaps. 5 and 6; for now, rG is the this point, the basics of measurement have yet
generalizability coefficient that can be utilized to to be mentioned. Indeed, measurement pervades
determine the reliability of measurements under every aspect of our lives and is present at every
specific conditions. instant of our lives. Every instant? Yes, let us
In summary, measurements made by instru- experiment for proof. Take a moment to locate
ments and tools in healthcare must measure that an object on your table—a pen, phone, water
which they are intended to measure, and those bottle, etc.—and then, close your eyes and reach
measurements must be reproducible, which are for the object. You grabbed it effortlessly, didn’t
3.5  Data Acquisition 37

you? But how? How did you know exactly what our secret admirers. We are also measured by
distance your arm needed to stretch? Or pre- disparate governmental and regulatory agen-
cisely which muscles to use and with how much cies such as the Internal Revenue Service (IRS),
intensity to use them? US Census Bureau, and the Environmental
Proprioception. This nontraditional sense, Protection Agency (EPA) to name a few. We can
often referred to as kinesthesia, is the awareness continue these examples indefinitely—however,
of the space around us, our position in that space, it is important to understand that measurement
and our movements. As children, we seem to is central not only to our lives but also to our
struggle with this sense to an appreciable degree existence.
as we learn to stand upright, walk fluidly, and When it comes to research in the health sci-
the ability to extend our arm to just the right dis- ences, the conceptual measurement device that
tance to grab that shiny object our parents forgot is taken advantage of is statistics. Statistics is
to hide. As we grow older and develop further, heavily relied on in order to capture the instances
we seem to not realize (consciously) our sense we observe (i.e., observations) from the natural
of proprioception and its importance in our daily world that are important to us and require further
lives. When we do, though, it is usually in the analysis. But what is so important about these
context of attributes akin to this sense like hand– observations that it requires an entire college of
eye coordination and muscle memory. thought like statistics? The necessity of a field
We can surmise that fundamental to the such as statistics can be said to have its origins
sense of proprioception is the understanding and in variability. The legitimization of statistics was
awareness of measurement. Take basketball, for initially for its application in governmental policy
example—how was Kobe Bryant so successful in that was based on the demographic and economic
making those seemingly impossible shots? Well, differences (i.e., variations) of the people.
the best answer is practice (practice?). But the rel- Similarly, there exists variation in the observa-
evant answer is the experiences that came along tions we make in the health sciences relevant to
with his practice. The experiences of practicing research. More importantly, we want to be able to
the different intricacies of his body required to capture or record those observations because they
shoot at certain distances and at certain angles are important. And because they are important,
from the hoop; all of which come together as we want to be able to utilize and even manipulate
measurements necessary to make those impos- those observations so that we can garner perti-
sible shots—a genius, undeniably. nent findings. Surely, the majority learned from
We also consciously, actively, and purpose- an important observation—especially in trans-
fully utilize the science of measurement daily. lational healthcare—is an important finding to
Take a standard weekday morning, for example: someone, somewhere.
You wake up, measure the amount of toothpaste In science, we refer to the observations we
to use, measure the amount of time needed for make from within the natural world (or, in
you to leave home, measure the weather to deter- research, scores from an experiment or survey) as
mine what clothes to wear, measure the amount data. Data (datum, sg.) are essentially the product
of coffee to make, measure the best route to get of transforming observations and measurements
to school, measure the distance needed to break from the natural world into scientific informa-
(or gas) at a yellow light, and so on and so forth. tion. The human intellect is what mediates this
Notice that, although we speak of measuring, it transformation or codification. Everything we
is not necessary nor required for there to be an observe—say the different people sitting in the
actual scale or instrument to measure whatever it library right now—has the ability to someway,
is that you want to measure. somehow be transformed into data.
Furthermore, when we arrive at school or There are two inherent properties critical to
work, we are measured by other people like the observations and measurements we make;
our teachers, supervisors, counselors, and even one of which we have briefly touched on already,
38 3 Methodology

namely, the importance of observations. The sec- statistics is: the science of making effective use
ond essential principle is quantification. Truly, of numerical data.
every single thing that is observable in the natu- The issue that lies at the heart of our redhead
ral world can have a numerical value assigned example is this: How do we quantify something
to our perception of it. This is quite simple for that is an inherent quality of something? The bet-
measurements that are numerical in nature such ter question is this: How do even we measure
as weight, cell-culture viability, blood pressure, red hair? Is red hair measured by the intensity of
etc. But the quantification of “things” that we color? If so, how red is red? Or is red hair mea-
observe that are not numerical in nature requires sured by a mutation in the melanocortin-1 recep-
a few additional steps relative to data acquisition. tor gene (MC1R)? What numerical value can be
assigned to account for red hair?
This thought experiment can get a little hairy,
3.5.1 O
 n Data: Quantitative vs. to say the least. The point that we are attempt-
Qualitative ing to drive home is that there are essentially two
methods of quantification we can use to assign
The previous section painted data to be exclusive a numerical value to anything: measuring and
only to numerical values. But this is as much true counting. It is simple to see the basic nature of
as it is false. Certainly, the study of statistics, and counting as compared to measuring. But this is
biostatistics for that matter, is deeply rooted in not to diminish the stringent fact that measure-
probability and mathematics which are critical ments require some kind of instrument or tool that
to data analysis. In fact, measurements pertinent must abide by the rigors of validity and reliability
to research are associated with numbers simply as described above (see Sect. 3.3, Measurement).
because they permit a greater variety of statistical When we do have the ability to actually
procedures and arithmetical operations. But since measure something that is of interest to us via
its inception, data, research, and even science an instrument that produces a numerical value,
have all evolved in such a way that this simplistic then we refer to those measures as quantitative
understanding is no longer sufficient. Moreover, data. Intrinsic to quantitative data is the fact that
the particular aspects of life we scrutinize or are our relative observations or measurements made
interested in studying have evolved as well. were obtained via a measuring tool or instrument.
Take redheads, for example—up until 2004, For example, observations such as height,
the biomedical community had no idea that speed, or blood pressure all use some measuring
women with natural red hair have a lower pain instrument—a ruler, a speedometer, or a sphyg-
threshold than do women with dark hair and, momanometer, respectively. Quantitative data
therefore, require a higher dosage of anesthesia consist of numbers that represent an amount, and
than their dark-haired counterparts. So then, how hence—due to the importance of those numbers
do we transform something that is non-numerical in and of themselves—these types of data are
in nature to a numerical datum? How did the sci- often referred to as continuous data.
entists acquire data to something they perceived On the other hand, when that which is of
to be red hair? How about pain? interest has neither an inherent numerical value
To reiterate, everything can be quantified. nor a measuring instrument that can produce a
Everything has the ability to be quantified so ­numerical value, then the method of quantifica-
that the numbers we assign to anything can be tion is limited only to counting, and the resul-
used in the description, the comparison, and the tant information is rendered as qualitative data.
prediction of information most relevant to the Qualitative data are data that have been quanti-
health sciences. Therefore, all that is left for us to fied based on a certain quality of something that
learn are the intricacies of different quantification has been observed. Data of this sort consist of
methods and how to most effectively utilize those words, names, or numerical codes that represent
numbers. To echo a mentor, that is precisely what the quality of something.
3.5  Data Acquisition 39

For example, observations such as hair color, consists of numerical values that do have restric-
socioeconomic status, or pain intensity do not tions or are isolated. Whole numbers such as
have measurement tools per se but are perceivable household size, number of medications taken
qualities that are useful in the health sciences. per day, and the population size of US college
The best we can do—in terms of quantification— students are all examples of discrete variables.
with these qualities is to simply categorize them Discrete variables are often referred to as semi-
for what they are and count the number of their continuous or scalar as those values can include
occurrences. Thus, it is not uncommon to hear enumeration (like household size), which (many
qualitative data to be referred to as categorical have argued) are neither wholly quantitative nor
data. wholly qualitative.
Considering variables that exist in the realm
of quantitative data, there are also distinct scales
3.5.2 Variables of measurements that coincide accordingly.
Interval measures are measurements that are
According to the ancient Greek philosopher separated by equal intervals and do not have a
Heraclitus, the only thing that is constant in life true zero. For example, measuring a patient’s
is change itself. Indeed, that is what makes us body temperature using a thermometer produces
humans and the world we live in so unique. No readings of temperature along a range of −40
cell, no human, no tree, and no planet are con- °F to 120°F, with ticks separated at intervals of
stant. In research, we must account for this differ- 1°F. This level of measurement does not have a
entiation by organizing our data by variables. A true zero for two reasons: (1) a reading of 0°F
variable is a characteristic or property of interest does not mean that there is no temperature to read
that can take on different values. Similar to the (i.e., it can get colder than 0°F) and (2) a reading
different types of data, there are different types of 50°F is not twice the amount of 25°F worth of
of variables that have within them distinct levels temperature.6
of measurement (see Video 1). On the other hand, ratio measures are mea-
surements that do have true zeros. For example,
3.5.2.1 Quantitative measuring someone’s height using a meter stick
At the heart of quantitative data lies two char- produces readings of height along a range of
acteristic variables that are respective of what it 0–1 m, with ticks reflecting its ratio distance from
means for data to be rendered as such. A continu- the point of origin (0 m). In this case, a height can
ous variable is a variable that consists of numeri- be considered as a ratio measurement because a
cal values that have no restrictions. Amounts such reading of 0  m essentially means that there is
as body temperature, standardized test scores, nothing to measure and a height of 1.50  m is
and cholesterol levels are all examples of contin- twice the amount of 0.75 m worth of length.
uous variables. It is noteworthy to mention that
the lack of restrictions mentioned is theoretical in 3.5.2.2 Qualitative
essence. For example, measuring a patient’s body All qualitative variables can be denoted as cat-
temperature in °F might be 100.2861…, and it egorical variables. A categorical variable is a
continues ad  infinitum. We recognize this theo- variable that organizes qualitative data into cat-
retical behavior of the numbers we work with by egories. Well, that was obvious—but we cannot
the label we give them (i.e., continuous), but for stress the importance of being able to identify
practical reasons similar examples of numbers
with decimals are rounded to the nearest hun- In reality, temperature is a subjective and humanistic per-
6 

dredths place (100.29°F). ception of hot or cold, an expression of thermal motion at


the subatomic level. Thus, there is no absolute scale that
The second overarching quantitative vari- theoretically measures temperature, rather there are dif-
able is essentially the opposite of a continuous ferent subjective scales used in its measurement such as
variable. A discrete variable is a variable that Fahrenheit, Celsius, and Kelvin.
40 3 Methodology

between measures and counts, quantities and Conversely, qualitative data that utilize ordinal
qualities. Moreover, categorical variables also measures are variables similar to grade point
have distinct scales of measurement. averages, where students are ranked from lowest
At the most basic level are nominal measure- to highest class standing based upon a letter grade
ments. A nominal measure is a measure of clas- (e.g., A, B, C, and D) (Fig.  3.9). If you notice
sification where observations are organized by here, the data mentioned is neither wholly quan-
either class, category, or name. For example, titative nor wholly qualitative. Furthermore, we
religious affiliation is a nominal measure where can presume ordinal measures to be exclusive to
participants can identify with Christianity, Islam, ranked data—whether quantitative or qualitative.
Judaism, or other religions. A simple way to The intricacies of ranked data and their analyses
remember nominal measures is by looking at the are discussed further in Chap. 7.
etymology of the word nominal—namely, nomi- The importance of the type of data collected
nalis, Latin for “name.” or utilized cannot be stressed enough. Not only
Another level of measurement relative to does the type of data set the precedent for the
qualitative data is dichotomous measurements. specific type of research being done, but it also
A dichotomous measure is a measure that can determines the appropriate statistical analysis
take on one of two values. For example, a ques- techniques that are permitted to be employed. We
tionnaire that asks whether food was consumed shall see why this is as such in Chaps. 4–6. For
prior to a medical exam can be answered as either now, let us consider all of the concepts discussed
“yes” or “no” which can be coded as 1 and 2, and their culmination to research methodology.
respectively. Or, as with our redhead example,
women with red hair can be labeled as 0 and
women with dark hair as 1. 3.6 Self-Study: Practice
We must mention that there are measurement Problems
levels that can be identified with both quantita-
tive and qualitative data. An ordinal measure is 1. What fundamental question does the meth-
a measurement made that is reflective of relative odology portion of a research endeavor ask?
standing or order (think: ordinal). Quantitative What aspects of the study does the answer-
data that utilize ordinal measurements can be ing of this question provide information to?
variables such as class standing, where—rela- 2. The following list is a mixture of samples
tive to course grade—students are ranked from and populations. Identify and match the sam-
lowest to highest grade point average (Fig. 3.9). ples to their parent-population:
(a) US college students
3.67–4.00 (b) Stars in the Milky Way Galaxy
A
(c) Republican Presidents
(d) Female business majors
2.67–3.33
(e) Female entrepreneurs
B
(f) Republican congressmen
1.67–2.33 (g) Arizona college students
C (h) Stars in the universe
0.67–1.33 3. Why do we more frequently measure sam-
D ples instead of entire populations? Can entire
0 GPA populations be measured?
F
4. What qualities are fundamental for a good
sample? Why is this important to a research
study?
Fig. 3.9  Both qualitative (GPA) and quantitative (letter 5. An investigator interested in studying breast-
grade) data can utilize ordinal measures feeding behavior in her county is in need of
Recommended Reading 41

data. Due to her busy schedule and fast-­ 10. For .each of the variables above, determine
approaching deadline, she takes advantage of the type of variable and specific measure if
the pediatric hospital across the street from her applicable.
laboratory. She stands outside of the main clinic (See back of book for answers to Chapter
and surveys pregnant women as they walk in. Is Practice Problems)
this a form of random sampling? Explain.
6. Researchers from the city’s Department of
Public Health are conducting a study on vac- Recommended Reading
cination efficacy in their state. After a month
of collecting data, the researchers compile Corbin JM, Strauss AL.  Basics of qualitative research:
techniques and procedures for developing grounded
4366 observations. They categorize their theory. 2nd ed. Los Angeles: Sage; 1998.
data by socioeconomic status and then ran- Kung J, Chiappelli F, Cajulis OO, Avezova R, Kossan G,
domly select 50 observations from each cat- Chew L, Maida CA. From systematic reviews to clini-
egory. What type of random sampling was cal recommendations for evidence-based health care:
validation of revised assessment of multiple system-
utilized here? atic reviews (R-AMSTAR) for grading of clinical rel-
7. A diabetic patient visits his local physician’s evance. Open Dent J. 2010;4:84–91. https://doi.org/10
office for an annual checkup. For the past .2174/1874210601004020084.
4  months, the patient’s handheld glucose Messick S.  Validity of psychological assessment: vali-
dation of inferences from persons’ responses and
meter (which measures blood by pricking his performances as scientific inquiry into score mean-
finger) has been reporting 108 mg/dL every ing. Am Psychol. 1995;50(9):741–9. https://doi.
day before breakfast—the patient is excited org/10.1037/0003-066X.50.9.741.
for his physician to see this. At the office, the Vandenbroucke JP, von Elm E, Altman DG, et  al.
Strengthening the reporting of observational studies
physician takes a full blood sample and runs in epidemiology (STROBE): explanation and elabora-
it through complex machinery to find his tion. Int J Surg. 2014;12(12):1500–24.
blood glucose levels. The physician returns, Wagner C, Esbensen KH.  Theory of sampling: four
slightly disappointed, and reports the critical success factors before analysis. J AOAC
Int. 2015;98(2):275–81. https://doi.org/10.5740/
patient’s blood glucose level to be 120 mg/ jaoacint.14-236.
dL.  Assuming the instrument in the physi- West S, King V, Carey TS, et  al. Systems to rate the
cian’s office is the gold standard for measur- strength of scientific evidence: summary. In: AHRQ
ing blood glucose levels, what might be evidence report summaries. Rockville, MD: Agency
for Healthcare Research and Quality (US); 2002.
wrong the patient’s blood glucose measuring 47:1998–2005. https://www.ncbi.nlm.nih.gov/books/
instrument? NBK11930/
8. Is it more important for an instrument to be
reliable or valid? Explain.
9. For each variable listed below, determine
whether it is quantitative data or qualitative
data:
(a) Socioeconomic status (low, middle,
high)
(b) Grade point average (GPA)
(c) Annual income ($)
(d) Graduate Schools in the United States
(e) Number of patients in the ER waiting
room
(f) Biological sex (male, female)
Descriptive Statistics
4

Contents
4.1 Core Concepts  43
4.2 Conceptual Introduction  44
4.3 Tables and Graphs  45
4.4 Descriptive Measures  53
4.4.1 Measures of Central Tendency  53
4.4.2 Measures of Variability  55
4.5 Distributions  59
4.6 Probability  63
4.6.1 Rules of Probability  64
4.6.2 Bayesian vs. Frequentist Approach  66
4.6.3 Z-Transformation  66
4.7 Self-Study: Practice Problems  69

4.1 Core Concepts summarizes, or presents the sample and the


observations that have been made based on the
Nicole Balenton data collected in an organized manner.
It is hard to visualize what the data could be
We finally approach the third and final leg of our showing by presenting the raw data as is. If we
“three-legged” stool, particularly data analysis. want to present our data in a more meaningful and
In the remaining chapters of the first half of the manageable form, descriptive statistics allows for
book, we will look at inferential and descriptive a simpler interpretation. We organize and consoli-
statistics, but for now, we direct our attention to date data via tables and graphs, a statistical tool,
the latter. Unlike inferential statistics, descriptive which describes data in a concise, direct, and pre-
statistics does not attempt to make inferences cise manner. In this chapter, we discuss two types
from a sample to the whole population. As its of statistics that are used to describe data: mea-
name suggests, descriptive statistics describes, sures of central tendency and measures of vari-
ability. Measures of central tendency (i.e., mean,
median, and mode) describe the distribution of the
Electronic Supplementary Material  The online version
of this chapter (https://doi.org/10.1007/978-3-662-57437-9_4). data by summarizing the behavior of the center of
contains supplementary material, which is available to the data, while measures of variability (i.e., range,
authorized users. interquartile range, standard deviation, and vari-
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 43
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_4
44 4  Descriptive Statistics

ance) describe the distribution of the data by pro- tence of uncertainty and the understanding that
viding an understanding of the dispersion of the knowledge is ever-growing.
data. These descriptive measures are techniques At the turn of the twentieth century, scientists
that aid in the organization and assist in the effec- scrambled to explain and pictorialize a model of our
tive summarization of the data. atoms on a quantum level. In 1913, Ernest
We mention a lot about distribution when dis- Rutherford and Niels Bohr introduced the
cussing the measures of central tendency and Rutherford-Bohr model which correlated the
variability. This chapter helps us understand how behavior of our atoms to that of our solar system.
the shape of the distribution tells us, as research- Just as the planets orbit our Sun via gravitational
ers, more about the data. Distributions such as attraction, the electrons orbit the protons via electro-
normal and skewed are further discussed along static attraction (Fig. 4.1). This theory entails that
with their corresponding characteristics. You will the electrons “orbiting” the protons follow a spheri-
learn that the most central distribution among cal path that is both continuous and identifiable—
those listed is the normal distribution, also called similar to our solar system. But this—however nice
Gaussian or “bell-shaped” curve. it may seem—was found not to be the case.
Culminating the chapter is the introduction of About a decade later, a student of Bohr’s,
probability, which is the likelihood or chance of a Werner Heisenberg, proposed that we cannot be
particular event occurring. To assist in the deter- certain of the exact location of an orbiting elec-
mination of the likelihood or chance that a par- tron.2 Instead, we can only discern the likelihood
ticular event is likely to occur, we turn to a list of (probability) of an electron’s position relative to
formulae which are essentially the rules of prob-
ability. Finally, tying together the topics of distri-
butions and probabilities is the z-transformation.

4.2 Conceptual Introduction

The world we inhabit is riddled with uncer-


tainty. Should anyone have the ability to isolate
the genetic sequence of our universe, there would
surely be a gene or two that encodes uncertainty.
In the most general sense, uncertainty refers to
the lack of knowledge or an imperfect under-
standing of the occurrence of future outcomes.
Throughout humanity’s history, man has rigor-
ously attempted to ascertain that which is uncer-
tain, along with a grasp of its overall impression.
Uncertainty deeply permeates the sciences as
well. If we are to be loyal to our earlier premise of
truth-seeking, then we must do more than simply
acknowledge the presence of uncertainty. We Fig. 4.1  Rutherford-Bohr model (Klute 2007)
must attempt to widen the capacity of certainty
while narrowing the uncertain. We can surmise
bility of contradiction, whereas a scientific theory implies
that, over the recent years, science has done well general agreement among scientists that have yet to
in this regard. No longer are postulates of our uncover substantiating evidence for refutation.
physical world referred to as laws, rather they are 2 
Heisenberg outlined the inverse relationship shared
referred to as theories.1 A theory implies the exis- between the position and momentum of an electron, such
that the moment one attempts to locate the precise posi-
tion of an electron, the electron has already traversed
A scientific law implies absolute truth without the possi-
1 
elsewhere.
4.3  Tables and Graphs 45

the proton it circles (Fig.  4.2). This was later we obtain are from the same world that is so
known to be the Heisenberg uncertainty princi- inherently uncertain and, therefore, must contain
ple—a seminal piece of work central to his Nobel a degree of uncertainty within them as well.
Prize of 1932 and, more importantly, to our Furthermore, in statistics, uncertainty encom-
understanding of quantum mechanics today. passes more than a seemingly lack of knowl-
Although theoretical physics is beyond the scope edge. In fact, the root of uncertainty in statistics
of (nor required for) this book, there is a lesson to is a topic we have recently become quite familiar
be learned. We can argue that this branch of phys- with—namely, variability. Whether it is referred
ics attempts to do just what we hoped for: widen- to as variability, variation, or just individual dif-
ing certainty and narrowing uncertainty. At the ferences, the application of a tool like statistics is
heart of theoretical physics must lie some math- useful for the systematic organization, analyza-
ematical model that is taken advantage of in order tion, and interpretation of data in spite of
to explain, rationalize, and predict these naturally uncertainty.
occurring (and uncertain) phenomena. What Our interest in uncertainty, then, is com-
might that be? pounded when we begin to discuss biostatistics.
You guessed it, statistics! Statistical mechan- Indeed, more fearful than an unfortunate diagno-
ics is critical for any physicist; it provides tools sis is an uncertain prognosis. To the best of our
such as probability theory to study the behavior ability, we wish to minimize any uncertainty, par-
of the uncertainty in mechanical systems. This is ticularly when it comes to patient health and
just one example of the long reach of statistic’s healthcare research. The ideal option is to know
utility, and it is why statistics is often set apart everything about everything—to know the whole
from the other investigative sciences. We can truth (whatever that means!) The second best
further our understanding of the role of statistics possible option is to understand the fundamental
to be the effective use of numerical data in the concepts behind ways to handle variability and,
framework of uncertainty. In other words, statis- its corollary, uncertainty. In accomplishing this,
tics must deal with uncertainty because the data we start with the appreciation of the most basic
concepts underlying statistical thought.

4.3 Tables and Graphs

As previously mentioned, statistics made its


official debut on the governmental scene, where
it was described as the tool used to help orga-
nize the diverse affairs of the state (hence, sta-
tistics). The central line of defense against the
uncertainty that stems from the prevalence of
variability in the natural world is one of the
most basic applications of statistics, namely,
descriptive statistics. Descriptive statistics
refers to a form of statistics that organizes and
summarizes data that are collected from the nat-
ural and uncertain world. But why and/or how
does an organized description of data help us
deal with uncertainty?
The ability to organize and consolidate our
Fig. 4.2  Schrödinger-Heisenberg atomic model (Walker data is the first, and arguably most important, step
2018) in making effective use of data obtained from a
46 4  Descriptive Statistics

varying and uncertain world. Positively, more set. The compact and coherent aspect of a fre-
damaging than the effects of uncertainty to our quency table facilitates the understanding of the
study (or to anything in that matter) is the igno- distribution of a specific set of data—for this rea-
rance of uncertainty. Descriptive statistics high- son, a frequency table is often referred to as a fre-
lights the uncertainty inherent in our data in the quency distribution. The presentation of data in this
form of variability. Descriptive statistics bears the manner is allotted for both quantitative and qualita-
fruits of statistical tools required to neatly describe tive data, although the latter form of data has a few
the observations we make. The first statistical tool restrictions described in additional detail below.
we mention utilizes tabulation, a method of sys- Table 4.3 shows a frequency table of the data
tematically arranging or organizing in a tabular from Fig. 4.1.
form (table). Consolidating data into tables is one
of the most practical ways of organization, espe- 1. The first column is labeled as our variable of
cially when it comes to large sets of data. interest, “systolic BP,” and was configured by
Assume you have just collected data about the first identifying the extreme values (smallest
systolic blood pressure from 50 of your college and largest) from the data set and then numer-
peers at random, shown in Table 4.1. ating the values in between the extremes in
Although the data are compiled together, the numerical order.
lack of organization in the wide array of data is 2. The second column labeled as “f” represents
quite overwhelming. Moreover, one would rarely the frequency of occurrence within each of
present data to their superior (i.e., principal inves- those classes of systolic BP from the data set.
tigator or research teacher) in this manner. Even This can be obtained via a simple tally or
if it is presented in this unorganized fashion, what count of each class’s occurrence through the
type of beneficial information can we glean from raw data.
it? Nothing much. In fact, the extensive detail 3. Next, the sum of each class’s frequency should
highlights more unimportant information than be equal to the total number of observations in
important. The least one can do in this instant is the data set. Thus, when a frequency table
to order the array of data numerically from low- organizes data by classes of single values as
est to highest (Table 4.2). Yet still, there is little such, it is viewed as a frequency table for
added utility to this method other than its aes- ungrouped data.
thetic pleasure to the eye. A more effective use of
tabulation is organization by frequency. Now, the differences in the organization and util-
A frequency table is a method of tabulation ity comparing Table 4-3 to Table 4-1 comes to light.
that organizes data by reflecting the frequency of Table 4.3 to Table 4.1 comes to light. Not only
each observation’s occurrence relative to the whole are the data more organized, but there is also a

Table 4.1  Randomly collected systolic blood pressures Table 4.2  Randomly collected systolic blood pressures of
of 50 college students 50 college students ordered/sorted from lowest to highest
Systolic blood pressure Systolic blood pressure
103 98 105 107 96 90 94 94 95 96
116 97 118 114 116 96 96 97 98 98
122 126 111 114 106 98 98 98 99 103
113 98 94 124 122 103 104 105 105 106
98 96 132 125 90 107 107 110 111 113
115 114 132 133 107 114 114 114 115 116
136 132 140 104 99 116 118 119 122 122
94 134 110 137 105 124 125 126 126 130
137 95 98 119 130 132 132 132 133 134
126 98 140 96 103 136 137 137 140 140
4.3  Tables and Graphs 47

Table 4.3  Frequency table of the systolic blood pres- Table 4.4  Frequency table for the weight of 500 college
sures from Fig. 4.1 students in lbs
1
Systolic BP 2
f Weight in lbs f
90 1 100–109 59
94 2 110–119 54
95 1 120–129 48
96 3 130–139 50
97 1
140–149 37
98 5
99 1 150–159 49
103 2 160–169 51
104 1 170–179 52
105 2 180–189 51
106 1 190–199 45
107 2 200–209 4
110 1
Total = 500
111 1
113 1
114 3 Table 4.5  Rules for constructing tables for grouped data
115 1
Four rules for constructing tables
116 2
118 1 1. Observations must fall in one and only one
119 1 interval. Groups cannot overlap
122 2 2. Groups must be the same width. Equal-sized
124 1 intervals
125 1 3. List all groups even if frequency of occurrence is
126 2 zero. All groups should be ordered from lowest to
130 1 highest
132 3 4. If groups have zeros or low frequencies, then widen
133 1 the interval. The intervals should not be too narrow
134 1
136 1
137 2 nization technique, in terms of classification,
140 2 from above. That is, if there are more than 10–15
3
Total = 50
different possible values to be classified singu-
larly, then we must take advantage of a more
greater deal of important information to be refined method of tabulation by grouping the
gleaned from this type of systematic organiza- classes into intervals.
tion. For instance, we can better identify the A frequency table for grouped data organizes
highest, lowest, and most common frequencies of observations by interval classification, whereas
blood pressure, which become important findings this differs from a frequency table for ungrouped
in the context of cardiovascular disease. It data that organizes observations by classes of
becomes even more helpful if you are tasked with single values. This refinement is reflected by yet
comparing your data set to that of a colleagues. another level of organization, conveyed by the
Hence, we have a much better understanding and grouping of data into intervals. Table 4.4 depicts
a more organized report (to present to your supe- a frequency table of weights from 500 college
rior) regarding the systolic blood pressure of your students. Note how the weights in the table are
college peers when using a frequency table. not singly defined but are classified by intervals
The significance of efficiently describing sta- of 10; yet the table still provides us with useful
tistics increases manyfold when the amount of and important information regarding the fre-
data increases. What if the size of data to be col- quency of each category’s occurrence in an orga-
lected is larger than 50, like 500 pieces of datum? nized manner. Table  4.5 presents a few simple
Surely the convenience of tabulation would lose rules to follow for constructing frequency tables
its pragmatic nature if we applied the same orga- for grouped data.
48 4  Descriptive Statistics

There is still more advantage to be taken from a realistically be anywhere between 99 and
frequency table, by performing a few more calcula- 101% as there may be errors in rounding.
tions. Table 4.6 is an extension of the earlier example • Cumulative frequency (cf) represents the
of college student’s weights from Table 4.4. Here we total number of occurrences in each class,
see the addition of four new columns—namely: fre- including the sum of the occurrences from the
quency percent, cumulative frequency, cumulative classes before it.
frequency percent, and interval midpoint. • Cumulative frequency percent (cf%) repre-
sents the cumulative frequency of each class
• Frequency percent (f%)3 represents the fre- relative to the total frequency of the whole set,
quency of each class (or interval) relative to expressed as a percentage.
the total frequency of the whole set, expressed –– This calculation is particularly useful in
as a percentage. describing the relative position of the par-
–– The sum of each category’s frequency per- ticular class of observations within the
cent should ideally equal to 100% but can whole data set, often viewed as percentiles.

Table 4.6  Complete frequency table of 500 college students and their recorded weight in pounds (lbs)
Weight in lbs f f% cf cf%
100–109 59 59 59 59
× 100 ≡ 11.8% × 100 ≡ 11.8%
500 500
110–119 54 54 113 113
× 100 ≡ 10.8% × 100 ≡ 22.6%
500 500
120–129 48 48 161 161
× 100 ≡ 9.6% × 100 ≡ 32.2%
500 500
130–139 50 50 211 211
× 100 ≡ 10% × 100 ≡ 42.2%
500 500
140–149 37 37 248 248
× 100 ≡ 7.4% × 100 ≡ 49.6%
500 500
150–159 49 48 297 297
× 100 ≡ 9.6% × 100 ≡ 59.4%
500 500
160–169 51 51 348 348
× 100 ≡ 10.2% × 100 ≡ 69.6%
500 500
170–179 52 52 400 400
× 100 ≡ 10.4% × 100 ≡ 80%
500 500
180–189 51 51 451 451
× 100 ≡ 10.2% × 100 ≡ 90.2%
500 500
190–199 45 45 496 496
× 100 ≡ 9% × 100 ≡ 99.2%
500 500
200–209 4 4 500 500
× 100 ≡ 0.8% × 100 ≡ 100%
500 500
Total = 500 ≈ 100%
Note that there are no calculated totals in cumulative frequency

Often referred to as relative frequency.


3 
4.3  Tables and Graphs 49

• Midpoint refers to the “middle” value of either the cumulative frequency or the cumula-
between the interval of the specific class. tive frequency percentages. Instead, we can
–– This is limited to frequency tables for gauge the accuracy of our calculation by confirm-
grouped data, where the middle point of ing that the values in the final interval of the table
the interval can be found by simply averag- for cf and cf% are equal to the total number of
ing the lower and upper points of the spe- observations and 100%,4 respectively.
cific interval. The importance and usage of The importance of the type of data we are work-
this become apparent in graphing (see next ing with should not be forgotten, as that specific
page). criterion sets the precedent for what we can and

Table 4.7 depicts a crisscross method that can


Again, the final cumulative frequency percent should ide-
4 
be used to obtain the cumulative frequency from
ally be equal to 100% but can realistically be anywhere
each category’s frequency. Additionally, there is between 99% and 101% as there may be errors in
nothing meaningful to be gleaned from the sums rounding.

Table 4.7  Crisscross method to calculate the cumulative frequency


50 4  Descriptive Statistics

cannot do with the data. We are guilty of being Table 4.10  Tabulates a nominal variable
somewhat biased, as all of the calculations men- Race f
tioned above can be taken advantage of when using American Indian/Alaskan native 9
tabulation to describe quantitative data. However, Asian 11
when working qualitative data, we can utilize only Black or African American 13
Native Hawaiian or other Pacific islander 7
a few of the characteristic traits of tabulation men-
White 24
tioned above. This is simply due to the nature of
Total 64
qualitative data; the information to be obtained
from categories, counts, and/or names is limited
when we attempt to apply the same mathematical simple, the importance of presenting and reporting
procedures as we did toward quantitative data. data in an organized manner through tabulation is a
Hence, when it comes to creating tables for critical first step in the effectiveness of any scien-
qualitative data, we do not create intervals, cal- tific study. As should be the case in any scientific
culate cumulative frequency, or calculate the undertaking, we must always be concise, direct,
midpoint of the interval. Conversely and depend- and precise in presenting data obtained from an
ing on the specific type of qualitative variable, we invariable and uncertain world. Moreover, we have
can still calculate frequency, frequency percent, also opened the door for yet another tool we can
and cumulative frequency percent. The contin- use in descriptive statistics—namely, graphs.
gency on the specific type of data is mentioned Graphs represent yet another statistical tool
particularly for the calculation of cumulative fre- available for the clear and concise description of
quency percent. When working with qualitative data. Similar to the construction of frequency
data, it is customary to only calculate the cumula- tables, graphs provide the means of organizing
tive frequency percent when the data to be tabu- and consolidating the inevitable variability within
lated are ordinally measured (i.e., an ordinal data in a visual manner. Think of all of the
variable). This is primarily due to the fact that instances (i.e., advertisements, class presenta-
percentiles are meaningless if there is no order to tions) where graphs played a vital role in visually
the data. Tables 4.8, 4.9, and 4.10 are three exam- imparting a piece of information or underlying
ples of frequency tables for different measures of message to the viewer as intended by the pre-
qualitative data (see Video 2). senter. Although numerous forms of graphing that
Let us be the first to congratulate you on passing can be utilized within descriptive statistics exist,
a milestone on your journey toward being an effi- below we outline a few of the most common.
cient and conscientious researcher. Congratulations! The most common forms of graphs used to
Seriously, although tabulation may seem relatively statistically describe data are histograms and bar
charts. Up until this moment, it is possible that
many of us believed that a histogram and a bar
Table 4.8  Tabulates a dichotomous variable question
chart (or bar graph) were synonymous with each
Do you brush your teeth before you go to bed? f other. Unfortunately, this is a grave misconcep-
Yes 59 tion. In fact, the primary difference between the
No 145
two is critical in understanding the data at hand.
Total 204
As was the case before, the nature of the data that
we work with (i.e., quantitative vs. qualitative)
Table 4.9  Tabulates an ordinal variable set the precedent as to which type of graphing we
Class standing f can utilize.
Freshman 116 Both a histogram and a bar chart can be
Sophomore 102 referred to as “bar-type graphs” that utilize
Junior 153 Cartesian coordinates and bars to summarize and
Senior 129 organize data. A histogram is a type of bar graph
Total 500 used for quantitative data, where the lack of gaps
4.3  Tables and Graphs 51

between the bars highlight the continuity of the type graphs, the x-axis represents the intervals or
data (hence, continuous data/variables.) On the classes of the variable, and the y-axis represents
other hand, a bar chart is a type of bar graph the frequency. Figure 4.3 is a histogram created
used for qualitative data, where a bar is separated from the quantitative data presented in Table 4.4;
by gaps to highlight the discontinuity of the data. Fig. 4.4 is a bar chart created from the qualitative
The primary difference between these bar-type data presented in Table 4.10.
graphs is that the bars in a histogram “touch” one Satisfy yourself that the graphing mechanism
another, whereas the bars in a bar chart do not is just as useful in concisely, directly, and pre-
touch one another and instead are separated by cisely describing data over the simple presenta-
gaps. tion of raw data. Table 4.11 provides a step-by-step
The easiest way to construct either of these protocol for constructing either of the two graphs.
graphs is by transforming the already organized Graphically speaking, there is one other graph
data provided by a frequency table. In both bar-­ that we can construct as yet another statistical

Weight of College Students

70
NUMBER OF STUDENTS

60
50
40
30
20
10
0
9 9 9
109 -11 129 139 -14 159 169 179 189 199 -20
0- 10 0- 0- 40 0- 0- 0- 0- 0- 00
10 1 12 13 1 15 16 17 18 19 2

Fig. 4.3 Histogram WEIGHT IN POUNDS (LBS.)

Practical Biostatistics in Translational Healthcare

Race

30

25

20

15

10

0
American Asian Black or African Native Hawaiian White
Indian/Alaskan American or Other Pacific
Fig. 4.4  Bar chart Native Islander
52 4  Descriptive Statistics

Table 4.11  Protocol for bar charts and histograms


Protocol for constructing a bar chart or histogram
1.  Draw and label your x- and y-axis. Place intervals or classes on the x-axis and frequencies on the y-axis for bar
charts and histograms
2.  Draw the bars. Draw a bar extending from the lower values of each interval to the lower value of the next
interval. The height of each bar is equivalent to the frequency of its corresponding interval. Make sure the bars are
the same width as each other and centered over the data they represent
   •  For bar charts: bars are separated by gaps
   •  For histograms: bars should be touching one another

Fig. 4.5 Transformation Weight of College Students


of histogram in Fig. 4.3
70
NUMBER OF STUDENTS

60
50
40
30
20
10
0
9 9 9 9 9 9 9 9 9 9 9
-10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20
100 110 120 130 140 150 160 170 180 190 200

WEIGHT IN POUNDS (LBS.)

tool used in descriptive statistics. This additional the midpoint), its x- and y-coordinates are labeled,
graph is a modification to a traditional histogram. and then a line is drawn connecting each point.
Furthermore, this variation is critical when we The frequency polygon depicted in Fig. 4.5 is a
begin to introduce distributions in the following transformation of the histogram shown in
section. A frequency polygon is a refined histo- Fig. 4.3. We can attempt to grasp the reason why
gram that has a line graph added within it. Just as it is referred to as a frequency polygon once the
a histogram has limited use within a quantitative bars are removed and the area under the line is
context, so too by extension does a frequency shaded (Fig. 4.6).
polygon. But it would be repetitive to outline the It is worth mentioning (again) that you need
protocol for constructing a frequency polygon. not initially create a histogram to transform into a
We can surmise that the first step in the construc- frequency polygon. Indeed, a frequency polygon
tion of a frequency polygon is a histogram. So can be composed directly from a frequency table,
then, how is a frequency polygon constructed? provided that the frequency table has a column
A histogram is transformed into a frequency for the calculated midpoint. This also omits the
polygon by simply connecting the peak of each necessity of drawing and then erasing the bars of
bar within the histogram by a line. Recall from the histogram, as long as the x- and y-coordinates
the section on frequency tables, specifically for of the dots on the line are labeled appropriately
quantitative (grouped) data, that we assigned a (x-coordinate, interval’s midpoint; y-coordinate,
column for the calculation of each interval’s mid- frequency). We briefly return to the importance
point. Thus, at the top of each bar within the his- behind frequency polygons in Sect. 4.4 (see
togram, a dot is placed at the middle (to represent Video 3).
4.4  Descriptive Measures 53

Fig. 4.6 Frequency Weight of College Students


polygon when bars of
histogram are removed 70
from Fig. 4.5
60

NUMBER OF STUDENTS
50

40

30

20

10

0
1 2 3 4 5 6 7 8 9 10 11 12
WEIGHT IN POUNDS (LBS.)

4.4 Descriptive Measures loquially but read and see it being used on a daily
basis. We often even refer to the average of some-
Now that we are equipped with a few of the sta- thing without the actual mentioning of the word
tistical tools necessary for descriptive statistics, itself. Though we may not consciously realize its
we must also begin a discussion on the mathe- usage, it is one of the most efficient ways to
matical techniques we can utilize to describe our describe whatever it is that requires describing.
data and the inevitable variability they are forti- For example, when returning from a vacation,
fied with. The techniques we are to describe not our friends and family usually ask: “How was the
only aid in the organization of our data, but— weather like?” Now, we don’t usually go hunting
even more—they assist in the effective summari- for a 10-day weather report that outlines the tem-
zation of our data in a direct, precise, and concise perature of each day in order to provide an answer
manner. to the simple question. Instead, we often respond
Consider the summary of a book you read on with a general description as to how the overall
the internet the night before a book report you temperature was or we provide a single tempera-
have been procrastinating is due. The purpose of ture that is about the same as the distribution of
the summary is to give you an overarching under- temperatures during our stay. In reality, all we are
standing of the book, its characters, and (hope- doing is providing an average—whether precise
fully) the overall message your teacher intended or not—of whatever it is we are attempting to
for you to learn. In essence, that is precisely what describe or summarize.
the summarization techniques relative to statisti- As useful as averages are to our daily life, they
cal data intend to do as well. As we shall see, the are even more useful when it comes to describing
processes contained within descriptive statistics data. In statistics, the techniques we use to
go beyond tabulation and graphing; we learn that describe data using averages are referred to as
data can be described by averages and measures of central tendency. Measures of cen-
variability. tral tendency are techniques used in describing
how the center of the distribution of data tends to
be. That is, we use these specific measures to
4.4.1 Measures of Central Tendency help us summarize the average behavior of our
data which happen to lie in the center of the
What do we think of when we think of the word distribution.
“average”? Well, for starters, we usually think of The plural word “measures” implies that there
the most common or frequently occurring event. is more than just one calculation of the average.
We also tend to not only use the word average col- Any layperson may be familiar with how to math-
54 4  Descriptive Statistics

14 9 1 18 4 8 8 20 16 6

1 4 6 8 8 9 14 18 16 20

104 17
Mean= = 10.4 Median= 8 + 9 = 17 = 8.5 Mode = 8
10 2

Fig. 4.7  Measures of central tendency are calculated with an example of numbers above. Notice that the first step is to
order the data from lowest to highest value. See contingency for calculating the median for an odd number of data

ematically calculate the average of a set of num- rences is identical and no other observation
bers. But there is more than just this single occurs more frequently.
calculation of the average. The measures of central
tendency include the mean, median, and mode— If the description of the central nature of these
the calculations and meaning of each are provided measures relative to the distribution of data is not
below, along with a comprehensive example utiliz- yet clear, then there is no need for panic; the fol-
ing all measures of central tendency within a sin- lowing section should make more sense.
gle data set from Table 2 in Fig. 4.7. Additionally, we must pause, yet again, for the
ultimate deciding factor as to the usability of
• Mean—synonymous with the arithmetic these measures—namely, the nature of the data.
mean, refers to a form of average calculation Let us begin with the easiest one first, namely,
that is the sum of the scores from a data set quantitative data. All measures of central ten-
divided by the total number of scores from dency are permissible when describing quantita-
that data set. tive data. On the other hand, when it comes to
describing qualitative data, only the mode and
(seldom) the median are permissible.
Population Mean Sample Mean For starters, it should be evident as to why a
å Xi å Xi mean calculation is never permitted when work-
m= x=
N n ing with qualitative data. Why? Consider the
nominal variable of gender. Say your biostatistics
• Median—refers to a form of average calcula- class is composed of 16 females and 14 males.
tion that is represented by the meedle number, Okay, great—so what is the average gender in
given that the data is organized in numerical your class? We will wait… there is no valid
order (think: meedle number). answer. Let us go one step back: what are 16
–– Contingency: if the number of data is odd, females plus 14 males equal to? 30… what?
then count off from both ends toward the People? No, that cannot be the summative gen-
center, and you will arrive at the median. If der. Even two steps back: what is one female plus
the number of data is even, locate the two one male equal to? Two femalemales? Silly, but
middle numbers, and calculate the arithme- no that cannot be it either. The point that we are
tic mean of the two to arrive at the median. attempting to get at is this: due to the qualitative
• Mode—refers to a form of average calcula- nature of the data, meaningful (even simple)
tion that is represented as the most frequent mathematical calculations are not always appro-
occurring number within a data set. priate. As mentioned in the previous chapter, the
–– Contingency: there can exist many values best we can do with qualitative or categorical
within a data set that represent the mode, data is simply to count them or record their
given that the frequency of their occur- frequencies.
4.4  Descriptive Measures 55

Thus, the measures of central tendency we tion of the variability contained within our
are left with are the median and mode. As dis- data. The ability to understand and summarize
cussed above, a median is the middle number the variation among data provides important
among the data set, given that the data are and meaningful information regarding many
listed in ­numerical order (i.e., smallest to larg- aspects of the data.
est). This is problematic when it comes to qual- In brief, the measures of variability provide
itative data because not all qualitative data an understanding of the overall distribution and
(like gender or ethnicity) have the potential to dispersion of quantitative data. It is true that the
be ordered. Hence, the calculation of median is amount in which each value contained within a
only appropriate when we have an ordinal data set differs or varies from one another pro-
measure of qualitative data5 (see Chap. 3, Sect. vides an understanding of how spread out the
3.4.2.2). Lastly, with a sigh of relief, we can data are. Take a moment to reflect on the words
say that the calculation of mode is permissible we use such as distribution, dispersion, and
for all qualitative data, as it is simply the spread; they are not only essentially synonymous
description of the most frequent occurring with one another but also fundamentally contain
observation. Table  4.12 provides a quick tool within them the idea of variation. Below we pro-
that delineates which measures of central ten- vide the details of four distinct, yet interrelated,
dency are appropriate relative to the nature of measures of variability that are critical to descrip-
the data at hand. tive statistics.
We begin with the simplest measure of vari-
ability: range. The range is the distance
4.4.2 Measures of Variability between the highest and lowest values among
the data. Of course, there must be numerical
As outlined in the introduction of this chapter, order to the data before we can calculate the
variability is a chief marker of the uncertainty range—signifying that the data must be quanti-
contained within the data itself and its source, tative in nature. However simple the calcula-
namely, the world. But it is better to be igno- tion of range may be, its ability to provide
rant about the uncertainty in the world than to meaningful and useful information regarding
be knowledgeable of uncertainty without a way the distribution of data is limited. This is pri-
of measuring it. Luckily, we are both aware of marily due to outliers which are observations
the existence of uncertainty and have the abil- within a data set that significantly differ from
ity to measure it in the form of variability. the other observations.
Moreover, contained within its measurement is Outliers can be the consequence of numer-
the implication of a clear and concise descrip- ous factors, such as erroneous measurements
or observations obtained from unrepresenta-
Table 4.12  Measures of central tendency checklist tive samples; they are typically found among
the lower and/or upper distribution extremes.
Measures of central
tendency Quantitative Qualitative Regardless of the causes, outliers pose a threat
Mean ✓ × to the calculation of range, as they provide a
Median ✓ ✓* deceiving description of the variability con-
Mode ✓ ✓* tained within our data. Thus, in order to pre-
* indicates contingencies vent deception, we introduce a calculation of
range that is much less vulnerable to the
potentially damaging effects of outliers, which
This is contingent on their being an odd number of obser-
5 
is also the second measure of variability
vations. If there are an even number of observations and
the middle two observations are not identical, then the
explored here.
median cannot be calculated (see median definition and its Interquartile range (IQR) refers to the range
contingencies above). between the middle of our data, given that the
56 4  Descriptive Statistics

data have been divided in quarters. Assuming the the quarters or quartiles (“IQR”). Hence, we can
data are in numerical order, the quarters are split remember the formula to be:
by Q1, Q2, and Q3 (Fig. 4.8).
The second quartile (Q2) is the same as the Interquartile Range ( IQR )
median of the set of quartiles. Once Q2 is deter- IQR = Q3 − Q1
mined, we can visualize the first and third quar-
tiles (Q1 and Q3, respectively) to be the “medians” Figure 4.9 provides a brief guideline along
of the first and third quarters or the numbers to with an example for the calculation of IQR.
the left and right of the real median (Q2). After Next, we discuss the two measures that lie at
isolating the quarters, all that is left is determin- the heart of variability: standard deviation and
ing the distance (range) that is between (inter) variance. These are the most common and useful
measures of variability used within scientific
Q1 Q2 Q3 research. They not only adequately describe the
variability contained within quantitative data but
also provide credence to many of our statistical
Fig. 4.8  Illustration of the interquartile range (IQR) analyses, their interpretations, and their applica-

Fig. 4.9 Step-by-step
procedure on how to
calculate the Interquartile Range (IQR)
interquartile range (IQR) IQR = Q3 – Q1
4.4  Descriptive Measures 57

Fig. 4.10  Notice how


the observation points
deviate from either side
of the mean

tions to the health sciences. Thus, we must be


able to effectively and comprehensively under-
stand these measures in order to suite our needs
in the following chapters.
The standard deviation refers to the average
distance of every datum to the center of the entire
distribution. Let’s look at the actual phrase in the
context of the definition for clarification: the
standard (average) amount by which each obser-
vation deviates (varies) from the mean. Notice
the importance of the arithmetic mean here and—
from our previous understanding—that it (nor-
mally) falls in the middle of the distribution. Due
to this, we can say that the standard deviation is Fig. 4.11  When the standard deviation is small, the
the average distance from both sides of the mean graph depicts a much narrower distribution
when applied to a histogram (Fig. 4.10).
It is noteworthy to expound on the relation- hence the average). Lastly, we delineate the fact
ship shared between measures of central ten- that there are two calculations of standard devia-
dency and measures of variability, specifically tion—one form utilized for observations obtained
for the mean and the standard deviation. The from a sample and the other from a population—
standard deviation is a measure of variability in which, the formulas and symbols that are used
that describes the distribution of our data differ slightly.
through the perspective of dispersion. On the With these understandings, we introduce the
other hand, the mean is a measure of central ten- formula required for the calculation of the
dency that describes the distribution of our data standard deviation of both populations and
through the perspective of the center of the dis- samples.
tribution. We can think of the standard deviation
å ( xi - x )
2
as a measure of distance and the mean as a mea-
sure of location. Sample Standard Deviation s =
n -1
The standard deviation essentially provides an
å ( xi - m )
2
answer to: how widely (or narrowly) spread are
Population Standard Deviation s =
the data? If the standard deviation is relatively N
small, then we can assume a narrower distribu-
tion (Fig.  4.11)—i.e., the values only slightly According to the order of operations,6 notice that
vary or deviate from one another (and hence the the first operation in the numerator (for both
average). On the other hand, if the standard devi-
ation is relatively large, then we can assume a The order in which we calculate is parentheses>exponen
6 

wider distribution (Fig.  4.12)—i.e., the values ts>multiplication>division>addition>subtraction; the


greatly vary or deviate from one another (and acronym for which is commonly known as PEMDAS.
58 4  Descriptive Statistics

Fig. 4.12  Notice the


wider distribution
compared to Fig. 4.11
when the standard
deviation is relatively
large

Fig. 4.13  Steps for obtaining the standard deviation using the table method

formulae) is the subtraction of each observation obtaining the sum of squares is division by their
point from the mean. Next the value of that differ- respective denominators.7 Figure 4.13 shows these
ence is squared (2), and after doing this for each steps in an easy-to-­follow table method.
observation, then the products are summed together
(∑). This method is referred to as the sum of squares Understanding the difference in denominators for popu-
7 

(SS). The final step in obtaining the standard devia- lation (N) and sample (n−1) is important for inferential
tion—of either the population or the sample—after statistics, expanded on in Chaps. 5 and 6.
4.5 Distributions 59

Just as there are two forms of calculation for Sample Population


Signs
standard deviation, there are two forms of calcu- (Latin) (Greek)
lation for variance for similar reasons. The vari-
Mean
ance and the standard deviation share a very
similar resemblance; both are held to the highest
regard in terms of describing the variability Standard deviation
among scientific data. So, it will not be a surprise
when we define variance as the squared average Variance
of the squared differences from the mean. Better
yet, we can simply view the variance as the stan- Size
dard deviation squared or—the contrapositive—
the standard deviation as the square root of the Fig. 4.14  Sample (statistics) and population (parame-
variance! Both visualizations are one-in-the-­ ters) symbols
same and are depicted below.
Regardless of their innate similarity, the for- which the data are dispersed. Although we may
mulae for population variance and sample vari- have conceptually visualized the distribution of
ance are provided below as well. The calculation our data, the following section sets to provide a
of each can be performed utilizing the same table more tangible depiction of a distribution.
method outlined for standard deviation (see The first introduction we had to a distribution
Fig. 4.13). It is also worth reiterating the fact that was during the explanation of a frequency table
the measures of variability are limited for work or, its common alias, a frequency distribution.
with quantitative variables and data only. Notice that Table 4.4 is essentially a depiction of
Before we move on to the next section, we how the data are distributed in terms of frequency
must be attentive as to the different symbols used by method of tabulation. Next, we witnessed how
for populations and samples. Not only is it perti- that same frequency table was consolidated into a
nent to adequately understand the conceptual dif- histogram, Fig. 4.3. We briefly spoke on the great
ference between a population and a sample value a visual presentation had in clearly and
(discussed in Chap. 3, Sect. 3.2), but it is also concisely describing the way the data were dis-
important to learn the symbols and the words that tributed. Then, we observed how the distribution
refer to them. The symbols (and concepts) used of quantitative data has the ability to be further
to describe populations are referred to as param- described by the construction of a frequency
eters, whereas those used to describe samples are polygon. That said, there is a reason behind this
referred to as statistics. A modest way to remem- progressive method we outlined—that being, the
ber the distinct symbols is to notice that parame- normal distribution. Let us look at this progres-
ters use Greek symbols, whereas statistics use sion one more time and its culmination to a nor-
Latin symbols (Fig. 4.14). mal distribution in order to drive the point home.
Figure 4.15 shows a histogram of the continu-
ous variable hemoglobin (Hb) levels obtained
4.5 Distributions from a sample of 326 diabetic patients; Fig. 4.16
shows a frequency polygon of the same variable.
Throughout this chapter, the word distribution If we were able to “smoothen” out the rough
has been used regularly to describe the character- complexion of this frequency polygon, then what
istic qualities of the data at hand. We observed we would see is a nicely bell-shaped curve
that measures of central tendency described the (Fig.  4.17). Now, we cannot simply make
distribution of the data by summarizing the “smooth” every crooked frequency polygon that
behavior of the center of our data. We observed comes our way. But if we were able to, say, obtain
that measures of variability described the distri- a distribution on 3260 diabetic patients or—bet-
bution of the data by exemplifying the manner in ter yet—32,600 diabetic patients, then we would
60 4  Descriptive Statistics

Fig. 4.15 Histogram 300

Number of Patients
200

100

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)

Fig. 4.16 Frequency 300


polygon
Number of Patients

200

100

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)

300
Number of Patients

200

100

Fig. 4.17 Bell-shaped 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
frequency polygon Hemoglobin Levels (Hb)
4.5 Distributions 61

Fig. 4.18 Bell-shaped 300


curve

Number of Patients
200

100

3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
Hemoglobin Levels (Hb)

witness our original crooked distribution natu- Additionally, the unit of measurement for the
rally attain that smooth bell-shaped curve spread of the distribution on the x-axis is standard
(Fig.  4.18). This bell-shaped curve is what we deviation—hence, measures of variability or dis-
refer to as a normal distribution. persion. It is also critical to note (or reiterate) that
A normal distribution is a naturally occur- the normal distribution is based on quantitative
ring phenomenon that depicts the distribution of data and continuous variables.
data obtained from the world as a bell-shaped There are still other characteristics of a normal
curve, given a sufficient number of collected distribution that are important to understand.
observations. There are multiple qualities of a Recall from a few paragraphs above that the fre-
distribution that render it a normal distribution— quency polygon from the sample data did not
but that is not to say that there is anything normal necessarily smooth out until we increased the
about it, per se. size of our data. By continuously increasing the
The normal distribution, also referred to as the size of our sample (n), we began describing more
Gaussian distribution, was first introduced by the the population of our observations (N), rather
highly influential German mathematician, Johann than the sample. Thus, we say that normal
Carl Friedrich Gauss in the late eighteenth cen- distributions are primarily observed when
­
tury. Gauss described the qualities of this theo- describing the parameters of a population.10
retical distribution as a bell-shaped curve that is There are an infinite number of normal distribu-
symmetrical at the center with both of its tail tions that, in theory, can occur depending on the
ends stretching to infinity, never touching the specific population we are to describe. For this
x-axis. Moreover, at the center of a normal distri- reason, a short-­hand method of labeling any nor-
bution is where the mean, median, and mode are mal distribution relative to its specific parameters
all located (i.e., mean  =  median  =  mode)8— is N(μ, σ). Also, the total area under a normal dis-
hence, measures of central9 tendency. tribution is equal to one; the reasoning of which
is explained further below in Sect. 4.5.3.
8 
Depending on the size of data, these measures must not Table 4.13 summarizes the important qualities of
necessarily be exactly equal to one another but relatively a normal distribution.
close in order for a normal distribution to be observed. Not only are there different types of normal
9 
Notice, now, how the measures of central tendency are distributions, but there are also different types of
essentially a description of the distribution of data. The
calculated mean tends to fall in the center, the median—
by definition—is in the middle, and the mode is the obser- This is not to make normal distributions exclusive to
10 

vation that occurs most frequently which is the highest bar populations. Sample data may very well be normally dis-
in a frequency polygon and later the peak in the normal tributed as well, the reasoning we save for the next chapter
distribution. under the central limit theorem.
62 4  Descriptive Statistics

Table 4.13  Qualities of a normal distribution


Mode
Qualities of a normal distribution
Median
 • Notation: N (μ, σ)
 •  Symmetrical about the mean
 •  In the bell-shaped curve, tails are approaching
infinity and never touch the x-axis Mean
 •  The area under the curve is 1.00 (100%)
 •  Empirical rule: “68-95-99.7 rule”
   –  Approximately 68% of the sample lies ± 1σ
from the mean
Fig. 4.20  Negatively skewed (left-skew) distribution
   –  Approximately 95% of the sample lies ± 2σ
from the mean
   –  Approximately 99.7% of the sample lies ± 3σ

Number of College graduates


from the mean Mode
Median

Mode Mean
Median

Mean Income in Dollars($)

Fig. 4.21  Using measures of central tendency to describe


the distribution of data of college graduate’s first year
income

Fig. 4.19  Positively skewed (right-skew) distribution


Due to this vulnerability and the resultant
skew, we must be cautious of which measure of
distributions, in general. Although that list may central tendency we use to describe the distribu-
also stretch to infinity, we will focus on non-­ tion of data. For example, Fig.  4.21 shows the
normal distributions. Two of the most frequently distribution of first year incomes for the popula-
observed types are skewed distributions and tion of last year’s graduates. Would it be accurate
polymodal distributions. here to use the mean to describe the average col-
Skewed distributions occur when the mea- lege student’s income 1 year after graduation?
sures of central tendency are not equivalent or No, the mean fails to provide an appropriate
even nearly equivalent to each other, like they are description of what the average income made 1
for normal distributions. A positively skewed year after graduation; instead the median would
(or right-skew) distribution is observed when be more appropriate.
the mean, median, and mode have a descending There are also instances where the mode is the
numerical order to their values (Fig. 4.19). On best measure to describe the distribution of our
the other hand, a negatively skewed (or left- data. For example, a distribution of the age of
skew) distribution is observed when the mean, patients suffering from Hodgkin’s lymphoma
median, and mode have an ascending numerical shows two peripheral peaks, instead of a single
order to their values (Fig. 4.20). It is noteworthy central peak (Fig. 4.22). It would be inappropri-
to mention that the skewness of a distribution ate to use the mean to describe the distribution
is determined by the tail end of each respective here as well; Hodgkin’s lymphoma primarily
graph, which is why they are also referred to as affects the immunocompromised (i.e., children
right-­skews or left-skews. The mean’s vulner- and seniors). Instead, the data show that the
ability to outliers is the most frequent cause of modes are the appropriate measure of central ten-
skewed distributions. dency used to describe the distribution. This type
4.6 Probability 63

of distribution with two modes is referred to as a Although there is an entire branch of mathematics
bimodal distribution, while similar distribu- devoted to the theory of probability, below we
tions with more than two modes are referred to as provide the fundamental axioms and rules of
polymodal distributions (Fig. 4.23). probability relative to statistics in the health sci-
ences. Following tradition, we begin with the flip-
ping of a coin to introduce the ultimate concept of
4.6 Probability probability.
Assuming we have a fair coin, how is the like-
One of the most popular techniques used to lihood of flipping the coin and getting a head
tackle uncertainty is probability theory. expressed mathematically? The most basic prob-
Probability refers to the likelihood or chance of ability formula is in the form of a fraction or pro-
a specific event occurring. We—probably—did portion. Figure  4.24 shows a simple fraction
not need to provide a definition of probability; where the numerator represents the number of
chance, odds, likelihood, and possibility are all times our event of interest (head) occurs from the
synonymous with probability and its inherent set (the coin) and the denominator represents the
concept. Whether it is determining the outfit of total number of occurrences of all possible events
the day based on the chance of rainy weather or within the entire set (head and tail). Thus, the
the likelihood of a specific treatment being effec- probability of flipping a coin and getting a head is
tive for a specific patient, the theory of probabil- ½ or 0.50. Multiplication of this proportion by
ity is used across the spectrum of our daily 100 results in a percentage—what we commonly
activities. call a 50% chance.
Moreover, probability joins the constant battle Along with this basic introduction comes the
in widening certainty and narrowing uncertainty. rudimentary understanding of the fundamental
The fact that we are able to quantify probability premise—or axiom (Andre Kolmogorov 1965),
and its associated bearing on uncertainty is, in as it was—of probability theory. The probabil-
essence, a function of descriptive statistics. ity of any event occurring is a nonnegative real
number. From just this first axiom, we can
deduce another other important concept of
Mode Mode probability. The probability of any event occur-
ring, written as P(E), is bound by zero and
one—meaning that the likelihood of the event
of interest (E) occurring ranges from 0% (abso-
lutely not happening) to 100% (absolutely hap-
pening). Recall that probability is essentially a
fraction that ranges from 0 to 1, in which its trans-
Fig. 4.22  Bimodal distribution formation to a percentage requires m ­ ultiplication

Mode Mode Mode

Fig. 4.23 Polymodal
distribution
64 4  Descriptive Statistics

(P(4) = 1/6). Moreover, these events are consid-


ered independent because the likelihood of roll-
ing a 3 on one die does not affect the likelihood
Fig. 4.24  Probability formula of rolling a 4 on the other and vice versa.
Therefore, the P(3 and 4) = (1/6) × (1/6) ≈ 2.78%.
by 100.11 Also, the probability notation P(E) is A helpful guide for determining the appropriate
read as: “The probability of E’s occurrence.” usage of the multiplication rule is to look for the
word and; for this reason the rule is also referred
to as the and rule.
4.6.1 Rules of Probability
Addition Rule for mutually exclusive events
Probability is more than just the determination of P ( A or B ) = P ( A) + P ( B )
a single event. Oftentimes, we are concerned
with determining the likelihood of a series of The addition rule considers the likelihood of
events or the likelihood of an event in consider- the singular occurrence of either two (or more)
ation of the occurrence of other related events. mutually exclusive events, A or B. This rule
For these more complex problems, there is a essentially tells us if we wish to determine the
necessity for formulae that help with their deter- probability of any single event occurring among
mination. These formulae we are to present, in several other events, then all that is required is to
what follows, are commonly referred to as the simply add the separate probability of these
rules of probability. mutually exclusive events. Notice that in order to
appropriately utilize the addition rule, the events
Multiplication Rule in consideration must be mutually exclusive.
P ( A and B ) = P ( A) ´ P ( B ) Mutually exclusivity refers to events that are
unable to occur together in a single instance. For
The multiplication rule considers the likeli- example, a bag of chocolate candy contains four
hood or the joint probability of two independent blue, one orange, two yellow, and three red
events, A and B, occurring simultaneously. This pieces. The probability of randomly selecting a
rule essentially claims that if we wish to deter- blue piece or a yellow piece is simply the sum of
mine the probability of any number of events their singular occurrences (P (blue or yel-
occurring together, then all that is required is to low) = P(blue) + P(yellow) = (4/10) + (2/10) = 0
simply multiply the separate probabilities of .60 or 60%). This simple utilization of the addi-
these independent events together. Notice that in tion rule is allotted because there is no single
order to appropriately utilize the multiplication instance where both the color blue and the color
rule, the events in consideration must be indepen- yellow occur together at the same time in a single
dent of each other. event (… green does not count). A helpful guide
Independence describes distinct events that for determining the appropriate usage of the
do not affect the likelihood of each other’s occur- addition rule is to look for the word or; for this
rence. For example, when rolling a pair of dice, reason the rule is also referred to as the or rule.
the probability of rolling a 3 and rolling a 4 can But what if the events of interest are not
be calculated by using the multiplication rule. mutually exclusive (i.e., the events do have the
The probability of rolling a 3 on one die is about ability to occur together in a single instance)?
0.1667 (P(3) = 1/6), and the probability of rolling If we mixed into that same bag five more pieces
a 4 on the other die is also about 0.1667 of colored chocolate candy (three yellow and
two orange) that have a peanut-center, then
Colloquially, we often use or hear the usage of phrases
11  what would be the probability of obtaining
such as: “I am 150% sure!”—unfortunately, that is either a piece of candy that is yellow or a piece
impossible. of candy that has a peanut-center (P (yellow or
4.6 Probability 65

peanut-­center)? Now we do have an instance nature of mutual exclusivity relative to the addi-
where the events are not mutually exclusive; tion rule in this example.
there is a singular occasion where both events Did we miss something? As we were familiar-
are able to occur simultaneously (i.e., a yellow izing with the multiplication rule, we did not
chocolate candy with a peanut-center.) For pose the question of what to do if our events were
this, we must introduce a refinement to the not independent of each other. What if the prob-
addition rule. ability of event A occurring was dependent on the
probability of event B occurring? Events are con-
Addition Rule for non  mutually exclusive events sidered to be dependent when the likelihood of
P ( A or B ) = P ( A) + P ( B ) – P ( A and B ) one event’s occurrence effects the likelihood of
the other event’s occurrence. In order to calculate
This refinement to the addition rule consid- events that share this relationship, we must turn
ers events that are not mutually exclusive, in to a third rule of probability.
which the original addition rule is joined with
the multiplication rule. Notice, that we are still Conditional Rule
utilizing the addition rule—meaning—we are P ( A and B )
P ( B|A) =
still interested in determining the probability of P ( A)
any single event occurring among several other
events. This modification simply considers the The conditional rule is the probability of a sec-
fact that the events have the ability to occur ond event occurring (B) given the probability of the
together (P (A and B)) and are not mutually first event occurring (A). This can also be consid-
exclusive. ered a refinement to the multiplication rule when
Thus, returning to the above example, the the occurrence of the events of interest are depen-
probability of obtaining either a piece of candy dent on each other. The vertical bar (|) in the equa-
that is yellow or a piece of candy that has a tion represents the dependence of the two events A
peanut-­center (P (yellow or peanut-center)) must and B, in which it is read as “given.” For example,
have the probability of obtaining a yellow choco- what is the probability of examining a color-blind
late candy with a peanut-center (P (yellow and patient, given that the patient is male? We can
peanut-center)) removed (subtracted) from the express this problem as P (CB|M), where the prob-
occurrence of each singular event [(P (yellow or ability of a color-blind male [P (CB and M)] is 8%
peanut-center))  =  P(yellow)  +  P(peanut) − (P and the probability of the next patient being a male
(yellow and peanut-center)  =  (5/10)  +  (5/10) − P(M) is 35%. Thus, by division, the probability of
(3/10) = 7/10 or 70%]. Figure 4.25 shows a series examining a color-blind patient, given that the
of Venn diagrams that depicts the conceptual patient is a male, is approximately 29%.

Yellow Peanut- SUBTRACT Yellow Peanut-


M&Ms centered M&Ms centered

Both yellow & with peanuts

Fig. 4.25  Mutual exclusivity relative to the addition rule


66 4  Descriptive Statistics

4.6.2 B
 ayesian vs. Frequentist 4.6.3 Z-Transformation
Approach
One of the fundamental qualities of a normal
The conditional probability above was first curve (or normal distribution) that was discussed
described as Bayes’ theorem, where the proba- was that the total area under the curve is equal to
bility of an event takes into account prior to one. In Sect. 4.4 Distributions, our attempts of
knowledge of similar events. Named after Rev. smoothening out the curve that blanketed the bars
Thomas Bayes, the theory takes subjective beliefs of the histogram was accomplished by the intro-
and experiences into mathematical consideration duction of more bars, i.e., an increase in observa-
when calculating the probability of future out- tions or sample size. Interestingly enough, it is
comes. Another way to put it is that the theory this action that ultimately leads to the measure-
provides a way to obtain new probabilities based ment of the area under the curve—a concept with
on new information. For example, having knowl- origins in calculus. The area under a curve is
edge that demographics are associated with one’s divided into numerous rectangular strips (i.e.,
overall health status, allows health practitioners bars), the area of each individual strip is mea-
to better assess the likelihood of their patient sured, and then the summation of those individ-
being at risk for certain cardiovascular diseases ual areas produce the area of the whole—a
relative to their socioeconomic status. This type process commonly known as integration.
of statistics takes a probabilistic approach to the The ability to be able to isolate certain areas
uncertainty in our world, in which probabilities under a curve is critical in descriptive statistics. Is
are never stagnant upon the availability of new knowledge of calculus required to do this?
knowledge. The purveyors of this lines of proba- Luckily not. What we do require is an under-
bility theory are commonly referred to as standing of the standard normal curve and the
Bayesians. process of the z-transformation. The standard
On the other hand, a more traditional view of normal curve has all of the qualities of a normal
statistics takes a regularity or frequentist approach distribution described in Table  4.13, along with
to probability and the uncertainty of our world. three additional qualities discussed next.
The frequentist approach relies on hard data, per The most important and differentiating quality
se; there is no mathematical consideration of sub- of a standard normal curve and other normal
jective experiences and new knowledge in deter- curves is that the mean (μ) is equal to 0 and the
mining the likelihood of future outcomes. Instead, standard deviation (σ) is equal to 1. Thus, the
only the frequency of an event’s occurrence rela- center of the graph of a standard normal curve is
tive to the rate of its occurrence in a large number at zero, and distances from the mean (along the
of trials is taken into consideration. For example, x-axis) are measured in standard deviations of
the fact that a coin was flipped ten times and only length one (Fig. 4.26). With this notion intact, the
the head was observed does not make the proba- second quality of standard normal curve exempli-
bility of obtaining a head on the 11th time 100%; fied is the 68-95-99.7 rule.
the probability of obtaining a head on the next As shown in Fig.  4.27, the 68-95-99.7 rule
flip still remains 50%. This classical interpreta- states that approximately 68% of observations
tion of probability theory is the most commonly fall within one standard deviation to the left and
utilized by statisticians and experimental scien- to the right of the mean (μ ± 1σ), approximately
tists, in which they and other purveyors of this 95% of observations fall within two standard
line of thought are referred to as frequentists.12 deviations to the left and to the right of the mean
(μ ± 2σ), and approximately 99.7% of observa-
tions fall within three standard deviations to the
left and to the right of the mean (μ ± 3σ).
See Perkins and Wang (2004), Raue et al. (2013), and
12  The third quality of a standard normal curve is
Sanogo et al. (2014). its usage in the z-transformation process, in
4.6 Probability 67

Standard Normal Curve

75

-3.00 -2.00 -1.00 1.00 2.00 3.00 78


66 70 74 82 86 90
µ=0
Fig. 4.28  Original score
Fig. 4.26  Standard normal curve

Standard Normal Curve


z -Score Formula
Xi - m
z=
s
When transforming an original observation
(Xi) into a z-score, the z-score indicates how
68%
many standard deviations (σ) the observation is
95% to the right or to the left of the mean (μ) of its
distribution. In calculating, the units of the
99.7% numerator and denominator cancel each other
out. Thus, we say that a z-score is a standard-
ized score that is unit-less. An obtained z-score
-3σ -2σ -1σ µ 1σ 2σ 3σ
with a positive sign indicates that the z-score—
Fig. 4.27  The 68-95-99.7 rule on a standard normal and by association the original observation—is
curve located to the right of the mean; conversely, a
negative z-score—and by association the origi-
which you will find the distribution being referred nal observation—is located to the left of the
to as the standard z curve or z distribution. Any mean. The actual value of the z-score itself
normal distribution of a population—regardless tells you the distance, in terms of standard
of the values of the mean and standard devia- deviation, of that specific observation from the
tion—can be transformed into a standard normal mean.
curve. This transformation is useful because it For example, midterm scores in a biostatistics
allows for the calculation of certain areas under class are normally distributed with an average
the curve that are more specific than those avail- score of 78 (out of 100) and a standard deviation
able from the 68-95-99.7 rule. For example, the of 4. If we are interested in determining the pro-
68-95-99.7 rule fails in being able to quantifiably portion that scored below a score of 75 among the
describe the area that falls between 1.34 standard rest of the students, then we would essentially
deviations of the mean. Hence, if a population is need to be able to quantify the shaded area shown
able to assume a normal distribution, then the in Fig. 4.28.
observations are able to be transformed into So, the first step would be to transform the
scores of a standard normal curve by ways of the original normal distribution of midterm scores to
z-score formula, below. a z distribution and locate the score of interest
68 4  Descriptive Statistics

Steps for Solving z-Score


–0.70
Draw normal distribution and shade
area you need to find, N (m,s)
Calculate z-score
Locate z-score on Standard normal
distribution table
Determine probability

Fig. 4.30 Procedure for solving z-score-related


78 questions
66 70 74 82 86 90

Fig. 4.29  Z-score

Z-Score Map
(75) via the z-score formula (Fig. 4.29). Notice,
the similarities in Figs. 4.28 and 4.29. Also notice
that the z-score (−0.70) that corresponds to the Original Score
original score (75) has a negative sign, indicating
that it falls below the mean (0) of the standard
normal curve that also corresponds to below the
mean of the original normal distribution (78). Z-transformation
To determine the proportion of students that
scored below a 75, we must find (quantify) the
area of the shaded region. Since the tools of calcu-
lus will not be utilized here, this can be accom- Z-Score
plished by using a standard normal table (Appendix
B) which contains the calculated areas of the
regions that fall to the left or below any specified
z-score. By pinpointing the proportion on the table Standard Normal Table
based on the z-score, we find that approximately
22.66% of students scored 75 or lower. In order to
obtain the proportion of students that scored to the
right or above a 75, then all that is necessary is to Probability/Area
subtract the proportion of total students (i.e., total
area under curve = 1) by the proportion that scored
less than 75 (1 − 0.2266  =  0.7734 or 77.34%). Fig. 4.31  Z-score map
Lastly, in order to determine the proportion of stu-
dents that scored between scores 75 and 85, the It may be beneficial, at this point, to return
proportion of students that scored below an origi- to the overarching theme of this chapter—
nal score of 85 (0.9599, at z = + 1.75) is subtracted namely, descriptive statistics. The purpose of
by the proportion that scored less than 75 (0.9599 being able to effectively utilize data that have
− 0.2266 = 0.7333 or 73.33%). Figure 4.30 pro- been properly consolidated is not limited to the
vides a stepwise procedure and helpful strategies process of z-transformation. This section has
for solving z-score-related questions. Figure 4.31 been appropriately placed at the end as it ties
is a map that can guide from any starting point to together individual sections, such as distribu-
any destination throughout the z-transformation tions and probabilities, into a much larger
process. concept.
4.7 Self-Study: Practice Problems 69

Think of the process of z-transformation as the Frequency table—weights of newborn babies (kg)
tare function on a weight scale. With the ability to Interval (kg) f f% cf cf%
essentially zero-out any normal distribution—by 2.00–2.99 19 ? 37 61.67%
transforming it into a standard normal curve—we 3.00–3.99 14 23.34% 51 ?
4.00–4.99 6 10% ? 95%
are able to precisely quantify certain locations and
5.00–5.99 ? ? 60 ?
certain distances of interest within the distribution
Total ? 100% ? ?
of observations. We can utilize distributions to cal-
culate probabilities and attempt to predict certain
outcomes. We are able to learn things like the area 4. Create the appropriate graph for the distribu-
between two certain points or, conversely, the points tion of data from questions 2 and 3 above.
within which a certain area of interest exists. By 5. The insulin levels (pmol/L) of a sample of 15
arming ourselves with these disparate tools and diabetic patients were collected an hour after
techniques, we have begun the effective utilization consumption of breakfast and are provided
of numerical data through description and, certainly, below. Please identify the mean, median, and
have begun narrowing the scope of uncertainty. mode:
   356, 422, 297, 102, 334, 378, 181,
389, 366, 230, 120, 378, 256, 302, 120
4.7 Self-Study: Practice 6. A scientist measures the rate of replication
Problems (s−1) for a sample of bacteria colonies from
ten Petri dishes. Determine the appropriate
1. The following are data collected on the length standard deviation and variance of the sample
of hospital stay (in days) from sample of 25 (hint: use table method from Figure 4.13):
patients:    2.33, 2.02, 1.99, 1.53, 0.99, 1.26,
   6, 11, 4, 8, 14, 30, 1, 3, 7, 11, 4, 9, 5, 1.18, 3.50, 0.22, 2.62
22, 25, 17, 2, 5, 19, 13, 21, 26, 26, 20, 29 7. Determine the range and interquartile range
(a) Should the data be organized into inter- from the data set above. Which of the mea-
vals? Explain. sures (including those in question 6) are bet-
(b) Create a frequency table that describes ter descriptions of dispersion? Explain.
only the frequency of the observations. 8. A local urgent care clinic reviews the
2. A group of college students were interested in recorded patient illnesses that were treated in
understanding the degree of satisfaction their the previous month from 450 patients—275
peers had with the campus health office. They males and 175 females. Their reports found
gathered 50 responses to their one-question the following number of diagnoses: 101
survey, in which 7 said they were very unsatis- common colds, 274 bodily injuries, 76 uri-
fied, 9 were unsatisfied, 19 were neither satis- nary tract infections (UTIs), 62 ear infec-
fied nor unsatisfied, 11 were satisfied, and 4 tions, and 100 unexplained pains.
were very satisfied. Create a frequency distri- Approximately 106 of the bodily injuries
bution that describes the frequency and cumu- were male patients and 55 of the UTI cases
lative frequency of the responses. were female patients.
3. The following is frequency table that orga-
(a) What is the probability of randomly
nizes the weights (kg) of a group of 60 new- selecting a diagnosis of the common
born babies. Complete the table by filling in cold and an ear infection?
the boxes labeled with the “?”. (b) What is the probability of randomly

selecting a diagnosis of unexplained
Frequency table—weights of newborn babies (kg)
pain or a bodily injury?
Interval (kg) f f% cf cf%

(c) What is the probability of randomly
0.00–0.99 6 10% 6 10%
1.00–1.99 ? 20% 18 ?
selecting a male patient or a bodily
injury case?
70 4  Descriptive Statistics

9. After a family banquet, 75% of family mem- population with a mean of 5.05 liters (L) and
bers were exposed to the peanut butter a standard deviation of 0.78 (L). For each
cheesecake, out of which 35% developed problem, use the z-score formula and the
acute inflammation. It was also found that standard normal probability table to
5% of the remaining family members who determine:
were not exposed to the cheesecake also (a) What proportion of scores fall above
reported acute inflammation. 6.13?
(a) What is the probability of a family mem- (b) What proportion of scores fall below

ber showing signs of acute inflammation? 5.44?
(b) Given those who reported acute inflam- (c) What proportion of scores fall below
mation, what are the chances of them 4.20?
actually being exposed to the peanut but- (d) What proportion of scores fall between
ter cheesecake? 5.44 and 6.13?
(c) Given those who reported acute inflam- (e) Which score marks the lower 10% of the
mation, what are the chances that they population?
were not exposed to the peanut butter (f) Which score marks the upper 60% of the
cheesecake? population?
10. Scores on a spirometry test are used to deter- (g) Which scores represent the middle 95%
mine lung function based on the volume of of the population?
air that is inspired and expired. The scores (See back of book for answers to Chapter Practice
approximate a normal distribution in the Problems)
Inferential Statistics I
5

Contents
5.1 Core Concepts  71
5.2 Conceptual Introduction  72
5.3 Principles of Inference and Analysis  73
5.3.1 Sampling Distribution  74
5.3.2 Assumptions of Parametric Statistics  76
5.3.3 Hypotheses  76
5.4 Significance  77
5.4.1 Level of Significance  77
5.4.2 P-Value  79
5.4.3 Decision-Making  80
5.5 Estimation  84
5.6 Hypothesis Testing  86
5.7 Study Validity  88
5.7.1 Internal Validity  88
5.7.2 External Validity  89
5.8 Self-Study: Practice Problems  89

5.1 Core Concepts We begin the chapter by introducing the criti-


cal notion of the sampling distribution and how it
Nicole Balenton facilitates the inference consensus. Before
researchers can proceed to utilize inferential sta-
Translational research finally reaches the discus- tistics efficiently, the assumptions of parametric
sion of the second branch of statistical analysis— statistics must be met. Should any of these three
inferential statistics which are statistics used to assumptions be violated, the researcher must turn
make generalizations or inferences based on to nonparametric statistics instead which will be
information obtained from actual observations. further discussed in Chap. 7. Moreover, the for-
Unlike its counterpart, descriptive statistics, infer- mulation and examination of hypotheses also
ential statistics go beyond the real observations play a critical role in scientific research.
and helps generalize a particular population. This Tools of significance testing are used during
chapter discusses the principles of inference and data analysis to help determine the validity of the
analysis and the various techniques included. hypotheses, therefore, guiding the interpretation
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 71
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_5
72 5  Inferential Statistics I

of a decision. The decision-making process dur- pictured in Fig. 5.1, for example. The frogs’ bril-
ing hypothesis testing has potentially two forms liant colored body warns (or reminds) predators
of error that may occur, namely, Type I and Type of the slow and painful death caused by feeding
II errors. This chapter goes more into detail about on the venomous species. But there was a time
what constitutes these types of errors, the ele- where the luminous color actually seduced pred-
ments of a power analysis, and how they establish ators to the possibility of the delicious meal that
the power of a study. awaits. This temptation was swiftly suppressed
Estimation is also related to inferential statis- after experiencing the death of similarly situated
tics as they are used to precisely and accurately predators consuming the colorful prey, or the
estimate/predict the actual population. Used in predator itself falling ill for a period of time.
conjunction with hypothesis testing are the tools Predators quickly learned that the brilliant colors
of estimation (e.g., confidence interval and level of prey meant a dooming venomous death. Even
of confidence) which increase the robustness of other prey adapted this antipredator technique
the study. At the heart of inferential statistics is and defense mechanism of warning coloration—
hypothesis testing which is used as the chief a concept referred to as aposematism.
method to determine the validity of a hypothesis. In order to ensure genetic success, there soon
Through the six steps of hypothesis testing, was a certain mutual understanding developed
researchers can determine the validity of a hypoth- within the arenas of the wild. Predators under-
esis by assessing the evidence. This basic protocol stood the consequence of feeding on prey with
is the foundation that will be used in all statistical seductive neon-like colors. Prey understood that
tests that will be mentioned in the next chapter. warning coloration is a powerful piece of artillery
Overall, the quality of a research study is scru- to add to their arsenal in a world full of predators.
tinized by validity whether it be internally or Thus, this mutual understanding established
externally. Researchers look at the soundness of among the earliest of predators and prey became
the entire study including the study design, meth- a type of generalization to add to the repertoire of
odology, and data analysis and how the findings survival skills for future generations. This gener-
truly represent the phenomenon being measured. alization in the simplest form equated brilliant
A research study that is valid is solid because it is colors with poison—an association that still to
well-designed and the findings are appropriate to this day is taken advantage of by both predator
generalize or infer to the population of interest. and prey.As rightful heirs of Kingdom Animalia,
we too adapt to our surroundings for survival
based on certain generalizations. As children, we
5.2 Conceptual Introduction learn to never take candy from strangers. As stu-
dents, we learn to always strive for the best
One of the earliest survival mechanisms devel- grades. As scientists, we learn that all questions
oped by Kingdom Animalia was the ability to are worth asking. The words italicized are com-
learn and adapt. Take the poison dart frog species monly referred to absolutes or universals, but was
it not already established that nothing is truly
absolute? That absolutes necessitate all-knowing
truth? Indeed, that notion still remains pertinent.
In fact, it is not the case that all brilliantly color-
ful animals in the wild are poisonous. The
California mountain kingsnake (Fig. 5.2) takes
advantage of coloration by using its red, black,
and yellow stripes to “warn” predators to stay
away—but this intelligent snake is neither ven-
omous nor harmful. Similarly, strangers posing
Fig. 5.1  Dart frog (NightLife Exhibit: Color of Life— as preschool teachers seem to be exempt when
Cali. Academy of Sciences 2015) offering candy to children.
5.3  Principles of Inference and Analysis 73

POPULATION

SAMPLE

Fig. 5.2  California mountain kingsnake (Jahn 2017)

The list of exemptions to absolutes or univer- Fig. 5.3  The population–sample interaction
sals can continue ad  infinitum. But once an
exemption is exposed, they are no longer consid- often referred to as and interchangeable with
ered absolutely true. Rather, we accept certain parametric statistics—that is, making inferences
things to be generally true—such statements are about the parameters (population) that are based
true most of the time. But what virtue does a truth on and go beyond the statistics (sample). The
statement hold if it is not always true? We digress. core principles of statistical analysis underlying
Yet, the general truth still contains a certain inferential statistics are discussed next and are
stronghold on our knowledge. Generalizations considered for the remainder of this book’s first
are essentially made based on the frequency of half on translational research.
our observations and the probability of making
similar or contradictory observations. Our ability
to make generalizations can serve as useful heu- 5.3 Principles of Inference
ristics or harmful stereotypes. The science of and Analysis
making accurate and precise generalizations that
are based on, and go beyond, actual observations Inferential statistics is the second branch of the
is referred to as inferential statistics.Whether fundamental concept underlying statistical
for statistics in general or for biostatistics in thought—the first of which was descriptive sta-
translational healthcare specifically, inferential tistics, as outlined in Chap. 4. Along with this
statistics is used to make inferences about the concept enters the third leg of the stool represent-
population based on observations ­collected from ing the foundation of the research process,
samples.1 This is the essence of the population– namely, data analysis (Fig. 5.4). Data analysis
sample interaction depicted in Fig. 5.3 and previ- refers to the statistical techniques that analyses
ously discussed in Chap. 3. Briefly, the mere fact both quantitative and qualitative data in order to
that it is neither feasible nor practical to collect render information not immediately apparent
observations from an entire population proves the from mere raw data. Indeed, we learned a few of
utility of inferential statistics. Instead, we collect the core concepts of data analysis under the
a representative sample of observations in order framework of descriptive statistics. So why not
to infer critical pieces of information regarding introduce data analysis in the previous chapter?
the population. Because a population is charac- To be clear, all data analyses take place within
terized by parameters, inferential statistics is the frameworks of descriptive and/or inferential
statistics. But inferential statistics is set apart
Notice that inference, generalization, and conclusion can
1  from descriptive statistics because the latter
all be considered synonymous. reaches a halt after describing and organizing the
74 5  Inferential Statistics I

observations at hand (i.e., the sample). Inferential 5.3.1 Sampling Distribution


statistics provides us with the analytical tools that
take the descriptions and information learned from There are many factors that, if satisfied, contrib-
the observations and allows them to be extrapo- ute to the ability to infer certain conclusions
lated onto the greater population. That is, it is post- about a population based on the sample. We
data analysis that the study reaches its conclusion, begin with the qualities of the particular sample
in which the formulation and application of an that facilitates the inference consensus. The sin-
inference consensus is permitted (Fig. 5.5). As we gle most important quality of a sample—and
will see, it is quantitative data that have the advan- hence its method of collection—is that it be a
tage over qualitative data within the context of random sample. As discussed in depth in Chap. 3,
inferential statistics. The principles that require we strive to collect samples that are random
consideration and the criteria that are necessary to because this seemingly non-biased approach
effectively utilize the techniques of inferential sta- does its best to provide information that is most
tistics are discussed next. representative of the sample’s parent population.
However, even at its best, a random sample occa-
sionally represents its parent population precisely
Research Process and accurately.
So, we push the boundaries by collecting mul-
tiple different random samples of the same popu-
lation in order to increase our chances of having
captured, per se, the different qualities of the
Meth

population that make it so unique. Thus, if mul-


Data
Study Design

tiple random samples are collected and their


odolo

Ana

characteristics summarized using measures of


lysis
gy

central tendency and variability and subsequently


graphed via a histogram, what we obtain is called
the sampling distribution of the mean.
The sampling distribution of the mean is a
distribution of collected sample means ( x ) from
numerous random samples of a given size (n)
obtained from a single population (Fig. 5.6). The
sampling distribution is arguably the most impor-
Fig. 5.4  Conceptualization of the research process as a tant notion of inferential statistics because it
three-legged stool facilitates the comparison of many samples to the

Study Design

Research Question Study Hypothesis Methodology Conclusion

Data Analysis

Inference
Consensus

Fig. 5.5  Research pathway


5.3  Principles of Inference and Analysis 75

X5
Moreover, it is referred to as the standard
error because it considers the large probability
X4 X6 that the random samples being considered may
not actually be precise and accurate representa-
X3 tions of the population. This measure also exem-
X7
plifies the unavoidable random error prone to a
X2 X8 research study—a concept distinct from system-
atic error.2 Although random error is unavoid-
X1 X9
able, the amount of error introduced has the
µx = µ ability to and should be reduced by increasing the
sample size (n). Mathematically speaking, an
Fig. 5.6  Sampling distribution of the mean increase in the value of the denominator ( n )
results in a smaller value of the entire fraction of
population, ultimately permitting the inference SEM.  Also, conceptually speaking, obtaining a
consensus. sufficiently large sample size translates to a
By now, we should be quite familiar with the higher chance of having a more accurate repre-
ins and outs of distributions. Distributions are sentation of the population.
able to be effectively described by two different Now, with all of these factors considered, we
yet related measures of central tendency and vari- are one step closer to being able to make more
ability—specifically, the mean and standard devi- accurate and precise inferences about a certain
ation. The distribution of sample means itself has population based on the sampling distribution.
both a mean and a standard deviation relative to Having discussed the various intricacies of a sam-
the population from which the samples were pling distribution, we move toward a concept that
obtained. ties the above concepts together, allows the deter-
mination of the shape of the sampling distribu-
• Mean ( m x ) of the sampling distribution of tion, and also happens to be fundamental to the
the mean represents the mean of all of the usability of inferential statistics. The concepts in
sample means, which is ultimately equal to this section, both above and below, will continue
the population mean (μ). to reappear in some form during the next chapter.
mx = m The central limit theorem states that a sam-
• Standard error of the mean (SEM)  is pling distribution with a sufficiently large sample
essentially the standard deviation of the sam- size will approximate a normal distribution,
pling distribution of the mean, which is equal regardless of the shape of the population distribu-
to the population standard deviation (σ) divided tion. The lack of regard to the shape of the popu-
by the square root of the sample size (n). lation is supported by obtaining a sufficiently
large sample size, which happens to depend on
sx = s the shape of the population distribution. To elab-
n orate, if the population is normally distributed
Indeed, this is why the importance of under- (and known), then even a small sample size will
standing the fundamental concepts of means, be sufficient to render a normal distribution of
standard deviations, and distributions was sample means. But if the population is not nor-
stressed in Chap. 4. The SEM is a special form of mally distributed (or is unknown), then we can
variability used to measure the dispersion or use a generally accepted rule that a minimum
spread of the data. This is due to the fact that the sample size of 30 will suffice for a good approxi-
data contained within the sampling distribution mation of the population being normally
of the mean are no longer composed of many distributed.
single observations but rather are composed of
numerous random samples. See Chap. 1, Sect. 1.2.1.2).
2 
76 5  Inferential Statistics I

5.3.2 Assumptions of Parametric The importance of satisfying these assump-


Statistics tions cannot be stressed enough. The ability to
effectively utilize inferential statistics—that is, to
As mentioned in the beginning of the chapter, draw inferences about parameters based on sta-
there are certain criteria that need to be met in tistics—rests on the manifestation of these three
order to reap the benefits of inferential statistics. criteria. Should any of them be violated, then we
Those benefits include the ability to make con- no longer have the ability to make accurate and
clusions or generalizations about a population precise parametric inferences (i.e., generaliza-
based on the sample and actually utilize these tions about the population). Instead, we must
inferences toward the betterment of the popula- settle with making nonparametric inferences that
tion in study. The criteria we are to speak of are are inherently unable to produce accurate and
commonly taken to be assumptions simply precise generalizations (see Chap. 7 for more).
because—when presented with statistical infer-
ences—it is assumed that these criteria were met
or else an inference consensus would not be per- 5.3.3 Hypotheses
missible. However, as lead investigators of a
research study, we must use these as criteria or Another core principle relative to inferential sta-
qualifications necessary to make sound paramet- tistics and data analysis is the formulation of
ric inferences. hypotheses. Recall (or refresh) that Chap. 1
The three assumptions of parametric statistics emphasized the importance of the study hypoth-
are: esis and hypothesis-driven research relative to the
research process. Simply the research question
• Normal Distribution—the data under consid- stated positively, the study hypothesis is both the
eration must be normally distributed (see Sect. starting point and guiding tool throughout the
4.4 for normal distributions). This assumption entire study—reaffirming the notion of it being
is proven via the central limit theorem and the driving force behind the research process
also implies that the data are quantitative and (Fig. 5.7). As we will see toward the end of this
not qualitative.3 chapter, the formulation and examination of
• Independence of Measurement—the method hypotheses are critical to hypothesis testing.
by which the data (i.e., the different samples) In terms of data analysis, a hypothesis is a
was collected or measured must be indepen- conjectural statement regarding a set of vari-
dent of each other. This assumption is estab- ables and the relationship that is shared between
lished by appropriate methodology and them. For example, it can be hypothesized that
design, along with the inherent nature of the psychological trauma during childhood has an
variables chosen.
• Homogeneity of Variance—the distribution of
data under consideration must be dispersed
Methodology
(vary) to a relatively similar (homogeneous)
degree. This assumption is also supported by
the central limit theorem in terms of standard Data
Study design
error.4 analysis

3 
Statistical inferences with qualitative data can only be Hypothesis
nonparametric inferences. See Chap. 7 for more.
A large amount of standard error dictates heterogeneity
4 

of variance, resulting in inaccurate and imprecise para-


metric inferences. Fig. 5.7  Hypothesis-driven research
5.4 Significance 77

effect on future academic performance. It would pose in this chapter may be to understand hypoth-
also be just as valid to hypothesize that psycho- esis testing, we must first understand the intricate
logical trauma during childhood has no effect concepts that are inherent to testing a hypothesis
on future academic performance. The former is first. In the next section, we discuss the main con-
chosen only when there is a hunch or prior evi- cepts behind hypothesis decision-making.
dence that suggests its validity. Regardless of
how the hypothesis is formulated, it is the rela-
tionship between psychological trauma during 5.4 Significance
childhood and future academic performance
that will be tested, determined, and—if appli- The data analysis section of a research study
cable—inferred. Moreover, these statistical should provide the evidence or proof necessary to
hypotheses—­hypotheses that claim relationships effectively determine the validity of the hypoth-
among certain variables—contend the existence esis that started the investigation in the first place.
of something unique underlying the population The statistical reasoning tools and techniques uti-
of interest which promotes further investigation. lized within the framework of data analysis are
The example of the hypotheses above is also mathematical in nature (Chap. 6). However, the
an example of the two main types of statistical decision that is made regarding the hypothesis is
hypotheses used within the world of research. A not mathematical—we simply decide whether to
null hypothesis, symbolized as H0 and read as accept or reject H0. We will see that our decision
“H naught,” is a hypothesis that claims that there is based on evidence provided by data analysis,
is no relationship between the variables being which renders the findings as either significant or
considered. The null hypothesis can be formu- insignificant. Therefore, there must exist some
lated in many ways, in which it most often claims tool that can be utilized to directly translate the
no effect, no difference, no association, etc. The results from data analysis and guide our decision-­
second type of hypothesis is referred to as the making process. These tools are used within the
alternative hypothesis, H1, that claims that there context of significance testing.
is a relationship between the variables being con-
sidered. The alternative hypothesis is essentially
the opposite of the null hypothesis. However, it is 5.4.1 Level of Significance
the null hypothesis that is most commonly
asserted, considered, and tested. There are many The first tool of significance testing is the level of
reasons to favor the null hypothesis that will be significance (α), also called “alpha” or “alpha
discussed throughout the remainder of this chap- level,” which refers to the threshold that the
ter. Perhaps one of the most basic explanations observed outcome—resulting from the null
involves removing any notions of bias and other hypothesis—must reach in order to be considered
errors to the study. a rare outcome (Fig. 5.8).
The determination of whether the hypotheses The level of significance is an arbitrary value
are true or not is equivalent to answering of the that is determined: at the discretion of the investi-
research question that takes place in conclusion gator, at the onset of a research study, and relative
of the study. Notice that data analysis is the final to the particulars of the study. Case in point, com-
step before the conclusion of the research pro- mon practice is to set the level of significance at
cess, in which the outcome of analyzing the data 0.05 or 5%.
promotes the decision that is to be made regard- The reason the level of significance is set prior
ing the hypotheses. Thus, after successfully to the actual analysis of data is to—yet again—
determining which hypothesis was “correct” and prevent any introduction of bias. The level of sig-
which was not, we are able to take the informa- nificance also describes the actual area of its
tion contained within the hypothesis and translate distribution as well. This means that if our level
it onto the population. Though our ultimate pur- of significance is 5%, then the shaded areas in the
78 5  Inferential Statistics I

Fig. 5.8  Level of


significance

Rare occurrences Common occurrences Rare occurrences

Fig. 5.9  5% significance


a = 0.05

0.025 0.025

tails of the distribution in Fig. 5.9 should be equal the potential observations as observations that
to 0.05. In the same breath, this measure also are most different (uncommon/rare) than what
considers the amount of error we permit into our is hypothesized in the original null
study, which will be expanded on a little later in hypothesis.
this chapter. The observed outcome we refer to in the
Rare outcomes are, obviously, opposed to definition above is, mathematically, the actual
common outcomes. Figure 5.9 shows a normal test statistic that is obtained from data analysis.
distribution that delineates this difference— Another way to visualize the process of ana-
and this difference makes sense. Understanding lyzing our data is noticing that the claim of the
the normal distribution is understanding that null hypothesis is being quantified based on the
the observations in the middle have the highest collected data, whereby a test statistic can be
chance of occurring signified by the large area obtained. The test statistic (i.e., the observed
(common), whereas the observations con- outcome) is essentially the proof or evidence
tained in the tails of the distribution have the that will be used against the null hypothesis.5
lowest chance of occurring signified by their
very small areas (rare). Thus, in consideration
We present this information only for clarification pur-
5 
of the entire distribution, if the investigator poses; test statistics and the actually formulae of the tests
desires a level of significance of, say, 0.10, utilized in analyzing data are discussed at great length in
then they are essentially setting aside 10% of the following chapter. For now, just understand that the
5.4 Significance 79

So, then, how do we use the level of signifi- for you to have lost? Is there anything significant
cance and the test statistic in order to ultimately about your particular situation? No, the chances
make a decision regarding the null hypothesis? A favor your loss (i.e., the probability of losing or
simple answer is that the test statistic is visual- the p-value was very high).
ized within the context of the level of significance On the other hand, if the winning numbers
in order to render the observed outcomes as either came to you in a dream and you ended up win-
a rare or common occurrence. But, in reality, we ning the lottery over and over again, then you are
are unable to simply compare the two as there are different than the vast majority of other players, it
numerous different test statistics and only one is uncommon or rare for you to have won multi-
level of significance. Thus, there must be some ple times, and there is something significant
measure that standardizes all test statistics and about your situation. Why? Well, because the
can be comparable to the level of significance. p-value, i.e., the probability of winning the lot-
tery, was about 1/175,000,000, and YOU were
that one! And not just once—you were that one
5.4.2 P-Value each and every time!
Notice that in the above example, the null
The most common statistical measure used in hypothesis essentially stated that there was no
significance testing is the p-value. The p-value is difference between you and the rest of the popu-
the probability of observing similar or more lation in winning the lottery. It is not as if we
extreme occurrences of the actual observed out- delineated you from the start by claiming that
come, given that the null hypothesis is true. Every you were an outcast that supposedly had revela-
parametric test statistic has an associated p-value tions in your dreams. No, we gave you the benefit
that is comparable to the level of significance. of being no different than the rest of the popula-
The p-value essentially considers the evidence tion. It was only when you won multiple lotteries
that goes against the hypothesis to be attributable consecutively with exceptionally low chances
to error and suggests that the outcome that was that your situation became a statistically signifi-
observed may have occurred just by chance. In cant situation relative to the rest of the popula-
this context, the null hypothesis is given the ben- tion. Furthermore, it could not have simply been
efit of the doubt; its claim of no difference is con- due to chance that the observed outcome (i.e.,
sidered to be probably true from the start of the you winning the lottery) occurred multiple times.
study.6 Thus, it is only when there is an exceptionally
For just a moment, conceptualize the p-value low probability of observing similar or more
as simply being the probability of generally extreme outcomes than the observed outcome
observing a specific outcome. Let us assume that that evidence against the statement of no differ-
the outcome we are interested in observing is a ence (H0) is substantiated. This signifies that the
winning lottery ticket. You are already aware of specific outcome that was observed did not occur
the slim chances of winning—about 1  in by chance alone—something special happened
175,000,000. But because you are a competent here. The question, then, that remains is: What
biostatistician, you know that the chances are constitutes an exceptionally low probability?
even more slim if you do not purchase a ticket at Better yet, at what level do we delineate the dif-
all. So you purchase a ticket. If you do not win ference between an outcome that, if observed,
the lottery, then are you any different from the occurred by chance alone and one that does not
vast majority of other players? Is it uncommon occur by chance alone?
The level of significance, of course! Therefore,
if the p-value is less than the level of significance
outcome of data analysis is a test statistic that is judged as
either common or rare, in order to make a decision about (α), then the observed outcome is statistically
the null hypothesis. significant. On the other hand, if the p-value is
Think: “innocent until proven guilty.”
6 
greater than the level of significance (α), then the
80 5  Inferential Statistics I

Table 5.1  P-values and the level of significance guides the assessment of the evidence provided
P-value < Level of significance (α) ⇨ Statistically by the data analysis, in which the probability of
significant ⇨ Reject H0 the outcome’s occurrence is taken into consider-
P-value > Level of significance (α) ⇨ Not statistically ation. We ask questions like: “Could this outcome
significant ⇨ Retain H0
have occurred by chance alone?”; “Might this
outcome have been due to sampling errors?” We
Table 5.2  Significance measures scrutinize our findings simply because the deci-
Significance measures
sion that we make is ultimately translated—bet-
Alpha (α) p-value ter, yet—inferred onto the population from which
Probability Probability the sample was drawn. Take a moment to con-
Dependent on H0 Dependent on H0 sider the gravity behind generalizations that have
Used in hypothesis Used in hypothesis testing the potential of influencing the health and overall
testing well-being of a population’s constituents.
Determined before Determined after analysis Therefore, to be able to make accurate and
analysis
precise generalizations, we must be able to take
Dependent on Dependent on data
investigator the results of our significance testing and effec-
tively interpret a decision regarding the hypothe-
sis—a process that is critical when testing a
observed outcome is not statistically significant hypothesis. To be clear, because both the level of
(Table 5.1). significance and the p-value address the null
The p-value and the level of significance share hypothesis, the decision made is in regard to the
important similarities and even more important null hypothesis. Of course, we can imply the
differences. Notice that both measures are meaning of this to be the opposite decision made
innately probabilistic, depend on the null hypoth- regarding the alternative hypothesis. That said,
esis, and are characterized with the observation decisions that are made regarding the null hypoth-
of rare outcomes—all utilized to guide the deci- esis are considered strong decisions, due to the
sion-making process with statistical significance. support of significance testing. Conversely, deci-
Still, it is their slight differences that make them sions made regarding the alternative hypothesis
so important to scientific research. The level of are considered weak, due to the lack support
significance is an arbitrary number determined at from significance testing.7
the onset of the study and at the discretion of the To restate, a statistically significant outcome
investigator. The p-value, on the other hand, signifies that the evidence substantiated against
comes into play after data analysis; the p-value is the null hypothesis cannot be ignored; something
determined relative to the specific data and test special is being observed here. If this is the case,
statistic used. Lastly, it is important to note that then the decision is to reject H0. This claim of no
p-values are most commonly obtained from sta- difference is rejected because the evidence has
tistical software applications and can also be provided proof that, in fact, there is a difference
roughly measured through different standardized between the variables in consideration. Therefore,
tables—both of which are described in the next this decision strongly implies that H0 is probably
chapter. Table 5.2 compares and contrasts alpha false (and that H1 is probably true).
and p-value. The same is true for the converse—a statisti-
cally insignificant outcome or an outcome that is
not statistically significant indicates that there is
5.4.3 Decision-Making no solid evidence substantiated against H0. If this
is the case, then we fail to reject H0; the decision
In statistical analysis, the decision-making pro-
cess hinges on the presence and/or absence of
statistical significance. Significance testing See Chiappelli (2014).
7 
5.4 Significance 81

Strong

Statistically significant (S.S) difference Reject H0 H0 is probably false

H1 is probably true

Weak

Not statistically significant (S.S.) difference Reject H0 H0 might be true

H1 might be false

Fig. 5.10  Rejecting and retaining the null hypothesis

is to retain H0.8 Instead, the claim that there is no 5.4.3.1 Errors in Decision-Making
difference between the variables in consideration Earlier in this section, the importance of sound
is retained until further evidence can prove other- decision-making was discussed in the context
wise. Retaining or failing to reject H0 does not of inferential statistics. Indeed, should all ele-
necessarily mean that its claim is true, per se— ments of a research study be done correctly and
this decision only weakly implies that H0 might properly, then there is no reason why the con-
be true (and that H1 might be false) (Fig. 5.10). clusion ought not be inferred onto the popula-
Realize the immense pressure of statistical tion and the findings disseminated throughout
significance during the decision-making process the scientific community. However, it is not
and on the research study as a whole. always the case that the decisions we make are
Unfortunately, the scientific community has correct decisions—after all, we are but only
become comfortable with associating insignifi- human. That is not to say that it always is an
cant results to insignificant studies. Could it be error of judgment; rather it can also be due to
that just because the significance testing rendered spurious data.
the findings stimulated by H0 as insignificant that The two forms of errors that may occur in
the information provided by the whole study is decision-making during hypothesis testing are:
invaluable? Of particular concern is the p-value.
Consider a p = 0.06, for example, which would • Type I error—rejecting a true H0
result in a statistically insignificant outcome and –– Researcher incorrectly rejected the null
a decision to retain H0. Did you waste all the hypothesis, rendering it as being probably
time, money, and resources that were invested false when, in reality, its claim is probably
into your study just because the p-value was one-­ true.
hundredth of a decimal off? The answer to both –– The decision should have been to
questions posed is negative. This raises a discus- retain H0.
sion regarding the overreliance on p-value’s and –– The study concludes by allotting the infer-
publication bias that are prevalent in the current ence of the existence of a difference
research community.9 between the variables in consideration by
H0, when there most probably was no real
difference after all.
• Type II error—retaining a false H0
8 
Earlier we mentioned that this case would lead to accept-
–– Researcher incorrectly retained (or failed
ing H0. But “accepting” is a bit too strong of word to use
in a scientific context—instead we are better off deciding to reject) the null hypothesis, presuming
to retain H0 or that we fail to reject H0. the claim as being probably true when, in
9 
See Wasserstein and Lazar (2016). actuality, its claim is probably false.
82 5  Inferential Statistics I

–– The decision should have been to reject H0. ally as the strength or robustness of a study.
–– Researchers risk generalizing their obser- However, at the most, these may qualify as just
vation of no difference, instead of realizing loose definitions of the word and the strategies
that there actually is a difference between used relative to research. More directly, the
the variables in consideration by H0. power of a study refers to the test’s ability to
detect an effect size, should there be one to be
Perhaps you are wondering, as biostatisticians found. The effect size, α, β, and n are the four
now, what the chances are of making these types elements necessary in power determination and
of errors. It should not be a surprise to learn that analysis, discussed next.
the probability of making a Type I error is noth-
ing other than alpha, (α), also known as the level 5.4.3.3 Elements of Power Analysis
of significance. The definition of a significance During the discussion of statistical hypotheses, it
level has contained in it the assumption that null was mentioned that a null hypothesis may also be
hypothesis is true. Therefore, by making a deci- a claim of no effect. In statistical research, the
sion you are essentially running the risk that the hypotheses we establish provide the variables
level established is also the probability that your that are to be observed in relation with one
decision is incorrect. This further stresses the another during data analysis. In other words, test-
importance of the level of significance being at ing a hypothesis usually entails comparing a
the discretion of the investigator. By increasing hypothesized population mean against a true
or decreasing the level of significance, your population mean, in which the presence or
chances of an incorrect and correct decision fluc- absence of an effect serves as the relationship.
tuate accordingly. Thus, the size of the effect or the effect size (ES)
On the other hand, the probability of making a refers to the extent to which our results are mean-
Type II error is referred to as beta, (β), which is ingful, where the effect is the difference between
usually equal to 0.20 or 20%. The conventional the compared means. In order for the difference
value of beta and its relevance to the power of to be meaningful then, the observed outcome
statistical tests will be expounded on in the next must be statistically significant.
section. For now, Table 5.3 shows an organiza- Therefore, we witness a direct relationship
tion of the decision-making process. between the size of an effect and statistical sig-
nificance, such that the larger the effect size, the
5.4.3.2 Power Analysis more statistically significant the findings. In
In scientific research studies, power analyses are terms of power and power analysis, there must be
strategies conducted to establish the power of a some notion of what ES might be detected from a
study. The techniques examine the relationship preliminary or pilot study (see Sect. 5.5). Let us
between a series of elements relative to the spe- use an example for clarification with the null
cific statistical tests that are used in data analysis. hypothesis below; feel free to replace the phrase
Indeed, we may have already heard the term no difference with no effect.
power being used to describe certain qualities of
a research study. At face value, or colloquially, H0  : There is no difference between the effective-
power may seem to refer to how well the study is ness of nicotine gum and the effectiveness of
able to do something in a certain way or gener- e-cigarettes in smoking cessation.
According to the decision-making process, we
know that if the data analysis provides a statisti-
Table 5.3  Decision-making process cally significant difference between the effective-
Status H0 ness of the two treatments, then the decision is to
Decision True False reject H0. By rejecting H0, we are rejecting the
Reject H0 Type I error (α) Correct decision claim of no difference or no effect. In other words,
Fail to reject H0 Correct decision Type II error (β) we are saying that there is, indeed, an effect! Let
5.4 Significance 83

that simmer for a moment. Again, if H0 claims no being the strength behind the investigator’s accu-
effect when comparing the two treatments (nico- racy regarding the observed differences—that is,
tine v. e-cigarette) and our decision is to reject the making a correct decision. Thus, in order to cal-
null hypothesis (due to a statistically significant culate the power of a study, we simply take the
outcome), then we are in actuality saying that complement of β, shown below.
there is an effect (difference) between the two
treatment groups. Power = 1 − β
On the other hand, if our data had presented a
lack of statistical significance, then we would Unlike α, the size of β is neither arbitrarily set
retain H0 and conclude that, indeed, there is no prior to data analysis, nor will it be known after a
difference (no effect) between the two treat- decision is made regarding H0. In reality, the level
ments relative to smoking cessation. Moreover, of β is not so important as a measure by itself;
when the means of the two variables were com- rather it is important in terms of power. It is the
pared, the size of their difference (effect) was not role that β plays along with the other elements of
appreciably large enough to provide a statisti- a power analysis that make its understanding
cally significant outcome. In terms of size, this even more imperative.
could very well mean that the difference of the The last element critical to the power of a
means compared was quite small or even nonex- study and the power analysis of a statistical test is
istent—in other words, more or less, close to the sample size (n). The importance of collecting a
value of zero. sufficiently large enough sample size is not lim-
This lends a hand to the importance of the level ited to power and the elements of a power analy-
of significance (α) as an element in establishing sis either. Notice the emphasis on sufficiently
the power of a study and, more specifically, con- large sample size. A good study is not one that
ducting a power analysis of a specific statistical has an extravagantly large sample size. Instead, a
test. By adjusting our level of significance, we good study is one that has a sample size that is
essentially affect the chances of making both an large enough (i.e., sufficient) to attain statistical
incorrect and correct decision regarding the null significance, should it exist. Each type of statisti-
hypothesis. For example, a study with an α = 0.10 cal test used in data analysis has a specific sample
denotes that there is a 10% likelihood of the deci- size that is appropriate for the study. The formu-
sion being a Type I error, along with a 90% likeli- las used to determine an appropriate sample size
hood (1−α) of it being a correct decision. By relative to power and the specific statistical tests
decreasing our alpha, we increase our chances of are discussed in the next chapter.
making a correct decision, while lowering the Conclusively, in order to establish the power
chances of an incorrect decision. More so, in of a study, there must be some consideration of
terms of effect size, a statistically significant out- these four elements (i.e., α, β, ES, and n) and
come will render a sizeable effect, should there be their interrelated relationship in power analyses
one to be found. This also settles our worry of relative to statistical tests. It will be convenient
making an erroneous decision when there is a lack to know that establishing any three of these ele-
of statistical significance. ments will subsequently determine the fourth,
Now, we can further our definition of power in addition to the overall power of the study.
to be the probability of rejecting H0 when it is This implies that there are actually four distinct
actually false. Notice that rejecting a false H0 is a power analyses that can be employed during
correct decision. This definition of power bears a any study and relative to any statistical test
stark resemblance with that of the one provided depending on which three out of the four ele-
in the opening of the section, both being equally ments are selected. Those notwithstanding, we
valid. Also, realize that this definition of power recommend the most practical approach, par-
essentially represents the opposite of making a ticularly to research in the health sciences, to be
Type II error (β). We can further view power as the establishment of the ES, alpha (= 0.05), and
84 5  Inferential Statistics I

beta (= 0.20) in order to determine an appropri- uncommon to see in shorthand estimated mea-
ate sample size, as referred to above by the for- sures as the mean ± SD.11
mulae mention in the next chapter. A range of values that goes beyond just the
point estimate, such as mean ± SD, makes us feel
a bit more confident in our estimation of the pop-
5.5 Estimation ulation mean. We are more scientifically poised
that somewhere contained within that range is the
The ability to make estimates regarding the precise and accurate population mean we are
parameters of population is pertinent to inferen- interested in. This form of estimation is referred
tial statistics. Some even argue the utility of esti- to as a confidence interval (CI) that provides a
mation over hypothesis testing, as the latter only range of values containing the true population
determines the presence or absence of an effect.10 parameter with a specific degree of certainty.
Nevertheless, estimation can be used in conjunc- Now, we may be more accurate in describing
tion with hypothesis testing in order to increase the population of human beings on Earth when
the robustness of our study. Particular to research presented as a confidence interval. For example,
in the health sciences, the techniques of estima- we can be 95% confident that the true population
tion are used in order to estimate the true param- mean of the number of human beings living on
eter of a population—a measure often unknown. planet Earth falls between 7,440,000,000 and
The topic and importance of estimation return to 7,560,000,000. If we consider the fact that our
that of a population, in general. original point estimate of 7,500,000,000 is the
Take the population of human beings on hypothesized population mean of the number of
planet Earth. As this is being written, there are human beings on planet Earth, then it would be
approximately 7.5 billion human beings on more logical to claim that the true population
planet Earth. So, does that mean that there are mean is probably either a little more or a little
exactly 7.5 billion? Of course not—this, and less as captured by the confidence interval.
most other parametric measures, is at best sim- A confidence interval must also take into con-
ply estimations. Yet, estimation is a superbly sideration a specific degree of certainty if it is to
useful technique that is fundamentally adopted provide accurate information. This considers,
from descriptive statistics. Notice that the mea- among others, the random error that may have
sure of the population of human beings is sim- been introduced during the measurement process.
ply an average measure of human beings that But the error we refer to is not simply a single
live on planet Earth (i.e., 𝜇 =  7,500,000,000). standard deviation above and below the mean
This is an example of a point estimate that uses (i.e., mean ± SD). Instead, the more or less aspect
a single value to represent the true (and often is taken into consideration by a measure known as
unknown) population parameter—in this case the margin of error. The margin of error is the
the population mean. product of the standard error of the mean (SEM)
Undoubtedly, there could be at any instant and a specified critical value that is relative to the
more than or less than 7.5 billion human beings statistical analysis technique used.12
on Earth. In terms of estimation, the more or less A confidence interval implies that there are
aspect encapsulates the potential error in mea- two products that are obtained from its calcula-
surement represented by the population standard tion. The interval is represented by a lower and
deviation (𝜎). Statisticians and researchers within upper limit that are referred to as the confidence
the health sciences in particular are much fonder limits. The limits signify the extension of the
of this more or less consideration than they are of
just the point estimate alone. That is why it is not 11 
Notice the similarity of this concept with those con-
tained within the sampling distribution of the mean.
We briefly expand on critical values below but provide a
12 

See Charles and Woodson (1969).


10 
much more extensive explanation in Chap. 6.
5.5 Estimation 85

CI: Mean ± (Critical Value) (SEM) the true population mean, if multiple confidence
intervals were constructed around sample means
CI: [lower (-) limit, upper (+) limit] that were obtained from a sampling distribution
(Fig. 5.12).
Fig. 5.11  Confidence intervals
Moreover, the level of confidence (i.e., degree
of certainty, confidence percent, etc.) is estab-
original point estimate on either side (below and lished by the level of significance (α), namely, by
above), where the sum is the upper limit and the taking its complement.
difference is the lower limit. Confidence intervals
are reported by their limits, in which they are CI %  : 1 − α
usually written within brackets and separated by
a comma (Fig. 5.11). Thus, a standard formula Notice the importance of the level of signifi-
we can use for the construction of a confidence cance in differentiating true confidence intervals
interval is: from false confidence intervals illustrated in Fig.
Notice that the formula above utilizes a sam- 5.12. At an α = 0.05, 95% of the confidence inter-
ple mean and not a population mean. Indeed, the vals are true because they contain the true popu-
confidence interval for the true population mean lation mean, while 5% are false because they do
is based on a hypothesized population mean that not. In the equation for computing a confidence
is obtained from a sample. We consider and can interval, the level of confidence is considered by
better understand confidence intervals in the con- the critical value that is particular to the specific
text of the sampling distribution of the mean statistical tests utilized, as discussed further in
described earlier.13 Hence, a confidence interval the next chapter.
is most often constructed around a single sample Take a moment to consider the effects on the
mean obtained from a sampling distribution, width or range of a confidence interval when the
which is hypothesized to be the population mean. level of confidence and the sample size change.
We may be wondering what a critical value is By increasing the level of confidence, we have
and how to obtain it. In brief, a critical value essentially chosen a smaller α (e.g., α  =  0.01
quantifies the threshold that is determined by the means 99% CI) which, we will see, results in a
level of significance and is relative to the specific larger critical value. Holding the SEM constant,
statistical technique used—but this is not of chief this widens the confidence interval making our
concern right now and will be discussed in depth estimation less precise. On the other hand, by
in the next chapter. However, the question as to increasing the sample size, we make the fraction
why we use critical value in terms of confidence of SEM smaller. Now, holding the critical value
intervals is of concern and will be discussed now. constant, this narrows the confidence interval
Recall from the definition of a confidence making our estimation more precise.14 Of course,
interval that there is a specific degree of certainty there are numerous combinations of different
in the estimation of a true population parameter, manipulations one can do within a confidence
in which the example above represented as a 95% interval. The point we are attempting to impart is
confidence. The degree of certainty is character- that a confidence interval is most practical and
ized by the percentage of confidence that is beneficial when it has the ability to provide the
referred to as the level of confidence. The level of precise and accurate estimation possible of the
confidence represents the probability that a suc- specific population parameter of interest.
cession of confidence intervals will include the
true parameter. In the example above, the level of
confidence contains the likelihood of obtaining Imagine looking for a needle in a haystack. Would it not
14 

be easier to look for that needle in cup full of hay as


opposed to, say, a bucket full of hay? This is what we
Mean of sampling distribution = population mean, SEM,
13 
mean by more or less precise relative to the width (size) of
and central limit theorem. the confidence interval.
86 5  Inferential Statistics I

Fig. 5.12  Series of


confidence intervals

m - margin of error m m + margin of error

False Confidence True Confidence False Confidence


Intervals Intervals Intervals

X1 – margin of error X1 X1 + margin of error

X2

X3

X4

X5

X6

X7

X8
m - margin of error m + margin of error

review of the next chapter, we will finally be able


5.6 Hypothesis Testing to utilize the method of hypothesis testing and
rest confidently on the stool representing the
Alas, the moment we have all been waiting for research process. To prevent redundancy, we rec-
has arrived—the testing of hypotheses. Indeed, ommend a brief review of the scientific method
this section was painfully prolonged with pur- and hypothesis-driven research from Chap. 1.
pose of initially comprehending the details prior Furthermore, it can be claimed with a great deal
to their marriage. Hypothesis testing is the chief of certainty that both the generation and testing
method used in inferential statistics, while being of hypotheses underlies practically all scientific
the golden child of the scientific method. After research.
5.6  Hypothesis Testing 87

Investigation within all factions of the health- able hypothesis must be statistical in nature; only
care field is heavily reliant on hypothesis testing. then can its claim be analyzed by statistical tests.
Any clinician worth her salt is continuously gen- Statistical hypotheses are tested relative to the
erating and testing hypotheses throughout the interactions between a set of variables from data
process of clinical care. Prior to the patient meet- obtained from the random sample(s). The statisti-
ing the clinician, the clinician has already begun cal tests we use to assess the evidence provided
to judge the patient and their purported symp- by the sample data are the statistical analysis
toms on information that has already come before techniques discussed in Chap. 7. Upon data anal-
them. During the patient–clinician interaction, ysis, the observed outcome is then determined
the clinician generates numbers of differential either to be a common occurrence (attributable to
diagnoses and estimates multiple probable prog- chance) or rare occurrence (something special).
noses; she makes predictions regarding a specific If the analysis of data renders the evidence incon-
drug’s course of action or the effects of exposure sistent with the hypothesized claim, then we are
to certain external stimuli, all of which are funda- forced to invalidate the hypothesis. The same is
mentally hypotheses. If necessary, she turns to true for the converse—evidence that is seemingly
the best available evidence contained within bio- consistent with the association that is claimed by
medical literature, in which her ability to inter- the hypothesis allots confirming or retaining the
pret those findings relies on the comprehension original hypothesis (see Sect. 5.3.3).
of the intricacies relative to hypothesis testing Regardless of the outcome observed, should
and the research process, in general. the criteria of parametric statistics be satisfied,
As discussed in further depth in Chap. 1, then we are able to generalize the findings onto
inherent to hypothesis testing is the generation of the population from hence the sample came.
hypotheses which are ultimately dependent on However, what may be even more important than
imagination, curiosity, and even—to a certain the inference consensus produced by hypothesis
degree—biases. Take a moment to consider this testing is establishing concrete evidence of cau-
fact. Even an educated guess must arise from sality. That is, does the specific exposure cause
something that was initially just believed to be the disease due to the effect observed? Or, does
true. Yet, it is these biases that require the method the specific intervention cause treatment due to
of hypothesis testing; convictions, educated the effect observed?
guesses, assumptions, and the like must be held Unfortunately, establishing causation requires
to a higher standard in order to quell (or demar- much higher levels of evidence that go beyond
cate) the biases. By doing this, we move one step collecting sample data. To save from the errors
closer to the truth and thus prevent fallacious and fallacies commonly made in research, the
inferences. We can visualize the crux of this con- relationships tested in Chap. 7 will be at best
cept to be the ratio of signal to noise or, statisti- associative. We establish whether or not an asso-
cally speaking, effect to error. Our interest is in ciation or relationship exists among the variables
amplifying the signal (large effect size) and in consideration, although even statistically sig-
reducing the noise (error fractionation) around it nificant associations made between variables
(Fig. 5.13). may be confounding.15
Hypothesis testing essentially boils down to As we will see in the next chapter, the major-
determining the validity of a hypothesis by ity of statistical hypothesis tests entail the com-
assessing the evidence that it implicates. A test- parison of means. We shall see how to transform
the research question (i.e., study hypothesis) into
a statistical hypothesis with averages that are
SIGNAL EFFECT obtained from sample data. Indeed, it is the sam-
ple means that we hypothesize to be the true
NOISE ERROR

Fig. 5.13  Error fractionation See Skelly et al. (2012).


15 
88 5  Inferential Statistics I

descriptions of the parameters. Therefore, in must take a step back and look at the validity of
order to make parametric inferences, a number of the entire study. This type of validity essentially
things that we will be working with and must scrutinizes the quality of the research study and
consider relative to hypothesis testing are neces- the evidence produced. Thus, by looking at the
sary: (1) quantitative data, (2) random samples, entire study, we are including the study design,
(3) sampling distribution (as reference frame), the methodology, and the data analysis and infer-
and (4) assumptions of parametric statistics. ences and are then faced with two questions:
The basic protocol for testing a hypothesis
with brief descriptions of each step are outlined 1. Is the study tight enough, such that the find-
below, in which its basic format will be utilized ings it produces are able to be replicated?
in virtually all statistical test and further 2. Is the study of sufficiently broad implications,
expounded on throughout the next chapter. such that the findings it produces are able to
be generalized?
Six Steps of Hypothesis Testing
1. Research Question—state the research prob- We discuss these questions and their relevance
lem of interest in terms of a question. toward study validity next.
2. Hypotheses—null and alternative hypotheses
are stated in a statistical manner.
3. Decision Rule—a preset rule is established in 5.7.1 Internal Validity
order to guide the decision-making process
after the data are analyzed. The first question posed above has to do with the
4. Calculation—data are analyzed, and the concept of the internal validity of a research
appropriate test statistic is calculated. study. Internal validity refers to the validity of
Statistical significance may also be estab- the assertions presented relative to the effects
lished here and used as proof below. between the variables being tested. However,
5. Decision—a decision regarding only the null there are a plethora of definitions that can be pro-
hypothesis is made as guided by the rule vided to determine a study’s internal validity. By
above and supported with significance testing looking at the phrase itself in the context of a
as proof from analysis. research study, we can surmise a comprehensive
6. Conclusion—the research question is understanding. We can wonder: “How valid is the
answered based on the findings and the deci- study internally?” Consider all of the aspects
sion that was made. Assuming the criteria of within a study that constitute its internal anat-
parametric statistics are satisfied, findings omy—namely, but not limited to, the study
may be generalized onto the population rela- design, methodology, and data analysis.
tive to the sample. Confidence intervals may For just a moment, stretch the word validity
also be placed and interpreted here. to mean accuracy and then we can ask: “Were
the steps within the study accurately done? Or
did it contain certain factors that made the study
5.7 Study Validity more vulnerable to systematic errors?” More
generally: “Was the study done well enough?”
As we wrap up this chapter, there must be a brief These are, in essence, what the word tight in the
discussion regarding validity of the research question above is trying to capture. Recall that
study. Until this moment, there have been numer- systematic errors have to do with the specific
ous occasions in the preceding chapters that system that was utilized, i.e., the study design.
talked about validity—ranging from topics of Yet, this type of validity has to do with the study
design to methodology. But as we approach the as a whole, simply because the study design sets
conclusion of the research process that takes the precedent for the remainder of the research
place within the confines of a scientific study, we study.
5.8  Self-Study: Practice Problems 89

Table 5.4  Threats to internal validity taken. This clearly lies at the core of inferential
Threats to internal validity statistics, in which the most fundamental ques-
• History—refers to certain events that may have tion we can begin to ask is: “Is the sample being
occurred during the time of study that may have studied representative of its parent population?”
affected the outcome of the study. The events may
have occurred in personal and/or professional aspect This is, of course and as mentioned earlier, one of
relative to both the investigators and/or the study the primary qualifications that facilitates sound
participants inferences. We, yet again, suggest a closer look at
• Maturation—refers to certain changes that may have the phrase for further clarification and ask: “How
occurred to the study participants throughout the
valid is the study externally?”; “Can the informa-
time of the study. These changes may be due to
growth in age, experience, fatigue, hunger, etc. tion learned go beyond just the sample?”
• Testing—refers to changes in the performance of Indeed, “external” refers to the inference
study participants or investigators upon consecutive consensus. This further begs the question of the
measurements. This may be due to memory of ability of the findings to go beyond the internal
earlier responses, practice, or desensitization
components of the study—namely, the study
• Instrumentation—refers to changes in the
calibration of a measuring device or the people that design, methodology, and data analysis—as well.
use the devices, which results in erroneous It should be evident that a necessary condition
measurements for the external validity of a study is in fact the
• Selection—refers to the process of assigning study establishment of an internally valid study. Hence,
participants (or even other units) to different
treatment or control groups. This can also be seen as should the internal validity of a study be jeop-
selection bias ardized, then any attempts of generalizing the
• Mortality—refers to the demise of study participants findings become impermissible—nonetheless,
during the course of the study. This may be extraneous. Although there also exist threats to
particular to studies of comparison such that the
external validity, those will not be spoken of as
death or attrition of a study participant no longer
facilitates the comparison they go beyond the scope of practicality, particu-
larly due to the lack of discussion regarding sta-
tistical analysis techniques.16
The virtue of having a study that is internally
valid implies that any other investigator with the
same research question and interest in the same 5.8 Self-Study: Practice
variables is able to precisely replicate or repro- Problems
duce the same findings as you did. It is for these
critical components that internal validity is often 1. What important role (if any) does the sam-
referred to as the sine qua non (i.e., absolutely ple–population interaction play in inferential
necessary condition) of research to be rendered statistics?
meaningful. In order to establish the internal 2. The following calculations are concerned
validity of a study, we must be aware of the major with the standard error of the mean ( s x ):
threats to validity. Although there can be an end- (a) If σ = 100 and n = 25, what is s x ?
less number of factors that have the ability jeop- (b) If s x  = 2.50 and σ = 25, what is the sam-
ardize internal validity, Table 5.4 illustrates a ple size (n)?
brief description of a few of the most important. (c) If n = 35 and s x = 2.82, what is σ2?
3. What strategy can be used to decrease the
amount of random error introduced into a
5.7.2 External Validity statistical test? Provide a mathematical proof
of your answer.
The second question posed has to do with the con- 4. James is interested in comparing the rates of
cept of external validity. External validity refers bullying among ten schools in Los Angeles
to the ability to generalize the findings of a study
onto the population from which the sample was See Campbell (1957).
16 
90 5  Inferential Statistics I

County. After he obtains a need assessment liter of blood. Researchers were confused
for each school, James determines that due when they obtained a 95% confidence inter-
to the differences between the schools he val of 1101–1278 from the participants.
must craft a specific questionnaire for each Determine which of the following state-
school in order to collect data on the rates of ments are true or false regarding the confi-
bullying. dence interval:
(a) Which of the assumptions of parametric (a) The interval of 1101–1278 contains all
statistics, if any, are being violated here? possible values of the true population
(b) Can an accurate inference consensus
mean for all patients that enter the trial.
still be made? Explain. (b) The interval of 1101–1278 estimates the
5. True or False: The alpha level represents the true population mean roughly 95% of
probability that the obtained results were due the time.
to chance alone. (c) The true population mean is absolutely
6. For each pair, indicate which of the p-values between 1101 and 1278.
describes the rarer result: (d) About 5% of participants did not score
(a) p = 0.04 or p = 0.02 between 1101 and 1278 and 95% did.
(b) p > 0.05 or p < 0.05 (e) There is a certain degree of confidence
(c) p < 0.001 or p < 0.01 that the true population mean lies
(d) p < 0.05 or p < 0.01 between 1101 and 1278.
(e) p < 0.15 or p < 0.20 10. If beta is the probability of making a Type II
7. What are the four elements necessary for the error, then which of the following describes
establishment of the power of a study? the power of the hypothesis test (i.e., 1−𝛽)?
8. True or False: As power increases, the prob- (a) Probability of rejecting a true H0
ability of making a Type II error increases. (b) Probability of failing to reject a true H0
9. Before entrance into a clinical trial, partici- (c) Probability of rejecting H0, when H1 is
pants had their average CD4 T cells mea- true
sured. A review of the existing literature (d) Probability of failing to reject any H0
determined the average count in health (See back of book for answers to Chapter
adults to be about 975 cells per cubic milli- Practice Problems)
Inferential Statistics II
6

Contents
6.1 Core Concepts  91
6.2 Conceptual Introduction  92
6.3 Details of Statistical Tests  93
6.3.1 Critical Values  93
6.3.2 Directional vs. Nondirectional Tests  95
6.4 Two-Group Comparisons  96
z Test
6.4.1   96
t Test Family
6.4.2   99
6.5 Multiple Group Comparison  106
6.5.1 ANOVA  107
6.6 Continuous Data Analysis  111
6.6.1 Associations  112
6.6.2 Predictions  116
6.7 Self-Study: Practice Problem  120

inferential statistics help researchers decide


6.1 Core Concepts whether the difference between groups is statisti-
cally significant enough to support our hypothe-
Nicole Balenton sis that the difference exists in the population.
Part two of inferential statistics dives into the core
In the previous chapter, we discussed the basic of data analysis by examining more advanced tech-
principles of inferential statistics and emphasized niques—statistical tests—and how they will guide
the difference between descriptive and inferential researchers through the decision-making process. In
statistics. The latter allows the researcher to make the form of statistical tests, hypothesis testing is the
inferences about the population based on the primary method used to take information learned
observations collected. Through analytical tools, from the observations and form an inference consen-
sus. There are many statistical tests and techniques
used to help make these inferences regarding differ-
Electronic Supplementary Material  The online version
ent types of data and research designs.
of this chapter (https://doi.org/10.1007/978-3-662-57437-9_6).
contains supplementary material, which is available to To those who are well versed in statistical
authorized users. tests, the book will cover nondirection, i.e.,

© Springer-Verlag GmbH Germany, part of Springer Nature 2018 91


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_6
92 6  Inferential Statistics II

two-­tailed tests such as z test, one-sample t test, taught us how to appropriately examine and orga-
independent sample t test, dependent sample t nize data to obtain useful information that was not
test, and analysis of variance (ANOVA). Selection readily available from the raw data. But pertinent
of statistical tests is dependent on individual to inferential statistics is understanding the more
characteristics and availability of the data. For sophisticated techniques that transform our data.
each test, we will consider what data and design We must go one step beyond simple organization
are appropriate, its corresponding formula, and a if we desire to provide evidence that will guide
step-by-­step procedure for hypothesis testing. our decision-making processes and, hopefully,
make generalizations toward the greater good.
In the previous chapter, we discussed the prin-
6.2 Conceptual Introduction ciples and philosophies behind inferential statis-
tics, i.e., making certain inferences,
An unspoken understanding behind the ultimate generalizations, or conclusions regarding a popu-
goal of healthcare is promoting and prolonging lation based on the sample. However, that is not
our species—fancy for keeping us from dying. to say that our principles of descriptive statistics
Just as the effective usability of penicillin was can be neglected. The three assumptions neces-
paramount to the overall well-being of humans, sary for parametric statistical inferences (i.e.,
so too is the ability to efficiently and effectively normality, independence of measurement, and
analyze data. A bit extreme? Consider our ability homogeneity of variance) are deeply grounded in
to compare the effectiveness between two cancer descriptive statistical theory.
treatments and predict the most prevalent strains All of these principles and philosophies must
of influenza during the next flu season or the abil- be in the forefront of our minds during this chap-
ity to model patient behavior and inclinations. ter as they are the underlying logic to the tech-
Every pioneer and frontier established within niques of data analysis. We shall see how data
healthcare began its infancy with a phase of data analysis takes advantage of the vehicle, that is,
analysis. Today we may think of lavish machinery hypothesis testing, in the form of statistical tests
or never-ending mathematical computations when that ultimately allot the inference consensus. This
thinking of analyzing data. But that is not neces- is experimental science in its crudest form. This
sarily the case. The earliest of analytics could have chapter further solidifies the third leg of the stool
simply been comparing observations between two (Fig. 6.1) that is the data analysis, in which the
modalities. Joe the Caveman observed that a circu- techniques discussed will be limited to quantita-
lar-shaped stone could better serve as a wheel than
a rectangular-shaped stone. Surely, it is the ability
to analyze, or more generally our cognitive abili-
Research Process
ties, that make us humans second to none—we are
able to look deeper or beyond that which is readily
apparent. Joe the Caveman did not simply
acknowledge the ­ existence of two differently
Meth

Study Design

Data

shaped wheels. No. Instead, he went beyond the


odolo

simple observation of their difference and was able


Ana
gy

to surmise or test which was more beneficial for


ly
sis

his cause. That, in essence, is at the heart of what it


means to analyze data.
Data analysis refers to the process of examin-
ing, organizing, and transforming data in order to
garner valuable information, provide substantiat-
ing evidence, and draw meaningful inferences.
Indeed, the techniques of descriptive statistics Fig. 6.1  Three-legged stool
6.3  Details of Statistical Tests 93

tive data and parametric statistics. The next chap- standardized sampling distribution specific to the
ter will explore the analogous fundamental formula (i.e., statistical test) that is used.
concepts and techniques of nonparametric statis- For example, the first statistical technique we
tics, which will be the final chapter in the transla- discuss in this chapter is referred to as the z test.
tional research enterprise. If we are interested in using a z test to test a popu-
lation mean (i.e., testing whether there is a differ-
ence between sample mean and the population
6.3 Details of Statistical Tests mean), then we must convert the data from the
sampling distribution of the mean ( x ) to the
Statistical tests are the techniques of data analy- sampling distribution of z. Now, we have a distri-
sis that are used during the examination of bution of z-values that represent the means of
hypotheses. These statistical tests mathemati- numerous random samples, instead of the actual
cally analyze the data obtained from the sam- sample means (see Chap. 5.2.1 on Sampling
pling distribution and essentially quantify the Distribution for clarification).
outcome espoused by the hypotheses. It is the We will further elaborate on what the sam-
compounding of principles of significance test- pling distribution of z is under the section devoted
ing onto statistical tests that allows for the test- to the z test but notice that the only major differ-
ing of a hypothesis. Now, we will be able to ence between the hypothesized sampling distri-
make sense of the test statistics and be able to bution of the mean and its standardized
make decisions regarding the tenability of cer- counterpart, the hypothesized sampling distribu-
tain hypotheses. The majority of the statistical tion of z, is the scaling and units of the graph
tests we explore are inherently similar in terms (Fig.  6.3). Notice that the transformation being
of mean comparison, effect attainment, and error done is very similar to the z-transformation pro-
consideration. This similarity can be noted cess discussed in Chap. 4! Here, the mean of the
through their formulae that innately represent sampling distribution (i.e., mean of sample
the signal (effect)-to-noise (error) ratios means) is now represented as z-score (0), and the
(Fig. 6.2). But before we can begin our discus- other sample means are observations as z-scores
sion on the different statistical tests, we must on the standard normal curve. Thus, just as we
consider a few important details below. are able to transform data onto a particular sam-
pling distribution, we can transfer to the relevant
concepts of hypothesis testing as well.
6.3.1 Critical Values Recall that in hypothesis testing, the decision-­
making process regarding the null hypothesis
One of the most valuable abilities provided by the was dependent on the observed outcome being
techniques of data analysis is the ability to trans- either a common observation or a rare observa-
form data. However, the transformation of data tion. This is briefly the process of establishing
does not necessarily mean that we are tampering statistical significance, in which an arbitrarily
with evidence. Rather, a smooth transformation chosen level of significance (𝜶) determines the
via the formulae sheds light on possible relation- threshold(s) between common outcomes and rare
ships between variables that may be of interest to outcomes (Fig. 6.4).
an investigator. This process takes the hypothe- Therefore, we can also imply this concept
sized sampling distribution and transforms it to a onto other hypothesized sampling distributions,
such as the sampling distribution of z. Here is
where we get to the meat of it—say we are inter-
SIGNAL EFFECT ested in identifying the exact locations of the
thresholds that represent the level of significance
NOISE ERROR
on either tail ends (Fig. 6.5). Would that not be
Fig. 6.2  Error fractionation useful? Of course! By identifying or quantifying
94 6  Inferential Statistics II

120 132 144 156 168 180 192 204 216 –4 –3 –2 –1 0 1 2 3 4


Weight (lbs.)
Weight (lbs.)

Fig. 6.3  Side-by-side comparison of the sampling distribution of the mean and sampling distribution of z. Notice how
the standard normal curve on the right has a center of 0 and a narrower distribution

Fig. 6.4  Level of


significance

Rare occurrences Common occurrences Rare occurrences

The quantification of precise locations on a


sampling distribution that represent the level of
significance is referred to as critical values. As
we will see, critical values are particular to the
specific statistical technique used and can play an
important role in hypothesis testing. Most impor-
α α
tantly, critical values are determined based on the
specific level of significance chosen. Recall from
last chapter’s discussion of the level of signifi-
cance that the arbitrary value chosen also repre-
? ?
sents the areas of the shaded regions located in
Fig. 6.5  The location on the x-axis represents the thresh- the tail ends. Thus, at an 𝜶 = 0.05, the area to the
old between common and rare occurrences left of the lower tail and the area to the right of
upper tail should each be equal to 0.025, to give a
those precise locations, we can then plot any sum of 0.05. Furthermore, using the tools of
hypothesized sample mean that has been trans- z-transformation, the specific z-scores represent-
formed into a z-value and determine which obser- ing those thresholds are −1.96 and  +  1.96 for
vations are rare or common. their respective tails (Fig. 6.6).
6.3  Details of Statistical Tests 95

α = 0.05

0.0.25 0.0.25

–1.96 +1.96
National BMI
Fig. 6.6  The threshold (i.e., critical values) and the area ? (µ) ?
of the tails are labeled
Middle School Children BMI
(x)
6.3.2 D
 irectional vs. Nondirectional
Tests Fig. 6.7  Sampling distribution of the average BMI of
middle school children and the national average
In the previous chapter, we discussed at length
the central role that the null hypothesis plays in
statistical significance and decision-making
within hypothesis testing. We learned that the
null hypothesis was a statement of no difference
or no effect. For example, say we are interested in
comparing the average BMI of middle school
α
children to the BMI of the national average. The
null hypothesis would then state: there is no dif-
ference between the average BMI of middle
school children when compared to the BMI of the
national average. If the analysis substantiates µ−χ:
plausible evidence against this claim, then our
decision would be to reject H0 and state that there Fig. 6.8  Observed outcomes occur in the lower refection
is indeed a statistically significant difference (or area
effect).
But that is all it says. It simply proves that a But what if we are interested in direction?
difference exists—it does not tell us whether the Rather than detecting just the existence of a dif-
average BMI of middle school children is less ference or an effect, we want to identify the spe-
than the national average or, conversely, greater cific type of difference or effect. Statistically, we
than the national average. In statistical terms, this might be interested in proving an alternative
means that the observed outcome could have hypothesis that claims: the average BMI of mid-
been observed in either of the tails (rejection dle school children is less than the BMI of the
areas) of the corresponding sampling distribution national average. Here, we are only interested in
(Fig.  6.7). This type of statistical test, namely, rejecting H0 if the observed outcome occurs in
one that tests a hypothesis with areas of rejection the lower rejection area (Fig. 6.8). On the other
on both tails of the sampling distribution, is hand, another alternative hypothesis might claim:
referred to as a nondirectional or two-tailed the average BMI of middle school children is
test. Moreover, along with two-tailed tests comes greater than the BMI of the national average.
two critical values that must equally share the Here, we are only interested in rejecting the null
area espoused by the level of significance. hypothesis if the observed outcome occurs in the
96 6  Inferential Statistics II

exists in the form of an effect size. We ask our-


selves: “Is there an effect between this mean and
that mean?” Notice that we can swap the word
“effect” in the above question with “difference”
and still arrive at a plausible research question.
α
Yet, the difference we are looking for cannot
be determined by simply comparing the averages
at face value. For example, just because we have
a x = 45 and another x = 56 does not necessar-
ily imply that they are statistically different.
µ−χ: Indeed, it is the determination of the appropriate
statistical test that is able to tell whether the mea-
Fig. 6.9  Observed outcomes occur in the upper rejection sures are statistically different or not. Moreover,
area
the averages we test within a sampling distribu-
tion have their own distributions and dispersions
upper rejection area (Fig.  6.9). Thus, statistical that must also be taken into consideration. Lastly,
tests that examine a hypothesis with a rejection the selection of an appropriate statistical test is
area in one of the tails of the sampling distribu- dependent on the disparate characteristics and
tion are referred to as directional or one-tailed availability of the data, as we shall see next.
tests. Moreover, depending on whether it is a
lower or upper one-tailed test, there will only be
one critical value that corresponds to the area 6.4.1 z Test
espoused by the level of significance.
Directional tests are useful when we have The first statistical test that will be explored is
some idea of a potential effect as supported by a called the z test or the one-sample z test. As briefly
pilot test or similar tests done in the past. By mentioned above, the z test compares a single sam-
focusing our attention to only one of the tails as a ple mean to a hypothesized population mean, in
qualified area for rejection, we increase the sensi- order to determine if there is a statistically signifi-
tivity of our hypothesis test. However, this comes cant difference between the two groups (Fig. 6.10).
at certain costs, such as increasing vulnerability We essentially compare the sample mean to the
to bias and Type I/II errors.1 For the purposes of
this book, all of the statistical tests we will be z Test
considering will be nondirectional, i.e., two-­
tailed tests.

6.4 Two-Group Comparisons

The first set of statistical tests that will be POPULATION


explored offer techniques that analyze the rela-
tionship between two groups. The groups can be
different combinations of samples and/or popula- SAMPLE
tions; the comparison will be between their arith-
metic means. In terms of hypothesis testing, the
statistical techniques of data analysis will shed
light on whether an association or relationship
Fig. 6.10  Population–sample interaction, where a sam-
See Banerjee et al. (2009).
1 
ple is representative of the population
6.4  Two-Group Comparisons 97

hypothesized population mean and test whether size this to be the case, until testing proves other-
the claimed population mean is true—that is, wise. Hence, to test a hypothesis of this sort to
whether it is actually representative of the entire determine how different these two measures
population. If it is representative, then the sample really are, we can use the z test formula, along
mean should not be significantly different than the with its confidence interval formula below:
population mean. So, the research question might
x - m hyp
ask: “Is there a statistically significant difference
between x and μ?” In this case, the null hypothe-
z=
s/ n
(
CI : x ± ( zcrit ) s / n )

sis will state: “There is no statistically significant
difference between x and μ.” By just looking at the formula, we can outline a
When we test the hypothesis, we essentially few important factors for a z test. Most impor-
examine the degree to which the sample mean tantly, notice the specific measures required to
deviates from the population mean. In order to do successfully complete the calculation: x , μhyp, σ,
this comparison, the sampling distribution of x and n—these will be important in determining the
is converted to the sampling distribution of z. appropriate statistical test, particularly when there
The sampling distribution of z is a distribu- are a variety to choose from. Also notice that the
tion of z-values that represent the means of numerator is essentially the effect (difference) and
numerous random samples of a given size for a the denominator is the standard error; this is akin
specific population. This hypothesized sampling to the signal-to-noise ratio discussed earlier. Below
distribution is then centered around the given we provide a step-by-step procedure for hypothe-
hypothesized population mean—recall that the sis testing using the z test, along with commentary
mean of the sampling distribution is equal to the under each step. After that, we demonstrate the
population mean, in which we initially hypothe- process with an example for further clarification.

Steps for One-Sample z Test


(b) The “z” above refers to the test statistic
1. Research Question: Is there a statistically that will be calculated in Step 4.
significant difference between x and 𝜇hyp? (c) The z critical values (±1.96) are depen-
(a) Simple and neutral. dent on the level of significance (𝛼 =
(b) In a real example, the descriptions of 0.05) that is chosen.
the averages are provided and not their 1. If we had chosen 𝛼 = 0.01, then the
symbols or numerical values. z critical values would be ±2.58,
2. Hypotheses: H0, 𝜇 = 𝜇hyp; H1, 𝜇 ≠ 𝜇hyp respectively. See Fig.  6.11 for
(a) These are statistical hypotheses but clarification.
still capture the essence of a null and x - m hyp
alternative hypothesis. 4 . Calculation: z =
(b) Replace 𝜇hyp with actual value. s /Ön
(c) These are the hypotheses for the two- (
CI : x ± ( zcrit ) s / Ö n )
tailed test.
3. Decision Rule: At 𝛼 = 0.05, If z ≤ −1.96 or (a) The confidence interval may be calcu-
z ≥ +1.96, then reject H0 | If −1.96 < z < +1.96, lated here.
then retain H0 5 . Decision: Reject/retain H0 because (insert
(a) This is the establishment of signifi- corresponding mathematical proof)
cance testing for a two-tailed test that (a) The z test statistic obtained from Step
will be used to guide the decision-­ 4 is examined within the contexts of
making process in Step 5. the decision rule that informs the deci-
1. The z test does not use p-value but sion that should be made.
critical values.
98 6  Inferential Statistics II

(b) The “mathematical proof” is either of (b) The interpretation of the confidence

the three possibilities mentioned in Step interval follows the conclusion. The
3 and satisfied by the z test statistic. degree of confidence depends on the
6 . Conclusion: Based on the results, there level of significance chosen earlier.
seems (to be a/to be no) statistically signifi- This can also be used as proof that
cant difference between x and 𝜇hyp. We your conclusion is correct:
are (1 − 𝛼)% confident that the true popula- 1. If the actual numerical value of μhyp
tion mean lies between (insert lower limit) falls within the lower and upper
and (insert upper limit). limits of the confidence interval,
(a) Depending on the decision made in then the decision that should have
Step 5, the research question is been made was to reject H0.
answered. 2. If the actual numerical value of

1. If the decision was to reject H0, then μhyp does not fall within the lower
there seems to be a statistically sig- and upper limits of the confidence
nificant difference. interval, then the decision that
2. If the decision was to retain H0,
should have been made was to
then there seems to be no statisti- retain H0.
cally significant difference.

α = 0.01

0.005 0.005

–2.58 +2.58

Fig. 6.11  Step 3 for


one-sample z test
6.4  Two-Group Comparisons 99

Example 3. At 𝛼 = 0.05: If z ≤ −1.96 or z ≥ +1.96, then


According to a national US study conducted reject H0|If −1.96  <  z  <  +1.96, and then
in 2017, the average cholesterol levels of US retain H0.
college students were approximately 199 mg/
196 - 199
dL, with a standard deviation of 12  mg/
z= (
= -1.37 196 ± (1.96 ) 12 / Ö 30 )
dL.  From 30 students from College A, you ( )
12 / Ö 30
find the average cholesterol level to be about
= [191.71, 200.29].
196 mg/dL. We are curious to know if there is
a statistically significant difference between
the national average and the average you 4. Retain H0 because
obtained from College A.  Thus, we test the −1.96 < − 1.37 < +1.96.
null hypothesis at a significance level of 0.05. 5. Based on the results, there seems to be no
statistically significant difference between
1. Is there a statistically significant difference the average cholesterol level of students
between the average cholesterol level of from College A and the national average
students from College A and the average cholesterol level. We are 95% confident
cholesterol level of college students that the true population mean lies between
nationwide? 191.71 and 200.29.
2. H0, 𝜇 = 199 mg/dL; H1, 𝜇 ≠ 199 mg/dL.

t Test
6.4.2 t Test Family

The next set of statistical test we discuss is


referred to as the t tests. The t test family repre-
sents three distinct t tests that can be used for
comparisons between different groups via their
means. The t-statistic, and hence the t test family, POPULATION
was developed by William S. Gosset in the early
1900s but was published under the pseudonym
and still referred to as “Student’s t test.” Generally SAMPLE
speaking, the t test family advocates the compari-
son of means when looking for effects between
two distributions. The statistical analysis tech-
niques of the t test family are comprised of one-­
sample t test, independent sample t test, and Fig. 6.12  The sample is compared to the population
dependent sample t test.

6.4.2.1 One-Sample t Test two groups (Fig. 6.12). Like the z test, the t test
This type of t test shares many similarities with examines location of the sample mean relative to
the z test discussed above, yet the few differenti- the hypothesized population mean to determine if
ating factors are what make the two so distinct. the claimed parameter is actually representative
The one-sample t test is a statistical test that of all the samples (i.e., the entire population).
compares a sample mean to a hypothesized popu- Moreover, in order to test a hypothesis for statis-
lation mean, in order to determine if there is a tical significance using a t test, the data must be
statistically significant difference between the transformed into the sampling distribution of t.
100 6  Inferential Statistics II

z distribution Take a moment to compare this formula


(standard normal) with the formula for a z test. Notice that almost
t-distribution everything is identical except for the standard
(n close to 30) error and, of course, t instead of z. The stan-
t-distribution dard error for the z test utilized the population
(n smaller than 30) standard deviation (σ), whereas the standard
error for the t test takes advantage of the sam-
ple standard deviation (s). How often do we
have access to a population standard deviation
µ=0
of a hypothesized sampling distribution? Not
very often. That is why the t test is favored and
Fig. 6.13  Comparing the sampling distribution of t and z more often used than the z test. Furthermore,
(Chiappelli n.d.) the fact that we have utilized the sample stan-
dard deviation to estimate the population stan-
The sampling distribution of t refers to the dis- dard deviation supports the need for a degree
tribution of t-values that represent the means of of freedom—answering our question in the last
numerous random samples of a given size for a paragraph.
specific population as t-values. Another important differentiating factor for
Figure 6.13 shows a sampling distribution of t a t test is that now we are able to use the p-value
and compares it to the sampling distribution of z for significance testing—something that was
(i.e., the standard normal curve). Notice that the not permitted during the z test. But that is not
sampling distribution of t is flatter and fatter than to say that we cannot use t critical values for
that of z. This is primarily because of a unique cal- the decision rule. Critical t-values can be
culation referred to as the degrees of freedom. The obtained from a t distribution table by using
degrees of freedom (df) refer to the number of val- the specified level of significance and the
ues that are free to vary when a statistic (i.e., sample degrees of freedom (Appendix C). We, yet
measure) is used to estimate an unknown parameter again, call your attention to the similarity of
(i.e., population measure). What unknown parame- the t ratio suggested by the formula and the
ter might we be referring to? Let us look at the for- signal-to-noise ratio. Below we provide a step-
mulae for a one-sample t test for the answer, below. by-step procedure for hypothesis testing using
the t test, along with commentary under each
x - m hyp
t=
s/ n
(
CI : x ± ( tcrit ) s / n ) df = n - 1 step. After that, we demonstrate an example for
further clarification.

Steps for One-Sample t Test (a) This is the establishment of signifi-


1. Research Question: Is there a statistically cance testing for a two-tailed test that
significant difference between x and 𝜇hyp? will be used to guide the decision-­
(a) Same commentary as z test making process in Step 5.
2. Hypotheses: H0, 𝜇 = 𝜇hyp; H1, 𝜇 ≠ 𝜇hyp (b) The p-value is a result of the calcula-
(a) Same commentary as z test tion of the test statistic done in Step 4.
3. Decision Rule: At 𝛼 = 0.05 and (c) Instead of the p-value, t critical values
df = n − 1 may be used for this step. The t critical
    If p ≤ 𝛼, then reject H0. values are dependent on the 𝛼 and df
    If p >𝛼, then retain H0. obtained from Appendix C.
6.4  Two-Group Comparisons 101

x - m hyp
4. Calculation: t = (c) It is also possible to calculate the con-
s /Ön fidence interval here.
(
CI : x ± ( tcrit ) s / Ö n ) 5 . Decision: Reject/Retain H0 because (insert
corresponding mathematical proof)
     df = n – 1 (a) Same commentary as z test
(a) The calculation of the t test statistic is 6. Conclusion: Based on the results, there
done here either by hand using the for- seems (to be a/to be no) statistically sig-
mula or via statistical software (see nificant difference between x and 𝜇hyp. We
Video 4). are (1−𝛼)% confident that the true popula-
(b) The p-value can be Roughly measured tion mean lies between (insert lower limit)
using the t table in Appendix C or a and (insert upper limit).
more exact measure via statistical (a) Same commentary as z test.
software. (b) The confidence interval is interpreted
here.

Example
3. At 𝛼 = 0.01 and df = 35–1 = 34: If p ≤ 𝛼,
A national study determined that a healthy then reject H0|If p > 𝛼, and then retain H0.
adult human requires an average of 100 min of
exercise per week. At a local community event, 103 - 100 æ 30 ö
it is gathered from Community Alpha that 35 t= = 0.592 CI :103 ± ( 2.750 ) ç ÷
30 è 35 ø
adults exercise approximately 103 minutes per 35
week (min/wk) with a standard deviation of = [89.05, 116.95].
30 min/wk. At a significance level of 0.01, we
test the null hypothesis that there is no statisti-
cally significant difference between these 4. p > 0.05; therefore, retain H0.
estimates. 5. Based on the results, there seems to be no
statistically significant difference between
1. Is there a statistically significant difference the recommended average minutes of exer-
between the recommended average of cise per week of 100 and Community
100  min/wk and the 103  min/wk sample Alpha’s 103 min/wk. We are 99% confident
obtained from Community Alpha? that the true population average of minutes
2. H0, 𝜇 = 100; H1, 𝜇 ≠ 100. per exercise per week for healthy adults falls
in the interval between 89.05 and 116.95.

6.4.2.2 Independent Sample t Test of two independent samples that are most often
The independent sample t test is a statistical test used in this context are treatment and control
that examines the difference between two hypoth- groups within an experimental design (Fig. 6.14).
esized populations through a comparison of two The group averages of the treatment group ( x1 )
sample means. Critical to this t test is that the two and the control group ( x2 ) can be compared
groups that are being compared must be two sam- because the observations in each sample are
ples that are independent of each other. Examples based on different participants. Therefore, the
102 6  Inferential Statistics II

Population estimate of pooled variance. The pooled vari-


Sample 1 ance represents the sum of the population vari-
Sample 2
ances of both groups, which are assumed to be
X1 X2 the same. Case in point, the sampling distribution
S1 S2 of the mean difference is still inherently a t distri-
n1 n2 bution. Consequently, the formulae and sub-­
formulae required to conduct a two independent
sample t test along with the protocol to test a
Fig. 6.14  Two independent samples hypothesis are provided below. After that, we
demonstrate with an example for further
clarification.

( x1 - x2 ) - ( m1 - m2 )
t= df = n2 + n1 - 2
s x1 - x2

where

m-c1 – -c2
sp2 sp2 s12 ( n1 - 1) + s22 ( n2 - 1)
s x1 - x2 = - where sp2 =
Fig. 6.15  Sampling distribution of the mean difference is n1 n2 n2 + n1 - 2
noticeable flatter and fatter in shape

subjects in each independent sample (statisti-



(
CI : ( x1 - x2 ) ± ( tcrit ) s x1 - x2 )
cally) belong to two distinct, hypothesized popu-
lations (𝜇1 and 𝜇2). Namely, the treatment group
sample is representative of the population of sub- 6.4.2.3 Dependent Sample t Test
jects that take the treatment (𝜇1); the control The last statistical test within the t test family is
group sample is representative of the population called the dependent sample t test. The depen-
of subject that does not take the treatment (𝜇2). dent sample t test is a statistical test that exam-
Similar to other sampling distributions, the ines the mean difference of two hypothesized
difference between the group averages cannot populations that have been reduced to a single
simply be examined at face value and inferred sample. The comparison that is made is still
onto the respective populations—a sampling between two groups (Fig.  6.16), but it is two
distribution is required. The sampling distribu- groups that are related or dependent on each
tion of the mean difference ( x 1– x 2) is a distri- other. There are two general classifications that
bution of mean differences that represent the make samples dependent and usable for this
means of numerous random samples of given
sizes for two independent populations.
Sample
According to the fundamental concepts of sam-
pling distributions, the mean of this particular x–
1 x–1
sampling distribution is in reality equal to the x–2 x–2
difference between the population means ( m x1 - x2
= 𝜇1−𝜇2) (Fig. 6.15). x–3 x–3
This also expands onto the standard error of
x–4 x–4
the mean difference, which is better defined here
as the estimated standard error (s x1 - x2 ). As we
will see below, the estimated standard error con-
tains within it a new measure, s 2p , which is an Fig. 6.16  Dependent samples
6.4  Two-Group Comparisons 103

Steps for Independent Sample t Test (b) The p-value can be roughly measured
1. Research Question: Is there a statistically using the t table in Appendix C or more
significant mean difference between x 1 and exactly measured via statistical
x 2? software.
(a) Same commentary as other statistical (c) It is also possible to calculate the con-
tests fidence interval here.
2. Hypotheses: H0, 𝜇1 = 𝜇2; H1, 𝜇1 ≠ 𝜇2 5 . Decision: Reject/Retain H0 because (insert
(a) These are new hypotheses specific for corresponding mathematical proof)
this statistical test. Notice that they (a) Same commentary as other statistical
still capture the essence of both the tests
null and alternative hypotheses. 6. Conclusion: Based on the results, there
(b) Another equally valid way of writing seems (to be a/to be no) statistically signifi-
the hypotheses can be: cant difference between x 1 and x 2. We are
1. H0, 𝜇1 – 𝜇2 = 0; H1, 𝜇1 – 𝜇2 ≠ 0. (1 – 𝛼)% confident that the true population
3. Decision Rule mean difference lies between (insert lower
(a) This can be written with either the
limit) and (insert upper limit).
p-value or t critical value (see decision (a) Same commentary as other statistical
rule commentary for one-­sample t test). tests
4. Calculation: (see formulae above) (b) Notice that here the confidence inter-
(a) The calculation of the t test statistic is val represents the true mean differ-
done here either by hand using the for- ence, as espoused by the associated
mula or via statistical software (see hypotheses.
Video 5).

Example
3. At 𝛼 = 0.05 and df = 75 + 75–2 = 148: If
A supervisor is interested in evaluating p ≤ 𝛼, then reject H0 | If p > 𝛼, then retain
whether two different health education semi- H0.
nars offered by her company increase the
(88 - 85) - ( 0 )
2*
health knowledge of participants. A total of
150 participants are randomly assigned to t=
2.21
either Seminar A or Seminar B for a total of 75 = 1.36 CI : ( 88 - 85 ) ± (1.980 )( 2.21)
participants in each group. After a 3-day-long
= [ -1.38,7.38].
session, the sample mean health-knowledge
score ( x 1) for participants in Seminar A was
88, and the sample mean ( x 2) for participants 4. Retain H0 because p > 0.05.
of Seminar B was 85. Assuming the estimated 5 . Based on the results, there seems to be no
standard error to be 2.21, test the null hypoth- statistically significant difference between
esis at a level of significance of 0.05. the average health-knowledge score of
participants that take either Seminar A or
1. Is there a statistically significant difference Seminar B. We are 95% confident that the
between the average health-­ knowledge true population mean difference lies on
score of participants in Seminar A com- health-knowledge scores that falls in the
pared to Seminar B? interval between –1.38 and 7.38.
2. H0, 𝜇1 – 𝜇2 = 0; H1, 𝜇1 – 𝜇2 ≠ 0.
104 6  Inferential Statistics II

Twin Group 1 Twin Group 2 groups, rather than differences between groups—
an important concept that will be further
Pair 1:
expounded on under the next statistical test.
Briefly, and more clearly, we are not comparing
Pair 2:
groups with each other; instead wn e are compar-
Pair 3:
ing individuals that are in a group with each
other. Figure 6.19 provides a pictorial illustration
Pair 4: of the difference in the comparisons.
One virtue of examining individual differ-
Pair 5: ences is the ability to control for variations that
would otherwise result in random error, distort-
Fig. 6.17  Twins matched individually from one sample
ing a signal we may be seeking. This is an
to the other
important characteristic that increases the likeli-
hood of detecting an effect—should there be
Before After
one to be found—by decreasing the thwarting
x–1 x–1 influence of individual variability. This unique t
test is distinct from all of the other statistical
Fig. 6.18  Individual observations measured before and tests we have learned because it does not look at
after the intervention differences in group averages, rathn er it exam-
ines differences in individual scores that have
specific statistical test: (1) paired measures or (2) been matched3 together. Let us turn to the for-
repeated measures.2 mula for further clarification on how this is done
Paired measures are composed of two sam- exactly:
ples that have been matched individually from
D - mD
one sample to the other (Fig.  6.17). Pairin ng t= df = n - 1
observations are most often used in twin studies, SD / n

crossover trials, or any other related characteris-
tics that allot the matching of two different indi-

(
CI : D ± ( tcrit ) SD / n )
viduals. Notice that although there are two sets
of samples, we are able to use them here because We are introduced with a few new symbols
there isn a certain study characteristic that here. Let us first break down D. Without the
makes the individuals dependent on each other. average (i.e., the straight bar atop), D represents
Repeated measures are composed of a single the difference between matched scores. That is,
sample where the subjects have been measured D  =  X1−X2 where the subscripts represent the
twice and matched together. This technique is matched pairs or the matched repetitions.
most often used to test the effects of an interven- Regardless of the type of measure, after each D
tion or treatment on a single group—the partici- has been determined for every match, the sum is
pants are measured once before the intervention computed and then divided by tn he number of
and once after the intervention. After measure- matches4 (n). This provides us with the D which
ment, the individual observations from the before is the average of the differences, better defined as
group are matched to individual observations the mean difference.
from the after group dependent on each single
individual (Fig. 6.18).
Regardless of the type of measure, a depen- Whether paired measures or repeated measures, individ-
3 

ual scores are ultimately matched together in order to


dent sample t test examines the differences within appropriately utilize the formula. For this reason, the
dependent sample t test is also referred to as the matched
*Note that this is the hypothesized difference between
2  t test.
the population means, which we denoted in Step 2 as the 4 
Note that the sample size (n) is not the number of total
null hypothesis. observations, rather the number of matched observations.
6.4  Two-Group Comparisons 105

BETWEEN WITHIN

Group 1 Group 2 Group 1

X1 X1 X1 X1

X2 X2 X2 X2

X3 X3 X3 X3

X4 X4 X4
X4

X5 X5 X5 X5

Fig. 6.19  Examine the difference in comparison between and within groups

åD Lastly, we examine the hypothesized popula-


D = X1 - X 2 D=
n tion mean difference, μD. Similar to the rest of the
statistical test,  μD is obtained frn om, and pre-
Next, we look at sD which is none other than sumed to be, the mean of the corresponding sam-
the standard error of the mean difference. pling distribution ( m D ). The sampling
This, like the other statistical tests, considers the distribution of the mean difference, D , refers
variability in our observations. The calculation is to the distribution of mean differences that repre-
simply the sample standard deviation of the dif- sent the average individual differences of numer-
ferences (D) divided by the number of matches ous random samples of given size for two
(n), shown below: dependent populations. Non w, we can turn to the
protocol for testing a hypothesis using the depen-
sD
sD = dent sample t test that precedes an example for
n further clarification.

Steps for Dependent Sample t Test


3. Decision Rule:
1. Research Question: Is there a statistically (a) Same commentary as independent

significant mean difference between (scores sample t test
of first measure) and (scores of second 4. Calculation:
measure group)? D - mD
(a) Note that the description of the groups t=
SD / Ö n
( )
CI : D ± ( tcrit ) SD / Ö n
relative to the mean difference is
dependent on whether it is a repeated (a) The calculation of the t test statistic is
measure or a paired measure. done here either by hand using the for-
2. Hypotheses: H0, 𝜇D = 0; H1, 𝜇D ≠ 0 mula or via statistical software (see
(a) These are new hypotheses specific for Video 6).
this statistical test. Notice that they still (b) The p-value can be roughly measured
capture the essence of both the null and using the t table in Appendix C or a more
alternative hypotheses. exact measure via statistical software.
106 6  Inferential Statistics II

(c) May also calculate confidence interval nificant mean difference between (group
here. one description) and (group two descrip-
5. Decision: Reject/Retain H0 because (insert tion). We are (1 − 𝛼)% confident that the
corresponding mathematical proof) true population mean difference lies
(a) Same commentary as other statistical between (insert lower limit) and (insert
tests upper limit).
6. Conclusion: Based on the results, there (a) Same commentary as independent

seems (to be a/to be no) statistically sig- sample t test

Example 1. Is there a statistically significant mean dif-


A company called Camp F.A.T. wishen s to ference between the weight of teenagers
determine whether their weight loss interven- before they enrolled in Camn p F.A.T. com-
tion is effective to lower the weights of teen- pared to after?
agers at risk of obesity. Camp counselors 2. H0, 𝜇D = 0; H1, 𝜇D ≠ 0.
measured the weight of ten teenagers on their 3. At 𝛼 = 0.05 and df = 10–1 = 9: If t ≤ −2.262
first day of camp and measured their weights or t  ≥  +2.262, then reject H0; if
again on the last day of camp, shown below. −2.262 < t < +2.262, then retain H0.
Test the null hypothesis in order to determine
whether the weight loss intervention was 0.25 - 0
t=
effective. 4.79 / Ö 10

Camper
Weight
before
Weight
after
Difference
(D = WBefore − WAfter)
(
= 0.165 CI : 0.25 ± ( 2.262 ) 4.79 / Ö 10 )
= [ -3.176, 3.676 ].
1 198 193 5
2 205 200 5
3 220 211 9 4. Retain H0 because −2.262 < 0.165 <
4 213 215 –2
+2.262.
5 239 233 6
5. Based on the results, there seems to be no
6 305 309 –4
7 276 270 –4 statistically significant mean difference
8 254 255 –1 between the average weight before and
9 281 274 7 after the weight loss intervention. We are
10 230 226 4 95% confident that the trn ue population
∑D = 2.5, mean difference of average weight loss lies
sD = 4.79 D = 0.25
in the interval between –3.176 and 3.676.

6.5 Multiple Group Comparison treatment that is most effective. Well, we might
argue, why not conduct a series of t tests?
What if we are interested in comparing more Unfortunately, conducting a series of t tests
than just two groups? As is often the case with will increase the chances of rejecting a true null
translational research, we may be interested in hypothesis (i.e., making a Type I error). This is
comparing the effectiveness of multiple treat- due to the fact that each t test that is conducted
ment modalities in order to determine the single claims the null hypothesis to be true at a
6.5  Multiple Group Comparison 107

p­robability equal to the level of significance. paring it to the variability within the groups. In
Thus, if multiple t tests are being done on the essence, the basic conceptual format of the
same null hypothesis, then we increase the chance F-statistic is:
of observing an effect when in reality there is no
variability between groups
effect to be found—said another way, multiple t F=
tests make it easier and easier to attain statistical variability within groups
significance. Therefore, we must discuss another In data analysis, the variability mentioned in
statistical technique that allows the parametric both the numerator and denominator above is
comparison of multiple groups but saves from represented as the mean square (MS), which is
this inaccuracy and an ultimately fallacious simply the calculation of the population variance
inference. (σ2) for both measures. Take an experimental
design that seeks the effectiveness between three
treatment modalities, in which subjects are ran-
6.5.1 ANOVA domly assigned to three different groups of equal
size (Fig. 6.20).
While Gosset argued that a comparison of distri- The MS between the groups estimates the
butions should be done through means, Sir variation of the overall results obtained from the
Ronald Aylmer Fisher contended that it is the subjects in the three different groups, receiving
variances of distributions that need be compared three different treatments—i.e., the variability
for the existence of an effect. Without even intro- that is between the (different) groups. The MS
ducing the reasoning, we can surmise the plausi- within the groups estimates the variation among
bility of using variances rather than means to the individual results obtained from subjects that
facilitate a comparison. Our extensive discussion are in the same group, receiving the same treat-
on descriptive statistics (Chap. 4) proved to us ment—i.e., the variability within the (same)
that measures of central tendency and variability groups. Figure  6.21 shows this in a pictorial
were simply two different, yet related and equally manner.
important, perspectives of describing a distribu- Thus, the actual equation for a F-statistic is
tion. Subsequently, testing a hypothesis with a
MSBetween
statistical test rooted in variability should not be F=
alarming; neither should the fact that this tech- MSWithin

nique is utilized within the context of testing for
population means. The equations for obtaining the mean squares
Fisher went on to develop the F-statistic and will be provided at the end of the section.
the corresponding statistical analysis technique Conceptually, the MS between groups is a reflec-
referred to as ANOVA (analysis of variance) or, tion of the treatment effect—if there is one to be
as used in hypothesis testing, the F test. Indeed, it found—along with a touch of random error due
is the variance of the different groups that must to the perspective of variability. On the other
be analyzed in order to test a null hypothesis for hand, the MS within groups is a representation of
multiple parameters. This technique allots the only random or residual error because it exam-
study of multiple independent groups by analyz- ines the individual differences that are contained
ing the variability between the groups and com- within the groups. It is for this reason that some

Treatment 1 Treatment 2 Treatment 3


Fig. 6.20 Three
different groups testing
the effectiveness
between treatments
108 6  Inferential Statistics II

BETWEEN

BETWEEN BETWEEN
Group 1 Group 2 Group 3

X1 X1 X1

X2 X2 X2
WITHIN

WITHIN

WITHIN
X3 X3 X3

X4
X4 X4

X5 X5 X5

Fig. 6.21  ANOVA analyzes the variability between and within the groups

refer to MS within as MS error. It should then not modalities (i.e., strains of marijuana) in the
be a surprise to realize the signal-to-noise ratio degree to which they stimulate a hunger response.
when these measures are combined in a fraction: If this claim is true, then the lack of difference in
hunger stimulation between all three groups
treatment effect + error signal
F= = should be supported by the lack of treatment
residual error noise effect in the F ratio. The only appreciable differ-
Thus, when using the F test for a null hypoth- ence, then, that might be observed is attributable
esis, a false H0 will support rejection because the to inevitable random error, which would reflect
ratio reflects the treatment effect above the as a small F-statistic.
amount of error contained in the study, similar to Conversely, if there actually exists a statisti-
the figure above. This will appear numerically as cally significant difference in the degree to
an F-statistic that is much greater than 1 depend- which each distinct strain stimulates a hunger
ing on the effect size. On the other hand, a true H0 response, then the effectiveness (i.e., differ-
will adopt a ratio that has only relatively equal ence in treatments) would present itself in the
amounts of variability between the groups form of a large effect size and, hence, a large
(numerator) to variability within the groups F-statistic. Thus, the evidence renders the null
(denominator), such that the value of the fraction hypothesis as false, showing that, in fact, there
is somewhere close to 1. is a statistically significant difference in the
Ponder on the meaning of variability in terms distinct strains of marijuana. This difference is
of an experiment. Take an experiment that is such that one strain’s effectiveness on hunger
interested in testing the effectiveness of three stimulation was resilient enough to rise up,
different strains of medical marijuana on appe- regardless of the error around it—like the rose
tite stimulation in cancer patients. The null that grew from the concrete.
hypothesis would state that there is no statisti- The formula (and sub-formulae) for these
cally significant difference between the effec- statistical analyses are referred to as a one-way
tiveness of the three different treatment ANOVA, which analyzes the influence of one
6.5  Multiple Group Comparison 109

is, if the evidence supports the rejection of H0 and


suggests that there is a difference between the
treatment groups, then how do we know which
one is most different? With two-group compari-
sons, this was a fairly simple determination
because there were only two groups that were
being compared. When multiple groups are intro-
0 fcrit
duced, a rejection of H0 does not imply that all of
Fig. 6.22  F distribution. See Appendix D for F table the groups that are being tested are necessarily
different. Say we are looking at childhood obe-
independent variable on a dependent variable.5 In sity rates (dependent variable) among different
this manner, an independent variable is the ethnicities (independent variable). It might be the
qualitative variable that categorizes the subjects case that six ethnic groups are the same and four
by groups—in the example above, the indepen- ethnic groups are different, eight groups are the
dent variable was the strains of medical ­marijuana. same and two groups are different, etc.
In general, the independent variable is perceived Thus, not only will our alternative hypothe-
as the investigator-manipulated treatment. A ses change (see below), but our analysis contin-
dependent variable, also referred to as the out- ues after we have obtained a statistically
come variable, is the continuous (quantitative) significant F-value—i.e., we must further ana-
variable that measures the effect presumed to be lyze which group(s) are different. These further
caused by the independent variable. The depen- analyses are referred to as post hoc analyses,
dent variable in the example above was the degree where multiple comparisons are made between
of hunger stimulation. different groups after (post) observing a statis-
There are a few qualities that make an F distribu- tically significant outcome (hoc, Latin for
tion (Fig. 6.22) distinctive among the rest of the sam- “fact”) in order to isolate the group(s) that are/
pling distributions. The F distribution is most often is most different.
presented as a right-skew distribution with a lower There are many different methods that have
bound of zero. This is due to the square function in been developed for post hoc analyses, such as
the MS calculation that squares the values in the the Scheffé test, Bonferroni method, Tukey HSD
numerator and denominator of the F ratio, removing test, Dunnett test, and Newman–Keuls test. All
any negative integers and causing the positive skew. of these methods use slightly different
Therefore, the right tail is the only rejection area sup- approaches, depend upon the design, and com-
porting the contention made earlier regarding larger pare different measures of the data in order to
F-values. However, the F test is still considered (in seek out the group(s) most different.6 The most
fact, limited to) a nondirectional or two-tailed test. versatile and conservative method is Scheffé,
Appendix D consists of a table of critical F-values which compares all possible combinations of
that can be used for significance testing. To find the pairs of means. Although this post hoc test,
desired F critical, simply locate the value that inter- along with the other multiple comparisons, is
sects the two degrees of freedom (between/within). able to be calculated by hand, they are just as
We might be wondering what happens after a easily determined by a few key strokes using
statistically significant F-value is obtained. That statistical software.
All of this taken together, we can now intro-
Though we will be focusing only on one-way ANOVA,
5  duce the formulae relative to the analysis of vari-
there exist other types of ANOVA, such as two-way, ance (ANOVA) and its usage in multiple group
covariate, and multivariate (see Chiappelli 2014). A spe- comparisons when testing a null hypothesis for
cial case of ANOVA occurs during a repeated measure
design, in which a single group is repeatedly measured
population means (F test).
multiple times against different treatments. The utility of
ANOVA for repeated measures is similar to the related
discussion under dependent samples t test. See McHugh (2011).
6 
110 6  Inferential Statistics II

MSBetween
F=
MSWithin

SSBetween SSWithin
MSBetween = MSWithin =
dfBetween dfWithin

SSBetween = n Â(xgroup – xgrand )2 dfbetween = k – 1 SSwithin = Â(xi – xgroup )2 dfWithin = N – k

• k = number of groups sures, the ratio of mean squares will produce the
• N = number of total observations F-statistic we require. Results of an F test are
most commonly illustrated as an ANOVA table
Do not be overwhelmed by the formulae. First which organizes the measures for the calculation
notice the formula for calculating both mean of an F-statistic; Step 4 of the protocol for
squares is simply their sum of squares (SS) hypothesis testing we provide below illustrates
divided by their respective degrees of freedom this table. In what follows shortly, the protocols
(df). Recall from descriptive statistics that the to calculate the F ratio from data and for testing a
underlying concept of sum of squares is simply null hypothesis are shown, along with an illustra-
the calculation of variability (for SS calculation, tion that utilizes both protocols together in an
see Appendices E). After obtaining those mea- example.

Steps for ANOVA


2. For the alternative hypothesis, we
1. Research Question: Is there a statistically are unable to follow the same “≠”
significant difference in (dependent vari- logic as the other statistical tests
able) among (independent variable)? because if H0 turns out to be false, it
(a) The dependent variable is the continuous does not necessarily mean that all
measure that is being tested for an effect groups are different; only one group
from the different treatment groups char- needs to be different for the rejec-
acterized by the independent variable. tion of H0.
2. Hypotheses: H0, 𝜇1 = 𝜇2 = 𝜇3 = … 𝜇k; H1, H0 3. Decision Rule:
is false; and at least one group is different (a) This can be written with either the

(a) These are new hypotheses specific for p-value or F critical.
this statistical test. Notice that they still 4 . Calculation
capture the essence of both the null and
Degrees
alternative hypotheses.
of Mean
1. The “𝜇k” in the null hypothesis Sum of freedom square
above signifies that the number of Source squares (SS) (df) (MS) F
𝜇’s written is dependent on the Between SSBW dfBW MSBW
F-statistic
number of treatment groups (k) that Within SSWN dfWN MSWN
Total SSTotal dfTotal X X
are being tested.
6.6  Continuous Data Analysis 111

(a) The calculation of the F test statistic is (a) Same commentary as other statistical
done here either by hand using the formula tests
or via statistical software (see Video 7). 6 . Conclusion: Based on the results, there
(b) The p-value can be roughly measured seems (to be a/to be no) statistically signifi-
using the F table in Appendix D or a cant difference in (dependent variable)
more exact measure via statistical among (independent variable).
software. (a) Same commentary as other statistical
(c) No confidence interval calculation. tests.
(b)
There is no confidence interval
5. Decision: Reject/retain H0 because (insert interpretation.
corresponding mathematical proof) (c) A post hoc analysis is presented here.

Example 2. H0, 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4; H1, H0 is false; and at


We are interested in testing the effectiveness least one group is different.
of four different pain medications on decreas- 3. At 𝛼 = 0.01, dfBW  =  4–1  =  3, and
ing the severity of a headache. Headache relief dfWN = 200–4 = 196: If p ≤ 𝛼, then reject
was measured by a standardized scale that H0|If p >𝛼, and then retain H0.
quantified the degree of pain relief. Randomly, 4.
.

200 participants were assigned to 4 different


Degrees
groups and were requested to fill out the head-
Sum of of Mean
ache relief scale after a week of taking their squares freedom square
assigned medication. The scales were col- Source (SS) (df) (MS) F
lected and tallied. The sum of squares between Between 346 3 115.33 7.58
was determined to be 346, and the sum of Within 2981 196 15.21
squares within was equal to 2981. Determine Total 3327 199 X X
whether there is a difference in effectiveness
between the four pain relief medications at a
level of significance of 0.01. 5. p < 0.01; therefore, reject H0.
6 . Based on the results, there seems to be a
1. Is there a statistically significant difference statistically significant difference in the
in the degree of headache relief between degree of headache relief between the four
the four different pain medications? different pain medications.

6.6 Continuous Data Analysis measure designs) that were compared to observe
an effect on a single outcome. In the case of
The majority of the statistical analysis techniques ANOVA, we examined factors within groups, but
that have been mentioned are chiefly concerned this was simply to account for individual differ-
with examining the relationship between a set of ences in terms of error; it was not, however,
groups. The distinct groups were primarily com- examined for a relationship within a singular
posed of different subjects (except for repeated group (Fig. 6.23).
112 6  Inferential Statistics II

Population On the other hand, these associations we are to


discuss describe the relationship among charac-
Sample teristics of a single group of individuals.
Statistically, these associations describe the rela-
tionship of two variables within a group. More
importantly, it is stressed that the associative rela-
tionship established here does not imply a causal
relationship.

6.6.1.1 Correlation
Fig. 6.23  Comparing the difference in analysis within In statistics, particularly descriptive statistics, an
groups and within a singular group association is represented as a correlation, which
graphically illustrates the relationship between
In this final section of inferential statistics, we two continuous variables on a scatterplot. The
will discuss the statistical analysis techniques numerical description of this relationship is
that allot the comparison of variables within a denoted by the Pearson correlation coefficient
single group of subjects. Indeed, we will notice (r). Developed by, and named after, one of the
that these analysis techniques are not limited to giants of contemporary statistics, Karl Pearson,
inferential analyses. In fact, both techniques are the correlation coefficient (r) is a measure of the
grounded in descriptive analytical theory. The strength and direction that describes the associa-
variables in consideration must be continuous in tion between two continuous variables. As
nature (i.e., quantitative data) to permit the ulti- implied above, critical to the calculation of the
mate parametric inference. The nonparametric correlation coefficient is the presence of two con-
counterpart for the analysis of categorical vari- tinuous variables that are collected from and
ables is examined further in Chap. 7. describe subjects within a single group.
To elaborate on this topic, we can ask a few
questions:
6.6.1 Associations
1 . How is the correlation coefficient calculated?
During the introductions to inferential statistics, it 2. What does the correlation coefficient repre-
was briefly mentioned that hypothesis testing is a sent, graphically?
method primarily useful during the examination of 3. How is the strength and direction of the cor-
effect sizes. Observing a large effect size among relation coefficient interpreted once it is
comparable groups provides the initial foundation obtained?
required for the establishment of causality relative
to the variables in question. However, the difficulty We will answer these questions in terms of
of actually materializing the causation was also Pearson’s first endeavor, which entailed deter-
considered. Our reluctance toward this leads to our mining the resemblance (i.e., association) in
use of hypothesis testing for establishing associa- heights of fathers and sons within pairs of family
tions in order to prevent fallacious inferences. members. Realize that a pair involves a single
Still, we distinguish those associations from subject within the group in this context.
these associations (i.e., the ones we discuss in The two continuous variables collected from
this section) on two grounds: subjects within a single group for Pearson were
father’s height and son’s height in centimeters
1. Those associations described the relationship (cm) (Table 6.1). After measuring the subjects,
between groups. the data points from each variable are plotted as
2. Those associations set the precedent for poten- an (x, y) coordinate. The x-axis will represent
tial causality, contingent on the availability of all of the heights from the fathers, and the
higher levels and quality of evidence. y-axis will represent all of the heights from the
6.6  Continuous Data Analysis 113

Table 6.1  Father’s height and son’s height in centimeters • x—continuous variable 1
on a traditional xy plot
• y—continuous variable 2
Pairs Father’s height (cm) (x) Son’s height (cm) (y) • i—measure from each subject (or pair)
1 210.00 205.00 • X —mean of continuous variable 1
2 239.00 230.00
• Y —mean of continuous variable 2
3 219.00 199.00
4 222.00 220.00
5 250.00 249.00 Graphically, the correlation coefficient repre-
6 216.00 218.00 sents the degree of linearity (i.e., straight line8)
7 208.00 221.00 between the scatter of the data points from the
8 199.00 202.00 two variables. Confirm that Fig.  6.24 above is,
9 226.00 220.00 indeed, a scatterplot, in which we obtained r  = 
10 197.00 199.00 + 0.858. So, how does this numerical value trans-
late information regarding the association
Father-Son Height ScatterPlot between the heights of fathers and sons?
260 To answer this—synonymous with the third
250 question posed earlier—we must turn to the
Son's Height (Y)

240 actual numerical value of the correlation coeffi-


230 cient and its underlying concept. But before we
220 get to the answer, understand that the correlation
210 coefficient is a unit-less9 measure which ranges
200 from +1.00 to –1.00. By having just this value,
190 we can describe the two fundamentally important
190 210 230 250 270
attributes of an association, namely, the strength
Father's Height (X)
and direction, which also happens to answer our
Fig. 6.24  Heights from the fathers and the sons are plot- question.
ted as data points The strength of the correlation coefficient
represents the degree of association between the
sons. The data point that is plotted (x, y) is two variables, which numerically is the absolute
dependent on each pair’s score for each vari- value of the correlation coefficient (|r|). Taking
able, where x1 = height of father 1, y1 = height the absolute value suggests that only the numeri-
of son 1, and so on for all pairs in the data set cal value (i.e., not the + or – sign) is required to
(Fig. 6.24). make this interpretation. Therefore, the strength
After plotting all data points from both vari- of an association is concerned with all possible
ables, the correlation coefficient (r) is calculated values that range from 0.00 to 1.00.
by the following formula7: In terms of strength, then, we can imply that
the lower extreme (r = 0) describes a lack of asso-
å ( xi - x ) ( yi - y )
r= ciation and the upper extreme (r = 1.00) describes
å ( xi - x ) å ( yi - y )
2 2
a perfect association. However, these extreme
values are very rarely observed. Instead, we say
that correlation coefficient values closer to 1.00
describe strong associations, whereas values
closer to 0.00 describe weak associations. Also,
The formula for its calculation can be both tedious and
7 

time-consuming but was essential for statisticians and


investigators in the nineteenth and twentieth centuries. An important clarification is made, particularly for
8 

Lucky for us, the computing of r today is most often done regression discussed in the next section.
(easily) by any good statistical software program. The for- The correlation coefficient is unit-less because when the
9 

mula is provided here for continuity purposes; however, terms are placed in the actual formula, the units in the
its calculation by hand will not be further discussed. numerator and denominator cancel each other out.
114 6  Inferential Statistics II

Strong negative No association Moderate positive


association r=0 association
r = -0.8 r = + 0.5

Fig. 6.25  Scatterplots depicting the strength and direction of associations

we can imply that r-values around 0.50 describe between two variables as being inversely pro-
moderate associations, in which relatively close portional with one another—i.e., the variables
values above and below can be seen as moder- behave in a completely opposite manner:
ately weak associations or moderately strong –– As the value of variable 1 (x) increases, the
associations, respectively. value of variable 2 (y) decreases.
The direction of the correlation coefficient OR
represents the behavior of the association –– As the value of variable 1 (x) decreases, the
between the two variables, which is denoted by value of variable 2 (y) increases.
the sign of the correlation coefficient (i.e., +/−).
Conceptually, this is opposite to the strength of a Now we are able to accurately interpret the
correlation coefficient. Indeed, now we are con- correlation coefficient from the father–son exam-
cerned with the sign and not with the actual ple above. The r of +0.858 insinuates that there
numerical value. Therefore, there must be only seems to be a strong positive association between
two possible scenarios: the heights of fathers and the heights of their
sons. Note that we incorporated both the strength
• Positive associations—signified by a positive and direction of the association in a single
(+) r-value, depict the relationship between description of the relationship between the vari-
two variables as being directly proportional ables. Also with these understandings, we are
with one another—i.e., the variables behave in able to (roughly) imply the strength and direction
a similar manner: of associations without the actual need of a
–– As the value of variable 1 (x) increases, the correlation coefficient by simply viewing the
­
value of variable 2 (y) increases. data’s scatterplot (Fig. 6.25)11 (see Video 8).
OR10 Observe in Fig. 6.25 that the strength of an asso-
–– As the value of variable 1 (x) decreases, the ciation is dependent upon the distances between
value of variable 2 (y) decreases. the cluster of points on the graph. The closer the
• Negative associations—signified by a nega- dots are scattered together and approximate a
tive (−) r-value, depict the relationship straight line (linearity), the stronger the association
between the variables. Similarly, if we imagine a
Do not let the word “positive” in positive association
10  straight line that follows the trend of the points, the
lead you to believe that this relationship is exclusive to
variables that increase together. A positive relationship
may also be used to describe two variables that decrease Although this may be a useful heuristic, having the
11 

together. What is more important to understand is that actual correlation coefficient provides a much more accu-
they behave in a similar manner. rate and precise description.
6.6  Continuous Data Analysis 115

Strong Medium Medium Strong

Weak
-0.7 -0.4 0.4 0.7

-1.00 0 +1.00

Fig. 6.26  Strength and direction, (r) number line

slope of that line reflects the direction of the asso- sions onto the population. Like the other statisti-
ciation. A slope on the graph that extends from the cal analysis techniques, the correlation coefficient
lower left up to the upper right represents a positive (r) is a statistic (i.e., sample measure), in which
association, whereas a slope that extends from the its analogous parameter is 𝜌, the Greek letter
upper left down to the lower right represents a neg- “rho.” The population correlation coefficient is
ative association. Figure 6.26 illustrates a number the hypothesized measure being tested. Thus, we
line that considers all possible values of r in rela- similarly utilize the sample correlation coeffi-
tion to both strength and direction. cient (r) in hypothesis testing to draw conclu-
The final point we wish to make is regarding sions regarding such associations in the
the order of the variables. Similar to the conten- underlying parent population.
tion in this section’s introduction, we reiterate the Unlike our other statistical tests, the test used
fact that we are unable to imply causal relation- for hypothesis testing regarding an association in
ships from correlation coefficients—no matter the population is not a variation of the formula
how strong the association may be. In previous for the Pearson correlation coefficient provided
associations that set the precedent for potential above. Instead, it is a variation of a t ratio which
causality, in which an effect was sought, our vari- is, in fact, a t test for a single population correla-
ables were either labeled as independent or tion coefficient 𝜌. Nonetheless, the inferences we
dependent. Here, though, that classification is not make must consider the sampling distribution of
present; further suggesting that the order in which r in order to hypothesize the correlation coeffi-
the variables are created or analyzed (i.e., either x cient parameter ρhyp. This permits the testing of
or y) is not of importance. the null hypothesis that claims a lack of associa-
For example, in the father–son example, the tion between the two continuous variables (i.e., ρ
variables could have been constructed on the = 0). Also, we dare not forget that this test must
opposite axes. Father’s heights could have been similarly fulfill and satisfy the three assumptions
the y variable and son’s heights could have been of parametric statistics.
the x variable. Nonetheless, the identical value The actual protocol for hypothesis tests of sig-
for r would still have been obtained. Thus, we are nificance for correlation coefficients is much
unable to claim that it is the father’s height that more extensive than our other parametric tests,
effects the son’s height or vice versa—further particularly in the absence of a statistical soft-
proving the absence of an effect and strict prohi- ware program. Indeed, today correlation coeffi-
bition of causality. cients and their tests of significance are easily
The majority of our current discussion has had obtained by a few strokes of a keyboard. Hence,
more to do with descriptive statistics, rather than we refrain from providing the protocol with the
inferential statistics. Certainly, correlation can be refined t test as they are beyond the scope of this
a powerful tool in the description of relationships book. However, in the name of knowledge, the
among continuous variables. Still, as appropri- formula for this newly refined t test is provided
ately placed in this chapter, the Pearson correla- below, along with additional resources for the
tion coefficient is inherently a parametric measure curious mind.12
that can be utilized to take useful associations
obtained from a sample and infer them as conclu- 12 
See Furr (n.d.).
116 6  Inferential Statistics II

r - r hyp linear relationship between the variables that are


t= being analyzed (see Sect. 6.5.1.1). Yet there is an
1- r2
important difference regarding linear relation-
n-2 ships of associations compared to predictions.
In correlation, the relationship shared among
• r = sample correlation coefficient two continuous variables is associative, but the
• 𝜌 = population correlation coefficient establishment of a causal relationship is strictly
• n = sample size prohibited. The correlation shared is mutual for
both variables. That is, one variable is not caus-
ing or effecting the other variable to behave in the
6.6.2 Predictions respective manner—the order of the variables is
not important. However, fundamental to tech-
The ability to make cogent predictions has played niques used in prediction-making is the under-
and will continue to play an important role in the standing that the associative relationship shared
domain of healthcare. Predictions we make range among the variables does set the precedent for
from patient behavior and medication response to causality, supported by previously established
disease likelihoods and future epidemics. But the causal relationships. This underlying attribute is
predictions do not entail simple guesses from necessary, such that it encourages the prediction
hunches or mystical future-telling—instead, the of an effect for future observations in a similar
predictions we make are grounded in probability. statistical manner.
More specifically, statistical predictions are Statistics, then, requires there has to be a
based on a Fisherian or a frequentist approach to method of establishing a prediction based on the
probability that considers the relative frequency institution of a linear relationship between con-
of previous observations to predict similar future tinuous variables. As a statistical analysis tech-
observations (see Chap. 4). As the frequency of nique, regression refers to the process of
these observations increases, the probability of estimating a prediction by observing the
their occurrence also increases—now enter ­relationship of a given dependent variable on the
statistics. independent variable(s). The dependent variable
The most (statistically) fundamental method is referred to as the outcome or response variable,
of predicting observations is by characterizing and the independent variable(s) represent the
the likelihood of a given outcome as a function of predictor(s). By segregating variables as either
different variables. Consider this in terms of a dependent or independent, we return to the inves-
meteorologist attempting to forecast (predict) tigation of effect sizes.
this week’s weather. There are a multitude of Still differing from correlation, regression
variables that she must consider, such as atmo- promotes the understanding that the relationship
spheric pressure, humidity, temperature, etc., as a between the variables is not mutual—in regres-
function of the weather as a whole. In order for sion, we observe previous relationships of the
this to happen, there must be some knowledge of independent variable(s) effecting the dependent
an association or relationship between the vari- variable, which sets the precedent for the predic-
ables in consideration. tions to be made of future instances. But, as
For example, to predict whether a physician’s always is the case, many previous instances of
ethnicity determines the particular patient that this relationship are not all the same. Undoubtedly,
visits the physician, there must first exist an asso- there exists some degree of variation among the
ciation between, say, members of an ethnic previous instances.
minority visiting physicians of the same (or dif- For example, if it has already been established
ferent) ethnic background. Indeed, associations that heavy smoking behavior affects the rates of
go hand in hand with predictions in such a way developing certain cancers, how come we are
that both concepts require the establishment of a unable to precisely identify the exact individuals
6.6  Continuous Data Analysis 117

Fig. 6.27 Scatterplot Father-Son Height ScatterPlot


with fitted regression 260
line
250
y = 0.8105x + 39.133
240

Son's Height (Y)


230

220

210

200

190
190 200 210 220 230 240 250 260
Father's Height (X)

or the exact rates at which the disease is devel- The most basic regression analysis is referred to
oped? Conversely, how come we fail in accurately as a simple linear regression, where a line is
predicting the development of these diseases in graphed on a scatterplot of a dependent variable
individuals that are not heavy smokers? The point and a single independent variable.
we are attempting to make can be better imparted Sample Population
by rephrasing into a much more general question. Y = bX + a + e Y = βX + α + ε
Why are we unable to accurately predict future
outcomes? Answer: Uncertainty. • Y = dependent variable, continuous
Recall that statistics perceives the innate • X = independent variable, continuous
uncertainty contained within our observations as • b/β = slope
variability (see Chap. 4). So, just as before, we • a/α = y-intercept
acknowledge our continued inclination of mini- • e/ε = residual error
mizing the contained error in the overall size of
the effect. Thus, in the context of regression, this Notice that the regression line equation
interest is primarily important for making more innately represents a basic mathematical concept
accurate and precise predictions. Furthermore, referred to as the slope–intercept formula
due to the inherent associative linearity, scatter- (y = mx + b), but with the addition of the residual
plots are taken advantage of not only to describe error (e/ε), also known as the residuals. The
regressions but also as a tool to measure and elu- residual error, similar to random error, is the vari-
cidate variability. ability within our results that is unexplainable or
has been unaccounted for.14 Mathematically, the
6.6.2.1 Simple Linear Regression error is represented by the scattering of the
The linear relationship between the dependent dots—i.e., the vertical distances between the data
and independent variables, along with the con- points relative to the regression line (Fig. 6.28).
tained error, is determined by the linear regres- The addition of the residual error is the elucida-
sion line (also referred to as least-squares tion of uncertainty that is inherently contained
regression line). This line represents how the within our data and, henceforth, the predictions
dependent variable (y) regresses on the indepen-
dent variable (x), which is done by fitting the This add-on also makes the equation akin to the general-
14 

best-fit line between the plots of dots (Fig. 6.27).13 izability (G) theory, briefly described in Chap. 3. The G
theory is a statistical framework that aims to fractionate
the error in the measurements we make, which ultimately
Recall from correlations that this linear line was simply
13 
allows our findings to be closer and closer to the true
imagined—however, here, we actually graph this line. value: X = T + ε.
118 6  Inferential Statistics II

Fig. 6.28 Residual Father-Son Height ScatterPlot


error depicted in 260
scatterplot
250
y = 0.8105x + 39.133
240

Son's Height (Y)


230
Observed Y
Regression line
220
Error
210
*
200
Predicted Y

190
190 200 210 220 230 240 250 260
Father's Height (X)

we wish to make. The residual error is not shown • r = Pearson correlation coefficient
in the regression equation for descriptive pur- • sx = sample standard deviation of independent
pose; rather it is more often shown during infer- variable (Xs)
ential analyses. • sy  =  sample standard deviation of dependent
Both a and 𝛂 are mathematically the y-­ variable (Ys)
intercept of their respective regression equations. • Y  = the average of the independent variable
They essentially describe the expected value of y • X  = the average of the dependent variable
when x is 0. The calculation of this term (equa-
tion provided below) is useful in completing the
line equation. However, in most of the research in
the health sciences, the interpretation of this mea- Example
sure is not primarily meaningful. We are interested in determining whether
In the regression equations, the most impor- years of heaving alcohol assumption (X)
tant term is the slope of the line referred to as the are able to predict life expectancy (Y). The
regression coefficient (b). This measure repre- correlation coefficient of –0.78 describes a
sents the effect of the independent variable on the strong negative association between the
dependent variable. That is, for every associated two continuous variables. Simple linear
change in the independent variable, it considers regression analysis finds the regression
the expected change in the dependent variable coefficients b = −1.69 and a = 76.78, which
(Y).15 Below we provide the formula for calculat- culminate into the line equation:
ing both the regression coefficients b and a, along y =  − 1.69X + 59.78. Moe, a 15-year alco-
with an example to tie all of the aforementioned holic, has consented to be the study subject.
together: Using the equation y =  − 1.39(15) + 76.78,
we find y = 55.93, which is interpreted as
sy
b=r a = Y - bX Moe’s predicted life expectancy to be
sx 55.93 years.

The same is true for β, the parametric regression coeffi-


15  We might be wondering how truly accurate of
cient. However, β is standardized via z-transformation in a prediction is this for Moe. Well, the answer to
order to be able to describe the parameter; for this reason, this question is discussed in depth in the section to
b is denoted as the unstandardized regression coefficient come. For the time being, let us conceptually con-
and β as the standardized regression coefficient. See
Chiappelli (2014) and Bewick et al. (2003). sider how well of a predictor alcohol consumption
6.6  Continuous Data Analysis 119

is in terms of life expectancy. Surely, it is not the vide substantial information regarding health-
single or even most effective predictor. Moreover, related outcomes. More importantly, we might
there can be many different factors, whether good want to consider how good or bad a regression
or bad, that if considered can provide a much bet- line is, particularly when we are interested in
ter prediction of Moe’s life expectancy. Factors making a predictive inference. The quality of a
such as physical activity, genetic disposition, or regression line depends on the amount of error
smoking behavior are also able to give us infor- it is able to explain and its ultimate predictive
mation regarding life expectancy. In terms of error.
regression, the different factors are represented as There is one last measure relevant to
additional predictors, which take on the form of prediction-­making in both forms of regression
independent variables. analyses that we have yet to mention and is criti-
cal for the consideration of predictive error. The
6.6.2.2 Multiple Linear Regression coefficient of determination (R2) indicates the
Regression analyses for predictions with multiple proportion of total variability in one variable that
independent variables (i.e., multiple predictors) is predictable from its relationship with another
are conducted with the technique referred to as a variable. The coefficient of determination entails
multiple linear regression. the predictive accuracy of an inference consen-
Sample sus. Some refer to the coefficient of determina-
Y = b1X1 + b2X2 + …biXi + a + e tion as shared variance, explained variability, or
Population predictive error.
Y = β1X1 + β2X2 + …βiXi + α + ε The measure of predictive error is able to esti-
mate the quality of a regression line—or, better
As briefly hinted above, there is an important stated, the goodness of fit of the regression line.
utility of multiple linear regression over simple lin- We can determine whether the data provides a
ear regression relative to prediction-making. The good or bad regression line by converting the
more predictors (X’s, independent variables) we decimal provided by R2 to a percentage. Hence,
use to predict a given outcome, the more error we the larger the percentage, the better the fit of the
are able to account for. The more explained error, line, and the more accurate the prediction.
the smaller the remaining (unexplained or unac- Moreover, the complement of this estimate,
counted) error tends to be, i.e., residual error (ε). namely, 1 – R2, is just as useful in determining the
In essence, this is the process of error fraction- unpredictive error, which in turn provides an idea
ation, whereby the error—originally unexplain- of the inaccuracy of our prediction.
able—is divided into components (i.e., fractions), In simple linear regression analyses, the R2 is
which ultimately reveal sources of error that were (simply) determined by squaring the Pearson cor-
otherwise unaccountable. The knowledge of this relation coefficient. Returning to the previous
error, in turn, ultimately facilitates a more accu- example, squaring the correlation coefficient of
rate and precise prediction. Therefore, we can –0.78 gives a R2 equal to about 0.608. As a per-
deduce that accurate and precise predictions cent—how it is most often presented—this means
entail a minimal amount of predictive error. The that about 60.8% of the total variability present in
reduction of predictive errors is one of the pri- life expectancy is predictable from the variability
mary advantages of fitting a regression line to a in years of alcohol consumption. This may not
scatterplot of data. seem so fortunate for our friend Moe. However, if
The estimation of predictive errors becomes we consider the unpredictive error in this estima-
particularly important when we begin to con- tion, 1–0.608 = 0.392 or 39.2% of the variability
jecture on health-related predictions useful to still remains a mystery. Moreover, if we consider
public health. In this case, regression analyses the numerous other factors relevant for life
are considered in terms of inferential statistics. expectancy, then—well, let just say—Moe is still
Indeed, predictive inferences are able to pro- in the game.
120 6  Inferential Statistics II

The addition of other predictors is exclusive Simple linear regression modifies the t test,
to a multiple linear regression analysis, in which which tests the null hypothesis that y has no effect
we are able to consider a larger degree of predic- on x, written as H0: β = 0. On the other hand,
tive error and ultimately provide a more accurate multiple linear regression modifies the F test due
­prediction of Moe’s life expectancy. As men- to the addition of more independent variables.
tioned, the method of determining R2 in this case Like parametric ANOVA, the null hypothesis in
is not as straightforward as it was for the simple this type of analysis equates all of the regression
linear regression. We cannot just square the cor- coefficients together, in which the number of β’s
relation coefficient in this instance because a depends on the number of independent variables
multiple linear regression contains more than present (i.e., H0: β1 = β2 = β3 = … βp, where p is
two variables and does not have a correlation the number of independent variables). Here, the
coefficient. In order to obtain the R2 for a multi- null hypothesis claims that the regression model
ple linear regression, we must utilize a method in consideration does not fit in with the
of ANOVA, which provides an F-statistic that population.
serves as the coefficient of determination.
Furthermore, the F-statistic relative to regression
analyses may be used as an F test for hypothesis 6.7 Self-Study: Practice Problem
testing.
To make parametric inferences by way of 1. Determine the critical values based on the

hypothesis testing, it becomes more about how specific two-tailed statistical tests for each of
an outcome can be predicted on the basis of cer- the following:
tain predictors. Now we become interested in (a) A one-sample t test with an α = 0.05 and
identifying the statistically significant predictors n = 21
(independent variables) of a given outcome or (b) Two independent sample t tests with an
response (dependent variable). The refined α = 0.01 and a df = 45
F-statistic used in conjunction with regression (c) A z test with an α = 0.05
analyses can provide predictive inferences and (d) A z test with an α = 0.01
can be tested for statistical significance—pre- (e) A F test with an dfbetween = 7 and dfwithin = 32
suming they satisfy the assumptions of paramet- 2. The Board of Education has just reported that
ric statistics. But unique to regression analyses is college students study an average of 15  h a
the satisfaction of a fourth assumption, namely, week with a standard deviation of 6.7. You
homoscedasticity. want to test whether the study hours of stu-
Homoscedasticity can be viewed as analogous dents at your college are a representation of
to homogeneity of variances, but for the variation the entire nation. A sample of collected data
of the vertical distances (see Fig.  6.28). This provided the following:
assumption is proffered by the predictive error 8.5, 19, 9, 15, 6, 1, 6, 10, 7, 6, 28, 35, 8, 20,
that results from the vertical distances of the data 6.5, 11, 2, 8, 5, 9
points on a scatterplot. (a) Use a one-sample z test to test the hypoth-
Like correlations, we omit the extensive dis- esis at α = 0.01 using the six steps.
cussion regarding the relative hypotheses and sig- (b) Calculate and interpret a 99% confidence
nificance tests relating to both simple and multiple interval.
linear regressions as they are beyond the scope of 3. Determine the appropriate t test for each of
this book. Their calculations are just as easily the following scenarios:
obtained using a variety of statistical software (a) Your boss wants you to determine if there
programs (see Video 9). Both simple and multiple is a difference in effectivity between dif-
linear regression analyses use distinct yet familiar ferent cognitive therapies. College stu-
test statistics to consider the null hypothesis pend- dents are randomly assigned to receive
ing further substantiated evidence. either behavioral or cognitive therapy.
6.7  Self-Study: Practice Problem 121

After 20 therapeutic sessions, each stu- in which one sibling was breastfed and the
dent earns a score on a mental health other bottle-fed. The following are the scores
questionnaire. collected on a standardized IQ measuring
(b) One hundred pharmacy students attend a tool.
seminar on novel therapeutic treatments. 6.
Students are tested once before the semi-
Pair of Breastfed sibling Bottle-fed sibling
nar and once after the seminar in order to
siblings IQ IQ
gage the effectiveness of the seminar. Pair 1 119 115

(c) According to the US Department of Pair 2 96 97
Health, the average 16-year-old male can Pair 3 102 105
do 23 pushups. A physical education Pair 4 111 110
instructor wants to find if 30 randomly Pair 5 79 83
selected lazy 16-year-olds are meeting Pair 6 88 90
this recommended standard. Pair 7 87 84
(d) The Centers for Disease Control and
Pair 8 99 99
Pair 9 126 121
Prevention recommend the ideal daily
Pair 10 106 101
dietary fiber consumption for adults. You
decide to see if your group of friends meet
this criterion or not. (a) Make a decision regarding the null hypothesis
4. The incubation period for the Zika virus is using the six steps.
between 2 and 17 days. A recently discovered (b) Which type of statistical measure did the

strain of Zika virus has an average incubation researchers take advantage of?
time of 6 days. A Zika outbreak in the popula- (c) What might be a few sources of error that the
tion has a group of epidemiologists curious as researchers might want to consider?
to whether the new epidemic is really the
recently discovered Zika strain. A random 7. A group of 40 middle-aged men are recruited
sample of Zika patients from the recent out- to a study in order to determine which home
break in the population revealed the following remedy is best suitable for decreasing the
incubation times in days: severity of the seasonal flu. The men are cat-
     8, 2, 3, 11, 7, 8, 2, 5 egorized into four groups with different
(a) Test the hypothesis at α = 0.05 using the interventions: orange juice, chicken soup,
six steps and p-value for proof. green tea, and salt water. The men are
(b) Calculate and interpret a 95% confidence reported to drink a cup of their specific inter-
interval for the true population mean incu- vention each day for 1  week and told to
bation time. report the severity of their condition on a ten-
(c) Would you say the Zika virus in the popu- point scale.
lation is the hypothetical strain? Explain
OJ C. Soup G. Tea S. Water
and provide proof.
5 8 3 1
(d) Without doing any further hypothesis test-
7 7 2 7
ing, would the decision regarding the null 3 4 5 2
hypothesis change at α  =  0.01? Provide 3 7 5 4
proof. 6 5 3 1
5. A group of psychologists is interested in
5 9 3 4
determining whether there is a statistically 8 9 2 4
significant effect on IQ in children who were 7 7 1 2
breastfed compared to children who were not. 6 6 4 2
The researchers recruited ten pairs of siblings, 5 6 4 1
122 6  Inferential Statistics II

(a) Using the six steps of hypothesis testing, test 10. In regard to the r  =  + 0.592 that describes
the null hypothesis at a significance level of association between cholesterol levels and
0.01. (Hint: use Appendix E for SS caloric intake from above, answer the fol-
calculation.) lowing questions:
(b) Based on your conclusion, can you deter- (a) What percent of the variability in choles-
mine which group is most effective? Explain. terol levels is predictable from its asso-
ciation with caloric intake?
8. Answer the following questions based on the (b) What percent of the variability in choles-
ANOVA table below: terol levels is unpredictable from its
association with caloric intake?
Degrees
(c) Is the R2 able to predict the percent of
Sum of of Mean
squares freedom square people with high cholesterol levels?
Source (SS) (df) (MS) F Explain.
Between ? 4 37.5 ? 11. The multiple linear regression line below

Within 5250 245 ? estimates the effects the number of daily
Total ? 249 X X cigarettes, daily alcoholic beverages, and age
have on monthly urinary output (L).
(a) Calculate the missing values in the table     y = 0.466–0.181 xCIG − 0.299 xALC + 0.333 xAGE
above. (a) Identify which of the variables are pre-
(b) How many groups were studied? dictors and which are responses.
(c) Assuming the groups were equal in
(b) A patient has volunteered his time to the
size, what is the sample size of each research study and is interested to know
group? whether his behavior can predict his uri-
(d) Determine the p-value for the outcome. nary output. You learn that he has four
If this study could be done again, what cigarettes a day and a single alcohol bev-
might you suggest? erage with dinner and is 56 years of age.
9. A scatterplot describing the relationship Calculate his predicted urinary output.
between cholesterol levels and caloric intake (c) Assuming control for the other variables,
renders a Pearson correlation coefficient of interpret the regression coefficient of the
r = + 0.582. variable for daily cigarettes.
(a) Describe the strength and direction of r. (d) With a coefficient of determination (R2)
(b) Provide a verbal interpretation of corre- of 0.388, what would you say regarding
lation between cholesterol levels and the predictive accuracy of the potential
caloric intake. inference consensus that may result from
(c) Can it be said that lower caloric intake this study?
causes lower cholesterol levels? (See back of book for answers to Chapter
Explain. Practice Problems.)
Nonparametric Statistics
7

Contents
7.1 Core Concepts  123
7.2 Conceptual Introduction  124
7.2.1 What Is Nonparametric Statistics?   124
7.2.2 When Must We Use the Nonparametric Paradigm?   125
7.2.3 Why Should We Run Nonparametric Inferences?   125
7.3 Nonparametric Comparisons of Two Groups  126
7.3.1 Wilcoxon Rank-Sum   126
7.3.2 Wilcoxon Signed-Rank   127
7.3.3 Mann–Whitney U   127
7.4 Nonparametric Comparisons of More than Two Groups  128
7.4.1 Kruskal–Wallis for One-Way ANOVA   128
7.4.2 Friedman for Factorial ANOVA   129
7.4.3 Geisser–Greenhouse Correction for Heterogeneous Variances   129
7.5 Categorical Data Analysis  129
7.5.1 The Chi-Square (χ2) Tests, Including Small and Matched Designs   130
7.5.2 Time Series Analysis with χ2: Kaplan–Meier Survival and Cox Test   133
7.5.3 Association and Prediction: Logistic Regression   134
7.6 Self-Study: Practice Problems  136
Recommended Reading  137

7.1 Core Concepts Nonparametric statistics is a statistical method


where the three assumptions are not satisfied
Nicole Balenton and the data are not continuous. It uses data that
is categorical and that does not rely on numbers
The three assumptions (i.e., normality, indepen- but instead a ranking or order of slots.
dence of measurement, and homogeneity of Nonparametric statistics is different from para-
variance) are necessary for parametric statistics metric in that the model structure is determined
as mentioned in the previous chapters. from the data itself.
The term nonparametric does not imply that
Electronic supplementary material  The online version these models lack parameters, but rather the
of this chapter (https://doi.org/10.1007/978-3-662-57437-9_7).
contains supplementary material, which is available to
parameters are flexible and not fixed in advance as
authorized users. it is with parametric statistics. Here the inferences
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 123
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_7
124 7  Nonparametric Statistics

regarding the population cannot be extrapolated; the sample. Simply stated, it is in those cases that
therefore, no assumptions or generalizations can we must rely on nonparametric statistics.
be made regarding the distribution of the popula- Another important characteristic of nonpara-
tion. This can explain why it is sometimes referred metric statistics is that they never require the
to as a distribution-free method. Used as a simple assumption that the structure of a research model
preliminary test of statistical significance, non- is fixed. In research designs, such as is often the
parametric statistics is a method commonly used case in adaptive clinical trials or summative eval-
to model and analyze ordinal or nominal data with uation protocols, the structure of the investiga-
small sample sizes. Since nonparametric statistics tional model changes during the study and
relies on fewer assumptions, these methods are evolves—e.g., alterations in sample size resulting
more robust. The simplicity and robustness of from adaptive changes, mortality, dropout, or
nonparametric statistics leaves less room for related situations—parametric constraints become
improper use and misunderstanding. limiting. Consequently, often albeit not always,
This chapter focuses on the nonparametric nonparametric tools of comparison or of predic-
comparison of two groups and more than two tion become the statistical approach of choice.
groups with an emphasis on tests for unifactorial To be clear, in such cases of flexible research
designs and multifactorial designs, as well as a models, frequentist statistical inference, such as
brief discussion on categorical data analysis. that advocated by probabilistic statisticians with a
Though there are a number of frequently used Fisherian formation, will prove less useful, less
tests, this book highlights a handful of these reliable, less manageable, and overall less appro-
mathematical procedures for statistical hypothe- priate than a Bayesian approach. Of course,
sis testing. By the end of the chapter, we learn Bayesian statistics can entertain statistical tests
that the objective of all statistical analyses is to that are traditionally considered to be parametric,
reveal the underlying systematic variations in a such as the t test, ANOVA, regression, and the like,
set of data from experimental manipulation or as well as the nonparametric tests described in this
observed measured variables. chapter. But, Bayesian statistics proffers the
unequalled advantage of progression toward the
absolute, the true characterization of the popula-
7.2 Conceptual Introduction tion by sequentially adding current new observa-
tions to established priors. From that perspective,
7.2.1 W
 hat Is Nonparametric Bayesian inferences are undoubtedly preferred to
Statistics? probabilistic statistics, which are static in time and
can only be repeated in time-dependent repeated
Nonparametric statistics are statistics that are, as measures designs, in any and all adaptive trials and
the name indicates, distinct from parametric sta- related flexible research situations.
tistics. The latter pertain to situations where the To be clear, whether the fundamental research
raw data are categorical in nature, rather than question deals with a comparison between
continuous (see Chaps. 5 and 6). groups, or with a set of inferences to be drawn
In addition, nonparametric statistics also per- about future outcomes of the dependent variable,
tain to research models where a set of parametric which relates to predictions, the rationale of the
assumptions—e.g., homogeneity of variance, process of statistical analysis remains the same:
normal distribution of measurements of the out- to compare the obtained p with the preset level of
come variable, and independence of outcome— significance, α (see Sect. 5.3 on significance).
are not verified, and the investigator is thus These considerations, while they pertain to the
unable to use the sample data to make inferences analysis of both categorical and continuous data,
about the population. When one or more of these are particularly relevant to continuous data analy-
assumptions are not verified, it follows that we sis, because these data permit to draw conclu-
cannot and must not use parametric statistics and sions—that is to say, statistical inferences—about
attempt to infer properties of the population from the population.
7.2  Conceptual Introduction 125

Categorical data do not, ever, in any circum- tical inference. Nonparametric statistics pertain to
stance allow statistical inferences about the pop- a domain of statistical inference where the investi-
ulation. The only exception to this fundamental gator is bound by the fact that either the data them-
rule of biostatistics pertains to situations where selves are not continuous—i.e., categorical in
the counts obtained in categorical data—e.g., nature—or that certain fundamental assumptions
white blood cell count in millions—that, by con- (homogeneity of variance, normality, and indepen-
vention, these data are now taken as continuous. dence) are violated by the data. Consequently, the
In those circumstances, these data must satisfy conclusions based on the statistical analyses, the
the parametric assumptions. statistical inferences cannot be extended and
In brief, continuous data have the advantage extrapolated to characterizing the population.
over categorical data of allowing extrapolations to Rather, they must be restricted to explaining the
the population. That is to say, observations made sample alone. Nonparametric statistics can never
on a discrete sample can—provided that the para- allow a descriptive, summative, or even formative
metric assumptions are satisfied—be used to generalization of the inferences to the population.
describe certain properties of the population. In brief, nonparametric statistics are:
It should be self-evident that this can only be the
case when and if the population is taken by conven- • Always required when the data under analysis
tion to be a fixed entity, with a known mean, μ, and are not continuous measurements obtained from
standard deviation, σ, of which our sample is in fact interval scales but rather categorical counts
representative. This convention takes us into the derived from simple enumeration of quantities
camp of Fisherian probabilistic statistics, which in certain categories defined by nominal or ordi-
essentially states that there is out there a “beast” nal variables—it follows that a simple transition
that we can refer to as the “population,” which we from a parametric to a nonparametric consider-
can study and comprehend. The alternative ation of the data rests on the translation of con-
Bayesian position states that the parameters of the tinuous measurements to a categorization of the
population may in fact never be known and are pro- numbers in the form of ranking.
gressively approximated by each iteration of inte- • Not based on parameterized families of prob-
grating the priors to newer observations. ability distributions.
The probabilistic perspective predominates • Whereas they include, like their parametric
current trends of research in the health sciences counterpart, both descriptive and inferential
and is therefore the commonly found in pub- statistics, nonparametric statistics require no
lished reports. This view proposes that the popu- assumption and make no assumption about the
lation is characterized by parameters, such as its probability distributions of the variables being
mean, μ, and its standard deviation, σ. In this assessed (e.g., normality, homogeneity of
paradigm, statistical analyses are based on variance, independence of measurement).
­sample statistics, which are used to characterize
the population—that is, to make inferences about
the parameters of the population—hence the ter- 7.2.3 Why Should We Run
minology “parametric.” Nonparametric statistics Nonparametric Inferences?
are simply situations where inferences about the
parameters of the population cannot be drawn for We have noted above that the primary reasons for
the reasons outlined above. using nonparametric inference can be summa-
rized as:

7.2.2 When Must We Use 1. Violation of the assumptions fundamental and


the Nonparametric Paradigm? sine qua non for parametric inferences
2. Categorical rather than continuous data
In brief, the field of nonparametric statistics is a 3. Variable, or adaptive (or, in certain circles

parallel universe, as it were, to probabilistic statis- “random”), rather than fixed research model
126 7  Nonparametric Statistics

From the practical standpoint, nonparametric as compared to a given alternative hypothe-


tests are often preferred over their parametric sis—conceptually related to the Mann–
equivalents because of their computational ease. Whitney U test
In other words, they can provide the investigator
with a fast and simple determination of signifi- Detailed consideration is given to the princi-
cance of a preliminary set of observations. Due pal ones below for the nonparametric comparison
both to this simplicity, and their consequentially of two groups, viz., the Wilcoxon family of tests,
greater robustness, nonparametric methods are including the rank sum test, the signed-rank test,
seen by some statisticians as leaving less room and the Mann–Whitney U test.
for improper use and misunderstanding.
Whereas it is true that nonparametric statistics
are less stringent than their corresponding para- 7.3.1 Wilcoxon Rank-Sum
metric tests and have, consequentially, less statis-
tical power, it is also true that they are simpler For example, let us look at these two groups:
and faster to run because they use a noncontinu-
ous or a semicontinuous representation of the Group 1 Group 2
data. Therefore, nonparametric analyses are often 70 82
used as a simple, “quick, and dirty” preliminary 62 66
test of statistical significance. 53 65
54 62
44
7.3 Nonparametric Comparisons
of Two Groups These are the raw data—now let us look at the
same two groups and rank the data:
In brief, the t test is a parametric test for compar-
ing two groups only. If the assumptions noted Group 1 Rank of group 1 Group 2 Rank of group 2
before are violated, then a t test cannot and should 70 8 82 9
not have been used. That is to say, the raw data 62 4 66 7
cannot be compared directly, but rather the rela- 53 3 65 6
tive ranking among the data of the two groups 54 2 62 4
must be considered. Several tests exist, among 44 1
which some are listed below.
Note the following two important points:
• Median test: tests whether two samples are
drawn from distributions with equal medians • Both group 1 and 2 have the value 62, which
• Squared ranks test: tests equality of variances corresponds to the 4th rank: in both groups the
in two or more samples value of 62 will obtain the rank of 4, but to
• Tukey–Duckworth test: tests equality of two keep all together, the next rank assigned will
distributions by using ranks have to be 6 (we skip 5).
• Siegel–Tukey test: tests for differences in • Just looking at the ranks now, we observe
scale between two groups that, overall, group 2 has the highest ranks—
• Wilcoxon signed-rank test: tests whether meaning here that the highest measured val-
matched pair samples are drawn from popula- ues are found overall in group 2, except for
tions with different mean ranks one value, 62. The lowest ranks belong to
• Mann–Whitney U test: tests whether two group 1, except for one value, 70. So there is
independent samples were selected from pop- some degree of overlap in ranks; but in fact,
ulations having the same distribution we could compute the mean + standard devi-
• Wilcoxon rank sum test: tests whether two ation of the ranks in each group—could we
samples are drawn from the same distribution, not?
7.3  Nonparametric Comparisons of Two Groups 127

Group 1: 3.5 ± 2.70. the null hypothesis that the population means are
Group 2: 6.5 ± 2.1 statistically homogeneous must be rejected.
The original Wilcoxon signed-rank test may
Now, based on this very simple example, we use a different, albeit equivalent statistic. Denoted
have the exceedingly strong impulse to do either by Siegel as the T statistic, it is the smaller of the
of two things—or both: two sums of ranks of given sign. Low values of T
are required for significance. T is generally easier
• Compare the extent of overlap of the ranks by to calculate than W. One important caveat is that
means of the W statistics. when the difference between the groups is zero,
• Compare the means of the ranks by a standard the observations are discarded. This is of particular
T statistics. concern if the samples are taken from a discrete
distribution, although in that case the Pratt modifi-
Both are permitted, and no assumptions are cation can be run to render the test more robust.
required for option 2 because we already have A second important aspect of this test pertains
given up, as it were the option of parametric to the power analysis. To compute an effect size
inferences due to the fact that we are not dealing for the signed-rank test, one must use the rank cor-
with the raw data any more, but in fact with the relation. If the test statistic W is reported, which is
relative ranks of the original data (see Video 10). most often the case because the Wilcoxon signed-
The null hypothesis of the Wilcoxon rank sum rank test generally relies on the W-statistics—sim-
test is usually taken as equal medians. The alter- ply the sum of the signed ranks— then the rank
native hypothesis is stated as the “true location correlation, r, is equal to the test statistic, W,
shift is not equal to 0.” That’s another way of say- divided by the total rank sum, S, or r = W/S. But,
ing “the distribution of one population is shifted if the test statistic T is reported, then the equivalent
to the left or right of the other,” which implies way to compute the rank correlation requires the
different medians. difference in proportion between the two rank
In comparing the ranks of the data in two sums (see Appendix F for T critical values).
groups, either of two approaches can be followed:
either we compare the overlap of the ranks by
means of the Mann–Whitney test, or we compare 7.3.3 Mann–Whitney U
the means of the ranks by the Wilcoxon test,
which is essentially a t test approach on the ranks. The Mann–Whitney U Test (aka, Wilcoxon two-
sample test and Mann–Whitney–Wilcoxon test)
examines whether the sums of the rankings for two
7.3.2 Wilcoxon Signed-Rank groups are different from an expected number. The
sum of one ranking is given as an integer value in
The Wilcoxon signed-rank test is a nonparametric the third box. If the sum is different from the
statistical hypothesis test used when comparing two expectation, this means that one of the two groups
matched samples or two repeated measurements on has a tendency toward the lower numbered ranks,
a single sample to assess whether their population while the other group has a tendency toward the
mean ranks differ. In other words, the signed-rank higher numbered ranks. The probability value pre-
test is a paired difference test, which is used as the sented is one-sided (“tailed”). Use this probability
nonparametric alternative, or equivalent to the paired value if you are only interested in the question if
Student t test for matched pairs (see Video 11). one of the two samples tend to cluster in a certain
Generally, the test renders a W statistics: W+ direction (see Appendix G for U critical values).
and W− as the sums of the positive and negative The paired Wilcoxon test ranks the absolute
ranks, respectively. If the two medians are statis- values of the differences between the paired data
tically not different, then the sums of the ranks in sample 1 and sample 2 and calculates a statistic
should also be nearly equal. If the difference on the number of negative and positive differ-
between the sum of the ranks is too great, then ences. The unpaired Wilcoxon test combines and
128 7  Nonparametric Statistics

ranks the data from sample 1 and sample 2 and these assumptions are violated, then you must
calculates a statistic on the difference between the resort to a nonparametric form of analysis, which
sum of the ranks of sample 1 and sample 2. By as we saw earlier must rest on the raw data rather
contrast, the Mann–Whitney U test compares the than on the means.
relative overlap of the ranks in groups 1 and 2.
But the question remains as to what to do if we • The Kruskal–Wallis test on ranks provides
have more than two groups to compare, and we you with such a tool, when we are dealing
have violated the assumptions for parametric sta- with one-way designs.
tistics. We still shall use the ranks of the data, • The Friedman test on ranks provides you with
rather than the raw data (see Video 12). a nonparametric comparison approach in the
The Mann–Whitney U renders a U statistics case of a two-way design.
that is computed as follows:
In either case, if significance is found, then
n ( n + 1) n2 Wilcoxon rank sum post hoc tests are done, with
U = n1 n2 + 2 2 - å Ri the Bonferroni correction, as above. It is very
2 i = n1 +1 straightforward, once you have grasped the flow
of things.
n1 and n2 correspond, respectively, to the samples The Kruskal–Wallis test by ranks, Kruskal–
sizes of groups 1 and 2 Wallis H test (named after William Kruskal and
R1 corresponds to the sum of the ranks for group 1 W. Allen Wallis), or one-way ANOVA on ranks is
a nonparametric method for testing whether sam-
ples originate from the same distribution. It is used
7.4 Nonparametric Comparisons for comparing two or more independent samples
of More than Two Groups of equal or different sample sizes. It extends the
Mann–Whitney U test when there are more than
Nonparametric ANOVA equivalents include the two groups. The parametric equivalent of the
Kruskal–Wallis test for unifactorial designs and Kruskal–Wallis test is the one-way analysis of
the Friedman test for multifactorial designs. variance (ANOVA). The distribution of the
Kruskal–Wallis test statistic approximates a χ2 dis-
• Kruskal–Wallis one-way analysis of variance tribution, with k-1 degrees of freedom (see
by ranks tests whether >2 independent sam- Appendix H for χ2 critical values). A significant
ples are drawn from the same distribution. Kruskal–Wallis test indicates that at least one sam-
• Friedman two-way analysis of variance by ple stochastically dominates one other sample.
ranks tests whether k treatments in random- The test does not identify where this stochastic
ized block designs have identical effects. dominance occurs or for how many pairs of groups
stochastic dominance obtains. Post hoc tests serve
Consideration is also given to the correction of to make those distinctions (see Video 13).
sphericity, viz., the assumption of homogeneity Since it is a nonparametric method, the
of variance, by means of the Geisser–Greenhouse Kruskal–Wallis test does not assume a normal
correction. distribution of the residuals, unlike the analogous
one-way analysis of variance. If the researcher
can make the less stringent assumptions of an
7.4.1 Kruskal–Wallis for One-Way identically shaped and scaled distribution for all
ANOVA groups, except for any difference in medians,
then the null hypothesis is that the medians of all
Nonparametric tests for comparisons of more groups are equal, and the alternative hypothesis
than two groups utilize, as was the case for two-­ is that at least one population median of one
group comparison, the ranking of the data, rather group is different from the population median of
than the raw data themselves. That is to say, when at least one other group.
7.5  Categorical Data Analysis 129

7.4.2 Friedman for Factorial ANOVA In brief, the Geisser–Greenhouse correc-


tion simply increases the p-value to compen-
The Friedman test is a nonparametric statistical sate for the fact that the test loses stringency
test developed by Milton Friedman. Similar to when sphericity is violated. Sphericity is
the parametric repeated measures ANOVA, it is indeed an important assumption of a repeated
used to detect differences in treatments across measures ANOVA, which refers to the condi-
multiple test attempts. The procedure involves tion where the variances of the differences
ranking each row (or block) together and then between all possible pairs of within-subject
considering the values of ranks by columns. The conditions (i.e., levels of the independent vari-
distribution of the Friedman test statistic also fol- able) are homogeneous—that is to say, not dif-
lows a χ2 distribution. A significant test is fol- ferent, statistically speaking. Sphericity—the
lowed by post hoc comparisons and Bonferroni homogeneity of variances assumption—is vio-
corrections (see Video 14). lated when it is not the case that the variances
of the differences between all combinations of
the conditions are statistically similar, as
7.4.3 Geisser–Greenhouse determined by the F test (ratio of the two
Correction for Heterogeneous variances).
Variances When sphericity is violated, then the F test
is significant, and the risk of a Type I error is
ANOVA is a parametric test, which can only be greater simply because the ANOVA F value
performed if the assumptions for parametric sta- may be inflated. When that is the case, a simple
tistics are satisfied. If these assumptions are not correction of the degrees of freedom of the
satisfied, then a nonparametric approach for more ANOVA model will, to a large extent, correct
than two groups comparison should have been for this inflation and thus diminish the risk of
reported. The Kruskal–Wallis test on ranks pro- Type I error noted above. That correction of the
vides such a tool, in one-way designs. The degrees of freedom is the Geisser–Greenhouse
Friedman test on ranks provides you a nonpara- correction.
metric comparison approach in the case of facto-
rial designs.
In the event that only the assumption of 7.5 Categorical Data Analysis
homogeneity of variance assumption is vio-
lated, a statistically acceptable correction may Several tests exist for the nonparametric analysis of
be applied, which is known as the Geisser– categorical data, among which some are listed
Greenhouse correction. It alters the stringency below.
of the interpretation of the data as indicated. In
brief, the Geisser–Greenhouse correction, also • Cochran’s Q: tests whether k treatments in
called the Greenhouse–Geisser procedure, esti- randomized block designs with 0/1 outcomes
mates a value, epsilon, the Greenhouse–Geisser have identical effects
Estimate Epsilon, in order to correct the degrees • McNemar’s test: tests whether, in 2 × 2 con-
of freedom of the F distribution as follows: tingency tables with a dichotomous trait and
matched pairs of subjects, row and column
dferror = eˆ ( k - 1) ( n - 1) marginal frequencies are equal
• Kaplan–Meier: estimates the survival function
from lifetime data, modeling censoring
df time = eˆ ( k - 1) • Chi-square and special cases (small designs,
condition matching)
• Time series (survival, Cox)
Corrections to the degrees of freedom of the F • Association and prediction—logistic
distribution regression
130 7  Nonparametric Statistics

• Nonparametric (or distribution-free) inferential vations contingent upon the nominal variables
statistical methods: used to analyze similarities, used. Frequency tables should only list one
or association, include but are not limited to: observation (one count) per individual; but com-
–– Anderson–Darling test: tests whether a plex studies (i.e., often badly designed studies)
sample is drawn from a given distribution often list more complex and misleading fre-
–– Statistical bootstrap method: estimates the quency tables, the discussion and analysis of
accuracy/sampling distribution of a which are beyond the scope of our present
statistic examination.
–– Cohen’s kappa: measures inter-rater agree- Chi-square (note: χ2 test, whose outcome is
ment for categorical items checked on the appropriate table of the χ2 distri-
–– Kendall’s tau: measures statistical depen- bution) is the appropriate test for comparing and
dence between two variables for testing associations of frequencies and pro-
–– Kendall’s W: a measure between 0 and 1 of portions. This test can be used equally well for
inter-rater agreement two or more than two groups. That is to say that,
–– Kolmogorov–Smirnov test: tests whether a while χ2 can answer such questions as “is there a
sample is drawn from a given distribution difference in the frequencies among the groups”
or whether two samples are drawn from the (test of equality of proportions among groups), it
same distribution can also test whether or not there is an associa-
–– Kuiper’s test: tests whether a sample is tion among the groups (test of association among
drawn from a given distribution, sensitive groups).
to cyclic variations such as day of the week Since χ2 is a relatively easy test to compute
–– Logrank test: compares survival distribu- and to interpret, it is often abused. There are few
tions of two right-skewed, censored special cases, which deserve discussion, because
samples failure to rectify the test in certain situations lead
–– Pitman’s permutation test: a statistical sig- to making a Type I error more likely. Appropriate
nificance test that yields exact p-values by use of χ2 includes a preliminary characterization
examining all possible rearrangements of of the sample used in a study, or the analysis of
labels such designs as diagnostic tests, where the out-
–– Rank products: detect differentially comes refer to counts of patients who are true
expressed genes in replicated microarray positives, true negatives, false positives, or false
experiments negatives.
–– Spearman’s rank correlation coefficient: The χ2 test computes the extent of deviation of
measures statistical dependence between the observed (“O”) cases from frequencies attrib-
two variables using a monotonic function utable to chance (expected frequencies; “E”). In
–– Wald–Wolfowitz: runs test tests whether brief, the χ2 test is a computation that is based on
the elements of a sequence are mutually the frequency table of the observed cases (O) and
independent/random the extent of deviation of the observed cases from
the expected frequencies (E) contingent upon
Detailed consideration is given to the princi- (i.e., dictated by) the nominal variables used. The
pal ones below. frequency table so constructed is referred to as a
contingency table.
For example, if we are counting men and
7.5.1 T
 he Chi-Square (χ2) Tests, women who are either old or young, we can tally
Including Small and Matched each individual we count in one of four cells: men-
Designs young, men-old, women-young, and women-old.
The totals of our tallies in each cell represent the
In order to conduct an analysis of frequencies, observed frequencies, and the cells themselves rep-
the data are organized by constructing a fre- resent the levels of the nominal variables our analy-
quency table. The frequency table lists the obser- sis is contingent upon (i.e., the “categories”).
7.5  Categorical Data Analysis 131

å (O - E ) df = ( a - 1) ( b - 1) ( c - 1)¼( p - 1)
2

c =
2

E
It also important to note that the χ2 distribution
(χ2) is a distribution of square values, whose
The test achieves that by adding (use the symbol: mean and standard deviation are its degrees of
Σ) the ratios of each of the differences between freedom. Thus, we only need to know the degrees
observed and expected frequencies, squared and of freedom to characterize the χ2 distribution. The
then divided by the expected frequencies. definition of this distribution therefore is quite
Each difference (O−E) is squared because oth- simple: for any quantity that has a standard nor-
erwise the simple sum of these differences would mal distribution, its square has a χ2 distribution.
add up to 0. Also note that this test tells us nothing It should be evident that this distribution can
about the spread (dispersion) of the frequencies only have positive values. That is to say, the χ2
within each category. However, it is a fact that as distribution is positively skewed: as the design
long as the E values are at least >5, they turn out increases, and therefore the degree of freedom
to be (quasi-)normally distributed, with a variance increases, the distribution increasingly tends to
equal to the frequency itself. Therefore, the vari- become normal.
ance in each cell could be rendered as the expected The positive skew of the χ2 distribution also
frequency, E. That means that: implies that inferences can only be and are
always one-tail. Practice using the χ2 distribu-
• It is fair game to divide the squared difference tion table, and find, for example, the critical
between O and E values by value of χ2, at α = 0.5, for df = 1; for df = 5; for
df = 7; etc.
(O - E )
2

• E = , to have an estimate of spread Practically speaking, when doing a χ2 test,


E we input the data in the software, and it com-
of the observed value from the expected value putes a value, which we then need to interpret.
in each cell. Let us briefly examine what the computation
• We must do some “fix-up” of the test when E entails.
is small (usually E < 5). First, we construct a contingency table by list-
ing the observed values (“O”), as well as the
The reasoning is the same in the χ2 test, except expected values (“E”). We derive the expected
that we are not dealing with continuous assess- values from “common sense,” from prior obser-
ments of x, but rather of counts within groups, or vations, or from computation. For computing E
categories. Therefore, the degrees of freedom of values, we obtain the marginal sums for each col-
a “one-way” χ2 test is simply the number of cat- umns and rows, as well as the total sum. The E
egories minus one (rather than n−1). In a facto- value for the first column in the first row is the
rial design, the degrees of freedom are obtained product of the first column total times the first
as the product of number of categories along each row total, divided by the total sum.
dimension (p) of the design, minus one: For example:

A B A + B sum A B
60 50 110 110 110
100 ´ = 36.72 200 ´ = 73.3
300 300
40 150 190 190 190
100 ´ = 63.32 200 ´ = 126.7
300 300
Total sum 100 200 300 Observed values Expected values
132 7  Nonparametric Statistics

Now, we obtain each individual spread of the quencies among the groups (test of equality of
O’s (observed value) to the respective E’s proportions), and whether there is an association
(expected value) by subtracting O − E. We square among the groups (test of association), it is, nev-
them, lest the sum add up to 0, and divide each by ertheless, a test with limited statistical stringency.
the respective E. We add up these individual The weak nature of the χ2 test lies inherently in
ratios to obtain the overall spread from O to E is the fact that it relies not on measurements per-
in the overall design: formed on the subjects, which could then be used
to extrapolate the behavior and characteristics of
å (O - E )
2 the population, but rather on the actual quantity,
c2 = or number of subjects.
E Therefore, the χ2 test:

The final step is to determine whether the • Must not be overuse, just because it is simple
observed χ2 value (χ2obs) is larger than the critical to perform.
value given in the table for the corresponding • Assumes no ordering among the categories
degrees of freedom (χ2crit). If χ2obs > χ2crit, then the under study (in the case of ordinal data [e.g.,
test is significant, and your statistical software stage of disease]) that information will be lost
would compute a p-value (the probability of your in the process of analysis.
finding that outcome by chance alone) that would • Becomes inaccurate when the frequencies in
be smaller than the α level set (often by conven- any one cell is small (<5). The Yates’ correc-
tion 5%). tion for continuity of χ2 must be done in that
There is a shortcut to this computation that instance.
can be used in instances of a 2×2 design, table,
such as in diagnostic tests. The shortcut is as The Yates’ correction for continuity must be
follows (using the numbers above): applied when dealing with small designs, when E
is anticipated (or computed to be <5 - …gener-
ally, most statisticians agree with this “thresh-
c 2 obs = 300 ( 60 ´ 150 - 50 ´ 40 ) 2 old”). This correction involves subtracting 0.5
= ( 60 + 40 ) ( 50 + 150 ) ( 60 + 50 ) ( 40 + 150 ) from the difference between O and E frequencies
108 before squaring in the regular formula. In the
= 147 ´ ´ 108 = 35.16 shortcut formula, the correction entails subtract-
4.18
ing one half from each of the O–E differences in
As stated above, χ2 values are always greater than the numerator before squaring. The correction
1.0, and the test is always one-tail. The greater the decreases the difference between each pair of
value of χ2 obs, the greater the deviation of the observed and expected frequencies by 0.5. If
observed values from the values expected based each observed frequency is so close to the
on chance alone and the greater the probability expected frequency that the correction reverses
that this deviation is statistically significant the algebraic sign of the difference, then the
(Appendix H). That is to say and as noted above, χ2 agreement is as good as possible, and the null
is a test of association and of comparison between hypothesis is accepted.
observed and expected values. Whereas χ2 is most That is to say, the Yates’ correction makes the
often used as a test of association, relationship, or final computed value of χ2 smaller, which pro-
dependency, it also serves to test the equality in tects from a Type I error, rejecting H0 when it is
proportions among groups (see Video 15). true. By the same token, the Yates’ correction
Despite the fact that the χ2 test can answer increases the risk of a Type II error, not rejecting
such questions as is there a difference in the fre- the null hypothesis when it is false.
7.5  Categorical Data Analysis 133

As was noted above, when we are dealing 7.5.2 T


 ime Series Analysis with χ2:
with a small design (generally—again it is a con- Kaplan–Meier Survival
vention more than a rule—when we deal with a and Cox Test
2×2 design), it is feared that the χ2 computation
and inference will be distorted. It is imperative, From a set of observed survival times, the pro-
therefore, that an alternate procedure, the portion of the sample who would survive a given
Fisher’s exact test, be used. The Fisher’s test is length of time is usually reported as a “survival
complex to compute by hand. Suffices it to say Kaplan–Meier curve.” This estimate requires
that for the Fisher’s exact test, the exact proba- considering time in many small intervals and is
bility of O is based upon the size of the marginal best analyzed by the logrank nonparametric
frequencies. test, which tests the null hypothesis that the
One way to attempt to circumvent the need for groups being compared are samples of the same
use of these corrections is collapsing the levels of population with regard to survival. A significant
contiguous nominal variables. Collapsing, how- outcome indicates that the groups actually do not
ever, must be done judiciously, and must involve come from the same population.
adjacent cells, and make sense. When one is interested in the relative survival
Special cases of the χ2 analysis refer to the analysis in two distinct groups (e.g., HIV+ >250
type of design that is being analyzed. CD4 vs. HIV+ <250 CD4), then a hazard ratio is
usually reported as
• When the data are matched categorical data,
then the more appropriate variation of the χ2
O1 / E1
test is the McNemar χ2. This test takes into R=
account the fractionation of the random error O2 / E2
term that is obtained through the blocking
effect. This ratio is computed over the entire period of
• When matching cannot be achieved, but still study and may not be used to detect identical vs.
some random error can be fractionated by the different kinetics of survival between the
process of stratification, then the Mantel– groups.
Haenszel χ2 test should be applied. Whereas the logrank test serves to compare
• If the categorical data being analyzed refer to the survival experience of two or more groups, it
a pre–post design, where baseline values were cannot be used to explore the effects of several
obtained, the data can be looked at as a differ- variables on survival. For that purpose, the Cox
ence, D, from baseline, and the Cochran Q χ2 proportional hazard regression analysis (vide
test is most appropriate. infra) should be used. This analysis is “semi-­
parametric” because whereas no particular type
It is possible that χ2 analyses may turn out not of distribution is assumed for the survival times,
significant and that the investigator may want to the assumption that the effects of the various
calculate how many more subjects should be variables under study on survival are constant
included in the study to attain significance. over time and additive in a given scale. The out-
Sample size computations on frequencies are come of an (exquisitely complex) statistical anal-
based on z values for given α and β, as well as on ysis by this method produces regression
the frequencies observed in the pilot study. The coefficients, whose sign is indicative of the prog-
appropriate formula generates a number that nosis of survival (+ means higher hazard, thus
must be squared and provides the number of worse prognosis; − means hazard is lower, thus
­subjects needed per group, but not expected fre- greater expectation of survival). The actual value
quencies, E, per group. of the regression coefficient is actually inter-
134 7  Nonparametric Statistics

preted via its transformation to an exponent (e.g., “blocks,” represented by the same individual
b = 0.520; eb = 1.68; meaning in this particular within whom measurements are obtained, cross-­
case that this variable increased hazard by 168%). individual differences are eliminated and the
By contrast, the approach of survival analysis design is made stronger; but, because each value
entails the following: From a set of observed sur- will be correlated with the preceding and the fol-
vival times, we can estimate the proportion of the lowing measurements on the same individual,
sample who would survive a given length of time, data points are not to be considered fully “inde-
thus generating a “life table” and a “survival pendent” and problems arise.
Kaplan-Meier curve.” This estimate requires con- In the instance of a few measurements (e.g.,
sidering time in many small intervals. For example, pre/post), the more appropriate term (and analy-
the probability of surviving 2 days if the product of sis) is “repeated measure.” The statistical
the probability of surviving the first day times the approach of choice is a within-group ANOVA
probability of surviving the second day, which design, as was noted above. Analyses can often
itself is called the “conditional” probability as it is be simplified by analyzing in fact the post-/
conditional upon the probability of surviving the pre-difference.
first day. So, for say, 100 days, the total probability Thus, these regression coefficients have com-
becomes P1 × P2 × … × P100. In actual terms, P100 is mon usage in the derivation of a prognostic index
calculated as the proportion of the sample surviv- for each individual variable, as well as overall. In
ing at day 100. On days when nobody dies, the this analysis, the outcome measure is the cumula-
probability of surviving is 1, and it is therefore only tive hazard of dying at time t, h(t):
necessary to calculate these probabilities on days
that somebody dies. The data are plotted as a “step
function,” where it is incorrect to join the propor- h ( t ) = h0 + e ( b1 ´ X1 +¼ bn ´ X n )
tion surviving by sloping lines.
The data are best analyzed by the logrank test,
a nonparametric test that tests the null hypothesis 7.5.3 Association and Prediction:
that the groups being compared are samples of Logistic Regression
the same population with regard to survival. It
acts a bit like a χ2 test in that it compares observed It is not uncommon that two variables have some
with expected number of events: indeed, it uses degree of relationship, or association, in a given
the χ2 distribution with k groups −1 as degrees of data set. The correlation coefficient, r, is a mea-
freedom (Appendix H). A significant outcome sure of the relationship between variable X and
would suggest that the groups do not come from variable Y. Actually, r gives an indication of the
the same population. When groups are stratified direction of the relationship (positive or negative)
(e.g., age range), then a logrank test could be and of the strength of the relationship (from –1 to
used to determine whether there were significant +1, 0 being no correlation whatsoever). But,
differences between the stratified groups. never can r imply a cause–effect relationship.
Whereas the logrank test serves to compare the The correlation coefficient, r, or the Pearson
survival experience of two or more groups, it cannot coefficient is the measure of relationship between
be used to explore the effects of several variables on two independent continuous variables. The value
survival. For that purpose, the Cox proportional indicates the degree of covariance, that is to say,
hazard regression analysis should be used. of shared variability. The square of r, the coeffi-
Let us recall that time series and survival anal- cient of determination, provides an indication of
yses allow us to look at data where we make how tight the relationship is (see Chap. 6).
many repeated measurements on the same indi- The relationship between ordinal variables is
vidual over time. Thus, they are also called given by the Spearman rank or Spearman rho cor-
“repeated measures” analyses. One advantage to relation coefficient. The Spearman rho (Ρ) is uti-
these analyses is the fact that by producing lized, as is the Kendall’s tau (Τ) to compute the
7.5  Categorical Data Analysis 135

association between the ranks (nonparametric), Diseased state = 0 + age + smoking + alcohol
rather than the actual raw data of two distributions. + treatment + error
Factor analysis is the set of statistical meth-
ods used to determine, for example, which items We then must “translate” the dependent vari-
on a scale clump (or cluster, hence the term for able, Y, into a continuous variable look-alike, and
the related statistical approach of “cluster analy- we do so by means of the logistic function,
sis”) together into some sort of a significant (or at æ p ö , hence the term logistic regression.
best, highly related) factor or construct. Factor log ç ÷
è 1- p ø
analysis is the means by which the investigator The equation now becomes
can group items, data, or trends of results (e.g.,
gels showing this or that particular band) into a
æ disease ö
coherent group that share fundamental similari- log ç ÷ = 0 + age + smoking + alcohol
ties. A factor analysis is based on the notion that è 1 - disease ø
+ treatment + error
each measurement is associated within a certain
degree of variability. When the variability about
two sets of measurements overlaps to a signifi- Multiple linear regression is a parametric test,
cant extent, then it is fair to assume that both which requires satisfying four assumptions: nor-
measurements are essentially the same or at least mality, independence, homogeneity of variance,
measure the same “thing,” the same factor. By and homoscedasticity, lest a logistic regression
contrast, if two sets of measurements do not over- be necessary. The statistical quality of the regres-
lap at all, then it seems fair to state that they are sion can be verified by multiple means: for exam-
totally and absolutely unrelated, hence the funda- ple, ANOVA can test its significance, CI to
mental principle behind factor analysis (and clus- examine the standardized regression coefficients,
ter analysis). An exploratory factor analysis will the β weights, and R and R2 to establish the lin-
establish, by calculating the correlation coeffi- earity of the relationship. This assumption refers
cients across the data for the expression of each to the fact that the variance around the regression
gene, whether or not some of the genes group line is the same for all values of the predictor
together in some meaningful way. If the investi- variable (X).
gator has a good idea of what the principal factors If the assumptions noted above hold, then the
are within each family of genes, and how they residuals should be normally distributed with a
will be ordered (a priori model), the data and the mean of 0, and a plot of the residuals against each
analysis will be organized accordingly. The fac- X should be evenly scattered. Statistical soft-
tor analysis is said to be a confirmatory statistical wares often will actually produce these graphs
analysis in this instance. with the initial regression command, followed by
That is to say, linear multiple regression test a plot command. Abnormal plots of the residuals
rests on the verification of the assumptions of will occur consequentially to the assumptions not
independence, normality, and homogeneity of being met. Therefore, while you rarely read about
variance. In addition, about something analogous this stage of analysis in papers, it is always a
to the assumptions of homogeneity of variances good idea to check the plot of the residuals before
must be verified, which refers to the homogeneity going any further in a regression analysis.
of the variation of Y’s across the range of the tested Abnormal plots of residuals could show, for
X’s: that assumption is called homoscedasticity. example, that:
When even one of these assumptions is vio-
lated, or when the outcome variable Y is not a con- (a) The variability of the residuals could increase
tinuous variable (e.g., disease present: yes, no), as the values of X increase.
then log transforming corrections of the ­outcome (b) There is a curved relationship between the
variable, Y, must be actualized. Thus, we might residuals and the X values, indicating a non-
have the following equation, for example: linear relation.
136 7  Nonparametric Statistics

Logistic regression is a statistical regression 3. Match the following nonparametric tests with
model that uses the logit of a number p between their analogous parametric tests.
0 and 1 for prediction models of binary depen- Kruskal–Wallis H
dent variables (see Video 16). Wilcoxon rank sum
In conclusion to this chapter, the objective of Wilcoxon signed-rank
all statistical analysis is to reveal underlying Spearman rho
systematic variations in a set of data, either as a Mann–Whitney U
result of some experimental manipulation or Friedman
from the effect of other observed measured Dependent sample t
variable. The basis of all statistical tests is an One-way ANOVA
assessment of the probability that given obser- Two independent sample t
vations and occurrences happen by chance, or One-sample t
not. The probability, p, is computed by the sta- Pearson r
tistical analytical tests on the basis of the data Two-way ANOVA
and is compared to a set value, a level, set by 4. Is there such a thing as a nonparametric infer-
the investigator, which establishes the point ence? If so, how does it differ from a paramet-
beyond which outcomes cannot be attributed to ric inference?
chance. 5. An immunologist is interested in comparing
Most research aims at either comparing two or the effects of peanut butter exposure to dust
more groups or at predicting the outcome vari- particle exposure on the number of specific T
able based on the independent and control vari- cells in two groups of postmortem patients that
ables. The research question sets this up at the died from a fatal asthma attack. Below are the
onset of the research process. Certain research T cell levels for each group. Use the Wilcoxon
designs favor a comparison approach (e.g., cross-­ rank sum method to rank the distribution of T
sectional), and others lead to a prediction type cells among the groups, along with the average
analysis (e.g., cohort studies). Experimental stud- and standard deviation of the ranks.
ies (and clinical trials) can go either ways (or 11.02, 5.98, 101.26, 18.09, 8.01, 45.93, 6.77
both ways), depending on how the research ques- 0.07, 32.33, 95.12, 11.02, 2.44, 300.65, 750.81
tion is stated. Systematic evaluation of the statis- 6. Assuming the two groups from question #5
tical analysis (SESTA), in any event, plays a (above) are independent, run the appropriate
central and a perichoretic role, one could say, in nonparametric test to determine whether the
evidence-based research and in evidence-based sums of the rankings for the two groups are
clinical decision-making. different.
7. What is sphericity in the context of nonpara-
metric testing?
7.6 Self-Study: Practice 8. Which type of error is a chi-square test vul-
Problems nerable to if used improperly? Why might cor-
recting this be considered a double-edged
1. What conditions allot the usage of nonpara- sword?
metric statistics? 9. Randomly selected patients at a hospital were
2. True or False: Bayesian approach to statistics asked whether they prefer to receive their
facilitates the usage of both parametric and diagnosis from a physician over a nurse. The
nonparametric tests, unlike the frequentist patients were divided by age group, and the
approach. results were as follows:
Recommended Reading 137

Age group Favor Oppose Total Corder GW, Foreman DI. Nonparametric statistics: a step-­
by-­step approach. New York: Wiley; 2014.
  Child 191 303 494
Dunn OJ.  Multiple comparisons using rank sums.
   Adult 405 222 627
Technometrics. 1964;6:241–52.
Total 596 525 1121 Fisher RA.  Statistical methods for research workers.
Edinburgh: Oliver and Boyd; 1925.
Friedman M. The use of ranks to avoid the assumption of
Using χ test, determine whether there is a dif-
2
normality implicit in the analysis of variance. J Am
ference in age group and their attitude toward Stat Assoc. 1937;32:675–701.
receiving the diagnosis from a physician over a Friedman M. A correction: the use of ranks to avoid the
nurse. assumption of normality implicit in the analysis of
variance. J Am Stat Assoc. 1939;34:109.
Friedman M. A comparison of alternative tests of signifi-
10. What type of study design would call for a cance for the problem of m rankings. Ann Math Stat.
logistic regression as one of its statistical 1940;11:86–92.
tests? Why? Hollander M, Wolfe DA, Chicken E. Nonparametric sta-
tistical methods. New York: Wiley; 2014.
Hosmer DW, Lemeshow S.  Applied logistic regression.
2nd ed. New York: Wiley; 2000.
Recommended Reading Kruskal W, Wallis A. Use of ranks in one-criterion vari-
ance analysis. J Am Stat Assoc. 1952a;47:583–621.
Bagdonavicius V, Kruopis J, Nikulin MS. Non-parametric Kruskal WH, Wallis WA.  Errata to Use of ranks in
tests for complete data. London/Hoboken: ISTE/ one-criterion variance analysis. J Am Stat Assoc.
Wiley; 2011. 1952b;48:907–11.
Bonferroni CE. Teoria statistica delle classi e calcolo delle Mauchly JW.  Significance test for sphericity of a nor-
probabilità. Pubblicazioni del Real Istituto Superiore mal n-variate distribution. Ann Math Stat. 1940;11:
di Scienze Economiche e Commerciali di Firenze. 204–9.
1936;8:3–62. Pratt J. Remarks on zeros and ties in the Wilcoxon signed
Chiappelli F.  Fundamentals of evidence-based health rank procedures. J Am Stat Assoc. 1959;54:655–67.
care and translational science. Heidelberg: Springer-­ Wasserman L. All of nonparametric statistics. New York:
Verlag; 2014. Springer; 2007.
Conover WJ.  Practical nonparametric statistics. Wilcoxon F. Individual comparisons by ranking methods.
New York: Wiley; 1960. Biom Bull. 1945;1:80–3.
Part II
Biostatistics for Translational Effectiveness
Individual Patient Data
8

Contents
8.1 Core Concepts  141
8.2 Conceptual, Historical, and Philosophical Background  142
8.2.1 Aggregate Data vs. Individual Patient Data  142
8.2.2 Stakeholders  143
8.2.3 Stakeholder Mapping  144
8.3 Patient-Centered Outcomes  145
8.3.1 Primary Provider Theory  145
8.3.2 Individual Patient Outcomes Research  147
8.3.3 Individual Patient Reviews  148
8.4 Patient-Centered Inferences  149
8.4.1 Individual Patient Data Analysis  149
8.4.2 Individual Patient Data Meta-Analysis  149
8.4.3 Individual Patient Data Evaluation  151
8.5 Implications and Relevance for Sustained Evolution of Translational
Research  153
8.5.1 The Logic Model  153
8.5.2 Repeated Measure Models  153
8.5.3 Comparative Individual Patient Effectiveness Research (CIPER)  154
8.6 Self-Study: Practice Problems  155

8.1 Core Concepts quality and novel assessments tools. Considered


as the gold standard of systematic review, indi-
Nicole Balenton vidual patient data (IPD) analysis and meta-­
analysis is a collaborative effort from multiple
At the core of patient-centered care is patient sat- studies combined to address a particular research
isfaction in clinical outcome. To provide the best question. Research studies reporting individual
available, evidenced-based, patient-centered patient data are timely and critical to collect
care, new research must integrate and utilize unique and specific data for each individual
patient directly.
The IPD approach is a lengthy and costly pro-
References specific to this chapter are listed at the end—
for general references, public domains, and reports, please cess where some patients may not provide the
refer to the general reference list at the end of this book data to be included in an IPD analysis, and thus
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 141
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_8
142 8  Individual Patient Data

selection bias may arise. Regardless, IPD sup- Groups of patients may seem homogeneous.
ports the active involvement of investigators, But in actuality, they can vary largely on indi-
improved data quality, and a more powerful anal- vidual characteristics. Practice guidelines and
ysis. Its role in evidence-based healthcare is criti- recommendations, which often are created
cal because of its time-to-event analysis in from research conducted with specific patient
evaluating prognostic studies. The principal groups, are de facto based on aggregate patient
models of evaluation are discussed and how they data and may not pertain with the same efficacy
affect the field of translational effectiveness, bet- and effectiveness to individual patient, because
ter known as Comparative Individual Patient of their individual physiological uniqueness,
Effectiveness Research (CIPER). individual needs and preferences, individual
This chapter focuses on the transition of pathological specificities, and individual psy-
patient-centered outcomes research (PCOR) to cho-emotional status. As an alternative to the
patient-centered outcomes evaluation (PCOE). traditional aggregate data meta-analytical proto-
As mentioned by observations made in Chiappelli cols, current trends concertedly lead toward the
(2014) and discussed further in this chapter, there development and characterization of individual
is a need for an established protocol for IPD patient data meta-analysis. To be clear, we must
meta-analysis to be validated and widely recog- emphasize that:
nized. Altogether, we learn how the fundamental
elements that drive the comparative effectiveness • Individual patient data meta-analysis can
research (CER) paradigm are integrated within involve the central collection, validation, and
the construct of IPD analysis and inferences. reanalysis of data from clinical trials world-
wide that pertain to a common research ques-
tion and data obtained from those responsible
8.2 Conceptual, Historical, for the original trials.
and Philosophical • The statistical implementation of an individ-
Background ual patient data meta-analysis should preserve
the clustering of patients within studies.
8.2.1 A
 ggregate Data vs. Individual • It is a misconception to assume that one can
Patient Data simply analyze individual participant data as
if they all came from a single study. Clusters
It is a fair assumption that patients enroll in ran- must be established, preserved, and retained
domized controlled trials because they fulfill throughout the analysis in either the two-step
inclusion criteria, which are based on strictly or the one-step approach briefly outlined
defined diagnostic criteria of the disease under above for the random model inference, as rec-
study. However, the majority of the patients have ommended by Simmonds and collaborators
symptoms that do not fit exactly in the diagnostic (2005).
criteria formulated by the researchers.
Randomized clinical trials are performed on In a typical two-step approach, individual
homogeneous patient groups that are artificially patient data are first analyzed in each separate
constructed by inclusion and exclusion criteria, study independently by using a statistical
which can include: method appropriate for the type of data being
analyzed. This may generate a typical aggregate
• Disease severity or comorbidity data analysis within each study. In a second step,
• Nature of healthcare facilities these data are combined and synthesized in a
• Intervention given suitable random model for meta-analysis of
• Clinical endpoint or outcome (death, disease, aggregate data.
disability) By contrast, in the one-step approach, individ-
• Expected treatment benefit ual patient data across all studies are integrated
8.2  Conceptual, Historical, and Philosophical Background 143

simultaneously into a generalized model that sim- benefits from assisting the patient. They form the
ply accounts for the clustering of participants structure of the socio-environmental reality of the
within studies. patient. Therefore, any consideration of individ-
As we discuss in greater depth below in this ual patient assessments and analyses must take
chapter, to conduct individual patient data meta-­ into account stakeholders’ attitudes, opinions,
analysis has distinct advantages, but inherent dif- knowledge gaps, and interests.
ficulties as well, including access to data from Stakeholder engagement improves the rele-
unpublished trials, inconsistent or incompatible vance of research, increases its transparency, and
data across trials, inadequate or limited informa- accelerates its adoption into practice. Stakeholder-­
tion provided in the published reports, longer engaged research is overwhelmingly useful to
follow-up time, more participants, more complex comparative effectiveness research (CER) and
outcomes, and overall lower cost-effectiveness to patient-centered outcomes research (PCOR).
that aggregate data meta-analysis. There are several advantageous key points of
In summary and as stressed in Chiappelli running stakeholder-centered endeavors in
(2014, 2016), individual patient data analysis and evidence-­based healthcare, including:
meta-analysis may be more reliable, because they
are more directly targeting individual patient • To shape the entity’s projects at an early stage
(i.e., patient-centered outcomes research) than to improve the quality of the project and
aggregate data analyses and meta-analyses. But ensure its success
undoubtedly, individual patient data are more • To help win more resources and ensure fund-
complex, expensive, and arduous to interpret ing support of the project to its successful
than aggregate data meta-analysis. Currently the completion
PRISMA statement is the standard for investiga- • To ensure that all participants fully understand
tors when reporting their aggregate data meta-­ what the process and potential benefits of the
analysis findings. PRISMA also provides a project
benchmark by which aggregate data meta-­ • To anticipate what people’s reaction to the
analyses may be appraised. It does little or noth- entity may be and build into plan the actions
ing, however, in its present version, to provide that will win people’s support
useful guidance for the critical evaluation of indi-
vidual patient data. In brief, the purpose of stakeholder-engaged
research is to widen the participation of shared
governance and utilization of the extracted data
8.2.2 Stakeholders and of the best available evidence among all cli-
nicians, patients, and insurers. This process con-
As we discussed elsewhere (Chiappelli 2014), tributes to align the interests among the groups of
the term “stakeholder” was originally meant to stakeholders in the context of patient-centered
define “those groups without whose support care. The engagement on the part of stakeholders
the organization would cease to exist.” The is critical to the success of the contemporary
concept and the role of stakeholders have healthcare model.
evolved and gained wide acceptance in the In patient-centered care, not all stakeholders
context of healthcare and biostatistics. Today, are equal, perform the same roles, or have the
the term specifically refers to the group of indi- same degree of involvement. Different stakehold-
viduals and ­ constituencies that contribute, ers contribute to different extents, and, as recom-
either voluntarily or involuntarily, to the mended by the Accountability Stakeholder
patient’s recovery, well-being, and, more gen- Engagement Technical Committee (208), research
erally, quality of life. focus must be deployed to develop and validate
Stakeholders as the constituencies of indi- novel tools to establish the nature, level (or
viduals who have interests in, receive concrete quantity), and quality stakeholder engagement.
144 8  Individual Patient Data

Certain lines of investigation have already been • Communication planning and stakeholder
drawn and include: management strategies
• Approaches to reduce potential negative
• To establish the necessary commitment to impacts and manage negative stakeholders
stakeholder engagement
• To ensure that stakeholders’ involvement is In a related context, the 6Ps framework of
fully integrated in strategy and operations stakeholders identifies key groups to consider for
• To define the purpose, scope, and stakeholders engagement, as follows:
of the engagement
• To characterize and define what a quality 1. Patients and the public consumers of patient-­
stakeholder engagement process looks like centered healthcare
2. Providers, including clinicians and organiza-
tions that provide care to patients and
8.2.3 Stakeholder Mapping populations
3. Purchasers (e.g., employers) who underwrite
A well-constructed stakeholder analysis the costs of healthcare
includes a “stakeholder map,” which is derived 4. Payers and insurers who pay and reimburse
from the identification of the needed stakehold- medical care
ers, in terms of the stakeholder’s perceived and 5. Governmental policy makers and advocates in
real power, influence, hierarchies of values, and the nongovernmental sector, product makers,
interest, in a manner similar to Fletcher and col- and manufacturers
laborators’ Key Performance Areas (2003). The 6. Researchers, including writers of research dis-
stakeholder analysis proceeds along four prin- semination report
cipal steps:
Outcomes of formative and summative evalu-
1 . Identify who the stakeholders are or should be ation (see Sect. 9.4) yield protocols of stakehold-
2. Prioritize, map, and classify the stakeholders ers may result in a reassessment of their relative
on the basis of interest, relative influence, and ranking and position in the project along the fol-
likelihood of involvement lowing broad system:
3. Understand the needs, wants, priorities, and
opinions of the stakeholders • Primary stakeholders, those individuals ulti-
4. Educate the stakeholders to keep them
mately affected, either positively or negatively
informed about, in touch with, and advocating by the project’s outcomes (e.g., patients)
in favor of the project as it evolves • Secondary stakeholders, those individuals
who are the intermediaries, the people indi-
By this systematic and validated approach, rectly affected by the project’s outcomes (e.g.,
and as we noted previously (Chiappelli 2014), caregivers, family members)
fundamental principles of stakeholders can be • Key stakeholders, those individuals, who may
identified, such as and not limited to: or may not be primary stakeholders as well,
who have a significant influence on the out-
• The interests of all stakeholders, who may come and/or running of the project
affect or be affected by the project
• Potential issues that could disrupt the project Taken together, stakeholder analysis is a criti-
• Key people for information distribution dur- cal sine qua non for stakeholder identification
ing executing phase and for analyzing the range of interest and needs
• Relevant groups that should participate in dif- among primary and secondary stakeholders.
ferent stages of the project Stakeholder analysis process can be seen in
8.3  Patient-Centered Outcomes 145

terms of five generally sequential, yet indepen- in the conduct and assessment of reviews is
dent but integrated, stages of activity: needed) (i.e., research synthesis).
• Evidence integration—to integrate clinical,
• Defining: Stakeholders are defined and identi- behavioral, economic, and systems evidence
fied in relation to a specific issue: stakeholder in decision analysis, simulation modeling,
identification operates in respect to a particu- cost-effectiveness analysis, and related proto-
lar specified issue. cols (i.e., translational inference).
• Long Listing: With respect to the specified • Evidence dissemination—active distribution
issue, a “long list” of key, primary, and second- of the outcomes of the research process
ary stakeholders is drawn that indicates group- described above to the five strata of stake-
ings (e.g., public, private, and community) and holders.
subgroupings (i.e., gender, ethnicity, age). • Evidence utilization—formative and summa-
• Mapping: Analysis of the long list along tive evaluation, adoption, and implementation
selected criteria (i.e., interest, social influence, of the findings in policies and revised clinical
political role) to allow systematic exploitation practice guidelines for practical use in specific
of positive attributes, identification of gaps or clinical and world settings (i.e., translational
needed bridge-building among stakeholders. effectiveness).
• Visualizing: Drawing an influence–interest– • Evidence feedback—stakeholders offer feed-
capacity matrix is essential at this stage. back regarding their participation, including
• Verification: Validity of the analysis is estab- on mechanisms for engagement, intensity of
lished by assessing and verifying stakehold- engagement, and support throughout the pro-
er’s availability and commitment. This step cess, as well as nature and use of uncovered
may require additional informants and infor- evidence.
mation sources.
• Mobilizing: Strategies for sustaining effective
participation of the stakeholders, tailored to 8.3 Patient-Centered Outcomes
the different groups and subgroups of identi-
fied stakeholders, and including empower- 8.3.1 Primary Provider Theory
ment interventions for high stake stakeholders
with little power or influence. By “individual patient data” (IPD), we mean
• Evaluation: Reassess to ensure maximizing the availability of raw data for each study partici-
the roles and contribution of all stakeholders. pant in each included trial. That is distinct from
aggregate data (summary data for the comparison
In a patient-centered healthcare modality, groups in each study), which has been the focus
stakeholder engagement strategies must be of the preceding chapters—mainly because
responsive to the values and interests of patients, aggregate data are still the focus of healthcare
patient advocates, and the public. The process research and biostatistics.
ought to include: Strictly speaking in the context of patient-­
centered evidence-based healthcare, it is impos-
• Evidence prioritization—establishing a vision sible to concede that aggregate mean data
and mission for research, identifying topics, are—on the rule—representative of any one
setting priorities, and refining key working patient in the group. In point of fact, aggregate
questions (i.e., formulation of CI). data are rather meaningless and useless in the
• Evidence generation—obtaining and refining context of patient-­centered research outcomes.
the bibliome. Consequently, a timely and critical approach at
• Evidence synthesis—systematic review of collecting and looking at data is specifically
research (continued exploration of engagement and uniquely directed to each individual
146 8  Individual Patient Data

patient: patient-­centered measures of care and assessment and analysis, is intertwined with
individual patient data analysis. patient-centered outcomes, such as:
The core of patient-centered care is patient
satisfaction in clinical outcome. Thence • Expectations of provider value
emerged Aragon’s Primary Provider Theory, • Descriptors of the dynamic process in which
which is, a generalizable theory that defends patient satisfaction occurs and converges from
that patient-­centeredness is a latent trait/ability provider power and patient expectations
of healthcare providers that influences their care
behavior and related patient outcomes. Based Therefore, patient satisfaction can be concep-
on these principles, research can be crafted to tualized as the result of an underlying network, a
test directly the robustness of the theory’s infer- meta-construct of interrelated satisfaction con-
ences across patients, settings, healthcare pro- structs, including satisfaction of the patient with
viders, healthcare settings including hospitals, the primary provider and the care received, wait-
medical practices, emergency departments, ing for the provider and bedside manner of the
physicians, and allied health practitioners, provider, and satisfaction with the provider’s
nurses, nurse practitioners, dentists, physician assisting office and clinical staff. Taken together,
assistants, and others. The Primary Provider these elements define the primary providers offer
Theory is grounded on eight fundamental to the individual patients in terms of the greatest
principles: clinical utility.
The Primary Provider Theory generates the
1. Clinical competency one of the necessary patient-centered measure of quality of service

conditions of desired outcomes. and offers an alternative paradigm for the mea-
2. Desired outcomes depend on the transmission surement and realization of patient satisfaction
of care, which is based on clinical knowledge, by informing the patient-centered physician
effective communication, and interaction with directly about how to improve the culture prac-
patients. tice, medical continuing education, quality of
3. Patient-centeredness describes an underlying care improvement, outcome measurement, sat-
quality of the provider’s interaction with and isfaction survey construction, and the like.
transmission of care to the patients. The Primary Provider Theory is related some-
4. Providing patient-center transmission of care what to the trialectical relationship among the
influences the outcomes of the treatment and clinical provider, the patient, and the patient-­
the satisfaction of the patients. centered best available evidence, which we
5. Providers are uniquely responsible for the described at length elsewhere (Chiappelli 2014).
patient-centered quality of the transmission In brief, the paradigm is an adaptation of the per-
of care and clinical knowledge to their son–environment fit model to evidence-based
patients. healthcare.
6. Providers who are both clinically competent Additional measures of patient-centered
and patient-centered generally achieve desired measurements in healthcare include quality
outcomes of clinical outcomes and indicators generated by the Agency for
compliance. Healthcare Research and Quality (AHRQ) and
7.
Patients and families value patient-­distributed as free software by AHRQ for that
centeredness care because the patient-­centered purpose. These tools can serve hospitals to
encounter is more important than any finan- help identify quality of care events that might
cial objectives. need further improvement, greater safety, and
8.
Patients are the best judges of the more extensive evaluation. They generally
patient-centeredness. include:

The Primary Provider Theory, a powerful • Prevention Quality Indicators, to identify


model in this early stage of individual patient hospital admissions evidence in geographic
8.3  Patient-Centered Outcomes 147

areas, which may have been avoided through clinical provider, the patient, and the patient-cen-
access to high-quality outpatient care tered best available evidence, to test and verify
• Inpatient Quality Indicators that reflect qual- the Primary Provider Theory.
ity of care inside hospitals, as well as across geo-
graphic areas, including inpatient mortality for
medical conditions and surgical procedures. 8.3.2 I ndividual Patient Outcomes
Research
These indicators provide a set of measures
that provide a novel and unbiased perspective on Methodologically speaking, individual patient
hospital quality of care using hospital administra- outcomes research, or patient-centered outcomes
tive data. They reflect specifically quality of care research (PCOR) protocols, should:
inside hospitals and include inpatient mortality
for certain procedures and medical conditions. In • Specify the outcomes and patient characteris-
addition, AHRQ has also developed: tics to be analyzed.
–– Establish, before embarking on data collec-
• Patient Safety Indicators that reflect quality tion, what data are actually available.
of care inside hospitals, as well as geographic –– Determine, when deciding what variable to
areas, and focus on potentially avoidable com- measure, what analyses are planned and
plications and iatrogenic events what data will be needed; to do them mini-
• Pediatric Quality Indicators that use indica- mize the potential of redundant or useless
tors from the other three modules with adapta- data gathering.
tions for use among children and neonates to • Consider the individual data items in terms
reflect quality of care inside hospitals, as well of which further or constituent variables are
as geographic areas, and identify potentially necessary.
avoidable hospitalizations. –– Redefine outcome variables as necessary
for consistency and completeness of
Taken together, the AHRQ quality indicators analysis.
serve to help hospitals and clinical practices in • Provide protocol and data format instructions
the community: for standardization among experimenters.
–– Streamline paper and digital data acquisi-
• Identify potential problem areas that might tion formats.
need further study and provide the opportunity • Collect and analyze data at the level of the
to assess quality of care inside the hospital individual participant to enable translation
using administrative data found in the typical between different staging, grading, ranking, or
discharge record. other scoring systems.
• Include mortality indicators for conditions or –– Pool homogeneous data whenever possible
procedures for which mortality can vary from from studies that would not otherwise be
hospital to hospital. possible, because of differences between
• Include utilization indicators for procedures the data collection tools.
for which utilization varies across hospitals or
geographic areas. The aims of the operations on individual
• Include volume indicators for procedures for patient data verification are:
which outcomes may be related to the volume
of those procedures performed. 1. To increase the probability that the data sup-
plied are accurate
New research must now integrate and utilize 2. To confirm that trials are appropriately

these AHRQ quality and integrators and the novel randomized
assessment tool designed and validated to 3. To ensure where ever appropriate that the data
measure the trialectical relationship among the are current
148 8  Individual Patient Data

Furthermore, to ensure efficient data verifica- collaborate. There may also be circumstances
tion, a practical protocol was outlined and recom- where it may not be necessary, for example, if all
mended in Chiappelli (2014). the required data are readily available in a suitable
Collecting PCOR data that include the time format within publications.
interval between the randomization and the event Researchers naturally require safeguards on
of interest enables time-to-event analyses, includ- the use of their study data and wish to ensure that
ing reverse Kaplan-Meier survival, to be it will be stored securely and used appropriately.
conducted. For this reason, a signed confidentiality agree-
For outcomes such as survival, where events ment is often used as a “contract” between the
can continue to take place over time, PCOR original investigators and the PCOR review team.
meta-analyses can provide an important opportu- The details of such agreements will vary, but
nity to examine the effects of interventions over a most will state that data will be held securely, be
prolonged period. They can also provide an accessed only by authorized members of the
opportunity for researchers to provide more up-­ project team and will not be copied or distributed
to-­date data for relevant outcomes such as mor- elsewhere. It is also good practice to request that
tality than they have published for their study. individual participants are de-identified in sup-
In brief, PCOR data are useful in that they plied data, such that individuals are identified
may be the most practical way to carry out analy- only by a study identifier code and not by name.
ses to investigate whether any observed effect of This seems to be an increasing requirement for
an intervention is consistent across well-defined obtaining PCOR from some countries where data
types of participants. By means of PCOR data, protection legislation requires that a participant
the investigator can: cannot be identified from the data supplied. Data
sent by email should be encrypted wherever
• Obtain a straightforward categorization of possible.
individuals for subgroup analysis, stratified by The general approach to PCOR review is
study and defined by single or multiple similar to any other systematic review. The
factors. methods used should differ substantially only
• Produce more complex and precise analyses, in the data collection, checking, and analysis
such as multilevel modeling, to explore asso- stages. Just as for any Cochrane review, a
ciations between intervention effects and detailed protocol should be prepared, setting
patient characteristics. out the objective for the review, the specific
• Conduct in-depth exploration of patient char- questions to be addressed, study inclusion and
acteristics, irrespective of the intervention. exclusion criteria, the reasons why PCOR are
• Consequentially yield more accurate sought, the methods to be used, and the analy-
inferences. ses that are planned. Similarly, the methods
used to identify and screen studies for eligibil-
ity should be the same irrespective of whether
8.3.3 Individual Patient Reviews PCOR will be sought, although the close
involvement of the original researchers in the
Reviews of PCOR data should, as we already project might make it easier to find other stud-
emphasized in Chiappelli (2014), be considered ies done by them or known to them. The proj-
in circumstances where the published information ect should culminate in the preparation and
does not permit a good quality review or where dissemination of a structured report. A PCOR
particular types of analyses are required that are review might also include a meeting at which
not feasible using standard approaches. There are results are presented and discussed with the
situations where the PCOR approach will not be collaborating researchers.
feasible, because data have been destroyed or lost In brief, and as we stated in Chiappelli (2014),
or, despite every effort, ­researchers do not wish to PCOR review is a specific type of ­systematic
8.4  Patient-Centered Inferences 149

review. Instead of extracting data from study 8.4 Patient-Centered Inferences


publications, the original research data for each
participant in an included study are sought 8.4.1 I ndividual Patient Data
directly from the researchers responsible for that Analysis
study. These data can then be reanalyzed cen-
trally and, if appropriate, combined in meta-­ Several standard statistical packages, including
analyses. Cochrane reviews can be undertaken MedCalc, can perform the necessary analyses of
as PCOR reviews, but PCOR reviews usually PCOR from the individual studies. Nevertheless,
require dedicated staff and would be difficult to it can be unwieldy and time-consuming to have
conduct in “free time.” The approach requires to analyze each outcome in each study one at a
particular skills and usually takes longer and time. Commercially available software is not cur-
costs more than a conventional systematic rently available that supports the direct analysis,
review relying on published or aggregate data. pooling and plotting of PCOR data.
However, PCOR reviews offer benefits related Practically speaking, and as noted in
particularly to the quality of data and the type of Chiappelli (2014), PCOR data can readily be
analyses that can be done. For this reason, they analyzed directly in RevMan, the software used
are considered to be a “gold standard” of sys- for preparing and maintaining Cochrane reviews,
tematic review. available for free download.1 The data are first
Obtaining PCOR often enables inclusion of analyzed outside of this software, and summary
studies that could not be included in a standard statistics for each study are entered into RevMan.
systematic review because they are either unpub- Whereas MedCalc does do meta-analyses, for
lished or do not report sufficient information to the purposes of these more sophisticated analy-
allow them to be included in the analyses. This ses, the noncommercial SAS analysis package,
may help avoid many types of publication bias. “SCHARP,” may be recommended. It analyzes
However, one must ensure that by restricting each study, pool results and outputs, tabulated
analyses to those studies that can supply PCOR, results, and forest plots for dichotomous, contin-
bias is not introduced through selective availabil- uous, and time-to-event PCOR, in a manner sub-
ity of study data. stantially equivalent to MedCalc. But, SCHARP
In brief, we stressed in Chiappelli (2014) is developed by the meta-analysis group of the
that the success and validity of the PCOR UK Medical Research Council Clinical Trials
approach require that data from all or nearly all Units, and is available from the Cochrane
studies will be available. If unavailability is Individual Patient Data Meta-analysis Methods
related to the study results, for example, if Group (IPD MA MG).
investigators are keen to supply data from stud-
ies with promising results but reluctant to pro-
vide data from those that were less encouraging, 8.4.2 I ndividual Patient Data
then ignoring the unavailable studies could Meta-Analysis
bias the results of the PCOR review. If a large
proportion of the data have been obtained, per- Individual patient data meta-analysis refers to
haps 90% or more of individuals randomized, the situation where the meta-analysis is performed
we can be relatively confident of the results. on research studies that report individual patient
However, with less information we need to be data, rather than group data. One of the main rea-
suitably circumspect in drawing conclusions. sons that individual patient data meta-­analysis is so
Sensitivity analysis combining the results of important in evidence-based healthcare is that time-
any unavailable studies (as extracted from pub- to-event analysis of survival is vital in evaluating
lications or obtained in tabular form) and com-
paring these with the main PCOR results is a Current version: 5.2.5 (http://ims.cochrane.org/revman/
1 

useful aid to interpreting the data. download)


150 8  Individual Patient Data

prognostic studies. To allow this type of analysis, Timely and concerted research must address
one needs to know the time that each individual several aspects of individual patient data meta-­
spends “event-free.” This is usually collected as the analysis, including:
date of randomization, the event status (i.e., whether
the event was observed or not), and the date of last • Improving design to secure a more compre-
evaluation for the event. Sometimes, it will be col- hensive investigation of the influence of
lected as the interval in days between randomiza- patient-level covariates and confounders on
tion and the most recent evaluation for the event. the heterogeneity of treatment effects, both
Time-to-­event analyses are performed for each trial within and between trials: that is to separate
to calculate hazard ratios, which are then pooled in within-trial and across-trials treatment-­
the meta-analysis. covariate interactions.
From an analysis standpoint, most individual • Better characterizing the impact of heteroge-
patient data meta-analyses to date have used a neity or use of random effects.
two-stage approach to analysis: • More stringent consideration of statistical
implementation, particularly in the context
1. In the first stage, each individual study is ana- of the fact that the analysis must preserve
lyzed in the same way, as set out in the meta-­ the clustering of patients within studies, it
analysis protocol or analysis plan. would be quite inappropriate to simply ana-
2. In the second step, the results, or summary lyze individual participant data as if they all
statistics, of each of these individual study came from a single study. Clusters must be
analyses are combined to provide a pooled retained during analysis through the two-
estimate of effect in the same way as for a step or one-step approach outlined above for
conventional meta-analysis in systematic the purpose of aggregating the data for each
reviews. study, such as a mean treatment effect esti-
mate and its standard error, and then synthe-
More complex approaches using multilevel sizing them in the second step by means of
modeling have been described for binary data, the suitable inference model for meta-analy-
continuous data, ordinal data, and time-to-event sis. Alternatively, the individual participant
data, but, currently, their application is less com- data from all studies can be modeled simul-
mon. When there is no heterogeneity between tri- taneously in a one-step process while
als, a stratified log-rank two-stage approach for accounting for the clustering of participants
time-to-event data may be best avoided for esti- within studies. Either model provides a
mating larger intervention effects. PCOR meta-analysis that yields the very
In brief, individual patient data meta-analysis estimate a single patient treatment effect
involves the central collection, validation, and under study.
reanalysis of “raw” data from all clinical trials
worldwide that have addressed a common In closing, it is important to reiterate the
research question with data obtained from those observation made in Chiappelli (2014) in that a
responsible for the original trials. formal protocol for individual patient data meta-­
As we already emphasized in Chiappelli analysis must be established, validated, and
(2014), despite the many advantages of individ- widely recognized, such that a new revision of
ual patient data meta-analysis in assessing a Preferred Reporting Items for Systematic
plethora of prognostic outcomes in evidence-­ Reviews and Meta-Analyses (PRISMA) check-
based healthcare, there is considerable scope for list could include it, perhaps along the essential
enhancing the methods of analysis and presenta- criteria we noted elsewhere (Chiappelli 2014,
tion of this analysis. 2016).
8.4  Patient-Centered Inferences 151

8.4.3 I ndividual Patient Data to say, of course, that evaluation cannot act in a
Evaluation void. It must be grounded—as research is for
sure—on a theoretical model. Evaluation must
We could conceive the scientific endeavor as optimally always be theory-based.
being a four-step process, which can be suc- This chapter examines current trends in the
cinctly outlined as the development of a new science of evaluation. This chapter also proposes
model, research question and hypothesis, system- the next necessary step in the field: from patient-­
atic research designed to test the model by prov- centered outcomes research (PCOR) to patient-­
ing or disproving the hypothesis, application of centered outcomes evaluation (PCOE).
the findings in real-life settings, and evaluation of The core concepts discussed in this chapter
the implications of the outcomes to improving pertain to evaluation science. The principal mod-
the model and to generating novel hypotheses. els of evaluation are discussed as they pertain to
That is to say, the phase of research and develop- translational effectiveness. The ultimate goal of a
ment both initiates and initiates anew, in a previous chapter (see Chap. 9) was to describe
dynamic process, which is akin to the progres- the process of evaluation. Now, it becomes
sion on a spiral, rather than on a circle: walking straightforward to expand and include this para-
along a circular path leads us back to where we digm into the topic of this present chapter: that is,
started from; progressing along a spiral leads us to transit from patient-centered outcomes
to ever newer, better, greater, and more ­fascinating research (PCOR) to patient-centered outcomes
discoveries than we could imagine at the onset. evaluation (PCOE).
But, in order for the scientific endeavor to Evaluation is critical to understanding how
retain its pragmatic systematic nature, the participatory processes work and how they can
research and development step must engender a be structured to maximize the benefits of stake-
phase during which findings are applied, dissemi- holder and decision-maker collaboration. Mixed
nated, and generalized to environments beyond model analysis allows us to investigate factors
the variables considered in the research study. whose levels can be controlled by the researcher
The extent to which findings can be validated (fixed) as well as factors whose levels are beyond
beyond the study’s constraints is, as noted in pre- the researcher’s control (random).
vious chapters (see Chap. 3), what is termed Mixed model analysis is preferred in PCOE. It
external validity. The process of establishing usually adopts a frequentist inferential interpreta-
external validity is akin to the process of evaluat- tion, although the Bayesian approach to inference
ing the implications and applications of the is becoming increasingly integrated in mixed
research outcomes to the real-world situations model analysis. Mixed models of evaluation
they were designed to address, with the ultimate imply a participatory process. Stakeholders must
purpose both of improving the original theoreti- be engaged early in the process to articulate the
cal model and to generating novel hypotheses. goals for the project and the participatory process
That is to say that, in brief, yes we pursue to achieve those goals. The assumptions underly-
evidence-­based healthcare, and yes we determine ing the goals and the process form the basis for
that research synthesis is the appropriate research the evaluation questions. The stakeholders are
protocol to obtain the best available evidence, also involved in the evaluation methodology, data
and yes we determine the fundamental elements collection procedures, and the interpretation of
that drive the utilization of the best available evi- the results. Mixed models are preferred and supe-
dence in specific clinical settings, which we prag- rior to other models of evaluation in the context
matically defined as translation effectiveness. of patient-centered care because they engage a
But, the question remains as to evaluate the out- systematic way to explore, explain, and verify
comes of evidence-based healthcare. This is also evaluation results, by proffering opportunities for
152 8  Individual Patient Data

evaluators to examine and peruse systematically absence of the intervention under study. In broad
data collection and analysis strategies for prompt lines, we could say that, whereas outcome evalu-
incorporation of a large number of evaluation ation “observes” outcomes, impact evaluation
questions (i.e., “nodes”) into the study design. In seeks to establish a cause-and-effect relationship
brief, the mixed method evaluation model yields in that it aims at testing the hypothesis that the
a novel and creative framework for the design recorded changes in outcome are directly attrib-
and implementation of rigorous, meaningful utable to the program, intervention, or policy
evaluations of participatory approaches that ben- being evaluated.
efit all stakeholders, from the patient to the clini- Impact evaluation—that is to say, in broad
cian, from the user to the decision-makers, which lines PCOE—serves to inform the stakeholders
the random model also proffers, but with greater about what program works, which policy is fail-
caveats. ing, in which contextual environment a given
We recall the emphasis we have made to dis- intervention is successful or not; that is to say, in
tinguish between the evaluation of outcomes what specific clinical setting will translational
(i.e., outcome monitoring evaluation: have pro- effectiveness be optimal, why, at what cost
posed targets been achieved?) and the evaluation (financial, risk-wise, and otherwise), etc. Impact
of impact. The latter, in brief, pertains to the sys-evaluation is timely and critical to the pursuit of
tematic assessment of the changes (e.g., improve- systematic reviews in patient-centered care.
ment vs. deterioration of quality of life)—intended Single difference estimators are designed to com-
as well as unintended side effects—attributed to a pare mean outcomes at end line, based on the
particular intervention, program, or policy. In an assumption that intervention and control groups
impact evaluation program, the intended have the homogenous values at baseline. Double
impact corresponds to the program goal and is (or multiple) difference estimators analyze the
generally obtained as a comparison of outcomes difference in the change, delta, in outcome from
among the participants who comply with the baseline over time for the intervention and con-
intervention2 in the treatment group to outcomes trol groups at each time point following imple-
in the control subjects. Here, we must distinguish mentation of the intervention.
between: From the methodological standpoint, impact
evaluation is complex primarily because it
• Treatment-on-the-treated (TOT) analyses involves a comparison between the intervention
• Intention-to-treat (ITT) analyses, which typi- under study and an approximated reference situa-
cally yield a lower-bound estimate of impact tion deprived of said intervention. This is the key
but are more relevant than TOT in evaluating challenge to impact evaluation, that the reference
impact of optional programs, such as patient-­ group cannot be directly observed, that it can
centered care only be inferred, that, for all intents and purposes,
it remains merely hypothetical. Consequently,
In this case, it is clear that impact evaluation impact evaluation relies upon an uncontrolled
protocols follow primarily the logic model of quasi-experimental counter-factual design, which
evaluation (vide infra), in which outputs refer to can yield either prospective (ex ante) or retro-
the totality of longer-term consequences associ- spective (ex post) time-dependent comparisons.
ated with the intervention, program, or policy
under study on quality of life, satisfaction, and • Prospective impact evaluation begins during
related patient-centered outcomes. It is also clear the design phase of the intervention and
that impact evaluation implies a “counter-­factual” requires the collection of baseline data for
analysis that compares actual outcomes and find- time series comparative analyses with midline
ings to results that could have emerged in the and end-line data collected from the interven-
tion and control groups (i.e., double and mul-
2 
See Chap. 9 on evaluation. tiple difference estimation based on the
8.5  Implications and Relevance for Sustained Evolution of Translational Research 153

delta’s). Subjects in the intervention group are critical to establishing and enhancing perfor-
referred to as the “beneficiaries,” and subjects mance and outcomes.
in the control group are the “non-­beneficiaries” Logic models describe the concepts that need
(of the intervention). Selection and allocation to be considered at each separate step and in so
principles and issues, including clustering doing inextricably link the problem (situation) to
effects, discussed in previous chapters apply the intervention (our inputs and outputs), to the
to impact evaluation studies to the same extent impact (outcome). The application and imple-
as noted for research investigations. mentation of the logic model in the planning
• Retrospective impact evaluation pertains to phase allows precise communication about the
the implementation phase of interventions or purposes of a project, the components of a proj-
programs. These modes of evaluation utilize ect, and the sequence of activities and the
end-stage survey data (i.e., single difference expected accomplishments. The logic model
estimation), as well as questionnaires and entails six fundamental steps:
assessments as close to baseline as possible, to
ensure comparability of intervention and com- 1 . Situation and Priorities
parison groups. 2. Inputs (what we invest)
3. Outputs
Threats to the internal and external validity of 4. Activities (the actual tasks we do)
impact evaluation are related to the threats of 5. Participation (who we serve; customers and
internal and external validity of research designs, stakeholders)
as discussed in preceding chapters. They were 6. Outcomes/Impacts:
also described at length in Chiappelli (2014). (a) Short-Term (learning: awareness, knowl-
edge, skills, motivations)
(b) Medium-Term (action: behavior, practice,
8.5 Implications and Relevance decisions, policies)
for Sustained Evolution
(c) Long-Term (consequences: social, eco-
of Translational Research nomic, environmental, etc.)

8.5.1 The Logic Model It is important to note that disadvantages of


the logic model manifest in that it can be too
The logic model (W.K.  Kellogg Foundation, complex and interactive to be effectively dis-
2004) serves to examine and to describe in detail played in the simplistic logic layout; it can be
the effectiveness of certain programs. The model challenging to depict in enough depth without
establishes logical linkages among resources, detrimentally simplifying their relationships and
activities, outputs, audiences, and short-, inter- implications; it can be too rigid in structure and
mediate-, and long-term outcomes related to the inhibit innovation and adaptive refinement.
specific research or implementation question at
hand. The model also leads to the clarification of
the needed critical measures of performance. 8.5.2 Repeated Measure Models
Logic models are narrative or graphical depic-
tions of the processes under study. They depict Traditional repeated measure models, such as the
the real-life situation and establish the underlying pre–post approach, have a high risk of generating
assumptions upon which an activity is expected a response shift bias. Response shift occurs
to lead to a specific result. They vividly illustrate when a participant uses a different frame of
the underlying sequence of cause-and-effect rela- understanding about an item between the pre and
tionships, and they communicate the path toward post periods, which generates a serious problem
a desired result. They identify the underlying in the estimation of change between the two time
trends and connectivity among variables that are points. Response shift may be due to learning,
154 8  Individual Patient Data

remembering the item, and subsequently pro- 8.5.3 Comparative Individual


cessing it cognitively during the two time points, Patient Effectiveness Research
new and improved understanding, or other events (CIPER)
indirectly related to the participant. Be that as it
may, when participants respond differently to the In closing and in summary, we stated in this
same item on two separate occasions (i.e., chapter that the term individual patient data refers
response shift), it generally reflects the fact that to the availability of raw data for each study par-
they are responding based on two different frames ticipant in each included trial, as opposed to
of reference. aggregate data (summary data for the comparison
The Post-before-Pre, or Post-then-Pre mod- groups in each study). Reviews using individual
els are designed to counter the response shift bias patient data require the collaboration of the inves-
of the pre-post design. The retrospective Post-­ tigators who conducted the original trials, who
then-­Pre design enables before and after informa- must provide the necessary data. From a method-
tion to be collected and analyzed at the same ological standpoint, the domain of individual
time. patient data gathering, analysis, and inference
Typically, the Post-before-Pre model permits needs to specify the specifics of the individual
the simultaneous administration of two evalua- patient data outcomes under study—viz., indi-
tive protocols following the intervention, each vidual patient data outcomes research. This
designed to assess the status either before (i.e., requires a cogent R.
perceived) or following the intervention (i.e., That is, altogether, relatively simple for a
objective). This approach avoids shift bias. But, it research methodologist and biostatistician
may also be fraught with inaccuracies of mea- involved in the type of study outlined here. What
surement consequential to estimated (perceived) becomes several orders of magnitude more com-
values of pre. Therefore, it is unclear how accu- plex is the performance of a research synthesis
rate and truly useful the Post-then-Pre model design, described above, for the purpose of a
could be to an emerging field of patient-centered PICOTS-driven systematic review and meta-­
outcomes evaluation. What is clear is that PCOE analysis of individual patient data. In that case,
is concerned with: the comparative effectiveness research paradigm
is integrated within the construct of individual
• The everyday practice to make insightful rec- patient data analysis and inference, thus generat-
ommendations for stakeholders and improve ing a novel and fast emerging sub-domain of the
the program developed for each individual field of translational effectiveness, which has
patient been termed Comparative Individual Patient
• Appraising all stages of the project to deter- Effectiveness Research (CIPER).
mine the merit and worth of the program and CIPER is designed to compare the effective-
how it can be improved stepwise (i.e., forma- ness outcomes of research obtained from inde-
tive evaluation) and overall (i.e., summative pendent patient data analyses and inferences. The
evaluation) protocol follows that outlined above for
• Providing, by means of the logic model road- CER. But, the problem arises at the level of anal-
map inputs, outputs, and short-, medium-, and ysis of the quantitative consensus. Indeed, and as
long-term outcomes, a timely and critical discussed in greater depth elsewhere, whereas
guide for program adaptations and refinement many standard statistical packages exist to per-
during all stages of the intervention, as well as form the necessary analyses of individual patient
forming recommendations for the fate of the data from individual studies, meta-analyses of
project including suggestions for project con- such data sets are unwieldy and time-consuming
tinuation, elimination, or modification, to because commercially available software is not
ensure maximum benefit to each patient currently available that supports the direct analy-
8.6  Self-Study: Practice Problems 155

sis, pooling, and plotting of independent patient 6. Describe the relationship shared between
data in meta-analysis. These are large data sets, individual patient data and patient-centered
often as complex as what is today referred to as outcomes in translational healthcare.
“big data,” the analysis of which in translational 7. Which of the following is the most appropri-
science is still in its infancy. ate study design to utilize in patient-centered
outcome research (PCOR) for obtaining the
best available evidence?
8.6 Self-Study: Practice (a) Diagnostic Study
Problems (b) Prognostic Study
(c) Naturalistic Study
1. Why might the analysis of individual patient (d) Research Synthesis Study
data be more advantageous than the analysis 8. Based on the answer above, what is the most
of aggregate data? appropriate format of the relevant research
2. What are current complications within the question?
healthcare field that have impeded the utili- 9. After the analysis of individual patient data
zation of individual patient data? in PCOR, what is the next necessary step in
3. Who may be considered a stakeholder and this dynamic process and why is it
why do they play an important role in important?
healthcare? 10. In the evaluation of patient-centered pro-

4. Can inferences be made from individual grams and research, a novel repeated mea-
patient data? Explain. sure model is proposed. How is this different
5. What are the differences between inferences than the traditional repeated measure model
made from individual patient data compared and what is its advantage?
to aggregate group data?
Evaluation
9

Contents
9.1 Core Concepts  157
9.2 Conceptual, Historical, and Philosophical Background  158
9.2.1 Conceptual Definition  158
9.2.2 Historical and Philosophical Models  158
9.2.3 Strengths and Deficiencies  160
9.3 Qualitative vs. Quantitative Evaluation  162
9.3.1 Quantifiable Facts Are the Basis of the Health Sciences  162
9.3.2 Qualitative Evaluation  162
9.3.3 Qualitative vs. Quantitative Evaluation  163
9.4 Formative vs. Summative Evaluations  163
9.4.1 Methodology and Data Analysis  163
9.4.2 Formative and Summative Evaluation  163
9.4.3 Comparative Inferences  164
9.5 Implications and Relevance for Sustained Evolution of Translational
Research  164
9.5.1 Participatory Action Research and Evaluation  164
9.5.2 Sustainable Communities: Stakeholder Engagement  165
9.5.3 Ethical Recommendations  165
9.6 Self-Study: Practice Problems  165
Recommended Reading  166

9.1 Core Concepts its validity and reliability. As mentioned before in


Chaps. 2 and 3, the validity and reliability of a
Nicole Balenton program or policy is of utmost importance in
both translational research and effectiveness in
A renowned evaluator, Daniel L.  Stufflebeam, healthcare. This chapter introduces important
once noted that “the purpose of evaluation is to philosophical models in evaluation, namely,
improve, not prove” (Stufflebeam 1993). We William Farish, Joseph Rice, and Fredrick Taylor,
begin the second half of the book with evalua- all of which have made their contribution to the
tion, a methodical determination of a health pro- evolution of modern evaluation.
gram or policy’s worth and significance using In comparison to research, evaluation is stake-
criteria governed by a set of standards to ensure holder focused and questions how well some-
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 157
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_9
158 9 Evaluation

thing works as opposed to how it works. Modern evaluation endeavors is even broader than that
evaluation is relevant for program-related and often spans a wide range of human enter-
decision-­making. It draws evaluative conclusions prises from the arts to criminal justice and from
about merit, worth, and quality designed to for-profit business to nonprofit organizations.
improve a particular health-related program or Evaluation rests on a well-characterized set of
policy. The four major evaluation strategies (i.e., criteria, protocols, and standards to ensure its
scientific-experimental model, management-­reliability and validity. It examines not only a
oriented model, qualitative-anthropological program or project in its entirety but also exam-
model, and participant-oriented model) are dis- ines its individual components, from the state-
cussed further in this chapter. ment of realistic and feasible aims to the
We examine and compare two sets of evalua- conceptualization of the background facts and
tion, specifically qualitative versus quantitative and data, the statement of expectations and alterna-
formative versus summative. In brief, quantita- tive inferences, and the detailed methodology
tive—the basis of research in the health sciences— and interpretation of the outcomes. Ultimately,
and qualitative evaluation are complimentary in evaluations confront the very decision-making
which they are equally essential to scientific process that arises from the completed project.
inquiry, yielding data that neither approach would Overall, the purpose of evaluation is to ascer-
produce on its own. On the other hand, formative tain the degree of achievement or value vis-à-vis
and summative evaluation yield significant esti- the objectives and results of any such action as it
mates of the program’s benefits, costs, and liabili- is in process and as it has been completed.
ties over a period of time. Evaluation is, one could say, the lightning rod
As mentioned, evaluation is stakeholder that helps policy makers, decision-makers, and
driven, and this chapter concludes with participa- actors in a certain field remain focused to gain
tory action research and evaluation’s (PARE) insight into planned or existing programs and to
contribution in raising stakeholder engagement, identify required initiatives for new and improved
awareness, and health literacy, as well as several directions.
ethical conducts to take into consideration for
translational healthcare. Broadly, PARE is an
approach that seeks to understand the reality of 9.2.2 Historical and Philosophical
what communities experience directed to social Models
change and improved effectiveness of transla-
tional healthcare. The origins of evaluation can be traced to antiq-
uity. But the conceptualization of modern-day
uses, protocols, and implications of evaluation as
9.2 Conceptual, Historical, a scientific discipline for societal decision-­
and Philosophical making and policies is more recent. Several dis-
Background crete periods of the evolution of modern
evaluation as we know it today can be identified.
9.2.1 Conceptual Definition The foundations of contemporary Figure 9.1
evaluation theory and practice were established
Evaluation can be conceived as a systematic as a modern scientific pursuit by William Farish
longtime process aimed at the determination of in the early 1790s. In his role as the Proctor of
the worth (i.e., effectiveness and efficacy) and Examinations at Cambridge, Farish examined the
significance, strengths and weaknesses, and qualitative and subjective scoring of examina-
validity and intrinsic biases and fallacies of cer- tions and consequentially the potential bias that
tain studies, investigations, programs, and poli- was introduced in the ranking of students. He
cies that pertain to society’s well-being, including developed a process by which correct answers
education and healthcare. In fact, the breadth of and incorrect answers could be scored numeri-
9.2  Conceptual, Historical, and Philosophical Background 159

theory and practice was spearheaded by the


Frederick W.  Taylor, considered by most as the
“father of scientific management” of scientific
data, which he spelled out in his 1911 book The
Principles of Scientific Management. The two
fundamental Taylor principles (i.e., Taylorism)
can be summarized as follows:

• The scientific management of data, which


is based on observation, measurement, analy-
sis, and efficiency, is the point of conjunction
of maximal effectiveness for maximal effi-
cacy (i.e., efficiency). It is the very founda-
tion of the field of evaluation and scientific
inferences from qualitative and quantitative
assessments.
• The scientific method requires that each
objective be defined in terms which clarify the
outcome under study (e.g., the kind of behav-
ior or knowledge that a course should help the
students to develop). Evaluation follows by
Fig. 9.1  William Farish, chemist, c 1915 (Dawe 1915)
means of internal cross-comparisons of the
outcome data with the stated objectives.
cally and a quantitative assessment of each indi-
vidual student obtained by processing the The prosperity that followed WWII led to a
numerical scores. The quantitative mark permit- rapid adoption and expansion of this new concep-
ted objective ranking of the examinees; the aver- tualization of evaluation, which was soon used
aging and aggregating of scores led to an estimate and integrated in taxonomies (cf., Bloom taxon-
performance rating of the group. omy of learning or cognitive domains, 1956), the
Farish’s creative and scientifically sound hierarchical relationship among these objectives,
approach became the formal approach to quanti- objectives-based testing, and many other aspects
tating evaluation in the US educational system of educational psychology, which contributed to
and the US Army by 1915. It laid the foundations the establishment of modern theories of learning,
for psychometric theory and measurement; the testing, and cognitive psychology.
novel branch of scientific inquiry dedicated the By 1959, the US Congress enacted the National
validation of tools of measurement. Defense Education Act (NDEA) that launched
Joseph Rice expanded on Farish’s model for new projects, updated curriculum development
the formal evaluation of study and learning of aspirations, and novel pedagogical and didactic
spelling by school children across many US cit- programs and evaluation modalities in mathemat-
ies and states. His work was an important con- ics, technology, engineering, and sciences, as well
tribution to the formal establishment of as foreign languages. The 1960s and early 1970s
evaluation in the US educational system in the saw the formal establishment of formative and
late 1990s. summative evaluation to evaluate new curricula.
This emergence of the field of evaluation was The Elementary and Secondary Education Act
followed by a period of roughly three decades (ESEA) of 1965, sponsored by Senator Robert
during which the efficiency of testing and mea- Kennedy, established programs evaluation as a
suring was refined. The ground work for this sec- sine qua non in education and in scholarly and
ond moment in the evolution of modern evaluation professional programs across the board.
160 9 Evaluation

Criterion-referenced testing was refined to many of the same methodologies and data ana-
yield a valid and reliable measure of group per- lytical paradigms used in research in general (cf.,
formance based on established criteria, as well Chaps. 1–7), but it does so for a different purpose
as, and as importantly, a measure of achievement or mission. Therefore, evaluation requires an
of each individual subject. By the 1970s and additional set of special skills, management abil-
1990s, criterion-referenced testing became a ity, political dexterity, sensitivity to multiple
timely and critical complement to norm-­ stakeholders, and other specific attributes.
referenced testing, which is designed to distin- Evaluation has a distinct mission or purpose,
guish differences from an established normative compared to research per se, in that it pertains to
value. In that regard, it was the precursor of the systematic assessment of the worth or merit
today’s individual patient measurement, analysis, of the findings produced by research. It follows
and inference (cf., Chap. 11). that evaluation has a central role in the
Evaluation today is considered a field of aca- ­interpretative processing of research findings and
demic inquiry in its own right. It encompasses six related feedback functions. Figure 9.2 below
distinct sub-domains: compares research and evaluation.
That is to say, evaluation is conceptualized as
(a) Objectives-oriented the systematic acquisition and assessment of
(b) Management-oriented information, including the generation of the
(c) Consumer-oriented resulting feedback to the appropriate stakehold-
(d) Expertise-oriented ers, viz., sponsors, donors, client groups, adminis-
(e) Adversary-oriented trators, staff, and other relevant constituencies. It
(f) Participant-oriented produces outcomes that are intended to influence
decision-making and policy formulation through
Academic journals and higher education gradu- the provision of empirically-driven feedback.
ate degrees in concert have contributed to the pro- Nonetheless, that is not always the case, and
fessionalization of contemporary evaluation. This this potential ambivalence can be a weakness of
movement was coordinated by some of the top the process of evaluation, in part attributable to the
universities (e.g., University of Illinois, Stanford heterogeneity of evaluation strategies. Four major
University, Boston College, University of groups of evaluation strategies can be identified:
California Los Angeles, University of Minnesota,
and Western Michigan University), and, while it • The scientific-experimental model of evalua-
struggled under the Reagan administration, when tion rests on the fundamental values and meth-
funding was dramatically cut, it recovered in the ods that are well grounded and generally
Clinton years when much of the funding for accepted across the health, life, and social sci-
research and academic development was rein- ences. They include the unbiased pursuit
stated. It fell again in disarray during the Great impartiality, accuracy, objectivity, reliability
Recession of the Bush administration but again and replicability, and validity. The scientific-­
rebounded when the economy stabilized during experimental model of evaluation relies on
the Obama years. Many academicians fear that experimental and quasi-experimental designs,
funding for evaluation may be curtailed once more as well as some observational designs (i.e.,
in the current political climate. cohort), and focuses on questions that pertain to
comparative effectiveness and efficacy research
and analysis for practice (i.e., CEERAP), com-
9.2.3 Strengths and Deficiencies parative effectiveness research (i.e., CER), and
comparative effectiveness analysis (i.e., CEA).
Evaluation is a methodological area of research • The management-oriented model of evalua-
that is closely related to but clearly distinct from tion examines comprehensiveness in evalua-
other traditional modes of inquiry. It utilizes tion and inserts evaluation as a most valued
9.2 Conceptual, Historical, and Philosophical Background 161

Fig. 9.2  Illustration of RESEARCH EVALUATION


the differences and
similarities between
research and evaluation.
Seek to generate Information for
Adapted from American new knowledge decision-making
Evaluation Association
Daily Tips blog, J.L.,
February 26, 2010, from
http://aea365.org/blog/
Researcher-focused Stakeholder-focused
john-lavelle-on-
describing-
evaluation/.n.d.
Hypotheses Questions

Methods and Analysis

Recommendations Recommendations
based on research based on questions

Publish results Report to


stakeholders

component of a larger framework, which usu- evaluation can provide stepwise estimates based
ally comprises business, organizational, on qualitative observations and on quantitative
­governmental, or occupational activities (e.g., data at given time points, which examines the
PERT, the program evaluation and review effects or outcomes of the object under evaluation
technique; CPM, the critical path method; at the completion of the process, and can also sum-
logic-based framework). marize either qualitatively or quantitatively pre-
• The qualitative-anthropological model of sented findings to highlight a given program’s
evaluation emphasizes the relevance of natu- strengths and weaknesses and successes and fail-
ralistic observation, the essential nature of the ures and to scrutinize recommendations for future
phenomenological quality of the evaluation improvements. Summative evaluation uses equally
context, and the value of subjective human qualitative and quantitative data to establish
interpretation in the evaluation process. whether the outcome that results at the completion
• The participant-oriented (or client-centered of the program under examination can in fact be
or consumer-oriented) model of evaluation said to have been caused by, to be a direct impact
focuses on the critical nature of the evaluation of, a cause-effect factor of the program under eval-
participants, clients, and users of the program uation or random coincidence. The major strength
or technology under examination. of formative and summative evaluations together
is that it yields timely and critical estimates of the
A second level of heterogeneity emerges from relative benefits, overall costs, and liabilities of the
the fact that each type of evaluation can be either program under examination over time. This chap-
qualitative or quantitative in nature. Formative ter examines these issues in greater detail.
162 9 Evaluation

9.3  ualitative vs. Quantitative


Q tative methods examine the why and how of
Evaluation decision-­making, not just what, where, when,
or “who.” It follows that qualitative approaches
9.3.1 Q
 uantifiable Facts Are for research and evaluation, such as the case
the Basis of the Health study or the case-­control study design, generate
Sciences new and improved understanding of a phenom-
enon that comes from exploring the totality of
In mathematics and empirical science, quantifica- the situation (e.g., phenomenology, symbolic
tion (or quantitation) is the act of counting and interactionism) but fails to provide continuous,
measuring that maps human sense observations semicontinuous, or dichotomous assessments
and experiences into members of some set of num- of any of the phenomenological events
bers. Quantification in this sense is fundamental to observed.
the scientific method and is an integral part of From a conventional frequentist Fisherian sta-
every aspect of the health sciences, from diagnosis tistical viewpoint, qualitative methods produce
to prognosis, from efficacy to effectiveness, from information only on particular cases studied (e.g.,
patient-centered to stakeholder-­directed, and from ethnographies) and generate hypothetical state-
based on the evidence to evidence-based. ments rather than conclusive data. Quantitative
The foundation of quantification is measure- methods are then needed to provide empirical
ment, which was discussed earlier in this book support for the thusly generated hypotheses.
(cf., Chap. 3), and quantification yields three Qualitative data may arise from a variety of
types of assessments: sources, which include, but are not limited to:

• Noncontinuous: dichotomous counts or enu- • Grounded theory and construct


meration of the events under study conceptualization
• Continuous assessment by which of a graded • Construct-derived interviews (structured,
instrument (e.g., interval scale) semi-structured, or unstructured)
• Semicontinuous measurement • (Meta-)cognitive, psychosocial, and psycho-­
cognitive conceptualizations
Qualitative statements, such as clinical notes, • Narrative (e.g., clinical notes) and stakeholder
can be quantified as well. Indices may be devel- and focus group town halls
oped to search and count common clinical themes • Storytelling and anecdotal evidence
in the text, to estimate grammaticalization of mor- • Transcript, normative patterns, photographs,
phemes, and to measure phonological shortness videos, and other media “literature”
and dependence on surrounding description of • Participant ethnographical research and field
physiological settings or pathological conditions. notes
• Policy studies and health services reports
• Action research and actor-network
9.3.2 Qualitative Evaluation participation

The appropriate rigor necessary in all sciences To quantify and to analyze qualitative infor-
includes the stringent criteria that govern quali- mation, it might be necessary to proceed along
tative evaluation. Indeed, qualitative methods of the following four principal steps:
inquiry and of evaluation range across many dif-
ferent academic disciplines. • Categorization and sorting of the information
Qualitative research is a broad methodologi- on the basis of certain criteria of hierarchy for
cal approach that encompasses many research thematic analyses
­methods that may vary substantially across dis- • Recognition of recurrence of the themes under
ciplinary specialties. Broadly speaking, quali- study
9.4  Formative vs. Summative Evaluations 163

• Continuous, semicontinuous, or dichotomous 9.4  ormative vs. Summative


F
assessment of recurrence Evaluations
• Statistical analysis of recurrence of the
themes1 under study by means of the statisti- 9.4.1 Methodology and Data
cal approaches described in the preceding Analysis
chapters (cf., Chaps. 4–7)
There are several different approaches in evalua-
tion, some the principal of which may be outlined
9.3.3 Q
 ualitative vs. Quantitative as:
Evaluation
• Basic/generic/pragmatic qualitative and
It is a fallacy to argue that qualitative research quantitative research and evaluation (i.e.,
and evaluation is somehow not as good as mixed method) that uses an eclectic approach
“weaker” than or softer than its quantitative to address the research question.
counterpart. Qualitative and quantitative • Ethnographic research and evaluation that
approaches complement each other and are studies, in the context of the health sciences,
equally as necessary and critical to scientific the impact of cultures and ethnic groups,
inquiry. Each has specialized modalities of design social norms, and societal frameworks on par-
and methodology, which derive from their ticular diseases (e.g., the role of the common
domain of focus. They, in concert, yield data, practice of “touching” the body of a diseased
which must be tested for parametric assumptions tribe leader in the spread of Ebola in western
and analyzed either in a comparative or in a pre- Africa in the early 2010s).
dictive mode, as described in the earlier chapters • Grounded theory research and evaluation that
of this book. is an inductive type of research and evaluation
In other words, regardless of whether we are that is “grounded” on the very observations
dealing with an experimental study whose out- from which it was developed.
come is measured by a validated scale, or with • Phenomenology research and evaluation
the participant observation of some theoretical describes the subjective perception, as
construct social role valorization theories, for opposed to the objective reality of an event,
instance, scientific inquiry must remain always a which it treats as a “phenomenon.”
rigorous hypothesis-driven process that consists
of carefully crafted research design, validated
methodology, and stringent analysis. Both quali- 9.4.2 Formative and Summative
tative and quantitative research and evaluation Evaluation
are grounded in this predicate. In brief, a given
research question may best be addressed by either Michael Scriven coined the terms formative and
quantitative or by qualitative scientific inquiry: summative evaluation in 1967 (Scriven 1967).
the type of research question dictates that. Scriven argued that formative evaluation gath-
It may also be the case that the research ques- ers information to assess the effectiveness of a
tion requires certain aspects to be examined by program—in the case he discussed this pertained
means of a more flexible qualitative process, to a curriculum to guide school system choices as
whereas others by means of a more rigidly struc- to which curriculum to adopt and how to improve
tured quantitative process. In that case, we speak it; but in the case of our discussion, it may very
of a mixed model research or evaluation. well pertain to a healthcare intervention aimed at
improving the condition and quality of life of a
Common Qualitative Data Analysis software include
1  certain patient group—as it runs.
MAXQDA, QDA MINER, ATLAS.ti, NVivo, Dedoose Formative assessments, including diagnostic
for mixed methods, and others. testing, include a range of formal and informal
164 9 Evaluation

assessment procedures and qualitative feedbacks Moreover, it is noteworthy that whereas sum-
(rather than quantitative scores) aimed at modify- mative evaluation yields information that can
ing and improving a given set of activities moni- yield either norm-based or criterion-based con-
toring outcomes and establishing accountability. clusions, formative evaluation can only produce,
Formative evaluation may seek: by design, criterion-based suggestions.

• To provide feedback for clinical providers


(i.e., medical doctors, dentists, nurses, phar- 9.5 Implications and Relevance
macists, and allied clinical personnel) to mod- for Sustained Evolution
ify and improve subsequent patient-centered of Translational Research
activities and experiences
• To identify and remediate group or individual 9.5.1 P
 articipatory Action Research
deficiencies and Evaluation
• To increase patients’ self-efficacy, determina-
tion, and motivation for healing Participatory action research and evaluation
• To improve stakeholders’ awareness and (PARE) is an approach to formative and summa-
engagement tive evaluation in communities—for example, in
• To favor dissemination (e.g., via telehealth) of patient groups. PARE is designed to focus partici-
the activities that have deemed to have satis- pation and action. It seeks to understand the world,
factory efficacy and effectiveness the reality these patients experience seeking to
ameliorate it, to increase benefit effectiveness. It
In brief, formative evaluation in the context of emphasizes collective inquiry and experimenta-
the health sciences is a constructive process directed tion grounded in experience and social history.
at promoting improved translational effectiveness. Within the PARE process, communities of
Summative evaluation has the same approach inquiry and action evolve and address questions
and beneficial intent as formative evaluation, but and issues that are significant for those who par-
it focuses to the overall assessment of a program ticipate either as subjects (i.e., patients) and as
after its completion. It is designed to summarize other stakeholders (e.g., allied clinical staff). The
the program outcomes. Whereas there ought to ultimate purpose of PARE is to contrast outcomes
be multiple formative evaluations during the and reproducibility of findings among an array of
course of a given intervention, there is only one concerted interventions. In that sense, PARE is
terminal summative evaluation. equivalent, in intent at least—though not in pro-
tocol—to CERE (comparative effectiveness
research and evaluation).
9.4.3 Comparative Inferences In terms of protocol, PARE integrates three
fundamental aspects, which are usually not found
Selected properties of formative and summative in a traditional comparative effectiveness research
evaluation protocols may be summarized as (CER) paradigm:
follows:
1 . Participation (life in society and democracy)
Formative evaluation Summative evaluation 2. Action (engagement with experience and

Timing Repeated several Carried out once history)
times throughout only at the end of 3. Research and evaluation (soundness in

the program the program
thought and the growth of knowledge)
Purpose To improve the To finalize the
program as it runs, materials and set a
to revise the policy for future In brief, PARE is not a monolithic body of
materials of the programs of similar ideas and methods but rather a fluid, pluralistic
program intents orientation to knowledge creation directed to
9.6  Self-Study: Practice Problems 165

social change and improved effectiveness. patients, caregivers, and all stakeholders in
Broadly speaking, PARE draws on a wide range patient-centered, effectiveness-focused, and
of influences and key initiatives such as the evidence-­based healthcare.
Participatory Research Network (1979), which
was created to foster an interdisciplinary devel-
opment drawing its theoretical strength from 9.5.3 Ethical Recommendations
adult education, sociology, political economy,
community psychology, community develop- Norms of ethical conduct to guide the relationship
ment, feminist studies, critical psychology, orga- between investigators and participants are sine qua
nizational development, and the like. Today, the non of effective and efficacious PARE paradigms.
PARE movement has evolved strategies to Informed consent; stringent adherence to HIPAA
democratize and disseminate knowledge—such regulations; full disclosure of potential physiologi-
as in the context of translational healthcare, cal, psychological, and sociological outcomes of
knowledge, and dissemination of the best evi- interventions; and unbiased consideration of bene-
dence base, BEB—thusly contributing to the fits, risks, and costs are essential to ensure that eval-
development of better informed communities uation protocols in translational healthcare, and in
founded on sustainable livelihoods, education, particular PARE, focus on patient welfare, privacy,
public health, and productive civic engagement. confidentiality, equal treatment and equipoise, and
In brief, it is safe to say that the modern con- appropriate inclusion free of conflicts of interests.
temporary conceptualization of participatory Furthermore, research and evaluation collabo-
action research and evaluation (PARE) reflects a rators must protect themselves and each other
fragile but growing intertwined unity between against potential risks, by mitigating the potential
reality and perceptions based on ethnic, cultural, negative consequences of their collaborative
and popular traditions, as well as a range of ideolo- work and pursuing the welfare of the patients
gies and a variety of socio-politico-­organizational first, and of all parties of stakeholders concerned.
contexts that together impact the well-being of Commitment of ethics must not exclude concerns
individuals and of communities. PARE is still, for social justice and welfare, such as critical
relatively speaking, at its infancy, particularly in struggles of certain patient groups (e.g., the dis-
the context of translational healthcare. Nonetheless, abled) in existing social structures and their
PARE is recognized by most as the avenue of the struggle against the policies and interests of indi-
future for the purpose of engaging stakeholders viduals, groups, and institutions.
and increasing their health literacy with BEB, the In conclusion, norms of ethical conduct in
product of comparative effectiveness research. healthcare are not fixed and immutable. Rather
they must be revised and updated as society
changes and evolves. The science of evaluation in
9.5.2 Sustainable Communities: general and PARE in particular play a central role
Stakeholder Engagement in this process of modernization, we might say, of
ethical norms for translational healthcare.
PARE proffers an important contribution to inter-
vention and self-transformation within groups
and communities, particularly, as noted, in the 9.6 Self-Study: Practice
context of raising awareness, engagement, and Problems
health literacy among the stakeholders in transla-
tional healthcare. It contributes to increased fac- 1. What is evaluation in translational
tual knowledge, understanding, discernment, and healthcare?
informed problem-solving and participation in 2. Which of the legs from the traditional three-­
decision-making for treatment intervention. It legged stool of the research process is evalu-
favors in other words active involvement by ation most like?
166 9 Evaluation

3. What is the purpose of formative evaluation? Donner A, Birkett N, Buck C.  Randomisation by clus-
ter: sample size requirements and analysis. Am J
Summative evaluation? Epidemiol. 1991;114:906–14.
4. True or False: Formative evaluation takes Dowie J. “Evidence-based,” “cost-effective” and
advantage of both quantitative and qualitative “preference-­driven” medicine: decision analysis based
data to establish the effect of an outcome. medical decision making is the pre-requisite. J Health
Serv Res Policy. 1996;1:104–13.
5. What type of study design and statistical test Gaventa J, Tandon R. Globalizing citizens: new dynamics
might be used in formative evaluation of a of inclusion and exclusion. London: Zed; 2010.
health intervention program? Gray JAM, Haynes RB, Sackett DL, Cook DJ, Guyatt
6. Explain the relationship between qualitative GH. Transferring evidence from health care research
into medical practice. 3. Developing evidence-based
and quantitative methods in evaluation. clinical policy. Evid Based Med. 1997;2:36–9.
7. Which type of evaluation method generates, Gubrium JF, Holstein JA.  The new language of quali-
at best, hypothetical statements as opposed tative method. New  York: Oxford University Press;
to conclusive data? 2000.
Ham C, Hunter DJ, Robinson R. Evidence-based policy-
8. Is it possible to quantify information obtained making—research must inform health policy as well
from qualitative evaluation? If so, how? as medical care. BMJ. 1995;310:71–2.
9. An investigator at your school’s local Liddle J, Williamson M, Irwig L.  Method for evaluat-
research laboratory claims that she only ing research and guidelines evidence. Sydney: NSW
Health Department; 1999.
works with quantitative data because it is Madaus GF, Stufflebeam DL, Kellaghan T.  Evaluation
better than qualitative data. Based on your models: viewpoints on educational and human ser-
knowledge, what would you tell her? vices evaluation. 2nd ed. Hingham: Kluwer Academic;
10. What is PARE and why is it important in 2000.
McIntyre A.  Participatory action research. Thousand
translational healthcare? Oaks: Sage; 2009.
Muir Gray JA. Evidence-based health care: how to make
health policy and management decisions. London:
Churchill Livingstone; 1997.
Patton MQ.  Utilization-focused evaluation. 3 London
Recommended Reading Sage, 1996.
Racino J. Policy, program evaluation and research in dis-
Bloom BS, Hasting T, Madaus G.  Handbook of forma- ability: community support for all. London: Haworth
tive and summative evaluation of student learning. Press; 1999.
New York: McGraw-Hill; 1971. Royse D, Thyer BA, Padgett DK, Logan TK.  Program
Bogdan R, Taylor S. Looking at the bright side: A positive evaluation: an introduction. 4th ed. Belmont: Brooks-­
approach to qualitative policy and evaluation research. Cole; 2006.
Qual Sociol. 1997;13:193–2. Scriven M.  The methodology of evaluation. In: Stake
Chiappelli F. Fundamentals of evidence-based health care RE, editor. Curriculum evaluation. Chicago: Rand
and translational science. Heidelberg: Springer; 2014. McNally; 1967.
Cochrane A. Effectiveness and efficiency. Random reflec- Stufflebeam DL. The CIPP model for program evaluation.
tions on health service. London: Nuffield Provincial In: Madaus GF, Scriven M, Stufflebeam DL, editors.
Hospital Trust; 1972. Evaluation models: viewpoints on educational and
Donner A.  A bayesian approach to the interpretation of human services evaluation. Boston: Kluwer Nijhof;
sub-group results in clinical trials. J Chronic Dis. 1993.
1992;34:429–35.
New Frontiers in Comparative
Effectiveness Research 10

Contents
10.1 Core Concepts  167
10.2 Conceptual Background  168
10.2.1 Introduction  168
10.2.2 Comparative Effectiveness Research in the Next Decades  170
10.2.3 Implications and Relevance for Sustained Evolution of Translational
Research and Translational Effectiveness  180
10.2.4 Self-Study: Practice Problems  182
Recommended Reading  183

10.1 Core Concepts tative consensuses of the best available evidence


obtained through a critical summative evaluation
Nicole Balenton process and interpretive synthesis. In compara-
tive effectiveness research, a systematic review is
This final chapter wraps up translational research a scientific report that describes the methodology
and effectiveness altogether by covering the fun- employed for obtaining, quantifying, analyzing,
damental principles of effectiveness, patient-­ and reporting the consensus of the best evidence
centeredness, and evidence-based. We learned base for a patient’s clinical treatment. The prin-
that translational research, the focus of the first ciple of translational effectiveness is essential for
half of the book, is the application of the scien- the best evidence base for patient-centered treat-
tific method into healthcare decision-making and ment intervention. It requires a heavy focus on
practice. Through biostatistical applications, new effectiveness, as well as a patient-centered
information directly benefits the patient in return approach to clinical decision and intervention.
hence the term “bench-to-bedside.” Translational In regards to biostatistical inference, there is a
effectiveness, on the other hand, relies heavily on substantial focus on the dynamic challenge that is
the biostatistical principles and concepts of trans- the Bayesian approach or biostatistical updating
lational effectiveness where clinical studies are to continuously update previous inferences as
translated into everyday clinical practices hence new data and information are obtained. This
“result translation.” chapter covers comparative effectiveness research
Comparative effectiveness research is the sys- in the next decade, where researchers and clini-
tematic process by which quantitative and quali- cians seek to provide the best possible intervention
© Springer-Verlag GmbH Germany, part of Springer Nature 2018 167
A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9_10
168 10  New Frontiers in Comparative Effectiveness Research

to the patient. We look at the emerging inquisitive of translational research grounded in the molecu-
and inferential models, as well as the future of lar characterization of biological specimens, and
healthcare that is telehealth. of translational effectiveness, the concerted oper-
ationalization of effectiveness-focused, patient-­
centered and evidence-based care.
10.2 Conceptual Background Once all the best evidence is assessed, treat-
ment is categorized as:
10.2.1 Introduction
• Likely to be beneficial
In this book, we have endeavored to discuss cer- • Likely to be harmful
tain of the fundamental concepts of biostatistics • Evidence did not support either benefit or
that appear most pertinent in our current times harm
and that can be foreseen to be most relevant in the
next decade. Biostatistics is the application of A 2007 analysis of 1016 systematic reviews from
statistics to the wide range of topics in the psy- all 50 Cochrane Collaboration review groups found
chobiological sciences in health and disease. that 44% of the reviews concluded that the interven-
Therefore, it encompasses research, clinical tion was likely to be beneficial, 7% concluded that
designs, and methodologies, in addition to the the intervention was likely to be harmful, and 49%
collection, organization, and analysis of data in concluded that evidence did not support either ben-
psychobiology as well as inferences about the efit or harm. Ninety-six percent recommended fur-
implications of these findings for the health sci- ther research. A 2001 review of 160 Cochrane
ences in general and healthcare in particular. systematic reviews (excluding complementary treat-
Current trends in medicine, dentistry, nursing, ments) in the 1998 database revealed that, according
and clinical psychology encourage new research to two standardized readers:
in effectiveness-focused, patient-centered, and
evidence-based clinical decision-making and • 41.3% concluded positive or possibly positive
practice. This perspective, which is barely a few effect
decades old at best, still challenges the commu- • 20% concluded evidence of no effect
nity of fundamental researchers and clinical pro- • 8.1% concluded net harmful effects
viders to develop and validate new and improved • 21.3% of the reviews concluded insufficient
tools for gathering, analyzing, and interpreting evidence
data aimed at improving patient care.
Therefore, this book examined the field of bio- A review of 145 alternative medicine Cochrane
statistics from two primary viewpoints. Firstly, it reviews using the 2004 database revealed that
was important to proffer a novel and clear discus- 38.4% concluded positive effect or possibly posi-
sion of the most common statistical concepts and tive (12.4%) effect, 4.8% concluded no effect,
tests that have been used in modern psychobiol- 0.69% concluded harmful effect, and 56.6% con-
ogy research and treatment evaluation ever since cluded insufficient evidence.
the emergence of current frequentist biostatistics,
as described by Pearson, Spearman, Fisher, 10.2.1.1 Translational Effectiveness
Gossett, and several others. Secondly, it was It behooves us to focus and define a bit more
timely and critical to contrast these views with clearly the breadth, constraints, limitations, and
Bayesian statistics, which is fast gaining greater fallacies of translational effectiveness at this point,
acceptance than the frequentist models in today’s not because the science of translational research is
psychobiological research and clinical domains. fully circumscribed by our current knowledge but
Moreover, it was unquestionably necessary to because translational effectiveness is relatively
incorporeal this discussion in the context of new—or at least newer than translational research,
translational healthcare: the crossroad, as it were, less clearly understood than translational research
10.2  Conceptual Background 169

to neophytes in health sciences research, and, by demands and in fact implies that healthcare edu-
all accounts, the future of healthcare. In its broad- cation must undergo a stringent formative and
est form, translational effectiveness is the appli- summative evaluation process designed to anchor
cation of the scientific method into healthcare clinical decisions, guidelines, and policies to the
decision-making and practice. Paraphrasing from fundamental principles of effectiveness, patient-­
the Agency for Healthcare Research on Quality centeredness, and evidence-based.
(AHRQ), translational effectiveness entails the
utilization and dissemination across all stakehold- 10.2.1.2 Fundamentals
ers of the best evidence base derived from com- of Comparative Effectiveness
parative effectiveness research (CER) and Research and Remaining
systematic reviews for patient-centered care. Open Questions
Translational effectiveness relies extensively To be clear, comparative effectiveness research is the
on the biostatistical principles outlined in the systematic process by which a qualitative and a
chapters of this book. Translational effectiveness quantitative consensus of the best available evidence
also empowers the development of novel and con- is obtained by a process of critical summative evalu-
certed biostatistical models, which borrow equally ation of the entire body of the pertinent available
from the frequentist and the Bayesian viewpoints, published research literature and a cogent interpreta-
to tackle new challenges in biostatistical infer- tive synthesis of the findings thereof. It is obtained
ence. These emerging inquisitive and inferential by means of a hypothesis-­driven research design
models include but are not limited to second- and known as research synthesis, which yields the con-
third-generation instruments to assess the quality sensus of the best available evidence in response to a
of the evidence, individual patient research out- population, intervention, comparator, outcome,
comes and analysis, individual patient data meta- timeline, setting (PICOTS) question typically initi-
analysis, stakeholder engagement quantification ated at the initial patient-clinician encounter.
and analysis, and local, national, and international The systematic review, the scientific report of
dissemination by such means as telehealth. the process of comparative effectiveness research,
Whereas the term “translational effectiveness” describes the methodology employed for obtain-
was coined relatively recently, certain of its ele- ing, quantifying, analyzing, and reporting the
ments are rather well-rooted in the conceptual- consensus of the best evidence base for the
ization of healthcare in the Western and the patient’s clinical treatment. It is not unusual that
Eastern cultures. Is origin can be traced back to the systematic review’s arduous biostatistics-­
ancient dogmas of philosophy. The “art” of treat- laden report needs to be translated into a language
ing ailments and to bring the patient back to that is clinician-friendly: that is, to rewrite the
health, be it in the context of medicine, dentistry, core of the systematic review process and analyti-
or clinical psychology, is in effect the conflict cal inferences in a form that emphasizes the utility
concerted approach to making critical informed (cf., utility-based clinical decision models, such
decisions about individual patients (i.e., patient-­ as the Markov decision tree) and logic of the
centered), to ensure the best possible intervention derived consensus in the evidence-­based clinical
(evidence-based), that will yield optimal benefit decision process (cf., logic-based models of clini-
to the patient (effectiveness-focused). cal decision-making). These translations of sys-
Today, and the decades ahead, translational tematic reviews into clinician-­friendly summaries
effectiveness must continue to emphasize reex- are often referred to as critical reviews, although
amining, revisiting, reviewing, and revising clini- they may be found under other rubrics. There is
cal practice guidelines by means of a systematic little consensus among specialists in the field
and peer-reviewed process to confirm the strength about either how these translational summaries
and validity, as well as the limitations and caveats must be obtained—particularly with respect of
of new and established clinical methods, materi- the biostatistics reported in the original systematic
als, and interventions. Translational effectiveness reviews—or about how to name them, how to
170 10  New Frontiers in Comparative Effectiveness Research

report them in the scientific clinical literature, or damental difference between what car we drive—
how to ­disseminate them, for that matter, to all Bentley, Lexus vs. FIAT, Ford—and how well the
clinicians who might need the best evidence base car runs. A Ferrari whose engine is not running
for patient-centered treatment intervention. well will be a much worse means of transporta-
That particular problem has a further dimen- tion than even the oldest Chevrolet with a tuned-
sion that is as important. Namely, the very princi- ­up engine. In other words, it is not so much the
ple of translational effectiveness, as noted above, type of car that will reliably allow us a safe trip,
requires not only the pursuit of the best evidence as its mechanical quality. To exactly the same
base with a focus on effectiveness; it also requires extent, it not so much the type of research and
a patient-centered approach to clinical decision clinical study that will contribute to the best evi-
and intervention. Patient-­centeredness implies dence base, as it is the quality and stringency of
patient participation in all phases of clinical deci- the research methodology (viz., sampling proto-
sions, and patient participation demands patient col, validity, and reliability of measurements) and
education. That very point opens a Pandora’s box of the biostatistical analysis and inferences.
of several complex issues, not the least of which The type of research study refers to the
entails health literacy: how do we best assess research design: to say it in the jargon of com-
health literacy across sociology-economic, ethno- parative effectiveness research and systematic
cultural, and linguistic barriers, how do we raise reviews, the level of the evidence—namely, clini-
health literacy of our patients, how do we ensure cal trials, cohort observational study, etc. The
that they retain the information provided—and if point here is that the level of the evidence—the
this information is about the best evidence base, as type of design—is quite a different and distinct
we presume here that it would largely be—and concept from the quality of the evidence, the
then how do we translate systematic reviews and stringency of the research methodology and data
critical reviews into lay language summaries while analysis. It is unfortunate that, in the past, the
preserving the stringency of the statement and its field has used the two conceptual frameworks
scientific foundations. Last but certainly not the interchangeably, using the words “quality” and
least, we recognize that patients often involve “level” of the evidence to describe the same
caregivers, family members, religious advisors, thing. In recent years, the distinction has been
friends, and other stakeholders, to various extents, made increasingly, and concerted effort must be
as guides and sounding boards in their decision- sustained to distinguish the level of evidence
making. Therefore, it is timely and critical to from quality of the evidence in the next decades.
develop new and improved means of characteriz- Levels of evidence are determined based on the
ing the nature and commitment of stakeholders, research design: from meta-analyses and system-
their level of engagement and persistence of atic reviews of triple-blind randomized clinical tri-
engagement, as well as similar health literacy als, with concealment of allocation and no attrition
issues as those noted above. The “umbrella” prob- at the top end, to observational design, bench and
lem may perhaps be stated as follows: how do we animal research, and published opinions and edi-
disseminate the consensus of the best evidence torials. At each level, levels of evidence consider
base for effectiveness among all interested parties factors, such as internal vs. external validity, statis-
to ensure patient-centered care. tical significance vs. clinical relevance, intention
to treat (ITT) analyses if applicable, number
needed to treat (NNT) and disease severity-cor-
10.2.2 Comparative Effectiveness rected NNT values, and prevented (or preventable)
Research in the Next Decades fraction (PF), and related information that arises
from the performance of the research design.
10.2.2.1 Methodological Issues Case in point, in healthcare research, the goal is
There is a fundamental difference between to be able to make some general conclusions appli-
“what” we do and “how” we do it. There is a fun- cable to about a wider set of patients, and, in gen-
10.2  Conceptual Background 171

eral, there is little interest for the particular group the ratio of nonevents to events is the odds ratio
of patients under study. The process that permits to (OR). For example, in the case of oral cancer and
make such general statements is grounded on a using smoking as the intervening factor, data may
systematic analysis of the information, which in show that the event rate for oral carcinoma in
research we call data, is the process of inference. smokers is 1%. Data may also show that its non-
In providing healthcare, the goal is to make spe- event rate, that is, the rate for oral carcinoma for
cific conclusions based on a given patient under nonsmokers, is 99%: the odds of smoking as an
study by the process of clinical diagnosis. intervening factor for the disease in question will
There is a fundamental difference between be computed as the following ratio 99:1.
statistical significance (see Chap. 5, Sect. 5.3 on Odds ratio are most common in primary clini-
significance), which can be said to be based on cal research, including observational designs
and derived from group data and which serves to (e.g., case-control and cohort studies). Data
draw conclusion about the population, and clini- obtained from a variety of clinical investigations
cal significance, which, while it may be derived can be transformed into OR’s to produce results
from observations obtained on a group of patients, in the form of expected event rates (i.e., patient
seeks to draw conclusion beneficial for each indi- expected event rate, PEER). When PEER is
vidual patient. Whereas statistical significance combined with the estimation of risk (OR), that
rests on the notion of sample size needed to attain is, the probability of a given situation (e.g., oral
statistical significance, clinical relevance rests carcinoma) to occur (i.e., in smokers) or not to
upon the concept of the minimum number of occur (i.e., in nonsmokers), then NNT is com-
patients needed to treat to obtain the desired out- puted as:
come or to avoid the undesired side effect.
1 - ( PEER ´ [1 - OR ])
The concept of number needed to treat NNT =
(NNT) is central to the process of comparative
(1 - PEER ) ´ PEER ´ (1 - OR )
effectiveness research and translational effec-
tiveness, simply because a critical determinant A treatment intervention produces a sizeable
of the decision-making process for clinical inter- event in the experimental group (i.e., experimen-
vention rests upon defining the minimal number tal event rate, EER), and a control event rate
of patients that must be treated to prevent—bio- (CER) is obtained in the control arm of the study,
statistically speaking, of course—one additional where placebo intervention was administered. The
bad outcome or to attain the benefit sought. following two-by-two table can be constructed:
The computation of NNT serves as a guide in
Control Experimental
the clinician’s decision-making process with
Event (A) (B)
respect to whether a given intervention ought to Nonevent (C) (D)
be applied, and of how few patients need to be
treated to prevent (i.e., risks), or to obtain (i.e., A
benefits) a given event. Intuitively, one of the Control even rate (CER) =
important uses of NNT is to provide a quantita-
(A +C)
B
tive guide for the assessment of cost-benefit Experimental event rate (EER) =
analysis.
( D)
B +
Research data are often expressed as a ratio of ( CER - EER )
Relative risk reduction (RRR) =
the measured outcome. That is to say, the “event” CER
divided by the “nonevent” corresponds to the CER
absence of the event or to the event whose magni- Absolute risk reduction (ARR) =
EER
tude falls below the measurable capability of the 1
instrument used, including background noise Number needed to treat (NNT)1 =
ARR
(i.e., random error). In the case of research syn-
thesis, the presentation of the published data as 1 
NNT is always rounded up.
172 10  New Frontiers in Comparative Effectiveness Research

1
CI 95 NNT = 95
CI ARR

ïì ( CER ´ [1 - CER ]) ïü ïì ( EER ´ [1 - EER ]) ïü


CI 95 ARR = +1.96 í ý+í ý
îï ncontrol þï îï nexperimental þï
Since NNT can be derived as the inverse of the As useful as this ranking system might
absolute risk reduction (ARR), it can as well, appear, it is fraught with misconceptions and
by converse reasoning, be derived from the fallacies. To cite only two: firstly, as discussed
inverse of the absolute benefit increase (ABI). in previous chapters (see Chap. 2 on study
In this case, clinical evidence is rated and ranked design), the diversity of the various types of sys-
according to the relative proximal relevance of tematic reviews, clinical trials, cohort observa-
the information provided to the patients’ needs, tional studies, and other designs renders this
risks, or benefits. Of course, the value of NNT, classification misleading at best and useless at
which is not a statistic of the patient sample but worst, when one considers that each design sub-
simply a quantitative estimate derived from data type was crafted purposefully to satisfy certain
transformation, is a function of several variables, practical conditions of sampling, measurement,
among which the severity of disease. That is to follow-up, and the like, which together produce
say, with a separate sample of patient, who might a certain type of evidence. Concerted efforts
be, say, twice as severely diseased, the NNT for have addressed to this very point, as is discussed
risk, as well as for benefit, ought to be corrected below. Secondarily, this ranking system simply
by a factor of two. lists different research designs but provides no
It should be clear that the concept of NNT is information about how correctly these designs
as weak and clumsy as its interpretation. It serves were conducted: case in point, a systematic
clinical relevance to a limited extent by providing review that is not conducted as per accepted
a rough approximation of the nimbler of patient standards of research methodology will yield
one needs to treat, theoretically, before observing evidence that, while ranked at the top of the
unwanted harm or sought for benefits. It is, really, level of evidence, is useless in the context of
a biostatistically meaningless number, which car- translational effectiveness—if not frankly dan-
ries no stringency whatsoever for obtaining sta- gerous in terms of patient safety—and must not
tistical significance. It is a mere attempt and a be used.
weak attempt at that to stretch quantifiable mea- The US Preventive Services Task Force
surements reported in the literature to provide (USPSTF) has proposed the following categori-
information that could serve the immediate needs zation of the level of the evidence:
of patients and clinical decision-making.
From this viewpoint, the highest ranked evi- • Level I: Evidence obtained from at least one
dence for therapeutic interventions emerges from properly designed randomized controlled
systematic reviews with quantitative (i.e., meta-­ trial.
analysis) or qualitative consensus or both. • Level II-1: Evidence obtained from well-­
Randomized, triple-blind, placebo-controlled tri- designed controlled trials without
als with allocation concealment and complete randomization.
follow-up involving a homogeneous patient popu- • Level II-2: Evidence obtained from well-­
lation and medical condition are thought to yield designed cohort or case-control analytic stud-
excellent evidence, which ranks immediately sec- ies, preferably from more than one center or
ond after the systematic reviews. The evidence research group.
from cohort studies ranks third and that obtained • Level II-3: Evidence obtained from multiple
from cross-sectional ranks fourth and so on. time series designs with or without the inter-
10.2  Conceptual Background 173

vention. Dramatic results in uncontrolled tri- findings. A grading system has been developed for
als might also be regarded as this type of that purpose and is widely used in the field.
evidence. However, research methodologists consider it fal-
• Level III: Opinions of respected authorities, lacious because it has not been validated psycho-
based on clinical experience, descriptive stud- metrically for validity and reliability. The GRADE
ies, or reports of expert committees. evaluation system produces a numerical value,
which purports to quantify the confidence in the
In addition, the same US Preventive Services observed effect as being close to what the true
Task Force qualifies the level of evidence as: effect is but that is completely and absolutely
devoid of statistical grounds and foundation. The
• Level A: Good scientific evidence suggests confidence value generated by GRADE is purely
that the benefits of the clinical service sub- judgmental and therefore biased and is not derived
stantially outweigh the potential risks. from the traditional statistically based computa-
Clinicians should discuss the service with eli- tion of the confidence interval. Moreover, the
gible patients. GRADE working group defines “quality of evi-
• Level B: At least fair scientific evidence sug- dence” (read: level of evidence) and “strength of
gests that the benefits of the clinical service recommendations” (read: confidence in the clini-
outweigh the potential risks. Clinicians should cal outcomes) as two interdependent yet distinct
discuss the service with eligible patients. concepts; but in actuality, these two concepts are
• Level C: At least fair scientific evidence sug- commonly—and erroneously—used interchange-
gests that there are benefits provided by the ably and confused with each other.
clinical service, but the balance between ben- The GRADE goes a step further and proposes
efits and risks is too close for making general the following inference:
recommendations. Clinicians need not offer it
unless there are individual considerations. • High-quality evidence: The authors are very
• Level D: At least fair scientific evidence sug- confident that the estimate that is presented
gests that the risks of the clinical service out- lies very close to the true value. One could
weigh potential benefits. Clinicians should not interpret it as: there is very low probability of
routinely offer the service to asymptomatic further research completely changing the pre-
patients. sented conclusions.
• Level F: Scientific evidence is lacking, of poor • Moderate-quality evidence: The authors are
quality, or conflicting, such that the risk versus confident that the presented estimate lies close
benefit balance cannot be assessed. Clinicians to the true value, but it is also possible that it
should help patients understand the uncer- may be substantially different. One could also
tainty surrounding the clinical service. interpret it as: further research may com-
pletely change the conclusions.
A system was developed by the GRADE (short • Low-quality evidence: The authors are not con-
for the Grading of Recommendations Assessment, fident in the effect estimate and the true value
Development, and Evaluation) working group to may be substantially different. One could inter-
take into account more dimensions than just the pret it as: further research is likely to change
quality of medical research. It requires users of the presented conclusions completely.
GRADE to use these criteria to develop a tool to • Very low-quality evidence: The authors do not
assess the quality (read: level) of evidence. The have any confidence in the estimate, and it is
GRADE checklist evaluates the impact of certain likely that the true value is substantially differ-
factors, which research methodologists would call ent from it. One could interpret it as: new
intervening or confounding variables, on the confi- research will most probably change the pre-
dence in the results—that is, the stringency of the sented conclusions completely.
174 10  New Frontiers in Comparative Effectiveness Research

The Appraisal of Guidelines for Research and In addition, three domains modulate these
Evaluation Enterprise (AGREE) working group assessments:
has also produced an instrument, which is
designed to evaluate the process of practice • Large effect: This is when methodologically
guideline development and the quality (read: strong studies show that the observed effect is
level) of reporting. The original AGREE instru- so large that the probability of it changing
ment has recently been updated and methodolog- completely is less likely.
ically refined, but the principal deficiencies and • Plausible confounding would change the
fallacies noted for the GRADE above remain in effect: This is when despite the presence of a
the AGREE-II assessment tool of practice guide- possible confounding factor which is expected
lines. Some efforts have been made to validate to reduce the observed effect, the effect esti-
this instrument psychometrically, such that mate still shows significant effect.
claims are common that “the AGREE-II is both • Dose response gradient: This is when the
valid and reliable,” but exception can be taken intervention used becomes more effective
with that assertion from a research methodology with increasing dose. This suggests that a fur-
standpoint. Nonetheless, the AGRRE-II is a con- ther increase will likely bring about more
siderable improvement over the GRADE check- effect.
list, if anything for its greater breadth and depth.
AGREE-II consists of 23 items organized into six To be sure, the field is endowed with many
domains of evidence quality (read: evidence more examples of instruments designed to
level). grade the level of the evidence (e.g., AGREE)
The quality of the evidence is often evaluated and the quality of the evidence (e.g., AMSTAR,
as the risk of bias, which both the Cochrane QUOROM, PRISMA). However, generally
group and AHRQ independently conceptualized. speaking, most if not all, these are originally
The best evidence base must be derived from conceptualized as checklists, which limit their
studies with low risk of bias. The proposition has psychometric validation concerted efforts and
been brought forward that clinical trials always have been deployed to restructure some of
have, by definition, lower risk of bias than obser- these instruments such as to generate a rating
vational studies, although this thesis has been scale, rather than a yes/no answer. These revi-
proven by research methodologists to be a fal- sions and expansions enrich the original instru-
lacy. The risk of bias assessment tool consists of: ments by:

• Risk of bias: Is a judgment made based on the 1. Generating a total final score of evidence

chance that bias in included studies has influ- quality. Based on this score, the acceptable
enced the estimate of effect. sampling statistical reasoning can be applied
• Imprecision: Is a judgment made based on the such that only the highest scoring literature
chance that the observed estimate of effect can be included in the process of generating
could change completely. the consensus of the best evidence base.
• Indirectness: Is a judgment made based on the 2. The semester-continuous scores thus obtained
differences in characteristics of how the study permit psychometric analysis of test reliabil-
was conducted and how the results are actu- ity (i.e., test-retest, inter-rather, internal con-
ally going to be applied. sistency, coefficient of agreement) and validity
• Inconsistency: Is a judgment made based on (i.e., criterion, content, construct).
the variability of results across the included
studies. In this very fashion, the stringency of the
• Publication bias: Is a judgment made based on assessment of the quality of the evidence is
the question whether all the research evidence improved when using the expanded version of
has been taken into account. GRADE (Ex-GRADE), the revised version of
10.2  Conceptual Background 175

AMSTAR (rAMSTAR), or the existed version of Consensus is then sought to prioritize the
the risk of bias instrument. Consequently, con- knowledge gaps by ranking and by Likert scale.
certed effort in the field is directed at significantly If the number of identified knowledge gaps is
improving the reliability and the validity of the large, then multiple rounds of prioritization (i.e.,
assessment of the quality of evidence simply by >2) and ranking will be run, to ensure replicable
revising and expanding existing instruments or cross-validation. Additional domains ought to
by developing tools for that purpose anew include plausible confounding that decreases the
(Wong). observed effect and large magnitudes of effect.
Another aspect of methodology of systematic The transparency of data sharing is necessary to
reviews that deserves consideration for improve- ensure that the product of the systematic reviews
ment to increase the stringency of research syn- proposed here is useful to a broad range of poten-
thesis is the process of sampling. Sampling is a tial audiences. Deficiencies in the strength of the
fundamental consideration in biostatistics, as we evidence grade can of course impact both the sys-
have noted in a preceding chapter (see Chap. 3, tematic reviews sub-aim and the gaps in knowl-
Sect. 3.2.1 on sampling methods). In brief, we edge sub-aim.
stated that sampling can be defined as a sequen- Therefore, the purpose of the analytical
tial collection of random variables, both indepen- framework is to crystalize the sharp criteria of
dent and identically distributed, or at least having effectiveness as practically feasible based on
potentially the same probability distribution as the PICOTS and to prioritize the research gaps
the others, and all are mutually independent. To thus identified. At this stage, more often than
test how realistic these assumptions of random not, engagement, participation, and involve-
sampling actually are on a given data set, auto- ment of the stakeholders in formulating
correlation statistics, which detect the presence PICOTS, finalizing the analytical framework,
of non-random periodic non-randomness, can be and stating the relevant key questions can be
computed, lag plots drawn, or turning point test assessed by the psychometrically validated
performed. The generalized assumption of participatory evaluation measurement instru-
exchangeable randomness is however, as we ment (PEMI) or other stakeholder engagement
emphasized above, most often sufficient and scales.
more easily met. The sample of primary research is obtained
The same consideration about random sam- from MEDLINE, PsycINFO, EMBASE,
pling, which applies to experimental design, is PsycARTICLES, Scopus, CINAHL, AMED, or
also pertinent to the research synthesis design another database. The sample for existing system-
that is used in comparative effectiveness research. atic reviews usually comes from MEDLINE, the
That is to say, the process by which the available Cochrane Library, Bandolier, or any other library
literature pertinent to the PICOTS question is of systematic reviews and meta-analyses.
identified and accessed. To identify gaps in The MEDLINE search strategy is developed
knowledge, the PICOTS is refined by means of and validated using PubMed medical subject
an analytical framework to generate specific key headings (MeSH) and keywords taken from the
questions that address certain intervening/con- PICOTS statement and related key questions.
founding variables. Knowledge gaps are com- The strategy is then replicated with the other
monly derived from GRADE or related electronic databases. Translators are used as
assessments, based on the criteria of: needed, unless the search is limited to English
language only. The clinicaltrials.gov registration
(a) Insufficient or imprecise information database is routinely reviewed to identify trials
(b) Biased information completed 3 or more years earlier that prespeci-
(c) Inconsistency or unknown consistency fied our outcomes of interest but did not publish
(d) Not the right information (wrong population all of the outcomes. The original authors can be
or wrong outcome) contacted as needed.
176 10  New Frontiers in Comparative Effectiveness Research

To ensure inclusion of individual reports, two healthcare specialties to distribute the best evi-
trained investigators independently screen titles dence base to clinicians, patients, caregivers, and
and abstracts of the list of references for perti- other stakeholders in real time.
nence to the stated PICOTS statement and identi- The healthcare provider (i.e., dentist, physi-
fied key questions. A second round of review by cian, nurse) can perform telecare in the form of
two additional independent reviewers examines teleconsultation and telediagnosis by using these
the full-text article. Differences regarding article electronic applications. That is to say, telecare
inclusion are resolved through consensus. The enables patients who live in rural areas or far
systemic review software (DistillerSR, 2010; away from healthcare services to receive the best
Evidence Partners) can serve to manage the available treatment and care in a cost-effective
screening process and information extraction on modality. It ensures that healthcare providers
measures of intervention fidelity. Funnel plots connect and treat patients in need with the proper
serve to estimate publication bias. communication technology in place.
As stringent and rigorous the sampling pro- With improving technology, telecare has
cess is, which requires searching multiple appro- much potential as a healthcare service, particu-
priate databases, eliminating duplicates and larly in situations such as complex dental inter-
reports that only approximate the PICOTS ques- ventions (e.g., 1-day crowns, immediate loading
tion, it generates a bibliome—that is, the collec- or delayed loading dental implants, mini-
tion of published papers that most adheres to the implants, inlays and onlays, etc.). By substan-
stated PICOTS statement and identified key tially reducing the cost of healthcare delivery
questions—that can suffer from selection and and increasing instant access to providers with-
accentuate the publication bias, including: out the need to travel, telecare technologies
improve the quality of dental care, and health-
• Language bias care in general given to the patients in inacces-
• Study designs bias sible communities, and raise patient and
• Time of publication bias healthcare provider’s satisfaction.
• Investigator bias (e.g., the same group of Implementation of telecare communication
investigators publishing multiple reports perti- technologies for mentally handicapped patients,
nent to PICOTS and thus being included in the elderly and disabled patients, and other special
bibliome) populations is particularly important, because
telecare can be optimized by using an electronic
More often than not, the funnel plot analysis application across the five domains listed above.
proves too soft to alert the investigators ade- The same benefits of telecare can also be obtained
quately of emerging publication bias. with other difficult patient groups, such as
Concerted methodological efforts must be patients with high levels of dental anxiety and
deployed in the coming decade to characterize dental phobia and homeless and destitute patients
new and improved means of obtaining the bibli- who live in poverty-stricken environments and
ome and ensuring that it is free of bias. Novel have access—at best—to dilapidated healthcare
biostatistical approaches must be developed to structure with intrinsic limits of patient access to
test for publication bias in a manner that is both clinical services. In these extreme situations,
more reliable and more stringent than the present telecare can vastly improve the well-being of
funnel plot analysis. dental patients in need of simple restorative den-
tistry or more complex and involved endodontic,
10.2.2.2 New Frontiers periodontal, or prosthodontic treatment
in Dissemination intervention.
Patient-centered care also implies that novel tele- Patients who are afflicted with serious infec-
health information and communication technolo- tious diseases, such as HIV/AIDS, Ebola,
gies must be developed and standardized across Zika, and other communicable diseases, are
10.2  Conceptual Background 177

oftentimes quarantined due to the infectious vide the necessary data. From a methodological
nature of the disease. These patients can be diag- standpoint, the domain of individual patient data
nosed and treated for dental problems and oral gathering, analysis, and inference needs to spec-
pathologies by means of telecare, even if dentists, ify the specifics of the individual patient data out-
physicians, and nurses are ordered to stay a safe comes under study—viz., individual patient data
distance away from those infected to prevent outcomes research. This requires a cogent char-
transmission of the virus from other vectors while acterization of the variables to measure, the anal-
providing diagnoses and treatment assistance via yses to plan, and the type of data (i.e., qualitative
electronic devices. That is in part the reason why vs. quantitative; categorical vs. continuous) to
teleconsultation, a low-cost and low-bandwidth gather. Thence will derive the type of analyses—
exchange of information between health special- usually longitudinal repeated measures type anal-
ists and patients when specialists are not avail- yses—that will be most appropriate and
able, is among the most common type of informative.
telehealth service in developing countries. In brief, three principal programs of telecare
Telehealth has shown a great promise across a ought to be developed in the decade to come:
variety of health problems, and telecare is
increasingly benefiting dental patients as well. • Electronic Data Methods Forum: A program
But, this will be obtained only if concerted that is in its second phase presently. It has
research is sustained in this field, which must established preliminary interconnections and
include the development of faster and more user-­ communications to a variety of electronic data
friendly technologies. Improved telecare tech- infrastructure and now is in the process of
nologies require, particularly in the field of expanding the breadth and depth of these
dentistry, seamless interconnectedness among interactions. To achieve its goal, the Electronic
clinical professionals and direct access to patients Data Methods Forum conducts comparative
in critical needs. effectiveness research on a wide spectrum of
In dentistry, and in other domains of health- patient-centered research outcomes, including
care, the need for cutting-edge, reliable, fast, and quality of life assessment and targeted
hack-free telecare is unquestionable. When improvement, and fosters the new and
implemented effectively, telecare will greatly improved utilization of a wide spectrum of
increase the treatment and care for dental patients. health information technologies to support
One aspect of telecare that is fast emerging routine clinical care.
with increasing relevance to situations of com- • Bringing evidence to stakeholders for transla-
plex dental interventions, or to some of the more tion to primary care: This is a concerted effort,
difficult patient populations briefly outlined initiated and supported by AHRQ-generated
above, is that it must ensure individualized, intramural and extramural funding programs,
patient-centered care. Consequently, one impor- to ensure and expand dissemination of pro-
tant development in translational effectiveness grams and best evidence base, evidence-based
that must go forth hand in hand with new devel- revisions of clinical practice guidelines, and
opments in telecare requires the validation of reports and information about professional
new research tools and protocols to analyze and and patient–stakeholder networking to patients
interpret individual patient data. and providers in primary care settings in the
The term individual patient data refers to the United States and worldwide.
availability of raw data for each study participant • Disseminating Patient-Centered Outcomes
in each included trial, as opposed to aggregate Research to Improve Healthcare Delivery
data (summary data for the comparison groups in Systems: A concerted effort to utilize existing
each study). Reviews using individual patient networks of providers and other key stake-
data require collaboration of the investigators holders to disseminate, translate, and imple-
who conducted the original trials, who must pro- ment delivery system evidence.
178 10  New Frontiers in Comparative Effectiveness Research

10.2.2.3 Translational Healthcare 10.2.2.4 Current Challenges


Challenges: Toward CIPER of Infectious Diseases
In the context of translational healthcare, patient-­ to Healthcare
centered, effectiveness-focused, and evidence-­ The acronym CERID stands for Comparative
based paradigms of clinical research and care Effectiveness Research for Infectious Diseases.
must strive to incorporate the best evidence base In brief, it consists of integrating the current con-
as the basis of effectiveness-focused intervention ceptualists on and practical protocol of compara-
that arises from a process of comparative indi- tive effectiveness research in the cord of the
vidual patient effectiveness research (CIPER). timely and critical set of concerns brought about
In brief, CIPER is a novel and fast emerging by the sharp increase of infectious diseases today,
sub-domain of the field of translational effective- a rise in prevalence that most expect to be sus-
ness, which is an integration of the comparative tained at extraordinary high levels in the next
effectiveness research paradigm with the con- decades. Challenges to the innate and antigen-­
struct of individual patient data analysis and dependent immune surveillance mechanism that
inference. Obtained from independent patient are consequential to infectious agents and that
data analyses and inferences, the purpose of threaten human beings throughout the lifespan,
CIPER is to compare the effectiveness outcomes among all genders, and across socioeconomic
of the research obtained. As discussed previously and educational levels include, and are not lim-
in Chap. 8, CIPER helps to determine which ited to, viral outbreaks, such as Ebola, Zika, HIV,
analyses are most appropriate and informative and the like; bacterial outbreaks, such as the
through the characterization of the measured plague, cholera, malaria and like; and perhaps
variables, the analyses to plan, and the type of more dramatic of all growing, widespread, and
data to gather (see Chap. 8, Sect. 8.5.3 for a generalized state of antimicrobial resistance,
review on comparative individual patient effec- which manifests as the ability of a microbe to
tiveness research). resist the effects of antibiotics and other medica-
Practically speaking, individual patient data tion previously used to treat them.
can rarely be analyzed directly in RevMan, the In brief, of all forms of infectious diseases that
Review Manager (RevMan) software used for threatened each individual’s health locally and
preparing and maintaining meta-analyses in globally, antimicrobial resistance is feared by
Cochrane reviews (current version: 5.2.5; many as the potentially most dangerous condi-
ims.cochrane.org/revman/download). The data tion for quality of life and for life itself, across
need to be first analyzed outside of this software, nations.
and summary statistics for each study may be
entered into RevMan. The SAS package, 10.2.2.5 T  he Urgency of CERID
“SCHARP,” or the MedCalc statistical software Protocols
for biomedical research can perform the analysis In the current decades, translational healthcare
in each study—not yet fixed or random model must weigh epidemiological trends in order not
meta-analyses—by pooling results and tabulating only to follow prevalence and relative prevalence
time-to-event individual patient data. trends of diseases but—and as importantly, if not
As CIPER continues to evolve, driven by more importantly—predict prognostic inference
the need of clinical situations such as telecare tends. Case in point, recent reports have indicated
in dentistry, biostatistics adequate and perti- new and alarming outbreaks of certain viral dis-
nent to the urgent needs of this domain of eases and other contagious illness, from cholera
translational effectiveness will also surely outbreaks to Zika and Ebola infection, as well as
evolve in parallel. exacerbation of infectious diseases thought, until
10.3 Comparative Effectiveness Research for recently, to be kept under control, such as tuber-
Infectious Diseases (CERID) culosis, measles, malaria, and others. Indeed, if
10.2  Conceptual Background 179

one thing emerges as certain is that healthcare dulling of the efficacy of the pharmaceutical
will be under siege by a vast spectrum of infec- interventions at our disposal to blunt the growth
tious diseases in the decades to come. This alarm- and proliferation of said agents. The purpose and
ing situation is by no means blunted by the recent call of CERID is to develop new and improved
exacerbation of climate change, which brings effectiveness-focused, patient-centered, and
along more cataclysmic hurricanes and flooding. evidence-­based countermeasures targeted against
Standing water slowly receding in warm tropical infectious diseases along these two converging
climates, such as a Texas, Florida, South Asia, the fronts.
Caribbean, and Central America, to cite only a
few of the more recent flooding events, are breed- 10.2.2.6 C  reating and Disseminating
ing grounds for waterborne parasitic and infec- New Knowledge in CERID
tious diseases, waterborne mosquitoes that are To be clear, incontrovertible evidence points to
carriers for viral infections, and a vast array of human activity as one major cause for the progres-
non-hygienic conditions that impose a serious sive warming of the planet’s temperature, green-
load on the immune system even of healthy house gases, pollution, and other contributors to
young individuals, thus undermining their health. climate change. Together, these factors contribute
Taking together current epidemiological to warming ocean waters, which then feed into
trends with the new frontier of translational larger, more menacing, forcefully destructive, and
healthcare that we have outlined in the preceding more frequent hurricanes and typhoons. This
chapters and in our preceding work, it becomes knowledge is now widespread, and only a handful
self-evident that the new serious threats to popu- of deplorably denying politicians do not accept
lation health brought about by the emergence of this cumulative evidence and obstruct local,
new infectious threats and the re-emergence of national, and international action to counter and to
older ones call for a worldwide concerted reverse these natural ecological trends. This is the
endeavor of comparative effectiveness research realm of politics and social history.
for infectious diseases (CERID) to establish and Nonetheless, throughout history, politics and
disseminate evidence-based, best clinical prac- social history have played a timely and critical
tices for this specific type of health threat in the role in population health and epidemiology, from
next decades. the scourges of antiquity to the Black Death that
One primary concern of CERID must also spread through Europe consequential, some say,
include the alarming trend of antimicrobial to the Crusades and other intestinal wars within
resistance, that is, the progressively weaker abil- Europe in the Middle Ages (e.g., sanguine con-
ity of commonly available antibiotics to counter flicts between the Guelfs and the Ghibellines),
infectious diseases. Antimicrobial resistance is the Spanish flu following WWI, the testing of
on the rise with millions of deaths every year. The penicillin on diseased soldiers following WWII
World Health Organization (WHO) reported in in the first clinical trial of its kind, and so on. The
2014 that this serious threat is no longer a pre- world population finds itself at a different junc-
diction for the future; it is happening right now in ture presently: one in which political system
every region of the world and has the potential to across the planet must work jointly and construc-
affect anyone, of any age, in any country. tively to block and reverse the fast rising of the
Antibiotic resistance—when bacteria change so planet, lest storms will increase in frequency and
antibiotics no longer work in people who need strength, bringing with them disastrous flood,
them to treat infections—is now a major threat to and life-threatening waterborne infectious
public health. diseases.
In other words, the world population is seri- As if this was not a sufficiently ominous
ously at risk both of a sharp increase in causative threat, antibiotic resistance is a growing problem
agents of infectious diseases and of a progressive among humans, domesticated animals, and wild-
180 10  New Frontiers in Comparative Effectiveness Research

life alike in terrestrial, aerial, or aquatic environ- national level. Global tracking of infectious dis-
ments. This is due, part at least, to the fact that eases may be a worthwhile endeavor, though
farm animals, which constitute a large proportion expensive and complex to develop, validate, and
of the human diet, are fed antibiotics themselves implement. A pluripotent national, politics-free
to ensure their health status, continued growth, system of this nature could be designed in
and maximal weight until slaughter. These antibi- increasing stages of complexity, starting from the
otics, and their by-products, contaminate the system that is operative presently and which pro-
meat products that enter the food chain, which vides real-time news information and images
we feed our developing children and youngsters. about wheaten patterns and storm destruction
It is not surprising that they develop resistance to cataclysms worldwide via satellites equipped
the antibiotics and antibiotics by-products found with the appropriate software.
in animal meats. Similar health-endangering situ- Based on that model, we might now conceptu-
ations can be traced to the traces of by-products alize a second-generation satellite software that
of fungicides and insecticides still found in veg- will integrate population health data and
etables and fruits following exhausting washing evidence-­based healthcare information into a
of the crops. When ingested, these by-products of worldwide health information technology net-
fungicides and insecticides can contribute to an work, a global telecare system, as it were.
override of cellular immune surveillance events, Consonant with the issues discussed in the previ-
which together signify increased debilitation to ous paragraphs, we argue that concerted effort
microbial assault. Last, but not the least, is the should focus initially on the establishment of a
pollution of the water we drink—pollution by CERID/CIPER-focused dimension of global
heavy metal products of refining industries, pol- telecare, that is to say, a focus on comparative
lution by fungicides and insecticides washes, individual patient effectiveness research targeted
etc.—which progressively contributes to organ on infectious diseases.
weakness and eventually failure (e.g., kidney,
liver) and to altered physiological homeostasis.
Taken together, the knowledge and the evi- 10.2.3 I mplications and Relevance
dence about the potential causes for our decreased for Sustained Evolution
ability to combat infectious diseases are widely of Translational Research
known and accessible to all, particularly those and Translational Effectiveness
living at or close to “hot spots,” such as urban
centers. The spread and contamination of the 10.2.3.1 T  oward Bayesian Biostatistics
environment constitute a growing and serious in Translational Research
public health problem, which physiologist might Bayesian inference is a method of statistical
describe as a type II allostatic load2 on the inference in which Bayes’ theorem is used to
immune system, and its consequential irreparable update the probability for a hypothesis as more
fall to a state of immune compromise and immune evidence or information becomes available and
deficiency. added onto the prior. Bayesian inference is an
There have been increasing public calls for important technique in biostatistics, which will
global collective action to address the threat, grow in relevance enormously in the next decades
including a proposal for international treaty on as the lines of research we have outlined in this
antimicrobial resistance. Further detail and atten- chapter continue to expand.
tion is still needed to recognize and measure The Bayesian approach to biostatistical infer-
trends in resistance at the national and the inter- ence is sometimes called biostatistical updating,
because new inference is updated every time new
Chiappelli F, Cajulis OS.  Psychobiologic views on
2  information is added to previously gathered data,
stress-related oral ulcers. Quintessence Int. 2004 35:223- that is, the prior. It is also referred to as Bayesian
7. PMID: 15119681 probability, as an alternative to frequentist
10.2  Conceptual Background 181

probability-­based inferences. The aim of the Bayes context, the Kullback–Leibler divergence analysis
approach is not to estimate the proximity of sam- pertains to the behaviors of the distributions of the
ple observations to the population; rather, it is prior distribution and observed posterior.
grounded in the principle that we do not know that In Bayesian hierarchical modeling, multiple
we cannot know the population, and hence any levels are proffered in a hierarchical structure that
attempt at estimating the probability that a sample estimates the parameters of the posterior distribu-
belongs to, or not, the population is futile. Rather, tion using Bayesian inference. They are then inte-
Bayesian inference considers that all sample grated in a manner not dissimilar to the integration
observations are valid information about the popu- of the pieces of a puzzle. In this manner, relevant
lation—even if unexpected—and thus potentially information regarding decision-making and updat-
considered erroneous. No observation is erroneous ing beliefs cannot be ignored because the Bayesian
from the Bayesian perspective, and all observa- hierarchical modeling has the potential to overrule
tions act as, as it were, independent pieces of the classical methods in applications such as clinical
puzzle. As new observations are obtained and decision-making based on the integration of con-
added to the prior, in a manner similar to adding a tinuous updates of the best evidence base through
new piece to the puzzle, a composite of the popu- comparative effectiveness research findings
lation emerges exactly as the composite image of reported in systematic reviews. The hierarchical
the puzzle emerges. form of Bayesian analysis and organization pro-
To the same extent as there is a relative vides a promising new dimension for the analysis
entropy—disorder—in the piece of the puzzle we and evaluation of multiparameter clinical deci-
have mixed before starting to compose it, so it is sion-making elaborated to integrate stakeholders’
for the possible data and observations we may views, patient’s needs and wants, clinicians’
collect and add unto the prior in the pursuit of expertise, and the evidence-based, effectiveness-
defining the population. In our puzzle example, focused, and patient-centered consensus of the
we might say that the pieces are well mixed when best evidence base. In that light, Bayesian biosta-
there is a considerable degree of disorder among tistics is only viable strategy in translational
them. Scientists might call this disorder entropy. healthcare for the twenty-first century.
In the case of Bayesian inference about the popu-
lation, we will call the elements that will consti- 10.2.3.2  iostatistics and Meta-­
B
tute our observations and our data the Bayesian Analysis in Systematic
factors. Bayesian factors are in a state of disor- Reviews: Toward Individual
ders, like the mixed pieces of the puzzle, and we Patient Data Meta-Analysis
can call this disorder entropy. In Bayesian statis- Meta-analysis is the core of quantitative biosta-
tics jargon, we call this state of entropy the tistical consensus of the best available evidence
Kullback–Leibler divergence. in comparative effectiveness research. Broadly
In other words, building upon the priors with speaking, it involves pooling quantitative evi-
current observations is a process that depends in dence from related homogeneous studies—as
large part upon the relative entropy of the Bayesian determined by the funnel plot and the Cochran Q
factors, that is, their Kullback–Leibler divergence. and/or I2 statistics—to estimate the effect of an
The probability of a Bayesian inference depends intervention, along with a confidence interval.
upon the relative size of the Kullback–Leibler Traditionally, meta-analyses synthesize group
divergence of its relative factors. data information obtained from multiple studies.
In general, a Kullback–Leibler divergence of 0 By contrast, individual participant-level data
indicates that we can expect similar, if not the meta-analysis (IPD MA) utilizes the prespecified
same, behavior among the distributions of different variables for each individual participant from mul-
Bayesian factors. But, a Kullback–Leibler diver- tiple applicable studies and synthesizes those data
gence of 1 indicates that the distributions behave in across all studies to assess the impact of a clinical
a dramatically different manner. In the Bayesian intervention in a more granular fashion. IPD MA,
182 10  New Frontiers in Comparative Effectiveness Research

which is the preferred biostatistical test to assess software is inadequate. Big data challenges
quantitative consensus in the individual patient include capturing data, data storage, data analy-
data outcomes research model discussed in a pre- sis, search, sharing, transfer, visualization, query-
vious chapter (see Chap. 9), has several important ing, updating, and information privacy. In the
potential advantages, including the ability to: context of the topics discussed in this book, and
specifically this chapter, big data can apply to the
1. Standardize the analysis across studies. bibliome, as well as to individual patient data
2. Include more up-to-date information than was sets, individual patient data meta-analyses, and
available at the time of each original trial’s other collection of information that becomes part
publication. of the consensus of the best evidence base, stake-
3. Incorporate results for previously missing or holder engagement, and the like.
poorly reported patient-centered outcomes. The domain of big data extends beyond the
4. Help personalize clinical decisions by assess- simple observation that the data set under study is
ing differential treatment effects for specific large. It extends to the use of predictive analytics,
subgroups. IPD MA can also allow for better user behavior analytics, and a range of alternative
ascertainment of the optimal dose, timing, advanced data analytic methods. Traditional rela-
and delivery method of a specific intervention tional database management systems and desktop
that might have been previously tested in mul- biostatistics and visualization-packages have dif-
tiple, nonuniform ways. ficulty handling big data, and the big data sets that
is projected to become common occurrence in
One important new frontier in biostatistics translational research and translational effective-
will be to develop new and improved protocols to ness urgently demand research and development,
perform IPD MA. At this point, the protocol to and work and quality control evaluation of novel
develop, test, and validate such a novel and com- biostatistical software packages for the purpose. It
plex way of obtaining individual patient data is possible and even probable that these new
consensus is thought to require a three-pronged approaches to biostatistics will increasingly rely
participatory structure that might include: on the Bayesian paradigm and the Kullback–
Leibler divergence analysis we outlined above.
1. The investigators who have performed trial Big data analysis has been criticized as being
meeting inclusion criteria for a given IPD MA relatively shallow at this point in its infancy. The
2. A representative group of stakeholders same could be said of traditional inferential tests

(including select trial investigators, patient soon as biostatistics was becoming established as
representatives, and biostatisticians) whose the modern science that it is today back in the
role is to collate and evaluate the protocols as 1920s. As comparative effectiveness research con-
they are being proposed tinues to grow along the dimensions we have out-
3. An IPD MA Research Center, a group of lined here, its reliance on big data analysis will

researchers with established expertise in the grow in parallel. That process is bound to drive big
underlying methods and conduct of high-­ data analysis to grow in biostatistical stringency.
quality, rigorous, IPD MA

These three entities all actively contribute 10.2.4 Self-Study: Practice Problems
to and participate throughout the development
and conduct of an IPD MA, but each plays a 1. Describe the two enterprises of translational
different role. healthcare. What relationship do they share?
2. What measuring instruments are needed in
10.2.3.3 B  ig Data Paradigm translational effectiveness? Why are they
in Translational Healthcare necessarily important?
Big data refers to data sets that are so large and 3. What is meant by the best available evi-
complex that traditional biostatistical application dence? How is it most commonly obtained?
Recommended Reading 183

4. Describe the process of comparative effec- Bernardo J, Smith AFM. Bayesian Theory: Wiley; 1994.
Chiappelli F. Fundamentals of evidence-based health care
tiveness research (CER). How is this differ- and translational science. New York: Springer; 2014.
ent than comparative individual patient Chiappelli F.  Methods, Fallacies and Implications of
effectiveness research (CIPER)? Comparative Effectiveness Research (CER) for health-
5. What is the difference between a systematic care in the 21st century (Chapter 1). In Chiappelli F.
(Ed.) Comparative Effectiveness Research (CER):
review and a meta-analysis? new methods, challenges and health implications Inc.
6. What is meant by the level of evidence as com- NovaScience, Hauppauge (2016)
pared to the quality of evidence? Then, name Cochrane AL.  Effectiveness and efficiency: random
at least one instrument that measures each. reflections on health services: Nuffield Provincial
Hospitals Trust; 1972.
7. From the studies below, rank the evidence El Dib RP, Atallah AN, Andriolo RB.  Mapping the
obtained from each from highest to lowest: Cochrane evidence for decision making in health care.
(a) Randomized, triple-blinded, placebo-­ J Eval Clin Pract. 2007;13:689–92. PMID 17683315.
controlled clinical trial Ezzo J, Bausell B, Moerman DE, Berman B, Hadhazy
V. Reviewing the reviews. How strong is the evidence?
(b) Mixed model cohort study How clear are the conclusions? Int J Technol Assess
(c) Systematic review research synthesis Health Care. 2001;17(4):457–66. PMID 11758290.
(d) Cross-sectional study Feinstein AR. Clinical Judgement. Baltimore: Williams &
8. What is a bibliome and what is it analogous Wilkins; 1967.
Gelman A, Carlin J, Stern H, Rubin D.  Bayesian data
to in traditional biostatistics? analysis. London: Chapman & Hall; 1995.
9. What is the difference between a frequentist Laxminarayan R, Duse A, Wattal C, Zaidi AK, Wertheim
approach and a Bayesian approach to biosta- HF, Sumpradit N, Vlieghe E, Hara GL, Gould IM,
tistical inference? Explain why the future of Goossens H, Greko C, So AD, Bigdeli M, Tomson G,
Woodhouse W, Ombaka E, Peralta AQ, Qamar FN,
biostatistics in translational research is Mir F, Kariuki S, Bhutta ZA, Coates A, Bergstrom R,
headed toward the latter approach. Wright GD, Brown ED, Cars O. Antibiotic resistance-­
10. Where is the dissemination of information the need for global solutions. Lancet Infect Dis.
and communication within translational 2013;13(12):1057–98.
Leskovec J, Rajaraman A, Jeffrey D. Ullman JD. (2014).
healthcare headed toward? Mining of massive datasets. CambridgeCambridge
University Press.
Murdoch TB, Detsky AS.  The inevitable application of
Recommended Reading big data to health care. JAMA. 2013;309:1351–2.
Renganathan V.  Overview of frequentist and bayesian
approach to survival analysis. Appl Med Informatics.
Baez J, Fritz T.  A Bayesian characterization of relative
2016;38(1):25–38.
entropy. Theory Appl Categ. 2014;29:421–56.
Vallverdu J. Bayesians versus frequentists a philosophical
Bauer JB, Spackman SS, Chiappelli F.  Evidence-based
debate on statistical reasoning. New  York: Springer;
research and practice in the big data era (Chapter 17).
2016.
In: Chiappelli F, editor. Comparative Effectiveness
Research (CER): new methods, challenges and health
implications. Hauppauge: NovaScience; 2015.
Appendices

© Springer-Verlag GmbH Germany, part of Springer Nature 2018


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare, 185
https://doi.org/10.1007/978-3-662-57437-9
186 Appendices

Appendix A: Random Number Table


Appendix A: Random Number Table 187

Created and assembled by Nicole Balenton


188 Appendices

Appendix B: Standard Normal Distribution (z)


Appendix B: Standard Normal Distribution (z) 189

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
190 Appendices

Appendix C: Critical t Values

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix D: Critical Values of (F) 191
Appendix D: Critical Values of F
Appendix D: Critical Values of F (continued)
192
Appendices
Appendix D: Critical Values of (F)

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the Standard Normal Random Variable by Stephen
193

Kokoska, Christopher Nevison 1989)


194 Appendices

Appendix E: Sum of Squares (SS) Stepwise Calculation Method

6. Square the group total (G) to obtain G2


SSBetween = å T
2 2
-G
n N
= å Xi2 - å T
2
SSWithin 7. Sum the sample size of each group (n) to
N
obtain the total sample size (N)
Xi = data point
T = group total 8. Divide the squared-grand total (G2) by the
G = grand total total sample size (N) to obtain G
2

n = size of each group N


N = total sample size
9.
Subtract obtained value for
å T n by G N by to obtain SSBetween
2 2

1. Sum the data points (Xi) in each group to


obtain the group totals (T) 10. 
Square each data point Xi and sum all
(a) i.e., T =  ∑ Xi squared-data points (Xi2) to obtain ∑Xi2
2. Square the totals (T) for each group to

åT
2
obtain T2 11. Subtract obtained value for ∑Xi2 by
n
3. Divide each of the squared-group totals by (from step 4) to obtain SSWithin

their respective sample sizes T ( n)


2
Adapted from Dr. Lawrence Chu’s Lectures on
Biostatistics at the California State University,
2
( )
4. Sum every group’s T n to obtain åT
2

n
Northridge, Department of Health Sciences

5. Sum all group totals (T) to obtain the grand


total (G)
(a) i.e., G =  ∑ T
Appendix F: Critical Values for Wilcoxon (T) 195

Appendix F: Critical Values for Wilcoxon T


196 Appendices

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix G: Critical Values for Mann-Whitney (U) 197

Appendix G: Critical Values for Mann-Whitney U


198 Appendices
Appendix G: Critical Values for Mann-Whitney (U) 199
200 Appendices

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
Appendix H: Critical Values for the Chi-Square Distribution 201

Appendix H: Critical Values for the Chi-Square Distribution


202 Appendices

Reprinted/adapted by permission from Springer Nature (Cumulative Distribution Function for the
Standard Normal Random Variable by Stephen Kokoska, Christopher Nevison 1989)
 nswers to End of Chapter Practice
A
Problems

4. The only difference between the research


Chapter 1 question and the study hypothesis is that the
latter states the former in the positive.
1. The utilization of the scientific method 5. False—the establishment of absolute truth is
begins by observing a phenomenon in the nearly impossible.
natural world and asking the question: 6. Systematic errors, random errors, and errors
“Why?” of judgment/fallacies; compromising any
2. Step 1: Research Question—a F.I.N.E.R single leg of this stool affects the statistical
question about an observation made, in significance of the findings.
which the answering of it would be useful 7. Systematic errors are avoidable, whereas
and meaningful. Step 2: Study Hypothesis— random errors are unavoidable.
an educated guess that states the research 8. Error of judgment/fallacy—specifically,
question in a positive manner. Step 3: Study hindsight bias.
Design—the infrastructure or system we 9. False—the presence of some degree of ran-
create to aid in answering the research ques- dom error may describe the individual differ-
tion. Step 4: Methodology—process of mea- ences of the subjects being studied (i.e., not
suring and collecting the necessary all humans are identical).
information (i.e., data). Step 5: Data 10. The first enterprise is translational research
Analysis—statistical reasoning tools and which refers to the biostatistical applications
techniques utilized in the examination of the and methods used on information obtained
data. Step 6: Conclusion—answering the from the patient in order to obtain new infor-
research question relative to the results that mation that directly benefits the patient in
were obtained. return (i.e., bench to bedside). The second
3. Study design, methodology, and data analy- enterprise is translational effectiveness, which
sis; compromising any single leg of the stool refers to the results that are gathered from clin-
affects the integrity of the research in its ical studies that are translated or transferred to
entirety and the potential inference that it everyday clinical practices and healthy deci-
may provide. sion-making habits (i.e., result translation).

© Springer-Verlag GmbH Germany, part of Springer Nature 2018


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare, 203
https://doi.org/10.1007/978-3-662-57437-9
204 Answers to End of Chapter Practice Problems

Chapter 2 designs in order to reduce bias and vul-


nerability to random error in the research
1. study.
(a) Observational (c) Participants cannot be blinded to the spe-
(b) Experimental cific intervention because they are actually
(c) Observational tasting the drink and are able to differenti-
(d) Experimental ate between alcohol and placebo.
(e) Experimental
(f) Experimental 9.
(a) The sugar pill represented the placebo or
2. False—sensitivity of a new diagnostic test control of the experiment.
refers to how effective the new test is at iden- (b) Double blind—patients were unaware of
tifying the presence of a disease or condition. the intervention, and the researchers did
3. not have access to the medical reports
(a) Sensitivity, 70/(70 + 10) = 0.875; speci- because they were judged by an inde-
ficity, 15/(15 + 5) = 0.75 pendent, third-party medical group.
(b) Prevalence (proportion of patients with (c) Randomized, double-blinded, placebo-­
caries): (70 + 10/(70 + 10 + 15 + 5) = 0.80 control clinical trial.
or 80%
10. In order for a measuring tool that measures
4. X to be considered as the gold standard,
(a) 47/150 = 0.3133 or 31.33% its sensitivity and specificity must be bet-
(b) No, incidence only observes new cases ter than other instruments that measure X.
in the population Additionally, the gold standard must be uni-
(c) Cohort study versally accepted across the majority of the
healthcare field. Thus, a measuring device
5. can lose its recognition as the gold standard
(a) Observational, cohort study. if a new device enters the field and can more
(b) Prospective study. accurately and precisely measure X than the
(c) No, causality cannot be established from gold standard.
observational studies.

6. Cases, babies in utero growing in settings that Chapter 3


contain environmental toxins; controls, babies
in utero growing in settings that do not con- 1. The fundamental question that is asked of a
tain environmental toxins; exposure, environ- methodology portion of research is: “How?”;
mental toxins; outcome, childhood leukemia the answering of this question provides us with
7. information as to how the research was done
(a) True with respect to the data collection techniques.
(b) False—odd ratio, not risks
(c) True 2. Sample and population match:
• Arizona college students  →  US college
8. students
(a) Crossover trial—women randomly • Stars in the Milky Way Galaxy → Stars in
switched from drinking a measure of the universe
alcohol on 1 day and a measure of pla- • Republican Congressmen  →  Republican
cebo on the other day. Presidents
(b) Random allocation signifies randomiza- • Female business majors  →  Female
tion, which is critical in experimental entrepreneurs
Answers to End of Chapter Practice Problems 205

3. Samples are measured instead of entire (c) Continuous variable


populations because measuring the popula- (d) Categorical variable, nominal measure
tion is not feasible in terms of the amount of (e) Semicontinuous variable (counts)
time, money, and resources required to (f) Categorical variable, dichotomous
access every single member of a popula-
tion. That said, it is possible to measure
entire populations given that the population Chapter 4
size is relatively small (e.g., all US
Presidents). 1.
4. The two most important qualities for a (a) Yes, due to large number of observations
sample to be rendered as a good sample are and low number of repetitive observa-
that it be of reasonable size and represen- tions, it is best to organize into grouped
tative of its parent population. This is data to maximize descriptive efficiency.
important to a research study mainly (b)
because it facilitates the potential of mak-
Frequency table: length of hospital stay
ing an accurate generalization regarding
Interval f
the population.
0–6 8
5. This is not a form of random sampling, but 7–13 5
rather it is a form of convenience sampling. 14–20 4
Unless this is the only fetal clinic in her 21–27 6
entire county, it will not provide a represen- 28–34 2
tative sample of all of the women and their Total 25
breastfeeding behavior in the researcher’s
county.
6. Stratified random sampling. 2.
7. The patients’ instrument may be reliable, but
Category f f% cf cf%
it is not valid.
Very unsatisfied 7 14% 7 14%
8. In healthcare, we strive to have an instru-
Unsatisfied 9 18% 16 32%
ment that is both valid and reliable as pos- Neither 19 38% 35 70%
sible. Thus, there is no real, definitive Satisfied 11 22% 46 92%
answer to whether we would choose one Very satisfied 4 8% 50 100%
over the other. You might argue that based Total 50 100%
on the specific instrument and the thing that
it is measuring, one may be more important
than the other. 3.
9.
Frequency table: weights of newborn babies (kg)
(a) Qualitative
Interval (kg) f f% cf cf%
(b) Quantitative
0.00–0.99 6 10% 6 10%
(c) Quantitative 1.00–1.99 12 20% 18 30%
(d) Qualitative 2.00–2.99 19 31.67% 37 61.67%
(e) Quantitative 3.00–3.99 14 23.34% 51 85%
(f) Qualitative 4.00–4.99 6 10% 57 95%
5.00–5.99 3 5% 60 100%
10. Total 60 100% N/Aa N/Aa
(a) Categorical variable, ordinal measure a
Neither cumulative frequency (cf) nor cumulative fre-
(b) Continuous variable quency percent (cf%) have meaningful total values
206 Answers to End of Chapter Practice Problems

4.

Campus Health Office Satisfaction

20

15
Frequency

10

0
Very Unsatisfied Unsatisfied Neither Satisfied Very Satisfied
Campus Health Office Satisfaction

Campus Health Office Satisfaction

20

15
Frequency

10

0
Very Unsatisfied Unsatisfied Neither Satisfied Very Satisfied
Campus Health Office Satisfaction
Answers to End of Chapter Practice Problems 207

5. Mean,  282.07; median, 302; mode, 120 and P ( AI and Unexp )


378 P ( AI| Unexp ) =
P ( AI )
6.
(c) ( 0.05)( 0.25)
=
0.3875
( x2 - x )
2
xi - x = 0.0323 or 3.23%
xi
2.33 0.57 0.325
2.02 0.26 0.068
10.
1.99 0.23 0.053 6.13 - 5.05
1.53 −0.23 0.053
(a) z = = 1.38 → 0.9162; since
0.78
0.99 −0.77 0.593
1.26 −0.50 0.25 looking for above: 1–0.9162 = 0.0838
1.18 −0.58 0.336 5.44 - 5.05
3.50 1.74 3.028 (b) z = = 0.50 → 0.6915 or
0.78
0.22 −1.54 2.372
7.818 69.15%
s= = 0.932
10 - 1
4.20 - 5.05
2.62 0.86 0.74 s2 = (0.932)2 = 0.868 (c) z = = -1.09 → 0.1379 or
∑ = 7.818 0.78
x = 1.76
n = 10 13.79%
(d) 0.9162–0.6915 = 0.2247 or 22.47%
x - 5.05
7. Range  =  3.50–0.22  =  3.28; IQR  =  2.33– (e) 0.1000 ← -1.28 = , x = 4.05
1.18  =  1.15—Out of all of the measures of 0.78
x - 5.05
dispersion, standard deviation and variance (f) 1–0.6000 = 0.4000 ← -0.25 = ,
are the best at providing an accurate descrip- x = 4.86 0.78
tion regarding the spread or distribution of
x - 5.05
the data; IQR would be the next best mea- (g) Lower score: 0.025 ← -1.96 =
sure, and lastly range because of its vulnera- , x = 3.52 0.78
bility to outliers.
8. x - 5.05
1.96 =
(a) P (CC and EI)  =  (101/613)  ×  (62/613)  Upper score: 0.975 ← , x = 6.58
0.78
= 0.01666 or 1.67%
(b) P (Unexp. or BI) = (100/613) + (274/61
3) = 374/316 or 61.01% Chapter 5
(c) P (BI or Male) = (274/613) + (275/450) 
− (55/106) = 0.54 or 54% 1. The sample population interaction plays a
critical role in inferential statistics because
9. information learned from a representative
(a) P ([Exp. and AI] or [Unexp. and sample can potentially serve as an inference
AI])  =  (0.35)(0.75)  +  (0.5) consensus onto the entire population.
(0.25) = 0.3875 or 38.75% (note: this is 2.
the same thing as finding P (AI)) (a) s x = 25
P ( AI and exp ) (
b) n = 100
P ( AI| exp ) = (c) σ2 = 278.33
P ( AI )
(b) ( 0.35)( 0.75) 3. Increase in sample size—by increasing the
=
0.3875 denominator of s x , the ratio decreases
= 0.6774 or 67.744%
208 Answers to End of Chapter Practice Problems

4. 1. Research Question: Is there a statistically


(a) Independence of measurement—mea- significant difference between the average
suring tools (i.e., questionnaires) are hours of studying done by your college com-
being created relative to (i.e., dependent pared to the national average hours of
on) the observations. studying?
(b) No, if any single one of the assumptions 2. Hypotheses: H0: 𝜇 = 15; H1: 𝜇 ≠ 15
of parametric statistics are violated, then 3. Decision Rule: At 𝛼  =  0.01, If
any parametric inference consensus is z ≤ −2.58 or z ≥ +2.58, then Reject H0|If—
strictly prohibited. 2.58 < z < +2.58, then Retain H0
11 - 15
5. False 4. Calculation: z = = -2.67
6. 6.7 / 20
(a) p = 0.02 5 . Decision: Reject H0 because −2.67 < −2.58
(b) p < 0.05 6. Conclusion: Based on the results, there seems
(c) p < 0.001 to be a statistically significant difference
(d) p < 0.01 between the average hours of studying done
(e) p < 0.15 by your college compared to national average
studying hours.
7. Alpha (𝛼), beta (𝛽), effect size (ES), and
sample size (n) (b) CI : 11 ± ( 2.58 ) æç 6.7 ö = [7.13, 14.87]
÷
è 20 ø
8. False—as power increases, the probability of
making a Type II error decreases Interpretation: We are 99% confident that
9. the true mean of average hours of study-
(a) True ing falls between 7.13 and 14.87.
(b) False—not a representation of crude

probabilities of occurrence 3.
(c) False—not an absolute measure (a) Two-independent sample t
(d) False—not a measure of specific partici- (b) Dependent sample t
pants than a description of population (c) One sample t
(e) True (d) One sample t

10. C—assuming the alternative hypothesis (H1) 4.


states the opposite of the null hypothesis (a)
(H0), then we can assume the null hypothesis
to be false. 1. Research Question: Is there a statistically
significant difference between the average
incubation days of the Zika outbreak com-
Chapter 6 pared to the incubation days caused by a
recently discovered strain of Zika?
1. 2. Hypotheses: H0: 𝜇 = 6; H1: 𝜇 ≠ 6
(a) tcrit = ± 2.086 3. Decision Rule: At 𝛼 = 0.05 & df = 7, If p ≤ 𝛼,
(b) tcrit = ± 2.074 then Reject H0|If p > 𝛼, then Retain H0
(c) zcrit = ± 1.96
5.75 - 6
(d) zcrit = ± 2.58 4. Calculation: t = = -0.216
(e) Fcrit = + 2.32 3.28 / 8
5. Decision: Retain H0 because p > 0.05
2. 6. Conclusion: Based on the results, there seems
(a) to be no statistically significant difference
Answers to End of Chapter Practice Problems 209

between the average incubation days of the 8


Zika outbreak compared to the incubation D= = 0.8 sD = 3.33
10
days of caused by a recently discovered strain
of Zika. 0.8 - 0
t= = 0.76
3.33 / 10
(b) CI : 5.75 ± ( 2.365 ) æç 3.28 ö÷ = [3.01,
è 8ø
8.49] 5. Decision: Retain H0 because p > 0.05
6. Conclusion: Based on the results, there seems
We are 95% confident that the true popu- to be no statistically significant mean differ-
lation mean of incubation days caused by ence between the IQ’s of those who were
the recently discovered strain of Zika lies breastfed compared to the IQ’s of their bottle-­
between 3.01 and 8.49. fed siblings.
(c) It is probable that the Zika outbreak under (b) Matched measures.
study is in fact the recently discovered (c) A major source of error may result from
strain of Zika. By failing to reject H0 (i.e., poor matching due to individual differ-
retaining H0), it is being stated that the ences of siblings, such as age, gender,
evidence shows there to be no difference socioeconomic status, etc.
between the strain that caused the out- 6.
break and the recently discovered strain.
(d) No, the decision would remain the same. (a)
At an 𝛼 = 0.01 & df = 7, the p-value is still
greater than 0.05, nonetheless 0.01. 1. Research Question: Is there a statistically
significant difference in the severity of the
5. seasonal flu between four home remedies?
(a) 2. Hypotheses: H0: 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4 H1: H0 is
false; at least one group is different
1. Research Question: Is there a statistically 3. Decision Rule: At 𝛼 = 0.01, dfBW = 4–1 = 3,
significant mean difference between the IQ’s and dfWN = 40–4 = 36: If p ≤ 𝛼, then Reject
of those who were breastfed and the IQ’s of H0|If p > 𝛼, then Retain H0
their bottle-fed siblings? 4. Calculation:
2. Hypotheses: H0: 𝜇D = 0; H1: 𝜇D ≠ 0
3. Decision Rule: At 𝛼 = 0.05 & df = 7 Using method in Appendix E
If p ≤ 𝛼, then Reject H0|If p > 𝛼, then Retain H0 OJ C. Soup G. Tea S. Water
4. Calculation: 5 8 3 1
7 7 2 7
Breastfed sibling IQ Bottle-fed sibling IQ D = XBr−XBO 3 4 5 2
119 115 4 3 7 5 4
96 97 −1 6 5 3 1
102 105 −3 5 9 3 4
111 110 1 8 9 2 4
79 83 −4 7 7 1 2
88 90 −2 6 6 4 2
87 84 3 5 6 4 1
99 99 0 T = 55 T = 68 T = 32 T = 28
126 121 5 T2 = 3025 T2 = 4624 T2 = 1024 T2 = 784
106 101 5 T2/n = 302.5 T2/n = 462.4 T2/n = 102.4 T2/n = 78.4
∑D = 8
210 Answers to End of Chapter Practice Problems

G = 183, G2 = 33,489, N = 40, ∑T2/n = 945.7 ( b) 1−R2 = 1–0.3505 = 64.95%


Sum of Mean (c) No, R2 is only able to predict the percent
squares Degrees of square of explained and/or unexplained variabil-
Source (SS) freedom (df) (MS) F ity—not percent of people.
Between 108.47 3 36.16 13.39 10.
Within 97.3 36 2.70
(a) Predictor variables: daily cigarettes, daily
Total 205.77 39 X X
alcoholic beverages, age. Response vari-
able: urinary output
5. Reject H0 because p < 0.01 (b) y = 0.466−0.181 (4) – 0.299 (1) + 0.333
6 . Based on the results, there seems to be a sta- (56)  =  18.09—According to the regres-
tistically significant difference in the severity sion line equation, the participant has
of the seasonal flu between the four home about 18.09  L of urinary output per
remedies. month.
(c) Controlling for the other variables, on
(b) Based on these calculations, we are unable to average, a single increase of cigarettes
determine exactly which of the home reme- smoked per day is predicted to result in a
dies are most effective in decreasing the loss of 0.181  L of urinary output per
severity of the seasonal flu. In order to do month.
that, we must conduct a post hoc analysis. (d) With this coefficient of determination, we
can conclude that we have a relatively low
7. predictive accuracy as there exists a much
(a) SSBW  =  150, SSTOTAL  =  5400, higher degree of unpredictive error (1−R2)
MSWN = 21.43, F = 1.75 relative to the amount of error we are able
(b) 5 to predict (R2).
(c) 50
(d) p > 0.05—The relatively small F statistics
suggests a small effect size, which is par- Chapter 7
ticularly due to the large amount of error
denoted by the large MSWN. Thus, a 1. The violation of any single or more assump-
decrease or control of individual differ- tions of parametric statistics, qualitative (cat-
ences might be a good suggestion, should egorical) data, and adaptive research
the study be done again. models.
8. 2. True
(a) Moderate strength, positive association. 3. Matching
(b) Based on the Pearson correlation coeffi- • Kruskall Wallis H → One-way ANOVA
cient, r  =  +  0.592, there seems to be a • Wilcoxon Rank-Sum → One sample t
moderately positive association between • Wilcoxon Signed-Rank  →  Dependent
cholesterol levels and caloric intake. Sample t
(c) No, the establishment of causality is not • Spearman Rho → Pearson r
permitted with associations, regardless of • Mann-Whitney U  →  Two independent
the strength or direction of the associa- sample t
tion. The best that can be said is the exis- • Friedman → Two-way ANOVA
tence of an associative relationship. 4. Yes, nonparametric inferences exist and differ
9. from their parametric counterparts. A non-
(a) By calculated R2  =  0.3505, we can say parametric inference can only be generalized
that approximately 35% of the variability onto the specific sample from which it was
in cholesterol levels is predictable from taken from and not their parent population. On
its association with caloric intake. the other hand, a parametric inference can be
Answers to End of Chapter Practice Problems 211

generalized onto the parent population, the 5.


sample from which it was obtained included.
Rank 1 2 3 4 5 6 6 8 9 10 11 12 13 14
Group 1 5.98 6.77 8.01 11.02 18.09 45.93 101.26
Group 2 0.07 2.44 11.02 32.33 95.12 300.65 750.81

Group 1: x = 6.86, s  =  3.29 Group 2: x =


8.00, s = 5.16. Adults who oppose:
525 x 627
6. Using Mann-Whitney U Test, we determine: Expected f = = 293.64
1121
R1 = 48,   n1 = 7  
R2 = 56,   n2 = 7
(191 - 262.64 ) ( 303 - 231.36 )
2 2

c2 = +
n1 ( n1 + 1) 262.64 231.36
U1 = n1n2 + - R1 ( 405 - 333. 36 ) +(
2
222 - 293 .64 )
2
2 + = 74.6 df = 1
7 ( 7 + 1) 333.36 293.64
= ( 7 )( 7 ) + - 48 = 29
2 At 𝛼 = 0.05, Reject H0 because p < 0.001. Based
on these results, there seems to be a statistically
U = 21 significant difference within the specific sample
that was measured in regard to age group and
n2 ( n2 + 1) preference toward the type of medical practitio-
U 2 = n1n2 + - R2 ner that provides the diagnosis.
2
7 ( 7 + 1)
= ( 7 )( 7 ) + - 56 = 21 10. A logistic regression would be most com-
2
monly used in an observational, case-control
At 𝛼  =  0.05 and Ucrit  =  8, UOBS  >  Ucrit, so we study. This type of statistical technique
Retain H0 would be useful for this specific study design
because it can provide an odds ratio.
7. In the Geisser-Greenhouse correction, sphe-
ricity refers to an assumption that is analogous
to the homogeneity of variances. Chapter 8
8. This type of test is highly vulnerable to a Type
I Error. It is a double-edged sword because 1. The analysis of individual patient data has the
decreasing its vulnerability to a Type I Error, ability to provide more reliable and more accu-
in turn, increases the probability of making a rate information regarding the specific patient.
Type II error. Aggregate data analysis provides a general-
9. ized inference consensus regarding similar
patients; however it does not necessarily entail
Children who favor: that it is applicable to each similar patient. The
individual differences that individual patient
596 x 494 data takes into consideration make it much
Expected f = = 262.64
1121 more advantageous than that of aggregate data
in the perspective of an individual patient.
Children who oppose:
2. Complications include lack of access to
525 x 494 unpublished trials, inconsistent data across
Expected f = = 231.36
1121 trials, limited information in published
reports, longer follow-up time, more com-
Adults who favor:
plex outcomes, and higher monetary cost.
596 x 627
Expected f = = 333.36
1121
212 Answers to End of Chapter Practice Problems

3. Whether primary, secondary, or key, stake- to the patients, stakeholders, and the like.
holders may include patients, patient’s Moreover, its overall purpose is to ascertain
family members, caregivers, governmental a certain degree of value or worth based on
figures, etc. Additionally, all those who fit the objectives and results throughout the
the following description: “those groups entire study.
without whose support, the organization 2. Evaluation is a systematic acquisition and
would cease to exist.” Stakeholder engage- management of information, which includes
ment in healthcare improves the relevance of the generation of feedback to specific stake-
research, increases transparency, and accel- holders. Therefore, evaluation is most like
erates its adoption into practice. (or includes) methodology and data
4. Yes, with tools such as individual patient analysis.
meta-analysis, information learned from the 3. The purpose of formative evaluation is to
patient can be inferred onto that specific measure the effectiveness of a program or
patient. research study as it is taking place in order to
5. Individual patient inferences provide infer- determine what can be improved within the
ences about the specific patient, whereas course of the study. On the other hand, sum-
aggregate patient data can only provide mative evaluation is concerned with the
inferences regarding the general population overall assessment of the study after it has
of similar patients. been completed.
6. Analysis of individual patient data should 4. False—formative assessments are dependent
ultimately culminate to patient-centered on qualitative feedback.
outcomes. 5. Observational study design, with chi-square
7. D. statistical tests.
8. PICOTS—population, intervention, compar- 6. Both quantitative and qualitative methods
ator, outcome, timeline, and setting. are essential and necessary in evaluation
9. After individual patient data analysis in within translational healthcare. The methods
PCOR, the next necessary step must be complement each other in their own specific
patient-centered outcome effectiveness ways to provide an ultimately more robust
(PCOE). The evaluation of the outcomes of result on the effectiveness and efficacy of
evidence-based healthcare is critical for that which they are concerned with.
understanding how the particular processes 7. Qualitative evaluation methods.
work and how the benefits can be maximized 8. In order to quantify qualitative methods and
for stakeholders and the like. information utilized in evaluation, one must
10. The traditional model of repeated measures (1) categorize and sort the relevant informa-
calls for a pre-post approach—where the tion on the basis of certain criteria; (2) recog-
effectiveness of an intervention is measured nize a recurrence of the themes under study;
by comparing the posttest results to the pre- (3) conduct continuous, semicontinuous, or
test results. In the new model proposed, there dichotomous assessment of recurrence; and
is a post-then(before)-pre approach utilized. (4) conduct statistical analysis of recurrence
The advantage here is the control for of the themes via traditional statistical
response shift bias. techniques.
9. It is a fallacy to think that one is better than
the other. Both quantitative and qualitative
Chapter 9 methods are distinctly beneficial in their own
right—and, when utilized together, are able
1. In translational healthcare, evaluation refers to complement each other to provide an all-­
to the systematic approach of determining encompassing basis of knowledge.
the effectiveness and efficacy of research 10. Participatory Action Research an Evaluation
studies, investigations, and programs relative (PARE)—refers to a formative and summa-
Answers to End of Chapter Practice Problems 213

tive approach utilized within community relies on aggregate patient data, whereas the
health action. It is a crucial concept within latter focuses on individual patient data and
translational healthcare as it seeks to increase meta-analysis.
benefit effectiveness by understanding the 5. A systematic review is a scientific report
experiences through the perspective of the that describes the methodology employed
actual patient or affected groups. for obtaining, quantifying, analyzing, and
reporting the consensus of the best evi-
dence base for a specific clinical treatment.
Chapter 10 A meta-­analysis is the biostatistical tech-
nique utilized within systematic reviews to
1. Translational research (T1) and transla- analyze quantitative evidence from related,
tional effectiveness (T2)—T1 refers to the homogenous studies in order to estimate
biostatistical applications and methods used the effect of the specific clinical
on information obtained from the patient in intervention.
order to obtain new information that directly 6. The level of evidence represents evidence
benefits the patient in return (i.e., bench to that is obtained from a particular research
bedside); T2 refers to the results gathered study design, which can be measured by the
from clinical studies that are translated or AGREE instrument. The quality of evidence
transferred to everyday clinical practices refers to the stringency of the research meth-
and healthy decision-making habits (i.e., odology and data analysis, which can be
result translation). T2 relies heavily on the measured by the PRISMA instrument.
biostatistical principles and concepts of 7. Systematic review research synthesis > ran-
T1, while T1 relies heavily on the develop- domized, triple-blinded, placebo-controlled
ment of novel and concerted biostatistical clinical trial > mixed model cohort study >
models. cross-sectional study.
2. Translational effectiveness is in need of mea- 8. The bibliome refers to a collection of pub-
surement tools that have the ability to assess lished papers obtained through a litera-
the quality of the evidence, individual patient ture review that most closely answers the
research outcomes and analysis, individual PICOTS research question. The bibliome
patient data meta-analysis, stakeholder is most analogous to a sample (i.e., sam-
engagement quantification and analysis, and ple–population interaction) from traditional
all-encompassing dissemination. The impor- biostatistics.
tance of this lies at the core of translational 9. A Bayesian approach to biostatical infer-
healthcare, whereby only the best possible ence refers to the continuous updating of
intervention and the most optimal benefit are previous inferences as new information and
provided to the patient. data becomes possible. This is in contrast
3. The best available evidence refers to the to the frequentist approach as it does not
highest level and quality of evidence that permit the updating of new knowledge. In
currently exists. It is most commonly the frequentist approach to biostatistical
obtained by a process of critical evaluation inference, the development of new knowl-
of the entire body of available published edge that is significantly unlike that which
research literature, along with a clear inter- is commonly known is rendered as extreme.
pretative synthesis of the relevant findings. This “extreme” or statistically significant
4. Comparative effectiveness research is the different observation is exiled from the
systematic process by which quantitative and population, even if the sample from which
qualitative consensuses of the best available it is obtained is a good representation. As
evidence are obtained. This differs from learned from individual patient data analy-
comparative individual patient effectiveness sis, the stark individual differences between
research (CIPER) simply because the former patients and their conditions exemplify the
214 Answers to End of Chapter Practice Problems

extent to which we are unable to grasp the of technology. Therefore, the most effective
true population (or even if one exists). dissemination and communication of infor-
Thus, the future of biostatistics must accept mation to all stakeholders (and the like) on
this dynamic challenge for the betterment the best available evidence must be through
of healthcare and its constituents. some technologically advance system, such as

10. Translational healthcare, just like the rest tele-health.
of the world, must move with the direction
Bibliography

AdvaMedDX, 2–24. A policy primer in diagnostics: June Bernardo J, Smith AFM.  Bayesian theory. Hoboken:
2011. 2011. https://dx.advamed.org/sites/dx.advamed. Wiley; 1994.
org/files/resource/advameddx-policy-primer-on-diag- Beveridge WIB. Biologist and statistician Ronald Fisher
nostics-june-2011.pdf. [Online image]. 1957. https://www.flickr.com/photos/
Agency for Healthcare Research and Quality. Logic mod- internetarchivebookimages/20150531109/.
els: the foundation to implement, study, and refine Bewick V, Cheek L, Ball J. Statistics review 7: correlation
patient-centered medical home models. March 2013a. and regression. Crit Care. 2003;7(6):451–9. Print.
AHRQ Publication No. 13–0029-EF. Bhatt A.  Evolution of clinical research: a history
Agency for Healthcare Research and Quality. Mixed before and beyond James Lind. Perspect Clin Res.
methods: integrating quantitative and qualitative 2010;1(1):6–10.
data collection and analysis while studying patient-­ Biddix P.  Instrument, validity, reliability. 2009. https://
centered medical home models. March 2013b. AHRQ researchrundowns.com/quantitative-methods/instru-
Publication No. 13–0028-EF. ment-validity-reliability/. Accessed July 2017.
Altman DG. Practical statistics for medical research. Boca Black K. Business Statistics for Contemporary Decision
Raton, FL: CRC; 1990. Making (4th edn, Wiley student edition for India).
Andersen H, Hepburn B. In: Zalta EN, editor. Scientific New Delhi: Wiley; 2004. ISBN 978-81-265-0809-9.
Method. The Stanford Encyclopedia of Philosophy Bloom BS, Hasting T, Madaus G.  Handbook of forma-
(Summer 2016 Edition); 2016. https://plato.stanford. tive and summative evaluation of student learning.
edu/entries/scientific-method/#SciMetSciEduSeeSci. New York: McGraw-Hill; 1971.
Baez J, Fritz T.  A Bayesian characterization of relative Bogdan R, Taylor S. Looking at the bright side: a positive
entropy. Theor Appl Categories. 2014;29:421–56. approach to qualitative policy and evaluation research.
Bagdonavicius V, Kruopis J, Nikulin MS. Non-parametric Qual Sociol. 1997;13:193–2.
tests for complete data. London & Hoboken: ISTE & Bohr N.  LXXIII.  On the constitution of atoms and
Wiley; 2011. molecules. Lond Edinb Dubl Phil Mag J  Sci.
Balescu R.  Equilibrium and non-equilibrium statis- 1913;26(155):857–75.
tical mechanics. Hoboken: Wiley; 1975. ISBN: Bonferroni CE. Teoria statistica delle classi e calcolo delle
9780471046004. probabilità. Pubblicazioni del Real Istituto Superiore
Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, di Scienze Economiche e Commerciali di Firenze.
Chaudhury S.  Hypothesis testing, type I and type II 1936;8:3–62.
errors. Ind Psychiatry J.  2009;18(2):127–31. https:// Born M, Heisenberg W.  The quantum theory of mol-
doi.org/10.4103/0972-6748.62274. ecules. Ann Phys. 1924;74(9):1–31.
Barkhordarian A, et al. Assessment of risk of bias in trans- Campbell DT.  Factors relevant to the validity of
lational science. J  Transl Med. 2013;11:184. http:// experiments in social settings. Psychol Bull.
www.translational-medicine.com/content/11/1/184. 1957;54:297–312.
Bauer JB, Spackman SS, Chiappelli F.  Evidence-based Chiang C, Zelen M.  What is biostatistics? Biometrics.
research and practice in the big data era. In: Chiappelli 1985;41(3):771–5. https://doi.org/10.2307/2531297.
F, editor. Comparative effectiveness research (CER): Chiappelli F. Fundamentals of evidence-based health care
new methods, challenges and health implications. and translational science. New York: Springer; 2014.
Hauppauge, NY: NovaScience; 2015. Chapter 17. Chiappelli F.  Methods, fallacies and implications of
Bem DJ.  Writing the Empirical Journal Article. comparative effectiveness research (CER) for health-
Psychology Writing Center. University of Washington; care in the 21st century. In: Chiappelli F, editor.
Denscombe, Martyn. In: The Good Research Guide: Comparative effectiveness research (CER): new meth-
for small-scale social research projects. 5th ed. ods, challenges and health implications. Hauppauge,
Buckingham: Open University Press; 2014. NY. Chapter 1.: NovaScience; 2016.

© Springer-Verlag GmbH Germany, part of Springer Nature 2018


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare, 215
https://doi.org/10.1007/978-3-662-57437-9
216 Bibliography

Chiappelli F.  Comparing two groups: T tests family How clear are the conclusions? Int J Technol Assess
(13,14,15) [PowerPoint slides]; n.d. Health Care. 2001;17(4):457–66.
Chiappelli F, Cajulis OS. Psychobiologic views on stress-­ Fain J.  Is there a difference between evaluation and
related oral ulcers. Quintessence Int. 2004;35:223–7. research. Diabetes Educ. 2005;31:150–5.
PMID: 15119681. Fan J, Han F, Liu H. Challenges of big data analysis. Natl
Chu VW. Assessing proprioception in children: a review. Sci Rev. 2014;1(2):293–314. https://doi.org/10.1093/
J Mot Behav. 2017;49:458–66. https://doi.org/10.1080 nsr/nwt032.
/00222895.2016.1241744. Feinstein AR.  Clinical judgement. Philadelphia, PA:
Cochrane AL. Effectiveness and efficiency: random reflec- Williams & Wilkins; 1967.
tions on health services. London: Nuffield Provincial Fisher RA.  Statistical methods for research workers.
Hospitals Trust; 1972. Edinburgh: Oliver and Boyd; 1925.
Collier R. Legumes, lemons and streptomycin: a short history Fisher RA. The design of experiments. New York: Hafner;
of the clinical trial. Can Med Assoc J. 2009;180(1):23– 1949.
4. https://doi.org/10.1503/cmaj.081879. Fisher RA.  Contributions to mathematical statistics.
Colosi L, Dunifon R.  What’s the difference? “Post then New York: Wiley; 1950.
Pre & Pre then Post”. Cornell Cooperative Extension, Fletcher A, Guthrie J, Steane P, Roos G, Pike S. Mapping
2006. stakeholder perceptions for a third sector organization.
Conover WJ.  Practical nonparametric statistics. J Intellect Capital. 2003;4(4):505–27.
New York: Wiley. ISBN 0–471–16851-3.; 1960. Freedman B. Equipoise and the ethics of clinical research.
Corbin JM, Strauss AL.  Basics of qualitative research: N Engl J Med. 1987;317:141–5.
techniques and procedures for developing grounded Freedman D, Pisani R, Purves R.  Statistics. 2nd ed.
theory. 2nd ed. Thousand Oaks, CA: SAGE; 1998. New York: W.W. Norton; 1991.
Corder GW, Foreman DI. Nonparametric statistics: a step-­ Friedman M. The use of ranks to avoid the assumption of
by-­step approach. Hoboken: Wiley. p. 2014. normality implicit in the analysis of variance. J  Am
Cram F. Method or methodology, what’s the difference?— Statist Assoc. 1937;32:675–701.
Whānau Ora. 2013. http://whanauoraresearch.co.nz/ Friedman M. A correction: the use of ranks to avoid the
news/method-or-methodology-whats-the-difference/. assumption of normality implicit in the analysis of
Creswell J. Research design: qualitative, quantitative, and variance. J Am Statist Assoc. 1939;34:109.
mixed methods approaches second editions. Thousand Friedman M.  A comparison of alternative tests of sig-
Oaks, CA: Sage; 2002. nificance for the problem of m rankings. Ann Math
Daly LE, Bourke GJ, McGilvray J.  Interpretation and Statist. 1940;11:86–92.
uses of medical statistics. 4th ed. Scarborough, ON: Friedman C.  The frequency interpretation in probabil-
Blackwell Scientific; 1991. ity. Adv Appl Math. 1999;23(3):234–54. https://doi.
Data, Trends and Maps. 2017, April 10. https://www. org/10.1006/aama.1999.0653.
cdc.gov/obesity/data/databases.html. Accessed 11 Jul Furr RM.  Testing the statistical significance of a corre-
2017. lation. Winston-Salem, NC: Wake Forrest University;
Dawe H.  William Farish, Chemist, c 1815. [Online n.d.
Image]. 1815. https://commons.wikimedia.org/wiki/ Gaba E.  A bust of Socrates in the Louvre [Online
File:William_Farish.jpg. Image]. 2005. https://commons.wikimedia.org/wiki/
Donner A.  A bayesian approach to the interpretation of File:Socrates_Louvre.jpg#file.
sub-group results in clinical trials. J  Chronic Dis. Gaventa J, Tandon R. Globalizing citizens: new dynamics
1992;34:429–35. of inclusion and exclusion. London: Zed; 2010.
Donner A, Birkett N, Buck C.  Randomisation by clus- Gelman A, Carlin J, Stern H, Rubin D.  Bayesian data
ter: sample size requirements and analysis. Am analysis. London: Chapman & Hall; 1995.
J Epidemiol. 1991;114:906–14. Gerstman BB.  Basic biostatistics: statistics for public
Dowie J.  “Evidence-based,” “cost-effective” and health practice. Sudbury, MA: Jones and Bartlett;
“preference-­driven” medicine: decision analysis based 2008.
medical decision making is the pre-requisite. J Health Gibbs JW. Elementary principles in statistical mechanics.
Services Res Policy. 1996;1:104–13. New York: Charles Scribner’s Sons; 1902.
Dudovskiy J.  Cluster sampling. 2016. http://research- Golafshani N.  Understanding reliability and validity in
methodology.net/sampling-in-primary-data-collection/ qualitative research. Qual Rep. 2003;8(4):597–606.
cluster-sampling/. Accessed Jul 2017. http://nsuworks.nova.edu/tqr/vol8/iss4/6.
Dunn OJ.  Multiple comparisons using rank sums. Gray JAM, Haynes RB, Sackett DL, Cook DJ, Guyatt
Technometrics. 1964;6:241–52. GH. Transferring evidence from health care research
El Dib RP, Atallah AN, Andriolo RB.  Mapping the into medical practice. 3. Developing evidence-based
Cochrane evidence for decision making in health care. clinical policy. Evid Based Med. 1997;2:36–9.
J Eval Clin Pract. 2007;13:689–92. PMID 17683315. Gubrium JF, Holstein JA. The new language of qualitative
Ezzo J, Bausell B, Moerman DE, Berman B, Hadhazy method. New York: Oxford University Press; 2000.
V. Reviewing the reviews. How strong is the evidence?
Bibliography 217

Haahr M. Randomness and Integrity Services Ltd. 2010. Kung J, Chiappelli F, Cajulis OO, Avezova R, Kossan G,
https://www.random.org/. Chew L, Maida CA. From systematic reviews to clini-
Ham C, Hunter DJ, Robinson R. Evidence based policy- cal recommendations for evidence-based health care:
making—research must inform health policy as well validation of revised assessment of multiple system-
as medical care. Br Med J. 1995;310:71–2. atic reviews (R-AMSTAR) for grading of clinical rel-
Haynes SN, Richard DCS, Kubany ES.  Content valid- evance. Open Dent J. 2010;4:84–91. https://doi.org/10
ity in psychological assessment: a functional .2174/1874210601004020084.
approach to concepts and methods. Psychol Assess. Lamb T. The retrospective pretest: an imperfect but useful
1995;7(3):238–47. tool. Eval Exchange. 2005;8:18.
Healy E, Jordan S, Budd P, Suffolk R, Rees J, Jackson Laxminarayan R, Duse A, Wattal C, Zaidi AK, Wertheim HF,
I. Functional variation of MC1R alleles from red-haired Sumpradit N, Vlieghe E, Hara GL, Gould IM, Goossens
individuals. Hum Mol Genet. 2001;10(21):2397–402. H, Greko C, So AD, Bigdeli M, Tomson G, Woodhouse
Hinkelmann K, Kempthorne O.  Introduction to experi- W, Ombaka E, Peralta AQ, Qamar FN, Mir F, Kariuki S,
mental design. In: Design and analysis of experi- Bhutta ZA, Coates A, Bergstrom R, Wright GD, Brown
ments, vol. 1. 2nd ed. Hoboken, NJ: Wiley; 2008. ED, Cars O.  Antibiotic resistance-­the need for global
Hollander M, Wolfe DA, Chicken E. Nonparametric sta- solutions. Lancet Infect Dis. 2013;13(12):1057–98.
tistical methods. Hoboken, NJ: Wiley; 2014. Leskovec J, Rajaraman A, Jeffrey D, Ullman JD. Mining
Hosmer DW, Lemeshow S.  Applied logistic regression. of massive datasets. Cambridge: Cambridge University
2nd ed. New York: Wiley; 2000. Press; 2014.
Hulley SB, Cummings SR, Browner WS, Grady DG, Liddle J, Williamson M, Irwig L.  Method for evaluat-
Newman TB, Hearst N.  Designing clinical research. ing research and guidelines evidence. Sydney: NSW
2nd ed. Philadelphia: Wolters Kluwer/Lippincott Health Department; 1999.
Williams & Wilkins; 2001. Liem EB, Lin C, Suleman M, Doufas AG, Gregg RG,
Hund L, Bedrick EJ, Pagano M. Choosing a cluster sam- Veauthier JM, Sessler DI.  Anesthetic require-
pling design for lot quality assurance sampling sur- ment is increased in redheads. Anesthesiology.
veys. PLoS One. 2015;10(6):e0129564. https://doi. 2004;101(2):279–83.
org/10.1371/journal.pone.0129564. W.K.  Kellogg Foundation. Logic model development
Jahn D. Coast Mountain Kingsnake (Lampropeltis zonata guide. Battle Creek, MI: W.K.  Kellogg Foundation;
multifasciata) [photograph]. Santa Cruz: Flickr; 2017. 2004.
Jenkins J, Hubbard S. History of clinical trials. Semin Oncol Lunenburg F. Writing a successful thesis or dissertation:
Nurs. 1991;7(4):228–34. https://doi.org/10.1016/0749- tips and strategies for students in the social and behav-
2081(91)90060-3. ISSN 0749-­2081. http://www.scien- ioral sciences. Thousand Oaks, CA: Corwin Press;
cedirect.com/science/article/pii/0749208191900603. 2008.
Kahn CH. Plato and the Socratic dialogue: the philosophi- Madaus GF, Stufflebeam DL, Kellaghan T.  Evaluation
cal use of a literary form. Cambridge: Cambridge models: viewpoints on educational and human ser-
University Press; 1998. p. xvii. vices evaluation. 2nd ed. Hingham, MA: Kluwer
Kallet RH. How to write the methods section of a research Academic; 2000.
paper. Respir Care. 2004;49:1229–32. Mak K, Kum CK.  How to appraise a prognostic study.
Katz DL. Clinical epidemiology & evidence-based medi- World J  Surg. 2005;29:567. https://doi.org/10.1007/
cine: fundamental principles of clinical reasoning & s00268-005-7914-x.
research. Thousand Oaks, CA: Sage; 2001. Mauchly JW. Significance test for sphericity of a normal
Klute R. Stylised atom with three Bohr model orbits and n-variate distribution. Ann Math Stat. 1940;11:204–9.
stylised nucleus [Stylised atom. Blue dots are elec- McHugh ML.  Multiple comparison analysis testing
trons, red dots are protons and black dots are neu- in ANOVA.  Biochem Med (Zagreb). 2011;21(3):
trons]. 2007. https://commons.wikimedia.org/wiki/ 203–9.
File:Stylised_atom_with_three_Bohr_model_orbits_ McIntyre A.  Participatory action research. Thousand
and_stylised_nucleus.svg. Oaks, CA: Sage; 2009.
Kolmogorov AN.  Foundations of the theory of prob- McMurray F. Preface to an autonomous discipline of edu-
ability. 2nd ed. New  York: Chelsea; 1956. ISBN cation. Educ Theory. 1955;5(3):129–40. https://doi.
0-8284-0023-7. org/10.1111/j.1741-5446.1955.tb01131.x. Accessed 3
Kolmogorov AN.  The theory of probability. In: Mar 2017.
Alexandrov AD, Kolmogorov AN, Lavrent’ev MA, Messick S.  Validity of psychological assessment: vali-
editors. Mathematics, its content, methods, and mean- dation of inferences from persons’ responses and
ing, vol. 2. Cambridge, MA: MIT Press; 1965. performances as scientific inquiry into score mean-
Kruskal W, Wallis A. Use of ranks in one-criterion vari- ing. Am Psychol. 1995;50(9):741–9. https://doi.
ance analysis. J Am Stat Assoc. 1952a;47:583–621. org/10.1037/0003-066X.50.9.741.
Kruskal WH, Wallis WA.  Errata to use of ranks in Millar A, Simeone RS, Carnevale JT. Logic models: a sys-
one-criterion variance analysis. J  Am Stat Assoc. tems tool for performance management. Eval Program
1952b;48:907–11. Plann. 2001;24:73–81.
218 Bibliography

Muir Gray JA. Evidence-based health care: how to make Royse D, Thyer BA, Padgett DK, Logan TK.  Program
health policy and management decisions. London: evaluation: an introduction. 4th ed. Belmont, CA:
Churchill Livingstone; 1997. Brooks-Cole; 2006.
Murdoch TB, Detsky AS.  The inevitable application of Sadler GR, Lee H-C, Seung-Hwan Lim R, Fullerton
big data to health care. JAMA. 2013;309:1351–2. J.  Recruiting hard-to-reach United States population
Nails D. In: Zalta EN, editor. Socrates: Socrates’s strange- sub-groups via adaptations of snowball sampling strat-
ness. The Stanford encyclopedia of philosophy (Spring egy. Nurs Health Sci. 2010;12(3):369–74. https://doi.
2014 Edition); 2017. org/10.1111/j.1442-2018.2010.00541.x.
National Institutes of Health. The Basics. US Sanogo M, Abatih E, Saegerman C.  Bayesian ver-
Department of Health and Human Services; sus frequentist methods for estimating true preva-
2017. p.  20. www.nih.gov/health-information/ lence of disease and diagnostic test performance.
nih-clinical-research-trials-you/basics. Vet J.  2014;202(2):204–7. https://doi.org/10.1016/j.
National Institutes of Health (NIH). NIH clinical research tvjl.2014.08.002.
trials and you: the basics. 2017. https://www.nih.gov/ Scriven M.  The methodology of evaluation. In: Stake
health-information/nih-clinical-research-trials-you/ RE, editor. Curriculum evaluation. Chicago: Rand
basics. Accessed 17 Aug 2017. McNally; 1967.
NightLife Exhibit: Color of Life—Cali. Academy of Scriven M.  The theory behind practical evaluation.
Sciences [Photograph]. Pacific Tradewinds Hostel, Evaluation. 1996;2:393–404.
San Franecisco, 2015. Selvin HC. Durkheim’s suicide and problems of empirical
Norman GR, Monteiro SD, Sherbino J, Ilgen JS, Schmidt research. Am J Sociol. 1958;63(6):607–19. https://doi.
HG, Mamede S. The causes of errors in clinical rea- org/10.1086/222356.
soning: cognitive biases, knowledge deficits, and dual Shaw L, Chalmers T. Ethics in cooperative clinical trials.
process thinking. Acad Med. 2017;92:23–30. Ann N Y Acad Sci. 1970;169:487–95.
Patton MQ.  Utilization-focused evaluation. 3rd ed. Shuttleworth M. Construct validity. 2009. Explorable.com:
Thousand Oaks, CA: Sage; 1996. https://explorable.com/construct-validity. Accessed 18
Pedhazur EJ, Schmelkin LP.  Measurement, design, and Jul 2017.
analysis: an integrated approach. Hove: Psychology Simmonds MC, Higgins JPT, Stewart LA, Tierney JF,
Press; 2013. Clarke MJ, Thompson SG. Meta-analysis of individ-
Perkins J, Wang D.  A comparison of Bayesian ual patient data from randomized trials: a review of
and frequentist statistics as applied in a simple methods used in practice. Clin Trials. 2005;2:209–17.
repeated measures example. J  Modern Appl Statist Skelly AC, Dettori JR, Brodt ED.  Assessing bias: the
Methods. 2004;3(1):24. https://doi.org/10.22237/ importance of considering confounding. Evid Based
jmasm/1083371040. http://digitalcommons.wayne. Spine Care J.  2012;3(1):9–12. https://doi.org/10.105
edu/jmasm/vol3/iss1/24. 5/s-0031-1298595.
Pinsker J.  The psychology behind Costco’s free sam- Steup M.  In: Zalta EN, editor. Epistemology. Stanford
ples. 2014. https://www.theatlantic.com/business/ Encyclopedia of Philosophy. Stanford, CA:
archive/2014/10/the-psychology-behind-costcos-free- Metaphysics Research Lab, Stanford University; 2005.
samples/380969/. Accessed July 2017. Stufflebeam DL.  The CIPP model for program evaluation.
Pocock SJ. Clinical trials: a practical approach. New York: In: Madaus GF, Scriven M, Stufflebeam DL, editors.
Wiley; 2004. Evaluation models: viewpoints on educational and human
Pratt J. Remarks on zeros and ties in the Wilcoxon signed services evaluation. Boston: Kluwer Nijhof; 1993.
rank procedures. J Am Stat Assoc. 1959;54:655–67. Suresh K, Thomas SV, Suresh G.  Design, data analysis
Racino J. Policy, program evaluation and research in dis- and sampling techniques for clinical research. Ann
ability: community support for all. London: Haworth Indian Acad Neurol. 2011;14(4):287–90. https://doi.
Press; 1999. org/10.4103/0972-2327.91951.
Raue A, Kreutz C, Theis FJ, Timmer J.  Joining Tolman RC.  The principles of statistical mechanics.
forces of Bayesian and frequentist methodol- Mineola, NY: Dover; 1938. ISBN: 9780486638966.
ogy: a study for inference in the presence of non-­ Translational Science Spectrum. 2016. https://ncats.nih.
identifiability. Philos Trans A Math Phys Eng Sci. gov/translation/spectrum. Accessed 11 Jul 2017.
2013;371(1984):20110544. https://doi.org/10.1098/ Trochim WM. Levels of measurement. 2006. http://www.
rsta.2011.0544. socialresearchmethods.net/kb/measlevl.php. Accessed
Rees JL, Flanagan N.  Pigmentation, melanocortins and July 2017.
red hair. QJM. 1999;92:125–31. Vallverdu J. Bayesians versus frequentists a philosophical
Renganathan V.  Overview of frequentist and Bayesian debate on statistical reasoning. New  York: Springer;
approach to survival analysis. Appl Med Inform. 2016.
2016;38(1):25–38. Vandenbroucke JP, von Elm E, Altman DG, et  al.
Robinson WS.  Ecological correlations and the behavior Strengthening the reporting of observational studies
of individuals. Am Sociol Rev. 1950;15(3):351–7. in epidemiology (STROBE): explanation and elabora-
JSTOR 2087176. tion. Int J Surg. 2014;12(12):1500–24.
Bibliography 219

Vogt DS, King DW, King LA.  Focus groups in psycho- Wanjek C.  Oops! 5 Retracted Science Studies of
logical assessment: enhancing content validity by 2012. 2012. http://www.livescience.com/25750-
consulting members of the target population. Psychol science-journal-retractions.html. Accessed 11 Jul 2017.
Assess. 2004;16(3):231–43. Wasserman L.  All of nonparametric statistics. Berlin:
von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche Springer; 2007.
PC, Vandenbroucke JP, STROBE Initiative. The Wasserstein RL, Lazar NA. The ASA’s statement on p-values:
Strengthening the Reporting of Observational Studies context, process, and purpose. Am Stat. 2016;70(2):129–
in Epidemiology (STROBE) statement: guidelines 33. https://doi.org/10.1080/00031305.2016.1154108.
for reporting observational studies. J Clin Epidemiol. West S, King V, Carey TS, et al. Systems to rate the strength of
2008;61(4):344–9. PMID: 18313558. scientific evidence: summary. In: AHRQ Evidence Report
Wagensberg J.  On the existence and uniqueness of the Summaries. Rockville (MD): Agency for Healthcare
scientific method. Biol Theory. 2014;9(3):331–46. Research and Quality (US); 1998–2005, 2002. p.  47.
https://doi.org/10.1007/s13752-014-0166-y. https://www.ncbi.nlm.nih.gov/books/NBK11930/.
Wagner C, Esbensen KH.  Theory of sampling: four Wilcoxon F. Individual comparisons by ranking methods.
critical success factors before analysis. J  AOAC Biom Bull. 1945;1:80–3.
Int. 2015;98(2):275–81. https://doi.org/10.5740/ Witte RS, Witte JS.  Statistics. 10th ed. Hoboken, NJ:
jaoacint.14-236. Wiley; 2014.
Walker JS.  Figure  8: Electron cloud model for the 1s Woodson CE.  Parameter estimation vs. hypothesis
orbital. [Digital image]. 2018. http://thebiologyprimer. testing. Philos Sci. 1969;36(2):203–4. https://doi.
com/atoms-and-molecules/. org/10.1086/288247.
Index

A C
Absolute benefit increase (ABI), 172 Case-control study, 19, 20
Absolute risk reduction (ARR), 172 Categorical data, 39, 54, 124, 125, 129–136, 208
Addition rule, 64, 65 Categorical variable, 39, 40, 112, 203
Agency for Healthcare Research and Quality (AHRQ), Central limit theorem, 61, 75, 76, 85
146, 147, 169, 177 Chi-Square (χ2) tests, 130–133, 198–199
AGREE instrument, 174 Clinical significance, 171
Alpha level, see Level of significance Clinical trials
Alternative hypothesis, 77, 80, 95, 97, 110, 126–128, 206 controlled trials, 23
Analysis of variance (ANOVA), 120, 129, 134, 135 crossover trials, 23
dependent variable, 109 definition, 22
F distribution, 109 public health, 24
F-statistic, 107 randomized trials, 23
independent variable, 109 run-in trials, 23
mean squares, 107 single-blinded and double-blinded clinical trials, 23
population means, 107 study design tree, 24
post hoc analyses, 109 run-in trials, 23
steps for, 110, 111 Cluster sampling, 33
three different groups testing, 107 Cochran’s Q test, 129, 133
variability between and within groups, 108 Coefficient of determination (R2), 119, 120, 122,
Anderson–Darling test, 130 134, 208
Antimicrobial resistance, 178–180 Cohen’s kappa, 130
Aposematism, 72 Cohort study
Appraisal of Guidelines for Research and Evaluation cause–effect relationship, 17
Enterprise (AGREE), 174 definition, 17
Aragon’s Primary Provider Theory, 146 exposed and unexposed, 18
for tuna fish casserole, 19
incidence, 18
B limitations, 19
Bar chart, 50–52 nested study, 18, 19
Basic/generic/pragmatic qualitative and quantitative prospective study, 18
research and evaluation, 163 relative risk, 18
Bayesian approach, 124, 136, 151, 167, 180, 183, 211 strengths, 19
Bayesian biostatistics in translational research, 180, 181 types of, 18
Bayesian hierarchical modeling, 181 Comparative effectiveness analysis (CEA), 160
Bayesian statistics, 66, 124, 168, 169, 181 Comparative effectiveness and efficacy research and
Bell-shaped curve, 44, 59, 61 analysis for practice (CEERAP), 160
Bell-shaped frequency polygon, 60 Comparative effectiveness research and evaluation
Bibliome, 145, 176, 182, 183, 211 (CERE), 164
Big data, 155, 182 Comparative effectiveness research (CER), 11, 142, 143,
Big data analysis, 182 160, 164
Bimodal distribution, 63 CERID
Blocking principle, 21 creating and disseminating new knowledge,
Bonferroni correction, 128, 129 179, 180

© Springer-Verlag GmbH Germany, part of Springer Nature 2018 221


A. M. Khakshooy, F. Chiappelli, Practical Biostatistics in Translational Healthcare,
https://doi.org/10.1007/978-3-662-57437-9
222 Index

Comparative effectiveness research (CER) (cont.) Continuous assessment, 131, 162


to healthcare, 178 Control event rate (CER), 171
urgency of, 178, 179 Controlled trials, 23, 172
translational methodological issues, 172 Correlation coefficient (r), 115, 134
AGREE-II, 174 direction, 114, 115
bibliome, 176 father’s height and son’s height, 113
clinical significance, 171 formula, 113
cochrane group and AHRQ, 174 hypothesis tests of significance, 115
dose response gradient, 174 sample, 115
GRADE, 173, 175 strength, 113–115
large effect, 174 t ratio, 115
levels of evidence, 170 t test, 115
MEDLINE search strategy, 175 Cox proportional hazard regression analysis, 133, 134
NNT, 171, 172 Criterion validity, 34, 35
PICOTS, 175, 176 Criterion-referenced testing, 159
random sampling, 175 Critical value, 93–95, 188–193
research design, 170 Crossover trials, 23, 104
risk of bias assessment tool, 174 Cross-sectional study, 20
sampling, 175 Cumulative frequency (cf), 48–50, 69
semester-continuous scores, 174
statistical significance, 171
total final score of evidence quality, 174 D
two-by-two table, 171 Data, definition, 37
US Preventive Services Task Force, 173 Data analysis
USPSTF, 172 assumptions of parametric statistics, 76
new Frontiers in dissemination, 176, 177 definition, 73, 92
patient-centeredness, 170 hypotheses, 76, 77
research synthesis, 169 research pathway, 74
systematic reviews, 169 sampling distribution, 74, 75
translational effectiveness, 168, 169, 180–182 Decision-making process
translational research errors in, 81, 82
Bayesian biostatistics in translational research, null hypothesis, 93
180, 181 power analysis, 82, 83
individual patient data meta analysis, 181, 182 rejecting and retaining null hypothesis, 81
translational healthcare challenges, CIPER, 178 statistical significance, 80, 81
Comparative Effectiveness Research for Infectious Degrees of freedom (df), 100, 109, 110, 128, 129, 131,
Diseases (CERID) 132, 134
creating and disseminating new knowledge, 179, 180 Dependent samples t test, 102, 104–106
to healthcare, 178 Dependent variable, 109–111, 116–118, 120, 124, 135
urgency of, 178, 179 Dependent, rules of probability, 65
Comparative Individual Patient Effectiveness Research Descriptive statistics
(CIPER), 142, 154, 155, 178 cumulative frequency, 69
Comparative inferences, 164 definition, 45
Comparative trial, see Controlled trials distribution, 59–63
Conceptual definition, 158 frequency distribution, 69
Concurrent criterion validity, 35 graphs
Conditional rule, 65 bar chart, 51
Confidence interval (CI), 72, 84–86, 88, 90, 97, 98, 101, frequency polygon, 52
103, 106, 111, 120, 140, 173, 181 histogram, 50–52
Confidence limits, 84 step-by-step protocol, 51
Construct validity, 34 Heisenberg uncertainty principle, 45
Content validity, 34 measures of central tendency, 43, 53–55
Contingency table, 54, 57, 129–131 measures of variability, 44, 55–59
Continuous data analysis probability, 63
associations Bayesian vs. frequentist approach, 66
correlation, 112–115 definition, 63
predictions, 116–120 formula, 63, 64
within group and singular group, 112 rules of, 64–65
Continuous variable, 39, 59, 61, 112, 115, 116, 118, 134, Z-transformation, 66–69
135, 203 raw data, 43
Index 223

Rutherford-Bohr model, 44 sustainable communities, stakeholder


standard deviation, 69 engagement, 165
tabulation Expanded version of GRADE (Ex-GRADE), 174
cumulative frequency, 48–50 Experimental design
cumulative frequency percent (cf%), 48 clinical trials, 22
cumulative frequency, crisscross method for, 49 controlled trials, 23
dichotomous variable question, 50 crossover trials, 23
frequency percent (f%), 48 definition, 22
frequency table, 46–48 public health, 24
midpoint, 49 randomized trials, 23
nominal variable, 50 run-in trials, 23
ordinal variable, 50 single-blinded and double-blinded
randomly collected systolic blood pressures, clinical trials, 23
46, 47 study design tree, 24
rules for constructing tables for grouped data, 47 control group, 21
Diagnostic studies experimental group, 21
definition, 15 possible consideration, 21
reliability and validity, 15, 16 quasi-experimental design, 22
reliability and validity specificity and randomization, 21
sensitivity, 16 statistical replication, 22
sensitivity, 16 Experimental event rate (EER), 171
specificity, 16 Experimental studies, 14, 21–23, 136
validity and predictive value formulas, 17 External validity, 89, 151
Dichotomous measure, 40
Directional/one-tailed tests, 96
Directional/two-tailed test, 95
F
Directional vs. nondirectional tests, 95, 96
Factor analysis, 135
Discrete variable, 39
Fallacies, 3, 8, 87, 158, 168, 172, 174, 201
Disseminating Patient-Centered Outcomes Research to
Feasible, interesting, novel, ethical, and relevant
Improve Healthcare Delivery Systems, 177
(FINER), 6
Distribution-free method, 124
F distribution, 109, 129
Field research design, 14
Fisher’s test, 133
E
Fisherian formation, 124
Educated guess, 6, 87
Fisherian probabilistic statistics, 125
Effect size (ES), 82, 83
Formative evaluations, 163, 164
Electronic Data Methods Forum, 177
Frequency distribution, 46, 59, 69
Elementary and Secondary Education Act (ESEA), 159
Frequency percent (f%), 48, 50
Error fractionation, 87, 93, 119
Frequency polygon, 52, 53, 60
Errors in research
Frequency table, 46–48
errors of judgmentorfallacies, 8
Frequentist approach, 66, 168, 169
random errors, 9
Frequentist statistical inference, 124
systematic errors, 8
Friedman test, 128, 129
types of errors, 8
F-statistic, 107, 108, 110, 120
Errors of judgment, 3, 8, 201
F test, 107–111, 120, 129
Estimated standard error, 102, 103
Fudged, 7
Ethnographic research and evaluation, 163
Evaluation
comparative inferences, 164
conceptual definition, 158 G
formative vs. summative evaluations, 163, 164 Gaussian/bell-shaped curve, 44
historical and philosophical models, 158–160 Gaussian distribution, 61
program-related decision-making, 158 Geisser–Greenhouse Correction for Heterogeneous
qualitative, 158, 162, 163 Variances, 129
quantification, 162 Generalizability (G) theory, 36, 117
quantitative, 158, 163 Grading of Recommendations Assessment,
strengths and deficiencies, 160, 161 Development, and Evaluation
translational research, sustained evolution of (GRADE), 173–175
ethical recommendations, 165 Greenhouse–Geisser Estimate Epsilon, 129
PARE, 164, 165 Grounded theory research and evaluation, 163
224 Index

H California mountain kingsnake, 72


Health information technologies, 177, 180 continuous data analysis (see Continuous data
Health sciences, 9, 10 analysis)
Heisenberg uncertainty principle, 45 dart frog, 72
HIPAA regulations, 165 data analysis
Histogram, 50–53, 57, 59, 60, 66, 74 assumptions of parametric statistics, 76
Historical and philosophical models, 158–160 definition, 73, 92
Homogeneity of Variance, 76, 92, 120, 123–125, 128, hypotheses, 76, 77
129, 135, 209 research pathway, 74
Homoscedasticity, 120, 135 sampling distribution, 74, 75
Hypothesis-driven research definition, 73
conclusion, 6, 7 estimation, 72, 84, 85
data analysis, 6 hypothesis testing, 72, 86–88
ducated guess, 6 population–sample interaction, 73
methodology, 6 research process, 73
research questions, types of, 6 significance, 80
study design, 6 decision-making process (see Decision-making
study hypothesis, 6, 7 process)
Hypothesis testing, 6, 86–88, 91 level of significance, 77–79
p-value, 79, 80
statistical tests
I critical value, 93–95
Impact evaluation program, 152 directional vs. nondirectional tests, 95, 96
Incidence, definition of, 18 error fractionation, 93
Independence, rules of probability, 64 study validity, 88
Independence of measurement, 76 external validity, 89
Independent samples t test, 101–103 internal validity, 88, 89
Independent variable, 109–111, 116–120, 129 three-legged stool, 92
Individual participant-level data meta-analysis two-group comparisons, 99
(IPD MA), 149, 181, 182 t test (see t test family)
Individual patient data (IPD), 177 z test, 96–99
patient-centered inferences within groups and within a singular group, 112
individual patient data analysis, 149 Inpatient quality indicators, 147
individual patient data evaluation, 151–153 Instrument validity, 34, 35
individual patient data meta-analysis, 149, 150 Intention to treat (ITT) analyses, 152, 170
patient-centered outcomes Internal validity, 88, 89
individual patient outcome research, 147, 148 Interquartile range (IQR), 55, 56
individual patient review, 148, 149 Inter-rater reliability, 35
primary provider theory, 145–147 Interval measures, 39
stakeholder mapping, 143–145
translational research, sustained evolution of
CIPER, 154, 155 K
logic model, 153 Kaplan–Meier survival and Cox test, 133–134
repeated measure model, 153, 154 Kendall’s tau (T), 130, 134
vs. aggregate data, 142, 143 Kendall’s W, 130
Individual patient data analysis, 149 Kinesthesia, 37
Individual patient data meta-analysis, 149, 150, 181, 182 Kolmogorov–Smirnov test, 130
Individual patient evaluation, 151–153 Kruskal–Wallis for One-Way ANOVA, 128
Individual patient outcome research, 147, 148 Kruskal–Wallis H test, 128
Inferential statistics Kuiper’s test, 130
ANOVA, 107, 111 Kullback–Leibler divergence, 181, 182
F distribution, 109
F-statistic, 107
independent variable, 109 L
mean squares, 107 Least-squares regression line, see Linear regression line
population means, 107 Level of confidence, 72, 85
post hoc analyses, 109 Level of significance, 77–79, 94
steps for, 110, 111 Levels of evidence, 87, 170
three different groups testing, 107 Likert scale, 175
variability between and within the groups, 108 Linear regression line, 117, 122
aposematism, 72 Logic model, 153
Index 225

Logistic regression, 134–136 Friedman test, 129


Logrank nonparametric test, 133 Geisser–Greenhouse correction for heterogeneous
Logrank test, 130, 134 variances, 129
Kruskal–Wallis for One-Way ANOVA, 128
population, 125
M quick and dirty preliminary test, 126
Management-oriented model, 158–160 Normal distribution, 61, 62, 76
Mann–Whitney U test, 126–128, 194–197 Null hypothesis, 7, 77–83, 88, 93, 95, 97, 99, 101, 103,
Mantel–Haenszel χ2 test, 133 104, 106–110, 115, 120–122, 127, 128,
Margin of error, 84 132–134, 206
McNemar test, 129, 133 Number needed to treat (NNT), 170–172
Mean, 54, 57
Mean difference, 102–106, 206
Mean square (MS), 107, 110 O
Measures of central tendency, 43, 53–55, 62 Observational design
Measures of variability, 44 case-control study, 19, 20
IQR, 56 cohort study
mean, 57 cause–effect relationship, 17
outliers, 55 definition, 17
parameters, 59 exposed and unexposed, 18
quantitative data, 56 incidence, 18
range, 55 limitations, 19
standard deviation, 57–59 nested study, 18, 19
statistics, 59 prospective study, 18
sum of squares (SS), 58 relative risk, 18
variance, 59 strengths, 19
MedCalc, 149, 178 for tuna fish casserole, 19
Median, 43, 54–56, 62, 69, 126–128, 205 types of, 18
MEDLINE search strategy, 175 cross-sectional studies, 20
Messick’s Unified Theory of Construct Validity, 34 longitudinal, 17
Midpoint, 49, 50, 52 risk factors and outcomes, 17
Mixed model analysis, 151 Observational studies, 14, 172, 174, 202
Mode, 43, 54, 55, 61, 62, 69 Odds ratio(OR), 20, 171
Multiple linear regression, 119, 120, 135 One-sample t test, 99–101
Multiplication rule, 64, 65 One-Way ANOVA, Kruskal–Wallis for, 128
Mutually exclusivity, 64 Ordinal measure, 40, 55, 203
Original score, 67–69
Outcome variable, see Dependent variables
N
National Defense Education Act (NDEA), 159
Naturalistic studies, 14 P
Negatively skewed (left-skew) distribution, 62 Paired measures, 104
Nested study, 18, 19 Parameters, 59, 61, 73, 76, 84, 88, 107, 123, 125, 181
Nominal measure, 40, 203 Participant observation, 14
Noncontinuous assessment, 162 Participant-oriented model, 161
Nonparametric statistics Participatory action research and evaluation’s (PARE),
Bayesian statistics, 124 158, 164, 165
categorical data analysis, 125, 129, 130 Participatory evaluation measurement instrument
association and prediction, logistic regression, (PEMI), 175
134–136 Patient expected event rate (PEER), 171
time series analysis with χ2, 133, 134 Patient safety indicators, 147
categorical data analysis Chi-Square tests, Patient-centered inferences
130–133 individual patient data analysis, 149
characteristic, 124 individual patient data evaluation, 151–153
comparing two groups, 126 individual patient data meta-analysis, 149, 150
Mann–Whitney U test, 127, 128 Patient-centered outcomes
Wilcoxon Rank-Sum, 126, 127 individual patient outcome research, 147, 148
Wilcoxon signed-rank test, 127 individual patient review, 148, 149
continuous data, 125 primary provider theory, 145–147
Fisherian probabilistic statistics, 125 Patient-centered outcomes evaluation (PCOE), 142, 151,
frequentist statistical inference, 124 152, 154
226 Index

Patient-centered outcomes research (PCOR), 142, 143, Ratio measures, 39


147–151 Regression, definition, 116
Patient-centeredness, 146, 167, 169, 170 Regression coefficient (b), 118, 120, 122, 133–135
Pediatric quality indicators, 147 Relative risk, 18
Phenomenology research and evaluation, 163 Reliable instrument, 35, 36
Pitman’s permutation test, 130 Repeated measure model, 104, 134, 153, 154
Point estimate, 84, 85 Research design, 124, 170
Polymodal distributions, 62, 63 Research methodology, 28, 30–39
Pooled variance, 102 data acquisition, 36–39
Population, 125 kinesthesia, 37
Population, intervention, comparator, outcome, timeline, quantitative vs. qualitative, 38, 39
setting (PICOTS) question, 14, 169, 175, 176 variables (see Variables)
Positively skewed (or right-skew) distribution, 62 data analysis, 28
Post hoc analyses, 109 measurement, 33
Post-before-Pre model, 154 instrument validity, 34, 35
Post-then-Pre model, 154 reliable instrument, 35, 36
Power analysis, 82, 83 researcher-completed instruments, 34
Predictive criterion validity, 35 subject-completed instruments, 34
Preferred Reporting Items for Systematic Reviews and principal domains of, 28
Meta-Analyses (PRISMA) checklist, 143, 150 sample, 27
Prevalence formulae, 21 sample vs. population, 28–30
Prevalence, definition of, 20 sampling technique, 27, 30–33
Prevented/preventable fraction (PF), 170 Research process, 13
Prevention quality indicators, 146 definition, 5
Primary provider theory, 145–147 errors in research, 7–9
Probability hypothesis-driven research, 6, 7
Bayesian vs. frequentist approach, 66 inferential statistics, 73
definition, 63 methodology, study design and data analysis, 5
formula, 63, 64 steps of, 5
rules of, 64, 65 Research synthesis, 145, 151, 154, 155, 169, 171,
Prognosis, definition of, 16 175, 211
Prognostic study Researcher-completed instruments, 34
definition, 17 Residual error, 117, 118
typical treatment–control relationship, 17 Response shift, 153, 154, 210
Prospective impact evaluation, 152 Result translation, 11, 24, 166, 201, 211
Prospective study, 18, 202 Retrospective impact evaluation, 153
Psychometric theory and measurement, 159 Retrospective time component, 19
p-value, 79, 80 Review Manager (RevMan) software, 149, 178
Revised version of AMSTAR (rAMSTAR), 174–175
Ronald Fisher (Biologist and statistician), 22
Q Risk ratio, 18
Qualitative data, 28, 38–41, 46, 50, 51, 54, 55, 73, 74, Run-in trials, 23
76, 162, 163, 166 Rutherford-Bohr model, 44
Qualitative evaluation, 162, 163
Qualitative-anthropological model, 161
Quantification, 162 S
Quantitation, 162 Sampling, 27, 175
Quantitative evaluation, 163 Sampling distribution, 74, 75, 97, 100
Quasi-experimental design, 22, 152, 160 Sampling distribution of mean, 74, 75
Sampling distribution of mean difference, 102, 105
SCHARP, 149, 178
R Scheffé test, 109
Random errors, 3, 8, 9, 21, 75, 84, 89, 104, 107, 108, Schrödinger-Heisenberg atomic model, 45
117, 133, 201, 202 Scientific management of data, 159
Random sampling, 32, 175 Scientific method, 6–9, 159
Randomization, 21, 22, 31, 148, 150, 172, 202 biostatistics today
Randomized clinical trials, 142, 170 health sciences, 9, 10
Randomized controlled trials (RCT), 22, 142 translational healthcare, 10, 11
Randomized trials, 22, 23 definition, 4
Rank products, 130 inductive and deductive reasoning, 5
Index 227

investigating phenomena, 5 Systematic evaluation of the statistical analysis


research process (SESTA), 136
definition, 5 Systematic reviews, 150, 152, 168–170, 172, 175,
errors in research, 7–9 181–182, 211
hypothesis-driven research, 6, 7 Systematic sampling, 32
methodology, study design and data analysis, 5
steps of, 5
Socratic method, 4 T
Scientific-experimental model, 160 Telecare, 176, 177
Semicontinuous, 39 Teleconsultation, 176, 177
Semicontinuous measurement, 162 Telehealth, 177
Semi-parametric, 133 Time series analysis with χ2, 133, 134
Siegel–Tukey test, 126 Translational effectiveness, 10, 167–170
Simple linear regression, 117–119 Translational healthcare, 10, 11, 13, 28, 168, 178, 182
Simple random sampling, 31 Translational research, 10, 28, 71, 73
Small and matched designs, 130–133 Translation to primary care, 177
Socratic method, 4 Treatment-on-the-treated (TOT) analyses, 152
Spearman rank, 130, 134 t test family, 115
Spearman’s rank correlation coefficient, 130 dependent samples t test, 102, 104–106
Sphericity, 129 independent samples t test, 101–103
Squared ranks test, 126 one-sample t test, 99–101
Stakeholder analysis process, 144 Tukey–Duckworth test, 126
Stakeholder mapping, 144, 145
Stakeholders, 143
Standard deviation, 57–59, 69 U
Standard error of the mean (SEM), 75, 84, 85 UK Medical Research Council Clinical Trials Units, 149
Standard error of the mean difference, 102, 105 US Preventive Services Task Force (USPSTF), 173
Standard normal curve, 67
Standard normal distribution (z), 186–187
Statistical bootstrap method, 130 V
Statistical hypotheses, 77, 82, 87 Variables
Statistical significance, 171 qualitative data, 39, 40
Statistical tests quantitative data, 39
critical value, 93–95
directional vs. nondirectional tests, 95, 96
error fractionation, 93 W
Statistical replication, 22 Wald–Wolfowitz test, 130
Stratified sampling, 32 Wilcoxon Rank-Sum test, 126, 127
Student’s t test, see t test family Wilcoxon signed-rank test, 126, 127
Study designs, 15, 17
core concepts, 13, 14
diagnostic studies (see Diagnostic studies) Y
experimental studies, 14 Yates’ correction for continuity, 132
naturalistic studies, 14
observational studies, 14
prognostic study, 17, 21 Z
definition, 17 z test, 96–99
experimental study (see Experimental design) calculation, 97
observational design (see Observational design) conclusion, 98
typical treatment–control relationship, 17 confidence interval formula, 97
Subject-completed instruments, 34 decision rule, 97
Sum of squares (SS), 58, 110 hypotheses, 97
Summative evaluations, 163, 164 population–sample interaction, 96
Survival Kaplan-Meier curve, 133, 134 research question, 97
Sustainable Communities, Stakeholder sampling distribution of z, 97
Engagement, 165 Z-score, 67, 68, 70, 93, 94
Systematic errors, 8, 14 Z-transformation, 66–69

You might also like