0% found this document useful (0 votes)
34 views157 pages

Course HEM245 2021

Uploaded by

Swed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views157 pages

Course HEM245 2021

Uploaded by

Swed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

TWO

4
5
ECONOMICS

Applied Healthcare Analytics


August 2023
Contents

1 Health Care Data 1


1.1 Claims data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Medical records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Health surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Public health surveillances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Preliminaries 5
2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Summarising data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Samples and populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Random variables and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Families of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Bernoulli trials and binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.3 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.4 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.5 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.6 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.7 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Regression 30
3.1 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

I
Contents

3.5 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


3.6 Regression diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 The selection of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.1 Forward selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.2 Backward selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.3 Stepwise selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.4 Block inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.5 Cautions about stepwise procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Logistic Regression 48
4.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Evaluation of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Overall assessment of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Goodness-of-fit of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Unusual observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Predictive accuracy and discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Regression Models for Count Outcome 63


5.1 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Negative Binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Hurdle regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Zero-inflated regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Generalised linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Survival Analysis 78
6.1 Survival time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Survival, hazard and cumulative hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Kaplan-Meier product limit estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Comparing survival curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Cox proportional hazards model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.6 Time varying coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.7 Time dependent covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.8 Assessing model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Classification and Clustering 96


7.1 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Multivariate normal distribution and Mahalanobis distance . . . . . . . . . . . . . . . . . . . . . . 98
7.3 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.4 Quadratic discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

II
Contents

7.5 Connection to logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


7.6 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Dimension Reduction and Regularisation 108


8.1 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3 Least absolute shrinkage and selection operator (LASSO) . . . . . . . . . . . . . . . . . . . . . . . 117

9 Causation vs.Correlation 121


9.1 Counterfactuals and potential outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Confounders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.3 Estimating the causal effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.4 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.5 Stratification and regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.6 Inverse probability weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

10 Simulation Methods 132


10.1 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.1.1 Estimating standard error and confidence interval of the mean . . . . . . . . . . . . . . . 132
10.1.2 Estimating standard error and confidence interval of an odds-ratio . . . . . . . . . . . . . 134
10.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2.1 Prediction using a logistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.2.2 Choosing the shrinkage parameter λ in LASSO . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.3 Comparison of alternate approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.3.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

III
1
Health Care Data

Healthcare analytics encompasses the methods of collecting, analysing and linking health data about people
and their health. Health care data can come from a variety of sources. Broadly, we distinguish between
administrative and non-administrative data. Administrative data are information collected, processed, and
stored in automated information systems. They consist of demographic, diagnostic and clinical information
routinely collected about patients. The major sources of administrative data are those relating to use of
hospital services; and health insurance claims, including both private and public payers. They are a part
of any health care system and normally large in size. Therefore they are a relatively cheap and important
source of data for health care research. However, since these data are not originally collected for research,
care must be exercised when they are used for health care research. Non-administrative data include those
collected through registries and surveys. They consist of information about disease, risk factors, and exposure
in a targeted population. These data are a useful source to determine the natural history of diseases; to
understand variations in treatment and outcomes; to examine the quality of care; and to assess effectiveness
of polices and programs. Yet, they also have their shortcomings.
Each data type provides some evidences to answering a range of questions, but none is capable of
providing all the evidences. For example, randomised controlled trials (RCTs) are experiments designed
to test hypotheses that can ultimately be applied to real-world care. However, they are normally conducted
under strict conditions, with detailed protocols and, inclusion and exclusion criteria. They are often very
expensive to carry out and are small in size. These conditions limit in their generalisability. In contrast, data
sources such as claims data, health records, public health surveillances and surveys collect comprehensive
information on a large number of patients. They also evaluate care as it is given, and not determined by a
protocol. As a result, they are more representative of real-world outcomes. However, they lack the designs
and controls to address potential biases. It is important for us to understand the strengths and limitations of
each data source, and the complementary roles they play in evaluating outcomes or decision making. Below
we briefly summarise some key characteristics of the different sources of data.

1.1 Claims data

Health care administrative data, also known as administrative claims data, are derived from claims for
reimbursement for routine health care services. They are relatively inexpensive to ascertain and, in general,
readily available in electronic format. In many countries, the format of the data is standardised and contains
a listing of diagnosis, procedures, prescribed medications, providers used, in addition to financial measures
such as billed amounts, reimbursed amounts, and patient cost sharing. Nearly all claims databases also
use harmonised disease and billing codes, making it easier to compare and link databases across different
geographical regions.

1
Chapter 1. Health Care Data

There are a number of attributes in using claims data as a tool for healthcare research. The sample
sizes of claims data are often large, which allows results to be obtained with sufficient accuracies. This is
especially useful in studying rare medical conditions. Claims databases often cover a relatively diverse group
of individuals receiving care in various settings across large geographic regions. For this reason, studies using
claims data are more easily generalisable to practices and outcomes in the population. Once established,
claims database normally do not have a defined end date, and hence they can be used to study the temporal
evolution of utilisation patterns, outcomes, and costs of care. Information from claims data come directly from
medical providers and pharmacists, and are considered to be more accurate than patient self-reports, which
may suffer from biases. For instance, individuals may be reluctant to disclose information, such as drug usage,
that is considered socially undesirable, or they may enter erroneously responses due to inability to recall past
events.
Because claims data exist primarily for billing and reimbursement purposes, there are a number of
limitations when they are used for healthcare research. First, claims data lack the clinical details of the
conditions and outcomes documented in the medical records. For instance, claims for a treatment for
depression will not contain information of the severity of depression. This makes it difficult to use the data to
answer questions about the quality of care. Another limitation of claims data is the prevalence of incomplete
data. This may arise when a medical condition cannot be associated with any of the standardised codes.
Another form of incomplete data is the lapses of information between claims, such that a patient’s condition is
followed and updated only when a claim arises. Claims data also suffer from the possibility of miscoding, that
may be the result of inadequate documentation or misinterpretation of medical conditions. A third limitation
is selectivity. Even though claims databases normally cover a diverse population, they are often criticised
for only covering those who are insured, who may have different risk profiles than those without insurance.
Claims data also only cover services and conditions that are covered by insurance; they will (likely) not contain
information about non-billable services such as immunisations, blood tests, or health screening. Finally, since
claims data are organised around a billing event rather than a particular diagnosis, it is challenging to use
claims data to evaluate outcome, and costs, related to a particular medical diagnosis.

1.2 Medical records

Medical records or electronic health records (EHR) are electronic records of clinical and administrative
encounters between a medical provider and a patient. They contain information on health conditions,
interventions, and outcomes at the individual patient level over time. Clinical notes are also captured in
EHRs. They include discharge summaries, treatment plans, and progress notes, which can contain information
about patients that is useful for research purposes. These databases provide a low-cost means of accessing rich
longitudinal data on large populations for healthcare research. Unlike claims data, EHRs include all patients
who received care regardless of insurance status. The large patient samples from EHRs enable researchers to
evaluate multiple risk factors and/or outcomes simultaneously, to test associations in subpopulations, and to
study rare outcomes. Data in EHRs also have the potential to be used in studies of stigmatised conditions,
such as mental health outcomes or HIV, where patient recruitment can be challenging.
Data included in EHRs are intended for clinical and administrative use. These data can be used effectively
for research purposes, but doing so requires some caution. One caution when working with EHR data is
selectivity. EHR data are collected as a result of clinical encounters. The frequency and duration of data
collection is determined by a patient’s health condition, by how and when they seek care, and the treatment
plans determined by the medical provider. Therefore, they may capture a disproportionate number of cases
that are more serious and need more tests or treatment encounters, which may bias estimates of associations
between exposures and outcomes. Since EHRs are created at the point of care, which is dependent on
treatment plans stipulated by a provider, data in EHRs also reflect variations in practice style, knowledge
and skill of the providers who create it. Whereas disease status is often well documented in EHRs, disease
aetiology, including causes of disease, for example, social, behavioural, environmental factors, is often not
well documented.

2
1.3. Health surveys

1.3 Health surveys

Health surveys are one of the most common methods used for gathering information on health about
a population. Health surveys generally include measures of risk factors, health behaviours, health care
utilisation, and non-health determinants or correlates of health such as socio-economic status. The range of
measures that can be included is wide and varies by survey. Surveys can be cross-sectional or longitudinal
(sometimes also called panel). Both the cross-sectional and the longitudinal surveys are examples of
observational studies. This means that researchers record information about the participants without
manipulating the study environment. For example, a survey might ask participants questions on their health
behaviour, such as tobacco use, alcohol use, diet, and physical exercise, but it would not influence participants
to modify their behaviour.
A cross-sectional survey collects data about a population of interest at a single point in time. Many
national population-based surveys are cross-sectional. A cross-sectional survey allows comparisons and
information to be drawn at one time point. For example, we may be able to compare the cholesterol levels
between participants who exercise daily and those who don’t. An advantage of using data from a cross-
sectional survey is that it allows researchers to compare many different variables at the same time. For
example, we might look at how age, gender, levels of exercise, etc., are related to cholesterol levels, with
little or no additional cost. However, since cross-sectional survey is made at a single time point, its results
may not be representative of other points in time. Furthermore, cause-and-effect relationships are difficult
to establish using cross-sectional data. For example, we would not be able to measure know for sure if those
who engage in daily exercise had low cholesterol levels before taking up their activities, or if their activities
helped to reduce cholesterol levels that previously were high.
In longitudinal surveys, researchers conduct several observations of the same subjects over a period
of time, sometimes lasting many years or even decades. Since longitudinal surveys collect information
over time, they can establish sequences of events. As a result, when linked with appropriate data, such as
individual demographic and behaviour information, longitudinal data allow researchers to evaluate cause-
specific effects on the outcome of interest. For example, we might identify at the beginning of a longitudinal
survey participants who exercise daily, and follow them over time to look at the change in cholesterol levels,
and compare to those who did not engage in daily exercise. Therefore, a longitudinal study is more likely to
suggest cause-and-effect relationships than a cross-sectional study by virtue of its scope. In the following, we
briefly review a few well known large surveys.
The Demographic and Health Surveys (DHS), funded by the USAID, is a collection of more than 260
surveys in over 90 countries. Since its inception in 1984, the DHS has collected information on fertility, family
planning, maternal and child health, gender, HIV/AIDS, malaria, and nutrition. DHS surveys used nationally
representative samples of women of childbearing age and, more recently, samples of males. The surveys
are organised and administered by individual governments, and the survey questions are adapted to each
country’s needs. Most of these surveys are cross-sectional surveys of nationally representative samples. The
surveys are useful in providing a wide variety of health indicators and health services indicators.
The National Health Interview Survey (NHIS) is an annual survey carried out to collect information on
a random sample from the population in the United States. NHIS uses a fresh cross-sectional sample every
year. The NHIS questions cover a wide variety of topics, including medical conditions, health insurance,
doctor visits, and health behaviours. The results have been used to track health status, health care access, and
progress toward achieving national health objectives.
The National Health and Nutrition Examination Survey (NHANES) is designed to assess the health and
nutritional status of adults and children in the United States. The NHANES has been conducted periodically
between 1970 and 1994 and continuously since 1999. Like NHIS, the survey is cross-sectional and collects
information on a different sample every year. The survey is unique in that it combines interviews, physical
examinations and blood tests. This has allowed researchers to use NHANES data to study the relationship
between diet, nutrition, and health. NHANES monitors the prevalence of chronic conditions and risk factors
and produces national reference data on height, weight, and nutrient levels in the blood.
The Malawi Diffusion and Ideational Change Project (MDICP) is a series of longitudinal surveys carried

3
Chapter 1. Health Care Data

out in the rural areas in three districts of Malawi. The initial sample was not designed to represent the national
population of rural Malawi, but the sample characteristics closely matched those of the rural sample in the
1996 MDHS. Approximately 25 percent of the households were randomly drawn from 120 villages in the
three districts. The MDICP is a longitudinal study carried out in four phases: 1998, 2001, 2004 and 2006,
respectively. The sample is made up of married women and their husbands in the selected households.

1.4 Public health surveillances

Public health surveillance are systems designed to collect, analyse, interpret, and disseminate data regarding
health-related events over time. Public health surveillance systems can be active or passive. Active surveillance
includes public health authorities seeking specific conditions in specific areas. For example, the World Health
Organization has developed international health regulations that require reporting of communicable diseases
such as cholera, plague, and yellow fever. This type of surveillance is resource intensive and normally only
last for the duration the diseases are prevalent. Passive surveillance is the most common type of surveillance,
requiring minimal resources since cases of disease are not sought out by public health authorities. A common
type of health surveillance is sentinel surveillance, which collects data from a limited number of reporting
sites. This approach is often used in low income countries, where, due to logistic or budget constrains, it is
not feasible to include all reporting sites. Selection of sentinel sites is often based on practical considerations,
such as sites where the disease is considered of particular public health importance or where there is a larger
catchment of cases. However, this typically means that not all individuals with disease have a chance of being
included in the surveillance system. As a result, care and caution must be exercised when generalising the
results. A type of sentinel surveillance system is antenatal clinic (ANC) for pregnant women. These have
been instrumental in providing disease prevalence estimates, such as HIV, in many developing countries with
generalised epidemics. Pregnant women who present at the sentinel sites for their prenatal visit are eligible
for participation in an ANC survey. For each woman selected, limited demographic information is obtained.
HIV testing is carried out on left-over blood for syphilis test.
In contrast, registries or population-based surveillance, are systems that capture reports from every
appropriate health facility, with the goal of identifying all cases of a defined disease, in a specific geographic
area. Population-based surveillance can either represent the whole country (national) or a sub-national
population area. Since the population is defined, these surveillance sites can produce rates of disease (for
example, incidence and mortality rates), which allows for comparison of rates of disease between other
population-based surveillance sites. Population-based surveillance is more costly than sentinel surveillance,
but produces more generalisable data. An example of a population-surveillance system is the Surveillance,
Epidemiology, and End Results (SEER) registry. SEER collects information on all diagnosed cancer cases in 19
geographic areas in the US, covering 31 percent of the nation’s population. The objective of the SEER registry
is to track the burden of cancer incidence and mortality in the population. It routinely collects data on patient
demographics, primary tumour site, tumour characteristics and stage at diagnosis, first course of therapy, and
follow-up for survival status.
The Atomic Survivor Registry is a registry of all 105,444 atomic bomb survivors from Nagasaki and
Hiroshima. The registry is maintained by Radiation Effects Research Foundation (RERF), Hiroshima and
Nagasaki, Japan. The database captures information on demographics, as well as radiation dose exposed,
distance from the hypocenter, incidence of different types of cancers and respiratory diseases since 1958,
obtained from the tumour registries from Hiroshima City, Hiroshima Prefecture, Nagasaki City, Nagasaki
Prefecture and, the Hiroshima and Nagasaki Tissue Registries.

4
2
Preliminaries

Healthcare analytics is about solving practical health care problems by collecting, using and interpreting data.
Much of the analytical tools in healthcare analytics have their origins in statistical science. In this chapter, we
review some basic statistical concepts and methods that are relevant to this course.
Healthcare data may be available in different forms. Sometimes they have already been summarised, such
as government statistics on mortality rates, or they may be available at the individual level, such as detailed
patient records. Each dataset consists of observations. On each observation, there may be information on
a number of characteristics that are collected, for example, the age, gender, health behaviour and clinical
outcome. Each characteristic is called a variable. A set of data, then refers to the collection of information on
all observations.

Example 2.1
In order to increase hospital efficiencies, governments have made considerable efforts to reducing in-hospital length of
stay (LOS). In Catalonia, Spain, where there is a national health system, the authorities negotiated with hospitals to
prepay an agreed LOS. The aim of this practice is to encourage reduction in inappropriate hospitalisation, and hence
LOS, since a profit can be made by a hospital when the LOS is lower than that for which they are paid. Table 2.1 shows
a set of data collected to evaluate inappropriate hospitalisation in Barcelona, Spain. The dataset is based on medical
records of 1, 383 admissions at a teaching hospital in 1988 and 1990.a Since the researchers simply collected the data
to study inappropriate hospitalisation, but did not make any attempts to influence practices at the hospital, this is an
example of an observational study. Furthermore, even though the data were collected over two different time points,
since there is no indication that the admissions over the two time points were linked, the data would still be considered
cross-sectional.

Table 2.1: Inappropriate hospitalisation in Hospital Universitari del Mar

Admission LOSb Inapc Genderd Warde Yearf Ageg


1 15 0 2 2 88 55
2 42 20 2 1 88 73
3 8 6 1 1 88 74
4 9 6 1 2 88 78
5 7 0 1 2 88 57
6 10 2 2 2 88 47
7 8 6 1 2 88 70
8 8 0 2 2 90 40
.. .. .. .. .. .. ..
. . . . . . .
1379 11 9 2 1 90 45

5
Chapter 2. Preliminaries

1380 18 11 2 1 90 85
1381 23 23 1 2 90 70
1382 3 0 2 2 90 35
1383 2 0 2 2 90 57

2.1 Variables

The emphasis throughout this set of notes is on approaches to derive information on a particular topic related
to healthcare. In Example 2.1, for each admission, the researchers collected a number of characteristics on
each admission, for example, LOS, Inap, Ward, etc.. Each characteristic is called a variable. When working
with a set of data, the lower case letter n is often used to represent the number of observations. In Example 2.1,
the number of observations is n = 1383. For convenience, variables are often represented using capital letters
such X , Y, Z, etc.. If X is used to denote a particular variable, then a set of n observations of X is written as
X 1 , ..., X n . Sometimes, in lieu of using X , Y, Z, etc., to represent multiple variables in a dataset, the matrix
notation, represented as a boldfaced X can be used. For example, suppose there are p variables in the dataset,
then they may be denoted by X = (X 1 , X 2 , ..., X p ). Then the value of the 3rd variable in the 1st observation
will be written as X 13 and that of the 2nd variable of the 4th observation will be written as X 42 , etc..
A variable can either be quantitative or qualitative. A quantitative variable is a variable that is numeric.
A quantitative variable can be discrete or continuous. Continuous variables, such as blood pressure, age,
can in theory take any value within a given range. Discrete variables, on the other hand, can assume only
a countable number of values. Examples of discrete variables are: number of children in a family, number
of hospitals in a city. The distinction between a continuous or discrete variable is not always clear, however,
because all continuous values are rounded off to some extent in practice. For example, age is normally rounded
to whole numbers.
Another name for qualitative variable is categorical variable. Categorical variables can be either nominal
(unordered) or ordinal (ordered). An example of nominal variables is admission ward. For nominal variables,
no ordering is implied between different values of the variable, so for example, there is no natural ordering
between “medical" and “surgical". In analyses, values of a variable are often coded numerically, e.g., 1=
medical, 2= surgical, etc.. If a nominal variable has been coded numerically, we must bear in mind that the
numeric values are simply indicators and do not possess the usual properties associated with numbers.
An ordinal variable admits a set of ordered values. However, the distance between values is meaningless.
For example, income may be classified as: “Low income", “Lower middle income", “Upper middle income"
and “High income". These values have an implied order. We can assign numeric values to an ordinal scale,
e.g., 1 for "Low income", 2 for "Low middle income", etc.. The implied order is 1 < 2 < 3..., etc.. However,
the difference between 1 and 2 compares low income to low middle income, which cannot be taken to be
the same as the difference between 2 and 3, which is the difference between low middle income and upper
middle income even though 2 − 1 = 1 = 3 − 2.
a
Alonso J., Muñoz A., Antó J.M. Using length of stay and inactive days in the hospital to assess appropriateness of utilisation in
Barcelona, Spain. J Epidemiology Community Health. 1996 Apr;50(2):196-201
b
Length of stay in days
c
Number of days of inappropriate hospitalisation
d
1=male, 2=female
e
1=medical, 2=surgical, 3=others
f
Year of hospital admission
g
Age of patient at admission

6
2.2. Summarising data

Regardless of whether a nominal or ordinal variable has been recoded as numeric or not, arithmetic
operations, e.g., addition, subtraction, multiplication, division, etc., are inappropriate.
In healthcare research, there is another distinction of variables that is important. The distinction is that
of predictors and outcome variables. In Example 2.1, the researchers were interested in whether the type of
admissions had any influence on inappropriate hospitalisation. Hence, admission ward would be a predictor
and the number of days of inappropriate hospitalisation would be an outcome variable. An outcome variable
is usually denoted using the symbol Y whereas a predictor is typically represented using the symbol X . If
there are multiple predictors, then the bold-faced convention X will be used. In that case, X = (X 1 , X 2 , ..., X p )
represents p predictors. In Example 2.1, in addition to ward, there are other variables, such as age, gender,
year of admission that the researchers wanted to study in relation to inappropriate hospitalisation, hence
the predictors are X 1 = ward, X 2 = age, X 3 = gender, X 4 = year of admission and X = (X 1 , X 2 , X 3 , X 4 ).
Predictors are also sometimes called covariates, independent variables, input variables or features. An
outcome variable is also referred to as response, dependent variable, output variable or target variable.
Throughout this set of notes, these terms will be used interchangeably.
In a particular study, outcome and predictor variables are created to answer the questions in the study.
The choice of outcome or predictor variables depends on the context and the data available. Sometimes,
there may be multiple outcome variables or a variable that is defined as outcome may be used as a predictor
to answer a different question. In Example 2.1, the researchers considered a number of outcomes as measure
of inappropriate hospitalisation. One measure is the number of days of inappropriate hospitalisation during
the LOS. Data on this outcome is directly given in the variable Inap. Another measure they used is whether
there is any day of inappropriate hospitalisation. This outcome requires creating a new variable by classifying
admissions into those with 0 days and those with ≥ 1 days of inappropriate hospitalisation during the LOS.
It is reasonable to conjecture that the number of inappropriate hospitalisation days would be related to LOS,
so patients with longer LOS also would likely have a higher number of inappropriate hospitalisation days.
Hence, a third (indirect) measure of inappropriate hospitalisation would be LOS. Alternatively, the data might
be used to study how much LOS is influenced by inappropriate hospitalisation. To answer this question, Inap
becomes a predictor and LOS as the outcome.

2.2 Summarising data

A key step in any analysis is summarising data. The principal objective is to display relevant and useful
information about the data without losing any features of importance. As such, it is important that the
summaries given are accurate and we fully understand what they are, their strengths and weaknesses.
Summaries that are constructed based on this principal allow us to learn about the trends and patterns in
the data. They allow us to gain insight into the structure of the data, and in turns, guide analysis. They also
provide means for validating assumptions that are used in an analysis.
There are many ways of summarising data; which method to use depends on the type and amount of
data available, and the purpose of the investigation. Summaries can be broadly classified into one of three
types: tabular, graphical and numerical. The appropriate summaries to use depend on whether the data is
categorical (qualitative) or numerical (quantitative).

2.2.1 Categorical data

When data is categorical the values recorded on a group of individuals (or items) can be summarised as the
proportions (or percentages) of the total falling into each category or level. A proportion and a percentage
are different ways of expressing the same information. The proportion is calculated as the number of
observations in a category divided by the total number of observations in the data. The percentage is obtained
by multiplying the proportion by 100. Rates can also be used. For example, in a study of access to healthcare
in a country, we might express number of healthcare per 1000 individuals. Rates sometimes are also used to
express the number of events over time, for example, the number of doctor’s visits per year of follow up.

7
Chapter 2. Preliminaries

In Example 2.1, there are three categorical variables, namely Gender, Ward and Year. Out of the 1383
admissions, there are 670 males (Gender = 1), which gives a proportion and percentage as, respectively,

670 670
= × 100 percent ≈ 48.5 percent
1383 1383

The proportion and percentage of the different values of Gender, Ward and Year are given in Table 2.2. The
table gives the frequency distribution of the data. A frequency distribution simply gives the number of
observations that fall within each of a number of categories or intervals of a variable. Looking at the variable
Ward, for instance, the table tells us quickly that the highest number of admissions are for surgical wards,
followed by medical, and there are very few admissions to other wards. The table also shows a new variable
called Inap (0 vs 1+) which is created by classifying hospitalisation into those Inap=0 and those with Inap
≥ 1. This example shows how categorical variable can also be created using quantitative data by grouping
values.

Table 2.2: Summary of categorical variables in LOS data

Variable Levels n %
Gender 1 670 48.5
2 713 51.5
all 1383 100.0

Ward 1 595 43.0


2 742 53.6
3 46 3.3
all 1383 100.0

Year 88 750 54.2


90 633 45.8
all 1383 100.0

Inap (0 vs 1+) 0 763 55.2


1 620 44.8
all 1383 100.0

Information in a categorical variable can also be displayed using graphical summaries. A commonly used
graphical summary is a pie chart. In a pie chart, the size of each “slice" of the pie is proportional to the
frequency of the corresponding category and the sum of all the slices give the pie. Fig. 2.1 (a) shows a pie
chart of the distribution of admissions by ward using data in Example 2.1.
Alternatively, the same data can be summarised in a bar graph. In a bar graph, the height of each bar
is proportional to the frequency of the corresponding category (Fig. 2.1 (b)). The bars in a bar graph can
appear horizontally or vertically.

2.2.2 Numerical data

There are several ways of summarising numerical data. If a numerical variable is to be displayed graphically,
then a histogram may be used. Fig 2.2 (a) shows the frequency distribution of LOS using data in Example 2.1.
The histogram is made up of a number of bins or intervals. The height of each bin is proportional to the
frequency of observations with values of the variable of interest that falls within the interval. The width of
an interval is called the bin size or bin width. Bin widths should be chosen so that we obtain a good idea
of the distribution of the data. The appropriate size of the bin widths depend on the sample size n and the

8
2.2. Summarising data

Figure 2.1: Pie chart (a) and bar graph (b) of admission by ward

(a) (b)

800
Surgical

600
400
Others

Medical 200
0

Surgical Medical Others

“spread" of the data (see discussion below). Programs such as Rh give a number of choices of bins widths. In
this example, the bins are defined by 0-5, 5-10, 10-15, etc., hence the bin size is 5. From Fig. 2.2, we observe
that the interval with the highest frequency is the bin 0-5, followed by 5-10. Very few admissions have LOS
above 50 days, and none had LOS above 150 days. This example shows how a histogram can effectively
summarise a large amount of numeric data. A histogram is not suitable for categorical (qualitative) variables.
For categorical variables, a bar graph or a pie chart should be used instead.
We can also create histograms for subsets of the data. For example, we might be interested in comparing
LOS between the different types of ward. The histograms for these subsets are given in Figs. 2.3 (a)-(c),
respectively. The histograms are clearly quite different in shapes. They show that admissions to medical
wards, the highest frequency of LOS is between 5-10 days, followed closely by 0-5 and 15-20 days. In contrast,
for surgical wards, LOS are concentrated between 0-5, then 5-10 days. For the remaining wards, Fig. 2.3 (c),
most LOS are between 5-10, and moreover, LOS is no more than 30 days. In order to quantify these differences
we turn to numerical summaries. Two commonly used numerical summaries are the sample mean and the
sample median, or simply, mean and median, if it is understood that they are calculated using a sample
of values. These are described below. For nPobservations of X : X 1 , ..., X n , the sample mean, is simply the
n
arithmetic mean of the n observations: X̄ = i=1 X i /n; the sample median is the “middle" observation when
the data are ordered from lowest to highest. If n is odd, then the sample median is simply the middle value.
If n is even, then the sample median is the average of the two middle values.

Example 2.2
Consider the following data: 1,1,4,2,5,2,2,3,3,4.
The sample mean is:
P
Xi 1 + 1 + 4 + ... + 3 + 4
X̄ = = = 2.7.
n 10
For the sample median, the data, in ascending order is: 1,1,2,2, 2,3 ,3,4,4,5.

h
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna,
Austria. URL https://www.R-project.org/.

9
Chapter 2. Preliminaries

Figure 2.2: Histogram of LOS

600

500

400
Frequency

300

200

100

0 50 100 150

LOS

Figure 2.3: Histogram of LOS by ward of admission

(a) Medical (b) Surgical (c) Others


400

40

250

200
300

30

150
200

20

100
100

10

50

0
0

0 50 100 150 0 50 100 150 0 50 100 150


LOS

10
2.2. Summarising data

Since n = 10, which is even, the sample median is the average of the two middle numbers:

2+3
Sample median = = 2.5
2

Example 2.3 Consider the following dataset of n = 11 observations:


1,1,2,2,2,3,3,4,4,5,50.
Notice that this dataset is similar to that in Example 2.2 except the extra observation with the value 50.
The sample mean becomes X̄ = 7.7 and the sample median is 3, which is the middle number of the (ordered)
observations: 1,1,2,2,2, 3 ,3,4,4,5,50.
In this example, the sample median has not changed that much from that in Example 2.2, whereas the sample mean has
changed a lot with the addition of a single value of 50 to the data. In this example, the value of 50 is extremely large
compared to the rest of the data. Because the sample mean is calculated by adding together all of the values and then
averaging these, a single extreme value can be quite influential. On the contrary, the median uses only the middle value,
and hence is not affected by the extreme value.

In Example 2.3, the 11th (outlying) observation has a relatively large influence on the sample mean. The
sample mean of this set of data does not represent the pattern of most of the data. Outlying or extreme values
like the 11th observation are often associated with frequency distributions that are skew or asymmetric. A
skewed distribution is a distribution in which the values taper off unevenly in one direction. The converse type
of distribution is called symmetric, where the values are mirror images on either side of a “center". Irrespective
of the shape of a distribution, the median, by definition, is the center, because it is the value that is larger
(smaller) than 50 percent of the data. For a skewed distribution, the skewness can be in either direction;
values above the median can be more spread out than those below or vice-versa. To differentiate between
these two directions of skewing, skewed distributions are sometimes named according to the direction in
which the tail of the distribution (or slower tapering or more spread out side) lies.
The histograms in Fig. 2.4 show the frequency distributions for four different sets of data. Fig. 2.4 (a) is
an example of a symmetric distribution. In Fig. 2.4 (a), we could imagine a “center" at a value around 10, and
on either side of this imaginary center, the histogram values are mirror images of each other. The frequency
distributions in Figs. 2.4 (b) and (c) correspond to data that come from asymmetric or skewed distributions.
For example, the skewed distribution shown in Fig. 2.4 (b) might be termed as negative, left or downward
skew because this is the direction of the tail of the distribution. Similarly the distribution shown in Fig. 2.4
(c) tails off to the right and would be called either positive, right or upward skew. When the distribution of a
set of data is skewed, the mean will be farther away from the center of the distribution than the median. The
greater the degree of skewness, the larger the difference between the mean and the median. We notice that
for Fig. 2.4(a)-(c), there is only one “peak" or “hump" or mode. A frequency distribution that displays one
peak is called a unimodal distribution, as opposed to those with more than one peak [Fig. 2.4(d)], which
are called multi-modal. For a unimodal distribution that is symmetric, sample mean ≈ sample median.
In any particular situation, the measure of location to use is the one that best describes the data. Since
every value in the data is used in the calculation of the mean, consequently, when it can validly be used, is a
better summary of the data than the median. However, for the same reason, we observed in Example 2.3 that
the mean is highly influenced by extreme values. Hence, when the data is highly skewed, or with extreme
values even if the distribution is symmetric, the median is probably a better choice.
For numeric data there is another summary that must be considered. To illustrate this point, consider
Example 2.4.

11
Chapter 2. Preliminaries

Figure 2.4: Four different frequency distributions

(a) (c)

4000
2000

3000
1500
Frequency

Frequency

2000
1000

1000
500
0

0
5 10 15 0 2 4 6 8

x x

(b) (d)
2500

1500
Frequency

Frequency

1000
1500

500
500
0

−6 −5 −4 −3 −2 −1 0 1 0 5 10 15

x x

Example 2.4
Consider the following two sets of data:
Dataset 1: −9,2,3,3,3,4,15
Dataset 2: 3,3,3,3,3,3,3
Clearly these two datasets are very different. However, for both datasets, sample mean = sample median = 3. Using
these summaries does not allow us to distinguish between the two datasets. Therefore, while measures of location are
useful, we need other types of summary to describe a set of quantitative data.

A second way of describing quantitative data is spread. Spread measures how the values of a variable
change over a set of n observations. The simplest measure of spread is the sample range, which is the simply
the difference between the lowest and the highest value. The sample range is always a non-negative value. A
large value of the range is indicative of a large spread.
Another measure of spread is the sample variance (s2 ); s2 measures the average distance between
(X −X̄ )2
P
observations and the sample mean: s2 = n−1 i
. If we take the (positive) square root of the sample variance,
p
we obtain the sample standard deviation (s = s2 ). Sample variance and sample standard deviation are
exchangeable measures of spread, in the sense that they give the same information about the spread of a
variable, for a given set of data. However, the standard deviation is often preferred since the variance is in
2
terms of squared
P values. For any dataset, s and s are non-negative numbers. Sometimes, the alternative
(X −X̄ )2
formula s2 = i
n , with n instead of n − 1 in the denominator, is used. The two formulae give similar
results unless n is very small. The higher the value of s or s2 , the higher the spread in the dataset, i.e., the

12
2.2. Summarising data

values of the dataset are more different.


A third measure of spread is the interquartile range (IQR), which is defined as the range in which the
middle 50 percent of the values lie. To calculate this requires the use of percentiles. A percentile is the
ranking of a value in the dataset as compared to other values. The q-percentile of a dataset is the value that is
bigger than q percent of the values in the dataset. For example, if a female patient has systolic blood pressure
of 109 and her value is at the 25-th percentile, then 25 percent of similar patients have systolic blood pressure
below 109 and 75 percent have values above 109. It is important to distinguish the difference between percent
and percentile. A percent always has a value between 0 and 100, but a percentile’s value can be anything,
depending on the context. The median is by definition the 50-th percentile. The IQR is defined as: IQR=
75-th percentile - 25-th percentile. The 25-th percentile is also called the lower quartile or 1-st quartile
(Q1) and the 75-th percentile is also called the upper quartile or 3-rd quartile (Q3). The calculations of the
lower and upper quartiles are similar to that of the median. Like the other measures of spread, IQR is also
non-negative. A larger value of IQR is indicative of higher spread. When IQR= 0, it means all observations
between the first and third quartiles have the same value. However, it says nothing about the observations
below the first quartile and those above the third quartile, c.f., the sample range.
Table 2.3 gives these numerical summaries on LOS using data from Example 2.1. We observe that,
whether we take the dataset as a whole or grouped by ward, the mean is quite a bit larger than the
corresponding median, reflecting the fact that the distribution of LOS is right skewed, see Figs. 2.2 and 2.3.
Furthermore, whether mean or median is used, it shows the average LOS is highest for ward = 1, followed by
ward = 2, then ward = 3. The spread of LOS also follows the same order, using any of the three measures.
The highest spread of LOS for ward = 1 means that some admissions in medical wards were associated with
very long LOS.

Table 2.3: Numerical summaries of LOS, grouped by Ward

Ward mean s median min max Q1 Q3 IQR


1 13.02 11.24 10.0 1 107 6 17 11
2 9.19 11.53 5.5 1 85 2 11 9
3 5.80 6.21 2.0 1 22 2 9 7
Combined 10.73 11.45 7.0 1 107 3 14 11

As in the case of measures of location, we might ask the same question as to when and how to choose
between different measures of spread. Similar to the median, both the range and the interquartile range are
relatively insensitive to changes in the data. On the contrary, a single value changes the variance (and the
standard deviation). Hence, when they are valid, the variance and the standard deviation are better summaries
as they use more information from the data. However, in situations when the sample mean might not be an
appropriate measure of location, as discussed previously, then it follows that the variance and the standard
deviation, which are calculated from distances around the sample mean, will not be a useful summary of the
spread of the values.
We summarise the different measures for numerical data in Table 2.4.

Table 2.4: Summary of measures of location and spread

Data Location Spread


Quantitative data from Sample mean Sample variance
symmetric distributions Range

Quantitative data from Sample median IQR


asymmetric distributions

13
Chapter 2. Preliminaries

2.3 Samples and populations

In the study of Example 2.1, the researchers used a set of data to infer about inappropriate hospitalisation in
Barcelona. The researchers made the implicit assumption that the dataset is representative of admissions at
the teaching hospital, and more generally, of those at other hospitals in Barcelona. In any healthcare study,
the data come from a sample of a population, defined as all units of interest.i Henceforth, we will use the
terms sample and data interchangeably. In Example 2.1, the population of interest is all hospital admissions
in Barcelona. Earlier, we talked about the wish to apply what we learned in the data to the general context.
By that, we mean to use information from a sample, i.e., the data, to say something about the population, i.e.,
the general context.
A sample is usually selected from the population using random sampling, and it is considered
representative of the population. Although random sampling is generally the preferred method of selecting
observations from a population, sometimes, due to budgetary or other considerations, non-random sampling
is carried out. In some instances, non-random methods yield better results; in other instances, they fare worse.
There are methods that allow data from non-random samples to be used for drawing proper conclusions about
a population. These topics are beyond the scope of this set of notes, henceforth, we assume samples are
representative of the population, unless otherwise specified.

2.4 Random variables and distributions

Example 2.1 gives a set of data on patients who were admitted and treated at the same hospital. Yet, we
notice that the outcome, Inap, of these admissions can be quite different from one patient to another. If
another patient is admitted, we can imagine it would be hard to predict precisely the value of Inap for that
patient. We call the outcome variable a random variable and we refer to the data as a set of observations
of a random variable. There are many other examples of random variables. In the same dataset, we notice
patient characteristics also vary, some are males and others females, and their ages are different. Hence, the
variables in Example 2.1 are all random variables.
Random variables can generally be classified into one of two types: discrete and continuous random
variables. When the set of possible values a random variable can take is countable, then the variable is
a discrete random variable. A continuous random variable, in contrast, can take any of the uncountably
many possible values in an interval. In Example 2.1, Gender, Ward and Year are examples of discrete
random variables, whereas Age is a continuous random variable; for LOS and Inap, they can be considered as
continuous that have been rounded to the nearest day or they can be considered as discrete.
In Chapter 2.1, a frequency distribution is used to show how the frequencies of the values of a variable in
a set of data. If data are observations of a random variable in a sample from a population, then the “frequency"
distribution of the random variable in the population is called a probability distribution. There are two main
types of distributions: discrete distribution and continuous distribution, for each of the two main types of
random variables. If X represents a random variable, and it is discrete, then the probability P(X = a), gives
the frequency or proportion of times X = a appears in the population. The probability distribution of X is
given by P(X = a) for all possible values of a that X can assume, and it is called a probability distribution
function (often abbreviated as pdf or PDF). For a continuous random variable X , there are uncountably many
values that fall within a range. Hence, unlike a discrete random variable, it would not be possible to list the
probabilities for all possible values of X . This problem is solved by using a probability density function (also
abbreviated as pdf or PDFj ). A probability density function is sometimes simply referred to as a probability
density or a density.
i
In general, there are two kinds of population – finite and infinite. However, as long as (1) the sample size, n, and the population
size, N are such that n/N < 0.05 (i.e., no more than 5% of the population is sampled) or (2) the population size N so big to be
considered infinite or (3) sampling with replacement, i.e., a unit that has been sampled will be returned to the population before
the next unit is sampled, then there is no difference in the statistical analyses between a finite and an infinite population. Henceforth,
we assume one or more of these conditions are satisfied so we do not distinguish between the two types of population.
j
Notice that pdf may also mean “probability distribution function", as for a discrete random variable. Hence, we have to rely on
the context to determine the meaning of pdf

14
2.5. Joint distributions

To motivate the concept of a probability density function (PDF), we turn to the frequency distribution.
Fig. 2.2 shows the frequency distribution of LOS, which gives the frequencies of the values of LOS in the
sample. Since LOS is defined as a continuous random variable, it can take any value in an interval. The PDF
is a function that represents the “frequencies" for all possible values of LOS, not just those observed in the
sample. This situation is illustrated in Fig. 2.5. Fig. 2.5 (a) reproduces Fig. 2.2 for the sample, Fig 2.5 (b)
shows a hypothetical PDF for LOS. The PDF is a smooth curve that extends over the interval for all possible
values of LOS. If X is a continuous random variable, we often write the PDF of X by f (x), where f and x
are both lower case letters. In Fig. 2.5 (b), the PDF f (x) is the black curve, and f (x) means the value of the
black curve at X = x. The choice of the letter f is not strict, we could also use other letters, such as g, h, etc..
Following the idea that the size of each bar in a histogram represents frequencies, for a PDF, the area under
the curve over any interval represents the proportion (or probability) of values of X within that interval. For
each PDF, there is a related function called the cumulative distribution function (CDF), F (x) ≡ P(X ≤ x),
which gives the proportion of values up to the value x, Fig. 2.5 (c).k
Earlier, we defined numerical summaries such as the sample mean, sample median, sample variance, etc.
for a set of data. The counterparts of summaries also exist for a population. For a random variable X , E(X ),
Var(X ), SD(X ) represent respectively, the population mean (also called the expectation or expected value),
variance and standard deviation of X . There are no special symbols for median and the quartiles, though they
are defined similarly as their sample counterparts.

Figure 2.5: Histogram of LOS data in Example 2.1 and a hypothetical density

(a) (b) (c)

600 1

500 0.845

400
Frequency

Probability
Density

300

200

100

0 0
0

0 50 100 150 0 50 100 150 0 20 150

LOS LOS LOS

2.5 Joint distributions

In Example 2.1, in addition to LOS and Inap, there are four other variables containing other information
about the admissions. These variables may provide valuable information for understanding factors that might
influence outcome. For example, from Table 2.2, we observe that 763 of the admissions did not have any day of
inappropriate hospitalisation during LOS while the remaining 620 had at least one day of hospitalisation that
is classified as inappropriate. We also observe there were 750 admissions in 1988 and the remaining in 1990.
k
It is also possible to define a CDF for a discrete random variable. However, in that case, since all probabilities can be represented
by the PDF, P(X = a), a CDF is less useful

15
Chapter 2. Preliminaries

Using these data, we might be interested in the following question: “Was there any change in inappropriate
hospitalisation pattern related between 1988 and 1990?" In this question, we are interested in the two random
variables: X =Year and Y =Inap (0 vs 1+). To answer such a question, we need to study both random variables
together. This type of analysis, where we examine the relationship between two or more random variables, is
called a multivariate analysis (“multi" = more than one, “variate" = variable). In the current question, there
are two random variables, so the analysis is called a bivariate analysis (“bi" = two, “variate" = variable). A
bivariate analysis is an example of a multivariate analysis.
A way to summarise data from two discrete variables is a contingency table of frequencies, as follows:

Table 2.5: Contingency table of Ward vs. Inap (0 vs. 1+) in Example 2.1

Y X (Ward)
Inap(0 vs. 1+) 1 2 3 Total
0 276 454 33 763
1+ 319 288 13 620
Total 595 742 46 1383

If we divide all the numbers in Table 2.5 by n = 1383, we obtain a contingency table of sample
proportions. Table 2.6 shows that for medical wards (ward =1), there is a higher proportion of inappropriate
hospitalisation when compared to surgical and other types of wards.

Table 2.6: Table 2.5 expressed in proportions

Y X (Ward)
Inap(0 vs. 1+) 1 2 3 Total
0 0.2 0.33 0.02 0.55
1+ 0.23 0.21 0.01 0.45
Total 0.43 0.54 0.03 1

We can extend the concept of a contingency table to study the relationships between discrete random
variables in the population. To do that, we need a joint probability distribution or joint distribution. A
joint distribution between two variables X and Y is also called a bivariate probability distribution or simply
bivariate distribution. A bivariate distribution is a special case of a multivariate distribution, which is used
to describe the joint behaviour of two or more random variables. The concept of a PDF for a single discrete
random variable carries over to the case of two or more random variables. For two discrete random variables
X and Y , the joint probability distribution function (joint PDF) (sometimes also called a joint distribution
function or simply distribution function if its is understood that we are referring more than one random
variable): P(X = a, Y = b) tells us the proportion of times (X , Y ) = (a, b) occur in the population.
Earlier, we noticed the proportion of inappropriate hospitalisation is higher among medical wards than
other wards. If we want to study similar questions in the population, we need to calculate the conditional
P(Y = b, X = a)
probability, P(Y = 1 + |X = 1). The notation, P(Y = b|X = a) = , reads the conditional
P(X = a)
probability of Y = b given X = a, means the proportion of Y = b among those X = a. The probability of
P(X = a) is also called a marginal probability.
Multivariate analysis is not restricted to discrete random variables. In Fig. 2.3, we analysed the
(frequency) distribution of LOS (a continuous random variable) by wards of admission (a discrete random
variable) using a set of data. In Chapter 2.1, we conjectured that the number of inappropriate hospitalisation
days would be related to LOS. If we are interested to study the relationship between two random variables, X

16
2.6. Families of distributions

and Y , both continuous, then we use a continuous bivariate distribution. We learned from Chapter 2.4 that
density functions are used to describe continuous distributions. For a continuous bivariate distribution, we
use a bivariate probability density function (PDF) f (x, y), which is an example of a joint density function.
The properties of a continuous bivariate random variable (X , Y ) and its joint PDF f (x, y) are analogous to
their univariate counterparts, so that the “volume" under the joint PDF over an region gives the probability
that (X , Y ) falls in that region. Related to the joint PDF is a joint CDF F (x, y) ≡ P(X ≤ x, Y ≤ y). If we
are interested in questions such as: for admissions with LOS longer than 10 days, what is the probability
that there is more than a day of inappropriate hospitalisation? Such questions require we to find conditional
probabilities, only that they are calculated based on density functions instead of probability functions.
A primary goal of the study in Example 2.1 was to measure hospital efficiencies in Barcelona. Spain. In
healthcare economics, a common measure of efficiencies is mean LOS (or suitably transformed value of LOS,
see Chapter 2.7). If our interest is in the entire population, then the expectation, E(X ), is the appropriate
summary to use. However, if we are interested in subsets of the population, for example, the mean LOS among
those in medical wards, or the mean LOS for those with more than 5 days of inappropriate hospitalisation,
then we need to calculate conditional expectations, E(Y |X = a) or E(Y |X ≥ a).

2.6 Families of distributions

There are a number of probability distributions that are useful for studying problems in healthcare research.
In this section, we will briefly describe these distributions.

2.6.1 Bernoulli trials and binomial distribution

In Example 2.1, one of the outcome measures is whether there is any day of inappropriate hospitalisation
during LOS. Hence, for each admission, the outcome is either 0 (days of inappropriate hospitalisation) or 1
(1+ days of inappropriate hospitalisation). We might want to ask what would be an appropriate distribution
of this random variable in the population? In this problem, the outcome X of each admission has two possible
values, 0 or 1. If we let p = P(X = 1), 0 < p < 1 and hence P(X = 0) = 1 − p. Then the distribution of X
is called a Bernoulli distribution;l X is called a Bernoulli random variable, p the success probability and 1
being a “success". For the problem at hand, the outcome of a sequence of admissions would look like 1,1,0,0
or 1,0,1,1, etc.. Since the value of X is unknown until we observe it, hence, it is sometimes also called a
Bernoulli trial. When analysing data like that, it is often convenient to make the assumptions that p is the
same for all admissions, and the values between admission are independent of one another. A sequence of
independent Bernoulli trials with a constant success probability form a (homogeneous) Bernoulli process or
a Bernoulli sequence.
In practice, the individual outcomes of each admission are not as insightful as the overall outcome among
all the admissions, when our interest is measuring a hospital’s overall efficiency. In other words, if there are n
admissions, we would be interested in X , the total number of admissions with inappropriate hospitalisations.
Hence X is defined as the number of successes in a sequence of n independent Bernoulli trials each with
probability of success p. The possible values of X are 0, 1, ..., n and hence X is a discrete random variable. The
probabilities of different values of X form a Binomial distribution with parameters n and p. Sometimes we
write X ∼ Bin(n, p) for short. If X ∼ Bin(n, p), the pdf of X is given by the following simple formula,
 ‹
n k
P(X = k) = p (1 − p)n−k , k = 0, 1, ..., n
k
where  ‹
n n!
= .
k (n − k)!k!
The parameters n and p are very useful, they allow us to use the Binomial distribution to model different
situations by changing the values of n and p to suit the context that we are interested in. For example, when
l
After James Bernoulli, 1654-1705

17
Chapter 2. Preliminaries

we are interested in the number of heads, X in three tosses of a coin, then X ∼ Bin(n = 3, p = 0.5) can be
used. On the other hand, if X is the number of patients out of 4 who recovered after being given treatment,
then X ∼ Bin(n = 4, p = 0.7). Parameters also permit a simple way to describe a particular situation, so for
example, if we are told the number of recoveries from a disease follow a Bin(n = 4, p = 0.7) distribution, we
can picture the situation based on our understanding of the Binomial distribution. Since the parameters n, p
define many different Binomial distributions, we call Bin(n, p) a family of Binomial distributions. Many useful
probability distributions also have parameters so they can be used to solve problems in different situations.
Expectation and variance
If X ∼ Bin(n, p) then
E(X ) = np, Var(X ) = np(1 − p).

Shape
The shape of a Binomial distribution depends upon the values of n and p (Fig 2.6). For small n, the
distribution is almost symmetrical for values of p close to 0.5, but highly skewed for values of p close to 0
or 1. As n increases, the distribution becomes more and more symmetrical, for almost all values of p except
when p is very close to 0 or 1. The binomial distribution is always unimodal.

Figure 2.6: PDFs for Bin(n, p) for various values n and p

n=10,p=0.5 n=10,p=0.9 n=100,p=0.9

0.12
0.20

0.3

0.10
0.15

0.08
Probability

0.2

0.06
0.10

0.04
0.1
0.05

0.02
0.00

0.00
0.0

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 85 88 91 94 97 100
X X X

2.6.2 Poisson distribution

The Bernoulli distribution is useful for situations when an event either occurs (success) or does not occur
(failure) over a predetermined number of n trials. Sometimes, a type of event does not occur over a number
of defined trials; rather it may occur at any time or at any place. For example, the next admission to a medical
ward, or the place of the next disease outbreak.
Events (e.g. hospital admission)

Time
0

In those cases, a Poisson processm may be suitable. The Poisson process can be considered a continuous-
time (or space) analogue of the Bernoulli process. In this section, we will discuss the Poisson process in a time
context; its application in a space context is identical and is omitted.
m
After Siméon-Denis Poisson (1781 - 1840)

18
2.6. Families of distributions

A Poisson process is a model for the occurrence of events, such that


i. No two events occur at the same time
ii. The rate of occurrence of events is constant over time
iii. Events occur independently of each other

We summarise below the characteristics of a Poisson distribution.


PDF

If X has a Poisson distribution with average rate 0 < λ, we write X ∼ Poisson(λ), the PDF of X is

λk −λ
P(X = k) = e for k = 0, 1, 2, ...
k!

Mean and variance

E(X ) = Var(X ) = λ

Notice for a Poisson distribution, the mean has the same value as the variance. This situation is illustrated
in Fig. 2.7, which shows the PDFs of two Poisson distributions. For λ = 1, the probabilities are concentrated
at a few values near the mean of 1, and it is very unlikely to observe very high values, contributing to a low
variance. For λ = 5, in contrast, the probabilities are spread over a much wider range around the mean of 5;
in this case, there is reasonable probability of values far away from the mean, leading to a high variance.
Sum of independent Poisson random variables

If X and Y are independent, and X ∼ Poisson(λ), Y ∼ Poisson(µ), then

X + Y ∼ Poisson(λ + µ)

Figure 2.7: PDFs for Poisson(λ)

λ=1 λ=5
0.4

0.4
0.3

0.3
Probability
0.2

0.2
0.1

0.1
0.0

0.0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
X X

Change of time frame

Let X ∼ Poisson(λ) represents the number of events in one unit of time. If we are interested in Y , the
number of events in t units of time, then Y ∼ Poisson(tλ). This result makes sense since in a Poisson process,

19
Chapter 2. Preliminaries

the rate of occurrence of events is constant over time, hence a time frame of t units should have a rate that is
t times the rate λ for 1 unit of time.

2.6.3 Exponential distribution

A Poisson(λ) distribution is used to describe the random number of events over a fixed unit of time. In a
Poisson process, events occur over any arbitrary time interval. There are two main questions we might
be interested in: (1) the number of events in the interval and (2) the time of occurrence of events. In
Chapter 2.6.2, we studied (1). Here, we consider (2). Let T denote the time to occurrence of an event (e.g.,
hospital admission, disease outbreak). Since T is a time, hence must be non-negative, and it is not possible
to predict the exact time or how long we have to wait for the event to occur, we may assume T ∈ (0, ∞).
Consequently, T is a continuous random variable.
The distribution of a continuous random variable is described by its probability density function (PDF)
and its cumulative distribution function (CDF). It can be shown if events occur at a constant rate, 0 < λ,
per unit time, i.e., follows a Poisson(λ) distribution, then the time between events, T follows an exponential
distribution. We write T ∼ E x p(λ), where 0 < λ is the parameter. We summarise the main characteristics of
an exponential distribution below.
PDF and CDF
( (
λe−λt , t >0 1 − e−λt , t >0
f (t) = F (t) =
0, t ≤0 0, t ≤0

A graph of the PDF of E x p(λ) for two different values of λ is given in Fig. 2.8.
Mean and variance

1 1
E(T ) = , Var(T ) =
λ λ2

f (t)

λ = 0.5
λ = 0.25

0 t
0

Figure 2.8: PDFs for two exponential distributions

Memoryless property

The Exponential distribution is famous for its property of being “memoryless"; by that we mean P(T >
t + s|T > s) = P(T > t) so that the probability of waiting for t units of time beyond s is independent of the
value of s.

20
2.6. Families of distributions

2.6.4 Uniform distribution

X is said to follow a Uniform(a, b) distribution if X is equally likely to fall anywhere in the interval [a, b],
a < b.
CDF and PDF

 x −a
 
 1 if a ≤ x ≤ b, if a ≤ x ≤ b,
f (x) = b−a and F (x) = b−a
 0 otherwise.  1 x > b.

Mean and Variance

a+b
E(X ) =
2
(b − a)2
Var(X ) =
12

2.6.5 Gamma distribution

The Gamma distribution has two parameters, k and λ, where 0 < λ and 0 < k. We write X ∼ Gamma(k, λ).
CDF and PDF

 λ x k−1 e−λx
 k
if x > 0
f (x) = Γ (k)
0 otherwise

R ∞ Here, Γ (k), called the Gamma function, is simply a constant that ensures f (x) integrates to 1, i.e.,
0
f (x)d x = 1. (see below)
There is no closed form for the CDF of the Gamma distribution. If X ∼ Gamma(k, λ), then F (x) can only
be calculated by computer.
Mean and variance

k
E(X ) =
λ
k
Var(X ) =
λ2

Relationship to the exponential distribution


If X 1 , ..., X k ∼ E x p(λ) and are independent, then X 1 + X 2 + ... + X k ∼ Gamma(k, λ).
Relationship to the Poisson distribution
Recall that the waiting time between events in a Poisson process with rate λ has the Exp(λ) distribution.
That is, if X i = {waiting time between event i − 1 and event i}, then X i ∼ E x p(λ).
Now the waiting time to the k-th event is X 1 + X 2 + ... + X k , the sum of k independent E x p(λ) random
variables.
Thus the time waited until the k-th event in a Poisson process with rate λ has the Gamma(k, λ)
distribution.

21
Chapter 2. Preliminaries

2.6.6 Normal distribution

The best known and most important distribution is the Normal distribution. The normal distribution is
sometimes called a Gaussian distribution.n The normal distribution is important because of two reasons.
First, the frequency distribution of many variables naturally follow the normal distribution, e.g., adult height,
adult blood pressure and many other biological quantities. Second, the normal distribution is central to our
ability to use data in a sample to draw conclusions about unknown quantities in a population, a topic that we
will study later in this set of notes.
Fig. 2.9(a) is a histogram of the frequency distribution of systolic blood pressure for a sample of n = 19921
adults.o The numerical summaries of this dataset are given in Table 2.7.

Table 2.7: Numerical summaries of systolic blood pressure in Adults

mean s median Q1 Q3 IQR


122 15 121 113 131 18

The frequency distribution looks symmetric. It peaks at around 122. Notice also the similarity between
the mean, the median and the mode (from Fig. 2.9 (a)). Fig. 2.9(b) shows a normal distribution. The normal
distribution probability density function (PDF) is the curve shown in Fig. 2.9(b). The PDF of the normal
distribution is “bell-shaped", hence a normal PDF is sometimes called a bell-curve.

Figure 2.9: Normal distribution as a model for systolic blood pressure

(a) Frequency distribution of systolic blood pressure (b) Normal distribution of systolic blood pressure
Probabilty density function (pdf)
Frequency

122 122
Systolic blood pressure (mm Hg) Systolic blood pressure (mm Hg)

n
After Carl Freidrich Gauss, 1777-1855
o
Mean Systolic and Diastolic Blood Pressure in Adults Aged 18 and Over in the United States, 2001-2008, J. D. Wright et. al,
National Health Statistics Reports, March 2011

22
2.6. Families of distributions

(a) Normal distribution

Mean

(b) Effect of smaller mean

Mean

(c) Effect of smaller standard deviation

Mean

Figure 2.10: Effects of changing mean and standard deviation in a normal distributions

The aim of a theoretical distribution such as the normal distribution is to create a model for the entire
population from which the sample is drawn.p The normal distribution is by definition symmetric with most
values towards the centre and with values tailing off evenly in either direction. Because of the symmetry
of the distribution, the mean (and median) always lies in the centre of the distribution (where the peak is).
The mean of a normal distribution is sometimes denoted by the Greek alphabet µ (pronounced “mu"). The
standard deviation of the distribution tells us how spread the values are around the mean. Fig. 2.9 (b) is
drawn using information from the data, with the center of the distribution at 122 and therefore, µ = 122 mm
p
In fact, all theoretical distributions share this aim

23
Chapter 2. Preliminaries

Hg and a standard deviation of 15 mm Hg. As seen in Fig. 2.9 (b), the normal density is tallest at the mean
and the curve drops off symmetrically on either side of the mean. All normal distributions have PDFs that are
of the same shape (i.e., bell-shaped), unimodal (has one hump or mode) and symmetric about the mean, as in
Fig. 2.9 (b). A normal distribution with mean, µ, and standard deviation, σ (Greek alphabet σ, pronounced
“sigma"), is usually written as N (µ, σ2 ) (alternatively, it may be written as N (µ, σ)) and a variable X that
follows a normal distribution is sometimes written as X ∼ N (µ, σ2 ). For example, if we let X represent the
systolic blood pressure of a person in a population, then X ∼ N (122, 152 ) means that the person comes from
a population with mean systolic blood pressure = 122 mm Hg and standard deviation = 15 mm Hg.
Fig. 2.10 shows three normal distributions and the effects of changes in the mean and the standard
deviation. The shapes of the normal distributions in Figs. 2.10 (a) and (b) are identical, except there is a shift
in the mean. Fig. 2.10 (c) shows a normal distribution with a smaller standard deviation. The distribution
is narrower and the peak at the center is more pronounced, when compared to Figs. 2.10 (a) and (b). A
narrower and sharper pdf tells us that, for the population in Fig. 2.10 (c), it is more likely to find values near
the mean and less likely to find values far away from the mean, in either direction. We can understand why
a small standard deviation leads to a sharper pdf by considering the following. In Section 2.2, we learned
that the standard deviation (or the variance) measures the average distance of observations in a dataset from
their mean. Therefore, a smaller standard deviation tells us the average distance of values from the mean is
smaller, which suggests that the probability of seeing values far away from the mean is small, as reflected in
a pdf that drops off quickly on either side of the mean.

(a) (b)

600 350
500 300
250
Frequency

Frequency

400
200
300
150
200
100
100 50
0 0

0 50 100 150 0 1 2 3 4 5

LOS log(LOS)

Figure 2.11: Effect of transformation of LOS data

2.6.7 Transformations

Many statistical methods require data assumptions to hold, and the most common assumption is the data
follow a normal distribution. If the data do not follow a normal distribution, it is possible to consider
transformations that can “normalise" the data. The transformations used should not change the relative
ordering of the values but alter the distance between successively ordered values to change the overall shape
of the distribution.

24
2.7. Estimation

The most common transformation is the (natural) q logarithmic transformation. We illustrate the concept
using data from Example 2.1. For each observation in the dataset, the LOS is transformed to log(LOS) and
the results are given in Fig. 2.11. Fig. 2.11 (a) reproduces the frequency distribution for LOS as originally
measured, which has substantial positive skewness. The frequency distribution following the transformation is
given in Fig. 2.11 (b); it shows most of the skewness has been removed and the distribution looks closer to that
of a normal distribution. In this example, all LOS values are positive. In datasets with some observations taking
zeros, then a small constant c could be added to each observation so that the smallest of the untransformed
observation would be 1 (and transformed value would be 0). We notice that in the data from Example 2.1,
763 out of the 1383 observations have a value of 0 for Inap (Table 2.2). In cases where there are a lot of
zeros, a different approach needs to be taken to handle the skewness of the distribution. This approach will
be discussed in detailed in a later chapter.

2.7 Estimation

A fundamental problem in statistical analysis is estimation of certain characteristics of a random variable


X in the population. In most healthcare problems, the interest is often in estimating a population mean,
median or quantile of X . For example, X might be BMI and we might be interested in estimating the mean,
median or the upper 25 percentile of X in the population. These population quantities are often referred to as
population parameters. Intuitively, when estimating a population parameter, its sample counterpart should
be used; for example, the sample mean would be used to estimate the population mean. It turns out, however,
a more powerful method, called maximum likelihood estimation (MLE), is preferred. The MLE requires us
to assume X follows a particular family of distributions. Under that assumption, it then chooses the member
in the family that most likely produced the observed data.
We illustrate the likelihood method using the LOS data. Figure 2.12(a) shows the histogram of the
LOS data. We superimpose on the histogram the probability density function of an E x p(λ) distribution in
Figure 2.12(b). From Figure 2.12(b), we argue that an E x p(λ) distribution may be a reasonable model for
the LOS data.
Figure 2.12: Histogram of LOS data modelled using an exponential distribution

(a) (b)
proportions

proportions

f(t) = λe−λt
0

0 20 40 60 80 100 120 0 20 40 60 80 100 120


LOS LOS

However, we do not know what value of λ gives an exponential distribution that best describes LOS. We
can use likelihood method to find the solution. To implement, we first introduce the concept of a likelihood
(function). Define t i as an observed LOS, then we write L(t i |λ) as the likelihood that t i is observed given the
E x p(λ) distribution under consideration.
q
All logarithms in this set of notes are natural logarithms, unless otherwise specified

25
Chapter 2. Preliminaries

The observed LOS are


15, 42, 8, ...
Suppose we wish to model the data using an E x p(λ = 0.1) distribution. The density of an E x p(λ = 0.1)
distribution is shown in Fig. 2.13. The likelihoods of the observations under this model are given by
L(15|λ = 0.1) = 15e0.1×15 = f (15)
L(42|λ = 0.1) = 42e0.1×42 = f (42)
L(8|λ = 0.1) = 8e0.1×8 = f (8)
..
.,
and the likelihood for the entire set of data is
n
Y n
Y
L(15|λ = 0.1) × L(42|λ = 0.1) × L(8|λ = 0.1) × ... = L(t i |λ = 0.1) = f (t i )
i=1 i=1

Figure 2.13: LOS data modelled using E x p(λ = 0.1) distribution

f(t) = te(0.1t)
0

0 8 15 42 50
LOS

Other exponential distributions with different values of λ can also be considered. The exponential
distribution that best describes the data is the one that maximises the likelihood, among all values of λ.
We can implement the procedure of finding the maximum by writing the likelihood as
n
Y n
Y
L(λ) = f (t i |λ) = λe−t i λ
i=1 i=1

Taking logarithm gives the log-likelihood


n
X
log λe−t i λ .
 
ℓ(λ) =
i=1

The MLE can be obtained using one of L(λ) or ℓ(λ), by calculus or numerical methods.
When carrying out estimation, we must acknowledge the sample is only a subset of the population, and
there will be a difference between the sample estimate and the population parameter. The difference between
the estimate derived from a sample and the parameter is called sampling error. A common measure of
sampling error is the standard error (SE) of the estimate. In many problems, the magnitude of the SE is a
function of two quantities, Var(X ) and n, the sample size. In particular,

26
2.8. Hypothesis testing

1. SE ∝ Var(X )
1
2. SE ∝ n

A type of estimate that incorporates the uncertainty due to sampling is called confidence interval estimate
or simply interval estimate. An interval estimate has the form

(sample estimate − a, sample estimate + a),

for some a > 0. The quantity a is called a margin of error. The margin of error reflects the uncertainty in our
estimate due to sampling. We predict, with certain level of confidence, that the parameter lies within interval
(sample estimate ± a). Obviously, the wider the interval, the more certain we are that the parameter would
lie within the interval. Hence, in an interval estimate, we adjust the interval width to reflect our confidence
level that the parameter would lie within the interval. If we wish to be 99% certain that the parameter lies in
our interval estimate, then we need to make the interval wider than if we are satisfied with a 90% certainty,
etc.. In practice, a balance needs to be made between the width of the interval and the level of confidence;
an interval that is too wide is not informative and a confidence level that is too low does not carry credibility.
It is customary, in most applications, that a 95% level of confidence is used. Under the assumption that the
sample is representative of the population and the sample size is reasonably large, a remarkable result called
the Central Limit Theorem (CLT) tells us that in most problems, a 95% confidence interval estimate is of the
form
sample estimate ± 1.96SE(sample estimate).r

2.8 Hypothesis testing

For the study described in Example 2.1, in additional to the data in Table 2.1, the researchers also recorded
the numbers of admissions to the hospital that were deemed inappropriate. They found the proportion of
inappropriate admissions was 12.1% in 1988 and 19.3% in 1990. Some questions they had were whether the
rates of inappropriate admissions in the two periods were different overall, and also stratified (analysed in
groups defined by different categories of a variable) by gender, age and hospital ward. Such questions can
be answered by setting up appropriate hypotheses and using data to test the hypotheses. For example, to
determine whether there is a difference in the rates between 1988 and 1990, we can set up a null hypothesis
that there is no difference in the rates, p1 − p2 = 0, where p1 , p2 are the rates in 1988 and 1990, respectively.
The alternative hypothesis would be that p1 − p2 ̸= 0. The null hypothesis is often written as H0 and the
alternate is written as H1 or H a .
Intuitively, we can assume p1 − p2 = 0 and then determine whether the observed data look plausible
under that assumption. If the answer is “Yes", then there is no reason to question that the rates are the same.
On the contrary, if the data are unlikely to have been observed under p1 − p2 = 0, then we may cast doubts on
H0 and may want to conclude that the rates are different. Hypotheses testing methods are all based on this
simple intuition.
In hypothesis testing situations, the hypotheses are set up so that H0 usually represents some norm or
standards, whereas H1 is the one that we want to prove. These hypotheses are not symmetrical in the sense
that H1 is not accepted unless there is strong evidence against H0 . For example, we assume there is no
difference in rates between 1988 and 1990 unless there is overwhelming evidence to show that it is not true.
Therefore, we take a conservative approach of not discounting the norm or standards (H0 ) unless we are fairly
certain that it is not tenable.
The analyses carried out by the researchers were given in Table 2 in their paper, which is reproduced
below in Table 2.8.

r
In practice, 2 is sometimes used in place of 1.96

27
Chapter 2. Preliminaries

Table 2.8: Proportions of admissions not fulfilling appropriateness. Hospital Universitari del Mar, Barcelona,
Spain, 1988 (n = 743∗ ) and 1990 (n = 633)

1988 total (%) 1990 total (%) p† -value


Total 743 (12.1) 633 (19.3) < 0.001
Gender:
Male 344 (11.9) 321 (21.2)
Female 399 (12.2) 312 (17.3) < 0.001
Age (year):
14-39 213 (7.5) 158 (11.4)
40-49 86 (18.6) 73 (19.2)
50-59 114 (14.9) 98 (23.5)
60-69 151 (12.6) 136 (22.8)
70+ 179 (12.3) 168 (21.4) < 0.001
Hospital ward:
Medicine 336 (11.6) 257 (18.3)
Surgery 391 (13.0) 346 (19.9)
Others 16 (0) 30 (20.0) < 0.001

Seven admissions with missing information

p value based on the exact Mantel-Haenszel test for several two by two tables

The researchers found that, had the rates been identical between 1988 and 1990, the exact probability
of observing 12.1% in 1988 and 19.3% in 1990 would be 0.0007. The value 0.0007 is sometimes called a
p-value. A p-value gives the probability of observing outcomes at least as unusual as the given data, if H0 is
true. A p-value is commonly used as evidence against H0 . The lower the p-value, the stronger the evidence
against H0 . In this example, we would be inclined to reject H0 : p1 − p2 = 0 because, if we assume H0 is true,
the chance of observing data such as those in the study is only 0.0007 (0.07%), a very small number.
When we reject H0 , it is important to realise that, even though the data provided strong evidence against
H0 , there is still a 0.07% chance that the data could have occurred if H0 is correct. In other words, there
is a 0.07% chance that we could have wrongly rejected H0 . The probability that H0 is wrongly rejected is
called a type I (1) error (sometimes denoted by the Greek alphabet α, pronounced “alpha"). Another type of
error is the type II (2) error (sometimes denoted by the Greek alphabet β, pronounced “beta"), which is the
probability of wrongly rejecting H1 .
In our example, we decided to reject H0 because the p-value, the chance of making a type I error is only
0.0007, a small value. However, there is no clear definition of what is considered “small" for a p-value. To
resolve this, a procedure called a significance test is sometimes adopted. A significance test is a rule for
deciding whether the data are “likely" or “unlikely" to have occurred by chance if H0 is true. There are a
number of definitions of what constitutes “likely" and “unlikely". A commonly used definition is to say that
the data are “unlikely" if there is no more than 5% chance that it would occur if the null hypothesis is true.
When the data appear with no more than 5% chance assuming H0 is true, we call the data significant at the
5% level. A set of data significant at the 5% level provides us with evidence to reject the null hypothesis. This
test is called a 5% significance test. In a 5% significance test, the chance of a type I error is no more 5%. If
a p-value has been calculated, then a 5% significance test can be easily conducted by comparing the p-value

28
2.8. Hypothesis testing

to 0.05, such that:


(
Reject H0 if p-value < 0.05
Do not reject H0 if p-value ≥ 0.05

The tests carried out by the researchers in Table 2.8 are called exact tests. An exact test does not require
assumptions. However, in practice, it is not easy to find conditions that exact tests can be carried out. In most
cases, some assumptions are necessary to carry out hypothesis tests. The most common assumptions are the
sample sizes are large and/or the data follow a normal distribution. These tests are carried out by calculating
test statistics, which can be converted into p-values.

29
3
Regression

The purpose of statistical evaluation of health data is often to describe relationships between two variables or
among several variables. For example, one of the goals in the study of Example 2.1 was to determine whether
there had been a change in inappropriate health utilisation, but also whether inappropriate utilisation was
influenced by length of stay, as well as factors such as age and type of hospital ward. The variable to be
explained, inappropriate hospitalisation, is the outcome variable. The variables LOS, age, hospital ward are
the independent variables. The statistical methods designed to study these problems are called regression
analyses.
Regression analysis achieves three goals: (1) Description: Relationship between the independent
variables and the outcome can be summarised in a functional form called a regression model; (2) Prediction:
The values of the outcome variable can be predicted using observed values of the independent variables; and
(3) Explanation: Factors that influence the outcome can be identified.
For the types of data found in health care research, there are four common types of regression models that
are used, they are: linear, logistic regression, Cox regression and regression models for count outcomes.
These are chosen depending on the type of outcome variable that we are dealing with. Linear regression
is used for problems with a continuous outcome variable. For example, if we are interested to study the
relationship between blood sugar level, and age, race, gender, etc. Linear regression can be simple or
multiple, depending on the number of independent variables. Logistic regression is for a binary outcome.
In the study of Example 2.1, one measure of outcome was whether there has been any or no inappropriate
hospitalisation during LOS. In that case, each admission has a binary outcome, either 0 or 1+ days of
inappropriate hospitalisation. Cox regression is a special type of regression analysis that is applied to survival
or “time to event" data, for example, if the outcome is the time of relapse of cancer and some patients remained
disease free at the time of analysis. Many health care data data involve count outcomes, for example, number
of hospital visits or insurance claims. In those cases, regression models that are specially designed for count
data should be used. In this chapter, we will focus on linear regression models. Logistic regression and other
types of specialised regressions will be discussed later on in this set of notes. For all regression analyses,
multiple independent variables are allowed and there is no restriction on the type of variables.
We use a dataset on life expectancies in 170 countries (Table A.1)a . The goal is concerned with
understanding the determinants of life expectancy. The level (and variability) of life expectancy has important
implications for individual and aggregate human behaviour; it affects fertility behaviour, economic growth,
human capital investment, intergenerational transfers, and incentives for pension benefit claims. From a
policy making view point, it has implications for public finance. For example, there has been a long debate
on the impact of increasing longevity on public funding of education and economic growth.
The conventional wisdom is that population life expectancy is a function of environmental measures
a
Source: World Health Organisation

30
3.1. Scatter plot

(e.g., wealth, education), lifestyle measures (e.g., diet), and health care measures (e.g., medical expenditures).
However, the appropriate econometric methodology for disentangling these effects and its meaning for the
relative importance of the estimated effects remains unclear.
Table A.1 shows that for each country, there are seven variables recorded (excluding “Code" and
“Country.name", which are simply identifiers). For the following variables: Lexp (life expectancy), Health.exp
(Health expenditure), Literacy (Literacy percentage), Physicians (number of physicians per 1,000), GDP
(Nominal GDP) values will be log-transformed based on the discussion in Chapter 2.6.7.

3.1 Scatter plot

When we are trying to study the relationship between two or more variables, it is important to gain an initial
impression on the relationship. Fig. 3.1 shows a scatter plot between Health.exp and Lexp among the counties
in the dataset. In the scatter plot, each symbol represents a pair of values for a country, one value for Health.exp
and one for Lexp.b We observe from the scatter plot that countries with higher value for Health.exp tend to
have higher Lexp. Thus, the scatter plot suggests that the two variables may be positively associated or
positively related. In some cases, when a scatter plot shows that higher values of a variable tend to be paired
with lower values of another variable, then we call the two variables negatively associated or negatively
related. When we study associations or relationships between variables, we need to be careful on what the
terms mean. When two variables are related or associated, it does not mean that one causes another or one
leads to another. They just mean the two variables tend to vary in some systematic pattern.
We observe from Fig. 3.1 that the symbols seem to scatter around along a straight line trend, such a
relationship is called a linear relationship and the two variables are said to be linearly related. In some
cases, a scatter plot shows symbols that seem to follow a curve, in such cases, the variables are non-linearly
related, see Fig 3.2(a)-(b). A scatter plot is useful for giving an overall impression of the kind of relationship
between the variables, e.g., linear or non-linear relationship. A scatter plot also allows us to identify unusual
observations. For example, in Fig 3.2(c), the scatter plot shows quickly that there is an unusual observation in
the bottom left hand corner, +-mark, when it is compared to the trend of the other observations, which shows
higher values of one variable tend to be associated with lower values of the other variable; whereas for that
observation, the value of both variables are low. Unusual observations like that are called outliers. Normally,
outliers must be treated with special care. The topic of outliers is beyond the scope here; we will not pursue
it further.

3.2 Correlation

In Chapter 3.1, we showed that a simple graphical summary of the association of two variables is a scatter
plot. Here, we study a simple numerical summary of the (linear) association between two continuous random
variables. The measure is called the correlation coefficientc , Corr(X , Y ), between X and Y . The correlation
coefficient has the following characteristics:

1. It measures the strength of the association

2. It measures the direction of the association

3. It is not affected by the scale of measurements of X and Y

Based on a sample of n pairs of (X , Y ), we find a sample version of Corr(X , Y ), called a sample correlation
coefficient, which is often denoted by r. To understand the mechanics of r (and therefore that of Corr(X , Y )),
we have redrawn Fig. 3.1 as Fig. 3.3. In Fig. 3.3, we use +-marks instead of country codes to represent the
b
there are 162 since 8 countries did not data on one of the two variables
c
The correlation coefficient we discuss here is called a Pearson (product moment) correlation coefficient, after Egon Sharp
Pearson, 1895-1980

31
Chapter 3. Regression

Figure 3.1: Scatter plot of Health.exp and Lexp

4.5
Y= Life expectancy (Lexp)− log scale

ESP JPN CHE


ITA
4.4

KORISRFRA
MLT AUSISL
SWENOR
LUX
IRL
GRCSVN
PRT CAN
NZL
NLD
FIN
AUT
BEL
CYP GBR
CRI CHL
DNK
DEU
ALB LBN BRBCZE
CUB
MDV
PAN USA
TUR OMN HRV
POL EST
ARE
URY
LKA THA
DZA
TUN PER BIH
COL
ECU
CHN
IRN ATG
MNE SVK
ARG
MAR LCA
MYS
MKD SRB
BRNHUN
BRALTU
VNM ROU KWT
4.3

HNDBLZ MEX
ARM
JOR
JAM BGRLVA
MUS SAU
NICGTM BLR
PRY
DOM
SLB WSM
AZE GEO
KAZ
SLV RUS TTO BHS
BGD UZBEGYMDA VCT
VEN GRD
TJKKGZIDN UKR
PHL BOL SUR
NPL VUT
STP IRQ
GUY
KHM
INDTLS MNG BWA
RWA TKM
4.2

SEN
LAO
PAK FJI
MDG
ETH MMR
DJI
KEN GAB
TZA SDN
MRT
AFG
COG
MWI GHA
LBRHTI ZAF
UGA ZMB NAM
NER
GMB
4.1

BEN
TGO AGOZWE
GIN
BFA
MOZ
MLICMR SWZ
GNBCIV
4.0

TCD NGA
SLE
LSO
CAF
3.9

3 4 5 6 7 8 9 10
X= log(Health.exp)

data in each country. Each country has a pair of values on (X , Y ), that represents the country’s Health.exp
and Lexp, respectively. In this set of data, sample means of X and Y , are respectively, 5.91 and 4.28; these
are represented by the vertical and horizontal lines on Fig. 3.3. Countries such that X , Y are on the same side
of 5.91 and 4.28 (i.e., both larger or both smaller) are represented as +-marks while countries such that X , Y
are on opposite sides of 5.91 and 4.28 (i.e., one larger and the other smaller) are represented as +-marks. We
can interpret countries with X , Y on the same side of their respective means show the two variables have a
“positive" relationship and while those countries with X , Y on the opposite sides indicate X , Y have a “negative"
relationship. We observe in Fig. 3.3 that there are a lot more +-marks than +-marks. Hence, among these
countries, Lexp tend to move in tandem with Health.exp, such that, countries with larger Health.exp tend to
have higher Lexp. The sample correlation, r summaries these information in a single numerical value, such
that:

1. Strength: the magnitude of r measures the strength of the association. If |r| ≈ 1, the association is
strong; if |r| ≈ 0, there is no linear association

2. Direction: the sign of r measures the direction of the association. If r > 0, large X tends to be associated
with large Y ; if r < 0, large X tends to be associated with small Y

3. Invariance to scale change: For any pair of X , Y , r satisfies −1 ≤ r ≤ 1

Based on the data, the sample correlation is r = 0.81 > 0, hence it supports the information in Fig. 3.3, that
Health.exp and Lexp are in tandem. In addition, since |r| is close to 1, the relationship is strong.

32
3.3. Simple linear regression

(a) linear (b) nonlinear (c) outlier

8 10
15

4
2
10

0
4

−4
5

2
0

−8
0

0 1 2 3 4 0.0 0.4 0.8 1.2 0 1 2 3 4

Figure 3.2: Scatter plots showing linearly related variables, non-linearly related variables, and outliers

The characteristics of r are further illustrated in Fig. 3.4, that show scatter plots for four hypothetical
datasets. In Fig 3.4(A), the number of red +-marks and green +-marks are approximately equal, that means
it is just as likely that X and Y are both relatively large (or small) as when one is relatively large and the other
small. So X and Y are not related and it shows up in r ≈ 0. In Fig 3.4(B), there are few red marks but many
green marks, that means very often X and Y are both relatively large (or small). The association between X
and Y is strong and consistent with r ≈ 1. Similarly, in Fig 3.4(C), there are many more green marks than
red marks, but the points do not follow as closely along a straight line trend as in Fig 3.4(B), so r is smaller
for the data in Fig 3.4(C). In Fig 3.4(D), there are many more red marks than green marks, so r is negative.
When r is being considered as a numerical summary of the association between two variables, it must be
remembered that it is only suitable for measuring linear associations. If two variables are non-linearly related,
or if outliers are present, calculating r based on the data without further investigation would give misleading
results. Sometimes, even if the association is truly linear but a non-random sample from the population is used,
r could also be affected. These situations are illustrated in Fig. 3.5. Fig. 3.5(A) shows a dataset with a linear
relationship between the variables, r in this case is an appropriate measure of linear association. Figs. 3.5(B)-
(D) all show some violation of the conditions for using r. In Fig. 3.5(B), the relationship is non-linear; in
Fig. 3.5(C), there is an outlier (red open circle); in Fig. 3.5(D), non-random sampling has been carried out
such that observations (represented as grey dots) have not been collected; calculating r based on the sampled
observations (represented as black dots) gives a distorted impression of the true linear association.

3.3 Simple linear regression

In a correlation analysis, our interest is on the strength and direction of the relationship between two variables
without making any assumptions about one variable being the independent variable and the other the outcome
variable. If in addition to summarising the linear association between Health.exp and Lexp, we wish to study
the way Lexp changes as a function of Health.exp, then a regression analysis can be carried out. A scatter
plot of Health.exp and Lexp (Fig. 3.6) shows there seems to be a linear relationship between Health.exp and
Lexp.
A regression analysis aims to investigate whether a linear relationship holds between Health.exp and
Lexp. Specifically, we postulate a linear relationship between Health.exp and Lexp, represented by a straight
line, as follows:
Lexp = a + b(Health.exp), (3.1)

where a is the intercept and b is the slope of the line. In (3.1), Health.exp is the independent variable and
Lexp is the outcome variable. Equation (3.1) is an example of a simple linear regression because there is only
one independent variable and we wish to study a linear relationship between the independent variable and

33
Chapter 3. Regression

Figure 3.3: Scatter plot of Health.exp and Lexp relative to their means

4.5
Y= Life expectancy (Lexp)
4.3

Y
4.1
3.9

3 4 5 6 7 8 9 10
X= log(Health.exp)

the outcome. If we denote the independent variable and outcome by X and Y , respectively, then the line can
be rewritten as:
Y = a + bX . (3.2)
The intercept a can be interpreted as the value of Y when X = 0. However, often a is not of interest or may
even be meaningless, e.g., no country has Health.exp (X ) of zero. The value of b represents the change in Y
for every unit difference in the value of X . In a linear regression, b captures the linear relationship between
X and Y . If b = 0, then X and Y are not linearly related; if b > 0, X and Y have a positive linear relationship
and; if b < 0, X and Y are negatively linearly related.
In reality, two variables will never follow perfectly a relationship like (3.1). The scatter plot in Fig. 3.6
illustrates this fact. The data do not fall on the straight line. In fact, there is no straight line relationship that
fits all the data. Hence, we need to taken into account an “error" or “deviation" that will occur when a function
is used to explain the relationship between the variables. This can be achieved by replacing (3.2) with

Y = a + bX + e, (3.3)

where e is a random error or deviation of individual Y values from the straight line relationship a + bX . We
call (3.3) a linear regression model, and a, b the regression coefficients.
We interpret the regression model as follows:
(1) a + bX is the average value of Y for observations with a particular value of X

(2) Each observation Y differs from the average by a random error e

(3) For each known value of X , the values of Y ∼ N (a + bX , σ2 ). Since a + bX is the average of Y for a
particular X , it can be written as
E(Y |X ) = a + bX .

34
3.3. Simple linear regression

Figure 3.4: Scatter plots and r showing different types of associations

(A) r = −0.063 (B) r = 0.935

10

10
Y
5

5
Y
0

0
−5

−5
0.0 0.4 X 0.8 0.0 0.4 X 0.8

(C) r = 0.652 (D) r = −0.439


10

10

Y
5

Y
0

0
−5

−5

0.0 0.4X 0.8 0.0 0.4X 0.8

E(Y |X ) is called a conditional expectation or conditional mean (cf., Chapter 2.5) of Y given X . A
conditional expectation has the same meaning as an expectation except that the former is based on a
“subset" of the population defined by the condition. For example, in the context of the data here, E(Y |X )
means the average Lexp for countries with Health.exp at level X . In a regression analysis, we assume
there are known values of X at X 1 , ..., X n and we investigate how E(Y |X ) changes over these values,
which is captured by the regression model

To investigate model (3.3), we need to find the unknowns a and b using the data. There are two main
methods for finding a and b: ordinary least squares (OLS) and maximum likelihood estimation (MLE). The
two methods differ in their assumptions about the random error e. In OLS, the random error e is just assumed
to follow a distribution with mean zero that does not depend on X . In MLE, e is assumed to follow a standard
normal distribution, N (0, 1). Therefore, comparing the two methods, MLE uses a stronger set of assumptions
about e.
Observe in Fig. 3.6, that a line is postulated for model (3.3). Obviously, other lines can be postulated by
changing the values of a and/or b. OLS finds the best fitting line by using the pair (a, b) that correspond to
the line “closest" to the observed data. The situation is illustrated in Fig. 3.7 (a), which shows the distances
between the observed data and the postulated line as blue dotted lines on the figure. For each observation, a
measure of its distance to the postulated is the squared difference. The overall distance between the postulated
line and all the observations is the sum of squared differences over the dataset. Of course, there are many lines

35
Chapter 3. Regression

Figure 3.5: Violations of the conditions of r

A B

● ●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ●
● ●

C D

● ●

● ● ●
● ● ● ●

● ● ●
● ● ● ●

● ● ●

● ●

to consider, but it turns out that finding the best fitting line in terms of minimising the total sum of squared
differences is a simple computational task via Calculus. Fig. 3.7 (b) shows the line that minimises the sum of
the squared differences between model and data, hence it is called the “least squares" line, or more generally
fitted regression line.
Maximum likelihood estimation (MLE) tries to determine the line that is most plausible or most likely
to have generated the observed outcome values. MLE tries different pairs of (a, b) until it finds the line that
maximises the likelihood of observing the outcome values in the data. Since MLE maximises a likelihood, it
requires that we know more about the data than OLS. In particular, MLE requires a stronger assumption of the
distribution of e. For linear regressions with continuous outcome, the distribution of e is usually assumed to
be a standard normal distribution, N (0, 1). This assumption also allows inferences such as confidence interval
estimation and hypotheses tests to be carried out. Under a normality assumption for the distribution of e,
the best fitting line by OLS and MLE are identical. In fact, OLS and MLE always give the same results for
regression analyses for normally distributed e’s, even for cases with more than one independent. In cases
where the distribution of e is non-normal, however, the methods would not give the same solution, and MLE
is often the preferred method. For the data in the current application, the best fitting regression line found
by OLS and MLE is shown in Fig. 3.7 (b). The line is given by Ŷ = 3.96 + 0.055X , where the “ ˆ " symbol is a
convention for writing an estimate. Observe that since the estimated coefficient b̂ = 0.055 is positive, hence,
it shows a positive relationship between Health.exp and Lexp, a higher level of health spending is associated
with a longer life expectancy. As discussed earlier, since no country has zero Health.exp, hence, 3.96 has no

36
3.4. Multiple linear regression

Figure 3.6: Postulated straight line relationship between Health.exp and Lexp

4.5
Y= Life expectancy (Lexp)
4.3
4.1
3.9

3 4 5 6 7 8 9 10
X= log(Health.exp)

physical meaning. For this reason, sometimes, values of the independent variable are adjusted by subtracting
each value by the sample mean, i.e., Z = X − X̄ and then followed by a regression of Y = a + bZ. Under that
formulation, Z = 0 has a physical meaning that X is at the mean value, and consequently, â represents the
estimate of Y at Z = 0 or X = X̄ .
Once the linear regression line is fitted, it can be used to estimate life expectancy of a country with health
expenditure that lies within the observed range (3 to 9.2, in logarithm-scale). When carrying out a prediction,
the country needs not have health expenditure that is the same as any of the observed values, as long as it
falls with that range. Mathematically, it is possible to estimate the life expenditure of a country with health
expenditure outside the range of values observed in the study. However, such an extrapolation is generally
not recommended.

3.4 Multiple linear regression

Earlier, we talked about description and prediction being two of the goals of a regression analysis. Very often,
the contribution of a single independent variable does not alone suffice to describe or predict the outcome
variable. This problem can be overcome by performing a multiple (linear) regression to study the joint
association between multiple variables and the outcome variable. In a multiple regression model, the outcome
variable is described as a linear function of the independent variables X j , as follows:

Y = a + b1 X 1 + b2 X 2 + ... + b p X p + e. (3.4)

The model permits the computation of a regression coefficient b j for each independent variable X j . In
Chapter 3.3, we observed how a simple linear regression model can be used to predict future values of Lexp,
and explain the relationship between Health.exp and Lexp. The same dataset also contains information on
other variables. It is then natural to ask if any or all of the other variables, for example, environmental

37
Chapter 3. Regression

Figure 3.7: Plot showing (a) distances between data and postulated line and (b) best fitting line

(a) (b)
4.5

4.5
Y= Life expectancy (Lexp)
4.3

4.3
4.1

4.1
3.9

3.9
3 4 5 6 7 8 9 3 4 5 6 7 8 9

X= log(Health.exp)

measures such as GDP, Literacy; lifestyle measures such as Daily.caloric (intake), and health care measures
such as Physicians also help to explain Lexp.
Another reason a multiple regression would be useful, even if we were only interested in the association
between a particular independent variable and the outcome, relates to the third goal of a regression analysis,
that of explanation. In Chapter 3.2 and 3.3, we found that there is a positive association between Health.exp
and Lexp. Naturally, we might inquire to what extent a longer life expectancy is due to expenditure on
healthcare. The way an independent variable affects or influences an outcome variable is called a causal
relationship. The answer to this seemingly innocuous question is not straightforward. To understand why,
look at Fig. 3.8, which shows scatter plots between GDP and Lexp and between GDP and Health.exp. The
scatter plots show a strong positive relationship between GDP and Lexp, and between GDP and Health.exp.
In fact, the sample correlation, r, for these pairs of variables, are 0.82 and 0.97, respectively. Hence while
the results in Chapter 3.3 shows countries with higher Health.exp are more likely to have higher Lexp, the
relationship between health spending and life expectancy may be more complex. It is possible that health
spending is not the only reason for longer life expectancy. For example, countries with higher Health.exp
are more likely to be those with higher GDP, whose residents have healthier diets, are better educated on
how to take care of themselves, are less likely to engage in dangerous works, etc., that in turn contributes to
higher Lexp. In this context, GDP is a potential confounder. A confounder is an independent variable that
influences, not only with the outcome, but also other independent variables. The presence of confounders
can distort the effect of the other independent variables. Fig. 3.9 illustrates this situation. In the figure,
U is a confounder that influences both X , the independent variable of interest and Y , the outcome. Under
this situation, it would be difficult to identify the influence of X on Y , using a simple regression of X on Y ,
omitting U. To isolate the effect of X on Y net of the influence from U, requires us to compare the effects
of different levels of X on Y , when U is held at a constant level. For example, we need to compare Lexp
between two countries with different Health.exp but with the same GDP. Confounders are not a problem if
data are collected through randomised studies, because by randomisation, data are balanced on all known and

38
3.4. Multiple linear regression

Figure 3.8: Scatter plots between (a) GDP and Lexp and (b) GDP and Health.exp

(a) (b)
3.9 4.0 4.1 4.2 4.3 4.4 4.5

10
Y= Life expectancy (Lexp)

9
Y= log(Health.exp)
corXY = 0.97

8
7
6
5
4
3
5 6 7 8 9 11 5 6 7 8 9 11

X= GDP X= GDP

Figure 3.9: Diagram showing the relationship between a confounder U, an independent variable X and an
outcome Y

X Y

unknown confounders. In such cases, analysis of the casual relationship between X and Y can be carried out
without considering any confounder U. In practice, however, data are often collected through observational
studies where it is impossible for U to be balanced. A multiple regression is a way to mitigate the influences of
potential confounders. In a multiple regression analysis with X and U as independent variables, the regression
coefficient for X represents the amount of its effect on the outcome, after U has been taken into account. In
this way, multiple regression analysis permits the study of multiple independent variables at the same time,
with adjustment of their regression coefficients for possible confounding effects between variables. However,
if there are confounders that are not known to the researchers, or known but unable to be measured, then
a multiple regression model without these confounders would not allow a causal relationship to be drawn
between X and Y . Confounding is a reason why we must exercise caution when making statements about
causal relationship from observational data and why we always have to question whether there may be other
explanations for any associations identified between a independent and the outcome variables. We will provide
a detailed treatment of confounding and causal relationship in a later chapter in this set of notes.

39
Chapter 3. Regression

3.5 Interpretation of results

Table 3.1 provides the results from fitting different multiple linear regression models to the life expectancy
data. The outcome variable is Lexp. In the dataset, GDP is measured both as a continuous variable and as
a categorical variable as IncomeGroup. For a categorical variable, we need to create a reference level. In
general, for a categorical variable of k different levels, it will be recoded into k − 1 binary 0-1 variables, where
each binary variable gives comparison between a particular level and the reference level. For IncomeGroup,
“High Income" is chosen as the reference level. As discussed in Chapter 3.3, to give interpretation of the model
intercept, we adjust the continuous independent variables by the sample mean. For Daily.caloric, we simply
define Z = X − X̄ and use Z instead of X in the model; for the remaining variables, since they have been
log-transformed, we first found the mean of the untransformed values, then we adjust by log-transform of the
mean to obtain Z = log(X ) − log(X̄ ).
Three separate regression models were carried out. Model (1) uses a simple linear regression between
Health.exp and Lexp. Model (2) is a multiple linear regression model that uses all the independent variables
except IncomeGroup. Model (3) is a multiple linear regression model based on all variables except GDP.
Hence, compared to model (1), models (2) and (3) adjust for other independent variables. Model (2) and (3)
examine the difference between using a continuous variable or a categorical variable to represent GDP.
Table 3.1 shows the results of fitting the three regression models. As discussed above, IncomeGroup
has been recoded into binary variables, using “High income" as the reference level. The second column shows
results of model (1). The intercept represents (estimated) mean Lexp at the reference level for all independent
variables. Since all continuous independent variables have been mean adjusted, the reference level is at
Zi = 0 or, X i = X̄ i (since each X i is adjusted by subtracting from its sample mean). Therefore, the intercept
of 4.345 is estimated mean Lexp for countries at mean Health.exp = 1200. Recall that Lexp has been log-
transformed, hence, this figure is in terms of log-years. Taking anti-log gives approximately 79.1 years. The
coefficient estimate for Health.exp is 0.055. Notice this coefficient is exactly the same as that in the model in
Chapter 3.3. This is because the only difference between the two models is Health.exp has been adjusted here
by a constant, which does not change the relationship between Health.exp and Lexp. Since Health.exp has
been mean adjusted, a value of Health.exp = 0 (log(health expenditure) -log (mean health expenditure)= 0)
is equivalent to health expenditure at the mean. Using the model, we estimate mean Lexp for Health.exp = 1
(one unit above the mean, in log-scale) to be 4.345 + 0.055 × 1 = 4.4 in log-years. Upon taking anti-log, we
obtain exp(4.4) ≈ 81.5 years.
Our estimate of 81.5 years refers to the mean log-life expectancy, Lexp, in terms of years, which is not
the same as mean life expectancy. It is natural to inquire whether the model can be used to estimate the
latter. Unfortunately, the answer is “No"; taking anti-logs (nor any transformation) will not give us the answer
to our question. This is because the mean of any numbers a and b, a+b 2 is not the same as the anti-log of
log(a)+log(b) d
2 . However, recall from Chapter 2.2 that the mean is not a useful summary for data from a
skewed distribution. Hence, we argue that since the distribution for life expectancy is skewed, it would not
be informative to use mean life expectancy anyway. Consequently, it is sufficient to continue using mean Lexp
as a summary measure, whether it is expressed in log-years or in years.
The standard errors (SEs) of the coefficient estimates are given in parentheses under the respective
coefficient estimates. Recall from Chapter 2.7 that SE measures uncertainty in an estimate, in this case,
estimates of the regression coefficients. This uncertainty is sometimes reported as 95% confidence intervals.
Under a regression model where the errors are assumed to be normally distributed, and the sample size n is
large enough, then the confidence intervals are approximately estimate ±1.96 SE(estimate). A 95% confidence
d
The anti-log is

log(a) + log(b) log(ab)


 ‹  ‹
exp = exp
2 2
= (ab)1/2
a+b
̸=
2

40
3.5. Interpretation of results

Table 3.1: Multiple linear regression of cross-country life expectancy study

Dependent variable:
Lexp
(1) (2) (3)
Constant 4.345∗∗∗ 4.324∗∗∗ 4.328∗∗∗
(0.006) (0.007) (0.010)

Health.exp 0.055∗∗∗ 0.025∗∗ 0.023∗∗∗


(0.003) (0.011) (0.008)

Literacy 0.087∗∗∗ 0.086∗∗∗


(0.023) (0.025)

Daily.caloric 0.00003∗∗ 0.00003∗∗


(0.00001) (0.00001)

IncomeGroup = Low income −0.014


(0.032)

IncomeGroup = Lower middle income −0.007


(0.024)

IncomeGroup = Upper middle income −0.009


(0.016)

Physicians 0.015∗∗ 0.015∗∗


(0.006) (0.006)

GDP 0.001
(0.013)

R2 0.666 0.77 0.771


Observations 162 156 156
Log Likelihood 220.335 248.789 249.076
Akaike Inf. Crit. −436.669 −485.578 −482.152

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

41
Chapter 3. Regression

interval for the increase in mean Lexp for every unit increase in Health.exp is 0.055 ± 1.96 × 0.003, or 0.049
to 0.061.
Model (2) includes, in addition to Health.exp, four other independent variables, the results in the third
columns show estimated coefficients for the different independent variables. All coefficients are positive. All
the independent variables in this model are continuous, this means higher values of an independent variable
are associated with higher Lexp. For example, countries with higher Literacy is associated with higher Lexp.
The estimated coefficient for Health.exp is now adjusted for all the other independent variables in the model.
The adjusted coefficient of 0.025, means that for countries with the same Literacy, Daily.caloric, Physicians,
GDP, each unit increase of Health.exp is still associated with an 0.025 increase in Lexp, even though this
figure is slightly attenuated compared to the value of 0.055 in model (1). The interpretation of an adjusted
association is not limited to Health.exp. In fact, all the estimated coefficients in the models are adjusted
coefficients. For example, the coefficient for Literacy of 0.087 is the change in mean Lexp for a unit change in
Literacy, assuming all the other independent variables have been considered.
Model (3) is identical to model (2) except GDP has been replaced by a categorical variable IncomeGroup,
using “High income" as the reference level. IncomeGroup has been recoded into three binary variables. The
coefficient estimates for all three binary variables are negative. For example, the value of −0.014 means that
compared to “High income", a “Low income" country corresponds to an estimated 0.014 drop in mean Lexp.
Once again, these coefficients are adjusted for other independent variables being held at a fixed value. For
this reason, we notice the estimated coefficient for Literacy is now 0.086, compared to that of 0.087 in model
(2), since GDP has been replaced by IncomeGroup.
In all three models, the coefficient estimate for each independent variable is the same within each
subgroup defined by the other independent variables. We will explore the option of allowing different effects
of an independent variable on the outcome variable in different subgroups in a later section in this chapter.
A multiple linear regression model postulates Y = a + b1 X 1 + b2 X 2 + ... + b p X p + e for p independent
variables, where the coefficient b j represents the relationship between X j and Y . If b j = 0, then X j and Y
have no relationship. To determine whether a particular independent variable X j has a relationship with Y
using the data, we evaluate whether its estimated coefficient shows significant departure from zero using a
statistical test. In model (1), the coefficient estimate for Health.exp is 0.055 with an SE of 0.003. A statistical
tests takes the estimate, accounting for the uncertainty given by the SE, and compare to 0. For Health.exp,
the test shows a p-value of < 0.01, which means that 0.055 would be observed with no more than 1 percent
chance if there is no relationship between Health.exp and Lexp.. The result supports a relationship between
Health.exp and Lexp. Tests for all other estimated coefficients in models (2) and (3) are carried out using the
same idea.
In models (1)-(3), we postulated three linear regression models to study the relationship between the
independent variables and the outcome. Naturally, we would like to determine how well the independent
variables (and the models) explain the outcome. One way is to examine how well the model Y = a + b1 X 1 +
b2 X 2 + ... + b p X p predicts the data. If we fit this model to the data, we obtain a fitted model Ŷ = â + b̂1 X 1 +
b̂2 X 2 + ... + b̂ p X p . An overall measure of using the model to predict the data is

model predictions
data
X z }| { z}|{
L1 = (â + b̂1 X 1i + ... + b̂ p X pi − Yi )2 .
i

Notice that each term in L1 measures the difference between the prediction and the observed data, hence,
L1 can be interpreted as the loss of using the model to predict the outcome. Obviously, the smaller the loss,
the better. To assess the model’s loss, we need a benchmark for comparison. One candidate benchmark is a
model with no independent variables, i.e., Y = a. This benchmark model is equivalent to setting b1 = b2 =
... = b p = 0. Using the benchmark model, the loss is:

variables not used


X z }| { X
L0 = (â + 0X 1i + ... + 0X pi −Yi )2 = (â − Yi )2
i i

Comparing L1 to L0 tells us how well our model explains the outcome, against the benchmark model. Formally,

42
3.5. Interpretation of results

we can find
L0 − L1 L1
=1− = R2
L0 L0

The quantity R2 (pronounced “R-square") is called coefficient of determination. It measures the reduction
of “loss" between the benchmark (of not using the independent variables) and using the model with X 1 , ..., X p .
Recall that whether OLS or MLE is used, we aim to find best fitting line by minimising the distance between the
lines and the outcome, see, e.g., Fig. 3.7, hence the loss is simply the amount of deviation or errors between
the fitted regression line and the data. That is why R2 is also called a measure of the Goodness-of-fit of a
regression model.e The value of R2 is between 0 and 1. A large value of R2 indicates the model explains Y , or
it provides a “good fit".f
The coefficient of determination, R2 can also be interpreted as a type of correlation coefficient, that
summarises the relationship between multiple independent variables and an outcome variable. Therefore, R2
is sometimes called a multiple correlation coefficient. For a simple linear regression model, with only one
independent variable, it can be shown that R2 = r 2 , where r is the sample correlation coefficient described in
Chapter 3.2.
At the bottom of Table 3.1, the R2 for the three models are all relatively high. In addition, R2 for models
(2) and (3) are both around 0.77, and higher than that of model (1), 0.66. That means comparatively, models
(2) and (3) explain more of the variation in Lexp than model (1). Another way to evaluate a model is to
calculate the likelihood of observing the data given the model. For models (1) to (3), the log-likelihood
values are given below the values of R2 . Both models (2) and (3) have a higher log-likelihood than model
(1). Compared to models (2) and (3), model (1) is a sub-model, since models (2) and (3) both contain other
independent variables, in addition to Health.exp. We can compare between a model and any of its submodels
using log-likelihood values. However, likelihood values cannot be used directly to compare models such that
one is not a submodel of the other. For example, model (2) is not a submodel of model (3), because model (3)
does not contain GDP, nor is model (3) a submodel of model (2) because model (2) does include IncomeGroup.
Both R2 and the log-likelihood value are non-decreasing when more independent variables are added to an
existing model. However, if the addition of an independent variable only leads to incremental improvement in
explaining the outcome, but at the expense of a more complicated model, then the increase in R2 or likelihood
value may not be justified.
One method that allows comparison without requiring one to be a submodel of another, and also imposes
a penalty for an unnecessarily complex model with too many independent variables is a scaled version of the
negative log-likelihood called the Akaike Information Criterion (AIC). An alternative measure to the AIC is
called the Bayesian Information Criterion (BIC). The AIC or BIC for a model is usually written in the form
[-2logL + kp], where L is the likelihood value, p is the number of variables in the model, and k is 2 for AIC
and log(n) for BIC. Hence the only difference between AIC and BIC is in the penalising value k; BIC penalises
heavier and therefore, for large samples, BIC tends to choose models with a smaller number of variables than
AIC. The AIC values for the three models are shown at the bottom of Table 3.1. A small value of AIC indicates
a preferred model. Comparing the three models, the AIC values for models (2) and (3) are much smaller
than that of model (1), and hence, models (2) and (3) are preferred over model (1). There is practically no
difference between models (2) and (3).

e 2
R can also be interpreted as follows. In the data, we observed that Lexp varies among countries, how much can we use a model
(the independent variables) to explain the variation? We can compare the variation in Lexp that can be explained by the model to
the variation in Lexp explained by the benchmark model (of not using any the independent variables), i.e.,
variation in outcome explained by the model
R2 = .
variation in outcome explained by the benchmark

f
Even though there is no universal benchmark of what value of R2 constitutes a good fit, a commonly accepted guideline is: <0.1
(poor), 0.1−0.3 (weak), 0.3−0.5 (moderate), 0.5−0.7 (good), >0.7 (very good), Cohen, 1988, Statistical power analysis for the
behavioral sciences (2nd ed.), Hillsdale, NJ: Lawrence Earlbaum Associates.

43
Chapter 3. Regression

3.6 Regression diagnostics

For each of models (1)-(3), we have made a number of assumptions in the model. In particular, we assumed
that (1) the deviations or errors of the outcome are randomly distributed around the regression model; (2)
the distribution of the deviation is not a function of the independent variables. After a regression model has
been fitted, we can use the data to examine whether these assumptions are approximately valid by using the
residuals, defined by êi = Yi − Ŷi , i = 1, ..., n, If the model assumptions are not violated, then the residuals
êi s should resemble a set of random observations with mean zero, and constant variance. If we plot the
residuals against X i s (if there is a single independent variable), or against Yi s or Ŷi s (if there are multiple
independent variables), ideally, the plot should look like the top left hand panel of Fig. 3.10, where it can
be seen that the residuals show no relationship to the X -values. On the other hand, the remaining plots in
Fig. 3.10 reveal the data are of different degrees of assumption violation. In particular, the top right hand
panel suggests the distribution of the error e is skewed; the residual plot shows a consistently highly spread
below zero than above zero, over the observed range of X . The bottom left hand panel indicates that the
relationship between X and Y is non-linear; there is a clear non-linear trend in the residual plot. The bottom
right hand panel suggests the error distribution has non-constant variance across different values of X ; the
spread of the residuals changes over the observed range of X . When a residual plot suggests violations of any
of the assumptions in a regression model, then the data should be re-examined or a different model should
be considered before proceeding.
Another way to demonstrate no violations of the assumptions of a regression model is to use a quantile-
quantile (QQ) plot, where the empirical quantiles of the residuals are plotted against the theoretical quantiles
expected under a normal model. A QQ-plot goes beyond a residual plot in the sense that it examines whether
the normality assumption of error distribution is violated. When the normality assumption holds, the QQ-plot
shows a linear trend.
Fig. 3.11 shows the residual plot and QQ-plot of the residuals for model (3). The residual plot shows
that the residuals resemble a plot of random values, except for a few values that are slightly skewed on the
low side of the plot. The median of these residuals is 0.006 with Q1= −0.021, Q3= 0.029, so any violation
of the assumptions is mild.

3.7 The selection of variables

In Chapter 3.5, the relationship between Health.exp and mean Lexp was assumed to be the same in all
subgroups. This means the effect of spending another 1000 USD per capita on health care is the same whether
the country is a “Low income" or “High income" country, and also the same regardless the level of daily
dietary intake. If we believe the effect of health expenditure may not be same across different subgroups,
then we need to turn to the concept of effect modification. Unlike confounding, effect modification occurs
when an independent variable X has different effects on the outcome Y among different subgroups define
by another variable U. In that setting, U is called an effect modifier. For example, if we believe the
effect of health expenditure on life expectancy is different among countries of different income levels, then
GDP or IncomeGroup would be an effect modifier of the relationship between Health.exp and Lexp. Effect
modification is sometimes also called interactions and it can be entered into a regression model by defining
an interaction term. For example, suppose U, X are both independent variables and we believe U is an effect
modifier, then we can set up a regression model as follows: Y = a + b1 × X + b2 × U + b3 × X × U + e.
In this regression model, the coefficient b3 allows the possibility that U “modifies" the effect of X on Y . If
b3 = 0, then there is no effect modification; otherwise, the effect of X on Y when U = 0 is b1 X , but it
is b1 X + b3 X = (b1 + b3)X , when U = 1, so the effect of X depends on the value of U in the regression
model. Effect modification is not limited to a single effect modifier. In a multiple regression model, there
can be multiple effect modifiers that define different sub-groups, for instance, “Low income" countries with
Daily.caloric < 2000.
In Chapter 3.6, the regression diagnostics showed that, even though there is no clear violation of the
assumptions of model (3), the model can possibly be improved. We discussed the possibility of adding effect

44
3.7. The selection of variables

Figure 3.10: Examples of residual plots

Random Skewed distribution

6
4

4


● ●● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
2

2
● ● ● ●●●
residuals

residuals

●●● ●●●● ●●
●● ● ● ● ● ● ●● ● ●
● ●● ●●●● ●● ● ●
● ●
● ●●● ●● ●
● ●
● ●●● ●● ● ● ●● ● ● ● ● ●●
●● ●● ●
0 ● ● ● ● ●● ● ● ●● ●
●●● 0 ● ● ●●● ●●●● ●
−2 0

−2 0
●●
● ● ●●● ●
● ● ● ●● ●● ●
● ●●● ● ● ●● ● ●●●
●● ●● ●
● ● ● ● ●
●● ● ● ●
● ●● ● ● ● ●●● ●
● ● ●
● ● ● ●● ● ● ●●

● ●● ● ● ●● ●

● ●
−6

−6
0.0 0.4 0.8 0.0 0.4 0.8
X X

Non−linear Non−constant varinace


6

6
●● ●●

●● ● ●● ●
4

4
●●
● ●● ● ● ● ● ●
● ●● ● ● ●
●●● ● ● ●●●● ● ●● ●●
2

2
residuals

residuals

●●● ● ● ● ●

● ● ●● ● ●
● ●● ● ●● ●
● ● ● ● ●
● ●● ● ● ●●●

●●

●● ● ●
●● ●●● ●
0 ●●● ● 0
−2 0

−2 0

● ●●
●● ●● ● ●
● ● ● ●
● ● ●● ●
●● ● ●● ●● ●●●●● ●●
●●●● ● ●●●
●● ●●● ●●
● ●●
● ● ● ●● ●● ●● ●

● ● ● ●
●●
● ●● ●
●● ●● ●● ●● ● ●● ● ● ●
● ●●
● ●● ● ●● ●

● ● ●

−6

−6

0.0 0.4 0.8 0.0 0.4 0.8


X X

Figure 3.11: Residual plot (a) and QQ plot (b) for model (3) in cross-country life expectancy data

(a) (b)
0.10
Sample Quantiles

0.00

0
Residuals

−0.10
−0.20

−2 −1 0 1 2

Predicted Outcome Theoretical Quantiles

45
Chapter 3. Regression

modification in the model. Another option is to consider a non-linear regression model. An example of a non-
linear multiple regression model is a polynomial regression: Y = a + f1 (X 1 ) + f2 (X 2 ) + ... + e where f1 , f2 , ...
are k-degree polynomials, f1 (X 1 ) = b11 X 1 + b12 X 12 +...+ b1k X 1k , etc.. Under the same assumptions of a multiple
linear regression, this non-linear regression model can still be fitted using least squares or maximum likelihood.
Sometimes, it is also possible to improve an existing model by including other independent variables that have
not been considered, and functions of these variables.
Whether a regression requires the inclusion of effect modifiers, or non-linear functions of the existing
independent variables, or the inclusion of other independent variables, healthcare research often involve the
effect of a very large number of factors. The goal of statistical analysis is to find out which of these factors
truly have an effect on the outcome variable.
One way to carry out a multiple regression is to include all potentially relevant independent variables
in the model. The problem with this method is that the number of observations that can practically be
made is often less than the model requires. In general, the number of observations should be at least 20
times greater than the number of variables under study. A model with many independent variables, most
of which with unimportant contributions in explaining the outcome, would be very difficult to interpret.
Furthermore, irrelevant independent variables may appear in the model by chance and a model with many
irrelevant independent variables may not be reproducible when it is applied to future data. We will discuss
how these problems can be circumvented.
The basic criteria to statistical model building is minimisation of variables until the most parsimonious
model that explain the outcome is found. There are often variables that should be included in the model
in any case – for example, if policy makers are interested in the effects of health expenditure and number
of physicians on life expectancy – then they should be included in the model whether they are found to
contribute to the explanatory power of the model. A model should also include all known and important
potential confounders. For other independent variables, selection is performed to include only those variables
that contribute meaningfully to the explanatory power of the model. There are various methods of selecting
variables. Whichever method is used, selection is normally carried out with the aid: (1) test of significance of
the individual estimated coefficient or (2) test of significance of the improvement of a criterion such as R2 or
AIC.

3.7.1 Forward selection

Forward selection starts with a model with no independent variables. Variables are then added iteratively,
starting with the one with the highest correlation with the outcome, until there are no variables left that make
any appreciable contribution to the outcome.

3.7.2 Backward selection

Backward selection starts with a model that contains all potentially relevant independent variables. The
independent variable that contributes the least is then removed from the model. This procedure is iterated
until no independent variables are left that can be removed without markedly worsening the prediction of the
outcome.

3.7.3 Stepwise selection

Stepwise selection combines certain aspects of forward and backward selection. Like forward selection, it
begins with a model with no independent variables, adds the single independent variable that makes the
greatest contribution toward explaining the outcome variable, and then iterates the process. At every step
of the iteration process, all the independent variables currently in the model are checked to determine if it
remains relevant. If a variable that was relevant became irrelevant after the inclusion of a new variable, then
the irrelevant variable is removed.

46
3.7. The selection of variables

3.7.4 Block inclusion

There are also variables that must be considered as a group. In those cases they must be included or excluded
from the model as a group. In this way, one can combine the forced inclusion of some variables with the
selective inclusion of further independent variables that turn out to be relevant to the explanation of variation
in the dependent variable.

3.7.5 Cautions about stepwise procedures

Even though stepwise procedures are well established in solving regression problems with many independent
variables, there are also many critics that advise against their routine use. First there is no guarantee that
the subsets obtained from stepwise procedures will contain the “best" subset, as there may be others that
are equally good or better. Second, in many modern day studies, where it is not uncommon to find more
independent variables than observations (p > n), backward elimination is not a feasible procedure. In
addition, for large p, the number of possible iterations in these procedures may become impractically large.
Third the multiple tests carried out between iterations may present a problem. Fourth, the coefficient estimates
may be biased. In Chapter 8, we will discuss alternative solutions to reducing the dimension of a model.

47
4
Logistic Regression

There are many instances in healthcare research in which the outcome of interest is dichotomous (binary)
rather than continuous. For example, the risk of myocardial infarction or stroke with high blood pressure; the
influence of body mass index (BMI) on health outcomes such as diabetes. In these examples, the outcome of
interest (e.g., myocardial infarction, stroke, or diabetes) is dichotomize (yes/no).
One of the outcomes for the LOS data in Example 2.1 is whether the stay contain at least one day of
inappropriate hospitalisation. When the outcome is binary and the independent variable is categorical, then
a simple way to obtain an analysis of the association is to use a contingency table (c.f., Chapter 2.5). For
example, we can express the relationship between Year of admission and the outcome Inap, as follows:
A simple way to express a binary outcome is risk, which is simply the probability of an event happening.
For example, the risk of Inap = 1+ in 1988 is 310 310
750 ≈ 0.41; similarly for 1990, the risks is 633 ≈ 0.49. To
analyse the relationship between Year and Inap, we can calculate the risk ratio or relative risk, both of which
is often written as RR. An RR > 1 suggests increased risk, and < 1 suggests reduced risk. For each categorical
independent variable, one level is typically classified as the reference level and comparisons are made to that
level. For example, referring to Table 4.1, 1988 could be used as the reference level if we wish to determine
if there has been a change in the risk of Inap from 1988 to 1990. To compare the two years, RR is calculated.
310/633
In this example, RR = 310/750 ≈ 1.2. Since RR > 1, the risk for Inap = 1+ has increased from 1988 to 1990.
An alternative to risk is odds. The odds is a ratio of the probability of an event occurring to the probability
of the event not occurring. The odds has a range from 0 to ∞, whereas the risk has a range between 0 and 1.
When the risk of an event is 0.5, the odds is equal to 1. Risks ranging from 0.5 to 1 correspond to odds of 1 to
∞, and probabilities ranging from 0 to 0.5 correspond to odds of 0 to 1. The relationship between the odds
of an event and the risk is given by the formula: Odds= Risk/(1-Risk); or Risk = Odds/(1+Odds). Table 4.2
illustrates this relationship. For increasing odds, the risk increases up to a limit of 1. When the event is rare,
i.e., risk is low, the odds and the risk are very similar, but the two are drastically different for non-rare events,
i.e., risk is moderate to high.

Table 4.1: Contingency table of Year vs. Inap.0.vs.1+

Y X (Year)
Inap(0 vs. 1+) 88 90 Total
0 440 323 763
1 310 310 620
Total 750 633 1383

48
4.1. Logistic regression

Table 4.2: Relationship between Risk and Odds

Risk Odds
1/1000 (0.1%) 1/999
1/100 (1%) 1/99
1/10 (10%) 1/9
1/4 (25%) 1/3
1/2 (50%) 1/1=1
3/4 (75%) 3/1=3
9/10 (90%) 9/1=9
99/100 (99%) 99/1=99
999/1000 (99.9%) 999/1=999

The analysis using contingency table, RR and OR can be easily extended to situations where the
independent variable is categorical with more than two levels. In that case, a reference level can be defined
and RR or OR can be calculated
From this example, the odds of Inap= 1+ is 310 310
440 ≈ 0.70 in 1988 and 323 ≈ 0.95 in 1990. In this case,
since Inap= 1+ is not a rare event, the odds are very different from their corresponding risks.
A ratio of two odds is called an odds ratio (OR). A ratio of two odds greater than 1 indicates an increased
risk for the outcome, whereas an odds ratio less than 1 indicates a decreased risk for the outcome. Calculating
310/323
the OR between 1988 and 1990, we obtain OR = 310/440 ≈ 1.36, which shows an increased risk for Inap from
1988 to 1990.

4.1 Logistic regression

In a typical study, there is more than one independent variable to consider. Even if all of the variables are
categorical, contingency table will no longer be an efficient way of visualising and analysing the relationship
between the independent variables and the binary outcome. In cases where some of the independent variables
are continuous, then contingency tables cannot even be created unless each continuous independent variable
can be suitably recoded as a categorical variable. The method to be considered to model the influence of a
number of independent variables on a binary outcome is called a logistic regression.
Before we discuss logistic regression, we first discuss why the linear regression model in Chapter 3 would
not be suitable for studying data with a binary outcome. Fig. 4.1 shows the scattered plot between X = LOS
and Y = Inap and a simple linear regression model fitted to the data. There are two reasons why the fitted
straight line would not be useful in explaining the outcome of interest, Inap, which is either 0 (no inappropriate
hospitalisation) or 1 (1+ days of inappropriate hospitalisation). First, for any value of X in the range of the
observed independent variable, the regression line predicts the future values of Y , which can be any value on
the line. However, the outcome is always a binary 0-1 value, hence, the prediction of any value other than 0
and 1 is meaningless. Second, and a related problem to the first, is the straight line extends beyond 1 (the
maximum possible of these two values) for some X values in the range of observed data, which also does not
make sense (it is also possible for some data, that the line can extend below 0).
Rather than fitting a line directly to the binary outcome, a logistic regression model instead uses a
transformation called a logit or log-odds.
The reason why linear regression fails to produce a satisfactory model for a binary outcome, Y , is Y has
only two distinct values, 0 or 1, whereas an independent variable can be binary, categorical or continuous, and
a model such as a+ b1 X 1 + b2 X 2 +...+ b p X p can give predicted values different from 0 and 1. One way to resolve
this conundrum is to model the probability of the outcome instead: P(Y = 1) = a + b1 X 1 + b2 X 2 + ... + b p X p .
While the left hand side of this model allows a continuum of values, it still does not entirely solve the problem.

49
Chapter 4. Logistic Regression

Figure 4.1: Fitted simple linear regression model Y = a + bX between LOS and Inap

Y= Inappropriate hospitalisation (Inap)


2.0
1.0
0.0
−1.0

0 20 40 60 80
X= Length of stay (LOS)

This is because a probability is bounded by (0,1), but the right hand side of the model can produce predicted
values that are < 0 or > 1. Our solution is to take a logit transform of the probability, given by logit =
log(P(Y = 1)/1 − P(Y = 1)) and model it as a function of the independent variables, as follows:

P(Y = 1)
 ‹
log = a + b1 X 1 + b2 X 2 + ... + b p X p . (4.1)
1 − P(Y = 1)

We can check that the left hand side of (4.1) can be any value over (−∞, ∞) so the right hand side will not
produce a prediction that clearly violates the outcome of the model. Notice that P(Y = 1) is equivalent to
the risk and P(Y = 1)/[1 − P(Y = 1)] is the odds discussed in Chapter 4.1. Therefore, the logit is also called
the log-odds. This is one reason why odds, instead of risk, is commonly used in problems involving binary
outcomes.
Model (4.1) is called a logistic regression model. Since the logistic regression model tries to model the
log-odds of the binary outcome, rather than the outcome itself, the “best fitting" model is no longer measured
by minimising the distance of the model to the 0-1 outcome values, and the least squares approach does not
applies. The approach to fit (4.1) is by maximum likelihood estimation (MLE). To see why MLE works, notice
that even though (4.1) is in terms of log-odds, it can be easily shown that (4.1) can be back-transformed to
obtain a model for P(Y = 1) as follows:

exp(a + b1 X 1 + b2 X 2 + ... + b p X p )
P(Y = 1) = . (4.2)
1 + exp(a + b1 X 1 + b2 X 2 + ... + b p X p )

Therefore, (4.1) postulates a form (4.2) for P(Y = 1) and MLE seeks to find estimates of the most suitable
values of a, b1 , ..., b p that most likely produce the observed 0-1 outcome values in the data. Under this set-up,
we therefore, assume the binary outcome variable is a Bernoulli random variable with probability of “success"
given by (4.2) (cf., Chapter 2.6).

50
4.1. Logistic regression

We illustrate the logistic model using a single independent variable LOS and outcome Inap. The
independent variable LOS is highly skewed, as we did earlier in Chapter 3, and following the original research
from Alonso et al,a we adjusted LOS by taking log value and the adjusting sample mean of log(10) days,
hence, we define X = logLOS = log(LOS)− log(10) = log(LOS/10). The model is log(P(Y = 1)/1 − P(Y =
1)) = a + bX . The value of the coefficient b determines the direction of the relationship between X and the
logit of Y and therefore P(Y = 1). When b is greater than zero, larger (or smaller) X values are associated with
larger (or smaller) logits of Y . Conversely, if b is less than zero, larger (or smaller) X values are associated
with smaller (or larger) logits of Y . Using the data, the fitted model is logit(Inap) = 0.183 + 1.157logLOS.
The estimated intercept is 0.183, and the coefficient estimate for b is 1.157. The intercept indicates that the
logit of having inappropriate hospitalisation during LOS is 0.183 for an admission with logLOS=0, which is
equivalent to LOS=10 days. The slope b indicates that for every 1 unit increase of logLOS, the logit of having
inappropriate hospitalisation goes up 1.157.
If we exponentiate the log-odds, we obtain the odds. Hence,

exp(a + b × X ). (4.3)

If we set X = 0 (log(LOS/10)=0 or LOS= 10 days), then exp(a) = 1.2 gives the odds if LOS= 10 days. If we set
X = 1 , then exp(a + b × 1) = 3.82 gives the odds at logLOS =1 or log(LOS/10)=1 or LOS≈ exp(1) = 2.7 × 10
days. From these, we can obtain the odds ratio (OR) for every unit increase in logLOS to be

Odds at logLOS=1 exp(a) exp(b)


OR = = = exp(b) = 3.18. (4.4)
Odds at logLOS=0 exp(a)

The OR can be interpreted as the change in odds. Hence, for an additional unit of logLOS, the odds
increases by a factor of 3.18. In fact, in a logistic regression model, exponentiating the coefficient for any
independent variable gives the OR for any unit increase in the independent variable. For example, the OR
between logLOS= 2 and logLOS=3 is also exp(b) = 3.18.
Recall the relationship between OR and RR (relative risk). For rare events, OR ≈ RR, hence in such cases,
exp(b) from a logistic regression can be interpreted as a measure of RR.
The logistic regression model can be used to estimate the probability that an admission will have the
outcome. For example, the logistic regression model for inappropriate hospitalisation is:

exp(a + bX )
P(Y = 1) = . (4.5)
1 + exp(a + bX )

In this equation, P(Y = 1) represents the predicted probability (risk) of inappropriate hospitalisation. Thus,
the predicted probability of inappropriate hospitalisation at LOS = 10 days (log(LOS/10)=0 or X = 0) is
exp(a)/(1 + exp(a)) = 0.55. The predicted probability of inappropriate hospitalisation for log(LOS/10) = 1
or LOS =2.7× 10 days is exp(a + b × 1)/(1 + exp(a + b × 1)) = 0.79, an increase from that of LOS = 10 days.
There are other variables as well in the length of stay data. We next included all the remaining variables
as independent variables in a logistic regression model. Three of the independent variables are categorical:
Year, Ward, and Gender. For Ward, there are three levels and as such, two dummy binary variables are created
with medical wards used as the reference level: ward = 2 (1 if true and 0 for ward = 1) and ward = 3 (1 if
true and 0 for ward = 1). The remaining variable Age is continuous, it is adjusted by subtracting every value
by 54 (years). The results are given under model (2) in Table 4.2. For comparison, we also included results
with logLOS as the only independent variable (model (1)).
As in the case of a multiple linear regression, the coefficients for model (2) are all interpreted as
adjusted coefficients. For model (2), the intercept of −0.084 is interpreted as the log odds for inappropriate
hospitalisation for a patient at the reference level, i.e., age = 54 years, Gender =1 (male), Year= 1988, Ward
=1, and LOS= 10 days. The coefficient for ward = 2 (surgical) is −0.008, which is the log OR between ward=2
a
Alonso J., Muñoz A., Antó J.M. Using length of stay and inactive days in the hospital to assess appropriateness of utilisation in
Barcelona, Spain. J Epidemiology Community Health. 1996 Apr;50(2):196-201
b
Length of stay in days

51
Chapter 4. Logistic Regression

Table 4.3: Logistic regression models of Inap as outcome

Dependent variable:
Inap
(1) (2) (3)
Intercept 0.183∗∗∗ −0.084 −0.123
(0.065) (0.123) (0.085)

logLOS 1.157∗∗∗ 1.186∗∗∗ 1.228∗∗∗


(0.074) (0.082) (0.077)

ward=2 (Surgical) −0.008


(0.134)

ward=3 (others) −0.388


(0.403)

Year =1990 0.709∗∗∗ 0.715∗∗∗


(0.129) (0.128)

Sex=2 (Female) −0.117


(0.126)

Age 0.005
(0.003)

Observations 1,383 1,383 1,383


Log Likelihood −784.498 −766.656 −768.392
Akaike Inf. Crit. 1,572.997 1,547.311 1,542.783

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

52
4.2. Evaluation of the model

and the reference, which is medical ward. If we exponentiate −0.008, it gives us the OR between ward=2 and
medical ward, which is exp(−0.008) = 0.99, suggesting that there is practically no difference between the
two types of wards. Similarly, the coefficient for year=1990 is 0.709, hence the OR between 1990 and 1988 is
exp(0.709) = 2.03 so the odds of inappropriate hospitalisation in 1990 to twice as high as in 1988. Comparing
models (1) and (2), there is practically no difference between the unadjusted and the adjusted coefficients for
logLOS. As in the case of a linear regression model, for each coefficient, a statistical test can be carried out to
determine whether there is evidence that it is non-zero. For model (1), the unadjusted coefficient for logLOS
is significant (p < 0.01) suggesting a relationship between logLOS and the outcome Inap. For model (2),
two coefficients, year and logLOS are significant. Based on the results of model (2), we created model (3),
that includes only the independent variables with significant coefficients. As in model (2), the coefficients in
model (3) are adjusted coefficients, for example, the intercept coefficient is −0.123, which represents the log
OR for inappropriate hospitalisation, for those with logLOS = log(10) days, (equivalently LOS = 10 days) and
Year = 1988. If we write P(Y = 1|LOS = 10 days, Year =1988) as the conditional probability of inappropriate
hospitalisation given LOS = 10 days, in Year 1988, then

P(Y = 1|LOS = 10 days, Year =1988)


• ˜
log = −0.123
1 − P(Y = 1|LOS = 10 days, Year =1988)
exp(−0.123)
P(Y = 1|LOS = 10 days, Year =1988) = = 0.469.
1 + exp(−0.123)

Using the same method, we calculated the conditional probabilities for a few selected values of logLOS
between 1988 and 1990 and recorded them in Table 4.4. Notice for 1988, the conditional probabilities do
not change linearly as logLOS goes up. For example, between logLOS = 1 and 0, the risk difference (RD) is
0.751 − 0.469 = 0.282 whereas between 2 and 1, it is 0.912 − 0.751 = 0.161. The relative risk (RR) can also
be found using Table 4.4, by simply taking ratios of the probabilities. For example, for 1988, the RR between
logLOS=1 and logLOS=0 is 0.751/0.469 = 1.6 and between logLOS=2 and logLOS=1 is 0.912/0.751 = 1.21.
Again, RR is not linear in the value of logLOS. Moreover, we can find the RR between logLOS=1 and logLOS=0
to be 0.861/0.644 = 1.34 in 1990, which is not the same as that in 1988. Hence, not only RR is not linear,
it is also dependent on other independent variables in the model. Hence we cannot simply use the estimated
coefficients to derive any general statements about RD or RR. In contrast, log OR is given by the coefficients
and not only is that remains constant for each unit change in the independent variable, it also does not depend
on the values of the other independent variables. For this reason, in a logistic regression analysis, the OR is
the most useful measure of the association between the outcome and the independent variables.

Table 4.4: Conditional probabilities between 1988 and 1990 in logit(Inap) = a + b1 logLOS + b2 Year

1988 1990
logit

logLOS −0.123 + 1.228 × logLOS −0.123 + 1.228 × logLOS + 0.715


0 0.469∗ 0.644∗
1 0.751 0.861
2 0.912 0.955
3 0.972 0.986

∗ exp(logit)
Note:
1 + exp(logit)

4.2 Evaluation of the model

How effective are the models in Table 4.3 in explaining the relationship between the outcome and the
independent variable(s)? The answer to this question can be assessed by examining the goodness-of-fit of

53
Chapter 4. Logistic Regression

the model. There are several ways to examine the goodness-of-fit. First, the overall assessment of the model
(relationship between all of the independent variables and the outcome variable). Second, the significance
of each of the independent variables needs to be assessed. Thirdly, the predictive accuracy or discriminating
ability of the model needs to be evaluated, and finally, the model needs to be validated.

4.2.1 Overall assessment of the model

The overall assessment of a model determines the strength of the relationship between all of the independent
variables, taken together, and the outcome variable. A way to answer this question is to compare the fit
of the the model with all the independent variables and one with only the intercept (also called the null
model). A null model is a good baseline comparison because it contains no independent variables. Under
the null model, all observations would receive the same probability of the outcome. Using a model with
independent variables, outcome probabilities will depend on the values of the independent variables. The
overall assessment is therefore to evaluate the improvement using outcome probabilities from the model with
independent variables over those using the null model. This situation is illustrated in Table 4.5, which shows
the predicted probabilities of Inap = 1+ (Y = 1) between the null model and model (3), on the first 10
admissions. Column 2 of the table shows 0.448 as the predicted probability of Y = 1 using the null model;
the actual outcomes are in Column 3. Using model (3), additional information based on logLOS and Year
are incorporated into the predicted probability, given in column 6 of the table. The predicted probabilities
in column 6 are all different because the admission characteristics can be different. The actual outcomes are
shown again in column 7 for ease of comparison.

Table 4.5: Predicted probabilities between null model and model (3)

Null Model Model (3)

Admission P(Y = 1) Inap LOS Year P(Y = 1) Inap


1 0.448 0 0.405 88 0.593 0
2 0.448 1 1.435 88 0.838 1
3 0.448 1 -0.223 88 0.402 1
4 0.448 1 -0.105 88 0.437 1
5 0.448 0 -0.357 88 0.363 0
6 0.448 1 0.000 88 0.469 1
7 0.448 1 -0.223 88 0.402 1
8 0.448 0 -0.223 90 0.579 0
9 0.448 1 0.742 90 0.818 1
10 0.448 0 -0.223 90 0.579 0

The evaluation is based on how likely the data would be observed based on the predicted probabilities
using the null model (column 2) and those using model (3) (column 6). When applied to all n = 1383
admissions in the dataset, we obtain the likelihoods of observing the data using the null model and using
model (3). A test called the likelihood ratio test finds the
likelihood of the data using model (3)
LR = −2log
likelihood of the data using null model
= −2 [log-likelihood of the data using model (3) − log-likelihood of the data using null model]

and determines whether LR is significantly larger than expected to justify using model (3). Using the data,
the log-likelihoods are −951.126 and −768.392, respectively for the null model and model (3), which gives
a LR of −2[−951.126 − (−768.392)] = 365.468. Notice that the log-likelihood for model (3) is bigger than
the null model. This is always true since model (3) uses more admission information to form the predicted
probabilities and hence it cannot be worse than the null model. The question to ask is how much better
is model (3) compared to the null model in explaining the data. A large value of the LR indicates a good

54
4.2. Evaluation of the model

model compared to the null model. In practice, the test gives a p−value to determine whether a model gives
significantly better fit to the data than the null model. Based on the log-likelihood values above, a likelihood
ratio test gives a p-value of < 0.001, so we conclude model (3) significantly improves upon the null model.
The method of a likelihood ratio test in fact can be used to compare any two models such that one is a
submodel of the other, for example between model (2) and (3) since model (3) is a submodel of model (2).
In Chapter 3.5, R2 was used as a measure of goodness-of-fit of a linear regression model. For a logistic
regression model, R2 , which is based on squared error deviation between the model and the outcome data
is not appropriate. Instead, goodness-of-fit is measured by the likelihood that a model could have produced
the outcome data. There are three likelihood-based measures, the Akaike Information Criterion (AIC), the
Bayesian Information Criterion (BIC) and the Deviance.
In Chapter 3.5, we introduced the AIC and BIC. To define deviance, we introduce a saturated model as
a model that fits a separate model to each observation in the dataset. In the context of a logistic regression
with a binary outcome Y , at the i-th observation, the saturated model estimates the probability by

Yi
p̂i =
1
where Yi is the observed outcome and there is only n = 1 observation at that observation. By construction, a
saturated model gives perfect fit to the data but it is a useless model because it has too many parameters (as
many parameters as the number of observations). Nevertheless it can be used as a benchmark to evaluate a
proposed model. The Likelihood function for the saturated model has a value of 1 and the log-likelihood has
a value of 0. Deviance is defined by the difference between the log-likelihoods of a saturated model and the
proposed model

Likellihood of saturated model


2log ≡ −2log(Likelihood of proposed model)
Likelihood of proposed model

The deviance is related to the AIC and BIC. All three measures can be written as [-2logL + kp], where
L is the likelihood value, p is the number of variables in the model, and k is a penalising parameter. The
only difference between these measures lies in the value of k they use; 2 for AIC, log(n) for BIC, and from
the derivations above, 0 for Deviance. BIC places a high penalty for increasing the number of variables, AIC
places a moderate penalty and Deviance does not penalise at all. Hence, we expect, for any given model, BIC
favours a model with the fewest number of independent variables, AIC is moderate, and Deviance values “fit"
over simplicity. All three can be used to evaluate a model on its own, or for comparing competing models.
Whichever one of these measures is used, the goal is to find the model that minimises the value of the measure.
We illustrate the idea using the AIC. Table 4.3 shows the AIC values for models (1)-(3), which shows that model
(2) is better than model (1), but model (3), a simplified version of model (2) is the best.

4.2.2 Residuals

There are many types of residuals for a logistic regression analysis. They all reflect the differences between
fitted and observed values, and are the basis of varieties of diagnostic methods.
In a logistic regression, the outcome Y ∼ Ber noul l i(p) and the estimate at each observation is p̂ = P̂(Y =
1). From Chapter 3.6 , we can define residuals as

êi = Yi − p̂i .

However, since for a Bernoulli variable, Y , its variance is given by p(1 − p), the residuals êi in this context
do not have the same variance across the data. These residuals are therefore not useful. To remedy the
non-constant variance, we define the Pearson residual by

êi
ẽi = p .
p̂i (1 − p̂i )

55
Chapter 4. Logistic Regression

A third type of residuals are called deviance residuals. The deviance residual of an observation is defined as
the difference between the log-likelihoods of a saturated model and the proposed model at the observation
v
Likelihood of the saturated( yi )
u
d̂i = sign( yi − p̂i ) 2log
t
.
Likelihood of the proposed model( yi )

Deviance residuals are often part of the standard output in analyses of logistic regression models.
There is another type of residuals that is useful for logistic regression analysis; we defer the discussion
to Chapter 4.2.4.
Since for a logistic regression, the outcome Y is binary and often assumed to follow a Bernoulli
distribution, none of these residuals follow a normal distribution even when the model is correctly specifiedc ,
therefore, in the context of a logistic regression, residuals are not to be used in a QQ-plot, as in linear regression
models discussed in Chapter 3.
Pearson (or Deviance) residuals can be plotted against the covariates or the predicted values to detect
lack of fit of a model. If a model is correctly specified, the residuals should have no relationship with any
of the covariates or predicted values. Consequently, these plots should exhibit no trends. We illustrate using
the LOS data assuming that model (3) is used to fit the data. Fig.4.2 shows the plot of the Pearson residuals
against each of the covariates and the predicted values. Since year is a factor with only two levels, 1988 and
1990, the residuals are represented by two boxplots. On the other two plots are superimposed LOESS curves.
It is not obvious by examining the LOESS curves that there are trends in these plots.

Figure 4.2: Pearson residual plots for model (3) using LOS data

824
4

1280
1148
844
151
2

2
Pearson residuals

Pearson residuals

Pearson residuals
0

0
−2

−2

−2

653
1136
473
469
1327 811
25
−4

−4

−4

724

−2 −1 0 1 2 88 90 −3 −2 −1 0 1 2 3

los factor(year) Linear Predictor

To further our investigation, a test is computed for each numeric covariate (in this case, only LOS). The
test is a t-test for adding a quadratic term LOS2 to the model, for which the test corresponding test statistic
is 31.60 with p-value < 0.001, giving strong evidence that a quadratic term in LOS should be used. There is
no such test for a factor (year in this case) since squaring of a factor gives the same value. Based on the test
results, another model, denoted as model (4), is fitted and the residuals are again plotted, in Fig.4.3. The
results now show no evidence of any trend, p = 0.11 for LOS2 .
For comparison, we show in Table 4.6, the results of fitting model (4) along with model (3). We observe
the AIC for model (4) is lower. The coefficient for LOS2 in model (4) is highly significant. Henceforth, we use
model (4) for the LOS data.

c
The exception is when data can be “binned" into groups of observations with the same covariate values so a logistic regression
can be carried out on group of observations

56
4.2. Evaluation of the model

Figure 4.3: Pearson residual plots for model (4) using LOS data

8
Pearson residuals

Pearson residuals
6

6
4

4
2

2
0

0
−2

−2
−2 −1 0 1 2 0 1 2 3 4 5

los I(los * los)

824
8

8
Pearson residuals

Pearson residuals
1280
1148
844
151
6

6
4

4
219
1224
1086
934
2

2
0

0
−2

−2

88 90 −4 −3 −2 −1 0 1

factor(year) Linear Predictor

4.2.3 Goodness-of-fit of the model

In a linear regression model, R2 measures how far the model deviates from the outcomes. For a logistic
regression model, a counterpart of R2 is the Hosmer-Lemeshow test. The Hosmer-Lemeshow test is used to
examine whether the observed proportions of events are similar to the predicted probabilities of events using
the model. The Hosmer-Lemeshow test is performed by dividing the predicted probabilities into Q subgroups,
normally deciles (10 groups based on percentile ranks) and then computing a measure that compares the
predicted to the observed frequencies:
X (Oq0 − Eq0 )2 (Oq1 − Eq1 )2
H= + ,
q∈Q
Eq0 Eq1

where Oq j and Eq j denote the observed and expected number of outcomes (Y = j, j = 0, 1) for the q-th
subgroup. The expected number of Y = 1 for the q-th subgroup can be easily calculated by summing the
predicted probabilities of Y = 1 for all observations in that subgroup and that for Y = 0 is simply the difference
between the total number of observations in that subgroup minus the expected for Y = 0. We illustrate
using model (4) in the hospitalisation dataset. We subdivide the data by deciles according to the predicted
probabilities of Y = 1, and we tabulated the observed and expected number of outcomes in each decile in
Table 4.7. Notice that for each row, the total number of observed outcomes equals the total expected number
of outcomes, as they should.
A large value of H indicates large differences between the observed and the expected number of
outcomes, by subgroups, and hence the model is inadequate to explain the data. The Hosmer-Lemeshow
test takes into consideration the number of subgroups to determine whether the size of H is sufficiently large
to reject the null hypothesis that the model provides reasonable prediction. A rejection of the null hypothesis
calls for a re-examination of the model or the possible inclusion of independent variables that have not been
considered. The test, however, is sensitive to the number of subgroups and the sample size. The test is not

57
Chapter 4. Logistic Regression

Table 4.6: Comparison between model (4) and model (3)

Dependent variable:
Inap
(3) (4)
Intercept −0.123 0.101
(0.085) (0.093)

logLOS 1.228∗∗∗ 1.028∗∗∗


(0.077) (0.082)

logLOS2 −0.377∗∗∗
(0.068)

Year =1990 0.715∗∗∗ 0.712∗∗∗


(0.128) (0.129)

Observations 1,383 1,383


Log Likelihood −768.392 −752.570
Akaike Inf. Crit. 1,542.783 1,513.139

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

suitable for very large sample sizes as it has the tendency to reject reasonable models when the sample size
is large. For the hospitalisation data, when the data is divided into deciles, Table 4.7 shows the expected
numbers seem to track general trend of the observed outcomes. The Hosmer-Lemeshow test give a H value
of 8.7694 with a p-value of 0.36 for the differences seen in Table 4.7, which suggests adequate fit using the
model to explain the data.
In addition to an overall assessment of the model, statistical tests of significance can be applied to each
independent variable. For each coefficient, the null hypothesis that the coefficient is zero is tested against the
alternative that the coefficient is not zero. For example, in Table 4.3, model (2) shows that among all the
independent variables in the model, only the coefficients for logLOS and Year are significant. Individual tests
are useful to assess the contribution of each independent variable in a given model. The individual tests in
model (2) paves way to a reduced model (3). In contrast, Table 4.6, shows that all coefficients in model (4)
are necessary.

4.2.4 Unusual observations

An outlier is an observation with a response value that is unusual conditional on covariate patterns. For
example, patients with a long LOS are very likely to have experienced one or more episodes of inappropriate
hospitalisation. If the a patient with a long LOS but did not experience any episode of inappropriate
hospitalisation, then it is an outlier.
Leverage measures how far an observation compared to the typical covariate patterns of other
observations. Observations that are far from the average covariate pattern are considered to have high
leverage. For example, suppose most of the surgical ward patients tend to have long LOS. Then if there
is a surgical patient with short LOS, the patient is a high leverage point. Since leverage concerns the covariate
patterns between observations, it is the same for a set of observations, irrespective of the outcome of interest.
In a set of data with p covariates X 1 , ..., X p , we can use matrix and vector notations X and xi to represent,
respectively, X 1 , ..., X p for the whole dataset and the i-th observation. The leverage of the i-th observation is
given by
1
hi = xi (X T X) xiT .

58
4.2. Evaluation of the model

Table 4.7: Observed and expected number of outcomes using model (4) by deciles

Y =0 Y =1

Predicted probability Observed Expected Observed Expected


[0.0138,0.0737] 210 208.3 9 10.6
(0.0737,0.139] 69 63.6 5 10.3
(0.139,0.274] 94 99.1 34 28.8
(0.274,0.422] 95 100.8 68 62.1
(0.422,0.497] 59 59.1 55 54.8
(0.497,0.585] 56 59.3 77 73.6
(0.585,0.637] 75 66.9 99 107.0
(0.637,0.674] 39 35.2 65 68.7
(0.674,0.742] 39 43.6 107 102.3
(0.742,0.82] 27 26.5 101 101.4

The leverage hi is sometimes called a hat-value. For any set of data, irrespective of the covariates, hat-values
are always bounded between 1/n and 1, and their sum is always equal to the number of coefficients (p) in
the model, including the intercept, so that their average is (p + 1)/n. An observation with a value of hi much
higher than the average is considered a high leverage point.
Following the fit of a model to the data, outliers can be identified by Studentised residuals, which are
the residual êi scaled by the leverage and the standard deviation of the residuals (SD(ê)−i ) from fitting the
model with the observation removedd
êi
p .
SD(ê)−i 1 − hi
Studentised residuals follow a t-distribution. Since in most studies, n is rather large, and the t-distribution in
that case behaves similarly to a standard Normal distribution. Since a Normal variable with an absolute value
of more than 3 is rare, so an observation with a Studentised residual more than 3 in absolute value is often
considered an outlier.
Figure 4.4 illustrates an outlier and a point with high leverage.

Figure 4.4: Outlier (left panel) and high leverage point (right panel)

Outlier High Leverage Point


10
10
8

5
6
Y

Y
4

0
2
0

−5

2 4 6 8 10 5 10 15

X X

Outlier and high leverage points are unusual but they are not of particular concern unless they make
an influential impact on the model and therefore inferences drawn from the model. An influential point
d
It makes sense to remove the i-th observation since we are evaluating the distance of the observation from the rest of the data

59
Chapter 4. Logistic Regression

is an observation that is both an outlier and a high leverage point. Removal of an influential point causes
substantial change in the estimates of coefficient. Figure 4.5 illustrates that an outlier with low leverage or
a high leverage point but not an outlier would not lead to substantial change in the model, hence inference
is unlikely to change much with or without these observations. In contrast, removal (or inclusion) of an
influential point causes a big shift in the model and consequently it has a big influence on the conclusion of
a study. An influential point can be the result of a multitude of possible reasons; it may be a recording error,
or it is a truly unusual case. Irrespective of the reason, when influential points are found, they need to be
carefully examined.

Figure 4.5: Comparison of an outlier and high leverage point but not influential to an influential point

Outlier with low Leverage High leverage but not an outlier Outlier with high leverage
Small influence 10
Small influence Large influence
10

10
8

5
6

5
Y

Y
4

0
2

0
0

−5

Observtion Removed Observtion Removed Observtion Removed


Original Data Original Data Original Data

2 4 6 8 10 5 10 15 2 4 6 8 10 12 14

X X X

A plot of hat values the against the Studentised residuals can help to detect influential observations. The
plot for the LOS data using model (4) is shown in Fig. 4.6. For model (4), n = 1383, p = 3, so average hat
value is (p + 1)/n = 4/1383 = 0.00289. There is no evidence any observation has a hat value much bigger
than 0.00289 and with large Studentised residual.

4.3 Predictive accuracy and discrimination

The outcome of interest in a logistic regression problem is binary (Y = 0 or 1). In contrast, a logistic regression
model gives as its output an estimated log-odds or probability of Y = 1. If the interest is to predict the outcome,
then the output from a logistic regression model needs to be translated into a binary predicted outcome. If
we use the predicted probability of a logistic regression model, then we need to specify a threshold above
which we call the prediction Y = 1 and below which Y = 0. A natural threshold would be 0.5, such that if
the predicted probability from the logistic regression model is above 0.5, the outcome is predicted to be 1,
otherwise, 0. The classification table (Table 4.8) shows the results of using model (4) in the hospitalisation
data.

Table 4.8: Classification table of observed and predicted outcome using model (4) based on a threshold of 0.5

Predicted
Observed 0 1
0 527 236
1 171 449

60
4.3. Predictive accuracy and discrimination

Figure 4.6: Plot of hat values against Studendised residuals for the LOS data

3
2
Studentised residual

1
0
−1
−2

0.00 0.01 0.02 0.03 0.04

The table shows that out of all the n = 763 observed Y = 0, 527 are correctly classified as Y = 0 using
model (4). Similarly, out of n = 620 observed Y = 1, 449 have been correctly classified as Y = 1 using model
(4). For any logistic regression model and any particular threshold, we can form similar classification tables
that looks like Table 4.9, where a, b, c and d are the number of observations in the corresponding cells.

Table 4.9: Typical classification table of observed and predicted outcome

Predicted
Observed 0 1
0 a b
1 c d

If the logistic regression model has a good fit, we expect to see many counts in the a and d cells, and few
in the b and c cells. We define the following measures if predictive accuracy of a logistic regression model. The
first is called sensitivity, which is the probability that the model prediction is 1 when the observed outcome is
1. According to Table 4.9, Sensitivity = d/(c + d) = 449/(171 + 449) = 0.72. The second is called specificity
which is the probability that the model prediction is 0 when the outcome is 0. Using the table, specificity
= a/(a + b) = 527/(236 + 527) = 0.69. Higher sensitivity and specificity indicate a better fit of the model.
Very often, a 1 is called a positive outcome and 0 a negative outcome. That is why sensitivity is called the true
positive rate (TPR) and (1- specificity) is referred to as the false positive rate (FPR). Ideally a high TPR and
low FPR is desired but often it is not possible to achieve both. The balance between TPR and FPR depends on
the study outcome. For example, if it is important for the model to identify as many positive cases (Y = 1) as
possible, then a high TPR is desired, which may be expense of a higher FPR. In contrast, if the cost of wrongly
classifying false positive is high, then we wish to maintain a low FPR. Given any model, the threshold can be
adjusted away from 0.5 to reach either of these goals. A low threshold would give positive prediction for most
outcomes, and hence, would increase the TPR. However this would invariably lead to a high FPR. In contrast,
setting a high threshold would give negative prediction for most outcomes, and hence, would lower the FPR,
at the expense of a low TPR.
Extending the above two-by-two idea (Table 4.9), rather than selecting a single threshold, the full range
of threshold values from 0 to 1 can be examined. For each possible threshold value, a two-by-two table can
be formed. Plotting the pairs of sensitivity and one minus specificity on a scatter plot provides a Receiver
Operating Characteristic (ROC) curve. The area under this curve (AUC) provides an overall measure the

61
Chapter 4. Logistic Regression

model fit. If the AUC =0.5, then the model is no better than guessing. In general, the AUC varies from 0.5 (no
predictive ability) to 1.0 (perfect predictive ability). Larger AUC indicates better predictability of the model.
Points above the diagonal dividing the ROC space represent good classification results (better than random),
while points below represent poor results (worse than random). Using model (4), we plotted the ROC curve
in Fig. 4.7. The model gives an AUC = 0.784, which shows the model is reasonably good. Along the figure, we
showed three threshold values with corresponding sensitivity and specificity in parentheses. For example, if
we set a low threshold of 0.25, then the FPR is 1−0.448 = 0.552 and TPR is 0.945; if we set a high threshold of
0.75, then many of of the outcomes would be classified as negative, as such the FPR is much 1−0.965 = 0.035
but the TPR has dropped to 0.163.

Figure 4.7: ROC curve for predicting Inap using model (4)
1.0

0.250 (0.448, 0.945)


0.8

0.500 (0.691, 0.724)


0.6
Sensitivity

AUC: 0.784
0.4 0.2

0.750 (0.965, 0.163)


0.0

1.0 0.8 0.6 0.4 0.2 0.0


Specificity

62
5
Regression Models for Count Outcome

In Chapter 4, we studied inappropriate hospitalisation during LOS using a dichotomised outcome Inap =
0 days vs. 1 for no days and 1+ days, respectively, of inappropriate hospitalisation. The original data
recorded inappropriate hospitalisation (Inap) days as a count. Count data consist of non-negative integers
that represent the number of times an event is observed. A count outcome is the result of a cumulative process
in which events are summed to produce the observed data. The process can occur over time, e.g., number of
inappropriate hospitalisation days or number of visits to a hospital, or over space, e.g., number of cancers in
an area. Count data have unique properties that lead to a number of analytic challenges including: (1) a large
and perhaps disproportionate number of zero values; (2) a relatively high frequency of small integer values;
and (3) variance that is not independent of the mean. Due to these properties, Ordinary least squares (OLS)
regression is often inappropriate because count data violate the underlying assumptions of OLS regression:
normality and constant variance.

5.1 Poisson regression

As the benchmark model for count data, the Poisson distribution models the probability of the number of
occurrences, Y , of an event within a given time interval or space. The assumptions requires that the rate
(mean number), µ ≡ E(Y ), of event occurrence to be constant for any fixed unit of time or space. Poisson
regression is a model for relating µ to some independent variables. For example, we might be interested to
study how the number of inappropriate hospitalisation days is related to age of a patient. A Poisson regression
shares many similarities with OLS. However, there are some key differences. Since counts are non-negative
integers, and therefore the mean of Y must also necessarily be non-negative; hence, we cannot specify a model
such as (3.4) since it allows the possibility of negative values for the mean function. To resolve this problem,
in a Poisson regression with p independent variables X 1 , ..., X p , we assume

log(µ) = g(X 1 , ..., X p ),

where g is an arbitrary function of X 1 , ..., X p . For ease of exposition and discussion, in the following, we
consider the simple case that g is a linear model of the independent variables,

log(µ) = a + b1 X 1 + b2 X 2 + ... + b p X p , (5.1)

so that

µ = exp(a + b1 X 1 + b2 X 2 + ... + b p X p ) = exp(a) × exp(b1 X 1 ) × exp(b2 X 2 ) × ... × exp(b p X p ), (5.2)

is always non-negative. As illustrated in (5.2), a linear model for log(µ) leads to a multiplicative model for
µ. The inclusion of each independent variable X j induces a multiple of the rate µ by a factor exp(b j X j ). In

63
Chapter 5. Regression Models for Count Outcome

addition, each unit change in X j also corresponds to a relative effect of exp(b j ) on µ. If b j is positive, then
exp(b j ) > 1 and µ increases with a unit increase in X j ; otherwise, if b j is negative, then exp(b j ) < 1 and µ
decreases with increasing X j . Since exp(b j ) has a multiplicative effect on µ, it has the same interpretation as
the relative risk or odds ratio (cf., Chapter 4). Hence in a Poisson regression, Y is assumed to follow a Poisson
distribution with mean µ given by (5.2). Unlike an OLS regression, the observed outcome Y may differ from µ
with a random error e, which is assumed to follow the Poisson distribution instead of the normal distribution.
In a Poisson distribution, Y is assumed to be the total occurrences of an event ascertained over an interval
of time or space. Therefore, the longer the interval, the chance is that Y will be likely be higher. In a Poisson
regression, the intervals over which count outcomes are measured is called the exposure. Observations with
longer exposure have a greater opportunity to accumulate more events. We illustrate this fact using the
hospitalisation data in Chapter 2. Table 5.1 reproduces the data for the first eight admissions. The count
outcome is Inap, the number of inappropriate hospitalisation days. The exposure is LOS since the outcome
is ascertained during LOS. It would be reasonable to assume that an admission with longer LOS would have
more opportunity for a higher value for Inap. In fact, since Inap and LOS are both counts that measure days,
it is mathematically impossible for Inap to be higher than LOS. In the first eight admissions, admission 2 ends
up with a LOS of 42 days; consequently, it is not surprising to see its Inap value of 20 is higher than that of
admission 3, which has a LOS of 8 days and corresponding Inap = 6 days.

Table 5.1: “Exposure" in inappropriate hospitalisation data

Admission LOSa Inapb Genderc Wardd Yeare Agef


1 15 0 2 2 88 55
2 42 20 2 1 88 73
3 8 6 1 1 88 74
4 9 6 1 2 88 78
5 7 0 1 2 88 57
6 10 2 2 2 88 47
7 8 6 1 2 88 70
8 8 0 2 2 90 40

When the exposure is identical or similar for all observations in a count data analysis, then simply using
the number of events without considering exposure does not affect the results of the analysis. However,
when the exposure differs considerably across observations, then analysing only the number of events may
misrepresent the data. Observations with greater exposure may have more events recorded than observations
with lesser exposure simply because they have a greater opportunity to observe the event, and not because
they are at higher risk for the event. To ensure all observations are treated equally, instead of modelling the
number of event occurrences, we can model the average number of events per unit interval (time or space).
This requires including in the regression model a value for the count outcome as well as an associated exposure
value. In the hospitalisation data, we can adjust the number of inappropriate hospitalisation days by the LOS
and model the relationship between the rate of inappropriate hospitalisation and the independent variables.
Formally, if Y represents the number of events with mean µ and t is the LOS for an admission, then we can
define λ = µ/t as the expected events per day. Modelling λ rather than µ leads to a more equitable comparison
of the risk of the event of interest across the independent variables. With this change, our model becomes:

log(λ) = log(µ/t) = a + b1 X 1 + b2 X 2 + ... + b p X p , (5.3)

so that

log(µ) = a + b1 X 1 + b2 X 2 + ... + b p X p + log(t), (5.4)

Thus, the only difference between the model accounting for exposure and model (5.1), which does not account
for exposure time is the extra term, log(t), in (5.4). This term is referred to as an offset because it is a known

64
5.1. Poisson regression

measurement with no associated coefficient. By including the exposure as an offset, the coefficients in this
equation are interpretable as the effects of the independent variables on the number of events per unit time
rather than on the number of events. This is an important difference and one that may lead to completely
different conclusions about the relationship between the outcome and the independent variables. Simply
carrying out a Poisson regression without considering exposure may produce misleading conclusions.
When using an offset, the assumption is made that doubling the exposure will lead to a doubling of the
expected count outcome. If this assumption is not appropriate, controlling for the exposure as a covariate
instead of an offset may be more appropriate. In the context of the hospitalisation, the difference between
these two approaches can be expressed in the following two models. Using an offset, we consider

log(µ) = a + b1 X 1 + b2 X 2 + ... + b p X p + 1 × logLOS, (5.5)

whereas using logLOS as a covariate leads to

log(µ) = a + b1 X 1 + b2 X 2 + ... + b p X p + b p+1 × logLOS. (5.6)

Hence the difference lies in the coefficient for the exposure logLOS. Model (5.5) says that controlling for all the
other independent variables, a doubling of LOS leads to a doubling of the expected value of Inap. In contrast
model (5.6) allows the possibility that b p+1 is not 1. If b p+1 > 1, then a higher LOS will have a higher rate of
inappropriate hospitalisation (for example, if longer LOS is a direct result of inappropriate hospitalisation).
If b p+1 < 1, a longer LOS will have fewer inappropriate hospitalisation per day of LOS (perhaps those with
longer LOS are the truly sick patients).
We applied Poisson regression modelling to the hospitalisation data. We considered three different
models. In the first model, we ignored exposure, i.e., logLOS, and only used the remaining independent
variables. In the second model, the exposure logLOS is entered as an offset. In the last model, logLOS is
entered as a covariate in the model. The continuous variables are all adjusted as described in Chapter 4.1.
The results are given in Table 5.2. It is quite clear from the table that the results for model (1) (no exposure
considered) are quite different from those of models (2) and (3). This relates to point we made earlier that
when exposures are very different between observations, not taking account of exposure may give results that
are quite different from those where exposure is considered. In the current dataset, exposure varies widely
between observations, with a range of 1 to 107 days. For studies where the exposures are similar between
observations, the difference between (5.4) and (5.1) lies only in log(t), which is approximately constant
between observations and hence, ignoring it would not significantly alter the results.
Let us start by interpreting the results using model (1). The fitted model is:

log(µ) = 1.511 − 0.496 × Ward 2 − 1.286 × Ward 3 − 0.257 × Year − 0.218 × Gender + 0.017 × Age.

We interpret exp(1.511) = 4.53 as the expected number of inappropriate hospitalisation days for admissions
with the following characteristics: surgical ward in 1988, male, aged 54. We can interpret exp(1.511 −
0.496 + 0.281) = 3.65 as the expected number of inappropriate hospitalisation days for an admission with the
following characteristics: medical ward in 1988, female, aged 54. As discussed earlier, the coefficients for the
independent variables in a Poisson regression model, when exponentiated, can be interpreted as multiplier of
the rate of events. For example, the coefficient estimate for Gender is −0.218, so it implies relative to males,
the rate for inappropriate hospitalisation in females goes down by a factor of exp(−0.218) = 0.8.
The total number of LOS for admissions in surgical ward in 1988, who is male and aged 54 in this
study is 56 days. Hence the expected value of 4.53 days is based on an LOS of 56 days. This figure cannot
be used directly to make statements about the population since the LOS of this group of admissions in the
population will not be 56 days. We can find 4.53/56 = 0.08 to arrive at the rate per LOS day. Then
for an admission of the same type that requires a LOS of 25 days, the expected number of inappropriate
hospitalisation days is 4.53/56 × 25 = 2.02 days. Furthermore, in the study, the total number of observed
inappropriate hospitalisation days among admissions in surgical ward in 1988, who is male and aged 54 is 14
days, which gives an observed rate per LOS day to be 14/56 = 0.25. Using the observed rate, the expected
number of inappropriate hospitalisation days for a LOS = 25 days is 0.25 × 25 = 6.25 days, which is very

65
Chapter 5. Regression Models for Count Outcome

Table 5.2: Poisson regression models for inappropriate hospitalisation data. Model (1) no exposure
considered. Model (2) Exposure entered as offset. Model (3) Exposure entered as a covariate. Values in
parentheses are standard errors.

Dependent variable:
Inap
(1) (2) (3)
∗∗∗ ∗∗∗
Intercept 1.511 1.233 1.026∗∗∗
(0.028) (0.026) (0.030)

ward=2 (Surgical) −0.496∗∗∗ −0.324∗∗∗ −0.333∗∗∗


(0.032) (0.031) (0.031)

ward=3 (others) −1.286∗∗∗ −0.595∗∗∗ −0.444∗∗∗


(0.142) (0.142) (0.143)

Year =1990 −0.257∗∗∗ −0.020 0.033


(0.031) (0.031) (0.031)

Gender=2 (Female) −0.218∗∗∗ −0.083∗∗∗ −0.036


(0.030) (0.031) (0.031)

Age 0.017∗∗∗ 0.006∗∗∗ 0.004∗∗∗


(0.001) (0.001) (0.001)

logLOS 1.299∗∗∗
(0.019)

Observations 1,383 1,383 1,383


Log Likelihood −6,256.562 −3,810.471 −3,683.960
Akaike Inf. Crit. 12,525.120 7,632.941 7,381.920

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

66
5.1. Poisson regression

different from the value of 2.02 days predicted by the model. These issues highlight the difficulties and risks
with count regressions that do not account for exposure.
Both models (2) and (3) adjust for exposure. The coefficients for all independent variables other than
logLOS are very similar between models (2) and (3). In model (3), the coefficient estimate for logLOS is 1.299
and a test shows that there is significant evidence (p < 0.01) that the coefficient is different from 0. We can
compare between the three models in terms of model fit, using either the likelihood values or the AIC, since
in the current context, model (1) is a submodel of models (2) and (3) and model (2) is a submodel of model
(3). Using either measure shows models (2) and (3) significantly improve upon model (1). Between models
(2) and (3), the log-likelihood values are, −3810.471 and −3683.960, respectively. Based on these values, a
likelihood ratio test (Chapter 4.2.1) gives a p-value of < 0.001. Using AIC, model (3) shows a smaller value
than model (2). Hence, on both measures, model (3) is preferred to model (2).
Model (2) is

µ
 ‹
log(λ) = log = 1.253 − 0.324 × Ward 2 − 0.595 × Ward 3
LOS/10
− 0.020 × Year − 0.083 × Gender + 0.006 × Age.

Exponentiating, we obtain
µ
λ= = exp(1.253 − 0.324 × Ward 2 − 0.595 × Ward 3
LOS/10 (5.7)
− 0.020 × Year − 0.083 × Gender + 0.006 × Age).

We can directly apply (5.7) to make inferences for the population. For example, exp(1.253) = 3.43 is the
inappropriate hospitalisation rate per 10 LOS days for medical wards in 1988, male aged 54. So if the LOS
is 25 days, the expected number of inappropriate hospitalisation days is 3.43 × 25/10 = 8.6 days, which is
much closer than that given by model (1), to the value of 6.25 days calculated using the observed rate. By
this measure, model (2) provides a more credible model than model (1).
Model (3) is

log(µ) = 1.026 − 0.333 × Ward 2 − 0.444 × Ward 3 − 0.033 × Year


− 0.036 × Gender + 0.004 × Age + 1.299 × logLOS.

Exponentiating, we obtain

µ = exp(1.026 − 0.333 × Ward 2 − 0.444 × Ward 3 − 0.033 × Year


(5.8)
− 0.036 × Gender + 0.004 × Age + 1.299 × logLOS).

Comparing (5.7) to (5.8), we observe that LOS does not appear on the right hand side of (5.7), which means
µ
for model (2) = λ is a constant with respect to LOS. In contrast, LOS appears in (5.8), which means
LOS/10
model (3) allows µ to change with LOS. This is the key difference between models (2) and (3). Using (5.8),
we can also easily make inferences for the population. For comparison, exp(1.026 + 1.299 × (log(2.5)) = 9.2
days is the expected number of inappropriate hospitalisation days for medical wards in 1988, male aged 54,
LOS=25, which is slightly higher than that given by model (2). In model (3), the coefficient for logLOS is
1.299 > 1, which means inappropriate hospitalisation goes up with LOS. Since a LOS of 25 days is higher
than the mean of LOS=10 in the data, model (3) gives a higher prediction than model (2), which assumes
constant inappropriate hospitalisation rate over time.
As with all statistical estimates, uncertainties due to sampling need to be taken into consideration
when making inferences. Most often, confidence intervals are used for reporting relative rates and
predictions. For example, using model (3), the coefficient for Age is 0.004. This means for every year
increase in age, the expected number of inappropriate hospitalisation goes up by a factor of exp(0.004) =
1.004. The standard error (SE) of the estimate is 0.001. Based this, a 95% confidence interval is
(exp(0.004 − 1.96 × 0.001), exp(0.004 + 1.96 × 0.001)) = (1.002, 1.006), so we predict a relative rate of

67
Chapter 5. Regression Models for Count Outcome

at least 1.002, but can be as much as 1.006, with 95% confidence. On the contrary, for Gender=2,
the coefficient estimate is −0.036 with a SE of 0.031, which translates into a 95% confidence interval of
[exp(−0.036 ± 1.96 × 0.031)] = (0.91, 1.03). Since the interval straddles over 1, that we cannot say with
95% confidence that the relative rate is different from 1, or females have a different rate than males. This
result is consistent with a non-significant test result (p > 0.1). If we wish to find confidence intervals for
the expected count estimates, we can use the same method as we did for relative rates. Earlier, we gave an
estimate of exp(1.026+1.299×(log(2.5)) = 9.2 days for the expected number of inappropriate hospitalisation
days for medical wards in 1988, male aged 54, LOS=25, a 95% confidence interval would be in the form of

exp([1.026 + 1.299 × log(2.5)]±1.96SE[1.026 + 1.299 × log(2.5)])


(5.9)
= exp(2.21 ± 1.96(0.0263)) = (8.7, 9.7).

The confidence interval in (5.9) requires SE of a linear combination, and can be generated using most statistical
programs, such as R.

5.2 Negative Binomial regression

One of the properties for a Poisson distribution is the mean is the same as the variance of the distribution.
Often in count data analysis, the data exhibit over-dispersion, a situation when the magnitude of the variance
exceeds the magnitude of the mean. We illustrate over-dispersion using the count outcome in hospitalisation
data. Based on the data, we can easily find the sample mean of Inap to be 3.24 days. Fig. 5.1 shows the
histogram of Inap using the n = 1383 observations and on the same figure, we superimposed a histogram
of the same number of observations based on a Poisson distribution, with mean = 3.24 (same as the sample
mean of Inap). The figure shows (1) the histogram for the Inap data extends over a much wider range than the
histogram for the Poisson distribution, and (2) the number of zeros for the Inap data is much higher than those
of the Poisson distribution. The former illustrates over-dispersion while the latter is called zero-inflation. In
this section, we will study over-dispersion. Zero-inflation will be considered in a later section.
The only difference between a Poisson regression and a negative Binomial regression is in the
distribution of the outcome. In a Poisson regression, the outcome is assumed to follow a Poisson distribution
with mean µ that depends on the independent variables. In a negative Binomial regression, the outcome is
assumed to follow a negative Binomial distribution with mean µ that depends on the independent variables.
In a Poisson distribution, the mean is the same as the variance; in contrast, a negative binomial distribution
µ2
with mean µ allows a variance of µ + θ where θ is called a shape parameter. When 1/θ = 0 or θ = ∞,
then a negative Binomial distribution becomes a Poisson distribution; otherwise, they are different. In this
sense, a negative Binomial distribution is a more flexible distribution since it includes as a special case the
Poisson distribution. In practice, θ is not known but can be estimated from the data, along with µ.
To accommodate over-dispersion, a negative Binomial regression model can be considered. A negative
Binomial regression becomes a Poisson regression as over-dispersion declines, the main difference between a
Poisson regression and an negative Binomial regression is their variances. The consequence of over-dispersion
in Poisson regression is underestimation of standard errors, which tend to be larger and more accurate in
negative Binomial regression. The fact that negative Binomial regression converges to Poisson regression
makes it possible to carry out a model comparison between them. Empirically, negative Binomial regression
gives more accurate estimates than Poisson regression in most cases.
The question of exposure considered in Chapter 5.1 also applies to negative Binomial regressions.
Basically the same general rules about exposure in a Poisson regression carry over to a negative Binomial
regression. To illustrate our point, we fitted the hospitalisation data using two negative Binomial regressions,
one with logLOS as offset and another using logLOS as a covariate. The results are given in Table 5.3. In the
table, we also included model (3) from Table 5.2 for comparison. We can compare between these models in
terms of model fit. In this table, model (4) is a submodel of model (5); however, neither models (4) and (5)
is a submodel of model (3) and vice versa. We can compare models (4) and (5) using either the likelihood
values or the AIC. Between models (4) and (5), the log-likelihood values are, −2426.646 and −2393.516
respectively. Based on these values, a likelihood ratio test gives a p-value of < 0.001. Using AIC, model (5)

68
5.3. Hurdle regression

Figure 5.1: Histogram of Inap data vs a Poisson distribution with the same mean

600
400
Frequency

Poisson
200

Inap
0

0 20 40 60 80
Y

shows a smaller value than model (4). Hence, on both measures, model (5) is preferred to model (4). We can
compare the merits of a negative Binomial regressions to a Poisson regression. Since neither model (5) and
model (3) are submodels of each other, we used AIC to compare them. The AIC value of 4801.032 for model
(5) is clearly much smaller than 7381.920 for model (3). This result shows the negative Binomial regression
provides a much better fit for this set of data than a Poisson regression. In fact, the bottom of Table 5.3 shows
estimated values of θ , of 0.586 and 0.548, using the data, based on models (4) and (5), respectively; both
are significantly different from the null hypothesis of θ = 1. A statistical significant result for the test of θ = 1
suggests departure from a Poisson regression model.

5.3 Hurdle regression

In many healthcare count data, there is an excess number of zero counts such that the data do not follow a
Poisson distribution. For example, if we are interested in the demand of medical care in a community, then
most individuals who are not sick will have a zero count. For those who are sick, the number of treatment
visits will be determined by the condition and treatment regime. In this example, it is assumed that the
initial event (whether someone gets sick) and later events (how many times a sick person needs treatment)
are generated by different processes. The hurdle model is designed to deal with count data generated from
such systematically different processes. A hurdle model analysis consists of two parts. The first part models
the probability of a count of zero for an observation. This can be carried out using a logistic regression. If
the observation has a positive value, the second part models the positive count using Poisson regression or
negative Binomial regression. The difference between Poisson hurdle regression and negative Binomial hurdle
regression is the same as that between Poisson regression and negative Binomial regression, which means that
negative Binomial hurdle regression can better handle over-dispersion of count data.
Fig. 5.2 shows that for the hospitalisation data, n = 763 out of the n = 1383 admissions had zero days

69
Chapter 5. Regression Models for Count Outcome

Table 5.3: negative Binomial regression models for inappropriate hospitalisation data. Model (4) negative
Binomial regression with exposure as offset. Model (5) negative Binomial regression with exposure as a
covariate. Model (3) Poisson regression with exposure as a covariate. Values in parentheses are standard
errors.

Dependent variable:
Inap
negative Binomial Poisson
(4) (5) (3)
∗∗∗ ∗∗∗
Intercept 1.085 0.992 1.026∗∗∗
(0.084) (0.085) (0.030)

ward=2 (Surgical) −0.445∗∗∗ −0.371∗∗∗ −0.333∗∗∗


(0.092) (0.092) (0.031)

ward=3 (others) −0.802∗∗∗ −0.641∗∗ −0.444∗∗∗


(0.299) (0.316) (0.143)

Year =1990 0.021 0.136 0.033


(0.089) (0.090) (0.031)

Gender=2 (Female) −0.102 −0.039 −0.036


(0.089) (0.090) (0.031)

Age 0.008∗∗∗ 0.003 0.004∗∗∗


(0.002) (0.002) (0.001)

logLOS 1.456∗∗∗ 1.299∗∗∗


(0.056) (0.019)

Observations 1,383 1,383 1,383


Log Likelihood −2,426.646 −2,393.516 −3,683.960
θ 0.548∗∗∗ (0.036) 0.586∗∗∗ (0.039)
Akaike Inf. Crit. 4,865.292 4,801.032 7,381.920

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

70
5.3. Hurdle regression

Figure 5.2: Hurdle model of Inap data with n = 763 structural zeros and n = 1383 − 763 = 620 truncated
count (Poisson or negative Binomial) observations

600

Structural zeros
400
Frequency

Truncated Poisson or truncated Negative Binomial


200
0

0 20 40 60 80
Inap (days)

of inappropriate hospitalisation. We observed in Chapter 5.2 that there are evidences that the data, when
all the zero counts are included, do not fit a Poisson distribution. A hurdle model assumes that all zero
data are from one “structural" source. The positive (i.e., non-zero) data follow either a truncated Poisson or
negative Binomial distribution truncated at 0. In the current context, the hurdle model assumes that there
are two subgroups of admissions, one group of n = 763 (possibly due to their characteristics or risk factors)
is not at risk of inappropriate hospitalisation and they will not have any positive counts for inappropriate
hospitalisation counts. The other group of n = 1383 − 763 = 620 patients will have some positive (non-zero)
days of inappropriate hospitalisation. Hence the zero observations can come from only one structural source,
the not at risk admissions. In contrast, a patient that falls in the other group will not have the chance of zero
inappropriate hospitalisation days.
We applied the hurdle model to the hospitalisation data. As discussed earlier, the hurdle model carries
out estimation in two parts. In the first part, a logistic regression is carried out to model the logit of the
non-zero outcome (Inap = 1+). We used all the independent variables in the dataset, i.e., Ward, Gender, Age
and logLOS in this part. This part uses all the n = 1383 observations. The second part is a Poisson regression
using the positive counts only, i.e., n = 620 observations. In this part, we assumed data follow a Poisson
distribution. We used logLOS as offset in this Poisson regression. Notice that the independent variables used
in the two parts need not be the same. Table 5.4 gives the results of the hurdle Poisson regression model. The
first column of the results gives the coefficient estimates for the logistic model; the second column gives the
coefficient estimates for the Poisson regression model. Notice that the results for the logistic part is identical
to those of model (2) of Table 4.3, as it should, since both use the same data and same independent variables
in a logistic regression model. The Poisson regression (hurdle part) shows all coefficients to be significant. We
can compare this Poisson hurdle model with a standard Poisson regression without considering the “excess"
zero. From Table 5.3, the Poisson regression model (model (3)) has an AIC = 7632.941, which is considerably
higher than the AIC = 4978.498 for the Poisson hurdle model, hence we can conclude that the Poisson hurdle

71
Chapter 5. Regression Models for Count Outcome

model is superior to a Poisson regression model that does not take into account of the excess zeros. Notice
that the Poisson regression model is not a submodel of the Poisson hurdle model. As such, the models cannot
be compared using the likelihood values or via a likelihood ratio test.a

Table 5.4: Hurdle model for inappropriate hospitalisation data. Positive counts are modelled using Poisson
regression with exposure as offset. Values in parentheses are standard errors.

Dependent variable:
Inap
Logistic part Hurdle part
Intercept −0.084 1.722∗∗∗
(0.123) (0.026)

ward=2 (Surgical) −0.008 −0.308∗∗∗


(0.134) (0.032)

ward=3 (others) −0.388 −0.463∗∗∗


(0.403) (0.150)

Year =1990 0.709∗∗∗ −0.214∗∗∗


(0.129) (0.032)

Gender=2 (Female) −0.117 0.022


(0.126) (0.032)

Age 0.005 0.004∗∗∗


(0.003) (0.001)

logLOS 1.186∗∗∗
(0.082)

Observations 1,383 1,383


Log Likelihood −766.656 −2,474.544
Akaike Inf. Crit. 1,547.311

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

We repeated the hurdle regression of Table 5.4 using a few other formulations. Table 5.5 shows the
results. In Table 5.5, model (6) is simply the hurdle part of Table 5.4. Model (7) also uses a Poisson regression
but includes exposure as a covariate. Models (8) and (9) considered a negative Binomial distribution for the
positive count data, with the former uses an offset for exposure and the latter uses a covariate for exposure.
All models used the same logistic part as Table 5.4, so the results are not repeated. Table 5.5 shows that the
coefficients for all independent variables are very similar between the four models. Both models (7) and (9)
show a significant coefficient for the exposure logLOS. A likelihood ratio test between models (6) and (7)
gives a p-value of < 0.001. The AIC value of 4967.892 for model (7) is also smaller than that for model (6)
(AIC = 4978.498). Hence model (7) is the preferred model. Between models (8) and (9), the likelihood ratio
test p-value is 0.002; furthermore, the AIC value model (9) is smaller, suggesting model (9) is better than
model (8). To compare between a Poisson and negative Binomial formulation, we use the AIC. Model (9) has
a
Sometimes a test called the Vuong test is used. However, it has been found to be inappropriate for testing excess zeros and
produce misleading conclusions, see for example:
Desmarais Bruce A., Harden Jeffrey J. (2013) Testing for Zero Inflation in Count Models: Bias Correction for the Vuong Test. Stata
Journal, 13, 4, 810-835
Wilson P. (2015) The misuse of the Vuong test for non-nested models to test for zero-inflation.Economics Letters, 127, 51-53

72
5.4. Zero-inflated regression

a smaller AIC value than model (7), so model (9) is considered to have a better fit.
Using model (9), we notice that its model coefficients are quite different from those of the original Poisson
model (model (3)). In particular, the coefficient for the intercept in model (9) is 1.67, which is considerably
bigger than that of 1.027 in model (3). Based on model (9), the mean for the outcome is exp(1.67) = 5.32
days compared to exp(1.027) = 2.79 days for model (3). The interpretations of these two means are different.
For model (9), the mean of 5.32 days refer to the means of the data truncated at zero; whereas for model (3),
the mean of 2.79 days include the 763 structural zeros. Practically, it makes sense to restrict calculation of
mean days to only those who experienced inappropriate hospitalisation. It is possible to use results from the
hurdle model to work out the mean including structural zeros;b this works out to be 2.69 days. The coefficient
for Year in model (9) is significant, compared to the non-significant results in Table 5.3.

5.4 Zero-inflated regression

Figure 5.3: Zero inflated model of Inap data with a mixture of structural zeros and Poisson or negative Binomial
count observations
600

Structural zeros
400
Frequency

Poisson or Negative Binomial


200
0

0 20 40 60 80
Inap (days)

A zero-inflated regression model also assumes a high frequency of zero counts may be due to different
underlying processes. Consider the case where the outcome is the number of cigarettes smoked during a
specified time interval. Some with zero counts because they are non-smokers, while others are smokers, and
b
Assuming a hurdle model with Poisson regression, the mean of Inap (days) from a hurdle model is given by
Binomial
z }| {
P(Inap =1 +)
E(Inap days) = P(Inap =1 +)E(Inap days|Inap days > 0) = E(Inap days, Inap days > 0)
P(Inap =1 +) | {z }
| {z } Poisson
Poisson

Cameron A.C and Trivedi P.K. (2013). Regression Analysis of Count Data. Cambridge University Press, Cambridge

73
Chapter 5. Regression Models for Count Outcome

Table 5.5: Hurdle models for inappropriate hospitalisation data. Model (6) Poisson regression with exposure
as offset. Model (7) Poisson regression with exposure as a covariate. Model (8) negative Binomial regression
with exposure as offset. Model (9) regression with exposure as a covariate. Values in parentheses are standard
errors.

Dependent variable:
Inap
Poisson negative Binomial
(6) (7) (8) (9)
∗∗∗ ∗∗∗ ∗∗∗
Intercept 1.722 1.666 1.721 1.670∗∗∗
(0.026) (0.032) (0.052) (0.056)

ward=2 (Surgical) −0.308∗∗∗ −0.312∗∗∗ −0.373∗∗∗ −0.379∗∗∗


(0.032) (0.032) (0.058) (0.059)

ward=3 (others) −0.463∗∗∗ −0.446∗∗∗ −0.513∗∗ −0.510∗∗


(0.150) (0.151) (0.223) (0.225)

Year =1990 −0.214∗∗∗ −0.200∗∗∗ −0.224∗∗∗ −0.205∗∗∗


(0.032) (0.032) (0.057) (0.058)

Gender=2 (Female) 0.022 0.030 0.004 0.014


(0.032) (0.032) (0.057) (0.058)

Age 0.004∗∗∗ 0.004∗∗∗ 0.003∗∗ 0.002


(0.001) (0.001) (0.002) (0.002)

logLOS 1.065∗∗∗ 1.110∗∗∗


(0.022) (0.043)

Observations 1,383 1,383 1,383 1,383


Log Likelihood −2,477.249 −2,469.946 −2,271.768 −2,265.584
AIC 4978.498 4967.892 4569.537 4561.168

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

74
5.5. Generalised linear model

among them, some will have a zero count because they did not smoke during the time interval. Still other
smokers may have smoked, thus having a count of one or more. In this example, non-smokers are certain to
have a count of zero. They are similar to the structural zero group in a hurdle model. Smokers are assumed
to have counts from a Poisson or negative Binomial distribution.
Both zero-inflated and hurdle models allow a high frequency of zeros in the observed data but they
differ in the way the zeros are interpreted and analysed. A zero-inflated model assumes that the zeros are
of two different types: “structural" (those that cannot be positive) and “random" (those that are zero due
to sampling but could have been non-zero). In contrast, a hurdle model assumes all zeros are structural.
We illustrate a zero-inflated model using the hospitalisation data. Fig. 5.3 shows a zero-inflated model with
the zero observations split due to their structural (grey portion of the zeros) or random (red portion of the
zeros). The model assume the red zeros, along with the non-zero observations are all from the same Poisson
or negative Binomial distribution. In this sense, the red zeroes are simply zeros that occurred by chance, hence
“random".
For a hurdle model, the zeros are structural, so that the analysis breaks up into two parts: logistic and
hurdle. In contrast, the zeros in a zero-inflated model are a “mixture" of structural and random zeros, which
are observations from the same distribution of the non-zero data. As a result, the analysis cannot be broken
up into two parts. In a zero-inflated model, all the data, the zeros and the non-zero are analysed together in
one step, using the method of maximum likelihood. We also used four models here, except that we replaced
the hurdle part with zero-inflated part. As in the case of a hurdle model, the independent variables for the
model of the structural zeros do not need to be identical to those used for the random zeros and non-negative
outcomes.
The results of the analyses are given in Table 5.6. The models give similar results in the coefficient
estimates. Model (10) is a submodel of model (11), and model (12) is a submodel of model (13). The
likelihood ratio tests between model (10) and (11) shows model (11) to be significantly better (p < 0.001)
and that between (12) and (13) shows model (13) to be better (p = 0.002). We can also use AIC to compare
between these four models. The AIC values for the negative Binomial formulations are much smaller than
those for Poisson. We conclude for this set of data, negative Binomial is more suitable for handling the over-
dispersion in the data, on and above the excess zeros. In fact, the AIC value of 4536.698 for model (13) is the
smallest among all the 12 models considered. Hence, by that measure, model (13) is the preferred model for
this set of data.

5.5 Generalised linear model

The regressions considered in this chapter, along with those in Chapter 3-4 fall within a larger class of models
called generalised linear models (GLMs). A GLM is formulated by specifying:

(1) A distribution of the outcome Y . The choice of the distribution of Y can be from a large family of
distributions that includes the normal, binomial, Poisson, and negative Binomial distributions, among
others.
(2) A function of the independent variables X 1 , ..., X p that represents their relationship with Y . For example,
a linear combination of a + b1 X 1 + ... + b p X p
(3) A link function, g, that defines the mean µ (or expected value) of the outcome Y in terms of the function
in (2).

For example, in Chapter 3, we defined a multiple linear regression


E(Y ) = a + b1 X 1 + ... + b p X p . (5.10)
In (5.10), we assume the distribution of Y to be a normal distribution, where the mean µ of Y is µ = a+ b1 X 1 +
... + b p X p , so that g is an identity function or “identity link". In Chapter 4, we studied logistic regressions of
the form
logitP(Y = 1) = a + b1 X 1 + ... + b p X p . (5.11)

75
Chapter 5. Regression Models for Count Outcome

Table 5.6: Zero inflated models for inappropriate hospitalisation data. Model (10) Poisson regression with
exposure as offset. Model (11) Poisson regression with exposure as a covariate. Model (12) negative Binomial
regression with exposure as offset. Model (13) regression with exposure as a covariate. Values in parentheses
are standard errors.

Dependent variable:
Inap
Poisson negative Binomial
(10) (11) (12) (13)
∗∗∗ ∗∗∗ ∗∗∗
Intercept 1.722 1.659 1.721 1.657∗∗∗
(0.026) (0.032) (0.051) (0.056)

ward=2 (Surgical) −0.314∗∗∗ −0.318∗∗∗ −0.382∗∗∗ −0.389∗∗∗


(0.032) (0.032) (0.056) (0.058)

ward=3 (others) −0.487∗∗∗ −0.469∗∗∗ −0.551∗∗ −0.553∗∗


(0.150) (0.150) (0.216) (0.219)

Year =1990 −0.215∗∗∗ −0.199∗∗∗ −0.226∗∗∗ −0.200∗∗∗


(0.032) (0.032) (0.056) (0.057)

Gender=2 (Female) 0.021 0.031 0.004 0.017


(0.032) (0.032) (0.056) (0.057)

Age 0.004∗∗∗ 0.004∗∗∗ 0.004∗∗ 0.003∗


(0.001) (0.001) (0.002) (0.002)

logLOS 1.073∗∗∗ 1.134∗∗∗


(0.021) (0.043)

Observations 1,383 1,383 1,383 1,383


Log Likelihood −2,462.134 −2,456.340 −2,258.563 −2,253.349
AIC 4950.267 4940.679 4545.127 4536.698

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

76
5.5. Generalised linear model

When Y is a binary outcome, P(Y = 1) = µ = E(Y ) and hence in (5.11), we assume the distribution of Y to be
a binomial distribution, where the mean µ of Y is logit(µ) = a + b1 X 1 + ... + b p X p , so that g is a logit function
or “logit link". In this chapter, we considered a number of regression models for counts. For example, in a
Poisson regression, we define
log(µ) = a + b1 X 1 + ... + b p X p , (5.12)
where µ is the expected value of Y and Y follows a Poisson distribution. In this case, g is a log function or
“log link". The remaining models in this chapter can also be formulated as a GLM. GLMs are now standard
analysis methods in all statistical packages, and can be implemented by simple calls to routines.

77
6
Survival Analysis

In healthcare research, very often an outcome of interest is the time to an event. For example, time to death
or recurrence of a tumour, or discharge from hospital. The distinguishing feature of time to event data is that
at the end of a study, the event will probably not have occurred for all observations. Observations that have
not experienced the event are said be censored, indicating that the observation period was cut off before the
event occurred. For censored observations, it would not be possible for us to determine whether it would
experience the event if more observation time was given.
Censoring may also occur in other ways. Subjects may be lost to follow up during a study. In some studies,
a competing risk may remove a subject from further observation. For example, a patient being followed to
cancer death may die from heart failure.
In most time to event data, the event of interest is an adverse event, e.g., death, progression of disease,
and analysis of such data is generally called a survival analysis, referring to the survival from the event. In
most survival analyses, subjects are recruited over a period and followed up to a fixed date beyond the end of
recruitment. Thus subjects that are recruited later during a study will be observed for a shorter period than
those recruited first. An important assumption, therefore, is that the survival probability (prognosis) stay the
same throughout the study. We also assume that subjects lost to follow up have the same prognosis as the
others in the study.
This chapter examines the common statistical techniques employed to analyse survival data. Due to the
presence of censoring, the data are not amenable to the usual method of analysis.

6.1 Survival time

The primary variable in survival analysis is survival time. The term “survival time" is used loosely for the time
period from a starting time point to the occurrence of a certain event. Examples of survival time are: the time
to death following heart transplant, the duration of disability compensation or other insurance claims, and
the time to progress of disease from diagnosis of cancer.
Survival time is a non-negative random variable measuring the time interval from an origin to the
occurrence of a given event. Due to two special features, survival times cannot be handled using standard
statistical methods. First, the exact survival time may be longer than the duration of the study time (or
observation time) and is therefore unknown. For example, in a longitudinal study of a heart transplant
procedure, many patients may still be alive at the end of the follow-up period and, therefore, the exact survival
time is unknown. For someone who is on long term disability benefits, the total duration and therefore cost
to the insurance company or the government may not be exactly known. In many public health studies,
participants may leave the study before it ends and therefore become lost to follow-up. They may die of a

78
6.1. Survival time

cause unrelated to the disease under study, they may move to a different location, or they may simply refuse to
continue their participation. These incomplete observations are censored. Censoring can be right censored
if the event has not happened by the last time data are collected and studied. In contrast, when an event-
free observation “enters" midway through a study, then the observation is left censored. Yet there are others
called interval censored, in which the event is known or estimated to occur between two observed times, for
example, the development of cancer between two check-ups. These type of data are illustrated in Fig. 6.1,
which shows 5 cases in a study. The first case enters the study at the beginning and died during the study. The
survival time in this case is known exactly and is equal to the time of death − study start time. The second case
enters the study and survives to the end of the study but subsequently dies sometime following the end of the
study. The person’s survival time is right censored because the survival time is only known to be beyond the
study end date. The third case enters the study at start and survives up to certain time during the study but
lost to follow-up. Since it is impossible to ascertain the status of the patient at the point of loss to follow-up,
the case is only known to have survived at least as long as last follow-up and hence, right censored. The fourth
case enters following the start of the study. Even though the case dies during the study, the total survival time
cannot be ascertained due to the time since the “start" time of this case is unknown. This case is left censored.
The last case enters the study at the start and is known to have died between two known times within the
study. Since the exact time of death is unknown, the case is interval censored. Most survival analysis methods
focus on right censoring since it occurs far more frequently than other types of censoring.
The second feature is that survival times often follow a skewed distribution. Although transformations
can be applied to make the distribution more symmetrical and “normal-like", it is more desirable to use a
model that applies to the original, untransformed data. Some of the distributions known to be appropriate
for survival times include the exponential, Weibull, lognormal, etc..

Figure 6.1: Time to event in a follow-up study

Study starts Study ends


X

O Right censored

U Right censored

Left censored X

[ X ]

Interval censored

Time
known period X = died
unknown period O = alive
U = loss to follow-up

79
Chapter 6. Survival Analysis

6.2 Survival, hazard and cumulative hazard

The distribution of survival times is characterized by three functions: (a) the survival function, (b) the
probability density function, and (c) the hazard function.
Let T denote the survival time. We may assume for all practical purposes that T ∈ (0, ∞). The survival
function, denoted by S(t), is defined as S(t) = P(T > t) = 1 − F (t), where F (t) is the cumulative distribution
function of T . The survival function has the following properties: S(t) is monotonically non-increasing in t,
with S(0) = 1 and S(∞) = 0. The graph of S(t) as a function of t is known as the survival curve. Since the
survival function is non-increasing in t, the survival curve starts at 1 when t = 0 and drops as t increases.
In Chapter 2, we defined probability density function for a random variable. In the context of survival
analysis, the probability density function f (t) can be interpreted as the unconditional failure rate at time T = t,
for any t > 0. Suppose we are interested in survival following heart transplant. Then, we can define T as
the time of death (failure) following transplant. At time t = 0, a patient has just received the transplant. The
probability density function f (t) tells us the death rate at some time t > 0 following the transplant. A function
intimately related to the probability density function is the hazard function, h(t). The hazard function, h(t),
gives the conditional failure rate, given an individual has survived up to time T = t. It is defined as

f (t) Failure at t
h(t) = = (6.1)
S(t) Survive up to t

The hazard function is also known as the instantaneous failure rate or age-specific mortality rate. It is a
measure of the mortality risk as a function of the age of the individual in the sense that the quantity is the
expected proportion of aged t individuals who will fail in the short interval at T = t. Just like a probability
density, the hazard is not a probability since its value can be greater than one. Associated with each hazard
function h(t), we can define a cumulative hazard H(t) which can be interpreted as the “sum" of the hazards
up to time t, whereas h(t) is the hazard at t. The relation between h(t) and H(t) is similar to that between a
density function, f (t), and the cumulative distribution F (t).
All three functions, the survival, the density and the cumulative hazard function are mathematically
equivalent. If one of them is known, the others can be derived. For practical purposes, the survival function is
most useful because it directly gives the median survival time, a commonly used landmark in survival analysis,
and other summary statistics.

6.3 Kaplan-Meier product limit estimator

The survival function can be estimated using the Kaplan-Meier (KM) product limit estimator. The KM
estimator is an example of a non-parametric method, i.e., a method that requires no assumptions about the
distribution of the data. In a set of survival data, if t 1 < t 2 < ... < t m denote all the unique failure times
among the m individuals that failed, then the failure times form intervals [t 1 , t 2 ], [t 2 , t 3 ] etc.. The KM method
assumes the failure occurs at the beginning of the interval. It estimates the probability of surviving longer
than a given time t i , i.e., S(t i ).
The estimate is the product of a series of estimated conditional probabilities. The probability of surviving
from one interval to the next may be multiplied together to give the cumulative survival probability. More
formally, the probability of being alive at time t i , S(t i ), is calculated from S(t i−1 ), the probability of being
alive at t i−1 . Let ni be the number of patients alive just before t i , and di the number of failures at t i . The
probability of surviving longer than t i is estimated as,
 
dti
P(T > t i ) = S(t i ) = S(t i−1 ) 1 − = S(t i−1 ) × S(t i−2 ) × ...S(t 1 ) × S(0). (6.2)
nti

where at time 0, S(0) = 1. The value of S(t) is constant between times of events, and therefore the estimated
probability is a step function that changes value only at the time of each event. This estimator allows each
patient to contribute information to the calculations for as long as they are known to be event-free at the start

80
6.3. Kaplan-Meier product limit estimator

of a time interval. Were every individual to experience the event (i.e., no censoring), this estimator would
simply reduce to the ratio of the number of individuals events free at time t divided by the number of people
who entered the study.
The KM estimates are limited to the time interval in which the observations fall. If the largest observation
is uncensored, the estimate at that time is always zero. If the largest observation is censored, the estimate can
never equal zero and is undefined beyond the largest observation, unless an additional assumption is imposed.
In addition, if less than 50% of the observations are uncensored and the largest observation is censored, the
median survival time cannot be estimated.
An important assumption of the Kaplan-Meier method is that the probability of a censored observation
is independent of the actual survival time. In other words, those who are censored would have the same
chance of failure as those under study. If, on the contrary, patients dropped out because they became ill due
to the disease or intolerance to treatment, then censoring might be related to failure and KM-estimate would
produce biased results.
We illustrate the KM estimator using a dataset of n = 137 from a randomised trial of two treatment
regimens for lung cancer. The outcome of interest is survival time. Patients were randomised to receive either
a standard treatment (trt = 1) or test treatment (trt = 2). In addition to these two variables, the dataset also
contain information on the following variables: celltype (1 = squamous, 2 = small cell, 3 = adeno, 4 = large),
status (censoring status), karno (Karnofsky performance score, a measure of functional status on a scale of
0-100, with 0 = dead and 100 = good), diagtime (months from diagnosis to randomisation), age (in years)
and prior (prior therapy 0=no, 1=yes). There are n = 69 patients on standard treatment and n = 68 on test
treatment. A summary of the data is given in Table 6.1.

Table 6.1: Lung cancer trial data

trt=1 (n=69) trt=2 (n=68)


karno
Mean (SD) 59.20 (18.74) 57.93 (21.40)
Range 20 - 90 10 - 99
diagtime
Mean (SD) 8.65 (8.76) 8.90 (12.27)
Range 1 - 58 1 - 87
age
Mean (SD) 57.51 (10.81) 59.12 (10.28)
Range 34 - 81 35 - 81
celltype
squamous 15 (21.7%) 20 (29.4%)
smallcell 30 (43.5%) 18 (26.5%)
adeno 9 (13.0%) 18 (26.5%)
large 15 (21.7%) 12 (17.6%)
prior
0 48 (69.6%) 49 (72.1%)
1 21 (30.4%) 19 (27.9%)

In Table 6.2, we use the data from the standard treatment group (trt =1) to demonstrate how the KM
method estimates the survival function. For the trt = 1 group, there are n = 69 patients at the beginning of
1

the study. The first death occurred at time T = 3, which gives S(3) = S(0) 1 − 69 = 0.986. The second
1 1
 
death death occurred at time T = 4, which gives S(4) = S(3) 1 − 68 = 0.986 1 − 68 = 0.971. At time
2

T = 8, there were 66 patients at risk, and 2 patients died at that time, which gives S(8) = S(7) 1 − 66 =
2

0.957 1 − 66 = 0.928. The last patient died at 553 days following treatment; since this observation is not
censored, the survival function S(t) drops to zero at T = 553.
The survival curves S(t) for both treatment groups are shown in Fig. 6.1. The + marks on the curves
show the censored observations. Both groups of patients show a steep decline in S(t) in the early part of the

81
Chapter 6. Survival Analysis

Table 6.2: KM-estimate of S(t) in lung cancer trial data (trt=1)

time no. at risk no. failed KM estimate


3 69 1 0.986
4 68 1 0.971
7 67 1 0.957
8 66 2 0.928
10 64 2 0.899
11 62 1 0.884
12 61 2 0.855
13 59 1 0.841
..
.
314 5 1 0.069
384 4 1 0.052
392 3 1 0.035
411 2 1 0.017
553 1 1 0

study, suggesting the poor prognosis of the disease despite treatment. The curves in the early part of the trial
are almost indistinguishable, meaning that there is little difference between the two treatments early on. The
points at which these curves meet 50% give the median survival for the patients, i.e., the time when only 50%
of the patients survived. For trt =1 it is estimated to reach median survival at 103 days, and for trt =2, even
shorter at only 53 days. At around 200 days, the curves cross and the curve for trt = 2 flattens and stays above
that for trt =1, due to a few long term survivors on trt = 2; the last patient on trt =2 died at 999 days, when
the curve falls to zero.

6.4 Comparing survival curves

In Fig. 6.1, we computed the survival curves for trt=1 and trt=2. Based on the curves, we could compare
the proportions surviving at any specific time. The weakness of this approach is that it does not provide a
comparison of the total survival experience of the two groups, but rather gives a comparison at some arbitrary
time points. As we discussed in Chapter 6.2, the difference in survival is negligible early into the trial but
later on, they separate and at some times trt =1 is worse, but at other times, trt = 1 is better. There are
statistical tests that are designed for comparing the survival curves, which takes the whole follow up period
into account. The most popular of these is called the (Mantel-Haenszel) logrank test. The test an example of
a nonparametric method in that it does not require any assumptions about the shape of the survival curve or
the distribution of survival times. The logrank test is based on the same assumptions as the KM estimate, the
most important of which is that censoring is unrelated to prognosis. The logrank test is used to test the null
hypothesis that there is no difference between the groups in the probability of failure (e.g., death) at any time
point. The analysis is based on the times of failures. At each time a failure occurs, it calculates the observed
number of deaths in each group and the number expected if there were in reality no difference between the
groups. If the difference between observed and expected is large, then the test rejects the null hypothesis
of no difference between groups. The test only depends on the ranks of the survival time but not the actual
survival time of each observation, from which the name of the test is derived.
In a logrank test, the null hypothesis of no difference between the groups in the probability of failure at
all time points can be rephrased as a risk ratio = 1 at all time points. The test is suitable when the survival
curves between groups, do not cross, and risk ratios are constant over time. This is equivalent to a condition
called proportional hazards. In Fig. 6.1, we observed that the curves between groups cross at least twice
and the constant risk ratio assumption is violated. Under such a scenario, other tests are more suitable. An
alternative test is called the Gehan-Wilcoxon test. The Gehan-Wilcoxon test also uses just the ranks of the
observations and therefore also non-parametric.

82
6.5. Cox proportional hazards model

Figure 6.2: Kaplan Meier curves for S(t) in lung cancer trial data, stratified by trt

1 (standard) 2 (test)

1.00

0.75 +
Survival probability

0.50 ++
++
+
+
0.25
+ +

0.00
0 53 103 250 500 750 1000
Time

For the data in Fig. 6.1, the log-rank test gives a p-value of 0.9 and the Gehan-Wilcoxon test gives a
p-value of 0.4; both highly non-significant. Hence, we can conclude that there is no evidence that there is a
difference in survival experience between the two types of treatments.
Both tests can be extended to testing differences in survival curves among multiple groups.
While a statistical test provides a p-value for the differences between the groups, it offers no estimate of
the actual difference, for example the risk ratio, between groups. In other words, a test provides statistical
but not clinical evidence. A statistical model overcomes the shortfalls of a test. Furthermore, a statistical
model allows survival to be assessed with respect to several independent variables simultaneously. Therefore,
statistical models are important and frequently used tools which, when constructed appropriately, offer
valuable insight into the survival process.

6.5 Cox proportional hazards model

The most powerful method to show cause and effect of an independent variable (factor) on an outcome is
through a randomised study. In a randomised study, subjects are recruited and randomly assigned to receive
different levels of the factor and then followed for the outcome of interest. By the simple act of randomisation,
all known and unknown confounders are balanced so any difference in outcomes can be attributed to the
factor. For example, if we examine the data in Table 6.1, we notice that all independent variables a are
comparable between the two treatment groups and hence can be ignored if our interest is only in the impact
of the treatment on the outcome. In such a case, a test such as the log-rank test or Gehan-Wilcoxon test is
sufficient to establish difference, if any, between the two treatment arms.
Most healthcare research data, however, come from observational studies. Sometimes, even if the data
are from a randomised study, there may be questions on how various factors influence the outcome. For
example, whether survival in lung cancer patients differs between cell types. In practice, analysis of survival
a
With the exception of perhaps celltype ,due to the large number of levels and the relatively small sample size of this trial. For
trials with large sample sizes, all independent variables are guaranteed to be balanced

83
Chapter 6. Survival Analysis

data are carried out using multivariate (multiple) regression. Among the different multivariate regressions for
analysing survival data, the Cox Proportional Hazards (PH) model is the most popular. In earlier chapters,
when the relationship between multiple independent variables, X 1 , ..., X p , and an outcome is of interest, we
expressed a function of the outcome, e.g., mean, logit, log mean, in terms of a linear function in the form
a + b1 X 1 + ... + b p X p .
When analysing survival data, the function of the outcome to consider is the hazard function. There are two
special features about a hazard function for survival data that must be considered. First a hazard function is
non-negative at any time t. Second, the hazard is a function of the time t. Both features must be taken into
account when formulating the regression model. To handle non-negativity, we can assume log of the hazard
is a linear function of the independent variables, as we did for a Poisson regression:
log[h(t)] = a + b1 X 1 + .... + b p X p or h(t) = exp(a) exp(b1 X 1 + ... + b p X p )
To accommodate the time dependency nature of the hazard function, we make a a function of t, so
h(t) = exp(a(t)) exp(b1 X 1 + ... + b p X p ). (6.3)
Recall that in a regression model, the intercept can be interpreted as (the mean of) the outcome in the reference
or “baseline" group. Hence, in (6.3), we can interpret exp(a(t)) as the hazard for a baseline group at time t.
Writing h0 (t) = exp(a(t)), the Cox PH model is
h(t) = h0 (t) exp(b1 X 1 + ... + b p X p ). (6.4)
The impact of the different independent variables X 1 , ..., X p is measured by the size of the respective
coefficients b1 , ..., b p . Since b1 X 1 , etc., appear in the exponents, their effects on h(t) are multiplicative at
any point in time. This provides us with the key assumption of the PH model: the hazard of the event in
any group is a constant multiple of the hazard in any other groups. This assumption implies that the hazard
curves for the groups should be proportional and cannot cross, and hence proportional hazards. The quantities
exp(b j ) are called hazard ratios. A value of b j greater than zero, or equivalently a hazard ratio greater than
one, indicates that as the value of X j increases, the hazard increases and thus the length of survival decreases.
The Cox PH does not require any assumption about the baseline hazard h0 (t). The only assumption is the
multiplicative part of (6.4), hence the Cox PH model is considered a semi-parametric model.
We applied the Cox PH model to the lung cancer trial dataset. For celltype, the reference level is
“squamous". The results are given in Table 6.3. The table shows the coefficient for each variable. For example,
the coefficient for celltype= “small cell" is 0.86. A test of the null hypothesis that the coefficient is zero gives
a p-value of 0.0017, so the null hypothesis can be rejected. The coefficient corresponds to a hazard ratio
of exp(0.86) = 2.37. We can also calculate a 95% confidence interval for this hazard ratio, in the form of
exp(coefficient ± 1.96SE(coefficient)), which gives (1.38,4.06). Since the lower limit of the 95% confidence
interval is above 1, we conclude there is sufficient evidence that small cell celltype brings a higher hazard of
mortality, compared to the reference celltype = “squamous". This results is assumed to hold at all time points,
under the proportional hazards assumption. For a continuous variable, the coefficient corresponds to a unit
increase in the value of the variable. For example, Karnovsky performance score (karno) is a continuous
variable, the coefficient of −0.03 corresponds to every unit increase in the score. Since the coefficient has
a negative value, hence it corresponds to a reduction of hazard ratio by a factor of exp(−0.03) = 0.97 for
every unit increase in the score. This result applies to every time point, and every unit increase. To put
another way, the reduction in hazard ratio between a karno= 100 and karno= 90 is assumed to be the same
as that between karno= 30 and karno=20, at all time points. If this assumption is in question, then a more
complicated function of karno needs to be considered for the model.
Based on the results in Table 6.3, only three coefficients are significant: celltype= small, celltype= adeno,
and karno. We redrew the KM curves, stratifying for celltype in Fig. 6.3. The figure shows the clear separation
of the curves. The curves for celltype = adeno and small are similar, both are very different from that of
the reference celltype squamous. These results are consistent with those in Table 6.3, which shows both
coefficients for celltype = adeno and small to be significant. Between celltype = squamous and large, the
curves cross twice and only diverge later on during follow-up, when there are few patients. The behaviour of
these curves are also consistent with the non-significant coefficient for celltype = large.

84
6.6. Time varying coefficients

Table 6.3: Cox regression model, n = 137, number of events = 128.

coefficient HR = exp(coefficient) 95% CI p-value


trt 0.295 1.343 [0.894, 2.017] 0.16
celltype = smallcell 0.862 2.367 [1.380, 4.060] 0.0017
celltype = adeno 1.196 3.307 [1.834, 5.965] < 0.0001
celltype = large 0.401 1.494 [0.858, 2.600] 0.16
karno -0.033 0.968 [0.957, 0.978] < 0.0001
diagtime 0.00008 1.000 [0.982, 1.018] 0.99
age -0.009 0.991 [0.973, 1.010] 0.35
prior 0.072 1.074 [0.681, 1.694] 0.76

Figure 6.3: Kaplan Meier curves for S(t) in lung cancer trial data, stratified by celltype

squamous small cell adeno large

1.00

+
0.75
Survival probability

++
0.50

+ + +
0.25 +
+
+

0.00
0 250 500 750 1000
Time

6.6 Time varying coefficients

In Table 6.3, we found that karno score is an important predictor. We grouped patients into three groups
according to karno score: < 50, 50-70 and 70+ and compared their survival experience, stratified by celltype,
which is another important predictor according to our earlier analysis. The results are shown in Fig. 6.4.
The plots in Fig. 6.4 show that, irrespective of celltype, that a higher karno score is predictive of better
survival. However, both plots show effect is not constant over time. Early on, a low karno score has a large
negative effect: the risk of dying for a patient with karno < 50 is much higher than someone with karno score
50-70 or 70 +. Irrespective of celltype, the survival for patients with karno = 70+ is better than those with
karno = 50-70. However, these differences shrink so that by 350 days for the squamous/large celltype, and
200 days for the smallcell/adeno, they are not much different from zero. One explanation is that, this disease
can cause very acute impairment to the functional performance of the patients and any measure that is over
a few months old is no longer relevant. The Cox PH model (6.4) requires a constant hazard ratio over time
for all independent variables. In this situation, where the hazard ratios between subgroups of patients are
not constant over time, the Cox PH model (6.4) needs to be modified. Assuming the independent variable in

85
Chapter 6. Survival Analysis

Figure 6.4: Kaplan Meier curves for S(t) in lung cancer trial data, stratified by celltype and karno

(a) celltype = squamous/large (b) celltype = smallcell/adeno


karno < 50 karno = 50−70 karno = 70+ karno < 50 karno = 50−70 karno = 70+

1.00 ++ 1.00

0.75 0.75
Survival probability

Survival probability
+ +
+
0.50 0.50

+
0.25 0.25 ++

+
0.00 0.00
0 250 500 750 1000 0 100 200 300 400
Time Time

question is X p , the modification required to the Cox PH model is

h(t) = h0 (t) exp(b1 X 1 + ...b p−1 X p−1 + b p (t)X p ), (6.5)

where the difference between (6.4) and (6.5) is that b p (t) is not a constant; rather it is allowed to depend on
t. We call b p (t) a time varying coefficient.
A simple way to incorporate time varying coefficient is to use intervals of time. Consider a patient with
follow-up from time 0 to death at 411 days, and assume that we wish to have a time varying coefficient for
karno. A way to carry out this is to break the follow-up time into 3 time intervals 0-90, 90-180, 180+, with one
row of data for each interval. The data might look like the following for the first few patients in the dataset:

Table 6.4: Stepwise time varying coefficient

original data modified data


patient time status karno patient start time status interval karno
1 72 1 60 1 0 72 1 0-90 60
2 411 1 70 2 0 90 0 0-90 70
3 228 1 60 2 90 180 0 90-180 70
4 ··· 2 180 411 1 180+ 70
3 0 90 0 0-90 60
3 90 180 0 90-180 60
3 180 228 1 180+ 60

Table 6.4 shows the first patient died at 72 days, hence his data only contributes to the first interval (0-90
days) in the modified data. The second patient survived to 411 days and then died. Hence the contribution
of this patient is to all three intervals. In the interval 0-90, the patient is still alive, hence status =0. In the
second interval 90-180, the patient remains alive and hence status=0. In period 180+, the patient eventually
dies at 411 days, hence the survival time for the patient in that interval is 180 to 411 days and status=1 at
411. Similarly the third patient also contributes information to all three intervals. Hence the modified data

86
6.6. Time varying coefficients

is an expanded version of the original data. We repeated the analysis using the modified data. The results
are given in Table 6.5. Notice in the table, the data is drawn on n = 225 rather than the original data of
n = 137 due to the contribution of some observations to multiple intervals described above. The total number
of events remain the same since the number of deaths cannot change simply by dividing the observations into
different intervals. There are now three coefficients for karno, one for each interval. The results show that the
coefficient is only significant in the first interval, which is consistent with what we observed in Fig. 6.4. The
coefficient for karno in the first interval confers a hazard ratio of 0.95 for each unit increase in karno score,
which is quite a bit higher than 0.97 in Table 6.3. This result is not surprising since in the original analysis,
the karno effect is assumed to be constant over time and hence the impact during the earlier times is being
diminished by the non-impact later on. We can compare the model here with that in Table 6.3 by the Akaike
Information Criterion (AIC). The current model has an AIC value of 948.6 which is smaller than the value of
964.8 for the earlier model. Hence the current model is preferred.

Table 6.5: Cox regression model, n = 225, number of events = 128.

coefficient HR = exp(coefficient) 95% CI p-value


trt 0.11 1.12 [0.74, 1.69] 0.60
celltype = smallcell 0.96 2.61 [1.49, 4.57] 0.00076
celltype = adeno 1.14 3.14 [1.72, 5.73] 0.00019
celltype = large 0.35 1.41 [0.81, 2.47] 0.22
diagtime -0.002 1.00 [0.98, 1.02] 0.82
age -0.01 0.99 [0.97, 1.01] 0.24
prior 0.08 1.09 [0.69, 1.72] 0.72
karno: interval = 0-90 -0.05 0.95 [0.94, 0.97] < 0.0001
karno: interval = 90-180 0.007 1.01 [0.98, 1.03] 0.60
karno: interval = 180+ 0.0009 1.00 [0.97, 1.03] 0.95

By using a step function for a time varying coefficient requires us to split the follow-up time into arbitrary
intervals. It also requires us to modify the data as we did in Table 6.4. An alternative to a step function is
to assume a continuous function. A particularly simple continuous function for a time varying coefficient is
b p (t) = b p0 + b p1 log(t). The reason for using log(t) instead of t is due to the highly skewed nature of survival
data. For example, in the current dataset, the median survival for all patients is only 80 days and yet, the
longest surviving patient lived to 999 days. When we applied the above continuous function to model (6.5),
we obtain
h(t) = h0 (t) exp(b1 X 1 + ...b p−1 X p−1 + b p0 X p + b p1 log(t)X p ). (6.6)

In model (6.6), it is important to note that b p1 log(t)X p is not an interaction effect or effect modifier described
in Chapter 3.7. In here, t is a time that is updated at every event time (whether it is a failure or censor) in
the dataset. To put another way, t does not refer to the last follow-up time for a patient. Rather, it is a time
that is updated at each event time. The results for the re-analysis using this formulation of a continuous time
varying coefficient is given in Table 6.6. In this analysis, the sample size is exactly the same as the original
data, a point we made above that this approach does not require modifying the data. There are now two
coefficients for karno. The first coefficient can be interpreted as the “average" effect of karno on the hazard.
The coefficient translate to a hazard ratio of 0.92, or reduction in hazard ratio, for every unit increase in
karno. This result is to be expected from our discussions. The second coefficient is for log(t)×karno which
is the time varying part of the coefficient. This coefficient is positive and implies an increase over time, which
reflects the diminishing (positive) effect of karno over time. Put together, the time varying coefficient for
karno is −0.08 + 0.01log(t). Since the coefficient for log(t)karno is significant, it suggests a rejection of
the proportional hazards assumption for karno. The AIC value of this model is 956, which is in between
the stepwise model and the original model. In here, we have used the function b p (t) = b p0 + b p1 log(t) as
illustration of this technique. In practice, other functions that better fit the data can be used.

87
Chapter 6. Survival Analysis

Table 6.6: Cox regression model, n = 137, number of events = 128.

coefficient HR = exp(coefficient) 95% CI p-value


trt 0.163 1.178 [0.779, 1.780] 0.44
celltype = smallcell 0.927 2.527 [1.460, 4.375] 0.00093
celltype = adeno 1.175 3.239 [1.788, 5.869] 0.00011
celltype = large 0.362 1.437 [0.825, 2.504] 0.20
diagtime -0.0009 0.999 [0.982, 1.017] 0.92
age -0.009 0.991 [0.973, 1.009] 0.31
prior 0.047 1.048 [0.665, 1.651] 0.84
karno -0.083 0.920 [0.889, 0.952] < 0.0001
log(t)karno 0.014 1.014 [1.005, 1.023] 0.0018

6.7 Time dependent covariates

In the standard Cox PH model (6.4), we assume independent variables (covariates) are measured at baseline
and they stay constant over the course of the follow-up. But there are many applications for which a covariate
changes over the duration of the study period. Independent variables with values that may change during
follow-up are called time dependent covariatesb . Suppose an independent variable X p is time dependent,
then it means its value is a function of t. In the context of a Cox PH model, we write

h(t) = h0 (t) exp(b1 X 1 + ...b p−1 X p−1 + b p X p (t)). (6.7)

The difference between a time dependent covariate and a time varying coefficient is subtle, but important. In
model (6.6), the time dependent coefficient acts upon X p , which stays constant over time, but the coefficient,
(or hazards) changes over time. In contrast, for a time dependent covariate, the coefficient b p (since it is not
a function of t) stays constant but the value of the covariate X p (t) changes with time.
We illustrate the concept of a time dependent covariate using data from the Stanford Heart
Transplantation Program. The program began in October 1967 and ended in February 1980. Patients were
admitted to the program after review by a committee, and then they waited for donor hearts to become
available. While waiting, some died or were transferred out of the program, but most received a transplant. We
use data up to 1974. In all, there are n = 103 patients. Table 6.7 shows the data for the first 10 patients. The
dependent variable is survival time (futime) since enrolment (accept.dt). The relevant independent variables
are surgery, age, transplant, and year (of enrolment which can be calculated from accept.dt).

Table 6.7: Stanford heart transplant data

Patient accept.dt tx.date fu.date fustat surgery age futime wait.time transplant
1 1967-11-15 1968-01-03 1 0 30.8 49 0
2 1968-01-02 1968-01-07 1 0 51.8 5 0
3 1968-01-06 1968-01-06 1968-01-21 1 0 54.3 15 0 1
4 1968-03-28 1968-05-02 1968-05-05 1 0 40.3 38 35 1
5 1968-05-10 1968-05-27 1 0 20.8 17 0
6 1968-06-13 1968-06-15 1 0 54.6 2 0
7 1968-07-12 1968-08-31 1970-05-17 1 0 50.9 674 50 1
8 1968-08-01 1968-09-09 1 0 45.3 39 0
9 1968-08-09 1968-11-01 1 0 47.2 84 0
10 1968-08-11 1968-08-22 1968-10-07 1 0 42.5 57 11 1

The key variable is transplant (1= received a transplant or 0 = did not receive a transplant). The first
two patients died (fustat = 1) before receiving a transplant (transplant = 0). Patient 3 received a transplant
at time 0 but died 15 days following transplant. Patient 4 had to wait for 35 days to receive a transplant but
b
Sometimes also called time-varying covariates

88
6.7. Time dependent covariates

died on day 38. In this study, transplant is a time dependent covariate. The main question in the study is
whether transplant influenced survival of these patients.
A naïve analysis using the original data, treating transplant as a fixed covariate shows a dramatic survival
benefit for patients receiving transplant (Fig. 6.5). The associated log-rank test gives a p-value of < 0.001.
However, this analysis is flawed. Patients who eventually received a transplant in the study had to live long
enough to receive the transplant. In other words, patients who received a transplant could not die while they
were waiting for the transplant. This fact makes them immortal during the waiting time. Ignoring this fact
leads to bias, which sometimes is referred to as immortal time bias.

Figure 6.5: Kaplan Meier curves for S(t) in Stanford heart transplant data, stratified by transplant

tranplant=0 transplant=1

1.00

0.75
Survival probability

+ ++
+
0.50 + +
+++ ++++
++ +
+ ++
0.25 + +
+ ++ +
+ +
0.00
0 500 1000 1500
Time

To avoid immortal time bias, we need to adjust the data. For patients who eventually received transplant
are treated as two cases: before and after transplant. They would begin the study in the transplant = 0 group,
until the time of transplant, when they moved to the transplant = 1 group. In this sense, transplant status
is time dependent. We illustrate this procedure in Table 6.8 using the first few patients in the data. For each
patient, we used the original data to create two times, a start time that denotes the time survival is counted
and a stop time, when the patient died or was censored. We also created year to denote the year of enrolment
since the beginning of the program. Patient 4 received a transplant during the study and hence, there are
two entries. The patient waited until day 35 to receive a transplant, hence, from day 0 to 35, the patient
belonged to transplant = 0. In addition, that patient was alive during that time and hence fustat = 0 during
that time. On day 35, the patient received a transplant and switched over to transplant = 1 group. The patient
eventually died on day 38 and fustat = 1. So the patient contributed to two different follow-up periods, first
(0,35) with fustat=0, transplant = 0 and second (35,38) with fustat = 1, transplant =1. Similarly, patient 7
received a transplant on day 50, so there are two entries for that patient: (0,50) with fustat=0, transplant =
0 and (50,674) with fustat = 1, transplant =1.
Using the modified data, we redrew the KM curves in Fig. 6.6. The figure looks very different from
Fig. 6.5. In particular, the “benefit" of transplant has all but disappeared and a test now shows no difference
between transplant = 0 and 1 (p > 0.5).
To complete the analysis, we fitted a Cox PH model including all the independent variables. The results
for a Cox PH model using only transplant (model (1)) and the full model including all the variables (model

89
Chapter 6. Survival Analysis

Table 6.8: Stanford heart transplant data modified for time dependent covariate

patient start stop fustat transplant age year surgery


1 0 49 1 0 -17.2 0.12 0
2 0 5 1 0 3.8 0.25 0
3 0 15 1 1 6.3 0.27 0
4 0 35 0 0 -7.7 0.49 0
4 35 38 1 1 -7.7 0.49 0
5 0 17 1 0 -27.2 0.61 0
6 0 2 1 0 6.6 0.70 0
7 0 50 0 0 2.9 0.78 0
7 50 674 1 1 2.9 0.78 0
8 0 39 1 0 -2.7 0.84 0

(2)) are given in Table 6.9. The full model shows that only year and age are significant. Since model (1)
is a sub-model of (2), we can compare them using a likelihood ratio test, which gives a p-value of 0.002,
suggesting model (2) is better in explaining the data. Model (3) involves a non-linear function of year in the
form of a cubic function. We defer the discussion of this model to Chapter 6.8.

Table 6.9: Cox PH models for Stanford heart transplant data

Dependent variable:
Survival
(1) (2) (3)
transplant 0.127 −0.010 −0.043
(0.301) (0.314) (0.315)

year −0.146∗∗ −2.361∗∗∗


(0.070) (0.781)

year2 0.797∗∗∗
(0.292)

year3 −0.081∗∗
(0.032)

age 0.027∗∗ 0.031∗∗


(0.014) (0.015)

surgery −0.637∗ −0.666∗


(0.367) (0.383)

Observations 172 172 172


Log Likelihood −298.031 −290.566 −286.805

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

90
6.8. Assessing model fit

Figure 6.6: Kaplan Meier curves for S(t) in Stanford heart transplant data, stratified by transplant, time
dependent covariate approach

tranplant=0 transplant=1

1.00
+
+
+
++
++
+
0.75 +
+
Survival probability

++++
++
+
0.50 +
+++ +
+
++
+++ ++++ ++ +
0.25 + + ++ +
+ + + ++ +
0.00
0 500 1000 1500
Time

6.8 Assessing model fit

Similar to the models considered in earlier chapters, when using a Cox PH model to analyse survival data, it is
important to evaluate how well the model represents the data. A model is adequate if it explains reasonably
well the data. This aspect of a model is known as goodness-of-fit. For example, if a subset of the patients have
a poor prognosis, then the model should predict this subset to have that outcome. In practice, the issues in
choosing the most appropriate type of model and the most appropriate variables are closely related, and the
adequacy of a model may be assessed in several ways.
Every statistical model requires assumptions and the adequacy of a model in explaining the data depends
critically on the validity of the assumptions for the data under study. The Cox PH model, despite making
no assumptions about the baseline hazard, is no exception. Diagnostic methods are useful in all types of
regression models to investigate the validity of those assumptions and identify ways in which they might be
violated. Residuals play an important role in regression diagnostics. In survival analysis, there are several
kinds of residuals. We discuss two kinds that are commonly used.
The first kind of residuals are called Martingale residuals.c Consider, then, for each observation i, with
survival indicator di = 1 if it is a failure, or 0 censored, at last observation time t i (either failure or censored),
the following residual:
ei = di − H i (t i ). (6.8)

This residual represents the discrepancy between the observed value of a subject’s survival indicator and its
expected value, “summed" over the time for which that patient was at risk. Positive values mean that the
patient failed sooner than expected (according to the model); negative values mean that the patient survived
longer than expected (or was censored). Martingale residuals are very useful and can be used for many of the
usual purposes that we use residuals for in other models (identifying outliers, choosing a functional form for
c
It takes its name from “la grande martingale", the strategy for 50-50 bets in which one doubles the bet after each loss. Since one
is surely going to eventually win with 50-50 bets, the first win will recover all previous losses

91
Chapter 6. Survival Analysis

the covariate, etc.) If a trend in the plot is apparent, then it should be investigated. Plots of the martingale
residuals against celltype and karno for the lung cancer trial data are shown in Fig. 6.7. Fig. 6.7 (a) shows that
one observation with a particularly large martingale residual of −6.72. This residual corresponds to patient
44, who received trt= 1, with smallcell type and karno= 40, who died 392 days after treatment. This is an
unusual observation because from our previous analysis, the patient had poor prognosis based on the celltype
and karno score. Yet the patient lived for 392 days, way beyond the median survival of 103 days found in
Fig. 6.2. Fig.6.7 (b) shows no evidence of any trends, meaning there is no evidence that the relationship
between survival and karno has been modelled incorrectly (e.g., non-linear instead of a linear relationship).
The same outlying patient in Fig. 6.7 (a) is also appears in (b) in the bottom of the figure.

Figure 6.7: Martingale residual plots against (a) celltype and (b) karno for lung cancer study data

(a) (b)
● Censored ● Died ● Censored ● Died

● ●
● ● ● ● ● ● ● ● ●

● ● ●
● ● ●
● ● ●
● ● ●
● ● ●

● ●
● ●
● ●
● ● ● ●
● ●

● ●
● ●

● ●
● ● ●

● ●
● ●

● ● ● ● ● ●
● ● ● ● ● ● ●

● ●
● ●
● ● ● ● ● ● ●

● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
0.0 ●



● ●



0.0 ●

● ●
● ●

● ●
● ●
● ● ● ●
● ●

● ● ● ● ● ●
● ● ● ●

● ●
● ● ● ● ● ●

● ●
Martingale residuals

Martingale residuals

● ● ● ●

● ●
● ●

● ● ● ●
● ● ● ●
● ●
−2.5 ● −2.5 ●
● ● ●

● ●

−5.0 −5.0

● ●

squamous smallcell adeno large 25 50 75 100


celltype karno

A primary weakness of the martingale residual is its asymmetry (its upper bound is 1, but it has no
lower bound). A technique for creating symmetric, normalized residuals that is widely used in generalized
linear modeling is to construct a deviance residual. The idea behind the deviance residual is to examine
the difference between the log-likelihood for observation i under the given model and the maximum possible
log-likelihood for that observation, in the spirit of carrying a likelihood ratio test for the observation, cf.,
Chapter 4.2.1. In Fig. 6.8, we redrew Fig. 6.7 but now using deviance residuals. The plots show that the
deviance residuals are a lot more symmetric about zero than the martingale residuals. Neither plots show any
particular trends. The outlying observation in Fig. 6.7 has a deviance residual of −3.1.
One of the main uses of deviance (or martingale) residuals is identifying outliers. We printed the data
with the 5 largest deviance residuals in absolute value in Table 6.10. There are no extreme outliers; the largest
residuals are about 3.07 and 2.5 SDs away from zero. Comparing the deviance and martingale residuals show
the extent of the skewness of the martingales residuals; the highest martingale residual value is limited by 1,
which shows up for patient 85, 77, 15, who all have martingale residuals near 1. At the other end, patients 44
and 21 had much larger absolute martingales on the negative end. The table shows clearly why these patients
had the highest absolute deviance residuals; patients 85, 77, 15 all had the favourable squamous celltype, yet
they died 1, 1 and 11 days following treatment. In contrast, patients 44 and 21 had the unfavourable small
cell tumours and poor karno scores, yet they went on to live for 392 and 123 days, respectively, way beyond
the median survival.

92
6.8. Assessing model fit

Figure 6.8: Deviance residual plots against (a) celltype and (b) karno for lung cancer study data

(a) (b)
● Censored ● Died ● Censored ● Died

3 3
● ●
● ●
● ●
● ●
● ● ● ●
2 ● 2 ●
● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ●

● ● ● ● ● ●
● ● ● ● ●
Deviance residuals

Deviance residuals
● ● ● ● ●
1 ● ● ● 1 ● ● ●

● ● ● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ● ●


● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●
0 ●







0 ● ●


● ●

● ●

● ● ● ● ●

● ● ● ●
● ● ●
● ● ● ●

● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
−1 ● −1 ●
● ● ● ● ● ●
● ●
● ●
● ●
● ● ●

−2 ●
● −2 ●

● ●

−3 ● −3 ●

squamous smallcell adeno large 25 50 75 100


celltype karno

Table 6.10: Five observations with largest deviance residuals in absolute value in lung cancer trial data

patient trt celltype time status karno diagtime age prior deviance martingale
44 1 smallcell 392 1 40 4 68 0 -3.06 -6.71
85 2 squamous 1 1 50 7 35 0 2.79 0.99
77 2 squamous 1 1 20 21 65 1 2.49 0.98
21 1 smallcell 123 0 40 3 55 0 -2.28 -2.59
15 1 squamous 11 1 70 11 48 1 2.23 0.97

Residuals are also helpful for assessing the form of the relationship between an independent and the (log)
hazard is (e.g., linear, quadratic or or non-linear functions) and for checking whether additional variables
should be added to a model. Using model (2) in Table 6.9, we plotted the deviance residuals against year
in Fig. 6.9 (a). If the relationship between the variable and the outcome is correctly specified, we expect
to see a deviance plot against the variation to display a random pattern with no particular functional trend.
Fig. 6.9 (a) shows a loess (locally weighted smoothing) curve superimposed on the plot. The LOESS curve
is produced by taking average values of nearby points as the value of the variable changes. The LOESS curve
shows a trend of somewhat higher values of the residuals up to year = 2. Based on this, we considered a cubic
function of year, hence, instead of b p X p , we used b p1 X p + b p2 X p2 + b p3 X p3 . We refitted the model to give model
(3) in Table 6.9 and the deviance residuals against year are plotted in Fig. 6.9 (b), which shows most of the
trend has disappeared. Obviously, functions other than a cubic function can be considered. Since model (2)
is a sub-model of model (3), a likelihood ratio test can be employed for comparing them. The test gives a
p-value of 0.023, suggesting some improvements in using a cubic form.
A third type of residuals useful for diagnosis are the Schoenfeld residuals.d Schoenfeld residuals are
useful for detecting violations of the proportional hazards assumption. For each subject that failed in the study,
a Schoenfeld residual can be defined for each covariate by taking the difference in the covariate between the
failing subject and the weighted average for those still at risk at the subject’s failure time. The residuals
are plotted against each failure time. Violation of the proportional hazards assumption is indicated by a
d
Schoenfeld D. (1980) Chi-square goodness of fit tests for the proportional hazards model. Biometrika 67:145–153

93
Chapter 6. Survival Analysis

Figure 6.9: Deviance residual plots against year for Stanford heart transplant data (a) model (2) and (b)
model (3)

(a) ●
(b) ●


● ●

● ● ●

● ● ● ●
● ● ●
2 ● ●
● ●
● ●
2 ●

● ●
● ●
● ●
● ●
● ● ●● ● ● ● ● ● ●
● ● ● ●
Deviance residuals

Deviance residuals
●● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
1 ● ● ● ● ● ● ● ●

● ●


● 1 ●
●● ● ●
● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ●

● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ●● ●
● ● ●
● ●
0 ● ●
● ● ●● ●
● ● ●● ●●
● 0 ● ●

● ● ● ● ● ● ●●

● ● ● ● ●●

● ● ●●
● ●●● ● ● ●●
● ● ● ● ●● ● ● ●
●●● ● ● ●●●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ● ● ● ● ●

● ●●
● ● ● ●●● ● ●
● ●●● ●
● ● ● ● ●● ● ● ● ● ●●
● ●●● ● ●● ● ● ● ●●● ●
−1 ● ●●●● ● ● ●
● ● ● ● ● ● ● ●● ●
● ● ●
● ●
● ●●● ● −1 ● ●●

● ● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
●●
● ● ●
● ● ●
● ● ● ●
−2 ● ●

0 2 4 6 0 2 4 6
year year

consistently high or low trend over an interval of time since this suggests the hazard at that time is higher
(lower) for the subject than predicted by the model. Since the procedure requires plotting the residuals against
time, it cannot be used for a model with time-varying coefficients.
We illustrate this procedure using model (3) in the Stanford transplant data. From model (3), there are
six covariates. For each covariate, a set of Schoenfeld residuals are calculated, one for each subject that failed
and plotted against the failure time of that particular subject. The plots are shown in Fig. 6.10. None of the
plots show any clear evidence of sustained periods of residuals above or below zero.

A formal test by Grambsch and Therneaue can be carried out to supplement the plots. The test is
analogous to plotting a function of time (t or log(t)) versus the Schoenfeld residuals and comparing the
slope of the regression line to zero. Applying the test shows no evidence of proportional hazards violations in
any of the covariates, Table 6.11.

Table 6.11: Schoenfeld residuals for Stanford transplant data with time dependent covariates

Covariate p-value
transplant 0.64
year 0.54
year2 0.45
year3 0.42
age 0.55
surgery 0.88

e
Grambsch PM, Therneau TM (1994) Proportional hazards tests and diagnostics based on weighted residuals. Biometrika
81(3):515–526

94
6.8. Assessing model fit

Figure 6.10: Schoenfeld residual plots against time for Stanford heart transplant data using model (3)

20
8
6

10
Beta(t) for transplant

Beta(t) for year

0
2
0

−10
−6 −4 −2

−20
−30
3.8 17 37 62 83 190 990 3.8 17 37 62 83 190 990

Time Time
10

1.0
Beta(t) for f2(year)

Beta(t) for f3(year)


5

0.5
0.0
0

−0.5
−5

−1.0

3.8 17 37 62 83 190 990 3.8 17 37 62 83 190 990

Time Time
10
0.2

Beta(t) for surgery


0.0
Beta(t) for age

5
−0.2

0
−0.4

3.8 17 37 62 83 190 990 3.8 17 37 62 83 190 990

Time Time

95
7
Classification and Clustering

Classification methods in healthcare studies are important when researchers are interested in grouping subjects
into different classes according to specific characteristics. For example, a patient can be classified as having
“benign" or “malignant" tumour based on the disease pattern. This is a typical case of binary classification
where only two possible classes are considered. It is also possible to classify based on multiple classes as
“benign growth", “stage 1-2 tumours" or “stage 3-4 tumours". We illustrate using data from a breast cancer
study based on n = 569 women.a Of these women, 357 had benign growth while the remaining 212 had
a malignant tumour. Along with diagnoses (B= benign, M = malignant) are information from 30 variables.
Our goal is to use the data to form a model, that can be used for classifying patients in the future.
For each patient, 10 characteristics are computed for each cell nucleus: (a) radius (mean of distances
from center to points on the perimeter), (b) texture (standard deviation of gray-scale values), (c) perimeter
(d) area, (e) smoothness (local variation in radius lengths), (f) compactness (perimeter2 ) / area - 1), (g)
concavity (severity of concave portions of the contour), (h) concave points (number of concave portions of
the contour), (i) symmetry, and (j) fractal dimension. The mean, standard error, and worst or largest (mean
of the three largest values) of these characteristics were computed for each image, resulting in 30 variables.
We illustrate the basic concept of classification using the data. Suppose we wish to use the mean radius
(radius_mean) of the cells to classify cases into benign and malignant. Fig. 7.1 (a) shows histograms of
radius_mean, stratified by diagnosis. Based on the histograms, we may suggest using a decision boundary
formed by the line at radius_mean = 14, that defines a decision rule, such that a case would be
classified as benign if radius_mean is no more than 14, and malignant otherwise. Using the decision rule of
>, < 14, we observe that some of the benign cases will be wrongly classified as malignant, a false positive;
also, some malignant cases will be misclassified as benign, a false negative. If we change the decision rule
so with a threshold higher than 14, then more cases will be classified as benign. As a result, there will be
fewer cases classified as positive and the number of false positive will drop. However, this will lead to a rise in
the number of cases mistakenly classified as benign (negatives), or a rise in false negatives. In fact, since the
histograms overlap, no decision rule can eliminate false positive and false negatives completely. Whichever
decision rule is used, there is a trade-off between false positives and false negatives.
We can generalise the idea of classification using a single variable to using more than one variable. Fig. 7.1
(b) shows a scatterplot of two variables: radius_mean and texture_mean, stratified by diagnosis. We can also
draw a decision boundary using the line that defines a decision rule such that, a case would be
classified as benign if it falls below and to the left of the boundary, and malignant otherwise. We observe that
this decision rule also lead to some false negatives and false positives. When we have two or more variables,
the decision boundary is not restricted to a straight line. We could define a curve as the decision
a
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993
International Symposium on Electronic Imaging: Science and Technology, 1905, 861-870, San Jose, CA.

96
7.1. Bayes rule

boundary.

Figure 7.1: Breast cancer data in n = 569 women stratified by diagnosis (B = benign, M = malignant)
(a) Histogram of radius_mean with decision boundary and (b) scatterplot of radius_mean vs.
texture_mean with two different decision boundaries and

(a) (b)
0.20
40


0.15

texture_mean

30 ●
●● ●

●●● ● ●● ●
● ●●
B ● B
● ● ● ● ● ●●
● ●● ●
●●
0.10 ●●
M ● ● ●●
● ● ●● ●
M
● ●
●● ●
● ●●● ●
● ● ●
● ●●
●● ● ● ●● ● ●
● ●●● ● ●●
●●●●● ●
20 ●●●
●●


● ●
●●●●●●

●●● ●●● ● ●
●●
●●●● ● ●

● ●
●● ●
●●
●●●● ●
●●●●
● ●
● ●
●●●
●●


●●●●●●


● ●

● ●●
●● ●●●●
● ●●
● ● ●


●● ●● ● ●●● ●●

●●●●●
●●●●●
●●

● ●●
●●●
●●●

● ●
●●●
● ●
● ●
●●●● ●
●● ●●
●●●
●● ●
0.05 ●●●● ●● ●● ●●●● ●
●●● ●
● ●●




●●●●●● ●

● ●●● ●
●●●● ●
● ●
● ●● ● ●● ●●
● ●●●
●●●
● ●● ● ● ●

● ●● ● ● ●
● ●●
● ●● ●

● ● ● ●
10 ●

0.00
10 14 20 25 5 10 15 20 25
radius_mean radius_mean

7.1 Bayes rule

In classification problems, the Bayes rule plays an important role. Suppose we have two distinct classes that
we call G = 1 and G = 2. For example, G = 1 represents benign growth and G = 2 malignant tumours.
Consider a number of p relevant characteristics for each observation in the data. These characteristics or
measurements, for example, may be on some physical characteristics such as height or weight, or on some
clinical or biological features, such as cell radius_mean, texture_mean, etc.. We use X to denote the (vector) of
measurement(s) made on a given observation under study. The prior probability for class k is defined as the
probability that we assign to class k before observing X. The posterior probability for class k is defined as the
probability that we calculate for class k after observing X. If we denote the prior and posterior probabilities
for class k as P(G = k) ≡ qk and P(G = k|X), respectively, then Bayes Theorem gives a simple relationship
between P(G = k) and P(G = k|X):

P(G = k, X) P(X|G = k)P(G = k) f k (X)qk


P(G = k|X) = =P =P , (7.1)
P(X) k P(X|G = k)P(G = k) k f k (X)qk

where f k stands for the PDF (probability distribution or density function) for class k. The posterior probability
can be interpreted as a conditional probability of class k given the data X.
The Bayes rule says that if an observation X is given, under 0-1 loss,b the optimal decision is to classify
the observation into the class with maximum posterior probability. This rule is sometimes called maximum a
′′
posteriori (MAP). Notice that when we compare between two classes, say k′ and k , we find

f ′ (X)qk′ f ′′ (X)qk′′
Pk vs. Pk . (7.2)
k f k (X)qk k f k (X)qk

b
In a 0-1 loss, if an observation belonging to class k is classified correctly as class k, then the loss is 0, otherwise, the loss is 1

97
Chapter 7. Classification and Clustering

where the denominators are identical. Therefore, to implement the MAP, we simply can use the numerator in
(7.2) and compare among different classes the following

max f k (X)qk . (7.3)


k

When we have a set of data, such as the breast cancer data, we can develop a classification rule as follows.
The prior probabilities can be estimated by simply counting the percentage of data coming from each class.
The MAP can be implemented if we know the PDFs f k for each class k. This in turns, requires us to make some
assumptions of the distribution of X in each class k.

7.2 Multivariate normal distribution and Mahalanobis distance

An important distribution for continuous data is the normal distribution. The PDF for a normal distribution
for a variable X with mean µ and standard deviation σ is given by

1 1
exp −0.5(x − µ)2 /σ2 = p exp −0.5z 2 ,
  
f (x) = p
σ 2π σ 2π
x−µ
where z = σ is the z-score, a standardised version of x. The appearance of z 2 in the PDF of a normal
distribution is appealing. It means that (x −µ)2 , the squared Euclidean distance,c a natural distance measure,
scaled by σ2 , is the distance measure used by normally distributed data.
In Fig 7.1(a), we illustrated using a single variable X = radius_mean to classify cases into two classes.
Assuming for the moment that the prior probabilities of the two classes are equal (q1 = q2 = 0.5), the means
of the two classes be µ1 , µ2 , and the standard deviations are the same. Then according to MAP, an observation
X will be classified to G = 1 if

0.5 f1 (X ) > 0.5 f2 (X ) ⇔ exp −(X − µ1 )2 /2σ2 > exp −(X − µ2 )2 /2σ2 ⇔ (X − µ1 )2 < (X − µ2 )2 ,
   

and to G = 2 otherwise. To put another way, X will be assigned to the class with a mean closest to X . This
idea is behind the decision boundary in Fig 7.1(a). Suppose a case has a radius_mean < 14, then it is closer
to the mean of the benign growths and we classify this case as benign. In contrast, a case with a radius_mean
> 14 would be classified as malignant since then it would be closer to mean of the malignant tumours.
This idea can be generalised to situations where we have p > 1 characteristics (variables). We can define
a multivariate normal distribution. For a vector of variables, X, with mean µ and a variance-covariance
matrix Σ, the PDF can be written as

1 1
exp −0.5(x − µ) T Σ−1 (x − µ) = exp −0.5∆2 ,
   
f (x ) = p p
|Σ|1/2 (2π) p |Σ|1/2 (2π) p

where AT means transpose of a vector A, ∆ = (x − µ) T Σ−1 (x − µ) is called the Mahalanobis distance. If


p

p = 1, the Mahalanobis distance is simply the absolute value of the z-score.


The variance-covariance matrix Σ contains information about the variances of all the variables
represented by X, as well as the correlations among the variables. For example, if X = (X 1 , X 2 , X 3 ), then
Σ contains the variances σ2j , as well as the correlations ρ j j ′ , j = 1, 2, 3, j ′ ̸= j.

The introduction of Σ−1 in the Mahalanobis distance gives it an important advantage over the Euclidean
distance. We continue using X = p (X 1 , X 2 , X 3 ) as an example. To classify an observation x = (x 1 , x 2 , x 3 )
using Euclidean distance, we find (x 1 − µ1 )2 + (x 2 − µ2 )2 + (x 3 − µ3 )2 , which assumes equal weights on
(x j − µ j )2 , j = 1, 2, 3. Suppose X 1 , X 2 are highly correlated and X 3 is almost uncorrelated with either X 1 , X 2 ,
then the Mahalanobis distance down weighs the “double counting" of the distances (x 1 − µ1 )2 and (x 2 − µ2 )2
by the inverse of their correlation; the result is a more accurate measure of the distance of x to µ.
c
The Euclidean distance between two points a and b is the straight line distance between the points

98
7.2. Multivariate normal distribution and Mahalanobis distance

If the variables in X were uncorrelated in each group and were scaled so that they had unit variances,
then Σ would be the identity matrix and the Mahalanobis distance would correspond to using the (squared)
Euclidean distance. It can be seen that the presence of the inverse of Σ is to allow for the different scales
on which the variables are measured and for non-zero correlations between the variables. In this sense, the
Mahalanobis distance is achieving the best of both worlds; it is based on a natural distance measure, the
Euclidean distance, and it allows adjustments based on the actual data situation.
In Fig. 7.2, we plotted a bivariate normal (a multivariate normal with p = 2) PDF with mean µ = (0, 0)
and common unit variance 1 for both dimensions and a correlation of 0.5. Fig. 7.2 (a) shows the perspective
plot and (b) the contour plot of the PDF. The “centroid" of the contour plot corresponds to the peak of the
PDF; each contour line traces out all points (X 1 , X 2 ) with the same value on the PDF. The value of the PDF
is lower for contours further away from the centroid; the outermost contour in (b) corresponds to a value of
0.02. In any multivariate normal distribution, the mean µ determines the location of the centroid, whereas
the variance-covariance Σ determines the shape and orientation.
    
0 1 0.5
Figure 7.2: PDF plots for a N µ = ,Σ = distribution (a) Perspective and (b) Contour
0 0.5 1

(a)
(b)

0.02

0.08

0.12

6
0.1
06 14
0. 0.
X2

18
0. 0.
2
0.1

0.04

X2 X1
X1

We used the breast cancer data and fitted separate bivariate normal distributions based on radius_mean
and texture_mean, for diagnosis = benign (B) or malignant (M). The contour plots of these normal
distributions,
   superimposed
 on the original data, are shown
 in Fig. 7.3. The
 fitted distributions are
µ1 σ12 ρσ1 σ2 12.2 3.2 −0.26
N µ= ,Σ = = N µ= ,Σ = for diagnosis = B and
µ ρσ1 σ2 σ22 17.9 −0.26 15.9
  2   
17.5 10.2 1.28
N µ= ,Σ = for diagnosis = M. For both fitted distributions, Σ shows that the
21.6 1.286 14.2
p
correlation
p between X 1 and X 2 is very small in either group (they are −0.26/ 3.2 × 15.9 = −0.036 and
1.28/ 10.2 × 14.2 = 0.11, respectively. Hence we see the contour plots are oriented perpendicular to the
axes. If the correlation had been stronger, then the orientation would have been at an angle, similar to
Fig. 7.2 (b). We added a hypothetical patient with (X 1 , X 2 ) = (17, 13) on the figure, shown as a blue . The
Euclidean distance of this point to the means of the two fitted normal distribution are 6.9 and 8.6, respectively
for diagnosis = B and M, which indicates the new case is closer to benign than malignant. In contrast, the
Mahalanobis distances are 8.7 and 5.2, respectively, suggesting a malignant diagnosis. Notice that the fitted
variance for X 1 is only 3.2 for benign growth, and hence, the new case with X 1 = 17 is very far away from the
mean of 12.2 for benign growths. The Mahalanobis distance takes this into consideration, and hence it gives
a different diagnosis for this new case than that using Euclidean distance.

99
Chapter 7. Classification and Clustering

Figure 7.3: Contour plots of fitted bivariate normal distributions of (X 1 , X 2 ) = (radius_mean, texture_mean)
based on the breast cancer data

40

30
texture_mean

B
M
20

10

10 20 30
radius_mean

7.3 Linear discriminant analysis

One of the most important classification methods is linear discriminant analysis (LDA). Under LDA we
assume that the density for X, given every class k is following a multivariate normal distribution. In addition,
we assume the variance-covariance Σ is identical for all classes. Since Σ determines the shape of the normal
PDF, in LDA, the PDF for different classes have the same shape but are shifted versions of each other (different
means µ). Example PDFs for the LDA assumption are shown in Fig. 7.4.
Consider we have two classes G1 and G2, then using Bayes rule, the MAP criterion classifies an
observation with X = x to G1 if f1 (x )q1 > f2 (x )q2 and G2, otherwise. It follows that we classify to G1,
if

1 1
exp −0.5(x − µ1 ) T Σ−1 (x − µ1 ) q1 > exp −0.5(x − µ2 ) T Σ−1 (x − µ2 ) q2
   
p p
|Σ|1/2 (2π) p |Σ|1/2 (2π) p
−(x − µ1 ) T Σ−1 (x − µ1 ) + log(q1 ) > −(x − µ2 ) T Σ−1 (x − µ2 ) + log(q2 )
x T Σ−1 µ1 − 0.5µ1T Σ−1 µ1 − x T Σ−1 x + log(q1) > x T Σ−1 µ2 − 0.5µ2T Σ−1 µ2 − x T Σ−1 x + log(q2)
x T Σ−1 µ1 − 0.5µ1T Σ−1 µ1 + log(q1) > x T Σ−1 µ2 − 0.5µ2T Σ−1 µ2 + log(q2). (7.4)

If we replace (7.4) with equality sign, it defines a decision boundary for classifying an observation with value
x . Since the decision boundary is a linear function of x (the quadratic term x T Σ−1 x on both sides cancel
out), hence the name linear discriminant analysis. The same idea generalises to cases where there are more
than 2 classes. In such cases, we find class k that maximises f k (x )qk , which is equivalent to finding class k
that maximises the linear discriminant function x T Σ−1 µk − 0.5µkT Σ−1 µk + log(qk ). Notice that maximizing
the linear discriminant function is equivalent to maximising −(x − µk ) T Σ−1 (x − µk ) + log(qk ). If all classes
have the same prior probabilities, i.e., qk = q, this is the same as minimising the squared Mahalanobis distance
(x − µk ) T Σ−1 (x − µk ). If classes are not equally probable, then there is an additional adjustment due to qk .
Hence, LDA classifies an observation to class k if the observation is closest to class k, in the Mahalanobis
distance sense.

100
7.4. Quadratic discriminant analysis

Figure 7.4: Contour plots of fitted bivariate normal distributions of (X 1 , X 2 ) with identical Σs

group
X2 G1
G2

X1

7.4 Quadratic discriminant analysis

A key assumption of the LDA is that the variance-covariance matrices for all classes are identical and equal
to Σ. This assumption can be relaxed to allow each class k has its own variance-covariance Σk . Under MAP,
assuming that there are two classes that we wish to classify, then we classify an observation x to G1 if

1 1
exp −0.5(x − µ1 ) T Σ−1 exp −0.5(x − µ2 ) T Σ−1
   
p 1 (x − µ 1 ) q1 > p 2 (x − µ 2 ) q2
|Σ1 |1/2 (2π) p |Σ2 |1/2 (2π) p
−0.5[log|Σ1 | − (x − µ1 ) T Σ−1 T −1
1 (x − µ1 )] + log(q1 ) > −0.5[log|Σ2 | − (x − µ2 ) Σ2 (x − µ2 )] + log(q2 )
(7.5)

Unlike (7.4), (7.5) does not simplify further due to the fact that Σ1 ̸= Σ2 . The discriminant function for class
k can be rewritten as −0.5log|Σk | + x T Σ−1 k
µk − 0.5µkT Σ−1
k
µk − 0.5x T Σ−1
k
x + log(qk ). Since this function is
quadratic in x , it is called a quadratic discriminant function. We can define a decision boundary by setting
the inequality to equality in (7.5). Using this decision boundary gives us a quadratic discriminant analysis
(QDA). QDA also applies to cases where there are more than 2 classes; we choose the class k that maximises
the quadratic discriminant function among all classes.
Both LDA and QDA can be easily implemented once we know qk and f k . We illustrate the procedure
using the breast cancer data. Assume we are interested to classify the cases into two classes G1 = benign (B)
and G2 = malignant (M), using only two characteristics, X = (radius_mean,texture_mean). We use the data
and the method of maximum likelihood to estimate qk and f k . Among the n = 569 cases, there are n1 = 357
benign growths and n2 = 212 malignant tumours, hence we estimate q1 , q2 as

q̂1 = 357/569 and q̂2 = 212/569,

respectively, where the ˆ symbol denotes an estimate. For each class we need to fit the data to a multivariate
normal distribution. We estimate the means by the sample means of (radius_mean,texture_mean), in each of
the two classes to give
µ̂1 = (12.2, 17.9) T and µ̂2 = (17.5, 21.6) T ,
for the benign cases and the malignant tumours, respectively. Using LDA, we assume a common variance-

101
Chapter 7. Classification and Clustering

covariance Σ for both classes, which can be estimated by


n1 X (x i − µ̂1 )(x i − µ̂1 ) T n2 X (x i − µ̂2 )(x i − µ̂2 ) T
 
5.8 0.31
Σ̂ = + = . (7.6)
n n1 n n2 0.31 15.3
i∈ class 1 i∈ class 2
(7.6) is simply a weighted average of the within group variance-covariance, where the weights are
the sample proportions in the data. The idea of a weighted average is to give more weight to the
class with more observations. Combining all the estimates, for LDA, we have the decision boundary
−0.91radius_mean − 0.22texture_mean + 18.34 = 0. For observation with (radius_mean,texture_mean)
such that −0.91radius_mean − 0.22texture_mean + 18.34 > 0, it will be classified as benign, otherwise
malignant.
For QDA, the individual Σk is simply estimated by i∈ class k (x i − µ̂k )(x i − µ̂k ) T /nk ,d giving
P
   
3.2 −0.26 10.2 1.28
Σ̂1 = and Σ̂2 =
−0.26 15.9 1.286 14.2
for benign and malignant cases, respectively. From these, we obtain the decision boundary of the QDA as
− 0.11 radius_mean2 −0.0071(radius_mean)(texture_mean) + 0.0041 texture_mean2 + 2.41radius_mean
− 0.19texture_mean −5.23 =0. For any observation with (radius_mean,texture_mean) such that −
0.11 radius_mean2 −0.0071(radius_mean)(texture_mean) + 0.0041 texture_mean2 + 2.41radius_mean −
0.19texture_mean −5.23 > 0, then it will be classified as benign, otherwise malignant. These decision
boundaries are plotted in Fig. 7.5. For this example, since the coefficients for the quadratic terms are very
small, when compared to the linear terms, the QDA boundary looks almost linear and very similar to that of
the LDA. This is not the case in general.
Comparing between the two discriminant analyses, QDA allows each class to have a different variance-
covariance Σk , and therefore, more flexible and tends to fit the data better than LDA. However, as the number
of classes increases, the number of Σk that need to be estimated by QDA also increases. When the sample size
is not large, the estimate of a large number of Σk may lead to highly unstable results. Hence, the use of LDA
and QDA must be considered together with the number of observations and classes.
We can use the LDA and QDA discriminant functions to assess the classification accuracies of these
analyses. Table 7.1 shows confusion tables, which are basically contingency tables, for using the LDA and
QDA functions for classification. For example, Table 7.1a shows that out of the 357 actual benign cases, 345
are correctly classified as benign and 12 are mistakenly classified as malignant. These give a true negative
345 12
(TN) rate or specificity of 357 = 0.966 and false negative (FN) rate of 357 = 0.044. Similarly, among the
212 actual malignant cases, 159 are correctly classified as malignant and the remaining 57 are incorrectly
classified as benign, giving a true positive (TP) rate or sensitivity of 159
212 = 0.75 and false negative (FN) rate
57 TP + TN
of 212 = 0.25. Combining these figures gives an overall accuracy of = 159+345
569 = 0.885. Repeating
n
341
this for the QDA, we obtain a specificity of 357 = 0.955 and sensitivity of 164
212 = 0.773 and overall accuracy
164+341
of 569 = 0.888. In this example, since the decision boundaries of the two analyses are quite similar, it
is not surprising to see their specificity, sensitivity and overall accuracies to be similar. The above rates are
calculated based on the data that are used to develop the decision boundaries. In practice, what we wish is to
use a boundary to classify data that have not been seen. We defer the discussion of such situations to a later
chapter.

7.5 Connection to logistic regression

Consider the case where there are two classes, G =1 and G =2; then under the model of LDA, we can compute
the log-odds of the posterior probabilities, given X
P(G = 1|X) P(G = 1|X) q1
log = log = log − 0.5[(X − µ1 ) T Σ−1 (X − µ1 ) − (X − µ2 ) T Σ−1 (X − µ2 )]
P(G = 2|X) 1 − P(G = 1|X) q2
= a + bX. (7.7)
d
Sometimes nk is replaced by nk − K where K is the total number of classes

102
7.5. Connection to logistic regression

Figure 7.5: Decision boundaries on breast cancer data using radius_mean and texture_mean (a) LDA and (b)
QDA

(a) (b)

40 40
texture_mean

texture_mean
30 diagnosis 30 diagnosis
B B
M M
20 20

10 10

10 15 20 25 10 15 20 25
radius_mean radius_mean

Table 7.1: Confusion tables using (a) LDA and (b) QDA on breast cancer data

(a) LDA (b) QDA


Predicted Predicted
Actual B M Actual B M
B 345 12 B 341 16
M 53 159 M 48 164

(7.7) shows that LDA model expresses the log-odds of being in G =1 as a linear function of x . Hence a LDA
is similar to the linear logistic model (4.1). The difference between linear logistic regression and LDA is that
the linear logistic model only specifies the conditional distribution P(G = k|x ). No assumption is made about
the distribution of x ; while the LDA model uses the distribution of x and qk in the modelling. Since the LDA
requires x to follow a multivariate normal distribution in each class, all the variables represented by x must be
continuous. The logistic regression does not make any distributional assumptions about x and as such x can
be a mixture of continuous, categorical and binary variables. This makes the logistic regression very flexible
for many different data situations. When all components of x are continuous so that both LDA and logistic
regression apply, the latter is more robust to outliers since it uses fewer assumptions. On the other hand, if the
data are or can be suitably transformed to multivariate normal in each class, LDA tends to be more efficient
by using more information about the data.
A limitation of the logistic regression is that it assumes there are two classes whereas a linear discriminant
analysis allows any number of distinct classes. Another advantage of LDA is that observations without class
labels can be used under the model of LDA.

Table 7.2: Confusion table using logistic regression on breast cancer data

Predicted
Actual B M
B 334 23
M 39 173

103
Chapter 7. Classification and Clustering

In practice, in situations where both logistic regression and LDA apply, they often give similar
results. For comparison, we applied a logistic regression on the breast cancer data. The fitted model is
P(G = 1|x )
log = −19.85 + 1.057radius_mean + 0.218texture_mean. We used this model and a cut-off
1 − P(G = 1|x )
of > 0.5 to define a case that belongs to G=1 (benign). The classification results are given in Table 7.2,
from which we obtain a specificity of 334 173
357 = 0.936 and sensitivity of 212 = 0.816 and overall accuracy of
173+334
569 = 0.891.

7.6 Cluster analysis

In classification problems, we have a set of observations where on each unit, we have information on both
the outcome Y and a set of independent variables X. In some situations, we only know X and we wish to find
subgroups of observations within the data set. For example, among normal growths, some may eventually turn
malignant, and among malignant tumours, some may be more aggressive such that even following treatment,
they are more likely to recur. The type of analyses that we use is called cluster analysis.
The basic idea of cluster analysis is to place observations in groups such that observations in the same
group are similar and observations in different groups are dissimilar. There are many clustering algorithms but
here, we will focus on K-means clustering, one of the simplest and most commonly used clustering methods
for splitting a dataset into a set of K groups. K-means clustering defines clusters so that the total intra-
cluster variation (known as total intra-cluster variation) is minimised. There are several K-means algorithms
available. The standard algorithm, called the Hartigan-Wong algorithm, uses squared Euclidean distances to
define variations.e
The Hartigan-Wong algorithm randomly assigns K initial centers (K specified by the user), either by
randomly choosing points in the space defined by all p variables, or by sampling K points of all available
observations to serve as initial centers. Since the algorithm uses Euclidean distance, it needs to scale all
p variables to have the same mean and variance. Then it assigns each observation to the nearest center.
Next, it calculates the new center for each cluster as the centroid mean of the clustering variables for each
cluster’s new set of observations. K-means re-iterates this process, assigning observations to the nearest center
(some observations will change cluster). This process repeats until a new iteration no longer re-assigns any
observations to a new cluster. At this point, the algorithm is considered to have converged, and the final
cluster assignments constitute the clustering solution. The algorithm can be summarized as follows:

1. Select K, the number of clusters


2. Scale the data
3. Initialize K cluster centers randomly. This can be done, for example, by using the values of K randomly
chosen observations from the data
4. Define the similarity of two observations by their Euclidean distance. While the similarity between an
observation and a cluster is determined by the distance between the observation with the cluster center;
allocate all observations to the cluster with the nearest center.
5. Recalculate the cluster center with the current cluster memberships. Each cluster center is a vector of
length p containing the means of all variables for the observations in the cluster.
6. Iterate between steps 3-5 until convergence, i.e., changes between steps fall below certain predefined
tolerance, or the maximum number of iterations is reached.

K-means clustering requires all variables to be continuous. There are other methods that do not require
all variables to be continuous, but need different assumptions. K-means clustering also requires specification
of the number of clusters, K. Though this can be done empirically with the data (see Fig. 7.7 and discussions
below).
e
Hartigan, J.A. and Wong, M.A. (1979), A K-Means Clustering Algorithm. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 28: 100-108

104
7.6. Cluster analysis

The choice of clustering variables is also of particular importance. Generally, cluster analysis methods
require the assumption that the variables chosen to determine clusters are a comprehensive representation of
the underlying characteristic that distinguish the groups. Many techniques exist for validating results from
cluster analysis, some of these techniques will be discussed later in this set of notes.
We illustrate the method using the breast cancer data. We continue to use X =
(radius_mean,texture_mean) and we discard the diagnosis information on the data. We considered
two different values of K = 2 and 3. The results of applying K-means clustering are given in Table 7.3
and Fig. 7.6. Table 7.3 shows the cluster means of the clusters following application of the algorithms. For
example, Table 7.3a shows that the mean of X for the first cluster is (17.32,22.99), and that for the second
cluster is (12.26,17.13). These means are superimposed on Fig. 7.6 (a).

Table 7.3: Cluster means using K-means algorithm on breast cancer data: (a) K = 2 and (b) K = 3

(a) K = 2 (b) K = 3

Cluster radius_mean texture_mean Cluster radius_mean texture_mean


1 17.32 22.99 1 19.48 21.79
2 12.26 17.13 2 13.06 23.26
3 12.35 16.14

Figure 7.6: K-means clustering using radius_mean and texture_mean (a) K = 2 and (b) K = 3

(a) Cluster plot (b) Cluster plot

2.5 2.5
texture_mean

texture_mean

cluster
cluster
● 1
● 1
● 2
2
3
0.0 0.0

−2.5 −2.5
−2 0 2 4 −2 0 2 4
radius_mean radius_mean

In clustering, since we do not have information on the outcome, an obvious question we wish to ask is
whether K = 2 or 3, or any other value of K is the best choice in grouping the data. Recall that, the basic
idea behind K-means clustering is to define clusters such that the within-cluster variation, W , (also called
intra-cluster variation or within-cluster sum of squares) is minimised. This is done by first finding the sum of
squared Euclidean distances between all observations and their cluster mean, then summing over all clusters.
This within-cluster sum of square measures the similarity of observations within their cluster. f
We wish to maximise similarity within cluster, and hence, make the dis-similarity or the within-cluster
f
Within-cluster variation can be defined as
K X
X
W= ||X i − X̄ (k) ||22
k=1 i∈k

where X̄ (k) is the k-th cluster center || · ||22 is squared Euclidean distance

105
Chapter 7. Classification and Clustering

sum of square as small as possible. In Fig. 7.7, we plotted the within-cluster sum of square against K for
K = 1 to 10. We observe that within-cluster sum of squares is non-decreasing as a function of K. This is to be
expected since by dividing the data into more groups, we can make the data within each group more similar.
However, the number of clusters is similar to the number of independent variables in a regression and when
K becomes too large, the results become difficult to interpret and also unstable. The key is to seek a balance
between the value of K and the amount of similarity between observations in each cluster. Normally, we find
the “optimal" K by locating a “bend" in the plot. In this dataset, the optimal number of clusters seems to be
K = 3 or 4.

Figure 7.7: Plot of within-cluster variation vs. K in K-means clustering using radius_mean and texture_mean


Within cluster sum of squares

15000

Optimal K
10000 ●


5000 ●



● ●

0
1 2 3 4 5 6 7 8 9 10
Number of clusters k

The breast cancer data was later updated, so that among the patients with malignant tumours, n = 139
were treated and their follow-up information was obtained. Among the patients who were treated, n = 35
recurred. We plotted these their data as +, along those of the K = 3 clusters found earlier, in Fig. 7.8. It is
striking that most of these patients fall within cluster 1. Furthermore, for those few that are outside cluster 1,
their values on (radius_mean,texture_mean) are not far from cluster 1.
The within cluster variation measures how tightly grouped the clusters are. The between-cluster
variation, B (also called inter-cluster variation or between-cluster sum of squares)g on the other hand,
measures distance between clusters. Ideally we wish the clusters to be as far apart as possible. Unfortunately,
B always increases as K increases, and so by itself, it cannot be used as a measure to choose the number of
clusters. However, W and B can be combined into a single quantity called the CH Index.h The CH index can
be plotted against K. The optimal K is located at the value where the plot reaches a maximum. We illustrate
using the Breast Cancer data in Fig. 7.9. Based on the CH index, the optimal number of clusters is 2. The
discrepancy in conclusion between the previous method and by using the CH index is not uncommon in real
g
K
X
B= nk ||X̄ (k) − X̄ ||22
k=1

where X̄ is mean of all observations and nk is number of observations in k-th cluster


h
The CH index is proposed in Calinski, T. and Harabas, J. Communications in Statistics, 1974, 1:3, 1-27 and given by

B/(K − 1)
CH =
W /(n − k)
where n = all observations in the data

106
7.6. Cluster analysis

Figure 7.8: n = 35 patients with recurrent breast cancer superimposed on K = 3 clusters

2.5
texture_mean

cluster
● 1
2
3
0.0

−2.5
−2 0 2 4
radius_mean

data analyses. In any situation, different methods offer different perspectives at looking at the same set of
data and their conclusions are not always expected to be identical. Together they allow the user to have a
more complete picture of the data, which ultimately allows more informed inferences to be drawn.

Figure 7.9: Plot of CH Index vs. K


470
460
450
CH

440
430
420
410

2 4 6 8 10

107
8
Dimension Reduction and Regularisation

In Chapter 7, we studied classification and clustering problems in the cancer breast data using two variables,
x =(radius_mean,texture_mean). In the data, there are altogether 30 variables and the logical question is
whether adding all or some of the remaining variables would improve the results. We repeated LDA and QDA
using all 30 variables in x . Unlike in Chapter 7 where we used two similarly scaled variables, the entire dataset
of 30 variables contain some variables that differ in scales by orders of magnitude. For example, the range
for area_worst is (185.2,4254) whereas for smoothness_se, it is (0.001713,0.03113). Hence, the variables
are first scaled to have zero mean and unit variances. The LDA and QDA decision boundaries are now linear
and quadratic functions of all 30 (scaled) variables. The results of using these new decision boundaries are
summarised in Table 8.1, from which we can calculate for the LDA, a specificity of 355 357 = 0.995, sensitivity
of 212 = 0.912 and overall accuracy of 569 = 0.965. For QDA, we obtain a specificity of 352
194 194+355
357 = 0.986,
202 202+352
sensitivity of 212 = 0.952 and overall accuracy of 569 = 0.977. These results are superior to those of
Table 7.1, where discrimination was carried out using only radius_mean and texture_mean. The results bode
well for using a higher number of variables for discrimination in this set of data.

Table 8.1: Confusion tables using (a) LDA and (b) QDA on breast cancer data using all 30 variables for
discrimination

(a) LDA (b) QDA


Predicted Predicted
Actual B M Actual B M
B 355 2 B 352 5
M 18 194 M 10 202

In Fig. 8.1, we show scatterplots of the data, stratified by diagnosis (B = benign, M = malignant), based
on a selection of the variables used for discrimination. Fig. 8.1 (a) shows, as shown in Chapter 7, that there
is quite clear separation of diagnosis based on radius_mean and texture_mean. However, the same cannot
be said for fractal_dimension_mean and compactness_se; Fig. 8.1 (b) shows that the B and M cases are not
well separated based on these two variables. Fig. 8.1 (c) shows that texture_worst vs. perimeter_mean also
have very good discriminating power. However, comparing Fig. 8.1 (c) and Fig. 8.1 (a), we observe that the
two figures are very similar. These plots raised some questions: (1) are some variables less important and
therefore can be discounted for classification/clustering purposes? (2) are some variables similar to each
other and when one is included, the others add less value to the discriminating power? (3) a corollary of (1)
and (2) is whether a subset of the variables can capture most of the discriminating power of the full set of
variables? and if so, find that subset.

108
8.1. Principal component analysis

Figure 8.1: Breast cancer data in n = 569 women stratified by diagnosis (B = benign, M = malignant) (a)
scatterplot of radius_mean vs. texture_mean; (b) scatterplot of fractal_dimension_mean vs. compactness_se;
(c) scatterplot of texture_worst vs. perimeter_mean

(a) 5.0 (b) (c) 5.0

fractal_dimension_mean
4
texture_mean

texture_worst
2.5 2.5
2
B B B
M M M
0.0 0.0
0

−2.5 −2 −2.5
−2 0 2 4 0 2 4 6 −2 0 2 4
radius_mean compactness_se perimeter_mean

8.1 Principal component analysis

In both classification and clustering problems, we assume for each observation, X represents p characteristics
(variables) that will be used for the procedure. In both LDA and K-means clustering, for example, we assign
an observation to class k if
 
X1
X 2 
X=  ... 
 (8.1)

Xp

is closest to the centroid of class k, where closeness is measured by a scaled Euclidean (Mahalanobis) distance
in LDA and K-means clustering. The distance is calculated in p dimensions. The goal of dimension reduction
is to use distances in a dimension lower than p, while retaining the performance of the procedure. Principal
component analysis (PCA) does this by transforming the data into fewer dimensions, each acting as a
summary of the p variables.
If each component of X represents a particular characteristic of each observation, an intuitively simple
way to summarise the p variables is to use a linear combination:

φ11 X 1 + φ12 X 2 + ... + φ1(p−1) X p−1 + φ1p X p . (8.2)

The values (φ11 , ..., φ1p ) are called loadings in the linear combination. A linear combination like (8.2) is
called a principal component. Ideally, we wish this principal component to capture as much information
as possible about the p variables. At the same time, we cannot expect it to capture all the information in
p variables. Following this idea, we can find another linear combination, with a different set of loadings
(φ21 , ..., φ2p ) to form another principal component. We wish this second principal component to capture
information on the p variables that has not been captured by the first principal component, i.e., we wish
the second principal component to be uncorrelated with the first one. We carry on this process to find more
principal components, until we have captured most of the information about the p variables. In fact, in any
problem, if there are p original variables, we can find p principal componentsa that contain exactly the same
information contained in the p variables.
The first principal component is a linear combination of the p variables that captures the maximum
possible information about the variables in the data. Drawing from our discussion in linear regression models,
a
We assume the number of observations n is larger than p; in general, the number of principal components is min(n − 1, p)

109
Chapter 8. Dimension Reduction and Regularisation

and the form of (8.2), the first principal component is simply a “line" that is closest to the data, i.e., it minimizes
the sum of the squared Euclidean distance between the line and all the data points. Recall that in regression
models, we use percent variation explained as a measured of how much the data is explained by a regression
line. Using the same concept, the first principal component is the linear combination that explains the most
variation about the data, among all linear combinations. The second principal component is also a linear
combination of the p variables which captures part of the variations in the data set “missed" by the first
principal component. In other words, the correlation between first and second component should be zero.
The goal of a PCA is to explain most of the variability in the data with a smaller number of variables than
the original data set. For a large data set with p variables, we could examine pairwise plots of each variable
against every other variable, like Fig. 8.1. However, even for a moderate value of p, such as the case in the
30(30−1)
breast cancer dataset, where p = 30, there will be 2 = 435 plots, which is clearly infeasible. Using
a PCA, we replace the original variables with principal components, so each principal component acts as a
new variable. Our hope is we only need a few of these new variables to capture most of the information in
the data. For instance, if we found that the first two principal components are sufficient to capture most of
the information, we can use these two new variables to represent the data. We can plot and visualise the
data based on these new variables in a two-dimensional figure, like those in Chapter 7. We can also carry out
classification and clustering based on these new variables.
We illustrate PCA using the breast cancer data. Since the dataset contains p = 30 different variables, in all
we can find 30 principal components that capture all the information in the original variables. For illustration,
we show the loadings for the first three principal components in Table 8.2.

Table 8.2: Loadings for first three principal components in breast cancer data

PC1 PC2 PC3


radius_mean -0.22 0.23 -0.01
texture_mean -0.10 0.06 0.06
perimeter_mean -0.23 0.22 -0.01
area_mean -0.22 0.23 0.03
smoothness_mean -0.14 -0.19 -0.10
compactness_mean -0.24 -0.15 -0.07
concavity_mean -0.26 -0.06 0.00
concave.points_mean -0.26 0.03 -0.03
symmetry_mean -0.14 -0.19 -0.04
fractal_dimension_mean -0.06 -0.37 -0.02
radius_se -0.21 0.11 0.27
texture_se -0.02 -0.09 0.37
perimeter_se -0.21 0.09 0.27
area_se -0.20 0.15 0.22
smoothness_se -0.01 -0.20 0.31
compactness_se -0.17 -0.23 0.15
concavity_se -0.15 -0.20 0.18
concave.points_se -0.18 -0.13 0.22
symmetry_se -0.04 -0.18 0.29
fractal_dimension_se -0.10 -0.28 0.21
radius_worst -0.23 0.22 -0.05
texture_worst -0.10 0.05 -0.04
perimeter_worst -0.24 0.20 -0.05
area_worst -0.22 0.22 -0.01
smoothness_worst -0.13 -0.17 -0.26
compactness_worst -0.21 -0.14 -0.24
concavity_worst -0.23 -0.10 -0.17
concave.points_worst -0.25 0.01 -0.17

110
8.1. Principal component analysis

symmetry_worst -0.12 -0.14 -0.27


fractal_dimension_worst -0.13 -0.28 -0.23

The magnitude of the loading for a particular variable shows how much it contributes to the principal
component. For example, for the first principal component (PC1), the contributions of compactness_mean
and concavity_mean are high, whereas smoothness_se and texture_se are both low. This means PC1 tells us
more about compactness_mean and concavity_mean rather than smoothness_se or texture_se. In contrast,
the loadings for smoothness_se and texture_se are highest among all loadings for PC3, which means PC3 tells
us more about these two variables. These three principal components explain, respectively, about 44%, 19%
and 9% of the variation in the data. As discussed earlier, PC1 always explains the highest possible variation
about the data, followed by PC2, etc.. We now imagine the principal components as new variables. For each
observation, we can find the values of these new variables by applying expressions like (8.2). In Fig. 8.2, we
plotted the data using PC1 and PC2. The plot shows a clear separation of the benign and malignant cases
based on PC1 and PC2. A comparison of Fig. 8.2 to Fig. 8.1 shows that using PC1 and PC2 is superior to using
any of the original variables.

Figure 8.2: Scatterplot of PC1 and PC2 stratified by diagnosis (B = benign, M = malignant)

5
PC1 (1st Principal Component)

−5 B
M

−10

−15

−10 −5 0 5
PC2 (2nd Principal Component)

The goal of a PCA is to significantly reduce the number of variables. Hence, a natural question to ask is
how to decide the smallest number of principal components that explains most of the data. The most common
technique for determining how many principal components to retain is by using a scree plot, which plots the
proportion of variation explained by each principal component. Fig. 8.3 (a) shows the scree plot for the breast
cancer data. The scree plot shows the variation explained for PC1 is the highest, followed by PC2; the variation
explained always drops for a higher principal component. To determine the number of principal components
to retain, we look for the point where the drop on the plot begins to plateau. In this dataset, the point is
around 7. Notice that PC6 or PC7 explains no more than 5% each of the variation in the data. An alternative
method is to use a cumulative variance (scree) plot, which shows the total variation explained by including
a subset of the principal components. Fig. 8.3 (b) shows that the first seven principal components already
explain almost 90% of the variation given in the original 30 variables. This means that by using PC1-PC7
instead of the 30 variables, we have already captured almost 90% of the information in the original data.

111
Chapter 8. Dimension Reduction and Regularisation

Figure 8.3: PCA of breast cancer data (a) scree plot (b) cumulative variance (scree) plot

(a) (b)


1.0 ●●●●●
●●●●●●●●●●●●●
●●
●●


0.4 0.9 ●

0.8 ●


0.6
wisc

0.2 ●

0.4

0.1 ● 0.2


0.05 ●
0.025 ●
●●●●
●●●●●
0 ●●●●●●●●●●●●●● 0.0
1234567 10 20 30 1234567 10 20 30
Principal Component Principal Component

We end by demonstrating how the principal components can be used in place of the original variables
in classification and clustering problems. Based on our previous analyses, we use the first seven principal
components to carry out classification in the breast cancer data. The LDA and QDA decision boundaries
are now linear and quadratic functions of PC1-PC7. The results of using these new decision boundaries are
summarised in Table 8.3. For the LDA, we obtain a specificity of 357 188
357 = 1, sensitivity of 212 = 0.887 and
overall accuracy of 188+357
569 = 0.958. For QDA, the results give a specificity of 357352
= 0.986, sensitivity of
198 198+352
212 = 0.934 and overall accuracy of 569 = 0.967. These results are comparable to those of Table 8.1,
where discrimination was carried out using all 30 original variables. The results show the effectiveness of using
a few principal components to capture the information in the original variables. We can also use principal
components to carry out logistic regression and cluster analysis.

Table 8.3: Confusion tables using (a) LDA and (b) QDA on breast cancer data using only PC1-PC7 for
discrimination

(a) LDA (b) QDA


Predicted Predicted
Actual B M Actual B M
B 357 0 B 352 5
M 24 188 M 14 198

8.2 Ridge regression

In many areas of healthcare research, for example, genomics, fMRI data analysis, electronic health records
analytics and image analysis, high dimensional data with a large number of variables are often used to
help us better understand a particular issue. In high dimensional data studies, it is not unusual to see the
number of variables, p, greatly exceeds the number of observations, n. Furthermore, some of the variables
may be highly correlated with one another. When either of these conditions is present, the data is said
to exhibit multicollinearity. Multicollinearity occurs when p > n because numerically, the data become
linearity dependent, whereas when some variables are highly correlated, the data behave like they are linearly

112
8.2. Ridge regression

dependent. In either case, the model coefficients may become unstable, in the sense that a small change
in the data can lead to large changes in the model coefficients. The result is predictions that are highly
unreliable. We illustrate this situation using the breast cancer data (n = 569). We carried out logistic
regression using diagnosis (M vs.B) as outcome and we used all p = 30 variables. In this study, p is smaller
than n but the correlations between some of the variables are very high, for example, between radius_mean
and perimeter_mean, and radius_mean and area_mean, they are 0.998 and 0.987, respectively. The logistic
regression results are given in Table 8.4, column (a). We can observe that all coefficients are extremely high,
even though the variables have all been scaled to have mean zero and unit variance. We then removed the
first observation from the data and repeated the analysis. The results are given under column (b) in Table 8.4,
which show that the coefficients have changed to nonsensical magnitudes.
Ridge regression is a method designed to handle data when serious multicollinearity is of suspect.
Ridge regression mitigates multicollinearity by shrinking the influence of some variables in a model. This
process is sometimes called regularisation.b The concept of regularisation is to modify the method to bring
reasonable solutions to unstable problems. In Chapter 3.7.1-3.7.4, the selection methods are also examples
of regularisation.

Table 8.4: Logistic regression results for breast cancer data using p=30 original variables (a) full data (n =
569) (b) first observation removed (n = 568)

Dependent variable:
diagnosis
(a) (b)
Constant 253,916 119,601,255,635,090
radius_mean 8,552,881 −1,708,668,000,072,304
texture_mean 842,067 199,411,984,135,942
perimeter_mean 35,796,845 458,490,102,161,683
area_mean −45,790,269 789,332,171,458,818
smoothness_mean −2,144,100 1,203,376,580,870
compactness_mean −339,500 89,351,023,075,350
concavity_mean 83,032 −294,233,213,785,529
concave.points_mean −665,733 7,278,092,298,960
symmetry_mean 1,109,889 14,723,125,501,110
fractal_dimension_mean −298,858 −17,091,415,037,245
radius_se 9,230,274 54,694,723,980,476
texture_se 3,513,102 299,558,145,403,725
perimeter_se 3,438,589 −276,434,198,385,473
area_se −29,084,419 −305,040,743,657,955
smoothness_se 2,249,396 122,766,915,618,377
compactness_se −3,175,247 −106,445,072,301,594
concavity_se 4,614,370 4,360,757,216,504
concave.points_se −7,773,633 −161,588,882,029,986
symmetry_se 2,389,064 135,818,937,728,046
fractal_dimension_se 4,001,120 17,552,110,545,607
radius_worst −29,628,794 −78,883,407,843,200
texture_worst −3,584,767 −321,268,864,130,750
perimeter_worst −11,889,226 673,080,021,987,347
area_worst 50,959,829 421,117,625,995,940
smoothness_worst −493,436 −33,254,576,910,705
compactness_worst 1,413,874 −55,590,267,282,196

b
Bickel, P.J., Li, B., Tsybakov, A.B. et al. Regularization in statistics. Test 15, 271–344 (2006).

113
Chapter 8. Dimension Reduction and Regularisation

concavity_worst −6,316,972 210,962,588,687,504


concave.points_worst 9,408,268 251,308,666,299,682
symmetry_worst −1,530,342 −34,310,039,909,200
fractal_dimension_worst −667,962 −61,217,189,317,208

Recall from Chapter 3, in a linear regression model with outcome Y and p independent variables X =
(X 1 , ..., X p ), ordinary least squares (OLS) finds the line that minimises the sum of squared Euclidean distance
between the line and the data. The difference between the data and the line is a measure of error or loss. For
OLS, the loss is a squared error loss and we aim to minimise the total loss over all data:
X
LOSS = [ yi − (a + b1 x 1i + ... + b p x pi )]2 . (8.3)
i

In (8.3) there are no restrictions on the values of b = (b1 , ..., b p ); when there is serious multicollinearity,
b j x ji + b j ′ x j ′ i can have infinitely many solutions in b j , b j ′ , causing the phenomena that we saw in Table 8.4.
A ridge regression (RiR) uses the same concept as the OLS except that, a penalty is added to the total sum of
squares term to give:
X
LOSSRiR = [ yi − (a + b1 x 1i + ... + b p x pi )]2 + λ(b12 + ... + b2p ), (8.4)
i | {z }
| {z } penalty
OLS
where λ is a positive fixed constant. (8.4) is equivalent to minimisation of i [ yi − (a + b1 x 1i + ... + b p x pi )]2
P

subject to a constraint (b12 + ... + b2p ) < cλ , for some positive constant cλ . In other words, we can still imagine
that we are finding the best fitting line by minimising the sum of squared errors, but with a constraint. Since
λ, b12 , ...b2p are all > 0, large values of b j s in the penalty term will increase the value of LOSSRiR . Hence, in a
ridge regression, the penalty term is used to discourage large values of b j s by shrinking some of b j s. Because
the purpose of the penalty term is to shrink some of the b j s, λ is sometimes called a shrinkage parameter or
tuning parameter. The value of λ is predefined by the user. Obviously, the larger the value of λ, the higher
the penalty and will lead to more shrinkage. Notice that in a ridge regression, the intercept a is not penalised,
since multicollinearity has nothing to do with the intercept.
Ridge regression is not restricted to only linear regression; it applies also to other types of regressions,
such as the logistic regression. In linear regression, the term [ yi − (a + b1 x 1i + ... + b p x pi )]2 gives the squared
error loss of using the line a + b1 x 1i + ... + b p x pi to estimate the observation yi . A logistic regression also tries
to use a + b1 x 1i + ... + b p x pi to estimate yi by minimising
X
LOSS = loss[ y i, (a + b1 x 1i + ... + b p x pi )], (8.5)
i

but the loss function is not squared. There are two reasons why the loss function for a logistic regression is
not a squared function: (1) In a logistic regression, (a + b1 x 1i + ... + b p x pi ) gives an estimate of P( yi = 1) and
not yi directly and (2) A logistic regression estimates P( yi = 1), whose error in predicting yi = 1 or 0 should
not be measured by the Euclidean distance P( yi = 1) − yi ; e.g., if yi = 1 (malignant tumour), then the errors
of two estimates P( yi = 1) = 0.5 and P( yi = 1) = 0.01 using squared error (Euclidean distance) would be
(1 − 0.5)2 = 0.25 and (1 − 0.01)2 = 0.98 but the severity of a missed diagnosis of P( yi = 1) = 0.01 might be
a lot higher than 0.98/0.25 ≈ 4 times. A ridge regression for logistic regression follows the same idea as that
for a linear regression; a penalty is added to the loss function
X
LOSSRiR = loss[ y i, (a + b1 x 1i + ... + b p x pi )] + λ(b12 + ... + b2p ) . (8.6)
i | {z }
penalty
Once again, λ serves as a shrinkage parameter so that the larger the value of λ, the higher the shrinkage.
We repeated the analysis using the full set of p = 30 variables, using ridge regression. To carry out
ridge regression, all independent variables must be standardised to have zero mean and unit variance. This

114
8.2. Ridge regression

is to ensure that all variables are on the same scale so no variable dominates in the shrinkage process. As
discussed earlier, ridge regression requires the specification of the shrinkage parameter λ. We illustrate using
three different values of λ = 0.01, 1 and 10 and applied ridge logistic regression (8.6). The results are given
in Table 8.5, which shows that even a very small shrinkage parameter (λ = 0.01) manages to reduce the
magnitudes of the model coefficients dramatically. As to be expected, the model coefficients become smaller
as the value of λ increases as a large λ value places more penalty for larger values of the model coefficients.
Since all three choices of λ seem to be able to carry out shrinkage, this leads to the question of what is the
appropriate value of λ to use for a particular problem. To answer this question, we turn to a quantity called
variance inflation factor (VIF).

Table 8.5: Ridge logistic regression results for breast cancer data using three different values of shrinkage
parameter λ = 0.01, 1 and 10

λ
0.01 1 10
(Intercept) -0.5 -0.61 -0.54
radius_mean 0.42 0.12 0.028
texture_mean 0.46 0.08 0.016
perimeter_mean 0.4 0.12 0.028
area_mean 0.41 0.11 0.026
smoothness_mean 0.16 0.05 0.013
compactness_mean -0.095 0.071 0.021
concavity_mean 0.47 0.094 0.025
concave.points_mean 0.55 0.12 0.029
symmetry_mean 0.044 0.042 0.012
fractal_dimension_mean -0.29 -0.028 -0.0018
radius_se 0.65 0.08 0.021
texture_se -0.077 -0.0058 -0.00065
perimeter_se 0.45 0.074 0.02
area_se 0.49 0.074 0.02
smoothness_se 0.094 -0.014 -0.003
compactness_se -0.38 0.014 0.0094
concavity_se -0.043 0.0094 0.008
concave.points_se 0.17 0.045 0.014
symmetry_se -0.19 -0.014 -0.0011
fractal_dimension_se -0.34 -0.019 0.0014
radius_worst 0.63 0.13 0.029
texture_worst 0.72 0.093 0.018
perimeter_worst 0.57 0.12 0.03
area_worst 0.58 0.11 0.028
smoothness_worst 0.51 0.077 0.016
compactness_worst 0.11 0.084 0.022
concavity_worst 0.51 0.097 0.025
concave.points_worst 0.61 0.13 0.03
symmetry_worst 0.53 0.076 0.016
fractal_dimension_worst 0.19 0.043 0.012

Earlier, we observed that the model coefficients change a lot when the data is only slightly different. This
means that the standard errors, and hence the variances, of the coefficients are inflated when multicollinearity
exists. The VIF for a coefficient b j , VIF j is defined as the factor by which the variance is inflated.
For a linear model with p variables, yi = a + b1 x 1i + ...b j x ji + ... + b p x pi + ei , it can be shown that the

115
Chapter 8. Dimension Reduction and Regularisation

variance of b j is given by:


Var(ei ) 1
Var(b j ) = P × , (8.7)
i (x ji − x̄ j )2 1 − R2x | x
j −j

where R2x | x is the multiple correlation coefficient obtained by regressing x j on the remaining variables
j −j
x − j = (x 1 , ..., x j−1 , x j+1 , ..., x p ). Recall the definition of the multiple correlation coefficient from Chapter 3. It
measures the association between x j and the remaining variables and hence, a measure of multicollinearity.
The greater the linear dependence between a variable and other variables, the larger the value of R2x | x .
j −j

Furthermore, as the above formula suggests, the larger the R2x | x value, the larger the value of Var(b j ). In
j −j
the limit, if a variable is nearly or perfectly related to another variable in the data, Var(b j ) would become
very large. This is the phenomenon we observed in Table 8.4, due to the extremely high correlation between
radius_mean and other variables.
Notice that in (8.7), the variance of the model coefficient b j does not involve the outcome yi . This
means variance inflation has nothing to do with the outcome. Rather, variance is inflated due to the
relationship among the independent variables. If none of the independent variables are related, i.e., they
Var(e )
are all independent of each other, then Var(b j ) = P (x −x̄i )2 , hence the variance inflation factor for b j is
i ji j
simply
1
VIF j = . (8.8)
1 − R2x | x
j −j

Each model coefficient has its own VIF and hence, for a dataset with p variables, there will be p VIFs. In
Fig. 8.4, we plotted the maximum and minimum VIFs for all the p = 30 model coefficients in the breast cancer
data as a function of the shrinkage parameter λ. A value of λ = 0 corresponds to no shrinkage. We can
observe from the figure that when λ = 0, some of the variances are inflated by a few thousand times. As the
value of λ increases, the maximum VIF drops. It is also worthwhile to note that while in Table 8.5, it seems
that λ = 0.01, 1, or 10 all managed to bring down the magnitudes of the coefficients, Fig. 8.4 shows that
even for λ =2, the maximum VIF is still very large. Even though there is no fixed benchmark on how small
VIF should be, but a rough guide is for VIF < 10 to be considered acceptable. c Based on that, for the breast
cancer data, we used a λ value of 10.

Figure 8.4: Variance inflation factor (VIF) as a function of the shrinkage parameter λ in breast cancer data

5000

1000
VIF (log)

100
● min VIF

● max VIF

10 ●

5 ● ● ●




2
0 .5 1 2 5 10 25
λ

c
Hair, J. F. Jr., Anderson, R. E., Tatham, R. L. & Black, W. C. (1995). Multivariate Data Analysis (3rd ed). New York: Macmillan

116
8.3. Least absolute shrinkage and selection operator (LASSO)

So far, we have viewed regularisation using ridge regression as a means to manage multicollinearity. In
practical situations, we are often have two aims in modelling:

(1) To construct a good predictor of the outcome of interest

(2) To give interpretations of the factors that may explain the outcome

Under the first aim, the values of coefficients in the model are relatively unimportant. On the contrary, if
the aim is interpretation, the coefficients are crucial for identifying the important covariates and to what
extent they are related to the outcome. Regularisation is important for both aims. The different methods of
regularisation differ in their ability in answering the two aims. For example, in selection methods, some of
the covariates are removed through the selection process, therefore the final model contains only a subset
of the covariates that explain the outcome. In contrast, in ridge regression, coefficients are never shrunk to
exactly zero so the full set of covariates remains. If the purpose of the study is to identify from a large pool
of covariates a few that explain the outcome, then selection methods or others (see Chapter8.3) may be more
suited. Nevertheless, even in that case, it is still possible to use ridge regression and focus on the significant
covariates in the final model. If the aim is prediction, then often a loss function is defined to measure prediction
accuracy; methods are then “tuned" to make the loss as small as possible.

8.3 Least absolute shrinkage and selection operator (LASSO)

A ridge regression carries out shrinkage on some of the model coefficients but the model coefficients never
become zero. That means for a problem with p variables, ridge regression will retain all p variables in the final
model. In modern day healthcare data analysis, it is not uncommon to see certain problems with the value
of p into the thousands, which makes the model challenging to interpret. LASSO (Least Absolute Shrinkage
and Selection Operator) performs dimension reduction by shrinking coefficients of some variables to zero.
Hence LASSO carries out variable selection and regularisation in one procedure.
In PCA, dimension reduction is carried out by converting the original variables into principal components
and then retaining the few that capture most of the information in the data. Principal components can be
loosely interpreted as summaries of the original variables. More precise meanings to a principal components
can sometimes be determined by examining the loadings of the variables in a principal component. However,
when the number of variables becomes large, interpretation of a large number of loadings can become difficult.
LASSO carries out “dimension reduction" by training our focus on variables with non-zero coefficients.
Compared to PCA, which transformed the original variables into principal components, all the variables
retained in LASSO are the same as the original variables, and consequently, their meanings are unaltered,
and interpretation is easier.
Ridge regression penalises a model with p variables by using the term λ j b2j . LASSO is very similar to
P

ridge regression except that the penalising term is λ j |b j |, so in general it is


P

X
LOSSLASSO = loss[ y i, (a + b1 x 1i + ... + b p x pi )] + λ(|b1 | + ... + |b p |), (8.9)
i
| {z }
penalty

which is equivalent to minimisation of i loss[ y i, (a + b1 x 1i + ... + b p x pi )] subject to a constraint (|b1 | + ... +


P

|b p |) < cλ , for some positive constant cλ . By the seemingly small but important change from penalising the
squared to the absolute value of the b j s, some coefficients in a LASSO are shrunk to zero. As a result, following
LASSO, we end up with a model with p′ ≤ p instead of p variables; if p′ is Pmuch smaller than p, interpretation
of the model may be greatly improved. In linear regression settings, if j | b̂ j | denote the total absolute size
of the OLS coefficient estimates. Values of cλ < j | b̂ j | cause shrinkage in some variables towards zero.
P

We repeated the analysis using the full set of p = 30 variables, using LASSO. LASSO also requires all
independent variables must be scaled to zero mean and unit variance, and the specification of the shrinkage
parameter λ. Recall that as the value of λ increases, the amount of penalisation increases and the number

117
Chapter 8. Dimension Reduction and Regularisation

of variables retained in the model, p′ will drop. A smaller value of p′ is desirable for interpretation but may
affect predictability. Therefore, we wish to choose a λ (and therefore p′ ) that balances these two aims. In
Chapter 4, we defined the deviance as the difference between the log-likelihoods of a saturated model and a
proposed model, where a saturated model fits a separate model for each observation, viz.,

2[logLiksaturated model − logLikproposed model ].

A saturated model, though useless in practice, provides a benchmark as the model with the “best" prediction.
Since a likelihood is technically a loss function and hence deviance can be interpreted as the additional loss
using a proposed model compared to the saturated model.
In theory, the model with the least predictive ability is a null model, i.e., a model with p′ = 0. We can
define a deviance ratio as
Deviance using a null model
,
Deviance using p′ coefficients
which can be interpreted as the recovery of loss (from 0 to 100%) from fitting a LASSO with p′ coefficients
compared to a null model. As p′ increases (λ decreases), we expect the deviance ratio would rise but at the
expense of interpretability. Therefore, we aim to increase p′ until the rise in deviance ratio slows. We illustrate
this procedure using the Breast cancer data. We consider a sequence of λ values from 0.4 to 0.00001, and
for each value of λ, we fitted a LASSO regression. Fig. 8.5 shows the effects of using different values of λ
on the deviance ratio and p′ . We can observe that for larger values of λ, p′ is smaller pointing to a model
with fewer number of variables retained. At the same time, the deviance ratio is smaller and hence the model
loses to a model with larger values of p′ . The goal is to find a value of λ that captures most of the deviance.
i.e., a deviance ratio close enough to the value given by the full model. The figure shows that when λ = 0.2
only p′ = 3 variables are retained, but the corresponding deviance ratio is very small. For λ = 0.001, p′ = 26
and the deviance ratio is not that different from that for p′ = 27. In Table 8.6, we printed the last few values
of λ used; we observe that the deviance ratio hardly changes, the value of p′ = 27. For completeness, in
Table 8.7, we also printed the LASSO regression coefficients using λ = 0.000038. We observe that consistent
with Table 8.6, three of the variables: perimeter_mean , area_mean, concave.points_worst have coefficients
= 0 and hence are selected out of the model.
In this example, we have focused on tuning λ using the deviance ratio. Metrics other than the deviance
ratio can also be used. Earlier in Chapter 8.2, we talked about prediction being one of the aims in practice. If
that aim is of primary interest, then λ can be trained to minimise some measure of prediction loss. We explore
this direction in Chapter 10.

Table 8.6: Deviance, λ and p′ in LASSO

λ Deviance ratio p′
0.000081 0.955 25
0.000074 0.956 26
0.000067 0.956 26
0.000061 0.957 26
0.000056 0.957 26
0.000051 0.958 27
0.000050 0.958 27
0.000046 0.958 27
0.000042 0.958 27
0.000038 0.959 27

118
8.3. Least absolute shrinkage and selection operator (LASSO)

Figure 8.5: Deviance ratio and p′ as functions of the shrinkage parameter λ in breast cancer data

p'
0 3 4 9 15 25 27
1

0.75
deviance ratio

0.5

0.25

0
0.4 0.2 0.1 0.01 0.001 0.0001 0.00005
λ (log)

119
Chapter 8. Dimension Reduction and Regularisation

Table 8.7: Model coefficients in LASSO

radius_mean -12.466
texture_mean 0.261
perimeter_mean 0.000
area_mean 0.000
smoothness_mean 2.138
compactness_mean -10.449
concavity_mean 9.218
concave.points_mean 4.925
symmetry_mean -1.339
fractal_dimension_mean 1.174
radius_se 4.736
texture_se -1.681
perimeter_se -3.250
area_se 6.057
smoothness_se 0.903
compactness_se 4.251
concavity_se -6.075
concave.points_se 6.847
symmetry_se -1.569
fractal_dimension_se -9.332
radius_worst 3.634
texture_worst 4.166
perimeter_worst 1.663
area_worst 21.004
smoothness_worst -0.730
compactness_worst -1.809
concavity_worst 2.361
concave.points_worst 0.000
symmetry_worst 3.088
fractal_dimension_worst 6.145

120
9
Causation vs.Correlation

Health care research often attempts to determine the predictors of certain outcome. Researchers conducting
studies in medicine and epidemiology investigate questions like “Which factors lead to a certain disease?",
“Does smoking cause lung cancer?" or “Could improving population health lead to a happier and more
productive population?". These questions are causal in nature. They all focus on how changes in an input X
such as smoking, affects an output Y , such as cancer diagnosis. In order to answer these questions, we must
have a clear definition of what is a cause and how do we measure effects. The topic under study is generally
referred to as causal inference. A causal relationship, if it exists, has a temporal flavour. If we believe X
causes Y , then we assume at some point in time, X happens and because X happened, Y appears. In earlier
chapters, we studied many problems in which we looked at associations between independent variables X and
outcome variable Y Clearly, not every association is temporarily directed. Conversely, not every temporarily
directed association implies a causal relationship, it might be due to measurement error, shared prior factors
or chance.
Sometimes, the temporal direction can be assessed with knowledge, e.g., the installation of a nuclear
power plant may affect health but not vice versa. Alternatively, it can be established through study design.
Here, the temporal order is guaranteed by a condition in an experiment that has been manipulated before
an outcome is measured. For example, in a randomised controlled trial, patients are randomised to different
treatments and then followed up for the outcomes. However, often, such experiments are too expensive,
lengthy, and sometimes not feasible, e.g., it would not be ethical to randomise individuals to take up smoking,
or move to an area with a nuclear power plant. Very often, studies are observational. An observational
study may be prospective, e.g., a group of healthy individuals are assessed for their smoking behaviour and
followed up for incidence of lung cancer over time; or it may be retrospective, as in a cross-sectional study,
e.g., smoking histories among a group of lung cancer patients are assessed as risk factors and compared to
healthy individuals. Our focus of this chapter is estimating causal effects in observational studies.

9.1 Counterfactuals and potential outcomes

In causal inference, Y is the outcome and X is referred to as the exposure or treatment. For simplicity, let us
assume that X is binary so X = 1 for exposed and 0 for unexposed. For an individual i, let Yi (1) be the outcome
if the individual is exposed and Yi (0) if the individual is unexposed. To define a causal effect (sometimes
also called treatment effect) in individual i, we need to compare Yi (1) to Yi (0). However, this comparison
is not possible because the same individual can either be exposed or not exposed but cannot simultaneously
be both. If individual i is exposed, then Yi (0) is unobservable; conversely, if individual i is unexposed, then
Yi (1) is not observable. The treatment that individual i did not receive is called counterfactual treatment.
Likewise, the outcome under this treatment is referred to as counterfactual or potential outcome. A causal

121
Chapter 9. Causation vs.Correlation

effect can be defined in terms of a difference:

Yi (1) − Yi (0) (9.1)

or a ratio
Yi (1)
.
Yi (0)
Since a causal effect requires two quantities, at least one of which must be a counterfactual, and therefore
unobservable, the question then is how do we estimate a causal effect. A solution is to define the problem in
terms of the mean of the potential outcomes, or average treatment effects (ATE),

E[Y (1)] − E[Y (0)] (9.2)

where E[Y (0)], E[Y (1)] represent, respectively, the means of Y for those who are untreated and treated.
Unlike (9.1), (9.2) is estimable since it only requires us to obtain outcome information from two sub-
populations. However, for this strategy to work, we need the condition that the sub-populations who are
treated and untreated are comparable, except for the treatments they receive. This requires

Y (1), Y (0) ⊥ X , (9.3)

where the symbol A ⊥ B is used to denote A, B are independent. Condition (9.3) means that, for example,
sicker patients are not more likely to get treatment, or women who have a family history of breast cancer
are not more likely to go for cancer screening. In randomised studies, condition (9.3) is satisfied because
treatment assignment is randomised by a process, e.g., a coin flip or a random number generator, that is
completely independent of the outcome. However, as argued earlier, very often, randomised studies are not
feasible. When analysing causal relationships using data from non-randomised, observational studies, (9.3)
is no longer guaranteed.

9.2 Confounders

The biggest obstacle to drawing causal relationship between exposure X and outcome Y using observational
data is confounding. Confounding occurs when there is a variable Z that causes Y and Z is also associated
with (but not as a result of, see also Chapter 3.4) X . In this case, Z induces an association between X and
Y , whether there is a real causal effect of X on Y (Fig. 9.1). Fig. 9.1 shows two examples of a confounder
Z. In Fig. 9.1 (a), Z directly affects X and Y ; Fig. 9.1 (b), even though Z does not affect Y directly, but it
does so through an intermediate variable A. Confounders may falsely demonstrate an apparent association
between the treatment and outcome when no real association between them exists. For example, if we wish
to determine the effects of more frequent screening for breast cancers, and women who have family history
of cancers are more likely to choosing more frequent screening, then the observation of higher number of
cancers being picked up during more frequent screening may simply be an artefact of the women who are
more likely to have cancers, rather than the program being more effective. On the contrary, confounders may
mask an actual associations. Suppose we wish to compare the effectiveness of two treatments and if patients
with more severe conditions are more likely to receive treatment A than treatment B, then the effectiveness of
A may not be fully reflected. Severity of the condition is a confounder because it is likely to be associated with
the outcome and the treatment choice is indicated by condition severity. This type of confounding is called
“confounding by indication".
For Z to be a confounder, it has to be associated with both X and Y . In a randomised study, if we let Z be
the outcome of the flip of a coin, that determines whether a patient is treated or not treated, then Z would be
related to X . However, since the outcome of the coin flip (Z) should not determine Y , Z does not satisfy the
criteria of a confounder. Another example of a situation where Z is not a confounder would be if people with
certain genetic mutation (Z) are more likely to develop cancer (Y ), but information about the mutation is
unknown and it does not affect the treatment decision (X ), then Z is related to Y but not X . In the screening
example above, however, family history (Z) is related to X , frequency of screening and family history (Z) is
related to Y , discovery of cancer. Hence, family history is a confounder.

122
9.3. Estimating the causal effect

Figure 9.1: Graph showing the relationship between confounder Z, exposure X and outcome Y

Z Z A

X ? Y X ? Y
(a) (b)

Confounding is unavoidable and must be accounted for when observational data are being used to
study causal relationships. Due to confounding, condition (9.3) no longer holds. In principle, if we are
able to identify and observe all confounders Z, we can still draw valid causal inference by controlling for
(sometimes also called conditioning) on the confounders. For example, suppose family history is the only
known confounder for breast cancera , then we can compare the effects of screening frequencies in the sub-
population of women with family history of breast cancer. Since in this sub-population, all women have
the same value of Z, treatment X is independent of Y . We then repeat the same for the sub-population
of women with no family history of breast cancer and combine the results from the two analyses to form an
overall estimate of causal effects. Hence, controlling for (conditioning on) Z gives conditional independence
between X and Y
Y (1), Y (0) ⊥ X |Z. (9.4)

In general, if we can identify the set of all confounders, Z, such that the conditional independence criteria
(9.4) is satisfied, then causal effects of X on Y can be estimated. The description of conditioning on Z above
shows that we are practically carrying out analyses at each level or stratum of Z. In this sense, conditioning
is a stratified analysis.

9.3 Estimating the causal effect

As an illustrative case, we performed a subgroup analysis of the International Stroke Trial (IST).b The IST
trial included 19435 stroke patients across 36 countries but for our purposes we use only data from patients
treated on the Aspirin arm. The outcome of interest is death or disability at 6 months following treatment
(a binary outcome). We are interested in estimating the effect of the delay to Aspirin administration on the
outcome. This delay varied from 1 to 48 hours and likely depends on a number of factors (e.g., stroke subtype,
consciousness, function deficits, etc.). We use a 24 hour delay as a reference cutoff so we define DELAY (X )
as a binary exposure variable by X = 1 if Aspirin administration is delayed for more than 24 hours and 0
otherwise. We considered the following relevant variables that might impact on the causal effects of X on
Y (DEATH/DISABLE): AGE (age), SEX (sex), SBP (systolic blood pressure), RCONSC (conscious state), RCT
(CT-scan), RVISINF (visible infarct at CT-scan), STYPE (stroke subtype), RATRIAL (atrial fibrillation), RASP3
(Aspirin intake within the previous 3 days), and DEF (the number of assessable function deficits, 0-8). We
excluded all patients who were unconscious (RECONSC = U). In total, there are n = 8992 patients with
complete information on all variables. Of these patients, 5543 either were dead or disabled at 6 months
(Y = 1) and 3449 were alive and not disabled; X = 1 in 3039 patients and X = 0 in the remaining 5953
patients. To estimate the causal effect of X on Y , we consider 3 methods.

9.4 Matching

Recall to estimate causal effects of X on Y , we need (9.3):

Y (1), Y (0) ⊥ X ,
a
This is hypothetical; in reality, breast cancer has been found to be related to genetics, family history, among other factors
b
Sandercock et al. (2011) The International Stroke Trial database. Trials 12:101

123
Chapter 9. Causation vs.Correlation

which means the exposed (X = 1) and unexposed (X = 0) sub-populations must be comparable. This
condition is violated, when for example, sicker patients are more (or less) likely be in the exposed group.
Hence, to restore this condition, we can attempt to adjust the data so that the exposed and unexposed groups
are comparable in their risk factors (confounders). This method of adjustment is called matching. Matching
aims to adjust the data by creating comparable distributions of the confounders between the different levels
of the exposure X . To assess the causal effect of X on the outcome Y , each subject with X = 0 is matched to
a subject with X = 1 who has the same values of all the confounders. We illustrate this idea assuming there
is only one confounder: RASP3 (Aspirin 3 days). We found the following in the data. There are 729 patients
among the X = 1 group with RASP3=Y, we need to match these patients to 729 randomly chosen patients
in X = 0 group. Similarly, we need to match the 2354 patients with RASP3=N in the X = 1 group to 2354
randomly chosen patients with RASP3=N in the X = 0 group. In matching, we must start with the group with
the fewer number of patients.

X (DELAY)
RASP3 0 1
N 4753 2320
Y 1200 719

Table 9.1: Contingency table between RASP3 and X (DELAY) in IST data

In reality, there are more than one confounder that should be matched on: AGE, SEX, RECONS, etc..
Furthermore, the confounders may be a mix of categorical or binary and continuous variables. For each
individual with X = 1, it can be difficult or impossible to find an individual X = 0 having exactly the same
values for AGE, SEX, RECONS, etc.. In such cases, we can replace exact matching with approximate matching
for some variables. For example, we can match exactly on the categorical variables SEX and RECONS but
allow for AGE to slightly deviate up to a specified range.
The kind of matching we just described can be very inefficient in practice. We would like to match each
exposed observation with an unexposed observation with an exact match on each of the confounders. As the
number of confounders increases or the ratio of the number of unexposed to exposed observations decreases,
it becomes less and less likely that an exact match will be found for each exposed observation. Furthermore,
when some of the confounders are continuous, they must be changed into arbitrarily defined groups before
matching can be carried out. A solution to these issues is called propensity score matching.
In propensity score matching, we need to define a propensity score for each observation. The propensity
score is defined as the conditional probability of being exposed given the confounders, Pr(X = 1|Confounders).
We can calculate a propensity score for each observation in the data. Once we have a propensity score for
each observation, we can match exposed subjects with unexposed subjects with the same (or very similar)
propensity scores, instead of the original confounders. In this example, since the exposure X is binary, we can
use a logistic regression to estimate the propensity score. Briefly, let Z1 , ...Z p be the confounders, i.e., AGE,
SEX, RCONSC, etc.., then we can consider the following model for the logit of the propensity score:

logitP(X = 1|Confounders) = a + b1 Z1 + ... + b p Z p ,

and fit the model to the data. We can use the fitted model to obtain the propensity score estimates for each
individual in X = 1 to X = 0.
If matching is carried out using continuous confounders, or the propensity score, it would be difficult
to find two observations, one exposed and one unexposed, to have an exact match. To implement matching,
a tolerance (often called caliper) is decided. A match is found if the “distance" between an exposed and an
unexposed observation falls below the tolerance. If original confounders are used, then distance is defined by
the Mahalanobis distance and if the propensity score is used, the distance is simply the absolute value between
the propensity scores. It is possible that no matches can be found for a particular exposed observation if none
of the distances are below the tolerance; and there may be situations that there are multiple matches. If no
matches are found, the exposed observation needs to be discarded or else the tolerance raised. If there are
multiple matches, then there is the choice of using the unexposed observation with the shortest “distance" or
using all of the matches. Observations are not recycled so that an unexposed observation that is matched to

124
9.4. Matching

an exposed observation cannot be re-used for matching with another exposed observation. We applied these
methods to the IST data. In Table 9.2, column (a), we show the distribution of the confounders between the
exposed (X = 1) and unexposed (X = 0) groups as originally recorded. For each confounder, we calculated
the standardised mean difference (SMD) between the exposed and unexposed groups, i.e., for confounder
Z̄ j (X =1)− Z̄ j (X =0)
Z j , j = 1, ..., p, SMD = Ç . A value of SMD 0.1 − 0.2 shows some evidence of imbalance and a
s2j (X =1)+s2j (X =0)
value of > 0.2 is evidence of serious imbalance in the distribution of the confounder between exposed and
unexposed groups. As observed in column (a), there are signs of serious imbalance in RCT and RVISINF,
and moderate imbalance in RCONSC, STYPE and DEF. We carried out both methods of matching, one using
the original confounders and another using propensity score. The results are shown in columns (b) and (c),
respectively. For both types of matching, we considered only a 1:1 matching, since the exposed group only
has 3039 observations, there are 3039 unexposed matches. These figures are indicated at the top of columns
(b) and (c). Looking at the SMD for the both types of matching, they are much smaller than those in column
(a), indicating that both types of matching are successful in obtaining balance in all the confounders between
the exposed and unexposed groups.
We then carried out tests for the outcome based on the matched data. Since each exposed observation
is matched to an unexposed observation in the matched data, we can use a simple paired t-test. For the
matched data in column (b), the mean difference in Y between the unexposed and exposed is 0.017 with a
95% confidence interval of (−0.005, 0.04), and a test statistic of t = 1.53, with a p-value of 0.13. For the
matched data in column (c), the mean difference in Y between the unexposed and exposed is 0.003 with a
95% confidence interval of (−0.02, 0.03), and a test statistic of t = 0.24, with a p-value of 0.8. Both show no
causal effects of delay in treatment on the outcome. As a comparison, we also performed an analysis based
on the original, unmatched data. For the unmatched data, all 8992 observations can be used. Since the data
are not matched, analysis can be carried out using a two-sample t-test. The mean difference is now −0.31
with a 95% confidence interval of (−0.05, −0.009), a test statistic of t = −2.8 and a p-value of 0.005, highly
significant.

125
Table 9.2: Table of confounders: (a) Unmatched and (b) Matched by Mahalanobis Distance using original confounders (c) Matched by propensity score

Chapter 9. Causation vs.Correlation


126

(a) (b) (c)


X 0 1 SMD 0 1 SMD 0 1 SMD
n 5953 3039 3039 3039 3039 3039
SEX = M (%) 3164 (53.1) 1597 (52.6) 0.012 1596 (52.5) 1597 (52.6) 0.001 1567 (51.6) 1597 (52.6) 0.020
AGE (mean (SD)) 71.83 (11.56) 71.76 (11.74) 0.006 71.79 (11.24) 71.76 (11.74) 0.003 71.51 (11.25) 71.76 (11.74) 0.022
RASP3 = Y (%) 1200 (20.2) 719 (23.7) 0.085 719 (23.7) 719 (23.7) <0.001 718 (23.6) 719 (23.7) 0.001
RCONSC = F (%) 4519 (75.9) 2474 (81.4) 0.134 2474 (81.4) 2474 (81.4) <0.001 2447 (80.5) 2474 (81.4) 0.023
RCT = Y (%) 3648 (61.3) 2319 (76.3) 0.329 2319 (76.3) 2319 (76.3) <0.001 2356 (77.5) 2319 (76.3) 0.029
RATRIAL = Y (%) 1097 (18.4) 463 (15.2) 0.085 463 (15.2) 463 (15.2) <0.001 464 (15.3) 463 (15.2) 0.001
RVISINF = Y (%) 1538 (25.8) 1336 (44.0) 0.387 1336 (44.0) 1336 (44.0) <0.001 1324 (43.6) 1336 (44.0) 0.008
STYPE (%) 0.173 0.097 0.030
LACS 1365 (22.9) 826 (27.2) 823 (27.1) 826 (27.2) 826 (27.2) 826 (27.2)
OTH 11 (0.2) 10 (0.3) 0 (0.0) 10 (0.3) 6 (0.2) 10 (0.3)
PACS 2391 (40.2) 1230 (40.5) 1264 (41.6) 1230 (40.5) 1237 (40.7) 1230 (40.5)
POCS 632 (10.6) 379 (12.5) 331 (10.9) 379 (12.5) 391 (12.9) 379 (12.5)
TACS 1554 (26.1) 594 (19.5) 621 (20.4) 594 (19.5) 579 (19.1) 594 (19.5)
DEF (mean (SD)) 3.31 (1.27) 3.17 (1.27) 0.117 3.16 (1.24) 3.17 (1.27) 0.001 3.13 (1.24) 3.17 (1.27) 0.026
9.5. Stratification and regression adjustment

9.5 Stratification and regression adjustment

Earlier, we discussed that causal effects of X on Y can be studied using observational data if we can find
confounders Z, such that, conditioning on Z, X and Y are independent (9.4)

Y (1), Y (0) ⊥ X |Z.

The idea of conditioning is, if Z affects Y , but if Z held at a fixed level, then comparison made between the
exposed and unexposed groups becomes free of the influence of Z since every subject has the same level of
Z. For example, if Z = sex influences the outcome, but we make a comparison between X = 1 and X = 0 for
all males, then the results are no longer affected by the sex distributions between the X = 1 and X = 0 group.
This concepts gives rise to the method of stratification or stratified analysis.
We illustrate the method using the IST data. To begin, we assume for the moment that there is only one
confounder Z = SEX, which is a categorical variable with two levels F and M. We stratify the data into two
strata, F and M. There are 4231 patients who are F and the remaining 4761 belong to M. For each of these two
strata, we can then make a comparison of Y between X = 1 and X = 0. In the F stratum, the mean difference
of Y between X = 1 and X = 0 is d F = −0.0192 with a standard error SE F = 0.0151; in the M stratum, the
corresponding figures are d M = −0.0422 with a standard error SE M = 0.0153. We then can form a weighted
mean difference and pool standard error of
v
4231 2 4761 2
u
4231 4761
 ‹  ‹
d = dF × + dM × = −0.0314; SE = SE2F × + SE2M × = 0.0108,
t
8992 8992 8992 8992

from which we can obtain an approximate 95% confidence interval, stratified for SEX, of

d ± 1.96SE = −0.0314 ± 1.96 × 0.0108 ≈ (−0.053, −0.010).

We can use the same method if there are two confounders, say, SEX (F vs. M) and RASP3 (Y vs. N). To control
for both confounders, we need to divide the data into 2 × 2 = 4 strata: (SEX, RASP3) = (F,Y), (F,N), (M,Y),
(M,N) and then repeat the analysis that we carried out earlier for SEX only.
The method of stratification works well if there are a few confounders and all are categorical with very
few levels. However, when the number of confounders increases, the factorial combinations of strata that need
to be formed would become unmanageably large very quickly. This means that there will be few observations
in each stratum. Furthermore, in many studies, such as the IST data, the confounders are a mix of categorical
and continuous variables. To create strata using the continuous confounders, they must be coarsened to a few
categories. A common approach to resolve this problem is by using regression adjustment.
To control for confounding using regression adjustment, we simply include the confounders as well as
the exposure as independent variables. In that case, the coefficient for the exposure variable is interpreted
as the effects of the exposure on the outcome, when the confounders are simultaneously held (controlled)
at a fixed level. We illustrate this idea using the IST data. For the IST data, since the outcome Y is binary
(Y = 1 for dead/disabled at 6 months and 0 otherwise) , we use a logistic regression. We let Z1 , ..., Z p be the
confounders AGE, SEX, etc.and X be exposure (DELAY). The logistic regression model is

logitP(Y = 1) = a + b1 Z1 + ... + b p Z p + cX ,

The results are given in Table 9.3, column (b). In this model, we are interested in only the coefficient of X . The
model coefficient estimate is 0.027, which means DELAY = 1 (delay > 24 hours of Aspirin administration)
gives a log-odds ratio (OR) of 0.027 or OR = exp(0.027) = 1.027, compared to DELAY = 0 (delay ≤ 24
hours), with 95% confidence interval exp(0.027 ± 1.96 × 0.054) = (0.92, 1.14). Furthermore, this result is
not significant, indicating not enough evidence of a causal effects of DELAY on the outcome. For comparison,
we also included an unstratified analysis, i.e., a regression model using only X as the independent variable.
The results are given in column (a), which shows a significant negative coefficient for X , which means those
with delay actually fared better. We can compare the AIC values, which shows the stratified model provides a
significantly better fit (p < 0.001) using a likelihood ratio test.

127
Chapter 9. Causation vs.Correlation

Table 9.3: Unstratified (a) and stratified analysis (b) of IST data

Dependent variable:
Y (DEAD/DISABLE)

(a) (b)
Constant 0.518∗∗∗ −2.890∗∗∗
(0.027) (0.218)
SEX=M −0.336∗∗∗
(0.051)

AGE 0.052∗∗∗
(0.002)

RASP3=Y 0.139∗∗
(0.061)

RCONSC=F −1.409∗∗∗
(0.083)

RCT=Y −0.618∗∗∗
(0.061)

RATRIAL=Y 0.236∗∗∗
(0.074)

RVISINF=Y 0.373∗∗∗
(0.062)

STYPE=OTH 0.183
(0.490)

STYPE=PACS −0.015
(0.061)

STYPE=POCS −0.413∗∗∗
(0.087)

STYPE=TACS 0.482∗∗∗
(0.095)

DEF 0.396∗∗∗
(0.025)

X DELAY −0.128∗∗∗ 0.027


(0.046) (0.054)

Observations 8,992 8,992


Log Likelihood −5,982.763 −4,774.112
Akaike Inf. Crit. 11,969.530 9,576.225

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

128
9.6. Inverse probability weighting

9.6 Inverse probability weighting

As defined earlier, an estimated propensity score reflects the probability of exposure (treatment assignment)
conditional on a patient’s measured characteristics (potential confounders). Table 9.4 shows the propensity
scores for the first 10 patients in the IST data, along with their characteristics. For example, the first patient
is a female (F), aged 63, received no Aspirin within 3 days prior entry to the study, etc..; based on these
characteristics, her estimated propensity of X = 1 is 0.226 and of X = 0 is 1 − 0.226 = 0.774. She actually
received X = 0. Similarly, the 9-th patient is F, aged 72, with no Aspirin given within 3 days prior entry to the
study, etc..; her estimated propensity of X = 1 is 0.365 and she actually received X = 1.

SEX AGE RASP3 RCONSC RCT RATRIAL RVISINF STYPE DEF propensity 1− X
score propensity
score
F 64 N F Y Y N TACS 4 0.226 0.774 0
M 82 N F Y N N PACS 1 0.344 0.656 0
M 54 N F Y N N POCS 2 0.325 0.675 0
M 81 N F Y Y N PACS 4 0.286 0.714 0
M 71 N F Y Y Y LACS 2 0.483 0.517 0
F 64 N D Y N Y PACS 3 0.411 0.589 1
F 81 Y D N N N TACS 4 0.197 0.803 0
F 81 N F N Y N PACS 4 0.219 0.781 0
F 72 N F Y N N LACS 3 0.365 0.635 1
M 80 N F Y N N PACS 4 0.325 0.675 0

Table 9.4: Patient characteristics and propensity score estimates

The IST data has n = 8992 patients with 5953 unexposed (X = 0) patients and 3039 exposed (X = 1)
patients, which translate into about 5953/8992=2/3 unexposed and 1/3 exposed patients. If the study was
randomised, that means each patient would have the same chance of being in the unexposed or exposed group,
according to the ratio of unexposed and exposed patients in the data. In other words, all patients would have
2/3 chance of being in the unexposed group and 1/3 chance of being in the exposed group. However, Table 9.4
shows this is not the case in the IST data. For example, the 1-st patient has a propensity of 0.774, higher than
2/3 of being in the unexposed group, but the 5-th patient only has a propensity of 0.516 compared to 2/3 of
being in the unexposed group. Similarly, the 9-th patient has a propensity of 0.365, instead of 1/3 of being
exposed.
We just showed that the IST data cannot be used as is because patients with characteristics like the 1-st
patient are over-represented, while patients with characteristics like the 5-th patient under-represented in
the unexposed (X = 0) group; similarly, in the exposed (X = 1) group, patients with characteristics like the
9-th patient are over-represented. We need to adjust the data, so that patient characteristics are properly
represented in both the unexposed and exposed group. The adjustment requires us to down-weight those
characteristics that are over-represented and increase the weights of the under-represented ones. Since a
propensity (of unexposed or exposed) that is too high (too low) give over- (under-) representation, then
weights should be created by taking the inverse of the propensity scores.
The inverse probability (propensity) weighting (IPW) method assigns weights to patients based on
the inverse of their probability of being in their group (either unexposed or exposed), as estimated by the
propensity score. IPW results in a pseudo-sample in which patients with a high propensity will have a smaller
weight and patients with a low propensity will have a larger weight and thus the distribution of measured
patient characteristics are properly represented according to the sizes of the treatment groups. If all possible
confounders of a study are identified and the propensity scores are correctly specified, IPW allows estimation
of the causal effects of X on Y , because the data are re-weighted to assess the effects of X as if they were
collected in a randomised study.
When there are only two treatment groups, unexposed (X = 0) and exposed (X = 1), the weight for each
patient is calculated by inverting the probability of receiving the treatment the patient did in fact receive. For

129
Chapter 9. Causation vs.Correlation

patients in the exposed group, the weight is calculated as the inverse of the propensity score (of X = 1),
whereas for patients in the unexposed group, the weight is calculated as the inverse of 1 minus the propensity
score (i.e., the probability of unexposed). Thus, the weight w i for the i-th patient with propensity score pi is
calculated as:
w i = 1/pi if the patient is exposed

and
w i = 1/(1 − pi ) if the patient is unexposed

Once calculated, the weights determine the extent to which each patient contributes to the pseudo-
sample. For example, the 9-th patient is in the exposed group with pi = 0.365, the weight is 1/0.365 = 2.74,
which represents 2.74 units in the pseudo-sample. On the other hand, the 1st patient is in the unexposed
group with pi = 0.226, the weight is 1/(1 − 0.226) = 1.29, which represents 1.29 units in the pseudo-sample.
We do this for all the patients in the study. After the pseudo-sample has been created, the balance of patient
characteristics should be compared across the different groups using standardised differences, as in the case
for the matching methods. We use the IST data to illustrate this procedure. In Fig. 9.2, we show a plot of the
SMD of all the confounders for the original, unweighted data, and those weighted using propensity scores. As
observed in the plot, before weighting, a few of the confounders have SMD larger than 0.2 (see also, Table 9.2
column (a)); following weighting, all the confounders have SMD close to 0.

Confounders Balance
SEX=M ●
AGE ●
RASP3=Y ●
RCONSC=F ●
RCT=Y ●
Sample
RATRIAL=Y ●
● Unweighted
RVISINF=Y ●
Propensity Score Weighted
STYPE=OTH ●
STYPE=PACS ●
STYPE=POCS ●
STYPE=TACS ●
DEF ●

0.0 0.2 0.4


Standardized Mean Differences

Figure 9.2: Plot of SMD for all confounders before and after weighting using propensity scores

Following this, the outcomes between the exposed and unexposed groups are ready for comparative
assessment. In the pseudo sample, the weights gives pseudo “sample sizes" of the exposed and unexposed
groups as, respectively, n′1 = i∈X =1 w i = 8955.696 and n′0 = i∈X =0 w i = 9002.683. We then estimate the
P P

causal effects by taking the weighted mean difference


P P
i∈X =1 w i Yi i∈X =0 w i Yi
′ − = 0.6140489 − 0.613926 = 0.00012.
n1 n′0

We can also find the standard error (SE) of the causal effects estimate, by
2 2 2 2
P P
i∈X =1 w i s1 i∈X =0 w i s0
2 + 2 = 0.01135,
n′1 n′0

130
9.6. Inverse probability weighting

where s1 , s0 are the standard deviation among the Y s in the exposed (X = 1) and unexposed (X = 0) groups,
respectively. Notice the expression for the SE is simply a weighted version of the usual SE for comparison of
two sample means. Using this SE gives an approximate 95% confidence interval of 0.00012±1.96×0.01135 ≈
(−0.022, 0.022).
Since the outcome Y is binary (Y = 1 for dead/disabled at 6 months and 0 otherwise), we can also
express causal effects using odds ratio (OR), similar to Chapter 9.4. We use a logistic regression. Since IPW
adjusts for the confounders AGE, SEX, etc., the logistic regression no longer needs to include the confounders.
The logistic regression model is
logitP(Y = 1) = a + bX .
However, since each observation is now weighted, we need to fit a weighted version of the logistic regression,
using IPW weights we calculated earlier. We give the results in Table 9.5, column (b). Notice now, the
coefficient for X is 0.0002, a non-significant result. For comparison, we reproduced column (a) of Table 9.3
with no re-weighting to show the effects of re-weighting using the IPW weights. The unadjusted analysis
shows a significant coefficient for X , without properly taking into consideration of the confounders. For the
IPW analysis, we can calculate the OR estimate to be exp(0.0002) = 1.0002. This is the OR of Y = 1 between
the exposed (X = 1) and unexposed (X = 0) groups. Since the estimate is almost 1, it indicates no difference
between groups. We can also find a 95% confidence interval for the OR. However, in this case, we cannot
directly use the SE of the model coefficient of X from the logistic regression results. This is because the
logistic regression is fitted using weights, and hence, the results assume as if we had n′1 = 8955.696 and
n′0 = 9002.683 observations. The SE of the model coefficient have to be adjusted. We use a so-called robust
sandwich estimator to estimate the SE, which gives 0.0183. Applying this estimate gives an approximate
95% confidence interval for the odds ratio to be exp(0.0002 ± 1.96 × 0.0183) = (0.96, 1.037).
A word of caution when we are carrying out IPW is because the weights are calculated using the inverse
of the propensity scores; sometimes when the propensity score is very small, the resulting weight can be
extremely large. This situation can arise if an observation is very unlikely to be in the exposed (or non-
exposed) group but did appear in that group. In such a situation, an option is to limit the weights to certain
upper bounds, so no observation would have a disproportionately large influence on the results.

Table 9.5: Logistic regression of IST data (a) no re-weighting (b) re-weighting using IPW weights

Dependent variable:
Y (DEAD/DISABLE)
(a) (b)
Constant 0.518∗∗∗ −0.488∗∗∗
(0.027) (0.008)

X −0.128∗∗∗ 0.0002
(0.046) (0.012)

Observations 8,992 8,992


Log Likelihood −5,982.763 −11,560.580
Akaike Inf. Crit. 11,969.530 23,125.160

Note: p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01

131
10
Simulation Methods

Computer simulation is a powerful tool that is often used to guide statistical practice. Simulations give us a
way to understand complex problems when their behaviour is too complicated to predict theoretically. For
example, we may be able to mathematically work out how an estimate may behave on idealised situations (e.g.,
normally distributed data, large sample size, and independently observed). These idealised assumptions are
almost never precisely met in practice. The advent of computers allows us to simulate these non-ideal practical
situations, and carry out calculations without the needs for complicated mathematical derivations.
In computer simulation studies, pseudo-datasets are generated under controlled conditions. Statistical
procedures of interest are then applied to these pseudo-datasets and the results recorded and analysed. Here,
we illustrate three different ways simulations are used.

10.1 The Bootstrap

In healthcare research we study a sample of individuals to learn about a target population. Estimates of
interest, such as a mean or a difference in proportions, are calculated, usually accompanied by a confidence
interval derived from the standard error. A confidence interval is a measure of the uncertainty in our
estimate. It gives a plausible range of the population quantify of interest, that would lead to samples like
the data we are studying. To obtain a confidence interval, we need to make assumptions about the data.
The usual assumptions we make are, the sample is large, the observations are independent of each other,
or the observations follow a normal distribution. These assumptions are often made in good faith, seldom
verifiable. For example, for certain population quantities, a sample size of n = 30 is large enough, but for
some quantities, even n = 300 may not be sufficient. For data that such assumptions may not be met or cannot
be verified, the bootstrap provides an alternative way to estimate standard errors and confidence intervals
without any reliance on strong assumptions. The bootstrap uses the observed data and create multiple datasets
from the observed data without needing to make strong assumptions. Bootstrap is based on the following
simple principle. Since the observed sample is representative of the target population, a set of randomly
chosen observations from the observed sample will be equally representative of the target population. We can
generate a sample of the same size as the original data set by randomly choosing observations one at a time.
Each observation has an equal chance of being chosen each time, so some observations will be selected more
than once and some will not be selected at all.

10.1.1 Estimating standard error and confidence interval of the mean

To illustrate the principles of the bootstrap, we will start with a simple example using the LOS data from
Example 2.1. In the data, there are n = 46 patients in ward =3 between 1988 and 1990. Suppose we are

132
10.1. The Bootstrap

interested to use these data to estimate the mean LOS for all ward = 3 in the city during the same period.
Then normally, we would use the sample mean of the observed data, which is given in Table 10.1, first row.
The data show there are 3 admissions with LOS = 1 days, 24 with LOS = 2 days, 3 with LOS = 3 days, and 2
admissions with LOS = 22 days. Based on the data, we can easily obtain the sample mean as

1
X̄ = (3 × 1 + 24 × 2 + 3 × 3 + ... + 2 × 22) = 5.804.
46
We can also calculate the standard error of the mean using
v
1 1 t [3(1 − X̄ )2 + 24(2 − X̄ )2 + ... + 2(22 − X̄ )2 ]
SE(X̄ ) = p s = p = 0.916.
46 46 46 − 1

Based on these numbers, we can, assuming the sample size n is large, arrive at an approximate 95% confidence
interval as (5.804 ± 1.96 × 0.916) = (4.01, 7.6).

Number of admissions
LOS (days) 1 2 3 4 7 8 9 10 12 14 16 19 21 22
original 3 24 3 1 1 2 2 1 2 2 1 1 1 2
bootstrap 1 4 28 3 0 1 0 0 2 0 4 0 2 1 1
bootstrap 2 2 23 2 1 0 3 2 0 3 2 0 2 1 5
bootstrap 3 1 26 3 0 0 1 2 0 2 3 2 2 1 3
bootstrap 4 4 25 3 1 0 2 3 2 2 1 1 1 1 0
bootstrap 5 5 26 1 0 1 2 3 2 1 4 0 1 0 0
bootstrap 6 4 24 5 1 0 2 1 1 3 1 1 1 1 1
bootstrap 7 3 22 3 0 1 3 0 3 1 1 0 4 4 1
bootstrap 8 1 17 4 2 2 3 4 0 2 2 1 3 2 3
bootstrap 9 3 28 2 0 0 3 2 1 2 0 1 2 2 0
bootstrap 10 3 31 0 0 1 1 0 0 4 2 2 0 1 1

Table 10.1: Original data and 10 bootstrap samples of LOS for ward=3

We can use bootstrap samples to approximate the standard error of the mean. Table 10.1 shows 10
bootstrap samples from the observed data. Each bootstrap sample is obtained by sampling with replacement
from the observed data, each with sample size the same as the observed sample size of n = 47. For example,
we can see that the first bootstrap sample has 4 observations with LOS value = 1, compared to 3 observations
with LOS =1 in the original data. This is the result of sampling with replacement, that at least one of the 3
LOS values = 1 have been sampled more than once. Similarly, there are now 28 observations with LOS = 2,
compared to 24 in the original data; and there is no observation with LOS = 4 compared to 1 observation in
the original data, etc.. Using this bootstrap sample, we can calculate a bootstrap sample mean:

1
X̄ ∗ = (4 × 1 + 28 × 2 + 3 × 3 + ... + 1 × 22) = 5.065.
46
For the other 9 bootstrap samples, we can also calculate bootstrap sample means; they are 6.130, 7.087,
6.783, 4.457 6.804, 6.370, 4.630, 5.783, 6.891. Using these bootstrap sample means, we can estimate SE(X̄ )
by taking standard deviations of the 10 bootstrap sample means, as follows. The average of the 10 bootstrap
1
sample means is 10 (5.065 + 6.130 + ... + 6.891) = 6. The standard deviation of the bootstrap sample means
is: v
t 1
[(5.065 − 6)2 + (6.130 − 6)2 + ... + (6.891 − 6)2 ] = 0.987.
10 − 1
This figure is quite similar to the earlier estimate of SE(X̄ ) = 0.916. However, the estimate of 0.916 requires
s
us to understand that SE(X̄ ) = p . The current estimate of 0.976 does not require that knowledge; we only
n
need to understand the standard error of an estimate means the variation of the estimate due to random
sampling.

133
Chapter 10. Simulation Methods

In practice, for any particular problem, we would draw a lot more than 10 bootstrap samples. We
repeated the process and drew 1000 bootstrap samples and for each we calculated the bootstrap sample
mean. We plotted these sample means as a histogram in Fig. 10.1. The histogram shows a unimodal frequency
distribution of the 1000 bootstrap sample means. The means of these 1000 bootstrap sample means is 5.83,
which is very close to X̄ from the original data. Recall earlier we argued that the idea of the bootstrap is to
treat the original data as if it was the target population; so the histogram shows that the distribution of the
“samples" from this hypothetical population. This histogram is called the bootstrap distribution of X̄ . We can
approximate a 95% confidence interval of X̄ by using the lower and upper 2.5 percentile of the bootstrap
distribution, which are (4.11, 7.72). We can compare this confidence interval to the one we obtained earlier,
which is (4.01, 7.6). The earlier confidence interval requires the assumption that n is large and hence Central
Limit Theorem (CLT) can be used to construction a confidence interval of the form X̄ ±1.96SE(X̄ ). In contrast,
using the bootstrap distribution, there is no need to make the assumption that n is sufficiently large so that
the CLT applies.

Figure 10.1: Histogram of 1000 bootstrap sample means of LOS data

2.5 percentile
97.5 percentile
Frequency

4.11 5.76 7.41


Bootstrap sample means

10.1.2 Estimating standard error and confidence interval of an odds-ratio

In Chapter 9.6, we analysed the causal treatment effects problem using inverse probability weighting (IPW) to
adjust for potential confounding. We measured causal treatment effects using odds ratio via a weighted logistic
regression. Due to weighting, the standard error had to be estimated using a robust sandwich estimator. Here,
we show bootstrapping can be used.
We drew 5000 bootstrap samples from the original sample of 8992 observations. In each bootstrap
sample, the propensity score was estimated by regressing X (exposure) on the confounders using a logistic
regression model. The propensity scores were used to form weights. The outcome Y (death/disable at 6
months) was regressed on X using a weighted logistic model fit to each bootstrap sample. Finally, the standard
deviation of the estimated log-odds ratio (OR) (i.e., the estimated regression coefficient for X ) across the
5000 bootstrap samples was used as the bootstrap estimate of the standard error of the estimated regression
coefficient obtained in the original sample. Doing this gives an estimated standard error of 0.0162. We can
construct a 95% confidence interval as exp[0.0002 ± 1.96 × 0.0162], where 0.0002 denotes the estimated
effect in the original sample from Table 9.5, column (b), which gives (0.97,1.03).
Similar to Chapter 10.1.1, another method of constructing a confidence interval using the bootstrap is to

134
10.2. Cross-validation

use the bootstrap distribution directly. Fig. 10.2 shows the bootstrap distribution of the 5000 bootstrap ORs.
We use the lower and upper 2.5 percentiles as our lower and upper 95% confidence limits, which gives (0.96,
1.03). This confidence interval is almost identical to the one obtained in the last paragraph. In this problem,
the original sample size of n = 8992 is very large, so it is not surprising to see that the confidence interval
obtained here is similar to the one earlier, that relies on the CLT, which in turns requires a large sample.
We observe that using the bootstrap, we can avoid the robust sandwich estimator. The two results using
the bootstrap are similar to that in Chapter 9.6.

Figure 10.2: Histogram of 5000 bootstrap odds-ratios (OR) in IST data

2.5 percentile
97.5 percentile
Frequency

0.96 1.03
Bootstrap ORs

10.2 Cross-validation

The assessment of a model can be overly optimistic if the data used to fit the model are also used in the
assessment of the model. This is especially relevant since almost all models we study and develop are for
the purpose of application in the target population from which the data is drawn. For example, an important
goal in analysing the LOS data in Chapter 4 is to come up with a model that helps to predict incidence
of inappropriate hospitalisation for future admissions. This means that, as part of the process of finding a
model, and for evaluating how well the model generalises, we need to test the model on a set of data that is
independent of the data used for fitting the model. Very often, however, we do not have an independent set
of data. In those situations, cross-validation (CV) is a class of general methods that can be used.
In CV, the data are divided into two subsamples, a calibration or training sample and a validation or test
sample. The training sample is used to estimate the model, the test sample is used to estimate the expected
performance. The simplest example of CV is to split the data randomly into two halves, and use each half in
turn as the training sample and the other as the test sample. This is called a two-fold CV. Another option is
to remove one observation from the data at a time and use the remaining n − 1 observations as the training
sample. The process is then repeated for all n observations in the data and the CV criterion is the average,
over these repetitions, of the estimated expected discrepancies. This gives rise to an n-fold CV, which is also
called a leave-one-out CV. In between these two extremes are K-fold CV, where the training set is randomly
divided into K disjoint sets of (approximately) equal size n/K. For example, in a five-fold CV, the data are
partitioned into a training sample (80% of the data) and a test sample (20% of the data). The procedure is
performed five times using a different test partition (20%) data each time and the performance is averaged

135
Chapter 10. Simulation Methods

over the five test partitions. This approach necessitates heavy computation as the number of folds of the CV
increases; K is often chosen to be equal to 5 or 10.

10.2.1 Prediction using a logistic model

In Chapter 4, we used the LOS data and fitted a logistic regression model for predicting Inap (Inappropriate
hospitalisation). We found (Table 4.3) that using logLOS and Year as independent variables, and a threshold
of 0.5 to classify inappropriate hospitalisation (Inap), the model gave a true positive rate (TPR) of 405/(405+
215) = 0.653 and specificity (true negative rate) of 567/(567+196) = 0.743 if it is applied to predict the data
that were used for fitting the model. The question is, how the model would behave if it was applied to future
data? To answer this question, we split the data into two sub-samples, a training sample and a test sample.
The test sample is used to mimic the situation when the model is applied to future data. We first illustrate the
idea by randomly splitting the data into two equal halves and we fix the independent variables as logLOS and
Year. The training sample has n = 691 observations while the test sample has n = 692 observations (since
the sample size of the original data is odd, hence the resulting halves are not exactly the same size). Based
on the training data, we obtained a fitted model of logit(Inap = 1) = −0.185 + 1.255logLOS + 0.799Year so
that P(Inap = 1) = exp[−0.185 + 1.255logLOS + 0.799Year]. We then applied this model to predict Inap in
the test sample, using 0.5 as a threshold to define Inap. The results for the first 10 patients in the test sample
are given in Table 10.2.

LOS Year Predicted Predicted Actual


probability of outcome outcome
Inap =1
15 88 0.580 1 0
8 88 0.386 0 1
9 88 0.421 0 1
10 88 0.454 0 1
8 90 0.583 1 0
8 90 0.583 1 0
14 90 0.738 1 0
10 90 0.649 1 1
10 90 0.649 1 1
9 90 0.618 1 0

Table 10.2: Predicted probability, predicted outcome (Inap=1 or 0) and actual outcome of first 10 patients in
the test sample

In Table 10.3, we illustrate the difference between applying the model to the training sample and test
sample. Table 10.3 (a) shows the cross-tabulation of predicted vs.actual outcome using the training data. The
300+199
corresponding overall accuracy is 300+199+91+101 = 0.72. Table 10.3 (b) shows the results when the model is
267+206
used to predict the test sample; the overall accuracy is 267+206+105+114 = 0.68.

Table 10.3: Prediction in breast cancer data using: (a) Training sample and (b) Test sample

(a) Training (b) Test


Predicted Predicted
Actual 0 1 Actual 0 1
0 300 91 0 267 105
1 101 199 1 114 206

In practice, we would be using a K-fold CV, where K is usually 5 or 10. We carried a 10-fold CV using the
LOS data. Our interest is in the predictive performance of model (3) in Table 4.3, i.e., with independent
variables logLOS and Year. In each fold, 90% of the data are retained as training to fit the model, and

136
10.2. Cross-validation

then the model is applied to predict the remaining 10% of the data, the test data. This process is repeated
for the remaining 9 folds, and the results averaged over all 10 folds. Recall from Chapter 4 that the ROC
(Receiver Operating Characteristic) curve and the area under the ROC curve (AUC) are commonly used to
assess the performance of binary response models such as logistic models. We applied these metrics to evaluate
performance. The results are given in Fig. 10.4. In Fig. 10.4 (a), each fold produces an ROC curve, which in
shown as · · · · · · · · · curves and the average of these is shown as a curve. The mean area under the
ROC curves (AUC) is 0.772. For comparison, we repeated a 10-fold CV using model (2) in Table 4.3, which
uses, in addition to logLOS and Year, three additional independent variables, Sex, Age, and Ward. This model
theoretically would give a better fit to the original data than model (3), because compared to model (3), it uses
additional independent variables. However, we showed in Chapter 4 by a likelihood ratio test model (2) did
not give a significantly better fit to the original data than model (2). Here, we evaluate model (2) again using
test data. The results are given in Fig. 10.3 (b). The mean AUC for model (2) is 0.765, which is lower than
the mean AUC for model (2). This is also an example of how CV can be used for model selection. Between
models (2) and (3), model (2) is more complex with three additional independent variables. The additional
complexity allows model (3) to give a better, though not statistically significant, better fit to the original data.
However, the additional complexity, becomes a disadvantage when the model is used to predict test data.
While we do not suggest that a more complex model is always undesirable for predicting test (future) data,
the example does highlight the trade-off between complexity and generalisability of a model. Whether we
choose a more complex or a simpler model, the choice needs to be balanced against the sample size, the
number of variables and the purpose it serves.

Figure 10.3: 10-fold cross-validation ROC curves for LOS data (a) using model (3) in Table 4.3 and (b) using
model (2) in Table 4.3

(a) (b)
1.0

1.0
0.8

0.8
Sensitivity

Sensitivity
0.6

0.6
0.4

0.4

ROC for each fold ROC for each fold


mean ROC mean ROC
0.2

0.2

mean AUC = 0.772 mean AUC = 0.765


0.0

0.0

1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
Specificity Specificity

10.2.2 Choosing the shrinkage parameter λ in LASSO

In Chapter 8.3 we considered LASSO for dimension reduction in problems where we wish to use p independent
variables X 1 , ..., X p to model Y . For a sample of n observations, the LASSO minimises
X
loss[ y i, (a + b1 x 1i + ... + b p x pi )] + λ(|b1 | + ... + |b p |)
i
| {z }
penalty

137
Chapter 10. Simulation Methods

where λ > 0 acts as a shrinkage parameter. A large value of λ leads to higher dimension reduction in the sense
that more of the independent variables will be removed from the model. Choosing an appropriate value of λ
is important. Since changing the value of λ leads to different independent variables used in the model, this
can also be considered a model selection problem, see, for example, Chapter 3.7. What is considered a good
choice of λ depends on our goal, for example, whether we are interested in prediction or identifying the right
model for interpretation. Here, we will focus on the problem of prediction accuracy. In this kind of problems,
we are primarily interested in usingP the model to predict future outcomes, and the prediction accuracy can be
summarised as the average loss, E i loss[ y i, (a + b1 x 1i + ... + b p x pi )] , for any future set of n observations,
where E is the notation that denotes expectation or mean.
P
To find λ, we need to evaluate E i loss[ y i, (a + b1 x 1i + ... + b p x pi )] , which in turn requires “future"
observations that are yet to be observed. To solve this problem using the data, we can use a K-fold cross
validation, and split the data into training and test samples. For each fold, the training sample is used to
fit the model. This gives rise to â, b̂1 , ...p̂. Some ofP the b̂s will be zero as a result of the LASSO dimension
reduction. It is then applied against the test sample i∈test loss[ y i, (â + b̂1 x 1i + ... + b̂ p x pi )] to evaluate loss.
The losses over the K-folds are then averaged to form an estimate of the average loss.
We used a 10-fold cross validation on the training data. For different values of λ, a different model
was derived and the average loss over the 10-folds recorded. The results are plotted in Fig. 10.4. Since the
outcome here is binary (benign vs.malignant growth), we used deviance rather than squared error loss. In
Fig. 10.4, the estimated deviance along with −1SE and +1SE limits are plotted
against the value of λ. We observe that the estimated deviance reaches its minimum at λ = 0.0023. Using
this value of λ, the model retains 18 out of the 30 original variables a . An alternative is to choose a larger
value of λ as long as its deviance is not more than +1SE at the minimum, i.e., λ = 0.0023. Doing this gives
λ = 0.0064. Since this λ value is larger, it leads to more shrinkage. Using λ = 0.0064, only 10 out of the 30
variables are retained.

Figure 10.4: Choosing λ in LASSO for breast cancer data

0.5

0.4
Deviance (Loss)

0.3
Loss
− 1 SE
0.2 + 1 SE

0.1

0.0
0.0023 0.0064 0.0500 0.5000
λ (log−scale)

a
Since cross-vaidation randomly splits the data into K folds, the optimal value of λ and hence number of variables retain would
be sensitive to the randomness in the splits

138
10.3. Comparison of alternate approaches

10.3 Comparison of alternate approaches

In Chapter 7.3 and 7.5, we discussed the connections of linear discriminant analysis (LDA) and logistic
regression in classification problems. Here, we illustrate how simulations can be used to compare their
behaviours under various scenarios.
The simulation study consists of three experiments, under the following settings. We assume that the data
consist of two classes A and B. Each observation is represented by two independent variables (X 1 , X 2 ) from
bivariate normal distributions (µA, Σ) and (µB , Σ), in class A and B, respectively. Without loss of generality,
we set µA = (0, 0) T , while µB is set in the direction of the   component of Σ with the greatest
first principal
1 0.5
variance. Throughout the three experiments, we used Σ = . The separation of the two classes
0.5 1
can be measured by the Mahalanobis distance ∆ = (µA − µB ) T Σ−1 (µA − µB ). Fig. 10.5 shows two situations
p

with hypothetical samples of n = 100 observations each from the two classes with ∆ = 1 and 2.5. The goal
of a discrimination procedure, e.g., LDA or logistic regression, is to use data on (X 1 , X 2 ) to predict the class
membership of each observation.

Figure 10.5: Two classes A and B each of n = 100 from bivariate normal distributions separated by
Mahalanobis distance (a) 1 and (b) 2.5

(a) (b)

Classes Classes
X2

X2

A A
B B

X1 X1

10.3.1 Experiment 1

In the first experiment, we wish to investigate the effects of the separation of the two classes on the
performance of the two methods. We adjust the Mahalanobis distance by using ∆ = (0.5, 1, 2.5, 5). We
used 500 simulation runs. In each simulation run, we generated n = 200 observations from each of the two
classes. To evaluate the performance of the methods, we used a 5-fold CV to split the data into training and
test samples. We used two different ways to evaluate performance. First, we fixed a threshold of 0.5 to classify
a test observation into class B (> 0.5), and class A otherwise (< 0.5), and we recorded the overall accuracy =
(TPR + TNR)/(TPR + TNR + FPR + FNR) using each method. The second method of evaluation is to use the
ROC curve. For each method, we calculated the AUC under the ROC curve. The results are given in Fig. 10.6.
Fig. 10.6 shows group boxplots (see Fig. A.1 for explanation) of the results over 500 simulations. Each boxplot
shows the range of results over the 500 simulations. Overall, there are no clear differences between the two

139
Chapter 10. Simulation Methods

methods. The plots do show that as the separation of the two classes increases, as indicated by the value of
the Mahalanobis distance going from 0.5 to 5, the performance of both methods improves. At ∆ = 0.5, their
median overall accuracy is just around 0.6, but at ∆ = 5, the medians are nearly 1. An interesting observation
from the boxplots is when ∆ = 0.5, the plots are very tall for both methods, compared to those for larger
values of ∆. These results show that when the class separation is small, the performance of either method
can vary quite a bit from one simulation run to another; in contrast, when the class separation is large, the
performances are very similar between simulation runs.

Figure 10.6: Experiment 1: effect of class separation on performance (a) overall accuracy and (b) AUC

(a) (b)
1.0 1.0 ●


● ●
● ●
● ●

● ● ● ●
● ●

0.9 0.9

Overall accuracy

● ●

0.8 Method ●
● ●
● Method
0.8
● LDA AUC LDA
Logistic ● ● Logistic
0.7
● ● 0.7 ● ●
● ●
● ●
0.6
0.6

● ● ● ●
0.5
0.5 1 2.5 5 0.5 1 2.5 5
Mahalanobis Distance Mahalanobis Distance

10.3.2 Experiment 2

In the second experiment, we examine the effects of sample size on the performance of the two methods.
We fixed the Mahalanobis distance at ∆ = 2. We used 500 simulation runs. In each simulation run, we
generated n = 50, 100, 150, 200 observations from each of the two classes. We also used a 5-fold CV to
evaluate performance of the methods. The results are given in Fig. 10.7. Once again, the figure shows that
across all scenarios studied, the two methods perform similarly. Sample size does not have significant impact
on either overall accuracy or AUC; across all sample sizes examined, the median overall accuracy is just below
0.85 and the median AUC is between 0.9 and 0.95. Notice however, that the boxplots are much taller for
sample size n = 50, suggesting for such sample size, there is more variability in their performance.

10.3.3 Experiment 3

In the final experiment, we wish to study the effects of class size discrepancy on the performance of the two
methods. In classification problems involving two classes, the class sizes are unlikely to be identical. The class
with more observations is called the majority class and the one with fewer observations is called the minority
class. Many practical problems involves situations where there is high imbalance in the sizes of the two classes.
For example, in a population of patients affected by COVID, most would not need to be admitted to ICU. In this
experiment, we fixed the Mahalanobis distance at ∆ = 2. We used 500 simulation runs. In each simulation
run, we generated n = 200 observations from the majority class (Class A) and for the minority class (Class
B), we generated n = 40, 80, 120 observations. We continued to use a 5-fold CV to evaluate performance of

140
10.3. Comparison of alternate approaches

Figure 10.7: Experiment 2: effect of sample size on performance (a) overall accuracy and (b) AUC

(a)0.95 (b)


● ●
● ●

● ●


0.90 ●
0.95
Overall accuracy

0.85 Method 0.90


Method

AUC
LDA LDA

● ●
● ●


Logistic ●

● ●

● ● Logistic

0.80 ●

● ● ●
0.85

● ●

● ●

● ●
● ●


0.75 ● ●
● ●

● 0.80
● ● ●

50 100 150 200 50 100 150 200


Sample size n Sample size n

the methods. The results are given in Fig. 10.8. There are no clear differences between the two methods.
Furthermore, the performance, when measured using AUC, is not impacted by the size of the minority class.
In contrast, when using overall accuracy as the metric, the performance for both methods is actually better
when there is higher imbalance in the sample sizes (a smaller minority class size). This result seems counter-
intuitive as we might have expected the methods would perform better when both classes have a large sample
size. To investigate the source of this result, we printed a confusion table for one of the simulation runs,
using LDA (Table 10.4). Table 10.4 (a) shows the prediction results when the minority class (B) has n = 40
observations in the entire sample. Since the table is based on the test sample, the total sample size is only
48 with 43 in Class A and only 5 in Class B. The true negative rate (TNR) is 41/43 but the true positive rate
41+3
(TPR) is only 3/5. However, the overall accuracy is 41+3+2+2 = 0.92. In contrast, when the minority class (B)
has n = 120 observations in the entire sample (Table 10.4 (b)), the true negative rate (TNR) is 40/44 and
the true positive rate (TPR) is 16/20. These seemingly favourable numbers only give an overall accuracy of
40+16
40+4+4+16 = 0.88, which is lower than that for (a). Notice that in (a), even though LDA did poorly for the
minority class, it does not affect the overall results because the majority class size dominates. It tells us when
there is great imbalance, a method does not need to care much about the minority class if overall accuracy is
the metric for measuring the performance. This of course, ignores the loss in misclassifying the minority class.
The issue of imbalance is a problem that is beyond the scope of this course; the purpose here is to illustrate
simulations as aids to evaluate and compare statistical methods.

Table 10.4: Confusion tables using LDA (a) Minority class n = 40 and (b) Minority class n = 120

(a) (b)
Predicted Predicted
Actual A B Actual A B
A 41 2 A 40 4
B 2 3 B 4 16

141
Chapter 10. Simulation Methods

Figure 10.8: Experiment 3: effect of class size imbalance on performance (a) overall accuracy and (b) AUC

(a)0.95 (b)

● ●

● ●
● ●

● ● 0.95


0.90
Overall accuracy

Method Method

AUC
LDA 0.90 LDA

0.85 ●
● ●
Logistic Logistic

● ●
● ● ●

● ●
● ●


● ●

● ● ●
0.85 ●

● ● ● ●

● ●

0.80


● ● ●

40 80 120 40 80 120
Minority class size Minority class size

142
Appendices

143
Chapter 10. Simulation Methods
Table A.1: Multiple linear regression of cross-country life expectancy study
144

Code Country Lexpa Health.expb Literacy c Daily.caloricd IncomeGroup Physicianse GDPf


AFG Afghanistan 64.49 49.84 38.17 2090 Low income 0.28 493.75
AGO Angola 60.78 87.62 71.16 2473 Lower middle income 0.21 3289.65
ALB Albania 78.46 274.91 97.55 3193 Upper middle income 1.22 5284.38
ARE United Arab Emirates 77.81 1817.35 92.99 3280 High income 2.53 43839.36
ARG Argentina 76.52 1127.91 98.09 3229 Upper middle income 3.99 11633.50
ARM Armenia 74.94 422.28 99.77 2928 Upper middle income 4.40 4220.49
ATG Antigua and Barbuda 76.89 875.17 98.95 2417 High income 2.96 16672.74
AUS Australia 82.75 5425.34 99.00 3276 High income 3.68 57354.96
AUT Austria 81.69 5326.44 98.00 3768 High income 5.17 51478.29
AZE Azerbaijan 72.86 165.77 99.81 3118 Upper middle income 3.45 4739.84
BEL Belgium 81.60 4912.70 99.00 3733 High income 3.07 47583.07
BEN Benin 61.47 30.94 38.45 2619 Lower middle income 0.08 1240.83
BFA Burkina Faso 61.17 40.25 37.75 2720 Low income 0.08 813.10
BGD Bangladesh 72.32 41.91 61.49 2450 Lower middle income 0.58 1698.35
BGR Bulgaria 74.96 689.91 98.39 2829 Upper middle income 4.03 9427.73
BHS Bahamas, The 73.75 2013.38 95.60 2670 High income 2.01 33767.50
BIH Bosnia and Herzegovina 77.26 539.55 98.49 3154 Upper middle income 2.16 6072.18
BLR Belarus 74.18 356.25 99.72 3250 Upper middle income 5.19 6330.08
BLZ Belize 74.50 285.99 82.78 2751 Upper middle income 1.12 4884.73
BMU Bermuda 81.65 98.00 2671 High income 113021.42
BOL Bolivia 71.24 223.60 95.14 2256 Lower middle income 1.59 3548.59
BRA Brazil 75.67 848.39 92.59 3263 Upper middle income 2.16 9001.23
BRB Barbados 79.08 1164.54 99.70 2937 High income 2.48 17745.19
BRN Brunei Darussalam 75.72 763.15 96.66 2985 High income 1.61 31628.33
BWA Botswana 69.28 482.96 88.22 2326 Upper middle income 0.53 8279.60
CAF Central African Republic 52.80 53.66 36.75 1879 Low income 0.07 475.95
CAN Canada 81.95 4994.90 99.00 3494 High income 2.61 46313.17
CHE Switzerland 83.75 9870.66 99.00 3391 High income 4.30 82818.11
CHL Chile 80.04 1455.61 96.63 2979 High income 2.59 15924.79
CHN China 76.70 501.06 96.36 3108 Upper middle income 1.98 9976.68
CIV Cote d’Ivoire 57.42 71.88 43.27 2799 Lower middle income 0.23 2314.05
CMR Cameroon 58.92 54.14 74.99 2671 Lower middle income 1534.49

a
Life expectancy in yrs.
b
Healthcare expenditure in US$ per capita
c
Literacy rate
d
Daily caloric intake
e
No. physicians per 1000
f
per capita GDP in US $
COG Congo, Rep. 64.29 47.52 79.31 2208 Lower middle income 2577.70
COL Colombia 77.11 513.16 94.58 2804 Upper middle income 2.18 6716.91
CRI Costa Rica 80.09 909.67 97.65 2848 Upper middle income 2.89 12112.13
CUB Cuba 78.73 986.94 99.71 3409 Upper middle income 8.42 8821.82
CYP Cyprus 80.83 1954.41 99.06 2649 High income 1.95 28689.71
CZE Czech Republic 79.03 1765.59 99.00 3256 High income 4.12 23415.84
DEU Germany 80.89 5472.20 99.00 3499 High income 4.25 47810.51
DJI Djibouti 66.58 70.86 67.90 2607 Lower middle income 0.22 3141.89
DMA Dominica 490.82 94.00 2931 Upper middle income 1.12 7693.88
DNK Denmark 80.95 6216.77 99.00 3367 High income 4.01 61598.54
DOM Dominican Republic 73.89 461.54 92.47 2614 Upper middle income 1.56 8050.63
DZA Algeria 76.69 255.87 79.61 3296 Lower middle income 1.72 4153.73
ECU Ecuador 76.80 516.25 94.52 2344 Upper middle income 2.04 6295.94
EGY Egypt, Arab Rep. 71.83 125.55 75.84 3522 Lower middle income 0.45 2537.13
ESP Spain 83.43 2736.32 98.11 3174 High income 3.87 30389.36
EST Estonia 78.24 1552.97 99.82 3253 High income 4.48 23170.71
ETH Ethiopia 66.24 24.23 49.03 2131 Low income 0.08 771.52
FIN Finland 81.73 4515.68 100.00 3368 High income 3.81 50030.88
FJI Fiji 67.34 214.55 93.70 2943 Upper middle income 0.86 6317.49
FRA France 82.72 4690.07 99.00 3482 High income 3.27 41631.09
GAB Gabon 66.19 218.37 83.24 2830 Upper middle income 0.68 7956.63
GBR United Kingdom 81.26 4315.43 99.00 3424 High income 2.81 43043.23
GEO Georgia 73.60 312.75 99.76 2905 Upper middle income 7.12 4722.79
GHA Ghana 63.78 77.91 76.58 3016 Lower middle income 0.14 2202.31
GIN Guinea 61.19 38.32 30.47 2566 Low income 0.08 878.60
GMB Gambia, The 61.73 22.16 55.57 2628 Low income 0.10 732.72
GNB Guinea-Bissau 58.00 53.29 59.77 2292 Low income 0.13 778.35

10.3. Comparison of alternate approaches


GRC Greece 81.79 1566.90 95.29 3400 High income 5.48 20324.30
GRD Grenada 72.38 474.53 96.00 2447 Upper middle income 1.41 10485.91
GTM Guatemala 74.06 259.62 79.07 2419 Upper middle income 0.35 4472.89
GUY Guyana 69.77 295.56 87.54 2764 Upper middle income 0.80 6145.84
HKG Hong Kong SAR, China 84.93 93.50 3290 High income 48543.40
HND Honduras 75.09 176.25 88.42 2641 Lower middle income 0.31 2505.78
HRV Croatia 78.07 1014.22 99.27 3059 High income 3.00 15014.09
HTI Haiti 63.66 64.25 60.69 2091 Low income 0.23 1435.35
HUN Hungary 76.07 1081.80 99.38 3037 High income 3.41 16410.19
IDN Indonesia 71.51 111.68 95.44 2777 Upper middle income 0.43 3893.85
IND India 69.42 72.83 72.23 2459 Lower middle income 0.86 2005.86
IRL Ireland 82.26 5489.07 99.00 3600 High income 3.31 78621.23
IRN Iran, Islamic Rep. 76.48 484.29 87.17 3094 Upper middle income 1.58 5550.06
145

IRQ Iraq 70.45 239.41 79.72 2545 Upper middle income 0.71 5834.17
Chapter 10. Simulation Methods
ISL Iceland 82.86 6530.93 99.00 3380 High income 4.08 72968.70
146

ISR Israel 82.80 3323.65 97.10 3610 High income 4.62 41719.73
ITA Italy 83.35 2989.00 99.02 3579 High income 3.98 34615.76
JAM Jamaica 74.37 320.98 88.50 2746 Upper middle income 1.31 5354.24
JOR Jordan 74.41 330.14 98.01 3100 Upper middle income 2.32 4312.18
JPN Japan 84.21 4266.59 99.00 2726 High income 2.41 39159.42
KAZ Kazakhstan 73.15 275.85 99.79 3264 Upper middle income 3.98 9812.60
KEN Kenya 66.34 88.39 78.02 2206 Lower middle income 0.16 1707.99
KGZ Kyrgyz Republic 71.40 85.74 99.50 2817 Lower middle income 2.21 1308.14
KHM Cambodia 69.57 90.56 78.35 2477 Lower middle income 0.19 1512.13
KNA St. Kitts and Nevis 992.59 97.80 2492 High income 2.68 19276.52
KOR Korea, Rep. 82.63 2542.82 97.97 3334 High income 2.36 33422.94
KWT Kuwait 75.40 1711.22 96.12 3501 High income 2.65 33994.38
LAO Lao PDR 67.61 57.11 79.87 2451 Lower middle income 0.37 2542.49
LBN Lebanon 78.88 686.47 94.05 3066 Upper middle income 2.10 8024.80
LBR Liberia 63.73 45.42 47.60 2204 Low income 0.04 677.32
LCA St. Lucia 76.06 464.71 90.10 2595 Upper middle income 0.64 11357.89
LKA Sri Lanka 76.81 157.47 92.61 2539 Lower middle income 1.00 4080.57
LSO Lesotho 53.70 124.79 79.36 2529 Lower middle income 1221.88
LTU Lithuania 75.68 1249.25 99.82 3417 High income 6.35 19176.18
LUX Luxembourg 82.30 6227.08 100.00 3539 High income 3.01 116654.26
LVA Latvia 74.78 1101.49 99.89 3174 High income 3.19 17858.28
MAR Morocco 76.45 174.78 71.71 3403 Lower middle income 0.73 3222.20
MDA Moldova 71.81 212.97 99.24 2714 Lower middle income 3.21 4233.74
MDG Madagascar 66.68 22.05 64.66 2052 Low income 0.18 527.50
MDV Maldives 78.63 973.54 99.32 2732 Upper middle income 4.56 10276.93
MEX Mexico 74.99 519.61 94.55 3072 Upper middle income 2.38 9686.51
MKD North Macedonia 75.69 399.10 97.84 2949 Upper middle income 2.87 6088.97
MLI Mali 58.89 34.95 33.07 2890 Low income 0.13 894.80
MLT Malta 82.45 2753.51 94.07 3378 High income 2.86 30437.22
MMR Myanmar 66.87 59.21 93.09 2571 Lower middle income 0.68 1418.18
MNE Montenegro 76.77 731.48 98.72 3491 Upper middle income 2.76 8846.06
MNG Mongolia 69.69 155.09 98.37 2510 Lower middle income 2.86 4134.99
MOZ Mozambique 60.16 40.26 58.84 2283 Low income 0.08 503.32
MRT Mauritania 64.70 54.49 52.12 2876 Lower middle income 0.19 1600.88
MUS Mauritius 74.42 653.35 90.62 3065 High income 2.53 11208.34
MWI Malawi 63.80 35.50 65.96 2367 Low income 0.04 381.26
MYS Malaysia 76.00 427.22 94.64 2916 Upper middle income 1.54 11377.45
NAM Namibia 63.37 471.49 90.82 2171 Upper middle income 0.42 5495.43
NCL New Caledonia 77.15 96.94 2853 High income
NER Niger 62.02 30.36 19.10 2547 Low income 0.04 572.43
NGA Nigeria 54.33 83.75 59.57 2700 Lower middle income 0.38 2027.78
NIC Nicaragua 74.28 173.77 82.47 2638 Lower middle income 0.98 2020.55
NLD Netherlands 81.81 5306.53 99.00 3228 High income 3.61 53044.53
NOR Norway 82.76 8239.10 100.00 3485 High income 2.92 81734.47
NPL Nepal 70.48 57.85 64.66 2673 Lower middle income 0.75 1038.65
NZL New Zealand 81.86 4037.46 99.00 3137 High income 3.59 42949.93
OMN Oman 77.63 678.23 93.97 3143 High income 2.00 16521.18
PAK Pakistan 67.11 42.87 56.44 2440 Lower middle income 0.98 1482.31
PAN Panama 78.33 1131.66 95.04 2733 High income 1.57 15592.57
PER Peru 76.52 369.08 94.37 2700 Upper middle income 1.30 6941.24
PHL Philippines 71.09 136.54 96.62 2570 Lower middle income 0.60 3252.09
POL Poland 77.60 978.74 99.79 3451 High income 2.38 15468.48
PRK Korea, Dem. People’s Rep. 72.09 100.00 2094 Low income 3.68
PRT Portugal 81.32 2215.17 95.43 3477 High income 5.12 23562.55
PRY Paraguay 74.13 400.39 95.54 2589 Upper middle income 1.35 5805.68
PYF French Polynesia 77.46 98.00 2927 High income
ROU Romania 75.36 687.25 98.76 3358 High income 2.98 12399.89
RUS Russian Federation 72.66 609.01 99.72 3361 Upper middle income 4.01 11370.81
RWA Rwanda 68.70 58.31 71.24 2228 Low income 0.13 783.29
SAU Saudi Arabia 75.00 1484.59 94.84 3255 High income 2.61 23338.96
SDN Sudan 65.09 60.17 58.60 2336 Low income 0.26 623.87
SEN Senegal 67.67 58.90 55.62 2456 Lower middle income 0.07 1465.59
SLB Solomon Islands 72.83 94.96 84.10 2391 Lower middle income 0.19 2441.52
SLE Sierra Leone 54.31 85.78 48.43 2404 Low income 533.99
SLV El Salvador 73.10 288.52 87.65 2577 Lower middle income 1.57 4067.66
SRB Serbia 75.89 617.09 98.00 2728 Upper middle income 3.11 7252.40
STP Sao Tome and Principe 70.17 125.40 91.75 2400 Lower middle income 0.05 1953.59

10.3. Comparison of alternate approaches


SUR Suriname 71.57 474.13 95.54 2753 Upper middle income 1.21 6015.16
SVK Slovak Republic 77.27 1299.91 99.60 2944 High income 3.42 19406.35
SVN Slovenia 81.38 2169.58 99.71 3168 High income 3.09 26115.91
SWE Sweden 82.56 5981.71 99.00 3179 High income 3.98 54589.06
SWZ Eswatini 59.40 271.14 87.47 2329 Lower middle income 0.33 4106.20
TCD Chad 53.98 29.24 40.02 2110 Low income 0.04 726.15
TGO Togo 60.76 41.84 66.54 2454 Low income 0.08 679.97
THA Thailand 76.93 275.92 93.98 2784 Upper middle income 0.80 7295.48
TJK Tajikistan 70.88 59.84 99.78 2201 Low income 2.10 826.62
TKM Turkmenistan 68.07 460.18 99.69 2840 Upper middle income 2.22 6966.64
TLS Timor-Leste 69.26 93.69 64.07 2131 Lower middle income 0.72 1230.23
TTO Trinidad and Tobago 73.38 1123.42 98.97 3052 High income 4.17 17129.91
TUN Tunisia 76.50 251.55 81.05 3349 Lower middle income 1.30 3438.79
147

TUR Turkey 77.44 389.87 95.69 3706 Upper middle income 1.85 9455.59
Chapter 10. Simulation Methods
TZA Tanzania 65.02 36.82 80.36 2208 Lower middle income 0.01 1060.99
148

UGA Uganda 62.97 43.14 73.81 2130 Low income 0.17 770.45
UKR Ukraine 71.58 228.39 99.76 3138 Lower middle income 2.99 3096.82
URY Uruguay 77.77 1590.05 98.44 3050 High income 5.08 17277.97
USA United States 78.54 10623.85 99.00 3682 High income 2.61 62996.47
UZB Uzbekistan 71.57 82.27 100.00 2760 Lower middle income 2.37 1529.08
VCT St. Vincent and the Grenadines 72.42 329.26 95.63 2968 Upper middle income 7361.40
VEN Venezuela, RB 72.13 256.95 95.40 2631 Upper middle income 16054.49
VNM Vietnam 75.32 151.69 94.51 2745 Lower middle income 0.83 2566.60
VUT Vanuatu 70.32 105.39 85.06 2836 Lower middle income 0.17 3125.26
WSM Samoa 73.19 227.33 99.02 2960 Upper middle income 0.34 4188.53
YEM Yemen, Rep. 66.10 69.96 2223 Low income 0.53 824.12
ZAF South Africa 63.86 525.96 94.60 3022 Upper middle income 0.91 6374.03
ZMB Zambia 63.51 75.99 85.12 1930 Lower middle income 1.19 1516.39
ZWE Zimbabwe 61.20 140.32 86.87 2110 Lower middle income 0.21 1683.74
10.3. Comparison of alternate approaches

Figure A.1: Explanation of a boxplot

Largest value within 1.5 times


interquartile range above Q3

Q3
Interquartile
median range

Q1

Smallest value within 1.5 times


interquartile range below Q1

Value >1.5 times and <3 times the interquartile range



beyond either end of the box

149
Index

Administrative data, 1 Continuous random variable, 14


Age specific mortality rate, 80 Continuous variable, 6
Akaike Information Criterion (AIC), 43 Correlation
Alternative hypothesis, 27 Pearson, 31
Arithmetic mean, 9 Sample, 31
Average treatment effects ATE), 122 Correlation coefficient
Bar graph, 8 Multiple, 43
Bayes rule, 97 Counterfactual, 121
Bayes Theorem, 97 Covariate, 7
Bayesian Information Criterion (BIC), 43 Cox proportional hazards model, 84
Bell-curve, 22 Cox regression, 30
Bernoulli distribution, 17 Cross-validation (CV), 135
Bernoulli process, 17 Cumulative distribution function (CDF), 15
Bernoulli sequence, 17 Cumulative variance (scree) plot, 111
Bernoulli trial, 17 Data, 1
Bin size, 8 Decision boundary, 96, 100
Bin width, 8 Decision rule, 96
Binomial distribution, 17 Dependent variable, 7
Bivariate analysis, 16 Deviance, 55
Bivariate distribution, 16 Deviance ratio, 118
Continuous, 17 Deviance residual, 92
Joint pdf, continuous, 17 Deviance residuals, 56
Bootstrap, 132 Discrete random variable, 14
Calibration sample, 135 Discrete variable, 6
Caliper, 124 Distribution
Categorical variable, 6 Asymmetric, 11
Causal effect, 121 Continuous, 14
Causal inference, 38, 121 Discrete, 14
Censored, 78, 79 Mode, 11
Censoring, 78 Multimodal, 11
Central Limit Theorem, 27 Skewed, 11
CH index, 106 Symmetric, 11
Claims data, 1 Unimodal, 11
Classification: Accuracy, 102 Distribution function
Cluster analysis, 104 Joint, 16
Competing risk, 78 Effect modification, 44
Conditional independence, 123 Effect modifier, 44
Conditional probability, 16 Electronic health records (EHR), 2
Conditioning, 123 Estimation, 25
Confidence interval (CI) estimate, 27 Euclidean distance, 98
Confidence level, 27 Expectation, 15
Confounder, 38, 122 Conditional, 17, 35
Confounding, 122 Expected value, 15
Confusion table, 102 Exponential distribution, 20
Contingency table, 16, 48 Exposure, 64, 121

i
Index

False positive rate, 61 Margin of error, 27


Feature, 7 Marginal probability, 16
Frequency distribution, 8 Martingale residuals, 91
Gamma distribution, 21 Matrix, 6
Gamma function, 21 Maximum a posteriori (MAP), 97
Gaussian distribution, 22 Maximum Likelihood Estimation, 35
Gehan-Wilcoxon test, 82 Maximum likelihood estimation, 25
Generalised linear model (GLM), 75 Mean, 15
Goodness-of-fit, 43 Conditional, 35
Graphical summary, 7 Medical records, 2
Hat-value, 59 Multicollinearity, 112
Hazard function, 80 Multiple Regression, 37
Health records, 1 Multiple regression, 30
Histogram, 8 Multivariate analysis, 16
Hosmer-Lemeshow test, 57 Multivariate distribution, 16
Hurdle model, 69 Multivariate normal distribution, 98
Hypothesis, 27 Negative Binomial distribution, 68
Immortal time bias, 89 Negative Binomial regression, 68
Independent variable, 7 Nominal variable, 6
Influential point, 59 Non-administrative data, 1
Input variable, 7 Non-linear models, 46
Instantaneous failure rate, 80 Non-random Sampling, 14
Interaction, 44 Normal distribution, 22, 98
Interquartile range (IQR), 13 Null hypothesis, 27
Interval censored, 79 Null model, 54
Inverse probability (propensity) weighting (IPW), 129 Numerical summary, 7
Joint distribution, 16 Observation, 5
K-fold cross-validation, 135 Observational study, 3
K-means clustering, 104 Odds, 48
Kaplan Meier (KM), 80 Odds ratio, 49, 64
LASSO, 117 Offset, 64
Least Squares, 35 Ordinal variable, 6
Leave-one-out cross-validation, 135 Outcome, 7
Left censored, 79 Outlier, 31, 58
Leverage, 58 Output variable, 7
Likelihood ratio (LR) test, 54 Over dispersion, 68
Linear discriminant analysis (LDA), 100 p-value, 28
Linear discriminant function, 100 Parameters, 17
Linear regression, 30, 33 Pearson residuals, 55
Model, 34 Percentile, 13
Coefficient of determination, 43 25-th, 13
Coefficients, 34 75-th, 13
independent variable, 33 Pie chart, 8
Outcome variable, 33 Poisson, 63
Residuals, 44 Poisson process, 18, 20
LOESS (locally weighted smoothing), 93 Polynomial regression, 46
Log-odds, 49, 50 Population, 14
Logistic regression, 30, 49 Finite, 14
Logit, 49, 50 Infinite, 14
Loglikelihood, 26 Population parameter, 25
Logrank test, 82 Population surveillance, 4
Loss, 42, 114 Posterior probability, 97
Mahalanobis distance, 98, 124 Potential outcome, 121

ii
Index

Predictor, 7 Positive association, 31


Principal component, 109 Schoenfeld residuals, 93
Loadings, 109 Scree plot, 111
Principal component analysis, 109 Sensitivity, 61
Prior probability, 97 Sentinel surveillance, 4
Probability density function (pdf), 14 Shape parameter, 68
Probability distribution, 14 Shrinkage parameter, 114
Probability distribution function (pdf), 14 Significance test, 28
Propensity score matching, 124 Simple linear regression, 30, 33
Proportional hazards, 82, 84 Specificity, 61
Public health surveillance, 1, 4 Spread, 12
Quadratic discriminant analysis (QDA), 101 Squared error loss, 114
Quadratic discriminant function, 101 Standard error (SE), 26
Qualitative variable, 6 Strata, 127
Quantile-quantile (QQ) plot, 44 Stratification, 127
Quantitative variable, 6 Stratified, 27, 123
Quartile Stratified analysis, 127
1-st, 13 Stratum, 123
3-rd, 13 Studentised residuals, 59
Lower, 13 Survey, 1
Upper, 13 Longitudinal, 3
Cross-sectional, 3
Random sampling, 14
Panel, 3
Random variable, 14
Survival curve, 80
Randomised controlled trials, 1
Survival function, 80
Receiver Operating Characteristic (ROC), 61
Tabular summary, 7
Registry, 4
Target variable, 7
Regression, 30
Test
Regression adjustment, 127
Exact, 29
Regression line
Test sample, 135
Fitted, 36
Test statistic, 29
Regression model, 30
Time dependent covariate, 88
Regularisation, 113
Time varying coefficient, 86
Relative risk (RR), 48, 64
Training sample, 135
Response, 7
Treatment, 121
Ridge regression, 113 Treatment effect, 121
Right censored, 79 True positive rate, 61
Risk, 48 Tuning parameter, 114
Risk difference (RD), 53 Type I error, 28
Risk ratio, 48 Type II error, 28
Robust sandwich estimator, 131 Uniform distribution, 21
Sample, 14 Unimodal distribution, 18
Sample mean, 9 Validation sample, 135
Sample median, 9 Variable, 5, 6
Sample range, 12 Variance inflation factor (VIF), 115
Sample standard deviation, 12 Variance-covariance matrix, 98
Sample variance, 12 Zero inflation, 68
Sampling error, 26 Zero-inflated regression, 73
Sampling with replacement, 14
Saturated model, 55
Scatter plot, 31
Linear, 31
Negative association, 31
Non-linear, 31

iii

You might also like