Statistical Prediction and Machine Learning
Statistical Prediction and Machine Learning
One of the distinct features of this book is the comprehensive coverage of the topics in statistical
learning and medical applications. It summarizes the authors’ teaching, research, and consulting
experience in which they use data analytics. The illustrating examples and accompanying materi-
als heavily emphasize understanding on data analysis, producing accurate interpretations, and
discovering hidden assumptions associated with various methods.
Key Features:
• Unifies conventional model-based framework and contemporary data-driven methods into a
single overarching umbrella over data science.
• Includes real-life medical applications in hypertension, stroke, diabetes, thrombolysis,
aspirin efficacy.
• Integrates statistical theory with machine learning algorithms.
• Includes potential methodological developments in data science.
John Tuhao Chen is a professor of Statistics at Bowling Green State University. He completed his
postdoctoral training at McMaster University (Canada) after earning a PhD degree in statistics at
the University of Sydney (Australia). John has published research papers in statistics journals such
as Biometrika as well as in medicine journals such as the Annals of Neurology.
Lincy Y. Chen is a data scientist at JP Morgan Chase & Co. She graduated from Cornell Uni-
versity, winning the Edward M. Snyder Prize in Statistics. Lincy has published papers regarding
refinements of machine learning methods.
Clement Lee is a data scientist in a private firm in New York. He earned a Master’s degree in
applied mathematics from New York University, after graduating from Princeton University in
computer science. Clement enjoys spending time with his beloved wife Belinda and their son
Pascal.
Statistical Prediction
and Machine Learning
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9780429318689
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To the memory of my parents
HongBiao Chen and WanJuan Lin
–JTC
Contents
Preface xi
List of Tables xv
2 Fundamental Instruments 23
2.1 Data identification . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Pooling data, Simpson’s paradox, and solution . . . . 28
2.2 Basic concepts of trees . . . . . . . . . . . . . . . . . . . . . 30
2.3 Sensitivity, specificity, and ROC curves . . . . . . . . . . . . 34
2.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 LOOCV and Jackknife . . . . . . . . . . . . . . . . . . 38
2.4.2 LOOCV for linear regressions . . . . . . . . . . . . . 41
2.4.3 K-fold cross-validation and SAS examples . . . . . . . 44
2.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Non-parametric bootstrapping . . . . . . . . . . . . . 47
2.5.2 Parametric bootstrapping . . . . . . . . . . . . . . . . 50
vii
viii Contents
Bibliography 287
Index 297
Preface
Big data has penetrated into every corner of our lives. Its omnipresence and
the demands of the market necessitates that we decently understand it. To this
end, new books and literature that explain techniques in data-oriented predic-
tion and machine learning are required. Although there are excellent textbooks
focusing on data analysis with conventional statistical approaches, as well as
outstanding textbooks addressing machine learning methods for data-oriented
approaches, to this date, nothing has merged the two comprehensively. These
two approaches over time have led to two primary camps in data science, one
focused on data-oriented analysis and another on model-based analysis. Stu-
dents, data analysts, and junior researchers are often confused about which
camp they may fall under, especially as the two data analytic camps are of-
ten seemingly contradictory to each other. There is much debate on the right
camp to select in the broader realm of data science. Written by an experienced
statistician and two data scientists, this book unifies the two frameworks into
a single overarching umbrella on data science.
Starting from a background in a basic undergraduate college statistics
course, the conventional model-based inference framework finds its founda-
tions in data analytics. It consists of an underlying model for the data, hy-
pothesis testing or confidence estimation on unknown model parameters, mea-
suring variations behind the data, and prediction an unknown quantity related
to the inference problem. Under this style of thinking, the underlying model
serves as the hub in data analysis. An implausible model assumption may thus
result in a correct answer towards the wrong problem, which can often lead to
misleading prediction results.
When addressing practical problems such as high dimensional inference,
machine learning often relies on computer intensive algorithms. Many of the
underlying thought processes and methodologies have been well-developed but
are still fundamentally based in the conventional data analysis framework. One
of the major challenges underpinning modern machine learning stems from the
gap between the conventional model-based inference and data-driven learning
algorithms. The knowledge gap hinders practitioners (especially students, re-
searchers, data analysts, or consultants) from truly mastering and correctly
applying machine learning skills in data science.
This book is designed to bridge the gap between conventional statistics
and machine learning. It provides an accessible approach for readers with
a basic statistical background to develop a mastery of machine learning. We
start with elucidating examples in Chapter 1 and introducing fundamentals on
xi
xii Preface
xiii
xiv List of Figures
11.1 Mean monthly income between male and female with sprr . . 251
11.2 Overall mean monthly income vs a value with sprtt . . . . . . 254
11.3 Log-likelihood ratio vs effective sizes . . . . . . . . . . . . . . 255
11.4 Codes for two-stage sequential Student-t test . . . . . . . . . 258
11.5 Sample size calculation for stock data-2 . . . . . . . . . . . . 259
11.6 Sample size calculation for stock data-3 . . . . . . . . . . . . 261
11.7 Simultaneous lower bands for thrombolysis effects . . . . . . . 271
List of Tables
xv
1
Two Cultures in Data Science
1
2 Two Cultures in Data Science
which the model underlying the data is explicitly or implicitly assumed. It also
includes non-parametric statistical analysis in which the inference is motivated
by and grounded on a set of general population homogeneity (such as common
continuous cumulative distribution functions). The model-based approach has
been well documented in conventional statistical analyses, for instance, [9],
[10], [43], and [91], among others. In this approach, we assume that the data
set is generated from a population with unknown parameters:
y = f (x|η) + ,
TABLE 1.1
Lung cancer data structure
Item Patient#1 Patient#2 Patient#3 Patient#4
ID N01101 N01102 N01103 N01104
Gender M F M F
Age 30 43 71 63
Smoking Y N Y Y
Diabetes N Y Y N
Hypertension Y N Y Y
Lung Cancer Y N N Y
hypertension. As shown in the data types in Table 1.1, the usual normality
assumption fails, thus it is not appropriate to use the linear regression model
to fit the data. Instead, since the type of data is case-control data, a logistic
regression model would be more appropriate to analyze the odds ratio on
the disease rate of lung cancer associated with the population defined by the
strata related to the combination of risk factors. The case-control feature of
the data determines the analytical approach on the unknown function f (.) and
the prediction outcome on severity of risk factors associated with the disease.
The discrete feature of the response variable in Example 1.1 determines
the logistic function for the underlying model f (.) because the outcome of
developing lung cancer is either “yes” or “no”. The next example takes the
approach of simple linear regression since the response variable Y , insurance
premium, is continuous. It sets the connection between insurance premium
and driving experience, and demonstrates that within the method of simple
linear regression, model-based prediction discerns greatly from data-driven
prediction in the learning process toward the underlying model f (x|η).
Example 1.2 Assume that the insurance premium linearly decreases as the
driving year increases, more specifically, we assume the model behind the data
as
y = α + βx + ,
where y is the insurance premium, x is the driving experience in years, α is the
intercept for the mean premium of a new driver who has no driving experience,
and β is the slope for the amount of decrease in monthly insurance premium
for the increase of each driving year. The error term is the random variation
attributable to other factors such as age, gender, income, marital status, etc.
The learning process toward f (.) is tantamount to the estimation of model
parameters α and β.
As usual, assume that follows a normal model with an unknown standard
deviation σ. In regression analysis, we estimate the values of the parameters
α and β, and use the estimated model
ŷ(x) = α̂ + β̂x
4 Two Cultures in Data Science
y = α + βx + .
FIGURE 1.1
Premium-time regression analysis of the insurance data
f (x) = α + βx.
Second, the random fluctuation term follows a normal model. If any one
of these assumptions fails, the corresponding data analysis becomes invalid
and the conclusion would be misleading. In the application of real-life data
analyses, the plausibility of the model assumptions usually comes from the
information of the data in the specific field of investigation. However, in situ-
ations where informative knowledge is not available for the assumption of the
specific form of f (x|η), carelessly applying a linear model (or a generalized
linear model) to a set of data may result in misleading conclusions, albeit the
model may be statistically significant.
In the next section, we follow up on the analysis of the insurance data in
Example 1.2 to illustrate that the model-based analyses, especially the simple
linear structure of f (.) in Example 1.2 may be actually wrong, as discussed in
data-driven analyses in the next section.
Example 1.3 The premium-year plot in Figure 1.2 indicates that the insur-
ance premium is not a linear function of driving year.
6 Two Cultures in Data Science
FIGURE 1.2
Premium-year plot of the insurance data
As shown in Figure 1.2, there is actually no linear pattern for the relation-
ship between the 6-month insurance premium and the driving year. What is
behind the data is more likely a piece-wise linear function. Figure 1.2 depicts
that during the first 3 years, the decrease of the insurance premium for each
year of driving is much larger than the corresponding change for customers
who have driven 3 to 10 years. After 10 years, essentially there is no gain in
insurance premium for any additional year of driving. This is more close to
the realistic practice in the way that new drivers are charged with higher rates
for the first few years (the first stage). In the second stage (3 to 10 years),
although the rate decreases as the driving year increases after the initial stage
(0-3 years), the slope is relatively more stable compared with the changes in
the first stage. After 10 years driving, drivers essentially get a flat rate that
has nothing related to an additional year of driving.
In what follows, we shall examine the statistical significance of the three
phases separately.
Figure 1.4 provides the outcome of linear regression analysis for the effect
of driving years on the 6-month insurance premium. For drivers with less
Data-driven culture 7
FIGURE 1.3
Plausibility of piece-wise function for insurance premium
than three years’ experience, the long-term average rate is $601.34 (when the
driving experience x = 0). And for each additional year of driving, the 6-month
premium, on average, drops $21.62 within the first three years of driving. The
two coefficients are significantly different from zero because the p-values are
much less than 0.05.
The regression result in Figure 1.5 indicates that people with three to
five years driving experience are charged at a basic rate, on average, $399.36.
Such a rate decreases by $4.85, on average, for each additional year of driving.
Compared with drivers in the first three years of driving, this group of drivers
pays a lower starting premium ($601.34 versus $399.36), but the rate of change
for each additional year of driving is much less ($21.62 versus $4.85). All the
estimated parameters are statistically significant.
For the third consumer group who have driven more than 10 years, as
shown in Figure 1.6, people with more than 10 years driving experience ba-
sically pay an average flat rate of $211.01 (statistically significant) in which
the impact of an additional year of driving on the 6-month premium is not
statistically significant (the corresponding p-value is 0.169).
The above example indicates that the linear model as in Figure 1.1 can
be completely different from the true model behind the data even through
the p-value of the model is statistically significant. Starting with a correct
model assumption is critical in data analytics. If the initial model assumption
8 Two Cultures in Data Science
FIGURE 1.4
Premium-time regression for inexperienced drivers
FIGURE 1.5
Premium-time regression for drivers with 3-10 years of experience.
Definition 1.1 EPE: Let Y be the response observation and Ŷ be the pre-
diction of the response based on a set of data, the expected squared prediction
error (EPE) is defined as
EP E = E[(Y − Ŷ )2 ].
FIGURE 1.6
Premium-time regression for drivers with more than 10 years experience
It should be noted that when we apply the mean (squared) prediction error
as a criterion to evaluate the trained function ĝ(x), the optimal solution ĝ(x)
takes the form E(Y |X), as shown in the following theorem.
Theorem 1.1 Let Δ be a set of permissible functions of g(x) for the model
Y = g(x) + , we have
Theorem 1.1 indicates that the model minimizing the expected prediction
error is the conditional expected value of the response given the features asso-
ciated to the response of interest. Details on the proof of this theorem can be
found in [56], [119], or [120]. We will also discuss this result in Section 1.4.1
when we discuss the outcome evaluation in model-based inference.
Intrinsics between the two culture camps 11
> set.seed(127)
> y<-rnorm(15, 0, 7)
> hist(y)
> fit1<- fitdistr(y, "normal")
> ks.test(y, "pexp", fit1$esƟmate)
data: y
D = 0.54262, p-value = 0.0001152
alternaƟve hypothesis: two-sided
> y=y+7
> fit4 <- fitdistr(y, "exponenƟal")
> ks.test(y, "pexp", fit4$esƟmate)
data: y
D = 0.28369, p-value = 0.1463
alternaƟve hypothesis: two-sided
FIGURE 1.7
Small dataset misleads the underlying model
the data-driven culture camp uses testing errors, such as false positive rate,
false negative rate, or expected mean prediction error computed with the test-
ing data. For example, in a prediction problem, generally, the method that
predicts more closely to the true value should be selected.
As discussed above, when the dataset does not contain adequate informa-
tion on the selection of inference culture camp, the evaluation criterion and
accuracy measurement essentially dictator the selection of prediction meth-
ods between model-based camp or data-driven camp. However, each camp
has its own well-developed criteria on measuring closeness for various infer-
ence problems. In what follows, we shall discuss the evaluation criteria and
optimizing strategies on the evaluation of inference performance for each of the
two analytic culture camps in data science. Specific concepts and terminolo-
gies pertaining to a specific algorithm will be addressed in the corresponding
chapter when the topic is discussed.
Thus, the search for the underlying function f (x) can be achieved by mini-
mizing the EPE point-wise.
Now, consider
EY |X ([Y − c]2 |X = x)
= EY |X [(Y − E(Y |X = x) + E(Y |X = x) − c)2 |X = x]
= EY |X {[Y − E(Y |X = x)]2 |X = x} + EY |X {[E(Y |X = x) − c]2 |X = x}
Thus, when the value c takes E(Y |X = x), the conditional expected value
reaches its smallest possible value. And the solution is
Example 1.4 Assume that we have a set of data following N (θ, σ 2 ) with
unknown mean θ and standard deviation σ, and we are interested in testing
θ = 0 versus θ = 0. Notice that the model assumption here is that the data
follows a normal model with unknown mean and standard deviation, and we
want to make a prediction on the asserted θ value.
Under the above setting, the null space is θ ∈ Θ0 = {θ : θ = 0}, and the
alternative space is θ ∈ Θ1 = {θ : θ = 0}.
We usually use the Student-t statistic as the test statistic
s
Tn−1 = (X − 0)/ √ ,
n
sup β(θ) = α.
θ∈Θ0
where F is the activation function that allows the network to produce nonlin-
ear behaviors.
We usually provide input into the neural network by activating a set of
nodes with specific values and read output from any subset of nodes similarly.
These networks are typically organized in layers of neurons, which indicate the
depth of each node. In this way, layers are typically fully connected, meaning
that all nodes in one layer are connected to all nodes in the next layer. This
allows a computationally efficient model of weights as a matrix M , taking
input vector V to output vector M V . For example, feed-forward networks
are constructed in the following way. Its structure is often considered fixed
and serves to provide a final classification. Key limitations of fully connected
layers prevent them from being suitable for use as the sole structure of larger
networks. Because of the fully connected nature of the layers, they require
an immense amount of memory. Such a layer between two sets of just 10,000
nodes would require 100 million parameters, while modern networks often
have a total of 10 million parameters. This extrema capacity, while being
inefficient can also be problematic for training in general. There is no sense of
locality in such layer, as every node is treated individually. This means that
it is difficult and nearly impossible to train higher-level features that should
be treated equally across all areas of the input (which is of particular interest
to problems like image classification). However, even with these limitations,
fully connected layers remain critical for the task they perform.
Besides the feed-forward network, the other key component of neural net-
works includes back propagation, which is an algorithm to let errors accu-
mulated from the output layer of the network propagate backward through
the network, training it in the process. As in the example, if the network’s
output is O, but the correct response would be C, we can calculate the error
18 Two Cultures in Data Science
it is possible to figure out the influence each weight had on the error by taking
the partial derivative of the cost function with respect to the weight.
Utilizing this partial derivative, each weight can be modified as a result of
the preceding layer.
criteria are the Probability of Type-I error and the Probability of Type-II error.
A test procedure with “best” performance will be the one that reaches the
highest permissible power with rejection region R at a pre-specified signifi-
cance level α.
Under this setting, the issue of finding an optimal inference procedure
becomes to finding the most powerful test for a given significance level. There
are rich references at various levels in the literature in this regard, for instance
Lehmann and Casella [80]; Casella and Berger [16], as well as Lehmann and
Romano [81], to list just a few. We will briefly discuss basic results here to
facilitate the understanding on the strategy in the construction of optimal
learning procedures for model-based culture camp. More systematic details
can be found in classical statistics literatures such as [16] or [81].
For a set of data (random sample) X = (x1 , ..., xn )T , assume that we are
interested in determining whether the underlying model is f (x|θ1 ) or f (x|θ0 ).
The optimal learning strategy toward a most powerful (MP) test for a given
significance level α is the Neyman-Pearson fundamental lemma.
x x1 x 2 x 3 x 4 x 5 x 6 x7 x8
f (x|H0 ) .02 .02 .02 .01 .05 .01 0.41 0.46
f (x|H1 ) .22 .02 .12 .04 .18 .02 .06 0.34
Solution: Notice that in this case, the ratio of likelihood for each symptom
λ = ff (x|H 1)
(x|H0 ) takes the following values
x x1 x2 x3 x4 x5 x6 x7 x8
f (x|H0 ) 0.02 0.02 0.02 0.01 0.05 0.01 0.41 0.46
f (x|H1 ) 0.22 0.02 0.12 0.04 0.18 0.02 0.06 0.34
λ 11 1 6 4 3.6 2 0.146 0.739
Certainly, when the blood glucose level test is available, the laboratory test
result is more accurate in detecting diabetes as a follow-up diagnosis.
From the perspective of modern deep learning, the primary use of the
convolution technique is confined to the range of the convolution kernel g(.)
over (0, s), as follows.
s
(f ∗ g)(t) = f (r)g(t − r)dr.
0
Pe ∗ E =0a + 1b + 0c + 1d − 4e + 1f + 0g + 1h + 0i
=(b + d + f + h) − 4e.
Accordingly, this creates a new matrix, with each element representing the
convolution kernel applied at that point. As shown above, the convolution Pe ∗
E will have the strongest activation where there is a strong difference between
the pixel e and its neighbors (b, d, f, h), thus performing a basic localized form
of edge detection.
Summary
The impact of the books in statistical learning ([56], [72], [119], and [120])
and deep learning ([5], [51]) has penetrated into various applied fields including
medical research ([1], [96], [97], [98], [111], and [116] among others). Methods
in statistical learning essentially challenges the classical statistical theory and
methodologies with seemingly discernible boundaries. In this chapter, we fol-
low the idea of two cultures in statistical modeling by Breiman, Diaconis,
22 Two Cultures in Data Science
and Efron ( [9], [10], [11], [13], and [44]) to analyze differences and intrinsic
connections between the two culture camps in data science.
Based on the background training, knowledge, believe, and experience, it
is debatable on the correct way of data analytics because such a uniformly
correct decision does not exist. Instead of sailing in one direction on method-
ologies in data analytics, this chapter goes through the two directions from the
evaluation criteria to optimization processes. We use a numerical example on
insurance premium to elucidate that blindly performing either one approach in
data science may result in misleading conclusions. The model-based inference
camp demands plausibility in model assumptions behind a set of data (see,
for example, [22], [25], [28], [40], [42], [43],[52], [53] and [91], among others).
On the other hand, the data-driven inference camp necessitates large sample
size ([44], [51],[54], [79], among others).
There is a dilemma in the selection of the two data inference culture
camps. Seeking to thoroughly clarify the differences partially motivates the
compilation of this volume. The choice of the analytic culture camp should be
grounded on the feature information of the data. For instance, although big
data or data with high dimension is often regarded by some as the motivation
for data-driven technologies, the problem on high dimension is actually due
to the result that the sample covariance matrix is not positive definite with
probability one when n < p (see, for instance, Xie and Chen (1988 [126], 1990
[127]). Large dimension by itself is not an issue. For regression analysis in the
model-based camp, most of the theorems start with k dimensions where k is
any positive integer.
Once the framework of analytics is settled, the key component is the se-
lection of the evaluation criteria and optimization procedures. We use UMP
(uniformly most powerful test) as an illustrating example for model-based
culture camp, and MEPE (mean expected prediction error) for data-driven
culture camp for the selection of suitable analytical approaches. The chapter
concludes with a discussion on the principle of optimization strategies for the
two analytical culture camps in data science. The rest of the chapters basically
follow the theme and road-map in the setting of this chapter to delineate the
two inference camps in data science.
2
Fundamental Instruments
23
24 Fundamental Instruments
the analysis of the risk of a disease with a set of case-control data where
the number of cases and the number of control are determined before the
data collection stage. Regardless of the sample size and learning methods, the
case-control dataset does not lead to any information for the prediction of the
prevalence of the disease. Similarly, in a cohort study that contains prefixed
number of patients exposed (and nonexposed) to a potential risk factor, it is
methodologically fallacious to use such data to predict the prevalence of risk
exposure, P (E). This is because the ratio of risk exposure is given in cohort
data before sampling, the same reasoning as to the prediction of disease rate
P (D) with case-control data. Identifying the dataset correctly helps us avoid
making misleading conclusion in statistical learning.
Example 2.2 Consider the sample disease rate, P̂ (case), where the measure-
ment
number of cases
m(X) = .
total sample size
Since
The prediction of disease rate, which is the sample proportion, m(X), is not
invariant between case-control data and cohort data.
ˆ = ad .
OR
bc
Interpretation of the sample odds ratio: If the sample odds ratio is around
1, exposing to the risk factor does not affect the odds of getting the disease;
if the odds ratio is larger (less) than 1, exposing to the risk factor increases
(decreases) the odds of getting the disease. Due to randomness behind the
data, we usually claim significance of the odds ratio at 0.05 significance level
when the 95% confidence interval of the odds ratio completely locates within
the set (−∞, 0) or (0, ∞).
Data identification 27
The following theorem shows that the sample odds ratio is invariant be-
tween case-control data and cohort data. Thus, the sample disease prevalence
rate depends on the data collection method, but the sample odds ratio is in-
variant. This means that as long as we use the sample odds ratio to measure
the association between the disease and risk factors, the value of the sample
odds ratio is invariant.
Theorem 2.1 The sample odds ratio is invariant between case-control data
and cohort data.
Thus, the estimated odds ratio for the case-control data reads
ˆ case−control = a/c = ad
OR (2.1)
b/d bc
For cohort data, we have the sample odds of exposure in the exposed group,
P̂ (D|E) a b a
= / = .
P̂ (Dc |E) m1 m1 b
P̂ (D|E c ) c d c
= / = .
P̂ (H|E c ) m2 m2 d
Thus, the estimated odds ratio for the cohort data reads
ˆ case−control = a/b = ad
OR (2.2)
c/d bc
Comparing the equations (2.1) and (2.2) gets the conclusion of the theorem.
Although the estimate of the disease prevalence is not invariant between
cohort and case-control studies, Theorem 2.1 shows that the sample odds
ratio is invariant between case-control study and cohort study. In fact, the
28 Fundamental Instruments
TABLE 2.1
Pooling data and Simpson’s paradox
favoring not favoring
Urban Not stressed 48 12
Stressed 96 94
Rural Not stressed 55 135
Stressed 7 53
where the dataset has the following setting for the k data sources with i j =
1, ..., k.
With the adjustment of confounding factors in each data stratum (data
source in Table 2.1), the adjusted odds ratio reads
48 × 94 55 × 52 12 × 96 135 × 7 7427
ORCM H = + / + = = 3.5417,
250 250 250 250 2097
which is consistent with the conclusion on the impact of odds ratio in each
30 Fundamental Instruments
TABLE 2.2
OR and CMH for Simpson’s paradox
Case Control Total
Exposure ai bi ni1
Nonexposure ci di n2i
Total m1i m2i Ni
data stratum that people with stress tend to against the new health policy
while those without stress are in favor of the new policy.
Certainly, the hypothetical dataset is constructed in the way to amplify
the confounding effect. However, it points out the fact that pooling datasets
(with the frame of data similar to Table 2.2) without carefully considering
potential confounding factors may completely alter the learning outcome, and
consequently result in misleading conclusions. Confounding effects in Simp-
son’s paradox necessitates adjusting methods such as the CMH weighting
approach.
FIGURE 2.1
Regression tree of sale volume on price and advertising cost
As shown in Figure 2.1, the first split of the regression tree is on whether
the selling price is more than $2000, we essentially consider the selling price
first. Products with selling price more than $2000 will be considered in one
category (branch), while the advertising input will be considered as the sec-
ond criterion (sub-branch). If the first split is on advertising cost, we will
correspondingly consider the advertising cost first.
Interpretation: For the branch where the selling prices of the product are
more than $2000, the advertising input will be considered. Assume that the
split point for the advertising input is $10,000. We essentially consider two
subbranches for products in the first branch: one subbranch has selling price
more than $2000 and advertising input more than $10,000; while another
32 Fundamental Instruments
subbranch is for products with selling price more than $2000 and advertising
input no more than $10,000. Since products in each subbranch share the same
impact from the two factors (selling price and advertising input), the sale
volumes for products in each subbranch are averaged for the predicted sale
volume in the terminal node, (40K, 30K).
For products in the branch where selling prices are no more than $2000,
similar consideration leads to the following two sub-branches. The first sub-
branch consists of products with selling prices no more than $2000 and ad-
vertising input more than $15,000, while the second subbranch consists of
products with selling prices no more than $2000 and advertising input no
more than $15,000, correspondingly. And the sale volumes of products in each
subbranch are averaged up for the predicted sale volumes, 50 K and 10 K,
respectively.
In the description of the construction of a regression tree in Figure 2.1, as
mentioned before, one of the distinct features is the order of the explanatory
variables being considered in the prediction process. Another key issue is the
determination of the split point for the explanatory variable under consid-
eration. It is related at this point to introduce two basic definitions in the
construction of a regression tree.
Definition 2.6 Feature Space: The set of all possible input combinations
of explanatory variables in statistical prediction.
S = {(X1 , X2 ) ∈ R+ × R+ }.
The concept of feature space is closely related to the concept of feature space
partition in the theory of decision trees.
Definition 2.7 Partition of the feature space: Let Θ be the feature space
of a set of predictors. Denote {Θ1 , ..., Θk } the set of mutually exclusive subsets
of the feature space Θ, such that
k
Θi Θj = ∅, i = j and Θi = Θ.
i=1
For instance, for any two positive values s1 > 0 and s2 > 0, let
With a partition of the feature space under the assumption that subjects
in the same set of a partition share the same expected response, we have
yt = cij IRij (xt ), (2.3)
xt ∈Rij
1 if xt ∈ Rij
IRij (xt ) =
0 otherwise.
In this setting, the estimate of cij (which is the same as the expected response
within the partition Rij ) with minimum mean squared prediction error reads
1
∗
cij = ŷij = yt ,
nij
t:xt ∈Rij
Example 2.5 Consider a set of data where the sample mean responses within
each set of the partition are 15, 20, 30, and 40. Similar to the tree demon-
strated in Figure2.1, the tree predicted model (2.3) becomes
Depending on the set Rij that the value xt falls into, if xt ∈ R11 , we have
IR11 (x) = 1. Since {R11 , R12 , R21 , R22 } forms a partition of the feature space,
the observation xt now does not belong to any one of the sets R12 , R21 , or
R22 , we have x ∈ Rij when i = 1 or j = 1. Thus
Different from the linear regression method where the random error fol-
lows a statistical model, tree regressions do not need any model assumption.
On the other hand, the difficulty on the implementation of regression tree
switches to the selection of the partition of the feature space, that minimizes
the mean squared prediction errors. For instance, in the selling price and ad-
vertising input example, the construction of a regression tree depends on the
selection of values s and t, as well as the order of the two explanatory variables,
X1 and X2 . When the number of features (predictors) increases, the corre-
sponding volume of computation will increase dramatically. This necessitates
the use of computing software for the construction of regression trees.
As a fundamental introduction to the basic concept of statistical learning,
this section discusses the concept and interpretation of the decision tree. More
details on this topic, especially the uniformly minimum variance unbiased
estimator (UMVUE) for the homogeneity index in each terminal node, will be
delineated in Chapter 9.
TABLE 2.3
Four possible outcomes in a binary classification
Feature predicted negative predicted positive Total
Real negative true negative false positive nreal negative
Real positive false negative true positive nreal positive
Total mpredicted negative mpredicted positive N
ability for a diagnostic test, two concepts frequently used in the literature are
sensitivity and specificity.
of each patient, x. Thus, for each value c, based on the trained classifier, T ,
we can compute the estimated sensitivity by dividing the number of correct
diagnosis with the total number of cases in the sample. Similarly, estimated
specificity can be obtained for each value of c. Thus, in a trained model (binary
classifier), a pair of values (sensitivity, specificity) can be computed for each
diagnostic threshold c.
Considering all possible values of permissible diagnostic threshold c, c ∈ A,
gets a set of pairs
Plotting this set of data using the pairs (sensitivity(c), 1 − specif icity(c)) for
all c ∈ A yields a curve, which is the ROC curve.
FIGURE 2.2
Example of a sample ROC curve
The ROC curve is basically the graphic plot of two parameters: sensitivity
versus (1-specificity). When the value c travels in its domain determined by
the diagnostic threshold. It evaluates the discriminating ability of the binary
classifier. When a set of testing data is available, we may plug the data into
the estimated model (2.5) to obtain a set of sample sensitivities and sample
specificities associated with different cut-off (threshold) values of the classifier.
The plot of the sample (estimated) sensitivity versus false positive rates (1-
sample specificity) forms an estimated ROC curve, as shown in Figure 2.2.
Cross-Validation 37
2.4 Cross-Validation
After data identification, statistical learning of an unknown underlying model
usually involves three portions, model training, model validation, and model
testing. Correspondingly, the dataset is ideally split into three mutual exclu-
sive subsets, a training set, a validation set, and a test set. The three portions
are briefly described below.
The model training portion fits a candidate model with training data to
estimate the model parameters. The selection of the candidate model usually
is based on data analytic knowledge, for instance, we assume that the mean
response is a linear function of the predictors. In this step, assumptions on
the data and background information about the data play a critical role in the
selection of the appropriate model. In machine learning processes, usually the
training dataset fits a specific candidate model. However, there may be more
than one candidate model, such as different orders in polynomial regressions,
where we may fit a linear function or a quadratic function as candidate mod-
els. In polynomial regression, the degree of the polynomial model is usually
validated with validation data.
The model validation part involves parameter estimation on the basis of
an accuracy measurement in conjunction with the validation data, such as
the selection of model coefficients corresponding to the smallest MSPE (mean
squared prediction error). As for artificial neural networks, the number of
hidden units in each layer in artificial neural networks is a hyper-parameter to
be determined in the validation data. In general, the fine-tune of the trained
model necessitates the validation process.
The model testing portion usually includes the evaluation of the final model
with the testing data that was set aside to independently access the perfor-
mance of the final model. It is inappropriate to estimate the predictive model
and calculating the validation criteria to justify the final model because, over-
all, the validation data is just one random sample representing the population.
Especially when it comes to prediction, the first step is to build or estimate a
model which is then used to predict the unknown response.
One of the critical steps in the prediction process is to build a model that
fits well with the data. Usually we spend 50% of the data training the model,
25% of the data on validation and 25% of the data on testing. However, when
the sample size is not large enough for splitting into the training set and
validation set, an efficient way is to cooperate the training and validating
parts with the training data by the method of cross-validation.
Generally, cross-validation is a data implementation process that uses nu-
merical computation to replace thorny theoretical analysis. It evaluates the
trained model multiple rounds by different partitions of the training data,
then takes the average of the corresponding evaluated model accuracy (such
as the MSE) to give an estimate of the model’s predictive performance. The
38 Fundamental Instruments
The validation process uses k different folds of the data, which consequently
generates k accuracy measurements. Taking the average of the k accuracy
measurements leads to the overall accuracy level of the model.
In practice, we usually use LOOCV (leave-one-out cross-validation), 5-fold
CV, and 10-fold CV.
1
n
CV A(n) = M SEi .
n i=1
1
n
M SE = (ŷi − yi )2 .
n i=1
1 ˆ∗
n
CV A(n) = ( y − yi ) 2 ,
n i=1 i
with i = 1, ..., n. And the jackknife estimate as the average of the n drop-one
sample estimates,
1
n
θ̂Jack = θ̂(i) .
n i=1
Example 2.7 We shall use a toy dataset, {1, 5, 9}, to illustrate the cross-
validation accuracy (CVA) for the prediction of the population mean using
LOOCV, along with the jackknife estimation.
Case-1 Since the overall sample mean is 5, we have the mean squared error without
any data manipulation,
1 32
M SE = [(5 − 1)2 + (5 − 5)2 + (9 − 5)2 ] = .
3 3
Case-2 Since there are three observations in the dataset, we have the three drop-
one sample for the population mean,
1
θ̂(1) = (5 + 9) = 7
2
1
(1 + 9) = 5
θ̂(2) =
2
1
θ̂(3) = (1 + 5) = 3,
2
which are fixed for y 1 , y 2 , and yˆ∗ 3 , respectively. The cross-validation
ˆ∗ ˆ∗
accuracy,
1 72
CV A(3) = [(7 − 1)2 + (5 − 5)2 + (3 − 9)2 ] = = 24.
3 3
Theorem 2.2 The jackknife procedure does not change the estimation on the
population mean, but it reduces the variation of the data around the sample
1
mean by n−1 .
1 1 nx̄ − xi
n n
n−1
θ̂Jack = θ̂(i) = = x̄ = x̄.
n i=1 n i=1 n − 1 n−1
40 Fundamental Instruments
1
n
nx̄ − xi
|θ̂(i) − θ̂Jack | = | − xi |
n−1 n i=1
1
=| (nx̄ − xi − (n − 1)x̄)|
n−1
1
=| (x̄ − xi )|
n−1
Theorem 2.3 For any asymptotic unbiased estimator, the jackknife version
of the estimator improves the convergence rate from O( n1 ) to O( n12 ).
1
E(T (X)) = τ (θ) + O( ).
n
Define X(−i) the jackknife duplicate (the sub-sample of (n − 1) observations
excluding xi ). The jackknife version of T (X) is
n−1
n
TJack (X) = nT (X) − T (X(−i) ),
n i=1
n−1
n
E(TJack (X)) = nE[T (X)] − E[T (X(−i) )]
n i=1
n−1
n
1 1
= n[τ (θ) + O( )] − [τ (θ) + O( )]
n n i=1 n−1
1 1
= (n − n + 1)τ (θ) + nO( ) − (n − 1)O( )
n n−1
1 1 a2 a2
nO( ) − (n − 1)O( ) = a1 + + ... − a1 − − ...
n n−1 n n−1
a2
=
n(n − 1)
1
= O( 2 )
n
1 yi − ŷi 2
n
CV A(n) = ( ) , (2.6)
n i=1 1 − hi
where ŷi = xi β̂, the estimated response for the ith observation, and hi the
leverage of the ith observation
hi = xi (X X)−1 xi .
Proof: Denote X(−i) and y(−i) , respectively, the data without the ith observa-
tion yi and xi , β̂(−i) is the vector of the least square estimate using X(−i) and
y(−i) , and ŷi∗ is the estimated response using LOOCV with the ith observation
dropped.
Note that under this setting,
1
n
CV A(n) = (yi − ŷi∗ )2 , ŷi∗ = xi β̂(−i) .
n i=1
42 Fundamental Instruments
Under the assumption that n − 1 > p, see Xie and Chen (1988 [126]), the
sample covariance matrix is positive definite with probability 1, both (X X)−1
and (X(−i) X(−i) )−1 exist. Multiplying (X X)−1 in both sides of (2.7) yields,
[X(−i) X(−i) ][X(−i) X(−i) ]−1 = (X(−i) X(−i) )(X X)−1 + xi xi (X X)−1 . (2.9)
Multiplying [X(−i) X(−i) ]−1 from the left-hand side of (2.9) gets
[X(−i) X(−i) ]−1 = (X X)−1 + (X(−i) X(−i) )−1 xi xi (X X)−1 . (2.10)
[X(−i) X(−i) ]−1 xi = (X X)−1 xi + (X(−i) X(−i) )−1 xi xi (X X)−1 xi , (2.11)
which becomes
By (2.12), we have
1
(X(−i) X(−i) )−1 xi = (X X)−1 xi (2.13)
1 − hi
Now, by the LSE estimates of the regression coefficients corresponding to X
and X(−i) ,
X Xβ̂ = X y
(X(−i) X(−i) )β̂(−i) = X(−i) y(−i)
Consider (2.7), and X y = X(−i) y + xi yi , we have
{Ik + (X(−i) X(−i) )−1 xi xi }β̂ = β̂(−i) + (X(−i) X(−i) )−1 xi (xi β̂ + êi ),
where ei = yi − ŷi , the residual of the ith observation with β estimated by the
complete data. Thus
and by (2.14),
1 ˆ2 1 yi − ŷi 2
n n
CV A(n) = d(i) = [ ] .
n i=1 n i=1 1 − hi
where ŷi∗
is the estimated response using LOOCV with the ith observation
dropped.
44 Fundamental Instruments
data Fresh;
set WORK.'Fresh_multiple regression data'n;
x1=log('sale price'n);
x2=log('competitor price'n);
x3=log('advertising cost'n);
x4='sale price'n;
x5='competitor price'n;
x6='advertising cost'n;
run;
proc glmselect;
model 'market demand'n= x1-
x6/selection=forward(stop=CV) details=steps
cvMethod=split(117);
run;
FIGURE 2.3
SAS codes and LOOCV output for market demand predicted by sale price,
competing price, and advertising input
The trained model is evaluated with the held-out one-fold for the MSPE as
the model accuracy measurement.
3. The procedure is repeated k times for each fold serving as the validation
set.
1
k
CV A(k) = M SP Ei
k i=1
data Fresh;
set WORK.'Fresh_multiple regression data'n;
x1=log('sale price'n);
x2=log('competitor price'n);
x3=log('advertising cost'n);
x4='sale price'n;
x5='competitor price'n;
x6='advertising cost'n;
run;
proc glmselect;
model 'market demand'n= x1-
x6/selection=forward(stop=CV) details=steps
cvMethod=split(5);
run;
FIGURE 2.4
SAS codes and output for 5-fold CV on market demand data
Example 2.8 To seek the relationship between market demand and sale price,
advertising cost, and the competitor’s price using a random sample of the
46 Fundamental Instruments
previous months’ records as the input data, one of the difficulties is to select
the best models among all the possible variables: log scale of the sale price,
log scale of the competitor’s price, log scale of the advising cost, as well as
the original three variables recorded in the dataset. We use the LOOCV and
5-fold cross-validation to select variables for the regression model.
Examining the SAS outputs in Figure 2.3 and Figure 2.4, we can see that
the cross-validation accuracies are consistent, as well as the selected model of
the data.
As shown in Figure 2.3, after adding the intercept, the CV PRESS de-
crease the largest amount (105.72) by adding x1 , log scale of the sale price.
The second most important variable in terms of decreasing CV PRESS (cross
validation Prediction Residual Error Sum of Squares) is x3 , the log scale of the
advertising cost, by the amount of 88.94, which is followed by 85.38 by adding
the log scale of the competitor’s price into the model. The last variable added
to the regression model is x6 , the advertising cost with the CV PRESS value
at 85.17. The selection process stops at step 6, where adding the competitor’s
price into the model results in the increase of CV prediction residual errors
sum of squares of the model from 85.17 to 86.37.
Demand = α + β1 x1 + β2 x3 + β3 x2 + β4 x6 + .
Similar conclusions occur when we use the 5-fold cross-validation in Figure
2.4. This example also shows that the log scale transformation of the variable
may fit the data better by using cross-validation techniques. We will address
the issue of linear regression versus non-linear regression in Chapter 5 and
Chapter 6.
2.5 Bootstrapping
Bootstrapping is one of the efficient methods in intrinsic data manipulation. It
essentially recovers the population features by repeatedly sampling the original
data. Based on the model assumptions in data analytics, there are two types
of bootstrapping methods that we shall discuss as a fundamental topic in this
section: the nonparametric method and parametric method. The nonparamet-
ric method treats the original sample as the population and re-samples the
original sample to gain intrinsic data information such as the variance or prob-
ability percentiles of the underlying population. The parametric bootstrapping
starts with an assumption for the model behind the original sample, uses the
original sample to estimate the model parameters, and repeatedly samples the
population with estimated parameters for predictions.
Bootstrapping 47
V ar(X̄) = σ 2 /n,
1
n
S2 = (Xi − X̄)2 .
n − 1 i=1
Denote X̄j∗ the mean of the jth bootstrapping sample, we have the following
results.
Theorem 2.5 When exhausting all possible nn re-samples, the mean of the
bootstrapping sample mean equals to the mean of the original sample. The sam-
ple variance of the bootstrapping sample mean equals to the sample variance
of the original sample multiplying by a constant,
nn (n − 1)
c= ,
(nn − 1)n2
and
n
1 ∗
n
X̄j = X̄ (2.15)
nn j=1
nn
1 ∗
(X̄j − X̄)2 = cS 2 . (2.16)
n − 1 j=1
n
Proof: Notice that when we take average over all possible outcomes of the nn
selections with replacement, each element in the original sample E is equally
likely to be selected, hence, by regrouping the total entries in the double
n
summation over i and j, so that each summation of n items equals to i=1 xi ,
48 Fundamental Instruments
This completes the proof of equation (2.15). As for (2.16), with similar
rationale in regrouping the re-sampling observations into the nn items of the
original sample, the re-sampling data variance reads,
nn
1 ∗
(X̄j − X̄)2
n − 1 j=1
n
n
1 1 ∗
n n
= n ( 2 [ (xij − X̄)]2 )
n − 1 j=1 n i=1
n
1 1 ∗
n n
= ( [ (xij − X̄)2 + (x∗ij − X̄)(x∗kj − X̄)])
n − 1 j=1 n i=1
n 2
i=k
n
1 1 ∗
n n
= ( [ (x − X̄)2 ])
nn − 1 j=1 n2 i=1 ij
n
1 ∗
n n
1
= n (x − X̄)2
n − 1 n2 j=1 i=1 ij
1 1
= (n − 1)nn S 2
nn − 1 n2
This completes the proof of (2.16).
Theorem 2.5 establishes the connection between the mean and sample
variance of the bootstrapping sample means under the setting of equally likely
selection. The following theorem sets connection between the average of nn
bootstrapping samples and the sample mean of the original sample.
Theorem 2.6 Let Sj2∗ be the sample variance of the bootstrapping sample
{x∗1j , ..., x∗nj }. We have
n
1 2∗
n
n−1 2
S = S . (2.17)
nn j=1 j n
Bootstrapping 49
4 grant mean
grant
1.615385 variance
mean
4.666667 variance
7 mean-variance*n/(n-1)
7 grant-vatriance*[(n^n)-1]*n^2/[(n^n)*(n-1)]
FIGURE 2.5
Small sample nonparametric bootstrap
Example 2.9 Consider the original data set E = {2, 3, 7}, as show in Fig-
ure 2.5, the original sample has sample mean 4 and sample variance 7. We
have 33 = 27 different sample values in the re-sampling with replacement
method. Taking average of all the 27 re-sampling means, we get the exactly
same value as the sample mean of the original sample.
As shown in Figure 2.5, when we take the sample variance of the boot-
strapping sample means, the grant variance is 1.615385, which reaches the
original sample variance 7 after multiplying by
(nn − 1)n2 26 ∗ 9
= .
nn (n − 1) 27 ∗ 2
the underlying model for the original sample can be used to add information
toward the re-sampling data.
We will use an example to demonstrate the difference between parametric
bootstrapping and non-parameter bootstrapping.
> set.seed(10)
> x<-rnorm(20, 2, 0.8)
> x
[1] 2.0149969 1.8525980 0.9029356 1.5206658 2.2356361 2.3118354
1.0335391 1.7090592 0.6986619 1.7948173 2.8814236
[12] 2.6046252 1.8094132 2.7899558 2.5931121 2.0714778 1.2360449
1.8438797 2.7404170 2.3863828
>
> x.mean <- mean(x)
> x.std <-sd(x)
>
> x.mean
[1] 1.951574
> x.std
[1] 0.6399275
Nonparametric bootstrapping
> vec<-rep(0, 10000)
> for (i in (1:10000)){
+ y<-sample(x, 20, replace=TRUE)
+ vec[i]<-mean(y)
+ }
> mean(vec)
[1] 1.951677
> sd(vec)
[1] 0.1416576
> sqrt(20)*sd(vec)
[1] 0.6335121
FIGURE 2.6
Comparing nonparametric and parametric bootstrapping
deviation 0.6399. As shown in Figure 2.6, the mean value of the re-sampling
data is not 1.9499 with standard deviation 0.6332. These results are very close
to the outcomes using the non-parametric method.
Certainly, in Example 2.9 and Example 2.10, the estimated value is the
unknown mean and the corresponding variance is the sample standard devi-
ation. Obviously, the bootstrapping method does not add much information
on the predicted outcome from the original sample. In the following example,
we explore a situation where the original data sample does not provide an es-
timation of the sample standard deviation of the parameter estimator. In this
case, bootstrapping becomes an effective way to find the standard deviation,
hence the confidence interval.
since effects of Losartan and Valsartan are correlated, but the impact of Biso-
prolol is not correlated to Losartan or Valsartan, we have
> library(boot)
> bp <- read.table("D:/chapter2/bp.txt", header=TRUE)
> head(bp)
Losartan Valsartan Bisoprolol
1 -1.5453679 -1.1720908 -0.37967880
2 2.0408970 1.4170437 2.27528290
3 -0.1547983 -0.1231551 2.01227522
4 0.8056872 0.4575227 -0.03546461
5 -1.0266717 -0.2783768 -0.86678710
6 2.1577671 1.1213169 0.68332724
FIGURE 2.7
Optimal treatment regime for blood pressure instability
As depicted in Figure 2.7, with the original sample of blood pressure fluc-
tuation of 200 patients in the clinical trial, some patients responded positively
to medications on Beta-blockers, some on Angiotensin II receptor blockers,
and some on both of them. With the original sample, plugging-in the moment
estimators of the variations and covariation in (2.20) gets
α̂ = 0.297,
54 Fundamental Instruments
1
1000
(α̂r − ᾱ)2 = 0.01686434,
1000 − 1 r=1
which leads to a 95% confidence level (0.264, 0.33) for the unknown optimal
proportion α. This means that the optimal treatment regime locates in the
range from 26.4% to 33% for the best blood pressure stability of hypertension
patients.
Summary
This chapter focuses on basic concepts and methods that facilitate follow-
up discussions on statistical prediction and machine learning. It starts with
a discussion on different types of data, which is the first step in the learning
process. Different types of data require different methods of measure for homo-
geneity and learning procedures. This is often overlooked in machine learning
in practice. If the method is not right, the trained model could be fatally
misleading, even if it reaches a small testing error in one testing dataset. We
used case-control data, cohort data, and cross-sectional data in this chapter
to elucidate discernible methods and outcomes corresponding to the data.
The second key component that we concentrate in this chapter is decision
tree, a concept that we will frequently use and intertwine with other topics in
the rest of the book. We introduce the mathematical definition and practical
interpretation of a decision tree, which changes the conventional inference
approach on the culture camp of model-based data science. More insightful
issues in this regard will be delineated in Chapter 9.
Similar to the fundamental concepts on the probability of type-I error and
the probability of type-II error in hypothesis testing, another frequently used
terminology in the data-oriented camp is the concepts of sensitivity and speci-
ficity, with the plot of the two measurements by the ROC curve. We enhance
the definition with introductory examples in this chapter and will explore
further on this topic regarding the trade-off between sensitivity-specificity in
Chapter 3.
Cross-validation and bootstrapping are two data-based computer-intensive
methods in the data-oriented culture camp. We introduce them in this chapter
Bootstrapping 55
On the other hand, if we assert everyone as having the disease (do not claim
any body as being healthy),
57
58 Sensitivity and Specificity Trade-off
TABLE 3.1
Type-I and Type-II errors in hypothesis testing
Rejecting null hypothesis Not rejecting null hypothesis
Null true Type-I error correct decision
Alternative true power Type-II error
It should be noted that (3.3) is only one of the selection approaches for the
diagnostic criterion c. It uses the convenient concept of the distance between
two points in the xy-plane. This selection is not necessarily the most optimal
choice in practice. For example, when controlling for the false negative error
is more important, such as misdiagnosing and missing treatments of a life
threatening disease for a patient versus the error of asking the patient to
take a second confirmatory screening test, the selection standard in (3.3) will
be misleading, because it did not take the weights on diagnostic priority into
consideration. On the other hand, when controlling false positive error is more
critical, such as the error of incorrectly pushing a healthy person into an
operation room versus the error of asking the patient to take a preventive
medicine, the selection standard in (3.3) is also misleading, because it treats
both errors at the same level of importance (equal weights).
Controlling false positive error and false negative error is not a new para-
dox to data science. In hypothesis testing, we are often confronted with the
dilemma of controlling the probability of Type-I error (incorrectly rejecting
the null hypothesis) and the probability of Type-II error (incorrectly rejecting
the true alternative hypothesis) as shown in Table 3.1.
According to well-documented statistics literature, the solution to the
dilemma on the control of the Type-I and Type-II errors is to select one
statement as the null hypothesis, control the probability of making the Type-
I error in the selection of rejection areas, and find the most powerful test to
minimize the chance of making the Type-II error.
In what follows in this chapter, we will reformulate the control of false
positive error and the control of false negative error into the framework of
hypothesis testing, and consequently state similar results in the determination
of the diagnostic threshold by keeping the control on false positive rate, and
minimizing the chance of making the false negative rate. The idea is similar
to the concept of uniformly most powerful test in hypothesis testing.
60 Sensitivity and Specificity Trade-off
Definition 3.1 Uniformly Most Efficient Lower Variable: Assume that lower
values of the diagnostic measure are associated with the disease. Let K be a
set of diagnostic measures that satisfy
P (D ≥ c|healthy) = 1 − α,
When high values of the diagnostic measure D are associated with the
disease, such as escalated systolic blood pressure or escalated cholesterol level
for heart attacks, Definition 3.1 is equivalent to the following.
Definition 3.2 Uniformly Most Efficient Upper Variable: Assume that high
values of the diagnostic measure are associated with the disease. Let K be a
set of diagnostic measures that satisfy
P (D < c|healthy) = 1 − α,
Definition 3.1 and Definition 3.2 are essentially the same except the di-
rection of the diagnostic measure toward the disease. Notice that with the
two evaluation criteria (false positive and false negative errors, or type-I and
Most sensitive diagnostic variable 61
Solution Set the level α = 0.05 so that the specificity is at 0.95 level. We have
X̄ − 2
Specif icity = P ({X : √ < zα |healthy} = 0.95,
3/ 20
X̄ − 130
Sensitivity(μ) =P ( √ > zα |μ)
3/ 20
X̄ − μ μ − 130
=P ( √ > zα − √ )
3/ 20 3/ 20
μ − 130
=P (Z > zα − √ ).
3/ 20
FIGURE 3.1
Sensitivity in patients with LDL-C level more than 130 mg/dL
Theorem 3.1 Assume that the underlying model (pdf or pmf ) of a set of data
is f1 (x) for case, and f0 (x) for control. The most efficient diagnostic measure
is
f1 (x)
D∗ =
f0 (x)
with
P (D∗ < c∗ |healthy) = 1 − α
when small diagnostic measure D is associated with the disease, and
sigma <-3
n <-20
mu0 <-130
result <- 0
for (i in 1:length(mu.o)){
mu<-mu.o[i]
result <- 1-pnorm(qnorm(0.95)-(mu-130)/(3/sqrt(20)))
POWER[i,] <-c(mu, result)
}
png(file="~/desktop/saving_plot2.png",
width=500, height=400)
FIGURE 3.2
Code for sensitivity in patients with LDL-C level more than 130 mg/dL
Note that the above theorem provides an approach to find the most efficient di-
agnostic predictor when the likelihood function of the disease and the healthy
population can be plausibly assumed.
Proof Assume that large values of diagnostic measure D are associated
with the disease. For any diagnostic measure D with
P (D > c|healthy) = 1 − α,
denote
A = {D > c} A∗ = {D∗ > c∗ },
we have
(IA − IA∗ )[D∗ (X) − c∗ ] ≤ 0,
which is equivalent to
hence
(IA − IA∗ )[f1 (X) − c∗ f0 (X)]dX ≤ 0
n
where fi (X) = f (xj |θi ) for i = 0, 1. Now
j=1
(IA − IA∗ )f1 (X)dX ≤ c∗ (IA − IA∗ )f0 (X)dX.
and
P (D∗ > c∗ |healthy) = specif icity = 1 − α.
We have
(IA − IA∗ )f1 (X)dX ≤ 0.
Example 3.2 Assume that the LDL-cholesterol levels follow a normal model
with Xi ∼ N (130, 2) for healthy subjects and Xi ∼ N (150, 2) for patients with
coronary heart diseases. If the specificity is set to 0.95, we want to find the
most efficient diagnostic predictor.
1
n
1
f (X|healthy) = ( √ n
) exp{− 2 (xi − 130)2 }
2πσ 2σ i=1
and
1
n
1
f (X|case) = ( √ )n exp{− 2 (xi − 150)2 }.
2πσ 2σ i=1
By Theorem 3.1, the most efficient diagnostic measure reads,
f (X|case)
D(X) = .
f (X|healthy)
Most sensitive diagnostic variable 65
Now, if all the patients are healthy, the sample mean statistic of their LDL-
Cholesterol levels follows N (130, √2n ). After standardizing the sample statistic
gets
X − 130 ∗∗∗
{D(X) > c} = {X : > k0.05 }.
√2
n
Notice that the evaluation criterion requires that the specificity is 0.95,
X − 130
{D∗ > c∗ } = {X : > 1.645}.
√2
n
Solution: Notice that in this case, the ratio of chances for each symptom
f (x|case)
D(X) = f (x|healthy) takes the following values
In this case, a higher ratio of chance indicates that the individual is more
likely to have the disease. Based on Theorem 3.1 and according to the ranking
of the diagnostic measure for each symptom, we arrange the symptoms by
the likelihood of diabetes verse diabetes. To satisfy the evaluation criterion of
controlling the rate of misdiagnosis at 5% level, namely
Certainly, when the blood glucose level test is available, the laboratory test
result is more accurate in detecting diabetes as a follow-up diagnosis. How-
ever, as illustrated by Theorem 3.1, this example shows that the selection of
diagnostic predictors is possible without the use of the continuous likelihood
function.
It should be noted that Example 3.3 and Example 1.5 are very similar in
a way where the control of the type-I error in hypothesis testing plays the
same role as the control of the false negative error in the sensitivity-specificity
analysis.
supΘ0 L(θ|X)
D(X) =
supΘ L(θ|X)
On the other hand, for the sensitivity across all LDL-Cholesterol level more
than 130, we need,
c∗ − μ
P (D > c|case) = P (X̄ > c∗ |case) = P (Z > √ |μ > 130).
1/ n
68 Sensitivity and Specificity Trade-off
For any two LDL-Cholesterol levels λ1 and λ2 , if λ1 ≤ 130 and λ2 > 130, by
the derivation in Example 3.2, the most efficient diagnostic predictor is
1
{D > c} = {X̄ > λ1 + 1.645 ∗ √ }
n
for any value λ1 and λ2 . Thus, the difficulty becomes to find the optimal value
c in the supremum. For notational convenience, denote μ=LDL-cholesterol
level. We need to find,
c∗ − μ
sup P (Z > √ |healthy).
μ≤130 1/ n
Notice that
c∗ − μ c∗ − 130 + 130 − μ
P (Z > √ |control) =P (Z > √ |)
1/ n 1/ n
c∗ − 130
≤P (Z > √ |μ ≤ 1)
1/ n
c∗ − 130
=P (Z > √ ) = 0.05.
1/ n
∗
So, setting c1/−130
√
n
= 1.645 gets c∗ = 130 + 1.645 √1n . This leads to the most
efficient predictor, D,
√
{D > c} = {x : X̄ > 130 + 1.645 n}.
We have discussed two approaches, the likelihood ratio measurement and
Theorem 3.1 for the construction of the uniformly most efficient diagnostic
predictor. Notice that when the likelihoods of the case and control populations
are confined to one value, the most efficient diagnostic predictor reads
L(x, θ0 )
< λ.
max(L(x, θ0 ), L(x, θ1 ))
On the other hand, the most efficient diagnostic predictor according to The-
orem 3.1 is
L(x, θ1 ) > c∗ L(x, θ0 ).
Under this setting, we can clearly deduce that the two most efficient diagnostic
predictors are identical, after a few steps of simple algebra derivation.
As shown in the previous examples, one discernible feature in the optimiza-
tion process is the reduction of the data information from the n observations
of the original data to one diagnostic predictor. In other words, the opti-
mizing process with the evaluation standard becomes a process that reduces
the dimension of the data toward a diagnostic predictor. Given this new per-
spective, when a sufficient statistic with lower dimension is available in the
likelihood ratio expression, Theorem 3.1 can be simplified as follows to reduce
the diagnostic predictor into a lower dimensional function.
Most sensitive diagnostic variable 69
Theorem 3.2 Assume that the underlying model (pdf or pmf ) of a set of data
X is f (x|θ) ∈ {f (x|case), f (x|healthy)}. Denote T (X) a sufficient statistic for
θ, and gi (t), i = 0, 1 the pmf (or pdf ) of T corresponding to healthy and case
populations, respectively. Then, the most efficient diagnostic measure becomes
g1 (t)
D(t) = ,
g0 (t)
with specificity
P (D > c|healthy) = 1 − α.
Implementing the above theorem relies on the availability of a sufficient
statistic that may reduce the dimension of the data while maintaining suf-
ficient likelihood information. Since identifying a sufficient statistic is a key
in the implementation of the above theorem, it is relevant to mention the
factorization theorem, which involves the dimensional reduction process while
preserving data sufficiency.
Theorem 3.3 Factorization theorem: Let f (x|θ) denote the underpinning
model (pmf or pdf ) of a sample x. A statistic T (x) is sufficient for the un-
known parameter θ if and only if it satisfies the following condition. There
exist functions g(t|θ) and h(x) such that, for all sample points and all permis-
sible values of the parameter θ, the joint density can be decomposed into the
product of information about the unknown parameter and information on the
rest of the sample.
f (x|θ) = g(T (x)|θ)h(x).
The proof and discussions on the Factorization Theorem can be found in
Lehmann and Romano [81] or Casella and Berger [16]. Since this book focuses
more on statistical prediction and machine learning, we elect not to pursue
the theory of sufficient statistics in this book.
As pointed out in Example 3.4, we are often confronted with situations
where the disease and healthy populations are referred to a range (instead
of a value) of the observations. Under this scenario, the optimizing process
discussed above cannot be directly applied. We shall now discuss the concept
of the uniformly most efficient diagnostic predictor for a range of the patient
healthy readings. With this objective in mind, we need the following concept.
Definition 3.4 Monotone Likelihood Ratio: Assume that we have a set of
data X that follows a family of underlying models (pmf s or pdf s) character-
ized by an unknown parameter θ ∈ Θ. Let T be a sufficient statistic of θ with
the likelihood function g(t|θ) ∈ {g(t|θ) : θ ∈ Θ}. The monotone likelihood ratio
property refers to the following property of the model of T : For any two points
in the parameter space, θ2 > θ1 , the likelihood ratio
λ(t) = g(t|θ2 )/g(t|θ1 )
is a monotone function of t in the domain {t : g(t|θ1 ) > 0 or g(t|θ2 ) >
0, θi ∈ Θ}.
70 Sensitivity and Specificity Trade-off
Theorem 3.5 Assume that the underlying model of a set of data X is gov-
erned by a function characterized by patient feature θ ∈ R. Consider a classi-
fication problem formulated as healthy θ ≤ θ0 and case θ > θ0 . Suppose that
T is a sufficient statistic for θ, and the family of pmf s or pdf s of T has the
MLR (Monotone Likelihood Ratio) property. If large value of D is associated
with the disease, the most efficient diagnostic predictor is
{D > c} = {T > t0 },
where D is the diagnostic measure and c is the diagnostic threshold. The value
t0 is determined according to the following condition,
Specif icity = P (D ≤ c|healthy) = P (T ≤ t0 |healthy) = 1 − α.
Proof: Let β(θ) = Pθ (T > t0 ) be the sensitivity of the diagnostic predictor. Fix
any value of the parameter θ > θ0 , and consider a simple prediction problem
on θ = θ0 versus θ = θ , since the underlying model (the family of pmf s or
pdf s) of T is assumed to have the MLR property, by Theorem 3.4, β(θ) is a
non-decreasing of θ, so we have
i) supθ≤θ0 β(θ) = β(θ0 ) = α, hence the specificity of the diagnostic predictor
is 1 − α.
ii) If we define
g(t|θ )
k ∗ = inf
t∈T g(t|θ0 )
where T = {t > t0 and either g(t|θ ) > 0 or g(t|θ0 ) > 0}, it follows that
g(t|θ )
T > t0 ⇔ > k∗ .
g(t|θ0 )
By Theorem 3.1, Parts (i) and (ii) imply that β(θ ) > β ∗ (θ ), where β ∗ (θ)
is the sensitivity of any other diagnostic predictor with specificity at 1−α level.
Since θ is arbitrary, the diagnostic predictor is the most efficient diagnostic
predictor with specificity at the level 1 − α.
By an analogous argument, the following theorem can be derived.
Theorem 3.6 Assume that the underlying model of a set of data X is gov-
erned by a function characterized by patient feature θ ∈ R. Consider a classi-
fication problem formulated as healthy θ ≤ θ0 and case θ > θ0 . Suppose that
T is a sufficient statistic for θ, and the family of pmf s or pdf s of T has the
MLR (Monotone Likelihood Ratio) property. If small value of D is associated
with the disease, the most efficient diagnostic predictor is
{D < c} = {T < t0 },
where D is the diagnostic measure, c is the diagnostic threshold, and t0 is the
value that satisfies
Specif icity = P (D ≥ c|healthy) = P (T ≥ t0 |healthy) = 1 − α.
72 Sensitivity and Specificity Trade-off
Solution: Since the normal model has the monotone likelihood ratio property,
and high readings of LDL-Cholesterol level are associated with coronary heart
disease, by the adapted Carlin-Rubin theorem (Theorem 3.4), the most efficient
diagnostic predictor satisfies the condition
X − 130
{D > c} = { > 1.645}.
√1
n
This is equivalent to
1.645
X > 130 + √ ,
n
the most efficient diagnostic measure is the sample mean LDL-Cholesterol
level, and the diagnostic outcome is positive then the sample mean level
reaches the corresponding threshold.
10:1 to 20:1, a blood test report with either too high (exceeding 20) or too
low (below 10) bun to creatinine ratio is an indication of kidney failure.
Following the concept of optimization that we discussed in Section 3.1,
for convenience, we continue with the idea of UMEP (uniformly most efficient
predictor), and treat that as an example to illustrate the principle of restricted
optimization when the underlying model is assumed.
As discussed in Theorem 3.4, the adapted Karlin-Rubin theorem ensures
that the existence of an optimal solution for one-ended diagnostic measures
when the underlying model has the MLR property. However, the story is dif-
ferent for the two-ended extreme problem. As shown in the following example,
although the adapted Karlin-Rubin theorem is convenient in the derivation of
optimal diagnostic predictor for one-ended extremes on the diagnostic mea-
sures, when we consider two-ended diagnostic predictors, the global optimal
solution does not exist.
Example 3.6 Let X1 ......Xn be a set of blood test readings on the bun to
creatinine ratios. Assume that Xi ∼ N (θ, σ 2 ) with known variance σ 2 = 1.
Consider testing the prediction of healthy θ = θ0 versus kidney disease θ = θ0
for a given constant θ0 . Given a pre-fixed level of specificity 1 − α, we are
interested in identifying the most efficient diagnostic predictor that satisfies
The optimal solution for this problem does not exist. To see this point,
consider another parameter value θ1 < θ0 (for instance, two different values of
bun to creatinine ratios), by the adapted Karlin-Rubin theorem, the uniformly
most efficient predictor reads:
Claiming kidney diseases when
√
X̄ < −σzα / n + θ0 .
This predictor has the highest sensitivity at the bun to creatinine ratio θ1
among all predictors satisfying equation (3.4). We may call this Predictor-1.
By the uniqueness of UME predictor, if a UMEP exists for this problem, it
must almost surely be Predictor-1.
Now consider a different predictor, Predictor-2, which claims diseases when
√
X̄ > σzα / n + θ0 .
or
Sensitivity(φ, θ) > α, and Specif icity ≥ 1 − α.
As usual, φ(X) is the prediction function. When the diagnosed outcome is
positive, φ(X) = 1, otherwise φ(X) = 0.
From the Definition 3.5, any UMEP with specificity 1 − α is an UMEDP
(uniformly most efficient and decent predictor). On the other hand, with ad-
ditional restriction of being a decent predictor, for prediction problems where
Two-ended diagnostic measures 75
1.0
0.8
0.6
Power
0.4
0.2
0.0
FIGURE 3.3
Sensitivity functions β1 (θ), β2 (θ), and β3 (θ).
UMEP does not exist, UMEDP may exist. We illustrate this point with the
following example.
As shown in Definition 3.5, the concept of decent predictor is a natural
requirement for a diagnostic predictor. Namely, the probability of correctly
detecting the diseases subject should be at least as large as the probability of
incorrectly diagnose a healthy subject. With such a restriction, we are able to
search for a UMEP with specificity 1 − α among the class of decent predictors.
Such a restricted optimization procedure results in the construction of the
following UMEDP.
Consider the sensitivity function β(θ) of a diagnostic predictor Predictor-3
which diagnoses a subject as having the disease when
|X̄ − θ0 |
√ > zα/2 .
σ/ n
Figure 3.3 shows the three sensitivity curves corresponding to the three pre-
dictors discussed above. The dotted curve is for β3 (θ), the black curve is for
β2 (θ), and β1 (θ) has the gray curve. As shown in the diagram, the dotted
76 Sensitivity and Specificity Trade-off
curve, although not as powerful as the gray or the black curve at some points,
is able to achieve its own local optimal sensitivity when the diagnostic measure
gets either larger or smaller. However, for the other two curves, the sensitiv-
ity drops below the probability of false positive rate α when the diagnostic
measure gets larger for β1 (θ) (or smaller for β2 (θ)). Thus, they are not decent
predictors.
We will prove that the dotted sensitivity curve is indeed the restricted
optimal curve in Example 3.7 after a discussion on the following theorem,
which sets the connection between the global optimal solution (a UMEP level
1 − α predictor) and a restricted optimal solution. The latter is the UMEDP
level 1−α predictor for this type of diagnostic measures. Theorem 3.7 shows a
way to identifying a UMEDP by considering UMEP in the boundary between
the diseased and the healthy populations.
βφ (θ) = α, (3.5)
where θ ∈ ω, and ω is the set of the diagnostic boundary between diseased and
healthy subjects.
For example, when the range of healthy bun to creatinine ratio is from 10:1
to 20:1, the diagnostic boundary is formed by two values {10, 20}.
Proof The class of predictors satisfying (3.5) contains the set of decent
predictors, hence the sensitivity of the UMEP φ0 is at least as high as the
sensitivity of any decent predictor with the same specificity. On the other hand,
the UMEP predictor φ0 is decent itself. This is because it is uniformly at least
as sensitive as the predictor φ(x) ≡ α, where φ(x) is a special predictor that
claims a disease case by flipping a biased coin. When the outcome is a head
after flipping the coin with
P (Head) = α,
φ(x) claims diseases.
Theorem 3.8 Restricted optimization for UMEDP Assume that the under-
lying model for a set of data X can be expressed explicitly as an exponential
family, with the joint density characterized by a parameter θ:
for the prediction problem with two-ended diagnostic measures. If there exist
two constants λ1 and λ2 such that
D = {T > λ1 } {T < λ2 }.
Proof Consider the area A = {f (x) > λg(x)}, where f (x) is the likeli-
hood function for the diseased population, and g(x) is the one for the healthy
population. For any predictor with specificity 1 − α, we have
implies
IB f (x)dx − IA f (x)dx ≤ λ( IB g(x)dx − IA g(x)dx) ≤ 0,
which means that the predictor corresponding to the indicator function of the
region A is a uniformly most sensitive predictor.
Now
f (x) c(θ)eθT h(x)
= = d(θ, θ0 )e(θ−θ0 )T
g(x) c(θ0 )eθ0 T h(x)
Notice that
we have
P ({T (x) > λ∗∗∗ } {T (x) > λ∗∗ }|θ0 ) = α.
In conjunction with Theorem 3.7, the condition on (3.5) is satisfied, thus, the
diagnostic predictor {T (x) > λ∗∗∗ } {T (x) > λ∗∗ } is UMEDP.
78 Sensitivity and Specificity Trade-off
Theorem 3.9 Assume that the underlying model of a set of data X can be
expressed explicitly as an exponential family with the joint likelihood charac-
terized by a parameter θ as in (3.6),
φ(X) ≡ α,
this prediction function also defines a decent predictor: claiming diseased with
probability α regardless of the sample. Now, we have:
c (θ0 )
Eθ0 (T (X)) = −
c(θ0 )
Also, the first equality is implied by the condition of decent predictors, we
have
Eθ0 (φ(X)T (X)) − αEθ0 (T (X)) = 0.
Two-ended diagnostic measures 79
Proof: Denote g(t|θ) the underlying model of the data, by (3.8), we have
By Theorem 3.9, the UMEDP takes the form of claiming diseases when the
diagnostic measure T is either too large or too small, denote
1, T > λ1 or T < λ2
φ=
0, otherwise.
We have
λ2 +∞
(t − r)g(t|θ0 )dt + (t − r)g(t|θ0 )dt = 0.
−∞ λ
Since y − r = 2r − t − r = r − t, dy = −dt.
Therefore
λ2 2r−λ1
(t − r)g(t|θ0 )dt = (y − r)g(y|θ0 )dy,
−∞ −∞
P (Z > λ1 ) + P (Z < λ2 )
=α
= P (Z > λ1 ) + P (Z < −λ1 )
= 2P (Z > λ1 ),
so
λ1 = zα/2 .
This example offers a theoretical justification to the existence of UMEDP
(the dotted curve) in Figure 3.3. Notice that the functioning of Example 3.7 is
grounded on the assumption that the stability of the diagnostic measure (σ)
is a given value. In practice, the measurement stability score is an unknown
value. As a follow-up discussion, we shall describe a method pertaining to
the prediction of measurement stability scores. In particular, we discuss an
example predicting the risk level of an investment portfolio.
1, T > λ1 or T < λ2
φ=
0, otherwise.
Two-ended diagnostic measures 81
and
Pσ0 (λ2 < T < λ1 ) = 1 − α.
T
Now, let Y = the distribution of Y follows the χ2n under the assumption
σ02
,
that the investment risk is σ0 .
1, Y > d2 or Y < d1
φ=
0, otherwise.
By Theorem 3.9,
Eσ0 (φY ) =
yfY (y)dy
{y<d1 } {y>d2 }
= αEσ0 (Y )
= nα,
and
Eσ0 (φ) = α.
Thus, the two conditions for UMEDP are:
d2
1 y
y 2 e− 2 dy = n(1 − α)
n
n n (3.10)
d1 2 2 Γ( 2 )
d2
1 y
2 − 2 −1 dy = 1 − α
n
n n y e (3.11)
d1 2 Γ( 2 )
2
By (3.10),
d2
−2 n
− y2
n n y de
2 = n(1 − α),
d1 2 Γ( 2 )
2
which is equivalent to
−2 y
d2
−2 n n −1 − y
2 −2 | 2 −
n
n n y e
d
d1 n n ( )y
2 e 2 dy = n(1 − α).
2 Γ( 2 )
2
d1 2 Γ( 2 ) 2
2
By (3.11),
−2 y
2 − 2 | 2 + n(1 − α) = n(1 − α).
n d
n n y e d1
2 Γ( 2 )
2
Xi2 Xi2
{X : < d1 } {X : < d1 },
σ02 σ02
and
−n d1 −n d2
d1 2 e − 2 = d2 2 e − 2 .
This section focuses on the method of restricted optimization (the most ef-
ficient and decent diagnostic predictor) when the global optimal solution does
not exist. It shows that by adding an additional condition (a decent predictor
in the way that the probability of correct diagnosis exceeds the probability of
false positive), we can restrict the optimization domain on predictors satisfy-
ing certain conditions (controlling the specificity at level 1 − α), and find a
restricted optimal solution (uniformly most efficient decent diagnostic predic-
tor, UMEDP). The next section will follow up with optimization in the case
where nuisance condition exists.
mean while the population variation can not be plausibly assumed. The pop-
ulation variation serves as a nuisance parameter for the prediction problem
in this case. The example shows how a confounding nuisance factor alters the
existence of the global optimal solution (UMEP).
X̄ − 130
{X : √ > Zα }.
σ0 / n
However, the actual standard deviation may not be σ0 .
If the true but unknown standard deviation doubles the assumed value, say,
σ = 2σ0 , the probability of false positive rate of the UMEP with specificity 1−α
test becomes
X̄ − μ0
Pμ0 ({X : √ > Zα })
σ0 / n
X̄ − μ0
=Pμ0 ( √ > Zα )
σ0 / n
X̄ − μ0 Z
=Pμ0 ( √ > α)
2σ0 / n 2
Zα
=P (Z > )
2
>α.
In this case, when the probability of false positive rate is larger than α, the
corresponding specificity becomes less than 1 − α, hence the UMEP with incor-
rectly assumed standard deviation is not a diagnostic prediction at the nominal
specificity level 1 − α for the prediction problem.
This example shows that the UMEP does not exist in the process of pre-
dicting μ, when σ is unknown. In this case, the unknown standard deviation
σ serves as a nuisance parameter. We shall now introduce a theorem that can
be viewed as an example of restricted optimization under the presence of nui-
sance parameters. The method is similar to the discussion on the example of
two-ended diagnostic measures in the preceding section.
Theorem 3.11 Assume that the underlying model of the data set X follows
an exponential family
k
f (X) = h(x)exp{θU (x) + vi Ti (x) + c(θ, v)},
i=1
84 Sensitivity and Specificity Trade-off
Proof: For any fixed T = (T1 (x), ......, Tk (x)), Pθ (x) has the monotone like-
lihood ration property in terms of U , by the adapted Karlin-Rubin theorem,
the UMEP reads ⎧
⎪
⎨1, U > ξα (T)
∗
φ (U |T) = γα (T), U = ξα (T)
⎪
⎩
0 U < ξα (T)
where ξα (T) and γα (T) are determined by
is a UMEDP.
Notice that
⎧
⎪
⎨1, W > ξα (T)
φ∗∗ (U (T)|T) = γα (T), W = ξα (T)
⎪
⎩
0 W < ξα (T)
satisfies
Eθ0 (φ∗∗ |T) = Pθ0 (W > ξα (T)|T) + γα (T)Pθ0 (W = ξα (T)|T).
Now, since variables W and T are independent at the boundary θ0 , there exist
constants γα and Cα that are not functions of T, so that
Eθ0 (φ∗∗∗ ) = α
for any vector of nuisance components v. This shows that the diagnostic pre-
dictor φ∗∗∗ is an optimal solution restricted to the set of decent predictors.
We can now use Theorem 3.11 to find the uniformly most efficient decent
diagnostic predictor for the LDL-Cholesterol prediction in Example 3.5, when
the standard deviation is allowed to change within its permissible domain.
the same object, different measurement units (or scales) often make the data
look different. One frequently asked question is the consistency of the predic-
tion results when different scales or units are used in measuring the subjects
in an experiment. It is necessary to have a prediction method that remains
consistent for various measurements on the same object. For example, con-
sider the comparison of body heights between college students and elementary
school students. If a prediction method claims significant mean difference with
heights measured in cm, we would expect a similar claim when the same ob-
jects are measured by m, because measuring with the scale of cm or m should
not alter the fact that on average, college students are taller than elementary
school students.
2 Group multiplication obeys the associative law. Namely for any three ele-
ments in G, (ab)c = a(bc).
3 There is an element in G called identity, such that ae = ea for all a ∈ G.
88 Sensitivity and Specificity Trade-off
4 For each element a in G, there exist an a−1 ∈ G (its inverse in G), such
that aa−1 = a−1 a = e.
In the definition above, both the inverse a−1 of any element a ∈ G and the
identity element e ∈ G can be shown to be unique.
When we consider transformations on a data set, it is helpful to utilize the
concept on transformation group.
Example 3.12 The following are two groups of data transformations for two
common distribution families.
1 G = {X, n − X} for data following the binomial family Bin(n, p).
which is the same as the measurements that use cm and boot every subject
by 10 cm:
172 − 164
.
24∗(50)2 +24∗(40)2
48
X̄ − E(X1 )
t(X) = √ .
sX / n
Now the location and scale transformation data become
Y = a1 + bX,
with the expected value E(Y) = a1+bE(X). The variance of the transformed
data reads,
1
n
s2Y = (Yi − Ȳ )2
n − 1 i=1
1
n
= (a + bXi − a − bX̄)2
n − 1 i=1
b2
n
= (Xi − X̄)2
n − 1 i=1
=b2 s2X
90 Sensitivity and Specificity Trade-off
Definition 3.11 Invariant prediction problem: For a set of data with an as-
sumed underlying model governed by a parameter θ. The prediction problem
for the healthy population θ ∈ Ω0 versus the diseased population θ ∈ Ω1 is in-
variant under transformation g if the correspondingly transformed parameter
is within the original space for healthy subjects and the original space for the
diseased patients, respectively.
Briefly speaking, the invariant prediction problem is for the invariant of the
healthy range and diseased range of the diagnostic measure after data trans-
formation. Mathematically, as clearly described in [81] under the setting of
hypothesis testing: Let ḡ be the corresponding transformation of parameters.
The testing problem is invariant if ḡ preserves both Ω0 and Ω1 , ḡ(Ω0 ) = Ω0 ,
ḡ(Ω1 ) = Ω1 . In the following discussions involving invariant diagnostic predic-
tors in this book, we confine the discussion to invariant prediction problems.
With the setting above, the prediction question now becomes to find the
restricted optimal solution for invariant predictors (which is the same as the
uniformly most efficienty and invariant predictor).
We start with the location transformation for the discussion. Let X =
(X1 , ...Xn )T be an observation from a population with model f (x1 , ...xn ).
Assume that we are interested in predicting the Healthy population defined
as
fθ (x1 , ...xn ) = f0 (x1 − θ, ..., xn − θ)
versus the Diseased population defined as
ḡ(θ) = θ + c, θ ∈ Θ0
⇒ fθ (X) = f0 (x1 + θ, ..., xn + θ), θ ∈ R.
ḡ(θ) ∈ ḡ(Θ0 )
⇒ fθ (X) = f0 (x1 + θ + c, ..., xn + θ + c)
= f0 (x1 + c∗ , ..., xn + c∗ ), c∗ ∈ R.
We have
so, fS (t1 , ..., tn−1 ) = fT (t1 , ..., tn )dtn
R
= fX (t1 + tn , ..., tn−1 + tn , tn )dtn .
R
let tn − xn = u ⇒ tn = xn + u, we have
fS (t1 , ..., tn−1 ) = fX (x1 + u, ..., xn−1 + u, xn + u)du.
R
1 if λ > c
φ(X) =
0 otherwise
σ0 n R exp(− 2σ12
)du
λ(X) > c ⇐⇒ ( ) −
n
(x +u−θ) 2 >c
σ1 exp(− i=1 i
)du
R 2σ12
2
xi +2(u−θ)
xi +n(u−θ)2
R
exp(− 2σ12
)du
⇐⇒ 2
xi +2(u−θ)
xi +n(u−θ)2
> c∗ ,
R
exp(− 2σ 2
)du
0
∗
where c = ( σσ10 )n c). Thus,
λ(X) > c
x2i −nx̄2
2
+2x̄(u−θ)+x̄2 )
exp(− 2σ12
) R exp(− n((u−θ) 2σ12
)du
⇐⇒
x2i −nx̄2 2 2)
> c∗
exp(− 2σ02
) R exp(− n((u−θ) +2x̄(u−θ)+x̄
2σ02
)du
2
1 1 1 2 R
exp(− n(u−θ+x̄)
2σ12
)du
⇐⇒ exp(− ( 2 − 2 )( xi − nx̄ ))
2
n(u−θ+x̄)2 > c∗
2 σ1 σ0 exp(− 2 )duR 2σ0
1 1 1 2
⇐⇒ − ( 2 − 2 )( xi − nx̄2 ) > c∗∗
2 σ1 σ0
n
⇐⇒ (xi − x̄)2 > c∗∗∗ .
i=1
Notice that
1 1 1 1 1
σ0 < σ1 ⇐⇒ > 2 ⇐⇒ − ( 2 − 2 ) > 0.
σ02 σ1 2 σ1 σ0
Efficient and invariant diagnostic predictors 95
In the case where UMEP does not exist, we expanded the concept of UMEP
into UMEDP (unifomly most efficient and decent diagnostic predictor). The
latter engraves UMEP with a new concept, decent diagnostic predictors, a le-
gitimate requirement that the rate of correct classification should not be lower
than the rate of misclassification. Theory, practical procedures and examples
are discussed following the definition of UMEDP.
Data transformation is very common in data science, especially in the pro-
cess of measurement unification for pooled datasets from multiple resources.
Consistent interpretations of insightful information related to sensitivity and
specificity necessitates a discussion on the invariant property of UMEP predic-
tor for transformed data. We conclude this chapter with a discussion on theory
and procedures regarding invariant UMEP predictors for linear functions in
data transformation.
4
Bias and Variation Trade-off
This chapter deals with theoretical and fundamental issues of bias versus vari-
ation in data science. There are different views on the content and definition
of data science. Some claimed that data science is discernible from statistics
due to its unique handling of big data and reliance on modern computer tech-
niques. References in this regard can be found in Bell et. al. (2009) [5] and
Dhar (2013) [43], among others. On the other hand, current literature also
includes claims that statistics itself is data science (see, for example, Brieman
(1998) [11] and Wu (1986) [125], to list just a few). Although various defi-
nitions have their own rationales under different scenarios, in our view, the
essential process of data science is to make inference, to predict (or forecast)
the unknown using the known (observable data). Thus, without confining our-
selves into either direction, from the viewpoint of data analytic technologies,
algorithms, and methodological development, we go with the belief that data
science includes the model-based camp (mainly statistics) and the data-driven
camp (mainly computing techniques, machine learning, and deep learning al-
gorithms), as elucidated in Chapter One. In the process of predicting the
unknown, one frequently asked issue focuses on the dilemma regarding the
bias and variation of the predicted outcome: gaining lower bias at the cost of
high variation or trading prediction bias for lower prediction variation.
97
98 Bias and Variation Trade-off
FIGURE 4.1
Statistics versus Data Science
Obviously, the above expression suggests that EPE can be decomposed into
two portions. One accounts for the error between the predicted underlying
model and the true model, which is reducible when we have large enough
training data in conjunction with legitimate features for prediction in the
data. Another portion accounts for the error of the randomness of the data,
var() , which is due to the intrinsic fluctuation of the data that we cannot
influence in prediction, and is irreducible.
To further examine the evaluation of the expected prediction error, con-
sider a scenario in which we only have finite distinguishable features and
responses (yj , Xj ) for j = 1, ..., p.
Now, when we have a set of testing data (yi , Xi ) with i = 1, ..., m, and
fˆ(X), a model learned from the training data (yi , Xi ) for i = m + 1, ..., n, the
sample expected prediction error (of the testing data) reads
1
m p
f req((yj , Xj ))
[yk − fˆ(Xk )]2 = [yj − fˆ(Xj )]2 ,
m j=1
m
k=1
f req(yj , Xj )
lim = P [(Y, X) = (yj , Xj )].
m→∞ m
100 Bias and Variation Trade-off
Thus,
1
m
[yk − fˆ(Xk )]2 → E[(Y − Ŷ )2 ],
m
k=1
the sample prediction error approaches the expected prediction error when
the sample size of the testing data is large enough. This shows that although
we need to reserve a good portion of data to train the model, we also need to
keep a good size of data for the testing data. If the size of the testing data is
too small, over-estimating or under-estimating the expected prediction error
may lead to a misleading conclusion on the performance of the trained model.
Assume that the testing set contains large enough observations. When the
solution to the global optimization is available, the reducible error in (4.1)
is minimized and the expected prediction error cannot be further reduced.
However, when the global optimization solution does not exist, the analytic
process stops. In what follows, we shall discuss details of the fundamental
concepts and terminologies used in statistical prediction and machine learning,
when the global optimization needs to be confined with restrictions.
We will start with examples of restricted optimization when the underlying
model is assumed (model-based inference), which is followed by a discussion on
the impact of nuisance parameters. After that, we will discuss restricted opti-
mizations for fundamental estimation issues in data transformation, including
the invariant property and location-scale transformations. For transformed
data, a representative topic on restricted optimization is the minimization of
the variance confined to unbiased estimators in model-based inference.
Certainly, it is impossible to exhaust all restricted optimization methods in
one chapter. To cover key issues in restricted optimization, we merely focus on
underpinning ideas and representative principles. Discussions in this chapter
may help clarify the premises of algorithms in data science (to avoid the
abuse of data analytic procedures), enhancing understanding of prediction
procedures, and facilitating interpretations of analytical outcomes.
Materials in the rest of this chapter also underpin rationales and theory
behind common statistical decisions. For example, the Student-t test can ac-
tually be viewed as an optimal test in terms of maximizing the power of the
test with restriction to unbiased tests, when nuisance parameters are involved.
Most of the results presented in this chapter are synthesized from theorems in
classical textbooks such as Lehmann and Romano [81], Lehmann and Casella
[80], Casella and Berger [16], Shao [110], and Hastie et al (2009) [56], among
others. The revisit in this chapter sheds new light on classical results for re-
stricted optimization in data science.
Minimum variance unbiased estimators 101
Proof We start with the necessity part of the theorem. Assume that δ is
the UMVUE of its expectation and denote
Eθ (δ) = g(θ).
For any U ∈ H and θ ∈ Ω, and for an arbitrary real value λ, denote
δ = δ + λU,
obviously, δ is another unbiased estimator of g(θ). Consider
V arθ (δ + λU ) ≥ V arθ (δ)
for all λ. Expanding the left-hand side of the above equation, we have
λ2 V arθ (U ) + 2λCovθ (δ, U ) ≥ 0
for all λ. This is a quadratic equation of a real value λ with two roots λ1 = 0
and
λ2 = −2Covθ (δ, U )/V arθ (U ).
It will therefore take negative values unless
Covθ (δ, U ) = 0,
which implies (4.2).
As for the sufficiency part of the theorem, suppose (4.2) is valid for all
U ∈ H. To show that δ is UMVUE of its expectation, let δ be another unbi-
ased estimator of Eθ (δ). If V arθ (δ ) = ∞, there is no need to prove since its
variance is larger than that of δ. So, we can assume V arθ (δ ) < ∞. In this
case,
δ − δ ∈ H
because they are both unbiased estimators of E(δ). Furthermore,
Eθ [δ(δ − δ)] = 0,
hence
Eθ (δ (δ)) = Eθ (δ 2 ).
Now, using the fact that δ and δ have the same expectation, we have
V arθ (δ) = Cov(δ, δ ) ≤ V arθ (δ)V arθ (δ ),
thus,
V arθ (δ) ≤ V arθ (δ ).
The beauty of the above theorem links the property of a statistic with
the restricted optimal solution in terms of minimizing the variance (or the
risk under the square loss function). For instance, if the distribution family is
complete, which means that for any statistic U , Eθ (U ) = 0 implies that U = 0
almost surely. The above theorem points out an important fact in mathemat-
ical statistics that any unbiased and complete estimator is the UMVUE of its
own expectation. We shall illustrate this point with the following example.
Minimum variance unbiased estimators 103
Example 4.1 Let X be a random variable with E(X 2 ) < ∞. For a set of
normal data X1 , ..., Xn that have underlying model N (0, σ 2 ), find the UMVUE
of the population standard deviation σ.
n
i=1 (Xi − X̄) , the sample stan-
1 2
Solution: First, notice that S = n−1
dard deviation, is not unbiased for σ = V ar(X). To see this point, considering
the normality assumption, we have
n−1 2
Y = S ∼ χ2n−1 ,
σ2
and
σ √
E(S) = E( √ Y)
n−1
σ √
=√ yfY (y)dy
n−1
2 Γ( n2 )
= σ.
n − 1 Γ( n−1
2 )
Now,
E(S) ≤ E(S 2 ) = σ,
since
1
n
2
E(S ) = E( (Xi − X̄)2 ) = σ 2 .
n − 1 i=1
n
For the model of the data in this example, S 2 = i=1 Xi2 is a complete
2
statistic for σ, and Sσ2 ∼ χ2n , thus,
r
Sr 2 2 Γ( n+r
2 )
E( r
) = n
σ Γ( 2 )
for any positive integer r. Therefore, by Theorem 4.1, the UMVUE of σ reads
Γ( n2 )
σ̂ = √ S.
2Γ( n+1
2 )
This example shows that the restricted optimization for the estimation of
the standard deviation that governs the model behind a set of data is not the
sample standard deviation. Although the sample variance is an unbiased esti-
mator of the population variance, the best estimator of the sample standard
deviation is actually the sample standard deviation multiplied by a non-unit
constant.
104 Bias and Variation Trade-off
Now, if
E[T (X)δ(X)]
= 2ap3 [T (2) − T (1)] + 2ap2 [T (1) − T (2)] + 2ap[T (−1) − T (1)]
= 0,
E[T (X)]
= 2pqT (−1) + q 3 T (0) + pq 2 T (1) + p2 qT (2) + T (3)p3
= pq,
we have
1
T (3) = T (0) = 0, and b = .
3
In this case, the UMVUE for pq exists.
The above theorem obtains the restricted optimization by means of an
unbiased estimator of zero, although the optimization process is hidden in the
proof of the theorem. In what follows, we shall discuss a well-known theo-
rem that directly approaches the minimal value of the target function (the
variance) to find the UMVUE.
106 Bias and Variation Trade-off
Note: The above inequality (4.3) points out the lowest possible value for vari-
ances of unbiased estimator. If an unbiased estimator (restriction) has the
variance equal to the right-hand side of (4.3), it is the UMVUE of θ. The
validity of the theorem can be shown as follows. More thorough discussions
can be found in [16], [80], among others.
Proof First, applying the derivative on the unbiased restriction with the
use of the Leibniz condition, we have
d
Eθ W (X)
dθ
∂
= W (X) f (X|θ)dX
X ∂θ
∂
f (X|θ) f (X|θ)
=Eθ [W (X) ∂θ ] by multiplying in the integrand
f (X|θ) f (X|θ)
∂
=Eθ [W (X) log f (X|θ)]
∂θ
in conjunction with
∂ ∂
V arθ [ log f (X|θ)] = Eθ [( log f (X|θ))2 ],
∂θ ∂θ
we have:
d
dθ Eθ (W (X))
V arθ (W (X)) ≥ ∂
.
Eθ [( ∂θ log f (X|θ))2 ]
The following example shows how to use the Cramer-Rao lower bound to
find the UMVUE (minimizing variance with restriction to unbiased estima-
tors).
Example 4.3 Assume that the underlying model of a set of data X1 , ......, Xn
is the exponential model with unknown parameter λ, exp(λ), find the UMVUE
of λ.
E(X1 ) = λ, V ar(X1 ) = λ2 ,
and
E(X̄) = λ,
where X̄ is the sample mean. Also,
X1
log f (X1 ) = − log λ − ,
λ
and
∂ 1 X1
log f (X1 ) = − + 2 .
∂λ λ λ
Thus,
∂ X1 1
E[( log f (X1 ))2 ] =E[( 2 − )2 ]
∂λ λ λ
1 2E(X1 ) E(X12 )
= 2− +
λ λ3 λ4
1
= 2.
λ
2
In this case, the C-R lower bound is λn , which implies that the UMVUE of λ
is V ar(X̄).
Following the Cramer-Rao lower bound on the variance of unbiased es-
timators, another technique in the process of restricted optimization is the
approach of approaching the optimal solution by conditioning on an sufficient
statistic.
108 Bias and Variation Trade-off
Proof Obviously
n
E[( Xi )4 ] =E( Xi Xj Xk Xl )
i=1 i j k l
= E(Xi4 ) + E[(Xi Xj )2 ]
i i=j
+ E[(Xi Xj Xk )Xl ] + E(Xi Xj Xk Xl )
i=j=k,l=i,j,ork i=j=k=l
n 2 n 3 n 4
=np + p + p + p
2 3 4
By independence,
!
n
E(X1 X2 X3 X4 ) = Xi = p4 ,
i=1
φ(T ) = E(X1 X2 X3 X4 |T )
φ(T ) =E(X1 X2 X3 X4 |T = t)
n
P (X1 = X2 = X3 = X4 = 1, i=5 Xi = t − 4)
=
P (T = t)
" #
p t−4 p (1 − p)n−t
4 n−4 t−4
= " n#
pt (1 − p)n−t
"n−4#t
= "t−4
n
# .
t
Poisson distribution with parameter nθ. For the UMVUE, we want δ(T ) to
be unbiased, which means that
∞
(nθ)t −nθ
E[δ(T )] = δ(t) e = e−10θ ,
t=0
t!
thus
∞
(nθ)t
δ(t)
t=0
t!
= e(n−10)θ
∞
(n − 10)t θt
= ,
t=0
t!
θ = ḡ(θ).
Minimum risk estimators for transformed data 111
3 For each g and ḡ the corresponding change of the value of parameter for
estimation: g ∗ : h(θ ) = h(ḡ(θ)) = g ∗ (h(θ)).
for any parameter θ and estimate d. Further, assume that the corresponding
function of parameter to be estimated, h(θ), satisfies
Under this setting, the problem of estimating h(θ) with loss function L is
invariant under g.
δ(gX) = g ∗ δ(X).
Proof By definition, R(ḡθ, δ) = Eḡθ L[ḡθ, δ(X)], the right side is equal to
We shall now apply the above theoretical concepts to two frequently used
groups in data transformations. One is for location transformation and another
is scale transformation. We start with the location transformation first.
Example 4.6 A family of densities f (x|θ) with parameter θ and loss function
L(θ, δ) is a location invariant model if f (x |θ ) = f (x|θ).
θ = θ + a
The above setting lays the foundation for the discussion of restricted op-
timization with the target to minimizing the risk of the estimator and the
restriction on equivariant estimators, for location transformations of the orig-
inal data. First, we shall identify the set of equivariant estimators for location
transformations.
Theorem 4.6 Assume that the underlying model for a set of data X takes the
format in (4.5), and δ is equivariant for estimating ξ with loss function (4.4).
Then, the bias, risk, and variance of δ are all constant, being independent of
ξ.
Theorem 4.8 Assume that the underlying model for a set of data X is (4.5).
Let Yi = Xi −Xn for i = 1, ......, n−1 and denote y = (Y1 , ......, Yn−1 ). Suppose
the loss function is given by (4.4), and there is an equivariant estimator δ0 of
ξ with finite risk. Also assume that for each y, there exists a number v(y) =
v ∗ (y) which minimizes
Then a location equivariant estimator with minimum risk exists, and is given
by
δ ∗ (X) = δ0 (X) − v ∗ (y).
114 Bias and Variation Trade-off
The integral is minimized by minimizing the integrand, and hence (4.9) for
each y. Since δ0 has finite risk E0 {ρ[δ0 (X)]|y} < ∞, the minimization of (4.9)
is meaningful. The result now follows from the assumption of the theorem.
The following examples illustrate the construction of an MRE estimator
for location transformations.
Example 4.8 Assume that the underlying model of a set of data X1 , ......, Xn
is uniform (ξ − 1/2, ξ + 1/2). Suppose that the loss function is
L(ξ, d) = (d − ξ)2 ,
We may use the approach of Pitman estimator to find the MRE estimator
for ξ under location transformations.
Consider
1 1
f (x1 , ..., xn ) = 1 for all xi ∈ (ξ − , ξ + ); 0 otherwise.
2 2
Notice that
1 1 1 1
xi ∈ (ξ − , ξ + ) for all i = 1, ..., n. ⇒ ξ ∈ (X(n) − , X(1) + ).
2 2 2 2
We have the denominator of the Pitman estimator:
X(1) + 12
f (x1 − ξ, ..., xn − ξ)dξ = 1dξ = X(1) − X(n) + 1.
X(n) − 12
The next example discusses restricted optimization in the case where the
loss function is the absolute error, and the minimization of the risk is restricted
to an equivariant estimator for location transformations.
Example 4.9 Let X be i.i.d. according to the exponential distribution E(θ, 1),
find the MRE of θ for the absolute error L = |d − θ|.
116 Bias and Variation Trade-off
we need to minimize the risk with respect to v. Since we are not dealing with a
square loss function, the usual optimal solution v ∗ = E(δ0 |Y ) is not applicable
here.
Consider the smallest ordered statistic δ0 = x(1) = y1 , and denote
Thus,
+∞
E0 (|Y1 − v|) = |y − v|ne−ny dy
0
v +∞
−ny
= (v − y)ne dy = g(v) + (y − v)ne−ny dy
0 v
Setting g (v) = 0 yields that the optimal solution v ∗ is the median of X(1) ,
log 2
v∗ = .
n
Therefore the MRE for location transformations with absolute error loss func-
tion reads,
log2
δ ∗ = X(1) − .
n
Minimum risk estimators for transformed data 117
δ0 (X)
δ ∗ (X) = . (4.15)
w∗ (X)
Proof The proof of the above theorem is similar to the proof of Pitman
estimator for location transformations.
We selected the method of Pitman estimator to illustrate the method of
restricted optimization in data analysis, which involves using transformed data
for estimation or prediction of an unknown parameter characterizing assumed
models. Certainly, there are many interesting results in estimation theory
that we are unable to exhaust in this book. Interested readers can find further
discussions on this part of materials in the books [81], [80], and [16], among
others.
Linear regression is arguably one of the most commonly used and abused
statistical tools in data science. Its versatility and intuitiveness fits a broad
range of applications, from simple linear model such as “when the price in-
creases, the return per item increases” or “the insurance premium decreases
as the time spent driving increases”. This phenomenon can occur in any set-
ting at any time. Despite this, its ease of use tends to backfire when amateur
data analysts mindlessly default to reading the data into software (such as R,
Python, or Excel) to obtain a fitted line without checking validity conditions
of linear regression. They tend to lack consideration for the rationale behind
the methodology. As a result, the inferred conclusion sometimes results in a
unreliable statistical prediction.
Traditional textbooks on this topic usually begin with introductory exam-
ples, followed by a least squares estimate, inference, and discussion. However,
in this chapter, we take a different route. We focus on the validity conditions
and precautions with linear regression models in practice, to prevent misuse
of the technique from the get-go.
119
120 Linear Prediction
has no connection with the body weight of rhinos, as illustrated in Figure 5.1.
If there is no intuitive relationship between the input and output variables, it
would not make sense to set up a linear model between them, even though the
relationship may accidentally appear to be linearly correlated in one dataset
due to randomness.
Certainly, in data-oriented analysis, we use exploratory data analysis (such
as plotting the data) to seek or approach the true model behind the data.
Such a practice should be confined to cases where the data truly represent the
variable and no sample selection bias exists. Blindly applying linear regression
without proper justification may result in misleading conclusions.
FIGURE 5.1
Non-intrinsic linear relationship between weights on cow and rhino
The second precaution focuses on the sample size and the underlying model
of the data. Although the least square estimate of model parameters does not
require any distribution of the data, testing on the significance and validity
of the fitted model depends on hidden assumptions. For example, the error
term of the data follows a normal model with constant variation. Such model
assumption is critical in validity analysis of the fitted model. Especially since
any software can produce a fitted line out of an input and an output variable,
whether that relationship is statistically significant is questionable for many
data analyses.
If the data contains too much noise, the effect of the noise overwhelms the
effect of the input feature, and the fitted linear model is rendered insignificant.
In this case, it is necessary to test whether the noise is too large to claim the
existence of a linear relationship. The instruments to perform such tests in-
clude the t-test for linear coefficient significance and the F-test for the validity
of the model. Note that these tests are built upon the normality assumption
with constant variances.
yj = α + βxj + j
Pitfalls in linear regressions 121
FIGURE 5.2
Bounded driving years vs insurance premium
The third precaution verifies the range of the input features being used
in prediction. Recall that the fitted model is built upon training data that is
also confined within a range of values of the input predictor. When predicting
using the fitted model, if the value of the input variable is beyond the range
of the training data, depending on the fitted line, the predicted result may
not be meaningful. For instance, in the prediction of insurance premium in
Figure 5.2, it does not make sense to extend the linear line into the area of
driving age being -1, and claim that the premium is $78.2 per month for a
driver at one year before the beginning of one’s driving experience, because the
122 Linear Prediction
intrinsic relationship between the input and the output can not be plausibly
extended to that range.
The fourth precaution is the association effect versus the causation effect.
It is often confusing and inaccurate to claim that a linear relationship has a
causation effect. Linear regression is simply just the fitting of two columns of
data points; while the input has an effect on the output, the output may as
well influence the input in the linear regression model. In fact, what we can
claim in a linear regression analysis is essentially an association effect, not a
causation effect.
FIGURE 5.3
Car age vs selling price with outliers
an average decrease of $1022 per driving year and lower estimated price of
$15813 for a brand new car.
As shown in the figure, the occurrence of outliers (6.5, 245) flatters the
fitted line, and in turn, misrepresents the data pattern for the bulk of the
Model training and prediction 123
data. Obviously, a value of $24,500 for the selling price of a 6.5-year-old car is
an extreme case. In this example, it is more likely that the value of the selling
price 245 ($24,500) might be actually 45 ($4500), potentially due to a typo of
2 in front of 45 during the data entry process.
TABLE 5.1
Intrinsic linear relationships
Input variable Output variable
Sales this month Sales next month
Blood alcohol content Measure of body coordination
Price of a product Amount of that product sold
In linear regression, the relationship between the input variable and the
output variable is usually measured by the sample correlation coefficient, r.
If the absolute value of r is close to 1, the linear pattern between the input
variable and the output variable is strong. On the other hand, if the value |r| is
small, there is essentially not much correlation between the two variables. Un-
der the normality assumption, variables with zero correlation are statistically
independent. When the value of Y increases as x increases, the correlation is
positive. If the value of Y decreases as the value of x increases, the correlation
coefficient is negative. It can be proved that |r| ≤ 1.
124 Linear Prediction
(xi yi ) − 1
x y
rxy =
n
. (5.1)
x2 − n1 ( x)2 y 2 − n1 ( y)2
Equivalently, the sample correlation coefficient is expressed as
1 (xi yi ) − n1 x y ˆ (X, Y )
COV
rxy = = (5.2)
n−1 Sx Sy Sx Sy
where COVˆ (X, Y ) is the sample covariance between X and Y , Sx and Sy are
the sample standard deviation for X, and Y , respectively.
Equation(5.1) and Equation(5.2) are algebraically equivalent. The differ-
ence is that Equation (5.2) uses the sample covariance and sample standard
deviations that are conventionally used in data analysis. Equation(5.2) also
highlights the meaning of correlation coefficient in a way where it measures
the relationship between the variation of X and the variation of Y , adjusted
by the variations of X and Y .
The following example demonstrates the computation of the sample cor-
relation coefficient r on the basis of a set of bivariate data (X, Y ).
Example 5.1 Assume that we have the following data on the input variable
X and output variable Y .
X: 3 6 12 18 24
Y: 60 95 140 170 185
We can compute the components in the formula ( 5.1) as follows.
x2i = 1089
i i xi = 63 i xi yi = 9930
2
i i = 95350
y i i = 650
y n=5
This results in a sample correlation coefficient of 0.972, indicating a strong
linear pattern between the two variables.
Figure 5.4 illustrates different patterns of plots between the input and out-
put variables with their corresponding sample correlation coefficients. When
the linear pattern is indiscernible, the sample correlation coefficient is low, at
a range from 0 to 0.3. As the linear pattern becomes significant, the abso-
lute value of the corresponding sample correlation coefficient increases. When
variable Y increases as X increases, the correlation coefficient is positive; oth-
erwise, it is negative. Figure 5.4 shows that the correlation of sample data
is intuitive to understand. The plots of the data cloud fit well with the val-
ues of the sample correlation coefficients. The relationship between the input
and output variables is easily obtained from a quick glance at the plot of the
training data.
Model training and prediction 125
FIGURE 5.4
Data cloud and sample correlation coefficients
Since
E((Y − c(X))2 |X) = E((Y − E(Y |X))2 |X) + (E(Y |X) − c(X))2 ,
y = α + βX + ,
126 Linear Prediction
E() = 0, var() = σ 2 .
This leads to the underlying function
Therefore, when the unknown underlying function c(X) takes on the value
α + βX, the conditional expected value reaches its minimum value, which
consequently minimizes the expected prediction errors.
Notice that the above derivation does not require the distribution of the er-
ror term. When we search for the optimal solution for the underlying function
behind the data, we do not need the distribution pattern of the error term.
However, under the assumption that the underlying error term follows a nor-
mal model, as discussed in most introductory statistics textbooks, p-values are
available to determine whether there is significant data evidence to support
the validity of the linear model. In the following sections, we shall address
prediction methods using simple linear models with and without normality
assumptions in data science, before discussing the implications of multiple
linear regression at the end of this chapter.
Step 2. Using the least squares estimation with training data to build a trained
model without normality assumptions
Given a set of data consisting of a predictor X and a response Y , to build
a simple linear regression model, we need to estimate the coefficients α and β
in the model,
Y = α + βX + ,
where is the random term of the data. If follows a normal model, we have
existing estimation and testing procedures for the significance on the unknown
parameters α and β.
Consider the target function of minimizing the sum of residual errors for
a set of training data,
n
L(α, β) = (Yi − α − βXi )2 . (5.3)
i=1
α̂ = Ȳ − β̂ X̄,
Model training and prediction 127
where
1 1
n n
Ȳ = Yi X̄ = Xi .
n i=1 n i=1
The above derivation can be alternatively obtained via the following for-
mulation. Consider a case where the target function for optimization is the
sample mean squared error (M ˆSE). On the training set, assume that we have
n observation pairs on (x, y) where x is the input variable and y is the output
variable. Denote vectors X = (x1 , ..., xn ) and Y = (y1 , ..., yn ) . Our goal is to
minimize
1
M ˆSE training = (Y − α̂ − X β̂) (Y − α̂ − X β̂).
n
After some standard operations as documented in conventional statistics text-
books, the optimal values can be achieved by allowing
β̂ = (X X)−1 X Y, (5.4)
and,
α̂ = Ȳ − β̂ X̄, (5.5)
where X̄ and Ȳ are the sample means of the predictor and the response
variable, respectively.
Note that the above derivations do not require any assumption on the nor-
mality distribution for the error term . We shall use an example to illustrate
the above discussion.
Example 5.2 Horizon Properties specializes in custom home re-sales in
Phoenix, Arizona. A random sample of 200 records from the custom-home-
sale database provides the following information on the size (in hundreds of
square feet, rounded to the nearest hundred) and price (in thousands of dollars,
rounded to the nearest thousand) of houses in the market.
Using Equations (5.4) and (5.5), we get an estimated model,
y = −110 + 15.89X + error
This is represented in Figure 5.5.
The fitted line can be interpreted as follows. When the house size is 2,000
square feet, the long-run average price in the area is $207.800. This is because
the predicted value y = −110 + 15.89 × 20 = 207.8. Also, each 100-foot
increase to the size of a house will increase the long-run average resale value
of the house by $15,890.
Notice that in the interpretation of the regression model, the intercept
−110 can not be interpreted directly. Clearly, −110 is a nonsensical value
because the value of a house surely cannot be negative. However, this can only
occur for a corresponding x-value of 0, implying a house of zero square feet
exists, which is also a nonsensical input. This example indicates that a linear
regression is only intended to be interpreted within the range of reasonable
inputs.
128 Linear Prediction
FIGURE 5.5
Horizon Properties, data summary with a linear model
1. Sample R2
The first and most convenient way is to examine the sample R2 value in
a regression analysis. In a simple linear regression, we can directly calculate
the sample R2 between the input and output variables. In the multiple linear
regression case, we have the corresponding multiple correlation matrix for the
sample R2 , which essentially serves the same purpose. As it is a good deter-
miner of fit, we may also examine the changes in R2 when running variable
selection algorithms in linear models.
The sample R2 is formally defined as the following
2 1 xi − x̄ yi − ȳ
R =
n−1 i sx sy
Model training and prediction 129
where x̄ and ȳ are the average values of x and y of the training data re-
spectively, and sx and sy are the sample standard deviations of the training
data.
Taking a closer look at the y terms, we can see that
n
n
n
(yi − ȳ)2 = (yi − yˆi )2 + (yˆi − ȳ)2
i=1 i=1 i=1
and the sample test mean squared error from the testing data,
ˆR = 1
EP (Yi − g(Xi ))2 , (5.7)
m
i∈P
X̄n →P μ,
as n → ∞.
The weak law of large numbers states that as the number of observations
increases, the sample average will be close (in probability) to the expected
value. In the setting, the μ is EP R while the sample mean X̄ is the sample
expected prediction error EP ˆ R. In fact, the relationship between the sample
mean and the population mean has a stronger statement:
P ( lim X̄n = μ) = 1,
n→∞
X̄n →a.s. μ,
as n → ∞.
Model training and prediction 131
When the sample size is large enough, the average of the observations is
almost surely the expected value.
Now, if we let
Zi = (Yi − α̂ − Xi β̂)2 ,
we can apply the strong law of large numbers to the sequence {Zi }, and obtain
1
m
(Yi − α̂ − Xi β̂)2
m i=1
1
m
= Zi →a.s. E(Z)
m i=1
= E[(Y − α̂ − Xi β̂)2 ].
Therefore, when the sample size is sufficiently large, the test MSE almost
surely equals the expected prediction error. Stated more directly by combining
the training set with the testing set, when the sample sizes in the training set
and the test set are large enough, the linear model estimated from the least-
squared criterion almost surely has the smallest test MSE. Therefore, it is
necessary to have large sample sizes in both the training set and the testing
set, when we can not plausibly assume that the error term follows a normal
model with a common standard deviation.
Example 5.3 We use this example to show that for simple linear regression,
parameters estimated from the training data do not guarantee the variance
decomposition principle.
where ȳ and z̄ are sample means of the responses in the training set and testing
set, respectively. ŷi and ẑi are the predicted responses corresponding to the
predictor in the training set and testing set, respectively. The response in the
training set is denoted as yi and the response in the testing set is denoted as
zi .
First, we consider the validity of (5.8), notice that when we train the linear
model with a set of training data, we have
and
α̂ = ȳ − β̂ x̄
where ȳ and x̄ are the sample mean of the responses and the sample mean of
the predictor in the training data set. Now,
(yi − ȳ)2
i∈T
= (yi − ŷ)2 + (ŷi − ȳ)2 + 2(ŷi − ȳ)(yi − ŷi ).
i∈T i∈T i∈T
Notice that the last term in the expansion has the following property in the
training data set.
2(ŷi − ȳ)(yi − ŷi )
i∈T
= 2β̂ (xi − x̄)(yi − ŷi )
i∈T
= 2β̂ (xi − x̄)(yi − ȳ − β̂(x̂i − x̄))
i∈T
$ %
j∈T (xj − x̄)(yj − ȳ)
= 2β̂ (x̂i − x̄)(yi − ȳ) − (xi − x̄)2
j∈T (xj − x̄)2
i∈T i∈T
= 2b̂(0)
= 0.
Regression Statistics
Multiple R 0.9888279
R Square 0.9777807
Adjusted R Square 0.9755588
Standard Error 5.25746
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 12163.62781 12163.63 440.0593 1.34528E-09
Residual 10 276.4088543 27.64089
Total 11 12440.03667
Regression Statistics
Multiple R 0.288086
R Square 0.082993
Adjusted R Square -0.00037
Standard Error 51.87716
Observations 13
ANOVA
Significance
df SS MS F F
Regression 1 2679.272 2679.272 0.995553 0.33983196
Residual 11 29603.64 2691.24
Total 12 32282.91
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 158.1272 50.65999 3.121344 0.009727 46.6253475 269.62911
Years of car -10.2281 10.25097 -0.99777 0.339832 -32.790379 12.33408
FIGURE 5.6
Regression outputs on selling prices with outliers
When examining the second part of Figure 5.6, the story deviates from the
first portion of the output. With the presence of outliers, the data variation
becomes too large and the fitted line essentially becomes insignificant. The
sample R2 = 0.083, indicates that there is essentially no linear pattern behind
the data.
Notice that the fitted line is moved away from the original one due to the
occurrence of outliers. The estimated standard error for the regression coeffi-
cient is 10.25, which is almost the same as the absolute value of the regression
coefficient (-10.23). Since the data variation is almost as large as the quantity
of the estimated model coefficient, the t-statistic is -0.998, with a p-value =
0.3398 suggesting that, with the inclusion of outliers, the data variation has
increased to a level that overwhelms the significance of the estimated slope
(p=0.3398). Thus, we fail to reject the null hypothesis that β = 0, which
is equivalent to stating that there is no statistical evidence to claim a rela-
tionship between the age of a car and its selling price. Under the normality
assumptions, the numerical output agrees with the observation on the large
variation of the data in the second part of Figure 5.3.
Multiple linear regression 135
P (|tn−2 | < tα ) = 1 − α.
The sample standard deviation of ŷ, sŷ reads
&
1 (x0 − x̄)2
sŷ = s 1 + + ,
n SSxx
where s is the sample standard error of residuals, x0 is the value of the
explanatory variable X for which we predict the response, and SSxx is the
sum of squares of the explanatory variable X from the training data.
SSxx = (xi − x̄)2 .
i
Y = α + β1 X1 + ... + βk Xk + .
Besides inheriting common features in simple linear regression, with more
than one predictors in the model, multiple linear regression possesses other
discernible properties as discussed below.
136 Linear Prediction
ID: 1 2 3 4 5 6 7 8 9 10 11 12
Y: 85 93 79 98 83 66 53 68 72 81 74 87
X1 : 1.7 1.9 1.6 2.0 1.8 1.1 0.9 1.0 1.3 1.5 1.3 1.9
X2 : 23 19 21 22 28 36 34 29 21 32 19 41
X3 : 5.1 4.3 6.2 3.2 3.7 7.2 8.1 7.6 5.3 5.2 6.2 4.1
Call:
lm(formula = concentration ~ dosage + age + stress, data = mydata2)
Residuals:
Min 1Q Median 3Q Max
-5.2335 -2.2658 0.1126 2.2270 5.5940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
FIGURE 5.7
Regression of concentration on dosage, age, and stress level
Multiple linear regression 137
Figure 5.7 contains outcomes of data analysis using multiple linear regres-
sion. Among the three predictors, the data only contains significant evidence
to claim that the percentage of concentration is linearly affected by the dosage
of the medicine. The corresponding p-value is 0.0162, measured by the uni-
formly most powerful test under the normality assumption of the error term.
For each 10% increase in dosage, the corresponding percentage of concentra-
tion increases 2.78%, on average, with a 95% confidence interval from 0.67%
to 4.88%.
FIGURE 5.8
Dosage-concentration level plot
examine all the candidate models one-by-one to select the best one. In the
sequel, we will discuss multiple correlation matrices among the response and
all the predictors, followed by a discussion on the AIC criterion for variable
selection in multiple linear regression.
FIGURE 5.9
Correlations on concentration, dosage, age, and stress
Since we are dealing with the response (concentration level), dosage, age,
and stress level, the correlation among these variables is a matrix. As shown
Multiple linear regression 139
in Figure 5.9, the concentration level has strong positive correlation with the
dosage. Patients taking higher dosage of the medicine tend to have higher level
of concentration, on average. The sample correlation coefficient is 0.96 with
strong data evidence (extremely small p-value) to reject the null hypothesis
H10 in favor of the alternative hypothesis H11 , where
H10 : ρ = 0 versus ρconcentration−dosage = 0,
ρ is the correlation coefficient between the patient concentration percentage
and the corresponding dosage.
Although the above analysis coincides with the analysis result in Fig-
ure 5.7, further examining the correlation between the concentration level
and the stress level reveals the fact that dosage is not the only valuable pre-
dictor in our model; stress level is also strongly and negatively associated
with the concentration level. Patients with higher stress level tend to have
less concentration percentage.
ρ̂1 = −0.91 ρ̂1 = −0.94,
where ρ1 is the correlation coefficient between the patient’s concentration level
and the corresponding stress level. The result of correlation analysis shows that
both of the above-mentioned correlation coefficients are significantly different
from zero. Since the dosage level of a patient is significantly associated with
the stress level, it is naive to assume that the two covariates are independent.
On the other hand, examining the sample correlation of age with the con-
centration level, dosage, and stress level, we found that the sample correlation
coefficients are -0.34, -0.24, and 0.23, respectively. The corresponding p-values
are larger than 0.05, indicating that there is no data evidence to claim signif-
icant correlation between age and any one of the three variables.
When we fit a multiple linear model without the variable dosage, as shown
in Figure 5.10, the stress level feature has a much smaller p-value (9.98e-05)
than the significance level of dosage on concentration level (0.0162), implying
there is stronger statistical evidence to claim a linear relationship between the
concentration level and stress level.
Figure 5.11 shows the plot of the data between the stress level and the
concentration level. It is observable that as the stress level increases, the level
of concentration decreases. The linear pattern of the data is reflected by the
fitted line.
Although the software produces a regression line as shown in Figure 5.12,
the variation of the data invalidates the fitted line. It indirectly supports the
claim that age is not a significant factor attributable to patient concentration
levels.
#
) = , <" =! *
#
0 2
(8$1/6(1$3560$8472$0613$600
#
$ )>''*
)*012$0/8/6$473605$1204$57(/7+++
(/$1270/$1145(0$/45/$208
(5$85130$/445(5$4848$87(/4+++
(((
$#/%+++&/$//0%++&/$/0%+&/$/4%$&/$0%&0
#4$2578
( #/$7373" ( #/$7036
( #14$061 8"( #/$///1/48
>)"=/$84*
1$4;86$4;
)*0/4$840023003/$1556755
(/$6373274/$1610367
(8$24/3121(3$4632474
FIGURE 5.10
Drug concentration level associated with age and stress
multiple linear regression turns to identify “proper” variables for the training
model. That is, we want to pick out the input variables that are going to
produce a good model, without bad qualities like noise or over-fitting. This
necessitates a discussion on the AIC (Akaike information criterion) and BIC
(Bayesian information criterion) for model selection.
When we use data to construct an estimate fˆ(x) for the unknown underly-
ing model f (x) via parameter estimation, we need to consider the information
loss from using fˆ to replace the true model f (x). This is usually measured by
the relative entropy of fˆ to f in Kullback-Leibler divergence:
ˆ f (x)
D(f ||f )KL = log dF (x),
fˆ(x)
where F (x) is the cumulative distribution function associated with f (x). It is
proven that the estimated model minimizing the Kullback-Leibler divergence
D(f ||fˆ) can be obtained via AIC
AIC = 2k − 2 log L̂
where k is the number of estimated parameters in the model, and L̂ is the
maximum value of the likelihood function of the model. Specifically, in multiple
linear regression with p predictors, AIC reads
Multiple linear regression 141
FIGURE 5.11
Drug concentration level associated with stress alone
log(RSS)
AIC = n + n log(2π) + n + 2(p + 1) (5.10)
n
where RSS is the residual sum of square of the fitted line.
Another commonly used criterion is the BIC (Bayesian information crite-
rion), which reads,
log(RSS)
BIC = n + n log(2π) + n + log(n)(p + 1) (5.11)
n
With the AIC criterion in (5.10) and BIC criteria in (5.11), one can select
the model that minimizes the estimated information loss as the best model for
the data. However, we can’t just test all possibilities of models (that is, every
combination of variables). Given p possible input variables, each variable can
either be included into the model or not. This means that with p possibilities
142 Linear Prediction
FIGURE 5.12
Drug concentration level associated with age alone
of branching into two, we actually have 2p possible models. This value quickly
becomes intractable. For example, with 10 variables, we have 210 = 1024
possible models, and with 100 variables, we have 2100 = 1.26 × 1030 models.
Surely, testing each model for viability is not the right way forward. We shall
discuss three commonly used variable selection approaches below.
1. Forward Selection
One approach to solve the variable selection problem is known as forward
selection. This algorithm starts by examining the list of possible explana-
tory variables and computing a simple regression for each one. The estimated
AIC value (or residual sum of squares, or partial F-statistic) is used for the
judgment of including an explanatory variable in the final model. Then, we
continually add in the next best explanatory variable until no improvement is
detected and a final model is achieved.
Computer scientists may recognize this as a greedy algorithm. By taking
the maximum improvement at each step, we hope to quickly converge on the
“best” model.
2. Backward Selection
If we can add variables, so can we subtract. The corresponding backward
selection algorithm works by running a single massive regression on all possible
Multiple linear regression 143
3. Mixed Selection
A blending of the two preceding algorithms is commonly referred to as
mixed selection. Multiple formulations of this algorithm exist. We use the
following as a case to describe the process. For instance, we can use forward
selection to add a variable, then run backward selection to retest “prune” any
variables, repeating this pair of steps until a satisfied solution is achieved.
Step: AIC=33.86
concentration ~ dosage + age
Df Sum of Sq RSS AIC
<none> 122.35 33.864
- age 1 23.56 145.92 33.978
+ stress 1 1.84 120.52 35.683
- dosage 1 1390.52 1512.87 62.042
FIGURE 5.13
AIC criterion and information loss in multiple regression
Example 5.6 Refer to Example 5.5, use the AIC criterion to find the best
model (the model with the smallest information loss) for the data.
As shown in Figure 5.13, for the mixed selection approach, the starting
model (with all the covariates, dosage, age, and stress level) has AIC 35.68.
> library(MASS)
> fit<-lm(concentration~dosage+age+stress, data=mydata)
> step<-stepAIC(fit, direction="backward")
> step<-stepAIC(fit, direction="backward")
FIGURE 5.14
R Codes for confidence prediction
regression then works by taking each variable along a continuum and assuming
that full interpolation is possible.
However, not all data works cleanly in this form. Some data is categor-
ical, meaning that it belongs to certain categories which may have no real
relation to each other. For example, the color of car may be important in de-
termining car insurance rates, or the nationality of a car’s make in that same
model (physicists may complain that color is defined on a wavelength, but
this ordering of colors is arbitrary and serves no purpose in this regression).
Other data is just discrete, meaning that it will either never come in a con-
tinuous form (e.g. star ratings on individual Yelp reviews) or more strongly,
perhaps not comprehensible in a continuous form (e.g. take rankings of the
best video games – 1.5th place cannot exist). When dealing with well-ordered
discrete data, like star ratings, it may make sense to treat the inputs as con-
tinuous anyway, but only occurring along a fixed set of values.
However, when dealing with categorical data, we will use dummy variables
to separate variables. “Dummy variables” are indicator variables taking on
the boolean (true/false) value of 0 or 1 based on whether the condition is
146 Linear Prediction
present. For example, when classifying car insurance rates using colors, we
may create dummy variables for each separate color: 1red , 1blue , etc. This
notation (1condition ) is used to refer to the dummy variable that is 1 when
“condition” is true, and 0 otherwise. Of course, we expect only one dummy
variable to take on the value of 1 out of the created set. In modern machine
learning and computer science, dummy variables are also often referred to as
one-hot encoding because of this property.
where X is the design matrix. In other terms, the scores are the diagonal
elements of the projection matrix.
If xi is far away from the average x̄, hii will be large, meaning that the
corresponding observation yi has a large impact on the fitted model. Since the
MLE of the linear regression coefficients,
β̂ = (X X)−1 X y,
Outliers and leverage statistics 147
and
ŷ = X β̂ = X(X X)−1 X y,
for the ith observation, we have
d(ŷi ) ' (
= X(X X)−1 X ii = hii .
dyi
The degree on which the ith response influences the ith predicted value via
the observation matrix X.
We have
n
hii = hit hti = h2ii + h2ti ≥ h2ii (5.14)
t=1 t=i
hii ≤ 1.
hii ≥ 0.
we have
n
hii = T race(X(X X)−1 X )
i=1
= T race((X X)−1 X X)
= T race(Ip )
= p.
148 Linear Prediction
We can take this concept a little further with Studentized residuals to test
whether the distance is too far away from the bulk of the data.
Given a residual di = yi − ŷi , the variance of di can be expressed as:
Example 5.7 For simplicity, assume that a training data set (toy example)
contains
Y 12 -20 14.3 15.3 15.4
X1 2.1 2.2 2.3 2.4 2.5
X2 1.2 1.6 1.7 1.9 1.8
Find and interpret the Studentized residual for the second observation in the
training set.
Solution: As shown in Figure 5.15, the leverage statistic for the second obser-
vation is
h22 = 0.187,
indicating that under the context of the observation matrix X, each unit
change of the second response contributes 18.7% to the change of the predicted
value of y2 .
The absolute value of the residual of the second response is 17.452. With
the residual standard error σ̂ = 17.12, we have
17.452
Studentizedt = = 1.13
(17.12 ∗ (1 − 0.187))
Thus, at 0.05 significance level, since t0.975,3 = 3.18 we do not have data
evidence to claim that the second observation is significantly beyond the bulk
of the data, even though the second response looks well beyond the rest of the
response. This is partly due to the relatively small number of observations in
the regression.
Call:
lm(formula = y ~ x[, 1] + x[, 2])
Residuals:
1 2 3 4 5
5.717 -17.452 10.030 9.428 -7.723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -183.85 172.93 -1.063 0.399
x[, 1] 120.35 113.04 1.065 0.399
x[, 2] -52.16 66.15 -0.789 0.513
FIGURE 5.15
Leverage statistic and outliers
among non-statisticians in a way of coding with input for the purpose of get-
ting an output, without caring about legitimate assumptions and conditions
for the regression method. This chapter addresses problems that are commonly
abused or overlooked in regression analysis.
Starting with precautions that using linear regression without proper jus-
tification may result in misleading conclusions, we discuss the connection be-
tween the known and the unknown in identifying the model governing the
data. Linear regression is the first step to bridging the known (predictors)
and the unknown (response variables) with interpretable functions. In fact,
it is the optimal solution when the conditional expectation is indeed a linear
combination of the predictors.
We also discuss technical details on training models and testing the trained
model with and without normality assumption, a commonly overlooked issue
in data science.
Analyzing confounding effects and AIC-BIC criteria for the selection of
predictors in a multiple linear model, we use examples to delineate the in-
terpretation, validation, and confidence prediction on linear model. A specific
150 Linear Prediction
β̂ = (X X)−1 X y.
151
152 Nonlinear Prediction
⎛ ⎞2
n
p
β̂ = argM in ⎝y i − β 0 − xij βj ⎠ , (6.1)
β i=1 j=1
we have
⎛ ⎞2
n
p
p
p
⎝y i − β 0 − βj xij ⎠ + λ βj2 = RSS + λ βj2 .
i=1 j=1 j=1 j=1
E(β̂ridge ) = β.
Restricted optimization and shrinkage 153
β̂ridge = (X T X + λI)−1 X T y.
Denote β̂ls the parameter estimate obtained from the least squares estimation
(6.1), we have,
E(β̂ls ) = β.
T
Let R = X X, the expectation of the parameter estimators for the ridge
regression reads,
Thus, E(β̂ridge ) = β if λ = 0.
Denote
n
g(β1 ) = (yi − xi β1 )2 + λβ12
i=1
i x i yi
βˆ1 ridge = 2 .
i i +λ
x
The value of βˆ1 ridge is controlled by the value of λ. As λ increases, βˆ1 ridge
shrinks. When λ approaches ∞, βˆ1 ridge approaches zero. When λ takes the
value zero, βˆ1 becomes the regular MLE of β1 ,
ridge
x i yi
βˆ1 M LE = i
2 .
i xi
154 Nonlinear Prediction
p
|βj | ≤ t. (6.4)
j=1
⎛ ⎞2
n
p
p
β̂ lasso = argM in ⎝y i − β 0 − xij βj ⎠ subject to |βj | ≤ t. (6.5)
β i=1 j=1 j=1
Both optimization problems (6.5) and (6.2) are conditions to prevent over-
fitting the training data while minimizing the training MSE,
⎛ ⎞2
n
p
M SE = ⎝y i − β 0 − xij βj ⎠ .
i=1 j=1
y = Xβ + = β1 + β2 +
For the usual multiple linear regression, the MLE is
β̂ = (X T X)−1 X T y = y.
|β1 | + |β2 | ≤ t,
2
f (β1 , β2 ) = (yj2 − 2yj βj + βj2 + λ|βj |). (6.6)
j=1
p
I(βj = 0) ≤ s,
j=1
Thus, the method optimizing MSE in the step-wise model selection for
linear model regularization is,
⎛ ⎞2
n
p
p
min{ ⎝yi − β0 − βj xij ⎠ } subject to I(βj = 0) ≤ s. (6.7)
β
i=1 j=1 j=1
n
p
p
β̂ ridge = argM in (yi − β0 − xij βj )2 subject to βj2 ≤ t.
β i=1 j=1 j=1
N
p
p
β̂ lasso = argM in (yi − β0 − xij βj )2 subject to |βj | ≤ t.
β i=1 j=1 j=1
The above three optimizations focus on the MSE as the target function.
However, the model selection approach directly regularizes the model by re-
moving redundant regression coefficients and keeping the key predictors. The
method of ridge regression shrinks the associated values of the parameters to-
ward zero. While the LASSO method controls the range in which the predictor
attributes toward the response variable.
β̂ = (X T X)−1 X T y,
when p > n, the inverse matrix (X T X)−1 does not exist, which invalidates
the least square estimate of the regression parameters. The randomness of the
sample covariance matrix makes it impossible to resolve the problem with the
skill of generalized inverse matrix.
When n < p, the usual least squares estimate is not unique, but this can be
fixed by adding a constraint, such as in the ridge regression. However, when
n << p, even with adjusted methods such as ridge regression or LASSO,
adding predictors (noisy features) may deteriorate the model.
The following theorem shows that with an appropriately selected constant
λ, the least square estimate of the ridge regression always exists.
⎡ ⎤
λ1 ...
0
⎢ .. . ⎥ T
X X = A⎣ . ... .. ⎦ A with λi ≥ 0
T
0 ... λn
⎡ ⎤
λ1 + λ ... 0
⎢ .. ⎥ AT
X T X + λI = A ⎣ ... ... . ⎦
0 ... λn + λ
Thus
Det(X T X + λI) = 0.
p
Zm = φjm Xj f or m = 1, ..., M,
j=1
and M < p. We may then use the new set of predictors Z1 , ..., ZM to avoid
the issue of high dimensionality.
158 Nonlinear Prediction
M
M
p
θm zim = θm φjm xij
m=1 m=1 j=1
p
M
= θm φjm xij
j=1 m=1
p
= βj xij .
j=1
Example 6.3 Consider the situation in which we have four observations for
a regression of five predictors (n < p):
Z1 = X1 + X2 ; Z2 = X3 + X4 ; Z3 = X5 ,
the coefficient relationship between the response and the original predictors
Y = β0 + β1 X1 + .... + β5 X5
becomes
Y = θ0 + θ1 Z1 + θ2 Z2 + θ3 Z3 ,
for the transformed data. In this way, when we estimate θi ; i = 1, 2, 3,
in the setting of p < n, we can indirectly estimate βi , i = 1, 2, ...5 using the
corresponding linear transformations.
As shown in the above example, the key is to find the optimal transforma-
tion of Zm , m = 1, ...., M that reduces the high dimensionality issue in linear
regression. To this end, we discuss two approaches, the principal component
transformation and the partial least squares regression.
with
λ1 ≥ ..... ≥ λp ≥ 0
denoting the non-negative eigenvalues (also known as principal values) of the
non-negative matrix X T X, in which the columns of V , vj , denotes the corre-
sponding orthonormal eigenvector.
Under this setting, Xvj and vj , respectively, denote the j th principal com-
ponents direction (or PCA loading) corresponding to the j th largest principal
value λj for each j ∈ {1, ...., p}.
The following is a simple example to review the concept of principal com-
ponent.
2 1 −3
Example 6.4 Let X = , find the principal components of X.
1 −2 6
and
C0 (X) + C1 (X) + .... + CK (X) = 1.
Combining the piece-wise constant function and polynomial function, gets the
basic function,
Similar to the above example, in general, the points where the coefficients
change are called knots for basic functions. The concept of knot and spline
are defined as follows.
One technique in fitting a smooth curve to a non-linear relationship be-
tween the predictor X and the response Y is the polynomial spline.
Consider a set of data (Xi , Yi ), i = 1, ..., n in the range Xi ∈ [a, b]. For a
set of points ξj ∈ [a, b], j = 1, ..., m, the input range [a, b] can be partitioned
into
m+1
(a, b) = [ξj−1 , ξj ],
j=1
In this setting, ξj j = 1, ..., m are called the knots in the interval [a, b].
degree-3 polynomial, with continuity in the first and the second derivatives at
each knot.
4m + 4 − 3m = m + 4.
A natural cubic spline is a cubic spline that fits a constant in each of the
intervals [a1 , ξ1 ] and [ξm , b]. As such, the total number of unknown parameters
reduces to m + 4 − 4 = m.
Example 6.5 Consider the case where m = 2, namely, we have the partition
of the x-range in the following way,
β11 + β21 ξ1 + β31 ξ12 + β41 ξ13 = β12 + β22 ξ1 + β32 ξ12 + β42 ξ13
β21 + 2β31 ξ1 + 3β41 ξ12 = β22 + 2β32 ξ1 + 3β42 ξ12
2β31 + 6β41 ξ1 = 2β32 + 6β42 ξ1
β12 + β22 ξ2 + β32 ξ22 + β42 ξ23 = β13 + β23 ξ2 + β33 ξ22 + β43 ξ23
β22 + 2β32 ξ2 + 3β42 ξ22 = β23 + 2β33 ξ2 + 3β43 ξ22
2β32 + 6β42 ξ2 = 2β33 + 6β43 ξ2
is called a natural cubic spline. A natural cubic spline fits a constant line
beyond the range of the knots [ξ1 , ξm ] (in the beginning and the end of the
spline),
β31 = β41 = 0 β3,m+1 = β4,m+1 = 0.
This drops the number of free coefficients to m + 4 − 4 = m.
In conjunction with the smoothness conditions, the total number of free coef-
ficients drops to 4 × 3 − 2 × 3 − 4 = 2, which is the total number of knots in
this example.
Given a set of data, (x1 , y1 ), ..., (xn , yn ), when accounting for the closeness
and smoothness of the regression spline, f (x), the penalized residual sum of
squares
n
RSS(f, λ) = (yi − f (xi )) + λ (f (t))2 dt.
2
i=1
The first term in RSS measures the closeness of the cubic spline f (x) and the
observed data. The second term penalizes curvature in the function, and λ
establishes the trade-off between the two, a fixed smoothing parameter usually
determined by cross-validation.
When λ = 0, RSS(f, λ) = 0, disregarding curvature, the prediction f (x)
can be found by interpolating the data to make RSS = 0. When λ = ∞, by
using a linear function f (x) = a + bx and f (x) = 0, RSS can reaches its
maximum value by using the least squares linear fit.
The following theorem shows that the smoothest function interpolating a
set of data is the natural cubic spline.
Theorem 6.3 Given a set of data (xi , yi ), i = 1, ..., n, xi ∈ (a, b), among all
functions that interpolate all the data points, the natural cubic spline is the
smoothest curve connector the points when the smoothness is measured by
b
μ(f ) = (f (x))2 dx.
a
164 Nonlinear Prediction
Proof: Let g(x) be a natural cubic spline in (a, b) with knots xi , i = 1, ..., n.
Assume that G is the set of permissible functions that interpolate all the given
data points. For any f (x) ∈ G, we want to show that the smoothness measures
μ(f ) ≥ μ(g).
Consider
t(x) = g(x) − f (x),
we have
f (x) = g(x) − t(x).
Taking the second derivatives in both sides of the equation gets
f (x)g (x) − t (x)
Now,
b b
g (x)t (x)dx = g (x)d(t (x))
a a
b
= t (x)g (x)|ba − t (x)dg (x)
a
b
= t (b)g (b) − t (a)g (a) − t (x)g (3) (x)dx.
a
b n
xi+1
t (x)g (3) (x)dx = g (3) (x)t (x)dx
a i=0 xi
n xi+1
= c t (x)dx
i=0 xi
n
=c (t(xi+1 ) − t(xi ))
i=0
n
=c (−f (xi+1 ) + g(xi+1 ) − g(xi ) + f (xi ))
i=0
= 0.
since both f (x) and g(x) interpolate (xi , yi ), for i = 1, ..., n, we have
Therefore
μ(f ) = μ(g) + μ(t),
which implies that
μ(g) ≤ μ(f ).
where
R(g) = l11 p1 + l21 (1 − p1 ) + l12 (1 − p2 ) + l22 p2 .
167
168 Minimum Risk Classification
We shall first define a loss function (hence the risk function) associated
with the general classification problem. With the proper definition, we may
discuss the optimal solution for minimizing the classification risk, followed by
the underlying assumptions upon which the classification algorithms are built.
Bayesian classification will be the first topic in the list. After addressing the
method of Bayesian classification, we shall also discuss the method of logistic
regression, which uses odds ratios to predict the likelihood of the occurrence
for a dichotomous outcome.
The classification problem in this chapter can be broadly viewed as an
optimization problem on estimation or prediction for the true but unknown
category of a given observation. In this regard, the inference approach es-
sentially finds the minimum risk prediction (MRP) based on the classification
criterion. We will also discuss scenarios where the loss function is changed from
the conventional 0-1 loss to any loss function according to different practical
situations.
1, if ĝ = g
L(g, ĝ) =
0, otherwise.
K
Risk(G(X)) = L[Gk , Ĝ(X)] Pr(Gk |X)
k=1
Under this setting, the minimum risk estimator of the unknown category
Zero-one Loss Classification 169
becomes,
K
Ĝ(x) = arg min L(Gk , g) Pr(Gk |X = x). (7.2)
g∈G
k=1
since
K
Pr(Gk |X = x) = 1.
k=1
This implies that the minimum risk classifier is the one that is associated with
the largest posterior probability.
where g(θ) is the prior distribution and f (θ|x) is the posterior distribution.
When the parameter θ takes two values Y = 0 and Y = 1, the following the-
orem converts the comparison on posterior probabilities to the corresponding
likelihood functions.
170 Minimum Risk Classification
Theorem 7.1 For the likelihood functions f (x|Y = 0) and f (x|Y = 1), we
have
P (Y = 1|X) f (x|Y = 1) P (Y = 1)
log( ) = log( ) + log( ).
P (Y = 0|X) f (x|Y = 0) P (Y = 0)
Proof: To understand the above identity, consider any real set A ⊂ Rk where
k is the dimension of the features X. We have
P (Y = 1|X ∈ A)
P (X ∈ A ∩ {Y = 1})
=
P (X ∈ A)
P (X ∈ A|{Y = 1})P (Y = 1)
=
P (X ∈ A)
P (Y = 1|X = x) f (x|Y = 1) P (Y = 1)
log = log + log
P (Y = 0|X = x) f (x|Y = 0) P (Y = 0)
This proves Theorem-7.1, which converts the comparison on posterior
probabilities to the ratio of likelihood functions and prior probabilities as-
sociated with the two classes. For illustrative purposes, we now consider a
simple example for the case when the observation only has one dimension,
k = 1, given the above setting.
Example 7.1 Assume that f (x|Y = 1) and f (x|Y = 0) are two normal
densities with means μ1 , μ0 and common σ 2 , under the condition of equal
priors, we have
f (x|Y = 1)
log
f (x|Y = 0)
−1
= log(exp{ 2 [(x − μ1 )2 − (x − μ0 )2 ]})
2σ
1 1
= 2 x(μ1 − μ0 ) − 2 (μ21 − μ20 )
σ 2σ
Zero-one Loss Classification 171
Thus
P (Y = 1|X = x)
log( ) = ax + b
P (Y = 0|X = x)
for some constants a and b.
Notice that
fk (x)πk
Pr(G = k|X = x) = K
.
j=1 fj (x)πj
When there are k possible outcomes, the minimum risk classifier assigns
an observation to the class in which δk is the largest
1
δk (x) = xT Σ−1 μk − μTk Σ−1 μk + log πk .
2
We describe two classification criteria δ and δk above. The difference be-
tween δk and δ can be summarized as follows.
δ : classification rule for two equal prior classes with one input variable
following normal likelihood with equal variances.
In the discussion above, we assume that the standard deviations are identi-
cal for different categories. Such an assumption is not always plausible. When
the two standard deviations are not the same, the linear discriminant func-
tion becomes invalid, and the corresponding minimum risk classifier is actually
the quadratic discriminant function for normal data. In this setting, the com-
parison of the posterior probability motivates us to finding the classification
criterion δ(x).
P (Y = 1|X = x)
δ(x) = log( )
P (Y = 0|X = x)
If δ(x) is positive, we classify the data to Y = 1; otherwise, the classifica-
tion result is Y = 0.
We shall use an example to explain the above setting. Consider the clas-
sification of two population means when the data follow two normal models
with two different standard deviations.
Notice that in this case, for i = 1, 2, we have the models
1 1
f (x|μi , σi ) = exp(− 2 (x − μi )2 ),
(2π)1/2 (σi )1/2 2σi
When we have more than one predictor (where X is a vector) with unequal
covariance matrices from multivariate normal model,
P (Y = 1|X = x)
δk = log( )
P (Y = 0|X = x)
takes the following form,
1 1
δk (x) = − (x − μk )T Σ−1
k (x − μk ) − log |Σk | + log πk
2 2
1 1 T −1 1
= − xT Σ−1 T −1
k x + x Σk μ k − μ k Σk μ k − log |Σk | + log πk
2 2 2
To further clarify the above discussion on the linear discriminant func-
tion and quadratic discriminant function regarding p input variables for an
output of k possible outcomes, we consider the following example of the pre-
diction/classification of diabetes patients.
Zero-one Loss Classification 173
Example 7.2 Consider the diagnosis of diabetes patients with four input
features: systolic blood pressure, fasting blood glucose level, BMI, and smok-
ing (nicotine intake). The diagnosis outputs include three possible categories:
healthy, pre-diabetes, and diabetes. For a new patient, we want to use fea-
tures of patient information (input variables) to diagnose the clinical outcome
(designate the patient into the right category).
> d1<-t(x)%*%sigma%*%m1-
0.5*t(m1)%*%sigma%*%m1+log(0.5)
> d1
[,1]
[1,] 19299.35
> d2<-t(x)%*%sigma%*%m2-
0.5*t(m2)%*%sigma%*%m2+log(0.3)
> d2
[,1]
[1,] 19304.4
> d3<-t(x)%*%sigma%*%m3-
0.5*t(m3)%*%sigma%*%m3+log(0.2)
> d3
[,1]
[1,] 16648.32
FIGURE 7.1
R-codes for computation on diabetes classification risk.
Under the normality assumption, for each class, the likelihood function
takes the form:
1 " 1 #
f (x) = exp − (x − μ)T Σ−1 (x − μ)
2π p/2
|Σ| 1/2 2
Further, assume that the mean vector for the healthy category is μ1 , pre-
diabetes μ2 , and diabetes μ3 as follows.
174 Minimum Risk Classification
⎡ ⎤ ⎡ ⎤ ⎡⎤
120 138 150
⎢ 85 ⎥ ⎢110⎥ ⎢160⎥
μ1 = ⎢ ⎥ ⎢ ⎥
⎣ 20 ⎦ , μ2 = ⎣ 26 ⎦ , and μ3 = ⎢ ⎥
⎣ 31 ⎦ ,
0.04 1.2 1.7
we also assume that the correlation matrix of the four features is the same
for healthy, pre-diabetes, and diabetes patients,
⎡ ⎤
1 0.5 0 0
⎢0.5 1 0 0⎥
Σ−1 = ⎢
⎣0
⎥.
0 1 0⎦
0 0 0 1
Assume that the disease distribution of the population reads 50% healthy, 30%
pre-diabetes, and 20% diabetes. For a patient with xT = (125, 100, 30, 1.1),
we can then use the linear discriminant classification criterion to diagnose
whether he is healthy, pre-diabetic, or has diabetes.
Since the correlation matrix is the same across the four features, we have
for k = 1, 2, 3,
1
δk = xT Σ−1 μk − μTk Σ−1 μk + logπk .
2
Thus, for any x we have the three classification criteria as follows.
δ1
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 120 120 1 0.5 0 0 120
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢ 85 ⎥ − 1 ⎢ 85 ⎥ ⎢ 1 0 ⎥ ⎢ 85 ⎥
⎣ 30 ⎦ ⎣0 0 1 0 ⎦ ⎣ 20 ⎦ 2 20 ⎦
⎣ ⎣0 0 1 0 ⎦ ⎣ 20 ⎦
1.1 0 0 0 1 0.04 0.04 0 0 0 1 0.04
+ log(0.5)
= 19299.35.
δ2
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 138 138 1 0.5 0 0 138
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢110⎥ − 1 ⎢110⎥ ⎢ 1 ⎥ ⎢110⎥
⎣ 30 ⎦ ⎣0 0 1 0⎦ ⎣ 26 ⎦ 2 ⎣ 26 ⎦ ⎣0 0 1 0⎦ ⎣ 26 ⎦
1.1 0 0 0 1 1.2 1.2 0 0 0 1 1.2
+ log(0.3)
= 19304.4.
Zero-one Loss Classification 175
δ3
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 150 150 1 0.5 0 0 150
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢160⎥ − 1 ⎢160⎥ ⎢ 1 ⎥ ⎢160⎥
⎣ 30 ⎦ ⎣0 0 1 0⎦ ⎣ 31 ⎦ 2 ⎣ 31 ⎦ ⎣0 0 1 0⎦ ⎣ 31 ⎦
1.1 0 0 0 1 1.7 1.7 0 0 0 1 1.7
+ log(0.2)
= 16648.32.
Since the second class has the highest δ value, the new patient (who has the
four features (125, 100, 30, 1.1) representing systolic blood pressure, fasting
glucose level, BMI, and nicotine level) is classified into the pre-diabetes cat-
egory. This classification minimizes the possible misclassification risk for the
given data.
The R-codes for the computation in the above example can be found in
Figure 7.1.
We shall use another example to illustrate the quadratic discrimination
classification.
⎡ ⎤ ⎡ ⎤
0.09 0.45 0 0.04 0 0
Σdisease = ⎣ 0.45 25 0 ⎦, Σhealthy =⎣ 0 9.61 0 ⎦
0 0 5.29 0 0 4
In this setting,
⎡ ⎤
12.21 −0.2198 0
Σ−1 ⎣ −0.2198 0.04396 0 ⎦,
disease =
0 0 0.189
⎡ ⎤
25 0 0
Σ−1 ⎣ 0 0.104 0 ⎦.
healthy =
0 0 0.25
176 Minimum Risk Classification
tension, diabetes, etc. The potential risk factors can be denoted as X1 , ...,
Xk . The following equation captures a linear relationship between the logit
function of the probability of lung cancer and its predictors,
p
log = α + β1 X1 + ... + βk Xk
1−p
For convenience, the data frame for a logistic regression model can be
briefly outlined as follows.
Example 7.4
Example 7.5 Assume that a fitted model for the relationship between the
occurrence of lung cancer, X1 (dusty work conditions), and X2 (smoking) is
quantified as
p
log( ) = −0.4 + 0.3X1 + 1.6X2
1−p
where
p = P (lung cancer|X1 , X2 ).
The estimated model coefficient 1.6 is usually interpreted as: Controlling for
work environment conditions, the odds ratio of lung cancer is e1.6 = 4.953 for
178 Minimum Risk Classification
smoking patients. In other words, smoking increases the odds of lung cancer
by almost 5 folds when controlling for other risk factors.
FIGURE 7.2
Prostate cancer and feature selection in logistic regression
The selection of the factors significantly associated with the response vari-
able is critical in classification when using the logistic regression model. In-
cluding insignificant factors in the final model for classification may result in
misleading conclusions. One common approach to remedy this is to use the
AIC selection criterion. As shown in Figure 7.2, the occurrence of prostate
cancer relative to acid, stage, Xray, grade, and age is investigated. When all
General Loss Functions 179
the predictors are in the model, the AIC is 60.13. With the removal of each
predictor, the AIC changes. For instance, when the feature “grade” is re-
moved, the AIC for the new model is 59.097, which is lower than the AIC of
the complete model. Since removing grade results in the lowest AIC among all
the candidate models, for the first step, grade is removed and the AIC for the
updated model is calculated to be 59.1. The process continues until it finds
Xray, stage, and acid to be the model with the lowest AIC. At this point,
removing any one of the predictors increases the AIC for the updated model,
so the process stops.
Similar to the linear discriminant function and the quadratic discrimi-
nant function that we discussed in the preceding subsection, a key step in
the construction of the classification criterion using logistic regression is the
estimation and cross validation of the model coefficients using training data.
2 3
R1 (G) = EX {0 ∗ P (T = g1 |X) + ∗ P (T = g2 |X) + ∗ P (T = g3 )}
6 6
1
= EX {t(X) − P (T = g1 |X)},
6
180 Minimum Risk Classification
where
1 2 3
t(X) = ∗ P (T = g1 |X) + ∗ P (T = g2 |X) + ∗ P (T = g3 )
3 6 6
Similarly, for j = 2, 3, the risk associated with G = gj reads
j
Rj (G) = EX {t(X) − P (T = gj |X)}
6
Thus, one of the optimal solutions for the MRE of the unknown category
is
j
Ĝ = arg max{ P (T = gj |X)}
j 6
This implies that, unlike the optimal classification criterion for 0-1 loss
function, the optimal solution for non 0-1 loss is to classify the observation to
a category that has the highest posterior probability when weighted by the
associated loss.
The interpretation to the general principle is similar to the 0-1 loss scenario
in the sense of maximizing the posterior probability with modification of the
weight adjustment based on the non 0-1 loss function.
K
R(G) = EX [ L(Gk , g) Pr(Gk |X = x)],
k=1
K
Ĝ(x) = arg min L(Gk , g) Pr(Gk |X = x). (7.5)
g∈G
k=1
Clearly, Equation (7.5) is a sufficient but not necessary condition for the
optimal risk formulation in Equation (7.4). The condition that Equation (7.5)
Optimal ROC Classifiers 181
is valid for all x guarantees the validity of Equation (7.4). Yet, the validity
of Equation (7.4) does not imply Equation (7.5). For instance, X > 0 implies
that E(X) ≥ 0, however, E(X) ≥ 0 does not imply X > 0.
Therefore, the conditional optimization (since Equation (7.5) conditioned
on the given data x) on the inferred risk is different from the universal opti-
mization in (7.4), which does not depend on observation X. Notice that the
loss function L(Gk , g) is not a function of the data x. Equation (7.4) can be
simplified as follows:
K
R(g) = EX [ L(Gk , g) Pr(Gk |X = x)]
k=1
K
= [ L(Gk , g) Pr(Gk |X = x)]dF (x)
k=1
(7.6)
K
= L(Gk , g) Pr(Gk |X = x)]dF (x)
k=1 X
K
= L(Gk , g) Pr(Gk ).
k=1
When the loss function is 0-1 loss, the above equation ( 7.6) becomes
R(g) = 1 − Pr(Gtrue = g)
The MRE classification is the one that has the largest chance to be the true
but with unknown category under the 0-1 loss function.
and find the procedure that produces the highest power (lowest probability
of type-II error), in classification, all four indexes are interpreted and selected
according to the most appropriate situation. For instance, when the true but
unknown category Gk is a serious disease, the cost of mis-classifying a diseased
patient as a healthy person may result in life-threatening situations. Hence,
controlling for the probability of true positive (or false negative) may be of
primary concern. On the other hand, when making the decision to operate
or not in an emergency room, a false positive error may result in sending a
healthy person to an operation room. In that situation, controlling the false
positive (hence true negative) may be of higher weight when examining the
results of the confusion matrix.
FIGURE 7.3
Prostate cancer, ROC curve, and treatment regimen
One classical approach in logistic regression for balancing the optimal point
between the probability of false positive and false negative is to construct an
ROC (receiver operating characteristic) curve. The ROC curve usually plots
the estimated sensitivity versus 1-specificity (or the sensitivity versus false
negative probability). The optimal point (optimal cut-off threshold) is the one
corresponding to the point in the ROC curve that has the shortest distance
Optimal ROC Classifiers 183
methods can be found in [15], [29], [30], and [93], among others. After describ-
ing the method of logistic regression, we focus on the method of Bayesian
discriminant analysis under the normality assumption on the joint distribu-
tion of errors. The Bayesian discriminant analysis involves two discriminant
functions. We address the difference between linear discriminant function and
quadratic discriminant functions in classification. Such description sheds new
lights on the hidden condition governing the application of discriminant anal-
ysis.
After elucidating methods related to the zero-one loss function, we extend
the discussion to applications on general loss functions, which highlights pre-
mier conditions before performing data analysis. We address the difference
between local and universal optimization, and conclude the chapter with a
discussion on the selection of the dose level for optimal ROC classifiers. The
ROC criteria essentially cast a new light on the optimization measure that
innovates the mean prediction errors discussed in preceding chapters.
8
Support Vectors and Duality Theorem
Example 8.1 Consider a set of data for advertising costs (in thousand dol-
lars for TV advertising and Internet advertising) of 20 companies in a soft-
drink industry, along with their annual revenue status (positive or negative)
in the past year.
Positive:(20, 54); (30, 42); (28, 63); (42, 29); (38, 35); (31, 44); (29,
52); (62, 18); (32, 49); (53, 41)
Negative:(20, 18); (24, 22); (25, 13); (22, 16); (24, 18); (21, 24); (18,
32); (22, 16); (12, 28); (33, 12)
185
186 Support Vectors and Duality Theorem
FIGURE 8.1
The hyperplane classifier separates companies with positive and negative rev-
enues associated with advertising input of 20 soft-drink companies.
a1 X1 + b1 X2 + c1 = 0,
a2 X1 + b2 X2 + c2 = 0.
Intuitively, the best classifier is the one that keeps the equal distance between
the positive and the negative lines. We call it the maximal margin classifier,
a3 X1 + b3 X2 + c3 = 0.
With this setting, all the data points (Xi1 , Xi2 , yi ) satisfy
8.1.1 Hyperplane
In a p-dimensional space, a hyperplane is a flat affine subspace of dimension
p−1. For instance, the line x+y = 2 in R2 for the xy-plane, where (x, y) ∈ R2 .
Alternatively, the plane x + 3y + 2z = 10 in R3 for (x, y, z) ∈ R3 . Or generally
β0 + β1 X1 + β2 X2 + ... + βp Xp = 0,
d = |v.n|
= |(x1 − x0 , y1 − y0 ).n|
|A(x1 − x0 ) + B(y1 − y0 ) + C(z1 − z0 )|
= √ ,
A2 + B 2 + C 2
which is,
Ax1 + By1 + Cz1 + D
d= √ .
A2 + B 2 + C 2
The above expression can be further simplified as
d = β 0 + β 1 x 1 + β 2 y 1 + β 3 z1 ,
188 Support Vectors and Duality Theorem
3
βi2 = 1.
i=1
This directly leads to the maximal margin classifier in the way of finding the
classifier that maximizes the distances between the two different types of data
points.
p
βj2 = 1.
j=1
|y.n|
d= ,
||n||
we have the following definition for the concept of maximal margin classifier.
Definition 8.1 The maximal margin classifier is a hyperplane that keeps the
largest possible distance from the two classes of data. Namely, it is the hyper-
plane
β0 + β1 xi1 + β2 xi2 + ... + βp xip = 0
that satisfies the following conditions.
maximize M
β0 ,β1 ,...,βp ,M
p
subject to βj2 = 1
j=1
The maximal margin classifier for a set of completely separable data can
be found by iteratively substituting the current line with an updated line until
no further substitution is available. For convenience, we consider the simplest
case where p = 1. Starting from any point, we can intuitively move the cutoff
Maximal Margin Classifier 189
point x until the two groups are distinctive. And the maximal margin classifier
is the middle value between the x value for positive outcome and the x value
for negative outcome. Algebraically, to maximize the term
D(β, β0 ) = yi (xTi β + β0 ),
i∈M
D(β, β0 )
∂ = yi x i
∂β
i∈M
D(β, β0 )
∂ = yi .
∂β0
i∈M
The final solution of the maximal margin classifier can be found by recur-
sively updating
β β β
← +ρ ,
β0 β0 β0
where
xi − xi0
ρ= .
xi0
The following example shows how to use R to find the maximal margin
classifier for a set of separable data. The problem is about the prediction of
post-thrombotic syndrome related to the percentage of thrombolysis and the
time after the first minor stroke symptoms of stoke patients. More background
information on thrombolysis and post-thrombotic syndrome can be found in,
for example, Chen and Comerota (2012 [24]).
As shown in Figure 8.2, the three support vectors are (42, 29), (30, 42) for
patients with positive post thrombotic syndrome, and (24, 22) for negative
post-thrombotic syndrome. The line separating the two areas in the diagram
is the maximal margin classifier, the margin for positive post-thrombotic syn-
drome is the line determined by the two support vectors (42, 29), and (30,
190 Support Vectors and Duality Theorem
x=matrix(c(20, 30, 28, 42, 38, 20, 24, 25, 22, 24, 54, 42, 63, 29, 35, 18, 22, 13, 16,
18), 10, 2)
y=rep(c(-1, 1), c(5, 5))
par(mar=c(1, 1, 1, 1))
plot(x, col=y+3, pch=19)
dat=data.frame(x, y=as.factor(y))
library(e1071)
svmfit=svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)
dev.new(width=5, height=4)
plot(svmfit, dat)
FIGURE 8.2
The hyperplane classifier predicts patients with positive post-thrombotic syn-
drome based on the percentage of thrombolysis and the time after the first
onset of minor stroke symptoms.
42), while the margin for negative post-thrombotic syndrome is the line de-
termined by the slope of the positive post-thrombotic syndrome and passing
the support vector (24, 22).
FIGURE 8.3
When data are not linearly separable between the two classes, maximal margin
classifier does not work
this case, we may either change the linear classifier to a non-linear classifier, or
consider an optimization in which some vectors are allowed to be misclassified.
The vectors in the margin and in the area of misclassification are the support
vector in the classification process. This naturally extends the maximal margin
classifier to the method of support vector classifier.
Definition 8.2 For datasets that are not completely separable, the classifier
that optimizes the following target function with the conditions for a given cost
C is defined as a support vector classifier.
maximize M
β0 ,β1 ,...,βp , 1 ,..., n ,M
p
subject to βj2 = 1
j=1
In the scenario where the feature vector X has large dimension, the op-
timization process in the above definition can be simplified by the duality
192 Support Vectors and Duality Theorem
xj ≥ 0 j = 1, 2, ..., n,
has a dual:
m
minimize bi y i
i=1
m
subject to aij yi ≥ cj j = 1, 2, ..., n
i=1
yi ≥ 0 i = 1, 2, ..., m.
Notice that the original problem is to optimize over Rp space while the dual
problem optimizes over the R1 space. When the dimension p is large, such as
in the facial recognition in AI, the target dimension is changed dramatically.
We use the following example to illustrate the duality theorem.
Example 8.3 Consider the question to maximize the profit with fixed re-
sources,
maximize c1 x1 + c2 x2 + c3 x3
subject to a11 x1 + a12 x2 + a13 x3 ≤ b1
a21 x1 + a22 x2 + a23 x3 ≤ b2
x1 , x2 , x3 ≥ 0,
where cj =profit per unit of product j produced;
bi =units of raw material i on hand;
aij =units raw material i required to produce 1 unit of product j.
The problem can be viewed from the angle of cost as follows. If we save
one unit of product j, then we free up:
a1j units of raw material 1 and
a2j units of raw material 2.
Selling these unused raw materials at the price of y1 and y2 dollars/unit,
respectively, yields a1j y1 + a2j y2 dollars, which is the corresponding cost.
Assume that we are only interested whether the cost exceeds lost profit on
each product j:
a1j y1 + a2j y2 ≥ cj , j = 1, 2, 3.
Producing as much product as possible to gain max profit is the same as
efficiently minimizing the cost with certain input constraints
Support Vector Classifiers 193
minimize b1 y1 + b2 y2
subject to a11 y1 + a21 y2 ≥ c1
a12 y1 + a22 y2 ≥ c2
a13 y1 + a23 y2 ≥ c3
y1 , y2 ≥ 0.
With the duality theorem, the support vector classifier can be formulated
as follows.
1 N
min{ ||β||2 + C ξi }
β,β0 2
i=1
subject to ξi ≥ 0, yi (xTi β + β0 ) ≥ 1 − ξi ∀i,
1 N N N
LP = ||β||2 + C ξi − αi [yi (xTi β + β0 ) − (1 − ξi )] − μi ξi
2 i=1 i=1 i=1
N
β= α i yi x i ,
i=1
N
0= α i yi ,
i=1
αi = C − μi , ∀i,
In this setting, observations that lie directly on the margin, or on the
wrong side of the margin for their class, are defined as support vectors. These
observations affect the construction of the support vector classifier, while the
rest of the data do not contribute to the classifier.
In the above formulation, there is a principle on the trade-off between bias
and variance in support vector machine. When the value of C is large, there
is a high tolerance for observations being on the wrong side of the margin,
therefore, the margin will consequently be large. More support vectors which
leads to lower variance and higher bias. On the other hand, when the value
of C decreases, the tolerance for observations being on the wrong side of the
margin decreases, and consequently the margin shrinks. This results in less
observations violating the margin and consequently less support vectors. With
less support vectors, the support vector classifier has higher variance and lower
bias.
We now apply the above techniques to analyze the example in Figure 8.3
Example 8.4 Since the two data groups are not linearly separable, there is
no solution for the setting of maximal margin classifier. We seek the support
vector classifier with the cost being set to 10.
194 Support Vectors and Duality Theorem
1 N
min ||β||2 + C ξi
β,β0 2
i=1
subject to ξi ≥ 0, yi (xTi β + β0 ) ≥ 1 − ξi ∀i,
The 7 cross points in Figure 8.4 are the support vectors.
library(e1071)
set.seed(1)
x=matrix(rnorm(20*2), ncol=2)
y=c(rep(-1, 10), rep(1, 10))
x[y==1,]=x[y==1,]+1
plot(x, col=(3-y))
dat=data.frame(x=x, y=as.factor(y))
svmfit=svm(y~., data=dat, kernel="linear", cost=10,
scale=FALSE)
plot(svmfit, dat)
FIGURE 8.4
When data are not completely separable between the two classes, support
vector classifier works with a cost
Support Vector Machine 195
maximize M
β0 ,β11 ,β12 ...,βp1 ,βp2 , 1 ,..., n ,M
p
p
subject to yi (β0 + βj1 xij + βj2 x2ij ) ≥ M (1 − i )
j=1 j=1
n
p
2
i ≥ 0, i ≤ C, 2
βjk = 1.
i=1 j=1 k=1
where U = {1 ≤ i1 < ... < ir ≤ n 1 ≤ j1 < ... < ju ≤ m}, for any two sets
of events A1 , ..., An and B1 , ..., Bm in an arbitrary probability space (Ω, F,
P ).
The use of duality theorem for optimization solution in prediction essen-
tially stems from the same root of optimization in linear programming. We
briefly introduce the roadmap of perturbation method in this section. More
details on the use of the perturbation method with applications can be found
in the book by Chen (2014) [22].
For any two sets of events {Ai , i = 1, ..., n} and {Bj , j = 1, ..., m}, let
v1 and v2 be the number of occurrences of the two event sets, respectively.
Denote pij = P (v1 = i, v2 = j). For any integers 1 ≤ t ≤ n and 1 ≤ k ≤ m,
consider a set of consistent bivariate Bonferroni summations Sij , i = 1, ..., t,
j = 1, ..., k.
Similar to the optimization process in the construction of support vector
machine (SVM), an optimal upper bound for P (v1 ≥ 1, v2 ≥ 1) is defined by
the maximum value of the following linear programming problem:
Now for any two sets of events characterized by p in any probability space,
n
m
P (v1 ≥ 1, v2 ≥ 1) = pij ≤ U ∗ ,
i=1 j=1
because of (8.1). Thus the feasible optimal solution in (8.1) leads to a proba-
bility upper bound on the probability of the joint event
P (v1 ≥ 1, v2 ≥ 1).
b = (1, t(1, 1), t(1, 2), t(2, 1), t(2, 2)) x (8.3)
where 1 is the vector with length 2n+m and all elements equal 1, and t(i, j)
is the vector specified below for i, j = 1, 2,
a (5×2n+m ) matrix with structure not affected by the values of the Sij ’s. We
have
b = Rx. (8.5)
198 Support Vectors and Duality Theorem
where the row vector gkt is the vector of coefficients specified in (8.6). Com-
bining the row vectors g k,t into a matrix G (with the first row as 1) for the
quantities S11 , S12 , S21 , S22 .
Therefore putting b = (1, S11 , S12 , S21 , S22 ), for w = (n + 1)(m + 1), there
exists a 5×w matrix G so that
b = Gq, (8.7)
where the first row of the matrix G is 1 , the first (m + 1)th column of G
is (1, 0, 0, 0, 0) , and the structure of G is not affected by the value of the
bivariate Bonferroni summations Si,j ’s.
Denote the vector c: c = (c(i, j)), with the first element of c corresponding
to the index {i = 0 or j = 0}, and the rest of the elements of c formed by
ranking over i ≥ 1 in increasing order for each fixed j, then over j ≥ 1 in
increasing order. Also assign
0, f or i=0 or j= 0
c(i, j) =
1, otherwise.
The joint probability of at least one occurrence in both event sets can be
expressed as
P (v1 ≥ 1, v2 ≥ 1) = c p.
With the setting above, we have the following theorems.
Theorem 8.1 For matrix G, vectors c and b as specified in the above setting,
denote the vector w = (w0 , w1 , w2 , w3 , w4 ) which may depend on m, n, but
not on the values of S11 , S12 , S21 , S22 under consideration. w b is an upper
bound for P (v1 ≥ 1, v2 ≥ 1) for all probability spaces if and only if w G ≥ c
(each element of the vector w G is not less than the corresponding element in
the vector c).
Duality Theorem with Perturbation 199
The following theorem explores the existence of the feasible optimal solu-
tion from the angle of probability theory, instead of linear programming.
Theorem 8.2 If the Bonferroni summations S11 , S12 , S21 , and S22 are con-
sistent, the linear programming upper bound for P (v1 ≥ 1, v2 ≥ 1) always
exists.
w = (w0 , w1 , w2 , w3 , w4 )
which may depend on m, n, but not on the values of S11 , S12 , S21 , S22 .
Theorem 8.3 The value w b is a lower bound for P (v1 ≥ 1, v2 ≥ 1) for all
probability spaces if and only if w A ≤ c (each element of the vector w A is
not greater than the corresponding element in the coefficient vector c).
Theorem 8.4 For a set of consistent Bonferroni summations S11 , S12 , S21 ,
and S22 , the linear programming lower bound for P (v1 ≥ 1, v2 ≥ 1) always
exists.
Now that the existence is proved above, the following describes a pertur-
bation method to find the bivariate optimal lower bound. If the condition
in Theorem 8.3 is satisfied, the optimal lower bound is found. However, it is
not always true that xB = B−1 b ≥ 0. When xB = B−1 b ≥ 0 , we need
to find an alternative approach to reach the linear programming bound. This
is technically more involved in linear programming. In the following, we pro-
vide details on the alternative approach (an iteration algorithm) and show
that the algorithm can theoretically reach the existence condition after finite
iterations.
200 Support Vectors and Duality Theorem
c1 () = 1
ci () = k(i) for some k > 0 depending on i>1 (8.9)
ci () = cj () i = j.
For example, the function of can take any one of the following forms,
ci () = i , i = 1, c1 () = 1
or
cB () B−1 ak ≥ ck (), k = 1; and cB () B−1 a1 = 1, (8.10)
⎛ ⎞
0
⎛ y1jk ⎞
⎜ ⎟
..
yrjk sr ⎜ ⎟.
⎜ ⎟ ⎜ 1 ⎟
= B−1 − ⎜ .. ⎟ + ⎜ y sr ⎟ .
D−1 ⎝ . ⎠ ⎜ rjk ⎟
y5jk ⎜ . ⎟
yrj sr ⎝ .. ⎠
k
0
Let n be the total number of iterations before reaching the optimal solu-
tion, if θ() < 0 and the condition (8.10) persists for each B matrix in the
iteration process, n < ∞.
Proof: Put zrb = sr b, we have
⎞ ⎛
0
⎛ y1jk ⎞
⎜ .. ⎟
yrjk zrb ⎜ . ⎟
⎜ .. ⎟ + ⎜
⎟ ⎟
D−1 b = B−1 b − ⎜ ⎜ zrb ⎟
⎝ . ⎠ ⎜ yrjk ⎟ .
y5jk ⎜ . ⎟
yrjk zrb ⎝ .. ⎠
0
Thus,
This means that after a finite number of iterations (with the first column of
B never being changed), there exists a matrix B and an associated θ() such
that for all 0 < < 0 (B).
Summarizing the above discussion, we have the following conclusions.
Chen (2014)[22] also shows the following results to ensure the smoothing
operation in the iterative process.
Theorem 8.5 The condition θ() < 0 in each iteration, which means that de-
generacy does not occur in the iteration process. In the iteration process, when
we sequentially reach a matrix B such that B−1 b ≥ 0, an optimal solution is
found and the process is stopped. If the associated B−1 b ≥ 0, Condition (8.10)
persists for every B used in each iteration, with a corresponding 0 (B) > 0.
TABLE 9.1
Systolic blood pressure dataset
Age Gender Systolic BP
20 M 112
17 M 102
19 F 138
15 F 142
40 M 164
53 M 158
51 F 153
42 F 167
Using linear regression, the fitted model based on data information in Table
9.1, reads,
Systolic BP = 106.5 + 1.106 ∗ age + ε
The above linear regression line is easy to interpret. It implies that, on average,
the SBP increases 1.106 units per year of age increase. This reflects the possible
effect of bulging veins (when people get older, the valves in the veins may wear
out and result in improper blood flow in the extremities back to the heart).
However, it is not necessary that the change of SBP is based on per unit
205
206 Decision Trees and Range Regressions
1 if X ∈ Ri
I(X, Ri ) =
0 if X ∈ Ri .
1
n
(yi − ŷi )2
n i=1
is the homogeneity measure for the closeness between the observed and the
predicted outcomes. However, when the response is categorical, the value of
(white − green)2 has no meaning for comparisons. Even when we label the
categories with numbers such as 1 for red, 2 for green and 3 for white, the
outcome
(1 − 2)2 < (1 − 3)2
does not represent the color differences among red, green, and white. In this
case, we need to use other homogeneity measure such as Gini index or entropy
described in Section 9.2.
The selection of homogeneity measure is dictated by the type of response
variable Y . In what follows in this chapter, we shall redirect our discussion
to regression trees and classification trees, separately. The chapter will be
concluded by a discussion on range regression, an extension of binary splitting
in regression trees into multiple splitting.
Example 9.2 Consider the following simple data set for the construction of
a regression tree.
In the first step, possible splitting points for the binary splitting of x are
1.5 + 2.6 2.6 + 5.1 5.1 + 9.2
= 2.05, = 3.85, = 7.15,
2 2 2
the corresponding RSS reads
s2 2.05 3.85
RSS 40.5 0.5
Thus, the second binary split is s2 = 3.85.
Using R packages, the constructed tree can be seen in Figure 9.1.
It should be noted that for each partition on the feature space, when the
distribution family of the sample mean of y is a complete distribution family,
the UMVUE (uniformly minimum variance unbiased estimator) of the mean
response at each terminal node of the regression tree is
1
ĉi = ȳi = yj ,
|Ri |
xj ∈Ri
which is the average of all responses that share the corresponding features
x ∈ Ri .
By the consistency of UMVUE and the law of large numbers, when the
number of observations in each terminal node is large enough, the sample
mean can always approximate the population mean. However, in practice, we
do not always have a large sample size in each terminal node in regression
trees. It should be noted that with limited sample sizes, the sample mean is
not always the UMVUE of the mean parameter, as shown in the following
example.
FIGURE 9.1
Hypothetical data for regression tree construction
θ̂ n+1
μ̂ = = X(n),
2 n
which is not the sample mean X̄.
where ci ∈ {1, ..., k} is the proxy for the common value of observations x with
features in the partition Ri .
Ej = 1 − max (p̂jk ).
k∈{1,...,K}
Example 9.4 When SBP (systolic blood pressure) is used to diagnose DVT
(deep vein thrombosis), assume that features of 20 patients under the study
can be described according to the following table.
Out of the total of 20 patients, when 120 is selected as the threshold in the
binary splitting of SBP, we have patients with SBP < 120 as the group j = 1,
and patients with SBP ≥ 120 as the group j = 2. Out of the 12 patients with
SBP < 120, there are 5 with DVT (k = 1) and 7 without DVT (k = 2). Thus
we have, for j = 1,
5 7
p̂11 = p̂12 = ,
12 12
and
5 7 5
E1 = 1 − max(p̂11 , p̂12 ) = 1 − max( , ) = .
12 12 12
5
So, Region-1 is labeled as no-DVT with misclassification rate 12 .
As for Region-2, j = 2, we have
6 2
p̂21 = p̂22 =
8 8
212 Decision Trees and Range Regressions
and
6 2 2
E2 = 1 − max( , ) = .
8 8 8
So, Region-2 is for patients with DVT and the misclassification rate is 28 .
Therefore, the overall classification rate for the binary split at SBP = 120
reads
5 12 2 8 7
E= ∗ + ∗ = .
12 20 8 20 20
In fact, when we use the majority vote at each terminal node to determine
the common class for the node, the misclassification rate is simply the sum of
the misclassified proportions in the two regions of the binary splitting.
K
Ĝ = p̂jk (1 − p̂jk ).
k=1
For the nj observations in the terminal node, denote Xjk = 1 with probability
pjk in the jth region for the observations that belong to Category k. We have
K
K
G(j) = pjk (1 − pjk ) = V ar(Xjk ).
k=1 k=1
Thus, for each splitting in the region j, the Gini index G(j) is essentially the
sum of variance of each category in the region. In general, the Gini index for
a value in a binary splitting reads
n1 n2
Gini(s) = G(1) + G(2),
n n
where n1 denotes the number of observations satisfying Xj < s, and n2 denotes
the number of observations satisfying Xj ≥ s.
The Gini index takes on a small value if all of the correct classification
errors are close to zero or one. It is a measure of node purity. A small value
indicates that a node contains predominantly observations from a single class.
Classification Tree 213
Example 9.5 Using the SBP-DVT classification in Example 9.4, the Gini
index in splitting SBP = 120 reads
When j = 1,
5 5 7 7 5 7
Gini(1) = ∗ (1 − ) + ∗ (1 − ) = 2 ∗ ∗ ;
12 12 12 12 12 12
when j = 2,
6 6 2 2 6 2
Gini(2) = ∗ (1 − ) + ∗ (1 − ) = 2 ∗ ∗ ,
8 8 8 8 8 8
and the Gini index at SBP=120 reads
5 7 12 6 2 8 53
Gini(120) = 2 ∗ ∗ ∗ +2∗ ∗ ∗ = .
12 12 20 8 8 20 120
9.2.3 Entropy
For the previous measures of homogeneity in classification, the misclassifi-
cation rate uses the percentage of node homogeneity. A small value of the
misclassification rate indicates the predominant observations from a single
class. The Gini index, on the other hand, focuses on the node purity via the
variance of each class in the region. A small value of Gini index implies more
observations in the node are from a single class. Besides these two measures
of homogeneity, another commonly applied homogeneity measure for binary
response is the measure of entropy.
Entropy is a concept used in information theory, where a higher probability
of the occurrence of an event implies less information obtained when the event
occurs; and a lower probability implies more information obtained when the
event occurs. It is a useful concept in coding and decoding a signal process.
The form −plog(p) stems from the fact that the function
f (p) = −c ∗ loga (p)
for constants a and c is the only function satisfying the following three condi-
tions in information theory.
1 f (x) is a monotonically decreasing function of x;
2 f (x) is continuous in x;
3 f (p1 × p2 ) = f (p1 ) + f (p2 ), which implies that information of two inde-
pendent events is the sum of the individual information of each event.
On the basis of the above definition, when I(X) = −log(P (X = x)) is used
for the self-information of the element x, the entropy of a discrete random
variable X with probability mass function px is defined as the expected value
of the information belonging to each basic element.
Entropy(X) = E(I(X)) = (−log(p))P (X = x).
x
214 Decision Trees and Range Regressions
For each binary splitting, denote p̂jk the proportion of training observa-
tions in the jth region that are from the kth class, k ∈ {1, 2, ..., K}, we have
the entropy,
K
D̂ = − p̂jk log[p̂jk ].
k=1
Since the classification error is between 0 and 1, the entropy in a binary split
is non-negative. It takes a value near zero if the classification rates are all near
zero or near one.
It should be noted that the three different measures of node homogeneity
assume different meanings when splitting in the construction of the classifica-
tion tree, which should be integrated into the interpretation.
We use the following example to comprehensively explain the three mea-
sures of node homogeneity in the construction of classification trees.
Alternatively, for x < 10, j = 1, k ∈ {A, B}, p1A = 3/4 and p1B = 1/4,
1 400 1 400
+ = 1/4.
4 800 4 800
At x = 20, we have
when x < 20, the corresponding outcomes include A -400 and B-200, the
majority vote classifies the node as Category-A.
when x ≥ 20, the corresponding outcomes include A - none and B-200,
the majority vote classifies the node as Category-B.
The overall misclassification rate for splitting over x = 20 reads
(200(misclassification of B as A)+0(misclassification of A as B))/800 = 0.25.
Alternatively, for x < 20, j = 1, k ∈ {A, B}, p1A = 4/6 and p1B = 2/6,
E1 = 1 − max(2/3, 1/3) = 1/3.
for x ≥ 20, j = 2, k ∈ {A, B}, p2A = 0 and p2B = 200/200 = 1,
E2 = 1 − max(0, 1) = 1.
and the overall misclassification rate becomes
31 1
+ ∗ 0 = 1/4.
43 4
There is a tie in the homogeneity measure with misclassification errors at 10
and 20. In this case, the tie can be broken by flipping a fair coin.
when x < 20, the corresponding outcomes include A - 400 and B -200, the
majority vote classifies the node as Category-A. p̂1A = 400/600.
when x ≥ 20, the corresponding outcomes include A - none and B -200,
the majority vote classifies the node as Category-B. p̂2B = 200/200 = 1.
For x < 20, j = 1 the Gini index
2 2 1 1 4
G(1) = ∗ (1 − ) + ∗ (1 − ) = .
3 3 3 3 9
For x ≥ 20, j = 2 the Gini index
0 0 200 200
G(2) = ∗ (1 − )+ ∗ (1 − ) = 0.
200 200 200 200
The overall Gini index at splitting x = 20 reads
4 600 200
Overall Gini = ∗ +0∗ = 1/3.
9 800 800
Since 38 > 13 , lower overall Gini index indicates a higher node purity, the
splitting point at x = 20 is selected.
for each region j, it should be noted that the estimate in the above equation
is not the uniformly minimum variance unbiased estimator (UMVUE). To see
this point, notice that for the sample proportion p̂ with n observations in the
218 Decision Trees and Range Regressions
node,
j n
Proof: Consider the random variable T = i=1 Xijk in each node. Since
Xijk is a Bernoulli random variable with probability pjk , T follows a binomial
model, which is a complete distribution family. Furthermore, the sample mean
T is also a sufficient statistics for pjk . And T /nj is the UMVUE of pjk . What
we need to find is the UMVUE of p2jk .
For any two random variables in the node, notice that
K nj
Xijk
nj
Xijk − 1
Ĝ(j) = i=1
(1 − i=1
).
nj nj − 1
k=1
Extending regression trees to range regression 219
k
fˆ(X) = ci I(X, Ri ),
i=1
where ci is the common mean outcome for x with features in the partition Ri .
3.5
2.5
CEAP score
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Residual thrombus
FIGURE 9.2
Post-thrombotic syndrome and Linear regression
of residual thrombus. The diagram indicates that the two random sources
affect the outcome variable. It can be formulated as
Yi = α + βXi + i + ηi ,
where Y denotes the CEAP score, X the residual thrombus. i and ηi denote
the two different sources of random fluctuations for patient i. It should be
noted that i is for the normal fluctuation of the patients associated with the
mean CEAP scores. However, ηi represents all the random sources associated
with each range of residual thrombus. Obviously, the diagram shows that
the change of the outcome variable (CEAP score) can not be not adequately
explained by residual thrombus.
Figure 9.2 indicates that at the same level of residual thrombus, some
patients have higher CEAP scores while others have relatively lower CEAP
scores. The trend between the two variables of interest (CEAP scores and
residual thrombus) is not observable.
To focus on subject variability for patients with similar amounts of clot
lysed, Figure 9.3 shows that through range regression, stratifying patients
Extending regression trees to range regression 221
3.5
2.5
Mean CEAP score
Y
2
Predicted Y
Linear (Predicted Y)
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
-0.5
Middle range of female residual thrombus
FIGURE 9.3
Post-thrombotic syndrome and Range regression
with similar amounts of clot lysed into one group identifies a measure that
bundles subject variability within each stratum. The linear pattern emerges.
Similar to a regression tree, a range regression can be interpreted as follows,
k
fˆ(X) = ci I(X, Ri ).
i=1
Y = −0.646 + 5.09X + ,
for the female patients in the data set. Since the sample mean is an asymptot-
ically unbiased estimator of the population mean, range regression essentially
models the conditional expected value of CEAP scores for each fixed range of
residual thrombus as a linear trend of the residual thrombus.
where the effect of random source η is implied by ranging the residual throm-
bus and averaging out the output variable within each range of residual throm-
bus. As shown in Figure 9.3, the method of range regression successfully re-
veals the association between the mean response of CEAP scores and the 10%
range of residual thrombus.
At this point, we shall show a theoretical result associated with an asymp-
totic distribution governing the method of range regressions. It can also be
analogically applied to analyze the error term in each terminal node for re-
gression trees.
For any random vector (xi yi ) , i = 0, 1, . . . , n, consider the scenario
where xi falls into one of the k discernible categories T1 , . . . , Tk measured by
values s1 , . . . , sk . For each xi , there exists a set Tj such that xi ∈ Tj , where
Tj is represented by sj ,
xi
sj = ,
#(Tj )
xi ∈Tj
Z j = α1 + β 1 Sj + 1 , j = 1, ..., k,
Proof: Consider fj (t) the moment generating function of the centralized re-
sponses Yj1 − μj , . . . , Yjnj − μj . Notice that
Since the first two moments exist, using the Taylor expansion, we have
1
fj (t) = 1 − σj2 t2 + O(t2 ). (9.2)
2
The above equation implies that the moment generating function of the
standardized range mean variable
nj
(Yjk − μj )
Ȳj = √ ,
nj × σ j
k=1
reads
1 2 nj
t
= fj √ , by independence
nj × σ j
/ $ % $ %0nj
1 2 t2 t2
= 1 − σj +O
2 nj σj2 nj σj2
/ $ %0 nj
1 t2 t2
= 1− +O
2 nj nj σj2
/ $ %0,nj
1 t2 t2
= 1− + nj O .
nj 2 nj σj2
Now $ % $ %
t2 (3) t2
O = fj (θ) , θ ∈ [0, εj ].
nj σj2 nj σj2
(3)
Notice that fj (θ) is bounded since it is a continuous function in a closed
interval, we have, for each t,
$ %
t2 (3) t6
nj O = f j (θ) × −→ 0 as nj −→ ∞.
nj σj2 n6j σj2
224 Decision Trees and Range Regressions
Therefore,
2
fnj (t) −→ e−t /2
,
since
lim (1 − a/n)n = e−a ,
n→∞
and
σ
Zj = √ Ȳj + μj .
nj
Now, denote z ∈ Rk , the k-dimensional vector z can be expressed as
z = μ + ε,
μ = (μ1 , . . . , μk ) , ε ∼ Nk (0, Σ),
√ √
where Σ is a k × k matrix with σ1 / n1 , . . . , σk / nk on diagonal and 0 off-
diagonals.
If xji = sj , i = 1, ..., nj , and suppose there is a linear relationship between
μj and sj , say, μj = a + bsj , then
Zj = a + bSj + εj , j = 1, . . . , k,
sj −−−−→ E(Xji ) = ηj .
nj →∞
since each range has one representative value xi for i = 1, ..., k. Now notice
that
mi
(yij − ȳ) = mi (ȳi − ȳ),
j=1
we have
3 42
k mi
i=1 [(xi − x̄) j=1 (yij − ȳ)]
k k m
[ i=1 mi (xi − x̄)2 ][ i=1 j=1 i
(yij − ȳ)2 ]
3 42
k
i=1 [(x i − x̄)m i (ȳ i − ȳ)]
= k k mi
,
[ i=1 mi (xi − x̄)2 ][ i=1 j=1 (yij − ȳ)2 ]
where
1
mi
ȳi = yij .
mi j=1
226 Decision Trees and Range Regressions
k
m
k
m
(yij − ȳ)2 = (yij − ȳi + ȳi − ȳ)2
i=1 j=1 i=1 j=1
⎡ ⎤
k
m
= ⎣ (yij − ȳi )2 + m(ȳi − ȳ)2 ⎦
i=1 j=1
k
m
k
= (yij − ȳi )2 + m (ȳi − ȳ)2
i=1 j=1 i=1
k
>m (ȳi − ȳ)2 .
i=1
Now consider the transformed data (xi , ȳi ), where i = 1, . . . , k and ȳi is
defined as the same as before. Define the overall average across the k ranges,
1
k
x̄∗ = xi ,
k i=1
⎡ ⎤
1 k
1 k
1
mi
ȳ ∗ = ȳi = ⎣ yij ⎦ .
k i=1 k i=1 mi j=1
The sample correlation coefficient based on the transformed data can be cal-
culated as 3 42
[(xi − x̄∗ )(ȳi − ȳ ∗ )]
k
i=1
R̂22 = k k
. (9.4)
[ i=1 (xi − x̄∗ )2 ][ i=1 (ȳi − ȳ ∗ )2 ]
Notice that the data are balanced, that is, mi = m, for all i = 1, . . . , k,
then
1
k
mi
x̄ = k
xij
i=1 mi i=1 j=1
1
k
= k
[mi xi ]
i=1 mi i=1
k
m i=1 xi
=
km
= x̄∗ ,
Extending regression trees to range regression 227
and similarly,
1
k
mi
ȳ = k
yij
i=1 mi i=1 j=1
1
k
= [mi ȳi ]
mk i=1
1
k
= ȳi
k i=1
= ȳ ∗ .
Notice that the only difference between the above equation and Equa-
tion (9.4) is the term i=1 (ȳi − ȳ ∗ )2 in Equation (9.4). Since, m ≥ 1
k
k
m
k
(yij − ȳ)2 ≥ (ȳi − ȳ ∗ )2 ,
i=1 j=1 i=1
we have
R̂22 ≥ R̂12 .
This completes the proof of the theorem.
variance unbiased estimator of the population Gini index. Although, when the
sample size is large, the conventional estimation on the plug-in estimation will
converge to the true Gini index by the consistency property, we do not always
have an infinite amount of data for slow converging predictions. This necessi-
tates the use of best estimation for the Gini index measuring homogeneity of
observations in each terminal node. The interpretation of entropy in the con-
struction of classification tree is also addressed with examples in this chapter.
Further information on trees and bagging can be found in papers such as [6],
[7], [130], among others.
Range regression uses parallel splitting on ranges to replace binary splitting
in regression trees. The asymptotic distribution of the range regression model
bridges the data-driven camp with the model-based camp. It unifies the two
data science cultures via distribution convergences of response variable sample
mean.
10
Unsupervised Learning and Optimization
The previous chapters discuss data analytic issues on input features (such as
predictors) relating to output features (such as the response variable), where
each observation has a response. For instance, in the analysis of clinical fac-
tors related to systolic blood pressure, the response variable is the reading
of patients’ systolic blood pressure; in the classification of up or down mar-
ket trend in the coming time period, the response variable is either “bull
market” or “bear market”. The model learned from the training data has
a response variable intended to “supervise” the learning process by using a
MSE criterion or the total probability of correct classification. However, in
some data analytic problems, the response to “supervise” the learning process
might not even exist. For example, in business analysis, clustering of consumer
preferences helps structure the design of marketing strategies of advertising
campaigns. In clinical trials and epidemiology, grouping patient symptoms
helps diagnosis and prevention in public interventions. In geology, grouping
on element characteristics of rock samples helps identify the main character-
istic of the environment it was found. The common theme among the above
mentioned applications is the lack of a response variable, due to the absence
of knowledge in the experiment stage. In this chapter, we will focus on two
main methods: K-means clustering and the method of principal component
analysis. To briefly summarize, K-means clustering and principal component
analysis are two optimization approaches in grouping a set of data.
229
230 Unsupervised Learning and Optimization
In other words, the partition of a set A divides set A into finite mutually
exclusive sets.
K
Ĉi = argmin{ W (Ck )}, (10.1)
C1 ,...,CK
k=1
1 p
W (Ck ) = (xij − xi j )2 (10.2)
|Ck |
i,i ∈Ck
j=1
Example 10.2 Assume that the four points are (11, 12, 13), (21, 22, 23),
(31, 32, 33), and (41, 42, 43) to be grouped into two clusters. Since p = 3, for
C1 = {1, 2} and C2 = {3, 4}, find the corresponding risk for optimization.
K
1 p
[ (xij − xi j )2 ]
|Ck |
k=1 i,i ∈Ck
j=1
1
= [(11 − 21)2 + (12 − 22)2 + (13 − 23)2 + (21 − 11)2 + (22 − 12)2
2
1
+ (23 − 13)2 ] + [(31 − 41)2 + (32 − 42)2 + (33 − 43)2 + (41 − 31)2
2
+ (42 − 32) + (43 − 33)2 ].
2
For the partition C1 = {1}, C2 = {2, 3, 4}, the risk for optimization is
1
[(21 − 31)2 + (22 − 32)2 + (23 − 33)2 + (21 − 41)2 + (22 − 42)2 +
3
+ (23 − 43)2 + (31 − 21)2 + (32 − 22)2 + (33 − 23)2 + (41 − 21)2
+ (42 − 22)2 + (43 − 23)2 + (41 − 31)2 + (42 − 32)2 + (43 − 33)2 ].
Theorem 10.1 The mean of all point-wise distances within a cluster is twice
the sum of all distances of points to the centroid of the cluster.
Proof: Notice that the mean of all point-wise distance within a cluster reads,
1 p
(xij − xi j )2 ,
|Ck |
i,i ∈Ck
j=1
232 Unsupervised Learning and Optimization
1 p p
(xij − xi j )2 = 2 (xij − x̄kj )2 . (10.4)
|Ck |
i,i ∈Ck
j=1 j=1 i∈Ck
1 p
(xij − xi j )2
|Ck |
i,i ∈Ck
j=1
1 p
= (xij − x̄kj + x̄kj − xi j )2
|Ck |
i,i ∈Ck
j=1
1 p
= [(xij − x̄kj )2 − 2(xij − x̄kj )(xi j − x̄kj ) + (xi j − x̄kj )2
|Ck |
i,i ∈Ck
j=1
|Ck | p
|Ck |
p
= [(xij − x̄kj )2 + [(xi j − x̄kj )2
|Ck | |Ck |
i∈Ck j=1 i ∈Ck
j=1
2
p
− (xij − x̄kj )(xi j − x̄kj )
|Ck |
i,i ∈Ck j=1
p
=2 (xij − x̄kj )2 ,
i∈Ck j=1
since
p
p
(xij − x̄kj )(xi j − x̄kj ) = (xij − x̄kj ) (xij − x̄kj ) = 0,
i,i ∈Ck j=1 j=1 i∈Ck i ∈Ck
and
1 p
1
p
[(xij − x̄kj )2 = [(xij − x̄kj )2
|Ck | |C k |
i,i ∈Ck
j=1 j=1 i∈Ck i ∈Ck
|Ck | p
= [(xij − x̄kj )2 ,
|Ck |
i∈Ck j=1
Similarly
1 p
|Ck |
p
[(xi j − x̄kj ) =
2
[(xi j − x̄kj )2 .
|Ck | |Ck |
i,i ∈Ck
j=1 j=1 i ∈Ck
K-means Clustering 233
The following example illustrates the main clue in the proof of Theo-
rem 10.1.
Example 10.3 Assume that the four points for clustering are (11, 12, 13),
(21, 22, 23), (31, 32, 33), and (41, 42, 43) and the partition is C1 = {1, 2}, C2 =
{3, 4}. For the cluster C1 , we have
1 p
(xij − xi j )2
|Ck |
i,i ∈Ck
j=1
1
= [(11 − 21)2 + (12 − 22)2 + (13 − 23)2 + (21 − 11)2 + (22 − 12)2
2
+ (23 − 13)2 ]
1
= [(11 − 16 + 16 − 21)2 + (12 − 17 + 17 − 22)2 + (13 − 18 + 18 − 23)2 +
2
+ (21 − 16 + 16 − 11)2 + (22 − 17 + 17 − 12)2 + (23 − 18 + 18 − 13)2 ]
= (11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2 + (13 − 18)2
+ (18 − 23)2 + (11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2
+ (13 − 18)2 + (18 − 23)2
= 2[(11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2 + (13 − 18)2
+ (18 − 23)2 ]
= 2[(11 − 16)2 + (12 − 17)2 + (13 − 18)2 + (21 − 16)2 + (22 − 17)2
+ (23 − 18)2 ]
p
=2 (xij − x̄kj )2
i∈Ck j=1
Denote
Δ̂ = argminΔ f (Δt ).
Following the K-means clustering algorithm, the minimum risk estimator Δ̂
exists and can be achieved after finite iterations of the algorithm.
Proof: At the first step, if f (Δ1 ) is the smallest among all possible partitions,
by Theorem 10.1, we have
1 p p
(xij − xi j )2 = 2 (xij − x̄kj )2 .
|Ck |
i,i ∈Ck
j=1 j=1
i∈Ck
This means that the smallest possible risk is achieved at f (Δ1 ), no point
can be switched to a different cluster to form a different set of partition and
gain any improvement on the overall within-cluster variation. Under this sce-
nario, the MRE (minimum risk estimator) is achieved.
At any step t, t ≥ 1, if f (Δt ) is not the smallest value, the right-hand side of
Theorem 10.1 can be improved by rearranging the points around the centroid.
According to the K-means Algorithm, grouping each point to its closest cen-
troid yields a new partition Δt+1 . By Theorem 10.1, the total within-cluster
variation for Δt+1 is strictly less than the one for Δt ,
f (Δ1 > f (Δ2 ) > ... > f (Δm > f (Δm+1 ). (10.7)
Since the distance between all the countable points are fixed, as the num-
ber of iteration m increases, the total within-cluster variation approaches the
minimum value,
K
1 p
∗
f (Δ ) = (xij − xi j )2 ,
|Ck |
k=1 i,i ∈Ck
j=1
Example 10.4 Assume that the four points are (11, 12), (20, 21), (40, 41),
and (51, 52), K = 2 and
we have
1
f (Δ1 ) = [(11 − 51)2 + (12 − 52)2 + (11 − 40)2 + (12 − 41)2 +
3
+ (40 − 11)2 + (41 − 12)2 + (40 − 51)2 + (41 − 52)2 +
+ (51 − 11)2 + (52 − 12)2 + (51 − 40)2 + (52 − 41)2
= 2[(11 − 34)2 + (52 − 12)2 + (40 − 34)2 + (12 − 35)2 +
+ (41 − 35)2 + (52 − 35)2 ]
= 3416.
The distance of the point (20, 21) to the centroid of the first cluster (11, 12) is
smaller than its distance to the centroid of the second cluster. By rearranging
the partition as
we have
2
f (Δ2 ) = [(11 − 20)2 + (12 − 21)2 + (51 − 40)2 + (52 − 41)2 ] = 404
2
Obviously
f (Δ1 ) < f (δ2 ).
No 1 2 3 4 5 6 7
x1 1 10 1.2 2 12 9 1.4
x2 2 8 1.3 2.4 7 9 2
Assume that the initial random assignment sets the partition {C1 , C2 } as
Find the optimal clusters that minimizes the total within-cluster variations.
236 Unsupervised Learning and Optimization
1
x21 = (12 + 9 + 1.4) = 10.5
3
1
x22 = (7 + 9 + 2) = 6.
3
Thus, we have the distance of the observations.
No 1 2 3 4 5 6 7
dcentroid−1 3.38 7.49 3.57 2.31 8.70 7.44 3.02
dcentroid−2 10.31 2.06 10.42 9.23 1.80 3.35 9.94
updated cluster 1 2 1 1 2 2 1
No 1 2 3 4 5 6 7
dcentroid−1∗ 0.41 10.53 0.66 0.77 11.75 10.38 0.08
dcentroid−2∗ 11.10 0.33 11.33 10.04 1.94 1.67 10.76
updated cluster 1 2 1 1 2 2 1
Since the observations in the updated partition are identical to the observa-
tions in the partition before checking the updated point-centroid distances, no
rearrangement is needed and the optimal clusters are {1, 3, 4, 7} and {2, 5, 6}.
The plots of the observations and the two steps in the clustering algorithm
can be found in Figure 10.1.
K-means Clustering 237
FIGURE 10.1
Numerical illustration of K-means clustering algorithm
238 Unsupervised Learning and Optimization
K
1 p
Ĉi = argmin{ (xij − xi j )2 }, (10.8)
C1 ,...,CK |Ck |
k=1 i,i ∈Ck
j=1
where, denote |Ck | the number of observations in the set Ck . Notice the clus-
tering problem in (10.8) is different from the squared Euclidean clustering
(10.1) by inserting weights wi , for i = 1, ..., p where
p
0 < w1 < 1 wi = 1.
i=1
Since the K-means clustering algorithm in the preceding section is derived un-
der the assumption that the “closeness” is measured by the squared Euclidean
distance, it is inappropriate to carelessly apply the algorithm with checking
the plausibility of the assumptions.
Similar to the previous subsection, we should explore the possibility of
transferring the point-wise measurement to point-centroid measurement for
the weighted squared Euclidean distance. In this regard, we have the following
theorem.
Theorem 10.3
1 p p
wj (xij − xi j )2 = 2 wj (xij − x̄kj )2 . (10.9)
|Ck |
i,i ∈Ck
j=1 j=1i∈Ck
Proof Similar to the proof of Theorem 10.1, the key steps in the proof of
Theorem 10.3 begin with the decomposition of the left-hand side as follows,
K-means Clustering 239
by noticing that the weighting on the components does not affect the operation
on the summation of the centroid in each cluster.
1 p
wj (xij − xi j )2
|Ck |
i,i ∈Ck
j=1
1 p
= wj (xij − x̄kj + x̄kj − xi j )2
|Ck | j=1 i,i ∈C k
1
p
= wj [(xij − x̄kj )2 − 2(xij − x̄kj )(xi j − x̄kj ) + (xi j − x̄kj )2
|Ck | j=1 i,i ∈Ck
|Ck |
p |Ck |
p
= wj [(xij − x̄kj )2 + [(xi j − x̄kj )2
|Ck | |Ck |
j=1 i∈Ck i ∈Ck
j=1
2
p
− wj (xij − x̄kj )(xi j − x̄kj )
|Ck | j=1 i,i ∈Ck
p
=2 wj (xij − x̄kj )2
j=1 i∈Ck
p
=2 wj (xij − x̄kj )2 .
i∈Ck j=1
With Theorem 10.3, the point-wise cluster variation with weighted squared
Euclidean distances can be converted to the cluster variation from the points
to the centroid of the cluster. In fact, as shown in the above proof, as long as
the measurement is a linear function on the component j, converting results
similar to Theorem 10.3 can be similarly proved.
The validation of Theorem 10.3, in conjunction with Theorem 10.2, con-
sequently leads to the following non-Eculidean clustering algorithm.
xT = (x1 , x2 , x3 , . . . , x9 , x10 ),
where
x1 : systolic blood pressure
x2 : total blood cholesterol level
x3 : dusty working environment
x4 : residential location (city, rural area)
x5 : transportation (car, train, bus, walk)
x6 : career type
x7 : annual income
x8 : medical insurance
x9 : heart attack/stroke history
x10 : financial investment
We want to reduce the dimension of 10 features into several representative
variables to improve the efficiency of data analysis.
Solution: Assume that the eigenvectors corresponding to the largest two
eigenvalues are,
Example 10.7 Suppose the random variables X1 , X2 and X3 have the co-
variance matrix ⎡ ⎤
21 32 17
Σ = ⎣32 54 18⎦
17 18 35
Find the eigenvalues and eigenvectors of the population covariance matrix.
Evidently the above example shows how to get the principal components
from the covariance matrix. In practice, the population covariance matrix is
unknown. When the population covariance matrix is estimated by the sample
covariance matrix, the corresponding (sample) principal components are the
linear combination of the data with the corresponding eigenvectors of the
sample covariance matrix. We use the following example to illustrate this
point.
obtained in this way are actually the sample principal components (depend-
ing on the data), not the population principal components (which does not
depend on the data by definition).
Another commonly confusing issue is the standardization versus non-
standardization of data analysis of principal components. One common prac-
tice is to standardize the data (shifting to the center by subtracting the sample
mean, and dividing the sample standard deviation). The advantage of stan-
dardization makes the principal components invariant for location and scale
transformation. However, it should be noted that performing principal com-
ponent analysis on standardized data is tantamount to performing principal
component analysis on the sample correlation matrix.
Theorem 10.4 Consider a set of data xij , i = 1, ..., k and j = 1, ..., n for n
observations of k features of a population. Denote
1
x̄i = xij ,
n j
and
1
n
sij = (xik − x̄i )(xjk − x̄j ).
n−1
k=1
Sz = Rx (10.10)
The sample covariance matrix of the standardized data is the correlation ma-
trix of the original data.
Proof: Denote the data matrix of the standardized data by Z. Notice that
for the standardized data, the sample mean of each component reads
1 1
z̄ = (1 Z) = Z 1 = 0,
n n
since
1 xji − x̄i
n n
1
√ = √ ( xji − nx̄i ) = 0.
n j=1 sii n sii j=1
Now, by the definition, the sample covariance matrix of the standardized data
is
1
n
1
Sz = ( (zki − z̄i )(zkj − z̄j )) = Z Z.
n−1 n−1
k=1
244 Unsupervised Learning and Optimization
1 1 (n − 1)sij
Z Z = ( √ ) = Rx .
n−1 n−1 sii sjj
stock1 = (6, 8, 9, 3, 8, 7, 9, 8, 2)
> j=cor(data)
> eigen(h)
eigen() decomposition
$`values`
[1] 82.9223567 6.3569839 0.4706594
$vectors
[,1] [,2] [,3]
[1,] 0.04416809 0.99848413 0.03284249
[2,] -0.99888288 0.04469073 -0.01535322
[3,] -0.01679770 -0.03212768 0.99934261
> eigen(j)
eigen() decomposition
$`values`
[1] 1.3345071 0.8794526 0.7860403
$vectors
[,1] [,2] [,3]
[1,] 0.5150182 -0.8562234 -0.04046871
[2,] -0.6093253 -0.3324862 -0.71984420
[3,] -0.6028922 -0.3953915 0.69295498
FIGURE 10.2
Sample covariance matrix vs sample correlation matrix
This chapter discusses a common situation in data science when two or more
populations are involved in the learning process. When we have only one
data population for prediction, the error rates (false positive, false negative)
are clearly defined. However, when two or more sets of data are involved in
the analysis process, since each path of the analysis generates error rates,
controlling inference error in one population (one statement) does not control
the overall error rates. In fact, the overall error rates accumulate as the number
of data path increases. This chapter thus discusses methods applicable to
adjust the multiplicity in multi-path statistical learning, the scenario where
two or more sources of datasets are involved in the prediction process.
One of the special case of multiple path learning is the case when the
number of observations increases (instead of being fixed) in the learning pro-
cess. To this end, we discuss the method of sequential analysis, where the two
error rates (false positive and false negative) are combined to determine the
required sample size. This is the method of sequential analysis where data
coming at different phases.
Another approach to handling multi-resources learning simultaneously is
the methodology of multiple comparisons. We shall focus on recent devel-
opments on simultaneous confidence segments for dose-response curves, and
weighted hypotheses with high dimensional data in this chapter.
Materials in this chapter essentially synthesize some recent publications in
simultaneous inference, including Ma et al. (2023)[82], Yu et al (2022)[128],
Chen (2016)[23], and Kerns and Chen (2017)[76].
247
248 Simultaneous Learning and Multiplicity
Reject the null hypothesis (or equivalently accepting the alternative hy-
pothesis) if the data satisfies (Λn |μ = μ0 ) > A; clear data evudebce for
H1 .
Reject the alternative hypothesis (or equivalently accepting the null hy-
pothesis) if the data satisfies (Λn |μ = μ1 ) < B; clear data evidence for
H0 .
Continue sampling (without making any conclusion on accepting or re-
jection the null hypothesis) if the data satisfies A ≥ (Λn |μ = μ0 ) and
(Λ|μ = μ1 ) ≥ B.
Solution If μ = 6 the data contain higher likelihood for the alternative hy-
pothesis. If μ = 2, on the other hand, the data contain higher likelihood to
support the null hypothesis. Thus, for sprt, denote f (x|μ) the density of the
underlying model, and
n
f (xi |μ = 6)
Rn = ni=1 ,
j=1 f (xi |μ = 2)
Sequential Data 249
1
n
Rn = Exp{− [(xi − μ1 )2 − (xi − μ0 )2 ]}
2σ 2 i=1
1
= Exp{ [2nX̄(μ1 − μ0 ) + n(μ20 − μ21 )]}.
2σ 2
Thus,
Rn > A
is equivalent to
X̄ > c
for some constant c. The test statistic is Λn = X̄, and
is tantamount to
1 √
(X̄ − μ0 ) n > Z1−α .
σ
Similarly, the Wald’s sequential probability ratio test for rejecting the alter-
native hypothesis, in this simple setting, becomes
1 √
(X̄ − μ1 ) n < Zβ .
σ
Summarizing the above discussion on the Wald’s sequential probability
ratio test, we have the following decision criteria for the inference problem
discussed in Example 11.1
Reject the null hypothesis (or equivalently accepting the alternative hy-
pothesis) if the data satisfied the condition
1 √
(X̄ − μ0 ) n > Z1−α ;
σ
Reject the alternative hypothesis (or equivalently accepting the null hy-
pothesis) if the data satisfies the condition
1 √
(X̄ − μ1 ) n < Zβ ;
σ
The above discussion is valid when the data follows a normal model with a
given standard deviation. However, in practice, the standard deviation is un-
known. This leads to the sprt-t package in R for Wald’s sequential probability
ratio test without knowing the population standard deviation.
Another practical issue related to the analysis of sequential data is the
statement claiming the non-difference between the hypothesized mean and
the true but unknown population mean in composite hypothesis μ = μ0 ver-
sus μ = μ0 . Since the datum varies at certain level, and it makes more sense to
claim closeness within certain range instead of claiming an inequality vaguely
between the unknown and the hypothesized value. This is because an inequal-
ity could imply that the unknown value is very close to the hypothesized value
or very far away from the hypothesized value. Toward this end, a practical
concept to resolve the problem, is the effective size.
|μ − μ0 |
P (T ype − II error) = sup P (accepting H0 | > d0 )
|μ−μ0 |
>d0
σ
σ
|μ − μ0 |
= P (accepting H0 | = d0 )
σ
= P (accepting H0 |d = d0 ).
Example 11.2 For convenience, consider the income data in the R-package
sqrtt. The dataset contains 120 observations on monthly income for 60 male
and 60 female. For illustration purpose, we will use the income data to examine
the impact of effective size on the inference outcome. We first compare the
mean monthly income level of male with that of female under different effective
sizes, then use the data to test the sequential information with SPRT on a
Sequential Data 251
specific value for the population mean of the monthly income, and conclude
the example with a discussion regarding the impact of the alternative likelihood
on effective size.
FIGURE 11.1
Mean monthly income between male and female with sprr
As shown in the output Figure 11.1, when the effective size is set to 0.2,
the null and alternative hypotheses become
Namely to test whether the mean monthly income difference is more than 20%
of the data variation. The SPRT thresholds are
since the probability of type-I error and the probability of type-II error are
set to 0.05 in the coding. The log-likelihood of the null hypothesis
!
log f (xi |μmale = μf emale ) = 1.42137;
i
Since the value of the log-likelihood ratio is within the two threshold
there is no data evidence to reject the null hypothesis or the alternative hy-
pothesis. According to Wald’s SPRT, the inference outcome is “to continue
sampling”. This means that the existing sample size is not large enough to
detect the mean monthly income difference within 20% of the data variation,
thus the testing conclusion is “ to continue sampling”. In fact the two sample
means of monthly income are very close to each other. The mean monthly
income of male is $3072.09, and that of female is $3080.72.
It should be noted that the conclusion of “to continue sampling” is for the
pre-specified effective size of 0.2. When the effective size changes, the inference
conclusion changes. For instance, with the same dataset, when the effective
size is changed to 0.8, the corresponding log-likelihood of the alternative space
becomes
!
sup log f (xi : |μmale − μf emale | > 0.8σ) = −8.09254.
i
Since the value of the likelihood ratio is less than -2.9444, according to Wald’s
SPRT, the conclusion is to accept the null hypothesis, which means that the
mean monthly income of male, on average, has no significant difference within
80% of the data variation. In this case, the data contains evidence to support
the claim that the mean monthly income difference between male and female,
is not beyond 80% of the data variation. This echoes with the early discussion
on the relationship between the effective size and the inference conclusion.
Namely the hypothesized mean of the monthly income is set to $3100, and we
want to see whether there is any data evidence supporting the claim that the
difference is more than 50% of the data variation.
As shown in Figure 11.2, the log-likelihood of the null hypothesis is -
0.41656, and the one for the alternative hypothesis reads -13.20851. This makes
the log-likelihood ratio
which is less than the lower threshold -2.9444. According to Wald’s SPRT,
the optimal decision is to accept the null hypothesis. As a matter of fact, the
sample mean, $3076.4, is indeed well within the range if 50% data variation,
which supports the Wald’s SPRT in accepting the null hypothesis.
For comparison purpose, when the hypothesized mean is set to $1200, as
shown in the second half of Figure 11.2, the value of the hypothesized mean
$1200 is far below the sample mean, the corresponding log-likelihood value for
the null hypothesis is -169.7469, while the one for the alternative hypothesis
is -119.9379, the log-likelihood ratio reads 49.8090, which is larger than the
upper threshold of the Wald’s SPRT, 2.9444. Thus the inference conclusion is
to accept the alternative hypothesis. In fact, the sample mean monthly income
254 Simultaneous Learning and Multiplicity
FIGURE 11.2
Overall mean monthly income vs a value with sprtt
of the data is $3076.4, which is beyond the hypothesized mean ($1200) by more
than 50% of the data variation.
.!"
.!"
. !+,-.,"
!,-"% . ! ' .
. #$"
#$. * ,
&
! .((.)
)"
! "
FIGURE 11.3
Log-likelihood ratio vs effective sizes
H0 : μ = μ0 versus H1 : μ = μ0 + δ,
for any given value δ > 0. It is well known that when the population standard
deviation is given as σ, the probability of the type-II error becomes
X̄ − μ0
P (accepting H0 |H1 true) = P ( √ < Z1−α |μ0 + δ)
σ/ k
X̄ − μ1 δ
= P( √ < Z1−α − √ |H1 )
σ/ k σ/ k
δ
= P (Z < Z1−α − √ ),
σ/ k
256 Simultaneous Learning and Multiplicity
also, for given probabilities of type-I and type-II errors, the minimal sample
size for one-stage sampling is
(Z1−α + Z1−β )2 σ 2
k≥ .
δ2
The above analysis assumes the population standard deviation, however,
in practice, the population standard deviation is unknown. Where we replace
the population standard deviation with the sample standard deviation, the
above discussion becomes invalid. This is because when σ is replaced by the
sample standard deviation s, the latter is a random variable depending on
(being a function of) the sample size k. Under this scenario, the one-stage
approach is unable to solve the difficult on sample size determination. Toward
this end, we need the following theorem.
Consider the testing problem H0 : μ = μ0 versus H1 : μ = μ0 + δ, δ > 0 for
a normal population with unknown mean μ and unknown standard deviation
σ. In a sequential sampling, assume that the first n0 observations are available,
X1 , ..., Xn0 . Denote α and β the required probabilities of type-I and type-II
errors. Let s0 be the sample standard deviation of the first sample (the first
n0 observations).
X̄n − μ0
R= √ > tn0 −1,1−α ,
s0 / n
has both probabilities of type-I error and type-II error controlled at α and β
levels, respectively.
Proof: . It suffices to prove the following two conditions for the theorem.
The first one is for the probability of type-I error in the updated sample.
X̄n − μ0
P (incorrectly rejecting H0 ) = P ( √ > tn0 −1,1−α |μ0 )
s0 / n
= P (tn0 −1 > tn0 −1,1−α )
= α.
As for the second condition on the control of the type-II error, notice that
P (incorrectly accepting H0 )
X̄n − μ0
= P( √ < tn0 −1,1−α |μ0 + δ)
s0 / n
X̄n − μ1 δ
= P( √ < tn0 −1,1−α − √ |μ1 )
s0 / n s0 / n
δ√
= P (Tn0 −1 < tn0 −1,1−α − n)
s0
δ (tn −1,1−α + tn0 −1,1−β )2 s20
≤ P (Tn0 −1 < tn0 −1,1−α − ( 0 )
s0 δ2
= P (Tn0 −1 < tn0 −1,1−α − (tn0 −1,1−α + tn0 −1,1−β ))
= P (Tn0 −1 < −tn0 −1,1−β )
= β.
The following example shows how to use Theorem 11.1 for data analysis.
Example 11.3 The following data set contains one-week trading prices of a
stock at NYSC {14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96}. Does the dataset
contain enough information to test whether the mean price is $15.00 or $15.05
at 0.05 significance level with 95% power? If not, how many additional obser-
vations do we need in sequential sampling?
$t0
[1] 1.888474
$p0
[1] 0.05394006
$t1
[1] 0.5665422
$p1
[1] 0.7042146
[[5]]
[1] "Continue sampling"
FIGURE 11.4
Codes for two-stage sequential Student-t test
with unknown variance and the requirement that the accuracy of the estima-
tion is controlled at a pre-specified level. In fact, the idea in Theorem 11.1 can
be further extended to the two-stage confidence estimation method as follows.
Definition 11.2 For a set of data X, the terminology estimation error refers
to the following. If a (1 − α)% confidence interval of an unknown parameter
θ is
θ̂(X) − e(X) ≤ θ ≤ θ̂(X) + e(X),
the statistic e(X) is the estimation error.
n0 <- length(x)
s <- sd(x)
z <- delta**2/(t1+t2)**2
n <- ceiling(s**2/(z))
return(n)
TwoStageTestSize(stockdata,15,15.05,0.05,0.05)
[1] 61
FIGURE 11.5
Sample size calculation for stock data-2
and 1 − α the confidence level. Denote n0 the sample size of the first n0
observations. Let s0 be the sample standard deviation of the first sample (the
first n0 observations).
Proof: Notice that the sample size in the second stage satisfies
tn0 −1,1−α 2
n > (s0 ) ,
E
we have
260 Simultaneous Learning and Multiplicity
P (X̄n − E ≤ μ ≤ X̄ + E)
= P (|X̄ − μ| ≤ E)
|X̄ − μ| √ E√
= P( n≤ n)
s0 s0
|X̄ − μ| √ E tn −1,1−α 2
≥ P( n≤ (s0 0 ) )
s0 s0 E
|X̄ − μ| √
= P( n ≤ tn0 −1,1−α )
s0
=1−α
We use the following examples to illustrate the use of the above theorem.
Example 11.4 Apply the two-stage confidence procedure for the following
questions.
1. Use the following data to estimate the mean stock trading price with
95% confidence level. If we need to have the accuracy at 0.05, how many
observations do we need? 14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96
2. If we use the previous data to be the initial data and want to have a
confidence estimate with accuracy at 0.06, how many additional observations
do we need?
Solution. Figure 11.6 contains R codes and outputs for example 11.4. In the
seven-day stock exchange data, the sample mean is $15.07 with sample stan-
dard deviation of $0.10. At 95% confidence level, the mean stock exchange
price is estimated in the range from $14.98 to $15.16. The estimated error
is $0.09, which is beyond the 6 cents requirement. Using the two-stage confi-
dence approach as in Theorem 11.2, a total of 24 observations (additional 17
observations) is needed to keep the confidence level at 0.95 and accuracy level
at 6 cents.
TwoStageConfidenceSize(stockdata,0.05,0.05)
$SampleMean
[1] 15.07143
$SampleSD
[1] 0.1000714
$t0
[1] 2.446912
$CI
[1] 14.97888 15.16398
$error
[1] 0.09255061
$z
[1] 0.0004175451
$n
[1] 24
FIGURE 11.6
Sample size calculation for stock data-3
set {y : θ ∈ {P̂i (y|θ) ≥ α/t}} for any integer t = 1, ..., k. (Here, we use the
notation in Casella and Berger, 2002 [16], page 463). Since this confidence set
Simultaneous Learning in Dose-response Studies 263
is actually inverted from the corresponding rejection region via the individual
p-value, we name it inverted confidence set.
For any value μ∗i ≤ μi0 , the p-value for μi = μ∗i versus Hi1 : μi > μi0 is
√
P̂i (y|μ∗i ) = P (Tv > (X i − μ∗i )/(σ̂i / ni ));
∗
and the p-value for Hi0 : μi ≤ μi0 versus Hi1 : μi > μi0 is
Now, notice that for any value μi , P̂i (y|μi ) has the following property:
P (y : P̂i (y|μi ) ≥ α)
√
= P ((X i − μi ) ni /σ̂i ≤ t1−α )
= 1 − α.
Another concept we need in this section is the Directed Confidence Set (Hsu
and Berger, 1999 [68]): A confidence set for θ, C(y), is said to be directed
toward a subset Θ∗ of the parameter space Θ, if for every sample point y,
either Θ∗ ⊂ C(y) or C(y) ⊂ Θ∗ .
264 Simultaneous Learning and Multiplicity
Θ∗ = {μ : μi > μi0 }.
This is because either the confidence set C(y) contains the alternative space
Θ∗ , when
√
μi0 > X i − t1−α (σ̂i / ni ),
or it is contained in the alternative space, when
√
μi0 < X i − t1−α (σ̂i / ni ).
For a given sample, the confidence set is a subset in the parameter space.
For a given parameter, the confidence set is an event in the sample space (see,
for example, Berger and Casella, 2002, p 463[16]). The concept of directed
confidence set, in conjunction with the inverted confidence set, leads to the
following result.
For multiple testing problem of Hi0 : θi ∈ Θi versus Hi1 : θi ∈ Θci , assume
that for any nested rejection region and any permissible integers i and t, there
exists an inverted confidence set Cit (y) that is directed towards Θci . When
screening down from the largest to the smallest ordered p-value, let m be the
index that satisfies the following two criteria,
i) P̂(m) ≥ α/(k − m + 1); and
ii) for any index i: m < i ≤ k, P̂(i) < α/(k − i + 1). For notational
convenience, denote
C0k (y) = Θ
when m = 0 where all p-values are smaller than the corresponding cutoff
values, and
Θc(k+1) = Θ
when m = k. Under this setting, the simultaneous confidence set keeps the
confidence level at the nominal level.
Theorem 11.3
P (θ ∈ Θc(k+1) Θc(k) ... Θc(m+1) k−m+1
C(m) (y)) ≥ 1 − α. (11.1)
and apply the procedure to construct simultaneous confidence sets for the
Aspirin example.
We now apply the algorithm developed above to analyze the Aspirin effi-
cacy example in Example 11.5. Notice that the alternative parameter space
is of the form θi = ηi − η0 ≤ 5, and the associated confidence region corre-
sponding to p̂i ≥ α/t is of the form W ≤ c1−α , where W is the Wilcoxon
statistic. Thus the inverted confidence set is of the form ηi − η0 ≥ U , where
U is a lower confidence bound for the median difference derived from W , and
the condition of directed confidence interval is satisfied.
By the step-up confidence set algorithm, we have the following analytical
results when α = 0.05.
Thus, the 95% simultaneous confidence set consists of the following compo-
nents:
η4 − η0 − 5 ∈ (0, ∞)
η3 − η0 − 5 ∈ (0, ∞)
η2 − η0 − 5 ∈ (0, ∞)
η1 − η0 − 5 ∈ (−0.002, ∞).
able to claim that the Aspirin treatments significantly reduce the risk of heart
attack in the clinical trial. The improvement of the cardiovascular safety score
is at least five units for the first three treatments, with the following statements
simultaneously.
1 The daily treatment of Aspirin at 250 mg is statistically significant in
reducing the risk of heart attack. It increases the cardiovascular safety
score by at least 5 units.
2 The daily treatment of Aspirin at 300mg in conjunction with supplement
treatment of Heparin is statistically significant in reducing the risk of heart
attack. It increases the cardiovascular safety score by at least 5 units.
3 The daily treatment of Aspirin at 325 mg with Heparin is statistically sig-
nificant in reducing the risk of heart attack. It increases the cardiovascular
safety score by at least 5 units.
4 The daily treatment of Aspirin at 200 mg along improves the median
cardiovascular score by 4.998 units. This is because the 1 − α/4=98.75%
confidence interval for the median difference is
η1 − η0 − 5 ∈ (−0.002, ∞).
The learning result of the new step-up confidence method fits well with clinical
expectations. Chen (2016)[23] contains more technical discussions on motiva-
tion, theory, and applications of this method.
Denote the set Li (x) = (ŷi∗ (x), ∞). The set Li (x) possesses the following
statistical properties.
First, the one-sided confidence region is directed toward the pre-specified
efficacy region (δ(x), ∞). For any i = 1, 2, . . . , k, if the boundary ŷi (x) < δ(x)
at a point x = x0 , then ŷi∗ (x) < δ(x). Therefore, the set Li is directed toward
the set Δ(x) = (δ(x), ∞), that is, either Li (x) ⊆ Δ(x) or Δ(x) ⊆ Li (x).
Second, the one-sided confidence region Li (x) reaches the nominal confi-
dence coverage. For any i = 1, 2, . . . , k, if the boundary ŷi (x) is a 100(1 − α)%
lower confidence bound for the response curve yi (x), then ŷi∗ (x) is also a
100(1 − α)% lower confidence bound for yi (x).
We need the following notations for the key theorem in this subsection.
a 100(1−α)% lower confidence set for the logistic regression line yi (x). Assume
that Li (x) is directed toward the set of alternative space Δ(x). Also, let
∗
ŷk+1 (x) = min ŷi∗ (x),
i
which is a 100(1 − α)% lower confidence bound for the lower boundary across
all the response curves, min yi (x).
i
Let D be the smallest integer i such that ŷi∗ (x) > δ(x) if such i(1 ≤ i ≤ k)
exists; Otherwise, let D = k + 1.
is a 100(1 − α)% upper simultaneous confidence band for the parameter func-
tion yi (x) that is directed toward the set of parameter
respectively. Let
U
ŷk+1 (x) = max ŷi (x).
i
⎧ c
⎪
⎪ (Θ1 ∩ Λc1 ) ∩ (Θc2 ∩ Λc2 ) ∩ . . . ∩
(ΘcM −1 ∩ ΛcM −1 ) ∩ LS (x),
⎪
⎨ if S < T ,
Y W (x) =
⎪
⎪ (Θ1 ∩ η1 ) ∩ (Θ2 ∩ η2 ) ∩ . . . ∩ (ΘcM −1 ∩ ηM
c c c c c
−1 ) ∩ UT (x),
⎪
⎩
if S > T
(11.3)
With the above setting, we have the following result.
where Y W (x) is a set of two-sided 100(1 − α)% confidence bounds for θ(x).
The proof of the above result can be found in Kerns and Chen (2017)[76].
Notice that both LS (x) ∩ ΛcS when S < T and UT (x) ∩ ΘcT when S > T are
100(1 − α)% confidence bounds for θ(x).
Upon the establishment of the above theorem and algorithm, we may now
illustrate their applications using the following example.
TABLE 11.1
Success rates of thrombolytic therapy
Lysis time 0.25 1.25 2.5 8 24 50
Log-dose -1.386 0.223 0.916 2.08 3.18 3.91
Group 1 16/78 23/78 48/78 56/78 68/78 78/78
Group 2 10/78 13/78 37/78 48/78 54/78 65/78
Group 3 2/78 11/78 22/78 25/78 35/78 44/78
and the third group consists of patients whose ages were over 55 with (acute
or chronic) limb ischaemia. The dose variable in this example is the time of
therapy and the outcome is the success rate of the procedure (more than 80%
lysis). For illustration purpose, summaries of data information are given in
Table 11.1.
Based on the information in Table 11.1, the ML estimates from the logistic
fit and the Fisher information matrix for each group are found and displayed
in the following output.
TABLE 11.2
Parameter estimation on dose-response curves
1 2
162.12 328.68
Group 1 β = [ −.8699 .2654 ] F=
328.68 1103.38
1 2
132.81 284.09
Group 2 β = [ −1.1374 .3179 ] F=
284.09 944.85
1 2
84.33 193.69
Group 3 β = [ −1.3244 .3363 ] F=
193.69 646.02
0.5
0.3
Success rates
0.1
−1 0 1 2 3 4
Log(lysis dosage)
FIGURE 11.7
Simultaneous lower bands for thrombolysis effects
Example 11.9 The inference question in this example is the following. When
Weighted Simultaneous Confidence Regions 273
and denote the corresponding p-value by P̂i (y|θi∗ ). Let Cit (y) be an inverted
confidence set
Cit (y) = {y : θ ∈ {P̂i (y|θ) ≥ ct α/wt }}
for any t = 1, 2, ..., k.
For a multiple testing problem of Hi0 : θi ∈ Θi versus Hi1 : θi ∈ Θci , i =
1, 2, ..., k, assume that there exists an inverted confidence set Cit (y) that is
directed towards Θci for all permissible integers i and t, then for the index m
in the step-down weighted procedure, denote Θc(0) = Θ for notation conve-
k−m+1
nience, Θc(i) is the alternative space associated with S(i) , and C(m) (y) is
c
the inverted and directed confidence set that is associated with Θ(m) .
c∗
Step 2: If S(2) ≥ wk−1
α
, then conclude θ(2) ∈ {θ : P̂2∗ ≥ 2α
wk−1 }, stop; else
conclude θ(2) ∈ Θ(2) , go to Step 3.
c
..
.
c∗
Step i: If S(i) ≥ wk−i+1
α
, then conclude θ(i) ∈ {θ : P̂i∗ ≥ iα
wk−i+1 }, stop; else
conclude θ(i) ∈ Θc(i) , and go to Step i + 1.
..
.
Step k: If S(k) ≥ wα1 , then conclude θ(k) ∈ {θ : P̂k∗ ≥ α}, stop; else conclude
θ(k) ∈ Θc(k) , stop.
P̂t
St =
ct
for t = 1, 2, ..., k.
Weighted Simultaneous Confidence Regions 275
P̂t∗
S(t) = ,
c∗t
where P̂t∗ and c∗t are the corresponding p-value and hypothesis weight. Com-
paring the ordered S-value with the corresponding significant level
α
wk−t+1 ,
where
k
wk−t+1 = c∗s .
s=t
$ & %,
k−q+1 1 1
C(q) (y) = y : dq ∈ −∞, X 1(q) − X 2(q) + tv,u σˆ(q) + ,
n1(q) n2(q)
c∗q α
u=1− ,
wk−q+1
X 1(q) and X 2(q) are the sample means corresponding to the q-th ordered S-
k−q+1
value. The confidence interval C(q) (y) is applied at step q (if the procedure
stops at step q).
More specifically, suppose that k = 4,
Consequently,
P̂1∗ = .01, P̂2∗ = .1, P̂3∗ = .03, P̂4∗ = .08,
and the corresponding weights are
c∗
Step 2: S(2) = .00286 > .00083 = α
w3 . Assert that θ(2) ∈ {P̂2∗ ≥ 2α
w3 } =
3 c∗
2α
C(2) (y), where w3 = .02905. Stop.
Here, since the ordered S(1) corresponds to S3 , θ(1) ∈ Θc(1) implies that
so
Θc(1) = (−∞, 3).
If, from a set of data, we have that
1 1
X 11 − X 21 + t c∗
2α
σˆ1 + = 0.6,
v,1− w3 n11 n21
then, the algorithm claims, at 95% confidence level, that Treatment-3 has sig-
nificant treatment result compared with the placebo group, and the difference
between the drug and placebo for Treatment-1 is less than 0.6 unit; no con-
clusion on Treatment-2 and Treatment-4 with the current weighting. This is
partly because the weights on Treatment-2 and Treatment-4 (10 and 15, re-
spectively) are much lower than the weights for Treatment-1 and Treatment-3
(35 and 40, respectively).
With the above setting, we can now use the weighted confidence set method
to analysis the question posted in Example 11.9 as follows.
where B = 100 and i = 1, 2, ..., 3170. More discussion on permutation tests can
be found in Huang et al (2006) [69] and Kaizar et al (2011) [74]. In light of Jauf-
fret et al (2007) [73] regarding the significance of correlation between moesin
and BRCA1 associated breast cancer, we assigned weights correspondingly to
analyze the gene expression data. The reanalysis consists of two parts. In the
first part, we use the published p-values in Storey and Tibshirani (2003, [114])
in conjunction with the new weighted confidence theorem. The first analysis
identified several additional significant genes, such as APEX nuclease (clone
417124), apoptosis-related protein 15 (clone 137836), apoptosis inhibitor 1
(clone 34852) and two ERCC-related genes (clone 323390 and clone 52666),
in addition to the 160 genes identified using the q values in Storey and Tib-
shirani (2003).
For example, screening up from the smallest S-value to the largest S-value,
the weighted p-value of APEX nuclease is 2.78 × 10−9 , which is smaller than
the corresponding significant level 3.30 × 10−9 . We thus concluded that the
corresponding confidence interval of the mean difference of the two expressions
is Θci = (0, +∞). The expression of this gene is restrained in BRCA1-mutation-
positive tumors, which results in decrease in function of mediating DNA repair
(APEX nuclease).
Another example is apoptosis inhibitor 1, the weighted p-value of apoptosis
inhibitor 1 is 1.70×10−8 . We recognize this gene as significant after comparing
it with the corresponding significant level 2.54 × 10−8 , and conclude the mean
difference is within (0, +∞). The expression of this gene, involved in suppress-
ing apoptosis, is also decreased in BRCA 1- mutation- positive tumors.
Furthermore, the new procedure also found a gene associated with
apoptosis related protein 15, the weighted p-value is 1.48 × 10−10 which is less
than the corresponding significant level 1.55 × 10−10 . This indicates the sig-
nificance of apoptosis-related protein 15 after multiplicity adjustments. The
procedure stops at the 166th step with an inverted confidence interval calcu-
lated from the associated p-value.
Notice that the method of data analysis in the first part is based on per-
muted p-values. In the second portion of the reanalysis, we assume normality
for the data and use the two-sample t-test to reanalyze the data. Under the
normality assumption, we recalculated p values using two-sample t-test.
Specifically, the GATA-3 gene is highly correlated with estrogen receptor,
which leads to a suppressed expression of this gene in BRCA 1-mutation-
positive breast cancer (Eeckhoute et al. 2007, [46]). After assigning proper
weights to the genes according to biological literature, a total of 163 significant
genes were identified using the proposed weighted step-down confidence set
procedure, which includes GATA-binding protein 3, moesin, among others. For
278 Simultaneous Learning and Multiplicity
P̂i∗
S(i) = ,
c∗i
Weighted Simultaneous Confidence Regions 279
where P̂i∗ and c∗i are, respectively, the p-value and weight corresponding to
the ordered S-value S(i) with the corresponding parameter θi∗ , denoted as θ(i)
in the sequel.
Also, denote li , i = 1, ..., k, the dimension of the parameter space corre-
sponding to S(i) . For instance, if S4 ranks the second among all the S-values,
S(2) , by the notation, we have n4 = l2 . For notational convenience, define the
cumulative weight
k
wi = c∗j ,
j=k−i+1
or equivalently,
k
wk−i+1 = c∗j .
j=i
Example 11.12 For the datasets on red wine and white wine described above,
we are interested in learning wine features that significantly affect the wine
quality index in tasting.
TABLE 11.3
Analytic results of Category I for red wine.
unweighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 4.69E-5 1 0.0125 Claim H31
Score ≥ 8 0.02164 1 0.01667 Fail to reject H40
Score = 6 0.02463 1 0.025 No conclusion
Score = 5 0.03974 1 0.05 No conclusion
weighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 4.69E-5 1 0.000485 Claim H31
Score ≥ 8 2.164E-4 100 0.00049 Claim H41
Score = 6 0.02463 1 0.025 Claim H21
Score = 5 0.03974 1 0.05 Claim H11
With the above setting, we run the proposed procedure on all three cate-
gories with different types of wines to investigate the impact of different groups
of physicochemical variables on wine quality. Since the distribution of the data
set is skewed, we use the Wilcoxon rank-sum test to compute the individual
p-values for each test. As shown in Theorem 11.6, the procedure is valid for
any test statistic suitable for the distribution of the data. The vector-wise
p-value, S-value, adjusted significance level α∗ are reported in Table 11.3.
Without weighting adjustment for red win, results summarized in Ta-
ble 11.3 show the impact of acidity on the taste of red-wine. With equal
weight, the algorithm stops at the second step, concluding that acidity is sig-
nificant in differentiating moderately high-quality wine (Score = 7) from the
baseline, but it is not significant for the rest. However, the inference result
may be confounded by the unbalanced sample sizes between the group with
score 7 (n = 199) and the group with scores higher than 8 (n = 18). Also,
such inference conclusion is not coherent in terms of asserting the impact of
acidity changes on the taste quality changes. It makes more sense to claim sig-
nificant impact for wines with higher scores before claiming significant impact
for wines with lower scores. Since we are interested in seeking factors that are
important for high-quality wine, we consider a weight of 100 for the group in
which the wine taste quality score is 8 and higher. With the weight vector of
w2 , we are able to reach the conclusion that is consistent and coherent through
out the comparison groups. The new weighting scheme leads to the conclusion
that acidity is significant in distinguishing red wine taste quality.
Now, consider the white wine Category I. Similar to the analysis for red
wine, the impacts of Categories II and III are not statistically significant, we
only report the analysis on Category-I here.
Comparing the results from unweighted and weighted procedures in Ta-
ble 11.4, we can see that different weights may lead to different testing or-
Weighted Simultaneous Confidence Regions 283
TABLE 11.4
Analytic results of Category I for white wine.
unweighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 2.24E-4 1 0.00125 Claim H31
Score = 6 1.60E-3 1 0.00167 Caim H21
Score ≥ 8 2.84E-3 1 0.0025 Fail to reject H40
Score = 5 7.59E-3 1 0.005 No conclusion
weighted inference
Quality S-value Weight α∗ Conclusion
Score ≥ 8 2.84E-5 100 4.85E-5 Claim H41
Score = 7 2.24E-4 1 0.00167 Claim H31
Score = 6 1.60E-3 1 0.0025 Claim H21
Score = 5 7.59E-3 1 0.005 Fail to reject H10
der. For illustrating purpose, we chose a more precise significance level with
α = 0.005.
The equal weighting scheme results in an incoherent conclusion in which a
relatively lower quality (Score = 7) wine is significant but higher quality (Score
= 8 or more) wine is not significant. Comparatively, the weighted procedure
in Table 11.4 leads to a more persuadable and interpretable assertion, that
acidity is a significant factor in distinguishing white wine quality, because we
have sufficient evidence to reject the null hypothesis.
Practically, the impact of acidity on wine taste is complex. However, indus-
trial evidence shows that within a reasonable range relatively higher acidity
level makes both red and white wine taste fresher than those with low acidity
levels (see, for example, Plane et al 1980 [94]). Moreover, the sweetness might
balance the taste of acidity and consequently the confounded taste may be
more abundant. This agrees with the inference conclusion derived from the
weighted confidence procedure, as stated in Table 11.3.
Inference on the above-mentioned settings calls for further developments
of weighted inference for underlying parameter arrays. While we normally go
with multiple testing procedures for the comparisons of more than one popula-
tions, it should be noted that the method of simultaneous confidence set plays
a more effective role in estimating the unknown sets of parameters. There
are distinct differences between the confidence set method and the multiple
testing procedures (such as the closed testing principle). The closed testing
principle essentially draws conclusions on multiple hypotheses by controlling
the family-wise error rate. In the literature, various multiple testing proce-
dures were proposed with different criteria in controlling the error rate, such
as the control of false discovery rate, see for example, Hochberg and Tamhane
(1987) [58] . The simultaneous confidence set, on the other hand, provides an
overall estimation for all the parameters under investigation. Usually, a mul-
284 Simultaneous Learning and Multiplicity
discussed in the last section of the book cast high lights on statistical learning
for data with multiple resources.
Bibliography
[7] P. Bühlmann and B. Yu. Boosting with the L2 loss: regression and
classification. Journal of the American Statistical Association, 98:324–
39, 2009.
287
288 Bibliography
[30] J. T. Chen, S. Iyengar, and D. Brent. Constraint estimation for the pop-
ulation attributable risk. Journal of Applied Probability and Statistics,
2:251–265, 2007.
[35] J. T. Chen, E. Walsh, and A. Comerota et al. A new multiple test ap-
proach for nursing care administration of deep vein thrombosis patients.
Istanbul University Journal of the School of Business Administration,
1:22–34, 2011.
[36] L. Chen and J. Chen. Refined Machine Learning Approaches for Mask
Policy Analysis, Book chapter edited by E. Cetin and H. Ozen, Springer
Nature, Singapore . Health-care Policy, Innovation and Digitalization,
10:197–211, 2024.
[40] C. Daniel and F. Wood. Fitting Equations to Data. Wiley, New York.,
1971.
290 Bibliography
[54] M.H. Hansen and B. Yu. Model selection and the principle of mini-
mum description length. Journal of the American Statistical Associa-
tion, 96:746–774, 2002.
[74] E. Kaizar, Y. Li, and J. Hsu. Permutation Multiple Tests of Binary Fea-
tures Do Not Uniformly Control Error Rates. Journal of the American
Statistical Association, 106:1067–1074, 2011.
[102] E. Seneta. On the history of the strong law of large numbers and Boole’s
inequality. Historia Mathematica, 19(1):24–39, 1992.
[107] E. Seneta and J. T. Chen. On explicit and Fréchet optimal lower bounds.
Journal of Applied Probability, 39:81–90, 2002.
[109] E. Seneta and J. T. Chen. Simple stepwise tests of hypotheses and mul-
tiple comparisons. International Statistical Review, 73(1):21–34, 2005.
[116] J. Tank, G. Georgiadis, and J. Bair et al. Does the use of ethyl chlo-
ride improve patient-reported pain scores with negative pressure wound
therapy dressing changes? A prospective, randomized controlled trial.
Journal of Trauma and Acute Care Surgery, 6:1061–1066, 2021.
[128] Y. Yu, J.T. Chen, and B. Yeh. Weighted step-down confidence proce-
dures with applications to gene expression data. Communications in
Statistics – Theory and Methods, 51:2343–2356, 2022.
[129] Y. Zhang, J. T. Chen, S. Wang, C. Andrews, and A. Levy. How do
consumers use nutrition labels on food products in the United States?
Topics in Clinical Nutrition, 32:161–171, 2017.
297
298 Index
minimum risk invariant estimator, support vector machine, 55, 98, 185,
110 193, 195, 196, 202
misclassification rate, 210, 212, 214,
217 thrombolysis, 189, 219, 266, 269, 270
model-based, 1–3, 5, 10, 11, 13–15, trading price, 257, 260
18, 19, 22, 54, 67, 82, 97, two stage sampling, 258
110 two-stage estimation, 255
two-stage sequential procedure, 248,
natural cubic spline, 162–164 258, 284