100% found this document useful (2 votes)
375 views

Statistical Prediction and Machine Learning

Uploaded by

stickhacher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
375 views

Statistical Prediction and Machine Learning

Uploaded by

stickhacher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 314

Statistical Prediction

and Machine Learning


Written by an experienced statistics educator and two data scientists, this book unifies conven-
tional statistical thinking and contemporary machine learning framework into a single over-
arching umbrella over data science. The book is designed to bridge the knowledge gap between
conventional statistics and machine learning. It provides an accessible approach for readers
with a basic statistics background to develop a mastery of machine learning. The book starts
with elucidating examples in Chapter 1 and fundamentals on refined optimization in Chapter
2, which are followed by common supervised learning methods such as regressions, classifica-
tion, support vector machines, tree algorithms, and range regressions. After a discussion on
unsupervised learning methods, it includes a chapter on unsupervised learning and a chapter
on statistical learning with data sequentially or simultaneously from multiple resources.

One of the distinct features of this book is the comprehensive coverage of the topics in statistical
learning and medical applications. It summarizes the authors’ teaching, research, and consulting
experience in which they use data analytics. The illustrating examples and accompanying materi-
als heavily emphasize understanding on data analysis, producing accurate interpretations, and
discovering hidden assumptions associated with various methods.

Key Features:
• Unifies conventional model-based framework and contemporary data-driven methods into a
single overarching umbrella over data science.
• Includes real-life medical applications in hypertension, stroke, diabetes, thrombolysis,
aspirin efficacy.
• Integrates statistical theory with machine learning algorithms.
• Includes potential methodological developments in data science.

John Tuhao Chen is a professor of Statistics at Bowling Green State University. He completed his
postdoctoral training at McMaster University (Canada) after earning a PhD degree in statistics at
the University of Sydney (Australia). John has published research papers in statistics journals such
as Biometrika as well as in medicine journals such as the Annals of Neurology.

Lincy Y. Chen is a data scientist at JP Morgan Chase & Co. She graduated from Cornell Uni-
versity, winning the Edward M. Snyder Prize in Statistics. Lincy has published papers regarding
refinements of machine learning methods.

Clement Lee is a data scientist in a private firm in New York. He earned a Master’s degree in
applied mathematics from New York University, after graduating from Princeton University in
computer science. Clement enjoys spending time with his beloved wife Belinda and their son
Pascal.
Statistical Prediction
and Machine Learning

John Tuhao Chen


Lincy Y. Chen
Clement Lee
Designed cover image: © John Tuhao Chen
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2024 Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact [email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.

ISBN: 978-0-367-33227-3 (hbk)


ISBN: 978-1-032-81272-4 (pbk)
ISBN: 978-0-429-31868-9 (ebk)

DOI: 10.1201/9780429318689

Typeset in Nimbus Roman


by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To the memory of my parents
HongBiao Chen and WanJuan Lin
–JTC
Contents

Preface xi

List of Figures xiii

List of Tables xv

1 Two Cultures in Data Science 1


1.1 Model-based culture . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data-driven culture . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Intrinsics between the two culture camps . . . . . . . . . . . 11
1.3.1 Small sample inference necessitates model assumptions 11
1.3.2 Prediction accuracy demands large sample . . . . . . . 13
1.3.3 Which camp to go? . . . . . . . . . . . . . . . . . . . . 13
1.4 Learning outcome evaluation . . . . . . . . . . . . . . . . . . 14
1.4.1 Error rates in model-based culture camp . . . . . . . . 14
1.4.2 Cost functions in neural networks . . . . . . . . . . . . 16
1.5 Learning process optimization . . . . . . . . . . . . . . . . . 18
1.5.1 Model-based camp . . . . . . . . . . . . . . . . . . . . 18
1.5.2 Data-Driven camp . . . . . . . . . . . . . . . . . . . . 20

2 Fundamental Instruments 23
2.1 Data identification . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Pooling data, Simpson’s paradox, and solution . . . . 28
2.2 Basic concepts of trees . . . . . . . . . . . . . . . . . . . . . 30
2.3 Sensitivity, specificity, and ROC curves . . . . . . . . . . . . 34
2.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 LOOCV and Jackknife . . . . . . . . . . . . . . . . . . 38
2.4.2 LOOCV for linear regressions . . . . . . . . . . . . . 41
2.4.3 K-fold cross-validation and SAS examples . . . . . . . 44
2.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Non-parametric bootstrapping . . . . . . . . . . . . . 47
2.5.2 Parametric bootstrapping . . . . . . . . . . . . . . . . 50

vii
viii Contents

3 Sensitivity and Specificity Trade-off 57


3.1 Dilemma on false positive and false negative errors . . . . . . 57
3.2 Most sensitive diagnostic variable . . . . . . . . . . . . . . . 60
3.3 Two-ended diagnostic measures . . . . . . . . . . . . . . . . 72
3.4 UMEDP with confounding factors . . . . . . . . . . . . . . . 82
3.5 Efficient and invariant diagnostic predictors . . . . . . . . . . 86
3.5.1 Invariant principle in data transformation . . . . . . . 87
3.5.2 Invariance and efficiency . . . . . . . . . . . . . . . . . 91

4 Bias and Variation Trade-off 97


4.1 Reducible and Irreducible Errors in Prediction . . . . . . . . 97
4.2 Minimum variance unbiased estimators . . . . . . . . . . . . 101
4.3 Minimum risk estimators for transformed data . . . . . . . . 110

5 Linear Prediction 119


5.1 Pitfalls in linear regressions . . . . . . . . . . . . . . . . . . . 119
5.2 Model training and prediction . . . . . . . . . . . . . . . . . 123
5.2.1 Building models with training data . . . . . . . . . . . 125
5.2.2 Evaluating trained models without normality . . . . . 128
5.2.3 Model significance with normal data . . . . . . . . . . 133
5.2.4 Confidence prediction with trained models . . . . . . . 135
5.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . 135
5.3.1 Confounding effects . . . . . . . . . . . . . . . . . . . 138
5.3.2 Information loss and model selection . . . . . . . . . . 139
5.4 Categorical predictors . . . . . . . . . . . . . . . . . . . . . . 144
5.5 Outliers and leverage statistics . . . . . . . . . . . . . . . . . 146

6 Nonlinear Prediction 151


6.1 Restricted optimization and shrinkage . . . . . . . . . . . . . 151
6.1.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . 152
6.1.2 LASSO regression . . . . . . . . . . . . . . . . . . . . 154
6.2 Model Selection and Regularization . . . . . . . . . . . . . . 155
6.3 High Dimensional Data . . . . . . . . . . . . . . . . . . . . . 156
6.3.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . 157
6.3.2 Dimension Reduction by Transformation . . . . . . . . 157
6.4 Polynomial spline regression . . . . . . . . . . . . . . . . . . 160

7 Minimum Risk Classification 167


7.1 Zero-one Loss Classification . . . . . . . . . . . . . . . . . . . 168
7.1.1 Bayesian Discriminant Functions . . . . . . . . . . . . 169
7.1.2 Logistic regression classification . . . . . . . . . . . . . 176
7.2 General Loss Functions . . . . . . . . . . . . . . . . . . . . . 179
7.3 Local and Universal Optimizations . . . . . . . . . . . . . . . 180
7.4 Optimal ROC Classifiers . . . . . . . . . . . . . . . . . . . . 181
Contents ix

8 Support Vectors and Duality Theorem 185


8.1 Maximal Margin Classifier . . . . . . . . . . . . . . . . . . . 185
8.1.1 Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . 187
8.1.2 Definition of maximal margin classifier . . . . . . . . . 188
8.2 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . 190
8.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . 195
8.4 Duality Theorem with Perturbation . . . . . . . . . . . . . . 196

9 Decision Trees and Range Regressions 205


9.1 Regression Trees and UMVUE . . . . . . . . . . . . . . . . . 207
9.2 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . 210
9.2.1 Misclassification Rate . . . . . . . . . . . . . . . . . . 211
9.2.2 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . 212
9.2.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.4 UMVUE for homogeneity in classification trees . . . . 217
9.3 Extending regression trees to range regression . . . . . . . . 219

10 Unsupervised Learning and Optimization 229


10.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . 229
10.1.1 Clustering with Squared Euclidean Distance . . . . . . 230
10.1.2 Non-Euclidean Clustering . . . . . . . . . . . . . . . . 238
10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . 240
10.2.1 Population Principal Components . . . . . . . . . . . 240
10.2.2 Sample principal components . . . . . . . . . . . . . . 242

11 Simultaneous Learning and Multiplicity 247


11.1 Sequential Data . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.1.1 Wald’s sequential likelihood ratio test (SPRT) . . . . 248
11.1.2 Two-stage Estimation for Sequential Data . . . . . . . 255
11.2 Simultaneous Learning in Dose-response Studies . . . . . . . 260
11.2.1 Confidence Procedures for Aspirin Efficacy . . . . . . 261
11.2.2 Confidence Bands on Thrombolysis Effects . . . . . . 266
11.3 Weighted Simultaneous Confidence Regions . . . . . . . . . . 271
11.3.1 Weighted Hypotheses and Breast Cancer Study . . . . 272
11.3.2 Confidence Sets for Weighted Parameter Arrays . . . . 278

Bibliography 287

Index 297
Preface

Big data has penetrated into every corner of our lives. Its omnipresence and
the demands of the market necessitates that we decently understand it. To this
end, new books and literature that explain techniques in data-oriented predic-
tion and machine learning are required. Although there are excellent textbooks
focusing on data analysis with conventional statistical approaches, as well as
outstanding textbooks addressing machine learning methods for data-oriented
approaches, to this date, nothing has merged the two comprehensively. These
two approaches over time have led to two primary camps in data science, one
focused on data-oriented analysis and another on model-based analysis. Stu-
dents, data analysts, and junior researchers are often confused about which
camp they may fall under, especially as the two data analytic camps are of-
ten seemingly contradictory to each other. There is much debate on the right
camp to select in the broader realm of data science. Written by an experienced
statistician and two data scientists, this book unifies the two frameworks into
a single overarching umbrella on data science.
Starting from a background in a basic undergraduate college statistics
course, the conventional model-based inference framework finds its founda-
tions in data analytics. It consists of an underlying model for the data, hy-
pothesis testing or confidence estimation on unknown model parameters, mea-
suring variations behind the data, and prediction an unknown quantity related
to the inference problem. Under this style of thinking, the underlying model
serves as the hub in data analysis. An implausible model assumption may thus
result in a correct answer towards the wrong problem, which can often lead to
misleading prediction results.
When addressing practical problems such as high dimensional inference,
machine learning often relies on computer intensive algorithms. Many of the
underlying thought processes and methodologies have been well-developed but
are still fundamentally based in the conventional data analysis framework. One
of the major challenges underpinning modern machine learning stems from the
gap between the conventional model-based inference and data-driven learning
algorithms. The knowledge gap hinders practitioners (especially students, re-
searchers, data analysts, or consultants) from truly mastering and correctly
applying machine learning skills in data science.
This book is designed to bridge the gap between conventional statistics
and machine learning. It provides an accessible approach for readers with
a basic statistical background to develop a mastery of machine learning. We
start with elucidating examples in Chapter 1 and introducing fundamentals on

xi
xii Preface

refined optimization in Chapter 2, which are followed by common supervised


learning methods such as regressions, classification, support vector machines,
tree algorithms, and range regressions. After a discussion on unsupervised
learning methods, we include a chapter on unsupervised learning as well as a
chapter on statistical learning with data sequentially or simultaneously from
multiple resources.
One of the distinct features of this book is the comprehensive coverage of
the topics in statistical learning. This book summarizes the authors’ teaching,
research, and consulting experience in which they used data analytics. The
illustrating examples and accompanying materials heavily emphasize under-
standing on the two camps described above, producing accurate interpreta-
tions, and discovering hidden assumptions associated with various data anal-
ysis methods. It is designed to guide students toward effectively applying
statistical learning methods.
This book is addressed to practitioners in data science, but it is also suit-
able for upper-level undergraduate students and entry-level graduate students
who are interested in obtaining a more thorough comprehension of machine
learning. The potential audience extends to data scientists who are inter-
ested in more insightful interpretations of raw outputs generated from machine
learning. The materials of the book originate from the first author’s lecture
notes of a one-semester machine learning course taught at the University of
California Berkeley.
We are grateful to a number of people who have encouraged and con-
tributed to the writing of this book. Thanks to Professor Deb Nolan at UC
Berkeley, whose conversations and opinions partially motivated the writing
plan. The writing of the book also benefitted from suggestions, questions, and
clarifications of the students of Berkeley Stat 154 (Fall 2018) class, especially
Haotian Fu who developed an R package for range regression. We also thank
the following PhD and Master students at BGSU for their assistance and
contributions, Gul Bulbul, Asmita Ghoshal, Chao Gu, Yiheng Liu, Rachana
Mahajan, Corey Thrush, and Peiyao Wang.
We owe a great deal to our families for their great support, especially to
Binglin for her numerous suggestions; Belinda for being the supreme leader of
everything related to the book writing; Vincent and Janet for helpful discus-
sions; and Patrick for his tireless reading and editing.
The reviewers’ comments and suggestions have also significantly improved
the original writing plan of this book. Their contributions are gratefully ac-
knowledged. Of course, we are solely responsible for unavoidable typos and er-
rors. Finally, we would like to thank this book’s acquisitions editor, Mr. David
Grubbs, and editorial assistant, Mr. Curtis Hill, at Chapman and Hall/CRC,
for their helpful efforts, kindness, and patience (especially allowing us to con-
tinuously postpone the final manuscript submission from one weekend to an-
other weekend) during this project.
List of Figures

1.1 Insurance Premium-Driving Year regression . . . . . . . . . . 4


1.2 Insurance Premium-Driving Year plot . . . . . . . . . . . . . 6
1.3 Piece-wise regularization of insurance policy . . . . . . . . . . 7
1.4 First stage premium-year regression . . . . . . . . . . . . . . 8
1.5 Second stage premium-year regression . . . . . . . . . . . . . 9
1.6 Third stage premium-year regression . . . . . . . . . . . . . . 10
1.7 Small sample model misleading . . . . . . . . . . . . . . . . . 12

2.1 Regression tree of sale volume on price and advertising cost . 31


2.2 Sample ROC curve . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 SAS codes and LOOCV on marketing analysis . . . . . . . . 44
2.4 SAS codes and 5-fold CV . . . . . . . . . . . . . . . . . . . . 45
2.5 Small sample nonparametric bootstrap . . . . . . . . . . . . . 49
2.6 Comparing nonparametric and parametric bootstrapping . . . 51
2.7 Optimal treatment regime for blood pressure instability . . . 53

3.1 Sensitivity and LDL-Cholesterol . . . . . . . . . . . . . . . . 62


3.2 LDL-Cholesterol sensitivity R-codes . . . . . . . . . . . . . . 63
3.3 Sensitivity and Bun-to-Creatinine-ratios . . . . . . . . . . . . 75

4.1 Statistics versus Data Science . . . . . . . . . . . . . . . . . . 99

5.1 Nonintrinsic linear relation . . . . . . . . . . . . . . . . . . . 120


5.2 Bounded driving years vs insurance premium . . . . . . . . . 121
5.3 Car age vs selling price with outliers . . . . . . . . . . . . . . 122
5.4 Data cloud and sample correlation coefficients . . . . . . . . . 125
5.5 Horizon Properties, data summary with a linear model . . . . 128
5.6 Regression outputs on selling prices with outliers . . . . . . . 134
5.7 Regression of concentration on dosage, age, and stress level . 136
5.8 Dosage-concentration level plot . . . . . . . . . . . . . . . . . 137
5.9 Correlations on concentration, dosage, age, and stress . . . . 138
5.10 Drug concentration level associated with age and stress . . . 140
5.11 Drug concentration level associated with stress alone . . . . . 141
5.12 Drug concentration level associated with age alone . . . . . . 142
5.13 AIC criterion and information loss in multiple regression . . . 143
5.14 R Codes for confidence prediction . . . . . . . . . . . . . . . . 145
5.15 Leverage statistic and outliers . . . . . . . . . . . . . . . . . . 149

xiii
xiv List of Figures

7.1 R-codes for computation on diabetes classification risk. . . . . 173


7.2 Prostate cancer and feature selection in logistic regression . . 178
7.3 Prostate cancer, ROC curve, and treatment regimen . . . . . 182

8.1 Advertising cost and annual revenue . . . . . . . . . . . . . . 186


8.2 Post-thrombotic syndrome vs thrombolytic percentage and
stroke time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.3 Not linearly separable data . . . . . . . . . . . . . . . . . . . 191
8.4 When data are not completely separable between the two
classes, support vector classifier works with a cost . . . . . . . 194

9.1 Hypothetical data for regression tree construction . . . . . . . 209


9.2 Post-thrombotic syndrome and Linear regression . . . . . . . 220
9.3 Post-thrombotic syndrome and Range regression . . . . . . . 221

10.1 Numerical illustration of K-means clustering algorithm . . . . 237


10.2 Sample covariance matrix vs sample correlation matrix . . . . 245

11.1 Mean monthly income between male and female with sprr . . 251
11.2 Overall mean monthly income vs a value with sprtt . . . . . . 254
11.3 Log-likelihood ratio vs effective sizes . . . . . . . . . . . . . . 255
11.4 Codes for two-stage sequential Student-t test . . . . . . . . . 258
11.5 Sample size calculation for stock data-2 . . . . . . . . . . . . 259
11.6 Sample size calculation for stock data-3 . . . . . . . . . . . . 261
11.7 Simultaneous lower bands for thrombolysis effects . . . . . . . 271
List of Tables

1.1 Lung cancer data structure . . . . . . . . . . . . . . . . . . . 3

2.1 Pooling data and Simpson’s paradox . . . . . . . . . . . . . . 29


2.2 OR and CMH for Simpson’s paradox . . . . . . . . . . . . . . 30
2.3 Four possible outcomes in a binary classification . . . . . . . 35

3.1 Type-I and Type-II errors in hypothesis testing . . . . . . . . 59

5.1 Intrinsic linear relationships . . . . . . . . . . . . . . . . . . . 123

9.1 Systolic blood pressure dataset . . . . . . . . . . . . . . . . . 205

11.1 Success rates of thrombolytic therapy . . . . . . . . . . . . . 269


11.2 Parameter estimation on dose-response curves . . . . . . . . . 270
11.3 Analytic results of Category I for red wine. . . . . . . . . . . 282
11.4 Analytic results of Category I for white wine. . . . . . . . . . 283

xv
1
Two Cultures in Data Science

Data is a starting point in model prediction as well as in machine learning


addressed in the celebrated books on statistical learning [56], [72]. With a set
of data, data scientists predict future responses with certain level of scientific
reliability and confidence on the accuracy of prediction. In the process of data
analysis, for example, one way is to use linear regression model in conjunction
with normality assumption on the distribution of data. Another way is to
use tree regressions or neural network without any model assumption. As
elucidated by Breiman in [13] and [11], one culture of prediction is to start
with an assumed relationship that underpins the response and the explanatory
variables in conjunction with a random model that governs the distribution of
the data fluctuation. Data scientists then follow up with inference issues such
as coefficient estimation, hypothesis testing, and prediction. This is the model-
based prediction approach. Another culture of prediction is to start with the
data without making any model assumption. It uses training data to build a
model, and make predictions based on the trained model. This is commonly
referred to as a data-driven approach. In practice, data scientists, shaped by
their believes and operating mechanism in data analytics, intentionally (or
unintentionally) fall into either the model-based culture camp or the data-
driven culture camp in prediction. In this chapter, we will discuss intrinsic
connections, compare evaluation criteria, and address optimality issues related
to the two cultures.
Generally, when the sample size is small or moderate, model-based ap-
proach is able to accommodate additional model information on the data to
alleviate difficulties caused by insufficient sample sizes. On the other hand,
when the sample size is reasonably large, the data-based approach has the
advantage of avoiding implausible model assumptions and digging out the
underlying knowledge hidden behind the data.

1.1 Model-based culture


The culture camp of model-based inference mainly consists of statisticians
who start data analysis with the assumption on the model governing the
random mechanism of the data. This includes the model-based inference in

1
2 Two Cultures in Data Science

which the model underlying the data is explicitly or implicitly assumed. It also
includes non-parametric statistical analysis in which the inference is motivated
by and grounded on a set of general population homogeneity (such as common
continuous cumulative distribution functions). The model-based approach has
been well documented in conventional statistical analyses, for instance, [9],
[10], [43], and [91], among others. In this approach, we assume that the data
set is generated from a population with unknown parameters:

y = f (x|η) + ,

where y is the response, f (x|η) is a specific function with unknown parameters


η, and the random fluctuation is denoted by . The question of interest is
“how to use a set of data to estimate or perform hypothesis testing on the
unknown parameters η for prediction?”. In practice, the format of f (x|η) is
usually dictated by the nature of the problem, such as the linear models for
continuous responses, logistic regression models for binary responses, or log
linear models for skewed responses, to list just a few. To further illustrate this
point, we discuss the following example in statistical prediction.
Suppose that we observe a quantitative response Y and p different predictors
X1 , ..., Xp , with the model,

Y = f (X1 , ..., Xp ) + , (1.1)


where f (.) is an unknown function that contains systematic information en-
coded in the predictors for the response variable, and  is the random error
term.
To seek the estimation on the unknown function f (.), fˆ(.), we split the
original data into a training set (usually 75% of the original data) to make
inference on the unknown function f (.), with the aim of minimizing the mean
prediction error. In this regard, statistical learning refers to a set of approaches
for making inference about the unknown function f (.) in (1.1).
The following two examples explain the selection of f (.) in model-based
inference.

Example 1.1 Consider the hypothetical data on lung cancer patients in a


clinical trial listed in Table 1.1. To analyze possible associations between ten-
tative risk factors and lung cancer, a logistic regression model
p
log( ) = α + x β,
1−p
points to the underlying connection

eα+x β
P (Y = 1) = .
1 + eα+x β
In Example 1.1, the response variable is the positive clinical outcome in lung
cancer, while the predictors may include gender, age, smoking, diabetes, and
Model-based culture 3

TABLE 1.1
Lung cancer data structure
Item Patient#1 Patient#2 Patient#3 Patient#4
ID N01101 N01102 N01103 N01104
Gender M F M F
Age 30 43 71 63
Smoking Y N Y Y
Diabetes N Y Y N
Hypertension Y N Y Y
Lung Cancer Y N N Y

hypertension. As shown in the data types in Table 1.1, the usual normality
assumption fails, thus it is not appropriate to use the linear regression model
to fit the data. Instead, since the type of data is case-control data, a logistic
regression model would be more appropriate to analyze the odds ratio on
the disease rate of lung cancer associated with the population defined by the
strata related to the combination of risk factors. The case-control feature of
the data determines the analytical approach on the unknown function f (.) and
the prediction outcome on severity of risk factors associated with the disease.
The discrete feature of the response variable in Example 1.1 determines
the logistic function for the underlying model f (.) because the outcome of
developing lung cancer is either “yes” or “no”. The next example takes the
approach of simple linear regression since the response variable Y , insurance
premium, is continuous. It sets the connection between insurance premium
and driving experience, and demonstrates that within the method of simple
linear regression, model-based prediction discerns greatly from data-driven
prediction in the learning process toward the underlying model f (x|η).
Example 1.2 Assume that the insurance premium linearly decreases as the
driving year increases, more specifically, we assume the model behind the data
as
y = α + βx + ,
where y is the insurance premium, x is the driving experience in years, α is the
intercept for the mean premium of a new driver who has no driving experience,
and β is the slope for the amount of decrease in monthly insurance premium
for the increase of each driving year. The error term  is the random variation
attributable to other factors such as age, gender, income, marital status, etc.
The learning process toward f (.) is tantamount to the estimation of model
parameters α and β.
As usual, assume that  follows a normal model with an unknown standard
deviation σ. In regression analysis, we estimate the values of the parameters
α and β, and use the estimated model
ŷ(x) = α̂ + β̂x
4 Two Cultures in Data Science

for prediction in Example 1.2.


Now, consider the relationship between 6-month insurance premium and
driving experience for the data set insurance.txt. If we assume that the insur-
ance premium (y) decreases as the driving experience (x) increases, and fit a
linear regression model, we essentially assume that

y = α + βx + .

FIGURE 1.1
Premium-time regression analysis of the insurance data

As shown in Figure 1.1, the estimated intercept in the linear regression


model is α̂ = 544.62, which means that the long-term average of 6-month
insurance premium for a new driver is $544.62 (since driving experience =
0). The estimated regression coefficient is β̂ = 23.89, which means that for
each additional year of driving, the 6-month insurance premium decreases, on
average, $23.89. Both p-values for the intercept and regression coefficient are
less than 0.0001, suggesting that the model is statistically significant.
The validity of the above analysis is basically grounded on the following two
assumptions. First, the relationship between the 6-month insurance premium
Data-driven culture 5

and the driving experience is a linear function,

f (x) = α + βx.

Second, the random fluctuation term  follows a normal model. If any one
of these assumptions fails, the corresponding data analysis becomes invalid
and the conclusion would be misleading. In the application of real-life data
analyses, the plausibility of the model assumptions usually comes from the
information of the data in the specific field of investigation. However, in situ-
ations where informative knowledge is not available for the assumption of the
specific form of f (x|η), carelessly applying a linear model (or a generalized
linear model) to a set of data may result in misleading conclusions, albeit the
model may be statistically significant.
In the next section, we follow up on the analysis of the insurance data in
Example 1.2 to illustrate that the model-based analyses, especially the simple
linear structure of f (.) in Example 1.2 may be actually wrong, as discussed in
data-driven analyses in the next section.

1.2 Data-driven culture


The data-driven culture camp is basically grounded on the belief that the data
contains all information needed for prediction, without any additional model
assumption. Representing approaches in this culture camp include unsuper-
vised machine learning, decision trees, classification, range regression, neural
network, and deep learning, to list just a few.
As discussed in the previous section, when we make statistical inference
using model-based approach, the plausibility of the model assumptions is crit-
ical. Invalidity on one of the model assumptions may consequently ruin the
whole data analysis. And the analysis becomes a“correct answer to a wrong
problem” as commented by Tukey in 1962 (See, for instance, [11],[12], and
[118]). When the underlying model of the data cannot be plausibly and legiti-
mately assumed, one way to approach the correct model, f (.), when the data
is large enough is the data-driven approach. Self-evidently in the sequel, we
also use the term “data-based approach” or “data-oriented approach” to refer
to the same approach in seeking the underlying model behind the data. The
data-driven approach usually starts with the plot of the insurance premium
data, and use the pattern of the data to study the plausibility of the model
assumptions.
We shall use the insurance premium example, Example 1.2, that we dis-
cussed with the model-based approach in the preceding section.

Example 1.3 The premium-year plot in Figure 1.2 indicates that the insur-
ance premium is not a linear function of driving year.
6 Two Cultures in Data Science

FIGURE 1.2
Premium-year plot of the insurance data

As shown in Figure 1.2, there is actually no linear pattern for the relation-
ship between the 6-month insurance premium and the driving year. What is
behind the data is more likely a piece-wise linear function. Figure 1.2 depicts
that during the first 3 years, the decrease of the insurance premium for each
year of driving is much larger than the corresponding change for customers
who have driven 3 to 10 years. After 10 years, essentially there is no gain in
insurance premium for any additional year of driving. This is more close to
the realistic practice in the way that new drivers are charged with higher rates
for the first few years (the first stage). In the second stage (3 to 10 years),
although the rate decreases as the driving year increases after the initial stage
(0-3 years), the slope is relatively more stable compared with the changes in
the first stage. After 10 years driving, drivers essentially get a flat rate that
has nothing related to an additional year of driving.
In what follows, we shall examine the statistical significance of the three
phases separately.
Figure 1.4 provides the outcome of linear regression analysis for the effect
of driving years on the 6-month insurance premium. For drivers with less
Data-driven culture 7

FIGURE 1.3
Plausibility of piece-wise function for insurance premium

than three years’ experience, the long-term average rate is $601.34 (when the
driving experience x = 0). And for each additional year of driving, the 6-month
premium, on average, drops $21.62 within the first three years of driving. The
two coefficients are significantly different from zero because the p-values are
much less than 0.05.
The regression result in Figure 1.5 indicates that people with three to
five years driving experience are charged at a basic rate, on average, $399.36.
Such a rate decreases by $4.85, on average, for each additional year of driving.
Compared with drivers in the first three years of driving, this group of drivers
pays a lower starting premium ($601.34 versus $399.36), but the rate of change
for each additional year of driving is much less ($21.62 versus $4.85). All the
estimated parameters are statistically significant.
For the third consumer group who have driven more than 10 years, as
shown in Figure 1.6, people with more than 10 years driving experience ba-
sically pay an average flat rate of $211.01 (statistically significant) in which
the impact of an additional year of driving on the 6-month premium is not
statistically significant (the corresponding p-value is 0.169).
The above example indicates that the linear model as in Figure 1.1 can
be completely different from the true model behind the data even through
the p-value of the model is statistically significant. Starting with a correct
model assumption is critical in data analytics. If the initial model assumption
8 Two Cultures in Data Science

FIGURE 1.4
Premium-time regression for inexperienced drivers

is wrong, we may end up answering a question completely different from the


goal of the data analysis.
In general, the data-driven approach starts with
y = g(x) + ,
where g(x) here is the function governing the response y, and  is the random
variation of the data. In data-driven inference, we do not assume any specific
function of g(x). Instead, we plot the data and use the pattern shown in the
diagram to learn the shape of g(x), denoted as ĝ(x).
Specifically, in the analysis of the insurance premium data, instead of start-
ing with the conventional linear regression model, after plotting the data in
Figure 1.2, we are able to regularize the piece-wise function which reflects
the general premium-decision policy of the company. This special function
g(.) actually governs the general relationship between insurance premium and
driving years.
Data-driven approach usually separates the original data into two portions.
One is the training set to learn about ĝ(x), and another is the testing set to
Data-driven culture 9

FIGURE 1.5
Premium-time regression for drivers with 3-10 years of experience.

evaluate the trained outcome. Theoretically, as pointed out in Breiman and


Friedman [14], or Stone [113], one of the general principles or criteria for the
selection of the underlying function behind the data is the EPE (expected
prediction error, or expected squared prediction error).

Definition 1.1 EPE: Let Y be the response observation and Ŷ be the pre-
diction of the response based on a set of data, the expected squared prediction
error (EPE) is defined as

EP E = E[(Y − Ŷ )2 ].

Note that in the above definition, if Ŷ is a predicted value of Y based on a set


of training data, and {(xi , yi ), i = 1, ..., k} is a set of test data, the estimate
of the EPE is
 k
EP ˆE = 1 (yi − ŷi )2 ,
k i=1
where for any i = 1, ..., k, the predicted response is the value of the corre-
10 Two Cultures in Data Science

FIGURE 1.6
Premium-time regression for drivers with more than 10 years experience

sponding trained model ĝi ,

ŷi = ĝi (x1 , ..., xk ).

It should be noted that when we apply the mean (squared) prediction error
as a criterion to evaluate the trained function ĝ(x), the optimal solution ĝ(x)
takes the form E(Y |X), as shown in the following theorem.

Theorem 1.1 Let Δ be a set of permissible functions of g(x) for the model
Y = g(x) + , we have

argM ing∈Δ EY |X [|Y − g(x)|2 |X = x] = E(Y |X = x).

Theorem 1.1 indicates that the model minimizing the expected prediction
error is the conditional expected value of the response given the features asso-
ciated to the response of interest. Details on the proof of this theorem can be
found in [56], [119], or [120]. We will also discuss this result in Section 1.4.1
when we discuss the outcome evaluation in model-based inference.
Intrinsics between the two culture camps 11

1.3 Intrinsics between the two culture camps


It should be noted that the two culture camps of data science have their own
advantage and disadvantage. For instance, the model-based camp may coop-
erate implausible assumptions into the model, while the data-driven approach
may over-fit the model with random features contained in the data. Both ways
of data analysis result in errors in prediction. This necessitates a discussion
on the criterion measuring the fitness of the learned model to the data.
From the insurance premium examples discussed in the preceding sec-
tions, it seems that the data-driven approach leads to a closer description of
the unknown underlying relationship between the response and the features
(predictors). It avoids implausibly making model assumptions before exam-
ining the data. However, this is true only when the size of the data is large
enough to unveil the underlying relationship, and the random features con-
tained in the data are not over-fitted into the model. If the sample size is not
large enough, the plot of the data may show patterns merely pertaining to
the training data. This may mislead the prediction in the form of the true
underlying relationship f (x). In this case, the additional information on the
model helps to navigate toward a more legitimate conclusion.

1.3.1 Small sample inference necessitates model assump-


tions
Consider a set of simulated data from a normal model with small sample
size and large standard deviation. Since the sample size is small, if we fit
the underlying distribution using the simulated data, we may get into a com-
pletely different model. As shown in Figure 1.7, although the dataset was
originally generated from N (0, 7), since the sample size is only 15, the p-value
is 0.0001152 rejecting the hypothesis that the data is from a normal model.
Instead, the dataset fits well with an exponential model (p=0.1463) using
one-sample Kolmogorov-Smirnov test.
This example indicates that when the sample size is not large enough,
there is a risk of obtaining a misleading result from data-driven approach.
Under this scenario, additional information including appropriate model as-
sumptions, such as assuming that the underlying model is skew normal, or
a regular normal model N (μ, σ 2 ) with unknown parameters μ and σ, may
help to regularize the estimation of the underlying model toward the right
direction.
It is also related at this point to note that the non-parametric statistical
method actually pertains to the model-based culture camp. Regarding the
difference between model-based approach and data-driven approach in data
science, one of the confusing issues is the methodology named “distribution-
free” or “model-free” statistical methods. In fact, the “distribution-free” in-
12 Two Cultures in Data Science

> set.seed(127)
> y<-rnorm(15, 0, 7)
> hist(y)
> fit1<- fitdistr(y, "normal")
> ks.test(y, "pexp", fit1$esƟmate)

One-sample Kolmogorov-Smirnov test

data: y
D = 0.54262, p-value = 0.0001152
alternaƟve hypothesis: two-sided

> y=y+7
> fit4 <- fitdistr(y, "exponenƟal")
> ks.test(y, "pexp", fit4$esƟmate)

One-sample Kolmogorov-Smirnov test

data: y
D = 0.28369, p-value = 0.1463
alternaƟve hypothesis: two-sided

FIGURE 1.7
Small dataset misleads the underlying model

ference approach, as named in non-parametric statistics essentially goes with


assumptions on the underlying model for the data. Thus, it is actually a
model-driven inference method. For instance, the distribution-free Wilcoxon
signed rank statistic is based on the assumption that the underlying model
can be characterized by the median of the population, and the underlying
distribution is symmetric around the population median. As delineated by
Hollander and Wolfe [59] as well as Sprent and Smeeton [112], the terminol-
ogy of “distribution-free” refers to the condition that the model assumption
does not involve an explicit form of the underlying model. It still contains
assumptions on the general shape for inference, focusing on a class of model
features. This is unlike the data-driven techniques, in which the there is com-
pletely no assumption on the underlying model. Thus, the distribution-free
non-parametric statistics is classified as a model-oriented inference approach,
not data-oriented inference approach.
Intrinsics between the two culture camps 13

1.3.2 Prediction accuracy demands large sample


One of the drawbacks on model-based approach is that almost all the model
assumptions are oversimplified for real-life data due to the complexity of real-
life situations. When a data set with a reasonable sample size is available, it
is more legitimate to dig out the pattern (model information) directly from
the data. The advantage of data-driven approach, when available, is to avoid
blind model assumption on f (.), thus it is more reliable in prediction. Stud-
ies of data-oriented approaches constitute the major part of data science.
Certainly, this way of establishing a model is essentially different from the
classical statistics in which we start with a set of random samples from a pop-
ulation before making prediction. Since the methodology only valid when the
sample size is large enough, it should be noted that practices of data-driven
approaches with small sample are usually misleading and risky.
It is related at this point to emphasize that the data-driven (or data-
oriented) approach is grounded upon asymptotic properties of functions of
the available data. The accuracy and reliability of the data-oriented approach
heavily rely on the available data. That is part of the well-known drawback
of “data-hungry” issue in data-driven model analysis.

1.3.3 Which camp to go?


Theoretically, it seems that the sample size is a key component in the selec-
tion between model-based analytics and data-oriented analytics. However, for
a set of data, there is no unified way to determine how large is large enough
for the sample size for the selection of inference approaches. Similarly, with-
out checking the background information of the data, it is risky to assert
appropriateness on the model-based approach or the data-oriented approach,
. In practice, some rules of thumb are commonly applicable. For instance, the
background information of the data may permit a plausible model assump-
tion for a function of some existing models (even though it is approximate),
This includes a linear combination of mix normal models for the height of a
population, or a combination of mix exponential models for time effects. See
for example [84], [42], and [40]. On the other hand, if the information on the
distribution of the data is completely unknown, data-driven methods such as
the spline methods or decision trees are more appropriate for the analysis (see,
for example, [13], [120], and [121], to list just a few).
In the case where no distinguishable features of the data set are available
to justify whether we should go with the model-based camp or data-driven
camp, the performance of the approach becomes the follow-up criterion for
consideration. Inference methods with better performance should be selected.
For instance, the model-based camp evaluates the performance of inference
approach by expected prediction error, controls the type-I error with high
power in hypothesis testing of the associated model parameters, or high con-
fidence level and low estimate error in the estimation. On the other hand,
14 Two Cultures in Data Science

the data-driven culture camp uses testing errors, such as false positive rate,
false negative rate, or expected mean prediction error computed with the test-
ing data. For example, in a prediction problem, generally, the method that
predicts more closely to the true value should be selected.
As discussed above, when the dataset does not contain adequate informa-
tion on the selection of inference culture camp, the evaluation criterion and
accuracy measurement essentially dictator the selection of prediction meth-
ods between model-based camp or data-driven camp. However, each camp
has its own well-developed criteria on measuring closeness for various infer-
ence problems. In what follows, we shall discuss the evaluation criteria and
optimizing strategies on the evaluation of inference performance for each of the
two analytic culture camps in data science. Specific concepts and terminolo-
gies pertaining to a specific algorithm will be addressed in the corresponding
chapter when the topic is discussed.

1.4 Learning outcome evaluation


This section contains discussions on cases of evaluation criteria for model-
driven analytic approach and data-driven approach. We shall start with model-
based approaches. With rapid development of statistical theory today, it is
impossible to exhaust all methods of model evaluation in one section. As
an introduction, we will use hypothesis testing as an illustrating example to
elucidate the evaluation criteria in model-based inference. Hypothesis testing
is a common approach covered in almost every statistics textbook.
Following the discussion on model-based approach, we will discuss logistic
regression built from a set of training data in the next subsection to discuss
evaluation criteria for data-oriented analytic approaches.

1.4.1 Error rates in model-based culture camp


One of the common evaluation criteria is the expected prediction error (EPE)
which directly measures the squared error between the observed and the pre-
dicted values.

EP E(f ) = E(Y − f (X))2



= [y − f (x)]2 dF (x, y) (1.2)

Since X is the predictor, by conditioning on X, we have

EP E(f ) = EX {EY |X ([Y − f (X)]2 |X)}.


Learning outcome evaluation 15

Thus, the search for the underlying function f (x) can be achieved by mini-
mizing the EPE point-wise.

f (x) = argM inc EY |X ([Y − c]2 |X = x).

Now, consider

EY |X ([Y − c]2 |X = x)
= EY |X [(Y − E(Y |X = x) + E(Y |X = x) − c)2 |X = x]
= EY |X {[Y − E(Y |X = x)]2 |X = x} + EY |X {[E(Y |X = x) − c]2 |X = x}

Thus, when the value c takes E(Y |X = x), the conditional expected value
reaches its smallest possible value. And the solution is

f (x) = E(Y |X = x).

To illustrate how the model-based approach evaluates the performance of


an inference procedure, we discuss a simple example in hypothesis testing
on the inference for a normal population mean. In this case, the evaluation
criterion is the power and significance level of the test.

Example 1.4 Assume that we have a set of data following N (θ, σ 2 ) with
unknown mean θ and standard deviation σ, and we are interested in testing
θ = 0 versus θ = 0. Notice that the model assumption here is that the data
follows a normal model with unknown mean and standard deviation, and we
want to make a prediction on the asserted θ value.

Under the above setting, the null space is θ ∈ Θ0 = {θ : θ = 0}, and the
alternative space is θ ∈ Θ1 = {θ : θ = 0}.
We usually use the Student-t statistic as the test statistic
s
Tn−1 = (X − 0)/ √ ,
n

where X is the sample mean and s is the sample standard deviation


1 
S2 = (Xi − X)2 .
n−1
To evaluate the performance of the test statistic, conventionally we use
the following two criteria,

Definition 1.2 P(Type-I error): The conditional probability of incorrectly


rejecting the null hypothesis.

Definition 1.3 P(Type-II error): The conditional probability of incorrectly


not-rejecting the null hypothesis.
16 Two Cultures in Data Science

Usually, the probability of Type-II error is reported by its counterpart, the


power of a test.

Definition 1.4 Power: Power= 1 − P (T ype − II error), which is the condi-


tional probability of correctly rejecting the null hypothesis. Namely the proba-
bility of making a correct decision by rejecting the false null hypothesis.

Thus a way to evaluate a test statistic is to examine its probability of


type-I error and the power of the test. We usually fix the probability of type-
I error at level α, and use the concept of the power function to evaluate a
test with a rejection region R for a given significance level. In the case where
the alternative space contains more than one element, we use the concept
of a power function to measure the probability of Type-II error when the
population mean travels in the alternative space. When the null space contains
more than one element, we use the minimal possible probability of Type-I error
among probabilities associated with all parameters in the null space. This can
be conveniently defined by the concept of a power function.

Definition 1.5 Power function: The power function corresponding to a hy-


pothesis test with rejection region R is the function defined on the domain of
the parameter θ, which takes the form of β(θ) = Pθ (X ∈ R), where θ is the
parameter of primary interest in the hypothesis.

With above definition, conventionally as defined in the literature, a test


with power function β(θ) is a size α test, if

sup β(θ) = α.
θ∈Θ0

A test with power function β(θ) is a level α test, if supθ∈Θ0 β(θ) ≤ α.


Obviously, the above analysis is grounded on the assumption of the under-
lying model. If the model can not be plausibly assumed, there is no test statis-
tic. Under this scenario, if the sample size is large enough, the data-oriented
approach can be applied. In what follows in this section, as an introduction
to the evaluation criteria for data-oriented analytic approach, we shall discuss
the evaluation criterion of a data-driven method in the next subsection.

1.4.2 Cost functions in neural networks


When the model of the underlying data can not be plausibly assumed, an
alternative is to skip the model assumptions of the data and directly make
inference without any assumption on the underlying data model. This data-
oriented approach can be applied, especially when the data is large. For in-
stance, one may partition the data into a training set (usually 75 percent of
the data) and a testing data set (usually 25 percent of the data), and train
the model use the training data, then evaluates the trained model with the
testing data.
Learning outcome evaluation 17

Similar to the preceding subsection using hypothesis testing as an illus-


trating example, in this subsection, we use neural network as an example to
illustrate the evaluation of data-oriented inference. As suggested by its name,
the fundamental principle of neural networks is inspired by the biology of the
human brain. Neural networks constitute a substantial part of artificial intel-
ligence. For instance, genetic algorithms are built upon evolution and particle
swarm optimization is based on animal social behaviors. Generally speaking,
a neural network is a directed graph, with neurons as nodes and weights as
edges. Every neuron activates, or outputs, with a strength that is a function
of the element-wise multiplication of the inputs with the edge weights.
Mathematically, neural network is a composite function of the input infor-
mation, as defined in Goodfellow (2016)[51] and Lee (2017) [79]. In another
word, define wij the weight value between nodes i and j, nj the activation of
node i, and Nj the list of node indexes that are connected to j, then node j
will activate with strength

nj = F ( wij ni ),
i∈Nj

where F is the activation function that allows the network to produce nonlin-
ear behaviors.
We usually provide input into the neural network by activating a set of
nodes with specific values and read output from any subset of nodes similarly.
These networks are typically organized in layers of neurons, which indicate the
depth of each node. In this way, layers are typically fully connected, meaning
that all nodes in one layer are connected to all nodes in the next layer. This
allows a computationally efficient model of weights as a matrix M , taking
input vector V to output vector M V . For example, feed-forward networks
are constructed in the following way. Its structure is often considered fixed
and serves to provide a final classification. Key limitations of fully connected
layers prevent them from being suitable for use as the sole structure of larger
networks. Because of the fully connected nature of the layers, they require
an immense amount of memory. Such a layer between two sets of just 10,000
nodes would require 100 million parameters, while modern networks often
have a total of 10 million parameters. This extrema capacity, while being
inefficient can also be problematic for training in general. There is no sense of
locality in such layer, as every node is treated individually. This means that
it is difficult and nearly impossible to train higher-level features that should
be treated equally across all areas of the input (which is of particular interest
to problems like image classification). However, even with these limitations,
fully connected layers remain critical for the task they perform.
Besides the feed-forward network, the other key component of neural net-
works includes back propagation, which is an algorithm to let errors accu-
mulated from the output layer of the network propagate backward through
the network, training it in the process. As in the example, if the network’s
output is O, but the correct response would be C, we can calculate the error
18 Two Cultures in Data Science

E = O − C. From this, we need a cost function that determines how errors


are judged, a typical example may be the L2 loss

n
Cost(O − C) = ||Oi − Ci ||2 .
i=0

However, since we know that



n
O = F( wi ai ),
i=0

it is possible to figure out the influence each weight had on the error by taking
the partial derivative of the cost function with respect to the weight.
Utilizing this partial derivative, each weight can be modified as a result of
the preceding layer.

1.5 Learning process optimization


Following the introduction of the evaluation criteria for the two cultures of
data analysis, we shall introduce general principles in the optimization process
in terms of improving the inference performance, which, for instance, consists
of the probability of type-I and type-II errors in hypothesis for model-based
camp, or testing error rates in data-driven camp.
For the model-based culture camp, we will still use hypothesis testing
as an introductory example to illustrate general strategies in controlling the
Type-I error and optimizing the power, which is equivalent to lowering the
probability of Type-II errors. Similar discussions can be applied to estima-
tion, predictions, and classifications. For the data-oriented camp, we will use
convolutions and deep-learning as introductory examples to discuss the opti-
mization process. More details on optimizing strategies pertaining to specific
topics will be addressed later in the corresponding chapters.
The process of statistical learning involves correct identification of ap-
propriate data science camp to learn from the data with optimal learning
outcomes according to the associated evaluation criteria.

1.5.1 Model-based camp


In this subsection, we use hypothesis testing as an example to illustrate op-
timization process for model-based culture camp. The optimization criteria
discussed in the previous section serve as standards to seek (or approximate)
the testing procedure with optimal performance. For example, when the un-
derlying model is assumed for a hypothesis testing problem, the two evaluation
Learning process optimization 19

criteria are the Probability of Type-I error and the Probability of Type-II error.
A test procedure with “best” performance will be the one that reaches the
highest permissible power with rejection region R at a pre-specified signifi-
cance level α.
Under this setting, the issue of finding an optimal inference procedure
becomes to finding the most powerful test for a given significance level. There
are rich references at various levels in the literature in this regard, for instance
Lehmann and Casella [80]; Casella and Berger [16], as well as Lehmann and
Romano [81], to list just a few. We will briefly discuss basic results here to
facilitate the understanding on the strategy in the construction of optimal
learning procedures for model-based culture camp. More systematic details
can be found in classical statistics literatures such as [16] or [81].
For a set of data (random sample) X = (x1 , ..., xn )T , assume that we are
interested in determining whether the underlying model is f (x|θ1 ) or f (x|θ0 ).
The optimal learning strategy toward a most powerful (MP) test for a given
significance level α is the Neyman-Pearson fundamental lemma.

Theorem 1.2 Neyman-Pearson Lemma: Assume that the underlying model


(pdf or pmf ) of a set of data X is f (x|θ) ∈ {f (x|θ0 ), f (x|θ1 )}. Namely there
are only two candidate models, one is for the null hypothesis and the other for
the alternative hypothesis. Consider testing H0 : θ = θ0 vs H1 : θ = θ1 . The
level α most powerful test is the one with the rejection region R satisfying

X ∈ R if f (X|θ1 ) > kf (X|θ0 );

X ∈ Rc if f (X|θ1 ) < kf (X|θ0 ),


for some constant k ≥ 0 so that Pθ0 (X ∈ R) = α.

In what follows, we shall use an example to illustrate the application of the


Neyman-Pearson Lemma towards optimal learning procedures in model-based
culture camp.

Example 1.5 Let X be a random variable associated with diabetes symp-


toms. Assume that from historical data of patient records, the likelihood of
each symptom under H0 (blood glucose level ≤ 100 ) versus H1 (blood glucose
level > 100) for patients in a specific hospital are given in the following table.

x x1 x 2 x 3 x 4 x 5 x 6 x7 x8
f (x|H0 ) .02 .02 .02 .01 .05 .01 0.41 0.46
f (x|H1 ) .22 .02 .12 .04 .18 .02 .06 0.34

Notation: x1 -numbness; x2 - weight loss; x3 -swollen gums; x4 - slow healing;


x5 -increased appetite; x6 -blurred vision; x7 -energy loss; x8 -frequent thirst.
Based on the symptoms of the patient, we are interested in diagnosing
whether a newly admitted patient has diabetes, with a restriction that the
chance of incorrectly diagnosis a healthy patient as having diabetes is 5%.
20 Two Cultures in Data Science

Solution: Notice that in this case, the ratio of likelihood for each symptom
λ = ff (x|H 1)
(x|H0 ) takes the following values

x x1 x2 x3 x4 x5 x6 x7 x8
f (x|H0 ) 0.02 0.02 0.02 0.01 0.05 0.01 0.41 0.46
f (x|H1 ) 0.22 0.02 0.12 0.04 0.18 0.02 0.06 0.34
λ 11 1 6 4 3.6 2 0.146 0.739

Thus, on the basis of Neyman-Pearson lemma and according to the


ranking of the likelihood ratio for each symptom, we arrange the symp-
toms by the likelihood of diabetes verse non-diabetes. To satisfy the evalu-
ation criterion of controlling the rate of mis-diagnosis at 5% level, namely
P (X ∈ R|H0 ) = 0.05, symptoms with the top 5% likelihood ratios are
R = {numbness, swollen gums, slow healing}.
The chance of correctly diagnosing a diabetes patient, which is the power
of this MP test at size 0.05 reads

P ower = P (X ∈ R|H1 ) = 0.22 + 0.12 + 0.04 = 38%.

Certainly, when the blood glucose level test is available, the laboratory test
result is more accurate in detecting diabetes as a follow-up diagnosis.

The above example makes inference on an unknown parameter for assumed


models. The intuition is essentially the likelihood ratio test that optimizes the
evaluation criterion by maximizing the power of the test while controlling the
probability of incorrectly rejecting the null hypothesis. The rationale of the
Neyman-Pearson lemma is grounded on the intuition that we reject the null
hypothesis if the parameter in the alternative space is more likely to occur.

1.5.2 Data-Driven camp


A convolution neural network is the product of chaining together convolutions
to perform efficient feature extraction with the standard feed-forward neural
network structure. It is shown ([78]) that the same back-propagation meth-
ods used to train other networks could also be applied to convolution layers,
allowing convolution neural networks to learn their own feature extractors.
This allows the convolution neural networks to determine what kinds of
high-level feature extraction is necessary for the specific problem. More im-
portantly, this allows for networks to automatically chain convolution layers,
in which the initial information can pass through multiple layers of feature
extraction, which are all automatically determined from the training data. Us-
ing a convolution kernel to pre-process the image proves to be critical to the
performance of modern deep learning methods, as a small kernel can operate
over a large image in parallel (see, for example, [79]).
Learning process optimization 21

In mathematics, a convolution is essentially an outcome of interpreting one


function in the context of the other via the following formulation,
 ∞
(f ∗ g)(t) = f (r)g(t − r)dr
−∞

From the perspective of modern deep learning, the primary use of the
convolution technique is confined to the range of the convolution kernel g(.)
over (0, s), as follows.
 s
(f ∗ g)(t) = f (r)g(t − r)dr.
0

For example, consider the basic edge-detecting matrix,


⎡ ⎤
0 1 0
E = ⎣ 1 −4 1 ⎦ .
0 1 0

This convolution will perform the element-wise matrix multiplication of the


kernel E with the immediate neighbors of each pixel, and then aggregate the
elements by summation. That is, if the pixel values around a specific pixel e
are ⎡ ⎤
a b c
Pe = ⎣ d e f ⎦ ,
g h i
the convolution at that pixel will be

Pe ∗ E =0a + 1b + 0c + 1d − 4e + 1f + 0g + 1h + 0i
=(b + d + f + h) − 4e.

Accordingly, this creates a new matrix, with each element representing the
convolution kernel applied at that point. As shown above, the convolution Pe ∗
E will have the strongest activation where there is a strong difference between
the pixel e and its neighbors (b, d, f, h), thus performing a basic localized form
of edge detection.

Summary
The impact of the books in statistical learning ([56], [72], [119], and [120])
and deep learning ([5], [51]) has penetrated into various applied fields including
medical research ([1], [96], [97], [98], [111], and [116] among others). Methods
in statistical learning essentially challenges the classical statistical theory and
methodologies with seemingly discernible boundaries. In this chapter, we fol-
low the idea of two cultures in statistical modeling by Breiman, Diaconis,
22 Two Cultures in Data Science

and Efron ( [9], [10], [11], [13], and [44]) to analyze differences and intrinsic
connections between the two culture camps in data science.
Based on the background training, knowledge, believe, and experience, it
is debatable on the correct way of data analytics because such a uniformly
correct decision does not exist. Instead of sailing in one direction on method-
ologies in data analytics, this chapter goes through the two directions from the
evaluation criteria to optimization processes. We use a numerical example on
insurance premium to elucidate that blindly performing either one approach in
data science may result in misleading conclusions. The model-based inference
camp demands plausibility in model assumptions behind a set of data (see,
for example, [22], [25], [28], [40], [42], [43],[52], [53] and [91], among others).
On the other hand, the data-driven inference camp necessitates large sample
size ([44], [51],[54], [79], among others).
There is a dilemma in the selection of the two data inference culture
camps. Seeking to thoroughly clarify the differences partially motivates the
compilation of this volume. The choice of the analytic culture camp should be
grounded on the feature information of the data. For instance, although big
data or data with high dimension is often regarded by some as the motivation
for data-driven technologies, the problem on high dimension is actually due
to the result that the sample covariance matrix is not positive definite with
probability one when n < p (see, for instance, Xie and Chen (1988 [126], 1990
[127]). Large dimension by itself is not an issue. For regression analysis in the
model-based camp, most of the theorems start with k dimensions where k is
any positive integer.
Once the framework of analytics is settled, the key component is the se-
lection of the evaluation criteria and optimization procedures. We use UMP
(uniformly most powerful test) as an illustrating example for model-based
culture camp, and MEPE (mean expected prediction error) for data-driven
culture camp for the selection of suitable analytical approaches. The chapter
concludes with a discussion on the principle of optimization strategies for the
two analytical culture camps in data science. The rest of the chapters basically
follow the theme and road-map in the setting of this chapter to delineate the
two inference camps in data science.
2
Fundamental Instruments

This chapter discusses basic elements of statistical prediction and machine


learning. The learning process usually consists of three fundamental parts:
identifying data, building or training models, and evaluating models. To see
why data identification is one of the critical steps in the learning process,
consider the following simple example.
Example 2.1 Assume that we have 15,000 breast cancer patients and 15,000
healthy participants in a case-control dataset. If we do not pay attention to the
way in which the dataset was collected, it is easy to incorrectly claim that the
disease rate is
P (D) = 15000/(15000 + 15000) = 50%.
In fact, some software even generates such a disease rate automatically. Notice
that this is misleading because the numbers of cases and controls are prefixed
before data collection in case-control data. The pre-determined 50% comes
from the design stage in data collection. It has nothing to do with the disease
risk P (D), regardless of the sample size, learning methods, or testing methods.
Correctly identifying the feature of the dataset helps us to select the appro-
priate approach, and avoid making commonly misleading errors in statistical
learning.
Since the primary resource in the learning process is data, in this chapter
we start with a discussion on the identification of different types of data. Cer-
tainly, in integrated part of data science is computation. However, a computer
is, overall, just a machine that runs codes to carry out complicated computa-
tional tasks. If the learning method does not match well with the input data,
the corresponding output mechanically produced by the machine could com-
pletely lead to the wrong direction. For instance, in the above breast cancer
example, if a set of case-control data is mistreated as a set of cohort data,
the learning result will be completely fallacious. The first issue in statistical
learning is the understanding of the background information, features and
characteristics of the input data so that the corresponding learning outcomes
can be properly formulated.
Besides data identification and conventional learning methods such as re-
gression, estimation and testing, other essential instruments in the implemen-
tation of machine learning include regression trees and classification trees. We
shall also introduce the concept of decision trees as the second theme in this
chapter to facilitate related discussions in follow-up chapters.

23
24 Fundamental Instruments

With a set of well-identified data, appropriately formulated prediction pa-


rameters (targets of learning outcomes), and implementation instruments such
as computing algorithms in regression trees, the third theme in this chapter
focuses on essential features of model evaluation, which includes sensitivity
and specificity, ROC curves, cross-validation, and bootstrapping. ROC curves
evaluate the plot of sensitivity and specificity for different threshold on clas-
sifications. Cross-validation is a critical approach that makes fine-tunning on
models learned from training data, while bootstrapping is a data-driven ap-
proach that captures unknown features by re-sampling algorithms.

2.1 Data identification


The type of data dictates the learning tools and directions in prediction. We
usually assume that the data constitute a random sample which fairly repre-
sents the population of interest. However, not all the data are random sam-
ples as seemingly assumed. Noticing that the type of data critically affects the
learning method and the interpretation of the learning results. For instance,
statisticians developed various analytic tools according to different sampling
methods in experimental designs. On the other hand, real-life scenarios often
do not permit the collection of data by random experiments, such as perform-
ing drug efficacy and toxicity experiments directly on human beings.
To see the point mentioned above, consider the analysis on the relationship
between smoking and lung cancer. It is ethically inappropriate to conduct an
experiment that exposes young children to cigarettes in order to test whether
smoking or nicotine intake stimulates the development of lung cancer in chil-
dren. In fact, most of our data (especially big data) are observational data.
Different from experimental data, observational data mainly come in three dif-
ferent ways, case-control data, cohort data, and cross-sectional data, according
to the approach in sample collection. In what follows in this section, we will
use databases in medical studies to illustrate the importance on selecting ap-
propriate learning approaches and prediction procedures according to features
of the data.

2.1.1 Data types


Definition 2.1 Case-control data: Case-control data are collected retro-
spectively according to the health status of patients. The case group contains
retrospective features of diseased subjects, while the control group contains ret-
rospective medical records of healthy subjects. The goal of a case-control study
is usually to investigate potential risk factors associated with the disease.

Case-control design is an effective way to collect data retrospectively to


Data identification 25

avoid making experiments on recruited subjects. It is used for the prediction


on the odds ratio of a disease risk under two risk factors. It normally starts
with a prefixed number of cases (disease population) and controls (non-disease
population), and retrospectively traces down characteristics, features, and risk
status of the patients to learn (or predict) the unknown risk of a disease. Case-
control study often appear in clinical trials or medical investigations on risk
factors associated with diseases such as stroke, breast cancers, or lung cancers,
to list just a few. More information on this regard can be found in [1], [18],
[111], and [116], among others. To control possible confounding effects, some
clinical trials use matched case-control data where subjects are matched with
demographic factors such as age, gender, or race.
Definition 2.2 Cohort data: Cohort data are the data collected prospec-
tively over a period of time. A cohort study usually recruits two groups of
participants. One is exposed to the risk factor of interest, and another is not.
The recruited participants are observed over time a period of time for records
on disease development or occurrence of disease symptoms.
Cohort study is more reliable in the causal relationship between the ex-
posed and the diseased populations because it observes the development of
the disease over time. However, it usually takes a long time and requires large
sample sizes to control possible confounding factors and sampling bias. More
examples of medical studies related to prospective studies can be found in, for
example, [97], [98], and [99].
Definition 2.3 Cross-sectional data: Different from case-control data
(where the total numbers of cases and controls are fixed before sampling), or
cohort data (where the numbers of exposure and non-exposure are fixed prior
to the beginning of the study), a cross-sectional data only fixes the total num-
ber of subjects in the study in a single time point or a fixed location. Subjects
involved in a cross-sectional data are assumed to form a random sample that
fairly represents the surveyed population.
For example, in a medical survey of 200 diabetes patients, the population
is diabetes patients. If 30 out of 200 patients have hypertension in the survey
outcome, the estimated prevalence for hypertension is 15%. If 40 out of 200
patients have smoking history, the estimated smoking rate is 20%. This is
because in cross-sectional data, we do not pre-fixed the number of smoking or
hypertension patients.
In terms of sampling cost, collecting cross-sectional data may be less ex-
pensive since it only involves surveys. On the other hand, since the number
of diseased subjects and healthy subjects are unknown, the data may become
useless when the number of cases (or controls) is too small to learn anything
toward the feature of interest.
The method of data collection also dictates the corresponding analytical
tools in statistical learning and prediction. As the primary source of informa-
tion, data critically influence the learning outcome. To see this point, consider
26 Fundamental Instruments

the analysis of the risk of a disease with a set of case-control data where
the number of cases and the number of control are determined before the
data collection stage. Regardless of the sample size and learning methods, the
case-control dataset does not lead to any information for the prediction of the
prevalence of the disease. Similarly, in a cohort study that contains prefixed
number of patients exposed (and nonexposed) to a potential risk factor, it is
methodologically fallacious to use such data to predict the prevalence of risk
exposure, P (E). This is because the ratio of risk exposure is given in cohort
data before sampling, the same reasoning as to the prediction of disease rate
P (D) with case-control data. Identifying the dataset correctly helps us avoid
making misleading conclusion in statistical learning.

Definition 2.4 Invariant measurement: Let m(X|D) be a measurement


based on data X obtained by the data collection method D. For two different
data collection methods D1 and D2 , if m(X|D1 ) = m(X|D2 ), the data mea-
surement m(X) is called invariant for the two data collection methods D1 and
D2 .

Example 2.2 Consider the sample disease rate, P̂ (case), where the measure-
ment
number of cases
m(X) = .
total sample size
Since

P (disease|case − control data) = P (disease|cohort data)

The prediction of disease rate, which is the sample proportion, m(X), is not
invariant between case-control data and cohort data.

Definition 2.5 Sample odds ratio: Consider a data set in a 2 × 2 contin-


gency table
Disease Healthy Total
Exposed a b m1
Nonexposed c d m2
Total n1 n2 n
The sample odds ratio of disease for exposure patients is defined as,

ˆ = ad .
OR
bc
Interpretation of the sample odds ratio: If the sample odds ratio is around
1, exposing to the risk factor does not affect the odds of getting the disease;
if the odds ratio is larger (less) than 1, exposing to the risk factor increases
(decreases) the odds of getting the disease. Due to randomness behind the
data, we usually claim significance of the odds ratio at 0.05 significance level
when the 95% confidence interval of the odds ratio completely locates within
the set (−∞, 0) or (0, ∞).
Data identification 27

The following theorem shows that the sample odds ratio is invariant be-
tween case-control data and cohort data. Thus, the sample disease prevalence
rate depends on the data collection method, but the sample odds ratio is in-
variant. This means that as long as we use the sample odds ratio to measure
the association between the disease and risk factors, the value of the sample
odds ratio is invariant.

Theorem 2.1 The sample odds ratio is invariant between case-control data
and cohort data.

Proof: Denote P (D|E) the probability of disease in the exposure group,


P (D|E c ) the probability of disease in the control group, P (E|D) the probabil-
ity of exposure in the disease group, and P (E|Dc ) the probability of exposure
in the non-disease group.
For case-control data, we have the sample odds of exposure in the case
group,
a
P̂ (E|D) a
= nc1 = .
P̂ (E |D)
c
n1 c
And the sample odds of exposure in the control group,
b
P̂ (E|H) n2 b
= = .
P̂ (E c |H) d
n2
d

Thus, the estimated odds ratio for the case-control data reads

ˆ case−control = a/c = ad
OR (2.1)
b/d bc
For cohort data, we have the sample odds of exposure in the exposed group,

P̂ (D|E) a b a
= / = .
P̂ (Dc |E) m1 m1 b

And the sample odds of disease in the non-exposure group,

P̂ (D|E c ) c d c
= / = .
P̂ (H|E c ) m2 m2 d

Thus, the estimated odds ratio for the cohort data reads

ˆ case−control = a/b = ad
OR (2.2)
c/d bc
Comparing the equations (2.1) and (2.2) gets the conclusion of the theorem.
Although the estimate of the disease prevalence is not invariant between
cohort and case-control studies, Theorem 2.1 shows that the sample odds
ratio is invariant between case-control study and cohort study. In fact, the
28 Fundamental Instruments

population odds ratio is also invariant, by the use of conditional probability


argument.
P (D|E) P (E|D)
1−P (D|E) 1−P (E|D)
P opulation Odds − Ratio = P (D|E c )
= P (E|D c )
.
1−P (D|E c ) 1−P (E|D c )

Thus, for the evaluation of disease risk related to an exposure, it is legitimate


to use odds ratio for case-control and cohort study, instead of directly using
the estimate of disease prevalence.
For cross-sectional data, the learning method on odds ratio is different
from case-control data and cohort data since the marginal sums are random
variables. We may use the conditional non-central hyper-geometric model to
make inference on the odds ratio of disease between the exposed and unex-
posed populations.
There are also other types of data (such as time series data, survival data,
longitudinal data, etc) that necessitate specific analytical/learning approaches
for the characteristics and features of the data. To avoid misleading prediction
outcomes in learning the model behind the data, it is critical to correctly
identify the type of data and appropriately select the corresponding analytical
approach.

2.1.2 Pooling data, Simpson’s paradox, and solution


It is very common in data science to pool several sets of data together in
the process of data analytics. One of the common mistakes, which is often
overlooked in practice, is the impact of confounding factors that may alter
the prediction outcome.
In statistical analysis, when dividing data from a population into subpop-
ulations, this phenomenon is called the Simpson’s paradox. However, similar
effects also occur in data pooling. To further illustrate this point, we examine
the following numerical example.

Example 2.3 The following is a set of hypothetical data summarizing a sur-


vey (cross-sectional) data regarding residents’ opinions on a new health policy.
We have three features (variables) in the dataset.
Resident community: Urban or rural;
Opinion: favoring or against;
Mental stress status: stressed or not stressed.
For the odds ratio measuring the relationship between being stressed and
favoring the newly proposed healthy policy, we examine three datasets to see
the change of odds ratios in the individual and collapsed datasets.

The odds ratio for the urban population reads,


48 × 94
ORurban = = 3.9167.
12 × 96
Data identification 29

TABLE 2.1
Pooling data and Simpson’s paradox
favoring not favoring
Urban Not stressed 48 12
Stressed 96 94
Rural Not stressed 55 135
Stressed 7 53

for the rural population,


55 × 53
ORrural = = 3.0847.
135 × 7
Both the odds ratios for the urban and rural populations are larger than
1, indicating that stress level influences the opinion on supporting the new
health policy. However, when we collapse the data into one table, we have
Favoring Not favoring
Not stressed 103 147
Stressed 103 147
Thus, the odds ratio calculated from the collapsed data is 1, which implies
that being stressed does not have any impact on the opinion of the new health
policy.
Now, we have two contradicting statements with the odds ratio for the
impact of being stressed on political opinions. One claims that being mentally
stressful has impact on residential opinion for the public health policy, and
the other does not.
In fact, it is implausible to pool the data sets together in Table 2.1 be-
cause one is for urban population while the other is for the rural population.
The confounding effect of community locations alters the relationship between
stressful population and opinion on public health policy. In this situation, it is
more appropriate to use the Cochran-Mental-Haemsel approach for the odds
ratio when combining data information from two different sources with con-
founding effects.
 ai di  bj c j
ORCM H = { }/{ },
i
Ni j
Nj

where the dataset has the following setting for the k data sources with i j =
1, ..., k.
With the adjustment of confounding factors in each data stratum (data
source in Table 2.1), the adjusted odds ratio reads
48 × 94 55 × 52 12 × 96 135 × 7 7427
ORCM H = + / + = = 3.5417,
250 250 250 250 2097
which is consistent with the conclusion on the impact of odds ratio in each
30 Fundamental Instruments

TABLE 2.2
OR and CMH for Simpson’s paradox
Case Control Total
Exposure ai bi ni1
Nonexposure ci di n2i
Total m1i m2i Ni

data stratum that people with stress tend to against the new health policy
while those without stress are in favor of the new policy.
Certainly, the hypothetical dataset is constructed in the way to amplify
the confounding effect. However, it points out the fact that pooling datasets
(with the frame of data similar to Table 2.2) without carefully considering
potential confounding factors may completely alter the learning outcome, and
consequently result in misleading conclusions. Confounding effects in Simp-
son’s paradox necessitates adjusting methods such as the CMH weighting
approach.

2.2 Basic concepts of trees


After identifying the nature of the data, the next step is to find an appropriate
learning method to bridge the data and the unknown features for prediction.
This process involves functions of the data (statistics) in conjunction with
measurements of data variation (probability). Since probability and statistics
are very well developed and documented in the literature. In this section, we
focus on the discussion of basic concepts and fundamentals of a relatively new
topic, the decision tree.
Different from conventional methods, the use of decision trees partly marks
a distinct feature of machine learning. It usually includes regression trees and
classification trees. When we have a set of explanatory variables to predict
a dependent variable, the conventional approach in statistics, for example, is
the linear regression
Y = a + b1 X1 + ... + bk Xk + ,
where  is a random variable corresponding to the distribution of Y . In this
setting, the variables X1 , ..., Xk are assumed to be equally affecting the re-
sponsible variable Y .
E(Y |x) = a + b1 X1 + ... + bk Xk ,
where the vector x = (X1 , ..., Xk )t . However, in practice, there is a high possi-
bility that one of the explanatory variables is more prominent in determining
the value of Y , as shown in the following example.
Basic concepts of trees 31

Example 2.4 Consider a scenario in marketing analysis where Y is the sale


volume, X1 is the advertising input, and X2 is the selling price. In a regression
model, we are unable to claim whether advertising input should be considered
before the selling price, or vice versa. This problem can be resolved by a re-
gression tree.

FIGURE 2.1
Regression tree of sale volume on price and advertising cost

As shown in Figure 2.1, the first split of the regression tree is on whether
the selling price is more than $2000, we essentially consider the selling price
first. Products with selling price more than $2000 will be considered in one
category (branch), while the advertising input will be considered as the sec-
ond criterion (sub-branch). If the first split is on advertising cost, we will
correspondingly consider the advertising cost first.
Interpretation: For the branch where the selling prices of the product are
more than $2000, the advertising input will be considered. Assume that the
split point for the advertising input is $10,000. We essentially consider two
subbranches for products in the first branch: one subbranch has selling price
more than $2000 and advertising input more than $10,000; while another
32 Fundamental Instruments

subbranch is for products with selling price more than $2000 and advertising
input no more than $10,000. Since products in each subbranch share the same
impact from the two factors (selling price and advertising input), the sale
volumes for products in each subbranch are averaged for the predicted sale
volume in the terminal node, (40K, 30K).
For products in the branch where selling prices are no more than $2000,
similar consideration leads to the following two sub-branches. The first sub-
branch consists of products with selling prices no more than $2000 and ad-
vertising input more than $15,000, while the second subbranch consists of
products with selling prices no more than $2000 and advertising input no
more than $15,000, correspondingly. And the sale volumes of products in each
subbranch are averaged up for the predicted sale volumes, 50 K and 10 K,
respectively.
In the description of the construction of a regression tree in Figure 2.1, as
mentioned before, one of the distinct features is the order of the explanatory
variables being considered in the prediction process. Another key issue is the
determination of the split point for the explanatory variable under consid-
eration. It is related at this point to introduce two basic definitions in the
construction of a regression tree.

Definition 2.6 Feature Space: The set of all possible input combinations
of explanatory variables in statistical prediction.

For example, when we consider X1 selling price and X2 advertising input


of the company, the feature space is

S = {(X1 , X2 ) ∈ R+ × R+ }.

The concept of feature space is closely related to the concept of feature space
partition in the theory of decision trees.

Definition 2.7 Partition of the feature space: Let Θ be the feature space
of a set of predictors. Denote {Θ1 , ..., Θk } the set of mutually exclusive subsets
of the feature space Θ, such that

 k
Θi Θj = ∅, i = j and Θi = Θ.
i=1

The set {Θ1 , ..., Θk } is called a partition of the feature space Θ.

For instance, for any two positive values s1 > 0 and s2 > 0, let

R11 = {X1 ≤ s1 , X2 ≤ s2 } R12 = {X1 ≤ s1 , X2 > s2 }

R21 = {X1 > s1 , X2 ≤ s2 } R22 = {X1 > s1 , X2 > s2 },


The set {R11 , R12 , R21 , R22 } is a partition of the feature space [0, ∞) × [0, ∞).
Basic concepts of trees 33

With a partition of the feature space under the assumption that subjects
in the same set of a partition share the same expected response, we have

yt = cij IRij (xt ), (2.3)
xt ∈Rij

where the indicator function

1 if xt ∈ Rij
IRij (xt ) =
0 otherwise.

In this setting, the estimate of cij (which is the same as the expected response
within the partition Rij ) with minimum mean squared prediction error reads
1 

cij = ŷij = yt ,
nij
t:xt ∈Rij

where nij denotes the sample size in the set Rij .


The following example explains the notation of the indicator function in
equation (2.3).

Example 2.5 Consider a set of data where the sample mean responses within
each set of the partition are 15, 20, 30, and 40. Similar to the tree demon-
strated in Figure2.1, the tree predicted model (2.3) becomes

ŷ(xt ) = 15IR11 (xt ) + 20IR12 (xt ) + 30IR21 (xt ) + 40IR22 (xt ).

Depending on the set Rij that the value xt falls into, if xt ∈ R11 , we have
IR11 (x) = 1. Since {R11 , R12 , R21 , R22 } forms a partition of the feature space,
the observation xt now does not belong to any one of the sets R12 , R21 , or
R22 , we have x ∈ Rij when i = 1 or j = 1. Thus

IR12 (xt ) = IR21 (xt ) = IR22 (xt ) = 0,

and equation (2.3) leads to


ŷ(xt ) = 15.

Example 2.6 The process of constructing a regression tree is equivalent to an


optimization process that minimizes the prediction error (E) under all possible
partitions of the feature space,
 
E= (yi − ŷ1∗ )2 + (yi − ŷ2∗ )2
i:(x1i ,x2i )∈R11 i:(x1i ,x2i )∈R12
 
+ (yi − ŷ3∗ )2 + (yi − ŷ4∗ )2
i:(x1i ,x2i )∈R21 i:(x1i ,x2i )∈R22
 

= (yi − ŷ(uv) )2 (2.4)
uv i:(x1i ,x2i )∈Ruv
34 Fundamental Instruments

where Ruv is a set of the product in the subbranch (uv), ŷi∗ , i = 1, 2, 3, 4


is the average sale volume for products in the subbranch Ruv , for (uv) ∈
{(11), (12), (21), (22)}, respectively.
In Figure 2.1, the set {R11 , R12 , R21 , R22 } serves as a partition of the
feature space [0, ∞) × [0, ∞).

Different from the linear regression method where the random error  fol-
lows a statistical model, tree regressions do not need any model assumption.
On the other hand, the difficulty on the implementation of regression tree
switches to the selection of the partition of the feature space, that minimizes
the mean squared prediction errors. For instance, in the selling price and ad-
vertising input example, the construction of a regression tree depends on the
selection of values s and t, as well as the order of the two explanatory variables,
X1 and X2 . When the number of features (predictors) increases, the corre-
sponding volume of computation will increase dramatically. This necessitates
the use of computing software for the construction of regression trees.
As a fundamental introduction to the basic concept of statistical learning,
this section discusses the concept and interpretation of the decision tree. More
details on this topic, especially the uniformly minimum variance unbiased
estimator (UMVUE) for the homogeneity index in each terminal node, will be
delineated in Chapter 9.

2.3 Sensitivity, specificity, and ROC curves


Given a set of training data, we can train a binary classifier (a decision rule)
for the diagnosis of disease or non-disease using features of the patients (a
set of explanatory variables). It is related at this point to mention four pos-
sible outcomes corresponding to a decision rule for binary classifications with
a diagnostic threshold. For each decision, if the prediction is negative, in the
case where the real feature is negative, there is no error in the prediction, the
outcome is true negative. On the other hand, if the real feature is positive,
there is an error of falsely claiming negative (false negative). Similarly, if the
prediction is positive, in the case where the real outcome is actually negative,
there is a misclassification error on incorrectly claiming positive (false posi-
tive). If the real feature is indeed positive, the prediction makes a true positive
assertion.
The four possible outcomes in binary classification can be summarized
in Table 2.3. For example, in the process of classifying a disease, there is a
possibility of misclassifying a diseased patient as healthy (false negative), or
misclassifying a healthy patient as having a disease (false positive). These
two errors are conceptually similar to the type-I errors and type-II errors in
hypothesis testing. Along the same line on the evaluation of discriminating
Sensitivity, specificity, and ROC curves 35

TABLE 2.3
Four possible outcomes in a binary classification
Feature predicted negative predicted positive Total
Real negative true negative false positive nreal negative
Real positive false negative true positive nreal positive
Total mpredicted negative mpredicted positive N

ability for a diagnostic test, two concepts frequently used in the literature are
sensitivity and specificity.

Definition 2.8 Sensitivity: Let T be a binary classifier trained with the


training data. The probability that T correctly diagnoses a disease patient as
having the disease is defined as the sensitivity of the classifier T . If larger
value of T indicates higher likelihood of the disease, for a threshold value c,

Sensitivity = P (T diagnoses a case|case) = P (T > c|case).

Clearly, sensitivity is the probability of correct diagnosis of a disease con-


dition on the diseased population. It is essentially a conditional probability
depending on a given threshold c.

Definition 2.9 Specificity: Let T be a binary classifier trained with the


training data. The specificity of the classifier T is defined as the probabil-
ity that T correctly claims a healthy patient as healthy. For a threshold value
d,

Specif icity = P (T claims a healthy outcome|healthy)


= P (T < d|healthy).

Clearly, the specificity is a probability of correct diagnosis conditioned on


the healthy population. As the value d changes, the associated conditional
probability changes.
For instance, when we use the logistic regression model to seek the odds
of getting infected with a disease, given a set of features x, assume
p
log( ) = α + β t x,
1−p
where the disease rate p = P (Y = 1|x), and α and β are model parameters.
After training the model, we obtain the estimated values of α and β,
denoted as α̂ and β̂, and
t
x
eα̂+β̂
P̂ (Y = 1|x) = (2.5)
1 + eα̂+β̂ t x
The above equation predicts the chance of getting a disease for specific features
36 Fundamental Instruments

of each patient, x. Thus, for each value c, based on the trained classifier, T ,
we can compute the estimated sensitivity by dividing the number of correct
diagnosis with the total number of cases in the sample. Similarly, estimated
specificity can be obtained for each value of c. Thus, in a trained model (binary
classifier), a pair of values (sensitivity, specificity) can be computed for each
diagnostic threshold c.
Considering all possible values of permissible diagnostic threshold c, c ∈ A,
gets a set of pairs

{(sensitivityc , specif icityc ), c ∈ A}.

Plotting this set of data using the pairs (sensitivity(c), 1 − specif icity(c)) for
all c ∈ A yields a curve, which is the ROC curve.

Definition 2.10 ROC curve: The plot of (sensitivity, 1-specificity) across


all permissible values of the diagnostic threshold c is called the receiver oper-
ating characteristic curve, or the ROC curve.

FIGURE 2.2
Example of a sample ROC curve

The ROC curve is basically the graphic plot of two parameters: sensitivity
versus (1-specificity). When the value c travels in its domain determined by
the diagnostic threshold. It evaluates the discriminating ability of the binary
classifier. When a set of testing data is available, we may plug the data into
the estimated model (2.5) to obtain a set of sample sensitivities and sample
specificities associated with different cut-off (threshold) values of the classifier.
The plot of the sample (estimated) sensitivity versus false positive rates (1-
sample specificity) forms an estimated ROC curve, as shown in Figure 2.2.
Cross-Validation 37

2.4 Cross-Validation
After data identification, statistical learning of an unknown underlying model
usually involves three portions, model training, model validation, and model
testing. Correspondingly, the dataset is ideally split into three mutual exclu-
sive subsets, a training set, a validation set, and a test set. The three portions
are briefly described below.
The model training portion fits a candidate model with training data to
estimate the model parameters. The selection of the candidate model usually
is based on data analytic knowledge, for instance, we assume that the mean
response is a linear function of the predictors. In this step, assumptions on
the data and background information about the data play a critical role in the
selection of the appropriate model. In machine learning processes, usually the
training dataset fits a specific candidate model. However, there may be more
than one candidate model, such as different orders in polynomial regressions,
where we may fit a linear function or a quadratic function as candidate mod-
els. In polynomial regression, the degree of the polynomial model is usually
validated with validation data.
The model validation part involves parameter estimation on the basis of
an accuracy measurement in conjunction with the validation data, such as
the selection of model coefficients corresponding to the smallest MSPE (mean
squared prediction error). As for artificial neural networks, the number of
hidden units in each layer in artificial neural networks is a hyper-parameter to
be determined in the validation data. In general, the fine-tune of the trained
model necessitates the validation process.
The model testing portion usually includes the evaluation of the final model
with the testing data that was set aside to independently access the perfor-
mance of the final model. It is inappropriate to estimate the predictive model
and calculating the validation criteria to justify the final model because, over-
all, the validation data is just one random sample representing the population.
Especially when it comes to prediction, the first step is to build or estimate a
model which is then used to predict the unknown response.
One of the critical steps in the prediction process is to build a model that
fits well with the data. Usually we spend 50% of the data training the model,
25% of the data on validation and 25% of the data on testing. However, when
the sample size is not large enough for splitting into the training set and
validation set, an efficient way is to cooperate the training and validating
parts with the training data by the method of cross-validation.
Generally, cross-validation is a data implementation process that uses nu-
merical computation to replace thorny theoretical analysis. It evaluates the
trained model multiple rounds by different partitions of the training data,
then takes the average of the corresponding evaluated model accuracy (such
as the MSE) to give an estimate of the model’s predictive performance. The
38 Fundamental Instruments

cross-validation method usually provides a more accurate evaluation on fitting


the model to the data.

Definition 2.11 Cross-validation Procedure: Cross-validation divides


the training data into k-equal size folds, trains the model using k − 1 folds
of the data, and evaluates the trained model with the remaining fold of the
data for the model accuracy.

The validation process uses k different folds of the data, which consequently
generates k accuracy measurements. Taking the average of the k accuracy
measurements leads to the overall accuracy level of the model.
In practice, we usually use LOOCV (leave-one-out cross-validation), 5-fold
CV, and 10-fold CV.

2.4.1 LOOCV and Jackknife


The LOOCV is actually n-fold cross-validation when the training data con-
tains n observations. For each data point in LOOCV, the n − 1 observations
are used to trained the model, and the remaining one data point is used to
calculate the model accuracy. The average of n model accuracy measurements
is then used as the index for the LOOCV model accuracy.
For example, when we use MSE as a measurement of model accuracy
between the trained model and the data, the cross-validation accuracy reads,

1
n
CV A(n) = M SEi .
n i=1

Consider the overall model MSE,

1
n
M SE = (ŷi − yi )2 .
n i=1

In LOOCV, the cross-validation model accuracy becomes

1  ˆ∗
n
CV A(n) = ( y − yi ) 2 ,
n i=1 i

where yˆ∗ i is the predicted value of yi with predictor xi , corresponding to the


model trained by the data (x1 , y1 ), ..., (xi−1 , yi−1 ), (xi+1 , yi+1 ), ...(xn , yn ).
It should be noted that the drop-one observation feature of LOOCV shares
the same sub-sampling principle as the jackknife replication procedure, where
for a random sample x1 , ..., xn , define an estimator based on a drop-one
observation sample,

θ̂(i) = fn−1 (x1 , ..., xi−1 , xi+1 , ..., xn ),


Cross-Validation 39

with i = 1, ..., n. And the jackknife estimate as the average of the n drop-one
sample estimates,
1
n
θ̂Jack = θ̂(i) .
n i=1

Example 2.7 We shall use a toy dataset, {1, 5, 9}, to illustrate the cross-
validation accuracy (CVA) for the prediction of the population mean using
LOOCV, along with the jackknife estimation.

Case-1 Since the overall sample mean is 5, we have the mean squared error without
any data manipulation,
1 32
M SE = [(5 − 1)2 + (5 − 5)2 + (9 − 5)2 ] = .
3 3

Case-2 Since there are three observations in the dataset, we have the three drop-
one sample for the population mean,
1
θ̂(1) = (5 + 9) = 7
2
1
(1 + 9) = 5
θ̂(2) =
2
1
θ̂(3) = (1 + 5) = 3,
2
which are fixed for y 1 , y 2 , and yˆ∗ 3 , respectively. The cross-validation
ˆ∗ ˆ∗

accuracy,
1 72
CV A(3) = [(7 − 1)2 + (5 − 5)2 + (3 − 9)2 ] = = 24.
3 3

Case-3 The jackknife estimator of the population mean,


1
θ̂(Jack) = (7 + 5 + 3) = 5,
3
which shares the same accuracy as case [1] above.

Theorem 2.2 The jackknife procedure does not change the estimation on the
population mean, but it reduces the variation of the data around the sample
1
mean by n−1 .

Proof: Notice that

1 1  nx̄ − xi
n n
n−1
θ̂Jack = θ̂(i) = = x̄ = x̄.
n i=1 n i=1 n − 1 n−1
40 Fundamental Instruments

1
n
nx̄ − xi
|θ̂(i) − θ̂Jack | = | − xi |
n−1 n i=1
1
=| (nx̄ − xi − (n − 1)x̄)|
n−1
1
=| (x̄ − xi )|
n−1

This completes the proof of theorem 2.2.

Theorem 2.3 For any asymptotic unbiased estimator, the jackknife version
of the estimator improves the convergence rate from O( n1 ) to O( n12 ).

Proof: To see this, notice that if T (X) is an asymptotic unbiased estimator of


a function of the parameter τ (θ) with convergence rate O( n1 ), we have

1
E(T (X)) = τ (θ) + O( ).
n
Define X(−i) the jackknife duplicate (the sub-sample of (n − 1) observations
excluding xi ). The jackknife version of T (X) is

n−1
n
TJack (X) = nT (X) − T (X(−i) ),
n i=1

the asymptotic performance of the estimator TJack reads

n−1
n
E(TJack (X)) = nE[T (X)] − E[T (X(−i) )]
n i=1
n−1
n
1 1
= n[τ (θ) + O( )] − [τ (θ) + O( )]
n n i=1 n−1
1 1
= (n − n + 1)τ (θ) + nO( ) − (n − 1)O( )
n n−1

Now, notice that


1 a2
nO( ) = a1 + + ...
n n
and
1 a2
(n − 1)O( ) = a1 + + ...,
n−1 n−1
we have
Cross-Validation 41

1 1 a2 a2
nO( ) − (n − 1)O( ) = a1 + + ... − a1 − − ...
n n−1 n n−1
a2
=
n(n − 1)
1
= O( 2 )
n

This completes the proof of theorem 2.3.

2.4.2 LOOCV for linear regressions


Since LOOCV is often used in linear regression, we discuss a theoretical sim-
plification and example on LOOCV in multiple linear regression in this sub-
section. Consider the cross-validation of a linear model with a predictor row
vector xi and a response yi , i = 1, ..., n,
E(Y |xi ) = xi β,
where β is the column of parameter vectors. Denote the response vector y =
(y1 , ..., yn ), and the corresponding predictor matrix X, we have the model
y = Xβ + .
In general, when the sample size n is large, the regular LOOCV method is
computationally expensive. However for the linear model, the cross-validation
accuracy (CVA) with LOOCV can be simplified as in the following theorem.
Theorem 2.4 When LOOCV is applied to linear regressions, the cross-
validation accuracy (CVA) can be expressed explicitly as,

1  yi − ŷi 2
n
CV A(n) = ( ) , (2.6)
n i=1 1 − hi

where ŷi = xi β̂, the estimated response for the ith observation, and hi the
leverage of the ith observation
hi = xi (X X)−1 xi .
Proof: Denote X(−i) and y(−i) , respectively, the data without the ith observa-
tion yi and xi , β̂(−i) is the vector of the least square estimate using X(−i) and
y(−i) , and ŷi∗ is the estimated response using LOOCV with the ith observation
dropped.
Note that under this setting,

1
n
CV A(n) = (yi − ŷi∗ )2 , ŷi∗ = xi β̂(−i) .
n i=1
42 Fundamental Instruments

Recall the following relationship in matrix algebra,

X X = X(−i) X(−i) + xi xi (2.7)

Under the assumption that n − 1 > p, see Xie and Chen (1988 [126]), the
sample covariance matrix is positive definite with probability 1, both (X X)−1
and (X(−i) X(−i) )−1 exist. Multiplying (X X)−1 in both sides of (2.7) yields,

Ip = [X(−i) X(−i) ](X X)−1 + xi xi (X X)−1 . (2.8)

Notice that we also have

Ip = (X(−i) X(−i) )(X(−i) X(−i) )−1 ,

which, in conjunction with (2.8), leads to

[X(−i) X(−i) ][X(−i) X(−i) ]−1 = (X(−i) X(−i) )(X X)−1 + xi xi (X X)−1 . (2.9)

Multiplying [X(−i) X(−i) ]−1 from the left-hand side of (2.9) gets

[X(−i) X(−i) ]−1 = (X X)−1 + (X(−i) X(−i) )−1 xi xi (X X)−1 . (2.10)

Now, multiplying xi from the right-hand side of (2.10) results in

[X(−i) X(−i) ]−1 xi = (X X)−1 xi + (X(−i) X(−i) )−1 xi xi (X X)−1 xi , (2.11)

which becomes

(X X)−1 xi = (1 − hi )[X(−i) X(−i) ]−1 xi , (2.12)

since the leverage of the ith observation is defined as

hi = xi (X X)−1 xi .

By (2.12), we have
1
(X(−i) X(−i) )−1 xi = (X X)−1 xi (2.13)
1 − hi
Now, by the LSE estimates of the regression coefficients corresponding to X
and X(−i) ,
X Xβ̂ = X y
(X(−i) X(−i) )β̂(−i) = X(−i) y(−i)
Consider (2.7), and X y = X(−i) y + xi yi , we have

(X(−i) X(−i) + xi xi )β̂ = X(−i) y(−i) + xi yi ,


Cross-Validation 43

where yi is the response corresponding to the ith observation. Now, applying


(X(−i) X(−i) )−1 from left in both sides of the equality gets

{Ik + (X(−i) X(−i) )−1 xi xi }β̂ = β̂(−i) + (X(−i) X(−i) )−1 xi (xi β̂ + êi ),

where ei = yi − ŷi , the residual of the ith observation with β estimated by the
complete data. Thus

β̂ = β̂(−i) + (X(−i) X(−i) )−1 xi êi .

By (2.13), the above equation can be simplified as


êi
β̂ = β̂(−i) + (X X)−1 xi .
1 − hi
and
êi
xi β̂ = xi β̂(−i) + xi (X X)−1 xi . (2.14)
1 − hi
Let dˆ(i) be the residual associated with the “leaving the i-th column’s
observation” data, we have

dˆ(i) = yi − xi β̂(−i) ,

and by (2.14),

ê(i) = yi − xi β̂(−i)


êi
= yi − xi β̂ + hi
1 − hi
êi
= êi + ( )hi
1 − hi
êi
=
1 − hi

which, in conjunction with

1  ˆ2 1  yi − ŷi 2
n n
CV A(n) = d(i) = [ ] .
n i=1 n i=1 1 − hi

This concludes the proof for (2.6).


In some occasions (such as SAS outputs), the statistic CV A(n) is alterna-
tively denoted as PRESS (prediction residual error sum of squares),

n n
yi − ŷi 2
P RESS = (yi − ŷi∗ )2 = [ ] ,
i=1 i=1
1 − hi

where ŷi∗
is the estimated response using LOOCV with the ith observation
dropped.
44 Fundamental Instruments

data Fresh;
set WORK.'Fresh_multiple regression data'n;
x1=log('sale price'n);
x2=log('competitor price'n);
x3=log('advertising cost'n);
x4='sale price'n;
x5='competitor price'n;
x6='advertising cost'n;
run;
proc glmselect;
model 'market demand'n= x1-
x6/selection=forward(stop=CV) details=steps
cvMethod=split(117);
run;

FIGURE 2.3
SAS codes and LOOCV output for market demand predicted by sale price,
competing price, and advertising input

2.4.3 K-fold cross-validation and SAS examples


Although the method LOOCV has many properties in the data learning pro-
cess, one of the drawbacks of the method is that it is computationally expen-
sive, because the method essentially runs n − 1 more time to generate the
CVA(n). Another issue with LOOCV is that the data for the trained mod-
els are highly correlated since only one observation is removed each time. An
extension to LOOCV is the K-fold cross-validation.

Definition 2.12 K-fold cross-validation: Different from LOOCV, the k-fold


cross-validation approach contains the following steps:
1. Randomly dividing the original data set into k groups, or k folds of ap-
proximately equal size.
2. After dividing the data, one of the folds is treated as the validation set and
the remaining k-1 folds are treated as the training data to train the model.
Cross-Validation 45

The trained model is evaluated with the held-out one-fold for the MSPE as
the model accuracy measurement.

3. The procedure is repeated k times for each fold serving as the validation
set.

4. The average of all the k M SP Ei associated with the k-folds, i = 1, ..., k is


the overall model accuracy measurement.

1
k
CV A(k) = M SP Ei
k i=1

data Fresh;
set WORK.'Fresh_multiple regression data'n;
x1=log('sale price'n);
x2=log('competitor price'n);
x3=log('advertising cost'n);
x4='sale price'n;
x5='competitor price'n;
x6='advertising cost'n;
run;
proc glmselect;
model 'market demand'n= x1-
x6/selection=forward(stop=CV) details=steps
cvMethod=split(5);
run;

FIGURE 2.4
SAS codes and output for 5-fold CV on market demand data

In what follows, we shall use an example to show the method of cross-


validation with SAS in model selection.

Example 2.8 To seek the relationship between market demand and sale price,
advertising cost, and the competitor’s price using a random sample of the
46 Fundamental Instruments

previous months’ records as the input data, one of the difficulties is to select
the best models among all the possible variables: log scale of the sale price,
log scale of the competitor’s price, log scale of the advising cost, as well as
the original three variables recorded in the dataset. We use the LOOCV and
5-fold cross-validation to select variables for the regression model.

Examining the SAS outputs in Figure 2.3 and Figure 2.4, we can see that
the cross-validation accuracies are consistent, as well as the selected model of
the data.
As shown in Figure 2.3, after adding the intercept, the CV PRESS de-
crease the largest amount (105.72) by adding x1 , log scale of the sale price.
The second most important variable in terms of decreasing CV PRESS (cross
validation Prediction Residual Error Sum of Squares) is x3 , the log scale of the
advertising cost, by the amount of 88.94, which is followed by 85.38 by adding
the log scale of the competitor’s price into the model. The last variable added
to the regression model is x6 , the advertising cost with the CV PRESS value
at 85.17. The selection process stops at step 6, where adding the competitor’s
price into the model results in the increase of CV prediction residual errors
sum of squares of the model from 85.17 to 86.37.

Demand = α + β1 x1 + β2 x3 + β3 x2 + β4 x6 + .
Similar conclusions occur when we use the 5-fold cross-validation in Figure
2.4. This example also shows that the log scale transformation of the variable
may fit the data better by using cross-validation techniques. We will address
the issue of linear regression versus non-linear regression in Chapter 5 and
Chapter 6.

2.5 Bootstrapping
Bootstrapping is one of the efficient methods in intrinsic data manipulation. It
essentially recovers the population features by repeatedly sampling the original
data. Based on the model assumptions in data analytics, there are two types
of bootstrapping methods that we shall discuss as a fundamental topic in this
section: the nonparametric method and parametric method. The nonparamet-
ric method treats the original sample as the population and re-samples the
original sample to gain intrinsic data information such as the variance or prob-
ability percentiles of the underlying population. The parametric bootstrapping
starts with an assumption for the model behind the original sample, uses the
original sample to estimate the model parameters, and repeatedly samples the
population with estimated parameters for predictions.
Bootstrapping 47

2.5.1 Non-parametric bootstrapping


Consider a set of data E = {x1 , ..., xn }. When we randomly select n observa-
tions (drawing one each time from E with replacement), there are nn possible
different selections. For simplicity, assume that the statistic of interest is the
sample mean
1
n
X̄ = Xi ,
n i=1
and the variance of the sample mean

V ar(X̄) = σ 2 /n,

where the estimate of the population variance σ 2 is the sample variance

1 
n
S2 = (Xi − X̄)2 .
n − 1 i=1

Denote X̄j∗ the mean of the jth bootstrapping sample, we have the following
results.

Theorem 2.5 When exhausting all possible nn re-samples, the mean of the
bootstrapping sample mean equals to the mean of the original sample. The sam-
ple variance of the bootstrapping sample mean equals to the sample variance
of the original sample multiplying by a constant,

nn (n − 1)
c= ,
(nn − 1)n2

and

n
1  ∗
n
X̄j = X̄ (2.15)
nn j=1
nn
1  ∗
(X̄j − X̄)2 = cS 2 . (2.16)
n − 1 j=1
n

Proof: Notice that when we take average over all possible outcomes of the nn
selections with replacement, each element in the original sample E is equally
likely to be selected, hence, by regrouping the total entries in the double
n
summation over i and j, so that each summation of n items equals to i=1 xi ,
48 Fundamental Instruments

which produces nn terms of X̄ in the summation, we have


n n
1  ∗ 1 1 ∗
n n n
X̄ j = X
nn j=1 nn j=1 n i=1 ij
nn
1 1  ∗
n
= n X
n n j=1 i=1 ij
1 n
= n X̄
nn
= X̄.

This completes the proof of equation (2.15). As for (2.16), with similar
rationale in regrouping the re-sampling observations into the nn items of the
original sample, the re-sampling data variance reads,

nn
1  ∗
(X̄j − X̄)2
n − 1 j=1
n

n
1  1  ∗
n n
= n ( 2 [ (xij − X̄)]2 )
n − 1 j=1 n i=1
n
1  1  ∗ 
n n
= ( [ (xij − X̄)2 + (x∗ij − X̄)(x∗kj − X̄)])
n − 1 j=1 n i=1
n 2
i=k
n
1  1  ∗
n n
= ( [ (x − X̄)2 ])
nn − 1 j=1 n2 i=1 ij
n
1  ∗
n n
1
= n (x − X̄)2
n − 1 n2 j=1 i=1 ij
1 1
= (n − 1)nn S 2
nn − 1 n2
This completes the proof of (2.16).
Theorem 2.5 establishes the connection between the mean and sample
variance of the bootstrapping sample means under the setting of equally likely
selection. The following theorem sets connection between the average of nn
bootstrapping samples and the sample mean of the original sample.

Theorem 2.6 Let Sj2∗ be the sample variance of the bootstrapping sample
{x∗1j , ..., x∗nj }. We have
n
1  2∗
n
n−1 2
S = S . (2.17)
nn j=1 j n
Bootstrapping 49

2 3 7 4 2.645751311 7 Original data


Re-sampling information
possible re-sampling
outcomes mean std variance
2 2 2 2 0 0
2 2 3 2.333333 0.577350269 0.333333
2 2 7 3.666667 2.886751346 8.333333
2 3 2 2.333333 0.577350269 0.333333
2 3 3 2.666667 0.577350269 0.333333
2 3 7 4 2.645751311 7
2 7 2 3.666667 2.886751346 8.333333
2 7 3 4 2.645751311 7
2 7 7 5.333333 2.886751346 8.333333
3 2 2 2.333333 0.577350269 0.333333
3 2 3 2.666667 0.577350269 0.333333
3 2 7 4 2.645751311 7
3 3 2 2.666667 0.577350269 0.333333
3 3 3 3 0 0
3 3 7 4.333333 2.309401077 5.333333
3 7 2 4 2.645751311 7
3 7 3 4.333333 2.309401077 5.333333
3 7 7 5.666667 2.309401077 5.333333
7 2 2 3.666667 2.886751346 8.333333
7 2 3 4 2.645751311 7
7 2 7 5.333333 2.886751346 8.333333
7 3 2 4 2.645751311 7
7 3 3 4.333333 2.309401077 5.333333
7 3 7 5.666667 2.309401077 5.333333
7 7 2 5.333333 2.886751346 8.333333
7 7 3 5.666667 2.309401077 5.333333
7 7 7 7 0 0

4 grant mean

grant
1.615385 variance

mean
4.666667 variance

7 mean-variance*n/(n-1)
7 grant-vatriance*[(n^n)-1]*n^2/[(n^n)*(n-1)]

FIGURE 2.5
Small sample nonparametric bootstrap

Proof: The left-hand side of (2.17) reads


n
1  2∗
n
S
nn j=1 j
n
1  1  ∗
n n
= n (x − X̄j∗ )2
n j=1 n − 1 i=1 ij
n
1  1  ∗
n n
= n (x − X̄ + X̄ − X̄j∗ )2 .
n j=1 n − 1 i=1 ij
50 Fundamental Instruments

Decomposing the terms above reads,


n
1  2∗
n
S
nn j=1 j
n
1  1  ∗  
n n n n
= n [ (xij − X̄)2 − 2 (x∗ij − X̄)(X̄ − X̄j∗ ) + (X̄ − X̄j∗ )2
n j=1 n − 1 i=1 i=1 i=1
n n
1  1  ∗ 1  n
n n n
= n (xij − X̄)2 − n (X̄ − X̄j∗ )2
n j=1 n − 1 i=1 n j=1 n − 1
1 n nn (n − 1) 2
= S2 − S
nn n − 1 n2
n−1 2
= S ,
n
by (2.16). This completes the proof of Theorem 2.6.

The following numerical example illustrates the above two theorems on


nonparametric bootstrapping methods when n = 3.

Example 2.9 Consider the original data set E = {2, 3, 7}, as show in Fig-
ure 2.5, the original sample has sample mean 4 and sample variance 7. We
have 33 = 27 different sample values in the re-sampling with replacement
method. Taking average of all the 27 re-sampling means, we get the exactly
same value as the sample mean of the original sample.

As shown in Figure 2.5, when we take the sample variance of the boot-
strapping sample means, the grant variance is 1.615385, which reaches the
original sample variance 7 after multiplying by

(nn − 1)n2 26 ∗ 9
= .
nn (n − 1) 27 ∗ 2

As for theorem 2.6, the mean (average) of the bootstrapping samples is


4.666667, which comes back to the original sample variance 7 after multiplying
by
n 3
= .
n−1 2

2.5.2 Parametric bootstrapping


When there is an assumption on the underlying model for the original sam-
ple, the additional model condition should be taken into consideration in the
generation of the re-sampling process. In this case, the model parameters will
be estimated with the original sample. Once the parameters are estimated,
Bootstrapping 51

the underlying model for the original sample can be used to add information
toward the re-sampling data.
We will use an example to demonstrate the difference between parametric
bootstrapping and non-parameter bootstrapping.

> set.seed(10)
> x<-rnorm(20, 2, 0.8)
> x
[1] 2.0149969 1.8525980 0.9029356 1.5206658 2.2356361 2.3118354
1.0335391 1.7090592 0.6986619 1.7948173 2.8814236
[12] 2.6046252 1.8094132 2.7899558 2.5931121 2.0714778 1.2360449
1.8438797 2.7404170 2.3863828
>
> x.mean <- mean(x)
> x.std <-sd(x)
>
> x.mean
[1] 1.951574
> x.std
[1] 0.6399275

Nonparametric bootstrapping
> vec<-rep(0, 10000)
> for (i in (1:10000)){
+ y<-sample(x, 20, replace=TRUE)
+ vec[i]<-mean(y)
+ }
> mean(vec)
[1] 1.951677
> sd(vec)
[1] 0.1416576
> sqrt(20)*sd(vec)
[1] 0.6335121

Parametric bootstrapping with normal model assumption


> for (i in (1:10000)){
+ y<-rnorm(20, x.mean,x.std)
+ vec[i]<-mean(y)
+ }
> mean(vec)
[1] 1.949943
> sd(vec)
[1] 0.1415905
> sqrt(20)*sd(vec)
[1] 0.633212

FIGURE 2.6
Comparing nonparametric and parametric bootstrapping

Example 2.10 Assume that the original sample contains 20 observations


generated from a normal model with mean=2 and standard deviation=0.8.
With the original data, we can run non-parametric bootstrapping to estimate
the unknown mean and the unknown standard deviation. As shown in Fig-
ure 2.6, the bootstrapping sample mean is 1.9517 and the standard deviation
as 0.6335. The non-parameter bootstrapping results are very close to the sam-
ple mean and sample standard deviation of the original one, 1.952 for the
sample mean and 0.6399 for the sample standard deviation.
When the parametric bootstrapping is used, the re-sampling data are now
generated from a normal model with mean 1.952 and standard deviation
0.6399. The re-sampling data now are not drawn from the original sample.
Instead, they are drawn from a normal model with mean 1.952 and standard
52 Fundamental Instruments

deviation 0.6399. As shown in Figure 2.6, the mean value of the re-sampling
data is not 1.9499 with standard deviation 0.6332. These results are very close
to the outcomes using the non-parametric method.

Certainly, in Example 2.9 and Example 2.10, the estimated value is the
unknown mean and the corresponding variance is the sample standard devi-
ation. Obviously, the bootstrapping method does not add much information
on the predicted outcome from the original sample. In the following example,
we explore a situation where the original data sample does not provide an es-
timation of the sample standard deviation of the parameter estimator. In this
case, bootstrapping becomes an effective way to find the standard deviation,
hence the confidence interval.

Example 2.11 Blood pressure instability is a critical issue in the analysis


of treatment regime for hypertension patients. As an example, consider the
treatment regime with three medications Losartan, Valsartan, and Bisoprolol,
attributing to the drug efficacy. The measurement for blood pressure fluctua-
tion becomes
Z = αL + αV + (1 − 2α)B + c, (2.18)
where c is the constant of the patient’s baseline blood pressure. Since both
Losartan and Valsartan target the Angiotensin II receptor blockers while Biso-
prolol targets Beta-blockers. The question is to find the proper weight of α,
the right proportion for the Angiotensin II receptor blockers, in the treatment
regime so that the blood pressure variation reaches the minimum possible level.
Here, the effects of Losartan and Valsartan are correlated since they target the
same receptor blockers, but the impact of Bisoprolol is not correlated to Losar-
tan or Valsartan because it targets different receptor blockers.

Solution: According to (2.18), the blood pressure variability reads

V ar(Z) = V ar(αL + αV + (1 − 2α)B + c


= α2 V ar(L) + α2 V ar(V ) + (1 − 2α)2 V ar(B) + 2α2 Cov(L, V ),

since effects of Losartan and Valsartan are correlated, but the impact of Biso-
prolol is not correlated to Losartan or Valsartan, we have

Cov(L, V ) = 0 Cov(L, B) = Cov(V, B) = 0.

Thus, the blood pressure fluctuation is a function of the medication proportion


α in the construction of treatment regime when the drug effect variations
V (L), V (S), V (B), and correlations Cov(L, S), are pre-determined in the
stage of drug development (prior to the study of treatment regime). This
leads to the conclusion that, in this study of treatment regime, α is the only
factor attributing to the concern of blood pressure instability.

f (α) = α2 V ar(L) + α2 V ar(V ) + (1 − 2α)2 V ar(B) + 2α2 Cov(L, V ). (2.19)


Bootstrapping 53

Taking derivative of f (α) in (2.19) with respect to α gets, after some


algebra calculation,
2V ar(B)
α̂ = . (2.20)
V ar(L) + V ar(S) + 4V ar(B) + 2Cov(L, S)
With a set of samples on the fluctuation of blood pressure records from
clinical trials, we can use plug in moment estimation method to estimate
the treatment regime proportion α̂, however, the estimation of the standard
deviation of α̂ is not available. Under this scenario, the bootstrapping method
provides a convenient and efficient alternative.

> library(boot)
> bp <- read.table("D:/chapter2/bp.txt", header=TRUE)
> head(bp)
Losartan Valsartan Bisoprolol
1 -1.5453679 -1.1720908 -0.37967880
2 2.0408970 1.4170437 2.27528290
3 -0.1547983 -0.1231551 2.01227522
4 0.8056872 0.4575227 -0.03546461
5 -1.0266717 -0.2783768 -0.86678710
6 2.1577671 1.1213169 0.68332724

> alpha.fn <-function(data, index){


+ X1 <-data$Losartan[index]
+ X2 <- data$Valsartan[index]
+ X3 <- data$Bisoprolol[index]
+ (2*var(X3))/(var(X1)+var(X2)+4*var(X3)+2*cov(X1,X2))
+ }

> (alp <- alpha.fn(bp, 1:200))


[1] 0.2970361
> boot1<-boot(bp, alpha.fn, R=1000)
> alpha.CI.upper <- alp + 1.96*apply(boot1$t,2,sd)[1]
> alpha.CI.lower <- alp - 1.96*apply(boot1$t,2,sd)[1]
> alpha.CI.lower
[1] 0.263982
> alpha.CI.upper
[1] 0.3300902

FIGURE 2.7
Optimal treatment regime for blood pressure instability

As depicted in Figure 2.7, with the original sample of blood pressure fluc-
tuation of 200 patients in the clinical trial, some patients responded positively
to medications on Beta-blockers, some on Angiotensin II receptor blockers,
and some on both of them. With the original sample, plugging-in the moment
estimators of the variations and covariation in (2.20) gets

α̂ = 0.297,
54 Fundamental Instruments

which means that if 29.7% of the hypertension medication on Angiotensin


II receptor blockers and 70.3% on Beta-blockers, the drug treatment regime
reaches its optimal level in keeping the blood pressure stable for the patients.
However, the data is unable to estimate the standard deviation of α̂ for
the 95% confidence range on the optimal treatment regime. Figure 2.7 shows
that with the bootstrapping method and 1000 re-sampling data, we have the
bootstrap mean,
1 
1000
ᾱ = α̂r = 0.2970361.
1000 r=1
And the bootstrap standard deviation,

1 
1000
 (α̂r − ᾱ)2 = 0.01686434,
1000 − 1 r=1

which leads to a 95% confidence level (0.264, 0.33) for the unknown optimal
proportion α. This means that the optimal treatment regime locates in the
range from 26.4% to 33% for the best blood pressure stability of hypertension
patients.

Summary
This chapter focuses on basic concepts and methods that facilitate follow-
up discussions on statistical prediction and machine learning. It starts with
a discussion on different types of data, which is the first step in the learning
process. Different types of data require different methods of measure for homo-
geneity and learning procedures. This is often overlooked in machine learning
in practice. If the method is not right, the trained model could be fatally
misleading, even if it reaches a small testing error in one testing dataset. We
used case-control data, cohort data, and cross-sectional data in this chapter
to elucidate discernible methods and outcomes corresponding to the data.
The second key component that we concentrate in this chapter is decision
tree, a concept that we will frequently use and intertwine with other topics in
the rest of the book. We introduce the mathematical definition and practical
interpretation of a decision tree, which changes the conventional inference
approach on the culture camp of model-based data science. More insightful
issues in this regard will be delineated in Chapter 9.
Similar to the fundamental concepts on the probability of type-I error and
the probability of type-II error in hypothesis testing, another frequently used
terminology in the data-oriented camp is the concepts of sensitivity and speci-
ficity, with the plot of the two measurements by the ROC curve. We enhance
the definition with introductory examples in this chapter and will explore
further on this topic regarding the trade-off between sensitivity-specificity in
Chapter 3.
Cross-validation and bootstrapping are two data-based computer-intensive
methods in the data-oriented culture camp. We introduce them in this chapter
Bootstrapping 55

with a theoretical discussion on LOOCV for linear models and an illustrat-


ing example on k-fold cross-validation in SAS. It should be mentioned that
these two methods are frequently intertwined with other data science meth-
ods in the rest of the book, such as linear prediction (Chapter 4), non-linear
prediction (Chapter 5), support vector machine (Chapter 8), and range re-
gression (Chapter 9). More intrinsic discussion on bootstrapping methods can
be found in Efron (1979) [47], Efron and Tibshirani (1993) [48], and Davision
and Hinkley (2006) [41], among others.
3
Sensitivity and Specificity Trade-off

For problems involving classification or disease prediction, we are often tasked


with making a decision for a binary response. For example, diagnosing healthy
or diseased subjects, detecting male or female persons in facial recognition, etc.
Making decisions for binary outcomes always involves two errors: the error of
incorrectly diagnosing a disease, P (claiming disease |healthy), and the error
of missing a disease, P (claiming health |case).
If we assert every subject as healthy, we will never mistakenly diagnose a
patient,
P (claiming disease |healthy) = 0,
but we will surely commit an error missing cases in the population,

P (claiming health |case) = 1.

On the other hand, if we assert everyone as having the disease (do not claim
any body as being healthy),

P (claiming health |case) = 0,

consequently, we will completely misdiagnose healthy subjects in the study,

P (claiming disease |healthy) = 1.

This behavior is known as the trade-off between sensitivity and specificity.


In hypothesis testing, this statistical concept is closely related to the trade-
off between the probability of making type-I and the probability of making
type-II error. In the following section, we will discuss the trade-off in detail
and address new methods resolving the dilemma regarding the sensitivity-
specificity trade-off.

3.1 Dilemma on false positive and false negative errors


As discussed in Chapter 2, when we predict a binary outcome (such as diseased
vs healthy) with diagnostic measure D, assuming that a low value of D is

57
58 Sensitivity and Specificity Trade-off

associated with the disease under investigation, the concepts of sensitivity


and specificity are defined as
Sensitivity = P (D < c|case) Specif icity = P (D ≥ c|control), (3.1)
where c is the diagnostic criterion. The diagnostic measure D may be derived
from a logistic regression model, or from clinical results. For example, when
we use complete blood count (CBC) to diagnose leukemia, CBC measures the
number of red blood cells, white blood cells, and platelets in a patient’s blood.
It also provides information on the amount of hemoglobin (oxygen carriers)
and hematocrit (proportion of red blood cells in the blood). Although a low
value of CBC is linked to leukemia, CBC changes alongside demographic fea-
tures such as age, gender, race, BMI, and comorbidity. CBC can even change
from time to time (longitudinal effect) within the same individual. Thus, it is
unrealistic to pinpoint a cut-off value of the diagnostic threshold c in diagnos-
ing leukemia. For a specific population, the threshold c needs to be calculated
from the training data in disease prediction.
On one hand, if c is set too high, more patients will have their CBC measure
D value satisfying the condition D < c. This may result in more patients
being incorrectly classified into the diseased group. Especially, misdiagnosing
healthy subjects as having leukemia, incurs the false positive error. On the
other hand, if the threshold c is set too low, less people will have their CBC
test result satisfying the condition D < c, and more people will have their
CBC test result satisfying D ≥ c. This leads to more people being incorrectly
classified as being healthy, causing false negative error.
Theoretically, one may always set the criterion c to be higher than the
highest possible CBC value to completely avoid the false positive error, since
such c values result in the claim that everyone is positive. However, this leads
to the case where sensitivity=1 and specificity=0 as pointed out in (3.1). Al-
ternatively, when c is set to any value below the lowest possible CBC value,
such c value leads to the assertion that everyone is healthy, resulting in sensi-
tivity=0 and specificity=1 by (3.1). In general, for the scenario of predicting
leukemia using CBC, within the permissible range of threshold c defined in
(3.1), higher value of c increases the sensitivity but decreases the specificity,
and lower value of c increases the specificity but decreases the sensitivity. As
such, there is a dilemma in the control of sensitivity and specificity in the
determination of the threshold c value.
In logistic regression analysis with a binary response variable, the threshold
c is, in general, selected in the following way. First, all permissible values of c
are used to construct an estimated ROC curve by plotting the following pairs
of points
(1 − specif icity(c), sensitivity(c)) c ∈ A,
where A is the set of all permissible c values. The area under the ROC curve
is typically used to select the prediction model, and the value of c is adjusted
to produce a sensitivity and specificity with closest distance to the optimal
point in the ROC plot, where sensitivity=specificity=1. Namely,
Dilemma on false positive and false negative errors 59

TABLE 3.1
Type-I and Type-II errors in hypothesis testing
Rejecting null hypothesis Not rejecting null hypothesis
Null true Type-I error correct decision
Alternative true power Type-II error

c = ArgInfc∈A ||d(c)||, (3.2)


where

||d(c)|| =(1 − specif icity(c))2 + (1 − sensitivity)2

= (f alse positive error(c))2 + (f alse negative error)2 . (3.3)

It should be noted that (3.3) is only one of the selection approaches for the
diagnostic criterion c. It uses the convenient concept of the distance between
two points in the xy-plane. This selection is not necessarily the most optimal
choice in practice. For example, when controlling for the false negative error
is more important, such as misdiagnosing and missing treatments of a life
threatening disease for a patient versus the error of asking the patient to
take a second confirmatory screening test, the selection standard in (3.3) will
be misleading, because it did not take the weights on diagnostic priority into
consideration. On the other hand, when controlling false positive error is more
critical, such as the error of incorrectly pushing a healthy person into an
operation room versus the error of asking the patient to take a preventive
medicine, the selection standard in (3.3) is also misleading, because it treats
both errors at the same level of importance (equal weights).
Controlling false positive error and false negative error is not a new para-
dox to data science. In hypothesis testing, we are often confronted with the
dilemma of controlling the probability of Type-I error (incorrectly rejecting
the null hypothesis) and the probability of Type-II error (incorrectly rejecting
the true alternative hypothesis) as shown in Table 3.1.
According to well-documented statistics literature, the solution to the
dilemma on the control of the Type-I and Type-II errors is to select one
statement as the null hypothesis, control the probability of making the Type-
I error in the selection of rejection areas, and find the most powerful test to
minimize the chance of making the Type-II error.
In what follows in this chapter, we will reformulate the control of false
positive error and the control of false negative error into the framework of
hypothesis testing, and consequently state similar results in the determination
of the diagnostic threshold by keeping the control on false positive rate, and
minimizing the chance of making the false negative rate. The idea is similar
to the concept of uniformly most powerful test in hypothesis testing.
60 Sensitivity and Specificity Trade-off

3.2 Most sensitive diagnostic variable


We shall introduce a new concept in the evaluation of a diagnostic procedure,
the Uniformly Most Efficient predictor, which is equivalent to a global opti-
mization solution when the target function aims to maximize the sensitivity
of the diagnostic measure given a pre-specified level of specificity. Notice that,
as defined in (3.1), both sensitivity and specificity depend on the diagnostic
measure D and diagnostic threshold c.

Definition 3.1 Uniformly Most Efficient Lower Variable: Assume that lower
values of the diagnostic measure are associated with the disease. Let K be a
set of diagnostic measures that satisfy

P (D ≥ c|healthy) = 1 − α,

for a pre-specified level of specificity 1 − α. The most efficient lower variable


refers to the diagnostic measure D∗ that satisfies the following conditions.

I) D∗ ∈ K, which means P (D∗ ≥ c∗ |healthy) = 1 − α.


II) P (D∗ < c∗ |case) ≥ P (D < c|case) for any D ∈ K, which means that D∗
is the one with the highest sensitivity among all the diagnostic measures
that have specificity 1 − α.

When high values of the diagnostic measure D are associated with the
disease, such as escalated systolic blood pressure or escalated cholesterol level
for heart attacks, Definition 3.1 is equivalent to the following.

Definition 3.2 Uniformly Most Efficient Upper Variable: Assume that high
values of the diagnostic measure are associated with the disease. Let K be a
set of diagnostic measures that satisfy

P (D < c|healthy) = 1 − α,

for a pre-specified level of specificity 1 − α. The most efficient upper variable


refers to the diagnostic measure D∗ that satisfies the following conditions.

1 D∗ ∈ K, which means P (D∗ < c∗ |healthy) = 1 − α.


2 P (D∗ ≥ c∗ |case) ≥ P (D ≥ c|case) for any D ∈ K, which means that D∗
is the one with the highest sensitivity among all the diagnostic measures
that have specificity 1 − α.

Definition 3.1 and Definition 3.2 are essentially the same except the di-
rection of the diagnostic measure toward the disease. Notice that with the
two evaluation criteria (false positive and false negative errors, or type-I and
Most sensitive diagnostic variable 61

type-II errors), the traditional approach is to fix one evaluation criterion (sig-


nificance level) and optimize the second one (power). While for the definition
of uniformly most efficient measures, we essentially fix the probability of the
false negative error (specificity), and find the largest possible sensitivity. In
what follows, we shall use a simulation example to obtain a better understand-
ing of definition 3.1 and definition 3.2.

Example 3.1 Low-density lipoprotein cholesterol (LDL-C) level is a risk fac-


tor for coronary heart disease. Assume that the LDL-C level of healthy popu-
lation follows a normal model with mean 130 mg/dL and a standard deviation
3 mg/dL. Also assume that LDL-C levels of patients with coronary heart dis-
ease follow a normal model with mean μ more than 130 mg/dL and the same
variation (standard deviation). If we have the blood test results of 20 subjects
at the similar LDL-C levels, we want to identify the diagnostic measure for
coronary heart disease patients so that the specificity is kept at 95% level.

Solution Set the level α = 0.05 so that the specificity is at 0.95 level. We have

X̄ − 2
Specif icity = P ({X : √ < zα |healthy} = 0.95,
3/ 20

where X is the random sample, X̄ is the sample mean, and zα is defined as


P (Z > zα ) = α, in which Z follows the standard normal distribution.
Thus, for the diagnostic predictor D = X̄, the threshold for 0.95 specificity
reads, √
c = 130 + 1.645 ∗ 3/ 20.
Under this setting, the sensitivity becomes

X̄ − 130
Sensitivity(μ) =P ( √ > zα |μ)
3/ 20
X̄ − μ μ − 130
=P ( √ > zα − √ )
3/ 20 3/ 20
μ − 130
=P (Z > zα − √ ).
3/ 20

Notice that in this example, the sensitivity is a monotonic increasing function


with mean LDL-C level μ.
We may run the process 10,000 times to examine the sensitivity as a func-
tion of the LDL-C level. As shown in Figure 3.1, when the patient LDL-
cholesterol level increases, the sensitivity associated with the diagnostic mea-
sure D increases. The corresponding R-code is also included in Figure 3.2.
Example 3.1 shows the approach to find the diagnostic measure so that
the specificity can satisfy a given level. However, it does not prove whether
the diagnostic measure is a uniformly most efficient predictor. The following
optimal criterion ensures that the diagnostic measure D in Example 3.1 is
62 Sensitivity and Specificity Trade-off

FIGURE 3.1
Sensitivity in patients with LDL-C level more than 130 mg/dL

a most efficient predictor. In general, when the underlying models can be


plausibly assumed, we use the likelihood ratio criterion to seek the uniformly
most efficient diagnostic predictor.

Theorem 3.1 Assume that the underlying model (pdf or pmf ) of a set of data
is f1 (x) for case, and f0 (x) for control. The most efficient diagnostic measure
is
f1 (x)
D∗ =
f0 (x)
with
P (D∗ < c∗ |healthy) = 1 − α
when small diagnostic measure D is associated with the disease, and

P (D∗ > c∗ |healthy) = 1 − α

when large diagnostic measure D is associated with the disease.


Most sensitive diagnostic variable 63

sigma <-3

n <-20

mu0 <-130

mu.o<-seq(128, 133, 0.03)

POWER <-matrix(0, length(mu.o),2)

result <- 0
for (i in 1:length(mu.o)){
mu<-mu.o[i]
result <- 1-pnorm(qnorm(0.95)-(mu-130)/(3/sqrt(20)))
POWER[i,] <-c(mu, result)
}
png(file="~/desktop/saving_plot2.png",
width=500, height=400)

plot(POWER, xlim=c(128, 133), xlab="True LDL value",


ylab="Sensitivity", type="l", col="green", lty=1, lwd=2)
abline(v=130, col="red")

FIGURE 3.2
Code for sensitivity in patients with LDL-C level more than 130 mg/dL

Note that the above theorem provides an approach to find the most efficient di-
agnostic predictor when the likelihood function of the disease and the healthy
population can be plausibly assumed.
Proof Assume that large values of diagnostic measure D are associated
with the disease. For any diagnostic measure D with

P (D > c|healthy) = 1 − α,

denote
A = {D > c} A∗ = {D∗ > c∗ },
we have
(IA − IA∗ )[D∗ (X) − c∗ ] ≤ 0,
which is equivalent to

(IA − IA∗ )[f1 (X) − c∗ f0 (X)] ≤ 0,


64 Sensitivity and Specificity Trade-off

hence 
(IA − IA∗ )[f1 (X) − c∗ f0 (X)]dX ≤ 0
n
where fi (X) = f (xj |θi ) for i = 0, 1. Now
j=1
 
(IA − IA∗ )f1 (X)dX ≤ c∗ (IA − IA∗ )f0 (X)dX.

The right-hand side is non-positive since

P (D > c|healthy) = specif icity = 1 − α,

and
P (D∗ > c∗ |healthy) = specif icity = 1 − α.
We have 
(IA − IA∗ )f1 (X)dX ≤ 0.

This implies that  


IA f1 (X)dX ≤ IA∗ f1 (X)dX.

P (A|case) ≤ P (A∗ |case)


This proves that D∗ is the most efficient diagnostic measure and c∗ is the
most efficient diagnostic threshold.
When small values of diagnostic measure D are associated with the disease,
similar argument completes the proof of the theorem.
In the following examples, we present two scenarios (one for continuous
model and one for discrete model) to illustrate the application of Theorem 3.1
in the construction of the most efficient diagnostic predictor.

Example 3.2 Assume that the LDL-cholesterol levels follow a normal model
with Xi ∼ N (130, 2) for healthy subjects and Xi ∼ N (150, 2) for patients with
coronary heart diseases. If the specificity is set to 0.95, we want to find the
most efficient diagnostic predictor.

Solution: In this case, the two possible models are

1 
n
1
f (X|healthy) = ( √ n
) exp{− 2 (xi − 130)2 }
2πσ 2σ i=1

and
1 
n
1
f (X|case) = ( √ )n exp{− 2 (xi − 150)2 }.
2πσ 2σ i=1
By Theorem 3.1, the most efficient diagnostic measure reads,
f (X|case)
D(X) = .
f (X|healthy)
Most sensitive diagnostic variable 65

Since the escalated LDL-Cholesterol level is associated with coronary heart


diseases, the most efficient diagnostic predictor is

n 
n
{D(X) > c} = {X : (xi − 5)2 < (xi − 1)2 + kα∗ },
i=1 i=1

for a constant kα∗ that depends on the specificity level 1 − α, and

{D(X) > c} = {X : X > kα∗∗ }.

Now, if all the patients are healthy, the sample mean statistic of their LDL-
Cholesterol levels follows N (130, √2n ). After standardizing the sample statistic
gets

X − 130 ∗∗∗
{D(X) > c} = {X : > k0.05 }.
√2
n

Notice that the evaluation criterion requires that the specificity is 0.95,

P (D(X) < c|healthy) = 0.95,


∗∗∗
thus, we have k0.05 = 1.645. The most efficient diagnostic predictor, for any
sample size n, reads

X − 130
{D∗ > c∗ } = {X : > 1.645}.
√2
n

In the next example, we shall discuss an example of finding the most


efficient diagnostic predictor in a discrete model for the diagnosis of Type-I
diabetes.

Example 3.3 Let X be a random variable associated with diabetes symp-


toms, including numbness, weight loss, swollen gums, slow healing, increased
appetite, blurred vision, energy loss, and frequent thirst. Assume that from
historical data of patient records, the chance of each symptom for healthy pa-
tients (blood glucose level ≤ 100 ) versus diabetes patients (blood glucose level
> 100) in a local hospital are given in the following table.

x numb weight swollen slow increase blur energy often


loss gums heal appetite vision loss thirst
f (x|H) .02 .02 .02 .01 .05 .01 0.41 0.46
f (x|D) .22 .02 .12 .04 .18 .02 .06 0.34

Based on the symptoms of a newly admitted patient, the hospital is inter-


ested in diagnosing whether the patient has diabetes with a requirement that
the specificity should be at least 95%.
66 Sensitivity and Specificity Trade-off

Solution: Notice that in this case, the ratio of chances for each symptom
f (x|case)
D(X) = f (x|healthy) takes the following values

x numb weight swollen slow increase blur energy often


loss gums heal appetite vision loss thirst
f (x|H) 0.02 0.02 0.02 0.01 0.05 0.01 0.41 0.46
f (x|D) 0.22 0.02 0.12 0.04 0.18 0.02 0.06 0.34
λ 11 1 6 4 3.6 2 0.146 0.739

In this case, a higher ratio of chance indicates that the individual is more
likely to have the disease. Based on Theorem 3.1 and according to the ranking
of the diagnostic measure for each symptom, we arrange the symptoms by
the likelihood of diabetes verse diabetes. To satisfy the evaluation criterion of
controlling the rate of misdiagnosis at 5% level, namely

Specif icity = P (D < c|healthy) = 0.95,

symptoms with likelihood ratios on the top 5% misdiagnostic rate, are

{D∗ > c∗ } = {numbness, swollen gums, slow healing}.

The chance of correctly diagnosing a diabetes patient, which is the sensi-


tivity of this diagnostic predictor reads

Sensitivity = P (D∗ > c∗ |case) = 0.22 + 0.12 + 0.04 = 38%.

Certainly, when the blood glucose level test is available, the laboratory test
result is more accurate in detecting diabetes as a follow-up diagnosis. How-
ever, as illustrated by Theorem 3.1, this example shows that the selection of
diagnostic predictors is possible without the use of the continuous likelihood
function.
It should be noted that Example 3.3 and Example 1.5 are very similar in
a way where the control of the type-I error in hypothesis testing plays the
same role as the control of the false negative error in the sensitivity-specificity
analysis.

In the above two examples, predictions are made on an unknown parame-


ter for assumed models. The intuition behind it relies on the likelihood ratio
that optimizes the evaluation criterion by maximizing the sensitivity of the
diagnostic predictor while controlling the specificity. The rationale of theo-
rem 3.1 is grounded on the intuition that we diagnose the patient as sick
when the likelihood of sickness is higher compared with the likelihood of be-
ing healthy. This idea consequently leads to the likelihood ratio measurement
in the construction of the most efficient diagnostic predictor.
Most sensitive diagnostic variable 67

Definition 3.3 Likelihood Ratio measurement: Assume that patient obser-


vations follows the model l1 (θ, X) for disease population and l2 (θ, X) for
healthy population. The likelihood ratio measurement is defined as

supΘ0 L(θ|X)
D(X) =
supΘ L(θ|X)

where Θ0 is the set for parameters in the


 likelihood of the healthy population,
Θ1 the disease population, and Θ = Θ0 Θ1 is the whole parameter space.

Heuristic rationale of the likelihood ratio measurement for diagnostic predic-


tion: When the underlying model of the data is assumed to represent the
patient features, if the likelihood ratio is small, the largest possible likelihood
that the patient in the healthy population is less than the largest possible
likelihood that the patient is sick. Thus, we diagnose the patient as having
the disease. The key idea in this model-based sensitivity-specificity analysis is
the assumption of the underlying model. If the assumption is invalidated, the
diagnostic predictor consequently becomes invalid in the search for the most
efficient one.
Example 3.2 defines the healthy population’s LDL-Cholesterol level equal
to 130mg/dL and sick population at 150mg/dL. However, nobody in prac-
tice has an LDL-Cholesterol level exactly equal to those two specific numbers,
although the definition is mathematically sound. With the assumed under-
lying model, we shall consider scenarios where the definition of sickness is
extended to LDL-Cholesterol being more than 130 mg/dL, and the defini-
tion of a healthy population is extended to the corresponding level being less
than 130mg/dL. We use the following example to show how to seek the most
efficient diagnostic predictor with the likelihood ratio measurement.

Example 3.4 Let X1 ......Xn be a random sample of blood test LDL-


cholesterol readings. Assume that the readings follow a population distribution
N (μ, 1) with an unknown common mean LDL-Cholesterol level, μ. We are in-
terested in finding the most efficient diagnostic predictor to diagnose whether
the population is sick (μ more than 130) or healthy (μ less than 130) with the
specificity at 95% level.

Solution: Since the specificity is set at 95%, we have

sup P (D > c|Healthy) = 0.05.


μ≤130

On the other hand, for the sensitivity across all LDL-Cholesterol level more
than 130, we need,
c∗ − μ
P (D > c|case) = P (X̄ > c∗ |case) = P (Z > √ |μ > 130).
1/ n
68 Sensitivity and Specificity Trade-off

For any two LDL-Cholesterol levels λ1 and λ2 , if λ1 ≤ 130 and λ2 > 130, by
the derivation in Example 3.2, the most efficient diagnostic predictor is
1
{D > c} = {X̄ > λ1 + 1.645 ∗ √ }
n
for any value λ1 and λ2 . Thus, the difficulty becomes to find the optimal value
c in the supremum. For notational convenience, denote μ=LDL-cholesterol
level. We need to find,
c∗ − μ
sup P (Z > √ |healthy).
μ≤130 1/ n
Notice that
c∗ − μ c∗ − 130 + 130 − μ
P (Z > √ |control) =P (Z > √ |)
1/ n 1/ n
c∗ − 130
≤P (Z > √ |μ ≤ 1)
1/ n
c∗ − 130
=P (Z > √ ) = 0.05.
1/ n


So, setting c1/−130

n
= 1.645 gets c∗ = 130 + 1.645 √1n . This leads to the most
efficient predictor, D,

{D > c} = {x : X̄ > 130 + 1.645 n}.
We have discussed two approaches, the likelihood ratio measurement and
Theorem 3.1 for the construction of the uniformly most efficient diagnostic
predictor. Notice that when the likelihoods of the case and control populations
are confined to one value, the most efficient diagnostic predictor reads
L(x, θ0 )
< λ.
max(L(x, θ0 ), L(x, θ1 ))
On the other hand, the most efficient diagnostic predictor according to The-
orem 3.1 is
L(x, θ1 ) > c∗ L(x, θ0 ).
Under this setting, we can clearly deduce that the two most efficient diagnostic
predictors are identical, after a few steps of simple algebra derivation.
As shown in the previous examples, one discernible feature in the optimiza-
tion process is the reduction of the data information from the n observations
of the original data to one diagnostic predictor. In other words, the opti-
mizing process with the evaluation standard becomes a process that reduces
the dimension of the data toward a diagnostic predictor. Given this new per-
spective, when a sufficient statistic with lower dimension is available in the
likelihood ratio expression, Theorem 3.1 can be simplified as follows to reduce
the diagnostic predictor into a lower dimensional function.
Most sensitive diagnostic variable 69

Theorem 3.2 Assume that the underlying model (pdf or pmf ) of a set of data
X is f (x|θ) ∈ {f (x|case), f (x|healthy)}. Denote T (X) a sufficient statistic for
θ, and gi (t), i = 0, 1 the pmf (or pdf ) of T corresponding to healthy and case
populations, respectively. Then, the most efficient diagnostic measure becomes
g1 (t)
D(t) = ,
g0 (t)
with specificity
P (D > c|healthy) = 1 − α.
Implementing the above theorem relies on the availability of a sufficient
statistic that may reduce the dimension of the data while maintaining suf-
ficient likelihood information. Since identifying a sufficient statistic is a key
in the implementation of the above theorem, it is relevant to mention the
factorization theorem, which involves the dimensional reduction process while
preserving data sufficiency.
Theorem 3.3 Factorization theorem: Let f (x|θ) denote the underpinning
model (pmf or pdf ) of a sample x. A statistic T (x) is sufficient for the un-
known parameter θ if and only if it satisfies the following condition. There
exist functions g(t|θ) and h(x) such that, for all sample points and all permis-
sible values of the parameter θ, the joint density can be decomposed into the
product of information about the unknown parameter and information on the
rest of the sample.
f (x|θ) = g(T (x)|θ)h(x).
The proof and discussions on the Factorization Theorem can be found in
Lehmann and Romano [81] or Casella and Berger [16]. Since this book focuses
more on statistical prediction and machine learning, we elect not to pursue
the theory of sufficient statistics in this book.
As pointed out in Example 3.4, we are often confronted with situations
where the disease and healthy populations are referred to a range (instead
of a value) of the observations. Under this scenario, the optimizing process
discussed above cannot be directly applied. We shall now discuss the concept
of the uniformly most efficient diagnostic predictor for a range of the patient
healthy readings. With this objective in mind, we need the following concept.
Definition 3.4 Monotone Likelihood Ratio: Assume that we have a set of
data X that follows a family of underlying models (pmf s or pdf s) character-
ized by an unknown parameter θ ∈ Θ. Let T be a sufficient statistic of θ with
the likelihood function g(t|θ) ∈ {g(t|θ) : θ ∈ Θ}. The monotone likelihood ratio
property refers to the following property of the model of T : For any two points
in the parameter space, θ2 > θ1 , the likelihood ratio
λ(t) = g(t|θ2 )/g(t|θ1 )
is a monotone function of t in the domain {t : g(t|θ1 ) > 0 or g(t|θ2 ) >
0, θi ∈ Θ}.
70 Sensitivity and Specificity Trade-off

With the monotone likelihood ratio property, the process of optimizing


sensitivity and specificity for a diagnostic predictor can be formulated as fol-
lows. A similar version to this result in hypothesis testing is the Carlin-Rubin
theorem. For this reason, the following theorem is sometimes referred to as
the adapted Carlin-Rubin theorem in sensitivity-specificity analysis.
Theorem 3.4 Assume that the underlying model of the observations can be
characterized by a density function f (x|θ) with a parameter θ. If f (x|θ) has the
monotone (increasing) likelihood ratio property, the corresponding sensitivity
is a nondecreasing function of the parameter (θ) with the setting
{D > c} = {T > t},
and
Sensitivity(θ1 ) ≥ Sensitivity(θ2 ),
when θ1 > θ2 .
Proof: To show that the sensitivity is a non-decreasing function of the thresh-
old, we consider the proof for continuous distributions for convenience. How-
ever, the proof is also applicable to discrete MLR families.
For θ1 > θ2 , define
F (t|θ, case) = 1 − Pθ (T > t|θ, case).
It suffices to show F (t|θ1 ) ≤ F (t|θ2 ), where F is the distribution function of
T with parameter θ. Now
d f (t|θ1 )
[F (t|θ1 ) − F (t|θ2 )] = f (t|θ1 ) − f (t|θ2 ) = f (t|θ2 )[ − 1].
dt f (t|θ2 )
Because f (t|θ) has MLR property, the ratio on the right-hand side is increasing
with t, and the derivative can only change signs from negative to positive. This
indicates that any interior extreme is a minimum, and the highest point of
the function
g(t) = F (t|θ1 ) − F (t|θ2 )
is located at +∞ or −∞.
g(−∞) = F (−∞|θ1 ) − F (−∞|θ2 ) = 0,
g(∞) = F (∞|θ1 ) − F (∞|θ2 ) = 0,
Thus, g(t) ≤ 0 and
F (t|θ1 ) ≤ F (t|θ2 ),
which is tantamount to
Sensitivity(θ1 ) = 1 − F (t|θ1 ) ≥ 1 − F (t|θ2 ) = Sensitivity(θ2 ).
With Theorem 3.4, we have the following theorem (adapted Karlin-Rubin
theorem) for the construction of the most efficient diagnostic predictor.
Most sensitive diagnostic variable 71

Theorem 3.5 Assume that the underlying model of a set of data X is gov-
erned by a function characterized by patient feature θ ∈ R. Consider a classi-
fication problem formulated as healthy θ ≤ θ0 and case θ > θ0 . Suppose that
T is a sufficient statistic for θ, and the family of pmf s or pdf s of T has the
MLR (Monotone Likelihood Ratio) property. If large value of D is associated
with the disease, the most efficient diagnostic predictor is
{D > c} = {T > t0 },
where D is the diagnostic measure and c is the diagnostic threshold. The value
t0 is determined according to the following condition,
Specif icity = P (D ≤ c|healthy) = P (T ≤ t0 |healthy) = 1 − α.
Proof: Let β(θ) = Pθ (T > t0 ) be the sensitivity of the diagnostic predictor. Fix

any value of the parameter θ > θ0 , and consider a simple prediction problem

on θ = θ0 versus θ = θ , since the underlying model (the family of pmf s or
pdf s) of T is assumed to have the MLR property, by Theorem 3.4, β(θ) is a
non-decreasing of θ, so we have
i) supθ≤θ0 β(θ) = β(θ0 ) = α, hence the specificity of the diagnostic predictor
is 1 − α.
ii) If we define

g(t|θ )
k ∗ = inf
t∈T g(t|θ0 )

where T = {t > t0 and either g(t|θ ) > 0 or g(t|θ0 ) > 0}, it follows that

g(t|θ )
T > t0 ⇔ > k∗ .
g(t|θ0 )
 
By Theorem 3.1, Parts (i) and (ii) imply that β(θ ) > β ∗ (θ ), where β ∗ (θ)
is the sensitivity of any other diagnostic predictor with specificity at 1−α level.

Since θ is arbitrary, the diagnostic predictor is the most efficient diagnostic
predictor with specificity at the level 1 − α.
By an analogous argument, the following theorem can be derived.
Theorem 3.6 Assume that the underlying model of a set of data X is gov-
erned by a function characterized by patient feature θ ∈ R. Consider a classi-
fication problem formulated as healthy θ ≤ θ0 and case θ > θ0 . Suppose that
T is a sufficient statistic for θ, and the family of pmf s or pdf s of T has the
MLR (Monotone Likelihood Ratio) property. If small value of D is associated
with the disease, the most efficient diagnostic predictor is
{D < c} = {T < t0 },
where D is the diagnostic measure, c is the diagnostic threshold, and t0 is the
value that satisfies
Specif icity = P (D ≥ c|healthy) = P (T ≥ t0 |healthy) = 1 − α.
72 Sensitivity and Specificity Trade-off

The following example (a follow-up discussion on Example 3.4) demon-


strates an application of the adapted Carlin-Rubin theorem in the optimization
process toward the uniformly most efficient diagnostic predictor for a given
specificity level 1 − α.

Example 3.5 Let X1 ......Xn be a random sample of blood test LDL-


cholesterol readings from a population. Assume that the readings follow a pop-
ulation N (μ, 1) with an unknown common mean LDL-Cholesterol level, μ. We
are interested in showing that the solution in Example 3.4 is indeed the most
efficient diagnostic predictor with specificity at 95% level.

Solution: Since the normal model has the monotone likelihood ratio property,
and high readings of LDL-Cholesterol level are associated with coronary heart
disease, by the adapted Carlin-Rubin theorem (Theorem 3.4), the most efficient
diagnostic predictor satisfies the condition

X − 130
{D > c} = { > 1.645}.
√1
n

This is equivalent to
1.645
X > 130 + √ ,
n
the most efficient diagnostic measure is the sample mean LDL-Cholesterol
level, and the diagnostic outcome is positive then the sample mean level
reaches the corresponding threshold.

3.3 Two-ended diagnostic measures


The preceding section discusses methods to find the most efficient diagnostic
predictor for one-ended extremes. When large values of the diagnostic mea-
surement are associated with the disease such as LDL-Cholesterol level for
coronary heart disease, it is the upper extreme. Alternatively, when low val-
ues of the predictor are associated with the disease, such as the RBC (red
blood cell count) for leukemia, it is the lower extreme. However, in practice,
there are many scenarios where both low and high measurements are associ-
ated with a disease. For instance, consider the reading of bun to creatinine
ratio in a blood testing report. A high bun to creatinine ratio indicates con-
ditions that lead to decreased blood flow to the kidney. On the other hand,
a low bun to creatinine ratio implies increasing creatinine blood level, which
also indicates kidney damage or kidney failure. In this case, we need to find a
two-ended diagnostic predictor that can efficiently diagnose the disease. For
example, assume that the healthy range for the bun to creatinine ratio is from
Two-ended diagnostic measures 73

10:1 to 20:1, a blood test report with either too high (exceeding 20) or too
low (below 10) bun to creatinine ratio is an indication of kidney failure.
Following the concept of optimization that we discussed in Section 3.1,
for convenience, we continue with the idea of UMEP (uniformly most efficient
predictor), and treat that as an example to illustrate the principle of restricted
optimization when the underlying model is assumed.
As discussed in Theorem 3.4, the adapted Karlin-Rubin theorem ensures
that the existence of an optimal solution for one-ended diagnostic measures
when the underlying model has the MLR property. However, the story is dif-
ferent for the two-ended extreme problem. As shown in the following example,
although the adapted Karlin-Rubin theorem is convenient in the derivation of
optimal diagnostic predictor for one-ended extremes on the diagnostic mea-
sures, when we consider two-ended diagnostic predictors, the global optimal
solution does not exist.

Example 3.6 Let X1 ......Xn be a set of blood test readings on the bun to
creatinine ratios. Assume that Xi ∼ N (θ, σ 2 ) with known variance σ 2 = 1.
Consider testing the prediction of healthy θ = θ0 versus kidney disease θ = θ0
for a given constant θ0 . Given a pre-fixed level of specificity 1 − α, we are
interested in identifying the most efficient diagnostic predictor that satisfies

P (Claiming healthy|healthy) ≥ 1 − α. (3.4)

The optimal solution for this problem does not exist. To see this point,
consider another parameter value θ1 < θ0 (for instance, two different values of
bun to creatinine ratios), by the adapted Karlin-Rubin theorem, the uniformly
most efficient predictor reads:
Claiming kidney diseases when

X̄ < −σzα / n + θ0 .

This predictor has the highest sensitivity at the bun to creatinine ratio θ1
among all predictors satisfying equation (3.4). We may call this Predictor-1.
By the uniqueness of UME predictor, if a UMEP exists for this problem, it
must almost surely be Predictor-1.
Now consider a different predictor, Predictor-2, which claims diseases when

X̄ > σzα / n + θ0 .

Obviously, Predictor-2 also has the specificity level at 1 − α. We can now


compare the sensitivity of the two predictors, Predictor-1 and Predictor-2 as
follows.
Let βi (θ) be the sensitivity function of predictor i with i = 1, 2. For
any bun-to-creatinine-ratio θ2 > θ0 , denote the sensitivity associated with
74 Sensitivity and Specificity Trade-off

Predictor-i as βi (θ), we have


zα σ
β2 (θ2 ) =Pθ2 (X̄ > √ + θ0 )
n
X̄ − θ2 θ0 − θ2
=Pθ2 ( √ > zα + √ )
σ/ n σ/ n
>P (Z > zα )
=P (Z < −zα )
X̄ − θ2 θ −θ
>Pθ2 ( √ < −zα + 0 √ 2 )
σ/ n σ/ n
σzα
=Pθ2 (X̄ < − √ + θ0 )
n
=β1 (θ2 )

This shows that the sensitivity of Predictor-1 at the bun-to-creatinine-ratio,


θ2 , is lower than its counterpart. Thus, Predictor-1 is not a UMEP with speci-
ficity 1−α. This is in contradiction with the earlier statement that Predictor-1
is the UMEP. Therefore, the UMEP with specificity level 1 − α does not exist
for the two-ended diagnostic scenarios.
When the UMEP does not exist among all the predictors that have speci-
ficity 1 − α, instead of seeking the global optimal predictor, we may put a
restriction on the domain of predictors in the optimization process, and seek
for a local optimal solution within a confined domain of diagnostic measures.
For example, we may consider the search of optimal solution to a subgroup of
predictors named decent predictor defined below.

Definition 3.5 Decent predictor: When predicting disease populations with


a diagnostic measure, a decent predictor is a predictor satisfying the condition
that the probability of correctly diagnosing a sick patient is higher than the
probability of incorrectly diagnosing a healthy subject.

The idea of decent predictor is similar to the concept of unbiased test


defined for the content of hypothesis testing. A formal mathematical definition
of the unbiased test can be found in [81].
The Definition 3.5 for a decent predictor can be expressed as

Sensitivity(φ, θ) > α, and P (f alse positive) < α,

or
Sensitivity(φ, θ) > α, and Specif icity ≥ 1 − α.
As usual, φ(X) is the prediction function. When the diagnosed outcome is
positive, φ(X) = 1, otherwise φ(X) = 0.
From the Definition 3.5, any UMEP with specificity 1 − α is an UMEDP
(uniformly most efficient and decent predictor). On the other hand, with ad-
ditional restriction of being a decent predictor, for prediction problems where
Two-ended diagnostic measures 75

1.0
0.8
0.6
Power

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0

True mean value

FIGURE 3.3
Sensitivity functions β1 (θ), β2 (θ), and β3 (θ).

UMEP does not exist, UMEDP may exist. We illustrate this point with the
following example.
As shown in Definition 3.5, the concept of decent predictor is a natural
requirement for a diagnostic predictor. Namely, the probability of correctly
detecting the diseases subject should be at least as large as the probability of
incorrectly diagnose a healthy subject. With such a restriction, we are able to
search for a UMEP with specificity 1 − α among the class of decent predictors.
Such a restricted optimization procedure results in the construction of the
following UMEDP.
Consider the sensitivity function β(θ) of a diagnostic predictor Predictor-3
which diagnoses a subject as having the disease when
|X̄ − θ0 |
√ > zα/2 .
σ/ n
Figure 3.3 shows the three sensitivity curves corresponding to the three pre-
dictors discussed above. The dotted curve is for β3 (θ), the black curve is for
β2 (θ), and β1 (θ) has the gray curve. As shown in the diagram, the dotted
76 Sensitivity and Specificity Trade-off

curve, although not as powerful as the gray or the black curve at some points,
is able to achieve its own local optimal sensitivity when the diagnostic measure
gets either larger or smaller. However, for the other two curves, the sensitiv-
ity drops below the probability of false positive rate α when the diagnostic
measure gets larger for β1 (θ) (or smaller for β2 (θ)). Thus, they are not decent
predictors.
We will prove that the dotted sensitivity curve is indeed the restricted
optimal curve in Example 3.7 after a discussion on the following theorem,
which sets the connection between the global optimal solution (a UMEP level
1 − α predictor) and a restricted optimal solution. The latter is the UMEDP
level 1−α predictor for this type of diagnostic measures. Theorem 3.7 shows a
way to identifying a UMEDP by considering UMEP in the boundary between
the diseased and the healthy populations.

Theorem 3.7 For a model characterized by a parameter θ, assume that the


corresponding sensitivity function of every predictor is continuous. For level
1−α diagnostic predictors on the diseased and healthy populations, a predictor
φ is a UMEDP (uniformly most efficient and decent predictor) if it is a UMEP
among all predictors satisfying the following equation (3.5) on the diagnostic
boundary,

βφ (θ) = α, (3.5)

where θ ∈ ω, and ω is the set of the diagnostic boundary between diseased and
healthy subjects.

For example, when the range of healthy bun to creatinine ratio is from 10:1
to 20:1, the diagnostic boundary is formed by two values {10, 20}.
Proof The class of predictors satisfying (3.5) contains the set of decent
predictors, hence the sensitivity of the UMEP φ0 is at least as high as the
sensitivity of any decent predictor with the same specificity. On the other hand,
the UMEP predictor φ0 is decent itself. This is because it is uniformly at least
as sensitive as the predictor φ(x) ≡ α, where φ(x) is a special predictor that
claims a disease case by flipping a biased coin. When the outcome is a head
after flipping the coin with
P (Head) = α,
φ(x) claims diseases.

When additional information is available, we generally expect to have a


more accurate prediction for the optimizing problem of interest. For instance,
when the underlying distribution of diagnostic measures can be assumed to
be in the exponential family, the restricted optimal solution discussed in the
illustrating UMEDP example can be explicitly constructed according to the
following theorem.
Two-ended diagnostic measures 77

Theorem 3.8 Restricted optimization for UMEDP Assume that the under-
lying model for a set of data X can be expressed explicitly as an exponential
family, with the joint density characterized by a parameter θ:

f (x) = c(θ)eθT (x) h(x) c(θ) > 0, (3.6)

for the prediction problem with two-ended diagnostic measures. If there exist
two constants λ1 and λ2 such that

P (X ∈ {T > λ1 } {T < λ2 }|healthy) = 1 − α,

then the UMEDP is the diagnostic predictor shaped by the region

D = {T > λ1 } {T < λ2 }.

Proof Consider the area A = {f (x) > λg(x)}, where f (x) is the likeli-
hood function for the diseased population, and g(x) is the one for the healthy
population. For any predictor with specificity 1 − α, we have

(IB − IA )(f (x) − λg(x)) ≤ 0,

where IA is the indicator function of set A.


Now. since the underlying model is assumed to be in the form of (3.6), we
have 
(IB − IA )(f (x) − λg(x))dx ≤ 0,

implies
   
IB f (x)dx − IA f (x)dx ≤ λ( IB g(x)dx − IA g(x)dx) ≤ 0,

which means that the predictor corresponding to the indicator function of the
region A is a uniformly most sensitive predictor.
Now
f (x) c(θ)eθT h(x)
= = d(θ, θ0 )e(θ−θ0 )T
g(x) c(θ0 )eθ0 T h(x)
Notice that

SetA ⇐⇒ {x : (θ − θ0 )T (x) > c∗∗ }


⇐⇒ {T (x) > λ∗∗∗ } {T (x) > λ∗∗ }

we have
P ({T (x) > λ∗∗∗ } {T (x) > λ∗∗ }|θ0 ) = α.
In conjunction with Theorem 3.7, the condition on (3.5) is satisfied, thus, the
diagnostic predictor {T (x) > λ∗∗∗ } {T (x) > λ∗∗ } is UMEDP.
78 Sensitivity and Specificity Trade-off

The above theorem identifies the general shape of a restricted optimization


problem for the uniformly most efficient decent predictor. However, even in
this particular example, the determination of the explicit diagnostic predictor
is still unspecified. In this context, we have the following theorem that provides
a specific solution to construct the restricted optimizing diagnostic predictor,
UMEDP.

Theorem 3.9 Assume that the underlying model of a set of data X can be
expressed explicitly as an exponential family with the joint likelihood charac-
terized by a parameter θ as in (3.6),

f (X) = c(θ)eθT (X) h(X) c(θ) > 0.

For the two-ended prediction/classification problem on diseased and healthy


populations, the restricted optimizing solution (UMEDP) can be determined
via the following two conditions:

Eθ0 (φ(X)) =α (3.7)


Eθ0 (T (X)φ(X)) =αEθ0 (T (X)) (3.8)

Proof: The sensitivity function is



β(θ) = φ(X)c(θ)eθT (X) h(X)dX
Rn

Taking the derivative with respect to θ and evaluating at θ0 , yields,


 
  θ0 T (X)
β (θ0 ) = φ(X)c (θ0 )e h(X)dX + φ(X)c(θ0 )T (X)eθ0 T (X) h(X)dX
Rn Rn
c (θ0 )
= Eθ (φ(X)) + Eθ0 (φ(X)T (X))
c(θ0 ) 0
=0

The derivative is 0 at θ0 because the sensitivity of any decent predictor is


minimized at θ0 (the diagnostic measure for the healthy population) by defi-
nition.
Since the above equation is true for any decent predictor, we consider

φ(X) ≡ α,

this prediction function also defines a decent predictor: claiming diseased with
probability α regardless of the sample. Now, we have:
c (θ0 )
Eθ0 (T (X)) = −
c(θ0 )
Also, the first equality is implied by the condition of decent predictors, we
have
Eθ0 (φ(X)T (X)) − αEθ0 (T (X)) = 0.
Two-ended diagnostic measures 79

This verifies the second equality in the theorem.


When the underlying model of the data set is a symmetric distribution,
such as the normal model or Laplance model, the construction of the UMEDP
restricted optimal solution can be simplified in the following theorem.

Theorem 3.10 Following conditions of Theorem 3.9, when the underlying


model of the data is symmetric about a constant r, the restricted optimal so-
lution for the two-ended diagnostic prediction problem can be constructed by
setting λ1 = 2r−λ2 , where the values λ1 and λ2 are as defined in Theorem 3.8.

Proof: Denote g(t|θ) the underlying model of the data, by (3.8), we have

Eθ0 (T φ) = αEθ0 (T ) (3.9)

Since g(t|θ) is symmetric about r, we have Eθ0 (T ) = r, so (3.9) becomes


Eθ0 (T φ) = αr, α = Eθ0 (φ), which is

Eθ0 ((T − r)φ) = 0.

By Theorem 3.9, the UMEDP takes the form of claiming diseases when the
diagnostic measure T is either too large or too small, denote

1, T > λ1 or T < λ2
φ=
0, otherwise.

We have  
λ2 +∞
(t − r)g(t|θ0 )dt + (t − r)g(t|θ0 )dt = 0.
−∞ λ

By symmetry g(t|θ0 ) = g(2r − t|θ0 ), we have


 λ2  2r−λ1
(t − r)g(t|θ0 )dt − (y − r)g(y|θ0 )dy = 0.
−∞ −∞

Since y − r = 2r − t − r = r − t, dy = −dt.
Therefore
 λ2  2r−λ1
(t − r)g(t|θ0 )dt = (y − r)g(y|θ0 )dy,
−∞ −∞

and λ2 = 2r − λ1 . This concludes the proof of the theorem.

In what follows in this section, we shall provide two examples to illus-


trate how to perform the restricted optimization to construct the efficient
and decent predictor/classifier, when the underlying model can be plausibly
assumed.

Example 3.7 Assume that the distribution of the bun-to-creatinine-ratio fol-


low a normal model with a given standard deviation σ and healthy mean ratio
μ0 . In this case, with the notation defined in Theorem 3.10, λ1 = zα/2 .
80 Sensitivity and Specificity Trade-off

Solution: Notice that


X̄ − μ0
Eμ0 (T ) = Eμ0 ( √ ) = 0,
σ/ n
which implies that
λ2 = −λ1 .
We also have

P (Z > λ1 ) + P (Z < λ2 )

= P (Z > λ1 ) + P (Z < −λ1 )
= 2P (Z > λ1 ),

so
λ1 = zα/2 .
This example offers a theoretical justification to the existence of UMEDP
(the dotted curve) in Figure 3.3. Notice that the functioning of Example 3.7 is
grounded on the assumption that the stability of the diagnostic measure (σ)
is a given value. In practice, the measurement stability score is an unknown
value. As a follow-up discussion, we shall describe a method pertaining to
the prediction of measurement stability scores. In particular, we discuss an
example predicting the risk level of an investment portfolio.

Example 3.8 Let X = (X1 ......Xn ) be a random sample of stock returns of


an investment portfolio. For convenience, assume that in a short period of
time with market equilibrium, the data follow a normal population with mean
zero and unknown risk index σ 2 , where σ 2 reflects the measurement stability
of the portfolio. We are interested in finding the restricted optimal solution
in the setting for UMEDP in predicting σ = σ0 (the investment risk is at an
envisaged level) versus σ = σ0 (the envisaged level σ0 is either under-predicted
or over-predicted the true but unknown investment risk σ).
The requirement is that the predictor is a decent predictor with specificity
at level 1 − α.

Solution: The likelihood


n 2
1 i=1 Xi
L = (√ ) n e− 2σ 2
2πσ
n
belongs to the exponential family, so T (X) = i=1 Xi2 . By Theorem 3.8, the
decent predictor is given by

1, T > λ1 or T < λ2
φ=
0, otherwise.
Two-ended diagnostic measures 81

If the investment risk is really at σ0 level,

Pσ0 (T < λ2 or T > λ1 ) = α,

and
Pσ0 (λ2 < T < λ1 ) = 1 − α.
T
Now, let Y = the distribution of Y follows the χ2n under the assumption
σ02
,
that the investment risk is σ0 .

1, Y > d2 or Y < d1
φ=
0, otherwise.

By Theorem 3.9,

Eσ0 (φY ) = 
yfY (y)dy
{y<d1 } {y>d2 }

= αEσ0 (Y )
= nα,

and
Eσ0 (φ) = α.
Thus, the two conditions for UMEDP are:
 d2
1 y
y 2 e− 2 dy = n(1 − α)
n
n n (3.10)
d1 2 2 Γ( 2 )
 d2
1 y
2 − 2 −1 dy = 1 − α
n
n n y e (3.11)
d1 2 Γ( 2 )
2

By (3.10),
 d2
−2 n
− y2
n n y de
2 = n(1 − α),
d1 2 Γ( 2 )
2

which is equivalent to

−2 y
d2
−2 n n −1 − y
2 −2 | 2 −
n
n n y e
d
d1 n n ( )y
2 e 2 dy = n(1 − α).
2 Γ( 2 )
2
d1 2 Γ( 2 ) 2
2

By (3.11),
−2 y
2 − 2 | 2 + n(1 − α) = n(1 − α).
n d
n n y e d1
2 Γ( 2 )
2

So d1 and d2 satisfy the following equation,


−n d2 −n d1
(−2)d2 2 e− 2 (−2)d1 2 e− 2 −n d1 −n d2
n = n ⇒ d1 2 e − 2 = d2 2 e− 2 .
2 2 Γ( n2 ) 2 2 Γ( n2 )
82 Sensitivity and Specificity Trade-off

Therefore, the UME decent predictor is

Xi2 Xi2
{X : < d1 } {X : < d1 },
σ02 σ02

where the cutoff values d1 and d2 satisfy

P (χ2n < d1 ) + P (χ2n > d2 ) = α,

and
−n d1 −n d2
d1 2 e − 2 = d2 2 e − 2 .

This section focuses on the method of restricted optimization (the most ef-
ficient and decent diagnostic predictor) when the global optimal solution does
not exist. It shows that by adding an additional condition (a decent predictor
in the way that the probability of correct diagnosis exceeds the probability of
false positive), we can restrict the optimization domain on predictors satisfy-
ing certain conditions (controlling the specificity at level 1 − α), and find a
restricted optimal solution (uniformly most efficient decent diagnostic predic-
tor, UMEDP). The next section will follow up with optimization in the case
where nuisance condition exists.

3.4 UMEDP with confounding factors


When the global optimization does not exist due to the condition that the
diagnostic measure has two-ended extremes related to the diseases, we use
restricted optimization (such as the uniformly most efficient and decent pre-
dictors, UMEDP). Yet, the two-ended diagnostic measure is not the only cause
for the nonexistence of UMEP (in the sense of achieving the highest sensitivity
for a given specificity). In this section, we shall discuss a scenario in which nui-
sance parameters or confounding factors also result in nonexistence of UMEP.
The concept of restricted optimization such as the decent predictor defined in
Definition 3.5 will be applied again to seek for a local optimal solution.
In model-based analytics, when an assumed model involves various param-
eters, some of the parameters are of primary interest for data analysis, while
others are not of immediate interest (but still play critical roles in the as-
sumed model). For instance, in the optimization process with restrictions for
UMEDP, when predicting the mean vector, elements in the covariance matrix
(which measure the variation and inter-correlations among the variables) are
regarded as nuisance parameters. The restricted optimization process involv-
ing nuisance parameters is another critical issue in statistical prediction.
We start with a simple example of the inference about the population
UMEDP with confounding factors 83

mean while the population variation can not be plausibly assumed. The pop-
ulation variation serves as a nuisance parameter for the prediction problem
in this case. The example shows how a confounding nuisance factor alters the
existence of the global optimal solution (UMEP).

Example 3.9 Consider the uniformly most efficient diagnostic predictor in


Example 3.5, where we assume that the variation of the LDL-Cholesterol read-
ing is 1. In fact, the variation of the LDL-Cholesterol reading is unknown in
practice. As discussed in Example 3.5, for a given variation index σ0 , the
UMEP with specificity 1 − α reads

X̄ − 130
{X : √ > Zα }.
σ0 / n
However, the actual standard deviation may not be σ0 .
If the true but unknown standard deviation doubles the assumed value, say,
σ = 2σ0 , the probability of false positive rate of the UMEP with specificity 1−α
test becomes
X̄ − μ0
Pμ0 ({X : √ > Zα })
σ0 / n
X̄ − μ0
=Pμ0 ( √ > Zα )
σ0 / n
X̄ − μ0 Z
=Pμ0 ( √ > α)
2σ0 / n 2

=P (Z > )
2
>α.

In this case, when the probability of false positive rate is larger than α, the
corresponding specificity becomes less than 1 − α, hence the UMEP with incor-
rectly assumed standard deviation is not a diagnostic prediction at the nominal
specificity level 1 − α for the prediction problem.

This example shows that the UMEP does not exist in the process of pre-
dicting μ, when σ is unknown. In this case, the unknown standard deviation
σ serves as a nuisance parameter. We shall now introduce a theorem that can
be viewed as an example of restricted optimization under the presence of nui-
sance parameters. The method is similar to the discussion on the example of
two-ended diagnostic measures in the preceding section.

Theorem 3.11 Assume that the underlying model of the data set X follows
an exponential family


k
f (X) = h(x)exp{θU (x) + vi Ti (x) + c(θ, v)},
i=1
84 Sensitivity and Specificity Trade-off

where θ is the parameter of interest, and v is a vector of nuisance parame-


ters. Consider a diagnostic prediction problem for disease subjects classified
as healthy θ < θ0 , and diseased θ > θ0 . Further, assume that W (U, T) is a
monotone function in U for each vector of statistics T, T = (T1 , ......, Tk ).
If W (U, T) and T are independent on the boundary θ0 , then the diagnostic
predictor ⎧

⎨1, W > Cα
φ(W ) = γα , W = Cα


0 W < Cα
is the uniformly most efficient decent predictor, where the constants Cα and
γα are determined such that the specificity of φ(W ) is 1 − α.

Proof: For any fixed T = (T1 (x), ......, Tk (x)), Pθ (x) has the monotone like-
lihood ration property in terms of U , by the adapted Karlin-Rubin theorem,
the UMEP reads ⎧

⎨1, U > ξα (T)

φ (U |T) = γα (T), U = ξα (T)


0 U < ξα (T)
where ξα (T) and γα (T) are determined by

Eθ0 ,v (φ∗ (U |T)|T = t) = α

for all T = (t1 , ......, tk ). Now, if W is a monotone function (without loss of


generality, assume that it is an increasing function) of U ,
⎧ 

⎨1, W > ξα (T)
 
φ∗∗ (U |T) = γα (T), W = ξα (T)

⎩ 
0 W < ξα (T)

is a UMEDP.
Notice that
⎧ 

⎨1, W > ξα (T)
 
φ∗∗ (U (T)|T) = γα (T), W = ξα (T)

⎩ 
0 W < ξα (T)

satisfies
  
Eθ0 (φ∗∗ |T) = Pθ0 (W > ξα (T)|T) + γα (T)Pθ0 (W = ξα (T)|T).

Now, since variables W and T are independent at the boundary θ0 , there exist
constants γα and Cα that are not functions of T, so that

Eθ0 (φ∗∗ (W )) = Pθ0 (W (U, T) > Cα ) + γα P (W = Cα ) = α.


UMEDP with confounding factors 85

Thus, the test ⎧



⎨1, W > Cα
∗∗∗
φ (W ) = γα , W = Cα


0 W < Cα
is the UMEP of diagnostic predictors satisfying

Eθ0 (φ∗∗∗ ) = α

for any vector of nuisance components v. This shows that the diagnostic pre-
dictor φ∗∗∗ is an optimal solution restricted to the set of decent predictors.

We can now use Theorem 3.11 to find the uniformly most efficient decent
diagnostic predictor for the LDL-Cholesterol prediction in Example 3.5, when
the standard deviation is allowed to change within its permissible domain.

Example 3.10 Restricted optimization with nuisance parameters.


Consider a set of observations X1 , ......Xn ∼ N (μ, σ 2 ) where σ is unknown.
Assume that the cutoff threshold for the diagnostic predictor is μ0 (such as the
130 mg/dL for the LDL-cholesterol readings). The restricted optimal solution
in the prediction of diseased population on the basis of diagnostic measures
becomes
1, T > tn−1 (α)
φ=
0 otherwise
where √
n(X̄ − μ0 )
T = ,
s
X̄ is the sample mean reading and s is the sample standard deviation of the
LDL-cholesterol readings.

Solution: By Theorem 3.11, we need to rewrite the likelihood function as


follows. n n
1 1 2 2
Pθ (X) = ( √ )n e− 2σ2 ( i=1 Xi −2μ i=1 Xi +nμ ) .
2πσ
n
Let U = X̄ − μ0 , T = i=1 (Xi − μ0 )2 , for μ = μ0 , T is minimal sufficient for
σ2 .
Consider √
nU
W = 2) 1
[ (T −nU
n−1 ]
2

at μ = μ0 , W ∼ tn−1 . And W is ancillary of σ 2 . Also, T ∼ σ 2 χ2n at μ = μ0 ,


by Basu’s theorem, W and T are independent at μ = μ0 .
86 Sensitivity and Specificity Trade-off

Now, notice that


T − nU 2
n
= (Xi − μ0 )2 + n(X̄ − μ0 )2
i=1

n
= (Xi − X̄)2
i=1
=(n − 1)s2 > 0,
we can decompose W in the following way when U > 0,
√ 1 1
log(W ) = log n + log(U ) − (log + log(T − nU 2 )).
2 n−1
Furthermore,
∂ log(W ) 1 1 (−2nU )
= − ×
∂U U 2 T − nU 2
1 nU
= +
U T − nU 2
> 0,
so W increases as U increases when U > 0 for each T . When U < 0, letting
V = −U and applying the above arguments yields the conclusion that W is a
monotone function of U in this example.
By Theorem 3.11, φ is the restricted optimization solution (UMEDP) for
the prediction problem specified in the example.

According to the above example, a diagnostic predictor based on the usual


Student t statistic is the best diagnostic predictor in the sense that it controls
the specificity at 1 − α level, maximizes the sensitivity, and satisfies the condi-
tion of decent predictors. In the context of hypothesis testing, this is parallel to
the uniformly most powerful unbiased test in testing the mean with standard
deviation as the nuisance parameter.
In the next section, we shall discuss another fundamental principle in re-
stricted optimization, the invariant principle. It keeps the predicted result
consistent and invariant when the same experimental subject is measured by
different scales of the diagnostic measurement.

3.5 Efficient and invariant diagnostic predictors


The source of machine learning is data, and data comes from the observations
and measurements of experimental subjects that we are interested in. With
Efficient and invariant diagnostic predictors 87

the same object, different measurement units (or scales) often make the data
look different. One frequently asked question is the consistency of the predic-
tion results when different scales or units are used in measuring the subjects
in an experiment. It is necessary to have a prediction method that remains
consistent for various measurements on the same object. For example, con-
sider the comparison of body heights between college students and elementary
school students. If a prediction method claims significant mean difference with
heights measured in cm, we would expect a similar claim when the same ob-
jects are measured by m, because measuring with the scale of cm or m should
not alter the fact that on average, college students are taller than elementary
school students.

3.5.1 Invariant principle in data transformation


There is an excellent resource discussing this topic in the literature. We start
with a few basic concepts on model invariant as defined in [81].

Definition 3.6 Model invariant: Let g be a 1-to-1 transformation from the


sample space to itself. For a set of data X, if the transformed data g(X) follows
the same model as the model of the original data, the underlying model is called
model invariant under the data transformation g.

A mathematical definition according to [81] can be formulated as follows.


Let X be a random variable taking values in a sample space according to a
probability model from the family P = {Pθ , θ ∈ Ω}. For a one-to-one trans-
formation g from the sample space into itself. If, for each θ, the distribution of
X  = g(X) is a member of P , say Pθ , and ḡ(Ω) = Ω (as θ travels through Ω,

so does θ ), then the probability model P is invariant under transformation g.
It is related at this point to consider a class of permissible transformations
(a special case of data transformation) defined below. This is because we
are considering restricted optimization in this chapter, to achieve the local
optimal solution for UMEP, we shall define the domain of restriction on the
set of transformations as below.

Definition 3.7 Group: A set of elements is called a group if it satisfies the


following four conditions:

1 There is an operation defined for elements in G, group multiplication.


Namely, for any two elements a, b ∈ G, there exists an element c ∈ G
such that c = ab, where the element c is called the product of a and b and
denoted as ab.

2 Group multiplication obeys the associative law. Namely for any three ele-
ments in G, (ab)c = a(bc).
3 There is an element in G called identity, such that ae = ea for all a ∈ G.
88 Sensitivity and Specificity Trade-off

4 For each element a in G, there exist an a−1 ∈ G (its inverse in G), such
that aa−1 = a−1 a = e.

In the definition above, both the inverse a−1 of any element a ∈ G and the
identity element e ∈ G can be shown to be unique.
When we consider transformations on a data set, it is helpful to utilize the
concept on transformation group.

Definition 3.8 Transformation group: A set of transformations {g : g ∈ G}


from the sample space S to S is called a group of transformation in S if it
satisfies the following conditions.

1 Inverse transformation is self-contained in the set, namely for every g ∈ G,


there is a g  ∈ G such that g  (g(x)) = x for all x ∈ S.

2 Composition is self-contained in the set, namely for every g and g  ∈ G,


there exist a g  ∈ G such that g  (g(x)) = g  (x).
3 Identity transformation is self-contained in the set, namely e(x) defined by
e(x) = x is in set G.

Example 3.11 A class G of transformations is a transformation group if it


is closed under both composition and inversion.

Solution: It is straightforward to verify that transformation, is in fact, a group.


In particular, note that the identity transformation x ≡ x is a member of any
transformation group G since g ∈ G implies g −1 ∈ G and hence gg −1 ∈ G,
and by definition gg −1 is the identity. Note also that the inverse (g −1 )−1 of
g −1 is g, so that gg −1 is also the identity.

Example 3.12 The following are two groups of data transformations for two
common distribution families.
1 G = {X, n − X} for data following the binomial family Bin(n, p).

2 G = {X − a, a ∈ R} for data following the normal distribution family,


N (μ, σ 2 ).

Following the concept of transformation group, we can now extend the


concept of model invariant on one transformation into model invariant for a
group of transformations.

Definition 3.9 Group invariant: If the underlying model of a set of data is


invariant for every element of a class of transformations C, then the model is
invariant under this set of data transformations, C.
Efficient and invariant diagnostic predictors 89

We use the following example to clarify the above concepts in a heuristic


way for data transformations. Further details with advanced mathematical
treatments can be found in [81].
One of the interesting applications of the definition above is the property
that the Student-t test is invariant for any location and scale transformation.
For example, if the sample mean is 1.72m with sample standard deviation of
0.5m for a random sample of 25 college male students, and correspondingly
1.64m (sample mean) with 0.4m (sample standard deviation) for 25 female
students. Assuming that the population variations are the same for the two
populations, the difference t-score is
1.72 − 1.64
 ,
24∗0.52 +24∗0.42
48

which is the same as the measurements that use cm and boot every subject
by 10 cm:
172 − 164
 .
24∗(50)2 +24∗(40)2
48

In fact the above numerical example is a special case of the following


invariant principle.
Example 3.13 Invariant principle: The Student-t based diagnostic predictor
is invariant under location and scale transformations.
Solution: Let X = (X1 , ......, Xn ) be the data set for analysis, the Student-t
based diagnostic predictor is

X̄ − E(X1 )
t(X) = √ .
sX / n
Now the location and scale transformation data become

Y = a1 + bX,

with the expected value E(Y) = a1+bE(X). The variance of the transformed
data reads,

1 
n
s2Y = (Yi − Ȳ )2
n − 1 i=1
1 
n
= (a + bXi − a − bX̄)2
n − 1 i=1
b2 
n
= (Xi − X̄)2
n − 1 i=1
=b2 s2X
90 Sensitivity and Specificity Trade-off

so sY = bsX . Even with location transformation, the sample standard devia-


tion of the transformed data is the scale transformation of the original sample
standard deviation. Therefore the corresponding t-score becomes
a + bX̄ − a − bE(X1 )
t(Y) = √
bsX / n
X̄ − E(X1 )
= √
sX / n
= t(X).
It should be noted that invariant is defined with respect to the transfor-
mation group G, which can be used to characterize invariant transformations.
Definition 3.10 Maximal invariant A function is maximal invariant if iden-
tical mappings imply identical images in the transformation group G.
The concept of maximal invariant can also be formulated as in [81]: A
function M is a maximal invariant if it is invariant with respect to G, and
M (x1 ) = M (x2 ) implies x2 = g(x1 ) for some g ∈ G. The concept of maximal
invariant can be used to characterize invariant diagnostic predictors as in the
following theorem.
Theorem 3.12 Assume that M (X) is a maximal invariant with respect to a
group of transformation G. The necessary and sufficient condition for a diag-
nostic predictor φ to be invariant under G is that φ depends on observations
X only through M (X).
Proof A proof of this theorem can be found in [81], where the test statistic
serves as the role of a diagnostic predictor.
2
Note: In the theorem above, if x2 = h(x3 ), then h(t) = t 3 , and h(t) =
φ(M −1 (t)) is the explicit form of the transforming function h. The following
example illustrates the concept of maximal invariant for location transforma-
tions.
Example 3.14 Let X = (x1 , ......, xn )T , denote the location transformation
group
G = {g : g(X) = (x1 + c, ......, xn + c)T , c ∈ R}.
Let
M (X) = (x1 − xn , ......, xn−1 − xn )T .
Then M (X) is maximal invariant under the location transformation G.
Solution: ⎛ ⎞
x1 + c − (xn + c)
⎜ .. ⎟
⎜ . ⎟

M (g(X)) = ⎜ ⎟ = M (X)
.. ⎟
⎝ . ⎠
xn−1 + c − (xn + c)
Efficient and invariant diagnostic predictors 91

if M (X) = M (X ),
⎛ ⎞ ⎛   ⎞
x1 − xn x1 − xn
⎜ .. ⎟ ⎜ .. ⎟
⎜ . ⎟ ⎜ . ⎟
⎜ ⎟=⎜ ⎟,
⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ ⎝ . ⎠
 
xn−1 − xn xn−1 − xn
   
xi − xn = xi − xn for i = 1, ......, n − 1. We have gc : xi −→ xi , c = xn − xn ,

so that X = gc (X ), M (X) is a maximal invariant.
Since data transformation occurs frequently in statistical inference
(namely, hypothesis testing and estimation), for the remainder of this section,
we shall focus on examining the UMEIDP, uniformly most efficient invariant
diagnostic predictor. Efficiency is defined in the sense that the sensitivity is
maximized for a given specificity.

3.5.2 Invariance and efficiency


When performing data transformations before prediction analysis, the first
step should be the verification of invariance, namely the prediction problem
is invariant for the corresponding data transformations. We start with the
following definition for invariant property in prediction analysis.
Notice that this section focuses on restrained optimization. As for predic-
tion analysis, the target function for optimization is the sensitivity of the di-
agnostic predictor, while the restraint is the set of invariant predictors. Once
we transform the data, the underlying model of the transformed data may
change accordingly, alongside the parameter space that defines the domain of
the parameter in the assumed model. Toward this end, we need to set the
scene for invariant prediction problems.

Definition 3.11 Invariant prediction problem: For a set of data with an as-
sumed underlying model governed by a parameter θ. The prediction problem
for the healthy population θ ∈ Ω0 versus the diseased population θ ∈ Ω1 is in-
variant under transformation g if the correspondingly transformed parameter
is within the original space for healthy subjects and the original space for the
diseased patients, respectively.

Briefly speaking, the invariant prediction problem is for the invariant of the
healthy range and diseased range of the diagnostic measure after data trans-
formation. Mathematically, as clearly described in [81] under the setting of
hypothesis testing: Let ḡ be the corresponding transformation of parameters.
The testing problem is invariant if ḡ preserves both Ω0 and Ω1 , ḡ(Ω0 ) = Ω0 ,
ḡ(Ω1 ) = Ω1 . In the following discussions involving invariant diagnostic predic-
tors in this book, we confine the discussion to invariant prediction problems.

Definition 3.12 Invariant predictor: For invariant prediction problems, a


92 Sensitivity and Specificity Trade-off

predictor φ(X) is invariant with respect to a transformation g if φ(g(X)) =


φ(X). Namely the prediction outcome remains unchanged after the data trans-
formation.

With the setting above, the prediction question now becomes to find the
restricted optimal solution for invariant predictors (which is the same as the
uniformly most efficienty and invariant predictor).
We start with the location transformation for the discussion. Let X =
(X1 , ...Xn )T be an observation from a population with model f (x1 , ...xn ).
Assume that we are interested in predicting the Healthy population defined
as
fθ (x1 , ...xn ) = f0 (x1 − θ, ..., xn − θ)
versus the Diseased population defined as

fθ (x1 , ...xn ) = f1 (x1 − θ, ..., xn − θ),

for θ ∈ R. The goal of finding a UMEP invariant predictor is tantamount to


finding the one that has uniformly most sensitive, within the set of invariant
predictors with respect to a transformation group G.
Note: the problem of predicting diseased or healthy population here is
invariant under the group G of location transformation

g(X) = (X1 + c, ..., Xn + c),

c ∈ R. The corresponding transformation of the model parameter

ḡ(θ) = θ + c, θ ∈ Θ0
⇒ fθ (X) = f0 (x1 + θ, ..., xn + θ), θ ∈ R.

ḡ(θ) ∈ ḡ(Θ0 )
⇒ fθ (X) = f0 (x1 + θ + c, ..., xn + θ + c)
= f0 (x1 + c∗ , ..., xn + c∗ ), c∗ ∈ R.

Thus, ḡ(Θ0 ) = Θ0 = {f0 }, similarly ḡ(Θ1 ) = Θ1 = {f1 }.


Now, consider a maximal invariant under G:
⎛ ⎞ ⎛ ⎞
X1 − Xn t1
⎜ .. ⎟ ⎜ .. ⎟
⎜ . ⎟ ⎜ . ⎟
S(X) = ⎜ ⎜ ..
⎟≡⎜
⎟ ⎜ . ⎟,

⎝ . ⎠ ⎝ .. ⎠
Xn−1 − Xn tn−1

the distribution of S(X) is fS (t1 , ..., tn−1 ).


Let ti = xi − xn , i = 1, ..., n − 1, the notation can be simplified as

tn = xn , T = (t1 , ..., tn )T = (S, T )T .


Efficient and invariant diagnostic predictors 93

We have

fT (t1 , ..., tn ) =fX (x1 (t), ..., xn (t))|J|


=fX (t1 + tn , ..., tn−1 + tn , tn ) |J| = 1.


so, fS (t1 , ..., tn−1 ) = fT (t1 , ..., tn )dtn
R
= fX (t1 + tn , ..., tn−1 + tn , tn )dtn .
R

Now for the sample X, we have



fS (t1 , ..., tn−1 ) = fX (x1 + (tn − xn ), ..., xn−1 + (tn − xn ), tn )dtn ,
R

let tn − xn = u ⇒ tn = xn + u, we have

fS (t1 , ..., tn−1 ) = fX (x1 + u, ..., xn−1 + u, xn + u)du.
R

Given a set of data X, the invariant prediction problem on diseased or


healthy population under the transformation g(Xi ) = g(Xi ) + c has the max-
imal invariant S. For a given S, the associated predicting problem based on
S becomes
Healthy : fS (t1 , ..., tn−1 ) = f0 (t1 , ..., tn−1 )
Diseased : fS (t1 , ..., tn−1 ) = f1 (t1 , ..., tn−1 ).
Note that since the prediction problem is free of θ in this scenario, it becomes a
simple prediction problem discussed in the previous sections. By the Neyman-
Pearson lemma, the most sensitive predictor (based on the maximal invariant
predictors) is
f1 (t1 , ..., tn−1 )
>c
f0 (t1 , ..., tn−1 )


f1 (x1 + u, ..., xn−1 + u, xn + u)du
λ = R > c, (3.12)
f (x + u, ..., xn−1 + u, xn + u)du
R 0 1
so that P (λ > c|H0 ) = α. The predictor

1 if λ > c
φ(X) =
0 otherwise

is a UMEP invariant predictor on underlying model for location transforma-


tion.

Following the preceding example, we now specifically consider diagnostic


predictors on population variation for location transformations.
94 Sensitivity and Specificity Trade-off

Example 3.15 Consider the classification of healthy population characterized


as σ = σ0 versus the diseased population characterized as σ = σ1 , σ1 > σ0 .
Assume that the data X ∼ N (θ, σ 2 ) where θ is unknown. We are interested
in finding the UMEP invariant predictor for location transformations, which
is tantamount to finding the restricted optimal solution when the target for
maximization is the sensitivity function and the restriction is the location
invariant diagnostic predictor with specificity 1 − α.
Solution: We may use the discussion on (3.12) as follows. First, consider the
model densities under the healthy and diseased populations, respectively,
n
1 i=1 (xi − θ)2
f0 (x1 , ..., xn ) = ( √ )n exp(− )
2πσ0 2σ02
and n
1 (xi − θ)2
f1 (x1 , ..., xn ) = ( √
)n exp(− i=1 2 ).
2πσ0 2σ1
Now, the UMEP location invariant diagnostic predictor reads,
 
− n i=1 (xi +u−θ)
2

σ0 n R exp(− 2σ12
)du
λ(X) > c ⇐⇒ ( )  −
n
(x +u−θ) 2 >c
σ1 exp(− i=1 i
)du
R 2σ12
  2
xi +2(u−θ)

xi +n(u−θ)2
R
exp(− 2σ12
)du
⇐⇒   2
xi +2(u−θ)

xi +n(u−θ)2
> c∗ ,
R
exp(− 2σ 2
)du
0


where c = ( σσ10 )n c). Thus,

λ(X) > c
x2i −nx̄2 
 2
+2x̄(u−θ)+x̄2 )
exp(− 2σ12
) R exp(− n((u−θ) 2σ12
)du
⇐⇒ 
x2i −nx̄2  2 2)
> c∗
exp(− 2σ02
) R exp(− n((u−θ) +2x̄(u−θ)+x̄
2σ02
)du
 2

1 1 1  2 R
exp(− n(u−θ+x̄)
2σ12
)du
⇐⇒ exp(− ( 2 − 2 )( xi − nx̄ )) 
2
n(u−θ+x̄)2 > c∗
2 σ1 σ0 exp(− 2 )duR 2σ0
1 1 1  2
⇐⇒ − ( 2 − 2 )( xi − nx̄2 ) > c∗∗
2 σ1 σ0
n
⇐⇒ (xi − x̄)2 > c∗∗∗ .
i=1

Notice that
1 1 1 1 1
σ0 < σ1 ⇐⇒ > 2 ⇐⇒ − ( 2 − 2 ) > 0.
σ02 σ1 2 σ1 σ0
Efficient and invariant diagnostic predictors 95

So the UMEIDP, uniformly most efficient invariant diagnostic predictor,


for the healthy population vs the diseased population is the one with diagnostic
region 
A = {X : (Xi − X̄)2 > c},

for some c under the group transformation {g(x) = x + c, c ∈ R}. We shall


discuss how to determine the constant c as follows.
For any diagnostic predictor with specificity 1 − α, we have

n
Pσ0 ( (Xi − X̄)2 > c)
i=1
=Pσ0 ((n − 1)S 2 /σ02 > c∗ )
=P (χ2n−1 > c∗ )

So the constant corresponding to the restricted optimal solution is the diag-


nostic region with c∗ = χ2n−1 (α), where

P (χ2n−1 > χ2n−1 (α)) = α,

and c = σ02 χ2n−1 (α).


This example shows that the usual χ2 model is actually the model to deter-
mine the diagnostic threshold of a UMEIDP, uniformly most efficient invariant
diagnostic predictor, with location transformations on the data. Further the-
ory and techniques on similar topics under data transformations can be found
in one of the outstanding resources [81].

SUMMARY This chapter discusses the trade off between sensitivity and


specificity. Sensitivity is the probability that a diagnostic criterion correctly
identifies the case. It is a measurement confined to diseased population or
equivalently a conditional probability given that the patient is indeed sick.
On the other hand, specificity is the probability that a diagnostic criterion
correctly detects healthy patients. It is a measurement restricted to a healthy
population.
For a numerical or digital diagnostic standard, when the cutoff value
(threshold) for the measurement (such as body temperature) is low, the rate
of misclassification is high. For instance, if a patient is classified as having a
fever when the body temperature is more than 96 degrees Fahrenheit, more
patients will be classified as having a fever. This of course includes misclassi-
fied patients. However, when the criterion is set too high, such as 110 degrees
Fahrenheit, more patients will be classified as being healthy, resulting another
escalation on misclassification rate. Thus, finding the optimal threshold is crit-
ical in the prediction and diagnosis process. Increasing sensitivity is usually at
the cost of lowering the specificity, and vice versa. Under this scenario, we elu-
cidate the concept of uniformly most efficient diagnostic predictor (UMEP),
together with theory and procedures searching for the UMEP.
96 Sensitivity and Specificity Trade-off

In the case where UMEP does not exist, we expanded the concept of UMEP
into UMEDP (unifomly most efficient and decent diagnostic predictor). The
latter engraves UMEP with a new concept, decent diagnostic predictors, a le-
gitimate requirement that the rate of correct classification should not be lower
than the rate of misclassification. Theory, practical procedures and examples
are discussed following the definition of UMEDP.
Data transformation is very common in data science, especially in the pro-
cess of measurement unification for pooled datasets from multiple resources.
Consistent interpretations of insightful information related to sensitivity and
specificity necessitates a discussion on the invariant property of UMEP predic-
tor for transformed data. We conclude this chapter with a discussion on theory
and procedures regarding invariant UMEP predictors for linear functions in
data transformation.
4
Bias and Variation Trade-off

This chapter deals with theoretical and fundamental issues of bias versus vari-
ation in data science. There are different views on the content and definition
of data science. Some claimed that data science is discernible from statistics
due to its unique handling of big data and reliance on modern computer tech-
niques. References in this regard can be found in Bell et. al. (2009) [5] and
Dhar (2013) [43], among others. On the other hand, current literature also
includes claims that statistics itself is data science (see, for example, Brieman
(1998) [11] and Wu (1986) [125], to list just a few). Although various defi-
nitions have their own rationales under different scenarios, in our view, the
essential process of data science is to make inference, to predict (or forecast)
the unknown using the known (observable data). Thus, without confining our-
selves into either direction, from the viewpoint of data analytic technologies,
algorithms, and methodological development, we go with the belief that data
science includes the model-based camp (mainly statistics) and the data-driven
camp (mainly computing techniques, machine learning, and deep learning al-
gorithms), as elucidated in Chapter One. In the process of predicting the
unknown, one frequently asked issue focuses on the dilemma regarding the
bias and variation of the predicted outcome: gaining lower bias at the cost of
high variation or trading prediction bias for lower prediction variation.

4.1 Reducible and Irreducible Errors in Prediction


Discussions on data science often include two fundamental and integrated
parts, existing statistical methodologies (such as estimation, hypothesis test-
ing, and prediction), as well as data-driven methods (such as neural networks
and deep learning). As introduced in Chapter One, in both camps of data
science, optimization is ubiquitous in the development of data analytic pro-
cedures. Optimization strategies cover from the minimization of the residual
network errors in deep learning to the selection of the uniformly most power-
ful test in hypothesis testing for model-based inference. However, the optimal
solution does not always exist for practical problems.
When the optimal solution does not exist, we need to reformulate the an-
alytical problem and seek local optimization. This necessitates a discussion

97
98 Bias and Variation Trade-off

on optimization issues within a confined domain. At this point, it is conve-


nient to unify various analytical procedures into a general and fundamental
framework, which is the concept of restricted optimization as defined below.
Definition 4.1 Restricted Optimization: Let (Y, X) be a data set with the
underlying relationship
Y = h(X) + .
Assume that the evaluation criterion is F, the set of all permissible function
for h(.) is Γ, and Δ is a subset of Γ. The process of finding ĥ, such that
F(ĥ) = arg inf F(h); or F(ĥ) = arg sup F(h),
h∈Δ h∈Δ

is a restricted optimization process, in which F is the target function and the


set Δ is the restriction.
Under this definition, methods in hypothesis testing, estimation, and pre-
diction can be reformulated into the framework of restricted optimization.
For instance, finding UMPU (UMP unbiased) test is a process in which we
maximize the power of the the test with restriction to unbiased tests. Also,
finding the best linear prediction can be viewed as a restricted optimization in
which we minimize the expected prediction error with a restriction to linear
functions of features involved in prediction.
It should be noted that the framework of restricted optimization includes
machine learning in which it uses the duality theorem in linear programming
to identify a support vector machine. It also includes deep learning when
convolution over a specified set is applied (to list just a few).
The key component in the definition of restricted optimization is the target
function F, which determines the trained model in the data validation stage.
It also affects the selection of the evaluation criterion at the stage where we
apply the testing data for prediction errors.
It is related at this point to clarify the concept of expected prediction error
for a given set of data. Recall the general practice in data science (where a set
of big data is available), we usually split the data into 75:25. Namely 75% of
the original data is treated as the training data, and the remaining 25% is the
testing data. During the training process, we want to minimize the expected
prediction error. The process of minimization in this setting is an optimization
process. After training, the trained model is then evaluated by the testing data
set to examine the possibility of over-fitting (incorrectly including patterns of
random effects contaminated in the training data). In this setting, the target
function is the expected prediction error (EPE), defined in Chapter One.
Let Ŷ be the prediction of Y . One of the commonly applied criteria for
prediction is the expected prediction error (EPE),
E[(Y − Ŷ )2 ] = E{[f (X) +  − fˆ(X)]2 }
= E[f (X) − fˆ(X)]2 + var(). (4.1)
= reducible error + irreducible error
Reducible and Irreducible Errors in Prediction 99

FIGURE 4.1
Statistics versus Data Science

Obviously, the above expression suggests that EPE can be decomposed into
two portions. One accounts for the error between the predicted underlying
model and the true model, which is reducible when we have large enough
training data in conjunction with legitimate features for prediction in the
data. Another portion accounts for the error of the randomness of the data,
var() , which is due to the intrinsic fluctuation of the data that we cannot
influence in prediction, and is irreducible.
To further examine the evaluation of the expected prediction error, con-
sider a scenario in which we only have finite distinguishable features and
responses (yj , Xj ) for j = 1, ..., p.
Now, when we have a set of testing data (yi , Xi ) with i = 1, ..., m, and
fˆ(X), a model learned from the training data (yi , Xi ) for i = m + 1, ..., n, the
sample expected prediction error (of the testing data) reads

1  
m p
f req((yj , Xj ))
[yk − fˆ(Xk )]2 = [yj − fˆ(Xj )]2 ,
m j=1
m
k=1

As the amount of test data goes to infinite, m → ∞,

f req(yj , Xj )
lim = P [(Y, X) = (yj , Xj )].
m→∞ m
100 Bias and Variation Trade-off

Thus,
1 
m
[yk − fˆ(Xk )]2 → E[(Y − Ŷ )2 ],
m
k=1

the sample prediction error approaches the expected prediction error when
the sample size of the testing data is large enough. This shows that although
we need to reserve a good portion of data to train the model, we also need to
keep a good size of data for the testing data. If the size of the testing data is
too small, over-estimating or under-estimating the expected prediction error
may lead to a misleading conclusion on the performance of the trained model.
Assume that the testing set contains large enough observations. When the
solution to the global optimization is available, the reducible error in (4.1)
is minimized and the expected prediction error cannot be further reduced.
However, when the global optimization solution does not exist, the analytic
process stops. In what follows, we shall discuss details of the fundamental
concepts and terminologies used in statistical prediction and machine learning,
when the global optimization needs to be confined with restrictions.
We will start with examples of restricted optimization when the underlying
model is assumed (model-based inference), which is followed by a discussion on
the impact of nuisance parameters. After that, we will discuss restricted opti-
mizations for fundamental estimation issues in data transformation, including
the invariant property and location-scale transformations. For transformed
data, a representative topic on restricted optimization is the minimization of
the variance confined to unbiased estimators in model-based inference.
Certainly, it is impossible to exhaust all restricted optimization methods in
one chapter. To cover key issues in restricted optimization, we merely focus on
underpinning ideas and representative principles. Discussions in this chapter
may help clarify the premises of algorithms in data science (to avoid the
abuse of data analytic procedures), enhancing understanding of prediction
procedures, and facilitating interpretations of analytical outcomes.
Materials in the rest of this chapter also underpin rationales and theory
behind common statistical decisions. For example, the Student-t test can ac-
tually be viewed as an optimal test in terms of maximizing the power of the
test with restriction to unbiased tests, when nuisance parameters are involved.
Most of the results presented in this chapter are synthesized from theorems in
classical textbooks such as Lehmann and Romano [81], Lehmann and Casella
[80], Casella and Berger [16], Shao [110], and Hastie et al (2009) [56], among
others. The revisit in this chapter sheds new light on classical results for re-
stricted optimization in data science.
Minimum variance unbiased estimators 101

4.2 Minimum variance unbiased estimators


The preceding section focuses on optimization on power for hypothesis testing
and risk function for estimation with restriction on data transformations. In
this section, we shall discuss another strategy of restricted optimization, in
which the target function of optimization is to minimize the variance of the
estimator with the restriction on being an unbiased estimator. We start with
the definition.
Definition 4.2 Uniformly minimum variance unbiased estimator (UMVUE):
For a set of data with an unknown parameter that characterizes the under-
lying model. Consider the estimator (function of the data) for the unknown
parameter in a permissible set. If an estimator (with finite second moment)
is unbiased (the long-term average of the estimator hits the parameter), and
it reaches the lowest possible variance among all the unbiased estimators uni-
formly for all permissible values of the parameter, such an estimator is called
uniformly minimum variance unbiased estimator, or UMVUE.
As documented in many advanced statistics textbooks (such as [80], [16],
among others), the concept of UMVUE can be described mathematically as
follows. Consider the estimation of a function of parameter g(θ) based on a
set of data X = (X1 , ......, Xn ) from a family of distribution Pθ , θ ∈ Ω, where
Ω is the parameter space. An unbiased estimator δ(X) of g(θ) is UMVUE if
for every θ ∈ Ω,
V ar(δ(X)) ≤ V ar(δ  (X)),
where V ar denotes the variation of a random variable, and δ  (X) is another
unbiased estimator of g(θ).
Based on the above definition, any estimator is unbiased in its own expecta-
tion. The following theorem serves as an illustrating example of the restricted
optimization process in the construction of a UMVUE. The feature of this
restricted optimization process is that the UMVUE of its own expectation
can be identified by a condition with the unbiased estimator of zero, without
minimizing the target function directly.
Theorem 4.1 Denote the underlying model of a set of data X by Pθ , where
θ ∈ Ω is the parameter characterizing the underlying model. Assume that δ is
an estimator that has finite second moment (so that it is meaningful to discuss
its variance),
Eθ (δ 2 ) < ∞.
A necessary and sufficient condition for δ to be a UMVUE of its expectation
is that
Eθ (δU ) = 0, (4.2)
for all U ∈ H and any θ ∈ Ω, where H is the set of all unbiased estimators of
zero which finite second moment.
102 Bias and Variation Trade-off

Proof We start with the necessity part of the theorem. Assume that δ is
the UMVUE of its expectation and denote
Eθ (δ) = g(θ).
For any U ∈ H and θ ∈ Ω, and for an arbitrary real value λ, denote
δ  = δ + λU,
obviously, δ  is another unbiased estimator of g(θ). Consider
V arθ (δ + λU ) ≥ V arθ (δ)
for all λ. Expanding the left-hand side of the above equation, we have
λ2 V arθ (U ) + 2λCovθ (δ, U ) ≥ 0
for all λ. This is a quadratic equation of a real value λ with two roots λ1 = 0
and
λ2 = −2Covθ (δ, U )/V arθ (U ).
It will therefore take negative values unless
Covθ (δ, U ) = 0,
which implies (4.2).
As for the sufficiency part of the theorem, suppose (4.2) is valid for all
U ∈ H. To show that δ is UMVUE of its expectation, let δ  be another unbi-
ased estimator of Eθ (δ). If V arθ (δ  ) = ∞, there is no need to prove since its
variance is larger than that of δ. So, we can assume V arθ (δ  ) < ∞. In this
case,
δ − δ ∈ H
because they are both unbiased estimators of E(δ). Furthermore,
Eθ [δ(δ  − δ)] = 0,
hence
Eθ (δ  (δ)) = Eθ (δ 2 ).
Now, using the fact that δ  and δ have the same expectation, we have

V arθ (δ) = Cov(δ, δ  ) ≤ V arθ (δ)V arθ (δ  ),
thus,
V arθ (δ) ≤ V arθ (δ  ).
The beauty of the above theorem links the property of a statistic with
the restricted optimal solution in terms of minimizing the variance (or the
risk under the square loss function). For instance, if the distribution family is
complete, which means that for any statistic U , Eθ (U ) = 0 implies that U = 0
almost surely. The above theorem points out an important fact in mathemat-
ical statistics that any unbiased and complete estimator is the UMVUE of its
own expectation. We shall illustrate this point with the following example.
Minimum variance unbiased estimators 103

Example 4.1 Let X be a random variable with E(X 2 ) < ∞. For a set of
normal data X1 , ..., Xn that have underlying model N (0, σ 2 ), find the UMVUE
of the population standard deviation σ.

n
i=1 (Xi − X̄) , the sample stan-
1 2
Solution: First, notice that S = n−1
dard deviation, is not unbiased for σ = V ar(X). To see this point, considering
the normality assumption, we have
n−1 2
Y = S ∼ χ2n−1 ,
σ2
and
σ √
E(S) = E( √ Y)
n−1

σ √
=√ yfY (y)dy
n−1

2 Γ( n2 )
= σ.
n − 1 Γ( n−1
2 )

which is not σ. In general, we have

E(S 2 ) − (E(S))2 = V ar(X) ≥ 0.

Now, 
E(S) ≤ E(S 2 ) = σ,
since
1 
n
2
E(S ) = E( (Xi − X̄)2 ) = σ 2 .
n − 1 i=1
n
For the model of the data in this example, S 2 = i=1 Xi2 is a complete
2
statistic for σ, and Sσ2 ∼ χ2n , thus,
r
Sr 2 2 Γ( n+r
2 )
E( r
) = n
σ Γ( 2 )

for any positive integer r. Therefore, by Theorem 4.1, the UMVUE of σ reads
Γ( n2 )
σ̂ = √ S.
2Γ( n+1
2 )

This example shows that the restricted optimization for the estimation of
the standard deviation that governs the model behind a set of data is not the
sample standard deviation. Although the sample variance is an unbiased esti-
mator of the population variance, the best estimator of the sample standard
deviation is actually the sample standard deviation multiplied by a non-unit
constant.
104 Bias and Variation Trade-off

We shall now discuss another example which is related to the application


of Theorem 4.1 when the underlying model of the data is a discrete model.
Another related issue in this regard is the existence of the UMVUE for a
function of the unknown parameter. The following example shows how to use
Theorem 4.1 to determine whether the UMVUE of a parameter exists, and
how to obtain UMVUE for discrete data.
Example 4.2 Let X take on the values -1, 0, 1, 2, 3 with probabilities P (X =
−1) = 2pq and P (X = k) = pk q 3−k for k =0, 1, 2, 3, where 0 < p < 1,
q = 1 − p.
(a) Verify that P (X = k) given above is a probability model.
(b) Determine whether the UMVUE of p exists, given one observation X. How
about the UMVUE of pq?
Solution: As for part (a), notice that

3
P (X = k) =2pq + q 3 + p2 q + pq 2 + p3
k=−1

=(1 − p)3 + p2 (1 − p) + p(1 − p)2 + p3 + 2p(1 − p)


=(1 − p)2 (1 − p + p) + p2 (1 − p) + p3 − 2p(1 − p)
=(1 − p)2 + p2 − 2p(1 − p)
=p2 + 2pq + q 2
=(p + q)2
=1
For part (b), according to Theorem 4.1, we need to find an estimator δ(X)
such that E[δ(X)] = 0 and check the existence of the unbiased estimator T (X)
such that E[δ(X)T (X)] = 0 for all δ(X). If such an unbiased estimator exists,
it is UMVUE; otherwise, the UMVUE does not exist. Towards this end, for
any unbiased estimator, δ, of the value 0 in the setting of this problem, we
have
E[δ(X)]

3
= δ(k)P (X = k)
k=−1

=p [δ(3) − δ(2) + δ(1) + δ(0)] + p2 [δ(2) − 2δ(1) + 3δ(0) − 2δ(−1)]


3

+ p[δ(1) − 3δ(0) + 2δ(−1)] + δ(0)


=0.
This is a polynomial of p, we want the equality to hold for every p. Setting
the coefficients to 0, we have
δ(0) = 0, δ(2) = δ(1) = −2δ(−1), δ(3) = 0,
Minimum variance unbiased estimators 105

which can be simplified as

δ(−1) = a, δ(−2) = δ(1) = −2a,

for some constant a ∈ R+ . Similarly, for any unbiased estimator of p, T (X),


if E(T (X)) = p, we have

T (0) = 0, T (3) = 1, and T (1) + 2T (−1) = 1.

Now, if

E[T (X)δ(X)]
= 2ap3 [T (2) − T (1)] + 2ap2 [T (1) − T (2)] + 2ap[T (−1) − T (1)]
= 0,

for every δ, T (X) is the UMVUE of its expectation if and only if

T (−1) = T (1) = T (2) = b

for some constant b ∈ R. Plugging the expression on b into the expression of


E[T (X)] yields
1+p
E[T (X)] = p3 − 3bp2 + 3bp = p, and b = .
3
As a result, if the UMVUE exists, the estimator T (X) has to be a function
of the unknown parameter p. This is in contradiction with the definition of a
statistic (which is a function of the data only, not involving the unknown pa-
rameter). Thus, the UMVUE of p doesn’t exist for the model in this question.
However, when we consider the UMVUE of a function of the parameter
pq, if

E[T (X)]
= 2pqT (−1) + q 3 T (0) + pq 2 T (1) + p2 qT (2) + T (3)p3
= pq,

we have
1
T (3) = T (0) = 0, and b = .
3
In this case, the UMVUE for pq exists.
The above theorem obtains the restricted optimization by means of an
unbiased estimator of zero, although the optimization process is hidden in the
proof of the theorem. In what follows, we shall discuss a well-known theo-
rem that directly approaches the minimal value of the target function (the
variance) to find the UMVUE.
106 Bias and Variation Trade-off

Theorem 4.2 Cramer-Rao optimization: Assume that the underlying model


for the data X = (X1 , ......Xn ), f (X|θ), satisfies the Leibniz condition for lim-
its and integrations. Denote W (X) = W (X1 , ......Xn ), an estimator satisfying

d ∂
Eθ W (X) = [W (X)f (X|θ)]dX
dθ X ∂θ

and V arθ (W (X)) < ∞. Then


d
dθ Eθ (W (X))
V arθ (W (X)) ≥ ∂
. (4.3)
Eθ [( ∂θ log f (X|θ))2 ]

Note: The above inequality (4.3) points out the lowest possible value for vari-
ances of unbiased estimator. If an unbiased estimator (restriction) has the
variance equal to the right-hand side of (4.3), it is the UMVUE of θ. The
validity of the theorem can be shown as follows. More thorough discussions
can be found in [16], [80], among others.
Proof First, applying the derivative on the unbiased restriction with the
use of the Leibniz condition, we have
d
Eθ W (X)



= W (X) f (X|θ)dX
X ∂θ

f (X|θ) f (X|θ)
=Eθ [W (X) ∂θ ] by multiplying in the integrand
f (X|θ) f (X|θ)

=Eθ [W (X) log f (X|θ)]
∂θ

Now, consider the basic property of a density function. We have,


∂ d
Eθ [ log f (X|θ)] = E[1] = 0,
∂θ dθ
thus

Covθ [W (X), log f (X|θ)]
∂θ

= Eθ [W (X) log f (X|θ)]
∂θ
d
= Eθ W (X).

By the Cauchy-Schwartz inequality:

[Cov(X, Y )]2 ≤ V ar(X)V ar(Y ),


Minimum variance unbiased estimators 107

in conjunction with
∂ ∂
V arθ [ log f (X|θ)] = Eθ [( log f (X|θ))2 ],
∂θ ∂θ
we have:
d
dθ Eθ (W (X))
V arθ (W (X)) ≥ ∂
.
Eθ [( ∂θ log f (X|θ))2 ]
The following example shows how to use the Cramer-Rao lower bound to
find the UMVUE (minimizing variance with restriction to unbiased estima-
tors).

Example 4.3 Assume that the underlying model of a set of data X1 , ......, Xn
is the exponential model with unknown parameter λ, exp(λ), find the UMVUE
of λ.

Solution: For an exponential model, we know

E(X1 ) = λ, V ar(X1 ) = λ2 ,

and
E(X̄) = λ,
where X̄ is the sample mean. Also,
X1
log f (X1 ) = − log λ − ,
λ
and
∂ 1 X1
log f (X1 ) = − + 2 .
∂λ λ λ
Thus,
∂ X1 1
E[( log f (X1 ))2 ] =E[( 2 − )2 ]
∂λ λ λ
1 2E(X1 ) E(X12 )
= 2− +
λ λ3 λ4
1
= 2.
λ

2
In this case, the C-R lower bound is λn , which implies that the UMVUE of λ
is V ar(X̄).
Following the Cramer-Rao lower bound on the variance of unbiased es-
timators, another technique in the process of restricted optimization is the
approach of approaching the optimal solution by conditioning on an sufficient
statistic.
108 Bias and Variation Trade-off

Theorem 4.3 Rao-Blackwell theorem: For a set of data X = (X1 , ......, Xn ),


assume that W (X) is any unbiased estimator of a function of parameters
τ (θ), and let T (X) be a sufficient statistic for the unknown θ. Define φ(T ) =
E(W |T ), then φ(T ) is also unbiased with variance smaller than the variance
of W for every θ.

Proof Obviously

E(φ(T )) = E[E(W |T )] = E(W ),

thus, φ(T ) is unbiased for E(W ), also:

V arθ (W ) =V arθ [E(W |T )] + Eθ [V arθ (W |T )]


=V arθ [φ(T )] + Eθ [V arθ (W |T )]
≥V arθ [φ(T )]

Because the sufficiency of T , the conditional distribution of X given T , does


not depend on θ, φ(T ) is indeed a statistic.
The Rao-Blackwell theorem indicates that if T is a complete statistic for
θ, then any estimator based on T is the best UMVUE of its expected value.

Example 4.4 Rao-Blackwellization: Suppose that Xi , i = 1, ......, n are


Bernoulli(p), for n ≥ 4. Show that the product X1 X2 X3 X4 is an unbiased
estimator of p4 , and use this fact to find the UMVUE of p4

Solution: It is natural to think about X̄ since it is a sufficient and complete


statistic for p. However, X̄ 4 is not an unbiased estimator for p4 . To see this
point, notice that,


n 
E[( Xi )4 ] =E( Xi Xj Xk Xl )
i=1 i j k l
 
= E(Xi4 ) + E[(Xi Xj )2 ]
i i=j
 
+ E[(Xi Xj Xk )Xl ] + E(Xi Xj Xk Xl )
i=j=k,l=i,j,ork i=j=k=l
 
n 2 n 3 n 4
=np + p + p + p
2 3 4

Thus, X̄ 4 is not an unbiased estimator for p4 . In fact, the UMVUE can


be constructed by using the Rao-Blackwell approach as stated in the previous
theorem.
Minimum variance unbiased estimators 109

By independence,
!
n
E(X1 X2 X3 X4 ) = Xi = p4 ,
i=1

so the estimator S = X1 X2 X3 X4 , is unbiased for p4 .


n
Now, since the sample summation T = i=1 Xi follows a binomial model,
which is a complete statistic,

φ(T ) = E(X1 X2 X3 X4 |T )

is the UMVUE of its expected value.


Notice that X1 X2 X3 X4 only takes values 0 or 1. We can find the condi-
tional expectation as follows.
For 4 ≤ T ≤ n, we have

φ(T ) =E(X1 X2 X3 X4 |T = t)
n
P (X1 = X2 = X3 = X4 = 1, i=5 Xi = t − 4)
=
P (T = t)
" #
p t−4 p (1 − p)n−t
4 n−4 t−4
= " n#
pt (1 − p)n−t
"n−4#t
= "t−4
n
# .
t

For T < 4, φ(T ) = 0.


Thus, denote the sample summation as T , the UMVUE for p4 is
" n−4 #
T −4
" n#
t

when T ≥ 4; and it is zero when T < 4.


The above three theorems discuss three different approaches in restricted
optimization when we are using the square loss function (variance) in esti-
mation with restriction to unbiased estimators. It should be noted that the
restricted optimal solution is not always the most accurate and robust esti-
mator, as shown in the following example.

Example 4.5 Small sample estimation: Given a random sample X1 , ......Xn


from the Poisson family with unknown parameter θ, find the UMVUE for
g(θ) = e−10θ
n
Solution: Notice that in this setting, T = i=1 Xi is complete and follows a
110 Bias and Variation Trade-off

Poisson distribution with parameter nθ. For the UMVUE, we want δ(T ) to
be unbiased, which means that

 (nθ)t −nθ
E[δ(T )] = δ(t) e = e−10θ ,
t=0
t!

thus

 (nθ)t
δ(t)
t=0
t!
= e(n−10)θ
∞
(n − 10)t θt
= ,
t=0
t!

comparing each term gets


10 T
δ(T ) = (1 − ) .
n
However, when n = 1, δ(T ) = (−9)X1 , the value of UMVUE oscillates
wildly between positive and negative values. It is not even close to g(θ).
This example shows that although UMVUE is optimal in the sense of
minimizing the variance for unbiased estimators, it may perform poorly in
estimating/predicting the true value of the unknown parameter in the sce-
nario discussed in this example. When the sample size is not large enough,
enhancing the condition of unbiasedness may result in poor performance of the
estimation. In the next section, we shall discuss another aspect on restricted
optimization in which the minimum variance unbiased restriction is replaced
by the restriction of minimum risk estimator.

4.3 Minimum risk estimators for transformed data


Another aspect of restricted optimization in model-based inference focuses on
estimation. We shall discuss the idea of minimum risk invariant estimator and
minimum risk equivariant estimator for transformed data in this subsection.
We start with three basic notations that will be used frequently in the sequel.

1 A data transformation defined on the sample space g : Ω → Ω.


2 For each data transformation, the distribution of the transformed data
changes from the original Pθ to Pθ  . Correspondingly, there is a change of

the model parameter: ḡ : H → H , so that

θ = ḡ(θ).
Minimum risk estimators for transformed data 111

3 For each g and ḡ the corresponding change of the value of parameter for

estimation: g ∗ : h(θ ) = h(ḡ(θ)) = g ∗ (h(θ)).

Note: If G = {g} is a group, then Ḡ = {ḡ} and G∗ = {g ∗ } are groups. Also,


G, Ḡ and G∗ are isomorphic,

Pθ (g(X) ∈ A ) = Pḡ(θ) (X ∈ A), or Eθ [Ψ(gX)] = Eḡθ [Ψ(X)].

Analogically as maximizing the power function in hypothesis testing for


data analysis, in statistical estimation, the target for optimization is to mini-
mize the expectation of the loss function. We shall now review the concept of
invariant loss function upon data transformations.

Definition 4.3 (Invariant estimation problem) Assume that the probability


model P = {Pθ , θ ∈ Ω} is invariant under a transformation g. Denote the
associated transformation on the parameter as ḡ and the associated transfor-
mation on the estimate as g ∗ . Assume that the loss function is invariant after
the transformation, namely, L satisfies

L(ḡθ, g ∗ d) = L(θ, d),

for any parameter θ and estimate d. Further, assume that the corresponding
function of parameter to be estimated, h(θ), satisfies

h(θ1 ) = h(θ2 ) ⇒ h(ḡθ1 ) = h(ḡθ2 ).

Under this setting, the problem of estimating h(θ) with loss function L is
invariant under g.

For invariant estimation problem, since an estimator is essentially a func-


tion of the data, we may define the concept of equivariant estimator as follows.
Further discussion on this topic can be found in [80], or [16].

Definition 4.4 (Equivariant estimator) In an invariant estimation problem,


an estimator δ(X) is said to be equivariant if the estimated value based on the
transformed data equals the value of the corresponding transformation of the
estimator based on the original data. Mathematically, it can be expressed as

δ(gX) = g ∗ δ(X).

With the above definitions on the estimation problem for transformed


data, we have the following theorem. It states that for an invariant estimation
problem, the risk of an equivariant estimator is invariant.

Theorem 4.4 If δ is an equivariant estimator in an invariant estimation


problem under a transformation g, the risk function of δ satisfies

R(ḡθ, δ) = R(θ, δ).


112 Bias and Variation Trade-off

Proof By definition, R(ḡθ, δ) = Eḡθ L[ḡθ, δ(X)], the right side is equal to

Eθ L[ḡθ, δ(gX)] = EL[ḡθ, g ∗ δ(X)] = R(θ, δ),

since the loss function is invariant L(ḡθ, g ∗ d) = L(θ, d).


Note: According to the above theorem, if Ḡ is transitive over the parameter
space Ω, the risk function of any equivariant estimator is a constant, being
independent of θ. This theorem lays the legitimacy for the definition of the
minimum risk equivariant estimator.

Definition 4.5 (Minimum risk equivariant estimator) In an invariant esti-


mation problem, if there exists an equivariant estimator that minimizes the
constant risk, such estimator is called the minimum risk equivariant (MRE)
estimator.

We shall now apply the above theoretical concepts to two frequently used
groups in data transformations. One is for location transformation and another
is scale transformation. We start with the location transformation first.

Example 4.6 A family of densities f (x|θ) with parameter θ and loss function
L(θ, δ) is a location invariant model if f (x |θ ) = f (x|θ).

Notice that L(θ, δ) is a location invariant loss function if L(θ, δ) = L(θ , δ  ),


where the transformed data
x = x + a,
correspond to the transformed parameter

θ = θ + a

and the estimated value


δ = δ + a
for any location shift value a.
For example, an invariant loss function for location transformation reads,

L(ξ, d) = ρ(d − ξ). (4.4)

An estimating problem is location invariant if both family and loss function


are invariant.
Now, assume that the underlying model for a set of data X =
(X1 , ......, Xn ) takes the following format,

f (X − ξ) = f (x1 − ξ, ......, xn − ξ), −∞ < ξ < +∞, (4.5)

where f is known and ξ is an unknown location parameter. Suppose that for


the problem of estimating ξ with loss function L(ξ, d) as in (4.4), we have an
estimator δ(X).
Minimum risk estimators for transformed data 113

The above setting lays the foundation for the discussion of restricted op-
timization with the target to minimizing the risk of the estimator and the
restriction on equivariant estimators, for location transformations of the orig-
inal data. First, we shall identify the set of equivariant estimators for location
transformations.

Theorem 4.5 If δ0 is any equivariant estimator in an invariant estimation


problem, a necessary and sufficient condition for any estimator δ to be an
equivariant estimator is

δ(X) = δ0 (X) + u(X),

where u is a function satisfying

u(X + a) = u(X) (4.6)

for all data X and real value a.

Theorem 4.6 Assume that the underlying model for a set of data X takes the
format in (4.5), and δ is equivariant for estimating ξ with loss function (4.4).
Then, the bias, risk, and variance of δ are all constant, being independent of
ξ.

Notice that a function u satisfies (4.6) if and only if it is a function of the


differences yi = xi − xn (i = 1, ......, n − 1) when n ≥ 2, and if and only if it is
a constant when n = 1. Thus, the above theorem can be expressed as follows.

Theorem 4.7 If δ0 is any equivariant estimator, a necessary and sufficient


condition for an estimator δ to be equivariant is that there exists a function v
of n − 1 arguments for which

δ(X) = δ0 (X) − v(y) (4.7)

With the above preparation, the following theorem constructs an optimal


solution on minimizing risk with restriction to equivariant estimators in data
transformation.

Theorem 4.8 Assume that the underlying model for a set of data X is (4.5).
Let Yi = Xi −Xn for i = 1, ......, n−1 and denote y = (Y1 , ......, Yn−1 ). Suppose
the loss function is given by (4.4), and there is an equivariant estimator δ0 of
ξ with finite risk. Also assume that for each y, there exists a number v(y) =
v ∗ (y) which minimizes

E0 {ρ[δ0 (X) − v(y)]|y.} (4.8)

Then a location equivariant estimator with minimum risk exists, and is given
by
δ ∗ (X) = δ0 (X) − v ∗ (y).
114 Bias and Variation Trade-off

Proof By Theorem 4.7, the MRE estimator is found by determining v so


as to minimize
Rξ (δ) = Eξ {ρ[δ0 (X) − v(y) − ξ]}.
Since the risk is independent of ξ, it is suffices to minimize

R0 (δ) = E0 {ρ[δ0 (X) − v(y)]}



= E0 {ρ[δ0 (X) − v(y)]|y}f (y)dy. (4.9)

The integral is minimized by minimizing the integrand, and hence (4.9) for
each y. Since δ0 has finite risk E0 {ρ[δ0 (X)]|y} < ∞, the minimization of (4.9)
is meaningful. The result now follows from the assumption of the theorem.
The following examples illustrate the construction of an MRE estimator
for location transformations.

Example 4.7 Pitman Estimator: Under the assumption of the preceding


theorem and loss function L(ξ, d) = (d − ξ)2 , the MRE estimator is given by
 +∞
uf (x1 − u, ......, xn − u)du
δ (X) = −∞

+∞ . (4.10)
−∞
f (x1 − u, ......, xn − u)du

Proof Consider a trivial equivariant estimator for location transforma-


tions δ0 (X) = Xn . To use the previous theorem in the construction of MRE,
we need to compute E(Xn |y).
We can now consider the change of variables yi = xi −xn (i = 1, ......, n−1);
yn = xn . The Jacobin of the transformation is 1. The joint density of Y  s is
therefore
pY (y1 , ......yn ) = f (y1 + yn , ......, yn−1 + yn , yn ),
and the conditional density of Yn given y = (y1 , ......, yn−1 ) is
f (y1 + yn , ......, yn−1 + yn , yn )

f (y1 + t, ......, yn−1 + t, t)dt
It follows that

tf (y1 + t, ......, yn−1 + t, t)dt
E0 [Xn |y] = E0 [Yn |y] = 
f (y1 + t, ......, yn−1 + t, t)dt
This can be re-expressed in terms of the x s as

tf (x1 − xn + t, ......, xn−1 − xn + t, t)dt
E[Xn |y] =  .
f (x1 − xn + t, ......, xn−1 − xn + t, t)dt
Finally by making change of variables u = xn − t, we have

uf (x1 − u, ......, xn − u)du
E0 [Xn |y] = xn −  .
f (x1 − u, ......, xn − u)du
Minimum risk estimators for transformed data 115

This completes the proof.


We shall provide an example with a specific model to illustrate the method
of restricted optimization for data with location transformations.

Example 4.8 Assume that the underlying model of a set of data X1 , ......, Xn
is uniform (ξ − 1/2, ξ + 1/2). Suppose that the loss function is

L(ξ, d) = (d − ξ)2 ,

find the MRE of location transformations for ξ.

We may use the approach of Pitman estimator to find the MRE estimator
for ξ under location transformations.
Consider
1 1
f (x1 , ..., xn ) = 1 for all xi ∈ (ξ − , ξ + ); 0 otherwise.
2 2
Notice that
1 1 1 1
xi ∈ (ξ − , ξ + ) for all i = 1, ..., n. ⇒ ξ ∈ (X(n) − , X(1) + ).
2 2 2 2
We have the denominator of the Pitman estimator:
  X(1) + 12
f (x1 − ξ, ..., xn − ξ)dξ = 1dξ = X(1) − X(n) + 1.
X(n) − 12

The numerator of the Pitman estimator reads,


  X(1) + 12
ξf (x1 − ξ, ..., xn − ξ)dξ = ξdξ
X(n) − 12
1
= (X(1) + X(n) )(X(1) − X(n) + 1).
2
Thus, by the Pitman estimator, the MRE estimator is
1
2 (X(1) + X(n) )(X(1) − X(n) + 1)
δ∗ = = (X(1) + X(n) )/2.
X(1) − X(n) + 1

The next example discusses restricted optimization in the case where the
loss function is the absolute error, and the minimization of the risk is restricted
to an equivariant estimator for location transformations.

Example 4.9 Let X be i.i.d. according to the exponential distribution E(θ, 1),
find the MRE of θ for the absolute error L = |d − θ|.
116 Bias and Variation Trade-off

Solution: By the totality of equivariant estimator δ = δ0 − v(y), and

Eθ (|δ − θ|) = E0 (|δ|) = E0 (|δ0 − v(y)|),

we need to minimize the risk with respect to v. Since we are not dealing with a
square loss function, the usual optimal solution v ∗ = E(δ0 |Y ) is not applicable
here.
Consider the smallest ordered statistic δ0 = x(1) = y1 , and denote

yi = x(i) − x(1) i = 2, ......, n,

we have the joint distribution


n
fX(1) ,......,X(n) = n!e− i=1 xi +nθ
,

with |J| = 1. The distribution of the transformed variables reads,


n
fy1 ,......,yn = n!e− i=2 yi −x(1) +nθ
.

Now y1 and (y2 , ......, yn ) are independent,

E0 (|X(1) − v(y)||y) = E(|Y1 − v|).

The restricted optimization problem becomes to optimize the term involving


v.  +∞
E0 (|Y1 − v|) = |y − v|ne−ny dy,
0
if v < 0,  +∞
g  (v) = (−1)ne−ny dy = −1 < 0,
0
so g(v) is monotone decreasing,

min g(v) = min g(v).


v v>0

Thus,
 +∞
E0 (|Y1 − v|) = |y − v|ne−ny dy
0
 v  +∞
−ny
= (v − y)ne dy = g(v) + (y − v)ne−ny dy
0 v

Setting g  (v) = 0 yields that the optimal solution v ∗ is the median of X(1) ,
log 2
v∗ = .
n
Therefore the MRE for location transformations with absolute error loss func-
tion reads,
log2
δ ∗ = X(1) − .
n
Minimum risk estimators for transformed data 117

In the discussion above, we focused on MRE estimators for location trans-


formations. We shall now consider the corresponding theory for the scale trans-
formation group. Further materials in this regard can be found in [16].
Example 4.10 Assume that the underlying model of a set of data X =
(X1 , ......, Xn ) takes the following form,
1 X 1 x1 xn
f ( ) = n f ( , ......, ), τ >0 (4.11)
τn τ τ τ τ
where f is known and τ is an unknown scale parameter, this model remains
 
invariant under the transformations Xi = bXi , τ = bτ for any positive value
b > 0.
We will consider the following loss function for discussions in the rest of
this subsection.
(d − τ r )2
L(τ, d) = . (4.12)
τ 2r
By the definition, it seems that the set of equivariant estimators for scale
transformations is very vague. However, the following theorem shows that,
similar to the estimation theory for location transformations, we may con-
struct a general expression for scale equivariant estimators, based on a given
scale equivariant estimator.
Theorem 4.9 Totality of scale equivariant estimator: Assume that the un-
derlying model of a set of data X is (4.11), and δ0 is any scale equivariant
estimator of τ r . If
xi xn
zi = (i = 1, ......, n − 1) and zn = . (4.13)
xn |xn |
Denote z = (z1 , ......, zn ), an necessary and sufficient condition for δ to be a
scale equivariant estimator is that there is a function w(z) such that
δ0 (X)
δ(X) = .
w(z)
With the expression of the domain (totality of scale equivariant estimators)
in which the optimization is restricted, we can now discuss the following theo-
rem which identifies the optimal solution (minimizing the risk) with restriction
to scale equivariant estimators, the Minimum Risk Equivariant (MRE) esti-
mator.
Theorem 4.10 Pitman Estimator: Assume that the underlying model for a
set of data X takes the form in (4.11). Consider a vector z given by (4.13).
Further, suppose that the loss function is given by (4.12), and there exists a
scale equivariant estimator δ0 for the parameter of interest τ r with finite risk.
For each z, if there exists a number w(z) = w∗ (z) which minimizes
E1 {γ[δ0 (X)/w(z)]|z}, (4.14)
118 Bias and Variation Trade-off

Then an MRE, minimum risk equivariant estimator, can be constructed by

δ0 (X)
δ ∗ (X) = . (4.15)
w∗ (X)

Specifically, the MRE estimator reads,


 +∞ n+r−1
∗ v f (vx1 , ......, vxn )dv
δ (X) =  +∞
0
. (4.16)
0
v n+2r−1 f (vx1 , ......, vxn )dv

Proof The proof of the above theorem is similar to the proof of Pitman
estimator for location transformations.
We selected the method of Pitman estimator to illustrate the method of
restricted optimization in data analysis, which involves using transformed data
for estimation or prediction of an unknown parameter characterizing assumed
models. Certainly, there are many interesting results in estimation theory
that we are unable to exhaust in this book. Interested readers can find further
discussions on this part of materials in the books [81], [80], and [16], among
others.

SUMMARY Following the preceding chapter on sensitivity and specificity


trade-off, this chapter focuses on another type of trade-off, the bias and varia-
tion trade-off. For prediction processes in data science, there are always errors
or data fluctuation that make the predicted value consist of reducible errors
and irreducible errors. Reducible errors can be ameliorated by enhancing the
prediction accuracy with close approximation to the underlying rule governing
the data. On the other hand, the irreducible error is the one that can not be
improved by refining prediction models because the noise is hidden behind the
data. Usually, increasing accuracy (which is equivalent to decreasing predic-
tion bias) is bounded to increase the variation of the predicted value. Towards
this end, this chapter uses the concept of uniformly minimum variance unbi-
ased estimator (UMVUE) as an example to delineate methods and procedures
to keep the unbiased criterion and to minimize the variation of the estimator.
Under the measurement of squared prediction error, UMVUE is one of the
restrained optimal solutions.
When the measurement criterion changes for different types of prediction
problems with data transformation, the optimal estimator UMVUE needs to
be extended to MRE (minimum risk estimator). Certainly, when the risk func-
tion is the sum of squared prediction errors, UMVUE is one of MRE. However,
when a small amount of bias results in large amount of amelioration in vari-
ance, UMVUE is no longer the best MRE. MRE covers more information than
UMVUE, especially for discrete type of data in classification problems.
5
Linear Prediction

Linear regression is arguably one of the most commonly used and abused
statistical tools in data science. Its versatility and intuitiveness fits a broad
range of applications, from simple linear model such as “when the price in-
creases, the return per item increases” or “the insurance premium decreases
as the time spent driving increases”. This phenomenon can occur in any set-
ting at any time. Despite this, its ease of use tends to backfire when amateur
data analysts mindlessly default to reading the data into software (such as R,
Python, or Excel) to obtain a fitted line without checking validity conditions
of linear regression. They tend to lack consideration for the rationale behind
the methodology. As a result, the inferred conclusion sometimes results in a
unreliable statistical prediction.
Traditional textbooks on this topic usually begin with introductory exam-
ples, followed by a least squares estimate, inference, and discussion. However,
in this chapter, we take a different route. We focus on the validity conditions
and precautions with linear regression models in practice, to prevent misuse
of the technique from the get-go.

5.1 Pitfalls in linear regressions


The first precaution concerns the intrinsic character of the data. We need
to ensure that the data is a random sample representing an intrinsic linear
relationship. For instance, one can easily select a set of data in which students
with big shoe sizes have high SAT test scores. If we blindly fit the data into
a linear model, we may reach an unrealistic conclusion that bigger shoe sizes
predict higher SAT test scores (or the other way around). This is obviously a
misapplication of the tool.
To avoid making such a fundamental error, two basic conditions must be
satisfied before fitting the data into a regression model. The first condition is
that the data needs to represent the population of interest. In other words,
it should be a random sample from the population of interest. The second
condition is that the response variable and the predictor should have some
intrinsically linear relationship. For instance, the shoe size of an exam taker
has nothing to do with the corresponding SAT score. The body weight of cows

119
120 Linear Prediction

has no connection with the body weight of rhinos, as illustrated in Figure 5.1.
If there is no intuitive relationship between the input and output variables, it
would not make sense to set up a linear model between them, even though the
relationship may accidentally appear to be linearly correlated in one dataset
due to randomness.
Certainly, in data-oriented analysis, we use exploratory data analysis (such
as plotting the data) to seek or approach the true model behind the data.
Such a practice should be confined to cases where the data truly represent the
variable and no sample selection bias exists. Blindly applying linear regression
without proper justification may result in misleading conclusions.

FIGURE 5.1
Non-intrinsic linear relationship between weights on cow and rhino

The second precaution focuses on the sample size and the underlying model
of the data. Although the least square estimate of model parameters does not
require any distribution of the data, testing on the significance and validity
of the fitted model depends on hidden assumptions. For example, the error
term of the data follows a normal model with constant variation. Such model
assumption is critical in validity analysis of the fitted model. Especially since
any software can produce a fitted line out of an input and an output variable,
whether that relationship is statistically significant is questionable for many
data analyses.
If the data contains too much noise, the effect of the noise overwhelms the
effect of the input feature, and the fitted linear model is rendered insignificant.
In this case, it is necessary to test whether the noise is too large to claim the
existence of a linear relationship. The instruments to perform such tests in-
clude the t-test for linear coefficient significance and the F-test for the validity
of the model. Note that these tests are built upon the normality assumption
with constant variances.
yj = α + βxj + j
Pitfalls in linear regressions 121

where j ∼ N (0, σ 2 ) for all the observations j = 1, ..., n.


When the normality assumption is unfulfilled in the evaluation of model
validity, any further evaluation of the fitted model requires a large sample size
for asymptotic theory, in order to compensate for the model violation. Thus,
when the sample size is not significantly large and the underlying distribution
of the error term is not normal, it is inappropriate to use the linear regression
model.

FIGURE 5.2
Bounded driving years vs insurance premium

The third precaution verifies the range of the input features being used
in prediction. Recall that the fitted model is built upon training data that is
also confined within a range of values of the input predictor. When predicting
using the fitted model, if the value of the input variable is beyond the range
of the training data, depending on the fitted line, the predicted result may
not be meaningful. For instance, in the prediction of insurance premium in
Figure 5.2, it does not make sense to extend the linear line into the area of
driving age being -1, and claim that the premium is $78.2 per month for a
driver at one year before the beginning of one’s driving experience, because the
122 Linear Prediction

intrinsic relationship between the input and the output can not be plausibly
extended to that range.
The fourth precaution is the association effect versus the causation effect.
It is often confusing and inaccurate to claim that a linear relationship has a
causation effect. Linear regression is simply just the fitting of two columns of
data points; while the input has an effect on the output, the output may as
well influence the input in the linear regression model. In fact, what we can
claim in a linear regression analysis is essentially an association effect, not a
causation effect.

FIGURE 5.3
Car age vs selling price with outliers

The fifth precaution is the interpretation of the fitted model, especially


when there are outliers in the data. As shown in Figure 5.3, on average, the
expected price of a brand new car (X = 0) is around $20, 574 with a decrease
of around $2338 per year of driving. However, when there are outliers as
displayed in Figure 5.3, the fitted line represents

selling price = 15, 813 − 1022 ∗ car age,

an average decrease of $1022 per driving year and lower estimated price of
$15813 for a brand new car.
As shown in the figure, the occurrence of outliers (6.5, 245) flatters the
fitted line, and in turn, misrepresents the data pattern for the bulk of the
Model training and prediction 123

data. Obviously, a value of $24,500 for the selling price of a 6.5-year-old car is
an extreme case. In this example, it is more likely that the value of the selling
price 245 ($24,500) might be actually 45 ($4500), potentially due to a typo of
2 in front of 45 during the data entry process.

5.2 Model training and prediction


By definition, a simple linear model is represented by a singular input variable
and singular output variable that, in tandem, have a potential intrinsically lin-
ear relationship. For example, we may choose to model the amount of rainfall
an area of fertile land receives alongside the crop yields it produces. This
is plausibly linear (within reasonable bounds of rainfall, of course), in con-
junction with the idea that more rain may produce a fatter yield. Table 5.1
provides some examples of potentially-linear relationships. For instance, mar-
keting analyst could use sales this month to predict sales next month; the
measurement of body coordination is proportionally influenced by the blood
alcohol content; and the price of a product may linearly predict the amount
that that product sold.
One of the main advantages of linear models is that they are easy to inter-
pret, and the interpretation is immediately intuitive. A raise (or decline) in the
input variable produces a proportional result in the output variable. Another
feature of linear models is its applicability. Through graphs, the quality of fit
in a linear relationship can be visually determined when such a relationship
exists.

TABLE 5.1
Intrinsic linear relationships
Input variable Output variable
Sales this month Sales next month
Blood alcohol content Measure of body coordination
Price of a product Amount of that product sold

In linear regression, the relationship between the input variable and the
output variable is usually measured by the sample correlation coefficient, r.
If the absolute value of r is close to 1, the linear pattern between the input
variable and the output variable is strong. On the other hand, if the value |r| is
small, there is essentially not much correlation between the two variables. Un-
der the normality assumption, variables with zero correlation are statistically
independent. When the value of Y increases as x increases, the correlation is
positive. If the value of Y decreases as the value of x increases, the correlation
coefficient is negative. It can be proved that |r| ≤ 1.
124 Linear Prediction

(xi yi ) − 1
x y
rxy =  
n
. (5.1)
x2 − n1 ( x)2 y 2 − n1 ( y)2
Equivalently, the sample correlation coefficient is expressed as

1 (xi yi ) − n1 x y ˆ (X, Y )
COV
rxy = = (5.2)
n−1 Sx Sy Sx Sy

where COVˆ (X, Y ) is the sample covariance between X and Y , Sx and Sy are
the sample standard deviation for X, and Y , respectively.
Equation(5.1) and Equation(5.2) are algebraically equivalent. The differ-
ence is that Equation (5.2) uses the sample covariance and sample standard
deviations that are conventionally used in data analysis. Equation(5.2) also
highlights the meaning of correlation coefficient in a way where it measures
the relationship between the variation of X and the variation of Y , adjusted
by the variations of X and Y .
The following example demonstrates the computation of the sample cor-
relation coefficient r on the basis of a set of bivariate data (X, Y ).

Example 5.1 Assume that we have the following data on the input variable
X and output variable Y .

X: 3 6 12 18 24
Y: 60 95 140 170 185
We can compute the components in the formula ( 5.1) as follows.

x2i = 1089
i i xi = 63 i xi yi = 9930
2
i i = 95350
y i i = 650
y n=5
This results in a sample correlation coefficient of 0.972, indicating a strong
linear pattern between the two variables.

Figure 5.4 illustrates different patterns of plots between the input and out-
put variables with their corresponding sample correlation coefficients. When
the linear pattern is indiscernible, the sample correlation coefficient is low, at
a range from 0 to 0.3. As the linear pattern becomes significant, the abso-
lute value of the corresponding sample correlation coefficient increases. When
variable Y increases as X increases, the correlation coefficient is positive; oth-
erwise, it is negative. Figure 5.4 shows that the correlation of sample data
is intuitive to understand. The plots of the data cloud fit well with the val-
ues of the sample correlation coefficients. The relationship between the input
and output variables is easily obtained from a quick glance at the plot of the
training data.
Model training and prediction 125

FIGURE 5.4
Data cloud and sample correlation coefficients

5.2.1 Building models with training data


In this subsection, we shall discuss the optimization process for the derivation
of the model parameters in a linear model with a set of training data. Recall a
general setting in data science where we are interested in obtaining the model
behind a set of data by minimizing the expected prediction error.

Step 1. Identifying the shape of the underlying function by optimizing the


expected prediction error
When we have a set of training data consisting of a response variable y
and a set of predictors X = (X1 , ..., Xp )T , consider the simplicity of a linear
function in the optimization process for linear regression. The distance mea-
surement is in terms of the expected prediction error. This can be formulated
mathematically as follows.
For a vector β ∈ Rp , we have that the function c(X) can be obtained by
the typical minimization of prediction error,
" #
ĉ(X) = arg min EY |X [Y − c(X)]2 | X = x .

Since

E((Y − c(X))2 |X) = E((Y − E(Y |X))2 |X) + (E(Y |X) − c(X))2 ,

the optimal selection of the underlying function is

c(X) = E(Y |X).


Now, for the data represented by a linear model

y = α + βX + ,
126 Linear Prediction

where  is the term of random error satisfying

E() = 0, var() = σ 2 .
This leads to the underlying function

c(X) = E(Y | X) = α + βX.

Therefore, when the unknown underlying function c(X) takes on the value
α + βX, the conditional expected value reaches its minimum value, which
consequently minimizes the expected prediction errors.
Notice that the above derivation does not require the distribution of the er-
ror term. When we search for the optimal solution for the underlying function
behind the data, we do not need the distribution pattern of the error term.
However, under the assumption that the underlying error term follows a nor-
mal model, as discussed in most introductory statistics textbooks, p-values are
available to determine whether there is significant data evidence to support
the validity of the linear model. In the following sections, we shall address
prediction methods using simple linear models with and without normality
assumptions in data science, before discussing the implications of multiple
linear regression at the end of this chapter.
Step 2. Using the least squares estimation with training data to build a trained
model without normality assumptions
Given a set of data consisting of a predictor X and a response Y , to build
a simple linear regression model, we need to estimate the coefficients α and β
in the model,
Y = α + βX + ,
where  is the random term of the data. If  follows a normal model, we have
existing estimation and testing procedures for the significance on the unknown
parameters α and β.
Consider the target function of minimizing the sum of residual errors for
a set of training data,


n
L(α, β) = (Yi − α − βXi )2 . (5.3)
i=1

Taking derivatives of L(α, β) on the variables α and β for a given set of


data, gets the optimal solution for Equation( 5.3),
n
i=1 Xi Yi
β̂ = n 2
i=1 Xi
and

α̂ = Ȳ − β̂ X̄,
Model training and prediction 127

where
1 1
n n
Ȳ = Yi X̄ = Xi .
n i=1 n i=1
The above derivation can be alternatively obtained via the following for-
mulation. Consider a case where the target function for optimization is the
sample mean squared error (M ˆSE). On the training set, assume that we have
n observation pairs on (x, y) where x is the input variable and y is the output
variable. Denote vectors X = (x1 , ..., xn ) and Y = (y1 , ..., yn ) . Our goal is to
minimize
1
M ˆSE training = (Y − α̂ − X β̂) (Y − α̂ − X β̂).
n
After some standard operations as documented in conventional statistics text-
books, the optimal values can be achieved by allowing
β̂ = (X  X)−1 X  Y, (5.4)
and,
α̂ = Ȳ − β̂ X̄, (5.5)
where X̄ and Ȳ are the sample means of the predictor and the response
variable, respectively.
Note that the above derivations do not require any assumption on the nor-
mality distribution for the error term . We shall use an example to illustrate
the above discussion.
Example 5.2 Horizon Properties specializes in custom home re-sales in
Phoenix, Arizona. A random sample of 200 records from the custom-home-
sale database provides the following information on the size (in hundreds of
square feet, rounded to the nearest hundred) and price (in thousands of dollars,
rounded to the nearest thousand) of houses in the market.
Using Equations (5.4) and (5.5), we get an estimated model,
y = −110 + 15.89X + error
This is represented in Figure 5.5.
The fitted line can be interpreted as follows. When the house size is 2,000
square feet, the long-run average price in the area is $207.800. This is because
the predicted value y = −110 + 15.89 × 20 = 207.8. Also, each 100-foot
increase to the size of a house will increase the long-run average resale value
of the house by $15,890.
Notice that in the interpretation of the regression model, the intercept
−110 can not be interpreted directly. Clearly, −110 is a nonsensical value
because the value of a house surely cannot be negative. However, this can only
occur for a corresponding x-value of 0, implying a house of zero square feet
exists, which is also a nonsensical input. This example indicates that a linear
regression is only intended to be interpreted within the range of reasonable
inputs.
128 Linear Prediction

FIGURE 5.5
Horizon Properties, data summary with a linear model

5.2.2 Evaluating trained models without normality


We discuss the sample R2 and testing MSE (mean squared error) in this
subsection, since these two evaluation criteria do not require the normality
assumption for the error term in the linear regression model.

1. Sample R2
The first and most convenient way is to examine the sample R2 value in
a regression analysis. In a simple linear regression, we can directly calculate
the sample R2 between the input and output variables. In the multiple linear
regression case, we have the corresponding multiple correlation matrix for the
sample R2 , which essentially serves the same purpose. As it is a good deter-
miner of fit, we may also examine the changes in R2 when running variable
selection algorithms in linear models.
The sample R2 is formally defined as the following
 
2 1  xi − x̄ yi − ȳ
R =
n−1 i sx sy
Model training and prediction 129

where x̄ and ȳ are the average values of x and y of the training data re-
spectively, and sx and sy are the sample standard deviations of the training
data.
Taking a closer look at the y terms, we can see that

n 
n 
n
(yi − ȳ)2 = (yi − yˆi )2 + (yˆi − ȳ)2
i=1 i=1 i=1

or more succinctly, that


SStotal = SSresiduals + SSregression
where SS refers to the sum of squares. That is, the squared sample correla-
tion coefficient R2 is the fraction of variability in the data explained by the
regression model. We can further explore these terms as the following:
SStotal Variation in y values
SSresiduals Variation in y from mismatch between model and observed values
SSregression Variation in y due to changes in x that result in differing output
from the model
With the above notations, we have that
SSregression explained variation
R2 = = .
SStotal total variation
For a good model, we are able to explain almost all variations, hence the R2
will be high and approaching 1. If our R2 value is low (approaching 0), it
means that our model does not explain much variation – it is therefore not
providing much useful output information for changes in the input variable.
This is another indication that the relationship between the input and the
output variables is not sensitive enough to claim statistical significance.

2. Sample MSE of the testing data


After building the trained model with training data, it is imperative to
assess the validity of the model. Without the assumption of the normality
distribution, we do not have p-values to evaluate the significance of the model
parameter. Under a data scientist’s context, after obtaining the trained model
from the training data, we typically evaluate the model using the testing data
(usually 25% of the original data).
Given a testing set P , we have that
1 
M SEtest = (Yi − Xi β̂)2 .
|P |
i∈P

In general, theoretically we have the population expected prediction error,


EP R = E(Y − ĝ(X))2 , (5.6)
130 Linear Prediction

and the sample test mean squared error from the testing data,

ˆR = 1
EP (Yi − g(Xi ))2 , (5.7)
m
i∈P

in which m is the size of the testing dataset, i indexes the observation is in


the test dataset, and the function g is derived from the training data.
Notice that (5.6) is the population expected prediction error while (5.7) is
the sample mean of the individual errors for a set of data. Thus, the reliability
of our evaluation on the trained model depends on the sample size in the
testing data set. On one hand, if m is large enough, the value of EP ˆ R is
close to the true EP R. On the other hand, if the sample size of the testing
data is not large enough, we may underestimate or overestimate the expected
prediction error, which consequently leads to misleading conclusion in model
selection and consequently misleading prediction.
To better understand how the test expected prediction error (or in this
case, the test MSE) relates to the population expected prediction error, we
utilize the laws of large numbers.

Theorem 5.1 Weak Law of Large Numbers.


For any  > 0, we have

lim P (|X̄n − μ| > ) = 0,


n→∞

for μ = E(Xi ). Equivalently,

X̄n →P μ,

as n → ∞.

The weak law of large numbers states that as the number of observations
increases, the sample average will be close (in probability) to the expected
value. In the setting, the μ is EP R while the sample mean X̄ is the sample
expected prediction error EP ˆ R. In fact, the relationship between the sample
mean and the population mean has a stronger statement:

Theorem 5.2 Strong Law of Large Numbers.

P ( lim X̄n = μ) = 1,
n→∞

for μ = E(Xi ), or equivalently that

X̄n →a.s. μ,

as n → ∞.
Model training and prediction 131

When the sample size is large enough, the average of the observations is
almost surely the expected value.
Now, if we let
Zi = (Yi − α̂ − Xi β̂)2 ,
we can apply the strong law of large numbers to the sequence {Zi }, and obtain

1 
m
(Yi − α̂ − Xi β̂)2
m i=1
1 
m
= Zi →a.s. E(Z)
m i=1
= E[(Y − α̂ − Xi β̂)2 ].

Therefore, when the sample size is sufficiently large, the test MSE almost
surely equals the expected prediction error. Stated more directly by combining
the training set with the testing set, when the sample sizes in the training set
and the test set are large enough, the linear model estimated from the least-
squared criterion almost surely has the smallest test MSE. Therefore, it is
necessary to have large sample sizes in both the training set and the testing
set, when we can not plausibly assume that the error term follows a normal
model with a common standard deviation.

Example 5.3 We use this example to show that for simple linear regression,
parameters estimated from the training data do not guarantee the variance
decomposition principle.

Denote the training data by T and the testing data by P , we have


  
(yi − ȳ)2 = (yi − ŷi )2 + (ŷi − ȳ)2 (5.8)
i∈T i∈T i∈T

but, with probability one, we have


  
(zi − z̄)2 = (zi − ẑi )2 + (ẑi − z̄)2 (5.9)
i∈P i∈P i∈P

where ȳ and z̄ are sample means of the responses in the training set and testing
set, respectively. ŷi and ẑi are the predicted responses corresponding to the
predictor in the training set and testing set, respectively. The response in the
training set is denoted as yi and the response in the testing set is denoted as
zi .
First, we consider the validity of (5.8), notice that when we train the linear
model with a set of training data, we have

j∈T (xj − x̄)(yj − ȳ)


β̂ =
j∈T (xj − x̄)2
132 Linear Prediction

and
α̂ = ȳ − β̂ x̄
where ȳ and x̄ are the sample mean of the responses and the sample mean of
the predictor in the training data set. Now,

(yi − ȳ)2
i∈T
  
= (yi − ŷ)2 + (ŷi − ȳ)2 + 2(ŷi − ȳ)(yi − ŷi ).
i∈T i∈T i∈T

Notice that the last term in the expansion has the following property in the
training data set.

2(ŷi − ȳ)(yi − ŷi )
i∈T

= 2β̂ (xi − x̄)(yi − ŷi )
i∈T

= 2β̂ (xi − x̄)(yi − ȳ − β̂(x̂i − x̄))
i∈T
$ %
  j∈T (xj − x̄)(yj − ȳ)
= 2β̂ (x̂i − x̄)(yi − ȳ) − (xi − x̄)2
j∈T (xj − x̄)2
i∈T i∈T

= 2b̂(0)
= 0.

This concludes (5.8). Now for (5.9), notice that



(zi − z̄)2
i∈P
  
= (zi − ẑ)2 + (ẑi − z̄)2 + 2(ẑi − z̄)(zi − ẑi ).
i∈P i∈P i∈P

Now, the last term in the expression above becomes,



2(ẑi − z̄)(zi − ẑi )
i∈P

=2 (α̂ + β̂xi − z̄)(zi − ẑi )
i∈P

=2 (ȳ − β̂ x̄ + β̂xi − z̄)(zi − ȳ − β̂(x̂i − x̄))
i∈P
%
  j∈T (xj − x̄)(yj − ȳ)
=2 (ȳ − z̄ + β̂(x̂i − x̄))(zi − ȳ) − (xi − x̄) 2
j∈T (xj − x̄)2
i∈P i∈T
= 0(with probability 1).
Model training and prediction 133

Thus, Equation (5.9) follows.


The above discussions are grounded on information without the normality
assumption for the error term in linear regression analyses. However, when
the sample size is not large enough, errors in the asymptotic method are
not negligible. This necessitates a discussion on the regression model when
normality assumptions are satisfied.

5.2.3 Model significance with normal data


Plugging any two columns of data into a regression software, we can always
get a fitted line. Some regression lines do indeed reveal insightful information
between the input and the output variables. Others are just insignificant ran-
dom effect due to variation in the data. When we have large amount of data,
the discussion on the sample R2 and testing MSE in the above subsection
can be used to evaluate the significance of the fitted model. However, when
the data set is not very large (especially there is no explicit criterion on how
large is large enough), it is always helpful to utilize any additional distribution
information to facilitate the data analytics process.
We will illustrate this aspect using a hypothetical example of predicting a
car resale price based on the age of the car.
Example 5.4 As shown in the two fitted lines regarding the selling price and
the age of the car in Figure 5.3, the top one is for a normal data without
outliers while the bottom one corresponds to the occurrence of outliers in the
dataset. Although the fitted line corresponding to the dataset without outliers
effectively depicts the main pattern of the data cloud, the one with outliers
is questionable. Without further information, even with the fitted line, it is
debatable to claim the validity of the estimated model. This issue will be further
discussed with model significance under normality assumptions below.
Figure 5.6 shows regression outputs when the underlying distribution of
the data is assumed to be normal with a constant variation. As shown in the
first part of Figure 5.6, when there are no outliers, the dataset conveys strong
linear pattern with R2 = 0.978, and the model is extremely significant with
p − value = 1.35E − 09. Consider the data variation on the standard error
(1.115) for the estimated slope (-23.382), the t-statistic (which is a uniformly
most powerful unbiased test), takes the value of -20.978 (estimated regression
coefficient adjusted by the sample variation). Thus, the fitted line closely re-
veals the data pattern between the years of usage and the selling prices of the
car. In other words, for the hypothesis
H0 : β = 0 vs H1 : β = 0,
where β = 0 indicates that there is no linear relationship between the age of
the car and the selling price. β = 0 indicates that the age of the car does
affect the selling price of the car. The small p-value indicates data evidence
to reject the null hypothesis in favor of the alternative hypothesis.
134 Linear Prediction

1. Regression output for car selling price without any outlier.

Regression Statistics
Multiple R 0.9888279
R Square 0.9777807
Adjusted R Square 0.9755588
Standard Error 5.25746
Observations 12

ANOVA
df SS MS F Significance F
Regression 1 12163.62781 12163.63 440.0593 1.34528E-09
Residual 10 276.4088543 27.64089
Total 11 12440.03667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 205.74396 5.33816682 38.54206 3.3E-12 193.8497836 217.63814
Years of car -23.381625 1.114600159 -20.9776 1.35E-09 -25.8651087 -20.898141

2. Regression output for car selling price with outliers.

Regression Statistics
Multiple R 0.288086
R Square 0.082993
Adjusted R Square -0.00037
Standard Error 51.87716
Observations 13

ANOVA
Significance
df SS MS F F
Regression 1 2679.272 2679.272 0.995553 0.33983196
Residual 11 29603.64 2691.24
Total 12 32282.91

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 158.1272 50.65999 3.121344 0.009727 46.6253475 269.62911
Years of car -10.2281 10.25097 -0.99777 0.339832 -32.790379 12.33408

FIGURE 5.6
Regression outputs on selling prices with outliers

When examining the second part of Figure 5.6, the story deviates from the
first portion of the output. With the presence of outliers, the data variation
becomes too large and the fitted line essentially becomes insignificant. The
sample R2 = 0.083, indicates that there is essentially no linear pattern behind
the data.
Notice that the fitted line is moved away from the original one due to the
occurrence of outliers. The estimated standard error for the regression coeffi-
cient is 10.25, which is almost the same as the absolute value of the regression
coefficient (-10.23). Since the data variation is almost as large as the quantity
of the estimated model coefficient, the t-statistic is -0.998, with a p-value =
0.3398 suggesting that, with the inclusion of outliers, the data variation has
increased to a level that overwhelms the significance of the estimated slope
(p=0.3398). Thus, we fail to reject the null hypothesis that β = 0, which
is equivalent to stating that there is no statistical evidence to claim a rela-
tionship between the age of a car and its selling price. Under the normality
assumptions, the numerical output agrees with the observation on the large
variation of the data in the second part of Figure 5.3.
Multiple linear regression 135

5.2.4 Confidence prediction with trained models


Once the model is trained and validated, the next step is to predict the un-
known response based on the explanatory variable x. For instance, when we
have the trained model as
selling price = 20, 574 − 2, 338 age of the car,
the expected selling price for a car with 6.2 years of usage can be predicted,
on average, as
selling price = 20574 − 2338 ∗ 6.2 = $6078.40.
However, the point estimate is unstable and risky in statistical inference due to
randomness in the data. To reach a (1 − α)% confidence level for the predicted
value, under the normality assumption, we have the confidence prediction
interval,
ŷ − tα sŷ , ŷ + tα sŷ ,
where the cut-off value tα for the Student-t model satisfies,

P (|tn−2 | < tα ) = 1 − α.
The sample standard deviation of ŷ, sŷ reads
&
1 (x0 − x̄)2
sŷ = s 1 + + ,
n SSxx
where s is the sample standard error of residuals, x0 is the value of the
explanatory variable X for which we predict the response, and SSxx is the
sum of squares of the explanatory variable X from the training data.

SSxx = (xi − x̄)2 .
i

5.3 Multiple linear regression


In practice, most linear models will require multiple inputs to capture more
nuanced behavior. For instance, in the car resale price example, besides the age
of the car, variables affecting the selling price include mileages, the condition
of the car, location, special features, shape, and type of the car. Thus, it makes
more sense to model the response with a multiple linear model:

Y = α + β1 X1 + ... + βk Xk + .
Besides inheriting common features in simple linear regression, with more
than one predictors in the model, multiple linear regression possesses other
discernible properties as discussed below.
136 Linear Prediction

Example 5.5 The following table gives the percentages of concentration of a


person, Y , in terms of a set of testing questions, as a response variable with
possible association with three predictors including the dosage of a medicine
not exceeding the maximum dosage level (X1 ), patient age (X2 ), and patient
stress level (evaluated by a psychologist), X3 .

ID: 1 2 3 4 5 6 7 8 9 10 11 12
Y: 85 93 79 98 83 66 53 68 72 81 74 87
X1 : 1.7 1.9 1.6 2.0 1.8 1.1 0.9 1.0 1.3 1.5 1.3 1.9
X2 : 23 19 21 22 28 36 34 29 21 32 19 41
X3 : 5.1 4.3 6.2 3.2 3.7 7.2 8.1 7.6 5.3 5.2 6.2 4.1

Task-1) We are interested in knowing whether all the predictors significantly


affect the patient concentration level.
Task-2) We are interested in selecting the best model to detect the relationship
between the subject concentration level and significant predictors.

Call:
lm(formula = concentration ~ dosage + age + stress, data = mydata2)

Residuals:
Min 1Q Median 3Q Max
-5.2335 -2.2658 0.1126 2.2270 5.5940

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) 46.3646 25.8677 1.792 0.1108


dosage 27.7346 9.1358 3.036 0.0162 *
age -0.2037 0.1635 -1.246 0.2481
stress -0.7614 2.1805 -0.349 0.7360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.881 on 8 degrees of freedom

Multiple R-squared: 0.9295, Adjusted R-squared: 0.9031

F-statistic: 35.18 on 3 and 8 DF, p-value: 5.894e-05

> confint(fit, level=0.95)


2.5 % 97.5 %
(Intercept) -13.2864481 106.0155482

dosage 6.6673741 48.8018131

age -0.5806767 0.1733577

stress -5.7897912 4.2668920

FIGURE 5.7
Regression of concentration on dosage, age, and stress level
Multiple linear regression 137

Figure 5.7 contains outcomes of data analysis using multiple linear regres-
sion. Among the three predictors, the data only contains significant evidence
to claim that the percentage of concentration is linearly affected by the dosage
of the medicine. The corresponding p-value is 0.0162, measured by the uni-
formly most powerful test under the normality assumption of the error term.
For each 10% increase in dosage, the corresponding percentage of concentra-
tion increases 2.78%, on average, with a 95% confidence interval from 0.67%
to 4.88%.

FIGURE 5.8
Dosage-concentration level plot

The overall sample R2 is at the level of 0.93, suggesting a strong linear


pattern of the patient concentration level attributed by the dosage. Figure 5.8
shows the plot of the data between the dosage level and the concentration
level.
Although the model between dosage and concentration level is statistically
significant, we do know whether this regression model is the best model among
all the candidate models involving all three predictors. Note that the total
number of candidate linear regression models reads,
   
p p p p
+ + ··· + + = 2p = 23 = 8.
0 1 p−1 p

When the number of predictors is large, it is practically impossible to


138 Linear Prediction

examine all the candidate models one-by-one to select the best one. In the
sequel, we will discuss multiple correlation matrices among the response and
all the predictors, followed by a discussion on the AIC criterion for variable
selection in multiple linear regression.

5.3.1 Confounding effects


Following the discussion on the concentration-dosage example, stress is clin-
ically a significant factor related to the concentration level of the patient.
However, such a relationship is not supported by this set of data as shown in
Figure 5.7. This may be caused by factors such as limited sample size, rela-
tively large variation in this data set, or other unknown reasons. As it turns
out, carefully examining the overall correlation indicates that the regression
model involving all three predictors may not be the best model for the data.

FIGURE 5.9
Correlations on concentration, dosage, age, and stress

Since we are dealing with the response (concentration level), dosage, age,
and stress level, the correlation among these variables is a matrix. As shown
Multiple linear regression 139

in Figure 5.9, the concentration level has strong positive correlation with the
dosage. Patients taking higher dosage of the medicine tend to have higher level
of concentration, on average. The sample correlation coefficient is 0.96 with
strong data evidence (extremely small p-value) to reject the null hypothesis
H10 in favor of the alternative hypothesis H11 , where
H10 : ρ = 0 versus ρconcentration−dosage = 0,
ρ is the correlation coefficient between the patient concentration percentage
and the corresponding dosage.
Although the above analysis coincides with the analysis result in Fig-
ure 5.7, further examining the correlation between the concentration level
and the stress level reveals the fact that dosage is not the only valuable pre-
dictor in our model; stress level is also strongly and negatively associated
with the concentration level. Patients with higher stress level tend to have
less concentration percentage.
ρ̂1 = −0.91 ρ̂1 = −0.94,
where ρ1 is the correlation coefficient between the patient’s concentration level
and the corresponding stress level. The result of correlation analysis shows that
both of the above-mentioned correlation coefficients are significantly different
from zero. Since the dosage level of a patient is significantly associated with
the stress level, it is naive to assume that the two covariates are independent.
On the other hand, examining the sample correlation of age with the con-
centration level, dosage, and stress level, we found that the sample correlation
coefficients are -0.34, -0.24, and 0.23, respectively. The corresponding p-values
are larger than 0.05, indicating that there is no data evidence to claim signif-
icant correlation between age and any one of the three variables.
When we fit a multiple linear model without the variable dosage, as shown
in Figure 5.10, the stress level feature has a much smaller p-value (9.98e-05)
than the significance level of dosage on concentration level (0.0162), implying
there is stronger statistical evidence to claim a linear relationship between the
concentration level and stress level.
Figure 5.11 shows the plot of the data between the stress level and the
concentration level. It is observable that as the stress level increases, the level
of concentration decreases. The linear pattern of the data is reflected by the
fitted line.
Although the software produces a regression line as shown in Figure 5.12,
the variation of the data invalidates the fitted line. It indirectly supports the
claim that age is not a significant factor attributable to patient concentration
levels.

5.3.2 Information loss and model selection


The prior results indicate that there are 2p candidate models for a response and
p predictors in a multiple linear regression model. Thus, one of the focuses in
140 Linear Prediction



 #
) = , <"  =!  *

 #
0  2  
(8$1/6(1$3560$8472$0613$600

#
  $  )>''*
)*012$0/8/6$473605$1204$57(/7+++
(/$1270/$1145(0$/45/$208
(5$85130$/445(5$4848$87(/4+++
(((
$#/%+++&/$//0%++&/$/0%+&/$/4%$&/$0%&0

   #4$2578
 ( #/$7373"  ( #/$7036
( #14$061 8"( #/$///1/48

>)"=/$84*
1$4;86$4;
)*0/4$840023003/$1556755
(/$6373274/$1610367
(8$24/3121(3$4632474

FIGURE 5.10
Drug concentration level associated with age and stress

multiple linear regression turns to identify “proper” variables for the training
model. That is, we want to pick out the input variables that are going to
produce a good model, without bad qualities like noise or over-fitting. This
necessitates a discussion on the AIC (Akaike information criterion) and BIC
(Bayesian information criterion) for model selection.
When we use data to construct an estimate fˆ(x) for the unknown underly-
ing model f (x) via parameter estimation, we need to consider the information
loss from using fˆ to replace the true model f (x). This is usually measured by
the relative entropy of fˆ to f in Kullback-Leibler divergence:

ˆ f (x)
D(f ||f )KL = log dF (x),
fˆ(x)
where F (x) is the cumulative distribution function associated with f (x). It is
proven that the estimated model minimizing the Kullback-Leibler divergence
D(f ||fˆ) can be obtained via AIC

AIC = 2k − 2 log L̂
where k is the number of estimated parameters in the model, and L̂ is the
maximum value of the likelihood function of the model. Specifically, in multiple
linear regression with p predictors, AIC reads
Multiple linear regression 141

FIGURE 5.11
Drug concentration level associated with stress alone

log(RSS)
AIC = n + n log(2π) + n + 2(p + 1) (5.10)
n
where RSS is the residual sum of square of the fitted line.
Another commonly used criterion is the BIC (Bayesian information crite-
rion), which reads,

BIC = 2 log(n)k − 2 log L̂,


where n is the number of data points (sample size), k is the number of
estimated parameters, and L̂ the maximum value of the likelihood function.
Correspondingly, for multiple linear regression, the BIC becomes

log(RSS)
BIC = n + n log(2π) + n + log(n)(p + 1) (5.11)
n

With the AIC criterion in (5.10) and BIC criteria in (5.11), one can select
the model that minimizes the estimated information loss as the best model for
the data. However, we can’t just test all possibilities of models (that is, every
combination of variables). Given p possible input variables, each variable can
either be included into the model or not. This means that with p possibilities
142 Linear Prediction

FIGURE 5.12
Drug concentration level associated with age alone

of branching into two, we actually have 2p possible models. This value quickly
becomes intractable. For example, with 10 variables, we have 210 = 1024
possible models, and with 100 variables, we have 2100 = 1.26 × 1030 models.
Surely, testing each model for viability is not the right way forward. We shall
discuss three commonly used variable selection approaches below.

1. Forward Selection
One approach to solve the variable selection problem is known as forward
selection. This algorithm starts by examining the list of possible explana-
tory variables and computing a simple regression for each one. The estimated
AIC value (or residual sum of squares, or partial F-statistic) is used for the
judgment of including an explanatory variable in the final model. Then, we
continually add in the next best explanatory variable until no improvement is
detected and a final model is achieved.
Computer scientists may recognize this as a greedy algorithm. By taking
the maximum improvement at each step, we hope to quickly converge on the
“best” model.

2. Backward Selection
If we can add variables, so can we subtract. The corresponding backward
selection algorithm works by running a single massive regression on all possible
Multiple linear regression 143

explanatory variables. With this full-sized regression of all variables, we then


examine each one individually to determine which has the smallest estimated
AIC value, or partial F-statistic (and hence the largest p-value), and remove
this variable. We then continue this process of removing variables until an
optimal estimated AIC value or a threshold p-value is reached, leaving us
with a final model.

3. Mixed Selection
A blending of the two preceding algorithms is commonly referred to as
mixed selection. Multiple formulations of this algorithm exist. We use the
following as a case to describe the process. For instance, we can use forward
selection to add a variable, then run backward selection to retest “prune” any
variables, repeating this pair of steps until a satisfied solution is achieved.

> step<-stepAIC(fit, direction="both")


Start: AIC=35.68
concentration ~ dosage + age + stress
Df Sum of Sq RSS AIC
- stress 1 1.837 122.35 33.864
<none> 120.52 35.683
- age 1 23.376 143.89 35.810
- dosage 1 138.837 259.35 42.879

Step: AIC=33.86
concentration ~ dosage + age
Df Sum of Sq RSS AIC
<none> 122.35 33.864
- age 1 23.56 145.92 33.978
+ stress 1 1.84 120.52 35.683
- dosage 1 1390.52 1512.87 62.042

> step<-stepAIC(fit, direction="backward")


Start: AIC=35.68

concentration ~ dosage + age + stress


Df Sum of Sq RSS AIC
- stress 1 1.837 122.35 33.864
<none> 120.52 35.683
- age 1 23.376 143.89 35.810
- dosage 1 138.837 259.35 42.879
Step: AIC=33.86
concentration ~ dosage + age
Df Sum of Sq RSS AIC
<none> 122.35 33.864
- age 1 23.56 145.92 33.978
- dosage 1 1390.52 1512.87 62.042

FIGURE 5.13
AIC criterion and information loss in multiple regression

We shall use the example of the concentration level versus predictors


(dosage, age, and stress level) to illustrate the optimizing process for vari-
able selection in multiple linear regression.
144 Linear Prediction

Example 5.6 Refer to Example 5.5, use the AIC criterion to find the best
model (the model with the smallest information loss) for the data.
As shown in Figure 5.13, for the mixed selection approach, the starting
model (with all the covariates, dosage, age, and stress level) has AIC 35.68.

concentration = α + β1 dosage + β2 age + β3 stress. (5.12)


The AIC with stress deleted from model (5.12) will be 33.864, with
age deleted will be 35.81, and with dosage deleted will be 42.879. Since
33.864 < 35.68, in the second step, the variable corresponding to the low-
est AIC reduction (stress level) is deleted. For the new model at this step

concentration = α + β1 dosage + β2 age. (5.13)


Taking out the variable “age” from the model (5.13) will be 33.978, taking
out “dosage” will be 62.042, and adding “stress” back to model (5.13) will
be 35.683. Since none of these changes may result in AIC reduction, the best
model is model (5.13).
The above is the mixed procedure for the selection of the model that
has smallest loss of information. Notice that at the last step, the procedure
compares the AIC for adding possible candidate variables in to the model to
search for the best model.
Different from the mixed optimization procedure, in the backward search
procedure, after the first step of comparing different amounts of AIC changes
corresponding to the predictors, the procedure finds only the AIC levels from
the removal of the two variables in the model (5.13) without computing the
one where the variable “stress” is added back to the model.
Thus, after comparing the AIC values, the model (5.13) is the best linear
regression model for the data.
For reference, the R-codes for computations are given in Figure 5.14. It
should be noted that in the AIC-optimized model (5.13), the adjusted R2 is
0.9126, which is larger than 0.9031, the adjusted R2 for the model (5.12) that
uses all the three predictors.
The optimized AIC model (5.13) can be used to predict the percentage
concentration level for a patient aged 21 and taking dosage 1.2 ml. As shown
in Figure 5.14, the concentration level is predicted as 70.28%, with the 95%
confidence prediction interval from 66.19% to 74.36%, when accounting for
the randomness from data variation.

5.4 Categorical predictors


Our previous discussion has entirely covered continuous data. We assume that
variables like “age” or “weight” can be evaluated to a decimal value. Linear
Categorical predictors 145

> mydata <- read.csv("/Users/jchen/Desktop/concentration.csv")


> fit<-lm(concentration~dosage+age+stress, data=mydata)
> summary(fit)
> confint(fit, level=0.95)
> fit<-lm(concentration~age+stress, data=mydata2)
> summary(fit)
> confint(fit, level=0.95)

> library(MASS)
> fit<-lm(concentration~dosage+age+stress, data=mydata)
> step<-stepAIC(fit, direction="backward")
> step<-stepAIC(fit, direction="backward")

> fit<-lm(concentration~dosage+age, data=mydata)


> summary(fit)
Call:
lm(formula = concentration ~ dosage + age, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-5.4020 -1.9034 0.0062 2.2340 5.5034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7029 6.9736 5.407 0.000429 ***
dosage 30.7230 3.0378 10.114 3.26e-06 ***
age -0.2045 0.1553 -1.317 0.220529
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.687 on 9 degrees of freedom


Multiple R-squared: 0.9285, Adjusted R-squared: 0.9126
F-statistic: 58.4 on 2 and 9 DF, p-value: 7.006e-06

> predit <- read.csv("/Users/jchen/Desktop/test.csv")


> prediction=predict(fit, newdata=predit, interval='confidence')
> prediction
fit lwr upr
1 70.27688 66.19104 74.36273

FIGURE 5.14
R Codes for confidence prediction

regression then works by taking each variable along a continuum and assuming
that full interpolation is possible.
However, not all data works cleanly in this form. Some data is categor-
ical, meaning that it belongs to certain categories which may have no real
relation to each other. For example, the color of car may be important in de-
termining car insurance rates, or the nationality of a car’s make in that same
model (physicists may complain that color is defined on a wavelength, but
this ordering of colors is arbitrary and serves no purpose in this regression).
Other data is just discrete, meaning that it will either never come in a con-
tinuous form (e.g. star ratings on individual Yelp reviews) or more strongly,
perhaps not comprehensible in a continuous form (e.g. take rankings of the
best video games – 1.5th place cannot exist). When dealing with well-ordered
discrete data, like star ratings, it may make sense to treat the inputs as con-
tinuous anyway, but only occurring along a fixed set of values.
However, when dealing with categorical data, we will use dummy variables
to separate variables. “Dummy variables” are indicator variables taking on
the boolean (true/false) value of 0 or 1 based on whether the condition is
146 Linear Prediction

present. For example, when classifying car insurance rates using colors, we
may create dummy variables for each separate color: 1red , 1blue , etc. This
notation (1condition ) is used to refer to the dummy variable that is 1 when
“condition” is true, and 0 otherwise. Of course, we expect only one dummy
variable to take on the value of 1 out of the created set. In modern machine
learning and computer science, dummy variables are also often referred to as
one-hot encoding because of this property.

5.5 Outliers and leverage statistics


Any model is subject to the truism “garbage in, garbage out,” and linear re-
gressions are no exception. Because the goal is to minimize the sum of square
residuals, linear regressions can get greatly influenced to one edge by extreme
values, called outliers, as discussed in the first section in this chapter. Fig-
ure 5.3 shows an example of the impact of outliers that alter the fitted line
away from the main-stream of the data, and invalidates the regression output
as shown in Figure 5.6.
Outliers are not just points that buck the trend – there can be practical
reasons for distinct outliers. For example, in some datasets, a lack of data is
simply represented by the maximum or minimum value possible. Regressing on
this will easily cause a model that produces nonsensical results. Alternatively,
it may represent a literal equipment malfunction when performing a reading.
As such, removing these problematic points is critical to deriving a model that
is ultimately useful. Many data analysts and modelers spend the majority of
their time cleaning and preparing data as input to their models.
In the discussion on outliers, a primary concern is whether outliers are in-
fluential. Intuitively, this refers to whether outliers are significantly “dragging”
the regression away from its ordinarily correct value.
Of course, the concept of influential point is not a binary one: we care about
the degree to which outliers are influential more than any classification of
influential/non-influential points. The study of this metric falls under leverage
statistics, which is about measuring the distance between an observation and
the bulk of the data. For the i-th datum, the leverage score is defined as
' (
hii = X(X  X)−1 X  ii

where X is the design matrix. In other terms, the scores are the diagonal
elements of the projection matrix.
If xi is far away from the average x̄, hii will be large, meaning that the
corresponding observation yi has a large impact on the fitted model. Since the
MLE of the linear regression coefficients,

β̂ = (X  X)−1 X  y,
Outliers and leverage statistics 147

and
ŷ = X β̂ = X(X  X)−1 X  y,
for the ith observation, we have
d(ŷi ) ' (
= X(X  X)−1 X  ii = hii .
dyi
The degree on which the ith response influences the ith predicted value via
the observation matrix X.

Proposition 1 In a multiple regression setting, the leverage statistic 0 ≤


hii ≤ 1.

Proof: Since the projection matrix is symmetric and idempotent,

(X(X  X)−1 X  )(X(X  X)−1 X  = X(X  X)−1 X  ).

We have

n 
hii = hit hti = h2ii + h2ti ≥ h2ii (5.14)
t=1 t=i

Dividing hii in both sides of (5.14) gets

hii ≤ 1.

The first portion of the equation (5.14) implies that

hii ≥ 0.

Proposition 2 In a multiple regression with n observations and p unknown


parameters, the sum of the leverage statistics equal to the dimension of β in
the setting
y = Xβ.
Proof: Notice that

n 
n
' (
hii = X(X  X)−1 X  ii
,
i=1 i=1

we have

n
hii = T race(X(X  X)−1 X  )
i=1
= T race((X  X)−1 X  X)
= T race(Ip )
= p.
148 Linear Prediction

We can take this concept a little further with Studentized residuals to test
whether the distance is too far away from the bulk of the data.
Given a residual di = yi − ŷi , the variance of di can be expressed as:

V ar(di ) = ei V ar(Y − Ŷ )ei


= ei V ar([I − X(X  X)−1 X  ]Y )ei
= ei ([I − X(X  X)−1 X  ]σ 2 )ei
= (1 − hii )σ 2 .

Therefore, the Studentized residual is defined as


di
ti = √
σ̂ 1 − hii
This is the residual adjusted for its observation-specific variation.
We shall use an example to explain the formula discussed above.

Example 5.7 For simplicity, assume that a training data set (toy example)
contains
Y 12 -20 14.3 15.3 15.4
X1 2.1 2.2 2.3 2.4 2.5
X2 1.2 1.6 1.7 1.9 1.8
Find and interpret the Studentized residual for the second observation in the
training set.

Solution: As shown in Figure 5.15, the leverage statistic for the second obser-
vation is
h22 = 0.187,
indicating that under the context of the observation matrix X, each unit
change of the second response contributes 18.7% to the change of the predicted
value of y2 .
The absolute value of the residual of the second response is 17.452. With
the residual standard error σ̂ = 17.12, we have
17.452
Studentizedt =  = 1.13
(17.12 ∗ (1 − 0.187))
Thus, at 0.05 significance level, since t0.975,3 = 3.18 we do not have data
evidence to claim that the second observation is significantly beyond the bulk
of the data, even though the second response looks well beyond the rest of the
response. This is partly due to the relatively small number of observations in
the regression.

SUMMARY The method of linear regression prediction is probably one


of the most common data science techniques. It is often misused, especially
Outliers and leverage statistics 149

> x<-matrix(c(2.1,1.2,2.2,1.6,2.3,1.7,2.4,1.9,2.5,1.8), 5, 2, byrow=TRUE)


> c=t(x)%*%x
> d=solve(c)
> h=x%*%d%*%t(x)
>h
[,1] [,2] [,3] [,4] [,5]
[1,] 0.88237876 0.1155433 0.05645468 -0.2385496 0.1741930
[2,] 0.11554333 0.1870702 0.20082531 0.2338377 0.2090784
[3,] 0.05645468 0.2008253 0.22122392 0.2829465 0.2206971
[4,] -0.23854956 0.2338377 0.28294653 0.4731482 0.2400714
[5,] 0.17419298 0.2090784 0.22069712 0.2400714 0.2361790
> y<-matrix(c(12,-20,14.3, 15.3, 15.4), 5, 1, byrow=TRUE)
> fit<-lm(y~x[,1]+x[,2])
> summary(fit)

Call:
lm(formula = y ~ x[, 1] + x[, 2])

Residuals:
1 2 3 4 5
5.717 -17.452 10.030 9.428 -7.723

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -183.85 172.93 -1.063 0.399
x[, 1] 120.35 113.04 1.065 0.399
x[, 2] -52.16 66.15 -0.789 0.513

Residual standard error: 17.12 on 2 degrees of freedom

Multiple R-squared: 0.3801, Adjusted R-squared: -0.2398


F-statistic: 0.6132 on 2 and 2 DF, p-value: 0.6199

FIGURE 5.15
Leverage statistic and outliers

among non-statisticians in a way of coding with input for the purpose of get-
ting an output, without caring about legitimate assumptions and conditions
for the regression method. This chapter addresses problems that are commonly
abused or overlooked in regression analysis.
Starting with precautions that using linear regression without proper jus-
tification may result in misleading conclusions, we discuss the connection be-
tween the known and the unknown in identifying the model governing the
data. Linear regression is the first step to bridging the known (predictors)
and the unknown (response variables) with interpretable functions. In fact,
it is the optimal solution when the conditional expectation is indeed a linear
combination of the predictors.
We also discuss technical details on training models and testing the trained
model with and without normality assumption, a commonly overlooked issue
in data science.
Analyzing confounding effects and AIC-BIC criteria for the selection of
predictors in a multiple linear model, we use examples to delineate the in-
terpretation, validation, and confidence prediction on linear model. A specific
150 Linear Prediction

section pertaining to cross-validation on a linear model can be found in Chap-


ter 2.
Probably one of the thorny issues on linear regression is the occurrence
of outliers that may alter the bulk of data pattern in regression analysis. At
the end of the chapter, we address the issue with theoretical justification and
hypothetical data to elucidate a outliers detection method.
6
Nonlinear Prediction

The preceding chapter focuses on methods of regression for a linear function


between the predictors and the response. In reality, there are at least two issues
restricting the linearity approach. One is the scenario where the number of the
predictors is larger than the number of observations, a common issue in big
data analytics. Another restriction on the linearity approach occurs in the case
where the underlying function is actually not linear. For instance, when the
dosage goes beyond the unknown maximal dosage in clinical trials, the dose-
response curve may appear to be an umbrella (or inverted-U) rather than a
linear relationship. Also, in determining annual premiums for car insurance,
drivers with different driving experiences pay different rates (as discussed in
the insurance example in Example 1.3). When more predictors are taken into
consideration, the underlying model f (x) may take any shape, depending on
the mechanism of the relationship between the response and the predictors.
This calls for new methodologies for nonlinear predictions.
In this chapter, we shall focus on methods of shrinkage (including ridge re-
gression and LASSO), high-dimension data reduction, polynomial regression,
and regression splines.

6.1 Restricted optimization and shrinkage


Recall that in regression analysis with n observation and p predictors, the
variation of the prediction
ŷ = X β̂
depends on the variation of the parameter estimate

β̂ = (X  X)−1 X  y.

One way to improve the accuracy of prediction is to reduce the prediction


variation, which subsequently puts a restriction on the value range of the
parameter β. A properly restricted optimization procedure may produce a
more stable prediction in data analytics. Thus, in addition to the least squares
optimization in multiple linear regression,

151
152 Nonlinear Prediction

⎛ ⎞2

n 
p
β̂ = argM in ⎝y i − β 0 − xij βj ⎠ , (6.1)
β i=1 j=1

we consider the following two restricted optimizations in the non-linear pre-


diction process.

6.1.1 Ridge regression


⎛ ⎞2

n 
p 
p
β̂ ridge
= argM in ⎝y i − β 0 − xij βj ⎠ subject to βj2 ≤ t, (6.2)
β i=1 j=1 j=1

where the value t is a properly selected constant. It is usually determined by


a cross-validation process.
The optimization in (6.2) is equivalent to
⎧ ⎫
⎨n p 
p ⎬
β̂ ridge = argmin (yi − β0 − xij βj )2 + λ βj2
β ⎩ ⎭
i=1 j=1 j=1

for some constant λ. Denote the residual sum of squares


⎛ ⎞2
n 
p
RSS = ⎝y i − β 0 − βj xij ⎠ ,
i=1 j=1

we have
⎛ ⎞2

n 
p 
p 
p
⎝y i − β 0 − βj xij ⎠ + λ βj2 = RSS + λ βj2 .
i=1 j=1 j=1 j=1

Denote the updated residual sum of squares,

RSS ∗ (λ) = (y − Xβ)T (y − Xβ) + λβ T β.


Minimizing the RSS ∗ yields the parameter estimates of ridge regression,

β̂ ridge = (X T X + λI)−1 X T y. (6.3)


Although this setting has advantages in reducing the prediction error by
restricting the range of the model parameters, it should be mentioned that
the parameter estimate obtained from (6.2) is not unbiased.

Theorem 6.1 Model coefficients estimated from the optimization process in


(6.2) are biased when λ = 0.

E(β̂ridge ) = β.
Restricted optimization and shrinkage 153

Proof: Notice that by (6.3),

β̂ridge = (X T X + λI)−1 X T y.
Denote β̂ls the parameter estimate obtained from the least squares estimation
(6.1), we have,

E(β̂ls ) = β.
T
Let R = X X, the expectation of the parameter estimators for the ridge
regression reads,

E[β̂ridge ] = E{(X T X + λI)−1 (X T X)[(X T X)−1 X T y]}


= [R(I + λR−1 )]−1 RE[β̂ls ]
= [(I + λR−1 )]−1 β
= [(I + λR−1 )]−1 β.

Thus, E(β̂ridge ) = β if λ = 0.

Example 6.1 When p = 1 and α = 0, the optimization problem in (6.2)


becomes ,

n
β̂ ridge
= argM in (yi − xi β1 ) + λβ1 .
2 2
β i=1

Denote

n
g(β1 ) = (yi − xi β1 )2 + λβ12
i=1

and solve the equation


dg(β1 )
= 0,
dβ1
gets

i x i yi
βˆ1 ridge = 2 .
i i +λ
x
The value of βˆ1 ridge is controlled by the value of λ. As λ increases, βˆ1 ridge
shrinks. When λ approaches ∞, βˆ1 ridge approaches zero. When λ takes the
value zero, βˆ1 becomes the regular MLE of β1 ,
ridge

x i yi
βˆ1 M LE = i
2 .
i xi
154 Nonlinear Prediction

6.1.2 LASSO regression


p
Similar to the restriction of j=1 βj2 ≤ t, another approach to eliminate in-
substantial and redundant regression coefficients is to restrict the sum of the
absolute values of the regression coefficients


p
|βj | ≤ t. (6.4)
j=1

This approach is named Least Absolute Shrinkage and Selection Operator


(LASSO) regression. When the restricted condition in the optimization process
(6.2) is replaced by (6.4), we have

⎛ ⎞2

n 
p 
p
β̂ lasso = argM in ⎝y i − β 0 − xij βj ⎠ subject to |βj | ≤ t. (6.5)
β i=1 j=1 j=1

The optimization (6.5) is mathematically equivalent to


⎧ ⎫
⎨ n 
p 
p ⎬
β̂ lasso = argmin (yi − β0 − xij βj )2 + λ |βj | .
β ⎩ ⎭
i=1 j=1 j=1

Both optimization problems (6.5) and (6.2) are conditions to prevent over-
fitting the training data while minimizing the training MSE,
⎛ ⎞2

n 
p
M SE = ⎝y i − β 0 − xij βj ⎠ .
i=1 j=1

It should be noted that parameter estimates minimizing the training MSE


do not always minimize the testing MSE. However, when both training data
and testing data are available, by controlling the scale of estimated parame-
ters like in the ridge regression or LASSO regression, the restriction on the
variation of the estimated parameters contributes toward a relatively more
stable test MSE.

Example 6.2 Consider a case where the number of observations n = 2, the


number of predictors p = 2, and the observation matrix X is diagonal with 1’s
on the diagonal and 0’s on off-the diagonal elements. Also assume that the
intercept term is zero.
What is the estimate of β for the usual multiple linear regression? For
ridge regression? And for LASSO?

Solution: The model of interest in this example reads


Model Selection and Regularization 155

y = Xβ +  = β1 + β2 + 
For the usual multiple linear regression, the MLE is

β̂ = (X T X)−1 X T y = y.

For the ridge regression, as discussed in (6.3),


1
β̂ = (X T X + λI)−1 X T y = y.
1+λ
The estimate will shrink to zero as λ increases.
For the Lasso regression, since the restriction is

|β1 | + |β2 | ≤ t,

the corresponding target function reads


2
f (β1 , β2 ) = (yj2 − 2yj βj + βj2 + λ|βj |). (6.6)
j=1

Thus, by considering βj > 0 and βj < 0 for (6.6), gets




⎨yj − λ/2 yj > λ/2;
βˆjL = yj + λ/2 yj < −λ/2; .


0 |yj | ≤ λ/2.

6.2 Model Selection and Regularization


As shown in Example 6.2, ridge regression shrinks redundant parameters to
a small value by introducing the parameter λ. On the other hand, LASSO
directly shrinks them to zero when the sample condition is not satisfied. In
fact, this method is similar to the forward, backward, or step-wise approach
in model selection where the mean squared error is used as the target func-
tion in optimization. These approaches can be unified as practices in model
regularization.
Notice that the three model selection approaches (forward, backward, and
step-wise) can be expressed with an indicator function and a value s, as follows.


p
I(βj = 0) ≤ s,
j=1

where the predictor xj is dropped off of by directly setting βj = 0.


156 Nonlinear Prediction

Thus, the method optimizing MSE in the step-wise model selection for
linear model regularization is,

⎛ ⎞2

n 
p 
p
min{ ⎝yi − β0 − βj xij ⎠ } subject to I(βj = 0) ≤ s. (6.7)
β
i=1 j=1 j=1

The method of ridge regression, in this setting, becomes,


n 
p 
p
β̂ ridge = argM in (yi − β0 − xij βj )2 subject to βj2 ≤ t.
β i=1 j=1 j=1

and similarly, the method of LASSO in model regularization:


N 
p 
p
β̂ lasso = argM in (yi − β0 − xij βj )2 subject to |βj | ≤ t.
β i=1 j=1 j=1

The above three optimizations focus on the MSE as the target function.
However, the model selection approach directly regularizes the model by re-
moving redundant regression coefficients and keeping the key predictors. The
method of ridge regression shrinks the associated values of the parameters to-
ward zero. While the LASSO method controls the range in which the predictor
attributes toward the response variable.

6.3 High Dimensional Data


Notice that the existence of (XX  )−1 in the solution of MLS for β̂ in the linear
regression model depends on the condition n > p, the sample size is larger
than the number of parameters in the model. Related references can be found,
for example, Xie and Chen (1988, [126], 1990, [127]), among others.
Another challenge in data analysis occurs when the dimensions of data are
much larger than the sample size. For instance, in micro-array gene analysis,
we are usually tasked with making inference for thousands of genes using
the information of tens or hundreds of patients. In facial recognition, the
dimension associated with pixels on the image resolution is frequently larger
than the number of training images available. In general, the issue of high
dimensionality happens for any data set in which n << p, where n is the
number of observations and p is the number of predictors.
High Dimensional Data 157

6.3.1 Curse of Dimensionality


Recall the least square estimation of the linear model with n observation and
p predictors,

β̂ = (X T X)−1 X T y,
when p > n, the inverse matrix (X T X)−1 does not exist, which invalidates
the least square estimate of the regression parameters. The randomness of the
sample covariance matrix makes it impossible to resolve the problem with the
skill of generalized inverse matrix.
When n < p, the usual least squares estimate is not unique, but this can be
fixed by adding a constraint, such as in the ridge regression. However, when
n << p, even with adjusted methods such as ridge regression or LASSO,
adding predictors (noisy features) may deteriorate the model.
The following theorem shows that with an appropriately selected constant
λ, the least square estimate of the ridge regression always exists.

Theorem 6.2 When λ > 0, (X T X + λI)−1 exists.

Proof: Since X T X is non negative definite, there exists an orthogonal matrix


A, so that

⎡ ⎤
λ1 ...
0
⎢ .. . ⎥ T
X X = A⎣ . ... .. ⎦ A with λi ≥ 0
T

0 ... λn
⎡ ⎤
λ1 + λ ... 0
⎢ .. ⎥ AT
X T X + λI = A ⎣ ... ... . ⎦
0 ... λn + λ
Thus
Det(X T X + λI) = 0.

6.3.2 Dimension Reduction by Transformation


Assume that we have a set of data yi , Xi1 , Xi2 , ..., Xip , i = 1, .., n, with the
dimension p >> n the sample size. If there exists a linear transformation of
the p predictors,


p
Zm = φjm Xj f or m = 1, ..., M,
j=1

and M < p. We may then use the new set of predictors Z1 , ..., ZM to avoid
the issue of high dimensionality.
158 Nonlinear Prediction

The original regression problem contains p+1 regression parameters. How-


ever, the transformed data contains M + 1 parameters. If the dimensions of
the transformed data M < n the sample size, the regression problem is solved.


M 
M 
p
θm zim = θm φjm xij
m=1 m=1 j=1


p 
M
= θm φjm xij
j=1 m=1
p
= βj xij .
j=1

Example 6.3 Consider the situation in which we have four observations for
a regression of five predictors (n < p):

yi , Xi1 , Xi2 , ..., Xi5 , i = 1, .., 4,


If we consider the linear transformation,

Z1 = X1 + X2 ; Z2 = X3 + X4 ; Z3 = X5 ,
the coefficient relationship between the response and the original predictors

Y = β0 + β1 X1 + .... + β5 X5
becomes

Y = θ0 + θ1 Z1 + θ2 Z2 + θ3 Z3 ,
for the transformed data. In this way, when we estimate θi ; i = 1, 2, 3,
in the setting of p < n, we can indirectly estimate βi , i = 1, 2, ...5 using the
corresponding linear transformations.

As shown in the above example, the key is to find the optimal transforma-
tion of Zm , m = 1, ...., M that reduces the high dimensionality issue in linear
regression. To this end, we discuss two approaches, the principal component
transformation and the partial least squares regression.

Method-1: Principal component regression This approach identifies


M new features via the M principal components of the sample covariance
matrix, then fits a least squares estimate of the response on the M features
for prediction.
The Principal Components: V ΛV T gives a spectral decomposition of X T X
where

Λp×p = diag[λ1 , ...., λp ] = diag[δ12 , ..., δp2 ] = Δ2


High Dimensional Data 159

with
λ1 ≥ ..... ≥ λp ≥ 0
denoting the non-negative eigenvalues (also known as principal values) of the
non-negative matrix X T X, in which the columns of V , vj , denotes the corre-
sponding orthonormal eigenvector.
Under this setting, Xvj and vj , respectively, denote the j th principal com-
ponents direction (or PCA loading) corresponding to the j th largest principal
value λj for each j ∈ {1, ...., p}.
The following is a simple example to review the concept of principal com-
ponent.

2 1 −3
Example 6.4 Let X = , find the principal components of X.
1 −2 6

Solution: We start with the eigenvalues and eigenvectors of X T X as follows.


⎛ ⎞
5 0 0
X T X = ⎝0 5 −15⎠
0 −15 45
⎛ ⎞⎛ ⎞⎛ −1
√ ⎞
0 1 0 50 0 0 0 √ 3/ 10
√ 10
= ⎝ √−1 0 3/ 10⎠ ⎝ 0 5 0⎠ ⎝1 0 0 ⎠
√10 √ √ √
3/ 10 0 1/ 10 0 0 0 0 3/ 10 1/ 10
⎛ ⎞⎛ √ √ ⎞
0√ 5 0 0 −1/ 10 3/ 10
= ⎝−50/√ 10 0 0⎠ ⎝1 0
√ √0 ⎠
150/ 10 0 0 0 3/ 10 1/ 10
T
The eigenvalues ⎛ of X X are 50, 5,⎞0 with the following eigenvectors,
0
√ 1 √0
(v1 , v2 , v3 ) = ⎝1/√10 0 3/√10⎠
3/ 10 0 1/ 10
⎛ ⎞
 0√  √
2 1 −3 ⎝ ⎠ −√ 10
Z1 = Xv1 = −1/√ 10 =
1 −2 6 3 10
3/ 10
⎛ ⎞
 1 
2 1 −3 ⎝ ⎠ 2
Z2 = Xv2 = 0 =
1 −2 6 1
0
⎛ ⎞
 0 
2 1 −3 ⎝ √ ⎠ 0
Z3 = Xv3 = 3/√10 =
1 −2 6 0.
1/ 10
Interpretation: The first principal component contains the largest variation
(information) of the original data after data rotation of the orthogonal matrix.
It keeps the first k principal components (the k largest sample information)
in the transformed data when performing dimension reduction.
160 Nonlinear Prediction

Method-2: Partial Least Squares Approach The previous approach


uses principal components to construct linear transformed data, which par-
tially eases the burden in data analytics with n < p difficulty. Alternatively,
we may use the following approach to construct a transformed data set where
the dimensionality does not exceed the sample size.
The partial least squares approach identifies M new features using the
following method. After standardizing the p predictors, we can find the first
feature Z1 by setting each coefficient φ1j to the coefficient of a simple linear
regression of Y onto Xj . Then finding the second feature Z2 by regressing
each variable on Z1 , taking the residuals, and find Z2 using the orthogonalized
data in the same way Z1 was constructed using the original data. Repeat the
construction in the same way to find all ZM .
In dimension reduction, the principal component regression method (PCR)
focuses on the information/variation carried by the input data. However, the
partial least squares (PLS) regression focuses more on the correlation between
the response and the input. Notice that
Sy
β̂ = Rxy
sx

due to (X T X)−1 X T Y when y = β0 + βX in a simple linear regression setting.


Therefore, when comparing the goals of PCR versus PLS, it becomes obvi-
ous that PCR retains the maximal possible amount of information contained
in the data, while PLS searches for the maximal percentage of response vari-
ation explained by the transformed data.

6.4 Polynomial spline regression


The previous sections discuss the method of regression with coefficient regular-
ization or linear transformation to target the difficulties of large dimensional
data. However, when the underlying relationship between the response and
the predictors is actually not linear, linear-based regression is inappropriate
regardless of regularization techniques. As an extension to linear regression
techniques, this section addresses the method of polynomial regression and
regression splines.
Consider a degree-d polynomial regression as follows,

yi = β0 + β1 xi + β2 x2i + β3 x3i + ..... + βd xdi + i ,


for a d-degree polynomial with one-variable x. Besides the polynomial effect,
consider the piece-wise constant function such as the example in the insurance
premium data in Chapter 1.
The piece-wise constant regression can be formulated as follows.
Polynomial spline regression 161

yi = β0 + β1 C1 (xi ) + β2 C2 (xi ) + ..... + βK CK (xi ) + i ,


where

C0 (X) = I(X < c1 ),


C1 (X) = I(c1 < X < c2 ),
C2 (X) = I(c2 < X < c3 ),
..
.
CK−1 (X) = I(cK−1 < X < cK ),
CK (X) = I(cK ≤ X),

and
C0 (X) + C1 (X) + .... + CK (X) = 1.
Combining the piece-wise constant function and polynomial function, gets the
basic function,

yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + ..... + βK bK (xi ) + i .


For example, a piece-wise polynomial function may read,

β01 + β11 xi + β21 x2i + β31 x3i + i if xi ≤ c;


yi =
β02 + β12 xi + β22 x2i + β32 x3i + i if xi ≥ c.

Similar to the above example, in general, the points where the coefficients
change are called knots for basic functions. The concept of knot and spline
are defined as follows.
One technique in fitting a smooth curve to a non-linear relationship be-
tween the predictor X and the response Y is the polynomial spline.
Consider a set of data (Xi , Yi ), i = 1, ..., n in the range Xi ∈ [a, b]. For a
set of points ξj ∈ [a, b], j = 1, ..., m, the input range [a, b] can be partitioned
into
m+1
(a, b) = [ξj−1 , ξj ],
j=1

where ξ0 = a, ξm+1 = b, and

a < ξ1 < ... < ξm < b.

In this setting, ξj j = 1, ..., m are called the knots in the interval [a, b].

A regression spline is piece-wise degree-d polynomial, with continuity in the


derivative up to degree d − 1 at each knot. Delving further, a cubic spline
is with respect to knots ξj , j = 1, ..., m for the function g(x), is a piece-wise
162 Nonlinear Prediction

degree-3 polynomial, with continuity in the first and the second derivatives at
each knot.

gi (x) = β1i + β2i x + β3i x2 + β4ix3 , x ∈ [ξi , ξi+1 ], i = 0, 1, 2, ..., m,

and the smoothness condition



gi−1 (ξi +) = gi (ξi −) gi−1 (ξi +) = gi (ξi −) 
gi−1 (ξi +) = gi (ξi −).

Since each piecewise function gi (x), i = 0, 1, ..., m, contains 4 unknown


parameters, β1i , β2i , β3i , β4i . Without any modification, a cubic spline with
m knots has 4m + 4 parameters to be determined from the training data.
Because each interval [ξj , ξj+1 ], has 3 smoothing conditions, these constraints
reduce the total degree of freedom to

4m + 4 − 3m = m + 4.

A natural cubic spline is a cubic spline that fits a constant in each of the
intervals [a1 , ξ1 ] and [ξm , b]. As such, the total number of unknown parameters
reduces to m + 4 − 4 = m.

Example 6.5 Consider the case where m = 2, namely, we have the partition
of the x-range in the following way,

(−∞, ξ1 ], [ξ1 , ξ2 ], [ξ2 , ∞),

and the corresponding models

g1 (x) = β11 + β21 x + β31 x2 + β41 x3 , x ∈ (−∞, ξ1 ]


2
g2 (x) = β12 + β22 x + β32 x + β42 x , 3
x ∈ (ξ1 , ξ2 ]
2
g3 (x) = β13 + β23 x + β33 x + β43 x , 3
x ∈ (ξ2 , ∞).

Among the coefficients βij , i = 1, 2, 3, 4 and j = 1, 2, 3, the following condi-


tions also need to be satisfied in the two knots ξ1 and ξ2 .

β11 + β21 ξ1 + β31 ξ12 + β41 ξ13 = β12 + β22 ξ1 + β32 ξ12 + β42 ξ13
β21 + 2β31 ξ1 + 3β41 ξ12 = β22 + 2β32 ξ1 + 3β42 ξ12
2β31 + 6β41 ξ1 = 2β32 + 6β42 ξ1
β12 + β22 ξ2 + β32 ξ22 + β42 ξ23 = β13 + β23 ξ2 + β33 ξ22 + β43 ξ23
β22 + 2β32 ξ2 + 3β42 ξ22 = β23 + 2β33 ξ2 + 3β43 ξ22
2β32 + 6β42 ξ2 = 2β33 + 6β43 ξ2

Thus, the number of free variables of βij i = 1, 2, 3, 4, j = 1, 2, 3 becomes


4 × 3 − 2 × 3 = 6, which is m + 4 when m = 2.
Polynomial spline regression 163

In a regression spline, if the set of knots {ξi , i = 1, ..., m} is taken to be the


observed data {xi , i = 1, ..., n}, and m = n, the cubic spline becomes a smooth
curve that passes every point of the observation {(xi , yi ) i = 1, ..., n}.
A cubic spline with conditions

g (2) (a) = g (3) (a) = 0, g (2) (b) = g (3) (b) = 0

is called a natural cubic spline. A natural cubic spline fits a constant line
beyond the range of the knots [ξ1 , ξm ] (in the beginning and the end of the
spline),
β31 = β41 = 0 β3,m+1 = β4,m+1 = 0.
This drops the number of free coefficients to m + 4 − 4 = m.

Example 6.6 When m = 2, a natural cubic spline fits

g1 (x) = β11 + β21 x, x ∈ (−∞, ξ1 ]


g2 (x) = β12 + β22 x + β32 x2 + β42 x3 x ∈ [ξ1 , ξ2 ]
g3 (x) = β13 + β23 x x ∈ [ξ2 , ∞)

In conjunction with the smoothness conditions, the total number of free coef-
ficients drops to 4 × 3 − 2 × 3 − 4 = 2, which is the total number of knots in
this example.

Given a set of data, (x1 , y1 ), ..., (xn , yn ), when accounting for the closeness
and smoothness of the regression spline, f (x), the penalized residual sum of
squares
n 

RSS(f, λ) = (yi − f (xi )) + λ (f (t))2 dt.
2

i=1

The first term in RSS measures the closeness of the cubic spline f (x) and the
observed data. The second term penalizes curvature in the function, and λ
establishes the trade-off between the two, a fixed smoothing parameter usually
determined by cross-validation.
When λ = 0, RSS(f, λ) = 0, disregarding curvature, the prediction f (x)
can be found by interpolating the data to make RSS = 0. When λ = ∞, by

using a linear function f (x) = a + bx and f (x) = 0, RSS can reaches its
maximum value by using the least squares linear fit.
The following theorem shows that the smoothest function interpolating a
set of data is the natural cubic spline.

Theorem 6.3 Given a set of data (xi , yi ), i = 1, ..., n, xi ∈ (a, b), among all
functions that interpolate all the data points, the natural cubic spline is the
smoothest curve connector the points when the smoothness is measured by
 b

μ(f ) = (f (x))2 dx.
a
164 Nonlinear Prediction

Proof: Let g(x) be a natural cubic spline in (a, b) with knots xi , i = 1, ..., n.
Assume that G is the set of permissible functions that interpolate all the given
data points. For any f (x) ∈ G, we want to show that the smoothness measures

μ(f ) ≥ μ(g).

Consider
t(x) = g(x) − f (x),
we have
f (x) = g(x) − t(x).
Taking the second derivatives in both sides of the equation gets
  
f (x)g (x) − t (x)

and the smoothness measure of any interpolating function f (x) becomes


 b

μ(f ) = f (x)dx
a
 b
 
= (g (x) − t (x))2 dx
a
 b  b  b
   
= 2
(g (x)) dx + (t (x)) dx − 2 2
g (x)t (x)dx
a a a
 b
 
= μ(g) + μ(t) − 2 g (x)t (x)dx.
a

Now,

 b  b
  
g (x)t (x)dx = g (x)d(t (x))
a a
 b
 

= t (x)g (x)|ba − t (x)dg (x)
a
 b
 
= t (b)g (b) − t (a)g (a) − t (x)g (3) (x)dx.
a

Since g(x) is a natural cubic spline, we have


 
g (a) = g (b) = 0.

Denote x0 = a, xn+1 = b and notice that g (3) (x) = c (where c is a constant)


for a cubic spline, we have
Polynomial spline regression 165

 b n 
 xi+1
t (x)g (3) (x)dx = g (3) (x)t (x)dx
a i=0 xi


n  xi+1
= c t (x)dx
i=0 xi

n
=c (t(xi+1 ) − t(xi ))
i=0
n
=c (−f (xi+1 ) + g(xi+1 ) − g(xi ) + f (xi ))
i=0
= 0.

since both f (x) and g(x) interpolate (xi , yi ), for i = 1, ..., n, we have

g(xi ) = yi = f (xi ) g(xi+1 ) = yi+1 = f (xi+1 ).

Therefore
μ(f ) = μ(g) + μ(t),
which implies that
μ(g) ≤ μ(f ).

SUMMARY Under the circumstances where the linear relationship between


the response and the predictors can not be plausibly assumed, one auxiliary
approach is non-linear regression. This is particularly the case when we are
confronted with high dimensional data, and the sample covariance matrix is
stochastically not positively definitive, as highlighted in Xie and Chen (1988
[126]). Given these conditions, what we discussed in the previous chapter for
linear regression is not applicable.
We discussed the method of ridge regression (targeting an additional con-
dition to restrict model coefficients with a sum of squared measurements) and
LASSO regression (targeting the least absolute shrinkage and selection op-
erator) with focus on the model conditions and interpretations of regression
outputs. Continuing on from the linear model relationship in ridge regression
and LASSO, we also describe non-linear regression methods such as poly-
nomial spline regression, as well as principal component transformation for
dimensional reduction. Further discussion can be found in papers such as [85],
[88],
7
Minimum Risk Classification

This chapter discusses methods of classification that use observable infor-


mation to predict or estimate an unknown category behind the data. Such
classification techniques have various applications, for instance, the diagno-
sis of a disease, the prediction of up or down trends in the stock market, or
the occurrence of a criminal activity in a city, just to name a few. Similar
to hypothesis testing problems, for each decision made in classification, there
is a chance of correctly classifying the observation into (or not into) a given
category, as well as the chance of incorrectly classifying the observation into
(or not into) another category.

diagnosis\true leukemia healthy


leukemia p1 1 − p2
healthy 1 − p1 p2

where p1 is the probability of correctly classifying a leukemia patient as pos-


itive for leukemia, and p2 is the probability of correctly diagnosing a healthy
patient as being healthy.
Besides the chances of correct and incorrect classification, for each classifi-
cation criterion g in the set of all possible classification criteria G, the selection
of the classification criterion should also consider the loss associated with the
classification decision.

diagnosis\true leukemia healthy


leukemia l11 l12 (7.1)
healthy l21 l22
where lij is the loss (cost) of classifying a leukemia (healthy) patient as positive
for leukemia (healthy). For instance 112 is the loss of incorrectly diagnosing a
healthy patient as being positive for leukemia.
Certainly, there are many ways of optimization depending on the condi-
tion associated with the practical problem. In this chapter, we focus on the
minimum risk classification. Namely, we establish the optimal classification
criterion as the one that minimizes the overall risk,

Ĉ = arg min R(g),


g∈G

where
R(g) = l11 p1 + l21 (1 − p1 ) + l12 (1 − p2 ) + l22 p2 .

167
168 Minimum Risk Classification

We shall first define a loss function (hence the risk function) associated
with the general classification problem. With the proper definition, we may
discuss the optimal solution for minimizing the classification risk, followed by
the underlying assumptions upon which the classification algorithms are built.
Bayesian classification will be the first topic in the list. After addressing the
method of Bayesian classification, we shall also discuss the method of logistic
regression, which uses odds ratios to predict the likelihood of the occurrence
for a dichotomous outcome.
The classification problem in this chapter can be broadly viewed as an
optimization problem on estimation or prediction for the true but unknown
category of a given observation. In this regard, the inference approach es-
sentially finds the minimum risk prediction (MRP) based on the classification
criterion. We will also discuss scenarios where the loss function is changed from
the conventional 0-1 loss to any loss function according to different practical
situations.

7.1 Zero-one Loss Classification


The two most common applied discriminant functions are the linear discrim-
inant analysis and quadratic discriminant analysis. However, they are valid
only under the assumptions that the loss function is 0-1 loss, and the data
follow a multivariate normal model. When covariance matrices of the mul-
tivariate models are identical among all the categories, the minimum risk
prediction (MRP) criterion results in the linear discriminant function. When
the covariance matrices are different across different likelihood functions, the
MRP criterion leads to the quadratic discriminant function for multivariate
normal data.
First, the risk function corresponding to a 0-1 loss in a classification prob-
lem can be formulated as follows.
Let G be the set of possible classes in a classification problem with the size
|G| = K. Consider the 0-1 loss function penalizing prediction errors.

1, if ĝ = g
L(g, ĝ) =
0, otherwise.

The corresponding risk function for a set of observation x, reads,


K
Risk(G(X)) = L[Gk , Ĝ(X)] Pr(Gk |X)
k=1

Under this setting, the minimum risk estimator of the unknown category
Zero-one Loss Classification 169

becomes,

K
Ĝ(x) = arg min L(Gk , g) Pr(Gk |X = x). (7.2)
g∈G
k=1

With the 0 − 1 loss function, the above equation can be simplified to

Ĝ(x) = arg min[1 − Pr(g|X = x)],


g∈G

since

K
Pr(Gk |X = x) = 1.
k=1

Ĝ(x) = Gk if Pr(Gk |X = x) = max Pr(g|X = x). (7.3)


g∈G

This implies that the minimum risk classifier is the one that is associated with
the largest posterior probability.

Interpretation: Given a set of data X, heuristically it is natural to classify


the observation into the category that has the highest chance to be the true
category associated with the data.

7.1.1 Bayesian Discriminant Functions


To illustrate the search for the optimal solution in classification, we consider
the case where we only have two classes Y = 0 and Y = 1. On the basis of the
minimum risk criterion discussed above, we classify the observation into the
category Y = 1 if P (Y = 1|X) > P (Y = 0|X) for each X; and classify the
observation into the class Y = 0 if P (Y = 1|X) < P (Y = 0|X) for each X.
Since we typically access the likelihood function through a graphical rep-
resentation (histogram) of the data, it is relevant at this point to discuss a
relationship between likelihood functions and posterior probabilities in classi-
fication. As stated above, the classification problem is grounded on the com-
parison of the posterior probabilities.
Consider the distribution of a random variable X with a likelihood function
f (x|θ) with an unknown parameter θ. If we have some prior information on θ,
say, the distribution of θ is g(θ), then after observing x, the “adjusted belief”
on the distribution of θ becomes
f (x|θ)g(θ)
f (θ|x) = ,
f (x)

where g(θ) is the prior distribution and f (θ|x) is the posterior distribution.
When the parameter θ takes two values Y = 0 and Y = 1, the following the-
orem converts the comparison on posterior probabilities to the corresponding
likelihood functions.
170 Minimum Risk Classification

Theorem 7.1 For the likelihood functions f (x|Y = 0) and f (x|Y = 1), we
have

P (Y = 1|X) f (x|Y = 1) P (Y = 1)
log( ) = log( ) + log( ).
P (Y = 0|X) f (x|Y = 0) P (Y = 0)

Proof: To understand the above identity, consider any real set A ⊂ Rk where
k is the dimension of the features X. We have
P (Y = 1|X ∈ A)
P (X ∈ A ∩ {Y = 1})
=
P (X ∈ A)
P (X ∈ A|{Y = 1})P (Y = 1)
=
P (X ∈ A)

Now, let A = (x, x + δ]k

P (X ∈ A|Y = 1) P (x < X ≤ x + δ|Y = 1)


=
P (X ∈ A|Y = 0) P (x < X ≤ x + δ|Y = 0)

F1 (x + δ) − F1 (x) (F1 (x + δ) − F1 (x))/δ


= =
F0 (x + δ) − F0 (x) (F0 (x + δ) − F0 (x))/δ
Letting δ → 0 gets

  
P (Y = 1|X = x) f (x|Y = 1) P (Y = 1)
log = log + log
P (Y = 0|X = x) f (x|Y = 0) P (Y = 0)
This proves Theorem-7.1, which converts the comparison on posterior
probabilities to the ratio of likelihood functions and prior probabilities as-
sociated with the two classes. For illustrative purposes, we now consider a
simple example for the case when the observation only has one dimension,
k = 1, given the above setting.

Example 7.1 Assume that f (x|Y = 1) and f (x|Y = 0) are two normal
densities with means μ1 , μ0 and common σ 2 , under the condition of equal
priors, we have

f (x|Y = 1)
log
f (x|Y = 0)
−1
= log(exp{ 2 [(x − μ1 )2 − (x − μ0 )2 ]})

1 1
= 2 x(μ1 − μ0 ) − 2 (μ21 − μ20 )
σ 2σ
Zero-one Loss Classification 171

Thus

P (Y = 1|X = x)
log( ) = ax + b
P (Y = 0|X = x)
for some constants a and b.

Thus, we have a classification criterion (which is the Linear discriminant


classification):
When δ = ax + b > 0, classify Y = 1; when ax + b < 0, classify Y = 0.

When we have two or more predictors with a common covariance matrix


across all the categories, in which X ∈ Rp is a vector, the MRE classification
criterion becomes
P (Y = 1|X = x)
g(x) = log( )
P (Y = 0|X = x)
where the likelihood function for each category reads
1 " 1 #
f (x) = exp − (x − μ)T Σ−1 (x − μ) .
(2π)p/2 |Σ|1/2 2

Principle of MRE classification with 0-1 loss: Given patient information,


if the posterior probability of class t is the largest among all the posterior
probabilities, we classify the patient to class t. Namely,

t̂ = arg max P (G = k|X = x).


k=1,...,m

Notice that
fk (x)πk
Pr(G = k|X = x) = K
.
j=1 fj (x)πj

When there are k possible outcomes, the minimum risk classifier assigns
an observation to the class in which δk is the largest
1
δk (x) = xT Σ−1 μk − μTk Σ−1 μk + log πk .
2
We describe two classification criteria δ and δk above. The difference be-
tween δk and δ can be summarized as follows.

δ : classification rule for two equal prior classes with one input variable
following normal likelihood with equal variances.

δk : classification rule for k general prior classes with p input variables


following normal likelihood and equal covariance matrices.
172 Minimum Risk Classification

In the discussion above, we assume that the standard deviations are identi-
cal for different categories. Such an assumption is not always plausible. When
the two standard deviations are not the same, the linear discriminant func-
tion becomes invalid, and the corresponding minimum risk classifier is actually
the quadratic discriminant function for normal data. In this setting, the com-
parison of the posterior probability motivates us to finding the classification
criterion δ(x).

P (Y = 1|X = x)
δ(x) = log( )
P (Y = 0|X = x)
If δ(x) is positive, we classify the data to Y = 1; otherwise, the classifica-
tion result is Y = 0.
We shall use an example to explain the above setting. Consider the clas-
sification of two population means when the data follow two normal models
with two different standard deviations.
Notice that in this case, for i = 1, 2, we have the models
1 1
f (x|μi , σi ) = exp(− 2 (x − μi )2 ),
(2π)1/2 (σi )1/2 2σi

which leads to the quadratic discriminant function on x as follows.


1 1 1 μ1 μ0
δ(x) = ( − 2 )x2 + ( 2 − 2 )x + c,
2 σ02 σ1 σ1 σ0

where the constant


1 μ20 μ2
c= ( 2 − 12 ) + logπ1 − logπ0 .
2 σ0 σ1

When we have more than one predictor (where X is a vector) with unequal
covariance matrices from multivariate normal model,

P (Y = 1|X = x)
δk = log( )
P (Y = 0|X = x)
takes the following form,

1 1
δk (x) = − (x − μk )T Σ−1
k (x − μk ) − log |Σk | + log πk
2 2
1 1 T −1 1
= − xT Σ−1 T −1
k x + x Σk μ k − μ k Σk μ k − log |Σk | + log πk
2 2 2
To further clarify the above discussion on the linear discriminant func-
tion and quadratic discriminant function regarding p input variables for an
output of k possible outcomes, we consider the following example of the pre-
diction/classification of diabetes patients.
Zero-one Loss Classification 173

Example 7.2 Consider the diagnosis of diabetes patients with four input
features: systolic blood pressure, fasting blood glucose level, BMI, and smok-
ing (nicotine intake). The diagnosis outputs include three possible categories:
healthy, pre-diabetes, and diabetes. For a new patient, we want to use fea-
tures of patient information (input variables) to diagnose the clinical outcome
(designate the patient into the right category).

> x<-matrix(c(125, 100, 30, 1.1), 4, 1, byrow=TRUE)


> sigma<-matrix(c(1, 0.5, 0, 0, 0.5, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), 4,
4, byrow=TRUE)
> sigma
[,1] [,2] [,3] [,4]
[1,] 1.0 0.5 0 0
[2,] 0.5 1.0 0 0
[3,] 0.0 0.0 1 0
[4,] 0.0 0.0 0 1
> m1<-matrix(c(120, 85, 20, 0.04), 4, 1, byrow=TRUE)
> m2<-matrix(c(138, 110, 26, 1.2), 4, 1, byrow=TRUE)
> m3<-matrix(c(150, 160, 31, 1.7), 4, 1, byrow=TRUE)

> d1<-t(x)%*%sigma%*%m1-
0.5*t(m1)%*%sigma%*%m1+log(0.5)
> d1

[,1]
[1,] 19299.35
> d2<-t(x)%*%sigma%*%m2-
0.5*t(m2)%*%sigma%*%m2+log(0.3)
> d2
[,1]
[1,] 19304.4
> d3<-t(x)%*%sigma%*%m3-
0.5*t(m3)%*%sigma%*%m3+log(0.2)
> d3
[,1]
[1,] 16648.32

FIGURE 7.1
R-codes for computation on diabetes classification risk.

Under the normality assumption, for each class, the likelihood function
takes the form:
1 " 1 #
f (x) = exp − (x − μ)T Σ−1 (x − μ)
2π p/2
|Σ| 1/2 2

Further, assume that the mean vector for the healthy category is μ1 , pre-
diabetes μ2 , and diabetes μ3 as follows.
174 Minimum Risk Classification

⎡ ⎤ ⎡ ⎤ ⎡⎤
120 138 150
⎢ 85 ⎥ ⎢110⎥ ⎢160⎥
μ1 = ⎢ ⎥ ⎢ ⎥
⎣ 20 ⎦ , μ2 = ⎣ 26 ⎦ , and μ3 = ⎢ ⎥
⎣ 31 ⎦ ,
0.04 1.2 1.7
we also assume that the correlation matrix of the four features is the same
for healthy, pre-diabetes, and diabetes patients,
⎡ ⎤
1 0.5 0 0
⎢0.5 1 0 0⎥
Σ−1 = ⎢
⎣0
⎥.
0 1 0⎦
0 0 0 1

Assume that the disease distribution of the population reads 50% healthy, 30%
pre-diabetes, and 20% diabetes. For a patient with xT = (125, 100, 30, 1.1),
we can then use the linear discriminant classification criterion to diagnose
whether he is healthy, pre-diabetic, or has diabetes.
Since the correlation matrix is the same across the four features, we have
for k = 1, 2, 3,
1
δk = xT Σ−1 μk − μTk Σ−1 μk + logπk .
2
Thus, for any x we have the three classification criteria as follows.

δ1
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 120 120 1 0.5 0 0 120
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢ 85 ⎥ − 1 ⎢ 85 ⎥ ⎢ 1 0 ⎥ ⎢ 85 ⎥
⎣ 30 ⎦ ⎣0 0 1 0 ⎦ ⎣ 20 ⎦ 2 20 ⎦
⎣ ⎣0 0 1 0 ⎦ ⎣ 20 ⎦
1.1 0 0 0 1 0.04 0.04 0 0 0 1 0.04
+ log(0.5)
= 19299.35.

δ2
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 138 138 1 0.5 0 0 138
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢110⎥ − 1 ⎢110⎥ ⎢ 1 ⎥ ⎢110⎥
⎣ 30 ⎦ ⎣0 0 1 0⎦ ⎣ 26 ⎦ 2 ⎣ 26 ⎦ ⎣0 0 1 0⎦ ⎣ 26 ⎦
1.1 0 0 0 1 1.2 1.2 0 0 0 1 1.2
+ log(0.3)
= 19304.4.
Zero-one Loss Classification 175

δ3
⎡ ⎤T ⎡ ⎤⎡ ⎤ ⎡ ⎤T ⎡ ⎤⎡ ⎤
125 1 0.5 0 0 150 150 1 0.5 0 0 150
⎢100⎥ ⎢0.5 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0.5 0 0⎥ ⎢ ⎥
=⎢ ⎥ ⎢ 1 0 ⎥ ⎢160⎥ − 1 ⎢160⎥ ⎢ 1 ⎥ ⎢160⎥
⎣ 30 ⎦ ⎣0 0 1 0⎦ ⎣ 31 ⎦ 2 ⎣ 31 ⎦ ⎣0 0 1 0⎦ ⎣ 31 ⎦
1.1 0 0 0 1 1.7 1.7 0 0 0 1 1.7
+ log(0.2)
= 16648.32.

Since the second class has the highest δ value, the new patient (who has the
four features (125, 100, 30, 1.1) representing systolic blood pressure, fasting
glucose level, BMI, and nicotine level) is classified into the pre-diabetes cat-
egory. This classification minimizes the possible misclassification risk for the
given data.
The R-codes for the computation in the above example can be found in
Figure 7.1.
We shall use another example to illustrate the quadratic discrimination
classification.

Example 7.3 Outlined below is a hypothetic example of the application of


quadratic discriminant function.
C-reactive protein, ESR (erythrocyte sedimentation rate), and BMI are com-
mon factors in the diagnosis of rheumatoid arthritis. Assume that the fac-
tors follow normal models with μdisease = (2.1, 35, 30), σdisease = (0.3, 5, 2.3)
for C-reactive protein, ESR, and BMI, respectively. Also, assume that corre-
spondingly, μhealthy = (0.7, 15, 17), σhealthy = (0.2, 3.1, 2), the disease rate of
rheumatoid arthritis is 30% in the population. We are interested in developing
a classification criterion to detect diseased patients.

Assume that the covariance matrices are

⎡ ⎤ ⎡ ⎤
0.09 0.45 0 0.04 0 0
Σdisease = ⎣ 0.45 25 0 ⎦, Σhealthy =⎣ 0 9.61 0 ⎦
0 0 5.29 0 0 4

In this setting,
⎡ ⎤
12.21 −0.2198 0
Σ−1 ⎣ −0.2198 0.04396 0 ⎦,
disease =
0 0 0.189
⎡ ⎤
25 0 0
Σ−1 ⎣ 0 0.104 0 ⎦.
healthy =
0 0 0.25
176 Minimum Risk Classification

The classification criterion for an observation x = (x1 , x2 , x3 ) reads


⎡ ⎤T ⎡ ⎤⎡ ⎤
x 25 0 0 x1
1 ⎣ 1⎦ ⎣
δhealthy =− x2 0 0.104 0 ⎦ ⎣x 2 ⎦
2
x3 0 0 0.25 x3
⎡ ⎤T ⎡ ⎤⎡ ⎤
x1 25 0 0 0.7
+ ⎣x2 ⎦ ⎣ 0 0.104 0 ⎦ ⎣ 15 ⎦
x3 0 0 0.25 17
⎡ ⎤T ⎡ ⎤⎡ ⎤
0.7 25 0 0 0.7
1
− ⎣ 15 ⎦ ⎣ 0 0.104 0 ⎦ ⎣ 15 ⎦ + log(0.7)
2
17 0 0 0.25 17
1
− log(0.04 ∗ 9.61 ∗ 4)
2
⎡ ⎤T ⎡ ⎤⎡ ⎤
x 12.21 −0.2198 0 x1
1 ⎣ 1⎦ ⎣
δdisease =− x2 −0.2198 0.04396 0 ⎦ ⎣x2 ⎦
2
x3 0 0 0.189 x3
⎡ ⎤T ⎡ ⎤⎡ ⎤
x1 12.21 −0.2198 0 2.1
+ ⎣x2 ⎦ ⎣ −0.2198 0.04396 0 ⎦ ⎣ 35 ⎦
x3 0 0 0.189 30
⎡ ⎤T ⎡ ⎤⎡ ⎤
2.1 12.21 −0.2198 0 2.1
1
− ⎣ 35 ⎦ ⎣ −0.2198 0.04396 0 ⎦ ⎣ 35 ⎦ + log(0.3)
2
30 0 0 0.189 30
⎡ ⎤
12.21 −0.2198 0
1
+ log(det(⎣ −0.2198 0.04396 0 ⎦))
2
0 0 0.189
In practice, the parameter vectors μ1 , μ2 , μ3 , and Σ are unknown. They
need to be estimated by the training data through cross-validation before the
implementation of the classification procedure.
It should be noted that the linear (and quadratic) discriminant function
depends on the assumption that the joint distribution of the features follows
a multivariate normal model across the three classes. Such assumptions are
critical for the validity of the classification procedure. The assumption on equal
covariance matrices is also not easy to verify in practice. In what follows, we
discuss a different approach in classification involving logistic regression, which
does not require the assumption that the data is in the form of a multivariate
normal model.

7.1.2 Logistic regression classification


Consider the relationship between the occurrence of an event Y say lung
cancer, and potential risk factors such as age, gender, smoking status, hyper-
Zero-one Loss Classification 177

tension, diabetes, etc. The potential risk factors can be denoted as X1 , ...,
Xk . The following equation captures a linear relationship between the logit
function of the probability of lung cancer and its predictors,
p
log = α + β1 X1 + ... + βk Xk
1−p

where p = P (Y = 1|X) is the probability of the occurrence of lung cancer for


patients featuring with X.
For a new patient with features x, if we are able to estimate the probability
p = P (Y = 1|X = x), according to the MRE criterion for a given data point
x, by Equation ( 7.3), the patient is classified as Y = 1 when P (Y = 1|X) >
P (Y = 0|X). This is tantamount to p = P (Y = 1|X) > 0.5. Thus the
classification of claiming Y = 1 when the estimated probability p̂ > 0.5 is
actually a minimum risk classifier for the given data.

For convenience, the data frame for a logistic regression model can be
briefly outlined as follows.

Example 7.4

ID gender age smoking diabetes hypertension lung − cancer


N 0001 M 30 Y N Y Y
N 0002 F 43 N Y N N
N 0003 M 71 Y Y Y N
N 0004 F 63 Y N Y Y
.. .. ..
. . .

Model Assumption The assumption of the logistic regression model is that


the response follows a binary model with a common chance for the occurrence
of the event given a presumed set of features.

The following example explains the interpretation of a logistic regression


model.

Example 7.5 Assume that a fitted model for the relationship between the
occurrence of lung cancer, X1 (dusty work conditions), and X2 (smoking) is
quantified as
p
log( ) = −0.4 + 0.3X1 + 1.6X2
1−p
where
p = P (lung cancer|X1 , X2 ).
The estimated model coefficient 1.6 is usually interpreted as: Controlling for
work environment conditions, the odds ratio of lung cancer is e1.6 = 4.953 for
178 Minimum Risk Classification

smoking patients. In other words, smoking increases the odds of lung cancer
by almost 5 folds when controlling for other risk factors.

Additionally, for a patient who smokes and works in a dusty environment,


the chance of getting lung cancer reads
exp(−0.4 + 0.3 + 1.6)
P (Y = 1|X1 = 1, X2 = 1) = = 0.82 > 0.5
1 + exp(−0.4 + 0.3 + 1.6)
Thus, the patient is classified into the group of being more likely to have lung
cancer.

FIGURE 7.2
Prostate cancer and feature selection in logistic regression

The selection of the factors significantly associated with the response vari-
able is critical in classification when using the logistic regression model. In-
cluding insignificant factors in the final model for classification may result in
misleading conclusions. One common approach to remedy this is to use the
AIC selection criterion. As shown in Figure 7.2, the occurrence of prostate
cancer relative to acid, stage, Xray, grade, and age is investigated. When all
General Loss Functions 179

the predictors are in the model, the AIC is 60.13. With the removal of each
predictor, the AIC changes. For instance, when the feature “grade” is re-
moved, the AIC for the new model is 59.097, which is lower than the AIC of
the complete model. Since removing grade results in the lowest AIC among all
the candidate models, for the first step, grade is removed and the AIC for the
updated model is calculated to be 59.1. The process continues until it finds
Xray, stage, and acid to be the model with the lowest AIC. At this point,
removing any one of the predictors increases the AIC for the updated model,
so the process stops.
Similar to the linear discriminant function and the quadratic discrimi-
nant function that we discussed in the preceding subsection, a key step in
the construction of the classification criterion using logistic regression is the
estimation and cross validation of the model coefficients using training data.

7.2 General Loss Functions


The previous section discusses classification approaches based on minimum
risk estimation, in which the risk function is based on a 0-1 loss function.
However, in practice, we often face scenarios where the loss function is not
0-1. For example, the loss (cost) of misclassifying a healthy person as being a
leukemia patient may bear less loss than misclassifying a leukemia patient as
being healthy (missing treatment may lead to loss of life). Carelessly applying
procedures for 0-1 loss function may result in misleading prediction/estimation
conclusions. In this section, we shall discuss methodologies for classification
when the loss function is not 0-1. Under this setting, the risk function is
correspondingly changed.
We start with a simple example of non 0-1 loss functions.

Example 7.6 In the classification of three categories with training data X,


consider a loss function L(gi , G) = i/6 when G(X) = gi , and L(gi , G) = 0
when G(X) = gi for i = 1, 2, 3. Find and interpret the expression for the
minimum risk estimator of the true but unknown category.

Solution: When G(X) = g1 , denote the true category is T , we have the


risk,

2 3
R1 (G) = EX {0 ∗ P (T = g1 |X) + ∗ P (T = g2 |X) + ∗ P (T = g3 )}
6 6
1
= EX {t(X) − P (T = g1 |X)},
6
180 Minimum Risk Classification

where
1 2 3
t(X) = ∗ P (T = g1 |X) + ∗ P (T = g2 |X) + ∗ P (T = g3 )
3 6 6
Similarly, for j = 2, 3, the risk associated with G = gj reads

j
Rj (G) = EX {t(X) − P (T = gj |X)}
6
Thus, one of the optimal solutions for the MRE of the unknown category
is
j
Ĝ = arg max{ P (T = gj |X)}
j 6
This implies that, unlike the optimal classification criterion for 0-1 loss
function, the optimal solution for non 0-1 loss is to classify the observation to
a category that has the highest posterior probability when weighted by the
associated loss.
The interpretation to the general principle is similar to the 0-1 loss scenario
in the sense of maximizing the posterior probability with modification of the
weight adjustment based on the non 0-1 loss function.

7.3 Local and Universal Optimizations


It should be noted that classification methods discussed in the preceding sec-
tion (linear discriminant function, quadratic discriminant function, and logis-
tic discriminant approach) are grounded on point-wise optimization for each
set of given observations. In another word, the minimum risk estimation

Ĝ(x) = arg min R(G), (7.4)


g∈G

with the risk function,


K
R(G) = EX [ L(Gk , g) Pr(Gk |X = x)],
k=1

is simplified to the sufficient condition of point-wise minimization problem, as


discussed in Equation ( 7.2), for all observations x ∈ Rm ,


K
Ĝ(x) = arg min L(Gk , g) Pr(Gk |X = x). (7.5)
g∈G
k=1

Clearly, Equation (7.5) is a sufficient but not necessary condition for the
optimal risk formulation in Equation (7.4). The condition that Equation (7.5)
Optimal ROC Classifiers 181

is valid for all x guarantees the validity of Equation (7.4). Yet, the validity
of Equation (7.4) does not imply Equation (7.5). For instance, X > 0 implies
that E(X) ≥ 0, however, E(X) ≥ 0 does not imply X > 0.
Therefore, the conditional optimization (since Equation (7.5) conditioned
on the given data x) on the inferred risk is different from the universal opti-
mization in (7.4), which does not depend on observation X. Notice that the
loss function L(Gk , g) is not a function of the data x. Equation (7.4) can be
simplified as follows:


K
R(g) = EX [ L(Gk , g) Pr(Gk |X = x)]
k=1
 K
= [ L(Gk , g) Pr(Gk |X = x)]dF (x)
k=1
 (7.6)

K
= L(Gk , g) Pr(Gk |X = x)]dF (x)
k=1 X


K
= L(Gk , g) Pr(Gk ).
k=1

When the loss function is 0-1 loss, the above equation ( 7.6) becomes

R(g) = 1 − Pr(Gtrue = g)

Thus, the MRE becomes

Ĝ = arg max Pr(Gk = g)


k

The MRE classification is the one that has the largest chance to be the true
but with unknown category under the 0-1 loss function.

7.4 Optimal ROC Classifiers


In the process of training a classifier using the logistic regression model, sim-
ilar to the sensitivity and specificity problem that we discussed in Chapter
2 and Chapter 3, the decision of classification contains the following possible
outcomes:

decision\true category Gk not Gk


Gk sensitivity f alse positive
not Gk f alse negative specif icity
Unlike the most powerful test where we fix the probability of type-I error
182 Minimum Risk Classification

and find the procedure that produces the highest power (lowest probability
of type-II error), in classification, all four indexes are interpreted and selected
according to the most appropriate situation. For instance, when the true but
unknown category Gk is a serious disease, the cost of mis-classifying a diseased
patient as a healthy person may result in life-threatening situations. Hence,
controlling for the probability of true positive (or false negative) may be of
primary concern. On the other hand, when making the decision to operate
or not in an emergency room, a false positive error may result in sending a
healthy person to an operation room. In that situation, controlling the false
positive (hence true negative) may be of higher weight when examining the
results of the confusion matrix.

Prostate Cancer, estimated ROC

FIGURE 7.3
Prostate cancer, ROC curve, and treatment regimen

One classical approach in logistic regression for balancing the optimal point
between the probability of false positive and false negative is to construct an
ROC (receiver operating characteristic) curve. The ROC curve usually plots
the estimated sensitivity versus 1-specificity (or the sensitivity versus false
negative probability). The optimal point (optimal cut-off threshold) is the one
corresponding to the point in the ROC curve that has the shortest distance
Optimal ROC Classifiers 183

to the golden standard (where sensitivity=1 and specificity=1 for correctly


diagnosing all the disease patients and all the healthy patients).
For instance, Figure 7.3 shows an estimated ROC curve in the diagnosis
of prostate cancer. When a patient is diagnosed with prostate cancer, an
important question emerges in deciding treatment strategy for the patient is
whether the cancer cells have spread to the neighboring lymph nodes. The
question is so critical during prognosis and treatment that it is customary to
operate on the patient (i.e., perform a laparotomy) and remove tissue samples
for the sole purpose of examining the nodes for evidence of cancer. However,
certain variables can be measured without surgery and are predictive of the
nodal involvement. one of such variables is the pathology reading (grade) of
a biopsy for the tumor obtained by needle before surgery. Figure 7.3 shows
an estimated ROC curve obtained by plotting the sensitivity and 1-specificity
at the different grade levels of pathology reading. The point representing the
golden standard has 100% success in detecting the spread of prostate cancer
and 100% in detecting non-cancer patients, located at (0, 1). Among all the
estimated points in the ROC, the one with shortest distance to the golden
standard point (0, 1) is the most optimal threshold point. Therefore, the
grade level that corresponds to the optimal threshold point is the best cut-off
value for a pathology reading in diagnosing of the status of prostate cancer.
As discussed in the prostate cancer example, the optimality criterion now
becomes

R = (1 − sensitivity)2 + (1 − sensitivity)2

= (P (f alse positive))2 + (P (f alse negative))2 ,

an optimization classifier based on the training data, which is different from


the local and universal optimization that we discuss in the previous sections
of this chapter.

SUMMARY When predicting a response variable that is categorical (such


as “yes” or “no” for a disease, or a different candidate virus in question), the
prediction methods discussed in previous chapters, such as linear regression,
LASSO, or non-linear regression, can not be applied. This chapter thus fo-
cuses on the classification methods and classification criteria that minimize a
prediction risk.
For the classification risk function, we start with a zero-one loss function
for methods involving logistic regression, which assumes binomial responses.
The fitted logistic regression model is then used to compute the predicted
probability for classification. Further applications related to logistic regression
184 Minimum Risk Classification

methods can be found in [15], [29], [30], and [93], among others. After describ-
ing the method of logistic regression, we focus on the method of Bayesian
discriminant analysis under the normality assumption on the joint distribu-
tion of errors. The Bayesian discriminant analysis involves two discriminant
functions. We address the difference between linear discriminant function and
quadratic discriminant functions in classification. Such description sheds new
lights on the hidden condition governing the application of discriminant anal-
ysis.
After elucidating methods related to the zero-one loss function, we extend
the discussion to applications on general loss functions, which highlights pre-
mier conditions before performing data analysis. We address the difference
between local and universal optimization, and conclude the chapter with a
discussion on the selection of the dose level for optimal ROC classifiers. The
ROC criteria essentially cast a new light on the optimization measure that
innovates the mean prediction errors discussed in preceding chapters.
8
Support Vectors and Duality Theorem

Support vector machine is a method of classification without making model


assumptions on the underlying distribution of the data. For example, con-
sider the classification problem with input vectors X1 , ..., Xn , where Xi =
(xi1 , ..., xik ) for a response binary output Y , Y =yes or no. The goal of the
investigation is to find an effective approach to determine the value of Y on
the basis of the information regarding X1 , ..., Xn .
When the input variable X is of high dimension such as in the study of gene
expression data or picture recognition in artificial intelligence, the method
is especially useful in identifying data points near or beyond the boundary
(supporting vectors) to construct optimal classifiers. Theoretically, the opti-
mization process in support vector machine is built upon the duality theorem
in linear programming. In this chapter, we start with basic classifiers such as
maximal margin classifier, support vector classifier, and non-linear boundary
classifier before discussing the mechanism of support vector machine. After
describing the methodology, we move deeper into the area by discussing the
mechanism in the application of duality theorem. The chapter concludes with
a perturbation method in the search for optimal solutions with duality theorem
in linear programming.

8.1 Maximal Margin Classifier


We start with the following example to illustrate the setting and introduce
the concept of maximal margin classifier. The example is about predicting the
annual revenue status of a company based on its advertising costs.

Example 8.1 Consider a set of data for advertising costs (in thousand dol-
lars for TV advertising and Internet advertising) of 20 companies in a soft-
drink industry, along with their annual revenue status (positive or negative)
in the past year.
Positive:(20, 54); (30, 42); (28, 63); (42, 29); (38, 35); (31, 44); (29,
52); (62, 18); (32, 49); (53, 41)
Negative:(20, 18); (24, 22); (25, 13); (22, 16); (24, 18); (21, 24); (18,
32); (22, 16); (12, 28); (33, 12)

185
186 Support Vectors and Duality Theorem

If a soft-drink company plans to spend TV advertising budget=30K and In-


ternet advertising budget=40K in the coming year, we want to predict whether
the company will have positive or negative annual revenue.

FIGURE 8.1
The hyperplane classifier separates companies with positive and negative rev-
enues associated with advertising input of 20 soft-drink companies.

As shown in Figure 8.1, the information contained in the dataset can be


depicted in the plot where the TV advertising cost is the x-axis and the In-
ternet advertising cost is the y-axis with the round dot representing positive
revenues and triangle dot representing negative revenues. Without any addi-
tional model information, it is clear in this example that once the input of the
two advertising costs is above the line positive, the revenue is positive; while
for those below the line Neg, the outcome is negative.
Let Y = 1 denote the outcome of positive revenue and Y = −1 denote the
outcome of negative revenue. Denote the boundary for positive revenue be

a1 X1 + b1 X2 + c1 = 0,

and the boundary for negative revenue be

a2 X1 + b2 X2 + c2 = 0.

Intuitively, the best classifier is the one that keeps the equal distance between
the positive and the negative lines. We call it the maximal margin classifier,

a3 X1 + b3 X2 + c3 = 0.

With this setting, all the data points (Xi1 , Xi2 , yi ) satisfy

yi (a3 Xi1 + b3 Xi2 + c3 ) > 0,


Maximal Margin Classifier 187

for i = 1, ..., n, because when a3 X1 + b3 X2 + c3 < 0, the corresponding yi < 0.


As for the company that has X1 = 30, and X2 = 40, the company is
expected to have positive revenue because the point (30, 40) locates on the
positive side of the classifier.
In the setting of the above example, the key component is the positive
line, the negative line, and the classifier. In fact, in the p dimension space, the
term we frequently use is the hyperplane.

8.1.1 Hyperplane
In a p-dimensional space, a hyperplane is a flat affine subspace of dimension
p−1. For instance, the line x+y = 2 in R2 for the xy-plane, where (x, y) ∈ R2 .
Alternatively, the plane x + 3y + 2z = 10 in R3 for (x, y, z) ∈ R3 . Or generally

β0 + β1 X1 + β2 X2 + ... + βp Xp = 0,

for Xi in the Rp space.


Recall that in R3 , a plane with a normal vector (A, B, C) and a point
(x0 , y0 , z0 ) can be expressed as

A(x − x0 ) + B(y − y0 ) + C(z − z0 ) = 0,

which is the same as


Ax + By + Cz + D = 0,
where D = −Ax0 − By0 − Cz0 .
It is related at this point to review the concept of a unit normal vector of
a line:
N (A, B, C)
n= =√ .
||N|| A2 + B 2 + C 2
With the concept of unit normal vector, the distance between a point
(x1 , y1 ) and a line featured by n and (x0 , y0 ), becomes

d = |v.n|
= |(x1 − x0 , y1 − y0 ).n|
|A(x1 − x0 ) + B(y1 − y0 ) + C(z1 − z0 )|
= √ ,
A2 + B 2 + C 2
which is,
Ax1 + By1 + Cz1 + D
d= √ .
A2 + B 2 + C 2
The above expression can be further simplified as

d = β 0 + β 1 x 1 + β 2 y 1 + β 3 z1 ,
188 Support Vectors and Duality Theorem

where βi , i = 1, 2, 3 satisfies the condition


3
βi2 = 1.
i=1

This directly leads to the maximal margin classifier in the way of finding the
classifier that maximizes the distances between the two different types of data
points.

8.1.2 Definition of maximal margin classifier


On the basis of the above discussion, the maximal margin classifier can be
viewed as a hyperplane that keeps the highest possible values for the data
points in each of the two categories. The hyperplane condition necessitates


p
βj2 = 1.
j=1

and the distance condition requires

yi (β0 + β1 xi1 + β2 xi2 + ... + βp xip ) ≥ M.

Recall that the distance from a point to a hyperplane

|y.n|
d= ,
||n||

we have the following definition for the concept of maximal margin classifier.

Definition 8.1 The maximal margin classifier is a hyperplane that keeps the
largest possible distance from the two classes of data. Namely, it is the hyper-
plane
β0 + β1 xi1 + β2 xi2 + ... + βp xip = 0
that satisfies the following conditions.

maximize M
β0 ,β1 ,...,βp ,M


p
subject to βj2 = 1
j=1

yi (β0 + β1 xi1 + β2 xi2 + ... + βp xip ) ≥ M ∀i = 1, ..., n.

The maximal margin classifier for a set of completely separable data can
be found by iteratively substituting the current line with an updated line until
no further substitution is available. For convenience, we consider the simplest
case where p = 1. Starting from any point, we can intuitively move the cutoff
Maximal Margin Classifier 189

point x until the two groups are distinctive. And the maximal margin classifier
is the middle value between the x value for positive outcome and the x value
for negative outcome. Algebraically, to maximize the term

D(β, β0 ) = yi (xTi β + β0 ),
i∈M

where M indexes the set of misclassified points, we have

D(β, β0 ) 
∂ = yi x i
∂β
i∈M
D(β, β0 ) 
∂ = yi .
∂β0
i∈M

The final solution of the maximal margin classifier can be found by recur-
sively updating
β β β
← +ρ ,
β0 β0 β0
where
xi − xi0
ρ= .
xi0
The following example shows how to use R to find the maximal margin
classifier for a set of separable data. The problem is about the prediction of
post-thrombotic syndrome related to the percentage of thrombolysis and the
time after the first minor stroke symptoms of stoke patients. More background
information on thrombolysis and post-thrombotic syndrome can be found in,
for example, Chen and Comerota (2012 [24]).

Example 8.2 The following is a set of hypothetic data on post-thrombotic


syndrome for 10 stroke patients in terms of remaining percentage after throm-
bolysis X1 and the time after the onset of the first minor stroke symptoms in
minutes X2 .
Positive post-thrombotic syndrome: (20, 54); (30, 42); (28, 63); (42,
29); (38, 35)
Negative post-thrombotic syndrome: (20, 18); (24, 22); (25, 13); (22,
16); (24, 18)
We are interested in finding the support vectors, maximal margin hyper-
plane, and two margins to predict post-thrombotic syndrome with the informa-
tion on percentage remaining after thrombolysis procedures and time passed
after the onset of the first minor stroke.

As shown in Figure 8.2, the three support vectors are (42, 29), (30, 42) for
patients with positive post thrombotic syndrome, and (24, 22) for negative
post-thrombotic syndrome. The line separating the two areas in the diagram
is the maximal margin classifier, the margin for positive post-thrombotic syn-
drome is the line determined by the two support vectors (42, 29), and (30,
190 Support Vectors and Duality Theorem

x=matrix(c(20, 30, 28, 42, 38, 20, 24, 25, 22, 24, 54, 42, 63, 29, 35, 18, 22, 13, 16,
18), 10, 2)
y=rep(c(-1, 1), c(5, 5))
par(mar=c(1, 1, 1, 1))
plot(x, col=y+3, pch=19)
dat=data.frame(x, y=as.factor(y))
library(e1071)
svmfit=svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)
dev.new(width=5, height=4)
plot(svmfit, dat)

FIGURE 8.2
The hyperplane classifier predicts patients with positive post-thrombotic syn-
drome based on the percentage of thrombolysis and the time after the first
onset of minor stroke symptoms.

42), while the margin for negative post-thrombotic syndrome is the line de-
termined by the slope of the positive post-thrombotic syndrome and passing
the support vector (24, 22).

8.2 Support Vector Classifiers


The previous section discusses the method of maximal margin classifier in
the classification of two categories based on the feature information X, under
the assumption that the data can be completely separated into two distinct
clusters. However, in practice, we are often confronted with situations where
the two groups of data are not completely separable, as shown in Figure 8.3. In
Support Vector Classifiers 191

FIGURE 8.3
When data are not linearly separable between the two classes, maximal margin
classifier does not work

this case, we may either change the linear classifier to a non-linear classifier, or
consider an optimization in which some vectors are allowed to be misclassified.
The vectors in the margin and in the area of misclassification are the support
vector in the classification process. This naturally extends the maximal margin
classifier to the method of support vector classifier.

Definition 8.2 For datasets that are not completely separable, the classifier
that optimizes the following target function with the conditions for a given cost
C is defined as a support vector classifier.

maximize M
β0 ,β1 ,...,βp , 1 ,..., n ,M


p
subject to βj2 = 1
j=1

yi (β0 + β1 xi1 + β2 xi2 + ... + βp xip ) ≥ M (1 − i )


 n
i ≥ 0, i ≤ C,
i=1

In the scenario where the feature vector X has large dimension, the op-
timization process in the above definition can be simplified by the duality
192 Support Vectors and Duality Theorem

theorem, which states that every problem of the form,



n
maximize c j xj
j=1

n
subject to aij xj ≤ bi i = 1, 2, ..., m
j=1

xj ≥ 0 j = 1, 2, ..., n,

has a dual:

m
minimize bi y i
i=1
m
subject to aij yi ≥ cj j = 1, 2, ..., n
i=1
yi ≥ 0 i = 1, 2, ..., m.
Notice that the original problem is to optimize over Rp space while the dual
problem optimizes over the R1 space. When the dimension p is large, such as
in the facial recognition in AI, the target dimension is changed dramatically.
We use the following example to illustrate the duality theorem.

Example 8.3 Consider the question to maximize the profit with fixed re-
sources,
maximize c1 x1 + c2 x2 + c3 x3
subject to a11 x1 + a12 x2 + a13 x3 ≤ b1
a21 x1 + a22 x2 + a23 x3 ≤ b2
x1 , x2 , x3 ≥ 0,
where cj =profit per unit of product j produced;
bi =units of raw material i on hand;
aij =units raw material i required to produce 1 unit of product j.

The problem can be viewed from the angle of cost as follows. If we save
one unit of product j, then we free up:
ˆ a1j units of raw material 1 and
ˆ a2j units of raw material 2.
Selling these unused raw materials at the price of y1 and y2 dollars/unit,
respectively, yields a1j y1 + a2j y2 dollars, which is the corresponding cost.
Assume that we are only interested whether the cost exceeds lost profit on
each product j:

a1j y1 + a2j y2 ≥ cj , j = 1, 2, 3.
Producing as much product as possible to gain max profit is the same as
efficiently minimizing the cost with certain input constraints
Support Vector Classifiers 193

minimize b1 y1 + b2 y2
subject to a11 y1 + a21 y2 ≥ c1
a12 y1 + a22 y2 ≥ c2
a13 y1 + a23 y2 ≥ c3
y1 , y2 ≥ 0.
With the duality theorem, the support vector classifier can be formulated
as follows.

1  N
min{ ||β||2 + C ξi }
β,β0 2
i=1
subject to ξi ≥ 0, yi (xTi β + β0 ) ≥ 1 − ξi ∀i,

1 N N N
LP = ||β||2 + C ξi − αi [yi (xTi β + β0 ) − (1 − ξi )] − μi ξi
2 i=1 i=1 i=1


N
β= α i yi x i ,
i=1

N
0= α i yi ,
i=1
αi = C − μi , ∀i,
In this setting, observations that lie directly on the margin, or on the
wrong side of the margin for their class, are defined as support vectors. These
observations affect the construction of the support vector classifier, while the
rest of the data do not contribute to the classifier.
In the above formulation, there is a principle on the trade-off between bias
and variance in support vector machine. When the value of C is large, there
is a high tolerance for observations being on the wrong side of the margin,
therefore, the margin will consequently be large. More support vectors which
leads to lower variance and higher bias. On the other hand, when the value
of C decreases, the tolerance for observations being on the wrong side of the
margin decreases, and consequently the margin shrinks. This results in less
observations violating the margin and consequently less support vectors. With
less support vectors, the support vector classifier has higher variance and lower
bias.
We now apply the above techniques to analyze the example in Figure 8.3

Example 8.4 Since the two data groups are not linearly separable, there is
no solution for the setting of maximal margin classifier. We seek the support
vector classifier with the cost being set to 10.
194 Support Vectors and Duality Theorem

1  N
min ||β||2 + C ξi
β,β0 2
i=1
subject to ξi ≥ 0, yi (xTi β + β0 ) ≥ 1 − ξi ∀i,
The 7 cross points in Figure 8.4 are the support vectors.

library(e1071)
set.seed(1)
x=matrix(rnorm(20*2), ncol=2)
y=c(rep(-1, 10), rep(1, 10))
x[y==1,]=x[y==1,]+1
plot(x, col=(3-y))
dat=data.frame(x=x, y=as.factor(y))
svmfit=svm(y~., data=dat, kernel="linear", cost=10,
scale=FALSE)
plot(svmfit, dat)

FIGURE 8.4
When data are not completely separable between the two classes, support
vector classifier works with a cost
Support Vector Machine 195

8.3 Support Vector Machine


The previous section raises the point that when the maximal margin classifier
does not exist, we have to use the support vector classifier with a given cost for
misclassified observations or observations on the boundary (support vectors).
Following this way of thinking, since the support vector classifier uses a linear
boundary to classify the observations, it is possible that the linear classifier
determined by the support vectors, may not be able to adequately catch the
expected result in classification. Under this scenario, to lower the variation
and bias at the cost of straight-forward interpretation, we use a non-linear
boundary classifier determined by the support vectors. Since the final result
depends on the support vectors, the method of Support Vector Machine goes
in the way that the classification criteria (output) are directly influenced by
the support vectors.
The support vector machine with a quadratic kernel function can be math-
ematically formulated as follows.

maximize M
β0 ,β11 ,β12 ...,βp1 ,βp2 , 1 ,..., n ,M


p 
p
subject to yi (β0 + βj1 xij + βj2 x2ij ) ≥ M (1 − i )
j=1 j=1


n 
p 
2
i ≥ 0, i ≤ C, 2
βjk = 1.
i=1 j=1 k=1

Other common choices of kernel functions in support vector machine in-


clude
dth- Degree polynomial: K(x, x ) = (1 + x, x )d ,
Radial basis: K(x, x ) = exp(−γ||x − x ||2 ),
Neural network: K(x, x ) = tanh(K1 x, x  + K2 )
It should be noted that the main feature of support vector machine (SVM)
focuses on non-model assumption for the optimization. For instance, it is com-
parable for SVM linear boundary versus SVM non-linear boundary, or SVM
with different types of kernel functions. However, for data that have legiti-
mate model assumptions, such as the normal model for Bayesian classifiers or
binary model for logistic regressions, comparing them with SVM is like com-
paring apples with pears. In fact, the differences on fundamental assumptions
of the methodology hinders the legitimacy of comparisons.
The use of duality theorem makes SVM method switches the optimization
process from dealing with the feature space X to the output space on the
response variable y. This partially makes SVM popular in high dimension
data analysis, such as in picture recognition for artificial intelligence.
196 Support Vectors and Duality Theorem

8.4 Duality Theorem with Perturbation


The critical device in the operation of support vector machine (SVM) is the
duality theorem in the optimization process. When the feasible solution can
not be explicitly obtained in linear programming, the technique can be further
advanced by a perturbation method in linear programming. In what follows
in this section, we shall discuss a perturbation method in seeking the optimal
solution for a bivariate Bonferroni lower bound with information on degree-
two Bonferroni summations S11 , S21 , S12 , and S22 , defined as


⎪ U P (Ai1 ...Air Bj1 ...Bju ) 1 ≤ r ≤ n, 1 ≤ u ≤ m



⎪ r = 0, 0 ≤ u ≤ m
⎨Su (B)
Sr,u = Sr (A) u = 0, 0 ≤ r ≤ n



⎪ 1 r=u=0


⎩0 r > n or u > m,

where U = {1 ≤ i1 < ... < ir ≤ n 1 ≤ j1 < ... < ju ≤ m}, for any two sets
of events A1 , ..., An and B1 , ..., Bm in an arbitrary probability space (Ω, F,
P ).
The use of duality theorem for optimization solution in prediction essen-
tially stems from the same root of optimization in linear programming. We
briefly introduce the roadmap of perturbation method in this section. More
details on the use of the perturbation method with applications can be found
in the book by Chen (2014) [22].
For any two sets of events {Ai , i = 1, ..., n} and {Bj , j = 1, ..., m}, let
v1 and v2 be the number of occurrences of the two event sets, respectively.
Denote pij = P (v1 = i, v2 = j). For any integers 1 ≤ t ≤ n and 1 ≤ k ≤ m,
consider a set of consistent bivariate Bonferroni summations Sij , i = 1, ..., t,
j = 1, ..., k.
Similar to the optimization process in the construction of support vector
machine (SVM), an optimal upper bound for P (v1 ≥ 1, v2 ≥ 1) is defined by
the maximum value of the following linear programming problem:

max(p11 + ... + pnm ) (8.1)


n m
subject to i=0 j=0 pij = 1, and

p11 + 2p21 + ... + tpt1 + ... + nmpnm = S11


 
t n
p21 + ... + pt1 + ... + mpnm = S21
2 2
  ..... ... ...
n m
ptk + ... + pnm = Stk .
t k
Duality Theorem with Perturbation 197

The following is the rationale of optimality behind the bounding formula-


tion in (8.1). For any solution p∗ to the optimization issue (8.1), denote the
sum of partial elements of p∗ , we have

n 
m
U∗ = p∗ij .
i=1 j=1

Now for any two sets of events characterized by p in any probability space,

n 
m
P (v1 ≥ 1, v2 ≥ 1) = pij ≤ U ∗ ,
i=1 j=1

because of (8.1). Thus the feasible optimal solution in (8.1) leads to a proba-
bility upper bound on the probability of the joint event

P (v1 ≥ 1, v2 ≥ 1).

Denote T = {(i, j) : i =, ..., t, j = 1, ..., k}. Let F1 be an upper bound such


that for any particular set of consistent Bonferroni summations Sij , (i, j) ∈ T .
If there exists a set of events A∗1 , . . . , A∗n , B1∗ , . . . , Bm

in a probability space
where

Sij (A∗1 , . . . , A∗n , B1∗ , . . . , Bm



) = Sij , (i, j) ∈ T
and
P (v1 (A∗ ) ≥ 1, v2 (B ∗ ) ≥ 1) = F1 (Sij , (i, j) ∈ T ), (8.2)
then F1 is said to be a Fréchet optimal upper bound for P (v1 ≥ 1, v2 ≥ 1).
References on Fréchet optimal bounds can be found in [102], among others.
The optimality for probability bounds defined in (8.2) can be translated
into the language of linear programming, in which the domain of the function
class, F1 , is limited to Sij , a linear combination of pij . Denote,

b = (1, t(1, 1), t(1, 2), t(2, 1), t(2, 2)) x (8.3)
where 1 is the vector with length 2n+m and all elements equal 1, and t(i, j)
is the vector specified below for i, j = 1, 2,

ij(j − 1) ij(i − 1) ij(i − 1)(j − 1)


t (i, j) = (1, ij, , , ). (8.4)
2 2 4
Denote the matrix

R = (1, t(1, 1), t(1, 2), t(2, 1), t(2, 2)) ,

a (5×2n+m ) matrix with structure not affected by the values of the Sij ’s. We
have

b = Rx. (8.5)
198 Support Vectors and Duality Theorem

Vectorization of a bivariate array. Letting w = (n + 1)(m + 1) and


putting w elements of pij for i = 0, 1, ..., n and j = 0, 1, ..., m into a vector of
length w, yields " #
q = pij w×1
where pij are arranged from p0,0 to pnm by the increasing order on i for each
value of j, and then on j for each increasing sequence of i. Such a method of
vectorization for a bivariate array can be illustrated by the following example.
Letting n = m = 2, the vectorized outcome of a 9 × 1 vector q reads
q = (p00 , p10 , p20 , p01 , p11 , p21 , p02 , p12 , p22 ) .
Furthermore, by the expression of bivariate Bonferroni summations, see
for example, Chen(2014)[22].
n  m  
i j 
Skt = pij = gkt q, k = 1, 2 t = 1, 2, (8.6)
j=t
k t
i=k


where the row vector gkt is the vector of coefficients specified in (8.6). Com-
bining the row vectors g k,t into a matrix G (with the first row as 1) for the
quantities S11 , S12 , S21 , S22 .
Therefore putting b = (1, S11 , S12 , S21 , S22 ), for w = (n + 1)(m + 1), there
exists a 5×w matrix G so that

b = Gq, (8.7)

where the first row of the matrix G is 1 , the first (m + 1)th column of G
is (1, 0, 0, 0, 0) , and the structure of G is not affected by the value of the
bivariate Bonferroni summations Si,j ’s.
Denote the vector c: c = (c(i, j)), with the first element of c corresponding
to the index {i = 0 or j = 0}, and the rest of the elements of c formed by
ranking over i ≥ 1 in increasing order for each fixed j, then over j ≥ 1 in
increasing order. Also assign

0, f or i=0 or j= 0
c(i, j) =
1, otherwise.

The joint probability of at least one occurrence in both event sets can be
expressed as
P (v1 ≥ 1, v2 ≥ 1) = c p.
With the setting above, we have the following theorems.
Theorem 8.1 For matrix G, vectors c and b as specified in the above setting,
denote the vector w = (w0 , w1 , w2 , w3 , w4 ) which may depend on m, n, but
not on the values of S11 , S12 , S21 , S22 under consideration. w b is an upper
bound for P (v1 ≥ 1, v2 ≥ 1) for all probability spaces if and only if w G ≥ c
(each element of the vector w G is not less than the corresponding element in
the vector c).
Duality Theorem with Perturbation 199

The following theorem explores the existence of the feasible optimal solu-
tion from the angle of probability theory, instead of linear programming.

Theorem 8.2 If the Bonferroni summations S11 , S12 , S21 , and S22 are con-
sistent, the linear programming upper bound for P (v1 ≥ 1, v2 ≥ 1) always
exists.

Similar to the argument in the preceding section, to optimize the target


n
function, we find a vector (probability space) p to minimize i=1 pi subject
to p ≥ 0 and Gp = b, where

b = (1, S11 , S12 , S21 , S22 ) .

Here, G is a 5 × (nm + 1) matrix with a typical column at , where the


general form of at is:

ij(j − 1) ij(i − 1) ij(i − 1)(j − 1)


at = (1, ij, , , ), (8.8)
2 2 4
for integers 1 ≤ i ≤ n and 1 ≤ j ≤ m.
For matrix G defined above, for a vector of consistent Bonferroni summa-
tions b, and a vector of coefficients c, denote the vector

w = (w0 , w1 , w2 , w3 , w4 )

which may depend on m, n, but not on the values of S11 , S12 , S21 , S22 .

Theorem 8.3 The value w b is a lower bound for P (v1 ≥ 1, v2 ≥ 1) for all
probability spaces if and only if w A ≤ c (each element of the vector w A is
not greater than the corresponding element in the coefficient vector c).

The above theorem leads to the existence of degree-two optimal probability


lower bound for the occurrence of at least one joint event, as stated below.
More details on the proofs of the theorem can be found in Chen(2014)[22].

Theorem 8.4 For a set of consistent Bonferroni summations S11 , S12 , S21 ,
and S22 , the linear programming lower bound for P (v1 ≥ 1, v2 ≥ 1) always
exists.

Now that the existence is proved above, the following describes a pertur-
bation method to find the bivariate optimal lower bound. If the condition
in Theorem 8.3 is satisfied, the optimal lower bound is found. However, it is
not always true that xB = B−1 b ≥ 0. When xB = B−1 b ≥ 0 , we need
to find an alternative approach to reach the linear programming bound. This
is technically more involved in linear programming. In the following, we pro-
vide details on the alternative approach (an iteration algorithm) and show
that the algorithm can theoretically reach the existence condition after finite
iterations.
200 Support Vectors and Duality Theorem

When B−1 b ≥ 0, Theorem8.3 cannot be applied to find the linear pro-


gramming lower bound. We can update the matrix B by replacing one of the
columns in B with a column from the set of columns of G to form a new
B matrix, and denote the updated B-matrix as B1 . If B−1 1 b ≥ 0, by Theo-
rem 8.3, we find the linear programming bound. If B−1 1 b ≥
 0, a column in B1
is replaced to form a new B-matrix, denoted as B2 . In this way, we initiate
an iteration process.
The hug of the device now is in selecting the proper column from G to
form the optimal point. To this end, we use a perturbation device to show
that once the column for removal/replacement is selected appropriately, we
can achieve the solution for a linear programming lower bound.
The introduction of the  below, ostensibly a perturbation device, is nec-
essary to show that cycling doesn’t occur in the iterative procedure aimed at
arriving at an optimal lower bound by a linear programming implementation.
For any positive value  > 0, define a vector cB () as follows. Let Γ =
{c()}, where c() = (c1 (), c2 (), ..., cnm+1 ())(nm+1)×1 is any vector of ’s
satisfying,

c1 () = 1
ci () = k(i) for some k > 0 depending on i>1 (8.9)
ci () = cj () i = j.

For example, the function of  can take any one of the following forms,

ci () = i , i = 1, c1 () = 1
or

ci () = i+7 , i = 1, c1 () = 1.


The function c() will be fixed through the iteration process. The possibility
of different explicit forms of c() is key in showing that it is not necessary to
worry about perturbation terms in .
Now, we can use columns from G to form B denoted as a1 , at2 , ..., at5 ,
then cB () = (1, ct2 (), ..., ct5 ()) , a 5 × 1 vector formed by selecting the
corresponding ci ()’s from the (nm + 1) × 1 vector c() introduced above.
With the vector cB () defined above, we have

 
cB () B−1 ak ≥ ck (), k = 1; and cB () B−1 a1 = 1, (8.10)

where a1 , ..., ak are columns of matrix G.

Assume matrix B defined above satisfies the conditions specified in (8.10)


for any 0 <  ≤ 0 . If the vector xB = B−1 b ≥ 0, then there exists a lower
Duality Theorem with Perturbation 201

bound of P (v1 ≥ 1, v2 ≥ 1) for any probability space p, and there exists a


probability space in which this lower bound achieves equality.
Denote
⎛ ⎞
1 1 1 1 1
⎜0 ab (a + 1)b a(b + 1) (a + 1)(b + 1)⎟
⎜ ⎟
⎜0 ab(b−1) (a+1)b(b−1) ab(b+1) (a+1)(b+1)b ⎟
B1 = ⎜ ⎟
⎜ 2
ab(a−1)
2
(a+1)ab
2
a(a−1)(b+1) (a+1)(b+1)a ⎟
2
⎝0 2 2 2 2 ⎠
0 ab(b−1)(a−1)
4
(a+1)ab(b−1)
4
ab(a−1)(b+1)
4
ab(a+1)(b+1)b
4

then, as shown in Chen (2014)[22], there exists a positive integer 1 so that


the matrix B1 satisfies condition (8.10) for all 0 <  < 1 .
Notice that the inverse matrix of D can be decomposed as follows.

⎛ ⎞
0
⎛ y1jk ⎞
 ⎜ ⎟
..
yrjk sr ⎜ ⎟.
⎜ ⎟ ⎜ 1 ⎟
= B−1 − ⎜ .. ⎟ + ⎜ y sr ⎟ .

D−1 ⎝ . ⎠ ⎜ rjk ⎟
y5jk  ⎜ . ⎟
yrj sr ⎝ .. ⎠
k
0

Let n be the total number of iterations before reaching the optimal solu-
tion, if θ() < 0 and the condition (8.10) persists for each B matrix in the
iteration process, n < ∞.
Proof: Put zrb = sr  b, we have
⎞ ⎛
0
⎛ y1jk ⎞
⎜ .. ⎟
yrjk zrb ⎜ . ⎟
⎜ .. ⎟ + ⎜
⎟ ⎟
D−1 b = B−1 b − ⎜ ⎜ zrb ⎟
⎝ . ⎠ ⎜ yrjk ⎟ .
y5jk ⎜ . ⎟
yrjk zrb ⎝ .. ⎠
0
Thus,

cD  ()D−1 b = cB  ()B−1 b − θ()sr  b


< cB  ()B−1 b,

since θ() < 0 and sr  b < 0 for xr < 0.


Thus f (B) = cB () B−1 b decreases strictly at the next iteration, for 0 <
 < 0 (B). The number of permissible B s is however, finite, (at most nm(nm−
1)(nm − 2)(nm − 3)), so there will be a stage in the iteration process where
no strict decrease is possible. But strict decrease is possible at the next stage
for any B satisfying (8.10) and B−1 b ≥ 0. Thus at some iteration B−1 b ≥ 0.
202 Support Vectors and Duality Theorem

This means that after a finite number of iterations (with the first column of
B never being changed), there exists a matrix B and an associated θ() such
that for all 0 <  < 0 (B).
Summarizing the above discussion, we have the following conclusions.

ˆ For any B satisfying (8.10), and xB = B−1 b ≥ 0, let r represent the


position of the smallest negative element of xB = B−1 b, then r > 1.
Furthermore, there exists a set of vectors H(B) = {aj1 , ..., ajt } for some
t ≥ 1, in A, such that sr  aj1 < 0, ..., sr  ajt < 0, and bi ∈ H(B) where bi
is the ith column of B.

ˆ For a matrix B satisfying (8.10), if xB = B−1 b ≥ 0, the corresponding


θ() < 0, for all 0 <  < 2 , with a value 2 = 2 (B) > 0. Also, the vector
ajk , which is the maximizing vector ap , p ∈ {j1 , ..., jt }, associated with
θ() for 0 <  < 2 (B), is uniquely determined.

Chen (2014)[22] also shows the following results to ensure the smoothing
operation in the iterative process.

Theorem 8.5 The condition θ() < 0 in each iteration, which means that de-
generacy does not occur in the iteration process. In the iteration process, when
we sequentially reach a matrix B such that B−1 b ≥ 0, an optimal solution is
found and the process is stopped. If the associated B−1 b ≥ 0, Condition (8.10)
persists for every B used in each iteration, with a corresponding 0 (B) > 0.

More details on the theory of perturbation method can be found in Chapter


4 of [22].

Summary: This chapter discusses the method of support vector machine,


a classification methodology that does not rely on model assumptions for the
distribution of the data. We start with the simplest case of maximal margin
classifier, where the data can be linearly separated into two distinct categories.
In the scenario where some observations can not be conveniently classified, we
discuss the method of support vector classifier, linear classifier with permitted
misclassified support vectors. When the kernel function of the classification is
extended to non-linear function with misclassified support vectors, the method
of support vector machine (SVM) is discussed.
The methodology of SVM is grounded upon the duality theorem in the
process of seeking the optimal solution for a linear programming problem
[39]. In the case where explicit feasible solution is not available, we illustrate a
perturbation method to screen for the optimal solution. The overall idea works
Duality Theorem with Perturbation 203

similarly to the greedy search algorithm. We add a perturbation term in each


step for a local optimization, then find the shortest way to reach the largest
increase of the distance measure repeatedly, and prove that the iteration will
converge to the global optimal solution.
References for the optimal lower probability bound using the Bonferroni
summations can be found in [17], [22], [34], [50] [62], [65], [71], [100], among
others. These bounds can be improved with duality theorem as discussed
in this chapter. When high-dimensional probability bounds are of interest,
the bivariate inequality can be extended to a multivariate version in linear
programming. Related references include [8], [22], [32], [31], [33], [71], [104],
[106], among others. Without the condition of linear combination on Bon-
ferroni summations, the perturbation method can be expanded to improve
non-linear bounds such as inequalities documented in [22], [19], [63] [64], [87],
and [89], [101], among others. The derivation for optimal solutions on SVM
is similar to the derivation of the linear programming bounds in probability
inequalities. More applications in this regard can be found in [22], [61], [90]
[103], [107], and [108], among others.
9
Decision Trees and Range Regressions

Consider the prediction of an outcome (numerical or categorical variable) on


the basis of a predictor vector featured by p variables, x = (X1 , ..., Xp ) .
Instead of assuming a point-wise relationship between y and x such as in
linear regression
y = α + β  x + ,
the method of decision trees or range regression seeks the connection between
a proxy of Y and a partition of the feature space x ∈ Rp . We start with an
illustrative example.

Example 9.1 Age impact on systolic blood pressure. Assume that we


are interested in predicting the systolic blood pressure using patient informa-
tion on age and gender. Denote X1 , X2 , and y as age, gender, and systolic
blood pressure, respectively, with data information in Table 9.1 below.

TABLE 9.1
Systolic blood pressure dataset
Age Gender Systolic BP
20 M 112
17 M 102
19 F 138
15 F 142
40 M 164
53 M 158
51 F 153
42 F 167

Using linear regression, the fitted model based on data information in Table
9.1, reads,
Systolic BP = 106.5 + 1.106 ∗ age + ε
The above linear regression line is easy to interpret. It implies that, on average,
the SBP increases 1.106 units per year of age increase. This reflects the possible
effect of bulging veins (when people get older, the valves in the veins may wear
out and result in improper blood flow in the extremities back to the heart).
However, it is not necessary that the change of SBP is based on per unit

205
206 Decision Trees and Range Regressions

increase of age. According to the model, an eighty-year-old patient might get,


on average, a SBP at the level of
106.47 + 1.106 ∗ 80 = 194.95,
which is obviously unrealistic. In fact, it would be more plausible to assume
that for a certain age range, the SBP changes to a common value. For instance,
the CDC website states the following relationship.

age range systolic blood pressure


14-18 90-120
19-40 95-135
41-60 110-145
61-older 95-145

The above idea can be formulated into the following equation:



k
y = f (X) = ci I(X, Ri ),
i=1

where R1 , ..., Rk are age intervals of the patient, and

1 if X ∈ Ri
I(X, Ri ) =
0 if X ∈ Ri .

Definition 9.1 Consider a prediction problem with a response variable Y


and predictor x = (X1 , ..., Xk ) ∈ S, where S is the feature space. A binary
splitting on element Xj , j ∈ {1, ..., k}, is a numerical value s that splits the
feature space S into two sets
R1 (j, s) = {x|Xj < s} and R2 (j, s) = {x|Xj ≥ s},
so that
R1 (j, s) R2 (j, s) = S.
The concept of binary splitting forms a partition of the feature space S.
With that, we can define a decision tree as follows.
Definition 9.2 Consider a prediction problem with a response variable Y and
predictor x ∈ S. Let R1 , .., Rk be a partition of the feature space generated
by recursively splitting the domain of elements of x, with optimization of the
homogeneity measure at each step of the binary split. A decision tree is a rela-
tionship that divides the feature space into groups according to a homogeneity
measure,
k
fˆ(X) = ci I(X, Ri ),
i=1
where ci is the common outcome for x with features in the partition Ri .
Regression Trees and UMVUE 207

We introduce the concept homogeneity measure in the above definition.


When the response variable is continuous, the usual mean squared error

1
n
(yi − ŷi )2
n i=1

is the homogeneity measure for the closeness between the observed and the
predicted outcomes. However, when the response is categorical, the value of
(white − green)2 has no meaning for comparisons. Even when we label the
categories with numbers such as 1 for red, 2 for green and 3 for white, the
outcome
(1 − 2)2 < (1 − 3)2
does not represent the color differences among red, green, and white. In this
case, we need to use other homogeneity measure such as Gini index or entropy
described in Section 9.2.
The selection of homogeneity measure is dictated by the type of response
variable Y . In what follows in this chapter, we shall redirect our discussion
to regression trees and classification trees, separately. The chapter will be
concluded by a discussion on range regression, an extension of binary splitting
in regression trees into multiple splitting.

9.1 Regression Trees and UMVUE


When the response variable is continuous, the distance is a usual measure on
homogeneity. Features with similar responses can be measured by the distance
on RSS (residual sum of squared errors).
 
RSS = (yi − ŷR1 )2 + (yi − ŷR2 )2 ,
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

where R1 (j, s) = {x|Xj < s} and R2 (j, s) = {x|Xj ≥ s}.


Notice that a regression tree is formed via recursive binary splitting of the
feature space with RSS as the homogeneity measure. We use the following hy-
pothetical example to illustrate the process in the construction of a regression
tree.

Example 9.2 Consider the following simple data set for the construction of
a regression tree.

x 1.5 2.6 5.1 9.2


y 2 1 10 20
208 Decision Trees and Range Regressions

In the first step, possible splitting points for the binary splitting of x are
1.5 + 2.6 2.6 + 5.1 5.1 + 9.2
= 2.05, = 3.85, = 7.15,
2 2 2
the corresponding RSS reads

splitting point 2.05 3.85 7.15


RSS 181.29 50.5 48.67
Since the splitting point at 7.15 has the smallest RSS, the first binary
splitting is
R1 = {x|x < 7.15}, R2 = {x|x ≥ 7.15}.
After the first binary splitting, since R2 contains only one observation, the
second binary splitting occurs at a value s2 that minimizes the RSS among
the rest of the three observations. Similar to the first split, we have

s2 2.05 3.85
RSS 40.5 0.5
Thus, the second binary split is s2 = 3.85.
Using R packages, the constructed tree can be seen in Figure 9.1.
It should be noted that for each partition on the feature space, when the
distribution family of the sample mean of y is a complete distribution family,
the UMVUE (uniformly minimum variance unbiased estimator) of the mean
response at each terminal node of the regression tree is
1 
ĉi = ȳi = yj ,
|Ri |
xj ∈Ri

which is the average of all responses that share the corresponding features
x ∈ Ri .
By the consistency of UMVUE and the law of large numbers, when the
number of observations in each terminal node is large enough, the sample
mean can always approximate the population mean. However, in practice, we
do not always have a large sample size in each terminal node in regression
trees. It should be noted that with limited sample sizes, the sample mean is
not always the UMVUE of the mean parameter, as shown in the following
example.

Example 9.3 Consider a random sample X1 , ..., Xn from a Uniform model


U (0, θ). E(X) = θ2 , so the sample mean X̄ is an unbiased estimate of θ2 .
However, X̄ is not complete in this case, so it is not UMVUE of the population
mean.
On the other hand, the estimator T = X(n), the largest order statistic, is
complete and
n θ
E(T ) = ,
n+12
Regression Trees and UMVUE 209

x<-c(1.5, 2.6, 5.1, 9.2)


y<-c(2, 1, 10, 20)
library(rpart)
tt<-data.frame(x,y)
rt <- rpart(y~x, minsplit=2, data=tt, method="anova")
par(xpd=NA)
plot(rt)
text(rt, use.n="FALSE", all=FALSE)

FIGURE 9.1
Hypothetical data for regression tree construction

thus the UMVUE of the population mean θ/2 is

θ̂ n+1
μ̂ = = X(n),
2 n
which is not the sample mean X̄.

Thus, in the construction of regression trees for continuous responses, on


the basis of the characteristics of the response, properly adjusting the esti-
mate for the constant in the terminal node will increase the accuracy and
convergence rate for the regression tree.
Another key step in the construction of a regression tree accounts for the
time to stop the binary splitting of the feature space. The continuation of
binary splitting will eventually break the data points into individual cells
where each terminal node contains only one observation. This is not desirable
because it over-fits the relationship between the response Y and the feature x.
To this end, the construction of a regression tree necessitates a cost-complexity
pruning step. Notice that for each value of α and a tree T0 , there exists a
210 Decision Trees and Range Regressions

subtree T ⊂ T0 , such that


|T |
 
R= (yi − ŷRm )2 + α|T |
m=1 xi ∈Rm

is as small as possible, where Rm is the m-th terminal node. Hence, T is the


selected tree for the data. Here, the value of α can be determined using K-fold
cross-validation.
1 Select a set of α values as candidates. For each α value do the following.
2 For any positive integer j ≤ K, use recursive binary splitting to grow a
tree on the cross-validation data, which is the training data except the
j-th fold.
3 Use the cost complexity pruning to find a sub-tree T so that the term,
|T |
 
R= (yi − ŷRm )2 + α|T |,
m=1 xi ∈Rm

is the smallest possible value, where Rm is the mth terminal node.


4 Compute the mean squared prediction errors on the data in the left-out
jth fold.
5 Average the mean squared prediction error after running all j-fold data in
the cross-validation process.
6 After computing the average cross-validation mean squared prediction er-
ror for all candidate α values, pick the α level that is associated with the
smallest average of the mean squared prediction error.

9.2 Classification Tree


For the dataset (x, y) where y is a categorical variable representing different
classes such as Red, Black, Green for colors, or Hypertension, Diabetes, Stroke,
Cancer for diseases, the method of regression trees can not be directly applied.
This is because the Euclidean distance (mean squared prediction error) is
no longer a valid measure for category homogeneity. For instance, when we
denote Red = 1, Green = 2, and Black = 3, the error misclassifying Red
as Green (which reads MSE (1 − 2)2 = 1) is not less severe than the error
misclassifying Red as Black (which reads MSE (1−3)2 = 4). In the literature,
homogeneity measures for classification trees include misclassification rate,
Gini index, and entropy. Toward this end, the decision tree definition for
classification problems can be adjusted as follows.
Classification Tree 211

Definition 9.3 Consider a prediction problem with response variable Y ∈


{1, ..., K} and predictor x ∈ S. Let R1 , .., Rm be a partition of the feature
space generated by recursively splitting the domain of elements of x with op-
timization of the homogeneity measure at each step of the binary split. A
classification tree is a relationship that divides the feature space into groups
according to a homogeneity measure (misclassification rate, gini index, or en-
tropy),
m
fˆ(X) = ci I(X, Ri ),
i=1

where ci ∈ {1, ..., k} is the proxy for the common value of observations x with
features in the partition Ri .

9.2.1 Misclassification Rate


For classification problems, let p̂jk denote the proportion of training observa-
tions in the jth region that belong to the kth class, k ∈ {1, 2, ..., K} in each
binary splitting, j = 1, 2. The misclassification rate for region j is

Ej = 1 − max (p̂jk ).
k∈{1,...,K}

Example 9.4 When SBP (systolic blood pressure) is used to diagnose DVT
(deep vein thrombosis), assume that features of 20 patients under the study
can be described according to the following table.

SBP vs DV T DVT (Yes) DVT (No)


SBP < 120 5 7
SBP ≥ 120 6 2

Out of the total of 20 patients, when 120 is selected as the threshold in the
binary splitting of SBP, we have patients with SBP < 120 as the group j = 1,
and patients with SBP ≥ 120 as the group j = 2. Out of the 12 patients with
SBP < 120, there are 5 with DVT (k = 1) and 7 without DVT (k = 2). Thus
we have, for j = 1,
5 7
p̂11 = p̂12 = ,
12 12
and
5 7 5
E1 = 1 − max(p̂11 , p̂12 ) = 1 − max( , ) = .
12 12 12
5
So, Region-1 is labeled as no-DVT with misclassification rate 12 .
As for Region-2, j = 2, we have
6 2
p̂21 = p̂22 =
8 8
212 Decision Trees and Range Regressions

and
6 2 2
E2 = 1 − max( , ) = .
8 8 8
So, Region-2 is for patients with DVT and the misclassification rate is 28 .
Therefore, the overall classification rate for the binary split at SBP = 120
reads
5 12 2 8 7
E= ∗ + ∗ = .
12 20 8 20 20
In fact, when we use the majority vote at each terminal node to determine
the common class for the node, the misclassification rate is simply the sum of
the misclassified proportions in the two regions of the binary splitting.

9.2.2 Gini Index


Instead of directly using the misclassification rate as a measure for the ho-
mogeneity in each terminal node, another measure is the Gini index, which
measures the variation (or node purity) in the binary splitting during the
construction of a classification tree. For each binary splitting, denote p̂jk the
proportion of training observations in the jth region that are from the kth
class, k ∈ {1, 2, ..., K}, the Gini index reads,


K
Ĝ = p̂jk (1 − p̂jk ).
k=1

For the nj observations in the terminal node, denote Xjk = 1 with probability
pjk in the jth region for the observations that belong to Category k. We have

V ar(Xijk ) = pjk (1 − pjk ) i = 1, ..., nj ,

and the population-wise Gini index for region j reads,


K 
K
G(j) = pjk (1 − pjk ) = V ar(Xjk ).
k=1 k=1

Thus, for each splitting in the region j, the Gini index G(j) is essentially the
sum of variance of each category in the region. In general, the Gini index for
a value in a binary splitting reads
n1 n2
Gini(s) = G(1) + G(2),
n n
where n1 denotes the number of observations satisfying Xj < s, and n2 denotes
the number of observations satisfying Xj ≥ s.
The Gini index takes on a small value if all of the correct classification
errors are close to zero or one. It is a measure of node purity. A small value
indicates that a node contains predominantly observations from a single class.
Classification Tree 213

Example 9.5 Using the SBP-DVT classification in Example 9.4, the Gini
index in splitting SBP = 120 reads
When j = 1,
5 5 7 7 5 7
Gini(1) = ∗ (1 − ) + ∗ (1 − ) = 2 ∗ ∗ ;
12 12 12 12 12 12
when j = 2,
6 6 2 2 6 2
Gini(2) = ∗ (1 − ) + ∗ (1 − ) = 2 ∗ ∗ ,
8 8 8 8 8 8
and the Gini index at SBP=120 reads
5 7 12 6 2 8 53
Gini(120) = 2 ∗ ∗ ∗ +2∗ ∗ ∗ = .
12 12 20 8 8 20 120

9.2.3 Entropy
For the previous measures of homogeneity in classification, the misclassifi-
cation rate uses the percentage of node homogeneity. A small value of the
misclassification rate indicates the predominant observations from a single
class. The Gini index, on the other hand, focuses on the node purity via the
variance of each class in the region. A small value of Gini index implies more
observations in the node are from a single class. Besides these two measures
of homogeneity, another commonly applied homogeneity measure for binary
response is the measure of entropy.
Entropy is a concept used in information theory, where a higher probability
of the occurrence of an event implies less information obtained when the event
occurs; and a lower probability implies more information obtained when the
event occurs. It is a useful concept in coding and decoding a signal process.
The form −plog(p) stems from the fact that the function
f (p) = −c ∗ loga (p)
for constants a and c is the only function satisfying the following three condi-
tions in information theory.
1 f (x) is a monotonically decreasing function of x;
2 f (x) is continuous in x;
3 f (p1 × p2 ) = f (p1 ) + f (p2 ), which implies that information of two inde-
pendent events is the sum of the individual information of each event.
On the basis of the above definition, when I(X) = −log(P (X = x)) is used
for the self-information of the element x, the entropy of a discrete random
variable X with probability mass function px is defined as the expected value
of the information belonging to each basic element.

Entropy(X) = E(I(X)) = (−log(p))P (X = x).
x
214 Decision Trees and Range Regressions

For each binary splitting, denote p̂jk the proportion of training observa-
tions in the jth region that are from the kth class, k ∈ {1, 2, ..., K}, we have
the entropy,

K
D̂ = − p̂jk log[p̂jk ].
k=1

Since the classification error is between 0 and 1, the entropy in a binary split
is non-negative. It takes a value near zero if the classification rates are all near
zero or near one.
It should be noted that the three different measures of node homogeneity
assume different meanings when splitting in the construction of the classifica-
tion tree, which should be integrated into the interpretation.
We use the following example to comprehensively explain the three mea-
sures of node homogeneity in the construction of classification trees.

Example 9.6 Consider a sample of 800 objects featured by variable x to be


classified into two categories A and B.

Number 300 100 200 100 100


Category A A B B B
x 9.1 10.9 29.1 10.9 9.1

On the basis of the values of x, the candidate splitting points are x = 10


and x = 20.

Case-1: Using misclassification rate as the measure of node homo-


geneity
At x = 10, we have
ˆ when x < 10, the corresponding outcomes include A -300 and B-100, the
majority vote classifies the node as Category-A.
ˆ when x ≥ 10, the corresponding outcomes include A -100 and B-300, the
majority vote classifies the node as Category-B.
The overall misclassification rate for splitting over x = 10 reads

[100(misclassification of B as A)+100(misclassification of A as B)]/800 = 0.25.

Alternatively, for x < 10, j = 1, k ∈ {A, B}, p1A = 3/4 and p1B = 1/4,

E1 = 1 − max(3/4, 1/4) = 1/4.

for x ≥ 10, j = 2, k ∈ {A, B}, p2A = 1/4 and p2B = 3/4,

E2 = 1 − max(3/4, 1/4) = 1/4.

and the overall misclassification rate reads,


Classification Tree 215

1 400 1 400
+ = 1/4.
4 800 4 800
At x = 20, we have
ˆ when x < 20, the corresponding outcomes include A -400 and B-200, the
majority vote classifies the node as Category-A.
ˆ when x ≥ 20, the corresponding outcomes include A - none and B-200,
the majority vote classifies the node as Category-B.
The overall misclassification rate for splitting over x = 20 reads
(200(misclassification of B as A)+0(misclassification of A as B))/800 = 0.25.
Alternatively, for x < 20, j = 1, k ∈ {A, B}, p1A = 4/6 and p1B = 2/6,
E1 = 1 − max(2/3, 1/3) = 1/3.
for x ≥ 20, j = 2, k ∈ {A, B}, p2A = 0 and p2B = 200/200 = 1,
E2 = 1 − max(0, 1) = 1.
and the overall misclassification rate becomes
31 1
+ ∗ 0 = 1/4.
43 4
There is a tie in the homogeneity measure with misclassification errors at 10
and 20. In this case, the tie can be broken by flipping a fair coin.

Case-2: Using Gini index as the measure of node homogeneity


At splitting point x = 10, we have
ˆ when x < 10, the corresponding outcomes include A -300 and B-100, the
majority vote classifies the node as Category-A. p̂1A = 300/400.
ˆ when x ≥ 10, the corresponding outcomes include A -100 and B-300, the
majority vote classifies the node as Category-B. p̂2B = 300/400.
For x < 10, j = 1 the Gini index
3 3 3 3 3
G(1) = ∗ (1 − ) + ∗ (1 − ) = .
4 4 4 4 8
For x ≥ 10, j = 2 the Gini index
1 1 3 3 3
G(2) = ∗ (1 − ) + ∗ (1 − ) = .
4 4 4 4 8
The overall Gini index at splitting x = 10 reads
3 400 3 400
Overall Gini = ∗ + ∗ = 3/8.
8 800 8 800
At splitting point x = 20, we have
216 Decision Trees and Range Regressions

ˆ when x < 20, the corresponding outcomes include A - 400 and B -200, the
majority vote classifies the node as Category-A. p̂1A = 400/600.
ˆ when x ≥ 20, the corresponding outcomes include A - none and B -200,
the majority vote classifies the node as Category-B. p̂2B = 200/200 = 1.
For x < 20, j = 1 the Gini index
2 2 1 1 4
G(1) = ∗ (1 − ) + ∗ (1 − ) = .
3 3 3 3 9
For x ≥ 20, j = 2 the Gini index
0 0 200 200
G(2) = ∗ (1 − )+ ∗ (1 − ) = 0.
200 200 200 200
The overall Gini index at splitting x = 20 reads
4 600 200
Overall Gini = ∗ +0∗ = 1/3.
9 800 800
Since 38 > 13 , lower overall Gini index indicates a higher node purity, the
splitting point at x = 20 is selected.

Case-3: Using Entropy as the measure of node homogeneity


At splitting point x = 10, we have
ˆ when x < 10, the corresponding outcomes include A -300 and B-100, the
majority vote classifies the node as Category-A. p̂1A = 300/400.
ˆ when x ≥ 10, the corresponding outcomes include A -100 and B-300, the
majority vote classifies the node as Category-B. p̂2B = 300/400.
For x < 10, j = 1 the entropy reads
3 3 1 1
D(1) = − ∗ log( ) − ∗ log( ) = 0.24922.
4 4 4 4
For x ≥ 10, j = 2 the corresponding entropy reads,
1 1 3 3
D(2) = − ∗ log( ) − ∗ log( ) = 0.24922.
4 4 4 4
The overall entropy at splitting x = 10 reads
400 400
Overall entropy = 0.24922 ∗ + 0.24922 ∗ = 0.24922.
800 800
At splitting point x = 20, we have
ˆ when x < 20, the corresponding outcomes include A - 400 and B -200, the
majority vote classifies the node as Category-A. p̂1A = 400/600.
Classification Tree 217

ˆ when x ≥ 20, the corresponding outcomes include A - none and B -200,


the majority vote classifies the node as Category-B. p̂2B = 200/200 = 1.

For x < 20, j = 1 the entropy reads


2 2 1 1
D(1) = − ∗ log( ) − ∗ log( ) = 0.27646.
3 3 3 3
For x ≥ 20, j = 2 the entropy becomes
200 200
D(2) = 0 − ∗ log( ) = 0.
200 200
The overall entropy at splitting x = 20 reads
600 200
Overall entropy = 0.27646 ∗ +0∗ = 0.20735.
800 800
Since 0.24922 > 0.20735, lower overall entropy indicates higher node class
homogeneity, and the splitting point at x = 20 is selected.
Note that, while the overall classification errors are equal for the two split-
ting points in this example, the Gini index and the entropy are able to distin-
guish the difference between the two. It should be emphasized that the three
homogeneity measures address three different aspects in the binary splitting.
In terms of the majority rule, the two splitting points x = 10 and x = 20 have
equal homogeneity measures. In terms of node purity and the information
conveyed, x = 20 is a better choice when the Gini index or entropy is used.

9.2.4 UMVUE for homogeneity in classification trees


Among the three measures of homogeneity in the construction of classification
trees, the misclassification rate is essentially the probability of success of a
Bernoulli random variable. Thus, the UMVUE of the misclassification rate is
simply the sample proportion of the observations falling in the partition Ri .
However, the UMVUE for the Gini index is a different story. Although the
plug-in estimate of the Gini index gives

Ĝ = p̂jk (1 − p̂jk ),
k

for each region j, it should be noted that the estimate in the above equation
is not the uniformly minimum variance unbiased estimator (UMVUE). To see
this point, notice that for the sample proportion p̂ with n observations in the
218 Decision Trees and Range Regressions

node,

E(p̂(1 − p̂)) = E(p̂) − E(p̂2 )


= p − (V ar(p̂) + (E(p̂))2 )
p(1 − p)
=p− − p2
n
1
= p(1 − p)(1 − )
n
n−1
= p(1 − p) ,
n
which is not unbiased for p(1 − p).
In this regard, we have the following theorem for the UMVUE of the Gini
index.

Theorem 9.1 Assume that there are nj observations in the branch j, j = 1, 2


for a binary splitting. Let Xijk , i = 1, 2, ..., nj denote the event that observa-
nj
tion i is in the class k, so that i=1 Xijk is the total number of observations
belonging to class k in the region j. The UMVUE of the Gini index in region
j reads
K nj nj
Xijk − 1
i=1 Xijk
Ĝ(j) = (1 − i=1 ).
nj nj − 1
k=1

j n
Proof: Consider the random variable T = i=1 Xijk in each node. Since
Xijk is a Bernoulli random variable with probability pjk , T follows a binomial
model, which is a complete distribution family. Furthermore, the sample mean
T is also a sufficient statistics for pjk . And T /nj is the UMVUE of pjk . What
we need to find is the UMVUE of p2jk .
For any two random variables in the node, notice that

E(X1jk X2jk ) = p2jk

we have, by the Rao-Blackwell theorem, that the UMVUE of p2jk is

E(X1jk X2jk |T ) = P (X1jk = 1, X2jk = 1|T )


= P (X1jk = 1, X2jk = 1 and T = t)/P (T = t)
 
nj − 2 nj
= / .
t−2 t

Thus, the UMVUE of the Gini index in the region-j is


K nj
Xijk
nj
Xijk − 1
Ĝ(j) = i=1
(1 − i=1
).
nj nj − 1
k=1
Extending regression trees to range regression 219

9.3 Extending regression trees to range regression


As discussed in the previous sections, the decision tree uses the method of
recursive binary splitting to create a partition of the feature space. It is con-
venient for interpretation and programming applications. However, there is a
possibility that the optimal partition may not be obtained via binary splitting.
For instance, other combinations of explanatory features may be able to cre-
ate a different partition of the feature space that provides a better prediction
on the response variable using the explanatory features. One way to explore
this direction is the method of range regression, which directly improves the
method of binary splitting in regression trees.

Definition 9.4 Consider a prediction problem with response variable Y and


predictor x ∈ S. Let R1 , .., Rk be a partition of the feature space with op-
timization of the homogeneity measure. A range regression is a relationship
that divides the feature space into parallel groups with homogeneity measures,


k
fˆ(X) = ci I(X, Ri ),
i=1

where ci is the common mean outcome for x with features in the partition Ri .

Notice that, instead of making recursive binary splittings, range regression


uses ranges as partitions of the feature space to establish the relationship be-
tween the outcome Y and the feature variable x. When R1 , ..., RK are selected
as the ranges of the predictor, and ci is the mean response to the corresponding
range, the overall regression method of choice is range regression.
Range regression is an analytic approach aiming to separate the source of
data variation. When the original data cloud appears to contain large varia-
tions, averaging out the within group (range) variation makes the association
between the response and predictor variables observable. To visually depict
this point, we use an illustrating example of the relationship between resid-
ual thrombus and post-thrombotic classification of venous diseases (see, for
example, [37], [35], and [24]).
The study focuses on the quantity of clot lysed in IFDVT female patients
correlates with their quality of life after the operation. Equivalently, the prob-
lem emerges when evaluating the association between the CEAP score (a mea-
surement of clinical outcomes) and residual thrombus (quantitative thrombol-
ysis).
As shown in the plot in Figure 9.2, the conventional approach of linear
regression fails to catch the association between the two variables. This is
partly caused by sample fluctuations and unexpected variabilities in the pa-
tient population. The female CEAP scores vary greatly across different levels
220 Decision Trees and Range Regressions

Scatter plot of female CEAP score vs residual


thrombus
4.5

3.5

2.5
CEAP score

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Residual thrombus

FIGURE 9.2
Post-thrombotic syndrome and Linear regression

of residual thrombus. The diagram indicates that the two random sources
affect the outcome variable. It can be formulated as
Yi = α + βXi + i + ηi ,
where Y denotes the CEAP score, X the residual thrombus. i and ηi denote
the two different sources of random fluctuations for patient i. It should be
noted that i is for the normal fluctuation of the patients associated with the
mean CEAP scores. However, ηi represents all the random sources associated
with each range of residual thrombus. Obviously, the diagram shows that
the change of the outcome variable (CEAP score) can not be not adequately
explained by residual thrombus.
Figure 9.2 indicates that at the same level of residual thrombus, some
patients have higher CEAP scores while others have relatively lower CEAP
scores. The trend between the two variables of interest (CEAP scores and
residual thrombus) is not observable.
To focus on subject variability for patients with similar amounts of clot
lysed, Figure 9.3 shows that through range regression, stratifying patients
Extending regression trees to range regression 221

Plot of female mean CEAP score vs range of residual


thrombus, p=0.00726
4.5

3.5

2.5
Mean CEAP score

Y
2
Predicted Y
Linear (Predicted Y)

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

-0.5
Middle range of female residual thrombus

FIGURE 9.3
Post-thrombotic syndrome and Range regression

with similar amounts of clot lysed into one group identifies a measure that
bundles subject variability within each stratum. The linear pattern emerges.
Similar to a regression tree, a range regression can be interpreted as follows,


k
fˆ(X) = ci I(X, Ri ).
i=1

When a range of 10% clot lysed is set as a criterion to stratify patients, it


serves as the partition of the feature space on the ranges of clot lysed. Be-
cause patients in the same stratum have similar quantities (10%) of thrombus
removal, range regression averages out the impact of data variability due to
confounding factors associated with each range of clot lysed. This is the ho-
mogeneity class as discussed in Section 9.1. The individual variability is then
bundled with other patients in the same percentage range to show the associ-
ation between clot lysed and CEAP score.
After stratifying, the sample mean of all CEAP scores for patients at the
same terminal node (which contains patients having similar amounts of clot
222 Decision Trees and Range Regressions

lysed) is computed as an indicator for the clinical outcome. For example,


denote Y = mean female CEAP score, X = middle range of residual thrombus,
 = random effect, we have a fitted regression line

Y = −0.646 + 5.09X + ,

for the female patients in the data set. Since the sample mean is an asymptot-
ically unbiased estimator of the population mean, range regression essentially
models the conditional expected value of CEAP scores for each fixed range of
residual thrombus as a linear trend of the residual thrombus.

E(CEAP|residual thrombus) = α + β(range of residual thrombus) + ,

where the effect of random source η is implied by ranging the residual throm-
bus and averaging out the output variable within each range of residual throm-
bus. As shown in Figure 9.3, the method of range regression successfully re-
veals the association between the mean response of CEAP scores and the 10%
range of residual thrombus.
At this point, we shall show a theoretical result associated with an asymp-
totic distribution governing the method of range regressions. It can also be
analogically applied to analyze the error term in each terminal node for re-
gression trees.
For any random vector (xi yi ) , i = 0, 1, . . . , n, consider the scenario
where xi falls into one of the k discernible categories T1 , . . . , Tk measured by
values s1 , . . . , sk . For each xi , there exists a set Tj such that xi ∈ Tj , where
Tj is represented by sj ,
 xi
sj = ,
#(Tj )
xi ∈Tj

the average of xi in Category Tj . Then we have a mapping g(.) such that


g(xi ) = sj , i = 1, . . . , n, j = 1, . . . , k.
Let xj1 , . . . , xjnj be the x values corresponding to sj , j = 1, . . . , k, and
denote
nj
l=1 yjl
zj = ,
nj
where yj1 , . . . , yjnj are the nj response values associated with xj1 , . . . , xjnj .

Theorem 9.2 In the jth range, if there is a linear relationship between X


and Y ,
Y = α + βX + ,
where  follows any distribution, then there exists α1 , β1 , and 1 so that

Z j = α1 + β 1 Sj +  1 , j = 1, ..., k,

and 1 asymptotically follows a normal model.


Extending regression trees to range regression 223

Proof: Consider fj (t) the moment generating function of the centralized re-
sponses Yj1 − μj , . . . , Yjnj − μj . Notice that

fj (t) = E[et(Yjk −μj ) ], k = 1, . . . , nj (9.1)


fj (0) = E(1) = 1
fj (0) = E[(Yjk − μj ) × e0×(Yjk −μj ) ] = 0
fj (0) = E[(Yjk − μj )2 × e0×(Yjk −μj ) ] = σj2 .

Since the first two moments exist, using the Taylor expansion, we have
1
fj (t) = 1 − σj2 t2 + O(t2 ). (9.2)
2
The above equation implies that the moment generating function of the
standardized range mean variable


nj
(Yjk − μj )
Ȳj = √ ,
nj × σ j
k=1

reads

fnj (t) = E[etȲj ]


/ n t(yjk −μj )
0
j
k=1
√n ×σ
=E e j j

1  2 nj
t
= fj √ , by independence
nj × σ j
/ $ % $ %0nj
1 2 t2 t2
= 1 − σj +O
2 nj σj2 nj σj2
/  $ %0 nj
1 t2 t2
= 1− +O
2 nj nj σj2
/ $ %0,nj
1 t2 t2
= 1− + nj O .
nj 2 nj σj2

Now $ % $ %
t2 (3) t2
O = fj (θ) , θ ∈ [0, εj ].
nj σj2 nj σj2
(3)
Notice that fj (θ) is bounded since it is a continuous function in a closed
interval, we have, for each t,
$ %
t2 (3) t6
nj O = f j (θ) × −→ 0 as nj −→ ∞.
nj σj2 n6j σj2
224 Decision Trees and Range Regressions

Therefore,
2
fnj (t) −→ e−t /2
,
since
lim (1 − a/n)n = e−a ,
n→∞

which is the moment generating function of N (0, 1).


Thus, we have the standardized range mean variable

Ȳj −−−−→ N (0, 1),


nj →∞

which means that the unstandardized range mean


nj
Yjl √
Zj = l=1
−−−−→ N (μj , σj / nj ).
nj nj →∞

The second result follows from the fact that


Z j − μj
Ȳj = √ ,
σ j / nj

and
σ
Zj = √ Ȳj + μj .
nj
Now, denote z ∈ Rk , the k-dimensional vector z can be expressed as

z = μ + ε,
μ = (μ1 , . . . , μk ) , ε ∼ Nk (0, Σ),
√ √
where Σ is a k × k matrix with σ1 / n1 , . . . , σk / nk on diagonal and 0 off-
diagonals.
If xji = sj , i = 1, ..., nj , and suppose there is a linear relationship between
μj and sj , say, μj = a + bsj , then

Zj = a + bSj + εj , j = 1, . . . , k,

represents a linear relationship between sj and zj with normal error εj , j =


1, . . . , k.
If, on the other hand, xji = sj , then

sj −−−−→ E(Xji ) = ηj .
nj →∞

If there is a linear relationship between μj and ηj , say, μj = a + bηj , then we


have
Zj = a + b(Sj + ζj ) + εj = a + bSj + ζj∗ + εj ,
which again represents a linear relationship between Sj and Zj with a normal
Extending regression trees to range regression 225

error εj , j = 1, . . . , k. This concludes the proof of the theorem. More details


on this theorem can be found in Kerns and Chen (2016, [75]).
The asymptotic normality of range regression model makes it legitimate
in the application of linear regression analysis. Another significant advantage
of range regression is the improvement of the sample correlation coefficient,
as shown in the following theorem. This fits well with intuition because once
the variation within each range is removed by taking the average, a stronger
linear pattern shows up on the increase of the sample correlation coefficient.

Theorem 9.3 If the X variable in each range can be represented by Xi and


all ranges share the same sample size, then the sample correlation of the range
regression data is larger than the sample correlation of the original data.

Proof: Consider the original data (xij , yij ), where i = 1, . . . , k, j = 1, . . . , mi


and let xij = xi , for all j. The sample correlation coefficient based on the
original data can be calculated as
3 42
k mi
i=1 j=1 [(xij − x̄)(yij − ȳ)]
R̂12 = k mi k mi
[ i=1 j=1 (xij − x̄)2 ][ i=1 j=1 (yij − ȳ)2 ]
3 42
k mi
i=1 j=1 [(xi − x̄)(yij − ȳ)]
= k mi k mi
, (9.3)
[ i=1 j=1 (xi − x̄)2 ][ i=1 j=1 (yij − ȳ)2 ]

since each range has one representative value xi for i = 1, ..., k. Now notice
that

mi
(yij − ȳ) = mi (ȳi − ȳ),
j=1

we have
3 42
k mi
i=1 [(xi − x̄) j=1 (yij − ȳ)]
k k m
[ i=1 mi (xi − x̄)2 ][ i=1 j=1 i
(yij − ȳ)2 ]
3 42
k
i=1 [(x i − x̄)m i (ȳ i − ȳ)]
= k k mi
,
[ i=1 mi (xi − x̄)2 ][ i=1 j=1 (yij − ȳ)2 ]

where
1 
mi
ȳi = yij .
mi j=1
226 Decision Trees and Range Regressions

Since each range contains the same amount of sample mi = m, we have


k 
m 
k 
m
(yij − ȳ)2 = (yij − ȳi + ȳi − ȳ)2
i=1 j=1 i=1 j=1
⎡ ⎤

k 
m
= ⎣ (yij − ȳi )2 + m(ȳi − ȳ)2 ⎦
i=1 j=1


k 
m 
k
= (yij − ȳi )2 + m (ȳi − ȳ)2
i=1 j=1 i=1


k
>m (ȳi − ȳ)2 .
i=1

Now consider the transformed data (xi , ȳi ), where i = 1, . . . , k and ȳi is
defined as the same as before. Define the overall average across the k ranges,

1
k
x̄∗ = xi ,
k i=1
⎡ ⎤
1 k
1 k
1 
mi
ȳ ∗ = ȳi = ⎣ yij ⎦ .
k i=1 k i=1 mi j=1

The sample correlation coefficient based on the transformed data can be cal-
culated as 3 42
[(xi − x̄∗ )(ȳi − ȳ ∗ )]
k
i=1
R̂22 = k k
. (9.4)
[ i=1 (xi − x̄∗ )2 ][ i=1 (ȳi − ȳ ∗ )2 ]
Notice that the data are balanced, that is, mi = m, for all i = 1, . . . , k,
then

1 
k 
mi
x̄ = k
xij
i=1 mi i=1 j=1

1 
k
= k
[mi xi ]
i=1 mi i=1
k
m i=1 xi
=
km
= x̄∗ ,
Extending regression trees to range regression 227

and similarly,

1 
k 
mi
ȳ = k
yij
i=1 mi i=1 j=1

1 
k
= [mi ȳi ]
mk i=1

1
k
= ȳi
k i=1
= ȳ ∗ .

Equation (9.3) can be simplified further as


3 42
k
i=1 [(xi − x̄)(ȳi − ȳ)]
R̂12 = k k m
[ i=1 (xi − x̄)2 ][ i=1 j=1 (yij − ȳ)2 ]
3 42
− x̄∗ )(ȳi − ȳ ∗ )]
k
i=1 [(xi
= k k m
[ i=1 (xi − x̄∗ )2 ][ i=1 j=1 (yij − ȳ ∗ )2 ]

Notice that the only difference between the above equation and Equa-
tion (9.4) is the term i=1 (ȳi − ȳ ∗ )2 in Equation (9.4). Since, m ≥ 1
k


k 
m 
k
(yij − ȳ)2 ≥ (ȳi − ȳ ∗ )2 ,
i=1 j=1 i=1

we have
R̂22 ≥ R̂12 .
This completes the proof of the theorem.

SUMMARY Besides the method of support vector machine discussed in the


previous chapter for predictions without making specific model assumptions,
this chapter focuses on the method of regression trees for continuous responses
and classification trees for discrete responses. Additionally, we extend the
method of regression trees to the method of range regression.
The highlight of this chapter concentrates on the analysis of UMVUE for
the construction of decision trees. By providing a mathematical formulation
for a decision tree, we present the best estimator for the homogeneity measure
in the regression tree. We also include an example showing that the sample
mean for responses in each terminal node is not always the best one, when
the distribution of the data is not complete.
For the Gini index in classification trees, we show that the conventional es-
timation of the Gini index is not the best, and derive the uniformly minimum
228 Decision Trees and Range Regressions

variance unbiased estimator of the population Gini index. Although, when the
sample size is large, the conventional estimation on the plug-in estimation will
converge to the true Gini index by the consistency property, we do not always
have an infinite amount of data for slow converging predictions. This necessi-
tates the use of best estimation for the Gini index measuring homogeneity of
observations in each terminal node. The interpretation of entropy in the con-
struction of classification tree is also addressed with examples in this chapter.
Further information on trees and bagging can be found in papers such as [6],
[7], [130], among others.
Range regression uses parallel splitting on ranges to replace binary splitting
in regression trees. The asymptotic distribution of the range regression model
bridges the data-driven camp with the model-based camp. It unifies the two
data science cultures via distribution convergences of response variable sample
mean.
10
Unsupervised Learning and Optimization

The previous chapters discuss data analytic issues on input features (such as
predictors) relating to output features (such as the response variable), where
each observation has a response. For instance, in the analysis of clinical fac-
tors related to systolic blood pressure, the response variable is the reading
of patients’ systolic blood pressure; in the classification of up or down mar-
ket trend in the coming time period, the response variable is either “bull
market” or “bear market”. The model learned from the training data has
a response variable intended to “supervise” the learning process by using a
MSE criterion or the total probability of correct classification. However, in
some data analytic problems, the response to “supervise” the learning process
might not even exist. For example, in business analysis, clustering of consumer
preferences helps structure the design of marketing strategies of advertising
campaigns. In clinical trials and epidemiology, grouping patient symptoms
helps diagnosis and prevention in public interventions. In geology, grouping
on element characteristics of rock samples helps identify the main character-
istic of the environment it was found. The common theme among the above
mentioned applications is the lack of a response variable, due to the absence
of knowledge in the experiment stage. In this chapter, we will focus on two
main methods: K-means clustering and the method of principal component
analysis. To briefly summarize, K-means clustering and principal component
analysis are two optimization approaches in grouping a set of data.

10.1 K-means Clustering


K-mean clustering is applied when n observations is to be grouped into k
(k < n) clusters based on feature closeness, the criterion of minimizing the
total point-wise “distance”. Depending on the definition of “distance”, K-
means clustering can be applied to various fields to quantify different types of
closeness. There are a lot of publications regarding the theory and applications
of K-means clustering. It is out of the scope of this book to detail the history
of K-mean clustering. Early work in this field includes Forgy (1965, [49]),
Hartigan and Wong (1979, [55]), and MacQueen (1967, citemacqueen1967),
among others.

229
230 Unsupervised Learning and Optimization

Definition 10.1 Let A be a set of interest, if there exists a set of sets B1 ,


..., Bt satisfying 
Bi Bj = ∅ f or 1 ≤ i < j ≤ t
and
t
Bj = A,
j=1

The the set of sets {Bi , ..., Bt } is a partition of the set A.

In other words, the partition of a set A divides set A into finite mutually
exclusive sets.

10.1.1 Clustering with Squared Euclidean Distance


Let C1 , ..., CK be a partition of the index set {1, ..., n}. The purpose of K-
means clustering is to find the partition of the p dimensional points x1 , ..., xn
where xi ∈ Rp , i = 1, ..., n, for the following optimization problem,


K
Ĉi = argmin{ W (Ck )}, (10.1)
C1 ,...,CK
k=1

where, denote |Ck | the number of observations in the set Ck ,

1  p
W (Ck ) = (xij − xi j )2 (10.2)
|Ck | 
i,i ∈Ck
j=1

Notice that the definition of K-means clustering in (10.1) is defined with


the squared Euclidean distance as the measurement for closeness. Such mea-
surement can be redefined accordingly to suit different clustering problems
in practice. We use the following two simple examples to clarify the terms
mentioned in the definition (10.1).

Example 10.1 When n = 4 and K = 2, find all candidate partitions of the


observations.

Solution: All the candidate partitions of the set {1, 2, 3, 4} reads,

C1 = {1, 2} C2 = {3, 4}; C1 = {1, 3} C2 = {2, 4};


C1 = {1, 4} C2 = {2, 3}; C1 = {1} C2 = {2, 3, 4};
C1 = {2} C2 = {1, 3, 4}; C1 = {3} C2 = {1, 2, 4}
C1 = {4} C2 = {1, 2, 3}.
K-means Clustering 231

Example 10.2 Assume that the four points are (11, 12, 13), (21, 22, 23),
(31, 32, 33), and (41, 42, 43) to be grouped into two clusters. Since p = 3, for
C1 = {1, 2} and C2 = {3, 4}, find the corresponding risk for optimization.

Solution: According to Equation (10.1),


K
1  p
[ (xij − xi j )2 ]
|Ck | 
k=1 i,i ∈Ck
j=1
1
= [(11 − 21)2 + (12 − 22)2 + (13 − 23)2 + (21 − 11)2 + (22 − 12)2
2
1
+ (23 − 13)2 ] + [(31 − 41)2 + (32 − 42)2 + (33 − 43)2 + (41 − 31)2
2
+ (42 − 32) + (43 − 33)2 ].
2

For the partition C1 = {1}, C2 = {2, 3, 4}, the risk for optimization is

1
[(21 − 31)2 + (22 − 32)2 + (23 − 33)2 + (21 − 41)2 + (22 − 42)2 +
3
+ (23 − 43)2 + (31 − 21)2 + (32 − 22)2 + (33 − 23)2 + (41 − 21)2
+ (42 − 22)2 + (43 − 23)2 + (41 − 31)2 + (42 − 32)2 + (43 − 33)2 ].

The distance for optimization in the K-means procedure is defined as point-


wise squared Euclidean distance within each cluster. If the centroid of the
cluster k is defined as (x̄k1 , ..., x̄kp ) with
|Ck |
 
x̄kj = xij = xij , (10.3)
i=1 i∈Ck

for each component j = 1, ..., p. Namely, each component of the centroid is


the average of all the points in cluster Ck .
Given the definition of a centroid above, we have the following theorem
transferring the sum of all point-wise distances within a cluster to the sum of
all distances of points to the centroid in a cluster.

Theorem 10.1 The mean of all point-wise distances within a cluster is twice
the sum of all distances of points to the centroid of the cluster.

Proof: Notice that the mean of all point-wise distance within a cluster reads,

1  p
(xij − xi j )2 ,
|Ck | 
i,i ∈Ck
j=1
232 Unsupervised Learning and Optimization

and the sum of all distances of points to the centroid is



p
(xij − x̄kj )2 .
i∈Ck j=1

For Theorem 10.1, it suffices to prove that

1  p p
(xij − xi j )2 = 2 (xij − x̄kj )2 . (10.4)
|Ck | 
i,i ∈Ck
j=1 j=1 i∈Ck

We start with the left-hand side of the equation (10.4).

1  p
(xij − xi j )2
|Ck | 
i,i ∈Ck
j=1

1  p
= (xij − x̄kj + x̄kj − xi j )2
|Ck | 
i,i ∈Ck
j=1

1  p
= [(xij − x̄kj )2 − 2(xij − x̄kj )(xi j − x̄kj ) + (xi j − x̄kj )2
|Ck | 
i,i ∈Ck
j=1

|Ck |  p
|Ck |  
p
= [(xij − x̄kj )2 + [(xi j − x̄kj )2
|Ck | |Ck | 
i∈Ck j=1 i ∈Ck
j=1

2  
p
− (xij − x̄kj )(xi j − x̄kj )
|Ck |
i,i ∈Ck j=1

p
=2 (xij − x̄kj )2 ,
i∈Ck j=1

since
 
p 
p  
(xij − x̄kj )(xi j − x̄kj ) = (xij − x̄kj ) (xij − x̄kj ) = 0,
i,i ∈Ck j=1 j=1 i∈Ck i ∈Ck

and
1  p
1   
p
[(xij − x̄kj )2 = [(xij − x̄kj )2
|Ck |  |C k | 
i,i ∈Ck
j=1 j=1 i∈Ck i ∈Ck

|Ck |  p
= [(xij − x̄kj )2 ,
|Ck |
i∈Ck j=1

Similarly

1  p
|Ck |  
p
[(xi j − x̄kj ) =

2
[(xi j − x̄kj )2 .
|Ck |  |Ck | 
i,i ∈Ck
j=1 j=1 i ∈Ck
K-means Clustering 233

The following example illustrates the main clue in the proof of Theo-
rem 10.1.
Example 10.3 Assume that the four points for clustering are (11, 12, 13),
(21, 22, 23), (31, 32, 33), and (41, 42, 43) and the partition is C1 = {1, 2}, C2 =
{3, 4}. For the cluster C1 , we have

1  p
(xij − xi j )2
|Ck | 
i,i ∈Ck
j=1
1
= [(11 − 21)2 + (12 − 22)2 + (13 − 23)2 + (21 − 11)2 + (22 − 12)2
2
+ (23 − 13)2 ]
1
= [(11 − 16 + 16 − 21)2 + (12 − 17 + 17 − 22)2 + (13 − 18 + 18 − 23)2 +
2
+ (21 − 16 + 16 − 11)2 + (22 − 17 + 17 − 12)2 + (23 − 18 + 18 − 13)2 ]
= (11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2 + (13 − 18)2
+ (18 − 23)2 + (11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2
+ (13 − 18)2 + (18 − 23)2
= 2[(11 − 16)2 + (16 − 21)2 + (12 − 17)2 + (17 − 22)2 + (13 − 18)2
+ (18 − 23)2 ]
= 2[(11 − 16)2 + (12 − 17)2 + (13 − 18)2 + (21 − 16)2 + (22 − 17)2
+ (23 − 18)2 ]

p
=2 (xij − x̄kj )2
i∈Ck j=1

Theorem 10.1 establishes a connection from the sum of all point-wise


squared distances within a cluster to the total distance from each point to
the centroid of the cluster. With this theorem, it is intuitive to consider the
following algorithm in the optimization process for K-means clustering.

K-means clustering algorithm


1 Initial assignment: Randomly assign a number from 1 to K to each of the
observations.
2 For each cluster, compute the cluster centroid using Equation (10.3).
3 Compute the distances of each observation to the k centroids.
4 Assign each observation to the cluster that has the shortest centroid dis-
tance in the preceding step.
5 Iterate the above three steps ([2]-[4]) until the cluster assignments stop
changing.
234 Unsupervised Learning and Optimization

Theorem 10.2 The K-means clustering algorithm converges to the optimal


partition after finite iterations.
Consider a partition Δt = {C1 , ..., Ck } and the total within-cluster varia-
tion,

K
1   p
f (Δt ) = (xij − xi j )2 . (10.5)
|Ck | 
k=1 i,i ∈Ck
j=1

Denote
Δ̂ = argminΔ f (Δt ).
Following the K-means clustering algorithm, the minimum risk estimator Δ̂
exists and can be achieved after finite iterations of the algorithm.

Proof: At the first step, if f (Δ1 ) is the smallest among all possible partitions,
by Theorem 10.1, we have

1  p p
(xij − xi j )2 = 2 (xij − x̄kj )2 .
|Ck | 
i,i ∈Ck
j=1 j=1
i∈Ck

This means that the smallest possible risk is achieved at f (Δ1 ), no point
can be switched to a different cluster to form a different set of partition and
gain any improvement on the overall within-cluster variation. Under this sce-
nario, the MRE (minimum risk estimator) is achieved.
At any step t, t ≥ 1, if f (Δt ) is not the smallest value, the right-hand side of
Theorem 10.1 can be improved by rearranging the points around the centroid.
According to the K-means Algorithm, grouping each point to its closest cen-
troid yields a new partition Δt+1 . By Theorem 10.1, the total within-cluster
variation for Δt+1 is strictly less than the one for Δt ,

f (Δt+1 < f (Δt ). (10.6)


Repeat the K-means algorithm with t = 1, 2, ..., m, we have a sequence

f (Δ1 > f (Δ2 ) > ... > f (Δm > f (Δm+1 ). (10.7)
Since the distance between all the countable points are fixed, as the num-
ber of iteration m increases, the total within-cluster variation approaches the
minimum value,


K
1  p

f (Δ ) = (xij − xi j )2 ,
|Ck | 
k=1 i,i ∈Ck
j=1

which is Equation(10.5) in the theorem 10.2.


We shall use a numerical example to illustrate (10.6) as follows.
K-means Clustering 235

Example 10.4 Assume that the four points are (11, 12), (20, 21), (40, 41),
and (51, 52), K = 2 and

Δ1 = {{(20, 21)}, {(11, 12), (51, 52), (40, 41)}}

we have
1
f (Δ1 ) = [(11 − 51)2 + (12 − 52)2 + (11 − 40)2 + (12 − 41)2 +
3
+ (40 − 11)2 + (41 − 12)2 + (40 − 51)2 + (41 − 52)2 +
+ (51 − 11)2 + (52 − 12)2 + (51 − 40)2 + (52 − 41)2
= 2[(11 − 34)2 + (52 − 12)2 + (40 − 34)2 + (12 − 35)2 +
+ (41 − 35)2 + (52 − 35)2 ]
= 3416.

The distance of the point (20, 21) to the centroid of the first cluster (11, 12) is
smaller than its distance to the centroid of the second cluster. By rearranging
the partition as

Δ2 = {{(20, 21), (11, 12)}, {(51, 52), (40, 41)}},

we have
2
f (Δ2 ) = [(11 − 20)2 + (12 − 21)2 + (51 − 40)2 + (52 − 41)2 ] = 404
2
Obviously
f (Δ1 ) < f (δ2 ).

We shall discuss a numerical example to illustrate the use of the K-means


clustering algorithm.

Example 10.5 Consider a set of seven observations

No 1 2 3 4 5 6 7
x1 1 10 1.2 2 12 9 1.4
x2 2 8 1.3 2.4 7 9 2

Assume that the initial random assignment sets the partition {C1 , C2 } as

C1 = {1, 2, 3, 4} C2 = {5, 6, 7}.

Find the optimal clusters that minimizes the total within-cluster variations.
236 Unsupervised Learning and Optimization

Solution: According to the K-means clustering algorithm, the first step is to


find the centroids of the two clusters. The centroid of cluster-1 (x11 , x12 ) reads
1
x11 = (1 + 10 + 1.2 + 2) = 4.067
4
1
x12 = (2 + 8 + 1.3 + 2.4) = 3.425.
4
Similarly, the centroid of C2 reads

1
x21 = (12 + 9 + 1.4) = 10.5
3
1
x22 = (7 + 9 + 2) = 6.
3
Thus, we have the distance of the observations.

No 1 2 3 4 5 6 7
dcentroid−1 3.38 7.49 3.57 2.31 8.70 7.44 3.02
dcentroid−2 10.31 2.06 10.42 9.23 1.80 3.35 9.94
updated cluster 1 2 1 1 2 2 1

Now we have updated partition {C1∗ , C2∗ } as

C1∗ = {1, 3, 4, 7} C2∗ = {2, 5, 6}

The updated centroid (x∗11 , x∗12 ) becomes

x∗11 = 1.4 x∗12 = 1.925 x∗21 = 10.33 x∗22 = 8.

The updated point-centroid distances are

No 1 2 3 4 5 6 7
dcentroid−1∗ 0.41 10.53 0.66 0.77 11.75 10.38 0.08
dcentroid−2∗ 11.10 0.33 11.33 10.04 1.94 1.67 10.76
updated cluster 1 2 1 1 2 2 1

Since the observations in the updated partition are identical to the observa-
tions in the partition before checking the updated point-centroid distances, no
rearrangement is needed and the optimal clusters are {1, 3, 4, 7} and {2, 5, 6}.
The plots of the observations and the two steps in the clustering algorithm
can be found in Figure 10.1.
K-means Clustering 237

FIGURE 10.1
Numerical illustration of K-means clustering algorithm
238 Unsupervised Learning and Optimization

10.1.2 Non-Euclidean Clustering


The above subsection discusses the optimizing algorithm for K-means clus-
tering when the closeness between any two observations is measured by the
squared Euclidean distance. In practice, the closeness is not always measured
by the squared Euclidean distance. For example, when clustering patients into
different status of diabetes based on their health features, the blood glucose
level is more important than the body height. When clustering people’s facial
expression, index for the eyes may be of higher weight than the shape of the
hair. In this scenario, some features in the dataset will have a different degree
of relevance to the clustering outcome. There are different methods in cluster-
ing non-Euclidean distance. In this subsection, we will take a quick glance on
non-Euclidean clustering, and use the weighted squared Euclidean distance to
extend the classical K-means clustering algorithm into non-Euclidean cluster-
ing. Publications in this regard include Amorim and Mirkin (2012, [3]) as well
as Amorim and Henning (2015, [2]), among others.
Let C1 , ..., CK be a partition of the index set {1, ..., n}. Assume that we
are interested in clustering the n observations in Rp into K clusters with the
weights w1 , ..., wp for the following optimization problem,


K
1  p
Ĉi = argmin{ (xij − xi j )2 }, (10.8)
C1 ,...,CK |Ck | 
k=1 i,i ∈Ck
j=1

where, denote |Ck | the number of observations in the set Ck . Notice the clus-
tering problem in (10.8) is different from the squared Euclidean clustering
(10.1) by inserting weights wi , for i = 1, ..., p where


p
0 < w1 < 1 wi = 1.
i=1

Since the K-means clustering algorithm in the preceding section is derived un-
der the assumption that the “closeness” is measured by the squared Euclidean
distance, it is inappropriate to carelessly apply the algorithm with checking
the plausibility of the assumptions.
Similar to the previous subsection, we should explore the possibility of
transferring the point-wise measurement to point-centroid measurement for
the weighted squared Euclidean distance. In this regard, we have the following
theorem.

Theorem 10.3

1  p p
wj (xij − xi j )2 = 2 wj (xij − x̄kj )2 . (10.9)
|Ck | 
i,i ∈Ck
j=1 j=1i∈Ck

Proof Similar to the proof of Theorem 10.1, the key steps in the proof of
Theorem 10.3 begin with the decomposition of the left-hand side as follows,
K-means Clustering 239

by noticing that the weighting on the components does not affect the operation
on the summation of the centroid in each cluster.

1  p
wj (xij − xi j )2
|Ck | 
i,i ∈Ck
j=1

1 p 
= wj (xij − x̄kj + x̄kj − xi j )2
|Ck | j=1 i,i ∈C k

1  
p
= wj [(xij − x̄kj )2 − 2(xij − x̄kj )(xi j − x̄kj ) + (xi j − x̄kj )2
|Ck | j=1  i,i ∈Ck

|Ck | 
p  |Ck |  
p
= wj [(xij − x̄kj )2 + [(xi j − x̄kj )2
|Ck | |Ck | 
j=1 i∈Ck i ∈Ck
j=1

2 
p 
− wj (xij − x̄kj )(xi j − x̄kj )
|Ck | j=1 i,i ∈Ck

p 
=2 wj (xij − x̄kj )2
j=1 i∈Ck

p
=2 wj (xij − x̄kj )2 .
i∈Ck j=1

With Theorem 10.3, the point-wise cluster variation with weighted squared
Euclidean distances can be converted to the cluster variation from the points
to the centroid of the cluster. In fact, as shown in the above proof, as long as
the measurement is a linear function on the component j, converting results
similar to Theorem 10.3 can be similarly proved.
The validation of Theorem 10.3, in conjunction with Theorem 10.2, con-
sequently leads to the following non-Eculidean clustering algorithm.

Weighted squared Euclidean clustering algorithm


1 Initial assignment: Randomly assign a number from 1 to K to each of the
observations.
2 For each cluster, compute the cluster centroid using Equation (10.3).
3 Compute the weighted squared Euclidean distances of each observation to
the k centroids.
4 Assign each observation to the cluster that has the shortest centroid
weighted distance in the preceding step.
5 Iterate the above three steps ([2]-[4]) until the cluster assignments stop
changing.
240 Unsupervised Learning and Optimization

10.2 Principal Component Analysis


Principal component analysis originally stems from a statistical methodology
analyzing multivariate data proposed by Hotelling (1933, [66]) and Pearson
(1901, [92]). It starts with the decomposition of the population covariance
matrix of multivariate data into several key components constituted by the
eigenvalues and eigenvectors of the population covariance matrix. When the
population covariance matrix is unknown (which is the case in most prac-
tices), since the estimation of the population covariance matrix is the sample
covariance matrix, the practice of principal component analysis often starts
with the sample covariance matrix. Data analysts without proper statistical
training usually cut corners by disregarding population covariance matrix.
This partially leads to the misconception that principal component analysis
begins with the sample covariance matrix, or even the standardized data. It is
conceptually important to clarify this because the eigenvalues of the sample
covariance matrix are not compatible due to data randomness, while the ones
from the population covariance matrix are compatible because they are the
parameters containing the information of the data.

10.2.1 Population Principal Components


Definition 10.2 Let x be a p-dimensional observation from a population with
covariance matrix Σ. Assume that the ordered eigenvalues of Σ are
λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0,
since the population covariance matrix is semi-positive, all the eigenvalues are
positive. The first principal component is defined as
Y1 = aT1 x = a11 X1 + ... + a1p Xp
where
a1 = argM AXaT1 a=1 V ar(aT x)
Similar to the definition of the first principal component in Definition 10.2,
the ith principal component can be defined in a way of optimization as follows.
Definition 10.3 Let x be a p dimension observation from a population with
covariance matrix Σ. Assume that the ordered eigenvalues of Σ are
λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0.
The ith principal component is defined as
Yi = aTi x = ai1 X1 + ... + aip Xp
where
ai = argM axaT a=1, aT aj =0, j<i V ar(aT x)
Principal Component Analysis 241

Thus, the ith principal component of a data is essentially a linear combi-


nation of the data that is orthogonal to all the jth principal components with
j < i.

Example 10.6 Assume that a database contains patient information of the


following features.

xT = (x1 , x2 , x3 , . . . , x9 , x10 ),
where
x1 : systolic blood pressure
x2 : total blood cholesterol level
x3 : dusty working environment
x4 : residential location (city, rural area)
x5 : transportation (car, train, bus, walk)
x6 : career type
x7 : annual income
x8 : medical insurance
x9 : heart attack/stroke history
x10 : financial investment
We want to reduce the dimension of 10 features into several representative
variables to improve the efficiency of data analysis.
Solution: Assume that the eigenvectors corresponding to the largest two
eigenvalues are,

aT1 = (0.452, 0.603, 0, 0, 0, 0.151, −0.151, 0, 0.603, −0.151),

aT2 = (0, 0, 0.745, 0.447, 0.447, 0.149, 0.149, 0, 0, 0),


then the first two components become
Y1 = 0.452x1 + 0.603x2 + 0.151x6 − 0.151x7 + 0.603x9 − 0.151x10

Y2 = 0.745x3 + 0.447x4 + 0.447x5 + 0.149x6 + 0.149x7


where the first component Y1 measures blood vessel or vascular related
health aspect of the patient, and the second component measures the envi-
ronment aspect of the patient. In this way, the study of the 10 features of
the patients can be reduced to the analysis of the two principal components
measuring health and environment features of the patients. The dimension of
the data is thus reduced.

The above example shows the dimensional reduction aspect of principle


component analysis, but it does not show how to identify and construct the
principle components. The following example shows how to construct the prin-
ciple components from the covariance matrix of the data.
242 Unsupervised Learning and Optimization

Example 10.7 Suppose the random variables X1 , X2 and X3 have the co-
variance matrix ⎡ ⎤
21 32 17
Σ = ⎣32 54 18⎦
17 18 35
Find the eigenvalues and eigenvectors of the population covariance matrix.

The eigenvalues and the corresponding eigenvectors are

λ1 = 85.25 v1 = (−0.491, −0.754, −0.436)T

λ2 = 24.26 v2 = (−0.055, −0.473, 0.879)T

λ3 = 0.459 v3 = (0.869, −0.456, −0.191)T


Thus, the principal components becomes

Y1 = v1T x = −0.491X1 − 0.754X2 − 0.436X3

Y2 = v2T x = −0.055X1 − 0.473X2 + 0.879X3

Y3 = v3T x = 0.869X1 − 0.456X2 − 0.191X3


When we consider the largest two principal components that count 65% of
the variance information in the data, the first two principal components are
Y1 and Y2 .

Evidently the above example shows how to get the principal components
from the covariance matrix. In practice, the population covariance matrix is
unknown. When the population covariance matrix is estimated by the sample
covariance matrix, the corresponding (sample) principal components are the
linear combination of the data with the corresponding eigenvectors of the
sample covariance matrix. We use the following example to illustrate this
point.

10.2.2 Sample principal components


As discussed in the preceding subsection, the core concept of principal com-
ponent depends on the decomposition of the information or variation in the
covariance matrix, which is generally unknown in the real-world practice of
data analysis. However, using the consistency property of the sample covari-
ance matrix, it is efficient to perform principal component analysis using the
sample covariance matrix. It should be noted that the principal components
Principal Component Analysis 243

obtained in this way are actually the sample principal components (depend-
ing on the data), not the population principal components (which does not
depend on the data by definition).
Another commonly confusing issue is the standardization versus non-
standardization of data analysis of principal components. One common prac-
tice is to standardize the data (shifting to the center by subtracting the sample
mean, and dividing the sample standard deviation). The advantage of stan-
dardization makes the principal components invariant for location and scale
transformation. However, it should be noted that performing principal com-
ponent analysis on standardized data is tantamount to performing principal
component analysis on the sample correlation matrix.

Theorem 10.4 Consider a set of data xij , i = 1, ..., k and j = 1, ..., n for n
observations of k features of a population. Denote
1
x̄i = xij ,
n j

and
1 
n
sij = (xik − x̄i )(xjk − x̄j ).
n−1
k=1

The standardized observations


xij − x̄i
zij = √ ,
sii

where i = 1, ..., k, j = 1, ..., n, by centralizing to the sample mean and dividing


by the sample standard deviation. We have

Sz = Rx (10.10)

The sample covariance matrix of the standardized data is the correlation ma-
trix of the original data.

Proof: Denote the data matrix of the standardized data by Z. Notice that
for the standardized data, the sample mean of each component reads
1   1
z̄ = (1 Z) = Z  1 = 0,
n n
since
1  xji − x̄i 
n n
1
√ = √ ( xji − nx̄i ) = 0.
n j=1 sii n sii j=1

Now, by the definition, the sample covariance matrix of the standardized data
is
1 
n
1
Sz = ( (zki − z̄i )(zkj − z̄j )) = Z  Z.
n−1 n−1
k=1
244 Unsupervised Learning and Optimization

Now, notice that

1 1 (n − 1)sij
Z Z = (  √ ) = Rx .
n−1 n−1 sii sjj

The meaning and interpretation of principal component analysis on the


correlation matrix are different from those obtained from the sample covari-
ance matrix. We use the following example to illustrate this point.

Example 10.8 Consider the following hypothetical dataset for a random


sample of 9 records of three stocks in percentages.

stock1 = (6, 8, 9, 3, 8, 7, 9, 8, 2)

stock2 = (−2, 15, 9, 3, 7, 7, −8, −14, 7)


stock3 = (2, 3, 2, 4, 3, 3, 3, 2, 2),
Find the principal components of the stock market performance.

Solution: As shown in Figure 10.2, the components of the sample covariance


matrix of the data.frame reads

Y1 = 0.044Stock1 − 0.999Stock2 − 0.017Stock3

Y2 = 0.999Stock1 + 0.045Stock2 − 0.032Stock3

Y3 = 0.033Stock1 − 0.015Stock2 + 0.999Stock3,


with eigenvalues λ1 = 82.922, λ2 = 6.357, and λ3 = 0.471. The first principal
component takes
82.922
= 92.4%
82.922 + 6.357 + 0.471
of the total variation of the data. In fact, since Stock 3 (performing like CD
or bonds) basically remains the same among the 9 sample years, and Stock1
changes slightly (performing like low-risk portfolios), the market variation is
essentially reflected by Stock 2, while Stock 1 and Stock 3 carry less weights
in the evaluation of the market performance.
However, with the data standardization approach, when we subtract the
sample mean and divide each observation with the sample standard deviation,
the components of the sample correlation matrix reads

Y1∗ = 0.515Stock1 − 0.609Stock2 − 0.603Stock3

Y2∗ = −0.856Stock1 − 0.332Stock2 − 0.395Stock3


Principal Component Analysis 245

> data<-data.frame(stock1=c(6, 8, 9, 3, 8, 7, 9, 8, 2),


+ stock2=c(-2, 15, 9, 3, 7, 7, -8, -14, 7),
+ stock3=c(2, 3, 2, 4, 3, 3, 3, 2, 2))
> h=cov(data)

> j=cor(data)

> eigen(h)
eigen() decomposition
$`values`
[1] 82.9223567 6.3569839 0.4706594
$vectors
[,1] [,2] [,3]
[1,] 0.04416809 0.99848413 0.03284249
[2,] -0.99888288 0.04469073 -0.01535322
[3,] -0.01679770 -0.03212768 0.99934261

> eigen(j)
eigen() decomposition
$`values`
[1] 1.3345071 0.8794526 0.7860403
$vectors
[,1] [,2] [,3]
[1,] 0.5150182 -0.8562234 -0.04046871
[2,] -0.6093253 -0.3324862 -0.71984420
[3,] -0.6028922 -0.3953915 0.69295498

FIGURE 10.2
Sample covariance matrix vs sample correlation matrix

Y3∗ = −0.04Stock1 − 0.72Stock2 + 0.693Stock3,


with eigenvalues λ1 = 1.334, λ2 = 0.879, and λ3 = 0.786. Thus, the first
principal component of the correlation matrix carries
1.334
= 44.5%
1.334 + 0.879 + 0.786
of the total information/variation. To obtain more than 60% of the total vari-
ation, the first two principal components should be used. Obviously, the first
component of the standardized data measures the contrast between Stock1
and the combination of Stock2 and Stock3, the second component of the stan-
dardized data focuses on Stock1 and the last one focuses on the contrast
between Stock2 and Stock3.
It is not difficult to see that after standardization, the difference on the
variation of the three stocks has reduced due to the scaling of the variable by
its sample standard deviation. With standardization, different variations on
246 Unsupervised Learning and Optimization

different variables are removed or rescaled, which causes misleading conclusion


in data analysis.
This example shows that when we rescale data through standardization,
the variations of the original variables are shrunk to one. This causes distortion
of the variation information conveyed in the original dataset.

SUMMARY Preceding chapters describe prediction problems involving a


measurable response variable, a supervised measurement in the learning pro-
cess. This chapter, on the other hand, discusses methods in the case where we
do not have a response variable for classification or prediction, an unsupervised
learning scenario. In particular, we discuss the method of K-mean clustering
with squared Euclidean distance as homogeneity measurement, and extend the
method into cases where the classification criterion is non-Euclidean distance.
We also illustrate the method of principle component analysis addressing the
difference between the population principle components and sample principle
components. Recent development on the application of K-means clustering
method to the impact of masking policy during Covid-19 pandemic can be
found in Chen and Chen (2024, [36]), among others.
11
Simultaneous Learning and Multiplicity

This chapter discusses a common situation in data science when two or more
populations are involved in the learning process. When we have only one
data population for prediction, the error rates (false positive, false negative)
are clearly defined. However, when two or more sets of data are involved in
the analysis process, since each path of the analysis generates error rates,
controlling inference error in one population (one statement) does not control
the overall error rates. In fact, the overall error rates accumulate as the number
of data path increases. This chapter thus discusses methods applicable to
adjust the multiplicity in multi-path statistical learning, the scenario where
two or more sources of datasets are involved in the prediction process.
One of the special case of multiple path learning is the case when the
number of observations increases (instead of being fixed) in the learning pro-
cess. To this end, we discuss the method of sequential analysis, where the two
error rates (false positive and false negative) are combined to determine the
required sample size. This is the method of sequential analysis where data
coming at different phases.
Another approach to handling multi-resources learning simultaneously is
the methodology of multiple comparisons. We shall focus on recent devel-
opments on simultaneous confidence segments for dose-response curves, and
weighted hypotheses with high dimensional data in this chapter.
Materials in this chapter essentially synthesize some recent publications in
simultaneous inference, including Ma et al. (2023)[82], Yu et al (2022)[128],
Chen (2016)[23], and Kerns and Chen (2017)[76].

11.1 Sequential Data


One of the common questions in data analysis is the determination of sample
size. Notice that in previous chapters, we focus on the control of type-I error
with a fixed sample size. Consider the scenario where the type-I error and type-
II error are specified, which can be treated as the false positive (incorrectly
rejecting the null hypothesis) and false negative (incorrectly rejecting the al-
ternative hypothesis) error rates. The sample size computation is available
for the inference on population mean when the population standard deviation

247
248 Simultaneous Learning and Multiplicity

is given. However, in many applications, the population standard deviation


is unknown. Under this scenario, the two-stage sequential procedure can be
applied. In the literature, there are various excellent review articles on sequen-
tial analysis, such as Lai (2001, [77]). In this section, we will use examples to
describe the basic idea of prediction with sequential data.

11.1.1 Wald’s sequential likelihood ratio test (SPRT)


We start with some simplest cases on the application of Wald’s SPRT test.
Let A and B be two constants in Wald’s SPRT procedure, which satisfy
the following equations,
1) P(type-I error)=P(Incorrectly rejecting the null hypothesis)=α, and
2) P(type-II error)= P(Incorrectly rejecting the alternative hypothesis)=β.

We have the following general procedure in sequential inference with the


above two criteria. For testing H0 : μ = μ0 versus H1 : μ = μ1 , denote the
likelihood ratio n
f (xi |μ1 )

Rn (μ0 , μ1 ) = ni=1 ,
j=1 f (xi |μ0 )

and the likelihood test statistic as Λn .

ˆ Reject the null hypothesis (or equivalently accepting the alternative hy-
pothesis) if the data satisfies (Λn |μ = μ0 ) > A; clear data evudebce for
H1 .
ˆ Reject the alternative hypothesis (or equivalently accepting the null hy-
pothesis) if the data satisfies (Λn |μ = μ1 ) < B; clear data evidence for
H0 .
ˆ Continue sampling (without making any conclusion on accepting or re-
jection the null hypothesis) if the data satisfies A ≥ (Λn |μ = μ0 ) and
(Λ|μ = μ1 ) ≥ B.

Example 11.1 Consider a testing problem on μ = 2 versus μ = 6 for data


drawn from a population fallowing the normal model N (μ, 4) with the variance
σ 2 = 4.

Solution If μ = 6 the data contain higher likelihood for the alternative hy-
pothesis. If μ = 2, on the other hand, the data contain higher likelihood to
support the null hypothesis. Thus, for sprt, denote f (x|μ) the density of the
underlying model, and
n
f (xi |μ = 6)
Rn = ni=1 ,
j=1 f (xi |μ = 2)
Sequential Data 249

the decision rule of SPRT can be described as follows.


Since
1 − 12 (x−μ)2
f (x|μ) = e 2σ ,
2πσ
after some simple algebra, the likelihood ratio can be simplified as

1 
n
Rn = Exp{− [(xi − μ1 )2 − (xi − μ0 )2 ]}
2σ 2 i=1
1
= Exp{ [2nX̄(μ1 − μ0 ) + n(μ20 − μ21 )]}.
2σ 2
Thus,
Rn > A
is equivalent to
X̄ > c
for some constant c. The test statistic is Λn = X̄, and

P (Λn > A|μ = μ0 ) = α

is tantamount to
1 √
(X̄ − μ0 ) n > Z1−α .
σ
Similarly, the Wald’s sequential probability ratio test for rejecting the alter-
native hypothesis, in this simple setting, becomes
1 √
(X̄ − μ1 ) n < Zβ .
σ
Summarizing the above discussion on the Wald’s sequential probability
ratio test, we have the following decision criteria for the inference problem
discussed in Example 11.1

ˆ Reject the null hypothesis (or equivalently accepting the alternative hy-
pothesis) if the data satisfied the condition
1 √
(X̄ − μ0 ) n > Z1−α ;
σ

ˆ Reject the alternative hypothesis (or equivalently accepting the null hy-
pothesis) if the data satisfies the condition
1 √
(X̄ − μ1 ) n < Zβ ;
σ

ˆ Continue sampling without making any conclusion, if


1 √ 1 √
Zβ ≤ (X̄ − μ1 ) n, and (X̄ − μ0 ) n < Z1−α .
σ σ
250 Simultaneous Learning and Multiplicity

The above discussion is valid when the data follows a normal model with a
given standard deviation. However, in practice, the standard deviation is un-
known. This leads to the sprt-t package in R for Wald’s sequential probability
ratio test without knowing the population standard deviation.
Another practical issue related to the analysis of sequential data is the
statement claiming the non-difference between the hypothesized mean and
the true but unknown population mean in composite hypothesis μ = μ0 ver-
sus μ = μ0 . Since the datum varies at certain level, and it makes more sense to
claim closeness within certain range instead of claiming an inequality vaguely
between the unknown and the hypothesized value. This is because an inequal-
ity could imply that the unknown value is very close to the hypothesized value
or very far away from the hypothesized value. Toward this end, a practical
concept to resolve the problem, is the effective size.

Definition 11.1 Effective Size Assume that the hypothesized mean is μ0 ,


and the population standard deviation is σ. For any population mean μ, the ef-
fective size measures the distance between μ and μ0 adjusted by the population
standard deviation,
|μ − μ0 |
d= .
σ
With the concept of effective size, the composite hypothesis μ = μ0 versus
μ = μ0 is specified, in sequential analysis, as μ = μ0 versus |μ − μ0 | > dσ.
The alternative hypothesis is interpreted as the true but unknown mean value
is d standard deviation away from the hypothesized mean μ0 . Heuristically,
when the effective size decrease, the likelihood ratio in Wald’s SPRT decreases
because usually it is harder to detect the mean difference when the true value
and the hypothesized mean are very close. For instance, given a specified
effective size d0 , since the alternative hypothesis is |μ − μ0 | > dσ, we have,

|μ − μ0 |
P (T ype − II error) = sup P (accepting H0 | > d0 )
|μ−μ0 |
>d0
σ
σ

|μ − μ0 |
= P (accepting H0 | = d0 )
σ
= P (accepting H0 |d = d0 ).

We shall use an example to discuss the R-package “sprit” for sequential


data analytics using the sprt t-test.

Example 11.2 For convenience, consider the income data in the R-package
sqrtt. The dataset contains 120 observations on monthly income for 60 male
and 60 female. For illustration purpose, we will use the income data to examine
the impact of effective size on the inference outcome. We first compare the
mean monthly income level of male with that of female under different effective
sizes, then use the data to test the sequential information with SPRT on a
Sequential Data 251

specific value for the population mean of the monthly income, and conclude
the example with a discussion regarding the impact of the alternative likelihood
on effective size.

Case 1. sprt t-test comparing two population means


Assume that we are interested in detecting the mean monthly income
differences between male and female.

:'(!' + #  9 ' # 9/%1)



*****    &*****

 $!' + 
  $
&  9&/%48336#  9  
  $
()9&1%83333#()91%83333
& $
  !9/%71578
!90%31026
 !$    /%
   "$, 9/%1
 $ 9007
  $
   !
2/61%/752/7/%604

:'(!' + #  9 ' # 9/%7)

*****    &*****

 $!' + 
  $
&  9&8%40280#  9 /
  $
()9&1%83333#()91%83333
& $
  !9&7%/8143
!90%31026
 !$    /%
   "$, 9/%7
 $ 9007
  $
   !
2/61%/752/7/%604


FIGURE 11.1
Mean monthly income between male and female with sprr

As shown in the output Figure 11.1, when the effective size is set to 0.2,
the null and alternative hypotheses become

H0 : μmale = μf emale versus H1 : |μmale − μf emale | > 0.2σ,

Namely to test whether the mean monthly income difference is more than 20%
of the data variation. The SPRT thresholds are

log(B) = −log(0.95/0.05) = −2.9444(= −log(A))


252 Simultaneous Learning and Multiplicity

since the probability of type-I error and the probability of type-II error are
set to 0.05 in the coding. The log-likelihood of the null hypothesis
!
log f (xi |μmale = μf emale ) = 1.42137;
i

The log-likelihood of the alternative hypothesis


!
sup log f (xi |(|μmale − μf emale |) > 0.2σ) = 0.82689.
i

and the log-likelihood ratio reads


!
sup log f (xi |(|μmale − μf emale |) > 0.2σ) = 0.82689.
i

Thus the log-likelihood ratio becomes

log − likelihood ratio


!
= sup log f (xi |(|μmale − μf emale )| > 0.2σ)
i
!
− log f (xi |μmale = μf emale )
i
= 0.82689 − 1.42137
= −0.59447.

Since the value of the log-likelihood ratio is within the two threshold

−0.59447 ∈ (−2.9444, 2.9444),

there is no data evidence to reject the null hypothesis or the alternative hy-
pothesis. According to Wald’s SPRT, the inference outcome is “to continue
sampling”. This means that the existing sample size is not large enough to
detect the mean monthly income difference within 20% of the data variation,
thus the testing conclusion is “ to continue sampling”. In fact the two sample
means of monthly income are very close to each other. The mean monthly
income of male is $3072.09, and that of female is $3080.72.
It should be noted that the conclusion of “to continue sampling” is for the
pre-specified effective size of 0.2. When the effective size changes, the inference
conclusion changes. For instance, with the same dataset, when the effective
size is changed to 0.8, the corresponding log-likelihood of the alternative space
becomes
!
sup log f (xi : |μmale − μf emale | > 0.8σ) = −8.09254.
i

and the corresponding log-likelihood ratio becomes


Sequential Data 253

log − likelihood ratio


!
= sup {log f (xi |(|μmale − μf emale |) > 0.8σ)
i
!
− log f (xi |μmale = μf emale )}
i
= −8.09254 − 1.42137
= −9.51391.

Since the value of the likelihood ratio is less than -2.9444, according to Wald’s
SPRT, the conclusion is to accept the null hypothesis, which means that the
mean monthly income of male, on average, has no significant difference within
80% of the data variation. In this case, the data contains evidence to support
the claim that the mean monthly income difference between male and female,
is not beyond 80% of the data variation. This echoes with the early discussion
on the relationship between the effective size and the inference conclusion.

Case 2. Wald’s SPRT t-test on a hypothesized mean


Now, assume that, instead of comparing two means, we are interested in
testing whether the population mean (monthly income) is within certain range
of a fixed value. For instance, assume that the hypothesis is

μ = 3100 versus |μ − 3100| > 0.5σ.

Namely the hypothesized mean of the monthly income is set to $3100, and we
want to see whether there is any data evidence supporting the claim that the
difference is more than 50% of the data variation.
As shown in Figure 11.2, the log-likelihood of the null hypothesis is -
0.41656, and the one for the alternative hypothesis reads -13.20851. This makes
the log-likelihood ratio

−13.20851 − (−0.41656) = −12.79159,

which is less than the lower threshold -2.9444. According to Wald’s SPRT,
the optimal decision is to accept the null hypothesis. As a matter of fact, the
sample mean, $3076.4, is indeed well within the range if 50% data variation,
which supports the Wald’s SPRT in accepting the null hypothesis.
For comparison purpose, when the hypothesized mean is set to $1200, as
shown in the second half of Figure 11.2, the value of the hypothesized mean
$1200 is far below the sample mean, the corresponding log-likelihood value for
the null hypothesis is -169.7469, while the one for the alternative hypothesis
is -119.9379, the log-likelihood ratio reads 49.8090, which is larger than the
upper threshold of the Wald’s SPRT, 2.9444. Thus the inference conclusion is
to accept the alternative hypothesis. In fact, the sample mean monthly income
254 Simultaneous Learning and Multiplicity

;()"( ,1$:3100$  :( $:0&5*



+++++   '+++++

  %"( ,1
  %
' :'12&79195$ : 0
 %
 )*:'2&94444$)*:2&94444
'%
  ":'13&20851
":'0&41656
 "%  3100&
  #%-:0&5
%:119
  %
 !
3076&4

;()"( ,1$:1200$  :( $:0&5*

+++++   '+++++

  %"( ,1
  %
' :49&80902$ : 1
 %
 )*:'2&94444$)*:2&94444
'%
  ":'119&9379
":'169&7469
 "%  1200&
  #%-:0&5
%:119
  %
 !
3076&4

FIGURE 11.2
Overall mean monthly income vs a value with sprtt

of the data is $3076.4, which is beyond the hypothesized mean ($1200) by more
than 50% of the data variation.

Case 3. Effective size and likelihood under alternative hypothesis


To further explore the relationship between the effective size and the log-
likelihood of the alternative hypothesis, we plot the corresponding log likeli-
hood values of the alternative hypothesis for every 10% increase of the effective
size.
The plot in Figure 11.3 clearly shows that for this dataset, the likelihood
value of the alternative space decreases as the effective size increases. This is
consistent with the fact that it is harder to detect smaller difference between
the true mean and the hypothesized mean.
Sequential Data 255


   .!"
.!"
   . !+,-.,"
!,-"% .   !   '  .  
.   #$"
   #$. *   ,
&
!       .((.)     
     )"
!    "

FIGURE 11.3
Log-likelihood ratio vs effective sizes

11.1.2 Two-stage Estimation for Sequential Data


In this section, we shall discuss the two-stage estimation method in the anal-
ysis of sequential data. Consider the sample size of a given set of normal data
X1 , ..., Xk in the process of testing

H0 : μ = μ0 versus H1 : μ = μ0 + δ,

for any given value δ > 0. It is well known that when the population standard
deviation is given as σ, the probability of the type-II error becomes

X̄ − μ0
P (accepting H0 |H1 true) = P ( √ < Z1−α |μ0 + δ)
σ/ k
X̄ − μ1 δ
= P( √ < Z1−α − √ |H1 )
σ/ k σ/ k
δ
= P (Z < Z1−α − √ ),
σ/ k
256 Simultaneous Learning and Multiplicity

also, for given probabilities of type-I and type-II errors, the minimal sample
size for one-stage sampling is

(Z1−α + Z1−β )2 σ 2
k≥ .
δ2
The above analysis assumes the population standard deviation, however,
in practice, the population standard deviation is unknown. Where we replace
the population standard deviation with the sample standard deviation, the
above discussion becomes invalid. This is because when σ is replaced by the
sample standard deviation s, the latter is a random variable depending on
(being a function of) the sample size k. Under this scenario, the one-stage
approach is unable to solve the difficult on sample size determination. Toward
this end, we need the following theorem.
Consider the testing problem H0 : μ = μ0 versus H1 : μ = μ0 + δ, δ > 0 for
a normal population with unknown mean μ and unknown standard deviation
σ. In a sequential sampling, assume that the first n0 observations are available,
X1 , ..., Xn0 . Denote α and β the required probabilities of type-I and type-II
errors. Let s0 be the sample standard deviation of the first sample (the first
n0 observations).

Theorem 11.1 If a sequential sample of n − n0 observations is available with

(tn0 −1,1−α + tn0 −1,1−β )2 s20


n = max{[ ] + 1, n0 },
δ2
denote X̄n the sample mean of the updated sample, then the rejection region

X̄n − μ0
R= √ > tn0 −1,1−α ,
s0 / n

has both probabilities of type-I error and type-II error controlled at α and β
levels, respectively.

Proof: . It suffices to prove the following two conditions for the theorem.
The first one is for the probability of type-I error in the updated sample.

X̄n − μ0
P (incorrectly rejecting H0 ) = P ( √ > tn0 −1,1−α |μ0 )
s0 / n
= P (tn0 −1 > tn0 −1,1−α )
= α.

As for the second condition on the control of the type-II error, notice that

(tn0 −1,1−α + tn0 −1,1−β )2 s20


n> ,
δ2
we have
Sequential Data 257

P (incorrectly accepting H0 )
X̄n − μ0
= P( √ < tn0 −1,1−α |μ0 + δ)
s0 / n
X̄n − μ1 δ
= P( √ < tn0 −1,1−α − √ |μ1 )
s0 / n s0 / n
δ√
= P (Tn0 −1 < tn0 −1,1−α − n)
s0

δ (tn −1,1−α + tn0 −1,1−β )2 s20
≤ P (Tn0 −1 < tn0 −1,1−α − ( 0 )
s0 δ2
= P (Tn0 −1 < tn0 −1,1−α − (tn0 −1,1−α + tn0 −1,1−β ))
= P (Tn0 −1 < −tn0 −1,1−β )
= β.

The following example shows how to use Theorem 11.1 for data analysis.

Example 11.3 The following data set contains one-week trading prices of a
stock at NYSC {14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96}. Does the dataset
contain enough information to test whether the mean price is $15.00 or $15.05
at 0.05 significance level with 95% power? If not, how many additional obser-
vations do we need in sequential sampling?

Solution: When the null hypothesis is μ = $15, the corresponding t-value


is 1.889, which leads to the p-value of 0.054. Since the p-value is larger than
the nominated 0.05 level, we do not have data evidence to reject the null
hypothesis. On the other hand, when the alternative hypothesis is assumed
true, μ = $15.05, the corresponding t-value is 0.567, which results in a p-value
of 0.704. The p-value is much larger than the nominated 0.05 level for the
probability of making type-II error, thus, there is no data evidence to reject
the alternative (or equivalently, to accept the null hypothesis) at 0.05 level.
Summarily, to control both the probabilities of type-I error and type-II
error at 0.05 level, the information on the seven-day stock trading price is not
enough. Based on Wald’s SPRT, the conclusion is “to continue sampling” for
this set of data, as shown in Figure 11.4.
To control the probability of type-I error and the probability of type-II error
at 0.05 level, on testing whether the mean stock selling price is $0.05 away
from the $15.00 level, we need 61 observations, as shown in Figure 11.5. With
n0 = 7, additional 54 stock exchange prices are needed for this question.

Theorem 11.1 used a two-stage sequential method to solve the problem on


sample size requirement for inference problems when the population standard
deviation is unknown. A related question is to estimate the unknown mean
258 Simultaneous Learning and Multiplicity

stockdata <- c(14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96)


TwoStageTest <- function(x,theta0,theta1,alpha,beta){
xbar <- mean(x)
s <- sd(x)
n0 <- length(x)
tvalue0 <- (xbar-theta0)/(s/sqrt(n0))
tvalue1 <- (xbar-theta1)/(s/sqrt(n0))

p0 <- pt(tvalue0, df=n0-1, lower.tail = FALSE)


p1 <- pt(tvalue1, df=n0-1, lower.tail = TRUE)
output <- NULL
# Decision
if (p0 <= alpha) {
output <- "Accept H0"
} else if (p1 <= beta) {
output <- "Accept H1"
} else {
output <- "Continue sampling"
}
return(list(t0=tvalue0, p0=p0,t1=tvalue1,p1=p1,output))
}
TwoStageTest(stockdata, theta0=15.000,theta1=15.05,alpha=0.05,beta=0.05)

$t0
[1] 1.888474

$p0
[1] 0.05394006

$t1
[1] 0.5665422

$p1
[1] 0.7042146

[[5]]
[1] "Continue sampling"

FIGURE 11.4
Codes for two-stage sequential Student-t test

with unknown variance and the requirement that the accuracy of the estima-
tion is controlled at a pre-specified level. In fact, the idea in Theorem 11.1 can
be further extended to the two-stage confidence estimation method as follows.

Definition 11.2 For a set of data X, the terminology estimation error refers
to the following. If a (1 − α)% confidence interval of an unknown parameter
θ is
θ̂(X) − e(X) ≤ θ ≤ θ̂(X) + e(X),
the statistic e(X) is the estimation error.

Estimation error is a measure of evaluating the prediction accuracy in con-


fidence interval estimation. Lower estimation error implies that the estimator
is closely toward the true value, which can be interpreted as a higher accuracy
level of the estimation. In confidence interval estimation with fixed sample
size, usually higher confidence level leads to lower accuracy level. For a given
sample size, it is impossible to control the accuracy and confidence level at
the same time. However, using the method of two-stage sampling in sequential
analysis, similarly to Theorem 11.1, we can select a sample size (in the second
stage) that controls both confidence level and accuracy.
Consider the estimation of a population mean μ with unknown standard
deviation for normal populations. Denote E the pre-specified estimation error,
Sequential Data 259

stockdata <- c(14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96)

TwoStageTestSize <- function(x,theta0,theta1,alpha,beta){

n0 <- length(x)

s <- sd(x)

delta <- theta1-theta0

t1 <- qt(1-alpha, n0-1, lower.tail = FALSE)

t2 <- qt(1-beta, n0-1, lower.tail = FALSE)

z <- delta**2/(t1+t2)**2

n <- ceiling(s**2/(z))

return(n)

TwoStageTestSize(stockdata,15,15.05,0.05,0.05)

[1] 61

FIGURE 11.5
Sample size calculation for stock data-2

and 1 − α the confidence level. Denote n0 the sample size of the first n0
observations. Let s0 be the sample standard deviation of the first sample (the
first n0 observations).

Theorem 11.2 If there are n − n0 observations available in the second sam-


ple, where
tn −1,1−α 2
n = max{[(s0 0 ) ] + 1, n0 },
E
denote X̄n the sample mean of the updated sample with n observations the
confidence interval
A = (X̄n − E, X̄ + E),
has confidence level 1 − α.

Proof: Notice that the sample size in the second stage satisfies
tn0 −1,1−α 2
n > (s0 ) ,
E
we have
260 Simultaneous Learning and Multiplicity

P (X̄n − E ≤ μ ≤ X̄ + E)
= P (|X̄ − μ| ≤ E)
|X̄ − μ| √ E√
= P( n≤ n)
s0 s0

|X̄ − μ| √ E tn −1,1−α 2
≥ P( n≤ (s0 0 ) )
s0 s0 E
|X̄ − μ| √
= P( n ≤ tn0 −1,1−α )
s0
=1−α

We use the following examples to illustrate the use of the above theorem.

Example 11.4 Apply the two-stage confidence procedure for the following
questions.
1. Use the following data to estimate the mean stock trading price with
95% confidence level. If we need to have the accuracy at 0.05, how many
observations do we need? 14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96
2. If we use the previous data to be the initial data and want to have a
confidence estimate with accuracy at 0.06, how many additional observations
do we need?

Solution. Figure 11.6 contains R codes and outputs for example 11.4. In the
seven-day stock exchange data, the sample mean is $15.07 with sample stan-
dard deviation of $0.10. At 95% confidence level, the mean stock exchange
price is estimated in the range from $14.98 to $15.16. The estimated error
is $0.09, which is beyond the 6 cents requirement. Using the two-stage confi-
dence approach as in Theorem 11.2, a total of 24 observations (additional 17
observations) is needed to keep the confidence level at 0.95 and accuracy level
at 6 cents.

11.2 Simultaneous Learning in Dose-response Studies


When data come from two or more resources, the random error in each re-
source attributes to the prediction error in the overall statement combining the
multiple resources. In this case, it is desirable to control the error rates simul-
taneously so that the overall error rate can be maintained at a nominated level.
In statistics and data analysis, simultaneous inference contains methodologies
controlling errors from various resources. In the earlier time, this includes
the conventional methods such as all pairwise multiple comparisons by John
Tukey[118], multiple comparisons with a control by Dunnett[45], and multiple
Simultaneous Learning in Dose-response Studies 261

stockdata <- c(14.98, 15.09, 15.12, 15.15, 15.22, 14.98, 14.96)

TwoStageConfidenceSize <- function(x, alpha,d){


n0 <- length(x)
xbar0 <- mean(x)
s0 <- sd(x)
t0 <- qt(1-(alpha/2), n0-1)
CI <- xbar0+c(-1,1)*(s0/sqrt(n0))*t0
error <- (CI[2]-CI[1])/2
z <- (t0/d)**(-2)
n <- (s0**2)/z
n <- ceiling(n)
return(list(SampleMean=xbar0,SampleSD=s0, t0=t0, CI=CI,error=error,z=z,n=n))
}

TwoStageConfidenceSize(stockdata,0.05,0.05)

$SampleMean
[1] 15.07143

$SampleSD
[1] 0.1000714

$t0
[1] 2.446912

$CI
[1] 14.97888 15.16398

$error
[1] 0.09255061

$z
[1] 0.0004175451

$n
[1] 24

FIGURE 11.6
Sample size calculation for stock data-3

comparisons with the best by Hsu [67], as well as neighboring comparisons by


Chen and Hoppe[28], to list just a few. More details in this regard can be found
in the book by Hochberg and Tamhane (1987) as well as Hsu (1996), meth-
ods associated with probability inequalities can also be found, for example, in
Chen (2014) [22]. In what follows in this section, we shall discuss some cur-
rent developments on partitioning methods for multiple comparisons, which
is related to decision trees and range regression that we discussed in Chapter
9. Under the condition of directed and inverted confidence regions, the first
subsection is on step-up simultaneous confidence procedures to identify the
minimum effective dose of a drug. The second subsection is on simultaneous
confidence band estimating several dose-response curves spontaneously over a
continuous domain of the input dosage.

11.2.1 Confidence Procedures for Aspirin Efficacy


We start with an illustrative example to introduce the setting and the infor-
mation background on step-up confidence procedures.
262 Simultaneous Learning and Multiplicity

Example 11.5 Consider a double-blinded experiment for the efficacy of As-


pirin regarding acute myocardial infarction in men with unstable angina.
60 patients were randomly assigned to five distinctive dosages (for instance,
100mg, 200mg, 250mg, 300mg, and 325mg, daily) for a period of time, where
the 100mg Aspirin treatment serves as the baseline (active control) group.
Patients in the treatment groups of 250mg and 325mg were given Heparin
(5000-U intravenous bolus) as a supplement treatment in conjunction with
Aspirin. After diagnostic evaluations, each patient was assigned a risk score
based on the updated health information including the diagnosis of pathologic
Q-wave changes on electrocardiograms.
The outcome variable is the cardiovascular safety score ranging from 1
to 20, indicating the highest to the lowest risk of heart attach in the future.
the goal of the study was to detect treatments that effectively reduce future
heart-attack risks by significantly increasing cardiovascular safety scores.
In the study, a treatment is regarded as being significant if it improves the
median cardiovascular risk score by 5 units, compared with that of the active
control group.
Without a specific model assumption for the underlying distribution of
the data, we use Wilcoxon’s rank sum statistic to test the median score dif-
ference, Hi : ηi − η0 ≤ 5 versus Ki : ηi − η0 > 5 for i = 1, 2, 3, 4, of the risk
score difference between the treatment group and the baseline group (i = 0
represents the treatment of 100mg daily, the active control group). From the
data set, the pairwise Wilcoxon test statistics between the treatment and the
baseline group are, W1 = 75.5, W2 = 78, W3 = 76, W4 = 82, and the p-
values of pairwise comparisons are p̂1 = 0.0142, p̂2 = 0.0226, p̂3 = 0.0156,
and p̂4 = 0.0445. Thus, the ordered p-values corresponding to the changes
of cardiovascular scores are p̂(1) = 0.0142, p̂(2) = 0.0156, p̂(3) = 0.0226, and
p̂(4) = 0.0445.
The above methods of analysis fail to identify the efficacy of Aspirin com-
bined with Heparin, which is clinically regarded as an efficacious treatment.
Neither the Bonferroni adjustment nor the Holm’s procedure is able to detect
the significant difference, given the fact that the smallest p-value (0.0142) is
larger than the cut-off value, α/4 = 0.0125. On the other hand, the applica-
tion of Hochberg’s step-up procedure requires additional model assumptions
(such as M T P2 ), which are implausible to verify for this set of data. Under
this scenario, a new inference methodology is called.
Before discussing confidence procedures, we shall introduce two basic con-
cepts in the sequel. The first one is the concept of inverted confidence set, and
the second one is the concept of directed confidence set.
Definition 11.3 Inverted Confidence Set. For θ = (θ1 , ..., θk )T , let
P̂i (y|θi∗ ) be the p-value for the simple null hypothesis Hi0∗
: θi = θi∗ versus

θi ∈ Θi , where θi ∈ Θi . Denote Ci (y) (or Ci for notational convenience) the
c t t

set {y : θ ∈ {P̂i (y|θ) ≥ α/t}} for any integer t = 1, ..., k. (Here, we use the
notation in Casella and Berger, 2002 [16], page 463). Since this confidence set
Simultaneous Learning in Dose-response Studies 263

is actually inverted from the corresponding rejection region via the individual
p-value, we name it inverted confidence set.

The following example illustrates the concept of an inverted confidence set


Cit (y) when the data comes from a set of k normal populations.

Example 11.6 Consider a set of data from k different resources, y = (Xij ),


i = 1, ..., k and j = 1, ..., ni , where Xij ∼ N (μi , σi2 ) and σi is unknown. For a
fixed index i, consider the null hypothesis Hi0 : μi ≤ μi0 versus the alternative
hypothesis Hi1 : μi > μi0 . Denote Tv the corresponding t-statistic, and t1−α
the value satisfying P (Tv < t1−α ) = 1−α. Also denote corresponding rejection
region

R = {y : (X i − μi0 )/(σ̂i / ni ) > t1−α }.
We are interested in finding the inverted confidence set with confidence level
1 − α.

For any value μ∗i ≤ μi0 , the p-value for μi = μ∗i versus Hi1 : μi > μi0 is

P̂i (y|μ∗i ) = P (Tv > (X i − μ∗i )/(σ̂i / ni ));

and the p-value for Hi0 : μi ≤ μi0 versus Hi1 : μi > μi0 is

P̂i (y) = P̂i (y|Θi0 )


= sup P̂i (y|θi )
θi ∈Θi0

= P (Tv > (X i − μi0 )/(σ̂i / ni )).

Now, notice that for any value μi , P̂i (y|μi ) has the following property:

{y : μi ∈ {P̂i (y|μi ) ≥ α}}



= {y : μi ∈ {(X i − μi ) ni /σ̂i ≤ t1−α }}

= {y : μi ∈ {μi ≥ X i − t1−α (σ̂i / ni )}}.

Thus, the inverted confidence set for μi is



C(y) = (X i − t1−α (σ̂i / ni ), +∞),

and for each μi ,

P (y : P̂i (y|μi ) ≥ α)

= P ((X i − μi ) ni /σ̂i ≤ t1−α )
= 1 − α.

Another concept we need in this section is the Directed Confidence Set (Hsu
and Berger, 1999 [68]): A confidence set for θ, C(y), is said to be directed
toward a subset Θ∗ of the parameter space Θ, if for every sample point y,
either Θ∗ ⊂ C(y) or C(y) ⊂ Θ∗ .
264 Simultaneous Learning and Multiplicity

Example 11.7 In Example11.6, the confidence interval for



C(y) = {y : μi ∈ (X i − t1−α (σ̂i / ni ), +∞)}

is directed toward the alternative parameter space

Θ∗ = {μ : μi > μi0 }.

This is because either the confidence set C(y) contains the alternative space
Θ∗ , when

μi0 > X i − t1−α (σ̂i / ni ),
or it is contained in the alternative space, when

μi0 < X i − t1−α (σ̂i / ni ).

For a given sample, the confidence set is a subset in the parameter space.
For a given parameter, the confidence set is an event in the sample space (see,
for example, Berger and Casella, 2002, p 463[16]). The concept of directed
confidence set, in conjunction with the inverted confidence set, leads to the
following result.
For multiple testing problem of Hi0 : θi ∈ Θi versus Hi1 : θi ∈ Θci , assume
that for any nested rejection region and any permissible integers i and t, there
exists an inverted confidence set Cit (y) that is directed towards Θci . When
screening down from the largest to the smallest ordered p-value, let m be the
index that satisfies the following two criteria,
i) P̂(m) ≥ α/(k − m + 1); and
ii) for any index i: m < i ≤ k, P̂(i) < α/(k − i + 1). For notational
convenience, denote
C0k (y) = Θ
when m = 0 where all p-values are smaller than the corresponding cutoff
values, and
Θc(k+1) = Θ
when m = k. Under this setting, the simultaneous confidence set keeps the
confidence level at the nominal level.

Theorem 11.3
   
P (θ ∈ Θc(k+1) Θc(k) ... Θc(m+1) k−m+1
C(m) (y)) ≥ 1 − α. (11.1)

The proof of the theorem can be found in Chen(2016 [23]). Denote

α(i) = α/(k − i + 1),

Theorem 11.3 leads to a step-up operating algorithm for the construction of


simultaneous confidence sets. We will describe the new confidence procedure
Simultaneous Learning in Dose-response Studies 265

and apply the procedure to construct simultaneous confidence sets for the
Aspirin example.

Algorithm: A Step-up Confidence Procedure


Step 1 If P̂(k) > α, then claim θ(k) ∈ {P̂(k) ≥ α}, stop;
else, claim θ(k) ∈ Θc(k) , and go to Step 2.
Step 2 If P̂(k−1) > α/2, then claim θ(k−1) ∈ {P̂(k−1) ≥ α/2}, stop;
else claim θ(k−1) ∈ Θc(k−1) , go to Step 3.
..
.
Step i If P̂(i) > α(i), then claim θ(i) ∈ {P̂(i) ≥ α(i)}, stop;
else claim θ(i) ∈ Θc(i) , and go to Step i + 1.
..
.
Step k If P̂(1) > α/k, then claim θ(1) ∈ {P̂(1) ≥ α/k}, stop;
else claim θ(1) ∈ Θc(1) , stop.

We now apply the algorithm developed above to analyze the Aspirin effi-
cacy example in Example 11.5. Notice that the alternative parameter space
is of the form θi = ηi − η0 ≤ 5, and the associated confidence region corre-
sponding to p̂i ≥ α/t is of the form W ≤ c1−α , where W is the Wilcoxon
statistic. Thus the inverted confidence set is of the form ηi − η0 ≥ U , where
U is a lower confidence bound for the median difference derived from W , and
the condition of directed confidence interval is satisfied.
By the step-up confidence set algorithm, we have the following analytical
results when α = 0.05.

Algorithm for the Aspirin Example


Step 1 P̂(4) = 0.0445 < α, claim θ(4) ∈ Θc(4) , and go to Step 2.
Step 2 P̂(3) = 0.0226 < α/2, claim θ(3) ∈ Θc(3) and go to Step 3.
Step 3 P̂(2) = 0.0156 < α/3, claim θ(2) ∈ Θc(2) and go to Step 4.
Step 4 P̂(1) = 0.0142 > α/4, claim μ(1) − μ0 ∈ {P̂(1) ≥ α/4}, stop.

Thus, the 95% simultaneous confidence set consists of the following compo-
nents:

η4 − η0 − 5 ∈ (0, ∞)
η3 − η0 − 5 ∈ (0, ∞)
η2 − η0 − 5 ∈ (0, ∞)
η1 − η0 − 5 ∈ (−0.002, ∞).

The above prediction outcomes can be interpreted as follows. Different


from the previous inference procedures, the step-up confidence procedure is
266 Simultaneous Learning and Multiplicity

able to claim that the Aspirin treatments significantly reduce the risk of heart
attack in the clinical trial. The improvement of the cardiovascular safety score
is at least five units for the first three treatments, with the following statements
simultaneously.
1 The daily treatment of Aspirin at 250 mg is statistically significant in
reducing the risk of heart attack. It increases the cardiovascular safety
score by at least 5 units.
2 The daily treatment of Aspirin at 300mg in conjunction with supplement
treatment of Heparin is statistically significant in reducing the risk of heart
attack. It increases the cardiovascular safety score by at least 5 units.
3 The daily treatment of Aspirin at 325 mg with Heparin is statistically sig-
nificant in reducing the risk of heart attack. It increases the cardiovascular
safety score by at least 5 units.
4 The daily treatment of Aspirin at 200 mg along improves the median
cardiovascular score by 4.998 units. This is because the 1 − α/4=98.75%
confidence interval for the median difference is
η1 − η0 − 5 ∈ (−0.002, ∞).

The learning result of the new step-up confidence method fits well with clinical
expectations. Chen (2016)[23] contains more technical discussions on motiva-
tion, theory, and applications of this method.

11.2.2 Confidence Bands on Thrombolysis Effects


Since a dose-response curve is usually continuous, it is desirable to have a
simultaneous confidence band instead of confidence intervals in dose-response
studies. In this section, we describe a method proposed by Kerns and Chen
(2017) [76] on the construction of simultaneous confidence bands. We present
a method to construct simultaneous confidence bands for the comparison of
multiple regression lines when individual confidence bands are available. The
theoretical result will be illustrated by an example comparing efficacy curves
of successful rates of thrombolysis for three different patient groups. Technical
details including theoretical proofs and simulations can be found in the paper
by Kerns and Chen (2017) [76].
In this subsection, we shall describe a general theorem on the construction
of step-wise simultaneous confidence bands, delineate the operating algorithm,
and illustrate the method with an example on impacts of theormbolysis for
different age groups of patients.
In a logistic regression model
y(x1 , ..., xm ) = P (Y = 1|Xi = xi , i = 1, . . . , m)
1
= ,
1 + exp[−(β0 + β1 x1 + · · · + βm xm )]
Simultaneous Learning in Dose-response Studies 267

where xi i = 1, ..., m are continuous variables such as time of exposure to a


risk factor or the dosage of a drug.
Consider a special case where m = 1 for the input of drug dosage with k
groups of treatments. Let ŷi (x) denote a 100(1 − α)% lower confidence bound
for the response curve yi (x), i = 1, 2, . . . , k, and let δ(x) denote a threshold
on the efficacy of the drug.


⎨min(ŷi (x), δ(x)),

ŷi (x) = if there exists a value x0 such that ŷi (x0 ) < δ(x0 )


ŷi (x), if ŷi (x) > δ(x) for all x.
(11.2)

Denote the set Li (x) = (ŷi∗ (x), ∞). The set Li (x) possesses the following
statistical properties.
First, the one-sided confidence region is directed toward the pre-specified
efficacy region (δ(x), ∞). For any i = 1, 2, . . . , k, if the boundary ŷi (x) < δ(x)
at a point x = x0 , then ŷi∗ (x) < δ(x). Therefore, the set Li is directed toward
the set Δ(x) = (δ(x), ∞), that is, either Li (x) ⊆ Δ(x) or Δ(x) ⊆ Li (x).
Second, the one-sided confidence region Li (x) reaches the nominal confi-
dence coverage. For any i = 1, 2, . . . , k, if the boundary ŷi (x) is a 100(1 − α)%
lower confidence bound for the response curve yi (x), then ŷi∗ (x) is also a
100(1 − α)% lower confidence bound for yi (x).
We need the following notations for the key theorem in this subsection.

1. Simultaneous confidence lower bands. Consider k population of in-


terest. For any i = 1, 2, . . . , k, denote

Li (x) = (ŷi∗ (x), ∞)

a 100(1−α)% lower confidence set for the logistic regression line yi (x). Assume
that Li (x) is directed toward the set of alternative space Δ(x). Also, let

ŷk+1 (x) = min ŷi∗ (x),
i

which is a 100(1 − α)% lower confidence bound for the lower boundary across
all the response curves, min yi (x).
i
Let D be the smallest integer i such that ŷi∗ (x) > δ(x) if such i(1 ≤ i ≤ k)
exists; Otherwise, let D = k + 1.

2. Simultaneous confidence upper bands Suppose that

Ui (x) = (−∞, ŷi∗∗ (x)), i = 1, 2, . . . , k,

is a 100(1 − α)% upper simultaneous confidence band for the parameter func-
tion yi (x) that is directed toward the set of parameter

Λci (x) = (−∞, η(x)),


268 Simultaneous Learning and Multiplicity

respectively. Let
U
ŷk+1 (x) = max ŷi (x).
i

For the parameter array of logistic regression functions, θ(x).


Let T denote the smallest integer i such that ŷi∗∗ (x) < η(x) if such i(1 ≤
i ≤ k) exists; Otherwise, let T = k + 1.

3. Simultaneous two-sided confidence bands Let M = min(T, S), and


define the set

⎧ c

⎪ (Θ1 ∩ Λc1 ) ∩ (Θc2 ∩ Λc2 ) ∩ . . . ∩
(ΘcM −1 ∩ ΛcM −1 ) ∩ LS (x),

⎨ if S < T ,
Y W (x) =

⎪ (Θ1 ∩ η1 ) ∩ (Θ2 ∩ η2 ) ∩ . . . ∩ (ΘcM −1 ∩ ηM
c c c c c
−1 ) ∩ UT (x),


if S > T
(11.3)
With the above setting, we have the following result.

Theorem 11.4 For all θ(x) ∈ Θ, we have simultaneous confidence lower


bands,

Pθ(x) (θ(x) ∈ Θc1 ∩ Θc2 ∩ . . . ∩ ΘcS−1 ∩ LD (x), x ∈ (a, b)) ≥ 1 − α.

Simultaneous confidence upper bands,

Pθ(x) (θ(x) ∈ Λc1 ∩ Λc2 ∩ . . . ∩ ΛcT −1 ∩ UT (x), x ∈ (a, b)) ≥ 1 − α.

and simultaneous confidence two-sided bands, for all θ(x) ∈ Θ,

Pθ(x) (θ(x) ∈ Y W (x), x ∈ (a, b)) ≥ 1 − α,

where Y W (x) is a set of two-sided 100(1 − α)% confidence bounds for θ(x).

The proof of the above result can be found in Kerns and Chen (2017)[76].
Notice that both LS (x) ∩ ΛcS when S < T and UT (x) ∩ ΘcT when S > T are
100(1 − α)% confidence bounds for θ(x).

In order to use the above theorem to construct simultaneous confidence


bands, in the first step we need to establish an individual confidence band for
each single logistic regression curve under consideration. Methods in the con-
struction of individual confidence band can be found in Piegorsch and Casella
(1988) [93]), among others. After obtaining individual confidence bands, in the
second step, we can form simultaneous confidence bands ŷi∗ (x), for i = 1, ..., k
by deriving confidence bands directed toward the alternative parameter space,
as described above.
For the easy of application in the construction of simultaneous confidence
bands for k logistic regression lines, we can simplify the above theorem into
Simultaneous Learning in Dose-response Studies 269

an operating algorithm. Notice that the parameter array of logistic regression


lines θ(x) is now a function based on the value x, and the directed toward
confidence set is now a region bounded by a function θ(x), on the basis of the
individual confidence bands, the stepwise confidence procedure corresponding
to the above theorem can be formulated into the following procedure.

Stepwise Lower Confidence Bands Procedure

Step 1 If ŷ1∗ (x) > δ(x), x ∈ (a, b),


then assert y1 (x) > δ(x), x ∈ (a, b), and go to Step 2;
else assert y1 (x) > ŷ1∗ (x), x ∈ (a, b) and stop.
Step 2 If ŷ2∗ (x) > δ(x), x ∈ (a, b),
then assert y2 (x) > δ(x), x ∈ (a, b), and go to Step 3;
else assert y2 (x) > ŷ2∗ (x), x ∈ (a, b) and stop.
..
.
Step k If ŷk∗ (x) > δ(x), x ∈ (a, b),
then assert yk (x) > δ(x), x ∈ (a, b), and go to Step k + 1;
else assert yk (x) > ŷk∗ (x), x ∈ (a, b) and stop.

Step k + 1 Assert θ(x) ∈ ŷk+1 (x), x ∈ (a, b) and stop.

Upon the establishment of the above theorem and algorithm, we may now
illustrate their applications using the following example.

TABLE 11.1
Success rates of thrombolytic therapy
Lysis time 0.25 1.25 2.5 8 24 50
Log-dose -1.386 0.223 0.916 2.08 3.18 3.91
Group 1 16/78 23/78 48/78 56/78 68/78 78/78
Group 2 10/78 13/78 37/78 48/78 54/78 65/78
Group 3 2/78 11/78 22/78 25/78 35/78 44/78

Example 11.8 In a prospective randomized trial designed to investigate the


comparative results of traditional surgical revascularization with those of
catheter-directed thrombolytic therapy, the therapy is intra-arterial catheter di-
rected thrombolysis with UK (Urokinase) 250,000 unit bolus followed by 4000
units/min for 4 hours, and then 2000 units/min for as many as 36 hours.
To evaluate the age impact on the successful rate of thrombolysis response
curves, three categories of patients were studied. The first group consists of
patients whose ages were under 55 and had acute limb ischaemia (onset of
symptoms < 14 days); the second group consists of patients whose ages were
under 55 and have chronic limb ischaemia (onset of symptoms > 14 days);
270 Simultaneous Learning and Multiplicity

and the third group consists of patients whose ages were over 55 with (acute
or chronic) limb ischaemia. The dose variable in this example is the time of
therapy and the outcome is the success rate of the procedure (more than 80%
lysis). For illustration purpose, summaries of data information are given in
Table 11.1.

Based on the information in Table 11.1, the ML estimates from the logistic
fit and the Fisher information matrix for each group are found and displayed
in the following output.

TABLE 11.2
Parameter estimation on dose-response curves
1 2
162.12 328.68
Group 1 β = [ −.8699 .2654 ] F=
328.68 1103.38
1 2
132.81 284.09
Group 2 β = [ −1.1374 .3179 ] F=
284.09 944.85
1 2
84.33 193.69
Group 3 β = [ −1.3244 .3363 ] F=
193.69 646.02

Table 11.2 was used to construct the individual confidence bands as in


Kerns and Chen (2017)[76].
Figure 11.7 displays the estimated 95% lower confidence bands for three
groups, along with the threshold, which is chosen to be δ(x) = 1/{1 +
exp[−(−1.39 + .3x)]}. Three lower confidence bands are displayed using dot-
ted lines, with the order of Group 1, Group 2, Group 3, from top to bottom.
The thick solid line is for the threshold. It is indicated by the figure that the
therapy is effective for patients in the first age group only.
In this example, we consider the analysis of success rates of thromboly-
sis for patients with DVT ( deep vain thrombosis). Thrombosis occurs when
blood cells and proteins (e.g., erythrocytes, leukocytes, platelets, and fibrin)
aggregate and clot within an intact vein. Thrombolytic therapy is a treat-
ment procedure offering a compromise between surgery and anticoagulation.
It provides a potential to restore arterial flow through mildly invasive tech-
niques. Despite the theoretical advantages of the thrombolysis procedure, the
safety and efficacy of thrombolysis are often debated in the medical literature.
One of the safety concerns is that thrombolysis may restore blood flow more
slowly than immediate surgical revascularization. In that case, tissue ischemia
may progress to infarction before the artery has recanalized, hence result in
internal bleeding, stroke, and other complications. More information on the
efficacy of thrombolysis therapy can be found in Weaver et al. (1996, [124]),
Aziz, Chen and Comerota (2011, [4]), and Chen and Comerota (2012, [24]).
Weighted Simultaneous Confidence Regions 271

0.5

0.3
Success rates

0.1

−1 0 1 2 3 4

Log(lysis dosage)

FIGURE 11.7
Simultaneous lower bands for thrombolysis effects

11.3 Weighted Simultaneous Confidence Regions


Besides simultaneous confidence sets at different dosage level and simultane-
ous confidence bands for two or more continuous logistic regression curves,
another critical aspect in simultaneous learning is weighting inference. As
pointed out in Ma et al. (2023 [82]), when two or more confidence regions
are involved in a study, it is necessary to have the method of weighted si-
multaneous confidence regions. For example, in deep learning where weight
information is cooperated into different layers in the prediction process. Pop-
ulation weight is an integrated part of multiple inference. For simplicity and
convenience in discussions, we traditionally assume equal weights for pop-
ulations under investigation, which is not necessarily true all the time. For
instance, in a system consisting of multiple subsystems, when some systems
are relatively newer or more efficient than others, the difference on ages of
the subsystems may result in different levels of reliability performance. It is
unrealistic to assume that all the subsystems carry the same weight in the
evaluation of the system reliability. Similar applications also occur in various
scenarios. For instance, in gene micro-array analysis when some genes are bi-
ologically more closely (than others) related to a specific disease, we need to
272 Simultaneous Learning and Multiplicity

assign higher weights when performing simultaneous inference on those genes;


in system reliability analysis in engineering, where the reliability of primary
subsystems is tested prior to the regular (or secondary) subsystems, we need
to put relatively higher weights on the primary subsystems. In investment
strategy analysis in actuarial science, when certain stocks have dominating
priorities in building investment portfolios, we increase weights for prioritized
stocks in portfolio analytics. The conventional way of assigning equal weights
to all involved populations is not always valid in practice.
In the rest of this section, we shall synthesize two studies for recent method-
ological developments on simultaneous inference for weighted hypotheses. The
first one is on the analysis of gene expression data for breast cancer investi-
gation. The second subsection is on the inference for parameter arrays, a
new concept expanding the inference on vector-style multivariate parameters
to matrix-style parameter arrays. By increasing more parameter dimensions,
and each dimension contains infinitely many values, the method by Ma et al.
(2023)[82], essentially opens a new direction in high dimensional simultaneous
inference.

11.3.1 Weighted Hypotheses and Breast Cancer Study


We start with an application of weighted hypothesis in breast cancer study.
Instead of blindly assigning equal weights to each gene under investigation, we
use additional biological information to assign different weights according to
their relevance to the disease, when selecting genes with significantly different
expressions.The new procedure also strongly controls the family-wise error
rate in multiple-path learning.
Storey and Tibshirani (2003, [114]) used the false discovery rate (FDR) to
identify genes associated with breast cancer. The breast cancer susceptibility
genes, BRCA1 and BRCA2, play a critical role in DNA repair and apoptosis.
If either one mutates, its function of inhibiting tumor genesis would be affected
and leads to an elevated risk of breast cancer. Together, BRCA 1 and BRCA2
mutations account for a lifetime risk of 50 to 85 percent of breast cancer and
15 to 45 percent of ovarian cancer (Hedenfalk et al. 2001) [57]. Hedenfalk et
al. (2001) examined tissue samples from BRCA1-mutation-positive, BRCA2-
mutation-positive tumors, and some sporadic cases to detect differential gene-
expressions among these three types of breast cancer.
Based on a modified F -test, 51 of the 3,226 genes were determined to be
significant with a threshold α = .001. Storey and Tibshirani (2003) conducted
related research using a subset of genes (k = 3,170). With a cutoff q < .05
for the control of false discovery rate (FDR), they found 160 significant genes
for differential expression between BRCA1- and BRCA2- mutation-positive
tumors. The data set is available as the supplementary material in Storey and
Tibshirani (2003, [114]).

Example 11.9 The inference question in this example is the following. When
Weighted Simultaneous Confidence Regions 273

additional biological information on breast cancer is involved in the analysis


for the dataset in Storey and Tibshirani(2003), how to obtain the prediction
result with strong control of family-wise error rate, instead of FDR?

The problem can be formulated as follows. Let θ = (θ1 , θ2 , ..., θk )T be a


vector of parameters and H1 , H2 , ..., Hk be k hypotheses with null hypoth-
esis Hi0 : θi ∈ Θi versus Hi1 : θi ∈ Θci . Assume that prior information is
available such that the hypotheses can be weighted according to the order of
importance. Denote by ci the weight for Hi , i = 1, 2, ..., k, such that larger
ci ’s correspond to more important hypotheses. Let

wt = ci ,
i∈It

where It is an index set for hypotheses to be tested simultaneously.


Given a set of data y, let P̂1 , P̂2 , ..., P̂k denote the p-values corresponding
to tests φ1 , φ2 , ..., φk for H1 , H2 , ..., Hk , respectively. Different from the
weighted step-down testing procedure of Holm (1979) [60] and weighted step-
up procedure of Tamhane and Liu (2008)[115] , we propose a weighted step-
down confidence set procedure in this subsection.
For convenience, consider a null hypothesis

Hi0 : θi ≥ θi∗ versus ∗
Hi1 : θi < θi∗ ,

and denote the corresponding p-value by P̂i (y|θi∗ ). Let Cit (y) be an inverted
confidence set
Cit (y) = {y : θ ∈ {P̂i (y|θ) ≥ ct α/wt }}
for any t = 1, 2, ..., k.
For a multiple testing problem of Hi0 : θi ∈ Θi versus Hi1 : θi ∈ Θci , i =
1, 2, ..., k, assume that there exists an inverted confidence set Cit (y) that is
directed towards Θci for all permissible integers i and t, then for the index m
in the step-down weighted procedure, denote Θc(0) = Θ for notation conve-
k−m+1
nience, Θc(i) is the alternative space associated with S(i) , and C(m) (y) is
c
the inverted and directed confidence set that is associated with Θ(m) .

Theorem 11.5 With above settings, we have


3     4
P y : θ ∈ Θc(0) Θc(1) ... Θc(m−1) C(m) k−m+1
(y) ≥ 1 − α, (11.4)

Theorem 11.5 is a step-wise confidence procedure taking into considera-


tion of weights for the importance of hypotheses. It shows that once we can
find individual confidence sets adjusted by the Bonferroni inequality in con-
junction with the property of directed toward the corresponding alternative
space, we can derive the simultaneous confidence sets with the nominal overall
274 Simultaneous Learning and Multiplicity

confidence level. The theorem can be formulated into a prediction algorithm


as follows.

Algorithm: Weighted Step-down Confidence Set Procedure


c∗
Step 1: If S(1) ≥ wαk , then conclude θ(1) ∈ {θ : P̂1∗ ≥ 1α
wk }, stop; else con-
clude θ(1) ∈ Θc(1) , and go to Step 2.

c∗
Step 2: If S(2) ≥ wk−1
α
, then conclude θ(2) ∈ {θ : P̂2∗ ≥ 2α
wk−1 }, stop; else
conclude θ(2) ∈ Θ(2) , go to Step 3.
c

..
.
c∗
Step i: If S(i) ≥ wk−i+1
α
, then conclude θ(i) ∈ {θ : P̂i∗ ≥ iα
wk−i+1 }, stop; else
conclude θ(i) ∈ Θc(i) , and go to Step i + 1.
..
.

Step k: If S(k) ≥ wα1 , then conclude θ(k) ∈ {θ : P̂k∗ ≥ α}, stop; else conclude
θ(k) ∈ Θc(k) , stop.

We shall now use an example to illustrate the learning algorithm discussed


above. First, we shall discuss an example to explain the concept of Inverted
Confidence Set Cit (y) for the difference of two normal population means. More
details on the methodology can be found in Yu et al. (2022, [128]) or Casella
and Berger (2002)[16].

Example 11.10 We consider a data set (Xijt ), i = 1, 2 for treatment ver-


sus placebo, j = 1, 2, ..., ni for experimental subjects, and t = 1, 2, ..., k for k
different treatments, where Xijt ’s follow independent N (μit , σt2 ) with an un-
known standard deviation σt . We are interested in testing whether there is a
treatment effect of t unit difference between the treatment and placebo for each
Treatment-t,

H0t : μ1t ≥ μ2t + t, versus H1t : μ1t < μ2t + t,

for t = 1, ., .., k. Depending on the importance of each hypothesis, we assign


weights c1 , c2 , ..., ck successively and let

P̂t
St =
ct
for t = 1, 2, ..., k.
Weighted Simultaneous Confidence Regions 275

Let S(1) ≤ S(2) ≤ ... ≤ S(k) be the ordered S-values and

P̂t∗
S(t) = ,
c∗t

where P̂t∗ and c∗t are the corresponding p-value and hypothesis weight. Com-
paring the ordered S-value with the corresponding significant level
α
wk−t+1 ,

where

k
wk−t+1 = c∗s .
s=t

Now, we can consider the proposed weighted step-down confidence set


procedure with the assumption that the first non-rejected hypothesis occurs
at step q (q ≤ k).
Based on the above theorem, the inverted confidence interval for step q is

$ & %,
k−q+1 1 1
C(q) (y) = y : dq ∈ −∞, X 1(q) − X 2(q) + tv,u σˆ(q) + ,
n1(q) n2(q)

where dq = μ1(q) − μ2(q) , and

c∗q α
u=1− ,
wk−q+1

X 1(q) and X 2(q) are the sample means corresponding to the q-th ordered S-
k−q+1
value. The confidence interval C(q) (y) is applied at step q (if the procedure
stops at step q).
More specifically, suppose that k = 4,

P̂1 = .10, P̂2 = .03, P̂3 = .01, P̂4 = .08,

and the corresponding weights are

c1 = 35, c2 = 10, c3 = 40, c4 = 15,

respectively. Then, the S-values are

S1 = 0.00286, S2 = 0.003, S3 = 0.0025, andS4 = 0.0053,

and the correspondingly ordered S-values are

S(1) = .00025, S(2) = .00286, S(3) = .00300, and S(4) = .00533.


276 Simultaneous Learning and Multiplicity

Consequently,
P̂1∗ = .01, P̂2∗ = .1, P̂3∗ = .03, P̂4∗ = .08,
and the corresponding weights are

c∗1 = 40, c∗2 = 35, c∗3 = 10, and c∗4 = 15,

respectively. Under this setting,

w4 = 100, w3 = 60, w2 = 25, and w1 = 15.

Setting α = .05, the weighted step-down confidence set procedure proceeds


as follows.

Step 1: S(1) = .00025 < .0005 = α


w4 . Assert that θ(1) ∈ Θc(1) . Go to Step 2.

c∗
Step 2: S(2) = .00286 > .00083 = α
w3 . Assert that θ(2) ∈ {P̂2∗ ≥ 2α
w3 } =
3 c∗

C(2) (y), where w3 = .02905. Stop.

Here, since the ordered S(1) corresponds to S3 , θ(1) ∈ Θc(1) implies that

μ13 < μ23 + 3,

so
Θc(1) = (−∞, 3).
If, from a set of data, we have that

1 1
X 11 − X 21 + t c∗

σˆ1 + = 0.6,
v,1− w3 n11 n21

then, the algorithm claims, at 95% confidence level, that Treatment-3 has sig-
nificant treatment result compared with the placebo group, and the difference
between the drug and placebo for Treatment-1 is less than 0.6 unit; no con-
clusion on Treatment-2 and Treatment-4 with the current weighting. This is
partly because the weights on Treatment-2 and Treatment-4 (10 and 15, re-
spectively) are much lower than the weights for Treatment-1 and Treatment-3
(35 and 40, respectively).

With the above setting, we can now use the weighted confidence set method
to analysis the question posted in Example 11.9 as follows.

Example 11.11 Applying the new confidence procedure to re-analyze the


breast cancer data.
Weighted Simultaneous Confidence Regions 277

Storey and Tibshirani (2003) calculated the p values from permutation


test by
B
#{j : |tj 0b | ≥ |ti |, j = 1, ..., 3170}
pi = ,
3170B
b=1

where B = 100 and i = 1, 2, ..., 3170. More discussion on permutation tests can
be found in Huang et al (2006) [69] and Kaizar et al (2011) [74]. In light of Jauf-
fret et al (2007) [73] regarding the significance of correlation between moesin
and BRCA1 associated breast cancer, we assigned weights correspondingly to
analyze the gene expression data. The reanalysis consists of two parts. In the
first part, we use the published p-values in Storey and Tibshirani (2003, [114])
in conjunction with the new weighted confidence theorem. The first analysis
identified several additional significant genes, such as APEX nuclease (clone
417124), apoptosis-related protein 15 (clone 137836), apoptosis inhibitor 1
(clone 34852) and two ERCC-related genes (clone 323390 and clone 52666),
in addition to the 160 genes identified using the q values in Storey and Tib-
shirani (2003).
For example, screening up from the smallest S-value to the largest S-value,
the weighted p-value of APEX nuclease is 2.78 × 10−9 , which is smaller than
the corresponding significant level 3.30 × 10−9 . We thus concluded that the
corresponding confidence interval of the mean difference of the two expressions
is Θci = (0, +∞). The expression of this gene is restrained in BRCA1-mutation-
positive tumors, which results in decrease in function of mediating DNA repair
(APEX nuclease).
Another example is apoptosis inhibitor 1, the weighted p-value of apoptosis
inhibitor 1 is 1.70×10−8 . We recognize this gene as significant after comparing
it with the corresponding significant level 2.54 × 10−8 , and conclude the mean
difference is within (0, +∞). The expression of this gene, involved in suppress-
ing apoptosis, is also decreased in BRCA 1- mutation- positive tumors.
Furthermore, the new procedure also found a gene associated with
apoptosis related protein 15, the weighted p-value is 1.48 × 10−10 which is less
than the corresponding significant level 1.55 × 10−10 . This indicates the sig-
nificance of apoptosis-related protein 15 after multiplicity adjustments. The
procedure stops at the 166th step with an inverted confidence interval calcu-
lated from the associated p-value.
Notice that the method of data analysis in the first part is based on per-
muted p-values. In the second portion of the reanalysis, we assume normality
for the data and use the two-sample t-test to reanalyze the data. Under the
normality assumption, we recalculated p values using two-sample t-test.
Specifically, the GATA-3 gene is highly correlated with estrogen receptor,
which leads to a suppressed expression of this gene in BRCA 1-mutation-
positive breast cancer (Eeckhoute et al. 2007, [46]). After assigning proper
weights to the genes according to biological literature, a total of 163 significant
genes were identified using the proposed weighted step-down confidence set
procedure, which includes GATA-binding protein 3, moesin, among others. For
278 Simultaneous Learning and Multiplicity

those genes identified as significant, we concluded that the mean difference of


these genes is within (0, +∞). The procedure stops at the 164th gene where
the mean difference of the two expressions is at least -.5692 unit with 95%
confidence.

11.3.2 Confidence Sets for Weighted Parameter Arrays


The preceding subsection discusses a weighted simultaneous confidence
method, in which multiple parameters are treated as a vector of parameters
(vectorization) with an application for the analysis of gene expression data.
In the first look, it seems that one may simply vector the parameter array
into a long vector and somehow adapt the weighted simultaneous confidence
method of Yu et al (2022) [128] for the inference of parameter arrays. Unfor-
tunately, as pointed out in Ma et al. (2023)[82] and Zhang et al (2017) [129],
the intrinsic connection within each “subsystem” invalidates this vectoriza-
tion approach in multiple learning. For example, in the study of treatment
regime, once one of the medicines in the treatment regime is ineffective, the
patient dies already, regardless of the significance of other components in the
treatment regime (Thall et al. 2007 [117], Wang et al 2012 [123], or Wang et
al. 2017 [122]). In conjunction with different weights associated with different
“subsystems”, a new methodology on weighted simultaneous confidence set
for parameter arrays is on call.
This array-type inference problem can be viewed as a multivariate ver-
sion of univariate simultaneous inference. Instead of rearranging all multiple
parameters into one parameter vector, this subsection focuses on simultane-
ous inference for a parameter array formed by multiple parameter vectors, in
which each parameter vector consists of multiple parameters. The extension
from vector-wise multiple learning to array-wise multiple learning is analogical
to the extension from univariate inference to multivariate statistical inference.
We propose a stepwise confidence region method for inference on parameter
arrays. More details on materials discussed in this section can be found in Ma
et al. (2023)[82].
Let Hi , i = 1, ..., k be k hypotheses with null hypothesis H0i : θi ∈ Θi
versus alternative hypothesis H1i : θi ∈ Θci , where θi = (ηi1 , ..., ηini ) is a
vector of ni parameters. Suppose for each hypothesis Hi , we have a test φi .
Let ci be the weight of hypothesis Hi and P̂i the corresponding p-value.
Similar to previous subsection, define an S-value as a function of the
weighted p-value as follows:
P̂i
Si = ,
ci
and let S(1) ≤ ... ≤ S(k) be the ordered S-values:

P̂i∗
S(i) = ,
c∗i
Weighted Simultaneous Confidence Regions 279

where P̂i∗ and c∗i are, respectively, the p-value and weight corresponding to
the ordered S-value S(i) with the corresponding parameter θi∗ , denoted as θ(i)
in the sequel.
Also, denote li , i = 1, ..., k, the dimension of the parameter space corre-
sponding to S(i) . For instance, if S4 ranks the second among all the S-values,
S(2) , by the notation, we have n4 = l2 . For notational convenience, define the
cumulative weight
k
wi = c∗j ,
j=k−i+1

or equivalently,

k
wk−i+1 = c∗j .
j=i

In testing the multivariate parameter array θ1 , ..., θk , we focus on detecting


the significance of the vector θi (indicating the function of the ith subsystem)
in each hypothesis. For convenience, we confine our investigation to the case
where the significance of the parameter vector is claimed when at least one of
its components is significant (a scenario for the functioning of subsystems in
a parallel design).
For a multiple testing array problem H0i : θi ∈ Θi versus H1i : θi ∈ Θci ,
i = 1, ..., k, assume that there exist inverted confidence sets D1∗ (Y ), ..., Dk∗ (Y ),
which are directed toward the alternative spaces Θc1 , ..., Θck , respectively. When
screening up from S(1) to S(k) , define m as the smallest integer such that
S(i) ≥ wk−i+1α
, 1 ≤ i ≤ k. Also, let Θc(i) be the alternative space associated
with the ith ordered weighted S-value S(i) .

Theorem 11.6 With the above setting, we have

P (θ ∈ L(Y, m)) ≥ 1 − α, where




⎪ D∗ (Y ) × Rl2 · · · × Rlk , when m = 1

⎪ c(1)
⎨ ∗
Θ(1) × · · · × Θc(m−1) × D(m) (Y ) × Rlm+1 · · · × Rlk ,
L(Y, m) =

⎪ when 2 ≤ m ≤ k − 1


⎩Θc × · · · × Θc , when m = k.
(1) (k)

As discussed in Ma et al. (2023), the above theorem leads to the following


operating algorithm.
280 Simultaneous Learning and Multiplicity

Algorithm: Weighted Confidence Procedure on Parameter Array


α
Step 1 If S(1) < , then assert θ(1) ∈ Θc(1) and go to Step 2 ;
wk
c∗ α
else, assert θ(1) ∈ {θ : P̂1∗ ≥ 1 }, and stop.
wk
α
Step 2 If S(2) < , then assert θ(2) ∈ Θc(2) and go to Step 3 ;
wk−1
c∗ α
else, assert θ(2) ∈ {θ : P̂2∗ ≥ 2 }, and stop.
wk−1
..
.
α
Step k If S(k) < , then assert θ(k) ∈ Θc(k) and stop;
w1
else, assert θ(k) ∈ {θ : P̂k∗ ≥ α}, and stop.

The problem described above can be symbolically formulated as follows.


Assume that we are interested in making inference for k parameter vectors θ1 ,
..., θk simultaneously, where for the ith vector, i = 1, ..., k, the testing problem
is
H0i : θi ∈ Θi versus H1i : θi ∈ Θci ,
where
Θi = Θi1 × · · · × Θini ∈ Rni .
It should be noted that this section is designed for simultaneous inference
on parameter arrays. It is fundamentally different from previous subsection
focusing on multiple learning of parameter vectors. The intrinsic connection
within each parameter vector necessitates a new multivariate version of con-
fidence region in this section. We shall now use a dataset to illustrate the use
of the above simultaneous confidence method.
The dataset was originally discussed in Cortez et al (2009) [38], with two
sub datasets containing wine information (quality index and physicochemical
variables) for red and white variants, respectively, of the Portuguese “Vinho
Verde” wine. It consists of the following physicochemical variables: fixed acid-
ity; volatile acidity; citric acid; residual sugar; chlorides; free sulfur dioxide;
total sulfur dioxide; density; pH-level; sulfates; and alcohol. It also contains
the quality index based on the sensory of the wine samples.
Consider the intertwining relationships and the chemical nature of the
11 physicochemical variables, we partition the variables into three categories.
Category I is about acidity, which includes “fixed-acidity”, “volatile-acidity”,
“citric-acid” and “pH-level”. Category II focuses on sulfur-dioxide, which in-
cludes “free sulfur-dioxide” and “total sulfur-dioxide”. Category III contains
all other chemical components which have great impact on the density of
wine. This includes “residual-sugar”, “chlorides”, “sulfate”, “alcohol”, and
“density” itself.
Weighted Simultaneous Confidence Regions 281

Example 11.12 For the datasets on red wine and white wine described above,
we are interested in learning wine features that significantly affect the wine
quality index in tasting.

Next, we used wine quality as a guideline to determine the stan-


dard/baseline group and the comparison groups. The baseline (denoted as
group 0) is chosen to be wine variants with quality 3 or 4 (nred = 63, nwhite =
183), while four treatment groups are those with quality 5 (denoted as group 1,
nred = 679, nwhite = 1457), 6 (denoted as group 2, nred = 638, nwhite = 2195),
7 (denoted as group 3, nred = 199, nwhite = 880), and more than 8 (denoted
as group 4, nred = 18, nwhite = 180).
Since the p-values associated with Categories II and III are larger than
the cutoff values. they are not statistically significant on wine quality, we
only report the analysis result for Category I here. The inference problems of
interest are as follows.
Denote Mi the median acidity level of group i for i = 0, 1, 2, 3, 4.

H10 : M1 = M0 vs H11 : M1 > M0


H20 : M2 = M0 vs H21 : M2 > M0
H30 : M3 = M0 vs H31 : M3 > M0
H40 : M4 = M0 vs H41 : M4 > M0
For instance, the first null hypothesis H10 states that the median acidity
level of wine with quality score 5 equals the median acidity level of wine with
baseline quality, and the alternative hypothesis H11 states that the median
acidity level of wine with quality score 5 is higher than the median acidity
level of wine with baseline quality.
For illustrating purposes, we consider two weight vectors for each category,
equal weight w1 = (1, 1, 1, 1), and weights emphasizing high-quality wine w2 =
(1, 1, 1, 100). More discussions on wine quality information can be found in
Cortez et al (2009) [38].
The significance level in Table 11.3 is set to α = 0.005. Denote Mi the
median acidity level of group i for i = 0, 1, 2, 3, 4 and consider the following
hypotheses,
H10 : M1 = M0 vs H11 : M1 > M0
H20 : M2 = M0 vs H21 : M2 > M0
H30 : M3 = M0 vs H31 : M3 > M0
H40 : M4 = M0 vs H41 : M4 > M0
For instance, the first null hypothesis H10 states that the median acidity level
of wine with quality score 5 equals the median acidity level of wine with
baseline quality, and the alternative hypothesis H11 states that the median
acidity level of wine with quality score 5 is larger than the median acidity
level of wine with baseline quality.
282 Simultaneous Learning and Multiplicity

TABLE 11.3
Analytic results of Category I for red wine.

unweighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 4.69E-5 1 0.0125 Claim H31
Score ≥ 8 0.02164 1 0.01667 Fail to reject H40
Score = 6 0.02463 1 0.025 No conclusion
Score = 5 0.03974 1 0.05 No conclusion
weighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 4.69E-5 1 0.000485 Claim H31
Score ≥ 8 2.164E-4 100 0.00049 Claim H41
Score = 6 0.02463 1 0.025 Claim H21
Score = 5 0.03974 1 0.05 Claim H11

With the above setting, we run the proposed procedure on all three cate-
gories with different types of wines to investigate the impact of different groups
of physicochemical variables on wine quality. Since the distribution of the data
set is skewed, we use the Wilcoxon rank-sum test to compute the individual
p-values for each test. As shown in Theorem 11.6, the procedure is valid for
any test statistic suitable for the distribution of the data. The vector-wise
p-value, S-value, adjusted significance level α∗ are reported in Table 11.3.
Without weighting adjustment for red win, results summarized in Ta-
ble 11.3 show the impact of acidity on the taste of red-wine. With equal
weight, the algorithm stops at the second step, concluding that acidity is sig-
nificant in differentiating moderately high-quality wine (Score = 7) from the
baseline, but it is not significant for the rest. However, the inference result
may be confounded by the unbalanced sample sizes between the group with
score 7 (n = 199) and the group with scores higher than 8 (n = 18). Also,
such inference conclusion is not coherent in terms of asserting the impact of
acidity changes on the taste quality changes. It makes more sense to claim sig-
nificant impact for wines with higher scores before claiming significant impact
for wines with lower scores. Since we are interested in seeking factors that are
important for high-quality wine, we consider a weight of 100 for the group in
which the wine taste quality score is 8 and higher. With the weight vector of
w2 , we are able to reach the conclusion that is consistent and coherent through
out the comparison groups. The new weighting scheme leads to the conclusion
that acidity is significant in distinguishing red wine taste quality.
Now, consider the white wine Category I. Similar to the analysis for red
wine, the impacts of Categories II and III are not statistically significant, we
only report the analysis on Category-I here.
Comparing the results from unweighted and weighted procedures in Ta-
ble 11.4, we can see that different weights may lead to different testing or-
Weighted Simultaneous Confidence Regions 283

TABLE 11.4
Analytic results of Category I for white wine.

unweighted inference
Quality S-value Weight α∗ Conclusion
Score = 7 2.24E-4 1 0.00125 Claim H31
Score = 6 1.60E-3 1 0.00167 Caim H21
Score ≥ 8 2.84E-3 1 0.0025 Fail to reject H40
Score = 5 7.59E-3 1 0.005 No conclusion
weighted inference
Quality S-value Weight α∗ Conclusion
Score ≥ 8 2.84E-5 100 4.85E-5 Claim H41
Score = 7 2.24E-4 1 0.00167 Claim H31
Score = 6 1.60E-3 1 0.0025 Claim H21
Score = 5 7.59E-3 1 0.005 Fail to reject H10

der. For illustrating purpose, we chose a more precise significance level with
α = 0.005.
The equal weighting scheme results in an incoherent conclusion in which a
relatively lower quality (Score = 7) wine is significant but higher quality (Score
= 8 or more) wine is not significant. Comparatively, the weighted procedure
in Table 11.4 leads to a more persuadable and interpretable assertion, that
acidity is a significant factor in distinguishing white wine quality, because we
have sufficient evidence to reject the null hypothesis.
Practically, the impact of acidity on wine taste is complex. However, indus-
trial evidence shows that within a reasonable range relatively higher acidity
level makes both red and white wine taste fresher than those with low acidity
levels (see, for example, Plane et al 1980 [94]). Moreover, the sweetness might
balance the taste of acidity and consequently the confounded taste may be
more abundant. This agrees with the inference conclusion derived from the
weighted confidence procedure, as stated in Table 11.3.
Inference on the above-mentioned settings calls for further developments
of weighted inference for underlying parameter arrays. While we normally go
with multiple testing procedures for the comparisons of more than one popula-
tions, it should be noted that the method of simultaneous confidence set plays
a more effective role in estimating the unknown sets of parameters. There
are distinct differences between the confidence set method and the multiple
testing procedures (such as the closed testing principle). The closed testing
principle essentially draws conclusions on multiple hypotheses by controlling
the family-wise error rate. In the literature, various multiple testing proce-
dures were proposed with different criteria in controlling the error rate, such
as the control of false discovery rate, see for example, Hochberg and Tamhane
(1987) [58] . The simultaneous confidence set, on the other hand, provides an
overall estimation for all the parameters under investigation. Usually, a mul-
284 Simultaneous Learning and Multiplicity

tiple testing procedure is unable to provide parameter estimation for follow-


up investigations. However, simultaneous confidence sets are able to strongly
control the overall error rate on testing the associated hypotheses. Compared
with test-based inference methods, the method of simultaneous confidence set
is more straightforward, interpretable, and reliable. More discussions on this
regard for clinical trial studies can be found, for example, in papers of Chen
(2008a)[20] and Chen (2008b) [21], or in the book of Chen (2014) [22].

SUMMARY Previous chapters basically deal with learning from a dataset


representing a population. However, in real-life situations, we are often con-
fronted with scenarios where the data for statistical learning and prediction
come from multiple data sources or phases. This results in multiple error rates
at different stage or data sources. Ignoring the intertwining relationship on the
false positive or false negative rates may result in serious misleading conclusion
partly due to accumulating error rates from different data learning resources.
This chapter is thus motivated to synthesize published methodologies and
algorithms handling multiple data resources.
We start with methods in sequential analysis where it is desirable to control
both the probability of type-I error and the probability of type-II errors. We
elucidate Wald’s sequential likelihood ratio test for inference decisions, and the
two-stage estimation approach for additional observations in the sequential
sample. Sequential analysis appears in various learning phases such as for
customer record data, medical history data, to list just a few.
Besides sequential data coming from multiple phases, more often we have
data coming from multiple sources, or we are interested in learning multiple
information at the same time. This directly relates to the vast literature on
multiple comparisons and simultaneous inference. In this chapter, to give a
glimpse into the field by introducing two key directions. The first one is on the
prediction of dose-response curves in efficacy studies. This includes a step-up
confidence procedure and the method of simultaneous confidence bands con-
trols the overall coverage probability with proper adjustments on multiplicity.
Further information can be found in papers [27], [26], and [105].
The second direction on recent development of simultaneous inference fo-
cuses on testing weighted hypotheses. This challenges conventional methods
that implausibly assume equal weights for all population involved. It extends
the conventional setting of equal weight inference into unequal weights ac-
cording to related information on the populations. Two confidence methods
are synthesized in this chapter. One is on weighted confidence regions that im-
proves the prediction on genes significantly associated with breast cancer, by
including biological information on gene expression data. The second approach
directly extends the vector-wise multiple inference to array-wise confidence
procedures as in Ma et al. (2023, [82]).
Since it is not plausible to always assume that all the hypothesis in a mul-
tiple testing problem shall be treated with the same weight, the two methods
Weighted Simultaneous Confidence Regions 285

discussed in the last section of the book cast high lights on statistical learning
for data with multiple resources.
Bibliography

[1] H. Ajhajala, M. Hendricks-Jones, J. Shawver, A. Amllay, J. T. Chen,


and M. Hajjar et al. Expansion of telestroke coverage in community
hospitals: Unifying stroke care and reducing transfer rate. Annals of
Neurology, 12:1–7, 2023.

[2] R. Amorim and C. Hennig. Recovering the number of clusters in data


sets with noise features using feature rescaling factors. Information
Sciences, 324:126–145, 2015.

[3] R. Amorim and B. Mirkin. Minkowski metric, feature weighting and


anomalous cluster initializing in K-means clustering. Pattern Recogni-
tion, 45:1061–1075, 2012.

[4] F. Aziz, J. Chen, and A. Comerota. Catheter-directed thrombolysis of


iliofemoral deep vein thrombosis reduces deep vein thrombosis recur-
rence. Journal of Vascular Surgery, 53:252–254, 2011.

[5] G. Bell, T. Hey, and A. Szalay. COMPUTER SCIENCE: Beyond the


data deluge. Science, 323:1297–1298, 2009.
[6] P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics,
30:927–961, 2008.

[7] P. Bühlmann and B. Yu. Boosting with the L2 loss: regression and
classification. Journal of the American Statistical Association, 98:324–
39, 2009.

[8] H. Block and J. T. Chen. Multivariate product-type lower bounds.


Journal of Applied Probability, 38:407–420, 2001.

[9] L. Breiman. Bagging predictors. Machine Learning Journal, 26:123–


140, 1996.

[10] L. Breiman. The heuristics of instability in model selection. Ann.


Statist., 24:2350–2381, 1996.

[11] L. Breiman. Arcing classifiers (Discussion paper). Machine Learning


Journal, 26:801–824, 1998.

[12] L. Breiman. Random forests. Machine Learning J., 45:5–32, 2001.

287
288 Bibliography

[13] L. Breiman. Statistical modeling: The two cultures. Statistical Science,


16(3):199–231, 2001.
[14] L. Breiman and J. Friedman. Estimating optimal transformations in
multiple regression and correlation. J Amer. Statist. Assoc., 80:580–
619, 1985.
[15] D. Brent, M. Baugher, J. Bridge, J. T. Chen, and L. Chiappetta. Age
and sex-related risk factors for adolescent suicide. Journal of the Amer-
ican Academy of Child and Adolescent Psychiatry, 38(12):1497–1505,
1999.
[16] G. Casella and R. Berger. Statistical Inference. Brooks/Cole Cengage
Learning, Belmont, CA, USA, 2002.
[17] J. T. Chen. Optimal lower bounds for bivariate probabilities. Advances
in Applied Probability, 30:476–492, 1998.
[18] J. T. Chen. Letter to the Editor: On the estimation of attributable risk
in case-control studies. Statistics in Medicine, 20:979–982, 2001.
[19] J. T. Chen. A lower bound using Hamilton-type circuit and its applica-
tions. Journal of Applied Probability, 40:1121–1132, 2003.
[20] J. T. Chen. A two-stage estimation procedure. Biometrics, 64:406–412,
2008a.
[21] J. T. Chen. Inference on the minimum effective dose using binary data.
Communication in Statistics, 38:2124–2135, 2008b.
[22] J. T. Chen. Multivariate Bonferroni-Type Inequalities, Theory and Ap-
plications. Chapman and Hall, CRC, Boca Raton, 2014.
[23] J. T. Chen. A nonparametric coherent confidence procedure. Commu-
nications in Statistics – Theory and Methods, 45:3397–3409, 2016.
[24] J. T. Chen and A. J. Comerota. Detecting the association between
residual thrombus and post-thrombotic classification of chronic venous
disease with range regression. Reviews on Recent Clinical Trials, 7:329–
334, 2012.
[25] J. T. Chen, A. K. Gupta, and C. Troskie. Distribution of stock returns
when the market is up (down). Communications in Statistics-Theory
and Methods, 32:1541–1558, 2003.
[26] J. T. Chen and F. Hoppe. Simultaneous confidence intervals and regions.
Wiley StatsRef Statistics Reference Online ISBN 9781118445112, John
Wiley and Sons:1–10, 2017.
[27] J. T. Chen and F. M. Hoppe. Simultaneous confidence intervals. The
Encyclopedia of Biostatistics, 5:4114–4116, 1998.
Bibliography 289

[28] J. T. Chen and F. M. Hoppe. A connection between successive compar-


isons and ranking procedures. Statistics and Probability Letters, 67:19–
25, 2004.

[29] J. T. Chen, F. M. Hoppe, S. Iyengar, and D. Brent. A hybrid logistic


regression model for case-control studies. Methodology and Computing
in Applied Probability, 5:419–426, 2003.

[30] J. T. Chen, S. Iyengar, and D. Brent. Constraint estimation for the pop-
ulation attributable risk. Journal of Applied Probability and Statistics,
2:251–265, 2007.

[31] J. T. Chen and E. Seneta. A note on bivariate Dawson-Sankoff-type


bounds. Statistics and Probability Letters, 24:99–104, 1995.
[32] J. T. Chen and E. Seneta. Multivariate Bonferroni-type lower bounds.
Journal of Applied Probability, 33:729–740, 1996.

[33] J. T. Chen and E. Seneta. A refinement of multivariate Bonferroni-type


inequalities. Journal of Applied Probability, 37:276–282, 2000.

[34] J. T. Chen and E. Seneta. A Fréchet-optimal strengthening of the


Dawson-Sankoff lower bound. Methodology and Computing in Applied
Probability, 8(2):255–264, 2006.

[35] J. T. Chen, E. Walsh, and A. Comerota et al. A new multiple test ap-
proach for nursing care administration of deep vein thrombosis patients.
Istanbul University Journal of the School of Business Administration,
1:22–34, 2011.
[36] L. Chen and J. Chen. Refined Machine Learning Approaches for Mask
Policy Analysis, Book chapter edited by E. Cetin and H. Ozen, Springer
Nature, Singapore . Health-care Policy, Innovation and Digitalization,
10:197–211, 2024.

[37] A. J. Comerota, N. Grewal, J. T. Martinez, and J. T. Chen et al.


Postthrombotic morbidity correlates with residual thrombus follow-
ing catheter-directed thrombolysis for iliofemoral deep vein thrombosis.
Journal of Vascular Surgery, 55:768 – 773, 2012.

[38] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Model-


ing wine preferences by data mining from physicochemical properties.
Decision Support Systems, 47:547–553, 2009.

[39] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector


Machines. Cambridge Univ. Press., 2000.

[40] C. Daniel and F. Wood. Fitting Equations to Data. Wiley, New York.,
1971.
290 Bibliography

[41] A. Davision and D. Hinkley. Bootstrap Methods and Their Applications.


Cambridge University Press, 2006.

[42] A. Dempster. Logicist statistic 1. Models and modeling. Statistical


Science, 13:248–276, 1998.

[43] V. Dhar. Data science and prediction. Communications of the ACM,


12(56):64–73, 2013.

[44] P. Diaconis and B. Efron. Computer intensive methods in statistics.


Scientific American, 248:116–130, 1983.

[45] C. Dunnett. A multiple comparison procedure for comparing several


treatments with a control. Journal of American Statistical Association,
50:1096–1121, 1955.

[46] J. Eeckhoute, E. Keeton, and M. Lupien et al. Positive cross-regulatory


loop ties GATA-3 to estrogen receptor α expression in breast cancer.
Cancer Research, 67:6477–6483, 2007.
[47] B. Efron. Bootstrap methods, another look at the Jackknife. The Annals
of Statistics, 7:1–26, 1979.
[48] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman
and Hall/CRC, 1993.
[49] E. Forgy. Cluster analysis of multivariate data: efficiency vs inter-
pretability of classifications. Biometrics, 21:768–780, 1965.
[50] J. Galambos and I. Simonelli. Characterizations of probability distribu-
tions by properties of products of random variables. Journal of Applied
Statistical Science, 13:1–10, 2004.

[51] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press,


2016.

[52] A. Gupta and J. T. Chen. Goodness of fit test for the skew-normal


distribution. Communication in Statistics, 30(4):907–930, 2001.

[53] A. Gupta and J. T. Chen. A class of multivariate skew-normal models.


The Annals of the Institute of Statistical Mathematics, 56(2):305–315,
2004.

[54] M.H. Hansen and B. Yu. Model selection and the principle of mini-
mum description length. Journal of the American Statistical Associa-
tion, 96:746–774, 2002.

[55] J. Hartigan and M. Wong. A K-means clustering algorithm. Journal


of the Royal Statistical Society Series C: Applied Statistics, 28:100–108,
1979.
Bibliography 291

[56] T. Hastie, R. Tibshirani, and Friedman J. The Elements of Statistical


Learning. Springer, New York, 2009.
[57] I. Hedenfalk, D. Duggan, and Y. Chen et al. Gene-expression profiles
in hereditary breast cancer. Journal of Medicine, 344:539–548, 2001.
[58] Y. Hochberg and A. C. Tamhane. Multiple Comparison Procedures.
Wiley, New York, 1987.
[59] M. Hollander and D. Wolfe. Nonparametric Statistical Methods. John
Wiley and Sons, Inc, New York, second edition, 1999.
[60] S. Holm. A simple sequentially rejective multiple test procedure. Scan-
dinavian Journal of Statistics, 6:65–70, 1979.
[61] F. M. Hoppe. Multiple Comparisons, Selection Procedures and Applica-
tions in Biometry. Dekker, New York, 1993.
[62] F. M. Hoppe. Improving probability bounds by optimization over sub-
sets. Discrete Math., 306(5):526–530, 2006.
[63] F. M. Hoppe. The effect of redundancy on probability bounds. Discrete
Math., 309(1):123–127, 2009.
[64] F. M. Hoppe and E. Seneta. A Bonferroni-type identity and permutation
bounds. International Statistical Review, 58:253–361, 1990.
[65] F. M. Hoppe and E. Seneta. Gumbel’s identity, binomial moments, and
Bonferroni sums. International Statistical Review, 80:269–292, 2012.
[66] H. Hotelling. Analysis of a complex of statistical variables into principal
components. Journal of Educational Psychology, 24:417–441, 1933.
[67] J. C. Hsu. Multiple Comparisons: Theory and Methods. London, Chap-
man and Hall/CRC, 1996.
[68] J. C. Hsu and R. Berger. Stepwise confidence intervals without mul-
tiplicity adjustment for dose-response and toxicity studies. Journal of
American Statistical Association, 94:468 – 482, 1999.
[69] Y. Huang, H. Xu, V. Calian, and J. Hsu. To permute or not to permute.
Bioinformatics, 22:2244–2248, 2006.
[70] S. Iyengar. On a lower bound for the multivariate normal Mill’s ratio.
Annals of Probability, 4:1399–1403, 1986.
[71] S. Iyengar. Evaluation of normal probabilities of symmetric regions.
SIAM Journal on Scientific and Statistical Computing, 3:418–423, 1988.
[72] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to
Statistical Learning. Springer-Science and Business Media, LLC, New
York, 2013.
292 Bibliography

[73] E. Jauffret, F. Monville, F. Bertucci, and B. Esterni et al. Moesin Ex-


pression is A Marker of Basal Breast Carcinomas. International Journal
of Cancer, 121:1779–1785, 2007.

[74] E. Kaizar, Y. Li, and J. Hsu. Permutation Multiple Tests of Binary Fea-
tures Do Not Uniformly Control Error Rates. Journal of the American
Statistical Association, 106:1067–1074, 2011.

[75] L. Kerns and J. T. Chen. A note on range regression analysis. Journal


of Applied Probability and Statistics, 11:19–27, 2016.

[76] L. Kerns and J. T. Chen. Simultaneous confidence bands for restricted


logistic regression models. Journal of Applied Statistics, 44:2036–2051,
2017.
[77] T. Lai. Sequential analysis: Some classical problems and new challenges.
Statistica Sinica, 11:303–408, 2001.

[78] Y. LeCun, Y. Bottou, Y. Bengio, and P. Haffner. Gradient-based learn-


ing applied to document recognition. Proceedings of the IEEE, 86:2278–
2324, 1998.

[79] C. Lee. Self-Optimizing Networks. Princeton Univesity senior thesis,


2017.

[80] E. Lehmann and G. Casella. Theory of Point Estimation. Springer-


Verlag Inc, New York, 1998.
[81] E. Lehmann and J. Romano. Testing Statistical Hypotheses. Springer-
Science and Business Media, LLC, 2005.

[82] Y. Ma, A. Yeh, and J. T. Chen. Simultaneous confidence regions and


weighted hypotheses on parameter arrays. Methodology and Computing
in Applied Probability, 25:1–18, 2023.
[83] J. MacQueen. Some methods for classification and analysis of multivari-
ate observations. Berkeley Symp. on Math. Statist. and Prob., 5:281–297,
1965.
[84] P. McCullagh and J. Nelder. Generalized Linear Models. Chapman and
Hall, London, 1989.

[85] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representa-


tions for high-dimensional data. The Annals of Statistics, 37:246–270,
2011.
[86] W. Meisel. Computer-Oriented Approaches to Pattern Recognition. Aca-
demic Press, New York., 1972.
Bibliography 293

[87] J. Mi and A. Sampson. A comparison of the Bonferroni and Scheffe


bounds. Journal Statistical Planning and Inference., 36(1):101–105,
1993.
[88] D. Michie, D. Spiegelhalter, and C. Taylor. Machine Learning, Neural
and Statistical Classification. Ellis Horwood, New York, 1994.
[89] T. F. Móri and G. J. Székely. On the Erdos-Rényi generalization of the
Borel-Cantelli lemma. Studia Sci. Math. Hungar, 18:173–182, 1983.
[90] T. F. Móri and G. J. Székely. A note on the background of several
Bonferroni-Galambos-type inequality. Journal of Applied Probability,
22:836–843, 1985.
[91] T. T. Nguyen, J. T. Chen, A. K. Gupta, and K. T. Dinh. A proof
of the conjecture on positive skewness of generalized inverse Gaussian
distributions. Biometrika, 90:245–250, 2003.
[92] K. Pearson. On lines and planes of closest fit to systems of points in
space. Philosophical Magazine, 2:559–572, 1901.
[93] W. Piegorsch and G. Casella. Confidence bands for logisic regression
with restricted preditor variables. Biometrics, 44:739–750, 1988.
[94] R. Plane, L. Mattick, and L. Weirs. An acidity index for the taste of
wines. Am J Enol Citic, 31:265–268, 1980.
[95] R. Redfern, J. Chen, and S. Sibrel. Effect of thermomechanical stimu-
lation during vaccination on anxiety, pain, and satisfaction in pediatric
patients: A randomized controlled trial. Journal of Pediatric Nursing,
38:1–7, 2018.
[96] R. Redfern, K. Fleming, R. March, N. Bobulski, M. Kuehne, J. T. Chen,
and M. Moront M. Thrombelastography-directed transfusion in cardiac
surgery: Impact on postoperative outcomes. The Annals of Thoracic
Surgery, 107:1313–1318, 2019.
[97] R. Redfern, J. Micham, S. Seegert, and J. Chen. A buzzy approach
to ease the discomfort of a needle stick – A prospective, randomized
controlled trial. Pain Management Nursing, 20:164–169, 2019.
[98] R. Redfern, J. Micham, D. Sievert, and J. Chen. Effects of thermome-
chanical stimulation during intravenous catheter insertion in adults: A
prospective randomized study. Journal of Infusion Nursing, 41:294–300,
2018.
[99] R. Redfern, C. Rueta, S. O’Drobinak, J. Chen, and K. Beer. Closed
incision negative pressure therapy effects on postoperative infection and
surgical site complication after total hip and knee arthroplasty. Journal
of Arthroplasty, 32:3333–3339, 2018.
294 Bibliography

[100] E. Seneta. On a genetic inequality. Biometrics, 29:810–814, 1973.

[101] E. Seneta. An inequality from genetics. Advances in Applied Probability,


18(3):860–861, 1986.

[102] E. Seneta. On the history of the strong law of large numbers and Boole’s
inequality. Historia Mathematica, 19(1):24–39, 1992.

[103] E. Seneta. Probability inequalities and Dunnett’s test. Book chapter


edited by F. M. Hoppe, Dekker, New York. Multiple Comparisons, Se-
lection, and Applications in Biometry, 134:29–35, 1993.

[104] E. Seneta and J. T. Chen. Frechét optimality of upper bivariate


Bonferroni-type bounds. Theory of Probability and Mathematical Statis-
tics, 52:147–152, 1996.

[105] E. Seneta and J. T. Chen. A sequentially rejective test procedure. The-


ory of Stochastic Processes, 3(19)(3-4):393–402, 1997.

[106] E. Seneta and J. T. Chen. Multivariate Sobel-Uppuluri-Galambos type


bounds. Ukrainian Mathematical Journal, 52(9):1283–1293, 2000.

[107] E. Seneta and J. T. Chen. On explicit and Fréchet optimal lower bounds.
Journal of Applied Probability, 39:81–90, 2002.

[108] E. Seneta and J. T. Chen. A generator for explicit univariate lower


bounds. Statistics and Probability Letters., 75:256–266, 2005.

[109] E. Seneta and J. T. Chen. Simple stepwise tests of hypotheses and mul-
tiple comparisons. International Statistical Review, 73(1):21–34, 2005.

[110] J. Shao. Mathematical Statistics. Springer. 2nd edition., 2003.

[111] R. Shermis, R. Redfern, J. Bazydlo, G. Naimy, H. Kudrolli, and J. T.


Chen. Performance of molecular breast imaging as an adjunct diagnostic
tool. University of Toledo Journal of Medical Sciences, 6:15–19, 2019.

[112] P. Sprent and N. Smeeton. Applied Nonparametric Statistical Methods,


3rd edition. Chapman & Hall/CRC, 2001.

[113] M. Stone. Cross-validatory choice and assessment of statistical predic-


tions. J Roy. Statist. Soc. B, 36:111–147, 1974.

[114] J. Storey and R. Tibshirani. Statistical Significance for Genomewide


Studies. Proceedings of the Nattional Acadamy of Science, 27:4914–4927,
2003.

[115] A. Tamhane and L. Liu. On Weighted Hochberg Procedures.


Biometrika, 95:279–294, 2008.
Bibliography 295

[116] J. Tank, G. Georgiadis, and J. Bair et al. Does the use of ethyl chlo-
ride improve patient-reported pain scores with negative pressure wound
therapy dressing changes? A prospective, randomized controlled trial.
Journal of Trauma and Acute Care Surgery, 6:1061–1066, 2021.

[117] P. Thall, C. Logothetis, L. Pagliaro, S. Wen, M. Brown, D. Williams,


and R. Millikan. Adaptive therapy for androgen independent prostate
cancer: A randomized selection trial including four regimens. Journal
of the National Cancer Institute, 99:1613–1622, 2007.
[118] J. Tukey. The future of data analysis. The Annals of Mathematical
Statistics, 33:1–67, 1962.
[119] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New
York, 1995.
[120] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[121] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia,


1990.

[122] L. Wang, Y. Lin, and J. Chen. Simultaneous inference for treatment


regimes. Communications in Statistics – Theory and Methods, 46:9679–
9690, 2017.

[123] L. Wang, A. Rotnitzky, X. Lin, R. Millikan, and P. Thall. Evaluation


of viable dynamic treatment regimes in a sequentially randomized trial
of advanced prostate cancer. Journal of the American Statistical Asso-
ciation, 46:493–508, 2012.
[124] F. Weaver, A. Comerota, M. Youngblood, and J. Froechich. Surgical
revascularization versus thrombolysis for nonembolic lower extremity
native artery occlusions: Results of a prospective randomized trial. Jour-
nal of Vascular Surgery, 24:513–523, 1996.

[125] C. Wu. Future directions of statistical research in China: A historical


perspective. Application of Statistics and Management, 1:1–7, 1986.
[126] P. Xie and T. Chen. A new proof of the positive definiteness on the
sample covariance matrix. Acta Sci. Natur. Univ. Sunyatseni, 27:113 –
114, 1988.

[127] P. Xie and T. Chen. Positive definiteness of covariance matrices of


continuous sample. Acta Sci. Natur. Univ. Sunyatseni, 29:102 – 104,
1990.

[128] Y. Yu, J.T. Chen, and B. Yeh. Weighted step-down confidence proce-
dures with applications to gene expression data. Communications in
Statistics – Theory and Methods, 51:2343–2356, 2022.
[129] Y. Zhang, J. T. Chen, S. Wang, C. Andrews, and A. Levy. How do
consumers use nutrition labels on food products in the United States?
Topics in Clinical Nutrition, 32:161–171, 2017.

[130] P. Zhao and B. Yu. On model selection consistency of Lasso. The


Journal of Machine Learning Research, 7:2541–2563, 2006.
Index

annual revenue, 185 hyperplane, 187–189


Aspirin efficacy, 262
insurance premium, 3–6, 11, 22, 121,
Bonferroni summation, 196–199 160
bootstrapping, 24, 46–48, 50, 51, 53, inverted confidence set, 263
54 iteration process, 200–202
breast cancer study, 272
bun-to-creatinine, 73, 74, 79 jackknife, 38–40

cardiovascular safety score, 262 K-means clustering, 229, 233, 235,


case-control, 3, 27, 28, 54 238
classification trees, 23, 30, 207, 214,
227 LASSO, 154, 156, 157, 165
cohort study, 25–28 LDL-cholesterol, 61, 64, 67, 68, 72,
cross-sectional data, 25, 54 85
cubic spline, 163 leverage statistic, 147, 148
linear programming, 98, 185, 196,
data science, 11, 13, 14, 18, 22, 23, 199, 203
28, 54, 55, 59, 96, 97, 100, linear programming bound, 199, 200
118, 148 linear programming lower bound,
data-driven, 1, 3, 5, 8, 11–14, 16, 18, 199, 200
22, 24, 97 linear regression, 1, 4, 8, 34, 41, 46,
deep vain thrombosis, 270 120, 123, 127, 128, 133, 139,
diabetes, 2, 19, 20, 25, 65, 66, 174, 141, 146, 150, 155, 156, 160,
177, 238 183
directed confidence set, 262, 263 location invariant diagnostic
dose-response, 151, 261, 266, 284 predictor, 94
duality theorem, 98, 185, 192, 202 logistic regression, 2, 3, 14, 35, 58,
168, 176, 178, 182, 267, 268
entropy, 140, 207, 213, 217, 228 LOOCV, 38, 39, 41, 44, 55
expected prediction error, 98
marketing analysis, 31
gene expression data, 272 minimum risk classification, 167
Gini index, 210, 212, 215, 216, 218, minimum risk classifier, 171, 172, 177
228 minimum risk equivariant estimator,
110, 112, 118
homogeneity measure, 206, 211, 215, minimum risk estimator, 168, 179,
227 234

297
298 Index

minimum risk invariant estimator, support vector machine, 55, 98, 185,
110 193, 195, 196, 202
misclassification rate, 210, 212, 214,
217 thrombolysis, 189, 219, 266, 269, 270
model-based, 1–3, 5, 10, 11, 13–15, trading price, 257, 260
18, 19, 22, 54, 67, 82, 97, two stage sampling, 258
110 two-stage estimation, 255
two-stage sequential procedure, 248,
natural cubic spline, 162–164 258, 284

odds ratio, 3, 25–29 UMEDP, 74, 76, 79, 81, 86, 96


UMEIDP, 91, 95
perturbation, 185, 196, 199, 200 UMVUE, 34, 101, 102, 104, 106, 108,
Pitman estimator, 118 110, 118, 208, 217, 218
post-thrombotic syndrome, 189, 190
principal component analysis, 229, weighted hypothesis, 247, 272
240, 242–244 weighted simultaneous confidence
regions, 271
range regression, 5, 55, 205, 219–222, weighted squared Euclidean distance,
225, 227 238, 239
regression trees, 23, 24, 30, 34, 207, weighted step-down confidence set,
208, 222, 227 273
restricted optimization, 98 Wilcoxon rank sum test, 262
ridge regression, 153, 154, 156, 157,
165
ROC curve, 36, 54, 58, 182

selling price, 31, 32, 34, 123, 133–135


sensitivity, 24, 35, 36, 54, 57, 58, 60,
61, 66, 70, 73, 76, 78, 82,
86, 91, 95, 118, 181, 182
Simpson’s paradox, 28, 30
simultaneous confidence band, 261,
266, 267
specificity, 24, 35, 36, 54, 57, 60, 61,
65, 66, 72, 74, 76, 82, 86,
95, 96, 118, 181, 183
spline, 13, 163, 165
SPRT, 248, 250, 251, 253, 257
stock market, 167, 244
stock return, 80
stock selling price, 257
strong control of family-wise error
rate, 273
support vector classifier, 185, 191,
193, 195, 202

You might also like