LectureNotes PatternRecognition
LectureNotes PatternRecognition
Hastie, Tibshirani, and Fried- Duda, Hart, and Stork, “Pattern Bishop, “Pattern Recogni-
man, “The Elements of Sta- Classification”, 2nd edition, Wi- tion and Machine Learning”,
tistical Learning”, 2nd edition, ley. Springer.
Springer.
• Building intuition.
• Developing computer practice to build and assess models on real- and simulated-datasets.
Contents iii
1 Introduction 1
Bibliography
Introduction
Statistical learning plays a key role in many areas of science, finance and industry. Here are some examples
of learning problems:
• Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and clinical measurements for that patient.
• Predict the price of a stock in 6 months from now, on the basis of company performance measures and
economic data.
• Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spec-
trum of that person’s blood.
• Identify the risk factors for prostate cancer, based on clinical and demographic variables.
Unsupervised Learning We observe only the features and have no outcome. We need to cluster data or
organize it.
• 4601 email messages to try to predict whether the email was junk or not. The true outcome (email or
spam) is available. This is also called classification problem (as will be explained later).
• 97 men (observations):
predict the log of Prostate
Specific Antigen (lpsa)
from a number of mea-
surements including log-
cancer-volume (lcavol).
• This is a regression
problem (of course super-
vised).
• The data comes from the handwritten ZIP codes on envelops from U.S. postal mail.
• The images are 16 × 16 eight-bit grayscale maps, with each pixel ranging from 0-255.
• The task is to predict (classify) each image from its features (16 × 16) features to one of the digits.
• I’d like to see one of the projects to study this dataset and apply NN to it.
• The figure: experiment of 6830 genes (rows) (only 100 of them are
displayed for clarity) and 64 samples (columns). The samples are 64
cancer tumor from different patients.
• Do certain genes show very high (or low) expression for certain can-
cer samples?
• Regression: genes and kind of cancer (sample) are categorical predictors, and gene expression is the
response.
Qualitative (or Categorical), where no measures or metrics are associated; e.g., X = Diseased,N ondiseased.
• This best will not be the best for another loss! In terms of square error loss.
[ ( )]2
= E (Y − E [Y |X = x]) + E [Y |X = x] − f (x)
Y |X
( ) ( )2
= E {(Y − E [Y |X = x])2 + 2 (Y − E [Y |X = x]) E [Y |X = x] − f (x) + E [Y |X = x] − f (x) }
Y |X
( )2
= E (Y − E [Y |X = x])2 + E [Y |X = x] − f (x)
Y |X
( )2
= σY2 |X + E [Y |X = x] − f (x)
( )2
f ∗ (x) = arg min E Y − f (x) = E [Y |X = x] (2.1)
f (X ) Y |X
• So, it is impossible that any other rule (or algorithm) will be the best for all kind of problems.
• We have to try!
then
( )
Y ∼ N µY , ΣY Y ,
( ( ) )
Y |X ∼ N µY + ΣY X Σ−1 −1
XX x − µX , ΣY Y − ΣY X ΣXX ΣXY
Notice:
• E [Y |X ] is a line in p-dimensions.
( )
• for scalar Y and X , E [Y |X ] = µY + σσY2X x − µX .
X
σY X ρσY
• the slope 2
σX
= σX of the regression line makes sense.
Minimize R pointwise (i.e., minimize the conditional risk). Then, at particular X = x, calculate the condi-
tional risk for each decision g and choose the minimum; i.e.,
∑
K ( ) ( )
b (x) = arg min
G L Gk ,g Pr Gk |X = x
g ∈G k=1
( ) ( ) ( )
λ G1 = L11 Pr G1 |X = x + L21 Pr G2 |X = x
( ) ( ) ( )
λ G2 = L12 Pr G1 |X = x + L22 Pr G2 |X = x
( )
Pr G1 |X = x G1 (L21 − L22 )
( ) ≷
Pr G2 |X = x G2 (L12 − L11 )
( )
Pr G1 |X = x G1 L21
( ) ≷ , (Lii = 0 usually)
Pr G2 |X = x G2 L12
which makes a lot of sense, as we classify according to the maximum posterior(modified by the loss weights).
( )
Pr X |G1 π1 /P (X = x) G1 L21
( ) ≷
Pr X |G2 π2 /P (X = x) G2 L12
f1 (X ) G1 π2 L21
≷ , (LR)
f2 (X ) G2 π1 L12
which makes another great sense. We classify as G1 if its prior is larger, unless G2 has higher prevalence
(higher probability π2 or higher loss for misclassification L21 ) we raise the bar.
18 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Usually, this rule is put in the form
( ) ( )
f1 (X ) G1 π2 L21
ln ≷ ln (LLR)
f2 (X ) G2 π1 L12
G1
h (X ) ≷ th,
G2
{
∗ G1 h (X ) > th
η (X ) =
G2 h (X ) < th
S ∗ = {x : h (X ) = th} ,
R1∗ = {x : h (X ) > th} , (2.4)
R2∗ = {x : h (X ) < th}
∑
K ( ) ( )
R= E L Gk ,g Pr Gk |X
X k=1
| {z }
λ(g ) Conditional Risk
∫ ∫
∗
( ) ( )
R = λ G1 P (X ) dX + λ G2 P (X ) dX
R1 R2
∫ ∫
( ) ( )
= L21 Pr G2 |X = x P (X ) dX + L12 Pr G1 |X = x P (X ) dX
R1 R2
∫ ∫
( ) ( )
= L21 π2 f X |G2 dX + L12 π1 f X |G1 dX ,
R1∗ R2∗
| {z } | {z }
Error e∗21 =Pr[R1∗ |G2 ] Error e∗12 =Pr[R2∗ |G1 ]
[ ] [ ]
= L21 π2 Pr x ∈ G2 and decision is G1 + L12 π1 Pr x ∈ G1 and decision is G2 ,
a lot of sense: each kind of error is an integration of the right class over the wrong decision region, then
magnified by the prevalence of that class. Then, From (2.4)
[ ]
x ∈ R1∗ ≡ th < h < ∞ → Pr R1∗ = Pr [th < h < ∞]
[ ]
x ∈ R2∗ ≡ −∞ < h < th → Pr R2∗ = Pr [−∞ < h < th]
∫ ∞ ∫
( ) th ( )
R∗ = L21 π2 fh h|G2 dh + L12 π1 f h|G1 dh.
| th {z } | −∞ {z }
Error e∗21 Error e∗12
1 1 ′ −1
f1 (x) = e− 2 (x−µ1 ) Σ1 (x−µ1 )
,
((2π )p |Σ1 |)1/2
1 1 ′ −1
f2 (x) = e− 2 (x−µ2 ) Σ2 (x−µ2 )
,
((2π )p |Σ2 |)1/2
( ) 1 ( ′ −1 )
x′ Σ−1 µ2 − µ1 − µ2 Σ µ2 − µ′1 Σ−1 µ1 = −th
2
1 ′( ) 1 ( )
2
x µ2 − µ1 − 2 µ′2 µ2 − µ′1 µ1 = −th
σ 2σ
( )′ ( )
( )′ 1( )′ ( ) 2 µ2 − µ1 µ2 − µ1
µ2 − µ1 x − µ2 − µ1 µ2 + µ1 = −σ th
2 ∥µ2 − µ1 ∥2
( [ ])
( )′ 1( ) σ 2 th ( )
µ2 − µ1 x − µ2 + µ1 − µ 2 − µ 1 =0
2 ∥µ2 − µ1 ∥2
w′ (x − x0 ) = 0
( )1 ( ′ −1 )
x′ Σ−1 µ2 − µ1 − µ2 Σ µ2 − µ′1 Σ−1 µ1 = −th
2
[ −1 ( )]′ 1( )′ ( )
Σ µ2 − µ1 x − µ2 − µ1 Σ−1 µ2 + µ1 =
2
( )′ ( )
µ2 − µ1 Σ−1 µ2 + µ1
−th ( )′ ( )
µ2 − µ1 Σ−1 µ2 + µ1
[ −1 ( )]′
Σ µ2 − µ1 ·
( [ ])
1( ) th ( )
x− µ2 + µ1 − ( )′ ( ) µ2 + µ1 =0
2 µ2 − µ1 Σ−1 µ2 + µ1
w′ (x − x0 ) = 0
∑
K ( ) ( )
b (x) = arg min
G L Gk ,g Pr Gk |X = x
g ∈G k=1
( )
λ G1 = · · ·
( )
λ G2 = · · ·
..
.
( )
λ GK = · · ·
• For this example, or for other difficult distributions, risks can be estimated by simulating a testing set ts
( ) ( )2
Risk fb = E Y − fb(X ) (page 13)
( ) ∑(
n )2
fbtr = 1
ts
Risk yi − fbtr (xi )
nts i=1
33 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Parameter Estimation:
( )′
fb(x) = µ
bY + x − µ b −1 Σ
bX Σ b
XX XY
1 1 ∑
bY = y =
µ 1′ y = yi ,
N N i
1 1 ∑
bX = x =
µ X′ 1 = xi ,
N N i
( ) 1 ∑ 1 ( ′)
Σb XY
p×1 = N −1 i
(xi − x)p×1 yi1×1 = X
N − 1 c p×N
yN ×1
( ) 1 ∑ 1
Σb XX
p×p = N − 1 (xi − x) (xi − x)′ = X′ Xc
i N −1 c
( )−1
fb(x) = y + x′c X′c Xc X′c y (2.5)
∑
( )σbxy i (xi − x) yi
b
f (x ) = µ
bY + x − µ bX = y + xc ∑ . (for scalar X )
bxx
σ i (xi − x)
2
Hint: Eq. (2.5) will be reached very differently and interestingly next Chapter.
b1 , µ
Get µ b 1 , and Σ
b2 , Σ b 2 as before and plug in above.
1 [( ) ( ) ]
b=
Σ b 1 + ntr2 − 1 Σ
ntr1 − 1 Σ b2 .
ntr1 + ntr2 − 2
b 1 and Σ
(observe that, if Σ b 2 are unbiased Σ
b is so too.)
So, simply, after training on tr we can test on a very large testing set (MC trial) to get
1 ∑
nts1
ebtr12 = I(hb tr (xi )<th) ,
nts1 i=1
n∑
ts2
1
ebtr21 = I(hb tr (xi )>th) .
nts2 i=1
1
b tr = (ebtr12 + ebtr21 )
R
2
• Assessment: how can we assess what we have designed in terms of some measures, e.g., Risk, Error,
etc.? This is the second part of the field.
Y |X = E [ Y | X ] + ε
= f (X ) + ε,
where ε is a r.v. and E [ε] = 0. In linear models, it is assumed that f (X ) is linear in X , and the goal is to
estimate the coefficients in f (X ). Linear models:
• Still are a great tool for prediction and can outperform fancier ones.
( )′
• can be applied to transformed features (e.g., if we have X = X1 ,X2 , we can make up the feature
( )′
vector X = X1 ,X2 ,X12 ,X22 ,X1 X2 , and then assume f (X ) is linear in this new X ).
• many other methods are generalization to linear models, including Neural Networks and even some
methods for classification.
f (X ) = β0 + β1 X1 + · · · + βp Xp
= β ′ X,
( )′
X = 1,X1 ,...,Xp ,
( )
β = β0 ,...,βp ,
The model still is linear in coefficients (or linear in the new features).
( )
Typically, we have N observations; each is xi ,yi . So, we have the data matrix and the response values:
( )′
1,x1 1 x11 ... x1p
XN ×p+1 =
. . .. ..
( .. ) = .. . . ,
′
1,xN 1 xN 1 xN p
y1
..
yN ×1 = . .
yN
41 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.2 Least Mean Square (LMS)
For any choice β , we have a residual some squares:
( ) ∑
N ( )2
RSS β = yi − f (xi )
i=1
∑( )2
= yi − β ′ xi
i
( )2
∑ ∑
p
= yi − β0 − βj xij .
i j =1
• saddle points.
where H is called the hat matrix (or the projection matrix). Therefore, the residual error at each observation
is:
εb = by − y
( )−1
= X X′ X X′ y − y
( ( )−1 )
= X X′ X X′ − I y.
y = Xβ + ε:
• X spans a sub-space of RN .
• Then, eb must be perpendicular on the space X. This nice Geometry can be translated to math as:
( )
0= X′ y − Xβb
( )−1
βb = X′ X X′ y
• Eq. (3.2e) is of great surprise; it is the same as Eq. (2.5). LMS coincides with the best regression
function after plug-in parameters estimation in case of multinormal distribution!
1 ∑( )2 1 1
err = ybi − yi = εb′ εb = ∥εb∥2
N i N N
= E errtr (x0 )
x0
This is conditional on the training set tr appearing in the equation as X and y. In simulation problems, where
data comes from a known distribution, we can obtain a very accurate estimate of errtr , using very large ts
as:
1 ∑( )2
errtr ∼
= ybi − yi (as in Ex. 4)
nts i∈ts
We will see how to estimate this from tr. This means also that there is
Var errtr ,
tr
which expresses how stable the regression function is from dataset to another. In simulation problems, we
do MC simulation to estimate these quantities as well
1 ∑
M
E errtr ∼
= errtrm
tr M m=1
( )2
1 ∑M 1 ∑M
Var errtr ∼
= errtrm − errtrm
M − 1 m=1 M m=1
“The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign
prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleams score
(gleason), and percent of Gleams scores 4 or 5 (pgg45).”
• Eg., svi is binary, gleams is ordered categorical, both lcavol and lcp show a strong relationship with
the response lpsa, and with each other.
• “The mean prediction error on the test data is 0.521. In contrast, prediction using the mean training value
of lpsa has a test error of 1.057, which is called the base error rate. Hence the linear model reduces the base
error rate by about 50%. We will return to this example later to compare various selection and shrinkage
methods.”
1 ∑
N
b = βb0 =
yb0 = α yi ; (β1∼p = 0)
N i=1
i.e., using the sample average of the training set as if you do not have any additional information from X.
• the meaning of βbi is this: a unit increase in predictor Xi results in an increase of βbi in the response Yi .
Standardize your predictors, e.g., to unit variance, so that no variable is more dominant than others. (when
testing use of course the inverse scaling). Prove that a linear mapping in the form
ai Xi + bi , i = 1, 2
will not deform the correlation between the variables X1 , X2 . Applying linear transformation to each vari-
able of a data matrix accounts for moving the center of scatter plot then scaling without preserving the
aspect ratio.
One form of linear transformation is mapping all predictors to [L,H ] (usually [−1, 1] as mapminmax in Mat-
lab (but it assumes P × p matrix not N × p), or [0, 1] as in parallel cords) by:
Xmax − Xmin
b=
H −L
−a LXmax − HXmin
=
b Xmax − Xmin
HXmin − LXmax
a=
(H − L )
−LXmax
X − HXmin
H −L
Xnew = ( )
Xmax −Xmin
H −L
Linear transformation can be done by standardizing to zero sample mean and unit sample variance, by
X −X
Xnew =
bX
σ
A special case of this transformation is when we scale ONLY each feature (what is used usually); i.e.,
d1 ... 0
. ..
S1 = .. .
0 dp
( )
= diag d1 ,...,dp
Therefore, a general linear transformation for the features, including shifting and scaling, is given by
Xtrans = XTS
= XH,
[ ]−1
βbtrans = (XH)′ (XH) (XH)′ y
[ ′ ]−1
= H X′ XH H′ X′ y
[ ]−1 ′−1 ′ ′
= H−1 X′ X H HXy
[ ] −1
= H−1 X′ X X′ y,
( )
yb0trans = x′0 H βbtrans
[ ]−1 ′
= x′0 HH−1 X′ X Xy
= yb0 (without transformation)
where we minimized it by choosing f (X ) = E [Y |X = x]. Now, for any learning function ftr (x0 ) = yb0 , the
(conditional) risk are given by (look at Sec. 3.3)
( )2
errtr (x0 ) = σY2 |X + E y0 |x0 − yb0
( )2
errtr = E σY2 |X =x0 + E yb0 − E y0
x0 x0 y0 | x0
| {z }
minimum risk (irreducible)
Etr errtr =
( )2
= E σY2 |X =x0 + E E yb0 − E y0
x 0 x0 tr y0 | x0
(( ) ( ))2
= E σY2 |X =x0 + E E yb0 − E yb0 + E yb0 − E y0
x0 x0 tr tr tr y0 | x0
( )2 ( )2
= E σY2 |X =x0 + E E yb0 − E yb0 + E yb0 − E y0 + |{z}cov
x0 x0 tr tr tr y0 | x0
0
[ ( )]
= E σY2 |X =x0 + E Var yb0 + Bias2tr yb0 .
x0 x0 tr
In particular for linear models, it can be shown that if the data follows exactly a full linear model, then using
the right number of features or more will result in an unbiased model.
61 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.5.1 Bias for Underfitting
Assume the right model is
[ ]
E y |X = X ′ β
( )
( )
′ β1
= X1′ ,X2 = X1′ β1 + X2′ β2 .
β2
and we used the reduced model (underfitting)
[ ]
E y |X1 = X1′ β1∗
( )−1 ′
E yb0 = E x′01 X′1 X1 X1 y
tr tr
( )−1 ′
= x′01 E E X′1 X1 X1 y
X y|X
( )−1 ′
= x′01 E X′1 X1 X1 E y
X y|X
[( ) )]
−1 ′ (
= x′01 E X′1 X1 X1 X1 β1 + X2 β2
X
( )
′
( ′ )−1 ′
= x01 β1 + E X1 X1 X1 X2 β2
X
( )−1 ′
= x′01 β1 + x′01 E X′1 X1 X1 X2 β2
X
̸= x′01 β1 + x′02 β2
= E y0 . | x 0
Therefore, there is a bias in the underfitting. When using the right model, i.e., if β2 = 0, the bias is zero.
62 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.5.2 Bias for Right model and Overfitting
[ ]
E y |X = X1′ β1 = X1′ β1 + X2′ β2
| {z }
=0
( )−1 ( )−1
Var yb0 = E Var x′0 X′ X X′ y+ Var E x′0 X′ X X′ y
tr X y|X X Y|X
[( ( ) )( )( ( )−1 )]
′ −1 ′
= E x0 X X X σY |X IN ×N X X′ X x0
′ 2
X
[ ( )−1 ]
+ Var x′0 X′ X X′ Xβ
X
( )−1
= E σY2 |X x′0 X′ X x0 + Var x′0 β
X X
[ ( ′ )−1 ]
′ 2
= x0 E σY | X X X x0 .
X
To simplify more, assume that EX X = 0 (so X is centered, which is not a big deal), therefore
( )
1
E X′ X = ΣX .
X N −1
( )−1 1 −1
E X′ X ≈ ΣX .
X N
σ 2 ′ −1
Var yb0 = x Σ x0 ,
tr N 0 X
σ2
E Var yb0 = E x′ Σ−1 x0
x0 tr N x0 0 X
σ2 [ ]
= E trace x0 x′0 Σ−1
X
N x 0
σ2 [ ]
= trace E x0 x′0 Σ−1
X
N x 0
σ2 [ ]
= trace ΣX Σ−1X
N
σ2
= trace Ip×p
N
p
= σ2
N
V. Imp
∼ 1 ∑
M
err = E errtr = errtrm
tr M m=1
[ ( )2 ( )2 ]
= E σY2 |X =x0 + E yb0 − Ey
b0 + Ey
b0 − E y0
x0 tr tr tr y0 |x0
1 ∑( )2
err = ybi − yi
ntr i∈tr
[( [ ]
)2 ] ( )2 1 ∑( )2
errtr = E yb0 − y0 =E E yb0 − y0 ∼
= ybi − yi
x0 ,yo x0 y0 | x0 nts i∈ts
[ ( )2 ]
= E σY2 |X =x0 + yb0 − E y0
x0 y0 |x0
∼ 1 ∑
M
err = E errtr = errtrm
tr M m=1
[ ( )2 ( )2 ]
= E σY2 |X =x0 + E yb0 − Ey
b0 + Ey
b0 − E y0
x0 tr tr tr y0 |x0
• But with increasing the training set size we get better performance; So it
depends on the ratio p/N .
• err must ↘ with complexity; and Vartr err is “in general” ↘; why? How-
ever, errtr is “in general” has a minimum and Vartr errtr is “in general” ↗.
• “The first is prediction accuracy: the least squares estimates often have low bias but large variance.
Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By
doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may
improve the overall prediction accuracy.”
• “The second reason is interpretation. With a large number of predictors, we often would like to deter-
mine a smaller subset that exhibit the strongest effects. In order to get the “big picture”, we are willing
to sacrifice some of the small details.”
• Forward-Stagewise regression.
• Best subset regression finds for each k ∈ 0, 1, 2,...,p the subset of size k that gives smallest RSS.
• The best subset of size 2, e.g., needs not include the variable that was in the best subset of size 1. Therefore,
the red lower boundary is necessarily decreasing.
• The question of how to choose k involves the tradeoff between bias and variance, usually done by CV.
However, another very important factor is time complexity. Subset selection is very time consuming w.r.t.
regularization; it can even be untractable in many real life problems with high dimensions.
• Ridge regression
Hint: Great picture can be seen by studying the connection of these methods with each other; deferred to
the advanced course!
• “The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the
inputs before solving (3.3).”
• “In addition, notice that the intercept β0 has been left out of the penalty term. Penalization of the intercept
would make the procedure depend on the origin chosen for Y ; that is, adding a constant c to each of the
targets yi would not simply result in a shift of the predictions by the same amount c.”
b
y = 1αb + Xc βbridge (3.5)
( )−1
= 1αb + Xc X′c Xc + λI X′c y (3.6)
[ ( )−1 ]
df(λ) = tr Xc X′c Xc + λI X′c . (3.7)
However: we get big insight about the effect of λ from the cen-
tered model => df concept.
• Notice: Nonlinear decision boundaries (surfaces) in the original feature space are
linear in the expanded feature space. But, how to find the right expansion/transfor-
mation. For example, in the case of p = 2, we can do:
T : R 2 7→ R 5
( ) ( )
X new = X1new ,X2new ,X3new ,X4new ,X5new = T (X ) = X1 ,X2 ,X1 X2 ,X12 ,X22
i∑
=5
h(X ) = X1 + X2 + X1 X2 + X12 + X22 = Xinew = hnew (X new ) = hnew (T (X )) = hnew · T (X )
i=1
h = hnew · T
• First, let’s revisit the best decision boundaries (Sec. 2.2.2), and see a very simple 3-
class problem with no equal costs!
−1
( p )
0 = e 2 (x1 +x2 +a ) eax1 − e−ax1 + (L − 1)e( (3)ax2 −a )
2 2 2 2
( )
1 1 1−L
x2 = p (a − x1 ) − p log 2ax .
3 3a e 1 −1
80 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
@@@ This page still needs elaboration from Bishop P.180 and connection to Hastie and Tibs.
Theorem 13 General Linear discriminant (score) 2. Consider the decision region Rk , and the two points
functions, with any monotone transformation T : xa ,xb ∈ Rk , then consider the generic point xc on
the line segment between these two points
δi (x) = T (βi′ X ), i = 1, · · · ,K,
b (x) = arg max δi (x),
G xc = λxa + (1 − λ)xb
where, in general the intercept is absorbed in the first δk (xc ) = βk′ xc
feature vector X = (1, · · · )′ , have the following proper- = λβk′ xa + (1 − λ)βk′ xb
ties:
= λδk (xa ) + (1 − λ)δk (xb )
1. all decision surfaces are linear.
2. all decision regions are singly connected and convex. > λδi (xa ) + (1 − λ)δi (xb ) ∀i (x a , x b ∈ R k )
= λβi′ xa + (1 − λ)βi′ xb
Proof.
= βi′ xc
1. If a decision surface Sij exists, then Sij =
{ } = δ i (x c ).
x| δi (x) = δj (x) . Then,
T (βi′ X ) = T (βj′ X ) ≡ βi′ X = βj′ X (monotonecity) Hence, x ∈ R , which proves convexity and singly
c k
≡ (βi′ − βj′ )X = 0. (linear) connected in one step and completes the proof.
( [ ])
[ −1 ( )]′ 1( ) th ( )
Σ µ2 − µ1 · x − µ2 + µ1 − ( )′ ( ) µ2 − µ1 =0
2 µ2 − µ1 Σ−1 µ2 + µ1
w ′ (x − x 0 ) = 0 .
• “For this figure and many similar figures in the book we compute the decision boundaries by and exhaustive
contouring method. We compute the decision rule on a fine lattice of points, and then use contouring algorithms
to compute the boundaries.”
• This
• 3 different complexities:
a) is case 3 (QDA).
b) is case 1, less complex than LDA (how to esti-
mate only diagonals; trivial!)
c) is very naive.
Σ = V ΛV ′ , V ′ V = I
= v1 v1′ λ1 + v2 v2′ λ2 ,
Σv2 = v2 λ2
which means that the line connecting the two means (which is arbitrary) has the direction of the smallest
principal component; which is not mandatory of course.
Σ = U ΛU ′
λ1 u′1
( ) .. ..
= u1 · · · up . .
λp u′p
= λ1 u1 u′1 + · · · + λp up u′p ,
Distance to class centroid may weighted by inverse of spread (small spread in one direction makes a data
( )′
point far from the class centroid in that direction): x − µ ui /σi . So, the whole distance should be
( ) p ((
∑ )′ )2
δ 2 x,µ = x − µ ui /σi
i=1
∑p (( )′ )( ( ) )
= x − µ ui /σi u′i x − µ /σi
i=1
( )
( )′ 1 1 ( )
= x−µ u1 u′1 + · · · + up u′p x − µ
σ12 σp2
( )′ −1
( )
= x−µ Σ x−µ ,
93 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
which is the Mahalanobis’ distance between x and µ (the class centroid) wrt to the matrix Σ (the covariance
matrix of the data).
Then it is natural to classify based on the closest class centroid to the testing point x @@@
( )′ ( )
b (x) = ωk |k = arg min x − µk Σ−1 x − µk
G k
k
( )′ ( ) ( )′ ( ) G1
x − µ2 Σ−1 x − µ2 − x − µ1 Σ−1 x − µ1 ≷ th,
G2
′
∑
n1
=w (xi − x1 ) (xi − x1 )′ w = w′ S1 w
i=1
w S1 w + w′ S2 w = w′ SW w
′
w′ SB w
J (w ) = ′ .
w SW w
( ′ ) ( ′ )
(2SB w) w SW w − w SB w (2SW w)
∇j (w) =
(w′ SW w)2
( ( ) )
(SB w) w′ SW w = w′ SB w (SW w) .
SW w ∝ (x2 − x1 )
−1
w ∝ SW (x 2 − x 1 )
[ −1 ( )]′ G1
Σ µ2 − µ1 (x − x 0 ) ≷ 0
G2
[ −1 ( )]′ G1
Σ µ2 − µ1 x ≷ some value, where
G2
( )
1 ∑
2 ∑
b=
Σ (xi − xk ) (xi − xk )′
n1 + n2 − 2 k=1 i
( )
1
= SW ,
n1 + n2 − 2
b 1 = x1 ,
µ
b 2 = x2
µ
This explains why LDA works even if the data is neither linearly separable nor following Normal distribution.
The LDA is always optimal in Fisher’s sense.
Let’s see same Mathematica Notebook again for connection between PCA, LDA, and FDA (for Fisher (not
Flexible) Discriminant Analysis).
∑
ST = (xi − x) (xi − x)′
i
∑
= (xi − x) (xi − x)′
i
∑
K
SB = nk (xk − x) (xk − x)′
k=1
∑
SW =
( )
P (R ≤ r) = P x2 + y 2 ≤ r
πr2
=
π
= r2 .
1 ∑
n1 ∑
n2
AU C= Ihtr (yj )>htr (xi ) , (7.1)
n1 n2 i=1 j =1
1 a < b,
I (a,b) = 1/2 a = b, (7.2)
0 a>b
Proof.
∫1 ∫1
1. AU C = 0 T P F dF P F ≤ 0 1 dF P F = 1.
2. The proof is a set of straightforward calculus steps. Denote fhtr (x|ω1 ), fhtr (x|ω2 ) as X and Y , respectively,
= −T P F (x) dF P F (x),
x=−∞
∫ 1
= T P F dF P F .
0
3. of unbiasedness is trivial, and minimum variance part is omitted (see Randles and Wolfe, 1979).
Example 21 A very complex Deep Neural Network (DNN) is trained on millions of images to detect whether a
human appears in a photo or not. The DNN is later tested on 9 images (4 having no humans and 5 do have)
{ } { }
and produced the following output −2, 0, 2, 4 and 1, 3, 5, 7, 9 respectively. Estimate the AUC of this network,
and how does AUC changes with the decision point of the network?
X X O X O X O O O
Answer is: 17/20 = 0.85 = 85%.
Prediction
True c1 ω
ω ci . . . ωd
K
ω1 1
ωi cii 1
.. ..
. . 1
ωK 1
P P Vj =
1 = T P Fi + F N Fi
P P Vj ≡ Pr(ωj |ω
cj )
∑
F DRj ≡ Pr(ωi |ω
cj )
i
1 = P P Vj + F DRj
• 1-FNF, 1 − e2 , TPF, TPR, Sensitivity, Recall, Probability of Detection/Positive alaram, Hit Rate, per-class
accuracy:
TP TP
=
P TP +FN
• 1-FPF, 1 − e2 , TNF, TNR, Specificity:
TN
N
• Positive Predictive Value (PPV), Precision (not obtainable from the ROC. It is the conditional probability
that a positive decision is true.):
TP 1 1
= =
T P + F P 1 + F P /T P 1 + (F P /P )/T P F
• Accuracy:
TP +TN
+N
P119 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
• F -score:
P recision · Recall
2
P recision + Recal
• Fβ -score:
P recision · Recall
(1 + β 2 )
β 2 P recision + Recal
• Accuracy:
• F -score:
• Fβ -score:
7.4.2.1 K -Fold CV
for k = 1 : K
Train on all data except partition k;
Test on partition k;
Save the N/K predictions;
end
HW: Prove that estimating the error rate from each fold then averaging over the folds gives the same estimate
as if we pool the scores from folds and obtain one estimate from the n pooled scores.
7.5.2 Pitfall of CV
Golub et al. (1999)
Bad Practice: the pitfall of reusing the testing set. E.g., selecting the best model η ∗ , then reusing the same
testing set to select the optimal threshold for classification!
when do you know you have an over-optimistic measure A of your model (or over trained)?
This happens, e.g., when training models on tr then testing on ts to choose the best model in terms of some
loss; e.g., MSE. Then this chosen best model is used on the same test set to choose the best threshold value.
In this case, the threshold value is a tuning parameter selected using ts and tested on ts as well.
“Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then
the test set error of the final chosen model will underestimate the true test error, sometimes substantially.”
(Hastie et al., 2009, P. 222)
Incrementally converting the testing set to training set Gur et al. (2004)
Neural Networks
∑
M
Ybk = w0(2)
k
+ (2)
wmk Zm ,
m=1
( ) 1
σ µ = ( )
1 + exp −µ
or tan-sigmoid function.
It can be proven that a 2-layer NN with sufficient M
can approximate any input function of finite domain
with finite discontinuities. This result hold for a wide
range of activation functions σ (1) excluding polynomi- ( ( ))
als. ∑
M ∑
D
b (2)
Yk = σ w0k +(2) (2) (1) (1) (1)
wmk σ wm0 + wmi Xi
m=1 i=1
( )
∑
M ( )
=σ (2)
w0(2)
k
+ (2) (1)
wmk σ (1)
wm 0 + Wm X
m=1
( )
∑
M
=σ (2)
w0(2)
k
+ (2)
wmk Zm
m=1
Var (Y |X )
SN R = .
Var (ε)
Testing set (never seen by NN, this is just for understanding what’s going on) is 10,000.
Let’s train each network until it reaches the maximum accuracy for the gradient (or 10−5 is enough).
Try several M (from 1 to 5); for each do the training 10 times with different initialization vector (see the
code).
• Monotonic decrease in
training MSE.
with softmax, set net.outputs2.processFcns= as long as the classes are coded by 0 and 1.
where
the activation functions in the two layers are the hard limits given by:
( ) ( ) 1 µ>0
(1) (2)
σ µ =σ µ = 0 µ=0
−1 µ < 0
Unsupervised Learning
Marginal Probability
Sum Rule
Product Rule
Sum Rule
Product Rule
p X ,Y p X pY
Independence
p Y |X pY
f Y f X ,Y dX
f X ,Y y
f X |Y y
f Y y
Conditional Expectation
(discrete)
Approximate Expectation
2 2
X var X E X X E X2 2
X
cov X ,Y
Observed correlation does
X Y NOT imply causation
1 1
Intuition:
Independent observations (realizations) of this r.v. are called independent and identically distributed (i.i.d.),
e.g., x1 ,...,xn .
Estimator is a real-valued function that tries to be “close” in some sense to a population quantity.
( ) ( )2
How “close”? Define a loss function, e.g., the Mean Square Error (MSE): L µ b,µ = µb − µ .And, define the
( )2
Risk to be the Expected loss: E µb−µ .
b:
Important Decomposition for any estimator µ
( )2 (( ) ( ))2
E µb−µ = E µ b −Eµb + Eµ b−µ
( )2 ( )2 [( )( )]
=E µ b −Eµb +E Eµ b −µ +2E µ b −Eµ
b Eµ b−µ
( )
b + Bias2 µ
= Var µ b
One estimator may be better for one loss and not better for another loss.
1 ∑n
bX =
Sample mean X as an estimator of µX : µ n i=1 xi .
1∑
n ( )
EX = E xi = E X = µ
n i=1
( )
b = Eµ
Bias µ b−µ = 0
[ ]
1 ∑ 2 ∑∑ ( ) 1
b= 2
Var µ σ + Cov Xi ,Xj = σ 2
n i i j n
This means that from sample to sample it will vary with this variance.
An estimator with zero bias is called “unbiased”. This means that on average it will be exactly as what we
want.
( )2
σ2 = E X − µ
= E X 2 − µ2
c2 = 1 ∑( )2
σ xi − X
n−1 i
( )
1 ∑ 2 2
= x − nX
n−1 i i
[ ( )]
1 ∑ 2
c2 = E
Eσ x − nX
2
n−1 i i
1 ( 2
)
= nEX2 − nEX
n−1
( ( 2 ))
1 ( 2 2
) σ 2
= n σ +µ −n +µ
n−1 n
= σ2,
c2 is unbiased for σ 2 .
therefore, σ
( ) 1 ∑( )( )
d X,Y =
Cov x i − X yi − Y
n−1 i
( )
1 ∑
= xi yi − nXY
n−1 i
( ) 1 ( )
d X,Y =
E Cov n E XY − n E XY
( ) n−1
n E XY − n E XY =
( ( ) ) 1 ∑∑
= n Cov X,Y + µX µY − n E 2 x i yj
n i j
( )
( ( ) ) 1 ∑ ∑∑
= n Cov X,Y + µX µY − E x i yi + x i yj
n i i̸=j
( ( ) ) 1( )
= n Cov X,Y + µX µY − n E XY + n (n − 1) E xi yj
( ( ) ) (
n ( ) )
= n Cov X,Y + µX µY − Cov X,Y + µX µY + (n − 1) µX µY
( ) ( ( ) )
= n Cov X,Y + nµX µY − Cov X,Y + nµX µY
( ) ( )
d X,Y is unbiased as well for Cov X,Y
Therefore, Cov
B-6 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Random Vectors and Multivariate Statistics
( )′
A p-dimensional random vector X is X = X1 ,...,Xp has joint pdf
fX = fX1 ,...,Xp
Mean:
µ = EX
( )′
= E X1 ,..., E Xp
1∑
b=X =
µ xi ,
n i
1∑
x11 xn1 n i xi 1
1 .. .. ..
= . + ... . = .
n 1 ∑
x1 p xnp n i xip
1 ∑( )( )′
b=
Σ xi − X xi − X =
n−1 i
( )2 ( )( )
1 ∑ 1 ∑
n−1 i xi1 − X1 ... n−1 i xi1 − X1 xip − Xp
.
. ..
. .
( )( ) ( )2
1 ∑ 1 ∑
n−1 i xip − Xp xi1 − X1 n−1 i xip − Xp
1 ( x−µ )2
− 12
fX (x) = ( )1/2 exp σ .
2πσ 2
Prove that:
E X = µ,
Var X = σ 2 .
( )
So, the population parameters appear explicitly in the pdf. We say, X ∼ N µ,σ 2 . The figure shows the
geometry of the normal distribution.
E X = µ,
Cov (X ) = Σ.
which is joint of p independent normals. This is not the case for other distributions.
E A′m×p Xp×1 = A′ E X = A′ µ.
( )
Cov A′m×p Xp×1 = A′ Cov (X ) A = A′ ΣA.
Theorem
( ) ( )
If X ∼ N µ, Σ then A′ X ∼ N A′ µ,A′ ΣA .
( )
α′ X = ∥α∥ ∥X ∥ cos x,α
= ∥α∥ × Projected Length
α′
If we need the projected length only, then project on a unit vector ∥α∥ , then
α′ ( )
X = ∥X ∥ cos x,α .
∥α∥
B-13 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Multiply this scalar in the direction of the projection to get the new feature Z in the direction α
( ) ( )
( α) α α′
Z = X
∥α∥ p×1 ∥α∥ 1×1
αα′
= ′ X
(α α )
= Pp(α )
×p Xp×1 ,
1∑ 1
X= xi = 1′ Z,
n i n
( )
1 ′
( ) 1
EX = µ1 = µ1′ 1 =µ
1
n n
( ) ( )
1 ′ ( 2 ) 1 ′ ′ σ2 ′ σ2
Var X = 1 σ I 1 = 21 1=
n n n n
For vectors in p dimensional space, the inner product 〈x,y 〉 is the dot product x′ y ,
〈x,y 〉 = x′ y
y1
( )
= x1 ... xp · ...
yp
∑
p
= x i yi .
i=1
when x′ y is zero we say they are orthogonal. It can be shown that, for p ≤ 3,
x′ y
cos θ =
∥x∥ · ∥y ∥
x′ y
=p √
x′ x · y′y
Then, we can generalize this definition in higher dimensions, and define the angle between two vectors for
p > 3.
X′
If we need the projected length only, then project on a unit vector ∥X ∥ , then
X′ ( )
Y = ∥Y ∥ cos Y,X .
∥X ∥
Multiply this scalar in the direction of the projection to get the new component in the direction X
( )( )
X X′
Yb = Y
∥X ∥ ∥X ∥
= X βb
XX ′
= Y
(X ′ X )
= Pp(X )
×p Yp×1 ,
b
e = Y − Xβ,
⟨( ) ( )⟩
∥e∥2 = Y − Xβb , Y − Xβb
〈X1 ,Y 〉 〈X1 ,X1 〉 ... 〈X1 ,Xn 〉
.. ′ .. ..
= 〈Y,Y 〉 − 2β ′ . +β .
..
. . β
〈Xn ,Y 〉 〈Xn ,X1 〉 ... 〈Xn ,Xn 〉
= Y ′ Y − 2β ′ X′ Y + β ′ X′ Xβ
〈Xi ,e〉 = 0, or
Xi′ e = 0
Yb = Xβb
( )−1
= X X′ X X′ Y
C-4 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
and the projection matrix P is
( )−1
P = X X′ X X′
The very interesting thing is
X1′
.. ( )
X′ X = . X1 ... Xn
Xn′
X ′ X1 X1′ X2 ... X1′ Xn
1′
X2 X1 X2′ X2 ... X2′ Xn
=
.. .. .. ..
. . . .
Xn′ X1 Xn′ X2 ... Xn′ Xn
Pi = Xi Xi′ ,
βbi = Xi′ Y
Ambroise, C., McLachlan, G. J., 2002. Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression data. PNAS 99 (10), 6562–6566.
Cherkassky, V. S., Mulier, F., 1998. Learning from data : concepts, theory, and methods. Wiley, New York.
Dave, S. S., Wright, G., Tan, B., Rosenwald, A., Gascoyne, R. D., Chan, W. C., Fisher, R. I., Braziel, R. M., Rimsza, L. M., Grogan, T. M., Miller, T. P.,
LeBlanc, M., Greiner, T. C., Weisenburger, D. D., Lynch, J. C., Vose, J., Armitage, J. O., Smeland, E. B., Kvaloy, S., Holte, H., Delabie, J., Connors,
J. M., Lansdorp, P. M., Ouyang, Q., Lister, T. A., Davies, A. J., Norton, A. J., Muller-Hermelink, H. K., Ott, G., Campo, E., Montserrat, E., Wilson,
W. H., Jaffe, E. S., Simon, R., Yang, L., Powell, J., Zhao, H., Goldschmidt, N., Chiorazzi, M., Staudt, L. M., 2004. Prediction of Survival in Follicular
Lymphoma Based on Molecular Features of Tumor-Infiltrating Immune cells. New England Journal of Medicine November 351 (21), 2159–2169.
Devroye, L., 1982. Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size. Pattern Analysis and Machine
Intelligence, IEEE Transactions on PAMI-4 (2), 154–157.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield,
C. D., Lander, E. S., 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction By Gene Expression Monitoring. Science
286 (5439), 531–537.
Gur, D., Wagner, R. F., Chan, H. P., 2004. On the Repeated Use of Databases for Testing Incremental Improvement of Computer-Aided Detection
schemes. Acad Radiol 11 (1), 103–105.
Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer,
New York.
Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Lee, S., 2008. Mistakes in Validating the Accuracy of a Prediction Classifier in High-Dimensional But Small-Sample Microarray data. Statistical Meth-
ods in Medical Research 17 (6), 635–642.
URL https://doi.org/Doi10.1177/0962280207084839
Lim, T. S., Loh, W. Y., Shih, Y. S., 2000. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classifica-
tion algorithms. Machine Learning 40 (3), 203–228.
Randles, R. H., Wolfe, D. A., 1979. Introduction to the theory of nonparametric statistics. Wiley, New York.
Simon, R., Radmacher, M. D., Dobbin, K., McShane, L. M., 2003. Pitfalls in the Use of Dna Microarray Data for Diagnostic and Prognostic classifica-
tion. Journal of the National Cancer Institute 95 (1), 14–18.
Tibshirani, R., 2005. Immune Signatures in Follicular lymphoma. N Engl J Med 352 (14), 1496–1497.
URL https://doi.org/352/14/1496[pii]10.1056/NEJM200504073521422