0% found this document useful (0 votes)
44 views203 pages

LectureNotes PatternRecognition

Uploaded by

Hossam Khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views203 pages

LectureNotes PatternRecognition

Uploaded by

Hossam Khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 203

IT342: Pattern Recognition

Insight by Mathematics and Intuition


for Understanding
Pattern Recognition

Waleed A. Yousef, Ph.D.,

Human Computer Interaction Lab.,


Computer Science Department,
Faculty of Computers and Information,
Helwan University,
Egypt.

March 24, 2019


Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Lectures follow (and some figures are adapted from):

Hastie, Tibshirani, and Fried- Duda, Hart, and Stork, “Pattern Bishop, “Pattern Recogni-
man, “The Elements of Sta- Classification”, 2nd edition, Wi- tion and Machine Learning”,
tistical Learning”, 2nd edition, ley. Springer.
Springer.

i Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Course Objectives

• Developing rigorous mathematical treatment.

• Building intuition.

• Developing computer practice to build and assess models on real- and simulated-datasets.

ii Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Contents

Contents iii

1 Introduction 1

2 Introduction to Statistical Decision Theory 9


2.1 Types of Variables and Important Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 “Best” Decision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Getting to “Learning” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Regression (Refer to Theorem 2:) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Conclusion and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Linear Models for Regression 39


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Least Mean Square (LMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 LMS: Geometric Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 LMS: Centered Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Apparent Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Conditional Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Unconditional Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
iii Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.4 Data Preprocessing and Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.1 Bias for Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.2 Bias for Right model and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.4 Bias-Variance: illustration, model complexity, and model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Reducing Complexity by Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 Reducing Complexity by Regularization (Shrinkage Methods) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Linear Models for Classification 78


4.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.1 Linear Regression of an Indicator Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.2 LDA: revisiting, emphasizing, and more insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.3 LDA in Extended Space vs. QDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1.4 Regularized Discriminant Analysis (RDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.5 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.6 LDA vs. PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.7 Mahalanobis’ Distance Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.8 Fisher Discriminant Analysis (FDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.9 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.1 Rosenblatt’s Perceptron Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.2 Optimal Separating Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7 Model Assessment and Selection 106


7.1 “Dimensionality” from Different Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1.2 Vapnik-Chervonenkis (VC) Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.3 Cover’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Assessing Regression Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Assessing Classification Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.1 Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size (Devroye, 1982) . . . . . . . . . . . . . . . . 113
7.3.2 Binary Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4 Resampling and Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.2 Cross Validation (CV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Discussion in Hyperspace: Pitfalls and Advices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.1 Good Features vs. Good Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.2 Pitfall of CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
iv Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
7.5.3 Pitfall of Reusing the Testing set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.5.4 One Independent Test set is not Sufficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

11 Neural Networks 128


11.1 Basic and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2 Connection to Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.3 Connection to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.4 Solution vs. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.5 NN for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.6 NN for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

14 Unsupervised Learning 160

A Fast Revision on Probability A-1

B Fast Revision on Statistics B-1

C Fast Revision on Geometry, Linear Algebra, and Matrix Theory C-1

D Fast Revision on Multivariate Statistics D-1

Bibliography

v Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 1

Introduction

1 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


“Learning: is the process of estimating an unknown input-output dependency or structure of a system using
a limited number of observations.” (Cherkassky and Mulier, 1998).

Statistical learning plays a key role in many areas of science, finance and industry. Here are some examples
of learning problems:

• Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and clinical measurements for that patient.

• Predict the price of a stock in 6 months from now, on the basis of company performance measures and
economic data.

• Identify the numbers in a handwritten ZIP code, from a digitized image.

• Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spec-
trum of that person’s blood.

• Identify the risk factors for prostate cancer, based on clinical and demographic variables.

2 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Supervised Learning we have training set, features (predictors, or input), and outcome (response or output).
Build a model to predict future data.

Unsupervised Learning We observe only the features and have no outcome. We need to cluster data or
organize it.

3 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 1: Email Spam

• 4601 email messages to try to predict whether the email was junk or not. The true outcome (email or
spam) is available. This is also called classification problem (as will be explained later).

• The rule could be:


if (%george<0.6)&(%you>1.5) then spam
else mail

• But is this the “best” rule?

• Not all errors are equal!!

4 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 2: Prostate Cancer

• 97 men (observations):
predict the log of Prostate
Specific Antigen (lpsa)
from a number of mea-
surements including log-
cancer-volume (lcavol).

• This is a regression
problem (of course super-
vised).

5 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 3: Handwritten Digit Recognition

• The data comes from the handwritten ZIP codes on envelops from U.S. postal mail.

• The images are 16 × 16 eight-bit grayscale maps, with each pixel ranging from 0-255.

• The task is to predict (classify) each image from its features (16 × 16) features to one of the digits.

• I’d like to see one of the projects to study this dataset and apply NN to it.

6 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 4: DNA Expression Microarrays
• DNA is the basic material that makes up human chromosomes. On
the DNA chip, and through fluoroscopy, the gene-expression is mea-
sured.

• It ranges, e.g., between -6 to 6; positive values indicate higher expres-


sion (red) and negative indicate lower expression (green).

• Thousands of genes (features) exist; this is always ill-posed problem.

• The figure: experiment of 6830 genes (rows) (only 100 of them are
displayed for clarity) and 64 samples (columns). The samples are 64
cancer tumor from different patients.

• Which samples are most similar to each other (across genes)?

• Which genes are most similar to each other (across samples)?

• Do certain genes show very high (or low) expression for certain can-
cer samples?

7 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


This problem can be viewed as:

• unsupervised learning: each sample is an observation in R 6830 .

• Regression: genes and kind of cancer (sample) are categorical predictors, and gene expression is the
response.

8 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 2

Introduction to Statistical Decision Theory

9 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.1 Types of Variables and Important Notation

Quantitative, where some measure is given as a value; e.g.,X = 1, 3, −2.5.

Qualitative (or Categorical), where no measures or metrics are associated; e.g., X = Diseased,N ondiseased.

Ordered Categorical; e.g., X = small,medium,.... The variable X ∈ G , a set of possible values.

10 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


( )′
X Random variable (or vector). In general, X = X1 ,...,Xp
( )′
xi ith observation from X . Therefore, xi = xi1 ,...,xip
Y To denote a quantitative response
G Qualitative (for group) response; G ∈ G
 ′   
x1 x11 ... x 1p
 ..   .. .. 
X A data matrix: XN ×p =  .  = . . 
x′N xN 1 xN p
 
x1 j
 .. 
xj All observations of Xj ; i.e., j th vector of X: xj =  . 
xN j
Yb The predicted value of Y
Gb Predicted category (or class); G b ∈ G.
{ ( ) }
tr Training set (dataset); tr = ti |ti = xi ,yi ,i = 1,...,N .

11 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.2 “Best” Decision?
2.2.1 Regression

• Predictor X ∈ Rp and the response Y ∈ R.

• What is the “best” prediction Yb = f (X )?

• This should be defined in terms of some loss.

• This best will not be the best for another loss! In terms of square error loss.

12 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


We have to minimize:
( )2
Risk = EP E = E Y − f (X )

( )2 ( )
= Y − f (X ) fXY x,y dxdy

( )2
= Y − f (X ) fY |X fX dxdy
∫ [∫ ]
( )2
= Y − f (X ) fY |X dy fX dx
( )2
=E E Y − f (X )
X Y |X
( )2
E Y − f (x) = (minimize pointwise w.r.t. X )
Y |X
| {z }
Conditional Risk

[ ( )]2
= E (Y − E [Y |X = x]) + E [Y |X = x] − f (x)
Y |X
( ) ( )2
= E {(Y − E [Y |X = x])2 + 2 (Y − E [Y |X = x]) E [Y |X = x] − f (x) + E [Y |X = x] − f (x) }
Y |X
( )2
= E (Y − E [Y |X = x])2 + E [Y |X = x] − f (x)
Y |X
( )2
= σY2 |X + E [Y |X = x] − f (x)
( )2
f ∗ (x) = arg min E Y − f (x) = E [Y |X = x] (2.1)
f (X ) Y |X

Riskmin = E σY2 |X (2.2)


X
13 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Very Important:

• This is an iff rule: ANY OTHER RULE WILL BE INFERIOR TO f ∗ .

• So, it is impossible that any other rule (or algorithm) will be the best for all kind of problems.

• We have to try!

14 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 1 (Multinormal Distribution) .

Matlab Code 2.1:


mu =3+ zeros (1 , 2) ; samples =100;
s1 =1; s2 =1; r =.8; sigma =[ s1 ^2 s1 * s2 * r ; s1 * s2 * r s2 ^2];
X = mvnrnd ( mu , sigma , samples ) ;
scatter ( X (: ,1) ,X (: ,2) ,20 , ’* r ’) ; hold on

x =0:.1:10; y = mu (1) + sigma (1 , 2) /( sigma (2 ,2) ) * (x - mu (2) ) ;


plot (x , y , ’ - ’ , ’ LineWidth ’ , 3) ;

set ( gcf , ’ Units ’ , ’ inches ’) ; set ( gcf , ’ position ’ , [1 , 1 , 4 , 4]) ;

This complies with the following theorem:

15 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Theorem 2 Let the components of Z be deviled into two groups composing the sub-vectors Y,X such that
( )
Z ∼ N µZ , ΣZ ,
( ) (( ) ( ))
Y µY ΣY Y ΣY X
∼N , ,
X µX ΣXY ΣXX

then
( )
Y ∼ N µY , ΣY Y ,
( ( ) )
Y |X ∼ N µY + ΣY X Σ−1 −1
XX x − µX , ΣY Y − ΣY X ΣXX ΣXY

Notice:

• E [Y |X ] is a line in p-dimensions.
( )
• for scalar Y and X , E [Y |X ] = µY + σσY2X x − µX .
X

σY X ρσY
• the slope 2
σX
= σX of the regression line makes sense.

• proofs and details are in the appendix.

16 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.2.2 Classification
( )
We have K classes, X may belong to any. The Loss is L = L G, G b (X ) , which is K × K matrix; sometimes
we call it Lkl , which is the price paid for classifying an observation belonging to class Gk as Gl . We have to
minimize the risk under this loss:
( )
b (X )
R = E L G, G
( )
b (X )
= E E L G, G
X G|X

K ( ) ( )
=E L Gk ,g Pr Gk |X . (2.3)
X k=1
| {z }
λ(g ) Conditional Risk

Minimize R pointwise (i.e., minimize the conditional risk). Then, at particular X = x, calculate the condi-
tional risk for each decision g and choose the minimum; i.e.,


K ( ) ( )
b (x) = arg min
G L Gk ,g Pr Gk |X = x
g ∈G k=1

2.2.2.1 Two classes

( ) ( ) ( )
λ G1 = L11 Pr G1 |X = x + L21 Pr G2 |X = x
( ) ( ) ( )
λ G2 = L12 Pr G1 |X = x + L22 Pr G2 |X = x

What do you expect for the best rule?


17 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
( ) G1 ( )
λ G2 ≷ λ G1
G2
( ) ( ) G1
L12 Pr G1 |X = x + L22 Pr G2 |X = x ≷
G2
( ) ( )
L11 Pr G1 |X = x + L21 Pr G2 |X = x
( )
Pr G1 |X = x G1 (L21 − L22 )
( )≷
Pr G2 |X = x G2 (L12 − L11 )

( )
Pr G1 |X = x G1 (L21 − L22 )
( ) ≷
Pr G2 |X = x G2 (L12 − L11 )
( )
Pr G1 |X = x G1 L21
( ) ≷ , (Lii = 0 usually)
Pr G2 |X = x G2 L12

which makes a lot of sense, as we classify according to the maximum posterior(modified by the loss weights).
( )
Pr X |G1 π1 /P (X = x) G1 L21
( ) ≷
Pr X |G2 π2 /P (X = x) G2 L12
f1 (X ) G1 π2 L21
≷ , (LR)
f2 (X ) G2 π1 L12

which makes another great sense. We classify as G1 if its prior is larger, unless G2 has higher prevalence
(higher probability π2 or higher loss for misclassification L21 ) we raise the bar.
18 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Usually, this rule is put in the form
( ) ( )
f1 (X ) G1 π2 L21
ln ≷ ln (LLR)
f2 (X ) G2 π1 L12
G1
h (X ) ≷ th,
G2
{
∗ G1 h (X ) > th
η (X ) =
G2 h (X ) < th

The decision surface and the two regions of decision:

S ∗ = {x : h (X ) = th} ,
R1∗ = {x : h (X ) > th} , (2.4)
R2∗ = {x : h (X ) < th}

19 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


20 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
The risk in general (for any rule η ) is given by (2.3) as


K ( ) ( )
R= E L Gk ,g Pr Gk |X
X k=1
| {z }
λ(g ) Conditional Risk
∫ ∫

( ) ( )
R = λ G1 P (X ) dX + λ G2 P (X ) dX
R1 R2
∫ ∫
( ) ( )
= L21 Pr G2 |X = x P (X ) dX + L12 Pr G1 |X = x P (X ) dX
R1 R2
∫ ∫
( ) ( )
= L21 π2 f X |G2 dX + L12 π1 f X |G1 dX ,
R1∗ R2∗
| {z } | {z }
Error e∗21 =Pr[R1∗ |G2 ] Error e∗12 =Pr[R2∗ |G1 ]
[ ] [ ]
= L21 π2 Pr x ∈ G2 and decision is G1 + L12 π1 Pr x ∈ G1 and decision is G2 ,

a lot of sense: each kind of error is an integration of the right class over the wrong decision region, then
magnified by the prevalence of that class. Then, From (2.4)
[ ]
x ∈ R1∗ ≡ th < h < ∞ → Pr R1∗ = Pr [th < h < ∞]
[ ]
x ∈ R2∗ ≡ −∞ < h < th → Pr R2∗ = Pr [−∞ < h < th]
∫ ∞ ∫
( ) th ( )
R∗ = L21 π2 fh h|G2 dh + L12 π1 f h|G1 dh.
| th {z } | −∞ {z }
Error e∗21 Error e∗12

21 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 3 (Multinormal Distribution) .

1 1 ′ −1
f1 (x) = e− 2 (x−µ1 ) Σ1 (x−µ1 )
,
((2π )p |Σ1 |)1/2
1 1 ′ −1
f2 (x) = e− 2 (x−µ2 ) Σ2 (x−µ2 )
,
((2π )p |Σ2 |)1/2

Then, S ∗ is given by:


( )
f1 (x) G1 π2 L21

f2 (x) G2 π1 L12
G1 ( π |Σ1 |1/2
)
1 ′ −1
(x−µ1 )+ 12 (x−µ2 )′ Σ−1 2 L21
e− 2 (x−µ1 ) Σ1 2 (x−µ2 ) ≷
G2 π1 L12 |Σ2 |
1/2
( ( )′ ) ( )′ ( )
− x − µ1 Σ−1
1 x − µ1 + x − µ2 Σ2 x − µ2
−1
( )
G1 π2 L21 |Σ1 |1/2
≷ 2 ln
G2 π1 L12 |Σ2 |1/2
( ) ( ) ( ′ −1 )
x′ Σ−1 −1 ′ −1 −1
2 − Σ1 x − 2x Σ2 µ2 − Σ1 µ1 + µ2 Σ2 µ2 − µ1 Σ1 µ1
′ −1
| {z } | {z }
Quadratic Term Linear Term
( )
G1 |Σ1 |
≷ 2th + ln
G2 |Σ2|

22 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


The general geometry of S ∗ will be:

Three special cases of interest are:

23 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Case 1: Σ1 = Σ2 = Σ = σ 2 I; then S ∗ is given by:
( )
( ) ( ) ( ) |Σ1 |
x Σ−1
′ −1
2 − Σ1 x − 2x Σ−1
′ −1
2 µ2 − Σ1 µ1 + µ′2 Σ−1 ′ −1
2 µ2 − µ1 Σ1 µ1 = 2th + ln
| {z } | {z } |Σ2 |
Quadratic Term Linear Term

( ) 1 ( ′ −1 )
x′ Σ−1 µ2 − µ1 − µ2 Σ µ2 − µ′1 Σ−1 µ1 = −th
2
1 ′( ) 1 ( )
2
x µ2 − µ1 − 2 µ′2 µ2 − µ′1 µ1 = −th
σ 2σ
( )′ ( )
( )′ 1( )′ ( ) 2 µ2 − µ1 µ2 − µ1
µ2 − µ1 x − µ2 − µ1 µ2 + µ1 = −σ th
2 ∥µ2 − µ1 ∥2
( [ ])
( )′ 1( ) σ 2 th ( )
µ2 − µ1 x − µ2 + µ1 − µ 2 − µ 1 =0
2 ∥µ2 − µ1 ∥2
w′ (x − x0 ) = 0

24 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


25 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Case 2: Σ1 = Σ2 = Σ; then S ∗ is given by:
( )
( ) ( ) ( ) |Σ1 |
x Σ−1
′ −1
2 − Σ1 x − 2x Σ−1
′ −1
2 µ2 − Σ1 µ1 + µ′2 Σ−1 ′ −1
2 µ2 − µ1 Σ1 µ1 = 2th + ln
| {z } | {z } |Σ2 |
Quadratic Term Linear Term

( )1 ( ′ −1 )
x′ Σ−1 µ2 − µ1 − µ2 Σ µ2 − µ′1 Σ−1 µ1 = −th
2
[ −1 ( )]′ 1( )′ ( )
Σ µ2 − µ1 x − µ2 − µ1 Σ−1 µ2 + µ1 =
2
( )′ ( )
µ2 − µ1 Σ−1 µ2 + µ1
−th ( )′ ( )
µ2 − µ1 Σ−1 µ2 + µ1

[ −1 ( )]′
Σ µ2 − µ1 ·
( [ ])
1( ) th ( )
x− µ2 + µ1 − ( )′ ( ) µ2 + µ1 =0
2 µ2 − µ1 Σ−1 µ2 + µ1
w′ (x − x0 ) = 0

26 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


27 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Case 3: Σ1 , Σ2 arbitrary; then S ∗ is given by:
( )
( ) ( ) ( ) |Σ1 |
x Σ−1
′ −1
2 − Σ1 x − 2x Σ−1
′ −1
2 µ2 − Σ1 µ1 + µ′2 Σ−1 ′ −1
2 µ2 − µ1 Σ1 µ1 = 2th + ln
| {z } | {z } |Σ2 |
Quadratic Term Linear Term

28 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


29 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
30 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
2.2.2.2 Multiclass Problem


K ( ) ( )
b (x) = arg min
G L Gk ,g Pr Gk |X = x
g ∈G k=1
( )
λ G1 = · · ·
( )
λ G2 = · · ·
..
.
( )
λ GK = · · ·

31 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.3 Getting to “Learning”
• We assume we know the distributions—and hence the “best” rule—but the parameters.

• “Learning” here is nothing but point estimation.

• We consider multinormal case for simplicity.

32 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.3.1 Regression (Refer to Theorem 2:)
( )
f ∗ (x) = µY + ΣY X Σ−1
XX x − µX (E [Y |X ])
( )′
= µY + x − µX Σ−1XX ΣXY (p-dim)
( ) σXY
= µY + x − µX (scalar)
σXX
[ ]
Risk ∗ = E ΣY Y − ΣY X Σ−1
XX ΣXY (EX σY2 |X )
X
= ΣY Y − ΣY X Σ−1
XX ΣXY (p-dim)
( 2
) 2
= 1 − ρ σY (scalar)

Estimate the parameters, then plug in to get fbtr close to f ∗

Example 4 (live simulation using Matlab) . Notice that


• fbtr is close to f ∗ .

• The larger the sample size the better the prediction.

• This is no longer the best rule nor the risk is min.

• New source of variability: training set tr.

• For this example, or for other difficult distributions, risks can be estimated by simulating a testing set ts
( ) ( )2
Risk fb = E Y − fb(X ) (page 13)
( ) ∑(
n )2
 fbtr = 1
ts
Risk yi − fbtr (xi )
nts i=1
33 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Parameter Estimation:
( )′
fb(x) = µ
bY + x − µ b −1 Σ
bX Σ b
XX XY
1 1 ∑
bY = y =
µ 1′ y = yi ,
N N i
1 1 ∑
bX = x =
µ X′ 1 = xi ,
N N i
( ) 1 ∑ 1 ( ′)
Σb XY
p×1 = N −1 i
(xi − x)p×1 yi1×1 = X
N − 1 c p×N
yN ×1

( ) 1 ∑ 1
Σb XX
p×p = N − 1 (xi − x) (xi − x)′ = X′ Xc
i N −1 c
( )−1
fb(x) = y + x′c X′c Xc X′c y (2.5)

( )σbxy i (xi − x) yi
b
f (x ) = µ
bY + x − µ bX = y + xc ∑ . (for scalar X )
bxx
σ i (xi − x)
2

Hint: Eq. (2.5) will be reached very differently and interestingly next Chapter.

Mathematical Expression Matlab


1N ×1 ones([N,1])
x′ = N1 1′ X xbarp = mean(X)
 
(x1 − x)′
 .. 
(Xc )N ×p = X − 1x′ =  .  X − repmat(xbarp, [N,1])
( )′
xp − x

34 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.3.2 Classification
Quadratic Discriminant Analysis (QDA):
( )
( ) ( ) ( ) |Σ1 |
x Σ−1
′ −1
2 − Σ1 x − 2x Σ−1
′ −1
2 µ2 − Σ1 µ1 + µ′2 Σ−1 ′ −1
2 µ2 − µ1 Σ1 µ1 = 2th + ln
| {z } | {z } |Σ2 |
Quadratic Term Linear Term

b1 , µ
Get µ b 1 , and Σ
b2 , Σ b 2 as before and plug in above.

Linear Discriminant Analysis (LDA):


( ) 1( )
−x′ Σ−1 µ2 − µ1 + µ′2 Σ−1 µ2 − µ′1 Σ−1 µ1 = th
2

1 [( ) ( ) ]
b=
Σ b 1 + ntr2 − 1 Σ
ntr1 − 1 Σ b2 .
ntr1 + ntr2 − 2
b 1 and Σ
(observe that, if Σ b 2 are unbiased Σ
b is so too.)

Example 5 (Live simulation using Mathematica) .

35 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 6 (simulating LLR for more understanding) .

36 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Risk Estimation
∫ ∫
( ) ( )
R = L21 π2 f X |G2 dX + L12 π1 f X |G1 dX ,
R1 R
| {z } | 2 {z }
Error e21 =Pr[R1 |G2 ] Error e12 =Pr[R2 |G1 ]
∫ ∞ ∫ th
( ) ( )
L21 π2 fh h|G2 dh + L12 π1 f h|G1 dX ,
| th {z } | −∞ {z }
Error e21 Error e12

So, simply, after training on tr we can test on a very large testing set (MC trial) to get

1 ∑
nts1
ebtr12 = I(hb tr (xi )<th) ,
nts1 i=1
n∑
ts2
1
ebtr21 = I(hb tr (xi )>th) .
nts2 i=1

Under equal priors (0.5) and costs (1), i.e., th = 0, we have

1
b tr = (ebtr12 + ebtr21 )
R
2

37 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


2.4 Conclusion and Overview
Usually:

• we do not know the best regression function!

• we do not know the distribution!

• we cannot simulate more data to estimate Risk!

The field is for answering the above questions:

• “Learning: is the process of estimating an unknown input-output dependency or structure of a system


using a limited number of observations.” (Cherkassky and Mulier, 1998). This is the first part of the
field, also called design

• Assessment: how can we assess what we have designed in terms of some measures, e.g., Risk, Error,
etc.? This is the second part of the field.

38 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 3

Linear Models for Regression

39 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.1 Introduction
We saw that the best regression function is
Yb = E [Y |X ] ,
In general, we always can write

Y |X = E [ Y | X ] + ε
= f (X ) + ε,

where ε is a r.v. and E [ε] = 0. In linear models, it is assumed that f (X ) is linear in X , and the goal is to
estimate the coefficients in f (X ). Linear models:

• largely developed in statistics community long time ago

• Still are a great tool for prediction and can outperform fancier ones.
( )′
• can be applied to transformed features (e.g., if we have X = X1 ,X2 , we can make up the feature
( )′
vector X = X1 ,X2 ,X12 ,X22 ,X1 X2 , and then assume f (X ) is linear in this new X ).

• many other methods are generalization to linear models, including Neural Networks and even some
methods for classification.

40 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


( )′
Suppose that the original feature vector is Z = Z1 ,...,ZD . In linear models we assume that

f (X ) = β0 + β1 X1 + · · · + βp Xp
= β ′ X,
( )′
X = 1,X1 ,...,Xp ,
( )
β = β0 ,...,βp ,

where 1 accounts for the intercept and X1 ,...,Xp can be:

• The components of the original feature vector

• Basis: e.g., X1 = Z1 ,X2 = Z22 ,X3 = Z1 Z2 ,...

• Transformation: e.g., X1 = log Z1 ,X2 = exp [Z2 ].

The model still is linear in coefficients (or linear in the new features).
( )
Typically, we have N observations; each is xi ,yi . So, we have the data matrix and the response values:
 ( )′   
1,x1 1 x11 ... x1p
 
XN ×p+1 = 
.   . .. .. 
 ( .. )  =  .. . . ,

1,xN 1 xN 1 xN p
 
y1
 .. 
yN ×1 =  . .
yN
41 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.2 Least Mean Square (LMS)
For any choice β , we have a residual some squares:

( ) ∑
N ( )2
RSS β = yi − f (xi )
i=1
∑( )2
= yi − β ′ xi
i
( )2
∑ ∑
p
= yi − β0 − βj xij .
i j =1

A valid choice of β is to minimize RSS , which can be rewritten in vector form as


( ) ( )′ ( )
RSS β = y − Xβ y − Xβ
= y′ y − 2β ′ X′ y + β ′ X′ Xβ

This is a scalar function of a vector (many variables); how to minimize?

Extremum: Calculus Reminder

• 1st derivative test: f ′ (x) = 0.

• 2nd derivative test: f ′′ (x) ≶ 0.

• saddle points.

42 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


In general, to minimize a scalar function E in a vector W , we have to find the point w at which the gradient:
( )
∂E (W ) ∂E (W ) ∂E (W ) ′
∇E (W ) = = ,..., = 0′ .
∂W ∂W1 ∂Wp

Then, this is followed by the Hessian matrix test.

Hint: prove, for any matrix A and vector α, that

∇α′ W = ∇W ′ α = α (linear combination)


[ ]
∇W ′ AW = A + A′ W. (quadratic form)

43 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


( )
Then, back to minimizing RSS β :
( )
RSS β = y′ y − 2β ′ X′ y + β ′ X′ Xβ
( ) set
∇RSS β = −2X′ y + 2X′ Xβ ( = 0)
( )−1
βb = X′ X X′ y.

For a future observation x0 , the prediction yb0 is:


( )′ ( ′ )−1 ′
yb0 = 1,x0 X X X y,

and the prediction of the training observation is:


( )−1
y = X X′ X X′ y
b
= Hy,

where H is called the hat matrix (or the projection matrix). Therefore, the residual error at each observation
is:

εb = by − y
( )−1
= X X′ X X′ y − y
( ( )−1 )
= X X′ X X′ − I y.

Matlab does all of that; look for linear models.

44 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.2.1 LMS: Geometric Proof

y = Xβ + ε:

• can be viewed as linear combinations of vectors in sample space RN :


( )
XN ×(p+1) β = x0 ... xp β = x0 β0 + ... xp βp .

• X spans a sub-space of RN .

• yb is in the same space

• We need to minimize the error vector eb = y − Xβ .

• Then, eb must be perpendicular on the space X. This nice Geometry can be translated to math as:
( )
0= X′ y − Xβb
( )−1
βb = X′ X X′ y

45 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.2.2 LMS: Centered Form
Summary first:
y1 = β0 + β1 x11 + · · · + βp x1p + ε1 ,
..
.
yN = β0 + β1 xN 1 + · · · + βp xN p + εN
( )
( ) β0
yi = 1 x′i + εi ,
β1∼p
( )
( ) β0
y = 1, X1 + ε = Xβ + ε
β1∼p
( )′ ( )
RSS = ε′ ε = y − Xβ y − Xβ = y′ y−2β ′ X′ y+β ′ X′ Xβ
( ( ))′ ( ( ))
( ) β0 ( ) β0
= y− 1, X1 y− 1, X1
β1∼p β1∼p
( ) ( 1′ y ) ( )( N 1′ X1
)(
β0
)
′ ′ ′
= y y−2 β0 β1∼p + β0 β1∼p
X′1 y X′1 1 X′1 X1 β1∼p
( ′ ) ( ) ( )
1y N 1′ X1 β0
∇RSS = −2 ′ + 2 ′ ′ (not sep.)
X1 y X1 1 X1 X1 β1∼p
( ) ( )−1 ( )
βb0 N 1′ X1 1′ y
=
βb1∼p X1 1 X′1 X1

X′1 y
( )−1
= X′ X X′ y
46 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Geometric motivation for centered form (p = 1):
( )
y1 = α + β1 (x11 − x1 ) + · · · + βp x1p − xp + ε1 ,
..
.
( )
yN = α + β1 (xN 1 − x1 ) + · · · + βp xN p − xp + εN ,
yi = α + (xi − x)′ β1∼p + εi ,

α = β0 + β1 x1 + · · · + βp xp = β0 + β1∼p x.
( )
( ) α
y= 1, Xc + ε,
β1∼p
( )
′ 1 ′ 1
Xc = X1 − 1x = X1 − 11 X1 = I − J X1 .
N N
( ( ))′ ( ( ))
( ) α ( ) α
RSS = y− 1, Xc y− 1, Xc
β1∼p β1∼p
( ) ( 1′ y ) ( )( N 1′ Xc
)(
α
)
′ ′ ′
= y y−2 α, β1∼p + α, β1∼p
X′c y X′c 1 X′c Xc β1∼p
( ) ( 1′ y ) ( )( N 0′ )(
α
)
′ ′ ′
= y y−2 α, β1∼p + α, β1∼p
X′c y 0 X′c Xc β1∼p
( ) ( )( )
1′ y 0′
N α
∇RSS = −2 +2 (3.1)
X′c y 0 X′c Xc β1∼p
( ) ( )−1 ( ′ ) ( 1 )( )
b
α N 0′ 1y 0′ 1′ y
= = N ( )−1
βb1∼p 0 X′c Xc X′c y 0 X′c Xc X′c y
47 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Or, simply
( )
b − 1′ y
αN
0= (3.2a)
X′c Xc βb1∼p − X′c y
1
b=
α 1′ y =y, (3.2b)
N
( )−1
( )−1 X′c Xc X′c y
βb1∼p = X′c Xc X′c y = b −1 Σ
=Σ b
XX XY (3.2c)
N −1 N −1
b + (x0 − x)′ βb1∼p
yb0 = α (3.2d)
= y + (x0 − x)′ Σ
b −1 Σ
b
XX XY , (3.2e)
( ′ )−1 ′
b
y = 1αb + Xc Xc Xc Xc y. (3.2f)

• Eq. (3.2e) is of great surprise; it is the same as Eq. (2.5). LMS coincides with the best regression
function after plug-in parameters estimation in case of multinormal distribution!

• This is a source of misconception for some people.

48 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 7 (non-centered & centered) :
     
2 0 2 1 0 2
3 2   6
   6 1 2 
     
2 2 7 1 2 7
     
7 2 5 1 2 5
     
6 4 9 1 4 9
    
8 4  ( )  8
   8 1 4 
y =   , X1 =   , X = 1, X1 =  
10 4 7  1 4 7
       
7 6 10 1 6 10 12 52 102
    
8 6 11 1 6 11 ′  536  ,
     X X= 52 296
12 6  1 9
   9  6  102 536 1004
       
11 8 15 1 8 15
0.974 76 0.242 9 −0.228 71
14 8 13 1 8 13 ( ′ )−1
XX 
= 0.242 9 0.162 07 −0.111 20  ,
−0.228 71 −0.111 20 8. 359 6 × 10−2
 
5.3711
( ′ )−1 ′
βb = X X X y =  3.0123 
−1.2866
( ′)
yb0 = 1 x0 βb = (1 x01 x02 ) βb
= 5.3711 + 3.0123x01 − 1.2866x02 .

49 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


1 ( )
x′ = 1′ X1 = 4. 333 3 8. 5
N      
0 2 4. 333 3 8. 5 −4. 333 3 −6. 5
2 6  4. 333 3 8. 5 −2. 333 3 −2. 5
     
     
2 7  4. 333 3 8. 5 −2. 333 3 −1. 5
     
2 5  4. 333 3 8. 5 −2. 333 3 −3. 5
     
4 9  4. 333 3 8. 5  −0.333 3 0.5 
     
4 8  4. 333 3 8. 5  −0.333 3 −0.5 
′      
Xc = X1 − 1x =  − = 
4 7  4. 333 3 8. 5  −0.333 3 −1. 5
     
6 10 4. 333 3 8. 5  1. 666 7 1. 5 
     
6 11 4. 333 3 8. 5  1. 666 7 2. 5 
     
6 9  4. 333 3 8. 5  1. 666 7 0.5 
     
     
8 15 4. 333 3 8. 5  3. 666 7 6. 5 
8 13 4. 333 3 8. 5 3. 666 7 4. 5
 
5.3711
( )
b = y = 7.5 = 1 4. 333 3 8. 5 3.0123  = βb0 + βb1∼12
α  ′
x
−1.2866
( )−1 ( ) ( )
( ′ )−1 ′ 70. 667 94.0 92. 003 3. 012 2
b
β1∼12 = Xc Xc Xc y = = = βb1∼12

94.0 137.0 107.0 −1. 285 7
b + (x0 − x)′ βb1∼p
yb0 = α
( )
( ) 3. 012 2
= 7.5 + x01 − 4. 333 3 x02 − 8. 5 = 7.5 + 3. 012 2 (x01 − 4. 333 3) − 1. 285 7 (x02 − 8. 5)
−1. 285 7
50 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.3 Performance
3.3.1 Apparent Performance
also called training or example error (this is what we have minimized):

1 ∑( )2 1 1
err = ybi − yi = εb′ εb = ∥εb∥2
N i N N

3.3.2 Conditional Performance


this is what we really want to minimize:
( )2
errtr = E yb0 − y0 (Risk, MSE, or EPE)
0
[( )′ ( )−1 ]2
= E 1,x0 X′ X X′ y−y0
0
[( )′ ( )−1 ]2
=E E 1,x0 X′ X X′ y−y0
x0 y0 | x0

= E errtr (x0 )
x0

This is conditional on the training set tr appearing in the equation as X and y. In simulation problems, where
data comes from a known distribution, we can obtain a very accurate estimate of errtr , using very large ts
as:
1 ∑( )2
errtr ∼
= ybi − yi (as in Ex. 4)
nts i∈ts

51 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.3.3 Unconditional Performance
err = E errtr
tr

We will see how to estimate this from tr. This means also that there is

Var errtr ,
tr

which expresses how stable the regression function is from dataset to another. In simulation problems, we
do MC simulation to estimate these quantities as well

1 ∑
M
E errtr ∼
= errtrm
tr M m=1
( )2
1 ∑M 1 ∑M
Var errtr ∼
= errtrm − errtrm
M − 1 m=1 M m=1

52 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 8 (prostate cancer) :

“The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign
prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleams score
(gleason), and percent of Gleams scores 4 or 5 (pgg45).”

53 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


• First: plot, understand your dataset, and withdraw some measures before any model building.

• Eg., svi is binary, gleams is ordered categorical, both lcavol and lcp show a strong relationship with
the response lpsa, and with each other.

• Data visualization is a stand alone topic and course.

54 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


• “We fit a linear model to the log of prostate-specific antigen, lpsa, after first standardizing the predictors to
have unit variance. We randomly split the dataset into a training set of size 67 and a test set of size 30.”

• “The mean prediction error on the test data is 0.521. In contrast, prediction using the mean training value
of lpsa has a test error of 1.057, which is called the base error rate. Hence the linear model reduces the base
error rate by about 50%. We will return to this example later to compare various selection and shrinkage
methods.”

• The prediction using the mean training value is:

1 ∑
N
b = βb0 =
yb0 = α yi ; (β1∼p = 0)
N i=1

i.e., using the sample average of the training set as if you do not have any additional information from X.

• the meaning of βbi is this: a unit increase in predictor Xi results in an increase of βbi in the response Yi .

55 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.4 Data Preprocessing and Transformation

Standardize your predictors, e.g., to unit variance, so that no variable is more dominant than others. (when
testing use of course the inverse scaling). Prove that a linear mapping in the form

ai Xi + bi , i = 1, 2

will not deform the correlation between the variables X1 , X2 . Applying linear transformation to each vari-
able of a data matrix accounts for moving the center of scatter plot then scaling without preserving the
aspect ratio.

One form of linear transformation is mapping all predictors to [L,H ] (usually [−1, 1] as mapminmax in Mat-
lab (but it assumes P × p matrix not N × p), or [0, 1] as in parallel cords) by:

56 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


( )
H −L
Xnew = (X − Xmin ) +L
Xmax − Xmin
( )
H −L LXmax − HXmin
= X+
Xmax − Xmin Xmax − Xmin
X −a
To put it in the form Xnew = b :

Xmax − Xmin
b=
H −L
−a LXmax − HXmin
=
b Xmax − Xmin
HXmin − LXmax
a=
(H − L )
−LXmax
X − HXmin
H −L
Xnew = ( )
Xmax −Xmin
H −L

Linear transformation can be done by standardizing to zero sample mean and unit sample variance, by

X −X
Xnew =
bX
σ

57 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Any shifting can be expressed as:
   
1 x11 ... x 1p 0 c1 ... cp
shif ted  .. .. ..   .. .. .. 
X = . . . − . . . 
1 xN 1 xN p 0 c1 cp

= X − 1c (X − repmat(c′ , [N, 1]))
 
1
 ..  ′
= X−X  . c
0
| {z }
1
    
0 c1 ... cp 1 −c1 ... −cp
    
  0 0 0   0 1 0 
= X 
I −  .. .. ..
 = X 
  .. .. ..


  . . .   . . . 
0 0 0 0 0 1
= XT.

A special case shifting is centering with the sample mean:


( )
′ 1 ′
c = 0, 1 X1 (mean(X1 ))
N

58 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Also, any linear transformation (without shifting) can be expressed as
( )
Xscaled = 1, X1 S1
( )
( ) 1 0′
= 1, X1
0 S1
= XS

A special case of this transformation is when we scale ONLY each feature (what is used usually); i.e.,
 
d1 ... 0
 . .. 
S1 =  .. . 
0 dp
( )
= diag d1 ,...,dp

Therefore, a general linear transformation for the features, including shifting and scaling, is given by

Xtrans = XTS
= XH,

where we do, first, shifting T, followed by scaling S.

59 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


How transformation affects prediction in LM?
( )−1
yb0 = x′0 X′ X X′ y.
With transformation, we replace X by XH. Therefore,

[ ]−1
βbtrans = (XH)′ (XH) (XH)′ y
[ ′ ]−1
= H X′ XH H′ X′ y
[ ]−1 ′−1 ′ ′
= H−1 X′ X H HXy
[ ] −1
= H−1 X′ X X′ y,
( )
yb0trans = x′0 H βbtrans
[ ]−1 ′
= x′0 HH−1 X′ X Xy
= yb0 (without transformation)

60 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.5 Bias-Variance Decomposition
For any regression function f (X ), its conditional risk was given by
[ ]2 ( )2
E Y − f (x) = σY2 |X + E [Y |X = x] − f (x) ,

where we minimized it by choosing f (X ) = E [Y |X = x]. Now, for any learning function ftr (x0 ) = yb0 , the
(conditional) risk are given by (look at Sec. 3.3)
( )2
errtr (x0 ) = σY2 |X + E y0 |x0 − yb0
( )2
errtr = E σY2 |X =x0 + E yb0 − E y0
x0 x0 y0 | x0
| {z }
minimum risk (irreducible)

Etr errtr =
( )2
= E σY2 |X =x0 + E E yb0 − E y0
x 0 x0 tr y0 | x0
(( ) ( ))2
= E σY2 |X =x0 + E E yb0 − E yb0 + E yb0 − E y0
x0 x0 tr tr tr y0 | x0
 
( )2 ( )2
= E σY2 |X =x0 + E E yb0 − E yb0 + E yb0 − E y0 + |{z}cov 
x0 x0 tr tr tr y0 | x0
0
[ ( )]
= E σY2 |X =x0 + E Var yb0 + Bias2tr yb0 .
x0 x0 tr

In particular for linear models, it can be shown that if the data follows exactly a full linear model, then using
the right number of features or more will result in an unbiased model.
61 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.5.1 Bias for Underfitting
Assume the right model is
[ ]
E y |X = X ′ β
( )
( )
′ β1
= X1′ ,X2 = X1′ β1 + X2′ β2 .
β2
and we used the reduced model (underfitting)
[ ]
E y |X1 = X1′ β1∗

( )−1 ′
E yb0 = E x′01 X′1 X1 X1 y
tr tr
( )−1 ′
= x′01 E E X′1 X1 X1 y
X y|X
( )−1 ′
= x′01 E X′1 X1 X1 E y
X y|X
[( ) )]
−1 ′ (
= x′01 E X′1 X1 X1 X1 β1 + X2 β2
X
( )

( ′ )−1 ′
= x01 β1 + E X1 X1 X1 X2 β2
X
( )−1 ′
= x′01 β1 + x′01 E X′1 X1 X1 X2 β2
X
̸= x′01 β1 + x′02 β2
= E y0 . | x 0

Therefore, there is a bias in the underfitting. When using the right model, i.e., if β2 = 0, the bias is zero.
62 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.5.2 Bias for Right model and Overfitting
[ ]
E y |X = X1′ β1 = X1′ β1 + X2′ β2
| {z }
=0

We use the overfitting model


[ ]
E y |X = X ′ β = X1′ β1∗ + X2′ β2∗
( )−1
E yb0 = E x′0 X′ X X′ y
tr tr
( )−1
= E x′0 X′ X X′ E y
X y|X
( ′ )−1 ′
= E x′0 X X X X1 β1
X
 
( )−1
= E x′0 X′ X X′ X1 β1 + X2 β2 
X | {z }
=0
( )−1
= E x′0 X′ X X′ Xβ
X
= x′0 β
= x′01 β1 + x′02 β2
| {z }
=0
= x′01 β1
= E y0 | x 0

Therefore the overfitted model is unbiased.


63 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
3.5.3 Variance

( )−1 ( )−1
Var yb0 = E Var x′0 X′ X X′ y+ Var E x′0 X′ X X′ y
tr X y|X X Y|X
[( ( ) )( )( ( )−1 )]
′ −1 ′
= E x0 X X X σY |X IN ×N X X′ X x0
′ 2
X
[ ( )−1 ]
+ Var x′0 X′ X X′ Xβ
X
( )−1
= E σY2 |X x′0 X′ X x0 + Var x′0 β
X X
[ ( ′ )−1 ]
′ 2
= x0 E σY | X X X x0 .
X

To simplify things, lets assume for a momonet that σY2 |X = σ 2 .


[( )−1 ]
Var yb0 = σ 2 x′0 E X′ X x0 .
tr X

To simplify more, assume that EX X = 0 (so X is centered, which is not a big deal), therefore
( )
1
E X′ X = ΣX .
X N −1

We can do this ad-hoc approximation (just to get the picture)

( )−1 1 −1
E X′ X ≈ ΣX .
X N

64 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Therefore,

σ 2 ′ −1
Var yb0 = x Σ x0 ,
tr N 0 X
σ2
E Var yb0 = E x′ Σ−1 x0
x0 tr N x0 0 X
σ2 [ ]
= E trace x0 x′0 Σ−1
X
N x 0

σ2 [ ]
= trace E x0 x′0 Σ−1
X
N x 0

σ2 [ ]
= trace ΣX Σ−1X
N
σ2
= trace Ip×p
N
p
= σ2
N

65 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.5.4 Bias-Variance: illustration, model complexity, and model selection

V. Imp

66 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


• Performance P. 51.
Apparent (training) error err , conditional error errtr , mean error err
• Why err not errtr ?
1 ∑( )2 • Learning function stability.
err = ybi − yi • estimators are good for err
ntr i∈tr
[( [ ]
)2 ] ( )2 1 ∑( )2 not errtr .
errtr = E yb0 − y0 =E E yb0 − y0 ∼
= ybi − yi • Concepts are transferable
x0 ,yo x0 y0 | x0 nts i∈ts
[ ( )2 ] to other models than LM.
= E σY2 |X =x0 + yb0 − E y0
x0 y0 |x0

∼ 1 ∑
M
err = E errtr = errtrm
tr M m=1
[ ( )2 ( )2 ]
= E σY2 |X =x0 + E yb0 − Ey
b0 + Ey
b0 − E y0
x0 tr tr tr y0 |x0

67 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Apparent (training) error err , conditional error errtr , mean error err

1 ∑( )2
err = ybi − yi
ntr i∈tr
[( [ ]
)2 ] ( )2 1 ∑( )2
errtr = E yb0 − y0 =E E yb0 − y0 ∼
= ybi − yi
x0 ,yo x0 y0 | x0 nts i∈ts
[ ( )2 ]
= E σY2 |X =x0 + yb0 − E y0
x0 y0 |x0

∼ 1 ∑
M
err = E errtr = errtrm
tr M m=1
[ ( )2 ( )2 ]
= E σY2 |X =x0 + E yb0 − Ey
b0 + Ey
b0 − E y0
x0 tr tr tr y0 |x0

68 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Notes:
• All methods have complexity parameter(s) to tune (p in LM). The right
complexity is chosen via Cross-Validation (next chapters).

• Overfitting: is increasing model complexity to adapt too much to the


training dataset. It causes the behavior of single errtr curve. NOT the
bias-variance decomposition. Rather, the bias-variance decomposition
and the behavior of err are caused as well by the same effect of increasing
complexity.

• But with increasing the training set size we get better performance; So it
depends on the ratio p/N .

• HW. prove that Etr errtr ̸= error of Etr yb0 .

• err must ↘ with complexity; and Vartr err is “in general” ↘; why? How-
ever, errtr is “in general” has a minimum and Vartr errtr is “in general” ↗.

69 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.6 Reducing Complexity by Subset Selection
There are two reasons why we are often not satisfied with the least squares estimates:

• “The first is prediction accuracy: the least squares estimates often have low bias but large variance.
Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By
doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may
improve the overall prediction accuracy.”

• “The second reason is interpretation. With a large number of predictors, we often would like to deter-
mine a smaller subset that exhibit the strongest effects. In order to get the “big picture”, we are willing
to sacrifice some of the small details.”

Common methods for subset selection:

• Best subset selection.

• Forward- and Backward-Stepwise regression.

• Forward-Stagewise regression.

70 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 9 (Best subset selection (prostate cancer)) .

• Best subset regression finds for each k ∈ 0, 1, 2,...,p the subset of size k that gives smallest RSS.

• The best subset of size 2, e.g., needs not include the variable that was in the best subset of size 1. Therefore,
the red lower boundary is necessarily decreasing.

• The question of how to choose k involves the tradeoff between bias and variance, usually done by CV.

71 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.7 Reducing Complexity by Regularization (Shrinkage Methods)
“By retaining a subset of the predictors and discarding the rest, subset selection produces a model that is inter-
pretable and has possibly lower prediction error than the full model. However, because it is a discrete process
variables are either retained or discarded—it often exhibits high variance, and so does not reduce the predic-
tion error of the full model. Shrinkage methods are more continuous, and do not suffer as much from high
variability.”

However, another very important factor is time complexity. Subset selection is very time consuming w.r.t.
regularization; it can even be untractable in many real life problems with high dimensions.

• Ridge regression

• Lasso regression (Tibshirani)

• Least Angle Regression (LAR) (Efron)

Hint: Great picture can be seen by studying the connection of these methods with each other; deferred to
the advanced course!

72 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


3.7.1 Ridge Regression
 

 


 ( )2 


∑ 

N ∑ p ∑ p
b
β ridge
= arg min yi − β0 − xij βj + λ βj 2
(3.3)

 

β 
|i=1 {z
j =1
} |
j =1
{z } 


 

RSS penalty term
| {z }
new loss to be minimized
• “Here λ ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of λ, the
greater the amount of shrinkage. The coefficients are shrunk toward zero (and each other). The idea of
penalizing by the sum-of-squares of the parameters is also used in neural networks, where it is known as
weight decay (Ch. 11).”

• “The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the
inputs before solving (3.3).”

• “In addition, notice that the intercept β0 has been left out of the penalty term. Penalization of the intercept
would make the procedure depend on the origin chosen for Y ; that is, adding a constant c to each of the
targets yi would not simply result in a shift of the predictions by the same amount c.”

• Eq. way to write (3.3) is:


( )2

N ∑
p
βbridge = arg min yi − β0 − xij βj (3.4a)
β i=1 j =1
∑p
subject to βj2 ≤ t. (3.4b)
j =1
73 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Now, minimizing the loss in (3.3) as done in centered LM (it could • We keep varying λ and choose the value
be just LM as we will see, but centered LM gives us great insight): that minimizes the total estimated er-
{ ( )2 } ror.
∑N ∑ p ∑p
βbridge = arg min yi − β0 − xij βj + λ βj2 . • Notice that, when λ = 0, this is the case
β i=1 j =1 j =1 of ordinary LM obtained in (3.2).
∑ p
RSS ridge = RSS + λ βj2 . • Notice also that, we had to use the cen-
j =1
( ) tered model to find a closed form so-
∑ p
lution. However, in Neural Networks,
∇RSS ridge = ∇RSS + ∇ λ βj2 .
j =1 we will use the same regularization for-
mula but find β0 numerically. We will
ridge set not need to center the data, but of
Substituting from (3.1), and ∇RSS = 0, we get
course we need to standardize, e.g., to
( ′ ) ( )( ) ( )
1y N 0′ α 0 [−1, 1], as mentioned above.
0 = −2 +2 + 2λ
X′c y 0 X′c Xc β1∼p β1∼p
( ) • Matlab does ridge regression: ridge.
N α − 1′ y
= ( ′ )
Xc Xc + λI β1∼p − X′c y • Python (scikit-learn): Ridge.
1
αb = 1′ y = y,
N
( ′ )−1
b
β1∼p = Xc Xc + λI X′c y.
ridge

74 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


• Imagine we plot vs. 1/λ NOT λ. This is to
Example 10 (prostate cancer using regularization) .
have the same trend with complexity p.

• In the book, they plot vs. df(λ), which in-


volves understating PCA; hence, deferred:

b
y = 1αb + Xc βbridge (3.5)
( )−1
= 1αb + Xc X′c Xc + λI X′c y (3.6)
[ ( )−1 ]
df(λ) = tr Xc X′c Xc + λI X′c . (3.7)

• After PCA, it can be shown that:

df(λ) ↘ λ, df(0) = p, df(∞) = 0.

75 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Hint: We could have proceeded without centered model as:
( )( ) ( )
′ N 1′ X1 β0 0
0= − X y + +λ
X1 1 X′1 X1

β1∼p+1 β1∼p+1
( )( ) ( )( )
N 1′ X1 β0 0 0′ β0
= −X′ y + +
X′1 1 X′1 X1 β1∼p+1 0 λI β1∼p+1
( )( )
N 1′ X1 β0
= −X′ y +
X′1 1 X′1 X1 + λI β1∼p+1
( )−1
N 1′ X1
βb = X′ y.
X′1 1 X′1 X1 + λI

However: we get big insight about the effect of λ from the cen-
tered model => df concept.

76 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 11 (Regularization for Example on P. 68) : The model complexity (p + 1)is set to 25.

77 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 4

Linear Models for Classification

78 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Introduction
• Recall regression, when we assumed that the regression function is linear Y = β ′ X +ϵ.
Similarly, we mean by “linear methods of classification” those methods that produce
a “linear decision boundary” between pairs of classes, Gi , Gj :
{ }

Sij (X ) = x| βij X =0 .
Hint: This does not mean that there has to be a boundary between each two classes!

• The problem is reduced to estimating β s, as it was in LM.

• Notice: Nonlinear decision boundaries (surfaces) in the original feature space are
linear in the expanded feature space. But, how to find the right expansion/transfor-
mation. For example, in the case of p = 2, we can do:

T : R 2 7→ R 5
( ) ( )
X new = X1new ,X2new ,X3new ,X4new ,X5new = T (X ) = X1 ,X2 ,X1 X2 ,X12 ,X22
i∑
=5
h(X ) = X1 + X2 + X1 X2 + X12 + X22 = Xinew = hnew (X new ) = hnew (T (X )) = hnew · T (X )
i=1
h = hnew · T

• First, let’s revisit the best decision boundaries (Sec. 2.2.2), and see a very simple 3-
class problem with no equal costs!

79 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 12 (3-class Multinormal) : Notebook

fi ∼ N (µi , Σi ), Σi = I, πi = 13 , p i = 1, 2, 3 First: consider{ the fully symmetric


}
case: L = 1.
µ1 = (−a, 0), µ2 = (a, 0), µ3 = (0, 3a) (symmetry?) Step 1: D12 = x| λ(G1 ) = λ(G2 ) .
Lii = 0, L12 = L21 = L23 = L32 = 1, L13 = L31 = L. ( ) ( )
0 = f2 (X ) + Lf3 (X ) − f1 (X ) + f3 (X ) = f2 (X ) − f1 (X

K ( ) ( ) 1 −1 ((x1 −a)2 +x2 ) 1 −1 ((x1 +a)2 +x2 )
b (x) = arg min
G L Gk ,g Pr Gk |X = x =e2 2 − e2 2 ⇒ x1 = 0 .
g ∈G k=1 2π 2π
f (X )λ(G1 ) = L11 π1 f1 (X ) + L21 π2 f2 (X ) + L31 π3 f3 (X ) Step 2: R = {x| (λ(G ) = λ(G )) | < λ(G )| }.
12 1 2 D12 3 D12
f (X )λ(G2 ) = L12 π1 f1 (X ) + L22 π2 f2 (X ) + L32 π3 f3 (X )
f (X )λ(G3 ) = L13 π1 f1 (X ) + L23 π2 f2 (X ) + L33 π3 f3 (X ) f2 (X ) + f3 (X ) < f1 (X ) + f2 (X ) (|D12 )
1 −1
( p ) 1 −1
1 −1 ′
e 2 (x1 −0) +(x2 − 3a) < e 2 ((x1 −a) +x2 ) (|x1 =0 )
2 2 2 2
fi (X ) = e 2 ((x−µi ) (x−µi )) 2 π 2 π
2π p
x2 < a/ 3.
We will solve only for S12 and the rest is HW.
{ p }
Step 3: S12 = D12 ∩ R12 = x| x1 = 0 & x2 < a/ 3 .

Second, general L: if S12 exists it will be:


−1 −1 −1
( p 2)
0 = e 2 ((x1 −a) +x2 ) − e 2 ((x1 +a) +x2 ) + (L − 1)e 2 x1 +(x2 −
2 2 2 2 2
3a)

−1
( p )
0 = e 2 (x1 +x2 +a ) eax1 − e−ax1 + (L − 1)e( (3)ax2 −a )
2 2 2 2

( )
1 1 1−L
x2 = p (a − x1 ) − p log 2ax .
3 3a e 1 −1
80 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
@@@ This page still needs elaboration from Bishop P.180 and connection to Hastie and Tibs.

• Linear decision boundaries can be established in two different paths:


1. Modeling “discriminant functions” (1 − λ):

1 − λ(Gi ) = δi (x) = βi′ X, i = 1, · · · ,K, X = (1, · · · )′ .


b (x) = arg min λ(Gi ) = arg max δi (x).
G
{ }
Sij = x | βi′ X − βj′ X = 0 ∩ Rij

Here come: regression of indicators, LDA, QDA, RDA, LR, etc.


2. Modeling decision boundaries hij (X ) directly as a hyperplane; i.e., a normal
vector r and a cut point x0 .
Here come: perceptron, optimal separating hyper plane, SVM, etc.

• Same concepts studied in LM apply here: bias-variance trade-off, regularization, etc.

• Which method is better? Same question and always same answer!

81 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1 Discriminant Functions

Theorem 13 General Linear discriminant (score) 2. Consider the decision region Rk , and the two points
functions, with any monotone transformation T : xa ,xb ∈ Rk , then consider the generic point xc on
the line segment between these two points
δi (x) = T (βi′ X ), i = 1, · · · ,K,
b (x) = arg max δi (x),
G xc = λxa + (1 − λ)xb
where, in general the intercept is absorbed in the first δk (xc ) = βk′ xc
feature vector X = (1, · · · )′ , have the following proper- = λβk′ xa + (1 − λ)βk′ xb
ties:
= λδk (xa ) + (1 − λ)δk (xb )
1. all decision surfaces are linear.
2. all decision regions are singly connected and convex. > λδi (xa ) + (1 − λ)δi (xb ) ∀i (x a , x b ∈ R k )
= λβi′ xa + (1 − λ)βi′ xb
Proof.
= βi′ xc
1. If a decision surface Sij exists, then Sij =
{ } = δ i (x c ).
x| δi (x) = δj (x) . Then,
T (βi′ X ) = T (βj′ X ) ≡ βi′ X = βj′ X (monotonecity) Hence, x ∈ R , which proves convexity and singly
c k
≡ (βi′ − βj′ )X = 0. (linear) connected in one step and completes the proof.

82 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.1 Linear Regression of an Indicator Matrix

First, we introduce multiple output regression Proof.


Theorem 14 (Multiple Output Regression) has an ∑
K ∑
N ( )2
LMS solution that is equivalent to the LMS solution RSS(B) = yik − fk (xi ) (4.1)
k=1 i=1
for each individual output problem (the solutions de-
∑K
couple): Y = X ′ β + ε I.e., for K outputs, k = 1,...,K : = E′k Ek (4.2)
k=1
Yk = β0k + X1 βk + ... + Xp βpk + εk ′
= tr E E (4.3)

p
[ ]
= β0 k + Xj βjk + εk = tr (Y − XB)′ (Y − XB) . (4.4)
j =1

= fk (X ) + εk But, (4.2) is sum minimized if each term E′k Ek is min-


( )
(Y1 · · · Yk ) = X ′ B1 · · · X ′ Bk + (ε1 ε2 · · · εk ) imized separately since each is positive. Hence, the
problem is reduced to K individual LMS problems.
= X ′ B + ( ε1 ε2 · · · εk ) .
The proof is complete.
and for N observations:
YN ×k = XN ×(p+1) B(p+1)×k + EN ×k .
The LMS solution will be
( )−1
Bk = X′ X X′ Yk
( )
b = X′ X −1 X′ Y.
B

83 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Lemma 15 (Properties of Regression of Indicator) Proof. trivial
: Formalizing the classification problem as multi-
ple out regression problem of indicator discriminant
functions δk (x) = IG (x)=k as:
δk (x) = X ′ Bk + εk ,
Y = X′ B + E,
( )
b = X′ X −1 X′ Y,
B
yb1×K = x′1×(p+1) B(p+1)×K
b (x) = arg max yk ,
G
k
where each row of the response is YN ×K , where each
row has only single 1. I.e., YN ×k 1k×1 = 1N ×1 ., has the
following properties:
1. The expectation of the response is the posterior prob-
ability.

2. For any observation xi , k ybk (xi ) = 1 as well as

k yk (xi ).

84 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Example 16 (Indicator regression in original and expanded space) :
(X1 ,X2 ) 7→ (X1 ,X2 ,X1 X2 ,X12 ,X22 )

85 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.1.1 Multicollinearity Problem

Effect on regression (“support” argument); projecting on largest PC.


Effect on regression (separation argument); projecting on smallest PC (Q and O image example from Duda
and Hart.)

86 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.2 LDA: revisiting, emphasizing, and more insight

( [ ])
[ −1 ( )]′ 1( ) th ( )
Σ µ2 − µ1 · x − µ2 + µ1 − ( )′ ( ) µ2 − µ1 =0
2 µ2 − µ1 Σ−1 µ2 + µ1
w ′ (x − x 0 ) = 0 .

87 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.3 LDA in Extended Space vs. QDA

• “For this figure and many similar figures in the book we compute the decision boundaries by and exhaustive
contouring method. We compute the decision rule on a fine lattice of points, and then use contouring algorithms
to compute the boundaries.”

88 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.4 Regularized Discriminant Analysis (RDA)

Example 17 (RDA on Vowel dataset) .


b k (α ) = α Σ
Σ b k + (1 − α)Σ
b, (4.5)
b γ) = γΣ
Σ( b + (1 − γ )σ 2 I. (4.6)

• The optimum occurs around α = 0.9, closer to QDA.

• This

89 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Revisiting bias-variance trade-off

• Remember case 1, case 2, case 3 (P. 24).

• 3 different complexities:
a) is case 3 (QDA).
b) is case 1, less complex than LDA (how to esti-
mate only diagonals; trivial!)
c) is very naive.

90 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.5 Principal Component Analysis (PCA)
PCA is:

• for dimension reduction and linear data summary.

• NOT a discrimination rule.

• but has strong connection to discrimination.

PCA: two different questions with the same answer:

• what is the best linear representation of data?

• what are directions of largest spread?

91 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.6 LDA vs. PCA
One may conjecture that the decision boundary has the direction of the largest principal component! Sounds
( )
reasonable; let’s investigate. If so then, the perpendicular direction to the decision boundary (Σ−1 µ2 − µ1 )
has the direction of the smallest principal component. Now

Σ = V ΛV ′ , V ′ V = I
= v1 v1′ λ1 + v2 v2′ λ2 ,
Σv2 = v2 λ2

Our conjecture leads to


( )
Σ−1 µ2 − µ1 = v2 c
µ2 − µ1 = Σv2 c
= v2 λ2 c,

which means that the line connecting the two means (which is arbitrary) has the direction of the smallest
principal component; which is not mandatory of course.

Let’s see Mathematica Notebook:

92 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.7 Mahalanobis’ Distance Classifier
The covariance matrix (population or sample) is positive semidefinite (prove as homework):

Σ = U ΛU ′
  
λ1 u′1
( ) ..   .. 
= u1 · · · up  .  . 
λp u′p
= λ1 u1 u′1 + · · · + λp up u′p ,

It is straight forward to show that (for full rank Σ):

λi = σi2 in the direction of ui


Σ−1 = U Λ−1 U ′

Distance to class centroid may weighted by inverse of spread (small spread in one direction makes a data
( )′
point far from the class centroid in that direction): x − µ ui /σi . So, the whole distance should be
( ) p ((
∑ )′ )2
δ 2 x,µ = x − µ ui /σi
i=1
∑p (( )′ )( ( ) )
= x − µ ui /σi u′i x − µ /σi
i=1
( )
( )′ 1 1 ( )
= x−µ u1 u′1 + · · · + up u′p x − µ
σ12 σp2
( )′ −1
( )
= x−µ Σ x−µ ,
93 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
which is the Mahalanobis’ distance between x and µ (the class centroid) wrt to the matrix Σ (the covariance
matrix of the data).

Then it is natural to classify based on the closest class centroid to the testing point x @@@
( )′ ( )
b (x) = ωk |k = arg min x − µk Σ−1 x − µk
G k
k

For K = 2, and Σ1 = Σ2 = Σ this reduces to

( )′ ( ) ( )′ ( ) G1
x − µ2 Σ−1 x − µ2 − x − µ1 Σ−1 x − µ1 ≷ th,
G2

which is the same as LDA.

94 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.8 Fisher Discriminant Analysis (FDA)
4.1.8.1 2-Class Problem

Projected data is y = w′ x. Let’s maximize


( )2
(∆means)2 y2 − y1
J (w ) = =∑ ( )2 ∑ ( )2
spread y yi − y 1 + yi − y 2
i ϵω1 yi ϵω2
1 1
y1 = Σyi ϵω1 yi =
n
Σi=1
1
w ′ xi = w ′ x1
n1 n1
( )2 ( )2
y 2 − y 1 = w′ (x2 − x1 ) = w′ (x2 − x1 ) (x2 − x1 )′ w
| {z }
SB

Of course SB is singular with rank 1, and called Between-Class.

95 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


∑ ( )2 ∑
n1 ( )2
yi − y 1 = w′ xi − w′ x1
yi ϵω1 i=1

n1
= w ′ (x i − x 1 ) ( x i − x 1 )′ w
i=1



n1
=w (xi − x1 ) (xi − x1 )′ w = w′ S1 w
i=1
w S1 w + w′ S2 w = w′ SW w

w′ SB w
J (w ) = ′ .
w SW w
( ′ ) ( ′ )
(2SB w) w SW w − w SB w (2SW w)
∇j (w) =
(w′ SW w)2
( ( ) )
(SB w) w′ SW w = w′ SB w (SW w) .

SB w has (x2 − x1 ) direction and hence (SW w). Then

SW w ∝ (x2 − x1 )
−1
w ∝ SW (x 2 − x 1 )

Then the decision rule is


( −1
)′ G1
SW (x2 − x1 ) x ≷ some value.
G2

96 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Recall LDA (with Normal assumption with Σ1 = Σ2 = Σ):

[ −1 ( )]′ G1
Σ µ2 − µ1 (x − x 0 ) ≷ 0
G2
[ −1 ( )]′ G1
Σ µ2 − µ1 x ≷ some value, where
G2
( )
1 ∑
2 ∑
b=
Σ (xi − xk ) (xi − xk )′
n1 + n2 − 2 k=1 i
( )
1
= SW ,
n1 + n2 − 2
b 1 = x1 ,
µ
b 2 = x2
µ

This explains why LDA works even if the data is neither linearly separable nor following Normal distribution.
The LDA is always optimal in Fisher’s sense.

Let’s see same Mathematica Notebook again for connection between PCA, LDA, and FDA (for Fisher (not
Flexible) Discriminant Analysis).

97 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.8.2 K -Class Problem

FDA also maximizes:


w′ SB w
J (w ) = ,
w ′ SW w
This is a well known mathematical problem called generalized eigenvalue problem, whose solution is
−1
SW SB w = λw,
−1
so w is the first eigenvector for the matrix SW SB . This is called generalized eigenvalue problem.
SB is called between-class covariance matrix,
SW is called within-class covariance matrix.
In case of 2-class problem
SB = n1 (x1 − x) (x1 − x)′ + n2 (x2 − x) (x2 − x)′
( )( )
n1 x1 + n2 x2 n1 x1 + n2 x2 ′
= n1 x1 − x1 − +···
n n
n1
= 2 (n2 x1 − n2 x2 ) (n2 x1 − n2 x2 )′ + · · ·
n
n1 n22
= 2
(x1 − x2 ) (x1 − x2 )′ + · · ·
( n )
n1 n22 n21 n2
= + (x1 − x2 ) (x1 − x2 )′
n n
= n1 n2 (x1 − x2 ) (x1 − x2 )′ ,
different from the SB defined above. However, the maximizer is the same, of course, since it is just a factor
(n1 n2 ).
98 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
More on SW and SB for K -class problem:


ST = (xi − x) (xi − x)′
i

= (xi − x) (xi − x)′
i

K
SB = nk (xk − x) (xk − x)′
k=1

SW =

99 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Connection to AUC:
Show that this projection maximizes the AUC (for normal dist).

100 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.9 Logistic Regression
4.1.9.1 Generalized Linear Models (GLM)

101 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.9.2 Fitting Logistic Regression

102 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.1.9.3 Logistic Regression vs. LDA

103 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.2 Separating Hyperplanes
4.2.1 Rosenblatt’s Perceptron Learning Algorithm

104 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


4.2.2 Optimal Separating Hyperplane

105 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 7

Model Assessment and Selection

106 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.1 “Dimensionality” from Different Perspectives

107 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.1.1 Curse of Dimensionality
Example 18 :
A point is chosen randomly in a disk of radius 1. What is fX ?
Then { 1 p
( ) ∫
x2 + y 2 ≤ 1 1 1−x2
fXY x,y = π fX (x) = p dy
0 otherwise. π − 1−x2
( ) 2√
What is P (R ≤ r )? In general, if X = X1 ,...,Xp is = 1 − x2 , − 1 ≤ x ≤ 1.
uniformly distributed over an area A, then π

( ) For p-dimensional spheres, P (R ≤ r ) = r p , which


1
fX x1 ,...,xp = , means that data will be on the surface!!
| A|

P (X ∈ B ) = fX dx
B
|B |
=
| A|

( )
P (R ≤ r) = P x2 + y 2 ≤ r
πr2
=
π
= r2 .

Later, we can do it differently: r = x2 + y 2 , which is a


function of 2 r.v. (transformation).

108 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.1.2 Vapnik-Chervonenkis (VC) Dimension

109 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.1.3 Cover’s Theorem

110 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.2 Assessing Regression Functions

111 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.3 Assessing Classification Functions

112 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.3.1 Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite
Sample Size (Devroye, 1982)

113 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.3.2 Binary Classification Problem
7.3.2.1 Receiver Operating Characteristics (ROC)

114 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


115 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
7.3.2.2 Area Under the Curve (AUC)

Definition 19 (AUC) is defined as


∫ 1
AU C = T P F dF P F .
0

Theorem 20 (Properties of the AUC) :


1. 0 ≤ AU C ≤ 1.
2. The probability that the decision variable under one class is greater than the decision variable of the other
class equals to the AUC.
3. AUC Minimum Variance Unbiased Estimator (MVUE) is the Mann Whitney U -statistic estimator (Randles
and Wolfe, 1979):

1 ∑
n1 ∑
n2
ƒ
AU C= Ihtr (yj )>htr (xi ) , (7.1)
n1 n2 i=1 j =1

 1 a < b,
I (a,b) = 1/2 a = b, (7.2)

0 a>b

Proof.
∫1 ∫1
1. AU C = 0 T P F dF P F ≤ 0 1 dF P F = 1.
2. The proof is a set of straightforward calculus steps. Denote fhtr (x|ω1 ), fhtr (x|ω2 ) as X and Y , respectively,

116 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


to ease notation; then
∫ ∞ ∫ ∞
Pr [Y > X ] = fXY (x′ ,y ) dy dx′
∫−∞

x
[∫ ∞ ]
= fX (x) fY (y ) dy dx
∫−∞
x=∞
x

= −T P F (x) dF P F (x),
x=−∞
∫ 1
= T P F dF P F .
0

3. of unbiasedness is trivial, and minimum variance part is omitted (see Randles and Wolfe, 1979).

Example 21 A very complex Deep Neural Network (DNN) is trained on millions of images to detect whether a
human appears in a photo or not. The DNN is later tested on 9 images (4 having no humans and 5 do have)
{ } { }
and produced the following output −2, 0, 2, 4 and 1, 3, 5, 7, 9 respectively. Estimate the AUC of this network,
and how does AUC changes with the decision point of the network?
X X O X O X O O O
Answer is: 17/20 = 0.85 = 85%.

117 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


@@@ Risk is a function of PPV, which is prevalence dependent πi . Express it.
@@@ Express Risk in terms of TPF and FPF.
@@@ The designed classifier to minimize an objective function, e.g., error, will ultimately have fX (x|ωi ), i =
1, 2, which can be used to obtained other measures than that optimized, e.g., error at different cut off point!
@@@ The ultimate measure should be the objective function to be optimized, but this is in many cases
difficult.

7.3.2.3 Important Measures for Classification Problem


b = ωi |G = ωj ) ≡ Pr(ω
For short Pr(G c |ω )
∑Ni j
In confusion matrix C = (cij ) = n=1 I(G(xn )=ωi ,Gb (xn )=ωj )

Prediction
True c1 ω
ω ci . . . ωd
K

ω1 1
ωi cii 1
.. ..
. . 1
ωK 1
P P Vj =

118 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


T P Fi ≡ Pr(ω
ci |ωi )

F N Fi ≡ cj |ωi )
Pr(ω
j ̸=i

1 = T P Fi + F N Fi
P P Vj ≡ Pr(ωj |ω
cj )

F DRj ≡ Pr(ωi |ω
cj )
i
1 = P P Vj + F DRj

• 1-FNF, 1 − e2 , TPF, TPR, Sensitivity, Recall, Probability of Detection/Positive alaram, Hit Rate, per-class
accuracy:
TP TP
=
P TP +FN
• 1-FPF, 1 − e2 , TNF, TNR, Specificity:
TN
N
• Positive Predictive Value (PPV), Precision (not obtainable from the ROC. It is the conditional probability
that a positive decision is true.):
TP 1 1
= =
T P + F P 1 + F P /T P 1 + (F P /P )/T P F
• Accuracy:
TP +TN
+N
P119 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
• F -score:
P recision · Recall
2
P recision + Recal
• Fβ -score:
P recision · Recall
(1 + β 2 )
β 2 P recision + Recal

7.3.2.4 Multi Class Problem)

• Accuracy:

• F -score:

• Fβ -score:

120 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.4 Resampling and Variability
7.4.1 p-values
7.4.2 Cross Validation (CV)
@@@ about the CV error bars:
Hastie: It is a muccky area but we do it anyway.
Tibs: it is a good approximation.

7.4.2.1 K -Fold CV

121 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.4.2.2 Model Assessment & Selection Using K -Fold CV
( ) 1∑ ( ( ))
rr(trKCV ) fb,λ =
ed L yi , fb−K (i) xi ,λ
N i

KCV ALGORITHM for model m{


Divide the N-observation dataset to...
K partitions, each has N/K;

for k = 1 : K
Train on all data except partition k;
Test on partition k;
Save the N/K predictions;
end

Collect the K * N/K predictions;

Estimate your error;


}

HW: Prove that estimating the error rate from each fold then averaging over the folds gives the same estimate
as if we pool the scores from folds and obtain one estimate from the n pooled scores.

122 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


KCV ALGORITHM for model selection{
for m = 1 : M
Err[m] = error of model m...
estimated using KCV;
end

Find the minimum Err and its model;


}

123 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.4.3 Bootstrap

124 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.5 Discussion in Hyperspace: Pitfalls and Advices
Lee (2008)
Ambroise and McLachlan (2002)
Simon et al. (2003)

7.5.1 Good Features vs. Good Classifiers


All classifiers perform similarly Lim et al. (2000); focus on selecting good features if you have the physics of
the problem.

7.5.2 Pitfall of CV
Golub et al. (1999)

125 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.5.3 Pitfall of Reusing the Testing set
@@@This section is not rigorous and the idea is not consolidated yet!!
Good Practice: one use of testing (or called validation) set.

A = Ets L(yts ,ηtr ) (7.3)


η ∗ = Ets (i)
L(yts ,ηtr ) (7.4)
argmin i

Bad Practice: the pitfall of reusing the testing set. E.g., selecting the best model η ∗ , then reusing the same
testing set to select the optimal threshold for classification!
when do you know you have an over-optimistic measure A of your model (or over trained)?
This happens, e.g., when training models on tr then testing on ts to choose the best model in terms of some
loss; e.g., MSE. Then this chosen best model is used on the same test set to choose the best threshold value.
In this case, the threshold value is a tuning parameter selected using ts and tested on ts as well.

“Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then
the test set error of the final chosen model will underestimate the true test error, sometimes substantially.”
(Hastie et al., 2009, P. 222)
Incrementally converting the testing set to training set Gur et al. (2004)

126 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


7.5.4 One Independent Test set is not Sufficient.
Dave et al. (2004) and Tibshirani (2005)

127 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Chapter 11

Neural Networks

@@@ The connection between Trees and NN:


root-firstlevel-leafs tree, you can get the same performance using input-twolayers-output NN. Deeper tree is
similar to wider network (more neurons on the two layers). Deeper networks are much powerful than trees
therefore. (https://www.youtube.com/watch?v=2_Jv11VpOF4) imperial college talk.
(what about many trees; i.e., random forests) Proof everything.

128 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


β for logistic regression can start with 0, For NN we have to start randomly.

Linear classification vs. logistic regression (complexity) Bishop page 205.

Difference between numerical solution of NN and Logistic regression.

129 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


11.1 Basic and Definitions
There is some confusion in layer counting in the liter-
ature. This network is

• three layer by counting all layers including input


and output.

• two layer by counting all but the input. (our


convention)

• single layer by counting all but both the input


and output; (only the hidden layers).

Neuron: each element in a layer.


Activation Function of layer i: σ (i)
Bias element x0 and z0 : accounts for intercept. ( ( ))

M ∑
D
b (2)
Yk = σ w0k +(2) (2) (1) (1) (1)
wmk σ wm0 + wmi Xi
m=1 i=1
( )

M ( )
=σ (2)
w0(2)
k
+ (2) (1)
wmk σ (1)
wm 0 + Wm X
m=1
( )

M
=σ (2)
w0(2)
k
+ (2)
wmk Zm
m=1

130 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


11.2 Connection to Linear Model
this means: Projection in a linear subspace WM ×D X,
followed by non-linear feature generation (to add flex-
ibility) Z = σ (1) (WX).
Usually, for regression, σ (2) of the second layer is cho-
sen to be identity σ (2) (T ) = T . Therefore


M
Ybk = w0(2)
k
+ (2)
wmk Zm ,
m=1

which is nothing but a linear model in the new feature


space.
And for binary classification, σ (2) is chosen to be the
softmax function
exp (Tk ) ( ( ))
σ (2) (Tk ) = ∑ , ∑
M ∑
D
k exp (Tk ) b (2)
Yk = σ w0k +(2) (2) (1) (1) (1)
wmk σ wm0 + wmi Xi
m=1 i=1
( )
which means that Ybk = Prb (Ck |X ) = σ (2) (Tk ). This is ex-

M ( )
actly a multi-logistic regression problem in the trans- =σ (2)
w0(2)
k
+ (2) (1)
wmk σ (1)
wm 0 + Wm X
m=1
formed features Z . ( )
If we start with zero weights the algorithm will never ∑
M
=σ (2)
w0(2)
k
+ (2)
wmk Zm
move. m=1

131 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Cont.
The advantage of transformation from X to Z is
adding more predictive power. However, this is on the
expense of model interpretability. In linear models of
regression or classification, the output is related di-
rectly to the input and every coefficient has a mean-
ing.
Usually σ (1) is chosen to be sigmoid

( ) 1
σ µ = ( )
1 + exp −µ

or tan-sigmoid function.
It can be proven that a 2-layer NN with sufficient M
can approximate any input function of finite domain
with finite discontinuities. This result hold for a wide
range of activation functions σ (1) excluding polynomi- ( ( ))
als. ∑
M ∑
D
b (2)
Yk = σ w0k +(2) (2) (1) (1) (1)
wmk σ wm0 + wmi Xi
m=1 i=1
( )

M ( )
=σ (2)
w0(2)
k
+ (2) (1)
wmk σ (1)
wm 0 + Wm X
m=1
( )

M
=σ (2)
w0(2)
k
+ (2)
wmk Zm
m=1

132 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


11.3 Connection to PCA
The network in the figure can We can design the net-
work without having a response variable, but with
number of outputs K equals to dimensions p. We
keep changing the weights and observe the output,
then measure the error:

n ° ( ) °
E (w ) = °yb xi ,w − xi °2
i=1
= ∥WX − X∥

With one hidden layer with link function it is obvious


that the output is linear in the inputs. We can de-
sign the network without having a response to achieve
minimum error.
( ( ))
In this case the NN will be nothing but PCA, because ∑
M ∑
D
b (2)
Yk = σ w0k +(2) (2) (1) (1) (1)
wmk σ wm0 + wmi Xi
both use square loss function as minimization crite-
m=1 i=1
rion. The network will find the largest M principal ( )

M ( )
components. But of course, PCA is better since the so- =σ (2)
w0(2)
k
+ (2) (1)
wmk σ (1)
wm 0 + Wm X
lution is found at once using SVD rather than training. m=1
( )

M
Adding a nonlinear hidden layer will project to the =σ (2)
w0(2)
k
+ (2)
wmk Zm
m=1
PCA also (there is a proof for that)

133 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Problem 1: This function is complicated in ws, and
11.4 Solution vs. Optimization no closed form solution for its minimum. Any 2-layer
feed forward network, with M hidden units has (D +
We have to find the weights w such that error
1)M + (M + 1)K weights.

n ° ( ) ° The simplest network has 1 input, 1 output, and 1 hid-
E (w ) = °yb xi ,w − yi °2
den unit. This gives 4 weights! Solution has to be nu-
i=1
merical.
is minimized. Notice that, xi is the ith realization of
( )′
the random vector X . So, xi = xi1 ,...,xiD ; and: ° ( ) °2
∑n ° °
° (2) (2) ∑ (2) (1)
M
°
° ( ) °2 E (w) = °σ w0k + wmk σ (Wm xi ) − yi °
∑n ° ° ° °
° (2) (2) ∑ (2) (1)
M i=1 m=1
° ° °2
E (w) = °σ w0k + wmk σ (Wm xi ) − yi ° . ° °
° °
i=1 m=1 ∑n ° °
° °
= °w4 + w3 σ (w1 + w2 xi ) − yi °
°
i=1 °
| {z } °
yb °
i

Problem 2: It is non-convex function, this means


local minima exist; however, we seek the global one.
(how do we know?)

134 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


11.5 NN for Regression
( )
Consider training a network on a 100-observation sample from Y = sin(2X ) + ε, where X ∼ N 0, 1 and
( )
ε ∼ N 0,σ 2 so that the signal to noise ratio (SNR) is about 4. Remember

Var (Y |X )
SN R = .
Var (ε)

In this case, it will be achieved for σ = 0.35.

Testing set (never seen by NN, this is just for understanding what’s going on) is 10,000.

Let’s train each network until it reaches the maximum accuracy for the gradient (or 10−5 is enough).

Try several M (from 1 to 5); for each do the training 10 times with different initialization vector (see the
code).

135 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


 Matlab Code 11.1: 
trainsize =100; testsize =10000; sig =0.3;
x = rand ( trainsize + testsize , 1) ’; fx = sin (2* pi * x ) ;
y = fx + sig * mvnrnd (0 , 1 , trainsize + testsize ) ’; % 4 to 1 SNR ; SNR = var ( E ( Y | X ) ) / var ( eps ) .
xtrain = x (1: trainsize ) ; ytrain = y (1: trainsize ) ;
xtest = x ( trainsize +1: end ) ; ytest = y ( trainsize +1: end ) ;

% % This to fix the initia lizati on over the simulation below .


M =10; I =10;
InitWeightMat = cell (I , M ) ;
for m =1: M for i =1: I net = newfit ( xtrain , ytrain , m ) ; InitWeightMat {i , m }={ net . IW , net . LW , net . b
}; end ; end ;
% % This cell trains M nets , each for I in it i al iz ati o ns and test each solution on the same
training set and infinite testing set
ApparPerfMat = zeros (I , M ) ; PerfMat = zeros (I , M ) ; WeightMat = cell (I , M ) ;
for m =1: M
net = newfit ( xtrain , ytrain , m ) ;
net . divideFcn = ’ ’; net . trainParam . showWindow =0;
net . trainParam . min_grad = 1e -2; % saving some time and prohibits reaching global minima .
for i =1: I
tmp = InitWeightMat {i , m }; net . IW = tmp {1 , 1}; net . LW = tmp {1 , 2}; net . b = tmp {1 , 3};
[ net , tr ,Y ,E , Pf , Af ]= train ( net , xtrain , ytrain ) ;
ApparPerfMat (i , m ) = sum ( E .^2) / length ( xtrain ) ;
[ Y2 , Xf , Af , E2 , perf ] = sim ( net , xtest , [] , [] , ytest ) ;
PerfMat (i , m ) = perf ;
WeightMat {i , m }={ net . IW , net . LW , net . b };
end ;
end ;

% % With parameter regula rizati on


Lamdalist =0:.01:0.1; net = newfit ( xtrain , ytrain , M ) ; net . divideFcn = ’ ’;
ApparPerfMat = zeros (I , length ( Lamdalist ) ) ; PerfMat = zeros (I , length ( Lamdalist ) ) ; WeightMat = cell (I
, length ( Lamdalist ) ) ;
net . performFcn = ’ msereg ’; net . trainParam . showWindow =0;

136 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


for m =1: length ( Lamdalist ) % m here is the regula rizati on parameter .
net . performParam . ratio = Lamdalist ( m ) ;
for i =1: I
tmp = InitWeightMat {i , M };
net . IW = tmp {1 , 1}; net . LW = tmp {1 , 2}; net . b = tmp {1 , 3};
[ net , tr ,Y ,E , Pf , Af ]= train ( net , xtrain , ytrain ) ;
ApparPerfMat (i , m ) = sum ( E .^2) / length ( xtrain ) ;
[ Y2 , Xf , Af , E2 , perf ] = sim ( net , xtest , [] , [] , ytest ) ;
PerfMat (i , m ) = sum ( E2 .^2) / length ( xtest ) ;
WeightMat {i , m }={ net . IW , net . LW , net . b };
end ;
end ;
 

137 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Same 10 initializations
per m:

• Small m implies few min-


ima

• Max. Gradient is set to


10−10 ; this allows for very
large unrealistic weight
vector searching for an
asymptotic minima.

• Best solution is roughly


at (D + 1)m + (m + 1)K =
n/10. That is, 3m+1 = 10,
i.e., m = 3.

138 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Same 10 initializations
per m:

• Small m implies few min-


ima

• Max. Gradient is set


−4
to 10 , which we con-
sider as 0. This sacrifices
a little decrease in MSE
on the training set for
getting reasonably small
weight vectors.

• Best solution is roughly


at (D + 1)m + (m + 1)K =
n/10. That is, 3m+1 = 10,
i.e., m = 3.

139 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Same 10 initializations
per m:

• Small m implies few min-


ima

• Max. Gradient is set to


10−3 ; Same comment as
above.

• Best solution is roughly


at (D + 1)m + (m + 1)K =
n/10. That is, 3m+1 = 10,
i.e., m = 3.

140 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Best solution on popula-
tion and best on training
set for min. grad. 10− 3:
• @@@ i think this was for
min-grad 10−2 NOT 10−3 .
Therefore, the curve is
not one of those of the
previous subplot where
the best had a spiky over-
shoot. I have to rerun the
Matlab code.

141 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Performance of all solu-
tions for min. grad. 10−3 :

• Monotonic decrease in
training MSE.

• There is an optimal so-


lution for minimizing the
MSE on the population.

142 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Regularization λ instead
of varying m:
Only one net with 10 neu-
rons, with same 10 initial-
izations of the case m = 10
previously for min. grad.
10−10 .

• λ ranges from 0 (the


most constrained model)
to 0.1. Of course, for case
of λ = 1 this reduces to
the full model M = 10
without using any regu-
larization.

• training stops when


MSEREG starts in-
creasing, where
M SEREG = λ M SE +

(1 − λ) allweights w2

143 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Performance of the all so-
lutions using regulariza-
tion. Notice that this best
solution would be exactly
as those for the case of
m = 10 had we set λ = 1
(no regularization).

144 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Best solution on the population Comparison between: varying m, regularization,
and best on the training set using regularization. and Bayes.

145 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


11.6 NN for Classification
( )
C1 ∼ N (1, 1)
C0 ∼ N 0 , 1
Training: 10 observations/class (blue)
Testing: 1000 observations/class (green).
complicated network tries to estimate Pr (C1 |x) closely to 1 or zero. Decision surface is complicated discon-
nected regions.
Sometimes the best solution is for the net with, say, m = 4; however its MSE is very close to the one with
m = 1.
Red curve is the Bayes’ posterior (calculated below)
small network looks like simple logistic regression, decision surface at each threshold is simply a point as it
should be.
The true AUC for this problem is 0.76.

146 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


f (x|C1 ) Pr (C1 ) f (x|C1 ) Pr (C1 )
f (C1 | x ) = =
f (x ) f (x|C1 ) Pr (C1 ) + f (x|C0 ) Pr (C0 )
1 1
= = [(
x−µ0 )2 ( x−µ1 )2
]
1 Pr(C0 ) p
1+ 2πσ0
−1
2 σ1 − σ
2
LR Pr(C1 ) 1+ p e
2πσ1
1
= 1
1 + ex− 2

147 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Performance (MSE):

148 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


MSE (measured on population) increases with m.

MSE (measured on training set) decreases with m.

149 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Performance (AUC):

150 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


AUC (measured on population) decreases with m.

AUC (measured on training set) increases with m.

151 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


It is not necessarily that the solution with best MSE is the same as the solution with the best AUC.

However, sometimes there is a general trend.


152 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
All solutions (1 ≤ m ≤ 10, 10 different initializations per each m)

153 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Practical Issues:

 Matlab Code 11.2: 


% %%%%% Important Bug %%%%%%%
% Seting the output transfer function in newpr does not work : and example is :
load s i m p l e c l a s s _ d a t a s e t
net = newpr ( simpleclassInputs , simpleclassTargets ,2 , { ’ tansig ’ , ’ softmax ’ }) ;
net . layers {2}. transferFcn
% will give ’ tansig ’ , which is the default ; not ’ softmax ’ as initialized .
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

testsamplesno =1000; n1 =10; n2 =10;


p =1; MDIST =2;
mu1 = zeros (1 , p ) ; mu2 = sqrt ( ( MDIST / p ) * ones (1 , p ) ) ;
sigma1 = eye ( p ) ; sigma2 = eye ( p ) ;
% sigma1 = eye ( p ) ; roh =.5; sigma2 = repmat ( roh , [ p p ]) +(1 - roh ) * eye ( p ) ;

i n f i n i t e t e s t s a m p l e s 1 = mvnrnd ( mu1 , sigma1 , testsamplesno ) ’; i n f i n i t e t e s t s a m p l e s 2 = mvnrnd ( mu2 ,


sigma2 , testsamplesno ) ’;
trainers1 = mvnrnd ( mu1 , sigma1 , n1 ) ’;
trainers2 = mvnrnd ( mu2 , sigma2 , n2 ) ’;

xtrain =[ trainers1 trainers2 ]; ytrain =[ ones (1 , n1 ) , zeros (1 , n2 ) ; zeros (1 , n1 ) , ones (1 , n2 ) ];


xtest =[ i n f i n i t e t e s t s a m p l e s 1 i n f i n i t e t e s t s a m p l e s 2 ]; ytest =[ ones (1 , testsamplesno ) , zeros (1 ,
testsamplesno ) ; zeros (1 , testsamplesno ) , ones (1 ,
testsamplesno ) ];
% % This to fix the initia lizati on over the simulation below .
M =10; I =10;
InitWeightMat = cell (I , M ) ;
for m =1: M for i =1: I net = newfit ( xtrain , ytrain , m ) ; InitWeightMat {i , m }={ net . IW , net . LW , net . b
}; end ; end ;
% % This cell trains M nets , each for I in it i al iz ati o ns and test each solution on the same
training set and infinite testing set
ApparPerfMat = zeros (I , M ) ; PerfMat = zeros (I , M ) ; ApparAUCMat = zeros (I , m ) ; AUCMat = zeros (I , M ) ;
154 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
WeightMat = cell (I , M ) ;
for m =1: M
net = newfit ( xtrain , ytrain , m ) ;
net . divideFcn = ’ ’; net . trainParam . showWindow =0;
net . layers {2}. transferFcn = ’ tansig ’;
net . trainParam . min_grad = 1e -4; % saving some time and prohibits reaching global minima .
for i =1: I
tmp = InitWeightMat {i , m }; net . IW = tmp {1 , 1}; net . LW = tmp {1 , 2}; net . b = tmp {1 , 3};
[ net , tr ,Y ,E , Pf , Af ]= train ( net , xtrain , ytrain ) ;
ApparPerfMat (i , m ) = sum ( E (:) .^2) /(2*( n1 + n2 ) ) ;
ApparAUCMat (i , m ) = MannWhitney ( Y (1 , 1: n1 ) , Y (1 , n1 +1: n1 + n2 ) , - inf ) ;
[ Y2 , Xf , Af , E2 , perf ] = sim ( net , xtest , [] , [] , ytest ) ;
PerfMat (i , m ) = perf ;
AUCMat (i , m ) = MannWhitney ( Y2 (1 , 1: testsamplesno ) , Y2 (1 , testsamplesno +1:2* testsamplesno )
, - inf ) ;
WeightMat {i , m }={ net . IW , net . LW , net . b };
end ;
end ;
 

with softmax, set net.outputs2.processFcns= as long as the classes are coded by 0 and 1.

155 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


The XOR classification:

156 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


y1 = σ (1) (0.5 + 1 · x1 + 1 · x2 ) ,
y2 = σ (1) (−1.5 + 1 · x1 + 1 · x2 ) ,
( )
z = σ (2) −1 + 0.7 · y1 − 0.4 · y2 ,

where
the activation functions in the two layers are the hard limits given by:

( ) ( )  1 µ>0
(1) (2)
σ µ =σ µ = 0 µ=0

−1 µ < 0

157 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


158 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
159 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Chapter 14

Unsupervised Learning

160 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Appendix A

Fast Revision on Probability

A-1 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


This introduction is adapted from
Bishop’s slides

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Probability Theory

Apples and Oranges

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Probability Theory
p xi , y j 1
i j

Marginal Probability

Joint Probability Conditional Probability

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Probability Theory

Sum Rule

Product Rule

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


The Rules of Probability

Sum Rule

Product Rule

p X ,Y p X pY
Independence

p Y |X pY

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Bayes’ Theorem

posterior  likelihood × prior

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Probability Densities

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Continuous variables

f Y f X ,Y dX
f X ,Y y
f X |Y y
f Y y

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Transformed Densities and Expectation
X E X
p x xdx p x xdx

Conditional Expectation
(discrete)

Approximate Expectation

2 2
X var X E X X E X2 2
X

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


E Y |X xo Y |X xo
2
2
var Y | X xo Y |X xo E Y Y |X xo |X xo

var[Y ] EX var Y | X varX E Y | X

cov X ,Y
Observed correlation does
X Y NOT imply causation
1 1

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Prove the follwing
var X Y var X varY 2 cov X ,Y
cov X ,Y 0 if X Y
var aX a 2 var X

Intuition:

the variance of each affects the total variance, but if one


decreases with the increase of the other this should
affect the total variance

If independent, of course there is not association.

Scaling a r.v. increases or decreases the spread


depending on whether the scale “a” is larger or smaller
than one.

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Appendix B

Fast Revision on Statistics

B-1 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


A random variable (or vector) is denoted by upper case letters, e.g., X .

Independent observations (realizations) of this r.v. are called independent and identically distributed (i.i.d.),
e.g., x1 ,...,xn .

Estimator is a real-valued function that tries to be “close” in some sense to a population quantity.
( ) ( )2
How “close”? Define a loss function, e.g., the Mean Square Error (MSE): L µ b,µ = µb − µ .And, define the
( )2
Risk to be the Expected loss: E µb−µ .

b:
Important Decomposition for any estimator µ
( )2 (( ) ( ))2
E µb−µ = E µ b −Eµb + Eµ b−µ
( )2 ( )2 [( )( )]
=E µ b −Eµb +E Eµ b −µ +2E µ b −Eµ
b Eµ b−µ
( )
b + Bias2 µ
= Var µ b

B-2 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


We could have defined other loss functions, e.g., the absolute deviance loss:
( )
b,µ = |µ
L µ b − µ|

One estimator may be better for one loss and not better for another loss.

B-3 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Estimation of µX

1 ∑n
bX =
Sample mean X as an estimator of µX : µ n i=1 xi .

1∑
n ( )
EX = E xi = E X = µ
n i=1
( )
b = Eµ
Bias µ b−µ = 0
[ ]
1 ∑ 2 ∑∑ ( ) 1
b= 2
Var µ σ + Cov Xi ,Xj = σ 2
n i i j n

This means that from sample to sample it will vary with this variance.

An estimator with zero bias is called “unbiased”. This means that on average it will be exactly as what we
want.

B-4 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Estimation of σ 2

( )2
σ2 = E X − µ
= E X 2 − µ2

c2 = 1 ∑( )2
σ xi − X
n−1 i
( )
1 ∑ 2 2
= x − nX
n−1 i i
[ ( )]
1 ∑ 2
c2 = E
Eσ x − nX
2
n−1 i i
1 ( 2
)
= nEX2 − nEX
n−1
( ( 2 ))
1 ( 2 2
) σ 2
= n σ +µ −n +µ
n−1 n
= σ2,

c2 is unbiased for σ 2 .
therefore, σ

B-5 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


( )
Estimation of Cov X,Y
( ) ( )( )
Cov X,Y = E X − µX Y − µY
= E XY − µX µY

( ) 1 ∑( )( )
d X,Y =
Cov x i − X yi − Y
n−1 i
( )
1 ∑
= xi yi − nXY
n−1 i
( ) 1 ( )
d X,Y =
E Cov n E XY − n E XY
( ) n−1
n E XY − n E XY =

( ( ) ) 1 ∑∑
= n Cov X,Y + µX µY − n E 2 x i yj
n i j
( )
( ( ) ) 1 ∑ ∑∑
= n Cov X,Y + µX µY − E x i yi + x i yj
n i i̸=j
( ( ) ) 1( )
= n Cov X,Y + µX µY − n E XY + n (n − 1) E xi yj
( ( ) ) (
n ( ) )
= n Cov X,Y + µX µY − Cov X,Y + µX µY + (n − 1) µX µY
( ) ( ( ) )
= n Cov X,Y + nµX µY − Cov X,Y + nµX µY

( ) ( )
d X,Y is unbiased as well for Cov X,Y
Therefore, Cov
B-6 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Random Vectors and Multivariate Statistics
( )′
A p-dimensional random vector X is X = X1 ,...,Xp has joint pdf

fX = fX1 ,...,Xp

Mean:

µ = EX
( )′
= E X1 ,..., E Xp

B-7 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Covariance Matrix Σ (= Cov (X )):
( )( )′
Σ = E X −µ X −µ
  
X1 − µ1
 .. ( )
= E  .  X1 − µ1 ,...,Xp − µp 
Xp − µp
 ( )2 ( )( ) 
X1 − µ1 ... X1 − µ1 Xp − µp
 .. 
= E
 .
..
. 

( )( ) ( )2
Xp − µp X1 − µ1 Xp − µp
 
σ12 ... σ1p
 .. .. 
= . . 
σp1 σp2
 
σ12 ... ρ1p σ1 σp
 .. .. 
= . . .
ρ1p σ1 σp σp2

Prove, for any random vector X , that:


Σ is positive definite (p.d.) matrix; i.e., α′ Σα > 0 ∀α. (See next how to generate Σ)
( ) ( )
Σ is symmetric; i.e., σij = Cov Xi ,Xj = Cov Xj ,Xi = σji

B-8 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Estimation:
Estimation of µ and Σ is nothing but estimation of their components, which is discussed above. In vector
form, it is written as:

1∑
b=X =
µ xi ,
n i
     1∑ 
x11 xn1 n i xi 1
1  ..   ..   .. 
=  .  + ...  .  =  . 
n 1 ∑
x1 p xnp n i xip
1 ∑( )( )′
b=
Σ xi − X xi − X =
n−1 i
 ( )2 ( )( ) 
1 ∑ 1 ∑
 n−1 i xi1 − X1 ... n−1 i xi1 − X1 xip − Xp 
 . 
 . .. 
 . . 
 ( )( ) ( )2 
1 ∑ 1 ∑
n−1 i xip − Xp xi1 − X1 n−1 i xip − Xp

B-9 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


We know that X is normally distributed if

1 ( x−µ )2
− 12
fX (x) = ( )1/2 exp σ .
2πσ 2

Prove that:

E X = µ,
Var X = σ 2 .
( )
So, the population parameters appear explicitly in the pdf. We say, X ∼ N µ,σ 2 . The figure shows the
geometry of the normal distribution.

B-10 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Multinormal Distribution:

X is said to have a multinormal distribution if


1 1 ′ −1
fX (x) = e− 2 (x−µ) Σ (x−µ)
.
((2π )p |Σ|)1/2

It is obvious that for p = 1, the pdf is the same as univariate normal.


Prove that:

E X = µ,
Cov (X ) = Σ.

B-11 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


You can generate a covariance matrix by choosing σi2 ,i = 1,...,p; then choose ρij , where −1 ≤ ρij ≤ 1. In the
case of p = 2, we have many cases:
(a) ρ > 0
(b) ρ = 0, and σ2 < σ1 .
(c) ρ = 0, and σ2 = σ1 .

For diagonal Σ (uncorrelated components),


[ ( )2 ]
1 ∑
p
1 x − µi
fX (x) = ∏p ( ) exp −
2 1/2
i=1 2 σi2
i=1 2πσi
[ ( )2 ]
∏ 1 1 x − µi
= ( )1/2 exp − ,
i 2πσ 2 2 σi2
i

which is joint of p independent normals. This is not the case for other distributions.

B-12 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Transformation by Projection
For any random vector Xp×1 , any matrix of m column vectors Ap×m can be seen as a set of m projections
 ′ 
α1
 .. 
from R p to R m : A′m×p = . 

αm

E A′m×p Xp×1 = A′ E X = A′ µ.
( )
Cov A′m×p Xp×1 = A′ Cov (X ) A = A′ ΣA.

Theorem
( ) ( )
If X ∼ N µ, Σ then A′ X ∼ N A′ µ,A′ ΣA .

Simple example (a single projection): A′1×p = α′


( )
if α′ = 1, 0,..., 0 , we simply get the first component X1 .
( p p )
if α′ = 1/ 2, 1/ 2 , we project on the π /4 direction; this will be the value of the new r.v. in the direction of
α.

( )
α′ X = ∥α∥ ∥X ∥ cos x,α
= ∥α∥ × Projected Length
α′
If we need the projected length only, then project on a unit vector ∥α∥ , then

α′ ( )
X = ∥X ∥ cos x,α .
∥α∥
B-13 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Multiply this scalar in the direction of the projection to get the new feature Z in the direction α
( ) ( )
( α) α α′
Z = X
∥α∥ p×1 ∥α∥ 1×1
αα′
= ′ X
(α α )
= Pp(α )
×p Xp×1 ,

where we call P the projection matrix of the direction α.

B-14 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Matlab Code B.1:
mu =3+ zeros (1 , 2) ; sigma = eye (2) ; samples =100;
X = mvnrnd ( mu , sigma , samples ) ;
scatter ( X (: ,1) ,X (: ,2) ,20 , ’* r ’) ; hold on

alpha =[1/ sqrt (2) ; 1/ sqrt (2) ];


proj = X * alpha ;

Xnew = repmat ( proj , [1 2]) .* repmat ( alpha ’ , [ samples 1]) ;


scatter ( Xnew (: , 1) , Xnew (: , 2) , ’* b ’) ;

% the following should be equivalent


var ( proj )
alpha ’* cov ( X ) * alpha

B-15 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Estimation of scalar quantities can be obtained from multivariate versions by:
( )
Z = x1 ,...,x has mean and covariance matrix µ1 and σ 2 I respectively, (since xi ,i = 1,...n, are i.i.d),
( n )′
where 1 = 1,..., 1 . Then,

1∑ 1
X= xi = 1′ Z,
n i n
( )
1 ′
( ) 1
EX = µ1 = µ1′ 1 =µ
1
n n
( ) ( )
1 ′ ( 2 ) 1 ′ ′ σ2 ′ σ2
Var X = 1 σ I 1 = 21 1=
n n n n

B-16 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Appendix C

Fast Revision on Geometry, Linear Algebra,


and Matrix Theory

C-1 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Inner product and orthogonalization

For vectors in p dimensional space, the inner product 〈x,y 〉 is the dot product x′ y ,

〈x,y 〉 = x′ y
 
y1
( )  
= x1 ... xp ·  ... 
yp

p
= x i yi .
i=1

when x′ y is zero we say they are orthogonal. It can be shown that, for p ≤ 3,

x′ y
cos θ =
∥x∥ · ∥y ∥
x′ y
=p √
x′ x · y′y
Then, we can generalize this definition in higher dimensions, and define the angle between two vectors for
p > 3.

Simple example (a single projection): Xp×1 = X


( )
if X ′ = 1, 0,..., 0 , we simply get the first component Y1 .
( p p )
if X ′ = 1/ 2, 1/ 2 , we project on the π /4 direction; this will be the value of the new vector in the direction
of X .
C-2 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
( )
X ′ Y = ∥X ∥ ∥Y ∥ cos Y,X
= ∥X ∥ × Projected Length

X′
If we need the projected length only, then project on a unit vector ∥X ∥ , then

X′ ( )
Y = ∥Y ∥ cos Y,X .
∥X ∥

Multiply this scalar in the direction of the projection to get the new component in the direction X
( )( )
X X′
Yb = Y
∥X ∥ ∥X ∥
= X βb
XX ′
= Y
(X ′ X )
= Pp(X )
×p Yp×1 ,

where we call P the projection matrix of the direction X .


( )
For decomposition (not projection yet) on set of vectors constituting the columns of X = X1 ,...,Xn ,

= X1 βb1 + ... + Xn βbn


= Xβb
C-3 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
We need to minimize the remaining error

b
e = Y − Xβ,
⟨( ) ( )⟩
∥e∥2 = Y − Xβb , Y − Xβb
   
〈X1 ,Y 〉 〈X1 ,X1 〉 ... 〈X1 ,Xn 〉
 ..  ′ .. .. 
= 〈Y,Y 〉 − 2β ′  . +β  .
..
. . β
〈Xn ,Y 〉 〈Xn ,X1 〉 ... 〈Xn ,Xn 〉
= Y ′ Y − 2β ′ X′ Y + β ′ X′ Xβ

∇ ∥e∥2 = −2X′ y + 2X′ Xβ


( )
= −2X′ y − Xβ ,

which means that the error is perpendicular to each component Xi ; i.e.,

〈Xi ,e〉 = 0, or
Xi′ e = 0

The solution will be


( )−1
βb = X′ X X′ Y
Therefore, the projection Yb is

Yb = Xβb
( )−1
= X X′ X X′ Y
C-4 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
and the projection matrix P is
( )−1
P = X X′ X X′
The very interesting thing is
 
X1′
 .. ( )
X′ X =  .  X1 ... Xn
Xn′
 
X ′ X1 X1′ X2 ... X1′ Xn
 1′
 X2 X1 X2′ X2 ... X2′ Xn 

=
 .. .. .. .. 

 . . . . 
Xn′ X1 Xn′ X2 ... Xn′ Xn

if we choose the basis Xi orthogonal


 
X1′ X1
 
 X2′ X2 
X′ X= 
 ..


 . 
Xn′ Xn
( )
= diag ∥X1 ∥2 ,..., ∥Xn ∥2 ,

C-5 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


 1  
∥X1 ∥2 X1′
( )  .. 
P= X1 ... Xn 

..
. 
 . 
1
∥Xn ∥2
Xn′
X1 X1′ Xn Xn′
= + ... +
X1′ X1 Xn′ Xn
X1 X1′ Xn Xn′
= + ... +
∥X1 ∥ ∥X1 ∥ ∥Xn ∥ ∥Xn ∥
= P1 + ... + Pn

The projection will be


Yb = P1 Y + ... Pn Y
The orthogonality of the basis made it possible to project on each and sum up the projections. If the set of
basis Xi span the whole space, then they are complete and Yb = Y and the error is zero; which means we can
express Y in terms of the new subspace X. And for orthonormal basis, where ∥Xi ∥ = 1

Pi = Xi Xi′ ,
βbi = Xi′ Y

C-6 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Appendix D

Fast Revision on Multivariate Statistics

D-1 Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.


Bibliography

Ambroise, C., McLachlan, G. J., 2002. Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression data. PNAS 99 (10), 6562–6566.

Cherkassky, V. S., Mulier, F., 1998. Learning from data : concepts, theory, and methods. Wiley, New York.

Dave, S. S., Wright, G., Tan, B., Rosenwald, A., Gascoyne, R. D., Chan, W. C., Fisher, R. I., Braziel, R. M., Rimsza, L. M., Grogan, T. M., Miller, T. P.,
LeBlanc, M., Greiner, T. C., Weisenburger, D. D., Lynch, J. C., Vose, J., Armitage, J. O., Smeland, E. B., Kvaloy, S., Holte, H., Delabie, J., Connors,
J. M., Lansdorp, P. M., Ouyang, Q., Lister, T. A., Davies, A. J., Norton, A. J., Muller-Hermelink, H. K., Ott, G., Campo, E., Montserrat, E., Wilson,
W. H., Jaffe, E. S., Simon, R., Yang, L., Powell, J., Zhao, H., Goldschmidt, N., Chiorazzi, M., Staudt, L. M., 2004. Prediction of Survival in Follicular
Lymphoma Based on Molecular Features of Tumor-Infiltrating Immune cells. New England Journal of Medicine November 351 (21), 2159–2169.

Devroye, L., 1982. Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size. Pattern Analysis and Machine
Intelligence, IEEE Transactions on PAMI-4 (2), 154–157.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield,
C. D., Lander, E. S., 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction By Gene Expression Monitoring. Science
286 (5439), 531–537.

Gur, D., Wagner, R. F., Chan, H. P., 2004. On the Repeated Use of Databases for Testing Incremental Improvement of Computer-Aided Detection
schemes. Acad Radiol 11 (1), 103–105.

Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer,
New York.
Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.
Lee, S., 2008. Mistakes in Validating the Accuracy of a Prediction Classifier in High-Dimensional But Small-Sample Microarray data. Statistical Meth-
ods in Medical Research 17 (6), 635–642.
URL https://doi.org/Doi10.1177/0962280207084839

Lim, T. S., Loh, W. Y., Shih, Y. S., 2000. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classifica-
tion algorithms. Machine Learning 40 (3), 203–228.

Randles, R. H., Wolfe, D. A., 1979. Introduction to the theory of nonparametric statistics. Wiley, New York.

Simon, R., Radmacher, M. D., Dobbin, K., McShane, L. M., 2003. Pitfalls in the Use of Dna Microarray Data for Diagnostic and Prognostic classifica-
tion. Journal of the National Cancer Institute 95 (1), 14–18.

Tibshirani, R., 2005. Immune Signatures in Follicular lymphoma. N Engl J Med 352 (14), 1496–1497.
URL https://doi.org/352/14/1496[pii]10.1056/NEJM200504073521422

Copyright © 2011, 2019 Waleed A. Yousef. All Rights Reserved.

You might also like