Intro To Data Science Lecture 5
Intro To Data Science Lecture 5
OUTLINE
Introduction to Data Science for Assessing Model Accuracy
Civil Engineers Measuring the Quality of Fit
The Bias-Variance Trade-off
The Classification Setting
BayesClassifier
Lecture 3a. Assessing Model Accuracy, Bayes
K-Nearest Neighbors (KNN)
Classifier, and KNN
Some of the figures in this presentation are taken from "An Introduction to Statistical
Learning, with applications in R" (Springer, 1st Edition, 2013; 2nd Edition, 2021) with
Fall 2022 permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
1 2
3 4 1
9/15/2022
TheMSE discussed is based on the Training In general, a more flexible method will fit the
Data, in which the response variable 𝑌 has training data better.
More flexible methods are less restrictive in the
been observed.
possible shape of 𝑓 as compared to less flexible
methods.
Whatis more important is how well the model Less flexible methods, however, are easier to
works on new data, known as Test Data, in interpret.
which the response variable 𝑌 is unknown. There is a trade-off between model flexibility
5 6
EXAMPLE 1: HIGHLY NONLINEAR FUNCTION, HIGH EXAMPLE 2: LESS NONLINEAR FUNCTION, HIGH
VARIANCE IN Y
VARIANCE INY
𝑌 =𝑓 𝒙 +𝜀
𝑣𝑎𝑟(𝜀)
𝑣𝑎𝑟(𝜀)
7 8 2
9/15/2022
LEFT
Black: Truth RIGHT
Orange: Linear regression Red: Test MSE
Blue: smoothing spline Grey: Training MSE
Green: smoothing spline (more Dashed: Minimum possible
flexible) test MSE (irreducible error)
9 10
The more flexible/complex a method is, the less bias Can variance be reduced by increasing the size of
it will generally have. training data set?
11 12 3
9/15/2022
For regression problems, the MSE is used to assess The Bayes classifier: assigns each observation to the
the accuracy of the statistical learning method. most likely class, given its predictor values.
For classification problems, the error rate may be used
instead:
That is, assign an observation with predictor vector 𝒙
1
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = 𝐼(Y ≠ 𝑌 ) to a class 𝑗 so that its conditional probability, 𝑃(𝑌
𝑛 = 𝑗|𝑋 = 𝒙 ), is the largest.
𝐼(Y ≠ 𝑌 ) is an indicator function, which equals 1 if Y The test error rate of the Bayes classifier is called the
≠ 𝑌 and 0 if Y = 𝑌 . Bayes error rate.
Thus the error rate represents the fraction of incorrect In reality, it is impossible to use the Bayes classifier
classifications, or misclassifications. because the conditional probability is unknown.
15 16 4
9/15/2022
17 18
For example, if the majority of the 𝑌 ’s are orange we With K=3, we find three
predict 𝑌 (corresponding to the test observation 𝒙 ) to points closest to x, and 𝑋
be orange otherwise guess blue. check which category
most of the three points
belong to.
The smaller K is, the more flexible the method will be.
19 20 5
9/15/2022
Euclidean distance = ∑ 𝑥 −𝑦
𝑋
Manhattan distance = ∑ |𝑥 − 𝑦 |
𝑋 , 𝑤ℎ𝑒𝑟𝑒 𝐼 𝑥 ≠ 𝑦 = 1 𝑖𝑓 𝑥 ≠ 𝑦 ; 0 otherwise.
21 22
K = 1 AND K = 100
SIMULATED DATA: K = 10
Black curve:
KNN decision
boundary
Purple dashed
line:
Bayes decision
boundary
K=1, KNN decision boundary is too flexible (low bias but high variance)
K=100, KNN decision boundary is close to linear (low variance but high bias)
23 24 6
9/15/2022
A FUNDAMENTAL PICTURE
TRAINING VS. TEST ERROR RATES ON THE
SIMULATED DATA In general, with the increase of model flexibility
Training errors will always decline.
Training error
But test errors will decline at first (due to reduction
rates keep
decreasing as k in bias) but then start to increase (due to increase in
decreases or variance).
equivalently as the
flexibility
increases.
25 26