AA2 Intro ML 2024
AA2 Intro ML 2024
MACHINE LEARNING:
INTRODUCTION
Reading:
• Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2009.
• G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An Introduction to Statistical Learning with
Applications in Python, Springer, July 2023.
1
11/18/24
2
11/18/24
Email spam
• Data from 4601 email messages sent to an individual (named George
at HP labs, before 2000). Each is classified as email or spam.
• Objective: build a customized spam filter.
• Input features: relative frequencies of 57 of the most commonly
occurring words and punctuation marks in the email messages.
3
11/18/24
Scatterplot of
cancer data
4
11/18/24
10
10
5
11/18/24
11
11
Usage ϵ {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil} 12
12
6
11/18/24
13
13
14
14
7
11/18/24
15
15
16
16
8
11/18/24
17
17
18
18
9
11/18/24
19
20
20
10
11/18/24
21
21
22
22
11
11/18/24
23
23
24
24
12
11/18/24
25
25
26
26
13
11/18/24
27
27
28
28
14
11/18/24
29
29
30
30
15
11/18/24
31
31
32
32
16
11/18/24
33
33
33
34
34
34
17
11/18/24
35
35
Objectives
On the basis of the training data we would like to:
• Accurately predict unseen test cases.
• Understand which inputs affect the outcome, and how.
• Assess the quality of our predictions and inferences.
36
36
18
11/18/24
Philosophy
• It is important to understand the ideas behind the various techniques, in order to
know how and when to use them.
• One has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
• It is important to accurately assess the performance of a method, to know how
well or how badly it is working (simpler methods often perform as well as fancier
ones!)
• This is an exciting research area, having important applications in engineering,
science, industry, finance, etc.
• Machine learning is a fundamental ingredient in the training of a modern
engineer or data scientist.
37
37
Unsupervised learning
• No outcome variable, just a set of predictors (features) measured on a
set of samples.
• Objective is fuzzier (no Y!), as e.g.:
• find groups of samples that behave similarly,
• find features that behave similarly,
• find (non)linear combinations of features with the most variation.
• Difficult to know how well you are doing.
• Different from supervised learning but can be useful as a pre-
processing step for supervised learning.
• It is much more difficult to collect labeled data!
38
38
19
11/18/24
39
• Sales vs TV, radio, and newspaper budgets. Blue line: least squares fit (linear
regression) of sales to that variable.
• How can we predict Sales, using jointly the three?
Sales ≈ f (TV, radio, newspaper)
40
40
20
11/18/24
Notation
• Sales is a response or target that we wish to predict. We generically refer to the
response as Y.
• TV is a feature, or input, or predictor; we name it X1. Likewise, Radio is X2.
• We can refer to the input vector as
! !! "
! = ## ! " $$
#% ! # $&
• The model is written as
! = " !# " + !
• where e captures measurement errors and other discrepancies.
41
41
42
42
21
11/18/24
• Good value for f(X) at any selected value of X, say X = 4? There can be many Y
values at X = 4. A good value is
! !"# = " !# $ $ = "#
• ! !" " # = #$ means expected value (average) of Y given X = 4.
• This ideal ! ! "" = # !$ # % = "" is called the regression function.
43
43
44
22
11/18/24
How to estimate f
• Typically, we have few if any data points with X = 4 exactly.
• So, we cannot compute E(Y | X = x)!
• Relax the definition and let
!! " "# = A%&"# ' $ ! 𝒩
% " "##
• where 𝒩(x) is some neighborhood of x.
45
45
Nearest neighbor
• Nearest neighbor averaging can be pretty good for small p, i.e. p £ 4 and
large values of N.
• Nearest neighbor methods can be lousy when p is large. Reason: the curse
of dimensionality. Nearest neighbors tend to be far away in high
dimensions.
• We need to get a reasonable fraction of the N values of yi to average to
bring the variance down, e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be local, so we
lose the spirit of estimating E(Y | X = x) by local averaging.
46
46
23
11/18/24
47
47
48
48
24
11/18/24
Example
A linear model "#! $ # % = !#! + !#" # gives a reasonable fit here
A quadratic model "$! % # & = !$" + !$# # + !$! # fits slightly better
!
49
49
Simulated example
50
50
25
11/18/24
51
51
52
52
26
11/18/24
53
53
Some tradeoffs
• Prediction accuracy versus interpretability.
• Linear models are easy to interpret; thin-plate splines are not.
• Good fit versus overfit or underfit.
• How do we know when the fit is just right?
• Parsimony versus black-box.
• We often prefer a simpler model involving fewer variables over a
black-box predictor involving them all.
54
54
27
11/18/24
Flexibility vs Interpretability
Fuzzy models
55
55
56
28
11/18/24
Example
• Black curve is truth. Red curve on right is MSETe, grey curve is MSETr. Orange,
blue and green curves/squares correspond to fits of different flexibility.
57
57
Another example
• Here the truth is smoother, so the smoother fit and linear model do really well.
58
58
29
11/18/24
• Here the truth is wiggly and the noise is low, so the more flexible fits do
the best.
59
59
Bias-variance tradeoff
• Suppose we have a model !! " "# to some training data Tr, and let (x0, y0) be a
test observation drawn from the population. If the true model is Y = f(X) + e
(with f(x) = E(Y|X = x)), then:
( )
! !
! "" $ ## $ $" B = &'($ ## $ $" BB + "% )*'+$ ## $ $" BB #& + &'($! B
• The expectation averages over the variability of y0 as well as the variability in
Tr. Note that
#$B&' !" ' "! (( = #) !" ' "! (* ! ! ' "! (+
• Typically, as the flexibility of !! increases, its variance increases, and its bias
decreases. So, choosing the flexibility based on the average test error
amounts to a bias-variance tradeoff.
60
60
30
11/18/24
61
61
Classification problems
• Here the response variable Y is qualitative.
• Examples: Email is one of C = (spam, ham) (ham = good
email). Digit class is one of C = {0, 1, …,9}.
• Our goals are to:
• Build a classifier C(X) that assigns a class label from C to a future
unlabeled observation X.
• Assess the uncertainty in each classification.
• Understand the roles of the different predictors among
X = (X1, X2, …, Xp).
62
62
31
11/18/24
Example
63
Estimation
64
32
11/18/24
65
65
Example
• K-nearest neighbors in two dimensions
66
66
33
11/18/24
67
67
68
68
34
11/18/24
69
69
35