0% found this document useful (0 votes)
15 views35 pages

AA2 Intro ML 2024

Uploaded by

rwoosh42069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views35 pages

AA2 Intro ML 2024

Uploaded by

rwoosh42069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

11/18/24

MACHINE LEARNING:
INTRODUCTION
Reading:
• Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2009.
• G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An Introduction to Statistical Learning with
Applications in Python, Springer, July 2023.

Examples of learning problems


• Predict if a patient will have a second heart attack, based on demographic,
diet and clinical measurements for that patient.
• Predict the price of a stock in 6 months from now, based on the company
performance measures and economic data.
• Identify the numbers in handwritten code, from a digitized image.
• Estimate the amount of glucose in the blood of a diabetic person, from the
infrared absorption spectrum of blood.
• Identify the risk factors for prostate cancer, based on clinical and
demographic variables.

1
11/18/24

Learning from data


• Supervised learning:
• Outcome measurement: usually quantitative (such as a stock price)
or categorical (such as heart attack/no heart attack), that one wishes
to predict.
• Based on a set of features (such as diet and clinical measurements).
• Training set of data contains the outcome and feature
measurements.
• Data is used to build a prediction model.
• In unsupervised learning only features are observed; no
measurements of the outcome are available.
3

Learning from data


Examples:
• Customize an email spam detection system.
• Identify the risk factors for prostate cancer.
• Identify the numbers in a handwritten zip code.
• Classify the pixels in a LANDSAT image, by usage.

2
11/18/24

Email spam
• Data from 4601 email messages sent to an individual (named George
at HP labs, before 2000). Each is classified as email or spam.
• Objective: build a customized spam filter.
• Input features: relative frequencies of 57 of the most commonly
occurring words and punctuation marks in the email messages.

Average percentage of words or characters in an email message equal to the


indicated word or character.
5

Learning from data


Examples:
• Customize an email spam detection system.
• Identify the risk factors for prostate cancer.
• Identify the numbers in a handwritten zip code.
• Classify the pixels in a LANDSAT image, by usage.

3
11/18/24

Scatterplot of
cancer data

Learning from data


Examples:
• Customize an email spam detection system.
• Identify the risk factors for prostate cancer.
• Identify the numbers in a handwritten zip code.
• Classify the pixels in a LANDSAT image, by usage.

4
11/18/24

Handwritten digit recognition


• Data: handwritten ZIP codes on envelopes. Each image is a
segment isolating a single digit.
• The images are 16×16 eight-bit grayscale maps, with each pixel
ranging in intensity from 0 to 255.
• Task: predict, from the 16 × 16 matrix of pixel intensities, the
identity of each image (0, 1, . . . , 9) quickly and accurately.
• If it is accurate enough, the resulting algorithm would be used
as part of an automatic sorting procedure for envelopes.

Handwritten digit recognition

10

10

5
11/18/24

Learning from data


Examples:
• Customize an email spam detection system.
• Identify the risk factors for prostate cancer.
• Identify the numbers in a handwritten zip code.
• Classify the pixels in a LANDSAT image, by usage.

11

11

Predict land usage

Usage ϵ {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil} 12

12

6
11/18/24

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

13

13

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

14

14

7
11/18/24

Classification of electricity prosumers

15

15

16

16

8
11/18/24

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

17

17

PSO for feature selection using SVM


Reset swarm best and local search

18

18

9
11/18/24

19

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

20

20

10
11/18/24

Estimated pasture characteristics

21

21

22

22

11
11/18/24

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

23

23

Optimization of EVs charging

24

24

12
11/18/24

25

25

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

26

26

13
11/18/24

Registration of 3D ultrasound for neurosurgery

27

27

28

28

14
11/18/24

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

29

29

Computational vision and drones


• Wildfire burnt area classification from UAV-based images

30

30

15
11/18/24

31

31

Learning from data


• Classification of electricity customers based on smart metering data
• Mortality prediction of septic patients
• Increase grassland production, minimizing fertilizers inputs
• Optimization of EVs charging
• Non-rigid registration of 3D ultrasound for neurosurgery
• Wildfire burnt area classification
• Generalizing GAN for robot fault diagnosis

32

32

16
11/18/24

Generalizing GAN for robot fault diagnosis

• Generative adversarial networks


(GAN) augment data for fault
diagnosis of an industrial robot.

33
33

33

34
34

34

17
11/18/24

Supervised learning problem


• Outcome measurement Y (also called dependent variable, response,
target).
• Vector of p predictor measurements X (also called inputs, regressors,
covariates, features, independent variables).
• In the regression problem, Y is quantitative (e.g. price, blood pressure).
• In the classification problem, Y takes values in a finite, unordered set
(survived/died, digit 0-9, cancer class of tissue sample).
• We have training data (x1, y1), …, (xN, yN). These are observations
(examples, instances) of these measurements.

35

35

Objectives
On the basis of the training data we would like to:
• Accurately predict unseen test cases.
• Understand which inputs affect the outcome, and how.
• Assess the quality of our predictions and inferences.

36

36

18
11/18/24

Philosophy
• It is important to understand the ideas behind the various techniques, in order to
know how and when to use them.
• One has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
• It is important to accurately assess the performance of a method, to know how
well or how badly it is working (simpler methods often perform as well as fancier
ones!)
• This is an exciting research area, having important applications in engineering,
science, industry, finance, etc.
• Machine learning is a fundamental ingredient in the training of a modern
engineer or data scientist.
37

37

Unsupervised learning
• No outcome variable, just a set of predictors (features) measured on a
set of samples.
• Objective is fuzzier (no Y!), as e.g.:
• find groups of samples that behave similarly,
• find features that behave similarly,
• find (non)linear combinations of features with the most variation.
• Difficult to know how well you are doing.
• Different from supervised learning but can be useful as a pre-
processing step for supervised learning.
• It is much more difficult to collect labeled data!
38

38

19
11/18/24

Statistical Learning vs Machine Learning


• Machine learning arose as a subfield of Artificial Intelligence.
• Statistical learning arose as a subfield of Statistics.
• There is much overlap – both fields focus on supervised and unsupervised
problems:
• Machine learning has a greater emphasis on large scale applications and
prediction accuracy.
• Statistical learning emphasizes models and their interpretability, precision and
stochastic uncertainty.
• But the distinction has become more and more blurred, and there is a
great deal of “cross-fertilization”.
39

39

What is Machine Learning?

• Sales vs TV, radio, and newspaper budgets. Blue line: least squares fit (linear
regression) of sales to that variable.
• How can we predict Sales, using jointly the three?
Sales ≈ f (TV, radio, newspaper)
40

40

20
11/18/24

Notation
• Sales is a response or target that we wish to predict. We generically refer to the
response as Y.
• TV is a feature, or input, or predictor; we name it X1. Likewise, Radio is X2.
• We can refer to the input vector as
! !! "
! = ## ! " $$
#% ! # $&
• The model is written as
! = " !# " + !
• where e captures measurement errors and other discrepancies.

41

41

What is f(X) good for?


• With a good f we can make predictions of Y at new points X = x.
• We can understand which components of X = (X1, X2, …, Xp) are
important in explaining Y, and which are irrelevant.
• Example: Seniority and Years of Education have a big impact on Income,
but Marital Status typically does not.
• Depending on the complexity of f, we may be able to understand how
each component Xj of X affects Y.

42

42

21
11/18/24

Is there an ideal f(X)?

• Good value for f(X) at any selected value of X, say X = 4? There can be many Y
values at X = 4. A good value is
! !"# = " !# $ $ = "#
• ! !" " # = #$ means expected value (average) of Y given X = 4.
• This ideal ! ! "" = # !$ # % = "" is called the regression function.
43

43

The regression function f(x)


• Is also defined for vector X; e.g.
! $ "% = ! $ "! & "" & "# % = # $$ ' % ! = "! & % " = "" & % # = "# %
• Is the ideal or optimal predictor of Y with regard to mean-squared
prediction error; ! ! "" = # !$ # % = "" is the function that minimizes
!"#" ! # # $ $$ ! % $ = %& over all functions g at all points X = x.
• ! = ! " " ! #" is the irreducible error – i.e., even if we knew f(x), we would
still make errors in prediction, since at each X = x there is typically a
distribution of possible Y values.
!
• For probability distributions, for any estimate 𝑓(𝑥) of f(x):
+#$I " -" $ . %% ! & . = /' = # - $ /% " -" $ /%'! + ()*$! %
!""#"" $ !#$
!"#$%&'E" )**"#$%&'E"
44

44

22
11/18/24

How to estimate f
• Typically, we have few if any data points with X = 4 exactly.
• So, we cannot compute E(Y | X = x)!
• Relax the definition and let
!! " "# = A%&"# ' $ ! 𝒩
% " "##
• where 𝒩(x) is some neighborhood of x.

45

45

Nearest neighbor
• Nearest neighbor averaging can be pretty good for small p, i.e. p £ 4 and
large values of N.
• Nearest neighbor methods can be lousy when p is large. Reason: the curse
of dimensionality. Nearest neighbors tend to be far away in high
dimensions.
• We need to get a reasonable fraction of the N values of yi to average to
bring the variance down, e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be local, so we
lose the spirit of estimating E(Y | X = x) by local averaging.

46

46

23
11/18/24

The curse of dimensionality

47

47

Parametric and structured models


• The linear model is an important example of a parametric model:
fL(X) = b0 + b1X1 + b2X2 + … + bpXp.
• A linear model is specified in terms of p +1 parameters b0, b1, bp.
• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good
and interpretable approximation to the unknown true function f (X).

48

48

24
11/18/24

Example
A linear model "#! $ # % = !#! + !#" # gives a reasonable fit here

A quadratic model "$! % # & = !$" + !$# # + !$! # fits slightly better
!

49

49

Simulated example

• Red points are simulated values for


income from the model.
• income = f (education, seniority) + e
• f is the blue surface.

50

50

25
11/18/24

Linear regression model


• Linear regression model fit to the simulated data.
.$ %"#$%&'()*& +"*(),('- ' = !$ + !$ " "#$%&'()* + !$ " +"*(),('-
! ! " #

51

51

Spline regression model


• More flexible regression model fS(education, seniority) fit to the simulated data.
Uses a technique called a thin-plate spline to fit a flexible surface (to see latter in
the course).

52

52

26
11/18/24

Spline regression model


• Spline regression model fS(education, seniority) fit to the simulated data, with no
errors on the training data! Also known as overfitting.

53

53

Some tradeoffs
• Prediction accuracy versus interpretability.
• Linear models are easy to interpret; thin-plate splines are not.
• Good fit versus overfit or underfit.
• How do we know when the fit is just right?
• Parsimony versus black-box.
• We often prefer a simpler model involving fewer variables over a
black-box predictor involving them all.

54

54

27
11/18/24

Flexibility vs Interpretability
Fuzzy models

55

55

Assessing model accuracy


• Suppose we fit a model !! " "# to some training data "# = $#" % $" &!!
and we wish to see how well it performs.
• We could compute the average squared prediction error over
Tr:
%&' "# = E)*!!"# + "! " #A + $! ,, !
• This may be biased toward more overfit models.
• Instead, we should, if possible, compute it using fresh test data
"# = $#" % $" &!! :
%&' "# = E)#!!"# * "! " #A * $! ++ !
56

56

28
11/18/24

Example

• Black curve is truth. Red curve on right is MSETe, grey curve is MSETr. Orange,
blue and green curves/squares correspond to fits of different flexibility.
57

57

Another example

• Here the truth is smoother, so the smoother fit and linear model do really well.

58

58

29
11/18/24

And another example

• Here the truth is wiggly and the noise is low, so the more flexible fits do
the best.
59

59

Bias-variance tradeoff
• Suppose we have a model !! " "# to some training data Tr, and let (x0, y0) be a
test observation drawn from the population. If the true model is Y = f(X) + e
(with f(x) = E(Y|X = x)), then:
( )
! !
! "" $ ## $ $" B = &'($ ## $ $" BB + "% )*'+$ ## $ $" BB #& + &'($! B
• The expectation averages over the variability of y0 as well as the variability in
Tr. Note that
#$B&' !" ' "! (( = #) !" ' "! (* ! ! ' "! (+
• Typically, as the flexibility of !! increases, its variance increases, and its bias
decreases. So, choosing the flexibility based on the average test error
amounts to a bias-variance tradeoff.

60

60

30
11/18/24

Bias-variance tradeoff for the examples

61

61

Classification problems
• Here the response variable Y is qualitative.
• Examples: Email is one of C = (spam, ham) (ham = good
email). Digit class is one of C = {0, 1, …,9}.
• Our goals are to:
• Build a classifier C(X) that assigns a class label from C to a future
unlabeled observation X.
• Assess the uncertainty in each classification.
• Understand the roles of the different predictors among
X = (X1, X2, …, Xp).

62

62

31
11/18/24

Example

• Is there an ideal C(X)? Suppose the K elements in C are numbered 1, 2, …,K.


Let pk(x) = Pr(Y = k|X = x), k = 1, 2, …,K.
• These are the conditional class probabilities at x; e.g. see little barplot at x = 5.
Then the Bayes optimal classifier at x is
C(x) = j if pj(x) = max{p1(x), p2(x), …, pK(x)}
63

63

Estimation

• Nearest-neighbor averaging can be used as before.


• Also breaks down as dimension grows. However, the impact on !! " "#
is less than on "! ! " ## , k = 1, …, K.
64

64

32
11/18/24

Classification: some details


• Typically, we measure the performance of !! " "# using the
misclassification error rate:
%&&!" = 'E"#!!" " ) #! " $A * %! +,
• The Bayes classifier (using the true pK(x)) has the smallest error (in the
population).
• Techniques used for classification in this course:
• Support vector machines,
• Logistic regression, etc.

65

65

Example
• K-nearest neighbors in two dimensions

66

66

33
11/18/24

67

67

68

68

34
11/18/24

Error rates of example

69

69

35

You might also like