Chapter 2

Supervised and unsupervised learning are the two main categories of machine learning models. Supervised learning involves predicting a response variable given predictor variables, while unsupervised learning involves understanding relationships between variables without a response. Cluster analysis is an example unsupervised learning technique used to group observations. Semi-supervised learning uses both labeled and unlabeled data. When selecting a machine learning method, we consider whether the response variable is quantitative or qualitative. Regression is used for quantitative responses while classification handles qualitative responses. Model accuracy is assessed using measures like mean squared error on test data, not training data, to avoid overfitting. The variance-bias tradeoff explains how more flexible models initially improve accuracy but can later hurt it by overfitting.

Uploaded by

rmf94hkxt9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views5 pages

Chapter 2

Uploaded by

rmf94hkxt9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Supervised vs Unsupervised Learning

Most learning models fall into two categories: Supervised or Unsupervised. “We wish to fit a
model that relates the response to the predictors, with the aim of accurately predicting the
response for future observations (prediction) or better understanding the relationship
between the response and the predictors (inference)”
“ Unsupervised learning describes the somewhat more challenging situation in which for
every observation i = 1, …, n, we observe a vector of measurements xi but no associated
response yi. It is not possible to fit a linear regression model, since there is no response
variable to predict” “What sort of statistical analysis is possible? We can seek to understand
the relationships between the variables or between the observations. One statistical learning
tool that we may use in this setting is cluster analysis, or clustering. The goal of cluster
analysis is to ascertain, on the basis of x = 1, …, xn, whether the observations fall into
relatively distinct groups.”“For instance, if there are p variables in our data set, then distinct
scatter plots can be made, and visual inspection is simply not a viable way to identify
clusters. For this reason, automated clustering methods are important ""For instance,
suppose that we have a set of n observations. For m of the observations, where m < n, we
have both predictor measurements and a response measurement. For the remaining n − m
observations, we have predictor measurements but no response measurement. Such a
scenario can arise if the predictors can be measured relatively cheaply but the
corresponding responses are much more expensive to collect. We refer to this setting as a
semi-supervised learning problem”

Regression vs Classification

“Variables can be characterized as either quantitative or qualitative (also known as

categorical).”
“We tend to refer to problems with a quantitative response as regression problems, while
those involving a qualitative response are often referred to as classification problems.
However, the distinction is not always that crisp. Least squares linear regression (Chapter 3)
is used with a quantitative response, whereas logistic regression (Chapter 4) is typically
used with a qualitative (two-class, or binary) response.”

“Some statistical methods, such as K-nearest neighbors (Chapters 2 and 4) and boosting
(Chapter 8), can be used in the case of either quantitative or qualitative responses.”

“We tend to select statistical learning methods on the basis of whether the response is
quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic
regression when qualitative. However, whether the predictors are qualitative or quantitative
is generally considered less important”

Assessing Model Accuracy

“Why is it necessary to introduce so many different statistical learning approaches, rather

than just a single best method? There is no free lunch in statistics: no one method
dominates all others over all possible data sets. On a particular data set, one specific
method may work best, but some other method may work better on a similar but different
data set. Hence it is an important task to decide for any given set of data which method
produces the best results. Selecting the best approach can be one of the most challenging
parts of performing statistical learning in practice”

“We need to quantify the extent to which the predicted response value for a given
observation is close to the true response value for that observation. In the regression setting,
the most commonly-used measure is the mean squared error (MSE), given by (2.5) where
is the prediction that gives for the ith observation. The MSE will be small if the predicted
responses are very close to the true responses, and will be large if for some of the
observations, the predicted and true responses differ substantially.”
“We do not really care how well the method works on the training data. Rather, we are
interested in the accuracy of the predictions that we obtain when we apply our method to
previously unseen test data.”

“suppose that we fit our statistical learning method on our training observations , and we
obtain the estimate . We can then compute . If these are approximately equal to , then the
training MSE given by (2.5) is small. However, we are really not interested in whether ;
instead, we want to know whether is approximately equal to y
0, where (x
0, y
0) is a previously unseen test observation not used to train the statistical learning method.
We want to choose the method that gives the lowest test MSE, as opposed to the lowest
training MSE. In other words, if we had a large number of test observations, we could
compute (2.6) the average squared prediction error for these test observations (x
0, y
0). We’d like to select the model for which the average of this quantity—the test MSE—is as
small as possible”

“We can then simply evaluate (2.6) on the test observations, and select the learning method
for which the test MSE is smallest. But what if no test observations are available? In that
case, one might imagine simply selecting a statistical learning method that minimizes the
training MSE (2.5). This seems like it might be a sensible approach, since the training MSE
and the test MSE appear to be closely related. Unfortunately, there is a fundamental problem
with this strategy: there is no guarantee that the method with the lowest training MSE will
also have the lowest test MSE. Roughly speaking, the problem is that many statistical
methods specifically estimate coefficients so as to minimize the training set MSE. For these
methods, the training set MSE can be quite small, but the test MSE is often much larger.”

“MSE as a function of flexibility, or more formally the degrees of freedom, for a number of
smoothing splines. The degrees of freedom is a quantity that summarizes the flexibility of a
curve; it is discussed more fully in Chapter 7”

“as the flexibility of the statistical learning method increases, we observe a monotone
decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property
of statistical learning that holds regardless of the particular data set at hand and regardless
of the statistical method being used. As model flexibility increases, training MSE will
decrease, but the test MSE may not. When a given method yields a small training MSE but a
large test MSE, we are said to be overfitting the data. This happens because our statistical
learning procedure is working too hard to find patterns in the training data, and may be
picking up some patterns that are just caused by random chance rather than by true
properties of the unknown function f. When we overfit the training data, the test MSE will be
very large because the supposed patterns that the method found in the training data simply
don’t exist in the test data.”

“One can usually compute the training MSE with relative ease, but estimating test MSE is
considerably more difficult because usually no test data are available.”

“One important method is cross-validation (Chapter 5), which is a method for estimating test
MSE using the training data.”

The Variance-Bias Tradeoff

“The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out to be the result
of two competing properties of statistical learning methods”

“The variance of , the squared bias of and the variance of the error terms ε. That is, (2.7)
Here the notation defines the expected test MSE, and refers to the average test MSE that
we would obtain if we repeatedly estimated f using a large number of training sets, and
tested each at x0. The overall expected test MSE can be computed by averaging over all
possible values of x0 in the test set.”

“Equation 2.7 tells us that in order to minimize the expected test error, we need to select a
statistical learning method that simultaneously achieves low variance and low bias. Note that
variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence,
we see that the expected test MSE can never lie below Var(ε), the irreducible error from
(2.3).What do we mean by the variance and bias of a statistical learning method? Variance
refers to the amount by which it would change if we estimated it using a different training
data set.”
“But ideally the estimate for f should not vary too much between training sets. However, if a
method has high variance then small changes in the training data can result in large
changes in . In general, more flexible statistical methods have higher variance. ”

“Bias refers to the error that is introduced by approximating a real-life problem, which may be
extremely complicated, by a much simpler model. For example, linear regression assumes
that there is a linear relationship between Y and . It is unlikely that any real-life problem truly
has such a simple linear relationship, and so performing linear regression will undoubtedly
result in some bias in the estimate of f. ”

“Generally, more flexible methods result in less bias.As a general rule, as we use more
flexible methods, the variance will increase and the bias will decrease. The relative rate of
change of these two quantities determines whether the test MSE increases or decreases. As
we increase the flexibility of a class of methods, the bias tends to initially decrease faster
than the variance increases. Consequently, the expected test MSE declines. However, at
some point increasing flexibility has little impact on the bias but starts to significantly
increase the variance. When this happens the test MSE increases. Note that we observed
this pattern of decreasing test MSE followed by increasing test MSE in the right-hand panels
of Figures 2.9–2.11.The three plots in Figure 2.12 illustrate Equation 2.7 for the examples in
Figures 2.9–2.11. In each case the blue solid ”

“This is referred to as a trade-off because it is easy to obtain a method with extremely low
bias but high variance (for instance, by drawing a curve that passes through every single
training observation) or a method with very low variance but high bias (by fitting a horizontal
line to the data). The challenge lies in finding a method for which both the variance and the
squared bias are low. This trade-off is one of the most important recurring themes in this
book.”

The Classification Setting

“Suppose that we seek to estimate f on the basis of training observations , where now y
1, …, yn are qualitative. The most common approach for quantifying the accuracy of our
estimate is the training error rate, the proportion of mistakes that are made if we apply our
estimate to the training observations: ”

“ Here is the predicted class label for the ith observation using . And is an indicator variable
that equals 1 if and zero if . If then the ith observation was classified correctly by our
classification method; otherwise it was misclassified. Hence Equation 2.8 computes the
fraction of incorrect classifications.”

“The Bayes Classifier It is possible to show (though the proof is outside of the scope of this
book) that the test error rate given in (2.9) is minimized, on average, by a very simple
classifier that assigns each observation to the most likely class, given its predictor values. In
other words, we should simply assign a test observation with predictor vector x
0 to the class j for which “ largest. Note that (2.10) is a conditional probability: it is the
probability that Y = j, given the observed predictor vector x
0. ”

“The purple dashed line represents the points where the probability is exactly 50 %. This is
called the Bayes decision boundary. The Bayes classifier’s prediction is determined by the
Bayes decision boundary; an observation that falls on the orange side of the boundary will
be assigned to the orange class, and similarly an observation on the blue side of the
boundary will be assigned to the blue class.
”
“The Bayes classifier produces the lowest possible test error rate, called the Bayes error
rate. Since the Bayes classifier will always choose the class for which (2.10) is largest, the
error rate at X = x0 will be . In general, the overall Bayes error rate is given by (2.11) where
the expectation averages the probability over all possible values of X. For our simulated
data, the Bayes error rate is 0. 1304. It is greater than zero, because the classes overlap in
the true population so for some values of x0. The Bayes error rate is analogous to the
irreducible error, discussed earlier.”

K-Nearest Neighbors
“In theory we would always like to predict qualitative responses using the Bayes classifier.
But for real data, we do not know the conditional distribution of Y given X, and so computing
the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable
gold standard against which to compare other methods. Many approaches attempt to
estimate the conditional distribution of Y given X, and then classify a given observation to the
class with highest estimated probability. One such method is the K-nearest neighbors (KNN)
classifier. Given a positive integer K and a test observation x
0, the KNN classifier first identifies the K points in the training data that are closest to x
0, represented by . It then estimates the conditional probability for class j as the fraction of
points in whose response values equal j: ”
“Finally, KNN applies Bayes rule and classifies the test observation x
0 to the class with the largest probability.”
“ KNN can often produce classifiers that are surprisingly close to the optimal Bayes
classifier”
“The choice of K has a drastic effect on the KNN classifier obtained. Figure 2.16 displays two
KNN fits to the simulated data from Figure 2.13, using K = 1 and K = 100. When K = 1, the
decision boundary is overly flexible and finds patterns in the data that don’t correspond to
the Bayes decision boundary. This corresponds to a classifier that has low bias but very high
variance. As K grows, the method becomes less flexible and produces a decision boundary
that is close to linear. This corresponds to a low-variance but high-bias classifier”

Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
SRM Formula Sheet-2
100% (1)
SRM Formula Sheet-2
11 pages
02 Chap02 AssesingModelAccuracy
No ratings yet
02 Chap02 AssesingModelAccuracy
22 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Inference For The Generalization Error
No ratings yet
Inference For The Generalization Error
43 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
ASSESSING MODEL Accuracy PDF
No ratings yet
ASSESSING MODEL Accuracy PDF
22 pages
Intro To Data Science Lecture 5
No ratings yet
Intro To Data Science Lecture 5
7 pages
Book's Solutions
No ratings yet
Book's Solutions
20 pages
Lecture 4
No ratings yet
Lecture 4
19 pages
Lec-01-Introduction To Statistical Learning
No ratings yet
Lec-01-Introduction To Statistical Learning
38 pages
Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
No ratings yet
Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
30 pages
00 Introduction
No ratings yet
00 Introduction
29 pages
Introduction To Statistical Learning
No ratings yet
Introduction To Statistical Learning
16 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
No ratings yet
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
61 pages
Cp4252 ML Unit-II
No ratings yet
Cp4252 ML Unit-II
44 pages
1 Machine Learning
No ratings yet
1 Machine Learning
111 pages
Week 05
No ratings yet
Week 05
23 pages
SDS Solution1
No ratings yet
SDS Solution1
26 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
4-ResamplingMethods 1
No ratings yet
4-ResamplingMethods 1
23 pages
Models PDF
No ratings yet
Models PDF
86 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
14 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
SVM Regressor
No ratings yet
SVM Regressor
13 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Resampling Methods - ML
No ratings yet
Resampling Methods - ML
115 pages
Ch2 Statistical Learning
No ratings yet
Ch2 Statistical Learning
51 pages
Week2-Day 1-Introduction To Data Mining
No ratings yet
Week2-Day 1-Introduction To Data Mining
30 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
SRM Formula Sheet
No ratings yet
SRM Formula Sheet
16 pages
Notes 1
No ratings yet
Notes 1
3 pages
Merge
No ratings yet
Merge
240 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
5 CV Boot-Handout PDF
No ratings yet
5 CV Boot-Handout PDF
44 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
BTMMeeting25Nov2020 StatisticalLearning
No ratings yet
BTMMeeting25Nov2020 StatisticalLearning
49 pages
Lec 1
No ratings yet
Lec 1
54 pages
ISL Answers
No ratings yet
ISL Answers
19 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
ISLR
No ratings yet
ISLR
9 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
ML Notes
No ratings yet
ML Notes
38 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Surviving Statistics: A Professor's Guide to Getting Through
From Everand
Surviving Statistics: A Professor's Guide to Getting Through
Luther Maddy
No ratings yet
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet