0% found this document useful (0 votes)
30 views5 pages

Chapter 2

Supervised and unsupervised learning are the two main categories of machine learning models. Supervised learning involves predicting a response variable given predictor variables, while unsupervised learning involves understanding relationships between variables without a response. Cluster analysis is an example unsupervised learning technique used to group observations. Semi-supervised learning uses both labeled and unlabeled data. When selecting a machine learning method, we consider whether the response variable is quantitative or qualitative. Regression is used for quantitative responses while classification handles qualitative responses. Model accuracy is assessed using measures like mean squared error on test data, not training data, to avoid overfitting. The variance-bias tradeoff explains how more flexible models initially improve accuracy but can later hurt it by overfitting.

Uploaded by

rmf94hkxt9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

Chapter 2

Supervised and unsupervised learning are the two main categories of machine learning models. Supervised learning involves predicting a response variable given predictor variables, while unsupervised learning involves understanding relationships between variables without a response. Cluster analysis is an example unsupervised learning technique used to group observations. Semi-supervised learning uses both labeled and unlabeled data. When selecting a machine learning method, we consider whether the response variable is quantitative or qualitative. Regression is used for quantitative responses while classification handles qualitative responses. Model accuracy is assessed using measures like mean squared error on test data, not training data, to avoid overfitting. The variance-bias tradeoff explains how more flexible models initially improve accuracy but can later hurt it by overfitting.

Uploaded by

rmf94hkxt9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Supervised vs Unsupervised Learning

Most learning models fall into two categories: Supervised or Unsupervised. “We wish to fit a
model that relates the response to the predictors, with the aim of accurately predicting the
response for future observations (prediction) or better understanding the relationship
between the response and the predictors (inference)”
“ Unsupervised learning describes the somewhat more challenging situation in which for
every observation i = 1, …, n, we observe a vector of measurements xi but no associated
response yi. It is not possible to fit a linear regression model, since there is no response
variable to predict” “What sort of statistical analysis is possible? We can seek to understand
the relationships between the variables or between the observations. One statistical learning
tool that we may use in this setting is cluster analysis, or clustering. The goal of cluster
analysis is to ascertain, on the basis of x = 1, …, xn, whether the observations fall into
relatively distinct groups.”“For instance, if there are p variables in our data set, then distinct
scatter plots can be made, and visual inspection is simply not a viable way to identify
clusters. For this reason, automated clustering methods are important ""For instance,
suppose that we have a set of n observations. For m of the observations, where m < n, we
have both predictor measurements and a response measurement. For the remaining n − m
observations, we have predictor measurements but no response measurement. Such a
scenario can arise if the predictors can be measured relatively cheaply but the
corresponding responses are much more expensive to collect. We refer to this setting as a
semi-supervised learning problem”

Regression vs Classification

“Variables can be characterized as either quantitative or qualitative (also known as


categorical).”
“We tend to refer to problems with a quantitative response as regression problems, while
those involving a qualitative response are often referred to as classification problems.
However, the distinction is not always that crisp. Least squares linear regression (Chapter 3)
is used with a quantitative response, whereas logistic regression (Chapter 4) is typically
used with a qualitative (two-class, or binary) response.”

“Some statistical methods, such as K-nearest neighbors (Chapters 2 and 4) and boosting
(Chapter 8), can be used in the case of either quantitative or qualitative responses.”

“We tend to select statistical learning methods on the basis of whether the response is
quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic
regression when qualitative. However, whether the predictors are qualitative or quantitative
is generally considered less important”

Assessing Model Accuracy

“Why is it necessary to introduce so many different statistical learning approaches, rather


than just a single best method? There is no free lunch in statistics: no one method
dominates all others over all possible data sets. On a particular data set, one specific
method may work best, but some other method may work better on a similar but different
data set. Hence it is an important task to decide for any given set of data which method
produces the best results. Selecting the best approach can be one of the most challenging
parts of performing statistical learning in practice”

“We need to quantify the extent to which the predicted response value for a given
observation is close to the true response value for that observation. In the regression setting,
the most commonly-used measure is the mean squared error (MSE), given by (2.5) where
is the prediction that gives for the ith observation. The MSE will be small if the predicted
responses are very close to the true responses, and will be large if for some of the
observations, the predicted and true responses differ substantially.”
“We do not really care how well the method works on the training data. Rather, we are
interested in the accuracy of the predictions that we obtain when we apply our method to
previously unseen test data.”

“suppose that we fit our statistical learning method on our training observations , and we
obtain the estimate . We can then compute . If these are approximately equal to , then the
training MSE given by (2.5) is small. However, we are really not interested in whether ;
instead, we want to know whether is approximately equal to y
0, where (x
0, y
0) is a previously unseen test observation not used to train the statistical learning method.
We want to choose the method that gives the lowest test MSE, as opposed to the lowest
training MSE. In other words, if we had a large number of test observations, we could
compute (2.6) the average squared prediction error for these test observations (x
0, y
0). We’d like to select the model for which the average of this quantity—the test MSE—is as
small as possible”

“We can then simply evaluate (2.6) on the test observations, and select the learning method
for which the test MSE is smallest. But what if no test observations are available? In that
case, one might imagine simply selecting a statistical learning method that minimizes the
training MSE (2.5). This seems like it might be a sensible approach, since the training MSE
and the test MSE appear to be closely related. Unfortunately, there is a fundamental problem
with this strategy: there is no guarantee that the method with the lowest training MSE will
also have the lowest test MSE. Roughly speaking, the problem is that many statistical
methods specifically estimate coefficients so as to minimize the training set MSE. For these
methods, the training set MSE can be quite small, but the test MSE is often much larger.”

“MSE as a function of flexibility, or more formally the degrees of freedom, for a number of
smoothing splines. The degrees of freedom is a quantity that summarizes the flexibility of a
curve; it is discussed more fully in Chapter 7”

“as the flexibility of the statistical learning method increases, we observe a monotone
decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property
of statistical learning that holds regardless of the particular data set at hand and regardless
of the statistical method being used. As model flexibility increases, training MSE will
decrease, but the test MSE may not. When a given method yields a small training MSE but a
large test MSE, we are said to be overfitting the data. This happens because our statistical
learning procedure is working too hard to find patterns in the training data, and may be
picking up some patterns that are just caused by random chance rather than by true
properties of the unknown function f. When we overfit the training data, the test MSE will be
very large because the supposed patterns that the method found in the training data simply
don’t exist in the test data.”

“One can usually compute the training MSE with relative ease, but estimating test MSE is
considerably more difficult because usually no test data are available.”

“One important method is cross-validation (Chapter 5), which is a method for estimating test
MSE using the training data.”

The Variance-Bias Tradeoff

“The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out to be the result
of two competing properties of statistical learning methods”

“The variance of , the squared bias of and the variance of the error terms ε. That is, (2.7)
Here the notation defines the expected test MSE, and refers to the average test MSE that
we would obtain if we repeatedly estimated f using a large number of training sets, and
tested each at x0. The overall expected test MSE can be computed by averaging over all
possible values of x0 in the test set.”

“Equation 2.7 tells us that in order to minimize the expected test error, we need to select a
statistical learning method that simultaneously achieves low variance and low bias. Note that
variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence,
we see that the expected test MSE can never lie below Var(ε), the irreducible error from
(2.3).What do we mean by the variance and bias of a statistical learning method? Variance
refers to the amount by which it would change if we estimated it using a different training
data set.”
“But ideally the estimate for f should not vary too much between training sets. However, if a
method has high variance then small changes in the training data can result in large
changes in . In general, more flexible statistical methods have higher variance. ”

“Bias refers to the error that is introduced by approximating a real-life problem, which may be
extremely complicated, by a much simpler model. For example, linear regression assumes
that there is a linear relationship between Y and . It is unlikely that any real-life problem truly
has such a simple linear relationship, and so performing linear regression will undoubtedly
result in some bias in the estimate of f. ”

“Generally, more flexible methods result in less bias.As a general rule, as we use more
flexible methods, the variance will increase and the bias will decrease. The relative rate of
change of these two quantities determines whether the test MSE increases or decreases. As
we increase the flexibility of a class of methods, the bias tends to initially decrease faster
than the variance increases. Consequently, the expected test MSE declines. However, at
some point increasing flexibility has little impact on the bias but starts to significantly
increase the variance. When this happens the test MSE increases. Note that we observed
this pattern of decreasing test MSE followed by increasing test MSE in the right-hand panels
of Figures 2.9–2.11.The three plots in Figure 2.12 illustrate Equation 2.7 for the examples in
Figures 2.9–2.11. In each case the blue solid ”

“This is referred to as a trade-off because it is easy to obtain a method with extremely low
bias but high variance (for instance, by drawing a curve that passes through every single
training observation) or a method with very low variance but high bias (by fitting a horizontal
line to the data). The challenge lies in finding a method for which both the variance and the
squared bias are low. This trade-off is one of the most important recurring themes in this
book.”

The Classification Setting

“Suppose that we seek to estimate f on the basis of training observations , where now y
1, …, yn are qualitative. The most common approach for quantifying the accuracy of our
estimate is the training error rate, the proportion of mistakes that are made if we apply our
estimate to the training observations: ”

“ Here is the predicted class label for the ith observation using . And is an indicator variable
that equals 1 if and zero if . If then the ith observation was classified correctly by our
classification method; otherwise it was misclassified. Hence Equation 2.8 computes the
fraction of incorrect classifications.”

“The Bayes Classifier It is possible to show (though the proof is outside of the scope of this
book) that the test error rate given in (2.9) is minimized, on average, by a very simple
classifier that assigns each observation to the most likely class, given its predictor values. In
other words, we should simply assign a test observation with predictor vector x
0 to the class j for which “ largest. Note that (2.10) is a conditional probability: it is the
probability that Y = j, given the observed predictor vector x
0. ”

“The purple dashed line represents the points where the probability is exactly 50 %. This is
called the Bayes decision boundary. The Bayes classifier’s prediction is determined by the
Bayes decision boundary; an observation that falls on the orange side of the boundary will
be assigned to the orange class, and similarly an observation on the blue side of the
boundary will be assigned to the blue class.

“The Bayes classifier produces the lowest possible test error rate, called the Bayes error
rate. Since the Bayes classifier will always choose the class for which (2.10) is largest, the
error rate at X = x0 will be . In general, the overall Bayes error rate is given by (2.11) where
the expectation averages the probability over all possible values of X. For our simulated
data, the Bayes error rate is 0. 1304. It is greater than zero, because the classes overlap in
the true population so for some values of x0. The Bayes error rate is analogous to the
irreducible error, discussed earlier.”

K-Nearest Neighbors
“In theory we would always like to predict qualitative responses using the Bayes classifier.
But for real data, we do not know the conditional distribution of Y given X, and so computing
the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable
gold standard against which to compare other methods. Many approaches attempt to
estimate the conditional distribution of Y given X, and then classify a given observation to the
class with highest estimated probability. One such method is the K-nearest neighbors (KNN)
classifier. Given a positive integer K and a test observation x
0, the KNN classifier first identifies the K points in the training data that are closest to x
0, represented by . It then estimates the conditional probability for class j as the fraction of
points in whose response values equal j: ”
“Finally, KNN applies Bayes rule and classifies the test observation x
0 to the class with the largest probability.”
“ KNN can often produce classifiers that are surprisingly close to the optimal Bayes
classifier”
“The choice of K has a drastic effect on the KNN classifier obtained. Figure 2.16 displays two
KNN fits to the simulated data from Figure 2.13, using K = 1 and K = 100. When K = 1, the
decision boundary is overly flexible and finds patterns in the data that don’t correspond to
the Bayes decision boundary. This corresponds to a classifier that has low bias but very high
variance. As K grows, the method becomes less flexible and produces a decision boundary
that is close to linear. This corresponds to a low-variance but high-bias classifier”

You might also like