Supervised and Unsupervised Learning Feature
Supervised and Unsupervised Learning Feature
Given an input or feature vector x, one of the main goals of machine learning is to predict response
an output or response variable y. For example, x could be a digitized signature and y a binary
variable that indicates whether the signature is genuine or false. Another example is where x
represents the weight and smoking habits of an expecting mother and y the birth weight of the
baby. The data science attempt at this prediction is encoded in a mathematical prediction function g,
called the prediction function function , which takes as an input x and outputs a guess g(x) for y
(denoted by by, for example). In a sense, g encompasses all the information about the relationship
between the variables x and y, excluding the effects of chance and randomness in nature. In
can only lie in a finite set, say y ∈ {0, . . . , c − 1}, then predicting y is conceptually the same as
regression problems, the response variable y can take any real value. In contrast, regression when y
classifying the input x into one of c categories, and so prediction becomes a classification
classification problem. We can measure the accuracy of a prediction by with respect to a given
response y by loss function using some loss function Loss(y,by). In a regression setting the usual
choice is the squarederror loss (y−by) 2 . In the case of classification, the zero–one (also written 0–1)
loss function Loss(y,by) = 1{y , by} is often used, which incurs a loss of 1 whenever the predicted class
by is not equal to the class y. Later on in this book, we will encounter various other useful loss
functions, such as the cross-entropy and hinge loss functions (see, e.g., Chapter 7). The word error is
often used as a measure of distance between a “true” object y and some approximation by thereof.
established error concepts, as are the norm ∥y−by∥ and squared norm ∥y−by∥ 2 for vectors. The
If y is real-valued, the absolute error |y − by| and the squared error (y−by) 2 are both well-
squared error (y−by) 2 is just one example of a loss function. It is unlikely that any mathematical
function g will be able to make accurate predictions for all possible pairs (x, y) one may encounter in
Nature. One reason for this is that, even with the same input x, the output y may be different,
depending on chance circumstances or randomness. For this reason, we adopt a probabilistic
approach and assume that each pair (x, y) is the outcome of a random pair (X, Y) that has some joint
probability density f(x, y). We then assess the predictive performance via the expected loss, usually
called the risk risk, for g: ℓ(g) = E Loss(Y, g(X)). (2.1) For example, in the classification case with zero–
one loss function the risk is equal to the Statistical Learning 21 probability of incorrect classification:
of (X, Y) and any loss function, we classifier can in principle find the best possible g ∗ := argming E
ℓ(g) = P[Y , g(X)]. In this context, the prediction function g is called a classifier. Given the distribution
Loss(Y, g(X)) that yields the smallest risk ℓ ∗ := ℓ(g ∗ ). We will see in Chapter 7 that in the
classification case with y ∈ {0, . . . , c−1} ☞ 251 and ℓ(g) = P[Y , g(X)], we have g ∗ (x) = argmax
y∈{0,...,c−1} f(y | x), where f(y | x) = P[Y = y | X = x] is the conditional probability of Y = y given X = x.
this setting, the optimal prediction function g ∗ is often called the regression function. The following
As already mentioned, for regression the most widely-used loss function is the squarederror loss. In
theorem specifies its exact form. regression function Theorem 2.1: Optimal Prediction Function for
g ∗ is equal to the conditional expectation of Y given X = x: g ∗ (x) = E[Y | X = x]. Proof: Let g ∗ (x) =
Squared-Error Loss For the squared-error loss Loss(y,by) = (y −by) 2 , the optimal prediction function
E[Y | X = x]. For any function g, the squared-error risk satisfies E(Y − g(X))2 = E[(Y − g ∗ (X) + g ∗ (X) −
g(X))2 ] = E(Y − g ∗ (X))2 + 2E[(Y − g ∗ (X))(g ∗ (X) − g(X))] + E(g ∗ (X) − g(X))2 ⩾ E(Y − g ∗ (X))2 +
2E[(Y − g ∗ (X))(g ∗ (X) − g(X))] = E(Y − g ∗ (X))2 + 2E {(g ∗ (X) − g(X))E[Y − g ∗ (X) | X]} . In the last
have E[Y − g ∗ (X) | X] = 0. It follows that E(Y − g(X))2 ⩾ E(Y − g ∗ (X))2 , showing that g ∗ yields the
equation we used the tower property. By the definition of the conditional expect- ☞ 431 ation, we
(random) response Y can be written as Y = g ∗ (x) + ε(x), (2.2) where ε(x) can be viewed as the
smallest squared-error risk. □ One consequence of Theorem 2.1 is that, conditional on X = x, the
random deviation of the response from its conditional mean at x. This random deviation satisfies E
ε(x) = 0. Further, the conditional variance of the response Y at x can be written as Var ε(x) = v 2 (x)
unspecified. Since, the optimal prediction function g ∗ depends on the typically unknown joint
for some unknown positive function v. Note that, in general, the probability distribution of ε(x) is
distribution of (X, Y), it is not available in practice. Instead, all that we have available is a finite
number of (usually) independent realizations from the joint density f(x, y). We denote this sample by
T = {(X1, Y1), . . . ,(Xn, Yn)} and call it the training set (T is a mnemonic for training set training) with n
examples. It will be important to distinguish between a random training set T and its (deterministic)
outcome {(x1, y1), . . . ,(xn, yn)}. We will use the notation τ for the latter. We will also add the
subscript n in τn when we wish to emphasize the size of the training set