Machine Learning and Pattern Recognition Week 3 Intro - Classification
Machine Learning and Pattern Recognition Week 3 Intro - Classification
pre-processing
So far we have fitted scalar real-valued functions f (x) from a real-valued input vector x. We
have matched these to real-valued observations or targets y, a task often called regression.
However, the inputs and outputs of the function we wish to learn could take different types.
This note begins to look at some of the alternatives.
The most common machine learning task is probably classification. We have a dataset of
inputs and outputs, {x(n) , y(n) } as before, but the y labels now belong to a discrete set
of categories. In binary classification, y takes on one of only two values, often {0, 1} or
{−1, +1}, for example indicating if an email is spam or not, or an image contains a particular
object of interest or not.
There are many different ways we could represent and learn a function that predicts discrete
labels. This note gives some of the different ways to represent the function. We will extend
these, and say a lot more about learning later in the course.
1 Just do regression?
Although the labels in classification problems are discrete, 0 and 1 are still numbers. We could
take a training set for a binary classification problem, and use it to fit a linear regression
model where all of the outputs just happen to be 0.0 and 1.0. Is that a good idea?
Given enough basis functions, and data to fit them, we can get a fit close to any function,
so we could fit a function that takes on values close to 0 and 1 for most of its inputs. What
if the labels are noisy, and we can observe both zeros and ones in the same location? Then
to minimize square error, the best function value at a location where p(y = 1) = p1 would
minimize:
E[(y − f )2 ] = p1 (1 − f )2 + (1 − p1 )(0 − f )2 = f 2 − 2p1 f + p1 , (1)
which is minimized by f = p1 . So in the limit of lots of basis functions and data, a flexible
linear regression model can give the probability that a binary label y ∈ {0, 1} will be one. If
we wanted to pick the most probable class, we would return one if our function was greater
than a half, and zero otherwise.
As we can fit regularized least squares regression models quickly, and with very little code,
regressing the labels can be appealing. On the other hand, for a fixed set of basis functions,
it is easy to find examples where the least squares objective does not return the most useful
function for constructing a classification rule. See if you can sketch one before we get to an
example.
Also, regardless of how we fit a linear regression model, the fitted functions will usually
extend outside f ∈ [0, 1] for some inputs, which makes it hard to take the probabilistic
interpretation seriously.
What follows is code to construct an example showing a limitation of least squares fitting.
It shows (in magenta) a least squares quadratic fit to some data, and (in black) another
quadratic curve, which when thresholded at 0.5, is a much better classifier.
import numpy as np
import matplotlib.pyplot as plt
# Predictions
x_grid = np.arange(0, 10, 0.05)[:,None]
f_grid = np.dot(phi_fn(x_grid), ww)
# Show demo
plt.clf()
plt.plot(X[yy==1], yy[yy==1], 'r+')
plt.plot(X[yy==0], yy[yy==0], 'bo')
plt.plot(x_grid, f_grid, 'm-')
plt.plot(x_grid, f2_grid, 'k-')
plt.ylim([-0.1, 1.1])
plt.show()
Remember: we’re not saying we can’t use linear regression for classification. We could fit
the problem above by using more basis functions. It’s just that approaches designed for
classification will usually generalize better. Linear regression could still be a baseline for
comparison (although by the end of the course, you might jump straight to other baselines).
[Later we will cover logistic regression, which is one way of forcing the function that we are
fitting to lie between zero and one.]
1. We still have the problems of fitting least squares regression to binary data that were noted above. However,
other methods we will cover, such as generalizations of logistic regression, can also be used to predict these binary
target vectors.
2. In statistics it is sometimes called “dummy variable coding”, an unfortunate clash of terminology with “dummy
variable” elsewhere in mathematics meaning a “bound variable”.
4
std("texture")
0
0 1 2 3
std(cell radius)
3. In this setting, the estimate above is probably fine: N is large, and hopefully we have a lot of examples from each
class. In other settings, probabilities estimated from counts are often forced away from zero or one with fictitious
‘counts’ α:
α + ∑n I ( y(n) = k )
P(y = k ) = πk ≈ ,
N + Kα
where K is the number of classes, and α could be one, but needn’t be an integer, and is often set smaller than one.
2
log(std("texture"))
−1
−2
−4 −2 0 2
log(std(cell radius))
Hopefully if we looked at more features these clouds could be separated rather more than
they are here.
Warning: you would need to think hard about the application and how the data were sampled
before possibly trusting a classifier in an application that matters. Medical applications are
particularly sensitive. One flaw with fitting a classifier as outlined above is that half of
patients do not have cancer. This classifier would give quite high probabilities of cancer to
every test example, which would be impractical to act on and could cause needless worry.
More subtle problems could depend on exactly how the patients in the training set were
selected within each class, and whether those choices will be reflected at test time.
where θd,k gives the probability that xd = 1 given that the class is k. The assumption that the
features are independent given the class is known as the Naive Bayes assumption. Not all
Bayes Classifiers are “Naive”!
[The website version of this note has a question here.]
[The website version of this note has a question here.]
Warning about a common confusion: Later in the course we will cover “Bayesian methods”.
These use Bayes’ rule to express beliefs about the parameters of a model given some training
data. For example, we don’t really know the distribution of the features of malignant cells
4. A distribution that is Gaussian on a log scale is a log-normal distribution, which as Wikipedia discusses, comes
up frequently in models. In detail, most positive distributions are not precisely log-Gaussian, but it’s often a more
sensible starting point than a Gaussian.
5 Comments
This document covered a couple of approaches to classification: least squares linear regres-
sion, and Bayes classifiers. However, just as important in practice, if not more so, are the
pre-processing methods: one-hot/one-of-K encoding and log-transformations.
Bayes classifiers are a baseline worth knowing about, but we haven’t found time for many
examples in this course. Naive Bayes in particular already occurs in several undergraduate
Edinburgh courses, and probably undergraduate courses in many other Universities. How-
ever, if you haven’t seen Naive Bayes before, you may want to read more about it. It’s a good
baseline to try, and easy to run on enormous datasets. Although some practitioners might
turn straight to “logistic regression”, which we get to soon.
If you do ever use Naive Bayes, you should know that the probabilities reported by Naive
Bayes classifiers are usually poorly calibrated, caused by its strong and wrong independence
assumptions. It’s possible to construct synthetic examples where Naive Bayes is either
overly-confident, or under-confident. Keen students could try to do so. In text classification,
Naive Bayes is usually extremely overconfident: for example, it might declare that many
emails are spam with probability >99.99%, but is wrong on more than 0.01% of them.
6 Further Reading
Neither Murphy or Barber cover the idea of using least squares linear regression for binary
classification. They go straight to the more sophisticated logistic regression method when
modelling binary outputs. You can find discussion in Bishop (Section 4.1.3, p184), or the
classic Duda and Hart Pattern Classification book (Section 5.8 in the Duda, Hart and Stork
second edition from 2000). A book that’s free online, which dives straight into the multiple
class case with one-hot encoding, is Hastie et al.’s The Elements of Statistical Learning (Section
4.2 in both the 1st and 2nd editions). The Rasmussen and Williams book Gaussian Processes
for Machine Learning is also free online, and covers this idea in Section 6.5 (we will discuss
Gaussian Processes later in the course).
Murphy has more detail on Bayes classifiers in section 3.5, and Gaussian classifiers in section
4.2. Barber covers Naive Bayes in Chapter 10, and a classifier based on a mixture of Gaussians
(a generalization we haven’t covered yet) in 20.3.3.