Notes Chapter Linear Classifiers
Notes Chapter Linear Classifiers
Linear classifiers
1 Classification
A binary classifier is a mapping from Rd → {−1, +1}. We’ll often use the letter h (for Actually, general classi-
hypothesis) to stand for a classifier, so the classification process looks like: fiers can have a range
which is any discrete
x→ h →y . set, but we’ll work with
this specific case for a
Real life rarely gives us vectors of real numbers; the x we really want to classify is while.
usually something like a song, image, or person. In that case, we’ll have to define a function
ϕ(x), whose domain is Rd , where ϕ represents features of x, like a person’s height or the
amount of bass in a song, and then let the h : ϕ(x) → {−1, +1}. In much of the following,
we’ll omit explicit mention of ϕ and assume that the x(i) are in Rd , but you should always
have in mind that some additional process was almost surely required to go from the actual
input examples to their feature representation.
In supervised learning we are given a training data set of the form
Dn = x(1) , y(1) , . . . , x(n) , y(n) .
We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? That it works well on new data; that is, that it makes
good predictions on examples it hasn’t seen. But we don’t know exactly what data this My favorite analogy
classifier might be tested on when we use it in the real world. So, we have to assume a is to problem sets. We
evaluate a student’s
connection between the training data and testing data; typically, they are drawn indepen-
ability to generalize by
dently from the same probability distribution. putting questions on the
Given a training set Dn and a classifier h, we can define the training error of h to be exam that were not on
the homework (training
n
1 X 1 h(x(i) ) 6= y(i) set).
En (h) = .
n
i=1
0 otherwise
For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error
n+n 0
1 X 1 h(x(i) ) 6= y(i)
E(h) = 0
n
i=n+1
0 otherwise
11
MIT 6.036 Fall 2019 12
on n 0 new examples that were not used in the process of finding the classifier.
2 Learning algorithm
A hypothesis class H is a set (finite or infinite) of possible classifiers, each of which represents
a mapping from Rd → {−1, +1}.
A learning algorithm is a procedure that takes a data set Dn as input and returns an
element h of H; it looks like
We will find that the choice of H can have a big impact on the test error of the h that
results from this process. One way to get h that generalizes well is to restrict the size, or
“expressiveness” of H.
3 Linear classifiers
We’ll start with the hypothesis class of linear classifiers. They are (relatively) easy to un-
derstand, simple in a mathematical sense, powerful on their own, and the basis for many
other more sophisticated methods.
A linear classifier in d dimensions is defined by a vector of parameters θ ∈ Rd and
scalar θ0 ∈ R. So, the hypothesis class H of linear classifiers in d dimensions is the set of all
vectors in Rd+1 . We’ll assume that θ is a d × 1 column vector.
Given particular values for θ and θ0 , the classifier is defined by Let’s be careful about
dimensions. We have
+1 if θT x + θ0 > 0 assumed that x and θ
h(x; θ, θ0 ) = sign(θ x + θ0 ) =
T
. are both d × 1 column
−1 otherwise vectors. So θT x is 1 × 1,
which in math (but not
Remember that we can think of θ, θ0 as specifying a hyperplane. It divides Rd , the space necessarily numpy) is
the same as a scalar.
our x(i) points live in, into two half-spaces. The one that is on the same side as the normal
vector is the positive half-space, and we classify all points in that space as positive. The
half-space on the other side is negative and all points in it are classified as negative.
−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5
3
The diagram below shows several points classified by h. In particular, let x(1) =
2
4
and x(2) = .
−1
3
h(x(1) ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2
4
h(x (2)
; θ, θ0 ) = sign −1 1.5 + 3 = sign(−2.5) = −1
−1
Thus, x(1) and x(2) are given positive and negative classfications, respectively.
x(1)
θT x + θ0 = 0
x(2)
Study Question: What is the green vector normal to the hyperplane? Specify it as a
column vector.
Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?
• Evaluate resulting h on a testing set that does not overlap the training set
Doing this multiple times controls for possible poor choices of training set or unfortunate
randomization inside the algorithm itself.
One concern is that we might need a lot of data to do this, and in many applications
data is expensive or difficult to acquire. We can re-use data with cross validation (but it’s
harder to do theoretical analysis).
C ROSS -VALIDATE(D, k)
1 divide D into k chunks D1 , D2 , . . . Dk (of roughly equal size)
2 for i = 1 to k
3 train hi on D \ Di (withholding chunk Di )
4 compute “test” error Ei (hi ) on withheld data Di
P
5 return k1 ki=1 Ei (hi )
It’s very important to understand that cross-validation neither delivers nor evaluates a
single particular hypothesis h. It evaluates the algorithm that produces hypotheses.