Classification
Classification
3강: Classification
3.1 Introduction
• The problem of predicting a discrete random variable Y from another random variable X is called
classification or supervised learning or discrimination or pattern recognition or machine
learning
Xi = (Xi1 , . . . , Xid )T ∈ X ⊂ Rd
Definition. A classification rule is a function h : X → {0, 1}. Observe X, predict Y = h(X). The
classification risk(or error rate) of h is
Example (Handwritten Digits). Identify handwritten digits from images. Each Y is a digit from 0 to 9.
There are 256 covariates x1 , . . . , x256 corresponding to the intensity values from the pixels of the 16 × 16
image.
1
2 3강: Classification
Example (Linear Decision Boundary). The figure below shows 100 data points. The covariate X = (X1 , X2 )
is 2-dimensional and the outcome Y ∈ Y = {0, 1}. □ means Y = 0 and △ means Y = 1. A linear classification
rule is of the form (
1 if a + b1 x1 + b2 x2 > 0
h(x) =
0 otherwise
Example (Coronary Risk-Factor Study). There are 462 males between the ages of 15 and 64 from three
rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary
heart disease and there are 9 covariates: systolic blood pressure, cumulative tobacco (kg), Idl (low density
lipoprotein), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol
(current alcohol consumption), and age. The goal is to predict Y from the covariates.
R(h) = Pr(h(X) ̸= Y )
n
bn (h) = 1
X
R I(h(Xi ) ̸= Yi ).
n i=1
3강: Classification 3
where
m(x) = E (Y |X = x) = Pr(Y = 1|X = x)
denote the regression function. The rule h∗ is called the Bayes’ rule. The risk R∗ = R(h∗ ) of the Bayes
rule is called the Bayes’ risk. The set D(h) = {x : m(x) = 1/2} is called the decision boundary.
Note that Z
R(h) = Pr(Y ̸= h(X)) = P r(Y ̸= h(X)|X = x)f (x)dx.
Now
Hence
where m(x) ≥ 1/2, h∗ (x) = 1, the above term is nonnegative. When m(x) ≤ 1/2, h∗ (x) = 0 so both terms
4 3강: Classification
Pr(x|Y = 1) Pr(Y = 1)
m(x) = Pr(Y = 1 | X = x) =
Pr(x|Y = 1) Pr(Y = 1) + Pr(x|Y = 0) Pr(Y = 0)
π1 p1 (x)
=
π1 p1 (x) + π0 p0 (x)
1 p1 (x) 1 − π1
m(x) > ⇔ >
2 p0 (x) π1
where the first term is a kind of variance and the second term acts like bias2 . In other words, if one
choose a bigger class of H, then h0 is closer to h∗ .
• A natural choice for Bayes classifier would be the plug-in classification rule:
1 1
if m(x)
b > 2
h(x) =
b
0 otherwise.
|m(x)
b − m∗ (x)| ≥ |m(x)
b − 1/2|
Therefore,
Z
h(X) ̸= Y ) − Pr(h∗ (X) ̸= Y ) ≤ 2
Pr(b |m∗ − m(x)|I(h
b ∗
(x) = b
h(x))dPX (x)
Z
≤ 2 |m∗ − m(x)|IdP
b X (x)
sZ
≤ (m(x)
b − m∗ (x))2 dPX (x)
p
The last inequality is due to E (Z) ≤ E (Z 2 ) for any Z.
π
bpb1 (x)
m(x) = Pr(Y
c = 1|X = x) =
bpb1 (x) + (1 − π
b
π b)b
p0 (x)
and (
1
1 if m(x)
b > 2
h(x) =
b
0 otherwise.
6 3강: Classification
2
p − p| > ϵ) ≤ 2e−2nϵ
Pr(|b
Pn
where pb = i=1 Xi /n.
Then
Pr max |R
bn (h) − R(h)| > ϵn ≤ α.
h∈H
3강: Classification 7
h) ≤ R
R(b bn (h∗ ) + ϵn ≤ R(h∗ ) + 2ϵn
h) + ϵn ≤ R
bn (b
• Summarizing: s !
bn (h) > R(h∗ ) + 2 2m
Pr R log ≤ α.
n α
• Regression approach is to estimate m(x) = E (Y |X = x) = Pr(Y = 1|X = x). Can use linear
d
X
Y = m(x) + ϵ = β0 + βj Xj + ϵ
j=1
or logistic P
eβ0 + βj xj
j
m(x) = Pr(Y = 1|X = x) = P
β0 + j βj xj
1+e
• The parameters β0 and β = (β1 , . . . , βd )T can be estimated by the following maximum conditional
likelihood.
n
Y
L(β0 , β) = π(xi , β0 , β)Yi (1 − π(xi , β0 , β))1−Yi .
i=1
n
X
ℓ(β0 , β) = {Yi log π(xi , β0 , β − (1 − Yi ) log(1 − π(xi , β0 , β))} ,
i=1
Xn
Yi (β0 + xTi β) − log(1 + exp(β0 + xTi β)) ,
=
i=1
• To find the logistic regression MLE, we use the iteratively reweighted least squares algorithm.
• Even if the model is wrong this might work well since we only need to approximate the decision
boundary.
• Suppose that p0 (x) = p(x | Y = 0) and p1 (x) = p(x | Y = 1) are both multivariate Gaussians:
1 1 T −1
pk (x) = exp − (x − µk ) Σ k (x − µk ) , k = 0, 1.
(2π)d |Σk |1/2 2
8 3강: Classification
Theorem 3.4. Suppose that X|Y = 0 ∼ N (µ0 , Σ0 ) and X|Y = 1 ∼ N (µ1 , Σ1 ). Then the Bayes rule
is
π1 |Σ0 |
1 if ri2 < r02 + 2 log 1−π1 + log |Σ1 | ,
h∗ (x) =
0 otherwise
where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk
2 2
is called the Gaussian discriminant function.
• If we assume that each group is Gaussian with the same covariance matrix, then
Pr(Y = 1|X = x)
log
Pr(Y = 0|X = x)
π0 1
= log − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) + xT Σ−1 (µ1 − µ0 )
π1 2
≡ α0 + αT x
• Both LDA and logistic regression lead to a linear classification rule. The difference is in how we
estimate the parameter.
3강: Classification 9
• Since classification only requires knowing f (y|x), we don’t really need to estimate the whole joint
distribution.
• Logistic regression leaves the marginal distribution f (x) unspecified so it is more nonparametric than
LDA. This is an advantage of the logistic regression approach over LDA.
• We consider a class of linear classifiers called Support Vector Machines (SVM). It will be convenient
to label the cot comes as −1 and +1 instead of 0 and 1. A linear classifier can then be written as
where x = (x1 , . . . , xd ),
d
X
H(x) = β0 + β j xj
j=1
• Note that:
classifier correct ⇒ Yi H(Xi ) ≥ 0
classifier incorrect ⇒ Yi H(Xi ) ≤ 0
where the loss function L is the hinge loss, L(a) = 1 if a < 0 and L(a) = 0 if a ≥ 0.
10 3강: Classification
그림 1: The 0-1 classification loss (dashed line), hinge loss (solid line) and logistic loss (dotted line)
n
X λ
[1 − Yi H(Xi )]+ + ∥β∥2
i=1
2
| {z }
hinge loss
where λ > 0.
• Figure 1 compares the 01-classification loss, hinge loss , logistic loss (log(1 + e−yH(x) )).
• The hinge loss is he smallest convex function that lies above the 0-1 loss so computation is easy and
the minimizer of E [1 − Y H(X)]+ is the Bayes rule.
• Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectly sepa-
rates the two classes.
• How can we find a separating hyperplane? Note that LDA or logistic regression may not find it.
• The particular separating hyperplane that this algorithm converges to depends on the starting values.
• Intuitively, it seems reasonable to choose the hyperplane furthest form the data in the sense that it
separates the +1s and −1s and maximizes the distance to the closest point.
• The margin is the distance to from the hyperplane to the nearest point.
• Points on the boundary of the margin are called support vectors. Figure 2 shows support vectors.
• The data can be separated by some hyperplane if and only if there exists a hyper plan H(x) =
Pd
β0 + j=1 βi xi such that
Yi H(xi ) ≥ 1, i = 1, . . . , n.
• The goal, then, is to maximize the margin, subject to this condition. That is
n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M i = 1, . . . , n.
• Then, for j = 1, . . . , d,
n
X
βbj = α
bi Yi Xij
i=1
12 3강: Classification
where Xij is the value of the covariate Xj for the ith data point, and α
b = (b
α1 , . . . α
bn ) is the vector
that maximizes
n n n
X 1 XX
αi − αi αk Yi Yk ⟨Xi , Xk ⟩
i=1
2 i=1
k=1
subject to
X
αi ≥ 0, and αi Yi = 0
i
• H
b may be written as
n
X
H(x)
b = βb0 + αi Yi ⟨x, Xi ⟩
i=1
• If there is no perfect linear classifier, then one allows overlap between the groups by replacing the
condition with
Yi H(xi ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n.
n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M (1 − ξi ),
X
ξi ≥ 0, ξi < C i = 1, . . . , n.
• The idea is to map the covariate X - which takes values in X - into a higher dimensional space Z and
apply the classifier in the bigger space Z.
• This can yield a more flexible classifier while retaining computationally simplicity
Example. The covariate x = (x1 , x2 ). The Yi s can be separated into two groups using an ellipse.
Define a mapping ϕ by
√
z = (z1 , z2 , z3 ) = ϕ(x) = (x21 , 2x1 x2 , x22 ).
3강: Classification 13
• If we significantly expand the dimension of the problem, we might increase the computational burden.
• For example, if x has dimension d = 256 and we wanted to use all fourth=order terms, then z = ϕ(x)
has dimension 183,181,376.
– First, many classifiers just use the inner product between pairs of points.
– Second, the inner product in Z can be written
• This raises an interesting question: given a function of two variables K(x, y), does there exist a function
ϕ(x) such that K(x, y) = ⟨ϕ(x), ϕ(y)⟩?
14 3강: Classification
• The answer is provided by Mercer’s theorem which says, roughly, that if K is positive definite -
meaning that Z Z
K(x, y)f (x)f (y)dxdy ≥ 0
• We now maximize
n n n
X 1 XX
αi − αi αk Yi Yk K(Xi , Xj )
i=1
2 i=1
k=1
n
X
H(x)
b = βb0 + α
bi Yi K(X, Xi )
i=1
• If Y is not real valued or ϵ is not Gaussian, using the basic regression model might not be appropriate.
• Here θ(·) is called the canonical parameter and ϕ is called the dispersion parameter.
• Define
m(x) = E (Y |X = x) = b′ (θ(x))
σ 2 (x) = Var (Y |X = x) = a(ϕ)b′′ (θ(x))
g(E (Y |X = x)) = xT β
• The parameters β are usually estimated by maximum likelihood. item We want to use nonparametric
regression version of GLM.
3강: Classification 15
• The local polynomial regression estimator can be obtained by solving the weighted least square:
n
X
argmin wi (Yi − Px (Xi , a))2
a
i=1
where wi = K( x−X
h ).
i
• Suppose Y |X = x ∼ Bernoulli(m(x)) for sme smooth function m(x) for which 0 ≤ m(x) ≤ 1.
n
X
ℓ(m) = ℓ(Yi , ξ(Xi )),
i=1
• To estimate the regression function at x for u near x, we approximate the regression function r(u) by
local logistic function
ea0 +a1 (u−x)
m(u) ≈
1 + ea0 +a1 (u−x)
• Equivalently,
ξ(u) = logit(m(u)) ≈ a0 + a1 (u − x).
n
X x − Xi
ℓx (a) = K ℓ(Yi , a0 + a1 (Xi − x))
i=1
h
n
X x − Xi
= K Yi (a0 + a1 (Xi − x)) − log(1 + ea0 +a1 (Xi −x) )
i=1
h
• Choose b
a = argmina ℓx (a).
eba0
m(x) = .
1 + eba0
b
16 3강: Classification
• To choose the optimal bandwidth, we can use the leave-one-out log-likelihood cross-validation.
n
X
CV = ℓ(Yi , ξb−i (Xi ))
i=1
• Unfortunately, we don’t have a simple formula for CV as in linear regression. However, we can approx-
imate the CV function.
• Let ℓ̇(y, ξ) and ℓ̈(y, ξ) denote the first and second derivatives of ℓ(y, ξ).
• Then
n
˙ i, b
X
CV ≈ ℓx (b
a) + m(Xi )((Y a0 ))2
i=1
T
where e1 = (1, 0, . . . , 0) and
m(x) = K(0)eT1 (XxT Wx Vx Xx )−1 e1
n
X
ν= m(Xi )E (−ℓ̈(Yi , b
a0 )).
i=1
• An important part of this method is to choose a good value of k. We can use cross-validation.
3강: Classification 17
0.41
0.40
0.39
0.38
error
0.37
0.36
0.35
0.34
0 10 20 30 40 50
• Figure 3 shows the result of the cross-validation with South African heart disease.
• We can estimate πk = 1
P
n i I(Yi = k)
• We can estimate fk using density estimation. For example, we could apply kernel density estimation
to Dk = {Xi : Yi = k} to get fbk .
d
Y
fb0 (x1 , . . . , xd ) = fb0j (xj )
j=1
d
Y
fb1 (x1 , . . . , xd ) = fb1j (xj )
j=1
18 3강: Classification
• The assumption that the components of X are independent is usually wrong yet the resulting classifier
might still be accurate. Here is a summary of the steps in the naive Bayes classifier.
1. For each group k, compute an estimate fbkj of the density fkj for Xj , using the data for which
Yi = k.
Qd
2. Let fbk (x) = fbk (x1 , . . . , xd ) = j=1 fbkj (xj )
Pn
bk = n1 i=1 I(Yi = k).
3. Let π
4. Define
h(x) = argmax π
bk fbk (x)
k
• Navie Bayes is closely related to generalized additive models. Under the naive Bayes model,
P(Y = 1|X) πf1 (X)
log = log
P(Y = 1|X) (1 − π)f0 (X)
Qd !
π j=1 f1j (X)
= log Qd
(1 − π) j=1 f0j (X)
X d
π f1j (Xj )
= log + log
1−π j=1
f0j (Xj )
d
X
= β0 + gj (Xj )
j=1
Example. Figure 4 shows an artificial data set with two covariate x1 and x2 . Figure 4 (middle) shows
kernel density estimators of fb1 (x1 ), fb1 (x2 ), fb0 (x1 ), fb1 (x2 ). The top left plot shows the resulting naive
Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this
is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear
model yields a classifier with error rate of 0.78.
3강: Classification 19
그림 4: Top: artificial data, Middle: kernel density estimation, bottom: Naive Bayes and GAM classifier