0% found this document useful (0 votes)

23 views

Classification

This document discusses classification and supervised learning. It introduces key concepts like classification rules, error rates, training samples, features/covariates, and classifiers. The goal of classification is to predict a discrete target variable Y from observed data X. The optimal classifier is the Bayes' classifier, which minimizes the classification risk/error rate. The Bayes' classifier depends on the regression function m(x), which gives the probability of Y=1 given X=x. Plug-in classifiers that estimate m(x) can perform nearly as well as the optimal Bayes' classifier.

Uploaded by

statyoung

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Classification

Uploaded by

statyoung

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

인공지능을 위한 이론과 모델링

3강: Classification

Lecturer: 장원철 2023 가을학기

3.1 Introduction

Statistics Computer Science Meaning

estimation learning find a good classifier
classification supervised learning predicting a discrete Y from X
clustering unsupervised learning putting data into groups
data training sample (X1 , Y1 ), . . . , (Xn , Yn )
covariate features the Xi ’s
classifier hypothesis map h : X → Y

• The problem of predicting a discrete random variable Y from another random variable X is called
classification or supervised learning or discrimination or pattern recognition or machine
learning

• Consider iid data (X1 , Y1 ), . . . , (Xn , Yn ) where

Xi = (Xi1 , . . . , Xid )T ∈ X ⊂ Rd

is a d-dimensional vector and Yi takes values in {0, 1}.

Definition. A classification rule is a function h : X → {0, 1}. Observe X, predict Y = h(X). The
classification risk(or error rate) of h is

R(h) = Pr(Y ̸= h(X))

Example (Handwritten Digits). Identify handwritten digits from images. Each Y is a digit from 0 to 9.
There are 256 covariates x1 , . . . , x256 corresponding to the intensity values from the pixels of the 16 × 16
image.

1
2 3강: Classification

Example (Linear Decision Boundary). The figure below shows 100 data points. The covariate X = (X1 , X2 )
is 2-dimensional and the outcome Y ∈ Y = {0, 1}. □ means Y = 0 and △ means Y = 1. A linear classification
rule is of the form (
1 if a + b1 x1 + b2 x2 > 0
h(x) =
0 otherwise

Example (Coronary Risk-Factor Study). There are 462 males between the ages of 15 and 64 from three
rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary
heart disease and there are 9 covariates: systolic blood pressure, cumulative tobacco (kg), Idl (low density
lipoprotein), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol
(current alcohol consumption), and age. The goal is to predict Y from the covariates.

Definition. The true error rate (or classification risk) of a classifier h is

R(h) = Pr(h(X) ̸= Y )

and the empirical error rate or training error rate is

n
bn (h) = 1
X
R I(h(Xi ) ̸= Yi ).
n i=1
3강: Classification 3

Theorem 3.1. The rule h that minimizes R(h) is

(
1
∗ 1 if m(x) > 2
h (x) =
0 otherwise

where
m(x) = E (Y |X = x) = Pr(Y = 1|X = x)

denote the regression function. The rule h∗ is called the Bayes’ rule. The risk R∗ = R(h∗ ) of the Bayes
rule is called the Bayes’ risk. The set D(h) = {x : m(x) = 1/2} is called the decision boundary.

Proof. It suffices to show that

R(h) − R(h∗ ) ≥ 0

Note that Z
R(h) = Pr(Y ̸= h(X)) = P r(Y ̸= h(X)|X = x)f (x)dx.

Hence we only need to show

Pr(Y ̸= h(X)|X = x) − Pr(Y ̸= h∗ (X)|X = x) ≥ 0 for all x

Now

Pr(Y ̸= h(X)|X = x) = 1 − Pr(Y = h(X)|X = x)

= 1 − {Pr(Y = 1, h(x) = 1|X = x) + Pr(Y = 0, h(x) = 0|X = x)}
= 1 − {I(h(x) = 1)Pr(Y = 1|X = x) + I(h(x) = 0)Pr(Y = 0|X = x)}
= 1 − {I(h(x) = 1)m(x) + I(h(x) = 0)(1 − m(x))}
= 1 − {I(x)m(x) + (1 − I(x))(1 − m(x))}

where I(x) = I(h(x) = 1).

Hence

Pr(Y ̸= h(X)|X = x) − Pr(Y ̸= h∗ (X)|X = x)

= (I ∗ (x)m(x) + (1 − I ∗ (x))(1 − m(x))) − (I(x)m(x) + (1 − I(x))(1 − m(x)))
= (2m(x) − 1)(I ∗ (x) − I(x))

1
= 2 m(x) − (I ∗ (x) − I(x))
2

where m(x) ≥ 1/2, h∗ (x) = 1, the above term is nonnegative. When m(x) ≤ 1/2, h∗ (x) = 0 so both terms
4 3강: Classification

are nonpositive and hence the above term is again non-negative.

• We can rewrite h∗ in a different way. From Bayes’ theorem

Pr(x|Y = 1) Pr(Y = 1)
m(x) = Pr(Y = 1 | X = x) =
Pr(x|Y = 1) Pr(Y = 1) + Pr(x|Y = 0) Pr(Y = 0)
π1 p1 (x)
=
π1 p1 (x) + π0 p0 (x)

where πj = Pr(Y = j). Therefore

1 p1 (x) 1 − π1
m(x) > ⇔ >
2 p0 (x) π1

Thus the Bayes rule can be rewritten as


1 p1 (x) 1−π1
if p0 (x) > π1
h∗ (x) =
0 otherwise

• Define the oracle classifier as follows:

R(h0 ) = inf R(h)

h∈H

where H is a set of classifiers.

• Let R0 = R(h0 ) denote the oracle risk of H. In general

R(h) − R(h∗ ) = R(h) − R(h0 ) + R(h0 ) − R(h∗ )

| {z } | {z }
distance from oracle distance of oracle from Bayes error

where the first term is a kind of variance and the second term acts like bias2 . In other words, if one
choose a bigger class of H, then h0 is closer to h∗ .

• A natural choice for Bayes classifier would be the plug-in classification rule:

1 1
if m(x)
b > 2
h(x) =
b
0 otherwise.

Theorem 3.2. The risk of the plug-in classifier rule satisfies

sZ
h) − R(h∗ ) ≤ 2
R(b (m(x)
b − m∗ (x))2 dPX (x)
3강: Classification 5

Proof. It is easy to show (HW #2)

Z
h(X) ̸= Y ) − Pr(h∗ (X) ̸= Y ) = 2
Pr(b |m∗ − 1/2|I(h∗ (x) = b
h(x))dPX (x)

Now when h∗ (x) ̸= b

h(x), there are two possible scenarios:

h(x) = 1 and h∗ (x) = 0;

1. b
h(x) = 0 and h∗ (x) = 1
2. b

In both scenarios, we can conclude

|m(x)
b − m∗ (x)| ≥ |m(x)
b − 1/2|

Therefore,
Z
h(X) ̸= Y ) − Pr(h∗ (X) ̸= Y ) ≤ 2
Pr(b |m∗ − m(x)|I(h
b ∗
(x) = b
h(x))dPX (x)
Z
≤ 2 |m∗ − m(x)|IdP
b X (x)
sZ
≤ (m(x)
b − m∗ (x))2 dPX (x)

p
The last inequality is due to E (Z) ≤ E (Z 2 ) for any Z.

• The above theorem tells us that if m(x)

b is closer to m∗ (x), then the plug-in classification risk will be
closer to the Bayes risk. However, the converse is not necessarily true.

• How to find a good classifier?

1. Empirical Risk Minimization: Choose a set of classifiers H and find b

h ∈ H that minimizes
some estimate of L(h).
2. Regression: Find an estimate m
b of the regression function r and substitute into the Bayes rule
3. Density Estimation: Estimate p0 from the Xi ’s for which Yi = 0, estimate p1 from the Xi ’s for
Pn
b = n1 i=1 Yi . Define
which Yi = 1 and let π

π
bpb1 (x)
m(x) = Pr(Y
c = 1|X = x) =
bpb1 (x) + (1 − π
b
π b)b
p0 (x)

and (
1
1 if m(x)
b > 2
h(x) =
b
0 otherwise.
6 3강: Classification

3.2 Empirical Risk Minimization

• Let H = {h1 , . . . , hm } be a finite set of classifiers.

• Empirical risk minimization means choosing the classifier b

h ∈ H to minimize the training error R
bn (h),
the empirical risk. !
1X
h = argmin R
b bn (h) = argmin I(h(Xi ) ̸= Yi )
h∈H h∈H n i

• Let h∗ the best classifier in H.

R(h∗ ) = min R(h)
h∈H

h) ≤ R(h∗ ) + ϵ for some small ϵ > 0.

item We want to show that, with high probability, R(b

• Recall Hoeffding’s inequality.

If X1 , . . . , Xn ∼ Bernoulli(p), then for any ϵ > 0,

2
p − p| > ϵ) ≤ 2e−2nϵ
Pr(|b

Pn
where pb = i=1 Xi /n.

Theorem 3.3. Assume H is finite and has m elements. Then,

2
Pr max |R
bn (h) − R(h)| > ϵn ≤ 2me−2nϵ
h∈H

Proof. We will use the union bound and Hoeffiding’s inequality.

• Fix α and let s

1 2m
ϵn = log
2n α

Then
Pr max |R
bn (h) − R(h)| > ϵn ≤ α.
h∈H
3강: Classification 7

• Hence, with probability at least 1 − α, the following is true:

h) ≤ R
R(b bn (h∗ ) + ϵn ≤ R(h∗ ) + 2ϵn
h) + ϵn ≤ R
bn (b

• Summarizing: s !
bn (h) > R(h∗ ) + 2 2m
Pr R log ≤ α.
n α

3.3 Linear and Logistic Regression

• Regression approach is to estimate m(x) = E (Y |X = x) = Pr(Y = 1|X = x). Can use linear

d
X
Y = m(x) + ϵ = β0 + βj Xj + ϵ
j=1

or logistic P
eβ0 + βj xj
j
m(x) = Pr(Y = 1|X = x) = P
β0 + j βj xj
1+e

• The parameters β0 and β = (β1 , . . . , βd )T can be estimated by the following maximum conditional
likelihood.
n
Y
L(β0 , β) = π(xi , β0 , β)Yi (1 − π(xi , β0 , β))1−Yi .
i=1

• The conditional log-likelihood is

n
X
ℓ(β0 , β) = {Yi log π(xi , β0 , β − (1 − Yi ) log(1 − π(xi , β0 , β))} ,
i=1
Xn
Yi (β0 + xTi β) − log(1 + exp(β0 + xTi β)) ,

=
i=1

• To find the logistic regression MLE, we use the iteratively reweighted least squares algorithm.

• Even if the model is wrong this might work well since we only need to approximate the decision
boundary.

3.4 Linear Discriminant Analysis

• Suppose that p0 (x) = p(x | Y = 0) and p1 (x) = p(x | Y = 1) are both multivariate Gaussians:

1 1 T −1
pk (x) = exp − (x − µk ) Σ k (x − µk ) , k = 0, 1.
(2π)d |Σk |1/2 2
8 3강: Classification

where Σ1 and Σ2 are both d × d covariance matrices.

where ri2 = (x − µ)T Σi (x − µi ) for i = 1, 2 is the Mahalanobis distance.

• An equivalent way of expressing the Bayes rule is

h∗ (x) = argmax δk (x)

where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk
2 2
is called the Gaussian discriminant function.

• In practice, insert sample estimates for µk , Σk , πk .

• Decision boundary is quadratic (Quadratic Discriminant Analysis),

• Set Σ0 = Σ1 = Σ to get linear decision boundary (LDA).

3.5 Relationship Between Logistic Regression and LDA

• LDA and logistic regression are almost the same thing.

• If we assume that each group is Gaussian with the same covariance matrix, then

Pr(Y = 1|X = x)
log
Pr(Y = 0|X = x)

π0 1
= log − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) + xT Σ−1 (µ1 − µ0 )
π1 2
≡ α0 + αT x

• On the other hand, the logistic model is

Pr(Y = 1|X = x)
log = β0 + β T x
Pr(Y = 0|X = x)

• Both LDA and logistic regression lead to a linear classification rule. The difference is in how we
estimate the parameter.
3강: Classification 9

• In LDA we estimated the whole joint distribution by maximizing the likelihood

Y Y Y
f (Xi , yi ) = f (Xi |yi ) f (yi )
i i i
| {z } | {z }
Gaussian Bernoulli

• In logistic regression we maximized the conditional likelihood

Q
i f (yi |Xi ) but we ignored the second
term f (Xi ):
Y Y Y
f (Xi , yi ) = f (yi |Xi ) f (Xi )
i i i
| {z } | {z }
logistic ignored

• Since classification only requires knowing f (y|x), we don’t really need to estimate the whole joint
distribution.

• Logistic regression leaves the marginal distribution f (x) unspecified so it is more nonparametric than
LDA. This is an advantage of the logistic regression approach over LDA.

3.6 Support Vector Machines

• In this section, the outcomes are coded as −1 and 1.

• We consider a class of linear classifiers called Support Vector Machines (SVM). It will be convenient
to label the cot comes as −1 and +1 instead of 0 and 1. A linear classifier can then be written as

h(x) = I(H(x) > 0)

where x = (x1 , . . . , xd ),
d
X
H(x) = β0 + β j xj
j=1

• Note that:
classifier correct ⇒ Yi H(Xi ) ≥ 0
classifier incorrect ⇒ Yi H(Xi ) ≤ 0

• The classification risk is

R = Pr(Y ̸= h(X)) = Pr(Y H(X) ≥ 0) = E (L(Y H(X)))

where the loss function L is the hinge loss, L(a) = 1 if a < 0 and L(a) = 0 if a ≥ 0.
10 3강: Classification

그림 1: The 0-1 classification loss (dashed line), hinge loss (solid line) and logistic loss (dotted line)

• The SVM classifier is b

h(x) = I(H(x)
b > 0) where H(x)
b can be obtained by minimizing

n
X λ
[1 − Yi H(Xi )]+ + ∥β∥2
i=1
2
| {z }
hinge loss

where λ > 0.

• Figure 1 compares the 01-classification loss, hinge loss , logistic loss (log(1 + e−yH(x) )).

• The hinge loss is he smallest convex function that lies above the 0-1 loss so computation is easy and
the minimizer of E [1 − Y H(X)]+ is the Bayes rule.

• The SVM classifiers is often developed from geometric perspective.

• Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectly sepa-
rates the two classes.

• How can we find a separating hyperplane? Note that LDA or logistic regression may not find it.

• A separating hyperplane will minimize

X
− Yi H(Xi )
i∈M

where M is the index set of all misclassified data points.

• Rosenblatt’s perceptron algorithm takes starting values and updates them:

β β Yi Xi
← +ρ
β0 β0 Yi

where ρ > 0 is the learning rate.

3강: Classification 11

그림 2: Support vectors and maximum margin hyperplane

• However, there are many separating hyperplanes.

• The particular separating hyperplane that this algorithm converges to depends on the starting values.

• Intuitively, it seems reasonable to choose the hyperplane furthest form the data in the sense that it
separates the +1s and −1s and maximizes the distance to the closest point.

• This hyperplane is called the maximum margin hyperplane.

• The margin is the distance to from the hyperplane to the nearest point.

• Points on the boundary of the margin are called support vectors. Figure 2 shows support vectors.

• The data can be separated by some hyperplane if and only if there exists a hyper plan H(x) =
Pd
β0 + j=1 βi xi such that
Yi H(xi ) ≥ 1, i = 1, . . . , n.

• The goal, then, is to maximize the margin, subject to this condition. That is

n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M i = 1, . . . , n.

• Given two vectors a and b let ⟨a, b⟩ = aT b =

P
j aj bj denote the inner product of a and b.
Pd
• Let H(x)
b = βb0 + j=1 βbj xj denote the optimal (largest margin) hyperplane.

• Then, for j = 1, . . . , d,
n
X
βbj = α
bi Yi Xij
i=1
12 3강: Classification

where Xij is the value of the covariate Xj for the ith data point, and α
b = (b
α1 , . . . α
bn ) is the vector
that maximizes
n n n
X 1 XX
αi − αi αk Yi Yk ⟨Xi , Xk ⟩
i=1
2 i=1
k=1

subject to
X
αi ≥ 0, and αi Yi = 0
i

• The points Xi for which α

b ̸= 0 are called support vectors. b
a0 can be found by solving

bi Yi (XiT βb + βb0 ) = 0
α

for any support point Xi .

• H
b may be written as
n
X
H(x)
b = βb0 + αi Yi ⟨x, Xi ⟩
i=1

• If there is no perfect linear classifier, then one allows overlap between the groups by replacing the
condition with
Yi H(xi ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n.

• The variables ξ1 , . . . , ξn are called slack variables.

• We now maximize subject to

n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M (1 − ξi ),
X
ξi ≥ 0, ξi < C i = 1, . . . , n.

• The constant C is a tuning parameter that controls the amount of overlap.

• Here is another (easier) way to think about the SVM.

• There is a trick called kernelization for improving a computationally simple classifier h.

• The idea is to map the covariate X - which takes values in X - into a higher dimensional space Z and
apply the classifier in the bigger space Z.

• This can yield a more flexible classifier while retaining computationally simplicity

Example. The covariate x = (x1 , x2 ). The Yi s can be separated into two groups using an ellipse.
Define a mapping ϕ by
√
z = (z1 , z2 , z3 ) = ϕ(x) = (x21 , 2x1 x2 , x22 ).
3강: Classification 13

Thus, ϕ maps Z = R2 into Z = R3 . In the higher-dimensional space Z, the Yi ’s are separable by a

linear decision boundary.

• If we significantly expand the dimension of the problem, we might increase the computational burden.

• For example, if x has dimension d = 256 and we wanted to use all fourth=order terms, then z = ϕ(x)
has dimension 183,181,376.

• We are spared this computational nightmare by the following two facts:

– First, many classifiers just use the inner product between pairs of points.
– Second, the inner product in Z can be written

⟨z, ze⟩ = x)⟩ = x21 x

⟨ϕ(x), ϕ(e e21 + 2x1 x e2 + x22 x
e1 x2 x e22
= e⟩)2 ≡ K(x, x
(⟨x, x e)

• Thus, we can compute ⟨z, ze⟩, without ever computing Zi = ϕ(Xi ).

• To summarize, kernelization involves finding a mapping ϕ : X → Z and a classifier such that:

1. cZ has higher dimension than cX and so leads a richer set of classifiers.

2. The classifier only requires computing inner products.
3. There is a function K, called a kernel, such that ⟨ϕ(x), ϕ(e
x)⟩ = K(x, x
e).
4. Everywhere there term ⟨x, x
e⟩ appears in the algorithm, replace it with K(x, x
e).

• In fact, we never need to construct the mapping ϕ at all.

• We only need to specify a kernel K(x, x

e) that corresponds to ⟨ϕ(x), ϕ(e
x)⟩. for some ϕ.

• This raises an interesting question: given a function of two variables K(x, y), does there exist a function
ϕ(x) such that K(x, y) = ⟨ϕ(x), ϕ(y)⟩?
14 3강: Classification

• The answer is provided by Mercer’s theorem which says, roughly, that if K is positive definite -
meaning that Z Z
K(x, y)f (x)f (y)dxdy ≥ 0

for square integrable functions f - then such a ϕ exists.

• The support vector machine can be kerneled as follows.

• We simply replace ⟨Xi , Xj ⟩ with K(Xi , Xj ).

• We now maximize
n n n
X 1 XX
αi − αi αk Yi Yk K(Xi , Xj )
i=1
2 i=1
k=1

• The hyperplane can be written as

n
X
H(x)
b = βb0 + α
bi Yi K(X, Xi )
i=1

3.7 Nonparametric Classification

3.7.1 Nonparametric Logistic Regression

• If Y is not real valued or ϵ is not Gaussian, using the basic regression model might not be appropriate.

• Assume that Y has an exponential family distribution family, given x, if

yθ(x) − b(θ(x))
f (y|x) = exp + c(y, ϕ) .
a(ϕ)

for some functions a(·), b(·) and c(·).

• Here θ(·) is called the canonical parameter and ϕ is called the dispersion parameter.

• Define

m(x) = E (Y |X = x) = b′ (θ(x))
σ 2 (x) = Var (Y |X = x) = a(ϕ)b′′ (θ(x))

• The generalized linear model is of form

g(E (Y |X = x)) = xT β

for some known function g called the link function.

• The parameters β are usually estimated by maximum likelihood. item We want to use nonparametric
regression version of GLM.
3강: Classification 15

• The local polynomial regression estimator can be obtained by solving the weighted least square:

n
X
argmin wi (Yi − Px (Xi , a))2
a
i=1

where wi = K( x−X
h ).
i

• Replace L2 loss with the log-likelihood function.

• Suppose Y |X = x ∼ Bernoulli(m(x)) for sme smooth function m(x) for which 0 ≤ m(x) ≤ 1.

• The likelihood function is

n
Y
m(Xi )Yi (1 − m(Xi ))1−Yi
i=1

• Define ξ(x) = log(m(x)/(1 − m(x))). Then the log-likelihood function is

n
X
ℓ(m) = ℓ(Yi , ξ(Xi )),
i=1

where " y 1−y #

eξ 1
ℓ(y, ξ) = log = yξ − log(1 + eξ ).
1 + eξ 1 + eξ

• To estimate the regression function at x for u near x, we approximate the regression function r(u) by
local logistic function
ea0 +a1 (u−x)
m(u) ≈
1 + ea0 +a1 (u−x)

• Equivalently,
ξ(u) = logit(m(u)) ≈ a0 + a1 (u − x).

• Now define local-loglikelihood

n
X x − Xi
ℓx (a) = K ℓ(Yi , a0 + a1 (Xi − x))
i=1
h
n
X x − Xi
= K Yi (a0 + a1 (Xi − x)) − log(1 + ea0 +a1 (Xi −x) )
i=1
h

• Choose b
a = argmina ℓx (a).

• The nonparametric estimate of m(x) is

eba0
m(x) = .
1 + eba0
b
16 3강: Classification

• To choose the optimal bandwidth, we can use the leave-one-out log-likelihood cross-validation.

n
X
CV = ℓ(Yi , ξb−i (Xi ))
i=1

where ξb−i (x) is the estimator obtained by leaving out (Xi , Yi ).

• Unfortunately, we don’t have a simple formula for CV as in linear regression. However, we can approx-
imate the CV function.

• Let ℓ̇(y, ξ) and ℓ̈(y, ξ) denote the first and second derivatives of ℓ(y, ξ).

ℓ̇(y, ξ) = y − p(ξ), ℓ̈(y, ξ) = −p(ξ)(1 − p(ξ))

where p(ξ) − eξ /(1 + eξ ).

• Define Vx = diag(V (xj )) with

V (xj ) = −ℓ̈(Yi , b a1 (xj − Xi )).

a0 + b

• Then
n
˙ i, b
X
CV ≈ ℓx (b
a) + m(Xi )((Y a0 ))2
i=1

T
where e1 = (1, 0, . . . , 0) and
m(x) = K(0)eT1 (XxT Wx Vx Xx )−1 e1

• The effective degrees of freedom is

n
X
ν= m(Xi )E (−ℓ̈(Yi , b
a0 )).
i=1

3.7.2 Nearest Neighbor Classifiers

• The k-nearest neighbor rule is

( Pn Pn
1 if i=1 wi (x)I(Yi = 1) > i=1 wi (x)I(Yi = 0)
h(x) =
0 otherwise

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise.

• Nearest depends on how you define the distance.

• Often we use Euclidean distance ∥Xi − Xj ∥.

• An important part of this method is to choose a good value of k. We can use cross-validation.
3강: Classification 17

0.41
0.40
0.39
0.38
error

0.37
0.36
0.35
0.34

0 10 20 30 40 50

그림 3: knn for South African heart disease data

• Figure 3 shows the result of the cross-validation with South African heart disease.

3.7.3 Density Estimation and Naive Bayes

• The Bayes rule is h(x) = argmaxk π

bk fbk (x)

• We can estimate πk = 1
P
n i I(Yi = k)

• We can estimate fk using density estimation. For example, we could apply kernel density estimation
to Dk = {Xi : Yi = k} to get fbk .

• But if x = (x1 , . . . , xd ) is high-dimensional, nonparametric estimation is not very reliable

• Solution: we assume that X1 , . . . , Xd are independent.

• We can then use one-dimensional density estimators and multiply them:

d
Y
fb0 (x1 , . . . , xd ) = fb0j (xj )
j=1
d
Y
fb1 (x1 , . . . , xd ) = fb1j (xj )
j=1
18 3강: Classification

• The assumption that the components of X are independent is usually wrong yet the resulting classifier
might still be accurate. Here is a summary of the steps in the naive Bayes classifier.

1. For each group k, compute an estimate fbkj of the density fkj for Xj , using the data for which
Yi = k.
Qd
2. Let fbk (x) = fbk (x1 , . . . , xd ) = j=1 fbkj (xj )
Pn
bk = n1 i=1 I(Yi = k).
3. Let π
4. Define
h(x) = argmax π
bk fbk (x)
k

• Navie Bayes is closely related to generalized additive models. Under the naive Bayes model,

P(Y = 1|X) πf1 (X)
log = log
P(Y = 1|X) (1 − π)f0 (X)
Qd !
π j=1 f1j (X)
= log Qd
(1 − π) j=1 f0j (X)
X d
π f1j (Xj )
= log + log
1−π j=1
f0j (Xj )
d
X
= β0 + gj (Xj )
j=1

which has the form of a generalized additive model.

Example. Figure 4 shows an artificial data set with two covariate x1 and x2 . Figure 4 (middle) shows
kernel density estimators of fb1 (x1 ), fb1 (x2 ), fb0 (x1 ), fb1 (x2 ). The top left plot shows the resulting naive
Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this
is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear
model yields a classifier with error rate of 0.78.
3강: Classification 19

그림 4: Top: artificial data, Middle: kernel density estimation, bottom: Naive Bayes and GAM classifier

2024.03.18 MarCom WG 211 PIANC Fender Guidelines 2024
100% (1)
2024.03.18 MarCom WG 211 PIANC Fender Guidelines 2024
217 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Nonparametric Classification 10/36-702: 1 1 N N N I I
No ratings yet
Nonparametric Classification 10/36-702: 1 1 N N N I I
20 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
midit10
No ratings yet
midit10
5 pages
5Naive Bayes
No ratings yet
5Naive Bayes
24 pages
Formulae
No ratings yet
Formulae
2 pages
Lecture Notes Part II
No ratings yet
Lecture Notes Part II
52 pages
Lagrange Intepolation
No ratings yet
Lagrange Intepolation
10 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
Ee5143 Pset1 PDF
No ratings yet
Ee5143 Pset1 PDF
4 pages
Density Matrix for Harmonic Oscillator
No ratings yet
Density Matrix for Harmonic Oscillator
9 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
Expectation
No ratings yet
Expectation
19 pages
Random Variable Modified PDF
No ratings yet
Random Variable Modified PDF
19 pages
Lec38 - 210108071 - AKSHAY KUMAR JHA
No ratings yet
Lec38 - 210108071 - AKSHAY KUMAR JHA
12 pages
Comp Numerical Analysis Problems
No ratings yet
Comp Numerical Analysis Problems
7 pages
Interpolation and Its Characters
No ratings yet
Interpolation and Its Characters
33 pages
Sol Advriskmin 2
No ratings yet
Sol Advriskmin 2
3 pages
P6-Random Variables and Distributions
No ratings yet
P6-Random Variables and Distributions
11 pages
Prof - Dr.akbar Azam, Prof. Dr. M.arshad Zia
No ratings yet
Prof - Dr.akbar Azam, Prof. Dr. M.arshad Zia
4 pages
13 Independent Random Variables
No ratings yet
13 Independent Random Variables
34 pages
Lecture Notes - 1
No ratings yet
Lecture Notes - 1
3 pages
bayes
No ratings yet
bayes
3 pages
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
No ratings yet
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
23 pages
Generalized Linear Models: FX Axb C DX Axb C DX
No ratings yet
Generalized Linear Models: FX Axb C DX Axb C DX
11 pages
S1B 16 All Lectures
No ratings yet
S1B 16 All Lectures
221 pages
Advanced Mathematical Analysis: Draft
No ratings yet
Advanced Mathematical Analysis: Draft
8 pages
chapter1
No ratings yet
chapter1
15 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
training_intro_learning
No ratings yet
training_intro_learning
2 pages
CS236 Homework 1
100% (1)
CS236 Homework 1
4 pages
Probability and Stochastic Process 21
No ratings yet
Probability and Stochastic Process 21
12 pages
Opuscula Math 3205 PDF
No ratings yet
Opuscula Math 3205 PDF
8 pages
Estimation Theory Presentation
100% (2)
Estimation Theory Presentation
66 pages
1
No ratings yet
1
9 pages
Essentiel_proba_stat_en
No ratings yet
Essentiel_proba_stat_en
2 pages
FIT5197 2021 S1 Formula Sheet
No ratings yet
FIT5197 2021 S1 Formula Sheet
20 pages
1-Information Removed
No ratings yet
1-Information Removed
5 pages
Assignment 1: Probability : Partial Solution
No ratings yet
Assignment 1: Probability : Partial Solution
7 pages
Notes 05 Joint Sampling Distributions
No ratings yet
Notes 05 Joint Sampling Distributions
12 pages
Statistical Convergence of Double Sequences On Probabilistic Normed Spaces
No ratings yet
Statistical Convergence of Double Sequences On Probabilistic Normed Spaces
11 pages
Modern Crypto 18 Homework 2 Solution
No ratings yet
Modern Crypto 18 Homework 2 Solution
5 pages
Mathematics Lecture Slides 1
No ratings yet
Mathematics Lecture Slides 1
46 pages
(Some) Solutions For HW Set # 2
No ratings yet
(Some) Solutions For HW Set # 2
3 pages
2012-13 Exam
No ratings yet
2012-13 Exam
8 pages
mutual_info_boolean_functions_AGKN2013
No ratings yet
mutual_info_boolean_functions_AGKN2013
7 pages
Rational Functions
No ratings yet
Rational Functions
7 pages
Sampling Distributions: 1.1 Statistical Inference
No ratings yet
Sampling Distributions: 1.1 Statistical Inference
22 pages
Ejercicios Munkres Resueltos
No ratings yet
Ejercicios Munkres Resueltos
28 pages
I and II Order Le Pss
No ratings yet
I and II Order Le Pss
48 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
MAS 102_Topic 1
No ratings yet
MAS 102_Topic 1
13 pages
1 What Is A Random Variable (R.V.) ?
No ratings yet
1 What Is A Random Variable (R.V.) ?
6 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Unit-1-Single Random Variable
No ratings yet
Unit-1-Single Random Variable
64 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Kristian Karl Bautista Kiw-Is - Neuro Quiz 2021
No ratings yet
Kristian Karl Bautista Kiw-Is - Neuro Quiz 2021
3 pages
Trees of Somalia: A Field Guide For Development Workers
No ratings yet
Trees of Somalia: A Field Guide For Development Workers
204 pages
GCB-DS New PDF
No ratings yet
GCB-DS New PDF
20 pages
The 3 Types of Joints in The Body
No ratings yet
The 3 Types of Joints in The Body
3 pages
Nail Care DLL
No ratings yet
Nail Care DLL
25 pages
Carla Olu Icher: Kemi Schle
No ratings yet
Carla Olu Icher: Kemi Schle
1 page
Red Oxide Scale
100% (1)
Red Oxide Scale
6 pages
Asi Controls
No ratings yet
Asi Controls
38 pages
IELTS 03 - Listening 03 & Reading 03 - Transcripts & Keys
No ratings yet
IELTS 03 - Listening 03 & Reading 03 - Transcripts & Keys
7 pages
Treatment of Pediatric Overweight and Obesity Position of the Academy of Nutrition and Dietetics Based on an Umbrella Review of Systematic Reviews
No ratings yet
Treatment of Pediatric Overweight and Obesity Position of the Academy of Nutrition and Dietetics Based on an Umbrella Review of Systematic Reviews
14 pages
UNIT 1
No ratings yet
UNIT 1
39 pages
MODULE 2 (Chapter 2.1)
No ratings yet
MODULE 2 (Chapter 2.1)
12 pages
Download ebooks file Hibernate Recipes A Problem Solution Approach Recipe Series 1st Edition Gary Mak all chapters
100% (5)
Download ebooks file Hibernate Recipes A Problem Solution Approach Recipe Series 1st Edition Gary Mak all chapters
61 pages
Philippine Nursing Licensure Examination
100% (1)
Philippine Nursing Licensure Examination
32 pages
Section 6.5: The Remainder and Factor Theorems
100% (1)
Section 6.5: The Remainder and Factor Theorems
19 pages
Advantages and Disadvantages of Hieararchical Organization Chart
No ratings yet
Advantages and Disadvantages of Hieararchical Organization Chart
3 pages
DLP Research
No ratings yet
DLP Research
3 pages
(eBook PDF) The Art of Public Speaking 13th Edition by Stephen Lucas pdf download
100% (8)
(eBook PDF) The Art of Public Speaking 13th Edition by Stephen Lucas pdf download
54 pages
Workers Don't Feel Like A 9-To-5 Job Is A Safe Bet Anymore
No ratings yet
Workers Don't Feel Like A 9-To-5 Job Is A Safe Bet Anymore
8 pages
Organisation Chart As On 10.08.2018 (Nerldc)
No ratings yet
Organisation Chart As On 10.08.2018 (Nerldc)
2 pages
Clarisuite-Brochure-with-Zebra-pages-1120
No ratings yet
Clarisuite-Brochure-with-Zebra-pages-1120
8 pages
NASA: ReliabilityPurpose
No ratings yet
NASA: ReliabilityPurpose
11 pages
Quick Start Guide This Guide Is Intended To Get You Started With Rational ClearCase or Rational ClearCase MultiSite.
No ratings yet
Quick Start Guide This Guide Is Intended To Get You Started With Rational ClearCase or Rational ClearCase MultiSite.
6 pages
Architecture in The Culture of Early Hum
No ratings yet
Architecture in The Culture of Early Hum
3 pages
Week 3 - Assignment Solutions
83% (6)
Week 3 - Assignment Solutions
4 pages
h15300 Vxrail Network Guide
No ratings yet
h15300 Vxrail Network Guide
62 pages
TOUR14H-Introduction To MICE and Events Management - Docx (Syllabus)
No ratings yet
TOUR14H-Introduction To MICE and Events Management - Docx (Syllabus)
4 pages
ECON3206 - Tutorial 4 - Felipe
No ratings yet
ECON3206 - Tutorial 4 - Felipe
19 pages
Hypoxia Types
No ratings yet
Hypoxia Types
3 pages