Linear - Classification
Linear - Classification
Photo: https://ww2.mathworks.cn/help/stats/create-
and-visualize-discriminant-analysis-classifier.html 2
Classification problem
2D case 3D case
3
Classification problem
5
Three representative linear classifiers
6
Linear discriminant for two-class classification
Photo: G. Shakhnarovich
9
Properties of a linear discriminant
10
Linear discriminant for multi-class classification
11
One-versus-the-rest
• One-vs-the-rest
➢ Learn 𝐾 − 1 two-class classifiers (linear discriminants)
➢ Classifier 1 is derived to separate data of class 1 from the rest
➢ Classifier 2 is derived to separate data of class 2 from the rest
➢…
➢ Classifier 𝐾 − 1 is derived to separate data of class 𝐾 − 1 from
the rest
3-class classification
2 one-vs-the-rest linear
discriminants
12
One-vs-one
• One-vs-one
➢ Learn 𝐾(𝐾 − 1)/2 two-class classifiers, one for each class pair
➢ For classes i and j, a binary classifier is learned to separate data
of class i from those of class j
➢ Classification is done by majority vote
• The problem of ambiguous regions
3-class classification
3 one-vs-one linear
discriminants
13
K-class discriminant
for 𝑘 = 1, 2, … 𝐾
➢ It assigns a point 𝐱 to class k if
14
K-class discriminant
15
Linear discriminant learning
17
Least squares for classification
• Some notations
➢ A data point 𝐱:
➢ 1-of-K binary coding for the label vector of 𝐱: 𝐭 = [0,1,0,0,0]𝑇
➢ The linear model for class k:
➢ Apply the linear model for class k to a point 𝐱:
18
Least squares for classification
• Sum-of-squares error
• Proof sketch
➢ Tr(AB) = Tr(BA)
➢ Tr(BA) is the sum of the diagonal elements of square matrix BA
➢ The nth diagonal element is the squared error of point 𝐱 𝑛
• Setting the derivative w.r.t. to 0, we obtain
19
Least squares for classification
20
Least squares for classification
where
• However, the distance can be arbitrarily large by increasing the
magnitude of 𝐰
24
Linear discriminant with maximum separation
• How to prove?
➢ Use Lagrange multiplier to solve it
➢ By setting the gradient of Lagrange function w.r.t. optimization
variables to 0, we get
25
Linear discriminant with maximum separation
➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap
26
Linear discriminant with maximum separation
➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap
➢ Right plot: The projection learned by FLD
27
Fisher’s linear discriminant: 2 classes
28
Fisher’s linear discriminant: 2 classes
29
Fisher’s linear discriminant: 2 classes
• We find that
• As , is in the direction of
• We have
30
Fisher’s linear discriminant: 2 classes
31
Fisher’s linear discriminant: Multiple classes
32
Fisher’s linear discriminant: Multiple classes
where
and
33
Fisher’s linear discriminant: Multiple classes
where
34
Fisher’s linear discriminant: Multiple classes
• An equivalent objective
• Lagrangian function
• We have
• The optimal 𝐰 is the eigenvector of that corresponds to
the largest eigenvalue
35
Fisher’s linear discriminant: Multiple classes
36
Fisher’s linear discriminant: Multiple classes
37
Probabilistic generative models: Two-class case
38
Logistic sigmoid function
• Logistic sigmoid function maps the whole real axis into [0,1]
➢ Symmetric property:
𝜎(𝑎)
𝑎 39
Probabilistic generative models: Multi-class case
where
40
Continuous inputs: Two-class case
• Recall where
• We have
where
42
Continuous inputs: Multi-class case
• Recall where
• We have
where
43
Continuous inputs: Multi-class case
45
Determine parameter values via maximum likelihood
where
46
Maximum likelihood solution for 𝜋
48
Maximum likelihood solution for
49
Maximum likelihood solution for
50
Maximum likelihood solution for
where
• The derivation
51
Generative approach summary
• Class-conditional densities
52
Generative vs. Discriminative models
53
Nonlinear basis functions for linear classification
• Nonlinear basis functions help when dealing with data that are
not linearly separable:
where
55
Logistic regression model
56
Determine parameters of logistic regression
• Derivation:
where and
57
Determine parameters of logistic regression
where
• The gradient of the error function w.r.t. 𝐰 is
• Derivation:
58
Determine parameters of logistic regression
59
Newton-Raphson iterative optimization
• Gradient
• Hessian
60
Newton-Raphson iterative optimization
61
Multiclass logistic regression
where
62
Likelihood function of multiclass logistic regression
where and
63
Newton-Raphson iterative optimization
64
Background
• Variable dependence: 𝐰 → 𝑎 → 𝑦 → 𝐸
• According to
we have
65
Background
• According to
we have
where
• Proof
66
Background
• According to
we have
67
Newton-Raphson iterative optimization
𝐰→𝑎→𝑦→𝐸
• Gradient
• Hessian
68
Discriminative approach summary
69
Summary
• Linear discriminant
➢ Two-class discriminant
➢ K-class discriminant
➢ Fisher’s linear discriminant
• Probabilistic generative model
➢ Class-conditional probability and class prior probability
➢ ML solution
• Probabilistic discriminative model: Logistic regression
➢ Posterior probability
➢ Newton-Raphson iterative optimization
70
References
71
Thank You for Your Attention!
72