0% found this document useful (0 votes)
37 views

Linear - Classification

The document introduces linear models for classification problems. It discusses how linear classifiers use hyperplanes to separate classes and define decision boundaries in a feature space. It then covers three representative linear classifiers - linear discriminant analysis, logistic regression, and support vector machines. The document focuses on the linear discriminant analysis approach, explaining how it finds the optimal linear decision boundary to maximize separation between classes.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Linear - Classification

The document introduces linear models for classification problems. It discusses how linear classifiers use hyperplanes to separate classes and define decision boundaries in a feature space. It then covers three representative linear classifiers - linear discriminant analysis, logistic regression, and support vector machines. The document focuses on the linear discriminant analysis approach, explaining how it finds the optimal linear decision boundary to maximize separation between classes.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Introduction to Machine Learning

Linear Models for Classification


林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University

Some slides are modified from Prof. Sheng-Jyh Wang


and Prof. Hwang-Tzong Chen
Classification problem

• The goal in classification is to take an input vector 𝐱 and to


assign it to one of 𝐾 discrete classes 𝐶𝑘 where 𝑘 = 1, 2, … , 𝐾.

• The input space is divided into decision regions whose


boundaries are called decision boundaries or decision surfaces

Photo: https://ww2.mathworks.cn/help/stats/create-
and-visualize-discriminant-analysis-classifier.html 2
Classification problem

• Linear models for classification: the decision surfaces are linear


functions of the input vector 𝐱

2D case 3D case

3
Classification problem

• Linear models for classification: the decision surfaces are linear


functions of the input vector 𝐱
• The decision surfaces are defined by (𝐷 − 1)-dimensional
hyperplanes within the 𝐷-dimensional input space
• Data whose classes can be separated by linear decision
surfaces are said to be linearly separable

linearly not linearly


separable separable 4
Classification problem

• Given a training data set comprising 𝑁 observations


{𝐱𝑛 }𝑁 𝑁
𝑛=1 and the corresponding target labels {𝑡𝑛 }𝑛=1 , the goal
of classification is to predict the label of 𝑡 for a new data
sample of 𝐱
➢ Categorical outputs, e.g., yes/no, dog/cat/other, called labels
➢ A classifier assigns each input vector to one of these labels
• Binary classification: two possible labels
• Multi-class classification: multiple possible labels
• Label representation
➢ Two classes: 𝑡 ∈ 1,0 or 𝑡 ∈ +1, −1
➢ Multiple classes, e.g., 𝐾 = 5: (1-of-K scheme)

5
Three representative linear classifiers

• Different linear models for classification


➢ Discriminant functions

➢ Generative approach using Bayes’ theorem:

➢ Discriminative approach to directly model the class-conditional


density:

6
Linear discriminant for two-class classification

• A linear discriminant is a linear function that takes an input


vector 𝐱 and assigns it to one of 𝐾 classes

• A linear discriminant function for two-class classification

where 𝐰 is the weight vector and 𝑤0 is the bias

• The decision boundary is , i.e., classification result

• is a (𝐷 − 1)-dimensional hyperplane within the 𝐷-


dimensional input space
7
Properties of a linear discriminant

• Weight vector 𝐰 is orthogonal to every vector lying within the


decision boundary
➢ Consider two points 𝐱𝐴 and 𝐱 𝐵 , which lie on the decision
boundary
➢ We have , leading to

Photo: K.-R. Muller 8


Properties of a linear discriminant

• The distance from the origin to the decision boundary is

Photo: G. Shakhnarovich

9
Properties of a linear discriminant

• How to compute the distance between an arbitrary point 𝐱 to


the decision boundary ?

10
Linear discriminant for multi-class classification

• Consider the extension of linear discriminants to 𝐾 > 2 classes

• Many classifiers cannot directly extend to multi-class


classification
➢ Build a 𝐾-class discriminant by combining a number of two-class
discriminant functions
➢ One-versus-the-rest strategy
➢ One-versus-one strategy

11
One-versus-the-rest

• One-vs-the-rest
➢ Learn 𝐾 − 1 two-class classifiers (linear discriminants)
➢ Classifier 1 is derived to separate data of class 1 from the rest
➢ Classifier 2 is derived to separate data of class 2 from the rest
➢…
➢ Classifier 𝐾 − 1 is derived to separate data of class 𝐾 − 1 from
the rest

3-class classification

2 one-vs-the-rest linear
discriminants
12
One-vs-one

• One-vs-one
➢ Learn 𝐾(𝐾 − 1)/2 two-class classifiers, one for each class pair
➢ For classes i and j, a binary classifier is learned to separate data
of class i from those of class j
➢ Classification is done by majority vote
• The problem of ambiguous regions

3-class classification

3 one-vs-one linear
discriminants

13
K-class discriminant

• A single K-class discriminant can avoid the problem of


ambiguous regions
➢ It is composed of K linear functions of the form

for 𝑘 = 1, 2, … 𝐾
➢ It assigns a point 𝐱 to class k if

➢ The decision boundary between class k and class j is

14
K-class discriminant

• An example: 3-class discriminant


• The decision regions of such a
discriminant are convex
• Consider two points 𝐱𝐴 and 𝐱𝐵 , which lie inside region
• For any point 𝐱ො that lies on the line connecting 𝐱𝐴 and 𝐱 𝐵 , it
can be expressed as

• It can be proved that 𝐱ො also lies inside


➢ Linear function of class k:
➢ Proof:

15
Linear discriminant learning

• Learning focuses on estimating a good decision boundary


• We need to optimize parameters 𝒘 and 𝑤0 of the boundary
• What does good mean here?
• Is this boundary good

Photo: R. Zemel et al.

• We need a criterion to tell how to optimize these parameters


16
Least squares for classification

• Use least squares technique to solve a K-class discriminant

• Each class k is described by its own linear model

• A point 𝐱 is assigned to class k if

17
Least squares for classification

• Some notations
➢ A data point 𝐱:
➢ 1-of-K binary coding for the label vector of 𝐱: 𝐭 = [0,1,0,0,0]𝑇
➢ The linear model for class k:
➢ Apply the linear model for class k to a point 𝐱:

➢ All data points: All data label vectors:

➢ All linear models:


➢ Apply all linear models to a point 𝐱:

18
Least squares for classification

• The squared difference between and

• Sum-of-squares error

• Proof sketch
➢ Tr(AB) = Tr(BA)
➢ Tr(BA) is the sum of the diagonal elements of square matrix BA
➢ The nth diagonal element is the squared error of point 𝐱 𝑛
• Setting the derivative w.r.t. to 0, we obtain

19
Least squares for classification

• After getting , we classify a new data point via

20
Least squares for classification

• The least-squares solutions are sensitive to outliers

Magenta: least squares


Green: logistic regression
21
Least squares for classification

• The least squares method sometimes gives poor results

Left: least squares


Right: logistic regression
22
Fisher’s linear discriminant: 2 classes

• Fisher’s linear discriminant (FLD): a non-probabilistic method


for dimensionality reduction

• Consider the case of two classes, and suppose we take a 𝐷-


dimensional input vector 𝐱 and project it onto one dimension
by

• If we place a threshold on 𝐱, and classify it as class 𝐶1 if


𝑦 ≥ −𝑤0 , and otherwise class 𝐶2 , we get the linear classifier
discussed previously

• In general, dimensionality reduction leads to information loss,


but we can select a projection maximizing data separation
23
Linear discriminant with maximum separation

• A two-class problem where there are 𝑁1 points of class 𝐶1 and


𝑁2 points of class 𝐶2
• The mean vectors of the two classes are
and

• An intuitive choice of 𝐰 that maximizes the distance between


the projected mean vectors, i.e.,

where
• However, the distance can be arbitrarily large by increasing the
magnitude of 𝐰
24
Linear discriminant with maximum separation

• We constrain 𝐰 to have unit length, i.e.,

• The constrained optimization problem:

• The optimal 𝐰 in the optimization problem above

• How to prove?
➢ Use Lagrange multiplier to solve it
➢ By setting the gradient of Lagrange function w.r.t. optimization
variables to 0, we get

25
Linear discriminant with maximum separation

• Is the obtained good?

➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap

26
Linear discriminant with maximum separation

• Is the obtained good?

➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap
➢ Right plot: The projection learned by FLD
27
Fisher’s linear discriminant: 2 classes

• FLD seeks the projection 𝐰 that gives a large distance between


the projected data means while giving a small variance within
each class

• Maximize the between-class variance


where
• Minimize the within-class variance
where

• The objective (Fisher criterion):

28
Fisher’s linear discriminant: 2 classes

• The objective (Fisher criterion):

where SB is the between-class covariance matrix

SW is the within-class covariance matrix

29
Fisher’s linear discriminant: 2 classes

• Differentiate Fisher criterion w.r.t. 𝐰

• We find that

• As , is in the direction of
• We have

30
Fisher’s linear discriminant: 2 classes

• The optimized 𝐰 is called Fisher’s linear discriminant

• Project training data into a one-dimensional space via 𝐰


➢ Classification can be carried out by several methods
➢ Determine a threshold 𝑦0
◆Predict a point 𝐱 as 𝐶1 if 𝑦(𝐱) ≥ −𝑦0 , and otherwise class 𝐶2
➢ Use the nearest-neighbor rule
◆Project all training data into the one-dimensional space via 𝐰
◆Project a testing point 𝐱 to the same space
◆Retrieve the nearest training sample of 𝐱 in the projected space
◆Predict 𝐱 as the class that the retrieved sample belongs to

31
Fisher’s linear discriminant: Multiple classes

• Fisher’s linear discriminant (FLD) for 𝐾 > 2 classes

• Assume the dimension of the input space is 𝐷, which is greater


than 𝐾

• FLD introduces 𝐷′ ≥ 1 linear weight vectors for

• Gathering the weight vectors together projects each data point


𝐱 to a 𝐷′ -dimensional space

where weight vectors are the columns of

32
Fisher’s linear discriminant: Multiple classes

• Generalize the within-class covariance matrix to 𝐾 classes

• Recall the within-class covariance matrix when 𝐾 = 2

• The within-class covariance matrix when 𝐾 ≥ 2

where

and

33
Fisher’s linear discriminant: Multiple classes

• Recall the between-class covariance matrix when 𝐾 = 2

• The extended between-class covariance matrix for 𝐾 > 2

where

34
Fisher’s linear discriminant: Multiple classes

• Consider the case where FLD projects data to a one-dimensional


space, i.e., 𝐷′ = 1

• An equivalent objective

• Lagrangian function

• We have
• The optimal 𝐰 is the eigenvector of that corresponds to
the largest eigenvalue
35
Fisher’s linear discriminant: Multiple classes

• Consider the case where FLD projects data to a multi-


dimensional space, i.e., 𝐷′ > 1
• Can we directly extend the objective to learn a multi-
dimensional projection? No
WT SB W
𝐽(w)= T
W SW W

• A choice for the objective is

• The columns of the optimal W are the eigenvectors of


that correspond to the 𝐷′ largest eigenvalues

36
Fisher’s linear discriminant: Multiple classes

• About the value 𝐷′, the dimension of the projected space

• Note that the rank of the between-class covariance matrix is at


most 𝐾 − 1

• In other words, the dimension of the projected space by FLD is


at most 𝐾 − 1

37
Probabilistic generative models: Two-class case

• In a generative model, we model the class-conditional


densities and class priors , and then use them to
compute posterior probabilities via Bayes’ theorem

• Consider two-class cases. The posterior probability for class 𝐶1


is defined by

where 𝜎(𝑎) is the logistic sigmoid function

38
Logistic sigmoid function

• Logistic sigmoid function maps the whole real axis into [0,1]
➢ Symmetric property:

• The variable 𝑎 here represents the log of the ratio of


probabilities

• Logistic sigmoid function (red curve)

𝜎(𝑎)

𝑎 39
Probabilistic generative models: Multi-class case

• For the case of 𝐾 > 2 classes, the posterior probability for


class 𝐶𝑘 is defined by

where

• Multi-class generalization of logistic sigmoid function, or


softmax function
➢ If , we have and

40
Continuous inputs: Two-class case

• Assume that the class-conditional densities are Gaussian and


all classes share the same covariance matrix:

• Recall where

• We have
where

• Note that the quadratic terms in 𝐱 are canceled


41
Class-conditional and posterior probabilities

class-conditional densities posterior probability


: red
: blue Note that

42
Continuous inputs: Multi-class case

• Assume that the class-conditional densities are Gaussian and


all classes share the same covariance matrix:

• Recall where

• We have
where

43
Continuous inputs: Multi-class case

• If each class-conditional density has its own covariance


matrix , it leads to a quadratic discriminant

three class-conditional densities posterior probability


Decision boundary between red and green classes is linear,
while those between other pairs are quadratic
44
Determine parameter values via maximum likelihood

• We specify the functional form of class-conditional density

• We assume the prior class probability takes the form


and
• Suppose a set of 𝑁 data points {𝐱 𝑛 , 𝑡𝑛 } is provided, where
𝑡𝑛 = 1 denotes class 𝐶1 and 𝑡𝑛 = 0 denotes class 𝐶2

• Our goal is to determine the values of parameters


to complete classification

45
Determine parameter values via maximum likelihood

• For a data point 𝐱 𝑛 from class 𝐶1 , we have

• Similarly for class 𝐶2 , we have

• Suppose data are i.i.d. The likelihood function is given by

where

46
Maximum likelihood solution for 𝜋

• It is convenient to maximize the log of the likelihood function


• Consider first the maximization w.r.t. 𝜋
• The terms in the log likelihood function that depend on 𝜋 are

• Setting the derivative w.r.t. 𝜋 to zero, we obtain

where 𝑁1 is the number of data in class 𝐶1 and 𝑁2 is the


number of data in class 𝐶2
• The ML estimate for 𝜋 is simple the fraction of points in class 𝐶1
47
Maximum likelihood solution for

• Consider the maximization w.r.t.


• We pick those terms that depend on in the log likelihood
function

• Setting the derivative w.r.t. to zero leads to

• It is simply the mean of all data points belonging to class 𝐶1

48
Maximum likelihood solution for

• Similarly, the corresponding result for is

• It is simply the mean of all data points belonging to class 𝐶2

49
Maximum likelihood solution for

• Finally, consider the maximum likelihood solution for the


shared covariance matrix

• Picking out the terms in the log likelihood function that


depend on , we get

50
Maximum likelihood solution for

where

• Setting the derivative to zero, we have

• The derivation

51
Generative approach summary

• Class-conditional densities

• Class prior probabilities


and
• Determine the parameter values of
• Classification is carried out via
where

• ML solution can be directly extended to multi-class cases

52
Generative vs. Discriminative models

• Probabilistic generative model: Indirect approach that finds


the parameters of class-conditional densities and class priors,
and applies Bayes’ theorem to get posterior probabilities

• Probabilistic discriminative models: Direct approach that uses


the generalized linear model to represent posterior
probabilities, and determines its parameters directly.

• Advantages of discriminative models:


➢ Better performance in most cases, especially when the class-
conditional density assumption gives a poor approximation to
the true distribution
➢ Less parameters

53
Nonlinear basis functions for linear classification

• Nonlinear basis functions help when dealing with data that are
not linearly separable:

data points of class 𝐶1 +: mean of Gaussian basis function


data points of class 𝐶2
54
Logistic regression for two-class classification

• Recall the posterior probability for two-class classification

• 1. Ignore the class-conditional probabilities and class priors


• 2. Apply basis functions for nonlinear transform
• 3. Assume the posterior probability can be written as a logistic
sigmoid acting on a linear function of the feature vector
• We get
and

where
55
Logistic regression model

• This model is called logistic regression, though it is used for


classification

• Suppose the dimension of is 𝑀. There are 𝑀 parameters to


learn in logistic regression

• Cf. For the generative model, we use 2𝑀 parameters for the


means of two classes and 𝑀 (𝑀 + 1)/2 parameters for the
shared covariance matrix

56
Determine parameters of logistic regression

• The derivative of the logistic sigmoid function

• Derivation:

• Given training data , where ,


for for , the likelihood function of logistic regression
is

where and
57
Determine parameters of logistic regression

• The negative log likelihood, called cross entropy error, is

where
• The gradient of the error function w.r.t. 𝐰 is

• Derivation:

58
Determine parameters of logistic regression

• Optimize the parameters by stochastic gradient descent

• Optimize the parameters by iterative reweighted least squares


(IRLS), i.e., Newton-Raphson iterative optimization scheme

where is the Hessian matrix whose elements comprise the


second derivatives of w.r.t. 𝐰

59
Newton-Raphson iterative optimization

• Negative log likelihood function

• Gradient

• Hessian

where 𝐑 is a diagonal matrix with

60
Newton-Raphson iterative optimization

• The Newton-Raphson update formula

• Hessian matrix is positive definite


• Proof:

61
Multiclass logistic regression

• Recall two-class logistic regression


and

where

• We deal with multiclass cases

where the activation 𝑎𝑘 is given by

62
Likelihood function of multiclass logistic regression

• Two-class case: Given training data , where


, for , the likelihood function of two-
class logistic regression is

where and

• For a multiclass case, we use 1-of-K coding to


represent each target label vector , let , we
have the likelihood function

63
Newton-Raphson iterative optimization

• Negative log likelihood function is

• When using Newton-Raphson iterative optimization, we need


to have the following gradient and Hessian

64
Background

• Variable dependence: 𝐰 → 𝑎 → 𝑦 → 𝐸

• According to

we have

65
Background

• According to

we have
where

• Proof

66
Background

• According to

we have

• Given and , we can compute

67
Newton-Raphson iterative optimization
𝐰→𝑎→𝑦→𝐸

• Gradient

• Hessian

• With gradient and Hessian, Newton-Raphson method works

68
Discriminative approach summary

• Posterior probability and logistic regression


and
• The negative log likelihood (cross entropy error)

• Newton-Raphson method for iterative optimization

• It can be extended to multiclass classification

69
Summary

• Linear discriminant
➢ Two-class discriminant
➢ K-class discriminant
➢ Fisher’s linear discriminant
• Probabilistic generative model
➢ Class-conditional probability and class prior probability
➢ ML solution
• Probabilistic discriminative model: Logistic regression
➢ Posterior probability
➢ Newton-Raphson iterative optimization

70
References

• Chapters 4.1, 4.2, and 4.3 in the PRML textbook

71
Thank You for Your Attention!

Yen-Yu Lin (林彥宇)


Email: [email protected]
URL: https://www.cs.nctu.edu.tw/members/detail/lin

72

You might also like