0% found this document useful (0 votes)
28 views50 pages

C30 C35 LinearModelForClassification

Uploaded by

tewipe3560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views50 pages

C30 C35 LinearModelForClassification

Uploaded by

tewipe3560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

12-11-2023

Introduction to Machine Learning


Linear Models for Classification
Veena Thenkanidiyoor
National Institute of Technology
Goa

Linear Method for Classification


• The boundary that separates the region of classes is linear
• Separating surface is linear i.e. hyperplane
• A hyperplane that best fit the region of separation between
the classes
• Discriminant function: Function that indicate the boundary
between the classes
– Here discriminant function is linear

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0
2
12-11-2023

Linear Method for Classification


• The boundary that separates the region of classes is linear
• Separating surface is linear i.e. hyperplane
• A hyperplane that best fit the region of separation between
the classes
• Discriminant function: Function that indicate the boundary
between the classes
– Here discriminant function is linear

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0  0
3

Linear Method for Classification


• The boundary that separates the region of classes is linear
• Separating surface is linear i.e. hyperplane
• A hyperplane that best fit the region of separation between
the classes
• Discriminant function: Function that indicate the boundary
between the classes
– Here discriminant function is linear

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
w1 w
xn=[xn1, xn2]T xn 2   x n1  0  mx n1  c
w2 w2 4
12-11-2023

Linear Method for Classification


• The boundary that separates the region of classes is linear
• Separating surface is linear i.e. hyperplane
• A hyperplane that best fit the region of separation between
the classes
• Discriminant function: Function that indicate the boundary
between the classes
– Here discriminant function is linear

x2 x2

x1 x1
• Discriminant function in d-dimensional space :
d
xn=[xn1, xn2 , … , xnd]T f (x n , w )  w T x n  w0   wi xi
i 0 5

Two classes of Approaches for


Linear Classification
1. Modeling a discriminant function:
– For each class, a linear discriminant function fi(x,wi) is
defined
– Let C1, C2, …, Ci, …, CM be the M classes
– Let fi(x,wi) be the linear discriminant function for ith
class

Class label for x = arg max f i (x,wi ) i  1, 2,..., M


i
– Discriminant function is defined independent of the
classes
– Linear regression can be used to learn linear
discriminant function
• Do the linear regression by considering dependent variable
as indicator variable (categorical variable)
– Logistic regression (Probabilistic discriminative model)
– Fisher linear discriminant analysis
6
12-11-2023

Two classes of Approaches for


Linear Classification
2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt

x2 x2

x1 x1

– Perceptron (linear discriminant function is learnt)


– Support vector machine (SVM) (linear discriminant
function is learnt)
– Neural networks (when the discriminant function is
nonlinear) 7

Method of Least squares


Classification Using Linear
Regression
12-11-2023

Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by

ˆ  XT X
w  
1
XT y
y

x
9

Classification Using Linear Regression


• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M

– xn is input vector (d dependent variable)


– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
• Illustration: Iris (Flower) Data – 3 classes
X Y
Sepal-Length Sepal_Width Petal_Length Petal_Width Class1 Class2 Class3

5.1 3.5 1.4 0.2 1 0 0


4.9 3.0 1.4 0.2 1 0 0
7.0 3.2 4.7 1.4 0 1 0
6.4 3.2 4.5 1.5 0 1 0
6.3 3.3 6.0 2.5 0 0 1
5.8 2.7 5.1 1.9 0 0 1
10
12-11-2023

Classification Using Linear Regression


• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M

– xn is input vector (d dependent variable)


– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
– For N examples, X is data matrix of size N x (d+1) and Y
is response matrix of size N x M
ˆ  XT X
• Linear regression on response vector: W   1
XT Y
– Ŵ is of the size (d+1) x M
ˆ  w
W ˆ 1, w ˆM
ˆ 2 ,..., w

– Each column of Ŵ is (d+1) coefficients corresponding


to a class
11

Classification Using Linear Regression


• For any test example x, the discriminant value for
class i is: d
ˆ Ti x   wˆ ij xi
ˆ i)  w
f i ( x, w
j 0

Class label for ˆ i)


x = arg max fi (x,w i  1, 2,..., M
i

12
12-11-2023

Illustration of Classification using


Linear Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Number of classes: 2
• Each output variable is a 2-dimensional
binary vector
• Class: Child (C1) Adult (C2)

Weight
in Kg

Height in cm 13

Illustration of Classification using


Linear Regression
ˆ  XT X
• Training: W  
1
XT Y
• X is data matrix of size 20 x 3
• Y is response matrix of size 20 x 2
 2.8897 - 1.8897 
W  w
ˆ ˆ 2   - 0.0222
ˆ1 w 0.0222 

 0.0122 - 0.0122 

ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm 14
12-11-2023

Illustration of Classification using


Linear Regression
Test Example:

ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.1842
f1 ( x, w ˆ 2 )  0.8158
f 2 ( x, w

• Class: Adult (C2)

15

Illustration of Classification using


Linear Regression
Test Example:

ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.4639
f1 ( x, w ˆ 2 )  0.5361
f 2 ( x, w

• Class: Adult (C2)

16
12-11-2023

Classification Using Linear


Regression

Classification Using Linear Regression


• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)

• If the input x belongs to Ci, then yi is 1


• The expected output for x should be close to 1
• During linear regression for classification, we are
trying to predict the expected output value
• In other way, we are trying to predict probability of
class
E y i | x   P  y i  C i | x 
• This is the ideal situation
• Linear regression gives the hope of getting this
• The notion of predicting probability of class is given
nicely by logistic regression 18
12-11-2023

Logistic Regression

Logistic Regression
( A probabilistic discriminative
model)
12-11-2023

Two classes of Approaches for


Linear Classification
1. Modeling a discriminant function:
– For each class, a linear discriminant function fi(x,wi) is
defined
– Let C1, C2, …, Ci, …, CM be the M classes
– Let fi(x,wi) be the linear discriminant function for ith
class

Class label for x = arg max f i (x,wi ) i  1, 2,..., M


i
– Discriminant function is defined independent of the
classes
– Linear regression can be used to learn linear
discriminant function
• Do the linear regression by considering dependent variable
as indicator variable (categorical variable)
– Logistic regression (Probabilistic discriminative model)
– Fisher (linear) discriminant analysis
21

Fisher Discriminant Analysis


12-11-2023

Fisher Discriminant Analysis (FDA)


• Aim: Finding those direction of projection where the
separability of classes after projection is maximum
• Training data: D  {x n , y n }nN1 ,
xn  R d and y n  {1,1}
– d: dimension of input example
– 2-class data:
• N+: Number of training examples in +ve class
• N-: number of examples in –ve class
• Task: Find a single direction of projection, q such that
the separability of projected data is maximum

an  q T x n n  1, 2, ..., N

• Separability of projected data is defined using


statistical information from the projected data
23

Classification on Representation
from FDA
• Project the xn (both training and test data) onto the
direction q to get 1-dimensional representation
an  q T x n
1-Dimensional
Representation
Representation
xn Classifier
Data Feature
FDA
a (Binary
Extraction d Classifier)

• Different ways of performing classification:


– Threshold on the projected data
– Decision based on means
• Eucledian distance between a projected test example and
means of two classes
• Mahalanobis distance between projected test example and
data of two classes
– Build Bayes’ classifier in the 1-dimensional space
• Unimodal Gaussian
• Gaussian mixture model 24
12-11-2023

Illustration: FDA and Bayes’ Classifier on


1-dimensional Representation from FDA

25

Illustration: FDA and Bayes’ Classifier on


1-dimensional Representation from FDA

26
12-11-2023

Illustration: FDA and Bayes’ Classifier on


1-dimensional Representation from FDA

27

Discriminative Learning based


Methods for Classification
12-11-2023

Discriminative Learning based


Methods
• Learn the surface that better separates the region of
classes
• Learning discriminant function: Learns a function that
maps input data to output
• Linear discriminant function: Function that indicate
the boundary between the classes which is linear

x2

x1

29

Linear Discriminant Function


• Regions of two classes are separable by a linear
surface (line, plane or hyperplane)
• 2-dimensional space: The
decision boundary is a
line specified by
w1 x1  w 2 x 2  w 0  0 x2
w w
x 2   1 x1  0
w2 w2
• d-dimensional space: The
decision surface is a x1
hyperplane specified by
d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T xˆ  0
i 0

where w  [ w0 , w1 ,..., wd ] and xˆ  [1, x1 ,..., xd ]T


T

30
12-11-2023

Discriminant Function of a Hyperplane


• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1

• For any point the lies on the hyperplane


d
g (x)   wi xi  w0  w T x  w0  0
• Example: i 1

– Consider a straight line with its equation as x2+x1-1=0


– Discriminant function of the straight line is g(x)=x2+x1-1
– For points (1,0) and (0,1) that lie on
this straight line g(x)=0 x2
– For the point (0,0), g(x)=-1 i.e. the
value of g(x) is negative
(1,0) * (1,1)
(0,0) * (0,1) x1
– For the point (1,1), g(x)=+1 i.e. the
value of g(x) is positive
31

Discriminant Function of a Hyperplane


x2
Positive side

Negative side (1,0) * (1,1)


(0,0) * (0,1) x1

g (x)  x1  x2  1

• A hyperplane has a positive side and a negative side


– For any point on the positive side, the value of
discriminant function, g(x), is positive
– For any point on the negative side, the value of
discriminant function, g(x), is negative

32
12-11-2023

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is placed between
the training data of two classes so that training error
(classification error) is minimum

w T x n  w0  0 C2 C2
w T x n  w0  0
x2 OR x2 C1
C1

w T x n  w0  0 w T x n  w0  0
x1 x1

33

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
1. Initialize the w with random values
2. Choose a training example xn
3. Update the w, if xn is misclassified
w  w   x n , for w T x n  w0  0 and x n  class with label  1
w  w   x n , for w T x n  w0  0 and x n  class with label  1
– Here 0 < η < 1 is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 34
12-11-2023

Perceptron Learning
• Training: C2

x2 C1

x1

• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to class with label
+1 (C2)
– If wTx + w0 ≤ 0 then x is assigned to class with label
-1 (C1)
35

Illustration of Perceptron Learning

36
12-11-2023

Optimal Separating Hyperplane

x2

x1

37

Hyperplane and Margin

38
12-11-2023

Hyperplane
• Equation of straight line: y = mx + c
• Consider the following equation: w2x2 + w1x1 + w0 = 0
• The above equation can be rewritten as follows:
w1 w
x2   x1  0
w2 w2
• Let y = x2 and x = x1. Then the above is the equation of
straight line in a 2-dimensional space with
w1 w
m and c   0
w2 w2
• The equation of a plane in a 3-dimensional space:
w3 x 3  w 2 x 2  w1 x1  w 0  0
• The equation of a hyperplane in a d-dimensional
space: d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T x  0
i 0

where w  [ w0 , w1 ,..., wd ] and x  [1, x1 ,..., xd ]T


T
39

Distance of a Point to Hyperplane


• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1 g (x)
• For any point x, the distance to the hyperplane:
w
• Margin of a hyperplane:
– Distance of the nearest training example to the
hyperplane
x2
g (x1 ) x1 x2
w
* *
g (x 2 )
w
x1
g (x 3 )
w
x 3*
40
12-11-2023

Support Vector Machines for


Pattern Classification

41

Maximum Margin Hyperplane

42
12-11-2023

Maximum Margin Hyperplane for


Linearly Separable Patterns

1
Margin =
|| w ||

w0

wT x  w0  1

wT x  w0  0

wT x  w0  1

43

Support Vector Machine (SVM) for


Linearly Separable Patterns
• Optimal hyperplane specified by (w, w0) must satisfy
the constraint:
yn(wTxn+ w0 ) ≥ 1 for n = 1, 2, …, N
– Here yi∈{+1, -1} is the desired output (class label) of
training example xn and N is the number of training
examples
2
• Optimal value of margin of separation:
w
• Maximizing the margin of separation is equivalent to
minimizing the Euclidean norm of the weight vector w
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
1
minimize J ( w , w0 )  || w || 2
2
T
s uch that y n ( w x n  w0 )  1, n  1, 2,..., N 44
12-11-2023

Learning in SVM for


Linearly Separable Patterns
• The Lagrangian function is:

 
N
1 T
J ( w , w0 , α )  w w -   n y n ( w T x n  w0 )  1
2 n 1

– where αn is the Lagrange coefficient


• Conditions for optimality:
 J ( w , w0 , α )
0
w
 J ( w , w0 , α )
0
 w0

• Application of optimality conditions gives:


N
w    n yn x n
n 1
N


n 1
n yn  0
45

Learning in SVM for


Linearly Separable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αn} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    n    m n ym yn xTm x n
n 1 2 m 1 m 1
N
 yn n  0
subject to the constraints  n 1
  0 for n  1,2,..., N
 n

• Optimum Lagrange multipliers will take non-zero


values
• The data points associated with the optimum
Lagrange multipliers are called support vectors

46
12-11-2023

Learning in SVM for


Linearly Separable Patterns (contd.)

Support vectors 1
Margin =
|| w ||

w0

wT x  w0  1

wT x  w0  0

wT x  w0  1

47

SVM for Linearly Separable Patterns


• For optimal Lagrange multipliers, optimal weight
vector w is given by
Ns
ˆ  ˆ n yn xn
w
n 1
– where Ns is the number of support vectors
• The optimal hyperplane is defined in terms of support
vectors
• For any test example x, the discriminant function of
optimal hyperplane is given by
Ns
ˆ T x  wˆ 0   ˆ n y n x T x n  wˆ 0
g (x)  w
n 1
– If g(x) ≥ 0 then x is assigned to C1
– If g(x) < 0 then x is assigned to C2

48
12-11-2023

Illustration: Perceptron vs Linear SVM

Perceptron

Linear SVM 49

Maximum Margin Hyperplane for


Linearly Nonseparable Patterns

1
Margin =
|| w ||

ξ0

ξ 1
w0 ξ0

0  ξ 1
wT x  w0  1

wT x  w0  0 ξ 1
T
w x  w0  1

50
12-11-2023

SVM for Linearly Nonseparable Patterns


• Some data points may fall inside the region of
separation or on the wrong side of separation
• ξi is a measure of the deviation for xi from the ideal
condition of pattern separability
• Optimal hyperplane specified by (w, w0) must satisfy
the constraint:
yn(wTxn+ w0 ) ≥ 1- ξn and ξn ≥ 0 for n = 1, 2, …, N
– Here yn∈{+1, -1} is the desired output (class label) of
training example xn and N is the number of training
examples

51

Learning in SVM for


Linearly Nonseparable Patterns
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
N
1
minimize J (w, w0 , ξ)  || w ||2  C   n
2 n 1

 y (wT xn  w0 )  1   n , n  1,..., N
subjected to the constraints  n
 n  0
– where C determines the tradeoff between the margin
and the error on training data
• The Lagrangian function is:

 
N N N
1
J ( w , ξ , w0 , α )  w T w  C   n    n y n ( w T x n  w0 )  1   n    n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients

52
12-11-2023

Learning in SVM for


Linearly Nonseparable Patterns (contd.)
• Conditions for optimality:
  J ( w , ξ , w0 , α ) 
0
w
  J ( w , ξ , w0 , α ) 
0
 n
  J ( w , ξ , w0 , α ) 
0
 w0

• Application of optimality conditions gives:


N
w    n yn x n
n 1
N


n 1
n yn  0

0  n  C
53

Learning in SVM for


Linearly Nonseparable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αi} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    n    m n ym yn xTm x n
n 1 2 m 1 n1
N
 yn n  0
subject to the constraints  n 1
0    C for n  1,2,..., N
 n
• The data points associated with the optimum Lagrange
multipliers are called support vectors
• For optimal Lagrange multipliers, optimal weight vector w is
given by
Ns
ˆ  ˆ n yn xn
w
n 1

– where Ns is the number of support vectors


54
12-11-2023

Maximum Margin Hyperplane for


Linearly Nonseparable Patterns

Support vectors 1
Margin =
|| w ||

ξ0

w0 ξ0

ξ 1
wT x  w0  1

wT x  w0  0 ξ 1

wT x  w0  1

55

Pattern Classification Problems

x2 x2 x2

x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes

56
12-11-2023

Support Vector Machines and


Kernel Methods

57

Key Aspects of Kernel Methods


• Kernel methods involve
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Detection of optimal linear solutions in the kernel feature
space
• Transformation to a higher dimensional space is expected to
be helpful in conversion of nonlinear relations into linear
relations (Cover’s theorem)
– Nonlinearly separable patterns to linearly separable
patterns
• Pattern analysis methods are implemented in such a way
that the kernel feature space representation is not
explicitly required.
– They involve computation of pair-wise inner-products only
• The pair-wise inner-products are computed efficiently
directly from the original representation of data using a
kernel function (Kernel trick)

58
12-11-2023

Illustration of Transformation

x2

x1 
x  [ x1,x2 ] 
 ( x )  x12 x 22 2 x1 x 2 

59

Maximum Margin Hyperplane in


Transformed Feature Space

1
Margin =
|| w ||

ξ0

w0 ξ0

ξ 1
wT (x)  w0  1

wT (x)  w0  0 ξ 1
T
w (x)  w0  1

60
12-11-2023

Learning in SVM for


Nonlinearly Separable Patterns
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
N
1
minimize J (w, w0 , ξ)  || w ||2  C   n
2 n 1

 y (wT (x n )  w0 )  1   n , n  1,..., N


subjected to the constraints  n
 n  0
– where C determines the tradeoff between the margin
and the error on training data
• The Lagrangian function is:

 
N N N
1
J ( w , ξ , w0 , α )  w T w  C   n    n y n ( w T  ( x n )  w0 )  1   n    n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients

61

Learning in SVM for


Nonlinearly Separable Patterns (contd.)
• Conditions for optimality:
  J ( w , ξ , w0 , α ) 
0
w
  J ( w , ξ , w0 , α ) 
0
 n
  J ( w , ξ , w0 , α ) 
0
 w0

• Application of optimality conditions gives:


N
w    n yn  (x n )
n 1
N


n 1
n yn  0

0  n  C
62
12-11-2023

Learning in SVM for


Nonlinearly Separable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αn} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    i    m n ym yn  (x m )T (x n )
n 1 2 m1 n1
N
 yn n  0
subject to the constraints  n 1
0    C for n  1,2,..., N
 n
• The data points associated with the optimum Lagrange
multipliers are called support vectors
• For optimum Lagrange multipliers, optimum weight vector
w is given by
Ns
ˆ  ˆ n yn (x n )
w
n 1

– where Ns is the number of support vectors


63

SVM for Nonlinearly Separable Patterns


• For any test example x, the discriminant function of
optimum hyperplane is given by
Ns
ˆ T  ( x )  wˆ 0   ˆ n y n  ( x ) T  ( x n )  wˆ 0
g (x)  w
n 1

• SVM is implemented in such a way that the kernel feature


space representation, Φ(x), for any x is not explicitly
required
• They involve computation of pair-wise inner-products only.
• The pair-wise inner-products are computed efficiently
directly from the original representation of data using a
kernel function (Kernel trick)
• K(x, xn) = Φ(x)T Φ(xn) is called as inner-product kernel
• Discriminant function of optimal decision surface:
Ns
ˆ T  ( x )  wˆ 0    n y n K ( x , x n )  wˆ 0
g (x)  w
n 1
64
12-11-2023

Kernel Functions
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Kernel functions must satisfy Mercer’s theorem
• Kernel gram matrix includes the values of kernel
function for all pairs of examples in the training set
 K ( x1 , x1 ) K ( x1 , x 2 ) . . . K ( x1 , x N ) 
 K (x , x ) K (x , x ) . . . K (x , x ) 
 2 1 2 2 2 N 
 . . . 
K 
 . . . 
 . . . 
 
 K ( x N , x1 ) K ( x N , x 2 ) . . . K ( x N , x N ) 

• The kernel gram matrix must be symmetric and


positive semi-definite, i.e., the eigenvalues should be
non-negative,
– for convergence of the iterative method used for solving
the constrained optimization problem
65

Kernel Functions (contd.)


• The commonly used kernel functions are:
Linear kernel : K (x m , x n )  x Tm x n
Polynomial kernel : K (x m , x n )  (axTm x n  b) p
 || x m  x n ||2 
Gaussian kernel : K (x m , x n )  exp  
  
– Where width (σ) in Gaussian kernel, and a, b and degree
of polynomial (p) in polynomial kernel are the kernel
parameters

66
12-11-2023

Inner-product Kernels
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Polynomial kernel: K(xm, xn) = (xmTxn + 1)2
• For 2-dimensional patterns, x1=[x11, x12]T and x2=[x21, x22]T
 
(x1 )  1, 2 x11, 2 x12 , x112 , x122 , 2 x11x12
T

(x )  1, 2 x , 2 x , x , x , 2 x x 
2 2 T
2 21 22 21 22 21 22

1 x x   1 2x x  2x x  x x  x
1
T
2
2
11 21 12 22
2 2
11 22 x  2x11x12 x21x22  (x1 )T (x2 )
2 2
12 22

• The dimension of the feature vector Φ(x) for the


polynomial kernel of degree p and the pattern
dimension of d is given by
( d  p)!
d! p!
• For Gaussian kernel, the dimension of feature vectors
is shown to be infinite
67

Architecture of an SVM

x1
SVM
1* y1
K (x, x1)
Ns
g ( x )    n y n K ( x, x n )  w0
x2 Output
n 1
Input
example x  y*
2 2
g(x)
K(x, x2) Σ
If g(x) ≥ 0 then x is
assigned to class1
(C1)
If g(x) < 0 then x is
xNs
assigned to class2
 N* y N (C2)
K(x, xNs) s s

68
12-11-2023

Illustration of SVM with Gaussian Kernel

69

Multi-class Pattern Classification


using SVMs
• Multi-class pattern classification for C classes is solved
using a combination of several binary classifiers and a
decision strategy
SVM1 g1(x)

SVM2 Decision Class


Input pattern, x Strategy

gL(x)
SVML

• Approaches to multi-class pattern classification using


SVMs:
– One-against-the-rest approach
– One-against-one approach
70
12-11-2023

One-against-the-rest Approach
SVM1 g1(x)

Input SVM2 g2(x)


Class
pattern, x Decision
g3(x) logic
SVM3

g4(x)
SVM4

• An SVM is built for each class to form a boundary between


the region of the class and the regions of the other classes
• Let C=4 be the total number of classes. Then, number of
SVMs is L = 4
• A test pattern x is classified by using winner-takes-all
strategy

71

One-against-the-rest Approach: Example


Data Class1 vs Rest Class2 vs Rest

Class3 vs Rest Class4 vs Rest Multi-class decision region

72
12-11-2023

One-against-one Approach
• An SVM is built for every pair
SVM12
g12(x) of classes to form a boundary
between their regions
g13(x)
SVM13 • Let C=4 be the total number
g14(x)
of classes
Input
pattern x
SVM14
Decision Class • Number of pairwise SVMs is
Strategy
L=C(C-1)/2 = 6
SVM23 g (x)
23
• The maxwins strategy is used
to determine the class of a
SVM24
g24(x)
test pattern x
SVM34 – In this strategy, a majority
g34(x)
voting scheme is used

73

One-against-one Approach: Example


Data Class1 vs Class2 Class1 vs Class3 Class1 vs Class4

Multi-class
Class2 vs Class3 Class2 vs Class4 Class3 vs Class4 decision region

74
12-11-2023

Tools For SVMs


Sl No Name Default approach for Reference
multiclass classification
1 SVMTorch One-against-the-rest [1]
2 LibSVM One-against-one [2]
3 SVMlight One-against-the-rest [3]
4 LibLinear One-against-the-rest [4]
5 Scikit-learn One-against-one [5]
SVM

[1] http://bengio.abracadoudou.com/SVMTorch.html
[2] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[3] http://svmlight.joachims.org/
[4] https://www.csie.ntu.edu.tw/~cjlin/liblinear/
[5] http://scikit-learn.org/stable/modules/svm.html#svm

75

Classification Using Linear


Regression
12-11-2023

Classification Using Linear Regression


• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)

• If the input x belongs to Ci, then yi is 1


• The expected output for x should be close to 1
• During linear regression for classification, we are
trying to predict the expected output value
• In other way, we are trying to predict probability of
class
E y i | x   P  y i  C i | x 
• This is the ideal situation
• Linear regression gives the hope of getting this
• The notion of predicting probability of class is given
nicely by logistic regression 77

Logistic Regression
12-11-2023

Logistic Regression
( A probabilistic discriminative
model)

Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci

E yi | x  P  yi  Ci | x 

• Look for some kind of transformation of probability


and fit that
 P (x) 
• Logit transformation: log 
 1  P (x) 
• 2-class classification:
– Class label: 0 or 1
– P(x) is P(Ci=1|x) i.e. probability that output is 1 given
input (probability of success)
– 1-P(x) is P(Ci=0|x) i.e. probability that output is 0 given
input (probability of failure)
80
12-11-2023

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P ( x) 
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x) 
where w  [ w0 , w1 ,..., wd ]T and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x


 P( x)  P( x)
log   w0  w1 x  e ( w0  w1 x )
 1  P ( x )  1  P( x)
81

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x


P( x)
 e ( w0  w1x )
1  P ( x)
e ( w0  w1x ) 1
P( x)   ( w  w x )
1  e ( w0  w1x ) 1  e 0 1
82
12-11-2023

Logistic Regression
1
P( x)  ( w0  w1x )
1 e

• This function is a sigmoidal function, specifically called


as logistic function
• Logistic function:
1
P( x)  β = 0.6 β = 0.2
1  e( w0  w1x )
Px  β = 0.3
1
P( x) 
1  e (  x )

x
83

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x)  T
• Fit a linear model to logit function: log   w xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T
P ( x) T
 e ( w xˆ )
1  P ( x)
84
12-11-2023

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
T
( w xˆ )
e
P (x)  T
What about the classifier learning here?
1  e(w xˆ )

1 It is still a linear classifier – Boundary is


P(x)  T
linear surface i.e. hyperplane
1  e ( w xˆ ) 85

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x)  T
1  e(w xˆ ) For any test example x:
1 If P(x) ≥ 0.5 then x is assigned with label 1
P(x)  T If P(x) < 0.5 then x is assigned with label 0
1  e ( w xˆ ) 86
12-11-2023

Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w )  P (x n ) yn (1  P ( x n )) (1 yn )

Probability that x Probability that x


has label 1 has label 0 87

Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w Bernoulli Distribution)

(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y

N
• Total data likelihood: P(D | w )   P(x
n 1
n | w)
88
12-11-2023

Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y

N
• Total data likelihood: P(D | w )   P ( x n ) n (1  P( x n )) n
y (1 y )

n 1 89

Estimation of Parameter in
Logistic Regression
• Total data log likelihood:
l (w )  ln P (D | w ) 
N
l (w )   yn ln( P (x n ))  (1  yn ) ln(1  P(x n ))
n 1

• Choose the parameters for which the total data log


likelihood is maximum:
w ML  arg max l ( w )
w
• Cost function for optimization:
N
l (w )   yn ln( P(x n ))  (1  yn ) ln(1  P(x n ))
n 1
 l (w )
• Conditions for optimality: 0
w
• Unfortunately, solving this, no closed form expression
for w is obtained
• Solution: Gradient accent method 90
12-11-2023

Estimation of Parameter in
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope  l (w )  l (w )
w   w  
(gradient) of the w w
likelihood surface where 0    1 is proportion ality constant
– Then, the w is updated using ∆w
• This indicate, we move in the direction of positive
slope of the likelihood surface
91

Estimation of Parameter in Logistic


Regression – Gradient Accent Method
• Given a training dataset, the goal is to maximize the
likelihood function with respect to the parameters of
linear function
1. Initialize the w
• Evaluate the initial value of the log likelihood, l(w)
 l (w )
2. Determine the change inw (∆w):  w  
w
3. Update the w: w  w   w
4. Evaluate the log likelihood and check for convergence of
the log likelihood
• If the convergence criterion is not satisfied repeat from
steps 2 to 4

• Convergence criterion: Difference between log


likelihoods of successive iterations fall below a
threshold (E.g. threshold=10-3)
92
12-11-2023

Illustration of Classification using


Logistic Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)

Weight
in Kg

Height in cm 93

Illustration of Classification using


Logistic Regression
• Training:
 - 378.2085 
w   2.2065 
 1.8818 

f (x, w )  w T xˆ  0
Weight
in Kg

Height in cm 94
12-11-2023

Illustration of Classification using


Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  47.9851 P(x)  T 1
1  e ( w xˆ )

• Class: Adult (C2)


95

Illustration of Classification using


Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  6.7771 P(x)  T  0.9
1  e ( w xˆ )

• Class: Adult (C2)


96
12-11-2023

Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just for building classifier, but also used
in sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output

97

Illustration of Sensitivity Analysis using


Logistic Regression
• Training:  - 378.2085   w0 
w   2.2065    w1 
 1.8818   w2 

Weight
in Kg

Height in cm

• Both the attributes are equally


important
98
12-11-2023

Summary
• The SVM is a linear two-class classifier
• An SVM constructs the maximum margin hyperplane
(optimal separating hyperplane) as a decision surface
to separate the data points of two classes
• Margin of a hyperplane: The minimum distance of
training points from the hyperplane
• SVM is a classifier using kernel methods
• Kernel methods involve the following two steps:
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Construction of optimal linear solutions in kernel feature
space
• A kernel function computes the pair-wise inner-products
directly from the original representation of data
• The SVM performs multi-class pattern classification either
by using one-against-the-rest approach or one-against-one
approach
99

Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.

100

You might also like