0% found this document useful (0 votes)

28 views50 pages

C30 C35 LinearModelForClassification

Uploaded by

tewipe3560

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views50 pages

C30 C35 LinearModelForClassification

Uploaded by

tewipe3560

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

12-11-2023

Introduction to Machine Learning

Linear Models for Classification
Veena Thenkanidiyoor
National Institute of Technology
Goa

Linear Method for Classification

• The boundary that separates the region of classes is linear
• Separating surface is linear i.e. hyperplane
• A hyperplane that best fit the region of separation between
the classes
• Discriminant function: Function that indicate the boundary
between the classes
– Here discriminant function is linear

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0
2
12-11-2023

Linear Method for Classification

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0  0
3

Linear Method for Classification

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
w1 w
xn=[xn1, xn2]T xn 2   x n1  0  mx n1  c
w2 w2 4
12-11-2023

Linear Method for Classification

x2 x2

x1 x1
• Discriminant function in d-dimensional space :
d
xn=[xn1, xn2 , … , xnd]T f (x n , w )  w T x n  w0   wi xi
i 0 5

Two classes of Approaches for

Linear Classification
1. Modeling a discriminant function:
– For each class, a linear discriminant function fi(x,wi) is
defined
– Let C1, C2, …, Ci, …, CM be the M classes
– Let fi(x,wi) be the linear discriminant function for ith
class

Class label for x = arg max f i (x,wi ) i  1, 2,..., M

i
– Discriminant function is defined independent of the
classes
– Linear regression can be used to learn linear
discriminant function
• Do the linear regression by considering dependent variable
as indicator variable (categorical variable)
– Logistic regression (Probabilistic discriminative model)
– Fisher linear discriminant analysis
6
12-11-2023

Two classes of Approaches for

Linear Classification
2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt

x2 x2

x1 x1

– Perceptron (linear discriminant function is learnt)

– Support vector machine (SVM) (linear discriminant
function is learnt)
– Neural networks (when the discriminant function is
nonlinear) 7

Method of Least squares

Classification Using Linear
Regression
12-11-2023

Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by

ˆ  XT X
w  
1
XT y
y

x
9

Classification Using Linear Regression

• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M

– xn is input vector (d dependent variable)

– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
• Illustration: Iris (Flower) Data – 3 classes
X Y
Sepal-Length Sepal_Width Petal_Length Petal_Width Class1 Class2 Class3

5.1 3.5 1.4 0.2 1 0 0

4.9 3.0 1.4 0.2 1 0 0
7.0 3.2 4.7 1.4 0 1 0
6.4 3.2 4.5 1.5 0 1 0
6.3 3.3 6.0 2.5 0 0 1
5.8 2.7 5.1 1.9 0 0 1
10
12-11-2023

Classification Using Linear Regression

• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M

– xn is input vector (d dependent variable)

– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
– For N examples, X is data matrix of size N x (d+1) and Y
is response matrix of size N x M
ˆ  XT X
• Linear regression on response vector: W   1
XT Y
– Ŵ is of the size (d+1) x M
ˆ  w
W ˆ 1, w ˆM
ˆ 2 ,..., w

– Each column of Ŵ is (d+1) coefficients corresponding

to a class
11

Classification Using Linear Regression

• For any test example x, the discriminant value for
class i is: d
ˆ Ti x   wˆ ij xi
ˆ i)  w
f i ( x, w
j 0

Class label for ˆ i)

x = arg max fi (x,w i  1, 2,..., M
i

12
12-11-2023

Illustration of Classification using

Linear Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Number of classes: 2
• Each output variable is a 2-dimensional
binary vector
• Class: Child (C1) Adult (C2)

Weight
in Kg

Height in cm 13

Illustration of Classification using

Linear Regression
ˆ  XT X
• Training: W  
1
XT Y
• X is data matrix of size 20 x 3
• Y is response matrix of size 20 x 2
 2.8897 - 1.8897 
W  w
ˆ ˆ 2   - 0.0222
ˆ1 w 0.0222 

 0.0122 - 0.0122 

ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm 14
12-11-2023

Illustration of Classification using

Linear Regression
Test Example:

ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.1842
f1 ( x, w ˆ 2 )  0.8158
f 2 ( x, w

• Class: Adult (C2)

Illustration of Classification using

Linear Regression
Test Example:

ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.4639
f1 ( x, w ˆ 2 )  0.5361
f 2 ( x, w

• Class: Adult (C2)

16
12-11-2023

Classification Using Linear

Regression

Classification Using Linear Regression

• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)

• If the input x belongs to Ci, then yi is 1

• The expected output for x should be close to 1
• During linear regression for classification, we are
trying to predict the expected output value
• In other way, we are trying to predict probability of
class
E y i | x   P  y i  C i | x 
• This is the ideal situation
• Linear regression gives the hope of getting this
• The notion of predicting probability of class is given
nicely by logistic regression 18
12-11-2023

Logistic Regression

Logistic Regression
( A probabilistic discriminative
model)
12-11-2023

Two classes of Approaches for

Class label for x = arg max f i (x,wi ) i  1, 2,..., M

Fisher Discriminant Analysis

12-11-2023

Fisher Discriminant Analysis (FDA)

• Aim: Finding those direction of projection where the
separability of classes after projection is maximum
• Training data: D  {x n , y n }nN1 ,
xn  R d and y n  {1,1}
– d: dimension of input example
– 2-class data:
• N+: Number of training examples in +ve class
• N-: number of examples in –ve class
• Task: Find a single direction of projection, q such that
the separability of projected data is maximum

an  q T x n n  1, 2, ..., N

• Separability of projected data is defined using

statistical information from the projected data
23

Classification on Representation
from FDA
• Project the xn (both training and test data) onto the
direction q to get 1-dimensional representation
an  q T x n
1-Dimensional
Representation
Representation
xn Classifier
Data Feature
FDA
a (Binary
Extraction d Classifier)

• Different ways of performing classification:

– Threshold on the projected data
– Decision based on means
• Eucledian distance between a projected test example and
means of two classes
• Mahalanobis distance between projected test example and
data of two classes
– Build Bayes’ classifier in the 1-dimensional space
• Unimodal Gaussian
• Gaussian mixture model 24
12-11-2023

Illustration: FDA and Bayes’ Classifier on

1-dimensional Representation from FDA

Illustration: FDA and Bayes’ Classifier on

1-dimensional Representation from FDA

26
12-11-2023

Illustration: FDA and Bayes’ Classifier on

1-dimensional Representation from FDA

Discriminative Learning based

Methods for Classification
12-11-2023

Discriminative Learning based

Methods
• Learn the surface that better separates the region of
classes
• Learning discriminant function: Learns a function that
maps input data to output
• Linear discriminant function: Function that indicate
the boundary between the classes which is linear

Linear Discriminant Function

• Regions of two classes are separable by a linear
surface (line, plane or hyperplane)
• 2-dimensional space: The
decision boundary is a
line specified by
w1 x1  w 2 x 2  w 0  0 x2
w w
x 2   1 x1  0
w2 w2
• d-dimensional space: The
decision surface is a x1
hyperplane specified by
d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T xˆ  0
i 0

where w  [ w0 , w1 ,..., wd ] and xˆ  [1, x1 ,..., xd ]T

30
12-11-2023

Discriminant Function of a Hyperplane

• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1

• For any point the lies on the hyperplane

d
g (x)   wi xi  w0  w T x  w0  0
• Example: i 1

– Consider a straight line with its equation as x2+x1-1=0

– Discriminant function of the straight line is g(x)=x2+x1-1
– For points (1,0) and (0,1) that lie on
this straight line g(x)=0 x2
– For the point (0,0), g(x)=-1 i.e. the
value of g(x) is negative
(1,0) * (1,1)
(0,0) * (0,1) x1
– For the point (1,1), g(x)=+1 i.e. the
value of g(x) is positive
31

Discriminant Function of a Hyperplane

x2
Positive side

Negative side (1,0) * (1,1)

(0,0) * (0,1) x1

g (x)  x1  x2  1

• A hyperplane has a positive side and a negative side

– For any point on the positive side, the value of
discriminant function, g(x), is positive
– For any point on the negative side, the value of
discriminant function, g(x), is negative

32
12-11-2023

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is placed between
the training data of two classes so that training error
(classification error) is minimum

w T x n  w0  0 C2 C2
w T x n  w0  0
x2 OR x2 C1
C1

w T x n  w0  0 w T x n  w0  0
x1 x1

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
1. Initialize the w with random values
2. Choose a training example xn
3. Update the w, if xn is misclassified
w  w   x n , for w T x n  w0  0 and x n  class with label  1
w  w   x n , for w T x n  w0  0 and x n  class with label  1
– Here 0 < η < 1 is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 34
12-11-2023

Perceptron Learning
• Training: C2

x2 C1

• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to class with label
+1 (C2)
– If wTx + w0 ≤ 0 then x is assigned to class with label
-1 (C1)
35

Illustration of Perceptron Learning

36
12-11-2023

Optimal Separating Hyperplane

Hyperplane and Margin

38
12-11-2023

Hyperplane
• Equation of straight line: y = mx + c
• Consider the following equation: w2x2 + w1x1 + w0 = 0
• The above equation can be rewritten as follows:
w1 w
x2   x1  0
w2 w2
• Let y = x2 and x = x1. Then the above is the equation of
straight line in a 2-dimensional space with
w1 w
m and c   0
w2 w2
• The equation of a plane in a 3-dimensional space:
w3 x 3  w 2 x 2  w1 x1  w 0  0
• The equation of a hyperplane in a d-dimensional
space: d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T x  0
i 0

where w  [ w0 , w1 ,..., wd ] and x  [1, x1 ,..., xd ]T

T
39

Distance of a Point to Hyperplane

• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1 g (x)
• For any point x, the distance to the hyperplane:
w
• Margin of a hyperplane:
– Distance of the nearest training example to the
hyperplane
x2
g (x1 ) x1 x2
w
* *
g (x 2 )
w
x1
g (x 3 )
w
x 3*
40
12-11-2023

Support Vector Machines for

Pattern Classification

Maximum Margin Hyperplane

42
12-11-2023

Maximum Margin Hyperplane for

Linearly Separable Patterns

1
Margin =
|| w ||

wT x  w0  1

wT x  w0  0

wT x  w0  1

Support Vector Machine (SVM) for

Linearly Separable Patterns
• Optimal hyperplane specified by (w, w0) must satisfy
the constraint:
yn(wTxn+ w0 ) ≥ 1 for n = 1, 2, …, N
– Here yi∈{+1, -1} is the desired output (class label) of
training example xn and N is the number of training
examples
2
• Optimal value of margin of separation:
w
• Maximizing the margin of separation is equivalent to
minimizing the Euclidean norm of the weight vector w
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
1
minimize J ( w , w0 )  || w || 2
2
T
s uch that y n ( w x n  w0 )  1, n  1, 2,..., N 44
12-11-2023

Learning in SVM for

Linearly Separable Patterns
• The Lagrangian function is:

 
N
1 T
J ( w , w0 , α )  w w -   n y n ( w T x n  w0 )  1
2 n 1

– where αn is the Lagrange coefficient

• Conditions for optimality:
 J ( w , w0 , α )
0
w
 J ( w , w0 , α )
0
 w0

• Application of optimality conditions gives:

N
w    n yn x n
n 1
N


n 1
n yn  0
45

Learning in SVM for

Linearly Separable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αn} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    n    m n ym yn xTm x n
n 1 2 m 1 m 1
N
 yn n  0
subject to the constraints  n 1
  0 for n  1,2,..., N
 n

• Optimum Lagrange multipliers will take non-zero

values
• The data points associated with the optimum
Lagrange multipliers are called support vectors

46
12-11-2023

Learning in SVM for

Linearly Separable Patterns (contd.)

Support vectors 1
Margin =
|| w ||

wT x  w0  1

wT x  w0  0

wT x  w0  1

SVM for Linearly Separable Patterns

• For optimal Lagrange multipliers, optimal weight
vector w is given by
Ns
ˆ  ˆ n yn xn
w
n 1
– where Ns is the number of support vectors
• The optimal hyperplane is defined in terms of support
vectors
• For any test example x, the discriminant function of
optimal hyperplane is given by
Ns
ˆ T x  wˆ 0   ˆ n y n x T x n  wˆ 0
g (x)  w
n 1
– If g(x) ≥ 0 then x is assigned to C1
– If g(x) < 0 then x is assigned to C2

48
12-11-2023

Illustration: Perceptron vs Linear SVM

Perceptron

Linear SVM 49

Maximum Margin Hyperplane for

Linearly Nonseparable Patterns

1
Margin =
|| w ||

ξ0

ξ 1
w0 ξ0

0  ξ 1
wT x  w0  1

wT x  w0  0 ξ 1
T
w x  w0  1

50
12-11-2023

SVM for Linearly Nonseparable Patterns

• Some data points may fall inside the region of
separation or on the wrong side of separation
• ξi is a measure of the deviation for xi from the ideal
condition of pattern separability
• Optimal hyperplane specified by (w, w0) must satisfy
the constraint:
yn(wTxn+ w0 ) ≥ 1- ξn and ξn ≥ 0 for n = 1, 2, …, N
– Here yn∈{+1, -1} is the desired output (class label) of
training example xn and N is the number of training
examples

Learning in SVM for

Linearly Nonseparable Patterns
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
N
1
minimize J (w, w0 , ξ)  || w ||2  C   n
2 n 1

 y (wT xn  w0 )  1   n , n  1,..., N
subjected to the constraints  n
 n  0
– where C determines the tradeoff between the margin
and the error on training data
• The Lagrangian function is:

 
N N N
1
J ( w , ξ , w0 , α )  w T w  C   n    n y n ( w T x n  w0 )  1   n    n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients

52
12-11-2023

Learning in SVM for

Linearly Nonseparable Patterns (contd.)
• Conditions for optimality:
  J ( w , ξ , w0 , α ) 
0
w
  J ( w , ξ , w0 , α ) 
0
 n
  J ( w , ξ , w0 , α ) 
0
 w0

• Application of optimality conditions gives:

N
w    n yn x n
n 1
N


n 1
n yn  0

0  n  C
53

Learning in SVM for

Linearly Nonseparable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αi} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    n    m n ym yn xTm x n
n 1 2 m 1 n1
N
 yn n  0
subject to the constraints  n 1
0    C for n  1,2,..., N
 n
• The data points associated with the optimum Lagrange
multipliers are called support vectors
• For optimal Lagrange multipliers, optimal weight vector w is
given by
Ns
ˆ  ˆ n yn xn
w
n 1

– where Ns is the number of support vectors

54
12-11-2023

Maximum Margin Hyperplane for

Linearly Nonseparable Patterns

Support vectors 1
Margin =
|| w ||

ξ0

w0 ξ0

ξ 1
wT x  w0  1

wT x  w0  0 ξ 1

wT x  w0  1

Pattern Classification Problems

x2 x2 x2

x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes

56
12-11-2023

Support Vector Machines and

Kernel Methods

Key Aspects of Kernel Methods

• Kernel methods involve
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Detection of optimal linear solutions in the kernel feature
space
• Transformation to a higher dimensional space is expected to
be helpful in conversion of nonlinear relations into linear
relations (Cover’s theorem)
– Nonlinearly separable patterns to linearly separable
patterns
• Pattern analysis methods are implemented in such a way
that the kernel feature space representation is not
explicitly required.
– They involve computation of pair-wise inner-products only
• The pair-wise inner-products are computed efficiently
directly from the original representation of data using a
kernel function (Kernel trick)

58
12-11-2023

Illustration of Transformation

x1 
x  [ x1,x2 ] 
 ( x )  x12 x 22 2 x1 x 2 

Maximum Margin Hyperplane in

Transformed Feature Space

1
Margin =
|| w ||

ξ0

w0 ξ0

ξ 1
wT (x)  w0  1

wT (x)  w0  0 ξ 1
T
w (x)  w0  1

60
12-11-2023

Learning in SVM for

Nonlinearly Separable Patterns
• Learning problem (Lagrangian primal problem) :
– Given the training examples {(xn, yn)}, n=1,2,…, N, find
the values of w and w0 that
N
1
minimize J (w, w0 , ξ)  || w ||2  C   n
2 n 1

 y (wT (x n )  w0 )  1   n , n  1,..., N

subjected to the constraints  n
 n  0
– where C determines the tradeoff between the margin
and the error on training data
• The Lagrangian function is:

 
N N N
1
J ( w , ξ , w0 , α )  w T w  C   n    n y n ( w T  ( x n )  w0 )  1   n    n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients

Learning in SVM for

Nonlinearly Separable Patterns (contd.)
• Conditions for optimality:
  J ( w , ξ , w0 , α ) 
0
w
  J ( w , ξ , w0 , α ) 
0
 n
  J ( w , ξ , w0 , α ) 
0
 w0

• Application of optimality conditions gives:

N
w    n yn  (x n )
n 1
N


n 1
n yn  0

0  n  C
62
12-11-2023

Learning in SVM for

Nonlinearly Separable Patterns (contd.)
• Lagrangian dual problem:
– Find the Lagrange multipliers {αn} that maximizes the
objective function J(α)
N
1 N N
maximize J (α )    i    m n ym yn  (x m )T (x n )
n 1 2 m1 n1
N
 yn n  0
subject to the constraints  n 1
0    C for n  1,2,..., N
 n
• The data points associated with the optimum Lagrange
multipliers are called support vectors
• For optimum Lagrange multipliers, optimum weight vector
w is given by
Ns
ˆ  ˆ n yn (x n )
w
n 1

– where Ns is the number of support vectors

SVM for Nonlinearly Separable Patterns

• For any test example x, the discriminant function of
optimum hyperplane is given by
Ns
ˆ T  ( x )  wˆ 0   ˆ n y n  ( x ) T  ( x n )  wˆ 0
g (x)  w
n 1

• SVM is implemented in such a way that the kernel feature

space representation, Φ(x), for any x is not explicitly
required
• They involve computation of pair-wise inner-products only.
• The pair-wise inner-products are computed efficiently
directly from the original representation of data using a
kernel function (Kernel trick)
• K(x, xn) = Φ(x)T Φ(xn) is called as inner-product kernel
• Discriminant function of optimal decision surface:
Ns
ˆ T  ( x )  wˆ 0    n y n K ( x , x n )  wˆ 0
g (x)  w
n 1
64
12-11-2023

Kernel Functions
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Kernel functions must satisfy Mercer’s theorem
• Kernel gram matrix includes the values of kernel
function for all pairs of examples in the training set
 K ( x1 , x1 ) K ( x1 , x 2 ) . . . K ( x1 , x N ) 
 K (x , x ) K (x , x ) . . . K (x , x ) 
 2 1 2 2 2 N 
 . . . 
K 
 . . . 
 . . . 
 
 K ( x N , x1 ) K ( x N , x 2 ) . . . K ( x N , x N ) 

• The kernel gram matrix must be symmetric and

positive semi-definite, i.e., the eigenvalues should be
non-negative,
– for convergence of the iterative method used for solving
the constrained optimization problem
65

Kernel Functions (contd.)

• The commonly used kernel functions are:
Linear kernel : K (x m , x n )  x Tm x n
Polynomial kernel : K (x m , x n )  (axTm x n  b) p
 || x m  x n ||2 
Gaussian kernel : K (x m , x n )  exp  
  
– Where width (σ) in Gaussian kernel, and a, b and degree
of polynomial (p) in polynomial kernel are the kernel
parameters

66
12-11-2023

Inner-product Kernels
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Polynomial kernel: K(xm, xn) = (xmTxn + 1)2
• For 2-dimensional patterns, x1=[x11, x12]T and x2=[x21, x22]T
 
(x1 )  1, 2 x11, 2 x12 , x112 , x122 , 2 x11x12
T

(x )  1, 2 x , 2 x , x , x , 2 x x 
2 2 T
2 21 22 21 22 21 22

1 x x   1 2x x  2x x  x x  x
1
T
2
2
11 21 12 22
2 2
11 22 x  2x11x12 x21x22  (x1 )T (x2 )
2 2
12 22

• The dimension of the feature vector Φ(x) for the

polynomial kernel of degree p and the pattern
dimension of d is given by
( d  p)!
d! p!
• For Gaussian kernel, the dimension of feature vectors
is shown to be infinite
67

Architecture of an SVM

x1
SVM
1* y1
K (x, x1)
Ns
g ( x )    n y n K ( x, x n )  w0
x2 Output
n 1
Input
example x  y*
2 2
g(x)
K(x, x2) Σ
If g(x) ≥ 0 then x is
assigned to class1
(C1)
If g(x) < 0 then x is
xNs
assigned to class2
 N* y N (C2)
K(x, xNs) s s

68
12-11-2023

Illustration of SVM with Gaussian Kernel

Multi-class Pattern Classification

using SVMs
• Multi-class pattern classification for C classes is solved
using a combination of several binary classifiers and a
decision strategy
SVM1 g1(x)

SVM2 Decision Class

Input pattern, x Strategy

gL(x)
SVML

• Approaches to multi-class pattern classification using

SVMs:
– One-against-the-rest approach
– One-against-one approach
70
12-11-2023

One-against-the-rest Approach
SVM1 g1(x)

Input SVM2 g2(x)

Class
pattern, x Decision
g3(x) logic
SVM3

g4(x)
SVM4

• An SVM is built for each class to form a boundary between

the region of the class and the regions of the other classes
• Let C=4 be the total number of classes. Then, number of
SVMs is L = 4
• A test pattern x is classified by using winner-takes-all
strategy

One-against-the-rest Approach: Example

Data Class1 vs Rest Class2 vs Rest

Class3 vs Rest Class4 vs Rest Multi-class decision region

72
12-11-2023

One-against-one Approach
• An SVM is built for every pair
SVM12
g12(x) of classes to form a boundary
between their regions
g13(x)
SVM13 • Let C=4 be the total number
g14(x)
of classes
Input
pattern x
SVM14
Decision Class • Number of pairwise SVMs is
Strategy
L=C(C-1)/2 = 6
SVM23 g (x)
23
• The maxwins strategy is used
to determine the class of a
SVM24
g24(x)
test pattern x
SVM34 – In this strategy, a majority
g34(x)
voting scheme is used

One-against-one Approach: Example

Data Class1 vs Class2 Class1 vs Class3 Class1 vs Class4

Multi-class
Class2 vs Class3 Class2 vs Class4 Class3 vs Class4 decision region

74
12-11-2023

Tools For SVMs

Sl No Name Default approach for Reference
multiclass classification
1 SVMTorch One-against-the-rest [1]
2 LibSVM One-against-one [2]
3 SVMlight One-against-the-rest [3]
4 LibLinear One-against-the-rest [4]
5 Scikit-learn One-against-one [5]
SVM

[1] http://bengio.abracadoudou.com/SVMTorch.html
[2] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[3] http://svmlight.joachims.org/
[4] https://www.csie.ntu.edu.tw/~cjlin/liblinear/
[5] http://scikit-learn.org/stable/modules/svm.html#svm

Classification Using Linear

Regression
12-11-2023

Classification Using Linear Regression

• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)

• If the input x belongs to Ci, then yi is 1

Logistic Regression
12-11-2023

Logistic Regression
( A probabilistic discriminative
model)

Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci

E yi | x  P  yi  Ci | x 

• Look for some kind of transformation of probability

and fit that
 P (x) 
• Logit transformation: log 
 1  P (x) 
• 2-class classification:
– Class label: 0 or 1
– P(x) is P(Ci=1|x) i.e. probability that output is 1 given
input (probability of success)
– 1-P(x) is P(Ci=0|x) i.e. probability that output is 0 given
input (probability of failure)
80
12-11-2023

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P ( x) 
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x) 
where w  [ w0 , w1 ,..., wd ]T and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x

 P( x)  P( x)
log   w0  w1 x  e ( w0  w1 x )
 1  P ( x )  1  P( x)
81

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x

P( x)
 e ( w0  w1x )
1  P ( x)
e ( w0  w1x ) 1
P( x)   ( w  w x )
1  e ( w0  w1x ) 1  e 0 1
82
12-11-2023

Logistic Regression
1
P( x)  ( w0  w1x )
1 e

• This function is a sigmoidal function, specifically called

as logistic function
• Logistic function:
1
P( x)  β = 0.6 β = 0.2
1  e( w0  w1x )
Px  β = 0.3
1
P( x) 
1  e (  x )

x
83

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x)  T
• Fit a linear model to logit function: log   w xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T
P ( x) T
 e ( w xˆ )
1  P ( x)
84
12-11-2023

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
T
( w xˆ )
e
P (x)  T
What about the classifier learning here?
1  e(w xˆ )

1 It is still a linear classifier – Boundary is

P(x)  T
linear surface i.e. hyperplane
1  e ( w xˆ ) 85

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x)  T
1  e(w xˆ ) For any test example x:
1 If P(x) ≥ 0.5 then x is assigned with label 1
P(x)  T If P(x) < 0.5 then x is assigned with label 0
1  e ( w xˆ ) 86
12-11-2023

Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w )  P (x n ) yn (1  P ( x n )) (1 yn )

Probability that x Probability that x

has label 1 has label 0 87

Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w Bernoulli Distribution)

(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y

N
• Total data likelihood: P(D | w )   P(x
n 1
n | w)
88
12-11-2023

N
• Total data likelihood: P(D | w )   P ( x n ) n (1  P( x n )) n
y (1 y )

n 1 89

Estimation of Parameter in
Logistic Regression
• Total data log likelihood:
l (w )  ln P (D | w ) 
N
l (w )   yn ln( P (x n ))  (1  yn ) ln(1  P(x n ))
n 1

• Choose the parameters for which the total data log

likelihood is maximum:
w ML  arg max l ( w )
w
• Cost function for optimization:
N
l (w )   yn ln( P(x n ))  (1  yn ) ln(1  P(x n ))
n 1
 l (w )
• Conditions for optimality: 0
w
• Unfortunately, solving this, no closed form expression
for w is obtained
• Solution: Gradient accent method 90
12-11-2023

Estimation of Parameter in
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope  l (w )  l (w )
w   w  
(gradient) of the w w
likelihood surface where 0    1 is proportion ality constant
– Then, the w is updated using ∆w
• This indicate, we move in the direction of positive
slope of the likelihood surface
91

Estimation of Parameter in Logistic

Regression – Gradient Accent Method
• Given a training dataset, the goal is to maximize the
likelihood function with respect to the parameters of
linear function
1. Initialize the w
• Evaluate the initial value of the log likelihood, l(w)
 l (w )
2. Determine the change inw (∆w):  w  
w
3. Update the w: w  w   w
4. Evaluate the log likelihood and check for convergence of
the log likelihood
• If the convergence criterion is not satisfied repeat from
steps 2 to 4

• Convergence criterion: Difference between log

likelihoods of successive iterations fall below a
threshold (E.g. threshold=10-3)
92
12-11-2023

Illustration of Classification using

Logistic Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)

Weight
in Kg

Height in cm 93

Illustration of Classification using

Logistic Regression
• Training:
 - 378.2085 
w   2.2065 
 1.8818 

f (x, w )  w T xˆ  0
Weight
in Kg

Height in cm 94
12-11-2023

Illustration of Classification using

Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  47.9851 P(x)  T 1
1  e ( w xˆ )

• Class: Adult (C2)

Illustration of Classification using

Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  6.7771 P(x)  T  0.9
1  e ( w xˆ )

• Class: Adult (C2)

96
12-11-2023

Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just for building classifier, but also used
in sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output

Illustration of Sensitivity Analysis using

Logistic Regression
• Training:  - 378.2085   w0 
w   2.2065    w1 
 1.8818   w2 

Weight
in Kg

Height in cm

• Both the attributes are equally

important
98
12-11-2023

Summary
• The SVM is a linear two-class classifier
• An SVM constructs the maximum margin hyperplane
(optimal separating hyperplane) as a decision surface
to separate the data points of two classes
• Margin of a hyperplane: The minimum distance of
training points from the hyperplane
• SVM is a classifier using kernel methods
• Kernel methods involve the following two steps:
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Construction of optimal linear solutions in kernel feature
space
• A kernel function computes the pair-wise inner-products
directly from the original representation of data
• The SVM performs multi-class pattern classification either
by using one-against-the-rest approach or one-against-one
approach
99

Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.

100

Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
ML Unit 2
No ratings yet
ML Unit 2
53 pages
Linear Discriminant
No ratings yet
Linear Discriminant
25 pages
ML Unit4
No ratings yet
ML Unit4
41 pages
Pattern Recognition Linear Classifier by Zaheer Ahmad
0% (1)
Pattern Recognition Linear Classifier by Zaheer Ahmad
37 pages
Linear - Classification
No ratings yet
Linear - Classification
72 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Machine Learning: Linear Models For Classification 1
No ratings yet
Machine Learning: Linear Models For Classification 1
30 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Discrimination and Classification
No ratings yet
Discrimination and Classification
7 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Discriminant, Generative, Discriminative Models
No ratings yet
Discriminant, Generative, Discriminative Models
98 pages
Week2 Part1 Summer Partial Notes
No ratings yet
Week2 Part1 Summer Partial Notes
75 pages
Linear Classifiers
No ratings yet
Linear Classifiers
48 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
ML 41
No ratings yet
ML 41
49 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
U20cs604 Machine Learning Unit II
No ratings yet
U20cs604 Machine Learning Unit II
50 pages
Class23 26 LinearClassification NeuralNetworks - 05 15nov2019
No ratings yet
Class23 26 LinearClassification NeuralNetworks - 05 15nov2019
35 pages
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
4 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Fla Question Bank 2021-2022
No ratings yet
Fla Question Bank 2021-2022
3 pages
Module 3.1
No ratings yet
Module 3.1
25 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
CBM342 BCI Unit IV
No ratings yet
CBM342 BCI Unit IV
22 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
No ratings yet
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
36 pages
CpE646 9v3 PDF
No ratings yet
CpE646 9v3 PDF
45 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
No ratings yet
Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
16 pages
Linear Discriminat Analysis
No ratings yet
Linear Discriminat Analysis
23 pages
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
No ratings yet
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
21 pages
Linear Classifiers PPT 1
No ratings yet
Linear Classifiers PPT 1
14 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Linear Methods For Classification
No ratings yet
Linear Methods For Classification
29 pages
Fitting A Model To Data
No ratings yet
Fitting A Model To Data
41 pages
Data Mining Supervised Techniques II
No ratings yet
Data Mining Supervised Techniques II
13 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
PA
No ratings yet
PA
8 pages
Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012
No ratings yet
Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012
15 pages
Linear Discriminant Analysis Reference
No ratings yet
Linear Discriminant Analysis Reference
6 pages
Artificial Neural Network Bao
No ratings yet
Artificial Neural Network Bao
26 pages
PRu 4
No ratings yet
PRu 4
13 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
17 - Illanes, A., Nadler, JR., S.B. Hyperspaces (Marcel Dekker, 1999) (ISBN 9780824719821)
100% (2)
17 - Illanes, A., Nadler, JR., S.B. Hyperspaces (Marcel Dekker, 1999) (ISBN 9780824719821)
541 pages
Bayesian Classification
No ratings yet
Bayesian Classification
14 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Discriminant Analysis Example 2: Fisher's Iris Data
No ratings yet
Discriminant Analysis Example 2: Fisher's Iris Data
12 pages
CSE 474/574 Introduction To Machine Learning Fall 2011 Assignment 3
No ratings yet
CSE 474/574 Introduction To Machine Learning Fall 2011 Assignment 3
3 pages
Detailed Linear Discriminant Functions Notes
No ratings yet
Detailed Linear Discriminant Functions Notes
2 pages
Ch-2 (HMT)
No ratings yet
Ch-2 (HMT)
67 pages
Part A: Chennai Mathematical Institute
No ratings yet
Part A: Chennai Mathematical Institute
3 pages
Two-Way Table of Specification: Name: Angelica H. Paras Subject: General Mathematics Quarter: First
No ratings yet
Two-Way Table of Specification: Name: Angelica H. Paras Subject: General Mathematics Quarter: First
3 pages
S3 Hand-out-KTU
No ratings yet
S3 Hand-out-KTU
128 pages
Additional Mathematics: (Syllabus 4047)
No ratings yet
Additional Mathematics: (Syllabus 4047)
12 pages
Factor I Sing Exercises 132
No ratings yet
Factor I Sing Exercises 132
4 pages
EE 376A: Information Theory: Lecture Notes
No ratings yet
EE 376A: Information Theory: Lecture Notes
75 pages
A First Course in Optimization Theory
No ratings yet
A First Course in Optimization Theory
8 pages
SISSA Groups Course2017
No ratings yet
SISSA Groups Course2017
99 pages
PX267 - Hamilton Mechanics
100% (1)
PX267 - Hamilton Mechanics
2 pages
Calculus For Engineers: Chapter 16 - Laplace Transforms - Solutions
No ratings yet
Calculus For Engineers: Chapter 16 - Laplace Transforms - Solutions
23 pages
2-D Line Plot - MATLAB Plot
No ratings yet
2-D Line Plot - MATLAB Plot
20 pages
1981PM
No ratings yet
1981PM
6 pages
MM 9 - Calculus 2: Techniques of Integration: Trigonometric Integrals
No ratings yet
MM 9 - Calculus 2: Techniques of Integration: Trigonometric Integrals
15 pages
FEM Vs FVM-COMSOL Blog
No ratings yet
FEM Vs FVM-COMSOL Blog
8 pages
Meshless Methods A Review and Computer I
No ratings yet
Meshless Methods A Review and Computer I
51 pages
EM10
No ratings yet
EM10
11 pages
Binomial Theorem
No ratings yet
Binomial Theorem
3 pages
Precalculus m3 Mid Module Assessment
No ratings yet
Precalculus m3 Mid Module Assessment
18 pages
MATH 8 Exam 1Q
No ratings yet
MATH 8 Exam 1Q
3 pages
14 Transpositions
No ratings yet
14 Transpositions
4 pages
Klein-Gordon Geon : Physical Review Number
No ratings yet
Klein-Gordon Geon : Physical Review Number
12 pages
Griffiths QMCH 3 P 17
No ratings yet
Griffiths QMCH 3 P 17
2 pages
Set 8
No ratings yet
Set 8
3 pages
MATH 264 Final Formula Sheet (2) 2025-01-06 20 - 25 - 01
No ratings yet
MATH 264 Final Formula Sheet (2) 2025-01-06 20 - 25 - 01
1 page
Matt Visser - Traversable Wormholes From Surgically Modified Schwarzschild Spacetimes
No ratings yet
Matt Visser - Traversable Wormholes From Surgically Modified Schwarzschild Spacetimes
12 pages
UTS, Matematika Teknik, 8 Oct 2019 PDF
No ratings yet
UTS, Matematika Teknik, 8 Oct 2019 PDF
2 pages
How Good in Math Was Paul Dirac
No ratings yet
How Good in Math Was Paul Dirac
2 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet