C30 C35 LinearModelForClassification
C30 C35 LinearModelForClassification
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 ) w1 xn1 w2 xn 2 w0
2
12-11-2023
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 ) w1 xn1 w2 xn 2 w0 0
3
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
w1 w
xn=[xn1, xn2]T xn 2 x n1 0 mx n1 c
w2 w2 4
12-11-2023
x2 x2
x1 x1
• Discriminant function in d-dimensional space :
d
xn=[xn1, xn2 , … , xnd]T f (x n , w ) w T x n w0 wi xi
i 0 5
x2 x2
x1 x1
Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by
ˆ XT X
w
1
XT y
y
x
9
12
12-11-2023
Weight
in Kg
Height in cm 13
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm 14
12-11-2023
ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 ) 0.1842
f1 ( x, w ˆ 2 ) 0.8158
f 2 ( x, w
15
ˆ 1)
f1 (x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 ) 0.4639
f1 ( x, w ˆ 2 ) 0.5361
f 2 ( x, w
16
12-11-2023
Logistic Regression
Logistic Regression
( A probabilistic discriminative
model)
12-11-2023
an q T x n n 1, 2, ..., N
Classification on Representation
from FDA
• Project the xn (both training and test data) onto the
direction q to get 1-dimensional representation
an q T x n
1-Dimensional
Representation
Representation
xn Classifier
Data Feature
FDA
a (Binary
Extraction d Classifier)
25
26
12-11-2023
27
x2
x1
29
30
12-11-2023
g (x) x1 x2 1
32
12-11-2023
Perceptron Learning
• Given - training data:D {x n , y n }nN1 , x n R d and y n 1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is placed between
the training data of two classes so that training error
(classification error) is minimum
w T x n w0 0 C2 C2
w T x n w0 0
x2 OR x2 C1
C1
w T x n w0 0 w T x n w0 0
x1 x1
33
Perceptron Learning
• Given - training data:D {x n , y n }nN1 , x n R d and y n 1,1
1. Initialize the w with random values
2. Choose a training example xn
3. Update the w, if xn is misclassified
w w x n , for w T x n w0 0 and x n class with label 1
w w x n , for w T x n w0 0 and x n class with label 1
– Here 0 < η < 1 is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 34
12-11-2023
Perceptron Learning
• Training: C2
x2 C1
x1
• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to class with label
+1 (C2)
– If wTx + w0 ≤ 0 then x is assigned to class with label
-1 (C1)
35
36
12-11-2023
x2
x1
37
38
12-11-2023
Hyperplane
• Equation of straight line: y = mx + c
• Consider the following equation: w2x2 + w1x1 + w0 = 0
• The above equation can be rewritten as follows:
w1 w
x2 x1 0
w2 w2
• Let y = x2 and x = x1. Then the above is the equation of
straight line in a 2-dimensional space with
w1 w
m and c 0
w2 w2
• The equation of a plane in a 3-dimensional space:
w3 x 3 w 2 x 2 w1 x1 w 0 0
• The equation of a hyperplane in a d-dimensional
space: d
wd xd .... w2 x2 w1 x1 w0 wi xi w T x 0
i 0
41
42
12-11-2023
1
Margin =
|| w ||
w0
wT x w0 1
wT x w0 0
wT x w0 1
43
N
1 T
J ( w , w0 , α ) w w - n y n ( w T x n w0 ) 1
2 n 1
n 1
n yn 0
45
46
12-11-2023
Support vectors 1
Margin =
|| w ||
w0
wT x w0 1
wT x w0 0
wT x w0 1
47
48
12-11-2023
Perceptron
Linear SVM 49
1
Margin =
|| w ||
ξ0
ξ 1
w0 ξ0
0 ξ 1
wT x w0 1
wT x w0 0 ξ 1
T
w x w0 1
50
12-11-2023
51
y (wT xn w0 ) 1 n , n 1,..., N
subjected to the constraints n
n 0
– where C determines the tradeoff between the margin
and the error on training data
• The Lagrangian function is:
N N N
1
J ( w , ξ , w0 , α ) w T w C n n y n ( w T x n w0 ) 1 n n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients
52
12-11-2023
n 1
n yn 0
0 n C
53
Support vectors 1
Margin =
|| w ||
ξ0
w0 ξ0
ξ 1
wT x w0 1
wT x w0 0 ξ 1
wT x w0 1
55
x2 x2 x2
x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes
56
12-11-2023
57
58
12-11-2023
Illustration of Transformation
x2
x1
x [ x1,x2 ]
( x ) x12 x 22 2 x1 x 2
59
1
Margin =
|| w ||
ξ0
w0 ξ0
ξ 1
wT (x) w0 1
wT (x) w0 0 ξ 1
T
w (x) w0 1
60
12-11-2023
N N N
1
J ( w , ξ , w0 , α ) w T w C n n y n ( w T ( x n ) w0 ) 1 n n n
2 n 1 n 1 n 1
– where αn and γn are the Lagrange coefficients
61
n 1
n yn 0
0 n C
62
12-11-2023
Kernel Functions
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Kernel functions must satisfy Mercer’s theorem
• Kernel gram matrix includes the values of kernel
function for all pairs of examples in the training set
K ( x1 , x1 ) K ( x1 , x 2 ) . . . K ( x1 , x N )
K (x , x ) K (x , x ) . . . K (x , x )
2 1 2 2 2 N
. . .
K
. . .
. . .
K ( x N , x1 ) K ( x N , x 2 ) . . . K ( x N , x N )
66
12-11-2023
Inner-product Kernels
• Kernel function: K(xm, xn) = Φ(xm)T Φ(xn)
• Polynomial kernel: K(xm, xn) = (xmTxn + 1)2
• For 2-dimensional patterns, x1=[x11, x12]T and x2=[x21, x22]T
(x1 ) 1, 2 x11, 2 x12 , x112 , x122 , 2 x11x12
T
(x ) 1, 2 x , 2 x , x , x , 2 x x
2 2 T
2 21 22 21 22 21 22
1 x x 1 2x x 2x x x x x
1
T
2
2
11 21 12 22
2 2
11 22 x 2x11x12 x21x22 (x1 )T (x2 )
2 2
12 22
Architecture of an SVM
x1
SVM
1* y1
K (x, x1)
Ns
g ( x ) n y n K ( x, x n ) w0
x2 Output
n 1
Input
example x y*
2 2
g(x)
K(x, x2) Σ
If g(x) ≥ 0 then x is
assigned to class1
(C1)
If g(x) < 0 then x is
xNs
assigned to class2
N* y N (C2)
K(x, xNs) s s
68
12-11-2023
69
gL(x)
SVML
One-against-the-rest Approach
SVM1 g1(x)
g4(x)
SVM4
71
72
12-11-2023
One-against-one Approach
• An SVM is built for every pair
SVM12
g12(x) of classes to form a boundary
between their regions
g13(x)
SVM13 • Let C=4 be the total number
g14(x)
of classes
Input
pattern x
SVM14
Decision Class • Number of pairwise SVMs is
Strategy
L=C(C-1)/2 = 6
SVM23 g (x)
23
• The maxwins strategy is used
to determine the class of a
SVM24
g24(x)
test pattern x
SVM34 – In this strategy, a majority
g34(x)
voting scheme is used
73
Multi-class
Class2 vs Class3 Class2 vs Class4 Class3 vs Class4 decision region
74
12-11-2023
[1] http://bengio.abracadoudou.com/SVMTorch.html
[2] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[3] http://svmlight.joachims.org/
[4] https://www.csie.ntu.edu.tw/~cjlin/liblinear/
[5] http://scikit-learn.org/stable/modules/svm.html#svm
75
Logistic Regression
12-11-2023
Logistic Regression
( A probabilistic discriminative
model)
Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci
E yi | x P yi Ci | x
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
P ( x)
log w0 w1 x1 ... wd xd w T xˆ
1 P ( x)
where w [ w0 , w1 ,..., wd ]T and xˆ [1, x1 ,..., xd ]T
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
P (x) where w [ w0 , w1 ,..., wd ]T
log w T xˆ
1 P ( x) and xˆ [1, x1 ,..., xd ]T
Logistic Regression
1
P( x) ( w0 w1x )
1 e
x
83
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
P (x) T
• Fit a linear model to logit function: log w xˆ
1 P ( x)
– For d-dimensional space, x=[x1, x2, …, xd]T
P (x) where w [ w0 , w1 ,..., wd ]T
log w0 w1 x1 ... wd xd w T xˆ
1 P ( x) and xˆ [1, x1 ,..., xd ]T
P ( x) T
e ( w xˆ )
1 P ( x)
84
12-11-2023
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
P (x)
• Fit a linear model to logit function: log w T xˆ
1 P ( x)
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w [ w0 , w1 ,..., wd ]T
e ( w xˆ )
1 P ( x) and xˆ [1, x1 ,..., xd ]T
T
( w xˆ )
e
P (x) T
What about the classifier learning here?
1 e(w xˆ )
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
P (x)
• Fit a linear model to logit function: log w T xˆ
1 P ( x)
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w [ w0 , w1 ,..., wd ]T
e ( w xˆ )
1 P ( x) and xˆ [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x) T
1 e(w xˆ ) For any test example x:
1 If P(x) ≥ 0.5 then x is assigned with label 1
P(x) T If P(x) < 0.5 then x is assigned with label 0
1 e ( w xˆ ) 86
12-11-2023
Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w ) P (x n ) yn (1 P ( x n )) (1 yn )
Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w Bernoulli Distribution)
(1 y n )
• Likelihood of xn: P (x n | w ) P (x n ) n (1 P ( x n ))
y
N
• Total data likelihood: P(D | w ) P(x
n 1
n | w)
88
12-11-2023
Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
(1 y n )
• Likelihood of xn: P (x n | w ) P (x n ) n (1 P ( x n ))
y
N
• Total data likelihood: P(D | w ) P ( x n ) n (1 P( x n )) n
y (1 y )
n 1 89
Estimation of Parameter in
Logistic Regression
• Total data log likelihood:
l (w ) ln P (D | w )
N
l (w ) yn ln( P (x n )) (1 yn ) ln(1 P(x n ))
n 1
Estimation of Parameter in
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope l (w ) l (w )
w w
(gradient) of the w w
likelihood surface where 0 1 is proportion ality constant
– Then, the w is updated using ∆w
• This indicate, we move in the direction of positive
slope of the likelihood surface
91
Weight
in Kg
Height in cm 93
f (x, w ) w T xˆ 0
Weight
in Kg
Height in cm 94
12-11-2023
Weight
in Kg
Height in cm
1
w T xˆ 47.9851 P(x) T 1
1 e ( w xˆ )
Weight
in Kg
Height in cm
1
w T xˆ 6.7771 P(x) T 0.9
1 e ( w xˆ )
Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just for building classifier, but also used
in sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output
97
Weight
in Kg
Height in cm
Summary
• The SVM is a linear two-class classifier
• An SVM constructs the maximum margin hyperplane
(optimal separating hyperplane) as a decision surface
to separate the data points of two classes
• Margin of a hyperplane: The minimum distance of
training points from the hyperplane
• SVM is a classifier using kernel methods
• Kernel methods involve the following two steps:
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Construction of optimal linear solutions in kernel feature
space
• A kernel function computes the pair-wise inner-products
directly from the original representation of data
• The SVM performs multi-class pattern classification either
by using one-against-the-rest approach or one-against-one
approach
99
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.
100