0% found this document useful (0 votes)
57 views

Logistic Regression

This document presents a comparative study of logistic regression and kernel logistic regression for binary classification. It introduces logistic regression and kernel logistic regression, discussing how they are used to classify linearly and non-linearly separable data, respectively. Gradient descent and stochastic gradient descent are used to optimize the models. The performance differences between classical and kernel logistic regression, as well as between their classical and stochastic variants, are explored.

Uploaded by

Sayan Ghosal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Logistic Regression

This document presents a comparative study of logistic regression and kernel logistic regression for binary classification. It introduces logistic regression and kernel logistic regression, discussing how they are used to classify linearly and non-linearly separable data, respectively. Gradient descent and stochastic gradient descent are used to optimize the models. The performance differences between classical and kernel logistic regression, as well as between their classical and stochastic variants, are explored.

Uploaded by

Sayan Ghosal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337932960

LOGISTIC REGRESSION AND KERNEL LOGISTIC REGRESSION A comparative


study of logistic regression and kernel logistic regression for binary
classification

Preprint · December 2019


DOI: 10.13140/RG.2.2.28668.28808

CITATIONS READS

0 558

2 authors:

Kenneth Ezukwoke Samaneh Zareian


Université Jean Monnet Université Jean Monnet
7 PUBLICATIONS   0 CITATIONS    5 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Logistic Regression and Perceptron for Classification View project

constraint programming framework View project

All content following this page was uploaded by Kenneth Ezukwoke on 14 December 2019.

The user has requested enhancement of the downloaded file.


LOGISTIC REGRESSION AND KERNEL
LOGISTIC REGRESSION
A comparative study of logistic regression and kernel
logistic regression for binary classification
Ezukwoke K.I1 , Zareian S.J2
1, 2
Department of Computer Science
Machine Learning and Data Mining
{ifeanyi.ezukwoke, samaneh.zareian.jahromi}@etu.univ-st-etienne.fr
University Jean Monnet, Saint-Etienne, France

Abstract parameter β that best maps the predictors


to the response variable yi .
Logistic regression is a linear
binary classification algorithm
y = X1 β1 + X2 β2 + · · · + XN βN (1)
frequently used for classifica-
tion problems. In this paper we
present its kernel version which
Using Ordinary Least Squares (OLS) we
is used for classification of non- can estimate the unknown parameters in
linearly separable problems. We the linear regression problem [1]. It does
briefly introduce the concept of this by minimizing the sum of square dif-
multiple kernel learning and ap- ferences between the predictors and the
ply it to kernel logistic regression. response variable. 1
We elaborate the performance dif-
ferences between classical, kernel
min kXβ − yk2 (2)
logistic regression and its stochas- β
tic variant (both classical and
kernel logistic regression). We minimize this objective function using
maximum likelihood estimation and derive
Keywords the following closed form solution.
Classification, logistic regression, kernel
logistic regression, multi-kernel learning. β = (XT X)−1 XT y (3)

This returns a model that produces as


1 INTRODUCTION straight line that maps the predictors to
the response [1].
Linear regression is a statistical method However, linear regression is only suffi-
used for univariate and multivariate anal- cient for explaining the relationship in
ysis. Given a set of observations {xi , yi }n , observations with continuous variable. For
where {xi }N is the independent variables observations with categorical variables it
(feature space) and yi is the dependent (re- becomes impossible to adopt this model.
sponse variable- usually continuous or dis- Logistic regression solves the limitation
crete). Linear regression models estimate of linear regression for categorical vari-
1
source code for project is available on github

1
able using maximum likelihood estima- rior probability of each class P r(Y |x; β).
tion of probability log function. This idea It is also a generalized linear model, map-
is further explained in the next sections. ping output of linear multiple regression
Our focus however is on its kernel version to posterior probability of each class
and how we explore the inner product of P r(Y |x; β) ∈ {0, 1} [2]. The probabilty
the independent variable to classify non- of a data-sample belonging to class 1 is
seperable data. given by:

P r(Y = 1|X = x; β) = σ(z), where z = β T x


2 CLASSIFICATION (4)

Classification is a supervised machine P (Y = 1|X = x; β) = σ(β T x) (5)


learning approach to categorize data into
where
distinct number of classes where we
can assign label to each class. Given a set P r(Y = 1|X = x; β)+P r(Y = 0|X = x; β) = 1
of data {x(i) , y (i) } where x is the feature (6)
space in m × (n + 1) dimension and y is
the classfication output such y ∈ {0, 1} for
binary output or {1, 2, ...n} for multiclass P r(Y = 0|X = x; β) = 1−P r(Y = 1|X = x; β)
output. Classification algorithms are most (7)
used for Spam detection, Voice and im- Hence, the probability of a data-sample
age recognition, sentiment analysis, fraud belonging to class 0 is given by:
detection and many more.
P r(Y = 0|X = x; β) = 1 − σ(z) (8)
Logistic regression is a linear binary clas-
sification algorithm that maps a set of σ(z) is called the logistic sigmoid func-
predictors to their corresponding categor- tion and is given by
ical response variables. The algorithm is
1
capable of classifying linearly separable σ(z) = (9)
1 + exp−z
datasets. However, linear logistic regres-
sion is not able to accurately classify non- The uniqueness of this function is that it
linear data, therefore we use kernel logistic maps all real numbers R to range { 0, 1}.
regression for non-linearly separable data Again, we know
classification.
log(odds(P r(Y = 1|X = x; β)))
Kernel logistic regression is similar to sup-
port vector machines in its operational out- P r(Y = 1|X = x; β)
=
put [4]. Existing papers already implement P r(Y = 0|X = x; β)
kernel logistic regression using the Newton- P r(Y = 1|X = x; β)
=
Ralphson method [3], Sequential Minimal 1 − P r(Y = 1|X = x; β)
Optimization (SMO) [4] and Truncated
Newton-method [5]. Assuming P (Y = 1|X = x; β) = p(x), the
next most obvious idea is to let logp(x) be
In this paper however we solve logistic
a linear function of x, so that changing an
regression and kernel-logistic regression
input variable multiplies the probability by
using gradient descent (GD) and stochas-
a fixed amount. This is done by taking a
tic gradient descent (SGD) optimization
log transformation of p(x).
techniques.
Formally, logit(p(x)) = β0 + β T x making

2.1 LOGISTIC REGRESSION  


p(x)
logit(p(x)) = log = β0 + β T x
Logistic regression is a discriminative 1 − p(x)
model since it focuses only on the poste- (10)

2
n  
Simplifying for p(x) and 1 − p(x) we have X p(xi )
= log(1 − p(xi )) + yi log
1 − p(xi )
p(x) i=1
= exp(β0 + β T x) (11) (18)
1 − p(x) 
p(x)

we replace log 1−p(x) with β0 + x · β as
p(x) = (1 − p(x)) exp(β0 + β T x) (12) seen in equation (8) and (1 − p(x)) with
1
p(x) = exp(β0 + β T x) − p(x) · exp(β0 + β T x) 1+exp(β0 +x·β) . Hence,
(13)
p(x) + p(x) · exp(β0 + β T x) = exp(β0 + β T x) n  
(14)
X 1
l(β0 , β) = log
p(x)(1 + exp(β0 + β T x)) = exp(β0 + β T x) 1 + exp(β0 + x · β)
i=i
(15) +y(β + x · β) 0

exp(β0 + β T x)
p(x) =
1 + exp(β0 + β T x)
1
exp(β0 +β T x)
· exp(β0 + β T x)
= exp(β0 +β T x)
1 n
exp(β0 +β T x)
+ exp(β0 +β T x)
X
= −log(1+exp(β0 +x·β))+y(β0 +x·β)
1 i=i
=
1 + exp −(β0 + β T x) (19)
1
1 − p(x) = (16) n
1 + exp(β0 + β T x) X 1
∇l(β0 , β) = − xij
1 + exp(β0 + x · β)
i=1
2.2 Learning Logistic regression n
X
We assume that P (Y = 1|X = x; β) = exp(β0 + x · β) + yi x i
P (x; β), for some probability function i=1

P (x; β) parameterized by β the conditional


likelihood function is given by Bernoulli
sequence: n
X
∇lβ = (yi − p(xi ; β0 , β))xij (20)
i=1
Πni=1 P r(Y = yi |X = xi ; β) = Πni=1 p(xi ; β)y
(1 − p(xi ; β)(1−yi ) )
Since this is a transcendental equation
with no closed-form solution, we ap-
The probability of a class is p, if yi = 1, or ply the gradient descent optimization
1 − p, if yi = 0. The likelihood is then algorithm. We take first the first or-
der derivative of the objective function
L(β0 , β) = Πni=1 p(xi )y (1−p(xi )(1−yi ) ) (17) Pn
∇β l = i=1 (yi − p(xi ; β0 , β))xij
taking the log of this likelhood we have
Algorithm 1: Logistic regression
Xn via Gradient Descent GD
l(β0 , β) = yi logp(xi ) + (1 − yi ) Input : x ∈ X where y ∈ Y
i=1 Output : β
log(1 − p(xi )) 1 begin
2 βj ← β 0 ;
Xn 3 while not converged do
= yi logp(xi ) − yi log(1 − p(xi ))+ 4 βj+1 = βj − α∇β l;
i=1 5 end
log(1 − p(xi )) 6 end

3
We also solve it using stochastic approach We can now express p(x; β) is subspace of
Algorithm 2: Logistic regression input vectors only such that
via Stochastic Gradient Descent
SGD 1
p(φ; α) = (24)
Input : x ∈ X where y ∈ Y 1 + e−αi κ(xi ,xj )
Output : β
1 begin and
2 βj ← β 0 ;
1
3 while not converged do 1 − p(φ; α) = (25)
4 for 1 + eαi κ(xi ,xj )
i ∈ randshuffle({1, . . . , N })
The logit function is mapped into the ker-
do
nel space as
5 for k ∈ {1, . . . , i} do
6 βj+1 = βj − α∇β lk ;
p(φ; α)
7 end logit( ) = ακ(xi , x) (26)
1 − p(φ; α)
8 end
9 end Deriving the equation of kernel logistic re-
10 end gression requires the regularized logistic
regression, precisely the l2 − norm of the
2.3 KERNEL LOGISTIC RE- log-likelihood. This is in comparison to the
GRESSION SVM objective function used in [3].

Classical logistic regression will fail to clas-


sify accurately non-linearly separable data, X n

therefore, we prefer to use its kernel ver- Lα = yi logp(xi ) + (1 − yi )log(1 − p(xi ))


sion. It also has a direct probabilistic inter- i=1

pretation that makes it suited for Bayesian λ


− αT κ(xi , x)α
design [4]. 2
The vector space can be expressed as a lin-
ear combination of the input vectors such 2.4 Learning kernel logistic re-
that gression
XN
β= αi φ(xi ) (21) As mentioned earlier, some of the methods
i=1
for finding the maximum likelihood esti-
where α ∈ Rn×1 is the dual variable. The mate include gradient descent (GD) and
function φ(xi ) maps the data points from iterative re-weighted least sqaures (IRLS)
lower dimension to higher dimension. method. Here we employ the use of IRLS
0 which is based on the Newton-Ralphson
φ : x ∈ RD → φ(x) ∈ F ⊂ RD (22)
algorithm.
Let κ(xi , x) be a kernel function resulting
from the inner product of φ(xi ) and φ(xj ),
2.4.1 Optimization problem
such that


κ(xi , x) = φ(xi )φ(xj ) (23) n
X
From representer theorem we know Lα = yi logp(xi ) + (1 − yi )log(1−
that i=1
λ
F = β T φ(x) = α φ(xi )φ(xj )

p(xi )) − αT κ(xi , x)α
2
= ακ(xi , xj )

4
We can expand the objective function as learning rate.
follows Algorithm 4: Kernel logistic re-
gression using Stochastic Gradi-
ent descent
  Input : κ, y, αj
p Output : α
Lα = ylog + log(1 − p(xi ))
1−p 1 begin
λ 2 αj ← α0 ’;
− αT κ(xi , x)α
  2  3 while not converged do
p 1 4 for
= ylog + log
1−p 1 + eακ(xi ,x) i ∈ randshuffle({1, . . . , N })
λ do
− αT κ(xi , x)α
2 5 for k ∈ {1, . . . , i} do
= yακ(xi , x) − log(1 + eακ(xi ,x) ) 6 αj+1 = αj − lr∇α Lk ;
λ 7 end
− αT κ(xi , x)α 8 end
2
9 end
10 end

2.4.2 Prediction
First order derivative of the log-likelihood
Still using the representer theorem, we
compute the posterior probability of a new
data point
κ(xi , x)eακ(xi ,x) 
1

∇α L = yκ(xi , x) − y = sign (27)
1 + eακ(xi ,x) 1 + exp−ακ(xi ,x)
−λακ(xi , x)
= yκ(xi , x) − pκ(xi , x) − λακ(xi , x) Here, the prediction is dependent only on
α and the kernel.
∇α L = κ(xi , x)(y − p) − λακ(xi , x)

2.5 Kernels
We introduce the commonly used kernels
and a brief overview of Multiple kernels
We apply gradient descent and stochastic used. The radius R is computed from the
gradient descent algorithms
• Linear kernel
Algorithm 3: Kernel logistic re-
gression using Gradient descent κ(xi , xj ) = xi xTj (28)
Input : κ, y, α
Output : α • Polynomial kernel
1 begin
2 αj ← α0 ’; κ(xi , xj ) = (xi xTj + c)d (29)
3 while not converged do
where c ≥ 0 and d is the degree of
4 αj+1 = αj − lr∇α L;
the polynomial usually greater than
5 end
2.
6 end
• RBF(Radial Basis Function) ker-
In algorithm 4 of Algorithm 3, lr is the nel

5
Sometimes referred to as the Gaus- So that,
sian kernel. N
X
hu, κui = ui · (κui ) (34)
κ(xi , xj ) = exp(−γ||xi − xj ||2 ) (30) i=1

1 N N
where γ = 2σ 2
. X X

= ui φ(xi )φ(xj )H uj (35)
• Sigmoid kernel i=1 j=1
Where H represent the Hilbert space we
project the kernel [7].
κ(xi , xj ) = tanh(γxi xTj + c) (31) *N N +
XX
1 = ui φ(xi ), uj φ(xj ) (36)
where c ≥ 0 and γ = 2σ 2
.
i=1 j=1 H
• Laplace kernel
N
2

X
hu, κui =

≥0
ui φ(xi ) (37)
i=1
κ(xi , xj ) = exp(−γ||xi − xj ||) (32) H
Therefore κ is positive definite.
where γ = 1
. Using this property of the kernel κ we in-
2σ 2
troduce multiple kernel combination as
follow
2.6 Multi-kernel
• LinearRBF
The reason behind the use of multiple ker- Here we combine two kernels, pre-
nel is similar to the notion of multiclassi- cisely Linear and RBF kernel us-
fication, where cross-validation is used to ing their inner product.
select the best performing classifier [6]. By
using multiple kernel, we hope to learn a K̂linrbf = κ(xi , xj ) × κ(xi , xl ) (38)
different similar in the kernel space not
easily observed when using single kernel. • RBFPoly
We can prove from Mercer’s Theorem Here we combine RBF and Polyno-
that a kernel is Positive Semi-Definite mial kernel using their inner prod-
(PSD) if uT κ(xi , xj )u ≥ 0. Hence by per- uct.
forming arithmetic or any mathematical K̂rbfpoly = κ(xi , xj ) × κ(xi , xl ) (39)
operation on two or more kernel matrix,
we obtain a new kernel capable of exploit- • EtaKernel
ing different property or similarities of The EtaKernel is a composite com-
training data. bination of LinearRBF, RBFPoly
Given a kernel κ, we prove that κ is PD if and RBFCosine and it is given by

hu, κui ≥ 0 (33) K̂etarbf = K̂linrbf × K̂rbfpoly +


K̂rbfpoly × K̂rbfcosine
Proposition: A symmetric function κ: χ
→ R is positive semi-definite if and only if
hu, κui ≥ 0 3 EXPERIMENT
Proof :
3.1 Dataset
Suppose that κ is a kernel which is the
inner

product
of the mapping functions We perform a comparison between classi-
φ(xi )φ(xj ) . κ is a kernel if its inner cal logistic regression and its kernel ver-
product are positive and the solution of sion using benchmark datasets includ-
κu = λu gives non-negative eigenvalues. ing moons, blob, circle and classification.

6
Given N , the total number of samples. ber of times. We do this because of the
Circle (N = 1000), Moon (N = 1000) non-deterministic result we get from ran-
and Classification (N = 1000), we split dom initialization of β and α-for stochastic
each data into training and test sample of version.
70% − 30% each. Using the configuration such that
learning rate = 10, γ = 1 and λ = 0.00001
3.1.1 Data description the algorithm returns the result in figure 1.
Each data contains binary class (exactly
2 groups of data), and each samples data 3.3 Logistic and kernel logistic
contains exactly two feature vectors each. regression result (Stochastic)
The figure 1 below shows the result of the
3.2 Logistic and kernel logis-
stochastic version for logistic regression
tic regression result (Non-
and its kernel versions. After N amount
stochastic)
of runs using the configuration such that
We begin by passing all data into one learning rate = 10, γ = 1 and λ = 0.00001
pipeline and runs this procedure N num- the algorithm returns the result in figure 2.

Figure 1: Non-Stochastic logistic and kernel logistic regression.


3 iterations for all except blob data (10 iterations), 0.01 learning
rate.

Figure 2: Stochastic logistic and kernel logistic regression. 3 itera-


tions for all except blob data (50 iterations), 0.01 learning rate.

7
3.4 Performance Analysis logistic regression and its kernel versions.
F1-Score is the harmonic mean of precision
We compare the performance of Logistic and recall and it is given by
regression and its kernel version. We also
show that although stochastic logistic re- 2 × precision × recall
F 1 − score = (40)
gression gives us a better stable result com- precision + recall
pared to non-stochastic logistic regression,
its takes considerable amount of time to where precision is given by
compute and hence, less faster compared TP
to non-stochastic version. precision = (41)
TP + FP
TP
3.4.1 Evaluation metric recall = (42)
TP + FN
We use f1-score as our evaluation metric TP: True Positives, TN: True Negatives,
to compare the performance of classical FP: False Positives FP: False Negatives.

Non-Stochastic F1-Score (%)


kernels linear rbf poly sigmoid laplace rbfpoly linrbf eta
kernel
Moons 81 72 68 72 63 67 69 66
Blobs 98 41 0 66 68 1 97 0
Circle 50 93 0 69 1 0 0 0
Classifi- 88 0 79 64 0 5 0 4
cation

Non-Stochastic Running Time (secs)


kernels linear rbf poly sigmoid laplace rbfpoly linrbf eta
kernel
Moons 0.003 0.02 0.04 0.006 0.02 0.6 0.02 0.37
Blobs 0.003 0.02 0.04 0.007 0.03 0.05 0.03 0.12
Circle 0.003 0.02 0.04 0.05 0.02 0.06 0.02 0.38
Classifi- 0.003 0.08 0.04 0.005 0.12 0.17 0.08 0.81
cation

Stochastic F1-Score (%)


kernels linear rbf poly sigmoid laplace rbfpoly linrbf eta
kernel
Moons 86 23 68 69 87 67 69 66
Blobs 98 75 0 66 95 88 99 84
Circle 48 49 0 17 69 0 0 0
Classifi- 89 0 78 17 0 0 0 0
cation

8
Stochastic Running Time (secs)
kernels linear rbf poly sigmoid laplace rbfpoly linrbf eta
kernel
Moons 0.62 0.05 0.07 0.02 0.09 0.21 0.08 0.27
Blobs 0.60 0.37 0.06 0.05 0.38 0.06 0.05 0.08
Circle 0.61 0.08 0.250 0.01 0.09 0.29 0.07 0.35
Classifi- 0.63 0.25 0.04 0.01 0.28 0.13 0.15 0.23
cation

We observer from the F 1 − score table stochastic logistic regression is faster


that linear logistic regression is almost than its gradient descent version. in other
suitable for all datasets except circle data. words, stochastic gradient version of LR
This is because since circle data is not a reaches the optimum solution faster than
linearly seperable data, precisely rbf ker- its gradient descent version. We can make
nel logistic regression is most suitable for the same argument for kernel logistic re-
classifying it with an score of 93%. gression (stochastic) and logistic regression
In terms of running time, it is obvious (non-stochastic). Its stochatic version con-
classical or linear logistic regression is the verges faster towards zero than its gradient
fastest in computation compared to its descent version.
kernel versions. This is due to the time
taken in computing the kernel matrix
O(m × n)d where d is the degree (used
for rbf and its variants with d = 2 and 4 CONCLUSION
polynomial with d ≥ 2). Sigmoid ker-
nel still has the fastest running time of all
kernels. We demonstrate the use of logistic regres-
However, we have considered different data sion, kernel logistic regression and stochas-
types and the performance of the algo- tic version of logistic and kernel logistic re-
rithm can be better evaluated when each gression. We conclude that kernel logistic
dataset is considered individual with dif- regression is the best performing algorithm
ferent configuration of learning rate, γ and for classifying non-linearly separable data.
polynomial degree. Its classical version however, has a faster
computational time but only serves best
for linear binary classification. This is one
3.5 Convergence rate of the advantages of using stochastic gradi-
ent descent version over non-stochastic ver-
We compare the convergence rate for logis- sion. Specifically because it does required
tic regression and its kernel and stochastic large number of iteration to converge.
kernel version to speed of convergence.
We introduced the notion of multiple ker-
nel learning and see that that can also out-
perform classical logistic regression using
the F1-score evaluation metric. Stochastic
logistic and kernel logistic regression both
behave alike with non-stochastic version
but can be much stable than their non-
stochastic counterpart. We noted that the
convergence rate of stochastic logistic and
kernel logistic regression is faster than its
We observe the speed of convergence for non-stochastic version.

9
References Conference on Machine Learning, 19,
2002.
[1] Goldberger, Authur S. Classical Linear
Regression. Econometric Theory. John [5] Maher Maalouf, Theodore B. Trafalis,
Wiley & Sons. pp. 158. ISBN- 04471- Indra Adrianto., Kernel logistic regres-
31101-4, 1964. sion using truncated Newton method.
Computer Management Science, 8:415-
[2] Scott Menard. Applied Logistic Regres- 428, 2009.
sion Analysis. SAGE PUBLICATIONS.
ISBN- 0-7619-2208-3, 2001. [6] Mehmet Gönen, Ethem Alpaydin.
Multiple Kernel Learning Algorithms.
[3] Zhu J, Hastie T. Kernel logistic re- Journal of Machine Learning Research,
gression and import vector machine. J 12:2211-2268, 2011.
Comput Graphic Stat, 14:185–205, 2005.
[7] John Shawt-Taylor, Nello Cristian-
[4] Keerthi, S., Duan, K., Shevade, S., and ini. Kernel Methods for Pattern Anal-
Poo, A. A Fast Dual Algorithm for Ker- ysis. Cambridge University Press,
nel Logistic Regression. International ISBN:9780511809682, pg47-83, 2011.

10

View publication stats

You might also like