Logistic Regression
Logistic Regression
net/publication/337932960
CITATIONS READS
0 558
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Kenneth Ezukwoke on 14 December 2019.
1
able using maximum likelihood estima- rior probability of each class P r(Y |x; β).
tion of probability log function. This idea It is also a generalized linear model, map-
is further explained in the next sections. ping output of linear multiple regression
Our focus however is on its kernel version to posterior probability of each class
and how we explore the inner product of P r(Y |x; β) ∈ {0, 1} [2]. The probabilty
the independent variable to classify non- of a data-sample belonging to class 1 is
seperable data. given by:
2
n
Simplifying for p(x) and 1 − p(x) we have X p(xi )
= log(1 − p(xi )) + yi log
1 − p(xi )
p(x) i=1
= exp(β0 + β T x) (11) (18)
1 − p(x)
p(x)
we replace log 1−p(x) with β0 + x · β as
p(x) = (1 − p(x)) exp(β0 + β T x) (12) seen in equation (8) and (1 − p(x)) with
1
p(x) = exp(β0 + β T x) − p(x) · exp(β0 + β T x) 1+exp(β0 +x·β) . Hence,
(13)
p(x) + p(x) · exp(β0 + β T x) = exp(β0 + β T x) n
(14)
X 1
l(β0 , β) = log
p(x)(1 + exp(β0 + β T x)) = exp(β0 + β T x) 1 + exp(β0 + x · β)
i=i
(15) +y(β + x · β) 0
exp(β0 + β T x)
p(x) =
1 + exp(β0 + β T x)
1
exp(β0 +β T x)
· exp(β0 + β T x)
= exp(β0 +β T x)
1 n
exp(β0 +β T x)
+ exp(β0 +β T x)
X
= −log(1+exp(β0 +x·β))+y(β0 +x·β)
1 i=i
=
1 + exp −(β0 + β T x) (19)
1
1 − p(x) = (16) n
1 + exp(β0 + β T x) X 1
∇l(β0 , β) = − xij
1 + exp(β0 + x · β)
i=1
2.2 Learning Logistic regression n
X
We assume that P (Y = 1|X = x; β) = exp(β0 + x · β) + yi x i
P (x; β), for some probability function i=1
3
We also solve it using stochastic approach We can now express p(x; β) is subspace of
Algorithm 2: Logistic regression input vectors only such that
via Stochastic Gradient Descent
SGD 1
p(φ; α) = (24)
Input : x ∈ X where y ∈ Y 1 + e−αi κ(xi ,xj )
Output : β
1 begin and
2 βj ← β 0 ;
1
3 while not converged do 1 − p(φ; α) = (25)
4 for 1 + eαi κ(xi ,xj )
i ∈ randshuffle({1, . . . , N })
The logit function is mapped into the ker-
do
nel space as
5 for k ∈ {1, . . . , i} do
6 βj+1 = βj − α∇β lk ;
p(φ; α)
7 end logit( ) = ακ(xi , x) (26)
1 − p(φ; α)
8 end
9 end Deriving the equation of kernel logistic re-
10 end gression requires the regularized logistic
regression, precisely the l2 − norm of the
2.3 KERNEL LOGISTIC RE- log-likelihood. This is in comparison to the
GRESSION SVM objective function used in [3].
4
We can expand the objective function as learning rate.
follows Algorithm 4: Kernel logistic re-
gression using Stochastic Gradi-
ent descent
Input : κ, y, αj
p Output : α
Lα = ylog + log(1 − p(xi ))
1−p 1 begin
λ 2 αj ← α0 ’;
− αT κ(xi , x)α
2 3 while not converged do
p 1 4 for
= ylog + log
1−p 1 + eακ(xi ,x) i ∈ randshuffle({1, . . . , N })
λ do
− αT κ(xi , x)α
2 5 for k ∈ {1, . . . , i} do
= yακ(xi , x) − log(1 + eακ(xi ,x) ) 6 αj+1 = αj − lr∇α Lk ;
λ 7 end
− αT κ(xi , x)α 8 end
2
9 end
10 end
2.4.2 Prediction
First order derivative of the log-likelihood
Still using the representer theorem, we
compute the posterior probability of a new
data point
κ(xi , x)eακ(xi ,x)
1
∇α L = yκ(xi , x) − y = sign (27)
1 + eακ(xi ,x) 1 + exp−ακ(xi ,x)
−λακ(xi , x)
= yκ(xi , x) − pκ(xi , x) − λακ(xi , x) Here, the prediction is dependent only on
α and the kernel.
∇α L = κ(xi , x)(y − p) − λακ(xi , x)
2.5 Kernels
We introduce the commonly used kernels
and a brief overview of Multiple kernels
We apply gradient descent and stochastic used. The radius R is computed from the
gradient descent algorithms
• Linear kernel
Algorithm 3: Kernel logistic re-
gression using Gradient descent κ(xi , xj ) = xi xTj (28)
Input : κ, y, α
Output : α • Polynomial kernel
1 begin
2 αj ← α0 ’; κ(xi , xj ) = (xi xTj + c)d (29)
3 while not converged do
where c ≥ 0 and d is the degree of
4 αj+1 = αj − lr∇α L;
the polynomial usually greater than
5 end
2.
6 end
• RBF(Radial Basis Function) ker-
In algorithm 4 of Algorithm 3, lr is the nel
5
Sometimes referred to as the Gaus- So that,
sian kernel. N
X
hu, κui = ui · (κui ) (34)
κ(xi , xj ) = exp(−γ||xi − xj ||2 ) (30) i=1
1 N N
where γ = 2σ 2
. X X
= ui φ(xi )φ(xj )H uj (35)
• Sigmoid kernel i=1 j=1
Where H represent the Hilbert space we
project the kernel [7].
κ(xi , xj ) = tanh(γxi xTj + c) (31) *N N +
XX
1 = ui φ(xi ), uj φ(xj ) (36)
where c ≥ 0 and γ = 2σ 2
.
i=1 j=1 H
• Laplace kernel
N
2
X
hu, κui =
≥0
ui φ(xi )
(37)
i=1
κ(xi , xj ) = exp(−γ||xi − xj ||) (32) H
Therefore κ is positive definite.
where γ = 1
. Using this property of the kernel κ we in-
2σ 2
troduce multiple kernel combination as
follow
2.6 Multi-kernel
• LinearRBF
The reason behind the use of multiple ker- Here we combine two kernels, pre-
nel is similar to the notion of multiclassi- cisely Linear and RBF kernel us-
fication, where cross-validation is used to ing their inner product.
select the best performing classifier [6]. By
using multiple kernel, we hope to learn a K̂linrbf = κ(xi , xj ) × κ(xi , xl ) (38)
different similar in the kernel space not
easily observed when using single kernel. • RBFPoly
We can prove from Mercer’s Theorem Here we combine RBF and Polyno-
that a kernel is Positive Semi-Definite mial kernel using their inner prod-
(PSD) if uT κ(xi , xj )u ≥ 0. Hence by per- uct.
forming arithmetic or any mathematical K̂rbfpoly = κ(xi , xj ) × κ(xi , xl ) (39)
operation on two or more kernel matrix,
we obtain a new kernel capable of exploit- • EtaKernel
ing different property or similarities of The EtaKernel is a composite com-
training data. bination of LinearRBF, RBFPoly
Given a kernel κ, we prove that κ is PD if and RBFCosine and it is given by
product
of the mapping functions We perform a comparison between classi-
φ(xi )φ(xj ) . κ is a kernel if its inner cal logistic regression and its kernel ver-
product are positive and the solution of sion using benchmark datasets includ-
κu = λu gives non-negative eigenvalues. ing moons, blob, circle and classification.
6
Given N , the total number of samples. ber of times. We do this because of the
Circle (N = 1000), Moon (N = 1000) non-deterministic result we get from ran-
and Classification (N = 1000), we split dom initialization of β and α-for stochastic
each data into training and test sample of version.
70% − 30% each. Using the configuration such that
learning rate = 10, γ = 1 and λ = 0.00001
3.1.1 Data description the algorithm returns the result in figure 1.
Each data contains binary class (exactly
2 groups of data), and each samples data 3.3 Logistic and kernel logistic
contains exactly two feature vectors each. regression result (Stochastic)
The figure 1 below shows the result of the
3.2 Logistic and kernel logis-
stochastic version for logistic regression
tic regression result (Non-
and its kernel versions. After N amount
stochastic)
of runs using the configuration such that
We begin by passing all data into one learning rate = 10, γ = 1 and λ = 0.00001
pipeline and runs this procedure N num- the algorithm returns the result in figure 2.
7
3.4 Performance Analysis logistic regression and its kernel versions.
F1-Score is the harmonic mean of precision
We compare the performance of Logistic and recall and it is given by
regression and its kernel version. We also
show that although stochastic logistic re- 2 × precision × recall
F 1 − score = (40)
gression gives us a better stable result com- precision + recall
pared to non-stochastic logistic regression,
its takes considerable amount of time to where precision is given by
compute and hence, less faster compared TP
to non-stochastic version. precision = (41)
TP + FP
TP
3.4.1 Evaluation metric recall = (42)
TP + FN
We use f1-score as our evaluation metric TP: True Positives, TN: True Negatives,
to compare the performance of classical FP: False Positives FP: False Negatives.
8
Stochastic Running Time (secs)
kernels linear rbf poly sigmoid laplace rbfpoly linrbf eta
kernel
Moons 0.62 0.05 0.07 0.02 0.09 0.21 0.08 0.27
Blobs 0.60 0.37 0.06 0.05 0.38 0.06 0.05 0.08
Circle 0.61 0.08 0.250 0.01 0.09 0.29 0.07 0.35
Classifi- 0.63 0.25 0.04 0.01 0.28 0.13 0.15 0.23
cation
9
References Conference on Machine Learning, 19,
2002.
[1] Goldberger, Authur S. Classical Linear
Regression. Econometric Theory. John [5] Maher Maalouf, Theodore B. Trafalis,
Wiley & Sons. pp. 158. ISBN- 04471- Indra Adrianto., Kernel logistic regres-
31101-4, 1964. sion using truncated Newton method.
Computer Management Science, 8:415-
[2] Scott Menard. Applied Logistic Regres- 428, 2009.
sion Analysis. SAGE PUBLICATIONS.
ISBN- 0-7619-2208-3, 2001. [6] Mehmet Gönen, Ethem Alpaydin.
Multiple Kernel Learning Algorithms.
[3] Zhu J, Hastie T. Kernel logistic re- Journal of Machine Learning Research,
gression and import vector machine. J 12:2211-2268, 2011.
Comput Graphic Stat, 14:185–205, 2005.
[7] John Shawt-Taylor, Nello Cristian-
[4] Keerthi, S., Duan, K., Shevade, S., and ini. Kernel Methods for Pattern Anal-
Poo, A. A Fast Dual Algorithm for Ker- ysis. Cambridge University Press,
nel Logistic Regression. International ISBN:9780511809682, pg47-83, 2011.
10