3logistic Regression
3logistic Regression
S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology
Binary Classification
{(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data where
xi ∈ Rn , yi ∈ {1, 0}.
Ω = {Heads, Tails}
Random Variable
A random variable is a measurable function defined on a
probability space that maps outcomes from the sample
space to real numbers.
Formally, it can be expressed as:
X : Sample Space → R
Types of Random Variables
{1, 0, 1, 0}
Random Vector
P : R(X ) → [0, 1]
p(x1 , x2 , . . . , xN )
p(x) = ϕx (1 − ϕ)1−x , x = 0, 1
Logistic Regression
Formulation
Output Variable
P(A)
The odds in favor of an event A =
1 − P(A)
The log-odds is the natural logarithm of the odds. It is also
known as logit
For logistic regression
f (x)
Odds in favour of getting the positive class is:
1 − f (x)
f (x)
logit(f(x)) = log , which lies in the interval
1 − f (x)
(−∞, +∞)
Modeling Using Linear Regression
Consider
the
data
f (x1 ) f (xN )
x1 , log , . . . xN , log
1 − f (x1 ) 1 − f (xN )
Apply linear regression concepts to model the data, that is,
a hyper plane is modeled to predict the log odds (the logit)
of the probability that Y = 1.
f (x)
log = w T x, w ∈ Rn+1
1 − f (x)
f (x)
exp(w T x) =
1 − f (x)
y |x is a random variable
1
y |x ∼ Bernoulli(f (x)) = Bernoulli
1 + exp(−w T x)
1
Sigmoid Function: g(t) =
1 + e−t
Decision Boundary
d 1
g ′ (t) =
dt 1 + e−t
1
= (e−t )
(1 + e−t )2
1 1
= 1−
1 + e−t 1 + e−t
= g(t)(1 − g(t))
.
Random Sample
w := w + α∇l(w)
∂l(w)
wj := wj + α , j = 0, 1, . . . n
∂wj
N
!
∂l(w) ∂ X
= (yi log f (xi ) + (1 − yi ) log(1 − f (xi ))
∂w ∂w
i=1
N
X 1 1 ∂
= yi − (1 − yi ) f (xi )
f (xi ) 1 − f (xi ) ∂w
i=1
N
X 1 1
= yi − (1 − yi ) f (xi )(1 − f (xi ))
f (xi ) 1 − f (xi )
i=1
∂
(w T xi )(refer previous section)
∂w
N
X
= (yi (1 − f (xi )) − (1 − yi )f (xi )) xi
i=1
X
= (yi − f (xi )) xi
Algorithm 1 Updation of w using Batch Gradient Ascent
Choose an initial w and learning parameter α
while not converged
PN do
w := w + α i=1 (yi − f (xi ))xi
end while
N
!
X
J(w) = − (yi log f (xi ) + (1 − yi ) log(1 − f (xi )))
i=1
N
XD E
= − (yi , (1 − yi ))T , (log f (xi ), (1 − log f (xi ))T
i=1
h(y )
y := y −
h′ (y )
w := w − H −1 ∇l
2 × (Precision × Recall)
F Measure =
Precision + Recall
True Positive Rate(TPR) & False Positive Rate (FPR)
TPR = sensitivity
FP
FPR = 1 − specificity =
TN + FP
Receiver Operating Characteristic Curve (ROC)
The value of the area under the ROC curve lies in the
interval [0,1] and is a measure of the accuracy of the
model.
An area of 1 represents a perfect test, while an area less
than or equal to 0.5 indicates a model that is not better
than chance.
More area under the curve signifies that the model is
identifying more true positives while minimizing the number
of false positives.
ROC
0.9
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
(1−Specificity)