0% found this document useful (0 votes)
11 views

3logistic Regression

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

3logistic Regression

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Logistic Regression

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

MA613 Data Mining


Introduction

Binary Classification
{(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data where
xi ∈ Rn , yi ∈ {1, 0}.

Data Soleus Gactrocnemius yi : 1/0


x1T x11 x12 1

x2T x21 x22 0

x3T x31 x32 0

x4T x41 x42 1

x5T x51 x52 1


Probability Theory
Experiments

Discrete Experiment: Oxygen and Hydrogen in the ratio


2:1
Random Experiment: Tossing a coin
Basic Concepts: Random Experiment

Sample Space (Ω): The set of all outcomes of a random


experiment.
For example, the outcomes of tossing a coin are:

Ω = {Heads, Tails}

A subset of a sample space is called an event. An event


can include one or more outcomes. For example, getting
Heads is an event.
Probability Function: P : A → [0, 1], where A is the
collection of events (i.e., the powerset of Ω).
Example: The probability of the event "getting Heads" can
be denoted as P(Heads).
Random Variable

Random Variable
A random variable is a measurable function defined on a
probability space that maps outcomes from the sample
space to real numbers.
Formally, it can be expressed as:

X : Sample Space → R
Types of Random Variables

Discrete Random Variable


The range is countable.
For example, let X denote a random variable that maps
outcomes of a coin toss:

X : {Heads, Tails} → {0, 1}, X (Heads) = 0, X (Tails) = 1

Thus, the probability can be expressed as


P(X = 0) = P(Heads).
Continuous Random Variable
The range is uncountable.
An example is the height of people in a country, which can
take any value within a specified range.
Question

An unbiased die is rolled. Let X = 1, if it shows even


number, else X = 0. What is P(X = 0)?
Attribute

An attribute of a data point is considered a random


variable.
For example, the attribute for **Height** can take values
such as:
{165, 170, 160}
For the attribute **Rain**, which could indicate whether it
rained (1 for rain, 0 for no rain):

{1, 0, 1, 0}
Random Vector

A random vector is defined as:


 
X1
 X2 
X = . 
 
 .. 
Xn

where each Xi is a random variable.


Each data point xi can be represented as a random vector:
 
A1
A2 
xi =  . 
 
 .. 
An
Probability Distribution

A probability distribution is a mathematical description of


the probabilities of events:

P : R(X ) → [0, 1]

where R(X ) is the range of the random variable X .


It can be represented using a table or an equation,
depending on the nature of the distribution (discrete or
continuous).
Discrete Probability Distribution
A discrete distribution describes the probability of
occurrence of each value of a discrete random variable.
The range of a discrete random variable X is given by:
R(X ) = {x1 , x2 , . . . , xm }
The probability mass function (PMF) is defined as:
p(xi ) = P(X = xi ) for i = 1, 2, . . . , m
x P(x)
1
1 6
1
2 6
1
3 6
1
4 6
1
5 6
1
6 6
Common types of discrete probability distributions include:
Bernoulli distribution, Binomial distribution, Poisson
distribution
Continuous Probability Distribution

If a random variable is continuous, its probability


distribution is described by a probability density function
(PDF).
The probability that the random variable X falls within a
certain interval [a, b] is given by:
Z b
P(a ≤ X ≤ b) = p(x) dx
a

Common types of continuous probability distributions


include:
Normal distribution
Exponential distribution
Uniform distribution
Others such as the Gamma and Beta distributions.
Figure: y axis shows density, which indicates the proportion of the
population having a particular range of height.
Univariate & Multivariate Probability Distribution

Univariate Distribution: A probability distribution that


describes the variability of a single random variable.
Multivariate Distribution: A probability distribution that
describes the variability of a random vector, which consists
of multiple random variables.
Joint Probability Distribution

The joint probability distribution of random variables


x1 , x2 , . . . , xN is denoted as:

p(x1 , x2 , . . . , xN )

If the random variables are mutually independent, the joint


probability can be expressed as:

p(x1 , x2 , . . . , xN ) = p(x1 ) · p(x2 ) · · · p(xN )


Bernoulli Distribution

The Bernoulli distribution is a discrete probability


distribution of a random variable X that takes only two
values: 1 (success) and 0 (failure).
Success: X = 1; Failure: X = 0.
The probabilities are defined as:

P(X = 1) = p(1) = ϕ, P(X = 0) = p(0) = 1 − ϕ

The probability mass function (PMF) can be expressed as:

p(x) = ϕx (1 − ϕ)1−x , x = 0, 1
Logistic Regression
Formulation
Output Variable

Define a random variable Y | X such that


(
1 given data point x in positive class
Y =
0 otherwise

Y | X follows Bernoulli distribution. Therefore,

p(Y = y |x) = ϕy (1 − ϕ)1−y , y = 0, 1

where, ϕ is the probability of success, that is the probability that


the given data point belongs to positive class.
Hypothesis Function in Logistic Regression

f (x) be the probability that x belongs to the positive class, that


is, f (x) is the probability of success. Hence, f (x) = p(y = 1|x).
Therefore,

P(Y = y |x) = p(y |x) = f (x)y (1 − f (x))1−y , y = 0, 1


Bernoulli Structures

yi | xi ∼ Bernoulli(f (xi )), ; i = 1, 2, . . . N


yi | xi are:
Bernoulli structures
Independent
Need not be identically distributed as f (xi ) may be different
Odds in favour of an Event

P(A)
The odds in favor of an event A =
1 − P(A)
The log-odds is the natural logarithm of the odds. It is also
known as logit
For logistic regression
f (x)
Odds in favour of getting the positive class is:
  1 − f (x)
f (x)
logit(f(x)) = log , which lies in the interval
1 − f (x)
(−∞, +∞)
Modeling Using Linear Regression

Consider
 the
 data    
f (x1 ) f (xN )
x1 , log , . . . xN , log
1 − f (x1 ) 1 − f (xN )
Apply linear regression concepts to model the data, that is,
a hyper plane is modeled to predict the log odds (the logit)
of the probability that Y = 1.

 
f (x)
log = w T x, w ∈ Rn+1
1 − f (x)
f (x)
exp(w T x) =
1 − f (x)

f (x) = (1 − f (x)) exp(w T x)


f (x)(1 + exp(w T x)) = exp(w T x)
1
f (x) =
1 + exp(−w T x)
1
f (x) = g(w T x) =
1 + e(−w T x)
where
1
g(t) = ,t ∈ R
1 + e−t
is called the logistic function or sigmoid function
Logistic Regression: Output variable

y |x is a random variable
 
1
y |x ∼ Bernoulli(f (x)) = Bernoulli
1 + exp(−w T x)
1
Sigmoid Function: g(t) =
1 + e−t
Decision Boundary

w T x = 0 is the decision boundary for logistic regression.


w T x ≥ 0, x belongs to positive class and when w T x < 0, x
is in negative class.
Linear Algorithm
Separable Data
Derivative of logistic function

d 1
g ′ (t) =
dt 1 + e−t
1
= (e−t )
(1 + e−t )2
 
1 1
= 1−
1 + e−t 1 + e−t
= g(t)(1 − g(t))
.
Random Sample

A random sample is a subset of individuals chosen from a


larger population, where each individual has an equal
chance of being selected. This helps to ensure that the
sample is representative of the population.
Consider the study of the heights of the population of a
country. The sample space Ω is defined as the set of all
people. We choose a random sample x1 , x2 , . . . , xN from
this population.
Each sample element can be represented as xi : Ω → R,
where each xi denotes the height of the i-th individual
sampled and is expressed as a real number.
Each xi is a random variable that can take on different
values based on the heights of individuals in the
population.
Data Generation
Consider an experiment of choosing a ball randomly from an
urn three times. The urn consists of 6 blue and 4 orange balls.
Define Ai = 1 if an orange ball is chosen, and Ai = 2 if a blue
ball is chosen, for i = 1, 2, 3. Let this experiment be conducted
4 times.
First Day:
Draw Outcomes: obb, bbo, bob, obo
Corresponding Values: 122, 221, 212, 121
Result Vectors:
       
1 2 2 1
x1 = 2 , x2 = 2 , x3 = 1 , x4 = 2
2 1 2 1
Second Day:
Draw Outcomes: oob, obo, ooo, boo
Corresponding Values: 112, 121, 111, 211
Result Vectors:
       
1 1 1 2
x1 = 1 , x2 = 2 , x3 = 1 , x4 = 1
2 1 1 1
Here, the xi and Ai are treated as random variables.
Independent and Identically Distributed Random
Variables (iid)

A set of random variables x1 , x2 , . . . , xN are declared to be iid if


they satisfy the following:
All have the same probability distribution.
All are mutually independent.
Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a method of


estimating the parameters of a probability distribution
Maximize a likelihood function (or likelihood) so that the
observed data is most probable under the probability
distribution under consideration
Assumption

Assume that the N training examples were generated


independently. Maximum likelihood method can be used to find
the unknown parameters.
Problem
Solution

L(θ) = p(x1 , x2 , . . . x10 )


10
Y
= p(x1 = 3)p(x2 = 0)p(x3 = 2)p(x4 = 1)p(x5 = 3)
i=1
p(x6 = 2)p(x7 = 1)p(x8 = 0)p(x9 = 2)p(x10 = 1)

θ = arg max L(θ)


Log Likelihood

Find θ that maximizes L(θ)


Logarithm function is strictly increasing on (0, ∞)
g(x) = log x
1
g ′ x = ≥ 0 on (0, ∞)
x
arg maxθ log L(θ) = arg maxθ L(θ)
log L(θ) = l(θ)
l(θ) is called the log likelihood
Likelihood Function
y |x is a random variable
 
1
y |x ∼ Bernoulli(f (x)) = Bernoulli
1 + exp(−w T x)
Given N independent random variables
The likelihood of the parameters can be written as

L(w) = p(y1 |x1 , y2 |x2 , . . . yN |xN )


N
Y
= p(yi |xi )
i
N
Y
= (f (xi ))yi (1 − f (xi ))1−yi
i
N
Y yi  1−yi
1 1
= 1−
1 + exp(−w T xi ) 1 + exp(−w T xi )
i
Example

L(w) = p(y1 = 1|x1 , y2 = 0|x2 , y3 = 0|x3 , y4 = 1|x4 , y5 =


1|x5 ; w)
Find w that maximizes the probability of getting such a
sample
l(w)

The log likelihood l(w) is,

l(w) = log L(w)


N
X
= (yi log f (xi ) + (1 − yi ) log(1 − f (xi )))
i=1

Find arg maxw l(w)


Parameter Estimation

To find the maximum value, we can apply gradient ascent


method. That is,

w := w + α∇l(w)

Equating the jth component from both sides,

∂l(w)
wj := wj + α , j = 0, 1, . . . n
∂wj
N
!
∂l(w) ∂ X
= (yi log f (xi ) + (1 − yi ) log(1 − f (xi ))
∂w ∂w
i=1
N  
X 1 1 ∂
= yi − (1 − yi ) f (xi )
f (xi ) 1 − f (xi ) ∂w
i=1
N  
X 1 1
= yi − (1 − yi ) f (xi )(1 − f (xi ))
f (xi ) 1 − f (xi )
i=1

(w T xi )(refer previous section)
∂w
N
X
= (yi (1 − f (xi )) − (1 − yi )f (xi )) xi
i=1
X
= (yi − f (xi )) xi
Algorithm 1 Updation of w using Batch Gradient Ascent
Choose an initial w and learning parameter α
while not converged
PN do
w := w + α i=1 (yi − f (xi ))xi
end while

Algorithm 2 Updation of w: Gradient Ascent


Choose an initial w and learning parameter α
Iterate until convergence {
wj := wj + α N
P
(y
i=1 i − f (x i ))xij , j = 0, 1, . . . n
}
Stochastic Gradient Ascent

One point at a time


Algorithm 3 Updation of w using Stochastic Gradient Ascent
Choose an initial w and learning parameter α
while not converged do
for i = 1, 2, . . . , N do
w := w + α(yi − f (xi ))xi
end for
Randomly shuffle the data
end while

Algorithm 4 Updation of w using Stochastic Gradient Ascent


Choose an initial w and learning parameter α
Iterate until convergence{
for i = 1, 2 . . . N (randomly shuffle the data) {
wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n
}
Randomly shuffle the data
}
Formulation: Minimization Problem

l(w) = log L(w)


N
X
= (yi log f (xi ) + (1 − yi ) log(1 − f (xi )))
i=1

N
!
X
J(w) = − (yi log f (xi ) + (1 − yi ) log(1 − f (xi )))
i=1
N
XD E
= − (yi , (1 − yi ))T , (log f (xi ), (1 − log f (xi ))T
i=1

w = argw min J(w)


Loss Function and Cost Function

A loss function/error function is to measure the discrepancy


between input and output of a single training point whereas
cost function for the entire training data.
Cross Entropy Function

The cross entropy function is used to find the difference


between two probability distributions. In the case of
discrete probability distributions,it takes in two distributions,
p(x), the true distribution, and q(x), the estimated
distribution, defined over the discrete variable X and is
given by P
H(p, q) = − x p(x)log(q(x))
Cross Entropy Function

The cross entropy function is used to quantify the


difference between two probability distributions.
For discrete probability distributions, it takes two
distributions:
p(x) - the true distribution,
q(x) - the estimated distribution,
defined over the discrete variable X .
The cross entropy H(p, q) is given by:
X
H(p, q) = − p(x) log(q(x))
x
Logistic Regression: Loss Function

L(yi , f (xi )) = − (yi log f (xi ) + (1 − yi ) log(1 − f (xi )))


N
X
J(w) = L(yi , f (xi ))
i=1

L is called the logistic loss function, cross-entropy loss, or log


loss.
Newton’s Method
To find the solution of a function h(y ) = 0, y ∈ R using Newton’s
method, the iteration is given by:

h(y )
y := y −
h′ (y )

For the maximum of the log-likelihood l(w), the critical points


satisfy:
∇l(w) = 0
By applying Newton’s method to this equation, we have:

w := w − H −1 ∇l

Here, H is the Hessian matrix of size (n + 1) × (n + 1), defined


as:
∂ 2 l(w)
H = [Hkl ], Hkl = , k , l = 0, 1, 2, . . . , n
∂wk ∂wl
Newton’s Method

Newton’s method converges faster than (batch) gradient


descent as well as stochastic gradient descent. However, one
iteration of Newton’s method is more expensive than both
gradient descent methods, as the Hessian matrix has to be
inverted. If ( n ) is not too large, Newton’s method will be more
effective. When Newton’s method is applied to maximize the
logistic regression log-likelihood function, the algorithm is called
Fisher scoring.
Confusion Matrix
Performance Measures: Classification
Accuracy:
No of correctly classified data
Accuracy =
Total number of data
TP + TN
=
TP + TN + FP + FN
Sensitivity (Recall):
No of correctly classified positive data
Sensitivity =
Total number of positive data
TP
=
TP + FN
Specificity:
No of correctly classified negative data
Specificity =
Total number of negative data
TN
=
TN + FP
Precision:
TP
Precision =
TP + FP

Is a good measure to check whether false positives are


high.
F Measure: Harmonic mean of precision and recall:

2 × (Precision × Recall)
F Measure =
Precision + Recall
True Positive Rate(TPR) & False Positive Rate (FPR)

TPR = sensitivity
FP
FPR = 1 − specificity =
TN + FP
Receiver Operating Characteristic Curve (ROC)

The receiver operating characteristic curve (ROC) of a


binary classifier plots (1 - specificity) (FPR) on the x-axis
and sensitivity (TPR) on the y-axis.
It is a visual tool for comparing classification models,
showing the trade-off between sensitivity and specificity.
ROC curves help in choosing the best threshold for a given
model.
The points in the curve are obtained using different
threshold values for classification.
Interpretation of ROC Curve

A completely random guess will result in points along the


diagonal line from the bottom-left to the top-right corner.
Points below the diagonal are worse than random
guessing.
The accuracy of the test is higher if the ROC plot bulges
towards the upper-left corner.
Area Under the ROC Curve

The value of the area under the ROC curve lies in the
interval [0,1] and is a measure of the accuracy of the
model.
An area of 1 represents a perfect test, while an area less
than or equal to 0.5 indicates a model that is not better
than chance.
More area under the curve signifies that the model is
identifying more true positives while minimizing the number
of false positives.
ROC

0.9

0.8 CGF ROC Area= 0.9403


CGPI ROC Area= 0.9328
0.7 CGPII ROC Area= 0.9402
CGPIII ROC Area= 0.9403
0.6
Sensitivity

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
(1−Specificity)

Figure: ROC curve and ROC area.

You might also like