0% found this document useful (0 votes)
56 views7 pages

Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1

Classification problems involve predicting a target variable that can take on discrete, unordered values. The goal is to predict the target class given input values. There are two main approaches: discriminative models directly model the conditional probability of the target given inputs, while generative models estimate class conditional densities and apply Bayes' rule. Linear probability models and logistic regression are examples of discriminative models that use a linear function of the inputs to model the conditional probability. Logistic regression specifically uses a logistic response function to map the linear combination to a probability between 0 and 1.

Uploaded by

Jiri otruba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views7 pages

Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1

Classification problems involve predicting a target variable that can take on discrete, unordered values. The goal is to predict the target class given input values. There are two main approaches: discriminative models directly model the conditional probability of the target given inputs, while generative models estimate class conditional densities and apply Bayes' rule. Linear probability models and logistic regression are examples of discriminative models that use a linear function of the inputs to model the conditional probability. Logistic regression specifically uses a logistic response function to map the linear combination to a probability between 0 and 1.

Uploaded by

Jiri otruba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Classification Problems

Data Mining In classification problem there is a target variable y that


Linear Models for Classification
assumes values in an unordered discrete set.

The goal of a classification procedure is to predict the


target value (class label) given a set of input values
x = {x1, . . . , xd} measured on the same object.

An important special case is when there are only two


classes, in which case we usually code them as
y ∈ {0, 1}.

1 JJ J I II J • X 1 JJ J I II J • X

Examples of Classification Problems Classification Problems

I Credit Scoring: applicant will default or not? At a particular point x the value of y is not uniquely
determined.
I SPAM filter: e-mail message is SPAM or not?
I Medical diagnosis: does patient have breast cancer? It can assume both its values with respective
I Handwritten digit recognition. probabilities that depend on the location of the point x
I Music Genre Classification. in the input space. We write
p(y = 1|x) = 1 − p(y = 0|x).
The goal of a classification procedure is to produce an
estimate of p(y|x) at every input point x.

2 JJ J I II J • X 3 JJ J I II J • X

Two types of approaches to classification Discriminative Models

I Discriminative Models (regression). Discriminative methods only model the conditional


distribution of y given x. The probability distribution of
I Generative Models (density estimation). x itself is not modeled. For the binary classification
problem:
p(y = 1|x) = f (x, w)
where f (x, w) is some deterministic function of x.

Note that the linear regression model follows the same


strategy.

4 JJ J I II J • X 5 JJ J I II J • X

logreg.pdf — May 4, 2010 — 1


Discriminative Models Generative Models

Examples of discriminative classification methods: An alternative paradigm for estimating p(y|x) is based
on density estimation. Here Bayes’ theorem
I Linear probability model (this lecture) p(y = 1)p(x|y = 1)
p(y = 1|x) =
I Logistic regression (this lecture) p(y = 1)p(x|y = 1) + p(y = 0)p(x|y = 0)

I Classification Trees (Book: section 4.3) is applied where p(x|y) are the class conditional
I Feed-forward neural networks (Book: section 5.4) probability density functions and p(y) are the
unconditional (prior) probabilities of each class.
I ...

6 JJ J I II J • X 7 JJ J I II J • X

Generative Models Discriminative Models: linear probability model

Examples of density estimation based classification Consider the linear regression model
y = wT x + ε y ∈ {0, 1}
methods:
Note:
I Linear/Quadratic Discriminant Analysis (not
 
1
discussed),  x1
 
wT = [w0 w1 . . . wd], x =  .. .

 . 
I Naive Bayes classifier (Book: section 5.3), xd
so wT x = w0 + di=1 wixi.
P
I...
By assumption E[ε|x] = 0, so we have
E[y|x] = wT x
But
E[y|x] = 1 · p(y = 1|x) + 0 · p(y = 0|x)
= p(y = 1|x)
8 JJ J I II J • X 9 JJ J I II J • X

Linear response function Logistic regression

Logistic response function


T
1
ew x
E[y|x] E[y|x] = T
1 + ew x
T
or (divide numerator and denominator by ew x)
1 T
E[y|x] = T
= (1 + e−w x)−1
1 + e−w x
1 wT x

10 JJ J I II J • X 11 JJ J I II J • X

logreg.pdf — May 4, 2010 — 2


Logistic Response Function Linearization: the logit transformation

Write z = wT x:
1.0
p(y = 1|x) (1 + e−z )−1
ln = ln
E[y|x] 1 − p(y = 1|x) 1 − (1 + e−z )−1
1 1
= ln = ln −z
(1 + e−z ) − 1 e
= ln ez = z = wT x
0.5

In the second step, we divided the numerator and the denominator by (1 + e−z )−1.
The ratio p(y = 1|x)/(1 − p(y = 1|x)) is called the odds.
0.0

12 JJ J I II J • X 13 JJ J I II J • X

Linear Separation Linear Decision Boundary

x2
Assign to class 1 if p(y = 1|x) > p(y = 0|x), i.e. if
p(y = 1|x)
>1 w T x = w0 + w1 x1 + w2 x2 = 0
1 − p(y = 1|x)

This is true if Class B

p(y = 1|x) Class A


ln >0
1 − p(y = 1|x)
x1

So assign to class 1 if wT x > 0, and to class 0


otherwise.
14 JJ J I II J • X 15 JJ J I II J • X

Maximum Likelihood Estimation Maximum Likelihood Estimation

y = 1 if heads, y = 0 if tails. Let µ = p(y = 1). In a sequence of 10 coin flips we observe


One coin flip y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0).
p(y) = µy (1 − µ)1−y
Note that p(1) = µ, p(0) = 1 − µ as required. The corresponding likelihood function is
Sequence of N independent coin flips p(y|µ) = µ · (1 − µ) · µ · µ · (1 − µ) · µ · µ · µ · µ
N ·(1 − µ) = µ7(1 − µ)3
µyi (1 − µ)1−yi
Y
p(y) = p(y1, y2, ..., yN ) =
The corresponding loglikelihood function is
i=1
which defines the likelihood function when viewed as a ln p(y|µ) = ln(µ7(1 − µ)3) = 7 ln µ + 3 ln(1 − µ)
function of µ. Note: log ab = log a + log b, log ab = b log a.
16 JJ J I II J • X 17 JJ J I II J • X

logreg.pdf — May 4, 2010 — 3


Computing the maximum Loglikelihood function for y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0)

To determine the maximum we take the derivative and


equate it to zero

−10
d ln p(y|µ) 7 3
= − =0
µ 1−µ

−15

loglikelihood
which yields maximum likelihood estimate µ = 0.7. ML

−20
This is just the relative frequency of heads in the sample.

−25
Note:

−30
d ln x 1
=
dx x 0.0 0.2 0.4 0.6 0.8 1.0

18 JJ J I II J • X 19mu JJ J I II J • X

ML estimation for logistic regression ML estimation for logistic regression

Now probability of success depends on xi: Example


T i xi yi p(yi)
µi = p(y = 1|xi) = (1 + e−w xi )−1
T 1 8 0 (1 + ew0+8w1 )−1
1 − µi = p(y = 0|xi) = (1 + ew xi )−1 2 12 0 (1 + ew0+12w1 )−1
we can represent its probability distribution as follows 3 15 1 (1 + e−w0−15w1 )−1
y 4 10 1 (1 + e−w0−10w1 )−1
p(yi) = µi i (1 − µi)1−yi yi ∈ {0, 1}; i = 1, . . . , N
p(y|w) = (1 + ew0+8w1 )−1 × (1 + ew0+12w1 )−1×
× (1 + e−w0−15w1 )−1 × (1 + e−w0−10w1 )−1

20 JJ J I II J • X 21 JJ J I II J • X

LR: likelihood function LR: error function

Since the yi observations are independent: Since for the logistic regression model:
T
N
Y N
Y µi = (1 + e−w xi )−1
p(y|w) = p(yi) = µyi i (1 − µi)1−yi T
1 − µi = (1 + ew xi )−1
i=1 i=1

Or, taking minus the natural log: we get


N n
X o
T T
N
Y E(w) = yi ln(1 + e−w xi ) + (1 − yi) ln(1 + ew xi )
− ln p(y|w) = − ln µyi i (1 − µi)1−yi i=1
i=1 I Non-linear function of the parameters.
N
I Likelihood function globally concave.
X
=− {yi ln µi + (1 − yi) ln(1 − µi)}
i=1 I Numerical Optimization required.
This is called the cross-entropy error function.

Comparable to sum of squared errors for regression problems.

22 JJ J I II J • X 23 JJ J I II J • X

logreg.pdf — May 4, 2010 — 4


Fitted Response Function Example: Programming Assignment

Substitute maximum likelihood estimates into the Model the probability of succesfully completing a
response function to obtain the fitted response function programming assignment.
T
ew x
ML
Explanatory variable: “programming experience”.
p̂(y = 1|x) = T
1 + ew x ML
We find w0 = −3.0597 and w1 = 0.1615, so
e−3.0597+0.1615xi
p̂(y = 1|xi) =
1 + e−3.0597+0.1615xi
14 months of programming experience:
e−3.0597+0.1615(14)
p̂(y = 1|x = 14) = ≈ 0.31
1 + e−3.0597+0.1615(14)
24 JJ J I II J • X 25 JJ J I II J • X

Example: Programming Assignment Allocation Rule

month.exp success fitted month.exp success fitted Probability of the classes is equal when
1 14 0 0.310262 16 13 0 0.276802
2 29 0 0.835263 17 9 0 0.167100 −3.0597 + 0.1615x = 0
3 6 0 0.109996 18 32 1 0.891664
4 25 1 0.726602 19 24 0 0.693379
5 18 1 0.461837 20 13 1 0.276802 Solving for x we get x ≈ 18.95.
6 4 0 0.082130 21 19 0 0.502134
7 18 0 0.461837 22 4 0 0.082130
8 12 0 0.245666 23 28 1 0.811825 Allocation Rule:
9 22 1 0.620812 24 22 1 0.620812
10 6 0 0.109996 25 8 1 0.145815 x ≥ 19: assign to class 1
11 30 1 0.856299 x < 19: assign to class 0
12 11 0 0.216980
13 30 1 0.856299
14 5 0 0.095154
15 20 1 0.542404

26 JJ J I II J • X 27 JJ J I II J • X

Programming Assignment: Confusion Matrix Example: Conn’s syndrome

Cross table of observed and predicted class label: Two possible causes:
0 1 a) Benign tumor (adenoma) of the adrenal cortex.
0 11 3 b) More diffuse affection of the adrenal glands (bilateral
1 3 8 hyperplasia).
Row: observed, Column: predicted
Pre-operative diagnosis on basis of
Error rate: 6/25=0.24 1. Sodium concentration (mmol/l)
2. CO2 concentration (mmol/l)
Default: 11/25=0.44

28 JJ J I II J • X 29 JJ J I II J • X

logreg.pdf — May 4, 2010 — 5


Conn’s syndrome: the data Conn’s Syndrome: Plot of Data

a=1, b=0

34
b b
b b b
sodium co2 cause sodium co2 cause b

32
1 140.6 30.3 0 16 139.0 31.4 0 b
2 143.0 27.1 0 17 144.8 33.5 0
b

30
3 140.0 27.0 0 18 145.7 27.4 0 b
b b
4 146.0 33.0 0 19 144.0 33.0 0 b
5 138.7 24.1 0 20 143.5 27.5 0 a

28
ab

co2
6 143.7 28.0 0 21 140.3 23.4 1 b b b
b a b
7 137.3 29.6 0 22 141.2 25.8 1
b a aa

26
8 141.0 30.0 0 23 142.0 22.0 1 a
9 143.8 32.2 0 24 143.5 27.8 1 a
10 144.6 29.5 0 25 139.7 28.0 1 b

24
11 139.5 26.0 0 26 141.1 25.0 1 a
12 144.0 33.7 0 27 141.0 26.0 1 a

22
13 145.0 33.0 0 28 140.5 27.0 1
138 140 142 144 146
14 140.2 29.1 0 29 140.0 26.0 1
15 144.7 27.4 0 30 140.0 25.6 1 sodium
30 JJ J I II J • X 31 JJ J I II J • X

Maximum Likelihood Estimation Conn’s Syndrome: Allocation Rule

The maximum likelihood estimates are:

34
b b
b b b
b
w0 = 36.6874320
32
b
w1 = −0.1164658 30
b
b
b b
w2 = −0.7626711 b
a
28

ab
co2

b b b
b a b
b a aa
26

Assign to group a if a
a
b
36.69 − 0.12 × sodium − 0.76 × CO2 > 0
24

and to group b otherwise. a


22

138 140 142 144 146

sodium
32 JJ J I II J • X 33 JJ J I II J • X

Conn’s Syndrome: Confusion Matrix Conn’s Syndrome: Line with lower empirical error

Cross table of observed and predicted class label:


34

b b
a b b
b b b
32

a 7 3 b
b b
b 2 18
30

b b
b
Row: observed, Column: predicted
28
co2

a a b
b b b
b a b
26

b a aa
Error rate: 5/30=1/6 a
a
24

b
a
Default: 1/3
22

a
138 140 142 144 146

sodium
34 JJ J I II J • X 35 JJ J I II J • X

logreg.pdf — May 4, 2010 — 6


Likelihood and Error Rate Quadratic Model

Likelihood maximization is not the same as error rate Coefficient Value


minimization! (Intercept) -13100.69
sodium 177.42
i yi p̂1(yi = 1) p̂2(yi = 1) CO2 41.36
1 0 0.9 0.6 sodium2 -0.60
2 0 0.4 0.1 CO2 2 -0.12
3 1 0.6 0.9 sodium × CO2 -0.25
4 1 0.55 0.4 Cross table of observed (row) and predicted class label:
Which model has the lower error-rate? a b
Which one the higher likelihood? a 8 2
b 2 18
36 JJ J I II J • X 37 JJ J I II J • X

Conn’s Syndrome: Quadratic Specification


34

b b
b b b
b
32

b
b b
30

b b
b
28
co2

a ab
b b b
b a b
26

b a aa
a
a
24

b
a
22

a
138 140 142 144 146

sodium
38 JJ J I II J • X

logreg.pdf — May 4, 2010 — 7

You might also like