0% found this document useful (0 votes)
45 views32 pages

Log Reg

Regression is a statistical modeling technique used to evaluate relationships between variables, where one variable is dependent on one or more independent variables. Linear regression fits a linear equation to continuous data to minimize the sum of squared errors between observed and predicted values. Logistic regression applies a sigmoid curve to binary dependent data and uses linear regression on the logit transform of the odds to model relationships between predictors and the log odds of the dependent variable.

Uploaded by

SIDDHARTH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views32 pages

Log Reg

Regression is a statistical modeling technique used to evaluate relationships between variables, where one variable is dependent on one or more independent variables. Linear regression fits a linear equation to continuous data to minimize the sum of squared errors between observed and predicted values. Logistic regression applies a sigmoid curve to binary dependent data and uses linear regression on the logit transform of the odds to model relationships between predictors and the log odds of the dependent variable.

Uploaded by

SIDDHARTH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Regression

• A form of statistical modeling that


attempts to evaluate the relationship
between one variable (termed the
dependent variable) and one or more
other variables (termed the independent
variables). It is a form of global analysis as
it only produces a single equation for the
relationship.
• A model for predicting one variable from
another.
Linear Regression
• Regression used to fit a linear model to data
where the dependent variable is continuous:

Y   0  1 X1   2 X2 …  n Xn  
• Given a set of points (Xi,Yi), we wish to find
a linear function (or line in 2 dimensions)
that “goes through” these points.
• In general, the points are not exactly
 aligned:
– Find line that best fits the points
Residue

• Error or residue:
– Observed value - Predicted value
Chart Title

4
Observed
Linear (Observed)
3

0
0 0.5 1 1.5 2 2.5
Sum-squared Error (SSE)

SSE   (yobserved  ypredicted )2


y

TSS  ( yobserved  yobserved ) 2


y

SSE
R  1
2

  TSS
What is Best Fit?

• The smaller the SSE, the better the


fit
• Hence,
– Linear regression attempts to minimize
SSE (or similarly to maximize R2)
• Assume 2 dimensions
Y   0  1 X
Analytical Solution

0 
 y   x1

1 
 xy   x y
n

n x    x
2
2
Example (I)

1 
 xy   x y
n

n x    x
2
2
x y x^2 xy
1.20 4.00 1.44 4.80 7  223.61  24.10  58.00

2.30 5.60 5.29 12.88 7  95.31  24.10 2
1565.27 1397.80
3.10 7.90 9.61 24.49 
667.17  580.81
3.40 8.00 11.56 27.20 167.47
 1.94
4.00 10.10 16.00 40.40 86.36
4.60 10.40 21.16 47.84
5.50 12.00 30.25 66.00 0 
 y   x
1

n
24.10 58.00 95.31 223.61 58.00 1.94  24.10

7
Target: y=2x+1.5 
11.27
1.61
7
Example (II)

Observed

14.00

12.00

10.00

8.00
Observed
6.00

4.00

2.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
Example (III)

SSE 0.975
R  1
2
 1  0.98
TSS 47.369
Logistic Regression

• Regression used to fit a curve to data


in which the dependent variable is
binary, or dichotomous
• Typical application: Medicine
– We might want to predict response to
treatment, where we might code
survivors as 1 and those who don’t
survive as 0
Example

Observations:
For each value of
SurvRate, the
number of dots is the
number of patients
with that value of
NewOut

Regression:
Standard linear
regression

Problem: extending the regression line a few units left or right along
the X axis produces predicted probabilities that fall outside of [0,1]
A Better Solution

Regression Curve:
Sigmoid function!

(bounded by
asymptotes y=0 and
y=1)
Odds
• Given some event with probability p of being 1,
the odds of that event are given by:
odds = p / (1–p)
• Consider the following data
Delinquent
Yes No Total
Normal 402 3614 4016
Testosterone High 101 345 446
503 3959 4462
• The odds of being delinquent if you are in the
Normal group are:
pdelinquent/(1–pdelinquent) = (402/4016) / (1 - (402/4016)) =
0.1001 / 0.8889 = 0.111
Odds Ratio
• The odds of being not delinquent in the Normal
group is the reciprocal of this:
– 0.8999/0.1001 = 8.99
• Now, for the High testosterone group
– odds(delinquent) = 101/345 = 0.293
– odds(not delinquent) = 345/101 = 3.416
• When we go from Normal to High, the odds of
being delinquent nearly triple:
– Odds ratio: 0.293/0.111 = 2.64
– 2.64 times more likely to be delinquent with high
testosterone levels
Logit Transform

• The logit is the natural log of the odds

• logit(p) = ln(odds) = ln (p/(1-p))


Logistic Regression

• In logistic regression, we seek a model:

logit( p)   0  1 X
• That is, the log odds (logit) is assumed
to be linearly related to the
independent variable X
 •  So, now we can focus on solving an
ordinary (linear) regression!
Recovering Probabilities

which gives p as a sigmoid function!


Logistic Response Function

• When the response variable is


binary, the shape of the response
function is often sigmoidal:
Interpretation of 1
• Let:
– odds1 = odds for value X (p/(1–p))
– odds2 = odds for value X + 1 unit
• Then:
odds2 e 0   1 (X 1)
   X
odds1 e0 1
e(  0   1 X ) 1 e(  0  1 X )e 1 1
  0  1 X
  0  1 X
 e
e e
• Hence, the exponent of the slope describes the
proportionate rate at which the predicted odds
ratio changes with each successive unit of X
 
Sample Calculations
• Suppose a cancer study yields:
– log odds = –2.6837 + 0.0812 SurvRate
• Consider a patient with SurvRate = 40
– log odds = –2.6837 + 0.0812(40) = 0.5643
– odds = e0.5643 = 1.758
– patient is 1.758 times more likely to be improved than not
• Consider another patient with SurvRate = 41
– log odds = –2.6837 + 0.0812(41) = 0.6455
– odds = e0.6455 = 1.907
– patient’s odds are 1.907/1.758 = 1.0846 times (or 8.5%) better than
those of the previous patient
• Using probabilities
– p40 = 0.6374 and p41 = 0.6560
– Improvements appear different with odds and with p
Example 1 (I)

• A systems analyst studied


the effect of computer
programming experience
on ability to complete a
task within a specified time
• Twenty-five persons
selected for the study, with
varying amounts of
computer experience (in
months)
• Results are coded in binary
fashion: Y = 1 if task
completed successfully; Y
= 0, otherwise
Loess: form of local regression
Example 1 (II)

• Results from a standard package


give:
– 0 = –3.0597 and 1 = 0.1615
• Estimated logistic regression
function: p
1
1 1 e3.05970.1615X
p 3.05970.1615(14 )
 0.31
1 e
• For example, the fitted value for X =
14 is:  
 
(Estimated probability that a person with 14 months experience will
successfully complete the task)
Example 1 (III)
• We know that the probability of success
increases sharply with experience
– Odds ratio: exp(1) = e0.1615 = 1.175
– Odds increase by 17.5% with each additional
month of experience
• A unit increase of one month is quite
small, and we might want to know the
change in odds for a longer difference in
time
– For c units of X: exp(c1)
Example 1 (IV)

• Suppose we want to compare


individuals with relatively little
experience to those with extensive
experience, say 10 months versus 25
months (c = 15)
– Odds ratio: e15x0.1615 = 11.3
– Odds of completing the task increase
11-fold!
Example 2 (I)

• In a study of the effectiveness


of coupons offering a price
reduction, 1,000 homes were
selected and coupons mailed
• Coupon price reductions: 5,
10, 15, 20, and 30 dollars
• 200 homes assigned at
random to each coupon value
• X: amount of price reduction
• Y: binary variable indicating
whether or not coupon was
redeemed
Example 2 (II)

• Fitted response function


– 0 = -2.04 and 1 = 0.097

• Odds ratio: exp(1) =


e0.097 = 1.102
• Odds of a coupon being
redeemed are
estimated to increase
by 10.2% with each $1
increase in the coupon
value (i.e., $1 in price
reduction)
Putting it to Work

• For each value of X, you may not have


probability but rather a number of <x,y>
pairs from which you can extract
frequencies and hence probabilities
– Raw data: <12,0>, <12,1>, <14,0>,
<12,1>, <14,1>, <14,1>, <12,0>, <12,0>
– Probability data (p=1, 3rd entry is number of
occurrences in raw data): <12, 0.4, 5>, <14,
0.66, 3>
– Odds ratio data…
Coronary Heart Disease (I)

Age Group Coronary Heart Disease


Total
No Yes
1 9 1 10 (20-29)
2 13 2 15 (30-34)
3 9 3 12 (35-39)
4 10 5 15 (40-44)
5 7 6 13 (45-49)
6 3 5 8 (50-54)
7 4 13 17 (55-59)
8 2 8 10 (60-69)
Total 57 43 100
Coronary Heart Disease (II)

Age Group p(CHD)=1 odds log odds #occ


1 0.1000 0.1111 -2.1972 10
2 0.1333 0.1538 -1.8718 15
3 0.2500 0.3333 -1.0986 12
4 0.3333 0.5000 -0.6931 15
5 0.4615 0.8571 -0.1542 13
6 0.6250 1.6667 0.5108 8
7 0.7647 3.2500 1.1787 17
8 0.8000 4.0000 1.3863 10
Coronary Heart Disease (III)

X ( AG) Y (log odds) X^2 XY #occ


1 -2.1972 1.0000 -2.1972 10
2 -1.8718 4.0000 -3.7436 15
3 -1.0986 9.0000 -3.2958 12
4 -0.6931 16.0000 -2.7726 15
5 -0.1542 25.0000 -0.7708 13
6 0.5108 36.0000 3.0650 8
7 1.1787 49.0000 8.2506 17
8 1.3863 64.0000 11.0904 10
448 -37.6471 2504.0000 106.3981 100

Note: the sums reflect the number of occurrences


(Sum(X) = X1.#occ(X1)+…+X8.#occ(X8), etc.)
Coronary Heart Disease (IV)

• Results from regression:


– 0 = -2.856 and 1 = 0.5535
Age Group p(CHD)=1 est. p
1 0.1000 0.0909
2 0.1333 0.1482
3 0.2500 0.2323
4 0.3333 0.3448
5 0.4615 0.4778
6 0.6250 0.6142
7 0.7647 0.7346
8 0.8000 0.8280

SSE 0.0028
TSS 0.5265
R2 0.9946
Summary

• Regression is a powerful data mining


technique
– It provides prediction
– It offers insight on the relative power of
each variable
• We have focused on the case of a
single independent variable
– What about the general case?

You might also like