Unit - II Regression-LogisticRegressionModels
Unit - II Regression-LogisticRegressionModels
In the linear regression model X , there are two types of variables – explanatory variables X 1 , X 2 ,.., X k
and study variable y . These variables can be measured on a continuous scale as well as like an indicator
variable. When the explanatory variables are qualitative, then their values are expressed as indicator
variables, and then dummy variable models are used.
When the study variable is a qualitative variable, then its values can be expressed using an indicator variable
taking only two possible values 0 and 1. In such a case, the logistic regression is used. For example, y can
denotes the values like success or failure, yes or no, like or dislike, which can be denoted by two values 0
and 1.
The study variable takes two values as yi 0 or 1. Assume that yi follows a Bernoulli distribution with a
1 with P ( yi 1) i
yi
0 with P ( yi 0) 1 i .
Assuming E ( i ) 0,
E ( yi ) 1. i 0.(1 i ) i .
E ( yi ) xi'
E ( yi ) xi' i
E ( yi ) P ( yi 1).
- yi 0, then i xi' .
Recall that earlier i was assumed to follow a normal distribution when y was not an indicator variable.
When y is an indicator variable, then i takes only two values, so it cannot be assumed to follow a normal
distribution.
In the usual regression model, the errors are homoskedastic, i.e., Var ( i ) 2 and so Var ( yi ) 2 . When y
Var ( yi ) E yi E ( yi )
2
(1 i ) 2 i (0 i ) 2 (1 i )
i (1 i ) 1 i i
i (1 i )
E ( yi ) 1 E ( yi )
y2i .
Thus Var ( yi ) depends on yi and is a function mean of yi . Moreover, since E ( yi ) i and i is the
probability, so 0 i 1 and thus there is a constraint on E ( yi ) that 0 E ( yi ) 1. This puts a big constraint
on the choice of the linear response function. One cannot fit a model in which the predicted values lie outside
the interval of 0 and 1.
When y is a dichotomous variable, then empirical pieces of evidence suggest that the function E ( y ) on the
whole real line that can be mapped to [0,1] has the sigmoid shape. It is a nonlinear S shape like
1 1
E(y) E(y)
0 0
x x
The link function in generalized linear model relates the linear predictor i to the mean response i .
Thus
g ( i ) i
or i g 1 (i ).
In the usual linear models based on the normally distributed study variable, the link g ( i ) i is used and is
called an identity link. A link function maps the range of i onto the whole real line, provides good
or 1 exp( ) 1
1
or e
or ln .
1
Note: Similar to logit function, there are other functions also which have the same shape as of logistic
function. These functions can also be transformed through . There are two such popular functions – probit
transformation and complementary log-log transformation. The probit transformation is based on the
transformation of using the cumulative distribution function of normal distribution and based on this is the
probit regression model.
where yi ' s are independent Bernoulli random variable with a parameter i with
E ( yi ) i
exp( xi' )
1 exp( xii )
n
n
yi ln i ln(1 i ) .
i 1 1 i i 1
Since
exp( xi' )
i ,
1 exp( xi' )
1
1i ,
1 exp( xi' )
i
exp( xi' ),
1 i
ln i exp xi' ,
1 i
so
n n
ln L yi xi' ln 1 exp( xi' ) .
i 1 i 1
Suppose repeated observations are available at each level of the x -variables. Let yi be the numbers of 1’s
observed for i th observation and ni be the number of trials at each observation. Then
n n n
ln L yi i ni ln(1 i ) yi ln(1 i ) .
i 1 i 1 i 1
If V ( ) , then asymptotically
E ( ˆ )
V ( ˆ ) ( X ' 1 X ) 1.
ˆi xi' .
The fitted value is
exp(ˆi ) 1 1
yˆi ˆi .
1 exp(ˆi ) 1 exp(ˆi ) 1 exp( xi' ˆ )
After fitting of the model, ˆ0 and ˆ1 are obtained as the estimators of 0 and 1 respectively. Then the
ˆ ( xi ) ˆ0 ˆ1 xi
which is the log-odds at x xi . The fitted value at x xi 1 is
ˆ ( xi 1) ˆ0 ˆ1 ( xi 1)
which is the log-odds at x xi 1.
Thus
ˆ1 ˆ ( xi 1) ˆ ( xi )
ln odds( xi 1) ln odds( xi )
odds(xi 1)
ln
odds( xi )
odds(xi 1)
exp( ˆ1 ).
odds( xi )
This is termed as odd ratio, which is the estimated increase in the probability of success when the value of
explanatory variable changes by one unit.
When there are more than one explanatory variables in the model, then the interpretation of j ' s is similar
as in the case of a single explanatory variable case. The odds ratio is exp ( ˆ j ) associated with explanatory
variable x j keeping other explanatory variables constant. This is similar to the interpretation of j in
If there is a m unit change is the explanatory variable, then the estimated increase in odds ratio is exp (mˆ j ).
A model with exactly p parameters that perfectly fit the sample data is termed as a saturated model.
The statistic that compares the log-likelihoods of fitted and saturated models is called as model deviance. It
is defined as
( ) 2 ln L(saturated model) 2 ln L( ˆ )
In the case of the logistic regression model, yi 0 or 1 and i ’s are completely unrestricted. So the
ln L(saturated model).
Assuming that the logistic regression function is correct, the large sample distribution of likelihood ratio test
statistic ( ) is approximately distributed as 2 (n p ) , when n is large.
A large value of ( ) implies the model is incorrect. A small value of ( ) implies that the model is well
fitted and is as good as the saturated model. Note that generally, the fitted model will be having a smaller
number of parameters than the saturated model that is based on all the parameters. Thus at % level of
significance.