Logisticregression
Logisticregression
Diploma Thesis
submitted to the
Department of Mathematics
Faculty of Sciences
University of Fribourg Switzerland
Diploma in Mathematics
by
Michael Beer
???
3 Consistency 15
3.1 A Different Approach . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Condition on the Relationship between p and n . . . . . . . . 22
3.3 Reformulation, Assessment, and Comparison . . . . . . . . . 23
4 Asymptotic Normality 27
5 Case Study 32
6 Conclusion 35
Acknowledgements 36
References 37
1
1 Introduction: A Regression Model for Dichoto-
mous Outcome Variables
2
ical activity carried out by the Swiss Federal Offices of Sports and Statistics,
an estimation will be made what effects the linguistic background of a Swiss
inhabitant has on his or her daily or weekly physical activity. The results
are presented along with the Mathematica 1 code used for the calculations.
1
Mathematica is a registered trademark of Wolfram Research, Inc.
3
2 Logistic Regression Model
i.e. the variable y takes either the value 1 or the value 0 with probabilities
π(x) or 1 − π(x) respectively. x ∈ Rp is a vector of p exogenous variables
and π : Rp −→ [0, 1] a real-valued function. In fact, π(x) represents the
conditional probability P (y = 1 | x) of y = 1, given x.
Let r := y − π(x), which allows us to rewrite our model as
y = π(x) + r ,
and a variance of
Var(r) = Var(y) = π(x) 1 − π(x) . (2.1b)
For the forthcoming analysis we are going to define the so-called logistic
transformation σLR : R −→ [0, 1] by
exp z 1
σLR (z) := =
1 + exp z 1 + exp −z
which allows us to specify the probability function π as
4
σLR (z)
1
0.8
0.6
0.2
z
0
σLR (z) -4 -2 2 4
0
σLR (z)
0.25
0.2
0.15
0.05
σLR (z)
z
-4 -2 2 4
The shape of σLR and its first derivative σLR 0 are displayed in figure 1.
Possible motivations for this specific model shall be discussed in section 2.3.
y = σLR (Xβ 0 ) + r ,
5
where
E(ri ) = 0 (2.5a)
and
s2i := Var(ri ) = σLR (xT 0 T 0
i β ) 1 − σLR (xi β )
exp xT
i β
0 1 + exp xT 0 T 0
i β − exp xi β
= · (2.5b)
1 + exp xT
i β
0 1 + exp xT
i β
0
exp xTi β
0
0
= = σLR (xT 0
i β ).
(1 + exp xTi β 0 )2
6
and yields the log-likelihood function
n
X n
X
ln L(β) = ln exp(yi xT
i β) − ln 1 + exp(xT
i β)
i=1 i=1
n (2.6)
X
T
= y Xβ − ln 1 + exp(xT
i β) .
i=1
= X T (y − σLR (Xβ)) .
we can derive the Hessian matrix Hln L (β) ∈ Rp×p of the log-likelihood
function ln L(β) given by
Verification. We have
n
X
uT Hln L (β) u = −uT X T D(β)Xu = − (xT 2 0 T
i u) σLR (xi β) . (2.9)
i=1
As the first derivative of σLR is always positive (see (2.3)), we can see from
equation (2.9) that uT Hln L (β) u ≤ 0 for all u ∈ Rp and all β ∈ Rp . ¨
7
We therefore know that, in any case, the log-likelihood function ln L(β)
is concave. Consequently, any root of ∇ ln L(β) is a global maximum of
ln L(β) and a maximum likelihood estimator of β 0 . The condition (2.7)
therefore is not only necessary but also sufficient for β̂ being a maximum
likelihood estimator of β 0 . However, neither its existence nor its uniqueness
are a priori guaranteed.
Remark. If we assumed p ≤ n and rg X = p, we would obtain that Xu = 0
is equivalent to u = 0, such that Hln L (β) were negative definite for all
β ∈ Rp and thus ln L(β) strictly concave. Any β̂ satisfying condition (2.7)
therefore would be the unique maximum likelihood estimator of β 0 .
8
Such a way of thinking is quite common in practice. If in a sporting compe-
tition, for example, the probability of one team winning is about 80 per-
cent, many people say that the odds on this team winning are four to
one. However, as Christensen (1997, p. 2) mentions, odds are not infre-
quently confused with probabilities. In a document entitled “What Are the
Odds of Dying?” the US National Safety Council (2001) states for instance:
“The odds of dying from an injury in 1998 were 1 in 1,796.” This number
was approximated by dividing the 1998 US population (270,248,000) by the
respective number of deaths (150,445). From a statistical point of view,
the number 1,796 is an estimation of the reciprocal probability, whereas
an estimation of the real odds on dying from an injury would rather be
150,445/(270,248,000 − 150,445), i.e. 1 to 1,795. The expression “1 in” in-
stead of “1 to” and the large numbers involved in this study justify such an
approximation procedure to a certain extent.
This latter example points out the need of careful distinction between odds
and probability. Note however, that there is a one-to-one relationship be-
tween both of them.2 Given the probability π, the respective odds o are
obtained by formula (2.10), while π is calculated from any given o by
π = o/(1 + o). As Christensen (1997, p. 3) emphasises, examining odds
therefore amounts to a rescaling of the measure of uncertainty: Probabili-
ties between 0 and 1/2 correspond to odds between 0 and 1 whereas odds
between 1 and ∞ correspond to probabilities between 1/2 and 1.
The loss of symmetry inherent in the transformation of probabilities to odds
can be offset by a subsequent application of the natural logarithm function.
Looking at the so-called log odds, we observe that they are “symmetric about
zero just as probabilities are symmetric about one half” (Christensen 1997,
p. 3). This is why mathematical analyses often deal with log odds instead
of “simple” odds. It is worth noting that the log odds are usually allowed to
take the values −∞ and +∞ in order to establish a one-to-one relationship
with the respective probabilities 0 and 1, and, moreover, that the trans-
formation of probabilities to log odds is exactly the logit transformation
introduced in section 2.1.
In order to compare the odds of two different events, it may be useful to
examine not only odds but also odds ratios. If the odds of an event A are
Odds(A) and the odds of an event B are Odds(B), then the odds ratio
of B to A is defined as Odds(A) Odds(B).3 In the above study on the
“Odds of Dying”, for instance, the odds of dying in a railway accident were
1 to 524,752 whereas the odds of an accidential death in a motor-vehicle
2
This holds under the condition that the odds are allowed to take the value +∞ when
the probability π equals 1.
3
Note that this syntax is not used identically throughout the literature. While many
authors just speak of the ratio of the odds of A to the odds of B, Garson (2001, “Log-linear
Models, Logit, and Probit”) explicitly refers to the expression “odds ratio of B to A”.
9
were 1 to 6,211. The odds ratio of dying in a railway accident to dying in
a motor-vehicle in the USA thus was about 84, i.e. the odds of dying in
a motor-vehicle were about 84 times as important as those of dying in a
railway crash.
logit π(x) = xT β
logit π(x̌) = x̌T β = (x + ei )T β = xT β + eT
i β
10
it is preferable to define a set of dummy variables x1 , . . . , xm−1 as
(
1 if g = i,
xi := (i = 1, . . . , m − 1)
0 if g =6 i,
and to analyse the model
logit π = β0 + β1 x1 + · · · + βm−1 xm−1
for instance. This procedure allows to distinguish between the effects on y
of every single factor level of g. An application of this idea will be shown in
section 5.
The possibility of interpreting the coefficients of β as logarithms of odds ra-
tios provides the foundation of a second important motivation of the logistic
model. Both Santner and Duffy (1989, pp. 206-207) and Christensen (1997,
pp. 118-120, p. 387) emphasise on the difference between prospective and
retrospective studies. Consider for instance an experiment in which 250 peo-
ple of an arbitrary population are sampled. A binary response “diseased”
(D) or “non-diseased” (D c ) is observed for each person. Moreover, there is a
single explanatory variable “exposed” (E) or “non-exposed” (E c ) involved.
This kind of study is called prospective. Let ψP denote the (prospective)
ratio of odds of disease for the exposed group to odds of disease for the
non-exposed group as
P (D|E) P (D|E c )
ψP = .
1 − P (D|E) 1 − P (D|E c )
According to the nature of the study, diseased individuals may be very rare
in a random sample of 250 people. So most of the collected data is about
non-diseased persons. It is therefore sometimes useful to fix the sample size
in the rare event category by design. In our example, one could possibly
study separated samples of 100 diseased and 150 non-diseased individuals
while determining for every person whether he or she had been exposed or
not. This procedure is called retrospective and leads directly to information
about the probability of exposure among the diseased and among the healthy
groups. We thus get the (retrospective) odds ratio
P (E|D) P (E|D c )
ψR = .
1 − P (E|D) 1 − P (E|D c )
However, we obtain by Bayes’s rule that
P (D|E) P (E|D)P (D) P (E|D)
P (D c |E) P (E|D c )P (D c ) P (E|D c )
ψP = P (D|E c )
= P (E c |D)P (D)
= 1−P (E|D)
= ψR
P (D c |E c ) P (E c |D c )P (D c ) 1−P (E|D c )
11
2.3.3 Relationship with the Logistic Distribution
12
However, the logistic transformation σLR itself provides a very similar and
analytically sometimes more convenient alternative. Being viewed as a dis-
0
tribution function, σLR gives rise to the logistic distribution. Its density σLR
2
has mean zero and variance π /3, so that it is appropriate to define the cu-
mulative distribution function of the standardised logistic distribution with
zero mean and unit variance as
exp λx
Λ(x) := σLR (λx) =
1 + exp λx
√
where λ = π 3. In the literature, for example by Cramer (1991), the
function Λ is sometimes called logit function and is therefore not to be
confused with the logit transformation introduced in section 2.1, which is
nearly its inverse.
Cramer (1991, section 2.3) shows in an example that “by judicious adjust-
ment of the linear transformations of the argument x, the logit and probit
probability functions can be made to coincide over a fairly wide range.”
Moreover, he states that “logit and probit functions which have been fitted
to the same data are therefore virtually indistinguishable, and it is impossi-
ble to choose between the two on empirical grounds.”
As a result, it is justifiable in most cases to assume a logistic distribution
instead of a normal distribution for yi∗ or in examples as those mentioned
above. This, however, guides us again to the logistic regression approach.
13
As Cramer (1991) states, this model of human population growth was in-
troduced independently of Verhulst’s study by Raymond Pearl and Lowell
Reed in an article entitled “On the rate of growth of the population of the
United States since 1790 and its mathematical representation” published in
1920. Cramer continues: “The basic idea that growth is proportional both
to the level already attained and to the remaining room to the saturation
ceiling is simple and effective, and the logistic model is used to this day to
model population growth or, in market research, to describe the diffusion
or market penetration of a new product or of new technologies. For new
commodities that satisfy new needs like television, compact discs or video
cameras, the growth of ownership is naturally proportional both to the pen-
etration rate already achieved and to the size of the remaining potential
market, and similar arguments apply to the diffusion of new products and
techniques in industry.”
The application of probability models to biological experiments in the thir-
ties of the twentieth century represents another foundation of the logistic
regression model. However, it was in the first place the probit model intro-
duced in section 2.3.3 which found its reflection in the literature. According
to Cramer, economists at that time did not seem to take the logit model
seriously. Only after Henri Theil 1969 generalised the bivariate or dichoto-
mous to the multinomial logit model with more than two states of the de-
pendent variable, the logistic regression gained its wide acceptance. In the
seventies, Daniel McFadden, winner of the 2000 Nobel Prize in economics,
and his collaborators finally provided a theoretical framework to the logit
model linking it directly to the mathematical theory of economic choice (see
McFadden 1974).
14
3 Consistency of the Maximum Likelihood Esti-
mator
Ey,σ (β) := 1T T
n Hσ (Xβ) − y Xβ , (3.1)
15
HσLR (z) = ln(1 + exp z), we see directly by (2.6) and (3.1) that Ey,σLR (β) =
− ln L(β).
The theorem 1 provided by Mazza and Antille (1998, p. 4) based on the
definition of Ey,σ is therefore directly applicable to our logistic regression
model. On the following pages, we shall recite this theorem and its proof in
a variant slightly adjusted to our problem.4
Theorem 3.1. Assume M1 and M2, let β ∈ Rp , and consider the random
vector y = σLR (Xβ 0 ) + r, where r = (r1 , r2 , . . . , rn )T has mutually indepen-
dent entries ri such that E(ri ) = 0 and E(ri2 ) = s2i > 0 for all i ∈ {1, . . . , n}.5
Let B(β 0 , δ) ⊂ Rp be the open ball of radius δ centered at β 0 , and let
For the proof of theorem 3.1, we need the following lemma which is in fact a
variant of Ortega and Rheinboldt’s (1970, p. 163) lemma 6.3.4 adapted for
our specific purpose.
4
There are two main differences. First, the assumptions on the differentiability of σ and
the positivity of its derivative are omitted because both of them are automatically satisfied
by σLR . Furthermore, the assumption of homoscedasticity of the random variables ri is
released in order to allow mutually different, but positive variances.
5
Note that, by (2.5b), we have s2i = σLR
0
(xT 0
i ˛ ). Following Mazza and Antille’s original,
the shorter notation s2i shall be applied in this context.
16
Lemma 3.2. Let B = B(β 0 , δ) be an open ball in Rp with center β 0 and
radius δ > 0. Assume that G : B̄ ⊂ Rp −→ Rp is continuous and satisfies
(β − β 0 )T G(β) ≤ 0 for all β ∈ ∂B, the border of B. Then G has a root in
B̄.
Proof (of Lemma 3.2). We consider the ball B0 = B(0, δ) and define
G0 : B̄0 −→ Rn by G0 (γ) := γ + G(γ + β 0 ). Given the continuity of G,
the function G0 is also continuous. Let γ ∈ ∂B0 , i.e. kγk2 = δ 2 . Then
γ + β 0 ∈ ∂B, and thus, for any λ > 1,
γ T (λγ − G0 (γ)) = γ T (λ − 1)γ − G(γ + β 0 )
= (λ − 1)γ T γ − γ T G(γ + β 0 ) > 0 . (3.3)
| {z } | {z }
=(λ−1)k‚k2 >0 ≤0 by assumption
We now want to show, that G0 has a fixed point γ̂ ∈ B̄0 , i.e. G0 (γ̂) = γ̂. This
result would finally mean that G(γ̂+β 0 ) = 0 and, therefore, β̂ := γ̂+β 0 ∈ B̄
would be a root of G.
Assume that
G0 has no fixed
point in B̄0 . Then the mapping Ĝ(γ) :=
δ G0 (γ) − γ
G0 (γ) − γ
is well-defined and continuous on B̄0 , and
kĜ(γ)k = δ for any γ ∈ B̄0 . According to the Brouwer Fixed-Point Theo-
rem 6 , Ĝ has a fixed point γ ∗ in B̄0 and kγ ∗ k = kĜ(γ ∗ )k = δ. As
G0 (γ ∗ ) − γ ∗
γ ∗ = Ĝ(γ ∗ ) = δ ·
G0 (γ ∗ ) − γ ∗
,
we have
γ∗
G0 (γ ∗ ) =
G0 (γ ∗ ) − γ ∗
+ γ ∗ = 1 + 1
G0 (γ ∗ ) − γ ∗
γ ∗ = λ∗ γ ∗ .
δ
δ | {z }
=:λ∗ >1
Proof (of Theorem 3.1). Considering the function G(β) := −∇Ey,σLR (β),
we are going to show that there exists a ball B̄(β 0 , δ) ⊂ Rp , which – with
6
The Brouwer Fixed-Point Theorem reads as follows (see Ortega and Rheinboldt 1970,
p. 161): Every continuous mapping G : C̄ −→ C̄, where C̄ is a compact, convex set in Rp ,
has a fixed point in C̄.
17
probability converging to 1 as n tends to infinity – contains a root β̂ of G
for an arbitrary small δ > 0.
Let G(β)j denote the jth component of the vector G(β). We have
G(β)j = X T y − σ(Xβ)
j
n
X
= xij yi − σLR (xT
i β)
i=1
n
X
= xij yi − σLR (xT 0 T
i β + xi γ)
i=1
such that, as a result of the mean value theorem, there exists some ξi =
η i + α i xT
i γ with αi ∈ ]0, 1[ satisfying
n
X
0
G(β)j = xij ri − σLR (ξi ) · xT
i γ .
i=1
Thanks to the previous lemma, we only need to prove that γ T G(β) ≤ 0 for
all γ = β − β 0 with kγk = δ. We thus consider the expression
p p n
!
X X X
T 0 T
γ G(β) = γj G(β)j = γj xij ri − σLR (ξi ) · xi γ
j=1 j=1 i=1
n
X p
X n
X Xp
0
= ri γj xij − γj xij σLR (ξi ) xT
i γ
i=1 j=1 i=1 j=1
n
X n
X 2
= r i xT
i γ− xT
i γ
0
σLR (ξi ) .
|i=1 {z } |i=1 {z }
=:A1 =:A2
18
P
If we examine the second moment of k ni=1 ri xi k, we obtain
! !2
Xn
2 p
X Xn
E
r i xi
= E ri xij
i=1 j=1 i=1
p
X n
X n X
X n
= E
ri2 x2ij + ri rk xij xkj
j=1 i=1 i=1 k=1
k6=i
p
X n n
n X
X 2 2
X
=
E(r i ) x ij + E(ri rk ) xij xkj
j=1 i=1 i=1 k=1
k6=i
and, as the random variables ri are mutually independent and their ex-
pectations are zero, we get E(ri rk ) = E(ri ) E(rk ) = 0 and thus, as s2i =
0 (xT β 0 ) ≤ 1 for all i,
σLR i 4
n
! n p n p
X
2 X X 1 X X 2 M1 Cnp
2 2
E
r i xi
= E(r ) xij ≤ xij ≤ (3.5)
| {zi } 4 4
i=1 i=1 j=1 i=1 j=1
=s2i
for some positive constant C > 0. We see from (3.4) and (3.5) that E(A 21 ) ≤
δ 2 Cnp/4 . Using the Tchebychev inequality (see Feller 1968, p. 233), we
have
P(|A1 | ≥ t) ≤ t−2 E(A21 ) ≤ t−2 δ 2 Cnp/4 ∀t > 0
−2 2
⇐⇒ P(|A1 | < t) ≥ 1 − t δ Cnp/4 ∀t > 0
| {z }
=:ε
√
δ Cnp
⇐⇒ P |A1 | < √ ≥1−ε ∀ε > 0.
2 ε
√
Defining C ∗ := C 2, it is obvious that
r √
∗ np δ C ∗ np
P A1 ≤ δ C ≥ P |A1 | < √ .
ε ε
If we set nε := n/ε for any given ε > 0, we obtain
√
P(A1 ≤ δ C ∗ nε p) ≥ 1 − ε ,
so for all ε > 0, there exists a number nε ∈ N such that for any n ≥ nε and
any given C ∗ > 0, we have
√
P(A1 ≤ δ C ∗ np) ≥ 1 − ε . (3.6)
Now, let us turn to the examination of A2 . Let Z := {ξ ∈ Rn | ξi = xT 0
i β +
α i xT
i γ, αi ∈ ]0, 1[}.
19
Affirmation. For any vector ξ ∈ Z, we have
0
σLR (ξi ) ≥ aδpn (X, β 0 ) = inf 0
min σLR (ζi ) ∀ i ∈ {1, . . . , n} .
“∈XB(˛ 0 ,δ) i=1,...,n
which is a contradiction. 2
We therefore get
n
X 2
A2 = xT
i γ
0
σLR (ξi )
i=1
n (3.7)
X 2
≥ aδpn (X, β 0 ) xT
i γ = aδpn (X, β 0 ) kXγk2 .
i=1
20
that A2 ≥ aδpn (X, β 0 ) c n δ 2 . When combining this result with (3.6), we get
that, for any ε > 0, there is a number nε such that
√
P(A1 − A2 ≤ δ C ∗ np − aδpn (X, β 0 ) c n δ 2 ) ≥ 1 − ε
21
Remark. If we choose δ such that
r
C∗ p 1
δ = δn := ,
c n apn (X, β 0 )
δ
n
2
as β̂ ∈ B̄(β 0 , δn ). In other words, p aδpn (X, β 0 ) kβ 0 − β̂k2 is bounded in
probability.
22
0 (z) = (exp z)/(1 + exp z)2 is an even function having at z = 0 its
As σLR
maximum (see figure 1), we get
0 0 exp D̄p
inf σLR (z) = σLR (D̄p) = .
|z|≤D̄p (1 + exp D̄p)2
In other words, our requirement (3.2) on p(n) holds when the left side of
this implication is true. Let us examine
r r
p (1 + exp D̄p)2 p 1 + exp D̄p
I= = (1 + exp D̄p)
n exp D̄p n exp D̄p
r r r
p p p
= exp(−D̄p) + 1 (1 + exp D̄p) ≤ c1 + exp D̄p .
n | {z } n n
≤exp(−D̄)+1=:c1
Affirmation. Let C be a constant such that C < 1/(2D̄), and suppose that
p(n) ≤ C ln n. Then limn→∞ I = 0.
Verification. We have
r r r r
p C ln n C D̄ C ln n C ln n
exp D̄p ≤ exp C D̄ ln n = n =
n n n n1−2C D̄
We shall now resume the results of the previous two sections in the following
theorem.
23
Assumptions. Consider the following assumptions:
B1: There exists a positive constant D ∗ > 0 such that kxi k2 ≤ D∗ p for all
i ∈ {1, . . . , n}.
B2: There exists a positive constant c > 0 such that λ∗ > c n for all n,
where λ∗ denotes the smallest eigenvalue of X T X.
B3: There exists a positive constant D > 0 such that
2
sup max βj0 < D .
p∈N j=1,...,p
√ √
B4: For an arbitrary δ > 0 there exists a constant C < 1 2 D∗ ( D +δ)
such that p(n) ≤ C ln n.
Theorem 3.3. Assume B1, B2, B3, and B4. Then the maximum likeli-
hood estimator β̂ exists almost surely as n tends to infinity, and β̂ converges
to the true value β 0 .
B1: There exists a positive constant M > 0 such that kxi k2 ≤ M for all
i ∈ {1, . . . , n}.
Let us now turn to the article of Gourieroux and Monfort. They make the
following assumptions:
G1: The exogenous variables are uniformly bounded, i.e. there exists a
positive constant M0 such that |xij | ≤ M0 for all i ∈ {1, . . . , n} and
all j ∈ {1, . . . , p}.
G2: Let λ1n and λpn be respectively the smallest and the largest eigenvalue
of −Hln L (β 0 ) = X T D(β 0 )X, the diagonal matrix D(β 0 ) being de-
fined as in (2.8). There exists a constant M1 such that λpn /λ1n < M1
for all n.
24
Theorem 3.4 (Gourieroux and Monfort). If G1 and G2 are satisfied,
the maximum likelihood estimator β̂ of β exists almost surely as n goes to
infinity, and β̂ converges almost surely to the true value β 0 if and only if
lim λ1n = +∞ .
n→∞
√
B1
|xT 0 0
i β | ≤ kxi k kβ k ≤ M kβ 0 k
√
0 ( M kβ 0 k) =: c > 0 for all i. Consequently, the real-
such that s2i ≥ σLR s
valued function k · ks defined as
v
u n
√ uX
T
kwks := w Σ w = t 2 s2i wi2
i=1
kwk2
cs kwk2 ≤ kwk2s ≤ . (3.12)
4
Let S p−1 denote the unit sphere in Rp , i.e. S p−1 := {v ∈ Rp | kvk = 1}. By
the Courant-Fischer-Weyl minimax principle9 we have
25
Choose v∗ ∈ S p−1 such that λ∗ = kXv∗ k2 . By (3.12) we have λ∗ =
kXv∗ k2 ≥ 4 kXv∗ k2s ≥ 4 λ1n . Conversely, if we choose v∗∗ ∈ S p−1 such
that λ1n = kXv∗∗ k2s , we get λ1n = kXv∗∗ k2s ≥ cs kXv∗ k2 ≥ cs λ∗ . In
consequence, the inequality
1
cs λ∗ ≤ λ1n ≤ 4 λ∗ (3.13)
This result shows that B2 implies that λ1n also goes to infinity as n increases.
The opposite, however, is not always true: Assume for instance that λ1n =
cl ln n where cl > 0 is an arbitrary positive constant. By (3.13), we get thus
cl
4 cl ln n ≤ λ∗ ≤ ln n .
cs
So while limn→∞ λ∗ = ∞, there is no positive constant c such that λ∗ > c n
for all n.
The assumption limn→∞ λ1n = ∞ of the theorem 3.4 is therefore less re-
strictive than B2. Conversely, Gourieroux and Monfort (1981) need the
supplementary hypothesis G2 to prove the consistency of β̂. On the other
hand, G2 additionally ensures that limn→∞ λ1n = ∞ is not only sufficient
but also necessary for the consistency of β̂.
26
4 Asymptotic Normality of the Maximum Likeli-
hood Estimator
eT X T D(β 0 )X(β̂ − β 0 ) L
−−−→ N (0, 1)
kΣXek n→∞
27
This condition can be reformulated as
X T σLR X(β 0 + β̂ − β 0 ) − y = 0
n
X
⇐⇒ xij σLR xT β 0
+ x T
( β̂ − β 0
) − y i =0 ∀j
| i{z } | i {z }
i=1
=:γi0 =:ε̂i
n
X
⇐⇒ xij σLR (γi0 + ε̂i ) − yi = 0 ∀j .
i=1
In application of the mean value theorem, we can write σLR (γi0 + ε̂i ) =
0 (γ 0 ) + ε̂ σ
σLR 0
i i LR (ξi ) where ξi = γi + αi ε̂i with αi ∈ ]0, 1[. This gives us the
equivalent condition
n
X
0
⇐⇒ xij ε̂i σLR (ξi ) + σLR (γi0 ) − yi = 0 ∀j .
| {z }
i=1 =−ri
P
Let us first show that N2 −−−→ 0: We have
n→∞
≤ xT
i∗ (β̂ −β ) 0 ∗
for an i ∈ {1, . . . , n}.
28
P
Moreover, as β̂ is consistent, kβ̂ − β 0 k −−−→ 0 from where it follows that
n→∞
P
max |ξi − γi0 | −−−→ 0 . (4.3)
i∈{1,...,n} n→∞
Therefore,
p Pn 0 (ξ ) − σ 0 (γ 0 )
X ej i=1 xij ε̂i σLR i LR i
|N2 | =
kΣXek
j=1
v
u p n 2
C-S kek uX X
≤ t xij ε̂i σLR 0 (ξ ) − σ 0 (γ 0 )
i LR i
kΣXek
j=1 i=1
r
1
0 (ξ ) − σ 0 (γ 0 ) 2
≤ p n2 M 2 ε̂2 max σLR i LR i
kΣXek i∈{1,...,n}
1 √ 0 0
= p n M ε̂ max σLR (ξi ) − σLR (γi0 )
kΣXek i∈{1,...,n}
q
with si = 0 (xT β 0 ). However, on one hand,
σLR i
A1 √
|xT 0 0
i β | ≤ kxi k kβ k ≤ p M kβ 0 k
and thus
1 √ √ 0 0
|N2 | ≤ √ p M n ε̂ max σLR (ξi ) − σLR (γi0 ) (4.5)
c0 c i∈{1,...,n}
0
By the continuity of σLR we get with (4.3) that
0 0
P
max σLR (ξi ) − σLR (γi0 ) −−−→ 0 , (4.6)
i∈{1,...,n} n→∞
29
Since p is fixed and thus XB(β 0 , δ) is bounded, there is a positive constant
C > 0 such that
√
Therefore, as a conclusion of the remark on page 22, n kβ̂ −β 0 k is bounded
√
in probability. The same holds thus for n ε̂ and we get with (4.5) and (4.6)
that
P
N2 −−−→ 0 .
n→∞
Furthermore, we have
p P p P
X ej ni=1 xij ε̂i σLR
0 (γ 0 )
i
X ej ni=1 xij xT 0 0 T 0
i (β̂ − β ) σLR (xi β )
N3 = =
kΣXek kΣXek
j=1 j=1
eT X T D(β 0 )X(β̂ − β0 )
= .
kΣXek
L
we have by (4.7) that N3 −−−→ N (0, 1). For this purpose, we write −N1 as
n→∞
Pn
i=1 ρi
−N1 =
kΣXek
where ρi := (Xe)i ri . We note that
n
X n
X n
X n
X
Var(ρi ) = (Xe)2i Var(ri ) = (Xe)2i s2i = (ΣXe)2i = kΣXek2 .
i=1 i=1 i=1 i=1
30
However, we have
the last inequality being a consequence of A1, (4.4), and the fact that
0 (xT β 0 ) ≤ 1 . Finally, (4.10) implies (4.9), which completes the
s2i = σLR i 4
proof. ¨
31
5 Case Study: Health Enhancing Physical Activ-
ity in the Swiss Population
32
In order to illustrate the application of the logistic regression model, we
shall now analyse the impact of the linguistic regions of Switzerland on
the physical activity of their inhabitants. We divide H in three mutually
exclusive sets HG , HF and HI containing the questioned inhabitants of the
German, French and Italian speaking parts of Switzerland respectively. For
every individual hi we define two indicator variables xF i and xIi by
logit π = β0 + β1 xF + β2 xI .
For this purpose, we are going to minimise by the use of Mathematica the
error function Ey,σLR defined as in (3.1) with respect to the given data set.
Note that this procedure is absolutely equivalent to maximising the log-
likelihood function, as has been shown in section 3.1. While the vector y
contains the variables yi defined above, the matrix X is constructed by the
row vectors xT i := (1 xF i xIi ). The explicit values of y and X are not
going to be displayed at this place. We assume that they are stored in the
Mathematica variables y and X. The vector β = (β0 , β1 , β2 )T is represented
by the variable b:
{n, p} = Dimensions[X]
{1459, 3}
In addition, the vector 1n , the primitive function HσLR , and the error func-
tion Ey,σLR are defined and stored in ev, H, and Err respectively:
Having specified y and X, it is now possible to print out the precise form
of Ey,σLR (β):
Err[b]
−721 b0 − 299
b1 − 261
b2 + 509 Log 1 + eb0 +
485 Log 1 + eb0+b1 + 465 Log 1 + eb0+b2
33
Finally, starting from β = 0, the function FindMinimum searches for a
local minimum in Ey,σLR :
34
6 Conclusion
35
Acknowledgements
I would like to thank Prof Dr André Antille for the proposal of this interest-
ing and many-sided topic as well as for his academic assistance during the
development process of this diploma thesis.
Secondly, I want to express my gratitude to Prof Dr Bernard Marti, head of
the Institute of Sports Sciences within the Swiss Federal Office of Sports in
Magglingen, and to his collaborator Dr Brian W. Martin for placing at my
disposal the complete data set of the HEPA study used and referred to in
section 5. Thanks also to PD Dr Hans Howald for arranging the necessary
contacts.
Finally, my appreciation goes to Manrico Glauser and Ralf Lutz for proof-
reading this text and giving me precious feedback.
36
References
37
McFadden, D. (1974). Conditional logit analysis of qualitative choice be-
haviour, in P. Zarembka (ed.), Frontiers in Econometrics, Academic
Press, New York, pp. 105–142.
National Safety Council (2001). What Are the Odds of Dying? [online],
available from: http://www.nsc.org/lrs/statinfo/odds.htm. [Ac-
cessed 15 October 2001].
38