The Logistic Function
The Logistic Function
Introduction
The logistic function or logistic curve is a common S-shape curve (sigmoid curve) with equation:
K
P (t ) = (1)
1 +e (
− rt +C )
where P ( t ) is the population P at time t, K is the carrying capacity and the curve’s maximum value,
e = 2.718281828 … is a mathematical constant and the base of the natural logarithms, r is the growth
P0
parameter (steepness of the curve), C = ln is a constant and P0 is the population at t = 0 .
K − P
0
K
Figure 1. Logistic curve: P ( t ) = , K = 100, P0 = 20 , r = 0.138629436 .
1 +e (
− rt +C )
For real values of t in the range −∞ < t < +∞ , the curve has two asymptotes, P ( t ) = K as t → +∞ and
C
P ( t ) = 0 as t → −∞ and is symmetric about the curve’s midpoint at t = t0 = − where P ( t0 ) = 1 K .
r 2
There are various alternative expressions for the logistic curve and the derivation of (1) – outlined below – is
a modern interpretation of the derivation by Pierre Verhulst in the 19th century who named the curve la
courbe logistique [the logistic curve].
The properties of the logistic curve are derived and a general equation developed with a special case, the
sigmoid curve. This is followed by the derivation of the logistic distribution.
Logistic regression is discussed and the method of least squares is employed to give a solution for the
parameters of a logistic curve that is a best fit of the outcomes of binary dependent variables. Two examples
of this technique are shown.
Finally, the connection between a sport rating system (Elo 1978) and the logistic curve is shown.
References for further reading and two appendices are included.
P a g e 1 | 44
The Logistic Function
Verhulst proposed the following (somewhat arbitrary) differential equation for the population P ( t ) at time t
dP P
= rP 1 − (2)
dt K
where P ( t ) is the population P at time t, K is the carrying capacity and r is the growth parameter.
The differential equation (2) for the population P is solved by integration where
dP
∫ P
= ∫ r dt (3)
P 1 −
K
In order to evaluate the left-hand-side we write
1 K K
= =
P
P 1 −
KP − P 2 P ( K −P)
K
and decomposing into partial fractions gives
K A B A ( K − P ) + BP
= + =
P (K − P ) P K −P P (K − P )
from which the relation A ( K − P ) + BP = K is obtained which must hold for all values of K and P.
Choosing particular values for P yields A and B as follows: (i) when P = 0 , AK = K and A = 1 ; (ii) when
P = K , BK = K and B = 1 . Using this result gives
1 K 1 1
= = +
P P (K − P ) P K −P
P 1 −
K
and (3) becomes
1 1
∫ P dP + ∫ K − P dP = ∫ r dt (4)
P a g e 2 | 44
The Logistic Function
1 1
∫ P dP − ∫ u du = ∫ r dt
1
and using the integral results ∫ x dx = ln x + C and ∫ a dx = ax + C where the C’s are constants of
integration gives
ln P − ln ( K − P ) = rt + C (5)
where ln denotes the natural logarithm and ln x ≡ loge x and e = 2.718281828 … , and the constants of
integration have been combined and added to the right-hand-side.
Re-writing (5) as
ln ( K − P ) − ln P = − ( rt + C )
M
and using the laws of logarithms: loga = loga M − loga N gives
N
K −P
ln = − ( rt + C )
P
Now, raising both sides to the base e and noting that e ln x = x gives
K −P
=e (
− rt +C )
(6)
P
that can be re-arranged as
K
P = P (t ) = (7)
1+e (
− rt +C )
K
An equation for the constant C can be found when t = 0 in which case P0 = P ( 0 ) = and a re-
1 + e −C
arrangement gives
K − P0 P0
e −C = and eC = (8)
P0 K − P0
Taking natural logarithms of both sides of the second member of (8) and noting ln e x = x gives
P0
C = ln (9)
K − P0
Midpoint
The midpoint of the curve is when P ( t ) = 1 K and this occurs when the exponent ( rt + C ) in (7) is equal
2
to zero and
P a g e 3 | 44
The Logistic Function
C
t = t0 = − (10)
r
Symmetry
The logistic curve is symmetric about the midpoint.
= 1+e (
−r t +C r )
This can be confirmed by writing the denominator of (7) as 1 + e (
− rt +C )
= 1 + e ( 0)
−r t −t
giving
K
P = P (t ) = (11)
1 + e ( 0)
−r t −t
Now using (11) with d = t − t0 a distance along the t-axis, the sum of the logistic function P (d ) and its
reflection about the vertical axis through t0 , P ( −d ) is
K K K ( 1 + e rd ) + K ( 1 + e −rd )
+ =
1 + e −rd 1+e ( )
− −rd
(1 + e−rd )(1 + erd )
2K + K (e rd + e −rd )
=
2 + e rd + e −rd + e(
−rd +rd )
K ( 2 + e rd + e −rd )
=
2 + e rd + e −rd
=K
K K
Also, due to symmetry we may write =K− and (11) becomes
−r ( t −t0 )
1 + e ( 0)
r t −t
1+e
K
P = P (t ) = K − (12)
1 + e ( 0)
r t −t
Inflexion point
d 2P
The midpoint is also the inflexion point of the logistic curve where the second derivative = 0 . This can
dt 2
be proved by the following
K − P0
A = e −C = (13)
P0
K
P = P (t ) = (14)
1 + Ae −rt
Differentiating (14) with respect to t gives
−2
= −K ( 1 + Ae −rt ) ( −Are −rt )
dP
dt
P a g e 4 | 44
The Logistic Function
d 2P −3 −2
= 2K ( 1 + Ae −rt ) ( −Are−rt ) − K ( 1 + Ae −rt ) ( Ar 2e−rt )
2
2
dt
d 2P
Second, solving = 0 gives
dt 2
−3 −2
2K ( 1 + Ae −rt ) ( −Are−rt ) − K (1 + Ae−rt ) ( Ar 2e−rt ) = 0
2
−1
2 ( 1 + Ae −rt ) ( Are −rt ) − rAre −rt = 0
2
−1
2 ( 1 + Ae −rt ) ( Are −rt ) − r = 0
−1
2 ( 1 + Ae −rt ) ( Ae −rt ) = 1
2Ae −rt = 1 + Ae −rt
Ae −rt = 1
1
e −rt = = A−1
A
Taking logarithms of both sides and noting that loga M p = p loga M gives
ln A
t = (15)
r
Substituting (15) into (14) gives
ln A K K K K
P = = = = = 1K
r ln A
−r 1 + Ae − ln A
1 + Ae ln A−1 1 + AA −1 2
r
1 + Ae
Finally, using (9), (10) and (13) we obtain the relations
ln A
ln A = −C and t0 = (16)
r
ln A
So, P = P ( t0 ) = 1 K thus proving that the inflexion point is the midpoint ( t0 , 1 K ) .
r 2 2
dp
= mp − ϕ ( p ) (17)
dt
that links the rate of change of population p with respect to time t with mp and a function of p, ϕ ( p ) where
m is a constant. He then supposes that ϕ ( p ) = np 2 where n is another constant and then finds for the
integral of (17) that
1
t = ln p − ln ( m − np ) + constant
m
On resolving (17) he gives the equation for the population p as
mp ′e mt
p= (18)
np ′e mt + m − np ′
P a g e 5 | 44
The Logistic Function
where p ′ is the population at t = 0 . He then states that as t → ∞ the value of p corresponds with
m
P = that he calls la limite supérieure de la population [the upper limit of the population].
n
The correspondence between variables in Verhulst’s 1838 paper and this paper are
Verhulst This paper
p P population
t t time
m r growth parameter
n no equivalent
p′ P0 population at t = 0
m
P = K upper limit of population
n
dp n
Verhulst’s differential equation (17) can be written as = mp − np 2 = mp 1 − p that is equivalent to
dt m
dP P
= rP 1 − that is our equation (2) and Verhulst’s logistic function (18) can be written as
dt
K
p ′e mt
p = that is equivalent to (Bacaër 2011, eq. 6.2)
1 + p ′ (e mt − 1 )
n
m
P0e rt KP0e rt
P = = (19)
P0 K + P0 (e rt − 1 )
1+
K
(e rt
− 1)
P
Using (6) the relationship e(
rt +C )
= can be obtained and taking logarithms of both sides gives
K −P
P
rt + C = ln and using this equations at times t = 0, t = t1 and t = t2 = 2t1 gives
K − P
P0
C = ln
K − P0
P1
rt1 + C = ln
K − P1
P2
2rt1 + C = ln
K − P2
P a g e 6 | 44
The Logistic Function
P0
eC = (i)
K − P0
P1
e rt1eC = ( ii )
K − P1
P2
e 2rt1eC = ( iii )
K − P2
Dividing (ii) by (i) and (iii) by (ii) gives two equations where eC has been eliminated and the left-hand-sides
are identical
P1 ( K − P0 )
e rt1 = ( iv )
P0 ( K − P1 )
P2 ( K − P1 )
e rt1 = (v)
P1 ( K − P2 )
P1 ( K − P0 ) P2 ( K − P1 )
Equating (iv) and (v) gives = and cross-multiplying and gathering terms gives
P0 ( K − P1 ) P1 ( K − P2 )
noting the conditions P12 > P0P2 and P0P1 + P1P2 > 2P0P2 for finite and positive K.
K K K K P0e rt
P (t ) = = =K− = = (22)
1 +e (
− rt +C )
1 +e ( 0) 1 +e ( 0)
−r t −t r t −t
1 + Ae −rt P0
1+
K
(ert − 1)
P0 K − P0 C ln A
where P0 = P ( 0 ), C = ln , A = , t0 = − = ,
K − P0 P0 r r
P a g e 7 | 44
The Logistic Function
K
Figure 2. Logistic curve: P ( t ) = , K = 100, t0 = 10 , r = 0.138629436
1 +e ( 0)
−r t −t
A1 − A2
y = + A2 (23)
1 +e ( 0)
k x −x
A1 − A2
Figure 3. Logistic curve: y = + A2 , A1 = 20, A2 = 120, k = 0.138629436, x 0 = 60
1 +e ( 0)
k x −x
y = A1 is the lower asymptote of the curve given by (23) when x → −∞ and y = A2 is the upper
asymptote of the curve when x → +∞ . The midpoint of the curve is x 0 , 1 ( A1 + A2 ) when x = x 0 then
2
e ( 0 ) = e 0 = 1 and y = ( A1 + A2 )
k x −x 1
2
P a g e 8 | 44
The Logistic Function
Sigmoid Function
The sigmoid function is a special case of the logistic function. The sigmoid curve is a symmetric S-shape
with a midpoint at ( 0, 1 ) and asymptotes y = 0 as x → −∞ and y = 1 as x → +∞ .
2
1 ex
y = = (24)
1 + e −x ex + 1
1
Figure 4. Sigmoid curve: y =
1 + e −x
dy
The derivative of the sigmoid function, denoted by y ′ = is
dx
−1 −2 e −x
y′ =
d
( 1 + e −x ) = −1 ( 1 + e −x
) ( −e−x ) = (25)
(1 + e−x )
dx 2
1−y 1
and from (24) e −x = and 1 + e −x = giving
y y
y ′ = y (1 − y ) (26)
d 2y d dy
The 2nd derivative of the sigmoid function, denoted by y ′′ = = is
dx 2 dx dx
d −x −2
y ′′ = e ( 1 + e −x )
dx
− −2
= e −x −2 ( 1 + e −x ) ( −e −x ) + ( 1 + e −x ) ( −e −x )
3
e −x 2e −x
= − 1
2
(1 + e ) 1 + e
−x
−x
= y ′ ( 1 − 2y ) (27)
And as before, the inflexion point of the sigmoid curve will be at the point where y ′′ = 0 and using (27)
y ′′ = 0 when y = 1
which occurs at the midpoint ( 0, 1 )
2 2
P a g e 9 | 44
The Logistic Function
+∞
2. ∫ fX ( x )dx = 1 (29)
−∞
The probability that a random variable X lies between any two values x = a and x = b is the area under
the density curve between those two values and is found by methods of integral calculus
b
P (a < X < b ) = ∫ fX ( x )dx (30)
a
x
1. FX ( x ) = P ( X ≤ x ) = ∫ fX ( x )dx (31)
−∞
d
2. F ( x ) = fX ( x ) (32)
dx X
In many scientific analyses of experimental data it is assumed the data are members of a probability
distribution with a density function having a smooth bell-shaped curve with tails that approach the
asymptote fX ( x ) = 0 as x → ±∞ and a cumulative distribution function that is a symmetric S-shape with
asymptotes FX ( x ) = 0 and FX ( x ) = 1 as x → −∞ and x → +∞ respectively.
As an example, Appendix A shows the probability density curve and the cumulative distribution curve of the
2
1 x −µ
1 −
familiar Normal distribution with probability density function fX ( x ) = e 2 σ
where the infinite
σ 2π
population has mean µ , variance σ 2 and standard deviation σ = + σ 2 .
Now suppose that our experimental data has a Logistic distribution and the cumulative distribution function
is the logistic function with location parameter a and shape parameter b [see (23) with A1 = 0 , A2 = 1 ,
x 0 = a and k = 1 b ]
1
FX ( x ) = x −a
(33)
−
b
1+e
x −a −1
−
= d 1 + e b
d 1
fX ( x ) =
x −a
dx − dx
1 + e b
P a g e 10 | 44
The Logistic Function
x −a −1
x − a −
1 + e b = d ( 1 + e u ) du and using the
du 1 d −1
Let u = − and with = − we write
b dx b dx
du dx
d x
rule e = e x the probability density function of the Logistic distribution is
dx
x −a
−
b
e
fX ( x ) = (34)
x −a 2
−
b
b 1 + e
The mean µX and variance σX2 are special mathematical expectations (see Appendix A) and
+∞
µX = E { X } = ∫ x fX ( x )dx (35)
−∞
+∞
σX2 = E {( X − µX ) 2
} = ∫ ( x − µx ) fX ( x )dx
2
(36)
−∞
Using the rules for expectations a more useful expression for the variance can be developed as
σX2 = E {( X − µ ) } = E { X
X
2 2
− 2X µX + ( µX )
2
}
= E { X 2 } − 2µX E { X } + ( µX ) = E { X 2 } − 2µX µX + ( µX )
2 2
= E { X 2 } − ( µX ) = E { X 2 } − ( E { X })
2 2
(37)
The following derivations are due to Max Hunter (2018), my mentor in all things mathematical especially the
lovely integral solutions that follow1.
The mean
x −a
−
+∞ +∞ b
xe
µX = ∫ x fX ( x ) dx = ∫ dx (38)
x −a 2
−∞ −∞ −
b 1 + e b
x −a
With the substitution t = then x = tb + a and dx = b dt , and with t = ∞ when x = ∞ then (38)
b
becomes
1 Max Hunter is a retired mathematician from RMIT University, Melbourne Australia. In an earlier version
of this document I had resorted to the use of a divergent series in solving integrals for the mean and
variance. Max, on reading this, sent me a note suggesting I had set back mathematics and statistics about
300 years! But very kindly attached several elegant solutions that avoided the use of said series.
P a g e 11 | 44
The Logistic Function
+∞
( tb + a )e −t +∞
te −t
+∞
e −t
µX = ∫ dt = b ∫ dt + a ∫ dt (39)
(1 + e ) (1 + e ) (1 + e )
2 2 2
−t −t −t
−∞ −∞ −∞
First
ex ex ex e −x
= = = (40)
e 2x e −2x ( 1 + 2e x + e 2x ) e 2x (e −2x + 2e −x + 1 )
(1 + ex ) (1 + e−x )
2 2
Second
e x + e −x e 12 x + e −21 x 2 −x
= e + 2 + e
x
cosh x = and cosh ( 1 x ) =
2
2 2 2 4
e x ( 1 + e −x )
2
(1 + e ) ( 1 + 2e )=e ( 12 x ) =
2
−x −x −2x −x
and with e x
=e x
+e x
+ 2 +e then cosh 2
and
4
e −x
1
sech2 ( 1 x ) = (41)
(1 + e−x )
4 2 2
Third
d du 1 d
dx
tanh u = sech2 u
dx
so
2 dt
( tanh ( 12 x ) ) = 14 sech2 ( 21 x ) (42)
+∞
te −t te −t
Now to evaluate (39) it is useful to note that ∫ dt = 0 since the integrand f ( t ) = is
( 1 + e−t ) (1 + e−t )
2 2
−∞
an odd function of t since f ( −t ) = −f ( t ) and the interval of integration is symmetric. Hence the mean
becomes
+∞
e −t
µX = a ∫ dt (43)
(1 + e )
2
−t
−∞
+∞ +∞ +∞
e −t d +∞
∫ dt = ∫ 1
sech2 ( 1 t ) dt = 1
∫ ( tanh ( 1 t ) )dt = 1 tanh ( 1 t ) = 1 1 − ( −1 ) = 1 (44)
−∞
(1 + e−t )
2 4 2 2 dt 2 2 2 2
−∞ −∞ −∞
and using this result in (43) gives the mean of the Logistic distribution as
µX = a (45)
P a g e 12 | 44
The Logistic Function
The variance
σX2 = E {( X − µ ) } = E { X
X
2 2
} − ( E { X })
2
+∞
x 2 fX ( x ) dx − ( µX )
2
= ∫
−∞
x −a
−
+∞ b
x 2e
= ∫ x −a 2
dx − a 2 (46)
−
b
−∞
b 1 + e
x −a
and similarly to the derivation of the mean, the substitution t = gives x = tb + a and dx = b dt
b
giving (46) as
2 −t
( tb ) + 2abt + a e
2
( tb + a )
+∞ 2 −t +∞
e
σX2 = ∫ dt − a 2 = ∫ dt − a 2
(1 + e−t ) (1 + e )
2 2
−t
−∞ −∞
and
+∞ +∞ +∞
t 2 e −t t e −t e −t
σX2 =b 2
∫ dt + 2ab ∫ dt + a 2
∫ dt − a 2 (47)
(1 + e ) (1 + e ) (1 + e )
2 2 2
−t −t −t
−∞ −∞ −∞
Now the second integral of (47) equals zero, since the integrand is an odd function of t. And, using (44), the
third integral of (47) equals one giving the variance as
+∞ ∞
t 2 e −t t 2 e −t
σX2 = b 2 ∫ dt = 2b 2 ∫ dt (48)
(1 + e−t ) (1 + e−t )
2 2
−∞ 0
t 2 e −t
since the function f ( t ) = is symmetric about the f ( t ) axis.
(1 + e−t )
2
1
Now, using the series expression = 1 − 2x + 3x 2 − 4x 3 + 5x 4 −
(1 + x )
2
∞
e −x n −1 −nx
= e −x − 2e −2x + 3e −3x − 4e −4x + 5e −5x − = ∑ n ( −1 ) e (49)
(1 + e )
2
−x n =1
∞ ∞ ∞ ∞
n −1 −nt n −1
σX2 = 2b 2 ∫ t 2 ∑ n ( −1 ) e dt = 2b 2 ∑ n ( −1 ) ∫ t 2e−ntdt
0 n =1 n =1 0
eax 2 2x 2
Using the standard integral result ∫ x 2eax dx = x − + gives
a a a 2
P a g e 13 | 44
The Logistic Function
∞
∞ −nt
e
t 2 −
n −1 2 t 2
σX2 = 2b 2 ∑ n ( −1 ) +
−n −n ( −n )2
n =1 0
∞ 0
= 2b 2 ∑ n ( −1 )
n −1
0 − e 2
n =1
−n n 2
∞
n −1 2
= 2b 2 ∑ n ( −1 )
n =1 n3
∞
n −1 1
= 4b 2 ∑ ( −1 ) (50)
n =1 n2
And
∞
n −1 1 1 1 1 1
∑ ( −1 ) n 2
= 1−
2 2
+
3 2
−
4 2
+
52
−
n =1
1 1 1 1 1 1 1
=1+ + + + + − 2 + + +
2 2
3 2
4 2
5 2
22
4 2
62
∞ ∞
1 2 1
= ∑ 2
− ∑
22 n =1 n 2
n =1 n
2
π π2
= −
6 12
π2
= (51)
12
∞
1 π2
[Note here Euler’s remarkable result: ∑ n2 =
6
(Euler 1784)]
n =1
Hence, using (51) in (50) gives the variance of the Logistic distribution as
b 2 π2
σX2 = (52)
3
The standard deviation σ (the positive square-root of the variance) and the shape parameter b are
bπ σ 3
σ= , b= (53)
3 π
If X is a random variable having a Logistic distribution with parameters µ and σ the usual statistical
notation is X ∼ LOG ( µ, σ ) and the probability density function is
π x − µ
−
π e 3 σ π π x − µ
fX ( x : µ, σ ) = = sech2 (54)
σ 3 π x − µ
−
2
4σ 3 2 3 σ
3 σ
1 + e
P a g e 14 | 44
The Logistic Function
2π
Figure 5. Probability density curve: fX ( x : µ, σ ) , µ = 5 and σ =
3
The cumulative distribution function of the Logistic distribution is
1 π x − µ
FX ( x : µ, σ ) = = 1 1 + tanh 1
(55)
π x − µ 2 2
3 σ
−
1+e 3 σ
e x − e −x ex 1
using tanh x =
−x
and 1
( 1 + tanh x ) = −x
=
e +e
x 2
e +ex
1 + e −2x
2π
Figure 6. Cumulative distribution curve: FX ( x : µ, σ ) , µ = 5 and σ =
3
P a g e 15 | 44
The Logistic Function
Logistic Regression
Logistic regression was developed by statistician David Cox (Cox 1958) as a means of measuring the
relationship between a binary dependent variable (yes or no, win or loss, 1 or 0, etc.) and one or more
independent variables by estimating probabilities using a logistic function. The key features of logistic
regression are (i) the conditional distribution y | x is a Bernoulli distribution2 where y | x means y given x
and (ii) the predicted values are probabilities and are therefore restricted to (0,1).
The model for logistic regression is the function
1
y = (56)
1 + e −z
1
y =
1 + e −z
−z
y + ye = 1
1−y
e −z =
y
y
e =
z
1−y
Taking natural logarithms of both sides, noting that ln e z = z and z = β0 + β1x1 + β2x 2 + + βn x n gives
y
β0 + β1x1 + β2x 2 + + βn x n = ln (57)
1 − y
This is a non-linear function relating probabilities y with independent variables x and coefficients β .
To enable the determination of coefficients of variables given probabilities it would be desirable to have a
linear function relating these quantities and this may be achieved with the aid of Taylor’s theorem, that for
a single variable, can be expressed in the following form
(x − a ) (x − a )
2 3
df d2f d3f
f ( x ) = f (a ) + (x − a ) + + +
dx a dx 2 a
2! dx 3 a
3!
n −1
(x − a )
d n −1 f
+ + Rn (58)
a (
dx n −1 n − 1) !
df d2f
where Rn is the remainder after n terms and lim Rn = 0 for f ( x ) about x = a and , , etc. are
n →∞ dx a dx 2 a
Using Taylor’s theorem on the right-hand-side of (57) and evaluating the derivatives about the point y = p
gives
2The probability distribution (in honour of the Swiss mathematician Jacob Bernoulli) of a random variable
which takes the value of 1 with probability p and the value of 0 with the probability q = 1 − p.
P a g e 16 | 44
The Logistic Function
y p 1 2p − 1 3p 2 − 3p + 1
ln = ln ( y − p) + ( y − p) + (y − p ) +
2 3
+
1 − y 1 − p p ( 1 − p )
2p 2 ( 1 − p ) 3p 3 ( 1 − p )
2 3
p 1
= ln
1 − p p ( 1 − p ) (
+ y − p ) + higher order terms (59)
exceedingly small and we may neglect the higher order terms in (59) and write
y p y−p
ln ≅ ln +
1 − y 1 − p p ( 1 − p )
1
p = (61)
1 + e −z
0
where z 0 = β00 + β10x1 + β20x 2 + + βn0x n and β00 , β10 , β20 , … are approximate values of the coefficients
β0 , β1, β2 , … . And bearing in mind (57)
p
β00 + β10x1 + β20x 2 + + βn0x n = ln (62)
1 − p
Now, using (62) in (60) and adding a residual v to the left-hand-side to account for small random errors in
the variables x, an equation than can be used to determine the coefficients β in an iterative scheme is
y−p
v + β0 + β1x1 + β2x 2 + + βn x n = β00 + β10x1 + β20x 2 + + βn0x n + (63)
p (1 − p )
To see how this may work, suppose as an example, 20 students sitting an examination with the observed
outcomes y1, y2 , …, y20 as pass or fail and recorded as 1 or 0. These outcomes are thought to be related to a
single variable x that is hours of study in preparation for the exam and the logistic function is assumed to be
1
y = . The wish is to determine the two coefficients β0 and β1 .
−( β0 + β1x )
1+e
Now assume approximate values β00 , β10 for the coefficients and following (63) the observation equation for
the kth outcome is
yk − pk
vk + β0 + β1x k = β00 + β10x k + (64)
pk ( 1 − pk )
P a g e 17 | 44
The Logistic Function
y1 − p1
v1 + β0 + β1x1 = β00 + β10x1 +
p1 ( 1 − p1 )
y2 − p2
v2 + β0 + β1x 2 = β00 + β10x 2 +
p2 ( 1 − p2 )
y 3 − p3
v 3 + β0 + β1x 3 = β00 + β10x 3 +
p3 ( 1 − p3 )
y20 − p20
v20 + β0 + β1x 20 = β00 + β10x 20 +
p20 ( 1 − p20 )
1 x1 1 ( 1 − p1 ) y1 ( p1 ( 1 − p1 ) )
1 x 1 (1 − p ) y ( p (1 − p ))
2 β
0 2 2 2 2
f = 1 x 3 00 − 1 ( 1 − p3 ) + y 3 ( p3 ( 1 − p3 ) ) = Bx 0 − c + Ay = d + Ay (66)
β1
1 x 20 1 ( 1 − p20 ) y20 ( p20 ( 1 − p20 ) )
1 ( 1 − p1 )
β0 1 (1 − p )
0
where x = 0 is the vector of approximate coefficients, c =
0 2
is a vector of numeric terms,
β1
1 1 − p
( 20 )
y1
y
y = 2 is the vector of measurements,
y
20
P a g e 18 | 44
The Logistic Function
1
0 0
p (1 − p )
1 1
1
0 0 0
A = p2 ( 1 − p2 ) is a diagonal coefficient matrix and
0
1
0 0
p20 ( 1 − p20 )
where Vyy is a diagonal matrix containing the variances of the measurements y and Vff is a diagonal matrix
containing the variances of the numeric terms f in (66). The y’s in the right-hand-side of (66) are random
variables that follow a Bernoulli distribution and take the value of 1 with a probability of p and a value of 0
with a probability of q = 1 − p . The variance of these variables is p ( 1 − p ) and the general form of Vff is
given by
1
0 0
p (1 − p )
1 1
1
0 0 0
Vff
= p2 ( 1 − p2 ) (68)
0
1
0 0
p20 ( 1 − p )
20
Now, with the general relationship that weights are inversely proportional to variances, i.e. W = V−1 the
general form of the weight matrix W of (65) is
p1 ( 1 − p1 ) 0 0
0 p2 ( 1 − p2 ) 0 0
W= (69)
0
0 0 p20 ( 1 − p20 )
The coefficients in the vector x in (65) can now be solved using least squares4 with the standard result
(Mikhail 1976)
−1
x = ( BT WB ) BT Wf (70)
and these are ‘updates’ of the approximate values in x0 . This iterative process can be repeated until there is
no appreciable change in the updated coefficients. In the literature associated with logistic regression this
iterative least squares process of determining the coefficients is known as Iteratively Reweighted Least
Squares (IRLS).
overdetermined systems of equations. It is commonly used to determine the line of best fit through a number
of data points and this application is known as linear regression.
P a g e 19 | 44
The Logistic Function
A MATLAB5 function logistic.m that solves for the parameters of a logistic regression has been developed by
Professor Geoffrey J. Gordon in the Machine Learning Department at Carnegie Mellon University, USA and
is available from his website: http://www.cs.cmu.edu/~ggordon/IRLS-example/. This function will also run
under GNU OCTAVE6. A copy of the function and an example of its use is shown in Appendix B.
The data for the example used for explanation (20 students studying for an examination) is from the
Wikipedia page Logistic Regression (https://en.wikipedia.org/wiki/Logistic_regression) and is shown in the
table below.
20 students sit an examination with the observed outcomes y1, y2 , …, y20 as pass or fail and recorded as 1 or
0. These outcomes are related to a single variable x that is hours of study in preparation for the exam and
1
the logistic function is assumed to be y = . The coefficients β0 and β1 are computed using the
−( β0 + β1x )
1+e
MATLAB function logistic.m. The input data and results are shown below
>> a
a =
1.00000 0.50000
1.00000 0.75000
1.00000 1.00000
1.00000 1.25000
1.00000 1.50000
1.00000 1.75000
1.00000 1.75000
1.00000 2.00000
1.00000 2.25000
1.00000 2.50000
1.00000 2.75000
1.00000 3.00000
1.00000 3.25000
1.00000 3.50000
1.00000 4.00000
1.00000 4.25000
1.00000 4.50000
1.00000 4.75000
1.00000 5.00000
1.00000 5.50000
>> y'
ans =
0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
-4.0776
1.5046
The array a is the coefficient matrix B and the vector y is y noting that y ′ = yT
The iterative process has converged after 8 iterations and the coefficients are in the array xhat where
β0 = −4.0776 and β1 = 1.5046 .
β0
β0 + β1x 0 = 0 giving x 0 = − = 2.7101
β1
In this example, a student who studies 2 hours has a probability of passing the exam of 0.26
1
P ( pass ) = = 0.2557
−( −4.0776 +1.5046×2 )
1 +e
and similarly, a student who studies 4 hours has a probability of passing the exam of 0.87
1
P ( pass ) = = 0.8744
−( −4.0776 +1.5046×4 )
1 +e
P a g e 21 | 44
The Logistic Function
Table 2 shows probabilities of passing the exam for several values of hours of study
Table 2.
7 The Swiss System (also known as the Swiss Ladder System) is a tournament system that allows
participants to play a limited number of rounds against opponents of similar strength. The system was
introduced in 1895 by Dr. J. Muller in a chess tournament in Zurich, hence the name ‘Swiss System’. The
principles of the system are: (i) In every round, each player is paired with an opponent with an equal score
(or as nearly equal as possible); (ii) Two players are paired at most once; (iii) After a predetermined number
of rounds the players are ranked according to a set of criteria. The leading player wins; or the ranking is the
basis of subsequent elimination series.
P a g e 22 | 44
The Logistic Function
Qualifying Ranking
Rank Team Score BHN fBHN Games Points delta
1 3 4 7 40 4:0 52:26 +26
2 15 3 10 36 3:1 50:33 +17
3 7 3 10 29 3:1 48:41 +7
4 16 3 9 34 3:1 46:31 +15
5 4 3 6 36 3:1 41:36 +5
6 1 3 5 40 3:1 40:33 +7
7 14 2 11 27 2:2 34:36 –2
8 6 2 10 33 2:2 39:47 –8
9 17 2 7 39 2:2 37:41 –4
10 2 2 7 35 2:2 43:32 +11
11 18 2 7 35 2:2 44:35 +9
12 11 2 5 30 2:2 38:34 +4
13 8 1 11 27 1:3 35:42 –7
14 13 1 9 27 1:3 44:40 +4
15 5 1 8 26 1:3 35:47 –12
16 9 1 6 32 1:3 25:44 –19
17 12 1 6 27 1:3 19:45 –26
18 10 0 10 23 0:4 23:50 –27
Table 4. Dove Open Doubles: Ranking after Qualifying rounds
(BHN is Buchholtz Number8, fBHN is Fine Buchholtz Number)
Principale Complémentaire
Rank Team Rank Team
1 3 1 18
2 14 2 11
3 7 3 2
4 16 4 17
4 8
6 13
=5 15 =5 9
1 5
Table 7. Dove Open Doubles: Final Ranking in Principale & Complémentaire
8 The Buchholtz system is a ranking system, first used by Bruno Buchholtz in a Swiss System chess
tournament in 1932. The principle of the system is that when two players have equal scores at the end of a
defined number of rounds a tie break is required to determine the top ranked player. The scores of both
player’s opponents (in all rounds) are added giving each their Buchholtz Number (BHN). The player having
the larger BHN is ranked higher on the assumption they have played against better performing players. The
Fine Buchholtz Number (fBHN) is the sum of the opponents Buchholtz Numbers and is used to break ties
where player’s BHN are equal. In the rare case that Score, BHN and fBHN are all equal then delta = points
For – points Against is used as a tie break (see Teams 2 & 18 in Table 7).
P a g e 23 | 44
The Logistic Function
A least squares solution for team ratings, based on the 36 Qualifying matches yielded the following values
with the highest rating team, (team 15) having a rating r15 = 100 .
Table 8. Win/Loss and rating difference for Qualifying matches of Dove Open Doubles.
Using the values in Table 8, Logistic Regression is used to compute the parameters of a curve representing
the probability of Team A winning given a certain rating difference.
1
The curve is assumed to have the following form: y = where y is the probability of winning
−( β0 + β1x )
1+e
and x is the rating difference between the two teams and the MATLAB function logistic.m is used to
compute the coefficients β0 and β1 .
P a g e 24 | 44
The Logistic Function
The input arrays a and y are shown below together with the output vector xhat (Note that a′ denotes
transpose)
>> a'
ans =
>> y'
ans =
1 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0
1 0 0 1 0 1 0 0 1 1 0 1 1 0
>>
-0.72544
0.97094
>>
The iterative process has converged after 10 iterations and the coefficients are in the array xhat where
β0 = −0.72544 and β1 = 0.97094 .
P a g e 25 | 44
The Logistic Function
β0
β0 + β1x 0 = 0 giving x 0 = − = 0.74715
β1
1
pA = r −r
(71)
− A B
b
1 + 10
where pA is the probability of player A winning in a match A versus B given the player ratings rA, rB and b
is a shape parameter.
The curve of this function has a similar form to the cumulative distribution curve of the Logistic distribution
(33) with e replaced by 10 as a base, x = rA − rB is the rating difference between players A and B, a = 0 b
is the shape parameter.
9 Arpad Elo (1903 – 1992) the Hungarian-born US physics professor and chess-master who devised a system
to rate chess players that was implemented by the United States Chess Federation (USFC) in 1960 and
adopted by the World Chess Federation (FIDE) in 1970. Elo described his work in his book The Rating of
Chess Players, Past & Present, published in 1978 and his system has been adapted to many sports.
P a g e 26 | 44
The Logistic Function
1
Figure 1. Elo curve: y = x
. y = pA is the probability of A winning, x = rA − rB is
−
400
1 + 10
the rating difference and the shape parameter b = 400 . The three points on the
curve shown thus ○ have rating differences −265, 174 and 626 that correspond with
probabilities 0.179, 0.731 and 0.973 respectively.
The curve in Figure 3 has the shape parameter b = 400 and this value is chosen so that a player rating
difference of approximately 200 corresponds to a probability of winning of approximately 0.75. With y = pA
and x = rA − rB (71) becomes
1
y = x
(72)
−
400
1 + 10
x
−
1−y
400
and this equation can be rearranged as 10 = . Now using the rule for logarithms that if
y
p = loga N then N = a p the expression for rating difference x is
1 − y
x = −400 log10 (73)
y
P a g e 27 | 44
The Logistic Function
For example, suppose two players A and B with ratings 1862 and 1671 respectively play a match. The
probability of A winning is given by (71) as
1 1 1
pA = 1862 −1671
= 191
= = 0.750163482
− −
400
1 + 10−0.4775
1 + 10 400
1 + 10
We might express this probability of A winning as:
(i) If A played B in 100 matches then A would win 75 of them (75.0163482 actually), or
(ii) A has a 75% chance of winning.
Elo’s logistic function uses exponents with a base of 10. This is the base of common logarithms and the
following relationships may be useful.
1 1
If y = = (74)
−x
1+e 1 + 10−αx
1
then α = log10 e = 0.434294481903... = (75)
2.302585092994...
Alternatively,
1 1
if y = = (76)
−x
1 + 10 1 + e −βx
1
then β = ln 10 = 2.302585092994... =
0.434294481903...
P a g e 28 | 44
The Logistic Function
References
Bacaër, N, 2011, A Short History of Mathematical Population Dynamics, Chapter 6, Verhulst and the logistic
equation (1838), pp.35-39, Springer-Verlag London Limited.
http://webpages.fc.ul.pt/~mcgomes/aulas/dinpop/Mod13/Verhulst.pdf [accessed 10-May-2018]
Cox, D.R., 1958, ‘The regression analysis of binary sequences’, Journal of the Royal Statistical Society. Series
B (Methodological), Vol. 20, No. 2, pp. 215-242.
http://www.jstor.org/stable/2983890 [accessed 10-May-2018]
Cramer, J.S., 2002, The Origins of Logistic Regression, Tinbergen Institute Discussion Paper, TI 2002-119/4,
Faculty of Economics and Econometrics, University of Amsterdam, and Timbergen Institute, 14 pages,
November 2002.
https://papers.tinbergen.nl/02119.pdf [accessed 15-May-2018]
Elo, A.E., 1978, The Rating of Chess Players, Past & Present, 2nd printing, April 2008, Ishi Press
International.
Euler, L., 1748, Introduction to Analysis of the Infinite (On the use of the Discovered Fractions to Sum
Infinite Series), in Pi: A Source Book, by Berggren, L., Borwein, J. and Borwein, P., 1997, Springer,
New York.
Glickman, M.E. and Jones, A., 1999, ‘Rating the chess rating system’, Chance, Vol. 12, No. 2, pp. 21-28.
http://glicko.net/research/chance.pdf [accessed 25-Aug-2018]
Hunter, M.N, 2018, ‘Some comments about The Logistic Function’, Private correspondence, 6 pages, 01-Sep-
2018.
Johnson, N.L. and Leone, F.C., 1964, Statistics and Experimental Design In Engineering and the Physical
Sciences, Vol. I, John Wiley & Sons, Inc., New York
Kreyszig, Erwin, 1970, Introductory Mathematical Statistics, John Wiley & Sons, New York.
Langville, A.N. and Meyer, C.D., 2012, Who’s #1? The Science of Rating and Ranking, Princeton University
Press, Princeton.
Mikhail, E.M., 1976, Observations and Least Squares, IEP―A Dun-Donnelley, New York
O’Connor, J.J. and Robertson, E.F., 2014, ‘Pierre François Verhulst’, MacTutor History of Mathermatics,
http://www-history.mcs.st-andrews.ac.uk/Biographies/Verhulst.html [accessed 16-May-2018]
Verhulst, P.F., 1838, ‘Notice sur la loi que la population suit dans son accroissement’, Correspondance
Mathématique et Physique, Publiée par A. Quetelet, Vol. 4, pp. 113-121
https://books.google.com.au/books?id=8GsEAAAAYAAJ&hl=fr&pg=PA113&redir_esc=y#v=onepa
ge&q&f=false [accessed 15-May-2018]
Verhulst, P.F., 1844, ‘Recherches mathématiques sur la loi d’accroissement de la population’, Nouveaux
Mémoires de l'Académie Royale des Sciences et Belles-Lettres de Bruxelles, Vol. 18, pp. 1-38
http://www.med.mcgill.ca/epidemiology/Hanley/anniversaries/ByTopic/Verhulst1844.pdf [accessed
15-May-2018]
P a g e 29 | 44
The Logistic Function
n
P ( Event ) = (77)
N
For example, if a card is drawn from a deck of playing cards, what is the probability that it is a heart? In
this case, the experiment is the drawing of the card and the possible outcomes of the experiment could be one
of 52 different cards, i.e., the sample space is the set of N = 52 possible outcomes and the event is the
subset containing n = 13 hearts. The probability of drawing a heart is
n 13
P ( Heart ) = = = 0.25
N 52
This definition of probability is a simplification of a more general concept of probability that can be
explained in the following manner (see Johnson & Leone, 1964, pp.32-3).
Suppose observations are made on a series of occasions (often termed trials) and during these
trials it is noted whether or not a certain event occurs. The event can be almost any observable
phenomenon, for example, that the height of a person walking through a doorway is greater
than 1.8 metres, that a family leaving a cinema contains three children, that a defective item is
selected from an assembly line, and so on. These trials could be conducted twice a week for a
month, three times a day for six months or every hour for every day for 10 years. In the
theoretical limit, the number of trials N would approach infinity and we could assume, at this
point, that we had noted every possible outcome. Therefore, as N → ∞ then N becomes the
number of elements in the sample space containing all possible outcomes of the trials. Now for
each trial we note whether or not a certain event occurs, so that at the end of N trials we have
noted nN events. The probability of the event (if it in fact occurs) can then be defined as
n
P ( Event ) = lim N
N →∞ N
Since nN and N are both non-negative numbers and nN is not greater than N then
nN
0≤ ≤1
N
Hence
0 ≤ P { Event } ≤ 1
If the event occurs at every trial then nN = N and nN N = 1 for all N and so P ( Event ) = 1 .
This relationship can be described as: the probability of a certain (or sure) event is equal to 1.
If the event never occurs, then nN = 0 and nN N = 0 for all N and so P ( Event ) = 0 . This
relationship can be described as: the probability of an impossible event is zero.
The converse of these two relationships need not hold, i.e., a probability of one need not imply
certainty since it is possible that lim nN N = 1 without nN = 1 for all values of N and a
N →∞
P a g e 30 | 44
The Logistic Function
probability of zero need not imply impossibility since it is possible that lim nN N = 0 even
N →∞
X ( hh ) = 2
X ( ht ) = 1
X ( th ) = 1
X ( tt ) = 0
In this example X is the random variable defined by the rule: "the number of heads obtained". The possible
values (or real numbers) that X may take are 0, 1, 2. These possible values are usually denoted by x and the
notation X = x denotes x as a possible real value of the random variable X.
Random variables may be discrete or continuous. A discrete random variable assumes each of its possible
values with a certain probability. For example, in the experiment above; the tossing of two coins, the sample
space S = { hh, ht, th, tt } has N = 4 elements and the probability the random variable X (the number of
heads) assumes the possible values 0, 1 and 2 is given by
x 0 1 2
P (X = x ) 1 2 1
4 4 4
Note that the values of x exhaust all possible cases and hence the probabilities add to 1
A continuous random variable has a probability of zero of assuming any of its values and consequently, its
probability distribution cannot be given in tabular form. The concept of the probability of a continuous
random variable assuming a particular value equals zero may seem strange, but the following example
illustrates the point. Consider a random variable whose values are the heights of all people over 21 years of
age. Between any two values, say 1.75 metres and 1.85 metres, there are an infinite number of heights, one
of which is 1.80 metres. The probability of selecting a person at random exactly 1.80 metres tall and not one
of the infinitely large set of heights so close to 1.80 metres that you cannot humanly measure the difference is
extremely remote, and thus we assign a probability of zero to the event. It follows that probabilities of
continuous random variables are defined by specifying an interval within which the random variable lies and
it does not matter whether an end-point is included in the interval or not.
P a g e 31 | 44
The Logistic Function
It is most convenient to represent all the probabilities of a random variable X by a formula or function
denoted by fX ( x ) , g X ( x ) , hX ( x ) , etc., or by FX ( x ) , GX ( x ) , H X ( x ) , etc.
In this notation the subscript X denotes that fX ( x ) or FX ( x ) is a function of the random variable X which
takes the numerical values x within the function. Such functions are known as probability distribution
functions and they are paired; i.e., fX ( x ) pairs with FX ( x ) , g X ( x ) pairs with GX ( x ) , etc. The functions
with the lowercase letters are probability density functions and those with uppercase letters are cumulative
distribution functions.
For discrete random variables, the probability density function has the properties
1. fX ( x k ) = P ( X = x k )
∞
2. ∑ fX ( xk ) = 1
k =1
1. FX ( x k ) = P ( X ≤ x k )
2. FX ( x ) = ∑ fX ( xk )
x k ≤x
As an example consider the probability distribution functions fX ( x ) and FX ( x ) of the sum of the numbers
when a pair of dice is tossed.
Experiment: Toss two identical dice.
1, 1 1, 2 1, 3 1, 4 1, 5 1, 6
2, 1 2, 2 2, 3 2, 4 2, 5 2, 6
3, 1 3, 6
S =
3, 2 3, 3 3, 4 3, 5
Sample space:
4, 1 4, 2 4, 3 4, 4 4, 5 4, 6
5, 1 5, 2 5, 3 5, 4 5, 5 5, 6
6, 6
6, 1 6, 2 6, 3 6, 4 6, 5
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P (X = x )
36 36 36 36 36 36 36 36 36 36 36
Table A1. Table of probabilities
Note that the values of x exhaust all possible cases and hence the probabilities add to 1
6− x −7
fX ( x ) = , x = 2, 3, 4, …,12
36
Probability distributions are often shown in graphical form. For discrete random variables, probability
distributions are generally shown in the form of histograms consisting of series of rectangles associated with
values of the random variable. The width of each rectangle is one unit and the height is the probability
given by the function fX ( x ) and the sum of the areas of all the rectangles is 1.
P a g e 32 | 44
The Logistic Function
Figure A1 shows the Probability histogram for the random variable X, the sum of the numbers when a pair
of dice is tossed.
Probability histogram
0.18
0.16
0.14
0.12
Probability f(x)
0.1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14
x
Figure A1 Probability histogram
Figure A2 shows the cumulative distribution function FX ( x ) = ∑ fX ( xk ) for the random variable X, the
x k ≤x
0.8
0.6
F(x)
0.4
0.2
0
0 2 4 6 8 10 12 14
x
Figure A2. Cumulative distribution function. [The dots at the left ends of the line
segments indicate the value of FX ( x ) at those values of x.
P a g e 33 | 44
The Logistic Function
For continuous random variables, the probability distribution functions fX ( x ) and FX ( x ) are curves,
which may take various forms depending on the nature of the random variable. Probability density functions
fX ( x ) that are used in practice to model the behaviour of continuous random variables are always positive
and the total area under its curve, bounded by the x-axis, is equal to one. These density functions have the
following properties
+∞
2. ∫ fX ( x )dx = 1
−∞
The probability that a random variable X lies between any two values x = a and x = b is the area under
the density curve between those two values and is found by methods of integral calculus
b
P (a < X < b ) = ∫ fX ( x )dx (78)
a
The equations of the density functions fX ( x ) are usually complicated and areas under their curves are found
from tables. In many scientific studies, the Normal probability density function is the usual model for the
behaviour of measurements (regarded as random variables) and the probability density function is (Kreyszig,
1970, p. 107)
2
1 x −µ
1 −
fX ( x ) = e 2 σ
(79)
σ 2π
µ and σ are the mean and standard deviation respectively of the infinite population of x and Figure A3
shows a plot of the Normal probability density curve for µ = 2.0 and σ = 2.5 .
0.14
0.12
0.1
f(x)
0.08
0.06
0.04
0.02
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure A3. Normal probability density function for µ = 2.0 and σ = 2.5
P a g e 34 | 44
The Logistic Function
For continuous random variables X, the cumulative distribution function FX ( x ) has the following properties
x
1. FX ( x ) = P ( X ≤ x ) = ∫ fX ( x )dx
−∞
d
2. F ( x ) = fX ( x )
dx X
In many scientific studies, the Normal distribution is the usual model for the behaviour of measurements and
the cumulative distribution function is (Kreyszig, 1970, p. 108)
2
1 x −µ
1 x −
2 σ
FX ( x ) = ∫−∞ e dx (80)
σ 2π
The probability that X assumes any value in an interval a < X < b is
2
1 x − µ
1 b −
2 σ
P (a < X < b ) = FX (b ) − FX ( a ) = ∫a e dx (81)
σ 2π
Figure A4 shows a plot of the Normal cumulative distribution curve for µ = 2.0 and σ = 2.5 .
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure A4. Normal cumulative distribution function for µ = 2.0 and σ = 2.5
Expectations
The expectation E { X } of a random variable X is defined as the average value µX of the variable over all
possible values. It is computed by taking the sum of all possible values of X = x multiplied by its
corresponding probability. In the case of a discrete random variable the expectation is given by
N
E { X } = µX = ∑ xk P ( xk ) (82)
k =1
Equation (82) is a general expression from which we can obtain the usual expression for the arithmetic mean
N
1
µ=
N
∑ xk (83)
k =1
P a g e 35 | 44
The Logistic Function
If there are N possible values x k of the random variable X, each having equal probability P ( x k )=1N
(which is a constant), then the expectation computed from (82) is identical to the arithmetic mean of the N
values of x k from (83).
This relationship may be extended to a more general form if we consider the expectation of a function g ( X )
of a random variable X whose probability density function is fX ( x ) . In this case
+∞
E { g ( X )} = ∫ g ( x ) fX ( x )dx (85)
−∞
Expressing (86) in matrix notation gives a general form of the expected value of a multivariate function
g ( X ) as
+∞ +∞ +∞
E { g ( X )} = ∫ ∫ ∫ g ( x ) fX ( x )dx (87)
−∞ −∞ −∞
There are some rules that are useful in calculating expectations. They are given here without proof but can
be found in many statistical texts. With a and b as constants and X and Y as random variables
E {a } = a
E {aX } = a E { X }
E {aX + b } = a E { X } + b
E { g ( X ) ± h ( X )} = E { g ( X )} ± E {h ( X )}
E { g ( X ,Y ) ± h ( X ,Y ) } = E { g ( X ,Y ) } ± E { h ( X ,Y ) }
P a g e 36 | 44
The Logistic Function
µX E ( X1 ) X1
1
µ E ( X ) X
mX = X2 =
E ( X ) = E X = E { X}
2 2
(89)
µX 3 3 3
σXY = E { ( X − µX )(Y − µY ) }
= E { XY − X µY − Y µX + µX µY }
= E { XY } − E { X µY } − E {Y µX } + E { µX µY }
= E { XY } − µY E { X } − µX E {Y } + µX µY
= E { XY } − µY µX − µX µY + µX µY
= E { XY } − µX µY
If the random variables X and Y are independent, the expectation of the product is equal to the product of
the expectations, i.e., E { XY } = E { X } E {Y } . Since the expected values of X and Y are the means µX
and µY then E { XY } = µX µY if X and Y are independent. Substituting this result into the expansion
above shows that the covariance σXY is zero if X and Y are independent.
For a multivariate function, variances and covariances of the random variables X is given by the matrix
equation
{
VXX = E X − m X X − mY
T
} (92)
VXX is a symmetric matrix known as the variance-covariance matrix and its general form can be seen when
(92) is expanded
X1 − µX
1
X − µX
= E 2
VXX X1 − µX1 X2 − µX Xn − µX
2
2 n
Xn − µX
n
P a g e 37 | 44
The Logistic Function
σ2 σX X σX X
X1 1 2 1 n
σX X σX2 σX X
giving VXX = 2 1 1 2 n
(93)
σ σX σX2
Xn X1 n X2 1
mY = E {y}
= E { Ax + b }
= E { Ax } + E { b }
= A E {x} + b
= Am X + b
Vyy = E{( y − m )( y − m ) }
y y
T
= E { ( Ax + b − Am − b )( Ax + b − Am − b) }
T
x x
= E { ( Ax − Am )( Ax − Am ) }
T
x x
= E { A ( x − m )( A ( x − m )) }
T
x x
= E { A ( x − m )( x − m ) A }
T T
x x
= A E { ( x − m )( x − m ) } A
T T
x x
= AVxx AT
or
If y = Ax + b and y and x are random variables linearly related then
P a g e 38 | 44
The Logistic Function
y = f (x) (96)
In such cases, we can expand the function on the right-hand-side of (96) using Taylor's theorem.
For a non-linear function of a single variable Taylor's theorem may be expressed in the following form
(x − a ) (x − a )
2 3
df d2 f d3f
f (x ) = f (a ) + (x − a ) + + +
dx a dx 2 a
2! dx 3 a
3!
n −1
(97)
d n −1 f (x − a )
+ + Rn
a (
dx n −1 n − 1) !
df d2 f
where Rn is the remainder after n terms and lim Rn = 0 for f ( x ) about x = a and , etc. are
n →∞ dx dx 2
a a
For a non-linear function of two random variables, say φ = f ( x , y ) , the Taylor series expansion of the
function φ about x = a and y = b is
∂f ∂f
φ = f ( a, b ) + (x − a ) + (y − b )
∂x a,b ∂y a,b
(98)
1 ∂2 f ∂2 f ∂f ∂f
( x −a) + (y − b ) + ( x − a )( y − b ) +
2 2
+
2 ! ∂x 2 a,b ∂y a,b
2 ∂ ∂y
x
a ,b a ,b
∂f ∂f ∂2 f
where f ( a, b ) is the function φ evaluated at x = a and y = b , and , , etc are partial
∂x a ,b ∂y a ,b ∂x 2 a ,b
derivatives of the function φ evaluated at x = a and y = b .
Extending to n random variables, we may write a Taylor series approximation of the function f ( x ) as a
matrix equation
∂f
f ( x ) = f ( x0 ) + ( x − x0 ) + higher order terms (99)
∂x x 0
∂f
where f ( x0 ) is the function evaluated at the approximate values x0 and are the partial derivatives
∂x x0
evaluated at approximations x0 .
Replacing f ( x ) in (96) by its Taylor series approximation, ignoring higher order terms, gives
∂f
y = f ( x ) = f ( x0 ) + ( x − x0 ) (100)
∂x x 0
P a g e 39 | 44
The Logistic Function
my = E { y }
∂f
= E f ( x0 ) + ( x − x0 )
∂x x0
∂f
= E { f ( x )} + E
0
( x − x0 )
∂x x0
∂f
= f ( x0 ) + E {( x − x0 )}
∂x x0
∂f
= f ( x0 ) + ( E { x } − E { x0 } )
∂x x0
∂f
= f ( x0 ) + ( mx − x0 )
∂x x0
And
∂f ∂f
y − my = f ( x0 ) + ( x − x0 ) − f ( x0 ) + ( mx − x0 )
∂x x0 ∂x
x0
∂f
= ( x − mx ) (101)
∂x x 0
= Jyx ( x − m x )
Jyx is the (m,n) Jacobian matrix of partial derivatives, noting that y and x are (m,1) and (n,1) vectors
respectively
Vyy = E{( y − m )( y − m ) }
y y
T
= E { ( J ( x − m ) )( J ( x − m ) ) }
T
yx x yx x
= E { J ( x − m )( x − m ) J }
T T
yx x x yx
= J E { ( x − m )( x − m ) } J
T T
yx x x yx
Thus, in a similar manner to above, we may express the Law of Propagation of Variances for non-linear
functions of random variables as
P a g e 40 | 44
The Logistic Function
∂z 2 ∂z 2 ∂z ∂z
σz2 = σx2 + σy2 + 2 σxy (104)
∂x ∂y ∂x ∂y
Equation (104) can be derived from the general matrix equation (103) in the following manner. Let
x
z = f ( x , y ) be written as y = f ( x ) where y = z , a (1,1) matrix and x = is a (2,1) vector. The
y
σ2 σ ∂z ∂z
variance-covariance matrix of the random vector x is Vxx = x xy
, the Jacobian J = and
2
σxy σy
yx ∂x ∂y
the variance-covariance matrix Vyy which contains the single element σz2 is given by
∂z
∂z ∂z σx
σxy ∂x
2
Vyy = σz2 =
σy2 ∂z
∂x ∂y σxy
∂y
∂z 2 2 ∂z 2 2
σz2 = σx + σy
(105)
∂x ∂y
P a g e 41 | 44
The Logistic Function
http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m
% process parameters
[n, m] = size(a);
if (nargin < 5)
param = [];
end
if (length(ridge) == 1)
ridgemat = speye(m) * ridge;
elseif (length(ridge(:)) == m)
ridgemat = spdiags(ridge(:), 0, m, m);
else
error('ridge weight vector should be length 1 or %d', m);
P a g e 42 | 44
The Logistic Function
end
if (~isfield(param, 'maxiter'))
param.maxiter = 200;
end
if (~isfield(param, 'verbose'))
param.verbose = 0;
end
if (~isfield(param, 'epsilon'))
param.epsilon = 1e-10;
end
if (~isfield(param, 'maxprint'))
param.maxprint = 5;
end
% do the regression
x = zeros(m,1);
oldexpy = -ones(size(y));
for iter = 1:param.maxiter
adjy = a * x;
expy = 1 ./ (1 + exp(-adjy));
deriv = expy .* (1-expy);
wadjy = w .* (deriv .* adjy + (y-expy));
weights = spdiags(deriv .* w, 0, n, n);
if (param.verbose)
len = min(param.maxprint, length(x));
fprintf('%3d: [',iter);
fprintf(' %g', x(1:len));
if (len < length(x))
fprintf(' ... ');
end
fprintf(' ]\n');
end
oldexpy = expy;
end
P a g e 43 | 44
The Logistic Function
Useage Example
http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic-ex.txt
>> a = randn(500,5);
>> x = 2*randn(5,1);
>> y = (rand(500,1) < 1./(1+exp(-a*x)));
>> xhat = logistic(a, y, [], [], struct('verbose', 1))
1: [ -0.842889 -0.959492 0.843404 0.198022 0.199493 ]
2: [ -1.55055 -1.75901 1.57622 0.360507 0.398254 ]
3: [ -2.35678 -2.71685 2.4373 0.545032 0.62605 ]
4: [ -3.20879 -3.74533 3.35828 0.740927 0.854976 ]
5: [ -3.86696 -4.54162 4.0753 0.899695 1.02575 ]
6: [ -4.12265 -4.85136 4.35625 0.964904 1.09126 ]
7: [ -4.14969 -4.88416 4.38617 0.972078 1.09818 ]
8: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
9: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
10: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
11: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
Converged.
xhat =
-4.1499
-4.8845
4.3865
0.9721
1.0982
>> x
x =
-3.9412
-4.0619
3.6705
1.1123
0.9645
P a g e 44 | 44