0% found this document useful (0 votes)
14 views44 pages

The Logistic Function

The document discusses the logistic function, an S-shaped curve used to model population growth, defined by a specific mathematical equation. It details the historical context of the function's development by Pierre Verhulst in the 19th century, its properties, and its applications in various scientific fields, including logistic regression and sport rating systems. The document also includes derivations, examples, and references for further reading.

Uploaded by

Ravneet Narang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

The Logistic Function

The document discusses the logistic function, an S-shaped curve used to model population growth, defined by a specific mathematical equation. It details the historical context of the function's development by Pierre Verhulst in the 19th century, its properties, and its applications in various scientific fields, including logistic regression and sport rating systems. The document also includes derivations, examples, and references for further reading.

Uploaded by

Ravneet Narang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

The Logistic Function

The Logistic Function


R.E. Deakin
Bonbeach VIC, 3196, Australia
email: [email protected]
October 2018

Introduction
The logistic function or logistic curve is a common S-shape curve (sigmoid curve) with equation:

K
P (t ) = (1)
1 +e (
− rt +C )

where P ( t ) is the population P at time t, K is the carrying capacity and the curve’s maximum value,
e = 2.718281828 … is a mathematical constant and the base of the natural logarithms, r is the growth
 P0 
parameter (steepness of the curve), C = ln   is a constant and P0 is the population at t = 0 .
 K − P 
0

K
Figure 1. Logistic curve: P ( t ) = , K = 100, P0 = 20 , r = 0.138629436 .
1 +e (
− rt +C )

For real values of t in the range −∞ < t < +∞ , the curve has two asymptotes, P ( t ) = K as t → +∞ and
C
P ( t ) = 0 as t → −∞ and is symmetric about the curve’s midpoint at t = t0 = − where P ( t0 ) = 1 K .
r 2

There are various alternative expressions for the logistic curve and the derivation of (1) – outlined below – is
a modern interpretation of the derivation by Pierre Verhulst in the 19th century who named the curve la
courbe logistique [the logistic curve].
The properties of the logistic curve are derived and a general equation developed with a special case, the
sigmoid curve. This is followed by the derivation of the logistic distribution.
Logistic regression is discussed and the method of least squares is employed to give a solution for the
parameters of a logistic curve that is a best fit of the outcomes of binary dependent variables. Two examples
of this technique are shown.
Finally, the connection between a sport rating system (Elo 1978) and the logistic curve is shown.
References for further reading and two appendices are included.

P a g e 1 | 44
The Logistic Function

A Brief History of the Logistic Function


The development of the logistic function is due to Pierre François Verhulst (1804–1849) and his work on
population growth in the 19th century. In an 1838 paper titled Notice sur loi que la population suit dans son
accroissement [A Note on the Law of Population Growth], Verhulst proposed a differential equation that
modelled bounded population growth and solved this equation to obtain the population function. He then
compared actual populations with his modelled values for France (1817-1831), Belgium (1815-1833), Essex,
England (1811-1831) and Russia (1796-1827). In a subsequent 1844 paper titled Recherches mathématiques
sur la loi d’accroissement de la population [Mathematical research on the Law of Population Growth] he
named the solution to the equation he had proposed in his 1838 paper la courbe logistique [the logistic curve].
Verhulst’s original works on population modelling were criticised by others (and even Verhulst in an 1847
publication) and his work was largely ignored, but in the early 20th century the logistic function was
‘rediscovered’ and Verhulst acknowledged as its inventor. Since then the logistic function (and logistic
growth) is a simple and effective model of population growth that is applied in many diverse fields of
scientific study (Cramer 2002, O’Connor & Robertson 2014, Bacaër 2011).

Derivation of the Logistic Function


[The notation of Bacaër (2011) is used in this derivation.]

Verhulst proposed the following (somewhat arbitrary) differential equation for the population P ( t ) at time t

dP  P
= rP  1 −  (2)
dt  K 

where P ( t ) is the population P at time t, K is the carrying capacity and r is the growth parameter.

The differential equation (2) for the population P is solved by integration where

dP
∫  P
= ∫ r dt (3)
P  1 − 
 K
In order to evaluate the left-hand-side we write

1 K K
= =
 P
P  1 − 
 KP − P 2 P ( K −P)
 K
and decomposing into partial fractions gives

K A B A ( K − P ) + BP
= + =
P (K − P ) P K −P P (K − P )

from which the relation A ( K − P ) + BP = K is obtained which must hold for all values of K and P.
Choosing particular values for P yields A and B as follows: (i) when P = 0 , AK = K and A = 1 ; (ii) when
P = K , BK = K and B = 1 . Using this result gives

1 K 1 1
= = +
 P P (K − P ) P K −P
P  1 − 
 K
and (3) becomes

1 1
∫ P dP + ∫ K − P dP = ∫ r dt (4)

P a g e 2 | 44
The Logistic Function

Using the substitution u = K − P with du = −dP in the second integral gives

1 1
∫ P dP − ∫ u du = ∫ r dt
1
and using the integral results ∫ x dx = ln x + C and ∫ a dx = ax + C where the C’s are constants of

integration gives

ln P − ln ( K − P ) = rt + C (5)

where ln denotes the natural logarithm and ln x ≡ loge x and e = 2.718281828 … , and the constants of
integration have been combined and added to the right-hand-side.
Re-writing (5) as

ln ( K − P ) − ln P = − ( rt + C )

M
and using the laws of logarithms: loga = loga M − loga N gives
N

K −P
ln = − ( rt + C )
P

Now, raising both sides to the base e and noting that e ln x = x gives

K −P
=e (
− rt +C )
(6)
P
that can be re-arranged as

K
P = P (t ) = (7)
1+e (
− rt +C )

K
An equation for the constant C can be found when t = 0 in which case P0 = P ( 0 ) = and a re-
1 + e −C
arrangement gives

K − P0 P0
e −C = and eC = (8)
P0 K − P0

Taking natural logarithms of both sides of the second member of (8) and noting ln e x = x gives

 P0 
C = ln   (9)
 K − P0 

Properties of the Logistic Curve


Asymptotes
The curve described by (7) is a symmetric S shape (see Figure 1) with two asymptotes, the upper one being
the line P ( t ) = K as t → +∞ and the lower one being the line P ( t ) = 0 as t → −∞ .

Midpoint

The midpoint of the curve is when P ( t ) = 1 K and this occurs when the exponent ( rt + C ) in (7) is equal
2
to zero and

P a g e 3 | 44
The Logistic Function

C
t = t0 = − (10)
r

Thus the midpoint of the logistic curve is at ( t0 , 1 K )


2

Symmetry
The logistic curve is symmetric about the midpoint.

= 1+e (
−r t +C r )
This can be confirmed by writing the denominator of (7) as 1 + e (
− rt +C )
= 1 + e ( 0)
−r t −t

giving

K
P = P (t ) = (11)
1 + e ( 0)
−r t −t

Now using (11) with d = t − t0 a distance along the t-axis, the sum of the logistic function P (d ) and its
reflection about the vertical axis through t0 , P ( −d ) is

K K K ( 1 + e rd ) + K ( 1 + e −rd )
+ =
1 + e −rd 1+e ( )
− −rd
(1 + e−rd )(1 + erd )
2K + K (e rd + e −rd )
=
2 + e rd + e −rd + e(
−rd +rd )

K ( 2 + e rd + e −rd )
=
2 + e rd + e −rd
=K

Thus the logistic function is symmetric about the midpoint ( t0 , 1 K )


2

K K
Also, due to symmetry we may write =K− and (11) becomes
−r ( t −t0 )
1 + e ( 0)
r t −t
1+e

K
P = P (t ) = K − (12)
1 + e ( 0)
r t −t

Inflexion point

d 2P
The midpoint is also the inflexion point of the logistic curve where the second derivative = 0 . This can
dt 2
be proved by the following

First, to simplify the analysis write the denominator of (7) as 1 + e (


− rt +C )
= 1 + e −rte −C and using the first
member of (8) we let

K − P0
A = e −C = (13)
P0

and (7) becomes

K
P = P (t ) = (14)
1 + Ae −rt
Differentiating (14) with respect to t gives
−2
= −K ( 1 + Ae −rt ) ( −Are −rt )
dP
dt

P a g e 4 | 44
The Logistic Function

and differentiating again gives

d 2P −3 −2
= 2K ( 1 + Ae −rt ) ( −Are−rt ) − K ( 1 + Ae −rt ) ( Ar 2e−rt )
2
2
dt

d 2P
Second, solving = 0 gives
dt 2
−3 −2
2K ( 1 + Ae −rt ) ( −Are−rt ) − K (1 + Ae−rt ) ( Ar 2e−rt ) = 0
2

−1
2 ( 1 + Ae −rt ) ( Are −rt ) − rAre −rt = 0
2

−1
2 ( 1 + Ae −rt ) ( Are −rt ) − r = 0
−1
2 ( 1 + Ae −rt ) ( Ae −rt ) = 1
2Ae −rt = 1 + Ae −rt
Ae −rt = 1
1
e −rt = = A−1
A

Taking logarithms of both sides and noting that loga M p = p loga M gives

ln A
t = (15)
r
Substituting (15) into (14) gives
 ln A  K K K K
P  = = = = = 1K
 r   ln A 
−r   1 + Ae − ln A
1 + Ae ln A−1 1 + AA −1 2
 r 
1 + Ae
Finally, using (9), (10) and (13) we obtain the relations

ln A
ln A = −C and t0 = (16)
r
 ln A 
So, P   = P ( t0 ) = 1 K thus proving that the inflexion point is the midpoint ( t0 , 1 K ) .
 r  2 2

Verhulst’s equation for the logistic function


In Verhulst’s 1838 paper he states the differential equation

dp
= mp − ϕ ( p ) (17)
dt

that links the rate of change of population p with respect to time t with mp and a function of p, ϕ ( p ) where

m is a constant. He then supposes that ϕ ( p ) = np 2 where n is another constant and then finds for the
integral of (17) that

1 
t = ln p − ln ( m − np )  + constant
m
On resolving (17) he gives the equation for the population p as

mp ′e mt
p= (18)
np ′e mt + m − np ′

P a g e 5 | 44
The Logistic Function

where p ′ is the population at t = 0 . He then states that as t → ∞ the value of p corresponds with
m
P = that he calls la limite supérieure de la population [the upper limit of the population].
n
The correspondence between variables in Verhulst’s 1838 paper and this paper are
Verhulst This paper
p P population
t t time
m r growth parameter
n no equivalent
p′ P0 population at t = 0
m
P = K upper limit of population
n
dp  n 
Verhulst’s differential equation (17) can be written as = mp − np 2 = mp  1 − p  that is equivalent to
dt  m 
dP  P
= rP  1 −  that is our equation (2) and Verhulst’s logistic function (18) can be written as
dt 
 K 
p ′e mt
p = that is equivalent to (Bacaër 2011, eq. 6.2)
1 + p ′ (e mt − 1 )
n
m

P0e rt KP0e rt
P = = (19)
P0 K + P0 (e rt − 1 )
1+
K
(e rt
− 1)

[This equation can be obtained from (14) with a bit of algebra]

Verhulst’s method for estimating the parameters r and K


In Verhulst’s 1844 paper he set out a method of estimating the parameters r and K from the population
P ( t ) in three different but equally spaced years; P0 at t = 0 , P1 at t = t1 and P2 at t = t2 = 2t1 .
Verhulst’s method is as follows.

P
Using (6) the relationship e(
rt +C )
= can be obtained and taking logarithms of both sides gives
K −P
 P 
rt + C = ln   and using this equations at times t = 0, t = t1 and t = t2 = 2t1 gives
 K − P 

 P0 
C = ln  
 K − P0 
 P1 
rt1 + C = ln  
 K − P1 
 P2 
2rt1 + C = ln  
 K − P2 

and from these, the following can be obtained

P a g e 6 | 44
The Logistic Function

P0
eC = (i)
K − P0
P1
e rt1eC = ( ii )
K − P1
P2
e 2rt1eC = ( iii )
K − P2

Dividing (ii) by (i) and (iii) by (ii) gives two equations where eC has been eliminated and the left-hand-sides
are identical

P1 ( K − P0 )
e rt1 = ( iv )
P0 ( K − P1 )
P2 ( K − P1 )
e rt1 = (v)
P1 ( K − P2 )

P1 ( K − P0 ) P2 ( K − P1 )
Equating (iv) and (v) gives = and cross-multiplying and gathering terms gives
P0 ( K − P1 ) P1 ( K − P2 )

( P12 − P0P2 ) K 2 − ( P0P12 + P2P12 − 2P0P1P2 ) K =0

Dividing both sides by K and rearranging gives (Bacaër 2011, p.38)

P1 ( P0P1 + P2P1 − 2P0P2 )


K = (20)
P12 − P0P2

noting the conditions P12 > P0P2 and P0P1 + P1P2 > 2P0P2 for finite and positive K.

Once K is known, r can be obtained from (iv) as

1  P1 ( K − P0 )  1  1 P0 − 1 K 


r = ln   = ln   (21)
t1  P0 ( K − P1 )  t1  1 P1 − 1 K 

Logistic Function – Various forms


Various forms of the logistic function have been developed so far in this paper – see (7), (11), (12), (14) and
(19)

K K K K P0e rt
P (t ) = = =K− = = (22)
1 +e (
− rt +C )
1 +e ( 0) 1 +e ( 0)
−r t −t r t −t
1 + Ae −rt P0
1+
K
(ert − 1)
 P0  K − P0 C ln A
where P0 = P ( 0 ), C = ln  , A = , t0 = − = ,
 K − P0  P0 r r

P a g e 7 | 44
The Logistic Function

K
Figure 2. Logistic curve: P ( t ) = , K = 100, t0 = 10 , r = 0.138629436
1 +e ( 0)
−r t −t

Logistic Function – A General Form


Considering (12), a general form for the logistic function can be given as

A1 − A2
y = + A2 (23)
1 +e ( 0)
k x −x

A1 − A2
Figure 3. Logistic curve: y = + A2 , A1 = 20, A2 = 120, k = 0.138629436, x 0 = 60
1 +e ( 0)
k x −x

y = A1 is the lower asymptote of the curve given by (23) when x → −∞ and y = A2 is the upper
asymptote of the curve when x → +∞ . The midpoint of the curve is  x 0 , 1 ( A1 + A2 )  when x = x 0 then
 2 
e ( 0 ) = e 0 = 1 and y = ( A1 + A2 )
k x −x 1
2

P a g e 8 | 44
The Logistic Function

Sigmoid Function
The sigmoid function is a special case of the logistic function. The sigmoid curve is a symmetric S-shape
with a midpoint at ( 0, 1 ) and asymptotes y = 0 as x → −∞ and y = 1 as x → +∞ .
2

1 ex
y = = (24)
1 + e −x ex + 1

1
Figure 4. Sigmoid curve: y =
1 + e −x

dy
The derivative of the sigmoid function, denoted by y ′ = is
dx

−1  −2  e −x
y′ =
d
( 1 + e −x ) = −1  ( 1 + e −x

) ( −e−x ) = (25)
(1 + e−x )
dx 2

1−y 1
and from (24) e −x = and 1 + e −x = giving
y y

y ′ = y (1 − y ) (26)

d 2y d  dy 
The 2nd derivative of the sigmoid function, denoted by y ′′ = =   is
dx 2 dx  dx 

d  −x −2 
y ′′ = e ( 1 + e −x ) 
dx   
 −  −2
= e −x  −2 ( 1 + e −x ) ( −e −x )  + ( 1 + e −x ) ( −e −x )
3

 
e −x  2e −x 
=  − 1 
2 
(1 + e )  1 + e
−x
−x

= y ′ ( 1 − 2y ) (27)

And as before, the inflexion point of the sigmoid curve will be at the point where y ′′ = 0 and using (27)
y ′′ = 0 when y = 1
which occurs at the midpoint ( 0, 1 )
2 2

P a g e 9 | 44
The Logistic Function

The Logistic Distribution


This section assumes some knowledge of statistical concepts and a brief outline of some of these is given in
Appendix A.
Suppose X is a continuous random variable and X = x denotes x as a possible real value of X.

The probability density function fX ( x ) has the following properties

1. fX ( x ) ≥ 0 for any value of x (28)

+∞
2. ∫ fX ( x )dx = 1 (29)
−∞

The probability that a random variable X lies between any two values x = a and x = b is the area under
the density curve between those two values and is found by methods of integral calculus
b
P (a < X < b ) = ∫ fX ( x )dx (30)
a

The cumulative distribution function FX ( x ) has the following properties

x
1. FX ( x ) = P ( X ≤ x ) = ∫ fX ( x )dx (31)
−∞

d
2. F ( x ) = fX ( x ) (32)
dx X
In many scientific analyses of experimental data it is assumed the data are members of a probability
distribution with a density function having a smooth bell-shaped curve with tails that approach the
asymptote fX ( x ) = 0 as x → ±∞ and a cumulative distribution function that is a symmetric S-shape with
asymptotes FX ( x ) = 0 and FX ( x ) = 1 as x → −∞ and x → +∞ respectively.

As an example, Appendix A shows the probability density curve and the cumulative distribution curve of the
2
1 x −µ 
1 −  
familiar Normal distribution with probability density function fX ( x ) = e 2 σ 
where the infinite
σ 2π
population has mean µ , variance σ 2 and standard deviation σ = + σ 2 .

Now suppose that our experimental data has a Logistic distribution and the cumulative distribution function
is the logistic function with location parameter a and shape parameter b [see (23) with A1 = 0 , A2 = 1 ,
x 0 = a and k = 1 b ]

1
FX ( x ) =  x −a 
(33)
− 
 b 
1+e

The probability density function fX ( x ) is given by (32) as

    x −a  −1 
   − 
 = d   1 + e  b   
d 1 
fX ( x ) = 
  x −a     
dx  −   dx    
 1 + e  b    

P a g e 10 | 44
The Logistic Function

  x −a  −1 
 x − a    −   
  1 + e b    = d  ( 1 + e u )  du and using the
du 1 d −1
Let u = −   and with = − we write  

 b  dx b dx   
  du   dx
 
d x
rule e = e x the probability density function of the Logistic distribution is
dx
 x −a 
− 
 b 
e
fX ( x ) = (34)
  x −a  2
−
  b  


b  1 + e 

 

Mean and Variance of the Logistic Distribution

The mean µX and variance σX2 are special mathematical expectations (see Appendix A) and

+∞
µX = E { X } = ∫ x fX ( x )dx (35)
−∞

+∞
σX2 = E {( X − µX ) 2
} = ∫ ( x − µx ) fX ( x )dx
2
(36)
−∞

Using the rules for expectations a more useful expression for the variance can be developed as

σX2 = E {( X − µ ) } = E { X
X
2 2
− 2X µX + ( µX )
2
}
= E { X 2 } − 2µX E { X } + ( µX ) = E { X 2 } − 2µX µX + ( µX )
2 2

= E { X 2 } − ( µX ) = E { X 2 } − ( E { X })
2 2
(37)

The following derivations are due to Max Hunter (2018), my mentor in all things mathematical especially the
lovely integral solutions that follow1.

The mean

Using (34) and (35) the mean can be written as

 x −a 
− 
+∞ +∞  b 
xe
µX = ∫ x fX ( x ) dx = ∫ dx (38)
  x −a  2
−∞ −∞  −  
b  1 + e  b  
 

 


x −a
With the substitution t = then x = tb + a and dx = b dt , and with t = ∞ when x = ∞ then (38)
b
becomes

1 Max Hunter is a retired mathematician from RMIT University, Melbourne Australia. In an earlier version
of this document I had resorted to the use of a divergent series in solving integrals for the mean and
variance. Max, on reading this, sent me a note suggesting I had set back mathematics and statistics about
300 years! But very kindly attached several elegant solutions that avoided the use of said series.
P a g e 11 | 44
The Logistic Function

+∞
( tb + a )e −t +∞
te −t
+∞
e −t
µX = ∫ dt = b ∫ dt + a ∫ dt (39)
(1 + e ) (1 + e ) (1 + e )
2 2 2
−t −t −t
−∞ −∞ −∞

Three results are useful in evaluating (39).

First

ex ex ex e −x
= = = (40)
e 2x e −2x ( 1 + 2e x + e 2x ) e 2x (e −2x + 2e −x + 1 )
(1 + ex ) (1 + e−x )
2 2

Second

e x + e −x  e 12 x + e −21 x 2 −x
  = e + 2 + e
x
cosh x = and cosh ( 1 x ) = 
2



2 2 2 4

e x ( 1 + e −x )
2

(1 + e ) ( 1 + 2e )=e ( 12 x ) =
2
−x −x −2x −x
and with e x
=e x
+e x
+ 2 +e then cosh 2
and
4

e −x
1
sech2 ( 1 x ) = (41)
(1 + e−x )
4 2 2

Third

d du 1 d
dx
tanh u = sech2 u
dx
so
2 dt
( tanh ( 12 x ) ) = 14 sech2 ( 21 x ) (42)

+∞
te −t te −t
Now to evaluate (39) it is useful to note that ∫ dt = 0 since the integrand f ( t ) = is
( 1 + e−t ) (1 + e−t )
2 2
−∞

an odd function of t since f ( −t ) = −f ( t ) and the interval of integration is symmetric. Hence the mean
becomes

+∞
e −t
µX = a ∫ dt (43)
(1 + e )
2
−t
−∞

Now, using (40), (41) and (42)

+∞ +∞ +∞
e −t d +∞
∫ dt = ∫ 1
sech2 ( 1 t ) dt = 1
∫ ( tanh ( 1 t ) )dt = 1  tanh ( 1 t )  = 1  1 − ( −1 )  = 1 (44)
  −∞
(1 + e−t )
2 4 2 2 dt 2 2 2 2
−∞ −∞ −∞

and using this result in (43) gives the mean of the Logistic distribution as

µX = a (45)

P a g e 12 | 44
The Logistic Function

The variance

Using (34), (35) and (37) gives

σX2 = E {( X − µ ) } = E { X
X
2 2
} − ( E { X })
2

+∞
x 2 fX ( x ) dx − ( µX )
2
= ∫
−∞
 x −a 
− 
+∞  b 
x 2e
= ∫   x −a  2
dx − a 2 (46)
−
  b  
−∞ 
b  1 + e  
 

x −a
and similarly to the derivation of the mean, the substitution t = gives x = tb + a and dx = b dt
b
giving (46) as

 2  −t
 ( tb ) + 2abt + a  e
2
( tb + a )
+∞ 2 −t +∞
e
σX2 = ∫ dt − a 2 = ∫   dt − a 2
(1 + e−t ) (1 + e )
2 2
−t
−∞ −∞

and
+∞ +∞ +∞
t 2 e −t t e −t e −t
σX2 =b 2
∫ dt + 2ab ∫ dt + a 2
∫ dt − a 2 (47)
(1 + e ) (1 + e ) (1 + e )
2 2 2
−t −t −t
−∞ −∞ −∞

Now the second integral of (47) equals zero, since the integrand is an odd function of t. And, using (44), the
third integral of (47) equals one giving the variance as
+∞ ∞
t 2 e −t t 2 e −t
σX2 = b 2 ∫ dt = 2b 2 ∫ dt (48)
(1 + e−t ) (1 + e−t )
2 2
−∞ 0

t 2 e −t
since the function f ( t ) = is symmetric about the f ( t ) axis.
(1 + e−t )
2

1
Now, using the series expression = 1 − 2x + 3x 2 − 4x 3 + 5x 4 − 
(1 + x )
2


e −x n −1 −nx
= e −x − 2e −2x + 3e −3x − 4e −4x + 5e −5x −  = ∑ n ( −1 ) e (49)
(1 + e )
2
−x n =1

and using (49) in (48) gives

∞ ∞ ∞ ∞
n −1 −nt n −1
σX2 = 2b 2 ∫ t 2 ∑ n ( −1 ) e dt = 2b 2 ∑ n ( −1 ) ∫ t 2e−ntdt
0 n =1 n =1 0

eax  2 2x 2 
Using the standard integral result ∫ x 2eax dx =  x − +  gives
a  a a 2 

P a g e 13 | 44
The Logistic Function


∞  −nt  

 e 
 t 2 −
n −1 2 t 2  
σX2 = 2b 2 ∑ n ( −1 )  +  
 −n  −n ( −n )2  
n =1     0
∞  0   
= 2b 2 ∑ n ( −1 )
n −1
 0 − e  2  
n =1

 −n  n 2  

n −1 2
= 2b 2 ∑ n ( −1 )
n =1 n3

n −1 1
= 4b 2 ∑ ( −1 ) (50)
n =1 n2
And


n −1 1 1 1 1 1
∑ ( −1 ) n 2
= 1−
2 2
+
3 2

4 2
+
52
−
n =1
1 1 1 1 1 1 1 
=1+ + + + +  − 2  + + + 
2 2
3 2
4 2
5 2 
22
4 2
62 

∞ ∞
1 2 1
= ∑ 2
− ∑
22 n =1 n 2
n =1 n
2
π π2
= −
6 12
π2
= (51)
12

1 π2
[Note here Euler’s remarkable result: ∑ n2 =
6
(Euler 1784)]
n =1

Hence, using (51) in (50) gives the variance of the Logistic distribution as

b 2 π2
σX2 = (52)
3
The standard deviation σ (the positive square-root of the variance) and the shape parameter b are

bπ σ 3
σ= , b= (53)
3 π
If X is a random variable having a Logistic distribution with parameters µ and σ the usual statistical
notation is X ∼ LOG ( µ, σ ) and the probability density function is

π  x − µ 
−  
π e 3  σ  π π  x − µ 
fX ( x : µ, σ ) = = sech2   (54)
σ 3 π  x − µ  
−   
2
4σ 3 2 3  σ 
 
3 σ    
 1 + e 
 

where −∞ < µ < ∞ and σ > 0 are parameters.

P a g e 14 | 44
The Logistic Function


Figure 5. Probability density curve: fX ( x : µ, σ ) , µ = 5 and σ =
3
The cumulative distribution function of the Logistic distribution is

1   π  x − µ 

FX ( x : µ, σ ) = = 1  1 + tanh 1   
 (55)
π  x − µ  2  2  
 3  σ  
−  
1+e 3  σ 

e x − e −x ex 1
using tanh x =
−x
and 1
( 1 + tanh x ) = −x
=
e +e
x 2
e +ex
1 + e −2x


Figure 6. Cumulative distribution curve: FX ( x : µ, σ ) , µ = 5 and σ =
3

P a g e 15 | 44
The Logistic Function

Logistic Regression
Logistic regression was developed by statistician David Cox (Cox 1958) as a means of measuring the
relationship between a binary dependent variable (yes or no, win or loss, 1 or 0, etc.) and one or more
independent variables by estimating probabilities using a logistic function. The key features of logistic
regression are (i) the conditional distribution y | x is a Bernoulli distribution2 where y | x means y given x
and (ii) the predicted values are probabilities and are therefore restricted to (0,1).
The model for logistic regression is the function

1
y = (56)
1 + e −z

where z = β0 + β1x1 + β2x 2 +  + βn x n . y is the predicted variable (probability), x1, x 2 , x 3 , … are


independent variables (parameters) and β0 , β1, β2 , … are coefficients.

An expression for z may be obtained from (56) as follows

1
y =
1 + e −z
−z
y + ye = 1
1−y
e −z =
y
y
e =
z
1−y

Taking natural logarithms of both sides, noting that ln e z = z and z = β0 + β1x1 + β2x 2 +  + βn x n gives

 y 
β0 + β1x1 + β2x 2 +  + βn x n = ln   (57)
 1 − y 

This is a non-linear function relating probabilities y with independent variables x and coefficients β .
To enable the determination of coefficients of variables given probabilities it would be desirable to have a
linear function relating these quantities and this may be achieved with the aid of Taylor’s theorem, that for
a single variable, can be expressed in the following form

(x − a ) (x − a )
2 3
df d2f d3f
f ( x ) = f (a ) + (x − a ) + + +
dx a dx 2 a
2! dx 3 a
3!
n −1
(x − a )
d n −1 f
+ + Rn (58)
a (
dx n −1 n − 1) !

df d2f
where Rn is the remainder after n terms and lim Rn = 0 for f ( x ) about x = a and , , etc. are
n →∞ dx a dx 2 a

derivatives of the function f ( x ) evaluated at x = a .

Using Taylor’s theorem on the right-hand-side of (57) and evaluating the derivatives about the point y = p
gives

2The probability distribution (in honour of the Swiss mathematician Jacob Bernoulli) of a random variable
which takes the value of 1 with probability p and the value of 0 with the probability q = 1 − p.
P a g e 16 | 44
The Logistic Function

 y   p  1 2p − 1 3p 2 − 3p + 1
ln   = ln  ( y − p) + ( y − p) + (y − p ) + 
2 3
  +
 1 − y   1 − p  p ( 1 − p )
2p 2 ( 1 − p ) 3p 3 ( 1 − p )
2 3

 p  1
= ln 
 1 − p  p ( 1 − p ) (
+ y − p ) + higher order terms (59)

If p is a close approximation of y the term ( y − p ) is small, ( y − p ) is very much smaller and ( y − p )


2 3

exceedingly small and we may neglect the higher order terms in (59) and write
 y   p  y−p
ln   ≅ ln  +
 1 − y   1 − p  p ( 1 − p )

Substituting this result into (57) gives


 p  y−p
β0 + β1x1 + β2x 2 +  + βn x n ≅ ln  + (60)
 1 − p  p ( 1 − p )

Now, suppose that the approximation p of the probability y is obtained from

1
p = (61)
1 + e −z
0

where z 0 = β00 + β10x1 + β20x 2 +  + βn0x n and β00 , β10 , β20 , … are approximate values of the coefficients
β0 , β1, β2 , … . And bearing in mind (57)

 p 
β00 + β10x1 + β20x 2 +  + βn0x n = ln   (62)
 1 − p 

Now, using (62) in (60) and adding a residual v to the left-hand-side to account for small random errors in
the variables x, an equation than can be used to determine the coefficients β in an iterative scheme is

y−p
v + β0 + β1x1 + β2x 2 +  + βn x n = β00 + β10x1 + β20x 2 +  + βn0x n + (63)
p (1 − p )

To see how this may work, suppose as an example, 20 students sitting an examination with the observed
outcomes y1, y2 , …, y20 as pass or fail and recorded as 1 or 0. These outcomes are thought to be related to a
single variable x that is hours of study in preparation for the exam and the logistic function is assumed to be
1
y = . The wish is to determine the two coefficients β0 and β1 .
−( β0 + β1x )
1+e

Now assume approximate values β00 , β10 for the coefficients and following (63) the observation equation for
the kth outcome is

yk − pk
vk + β0 + β1x k = β00 + β10x k + (64)
pk ( 1 − pk )

and for the 20 observed outcomes the following equations arise

P a g e 17 | 44
The Logistic Function

y1 − p1
v1 + β0 + β1x1 = β00 + β10x1 +
p1 ( 1 − p1 )
y2 − p2
v2 + β0 + β1x 2 = β00 + β10x 2 +
p2 ( 1 − p2 )
y 3 − p3
v 3 + β0 + β1x 3 = β00 + β10x 3 +
p3 ( 1 − p3 )

y20 − p20
v20 + β0 + β1x 20 = β00 + β10x 20 +
p20 ( 1 − p20 )

These equations can be rearranged into matrix form


 v1   1 x1   1 x1   ( y1 − p1 ) ( p1 ( 1 − p1 ) ) 
       
 v  1 x  1 x   (y − p ) ( p2 ( 1 − p2 ) ) 
 2   2 
β   2  0
β  2 2 
 v  + 1 x    = 1 x    +  (y − p )
0 0
( p3 ( 1 − p3 ) )  (65)
 3  3    3   0   3 3 
        β1       β1    
       
      
 v20   1 x 20   1 x 20   ( 20
y − p20 ) ( p20 ( 1 − p20 ) ) 
or
v + Bx = f
 v1   1 x1 
   
v  1 x 
 2  2  β 
v =  v 3  is a vector of residuals, B =  1 x 3  is a coefficient matrix, x =  0  is the vector of coefficients
      β
     1 
   
 v20   1 x 20 
1 x1   ( y1 − p1 ) ( p1 ( 1 − p1 ) ) 
   
1 x 2   0   ( y2 − p2 ) ( p2 ( 1 − p2 ) ) 
 β
and f =  1 x 3   00  +  ( y 3 − p3 ) ( p3 ( 1 − p3 ) )  is a vector of numeric terms that can be rearranged as
    1  
β
 
 
   
 1 x 20 
  ( y20 − p20 ) ( p20 ( 1 − p20 ) ) 

 1 x1   1 ( 1 − p1 )   y1 ( p1 ( 1 − p1 ) ) 
     
1 x   1 (1 − p )   y ( p (1 − p )) 
 2  β  
0 2   2 2 2 
f =  1 x 3   00  −  1 ( 1 − p3 )  +  y 3 ( p3 ( 1 − p3 ) )  = Bx 0 − c + Ay = d + Ay (66)
     β1       
     
     
 1 x 20   1 ( 1 − p20 )   y20 ( p20 ( 1 − p20 ) ) 

 1 ( 1 − p1 ) 
 
 β0   1 (1 − p ) 
 0  
where x =  0  is the vector of approximate coefficients, c = 
0 2 
  is a vector of numeric terms,
 β1   
1 1 − p 
 ( 20 ) 

 y1 
 
y 
y =  2  is the vector of measurements,

  
y 
 20 

P a g e 18 | 44
The Logistic Function

 1 
 0  0 
 p (1 − p ) 
 1 1 
 1 
 0 0 0 
A =  p2 ( 1 − p2 )  is a diagonal coefficient matrix and

  0   
 
 1 
 0 0  
 
 p20 ( 1 − p20 ) 

d = Bx0 − c is a vector of numeric terms.


Applying propagation of variances3 to (66) gives

Vff = AVyy AT (67)

where Vyy is a diagonal matrix containing the variances of the measurements y and Vff is a diagonal matrix
containing the variances of the numeric terms f in (66). The y’s in the right-hand-side of (66) are random
variables that follow a Bernoulli distribution and take the value of 1 with a probability of p and a value of 0
with a probability of q = 1 − p . The variance of these variables is p ( 1 − p ) and the general form of Vff is
given by
 1 
 0  0 
 p (1 − p ) 
 1 1 
 1 
 0 0 0 
Vff 
= p2 ( 1 − p2 )  (68)

  0   
 
 1 
 0 0  
 
 p20 ( 1 − p )
20 

Now, with the general relationship that weights are inversely proportional to variances, i.e. W = V−1 the
general form of the weight matrix W of (65) is

 p1 ( 1 − p1 ) 0  0 
 
 0 p2 ( 1 − p2 ) 0 0 

W=  (69)
 0   
 
 0 0  p20 ( 1 − p20 ) 
 
The coefficients in the vector x in (65) can now be solved using least squares4 with the standard result
(Mikhail 1976)
−1
x = ( BT WB ) BT Wf (70)

and these are ‘updates’ of the approximate values in x0 . This iterative process can be repeated until there is
no appreciable change in the updated coefficients. In the literature associated with logistic regression this
iterative least squares process of determining the coefficients is known as Iteratively Reweighted Least
Squares (IRLS).

3 Propagation of Variances is a mathematical technique of estimating the variance of functions of random


variables that have assumed (or known) variances. See Appendix A.
4 Least squares is a mathematical estimation process used to calculate the best estimate of quantities from

overdetermined systems of equations. It is commonly used to determine the line of best fit through a number
of data points and this application is known as linear regression.
P a g e 19 | 44
The Logistic Function

A MATLAB5 function logistic.m that solves for the parameters of a logistic regression has been developed by
Professor Geoffrey J. Gordon in the Machine Learning Department at Carnegie Mellon University, USA and
is available from his website: http://www.cs.cmu.edu/~ggordon/IRLS-example/. This function will also run
under GNU OCTAVE6. A copy of the function and an example of its use is shown in Appendix B.
The data for the example used for explanation (20 students studying for an examination) is from the
Wikipedia page Logistic Regression (https://en.wikipedia.org/wiki/Logistic_regression) and is shown in the
table below.

Hours Pass Hours Pass Hours Pass Hours Pass


1 0.50 0 6 1.75 0 11 2.75 1 16 4.25 1
2 0.75 0 7 1.75 1 12 3.00 0 17 4.50 1
3 1.00 0 8 2.00 0 13 3.25 1 18 4.75 1
4 1.25 0 9 2.25 1 14 3.50 0 19 5.00 1
5 1.50 0 10 2.50 0 15 4.00 1 20 5.5 1
Table 1. Number of hours each student spent studying, and whether they passed (1) of failed (0)

20 students sit an examination with the observed outcomes y1, y2 , …, y20 as pass or fail and recorded as 1 or
0. These outcomes are related to a single variable x that is hours of study in preparation for the exam and
1
the logistic function is assumed to be y = . The coefficients β0 and β1 are computed using the
−( β0 + β1x )
1+e
MATLAB function logistic.m. The input data and results are shown below
>> a
a =

1.00000 0.50000
1.00000 0.75000
1.00000 1.00000
1.00000 1.25000
1.00000 1.50000
1.00000 1.75000
1.00000 1.75000
1.00000 2.00000
1.00000 2.25000
1.00000 2.50000
1.00000 2.75000
1.00000 3.00000
1.00000 3.25000
1.00000 3.50000
1.00000 4.00000
1.00000 4.25000
1.00000 4.50000
1.00000 4.75000
1.00000 5.00000
1.00000 5.50000

>> y'
ans =

0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

5MATLAB (matrix laboratory) is a numerical computing environment and programming language


developed by MathWorks.
6 GNU OCTAVE is free software featuring a high-level programming language, primarily intended for
numerical computations that is mostly compatible with MATLAB. It is part of the GNU Project and is free
software under the terms of the GNU General Public License.
P a g e 20 | 44
The Logistic Function

>> xhat = logistic(a, y, [], [], struct('verbose', 1))


1: [ -2.61571 0.938375 ]
2: [ -3.66126 1.33855 ]
3: [ -4.03513 1.48734 ]
4: [ -4.07709 1.5044 ]
5: [ -4.07757 1.5046 ]
6: [ -4.07757 1.5046 ]
7: [ -4.07757 1.5046 ]
8: [ -4.07757 1.5046 ]
Converged.
xhat =

-4.0776
1.5046

The array a is the coefficient matrix B and the vector y is y noting that y ′ = yT

The iterative process has converged after 8 iterations and the coefficients are in the array xhat where
β0 = −4.0776 and β1 = 1.5046 .

Figure 7. Probability y of passing exam after x hours of study.


1
y = where β0 = −4.0776 and β1 = 1.5046
−( β0 + β1x )
1+e

The midpoint of the symmetric curve is the point ( x 0 , 1 ) , and y = 1


when the exponent is zero, i.e., when
2 2

β0
β0 + β1x 0 = 0 giving x 0 = − = 2.7101
β1

In this example, a student who studies 2 hours has a probability of passing the exam of 0.26

1
P ( pass ) = = 0.2557
−( −4.0776 +1.5046×2 )
1 +e
and similarly, a student who studies 4 hours has a probability of passing the exam of 0.87

1
P ( pass ) = = 0.8744
−( −4.0776 +1.5046×4 )
1 +e

P a g e 21 | 44
The Logistic Function

Table 2 shows probabilities of passing the exam for several values of hours of study

Hours of study Probability of passing exam


1 0.07
2 0.26
2.7101 0.50
3 0.61
4 0.87
5 0.97

Table 2.

Logistic Curve for Dove Open Doubles Petanque Tournament 01-May-2016


In this tournament there were 18 teams. There were 4 Qualifying rounds (Swiss System7) and the top 9
teams were seeded and did not play each other in Round 1 of the Qualifying. After the Qualifying the teams
were ranked 1–18 and teams ranked 1 to 8 went into a Principale and teams ranked 9 to 16 went into a
Complémentaire. The remaining two teams took no further part in the tournament. The Principale and
Complémentaire were single elimination finals series with play-off’s for 3rd and 4th places. There were 36
matches in the Qualifying and 8 matches each in the Principale and Complémentaire making a total of 52
matches involving the 18 teams.
Tables 3 (Qualifying), 5 (Principale) and 6 (Complémentaire) show matches and game scores in the
tournament and Tables 4 and 7 show ranking after Qualifying and final ranking in Principale and
Complémentaire respectively.

Round 1 Round 2 Round 3 Round 4


16(13) v 9(4) 2(7) v 14(11) 3(13) v 14(8) 3(13) v 16(8)
18(6) v 14(10) 7(9) v 15(13) 15(11) v 16(12) 18(12) v 7(13)
5(12) v 11(8) 8(3) v 16(13) 17(11) v 6(12) 14(5) v 4(10)
7(13) v 1(6) 5(5) v 17(13) 5(10) v 18(13) 15(13) v 6(11)
2(13) v 12(0) 3(13) v 6(3) 4(13) v 13(12) 1(12) v 13(9)
8(13) v 4(8) 4(10) v 11(6) 8(8) v 1(9) 5(8) v 2(13)
15(13) v 17(1) 9(2) v 18(13) 2(10) v 7(13) 17(12) v 8(11)
13(10) v 6(13) 13(13) v 12(2) 11(13) v 10(4) 11(11) v 12(8)
3(13) v 10(7) 10(3) v 1(13) 9(8) v 12(9) 10(9) v 9(11)
Table 3. Dove Open Doubles: Qualifying matches
(game scores shown in parentheses beside team number)

7 The Swiss System (also known as the Swiss Ladder System) is a tournament system that allows
participants to play a limited number of rounds against opponents of similar strength. The system was
introduced in 1895 by Dr. J. Muller in a chess tournament in Zurich, hence the name ‘Swiss System’. The
principles of the system are: (i) In every round, each player is paired with an opponent with an equal score
(or as nearly equal as possible); (ii) Two players are paired at most once; (iii) After a predetermined number
of rounds the players are ranked according to a set of criteria. The leading player wins; or the ranking is the
basis of subsequent elimination series.

P a g e 22 | 44
The Logistic Function

Qualifying Ranking
Rank Team Score BHN fBHN Games Points delta
1 3 4 7 40 4:0 52:26 +26
2 15 3 10 36 3:1 50:33 +17
3 7 3 10 29 3:1 48:41 +7
4 16 3 9 34 3:1 46:31 +15
5 4 3 6 36 3:1 41:36 +5
6 1 3 5 40 3:1 40:33 +7
7 14 2 11 27 2:2 34:36 –2
8 6 2 10 33 2:2 39:47 –8
9 17 2 7 39 2:2 37:41 –4
10 2 2 7 35 2:2 43:32 +11
11 18 2 7 35 2:2 44:35 +9
12 11 2 5 30 2:2 38:34 +4
13 8 1 11 27 1:3 35:42 –7
14 13 1 9 27 1:3 44:40 +4
15 5 1 8 26 1:3 35:47 –12
16 9 1 6 32 1:3 25:44 –19
17 12 1 6 27 1:3 19:45 –26
18 10 0 10 23 0:4 23:50 –27
Table 4. Dove Open Doubles: Ranking after Qualifying rounds
(BHN is Buchholtz Number8, fBHN is Fine Buchholtz Number)

Quarter-Finals Semi-Finals Final Playoff


3(11) v 6(9) 3(13) v 16(2) 3(13) v 14(7) 16(6) v 7(13)
16(11) v 4(10) 7(8) v 14(9)
1(9) v 7(11)
14(11) v 15(9)
Table 5. Dove Open Doubles: Principale matches

Quarter-Finals Semi-Finals Final Playoff


17(10) v 9(8) 17(2) v 11(13) 11(10) v 18(13) 17(3) v 2(13)
11(13) v 8(4) 18(13) v 2(1)
13(2) v 18(13)
5(7) v 2(13)
Table 6. Dove Open Doubles: Complémentaire matches

Principale Complémentaire
Rank Team Rank Team
1 3 1 18
2 14 2 11
3 7 3 2
4 16 4 17
4 8
6 13
=5 15 =5 9
1 5
Table 7. Dove Open Doubles: Final Ranking in Principale & Complémentaire

8 The Buchholtz system is a ranking system, first used by Bruno Buchholtz in a Swiss System chess
tournament in 1932. The principle of the system is that when two players have equal scores at the end of a
defined number of rounds a tie break is required to determine the top ranked player. The scores of both
player’s opponents (in all rounds) are added giving each their Buchholtz Number (BHN). The player having
the larger BHN is ranked higher on the assumption they have played against better performing players. The
Fine Buchholtz Number (fBHN) is the sum of the opponents Buchholtz Numbers and is used to break ties
where player’s BHN are equal. In the rare case that Score, BHN and fBHN are all equal then delta = points
For – points Against is used as a tie break (see Teams 2 & 18 in Table 7).
P a g e 23 | 44
The Logistic Function

A least squares solution for team ratings, based on the 36 Qualifying matches yielded the following values
with the highest rating team, (team 15) having a rating r15 = 100 .

 r1 = 93.670 r7 = 97.207 r13 = 92.312 


 
 r2 = 94.100 r8 = 92.923 r14 = 94.837 
 r 
= 99.666 r9 = 85.586 r15 = 100.000 
ratings =  3 
 r4 = 93.526 r10 = 85.238 r16 = 98.294 
 
 r5 = 89.598 r11 = 89.030 r17 = 93.204 
 
 r6 = 94.295 r12 = 83.757 r18 = 94.057 

Using these ratings an analysis of the Qualifying matches (Table 3) yields the following tabulated results
where Team A is the first-named team and Team B is the second-named team. A win is recorded as 1 and a
loss is recorded as 0, and dr is the difference in ratings. If dr is positive then Team A is the higher ranking
team.

Match Team A Team B dr = rA − rB Match Team A Team B dr = rA − rB


1 1 0 12.708 19 1 0 4.829
2 0 1 -0.780 20 0 1 1.706
3 1 0 0.568 21 0 1 -1.091
4 1 0 3.537 22 0 1 -4.459
5 1 0 10.343 23 1 0 1.214
6 1 0 -0.603 24 0 1 -0.747
7 1 0 6.796 25 0 1 -3.107
8 0 1 -1.983 26 1 0 3.792
9 1 0 14.428 27 0 1 1.829
10 0 1 -0.737 28 1 0 1.372
11 0 1 -2.793 29 0 1 -3.150
12 0 1 -5.371 30 0 1 1.311
13 0 1 -3.606 31 1 0 5.705
14 1 0 5.371 32 1 0 1.358
15 1 0 4.496 33 0 1 -4.502
16 0 1 -8.471 34 1 0 0.281
17 1 0 8.555 35 1 0 5.273
18 0 1 -8.432 36 0 1 -0.348

Table 8. Win/Loss and rating difference for Qualifying matches of Dove Open Doubles.
Using the values in Table 8, Logistic Regression is used to compute the parameters of a curve representing
the probability of Team A winning given a certain rating difference.

1
The curve is assumed to have the following form: y = where y is the probability of winning
−( β0 + β1x )
1+e
and x is the rating difference between the two teams and the MATLAB function logistic.m is used to
compute the coefficients β0 and β1 .

P a g e 24 | 44
The Logistic Function

The input arrays a and y are shown below together with the output vector xhat (Note that a′ denotes
transpose)
>> a'
ans =

Columns 1 through 14:

1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000


1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
12.70800 -0.78000 0.56800 3.53700 10.34300 -0.60300 6.79600 -1.98300
14.42800 -0.73700 -2.79300 -5.37100 -3.60600 5.37100

Columns 15 through 28:

1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000


1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
4.49600 -8.47100 8.55500 -8.43200 4.82900 1.70600 -1.09100 -4.45900
1.21400 -0.74700 -3.10700 3.79200 1.82900 1.37200

Columns 29 through 36:

1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000


-3.15000 1.31100 5.70500 1.35800 -4.50200 0.28100 5.27300 -0.34800

>> y'
ans =

1 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0
1 0 0 1 0 1 0 0 1 1 0 1 1 0

>>

>> xhat = logistic(a, y, [], [], struct('verbose',1))


1: [ -0.348917 0.277334 ]
2: [ -0.485595 0.504256 ]
3: [ -0.60651 0.733499 ]
4: [ -0.690405 0.902133 ]
5: [ -0.722152 0.964799 ]
6: [ -0.725415 0.970895 ]
7: [ -0.725443 0.970944 ]
8: [ -0.725443 0.970944 ]
9: [ -0.725443 0.970944 ]
10: [ -0.725443 0.970944 ]
Converged.
xhat =

-0.72544
0.97094

>>

The iterative process has converged after 10 iterations and the coefficients are in the array xhat where
β0 = −0.72544 and β1 = 0.97094 .

P a g e 25 | 44
The Logistic Function

Figure 7. Probability y of winning a petanque match given a rating difference x.


1
y = where β0 = −0.72544 and β1 = 0.97094
−( β0 + β1x )
1+e

The midpoint of the symmetric curve is the point ( x 0 , 1 ) , and y = 1


when the exponent is zero, i.e., when
2 2

β0
β0 + β1x 0 = 0 giving x 0 = − = 0.74715
β1

Logistic Curve for Elo Rating System


The Elo Rating System (Elo 1978) is a mathematical process based on a statistical model relating match
results to underlying variables representing the abilities of a team or player. The name “Elo” derives from
Arpad Elo9, the inventor of a system for rating chess players and his system, in various modified forms, is
used for player or team ratings in many sports.
The Elo Rating System calculates, for every player or team, a numerical rating based on performance in
competitions. A rating is a number (usually an integer) between 0 and 3000 that changes over time
depending on the outcome of tournament games. The system depends on a curve defined by a logistic
function (Langville & Meyer 2012, Glickman & Jones, 1999)

1
pA =  r −r
(71)
− A B 
 b 

1 + 10

where pA is the probability of player A winning in a match A versus B given the player ratings rA, rB and b
is a shape parameter.
The curve of this function has a similar form to the cumulative distribution curve of the Logistic distribution
(33) with e replaced by 10 as a base, x = rA − rB is the rating difference between players A and B, a = 0 b
is the shape parameter.

9 Arpad Elo (1903 – 1992) the Hungarian-born US physics professor and chess-master who devised a system
to rate chess players that was implemented by the United States Chess Federation (USFC) in 1960 and
adopted by the World Chess Federation (FIDE) in 1970. Elo described his work in his book The Rating of
Chess Players, Past & Present, published in 1978 and his system has been adapted to many sports.
P a g e 26 | 44
The Logistic Function

1
Figure 1. Elo curve: y =  x 
. y = pA is the probability of A winning, x = rA − rB is
− 
 400 
1 + 10
the rating difference and the shape parameter b = 400 . The three points on the
curve shown thus ○ have rating differences −265, 174 and 626 that correspond with
probabilities 0.179, 0.731 and 0.973 respectively.

The curve in Figure 3 has the shape parameter b = 400 and this value is chosen so that a player rating
difference of approximately 200 corresponds to a probability of winning of approximately 0.75. With y = pA
and x = rA − rB (71) becomes

1
y =  x 
(72)
− 
 400 
1 + 10
 x 
− 
1−y
 400 
and this equation can be rearranged as 10 = . Now using the rule for logarithms that if
y
p = loga N then N = a p the expression for rating difference x is

 1 − y 
x = −400 log10   (73)
 y 

And if probability y = 0.75 then rating difference x = 190.848501 .

P a g e 27 | 44
The Logistic Function

For example, suppose two players A and B with ratings 1862 and 1671 respectively play a match. The
probability of A winning is given by (71) as

1 1 1
pA =  1862 −1671 
=  191 
= = 0.750163482
−  − 
 400 
1 + 10−0.4775
 
1 + 10 400
1 + 10
We might express this probability of A winning as:
(i) If A played B in 100 matches then A would win 75 of them (75.0163482 actually), or
(ii) A has a 75% chance of winning.

Elo’s logistic function uses exponents with a base of 10. This is the base of common logarithms and the
following relationships may be useful.

1 1
If y = = (74)
−x
1+e 1 + 10−αx

noting that e x = 10αx and e1 = 2.718281828459... = 10α

1
then α = log10 e = 0.434294481903... = (75)
2.302585092994...

Alternatively,

1 1
if y = = (76)
−x
1 + 10 1 + e −βx

noting that 10x = e βx and 101 = 10 = e β

1
then β = ln 10 = 2.302585092994... =
0.434294481903...

P a g e 28 | 44
The Logistic Function

References
Bacaër, N, 2011, A Short History of Mathematical Population Dynamics, Chapter 6, Verhulst and the logistic
equation (1838), pp.35-39, Springer-Verlag London Limited.
http://webpages.fc.ul.pt/~mcgomes/aulas/dinpop/Mod13/Verhulst.pdf [accessed 10-May-2018]
Cox, D.R., 1958, ‘The regression analysis of binary sequences’, Journal of the Royal Statistical Society. Series
B (Methodological), Vol. 20, No. 2, pp. 215-242.
http://www.jstor.org/stable/2983890 [accessed 10-May-2018]
Cramer, J.S., 2002, The Origins of Logistic Regression, Tinbergen Institute Discussion Paper, TI 2002-119/4,
Faculty of Economics and Econometrics, University of Amsterdam, and Timbergen Institute, 14 pages,
November 2002.
https://papers.tinbergen.nl/02119.pdf [accessed 15-May-2018]
Elo, A.E., 1978, The Rating of Chess Players, Past & Present, 2nd printing, April 2008, Ishi Press
International.
Euler, L., 1748, Introduction to Analysis of the Infinite (On the use of the Discovered Fractions to Sum
Infinite Series), in Pi: A Source Book, by Berggren, L., Borwein, J. and Borwein, P., 1997, Springer,
New York.
Glickman, M.E. and Jones, A., 1999, ‘Rating the chess rating system’, Chance, Vol. 12, No. 2, pp. 21-28.
http://glicko.net/research/chance.pdf [accessed 25-Aug-2018]
Hunter, M.N, 2018, ‘Some comments about The Logistic Function’, Private correspondence, 6 pages, 01-Sep-
2018.
Johnson, N.L. and Leone, F.C., 1964, Statistics and Experimental Design In Engineering and the Physical
Sciences, Vol. I, John Wiley & Sons, Inc., New York
Kreyszig, Erwin, 1970, Introductory Mathematical Statistics, John Wiley & Sons, New York.
Langville, A.N. and Meyer, C.D., 2012, Who’s #1? The Science of Rating and Ranking, Princeton University
Press, Princeton.
Mikhail, E.M., 1976, Observations and Least Squares, IEP―A Dun-Donnelley, New York
O’Connor, J.J. and Robertson, E.F., 2014, ‘Pierre François Verhulst’, MacTutor History of Mathermatics,
http://www-history.mcs.st-andrews.ac.uk/Biographies/Verhulst.html [accessed 16-May-2018]
Verhulst, P.F., 1838, ‘Notice sur la loi que la population suit dans son accroissement’, Correspondance
Mathématique et Physique, Publiée par A. Quetelet, Vol. 4, pp. 113-121
https://books.google.com.au/books?id=8GsEAAAAYAAJ&hl=fr&pg=PA113&redir_esc=y#v=onepa
ge&q&f=false [accessed 15-May-2018]
Verhulst, P.F., 1844, ‘Recherches mathématiques sur la loi d’accroissement de la population’, Nouveaux
Mémoires de l'Académie Royale des Sciences et Belles-Lettres de Bruxelles, Vol. 18, pp. 1-38
http://www.med.mcgill.ca/epidemiology/Hanley/anniversaries/ByTopic/Verhulst1844.pdf [accessed
15-May-2018]

P a g e 29 | 44
The Logistic Function

APPENDIX A: Some Statistical Definitions

Experiments, Sets, Sample Spaces, Events and Probability


The term statistical experiment can be used to describe any process by which several chance observations are
obtained. All possible outcomes of an experiment comprise a set called the sample space and a set or sample
space contains N elements or members. An event is a subset of the sample space containing n elements.
Experiments, sets, sample spaces and events are the fundamental tools used to determine the probability of
certain events where probability is defined as

n
P ( Event ) = (77)
N
For example, if a card is drawn from a deck of playing cards, what is the probability that it is a heart? In
this case, the experiment is the drawing of the card and the possible outcomes of the experiment could be one
of 52 different cards, i.e., the sample space is the set of N = 52 possible outcomes and the event is the
subset containing n = 13 hearts. The probability of drawing a heart is

n 13
P ( Heart ) = = = 0.25
N 52
This definition of probability is a simplification of a more general concept of probability that can be
explained in the following manner (see Johnson & Leone, 1964, pp.32-3).
Suppose observations are made on a series of occasions (often termed trials) and during these
trials it is noted whether or not a certain event occurs. The event can be almost any observable
phenomenon, for example, that the height of a person walking through a doorway is greater
than 1.8 metres, that a family leaving a cinema contains three children, that a defective item is
selected from an assembly line, and so on. These trials could be conducted twice a week for a
month, three times a day for six months or every hour for every day for 10 years. In the
theoretical limit, the number of trials N would approach infinity and we could assume, at this
point, that we had noted every possible outcome. Therefore, as N → ∞ then N becomes the
number of elements in the sample space containing all possible outcomes of the trials. Now for
each trial we note whether or not a certain event occurs, so that at the end of N trials we have
noted nN events. The probability of the event (if it in fact occurs) can then be defined as

n 
P ( Event ) = lim  N 
N →∞ N 

Since nN and N are both non-negative numbers and nN is not greater than N then

nN
0≤ ≤1
N
Hence

0 ≤ P { Event } ≤ 1

If the event occurs at every trial then nN = N and nN N = 1 for all N and so P ( Event ) = 1 .
This relationship can be described as: the probability of a certain (or sure) event is equal to 1.

If the event never occurs, then nN = 0 and nN N = 0 for all N and so P ( Event ) = 0 . This
relationship can be described as: the probability of an impossible event is zero.
The converse of these two relationships need not hold, i.e., a probability of one need not imply
certainty since it is possible that lim nN N = 1 without nN = 1 for all values of N and a
N →∞
P a g e 30 | 44
The Logistic Function

probability of zero need not imply impossibility since it is possible that lim nN N = 0 even
N →∞

though nN > 0 . Despite these qualifications, it is useful to think of probability as measured on


a scale varying from (near) impossibility at 0 to (near) certainty at 1. It should also be noted
that this definition of probability (or any other definition) is not directly verifiable in the sense
that we cannot actually carry out the infinite series of trials to see whether there really is a
unique limiting value for the ratio nN N . The justification for this definition of probability is
utilitarian, in that the results of applying theory based on this definition prove to be useful and
that it fits with intuitive ideas. However, it should be realized that it is based on the concept of
an infinitely long series of trials rather than an actual series, however long it may be.

Random Variables and Probability Distributions of Random Variables


A random variable X is a rule or a function, which associates a real number with each point in a sample
space. As an example, consider the following experiment where two identical coins are tossed; h denotes a
head and t denotes a tail.
Experiment: Toss two identical coins.

Sample space: S = { hh, ht, th, tt } .

Random Variable: X, the number of heads obtained, may be written as

X ( hh ) = 2
X ( ht ) = 1
X ( th ) = 1
X ( tt ) = 0

In this example X is the random variable defined by the rule: "the number of heads obtained". The possible
values (or real numbers) that X may take are 0, 1, 2. These possible values are usually denoted by x and the
notation X = x denotes x as a possible real value of the random variable X.
Random variables may be discrete or continuous. A discrete random variable assumes each of its possible
values with a certain probability. For example, in the experiment above; the tossing of two coins, the sample
space S = { hh, ht, th, tt } has N = 4 elements and the probability the random variable X (the number of
heads) assumes the possible values 0, 1 and 2 is given by

x 0 1 2
P (X = x ) 1 2 1
4 4 4

Note that the values of x exhaust all possible cases and hence the probabilities add to 1
A continuous random variable has a probability of zero of assuming any of its values and consequently, its
probability distribution cannot be given in tabular form. The concept of the probability of a continuous
random variable assuming a particular value equals zero may seem strange, but the following example
illustrates the point. Consider a random variable whose values are the heights of all people over 21 years of
age. Between any two values, say 1.75 metres and 1.85 metres, there are an infinite number of heights, one
of which is 1.80 metres. The probability of selecting a person at random exactly 1.80 metres tall and not one
of the infinitely large set of heights so close to 1.80 metres that you cannot humanly measure the difference is
extremely remote, and thus we assign a probability of zero to the event. It follows that probabilities of
continuous random variables are defined by specifying an interval within which the random variable lies and
it does not matter whether an end-point is included in the interval or not.

P (a < X ≤ b ) = P (a < X < b ) + P ( X = b )


= P (a < X < b )

P a g e 31 | 44
The Logistic Function

It is most convenient to represent all the probabilities of a random variable X by a formula or function
denoted by fX ( x ) , g X ( x ) , hX ( x ) , etc., or by FX ( x ) , GX ( x ) , H X ( x ) , etc.

In this notation the subscript X denotes that fX ( x ) or FX ( x ) is a function of the random variable X which
takes the numerical values x within the function. Such functions are known as probability distribution
functions and they are paired; i.e., fX ( x ) pairs with FX ( x ) , g X ( x ) pairs with GX ( x ) , etc. The functions
with the lowercase letters are probability density functions and those with uppercase letters are cumulative
distribution functions.
For discrete random variables, the probability density function has the properties

1. fX ( x k ) = P ( X = x k )


2. ∑ fX ( xk ) = 1
k =1

And the cumulative distribution function has the properties

1. FX ( x k ) = P ( X ≤ x k )

2. FX ( x ) = ∑ fX ( xk )
x k ≤x

As an example consider the probability distribution functions fX ( x ) and FX ( x ) of the sum of the numbers
when a pair of dice is tossed.
Experiment: Toss two identical dice.

 1, 1 1, 2 1, 3 1, 4 1, 5 1, 6 
 
 2, 1 2, 2 2, 3 2, 4 2, 5 2, 6  
 3, 1 3, 6 

S =  
3, 2 3, 3 3, 4 3, 5
Sample space:
 4, 1 4, 2 4, 3 4, 4 4, 5 4, 6 


 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6  
 
6, 6 
 
6, 1 6, 2 6, 3 6, 4 6, 5

Random Variable: X, the total of the two numbers


The probability the random variable X assumes the possible values x = 2, 3, 4, …, 12 is given in Table A1

x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P (X = x )
36 36 36 36 36 36 36 36 36 36 36
Table A1. Table of probabilities
Note that the values of x exhaust all possible cases and hence the probabilities add to 1

The probability density function fX ( x ) can be deduced from Table A1

6− x −7
fX ( x ) = , x = 2, 3, 4, …,12
36
Probability distributions are often shown in graphical form. For discrete random variables, probability
distributions are generally shown in the form of histograms consisting of series of rectangles associated with
values of the random variable. The width of each rectangle is one unit and the height is the probability
given by the function fX ( x ) and the sum of the areas of all the rectangles is 1.

P a g e 32 | 44
The Logistic Function

Figure A1 shows the Probability histogram for the random variable X, the sum of the numbers when a pair
of dice is tossed.

Probability histogram
0.18

0.16

0.14

0.12
Probability f(x)

0.1

0.08

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14
x
Figure A1 Probability histogram

Figure A2 shows the cumulative distribution function FX ( x ) = ∑ fX ( xk ) for the random variable X, the
x k ≤x

sum of the numbers when a pair of dice is tossed.

Cumulative Distribution Function

0.8

0.6
F(x)

0.4

0.2

0
0 2 4 6 8 10 12 14
x
Figure A2. Cumulative distribution function. [The dots at the left ends of the line
segments indicate the value of FX ( x ) at those values of x.

P a g e 33 | 44
The Logistic Function

For continuous random variables, the probability distribution functions fX ( x ) and FX ( x ) are curves,
which may take various forms depending on the nature of the random variable. Probability density functions
fX ( x ) that are used in practice to model the behaviour of continuous random variables are always positive
and the total area under its curve, bounded by the x-axis, is equal to one. These density functions have the
following properties

1. fX ( x ) ≥ 0 for any value of x

+∞
2. ∫ fX ( x )dx = 1
−∞

The probability that a random variable X lies between any two values x = a and x = b is the area under
the density curve between those two values and is found by methods of integral calculus
b
P (a < X < b ) = ∫ fX ( x )dx (78)
a

The equations of the density functions fX ( x ) are usually complicated and areas under their curves are found
from tables. In many scientific studies, the Normal probability density function is the usual model for the
behaviour of measurements (regarded as random variables) and the probability density function is (Kreyszig,
1970, p. 107)
2
1 x −µ 
1 −  
 
fX ( x ) = e 2 σ 
(79)
σ 2π
µ and σ are the mean and standard deviation respectively of the infinite population of x and Figure A3
shows a plot of the Normal probability density curve for µ = 2.0 and σ = 2.5 .

Normal Probability Density curve


0.16

0.14

0.12

0.1
f(x)

0.08

0.06

0.04

0.02

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure A3. Normal probability density function for µ = 2.0 and σ = 2.5

P a g e 34 | 44
The Logistic Function

For continuous random variables X, the cumulative distribution function FX ( x ) has the following properties

x
1. FX ( x ) = P ( X ≤ x ) = ∫ fX ( x )dx
−∞

d
2. F ( x ) = fX ( x )
dx X
In many scientific studies, the Normal distribution is the usual model for the behaviour of measurements and
the cumulative distribution function is (Kreyszig, 1970, p. 108)
2
1 x −µ 
1 x −  
2 σ 
FX ( x ) = ∫−∞ e dx (80)
σ 2π
The probability that X assumes any value in an interval a < X < b is
2
1 x − µ 
1 b −  
2 σ 
P (a < X < b ) = FX (b ) − FX ( a ) = ∫a e dx (81)
σ 2π
Figure A4 shows a plot of the Normal cumulative distribution curve for µ = 2.0 and σ = 2.5 .

Normal Cumulative Distribution curve


1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2

0.1

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure A4. Normal cumulative distribution function for µ = 2.0 and σ = 2.5

Expectations
The expectation E { X } of a random variable X is defined as the average value µX of the variable over all
possible values. It is computed by taking the sum of all possible values of X = x multiplied by its
corresponding probability. In the case of a discrete random variable the expectation is given by
N
E { X } = µX = ∑ xk P ( xk ) (82)
k =1

Equation (82) is a general expression from which we can obtain the usual expression for the arithmetic mean
N
1
µ=
N
∑ xk (83)
k =1

P a g e 35 | 44
The Logistic Function

If there are N possible values x k of the random variable X, each having equal probability P ( x k )=1N
(which is a constant), then the expectation computed from (82) is identical to the arithmetic mean of the N
values of x k from (83).

In the case of a continuous random variable the expectation is given by


+∞
E { X } = µX = ∫ x fX ( x )dx (84)
−∞

This relationship may be extended to a more general form if we consider the expectation of a function g ( X )
of a random variable X whose probability density function is fX ( x ) . In this case

+∞
E { g ( X )} = ∫ g ( x ) fX ( x )dx (85)
−∞

Extending (85) to the case of two random variables X and Y


+∞ +∞
E { g ( X ,Y ) } = ∫ ∫ g ( x, y ) fXY ( x, y )dx dy
−∞ −∞

Similarly for n random variables


+∞ +∞ +∞
E { g ( X1 , X 2 , …, X n )} = ∫ ∫  ∫ g ( x1 , x 2 , …, x n ) fX ( x1 , x 2 , …, x n )dx1 dx 2 …dxn (86)
−∞ −∞ −∞

Expressing (86) in matrix notation gives a general form of the expected value of a multivariate function
g ( X ) as

+∞ +∞ +∞
E { g ( X )} = ∫ ∫  ∫ g ( x ) fX ( x )dx (87)
−∞ −∞ −∞

where fX ( x ) is the multivariate probability density function.

There are some rules that are useful in calculating expectations. They are given here without proof but can
be found in many statistical texts. With a and b as constants and X and Y as random variables

E {a } = a

E {aX } = a E { X }

E {aX + b } = a E { X } + b

E { g ( X ) ± h ( X )} = E { g ( X )} ± E {h ( X )}

E { g ( X ,Y ) ± h ( X ,Y ) } = E { g ( X ,Y ) } ± E { h ( X ,Y ) }

P a g e 36 | 44
The Logistic Function

Special Mathematical Expectations


The mean of a random variable
+∞
µX = E { X } = ∫ x fX ( x )dx (88)
−∞

The mean vector m X of a multivariate distribution is

 µX   E ( X1 )   X1 
 1     
µ   E ( X ) X 
mX =  X2 =

  
 E ( X ) = E  X  = E { X}
2 2
(89)
 µX 3   3   3
       
      

m X can be taken as representing the mean of a multivariate probability density function.

The variance of a random variable


+∞
σX2 = E {( X − µX ) 2
} = ∫ ( x − µx ) fX ( x )dx
2
(90)
−∞

The covariance between two random variables X and Y is


+∞ +∞
σXY = E { ( X − µX )(Y − µY )} = ∫ ∫ ( x − µx ) ( y − µy ) fXY ( x, y )dx dy (91)
−∞ −∞

Equation (91) can be expanded to give

σXY = E { ( X − µX )(Y − µY ) }
= E { XY − X µY − Y µX + µX µY }
= E { XY } − E { X µY } − E {Y µX } + E { µX µY }
= E { XY } − µY E { X } − µX E {Y } + µX µY
= E { XY } − µY µX − µX µY + µX µY
= E { XY } − µX µY

If the random variables X and Y are independent, the expectation of the product is equal to the product of
the expectations, i.e., E { XY } = E { X } E {Y } . Since the expected values of X and Y are the means µX
and µY then E { XY } = µX µY if X and Y are independent. Substituting this result into the expansion
above shows that the covariance σXY is zero if X and Y are independent.

For a multivariate function, variances and covariances of the random variables X is given by the matrix
equation

{
VXX = E  X − m X   X − mY 
T
} (92)

VXX is a symmetric matrix known as the variance-covariance matrix and its general form can be seen when
(92) is expanded

  X1 − µX  
  1 

 X − µX 
= E   2   
VXX   X1 − µX1 X2 − µX  Xn − µX
2
     

2 n
   
  Xn − µX 
n  

P a g e 37 | 44
The Logistic Function

 σ2 σX X  σX X 
 X1 1 2 1 n
 
 σX X σX2  σX X 
giving VXX = 2 1 1 2 n 
(93)
     

σ σX  σX2 
 Xn X1 n X2 1 

Law of Propagation of Variances for Linear Functions


T T
Consider two vectors of random variables x =  X1 X2  Xn  and y = Y1 Y2  Yn  that are
   
linearly related by the matrix equation
y = Ax + b (94)
where A is a coefficient matrix and b is a vector of constants. Then, using the rules for expectations
developed above we may write an expression for the mean mY using (89)

mY = E {y}
= E { Ax + b }
= E { Ax } + E { b }
= A E {x} + b
= Am X + b

Using (92), the variance-covariance matrix Vyy is given by

Vyy = E{( y − m )( y − m ) }
y y
T

= E { ( Ax + b − Am − b )( Ax + b − Am − b) }
T
x x

= E { ( Ax − Am )( Ax − Am ) }
T
x x

= E { A ( x − m )( A ( x − m )) }
T
x x

= E { A ( x − m )( x − m ) A }
T T
x x

= A E { ( x − m )( x − m ) } A
T T
x x

= AVxx AT

or
If y = Ax + b and y and x are random variables linearly related then

Vyy = AVxx AT (95)

Equation (95) is known as the Law of Propagation of Variances.

P a g e 38 | 44
The Logistic Function

Law of Propagation of Variances for Non-Linear Functions


In many practical applications of variance propagation the random variables in x and y are nonlinearly
related, i.e.,

y = f (x) (96)

In such cases, we can expand the function on the right-hand-side of (96) using Taylor's theorem.
For a non-linear function of a single variable Taylor's theorem may be expressed in the following form

(x − a ) (x − a )
2 3
df d2 f d3f
f (x ) = f (a ) + (x − a ) + + +
dx a dx 2 a
2! dx 3 a
3!
n −1
(97)
d n −1 f (x − a )
+ + Rn
a (
dx n −1 n − 1) !

df d2 f
where Rn is the remainder after n terms and lim Rn = 0 for f ( x ) about x = a and , etc. are
n →∞ dx dx 2
a a

derivatives of the function f ( x ) evaluated at x = a .

For a non-linear function of two random variables, say φ = f ( x , y ) , the Taylor series expansion of the
function φ about x = a and y = b is

∂f ∂f
φ = f ( a, b ) + (x − a ) + (y − b )
∂x a,b ∂y a,b
  (98)
1  ∂2 f ∂2 f ∂f ∂f
( x −a) + (y − b ) + ( x − a )( y − b ) + 
2 2
+ 
2 !  ∂x 2 a,b ∂y a,b
2 ∂ ∂y

x
 a ,b a ,b

∂f ∂f ∂2 f
where f ( a, b ) is the function φ evaluated at x = a and y = b , and , , etc are partial
∂x a ,b ∂y a ,b ∂x 2 a ,b
derivatives of the function φ evaluated at x = a and y = b .

Extending to n random variables, we may write a Taylor series approximation of the function f ( x ) as a
matrix equation

∂f
f ( x ) = f ( x0 ) + ( x − x0 ) + higher order terms (99)
∂x x 0

∂f
where f ( x0 ) is the function evaluated at the approximate values x0 and are the partial derivatives
∂x x0

evaluated at approximations x0 .

Replacing f ( x ) in (96) by its Taylor series approximation, ignoring higher order terms, gives

∂f
y = f ( x ) = f ( x0 ) + ( x − x0 ) (100)
∂x x 0

Then, using the rules for expectations

P a g e 39 | 44
The Logistic Function

my = E { y }
 ∂f 
= E  f ( x0 ) + ( x − x0 )
 ∂x x0 
 ∂f 
= E { f ( x )} + E 
0
( x − x0 )
 ∂x x0 
∂f
= f ( x0 ) + E {( x − x0 )}
∂x x0
∂f
= f ( x0 ) + ( E { x } − E { x0 } )
∂x x0
∂f
= f ( x0 ) + ( mx − x0 )
∂x x0

And
 ∂f   ∂f 
y − my =  f ( x0 ) + ( x − x0 )  −  f ( x0 ) + ( mx − x0 ) 
 ∂x x0   ∂x 
   x0 
∂f
= ( x − mx ) (101)
∂x x 0
= Jyx ( x − m x )

Jyx is the (m,n) Jacobian matrix of partial derivatives, noting that y and x are (m,1) and (n,1) vectors
respectively

 ∂y1 ∂x1 ∂y1 ∂x 2  ∂y1 ∂x n 


 
 ∂y ∂x ∂y2 ∂x 2  ∂y2 ∂x n 
Jyx =  2 1
 (102)
   
 ∂y ∂x ∂ym ∂x 2  ∂ym ∂x n 
 m 1 

Using (92), the variance-covariance matrix Vyy is given by

Vyy = E{( y − m )( y − m ) }
y y
T

= E { ( J ( x − m ) )( J ( x − m ) ) }
T
yx x yx x

= E { J ( x − m )( x − m ) J }
T T
yx x x yx

= J E { ( x − m )( x − m ) } J
T T
yx x x yx

= Jyx Vxx JTyx

Thus, in a similar manner to above, we may express the Law of Propagation of Variances for non-linear
functions of random variables as

If y = f ( x ) and y and x are random variables non-linearly related then

Vyy = Jyx Vxx JTyx (103)

P a g e 40 | 44
The Logistic Function

The Special Law of Propagation of Variances


The Law of Propagation of Variances is often expressed as an algebraic equation. For example, if z is a
function of two random variables x and y, i.e., z = f ( x , y ) then the variance of z is

 ∂z 2  ∂z 2 ∂z ∂z
σz2 =   σx2 +   σy2 + 2 σxy (104)
 ∂x   ∂y  ∂x ∂y
Equation (104) can be derived from the general matrix equation (103) in the following manner. Let
x 
z = f ( x , y ) be written as y = f ( x ) where y =  z  , a (1,1) matrix and x =   is a (2,1) vector. The
y
 
 σ2 σ   ∂z ∂z 
variance-covariance matrix of the random vector x is Vxx =  x xy 
 , the Jacobian J =   and
2
 σxy σy 
yx  ∂x ∂y 

the variance-covariance matrix Vyy which contains the single element σz2 is given by

 ∂z 
 
 ∂z ∂z   σx 
σxy   ∂x 
2
Vyy =  σz2  =    
  σy2   ∂z 
 ∂x ∂y   σxy
  
 ∂y 

Expanding this equation gives (104).


In the case where the random variables in x are independent, i.e., their covariances are zero; we have the
Special Law of Propagation of Variances. For the case of z = f ( x , y ) where the random variables x and y
are independent, the Special Law of Propagation of Variances is written as

If z = f ( x , y ) and x and y are independent random variables then

 ∂z 2 2  ∂z 2 2
σz2 =   σx +   σy
 (105)
 ∂x   ∂y 

P a g e 41 | 44
The Logistic Function

APPENDIX B: MATLAB function logistic.m

http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m

% function x = logistic(a, y, w, ridge, param)


%
% Logistic regression. Design matrix A, targets Y, optional instance
% weights W, optional ridge term RIDGE, optional parameters object PARAM.
%
% W is a vector with length equal to the number of training examples; RIDGE
% can be either a vector with length equal to the number of regressors, or
% a scalar (the latter being synonymous to a vector with all entries the
% same).
%
% PARAM has fields PARAM.MAXITER (an iteration limit), PARAM.VERBOSE
% (whether to print diagnostic information), PARAM.EPSILON (used to test
% convergence), and PARAM.MAXPRINT (how many regression coefficients to
% print if VERBOSE==1).
%
% Model is
%
% E(Y) = 1 ./ (1+exp(-A*X))
%
% Outputs are regression coefficients X.
%
% Copyright 2007 Geoffrey J. Gordon
%
% This program is free software: you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation, either version 3 of the License, or (at
% your option) any later version.
%
% This program is distributed in the hope that it will be useful, but
% WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
% General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with this program. If not, see <http://www.gnu.org/licenses/>.

function x = logistic(a, y, w, ridge, param)

% process parameters

[n, m] = size(a);

if ((nargin < 3) || (isempty(w)))


w = ones(n, 1);
end

if ((nargin < 4) || (isempty(ridge)))


ridge = 1e-5;
end

if (nargin < 5)
param = [];
end

if (length(ridge) == 1)
ridgemat = speye(m) * ridge;
elseif (length(ridge(:)) == m)
ridgemat = spdiags(ridge(:), 0, m, m);
else
error('ridge weight vector should be length 1 or %d', m);
P a g e 42 | 44
The Logistic Function

end

if (~isfield(param, 'maxiter'))
param.maxiter = 200;
end

if (~isfield(param, 'verbose'))
param.verbose = 0;
end

if (~isfield(param, 'epsilon'))
param.epsilon = 1e-10;
end

if (~isfield(param, 'maxprint'))
param.maxprint = 5;
end

% do the regression

x = zeros(m,1);
oldexpy = -ones(size(y));
for iter = 1:param.maxiter

adjy = a * x;
expy = 1 ./ (1 + exp(-adjy));
deriv = expy .* (1-expy);
wadjy = w .* (deriv .* adjy + (y-expy));
weights = spdiags(deriv .* w, 0, n, n);

x = inv(a' * weights * a + ridgemat) * a' * wadjy;

if (param.verbose)
len = min(param.maxprint, length(x));
fprintf('%3d: [',iter);
fprintf(' %g', x(1:len));
if (len < length(x))
fprintf(' ... ');
end
fprintf(' ]\n');
end

if (sum(abs(expy-oldexpy)) < n*param.epsilon)


if (param.verbose)
fprintf('Converged.\n');
end
return;
end

oldexpy = expy;

end

warning('logistic:notconverged', 'Failed to converge');

P a g e 43 | 44
The Logistic Function

Useage Example
http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic-ex.txt

>> a = randn(500,5);
>> x = 2*randn(5,1);
>> y = (rand(500,1) < 1./(1+exp(-a*x)));
>> xhat = logistic(a, y, [], [], struct('verbose', 1))
1: [ -0.842889 -0.959492 0.843404 0.198022 0.199493 ]
2: [ -1.55055 -1.75901 1.57622 0.360507 0.398254 ]
3: [ -2.35678 -2.71685 2.4373 0.545032 0.62605 ]
4: [ -3.20879 -3.74533 3.35828 0.740927 0.854976 ]
5: [ -3.86696 -4.54162 4.0753 0.899695 1.02575 ]
6: [ -4.12265 -4.85136 4.35625 0.964904 1.09126 ]
7: [ -4.14969 -4.88416 4.38617 0.972078 1.09818 ]
8: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
9: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
10: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
11: [ -4.14995 -4.88447 4.38646 0.972149 1.09825 ]
Converged.

xhat =

-4.1499
-4.8845
4.3865
0.9721
1.0982

>> x

x =

-3.9412
-4.0619
3.6705
1.1123
0.9645

P a g e 44 | 44

You might also like