Foundations of Descriptive and Inferential Statistics (Version 4)
Foundations of Descriptive and Inferential Statistics (Version 4)
Foundations of Descriptive and Inferential Statistics (Version 4)
net/publication/235432508
CITATIONS READS
3 18,494
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Henk van Elst on 08 September 2019.
Lecture notes for a quantitative–methodological module at the Bachelor degree (B.Sc.) level
parcIT GmbH
Erftstraße 15
50672 Köln
Germany
E–Mail: [email protected]
Abstract
Introductory remarks 1
1 Statistical variables 5
1.1 Scale levels of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Raw data sets and data matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Outlook 144
Bibliography 160
Introductory remarks
1
2 CONTENTS
Following a standard pedagogical concept, these lecture notes are split into three main parts: Part I,
comprising Chapters 1 to 5, covers the basic considerations of Descriptive Statistics; Part II, which
consists of Chapters 6 to 8, introduces the foundations of Probability Theory. Finally, the mate-
rial of Part III, provided in Chapters 9 to 13, first reviews a widespread method for operationalising
latent statistical variables, and then introduces a number of standard uni- and bivariate analytical
tools of Inferential Statistics within the frequentist framework that prove valuable in applica-
tions. As such, the contents of Part III are the most important ones for quantitative–empirical
research work. Useful mathematical tools and further material have been gathered in appendices.
Recommended introductory textbooks, which may be used for study in parallel to these lecture
notes, are Levin et al (2010) [61], Hatzinger and Nagel (2013) [37], Weinberg and Abramowitz
(2008) [115], Wewel (2014) [116], Toutenburg (2005) [108], or Duller (2007) [16].
There are not included in these lecture notes any explicit exercises on the topics to be discussed.
These are reserved for lectures given throughout term time.
The present lecture notes are designed to be dynamical in character. On the one-hand
side, this means that they will be updated on a regular basis. On the other, that the
*.pdf version of the notes contains interactive features such as fully hyperlinked refer-
ences to original publications at the websites doi.org and jstor.org, as well as
many active links to biographical information on scientists that have been influential in
the historical development of Probability Theory and Statistics, hosted by the websites
The MacTutor History of Mathematics archive (www-history.mcs.st-and.ac.uk) and
en.wikipedia.org.
Throughout these lecture notes references have been provided to respective descriptive and in-
ferential statistical functions and routines that are available in the excellent and widespread sta-
tistical software package R, on a standard graphic display calculator (GDC), and in the statis-
tical software packages EXCEL, OpenOffice and SPSS (Statistical Program for the Social Sci-
ences). R and its exhaustive documentation are distributed by the R Core Team (2019) [85] via
the website cran.r-project.org. R, too, has been employed for generating all the fig-
ures contained in these lecture notes. Useful and easily accessible textbooks on the application
of R for statistical data analysis are, e.g., Dalgaard (2008) [15], or Hatzinger et al (2014) [38].
Further helpful information and assistance is available from the website www.r-tutor.com.
For active statistical data analysis with R, we strongly recommend the use of the convenient
custom-made work environment R Studio, provided free of charge at www.rstudio.com. An-
other user friendly statistical software package is GNU PSPP. This is available as shareware from
www.gnu.org/software/pspp/.
A few examples from the inbuilt R data sets package have been related to in these lecture notes in
the context of the visualisation of distributional features of statistical data. Further information on
these data sets can be obtained by typing library(help = "datasets") at the R prompt.
Lastly, we hope the reader will discover something useful or/and enjoyable for her/him-self when
working through these lecture notes. Constructive criticism is always welcome.
Acknowledgments: I am grateful to Kai Holschuh, Eva Kunz and Diane Wilcox for valuable com-
ments on an earlier draft of these lecture notes, to Isabel Passin for being a critical sparing part-
CONTENTS 3
ner in evaluating pedagogical considerations concerning cocreated accompanying lectures, and to
Michael Rüger for compiling an initial list of online survey tools for the Social Sciences.
4 CONTENTS
Chapter 1
Statistical variables
More specifically, the general intention of empirical scientific activities is to modify or strengthen
the theoretical foundations of an empirical scientific discipline by means of observational and/or
experimental testing of sets of hypotheses; see Ch. 11. This is generally achieved by employing
the quantitative–empirical techniques that have been developed in Statistics, in particular in the
course of the 20th Century. At the heart of these techniques is the concept of a statistical vari-
able X as an entity which represents a single common aspect of the system of objects selected for
analysis — the target population Ω of a statistical investigation. In the ideal case, a variable
entertains a one-to-one correspondence with an observable, and thus is directly amenable to mea-
surement. In the Social Sciences, Humanities, and Economics, however, one needs to carefully
distinguish between manifest variables corresponding to observables on the one-hand side, and
latent variables representing in general unobservable “social constructs” on the other. It is this
latter kind of variables which is commonplace in the fields mentioned. Hence, it becomes an un-
avoidable task to thoroughly address the issue of a reliable, valid and objective operationalisation
of any given latent variable one has identified as providing essential information on the objects
1
A particularly sceptical view on the ability of making reliable predictions in certain empirical scientific disciplines
is voiced in Taleb (2007) [105, pp 135–211].
5
6 CHAPTER 1. STATISTICAL VARIABLES
under investigation. A standard approach to dealing with the important matter of rendering latent
variables measurable is reviewed in Ch. 9.
In Statistics, it has proven useful to classify variables on the basis of their intrinsic information
content into one of three hierachically ordered categories, referred to as the scale levels of mea-
surement; cf. Stevens (1946) [98]. We provide the definition of these scale levels next.
– Ratio scale: X has an absolute zero point and otherwise only non-negative values;
analysis of both differences ai − aj and ratios ai /aj is meaningful.
Examples: body height, monthly net income, . . . .
– Interval scale: X has no absolute zero point; only differences ai − aj are meaningful.
Examples: year of birth, temperature in centigrades, Likert scales (cf. Ch. 9), . . . .
Note that the values obtained for a metrically scaled variable (e.g. in a survey) always
constitute definite numerical multiples of a specific unit of measurement.
Examples: Likert item rating scales (cf. Ch. 9), grading of commodities, . . . .
1 x1 = a5 y1 = b9 ... z1 = c3
2 x2 = a2 y2 = b12 ... z2 = c8
.. .. .. .. ..
. . . . .
n xn = a8 yn = b9 ... zn = c15
To systematically record the information obtained from measuring the values of a portfolio of
statistical variables in a statistical sample SΩ , in the (n × m) data matrix X every one of the
n sampling units investigated is assigned a particular row, while every one of the m statistical
variables measured is assigned a particular column. In the following, Xij denotes the data entry in
8 CHAPTER 1. STATISTICAL VARIABLES
the ith row (i = 1, . . . , n) and the jth column (i = 1, . . . , m) of X. To clarify standard terminology
used in Statistics, a raw data set is referred to as
According to Hair et al (2010) [36, pp 102, 175], a rough rule of thumb concerning an adequate
sample size |SΩ | = n for multivariate data analysis is given by
n ≥ 10m . (1.1)
Considerations of statistical power of particular methods of data analysis lead to more refined
recommendations; cf. Sec. 11.1.
“Big data” scenarios apply when n, m ≫ 1 (i.e., when n is typically on the order of 104 , or very
much larger still, and m is on the order of 102 , or larger).
In general, an (n × m) data matrix X is the starting point for the application of a statistical soft-
ware package such as R, SPSS, GNU PSPP, or other for the purpose of systematic data analysis.
When the sample comprises exclusively metrically scaled data, the data matrix is real-valued,
i.e.,
X ∈ Rn×m ; (1.2)
cf. Ref. [18, Sec. 2.1]. Then the information contained in X uniquely positions a collection of
n sampling units according to m quantitative characteristic variable features in (a subset of) an
m-dimensional Euclidian space Rm .
R: datMat <- data.frame(x = c(x1 ,...,xn ), y = c(y1 ,...,yn), ...,
z = c(z1 ,...,zn ))
The first task at hand in unravelling the intrinsic structure potentially residing in a given raw data
set {xi }i=1,...,n for some statistical variable X corresponds to Cinderella’s task of separating the
“good peas” from the “bad peas,” and collecting them in respective bowls (or bins). This is to
say, the first question to be answered requires determination of the frequency with which a value
(or attribute, or category) aj in the spectrum of possible values of X was observed in a statistical
sample SΩ of size n.
The k value pairs (aj , oj )j=1,...,k resp. (Kj , oj )j=1,...,k represent the univariate distribution of ab-
solute frequencies, the k value pairs (aj , hj )j=1,...,k resp. (Kj , hj )j=1,...,k represent the univariate
distribution of relative frequencies of the aj resp. Kj in SΩ .
9
10 CHAPTER 2. UNIVARIATE FREQUENCY DISTRIBUTIONS
R: table(variable), prop.table(variable)
EXCEL, OpenOffice: FREQUENCY (dt.: HÄUFIGKEIT)
SPSS: Analyze → Descriptive Statistics → Frequencies . . .
Typical graphical representations of univariate relative frequency distributions, regularly em-
ployed in visualising results of descriptive statistical data analyses, are the
histogram
relative frequency density
0.8
0.4
0.0
magnitude [1]
Figure 2.1: Example of a histogram, representing the relative frequency density for the variable
“magnitude” in the R data set “quakes.”
R:
data("quakes")
?quakes
hist( quakes$mag , breaks = 20 , freq = FALSE )
It is standard practice in Statistics to compile from the univariate relative frequency distribution
(aj , hj )j=1,...,k resp. (Kj , hj )j=1,...,k of data for some ordinally or metrically scaled one-dimensional
1
The appearance of graphs generated in R can be prettified by employing the advanced graphical package
ggplot2 by Wickham (2016) [117].
2.2. EMPIRICAL CUMULATIVE DISTRIBUTION FUNCTION (DISCRETE DATA) 11
0.20
bar chart
relative frequency
0.10
0.00
Figure 2.2: Example of a bar chart, representing the relative frequency distribution for the variable
“age group” in the R data set “esoph.”
R:
data("esoph")
?esoph
barplot( prop.table( table( esoph$agegp ) ) )
statistical variable X the associated empirical cumulative distribution function. Hereby it is neces-
sary to distinguish the case of data for a variable with a discrete spectrum of values from the case
of data for a variable with a continuous spectrum of values. We will discuss this issue next.
pie chart
6−11yrs
0−5yrs
12+ yrs
Figure 2.3: Example of a pie chart, representing the relative frequency distribution for the variable
“education” in the R data set “infert.”
R:
data("infert")
?infert
pie( table( infert$education ) )
the empirical cumulative distribution function for X. The value of Fn at x ∈ R represents the
cumulative relative frequencies of all aj which are less or equal to x; cf. Fig. 2.4. Fn (x) has the
following properties:
• its domain is D(Fn ) = R, and its range is W (Fn ) = [0, 1]; hence, Fn is bounded from above
and from below,
• it is constant on all half-open intervals [aj , aj+1 ), but exhibits jump discontinuities of size
hn (aj+1 ) at all aj+1 , and,
R: ecdf(variable), plot(ecdf(variable))
Computational rules for Fn (x)
1. h(x ≤ d) = Fn (d)
0.4
0.0
x: magnitude [1]
Figure 2.4: Example of an empirical cumulative distribution function, here for the variable “mag-
nitude” in the R data set “quakes.”
R:
data("quakes")
?quakes
plot( ecdf( quakes$magnitude ) )
wherein c denotes an arbitrary lower bound, and d denotes an arbitrary upper bound, on the
argument x of Fn (x).
defines the empirical cumulative distribution function for X. F̃n (x) has the following proper-
ties:
• its domain is D(F̃n ) = R, and its range is W (F̃n ) = [0, 1]; hence, F̃n is bounded from above
and from below,
R: ecdf(variable), plot(ecdf(variable))
Computational rules for F̃n (x)
3. h(c < x < d) = h(c ≤ x < d) = h(c < x ≤ d) = h(c ≤ x ≤ d) = F̃n (d) − F̃n (c),
wherein c denotes an arbitrary lower bound, and d denotes an arbitrary upper bound, on the
argument x of F̃n (x).
Our next steps comprise the introduction of a set of scale-level-dependent standard descriptive
measures which characterise specific properties of univariate and bivariate relative frequency dis-
tributions of statistical variables X resp. (X, Y ).
Chapter 3
There are four families of scale-level-dependent standard measures one employs in Statistics to
describe characteristic properties of univariate relative frequency distributions. On a technical
level, the determination of the values of these measures from available data does not go beyond
application of the four fundamental arithmetical operations: addition, subtraction, multiplication
and division. We will introduce these measures in turn. In the following we suppose given from a
survey for some one-dimensional statistical variable X either (i) a raw data set {xi }i=1,...,n of n
measured values, or (ii) a relative frequency distribution (aj , hj )j=1,...,k resp. (Kj , hj )j=1,...,k .
3.1.1 Mode
The mode xmod (nom, ord, metr) of the relative frequency distribution for any one-dimensional
variable X is that value aj in X’s spectrum which was observed with the highest relative frequency
in a statistical sample SΩ . Note that the mode does not necessarily take a unique value.
Def.: hn (xmod ) ≥ hn (aj ) for all j = 1, . . . , k.
EXCEL, OpenOffice: MODE.SNGL (dt.: MODUS.EINF, MODALWERT)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Mode
3.1.2 Median
To determine the median x̃0.5 (or Q2 ) (ord, metr) of the relative frequency distribution for an
ordinally or metrically scaled one-dimensional variable X, it is necessary to first arrange the n
observed values {xi }i=1,...,n in their ascending natural rank order, i.e., x(1) ≤ x(2) ≤ . . . ≤ x(n) .
Def.: For the ascendingly ordered n observed values {xi }i=1,...,n , at most 50% have a rank lower
or equal to resp. are less or equal to the median value x̃0.5 , and at most 50% have a rank higher or
equal to resp. are greater or equal to the median value x̃0.5 .
15
16 CHAPTER 3. MEASURES FOR UNIVARIATE DISTRIBUTIONS
(i) Discrete data Fn (x̃0.5 ) ≥ 0.5
(
x( n+1 ) if n is odd
x̃0.5 = 1
2 . (3.1)
[x n
2 (2)
+ x( n2 +1) ] if n is even
i−1
!
bi X
x̃0.5 = ui + 0.5 − hj . (3.2)
hi j=1
Alternatively, the median of a statistical sample SΩ for a continuous variable X with binned
data (Kj , hj )j=1,...,k can be obtained from the associated empirical cumulative distribution
!
function by solving the condition F̃n (x̃0.5 ) = 0.5 for x̃0.5 ; cf. Eq. (2.4).1
Remark: Note that the value of the median of a univariate relative frequency distribution is rea-
sonably insensitive to so-called outliers in a statistical sample.
R: median(variable)
EXCEL, OpenOffice: MEDIAN (dt.: MEDIAN)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Median
3.1.3 α–Quantile
A generalisation of the median is the concept of the α–quantile x̃α (ord, metr) of the relative
frequency distribution for an ordinally or metrically scaled one-dimensional variable X. Again,
it is necessary to first arrange the n observed values {xi }i=1,...,n in their ascending natural rank
order, i.e., x(1) ≤ x(2) ≤ . . . ≤ x(n) .
Def.: For the ascendingly ordered n observed values {xi }i=1,...,n , and for given α with 0 < α < 1,
at most α×100% have a rank lower of equal to resp. are less or equal to the α–quantile x̃α , and at
most (1 − α)×100% have a rank higher or equal to resp. are greater or equal to the α–quantile x̃α .
i−1
!
bi X
x̃α = ui + α− hj . (3.4)
hi j=1
Remarks: (i) The value of the sample mean is very sensitive to outliers.
(ii) For binned data one selects the midpoint of each class interval Ki to represent the aj (provided
the raw data set is no longer accessible).
R: mean(variable)
EXCEL, OpenOffice: AVERAGE (dt.: MITTELWERT)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Mean
n
X
x̄w := w1 x1 + . . . + wn xn =: wi xi ; (3.8)
i=1
A very convenient graphical method for transparently displaying distributional features of metri-
cally scaled data relating to a five number summary, also making explicit the interquartile range,
outliers and extreme values, is provided by a box plot; see, e.g., Tukey (1977) [110]. An example
of a single box plot is depicted in Fig. 3.1, of parallel box plots in Fig. 3.2.
R: boxplot(variable), boxplot(variable ~ group variable)
box plot
6.0
magnitude [1]
5.0
4.0
Figure 3.1: Example of a box plot, representing elements of the five number summary for the
distribution of measured values for the variable “magnitude” in the R data set “quakes.” The open
circles indicate the positions of outliers.
R:
data("quakes")
?quakes
boxplot( quakes$mag )
there are only n − 1 degrees of freedom involved in this measure. The sample variance is thus
defined by:
(i) From a raw data set:
n
1 1 X
2
s := (x1 − x̄)2 + . . . + (xn − x̄)2 =: (xi − x̄)2 ; (3.12)
n−1 n − 1 i=1
4.5
3.5
Figure 3.2: Example of parallel box plots, comparing elements of the five number summary for
the distribution of measured values for the variable “weight” between categories of the variable
“group” in the R data set “PlantGrowth.” The open circle indicates the position of an outlier.
R:
data("PlantGrowth")
?PlantGrowth
boxplot( PlantGrowth$weight ~ PlantGrowth$group )
alternatively:
n 2
s2 = a1 hn (a1 ) + . . . + a2k hn (ak ) − x̄2
n−1
" k #
n X
2 2
= a hn (aj ) − x̄ . (3.15)
n − 1 j=1 j
Remarks: (i) We point out that the alternative formulae for a sample variance provided here prove
computationally more efficient.
(ii) For binned data, when one selects the midpoint of each class interval Kj to represent the aj
(given the raw data set is no longer
Pk accessible), a correction of Eqs. (3.14) and (3.15) by an addi-
2
tional term (1/12)(n/n − 1) j=1 bj hj becomes necessary, assuming uniformly distributed data
within each of the class intervals Kj of width bj ; cf. Eq. (8.41).
R: var(variable)
EXCEL, OpenOffice: VAR.S (dt.: VAR.S, VARIANZ)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Variance
22 CHAPTER 3. MEASURES FOR UNIVARIATE DISTRIBUTIONS
3.2.4 Sample standard deviation
For ease of handling dimensions associated with a metrically scaled one-dimensional variable X,
one defines the dimensionful sample standard deviation s (metr) simply as the positive square
root of the sample variance (3.12), i.e.,
√
s := + s2 , (3.16)
such that a measure for the spread of data results which shares the dimension of X and its sample
mean x̄.
R: sd(variable)
EXCEL, OpenOffice: STDEV.S (dt.: STABW.S, STABW)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Std. deviation
3.2.6 Standardisation
Data for metrically scaled one-dimensional variables X is amenable to the process of standardis-
ation. By this is meant a linear affine transformation X → Z, which generates from a univariate
raw data set {xi }i=1,...,n of n measured values for a dimensionful variable X, with sample mean x̄
and sample standard deviation sX > 0, data for an equivalent dimensionless variable Z according
to
xi − x̄
xi 7→ zi := for all i = 1, . . . , n . (3.18)
sX
For the resultant Z-data, referred to as the Z scores of the original metrical X-data, this has the
convenient practical consequences that (i) all one-dimensional metrical data is thus represented on
the same dimensionless measurement scale, and (ii) the corresponding sample mean and sample
standard deviation of the Z-data amount to
z̄ = 0 and sZ = 1 ,
respectively. Employing Z scores, specific values xi of the original metrical X-data will be ex-
pressed in terms of sample standard deviation units, i.e., by how many sample standard deviations
they fall on either side of the common sample mean. Essential information on characteristic distri-
butional features of one-dimensional metrical data will be preserved by the process of standardis-
ation.
R: scale(variable, center = TRUE, scale = TRUE)
EXCEL, OpenOffice: STANDARDIZE (dt.: STANDARDISIERUNG)
SPSS: Analyze → Descriptive Statistics → Descriptives . . . → Save standardized values as vari-
ables
3.3. MEASURES OF RELATIVE DISTORTION 23
3.3 Measures of relative distortion
The third family of measures characterising relative frequency distributions for univariate
data {xi }i=1,...,n for metrically scaled one-dimensional variables X, having specific sample mean x̄
and sample standard deviation sX , relate to the issue of the shape of a distribution. These measures
take a Gaußian normal distribution (cf. Sec. 8.6 below) as as a reference case, with the values of
its two free parameters equal to the given x̄ and sX . With respect to this reference distribution, one
defines two kinds of dimensionless measures of relative distortion as described in the following
(cf., e.g., Joanes and Gill (1998) [45]).
3.3.1 Skewness
The skewness g1 (metr) is a dimensionless measure to quantify the degree of relative distortion
of a given frequency distribution in the horizontal direction. Its implementation in the software
package EXCEL employs the definition
n 3
n X xi − x̄
g1 := for n > 2 , (3.19)
(n − 1)(n − 2) i=1 sX
wherein the observed values {xi }i=1,...,n enter in their standardised form according to Eq. (3.18).
Note that g1 = 0 for an exact Gaußian normal distribution.
R: skewness(variable, type = 2) (package: e1071, by Meyer et al (2019) [71])
EXCEL, OpenOffice: SKEW (dt.: SCHIEFE)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Skewness
wherein the observed values {xi }i=1,...,n enter in their standardised form according to Eq. (3.18).
Note that g2 = 0 for an exact Gaußian normal distribution.
R: kurtosis(variable, type = 2) (package: e1071, by Meyer et al (2019) [71])
EXCEL, OpenOffice: KURT (dt.: KURT)
SPSS: Analyze → Descriptive Statistics → Frequencies . . . → Statistics . . . : Kurtosis
n k
X X Eq. (3.6)
S := xi = aj on (aj ) = nx̄ , (3.21)
i=1 j=1
where (aj , on (aj ))j=1,...,k is the absolute frequency distribution for the observed values (or cat-
egories) of X. Then the relative proportion that the value aj (or the category Kj ) takes in S
is
aj on (aj ) aj hn (aj )
= . (3.22)
S x̄
• Horizontal axis:
i i
X on (aj ) X
ki := = hn (aj ) (i = 1, . . . , k) , (3.23)
j=1
n j=1
• Vertical axis:
i i
X aj on (aj ) X aj hn (aj )
li := = (i = 1, . . . , k) . (3.24)
j=1
S j=1
x̄
The initial point on a Lorenz curve is generally the coordinate system’s origin, (k0 , l0 ) = (0, 0),
the final point is (1, 1). As a reference facility to measure concentration in the distribution of X
in qualitative terms, one defines a null concentration curve as the bisecting line linking (0, 0)
to (1, 1). The Lorenz curve is interpreted as stating that a point on the curve with coordinates
(ki , li ) represents the fact that ki × 100% of the n statistical units take a share of li × 100% in
the total sum S for the ratio scaled one-dimensional variable X. Qualitatively, for given univariate
data {xi }i=1,...,n , the concentration in the distribution of X is the stronger, the larger is the dip of the
Lorenz curve relative to the null concentration curve. Note that in addition to the null concentration
curve, one can define as a second reference facility a maximum concentration curve such that
only the largest value ak (or category Kk ) in the spectrum of values of X takes the full share of
100% in the total sum S for {xi }i=1,...,n .
3.4. MEASURES OF CONCENTRATION 25
3.4.2 Normalised Gini coefficient
The Italian statistician, demographer and sociologist Corrado Gini (1884–1965) devised a quanti-
tative measure for concentration in the distribution for a ratio scaled one-dimensional variable X;
cf. Gini (1921) [33]. The dimensionless normalised Gini coefficient G+ (metr: ratio) can be
interpreted geometrically as the ratio of areas
3
In September 2012 it was reported (implicitly) in the public press that the coordinates underlying the Lorenz
curve describing the distribution of private equity in Germany at the time were (0.00, 0.00), (0.50, 0.01), (0.90, 0.50),
and (1.00, 1.00); cf. Ref. [101]. Given that in this case n ≫ 1, these values amount to a Gini coefficient of
G+ = 0.64. The Oxfam Report on Wealth Inequality 2019 can be found at the URL (cited on May 31, 2019):
www.oxfam.org/en/research/public-good-or-private-wealth.
26 CHAPTER 3. MEASURES FOR UNIVARIATE DISTRIBUTIONS
Chapter 4
Now we come to describe and characterise specific features of bivariate frequency distributions,
i.e., intrinsic structures of bivariate raw data sets {(xi , yi )}i=1,...,n obtained from samples SΩ for a
two-dimensional statistical variable (X, Y ) from some target population of study objects Ω. Let
us suppose that the spectrum of values resp. categories of X is a1 , a2 , . . . , ak , and the spectrum of
values resp. categories of Y is b1 , b2 , . . . , bl , where k, l ∈ N. Hence, for the bivariate joint dis-
tribution there exists a total of k × l possible combinations {(ai , bj )}i=1,...,k;j=1,...,l of values resp.
categories for (X, Y ). In the following, we will denote associated bivariate absolute (observed)
frequencies by oij := on (ai , bj ), and bivariate relative frequencies by hij := hn (ai , bj ).
27
28 CHAPTER 4. MEASURES OF ASSOCIATION FOR BIVARIATE DISTRIBUTIONS
The corresponding univariate marginal absolute frequencies of X and of Y are
l
X
oi+ := oi1 + oi2 + . . . + oij + . . . + oil =: oij (4.3)
j=1
k
X
o+j := o1j + o2j + . . . + oij + . . . + okj =: oij . (4.4)
i=1
hij b1 b2 . . . bj . . . bl Σj
a1 h11 h12 . . . h1j . . . h1l h1+
a2 h21 h22 . . . h2j . . . h2l h2+
.. .. .. .. .. .. . ..
. . . . . . .. .
. (4.5)
ai hi1 hi2 . . . hij . . . hil hi+
.. .. .. .. .. .. . ..
. . . . . . .. .
ak hk1 hk2 . . . hkj . . . hkl hk+
Σi h+1 h+2 . . . h+j . . . h+l 1
On the basis of a (k × l) contingency table displaying the relative frequencies of the bivariate
joint distribution for some two-dimensional variable (X, Y ), one may define two kinds of related
conditional relative frequency distributions, namely (i) the conditional distribution of X given Y
by
hij
h(ai |bj ) := , (4.9)
h+j
4.2. MEASURES OF ASSOCIATION FOR THE METRICAL SCALE LEVEL 29
and (ii) the conditional distribution of Y given X by
hij
h(bj |ai ) := . (4.10)
hi+
Then, by means of these conditional distributions, a notion of statistical independence of variables
X and Y is defined to correspond to the simultaneous properties
h(ai |bj ) = h(ai ) = hi+ and h(bj |ai ) = h(bj ) = h+j . (4.11)
Given these properties hold, it follows from Eqs. (4.9) and (4.10) that
the bivariate relative frequencies hij in this case are numerically equal to the product of the corre-
sponding univariate marginal relative frequencies hi+ and h+j .
alternatively:
1
sXY = [ x1 y1 + . . . + xn yn − nx̄ȳ ]
n−1" #
n
1 X
= xi yi − nx̄ȳ . (4.14)
n − 1 i=1
30 CHAPTER 4. MEASURES OF ASSOCIATION FOR BIVARIATE DISTRIBUTIONS
scatter plot
150
ozone [ppb]
100
50
0
60 70 80 90
temperature [°F]
Figure 4.1: Example of a scatter plot, representing the joint distribution of measured values for the
variables “temperature” and “ozone” in the R data set “airquality.”
R:
data("airquality")
?airquality
plot( airquality$Temp , airquality$Ozone )
alternatively:
n
sXY = [ a1 b1 h11 + . . . + ak bl hkl − x̄ȳ ]
n−1
" k l #
n XX
= ai bj hij − x̄ȳ . (4.16)
n − 1 i=1 j=1
Remark: The alternative formulae provided here prove computationally more efficient.
R: cov(variable1, variable2)
EXCEL, OpenOffice: COVARIANCE.S (dt.: KOVARIANZ.S, KOVAR)
In view of its defining equation (4.13), the sample covariance can be given the following geo-
metrical interpretation. For a total of n data points (xi , yi ), it quantitfies the degree of excess of
4.2. MEASURES OF ASSOCIATION FOR THE METRICAL SCALE LEVEL 31
x̄
signed rectangular areas (xi − x̄) (yi − ȳ) with respect to the common centroid rC := of
ȳ
the n data points in favour of either positive or negative signed areas, if any.1
It is worthwhile to point out that in the research literature it is standard to define for the joint
distribution for a metrically scaled two-dimensional variable (X, Y ) a dimensionful symmetric
(2 × 2) sample covariance matrix S 2 according to
2 s2X sXY
S := , (4.17)
sXY s2Y
the components of which are defined by Eqs. (3.12) and (4.13). The determinant of S 2 , given by
det(S 2 ) = s2X s2Y − s2XY , is positive as long as s2X s2Y − s2XY > 0, which applies in most practical
cases. Then S 2 is regular, and thus a corresponding inverse (S 2 )−1 exists; cf. Ref. [18, Sec. 3.5].
The concept of a regular sample covariance matrix S 2 and its inverse (S 2 )−1 generalises in a
straightforward fashion to the case of multivariate joint distributions for metrically scaled m-
dimensional statistical variables (X, Y, . . . , Z), where S 2 ∈ Rm×m is given by
s2X sXY . . . sZX
sXY s2Y . . . sY Z
S 2 := .. .. .. .. , (4.18)
. . . .
sZX sY Z . . . s2Z
sXY
r := . (4.19)
sX sY
1 1
The centroid is the special case of equal mass points, with masses mi = , of the centre of gravity of a system
Pn n
mi r i
of n discrete massive objects, defined by r C := Pi=1n . In two Euclidian dimensions the position vector is
j=1 mj
xi
ri = .
yi
32 CHAPTER 4. MEASURES OF ASSOCIATION FOR BIVARIATE DISTRIBUTIONS
With Eq. (4.13) for sXY , this becomes
n n
1 X xi − x̄ yi − ȳ 1 X X Y
r= = z z , (4.20)
n − 1 i=1 sX sY n − 1 i=1 i i
employing standardisation according to Eq. (3.18) in the final step. Due to its normalisation, the
range of the sample correlation coefficient is −1 ≤ r ≤ +1. The sign of r encodes the direction
of a correlation. As to interpreting the strength of a correlation via the magnitude |r|, in practice
one typically employs the following qualitative
Rule of thumb:
0.0 = |r|: no correlation
0.0 < |r| < 0.2: very weak correlation
0.2 ≤ |r| < 0.4: weak correlation
0.4 ≤ |r| < 0.6: moderately strong correlation
0.6 ≤ |r| ≤ 0.8: strong correlation
0.8 ≤ |r| < 1.0: very strong correlation
1.0 = |r|: perfect correlation.
R: cor(variable1, variable2)
EXCEL, OpenOffice: CORREL (dt.: KORREL)
SPSS: Analyze → Correlate → Bivariate . . . : Pearson
In line with Eq. (4.17), it is convenient to define a dimensionless symmetric (2 × 2) sample
correlation matrix R by
1 r
R := , (4.21)
r 1
which is regular and positive definite as long as its determinant det(R) = 1 − r 2 > 0. In this case,
its inverse R−1 is given by
−1 1 1 −r
R = . (4.22)
1 − r 2 −r 1
Note that for non-correlating metrically scaled variables X and Y , i.e., when r = 0, the sample
correlation matrix degenerates to become a unit matrix, R = 1.
Again, the concept of a regular and positive definite sample correlation matrix R, with inverse
R−1 , generalises to multivariate joint distributions for metrically scaled m-dimensional statistical
variables (X, Y, . . . , Z), where R ∈ Rm×m is given by2
1 rXY . . . rZX
rXY 1 . . . rY Z
R := .. .. . .. .
.. , (4.23)
. .
rZX rY Z . . . 1
2
Given a data matrix X ∈ Rn×m for a metrically scaled m-dimensional statistical variable (X, Y, . . . , Z), one
can show that upon standardisation of the data according to Eq. (3.18), which amounts to a transformation X 7→ Z ∈
1
Rn×m , the sample correlation matrix can be represented by R = Z T Z. The form of this relation is equivalent
n−1
to Eq. (4.20).
4.3. MEASURES OF ASSOCIATION FOR THE ORDINAL SCALE LEVEL 33
and det(R) 6= 0. Note that R is a dimensionless quantity which, hence, is scale-invariant; cf.
Sec. 8.10.
Then, every individual xi resp. yi is assigned a rank number which corresponds to its position in
the ordered sequences (4.24):
Should there be any “tied ranks” due to equality of some xi or yi , one assigns the arithmetical
mean of the corresponding rank numbers to all xi resp. yi involved in the “tie.” Ultimately, by this
procedure, the entire bivariate raw data undergoes a transformation
yielding n pairs of rank numbers to numerically represent the original bivariate ordinal data.
Given surrogate rank number data, the means of rank numbers always amount to
n
1X n+1
R̄(x) := R(xi ) = (4.27)
n i=1 2
n
1X n+1
R̄(y) := R(yi ) = . (4.28)
n i=1 2
The variances of rank numbers are defined in accordance with Eqs. (3.13) and (3.15), i.e.,
" n # " k #
1 X n X
s2R(x) := R2 (xi ) − nR̄2 (x) = R2 (ai )hi+ − R̄2 (x) (4.29)
n − 1 i=1 n − 1 i=1
" n # " l #
1 X n X
s2R(y) := R2 (yi ) − nR̄2 (y) = R2 (bj )h+j − R̄2 (y) . (4.30)
n − 1 i=1 n − 1 j=1
34 CHAPTER 4. MEASURES OF ASSOCIATION FOR BIVARIATE DISTRIBUTIONS
In addition, to characterise the joint distribution of rank numbers, a sample covariance of rank
numbers is defined in line with Eqs. (4.14) and (4.16) by
" n #
1 X
sR(x)R(y) := R(xi )R(yi ) − nR̄(x)R̄(y)
n − 1 i=1
" k l #
n XX
= R(ai )R(bj )hij − R̄(x)R̄(y) . (4.31)
n − 1 i=1 j=1
On this fairly elaborate technical backdrop, the English psychologist and statistician
Charles Edward Spearman FRS (1863–1945) defined a dimensionless sample rank correlation
coefficient rS (ord), in analogy to Eq. (4.19), by (cf. Spearman (1904) [96])
sR(x)R(y)
rS := . (4.32)
sR(x) sR(y)
The range of this rank correlation coefficient is −1 ≤ rS ≤ +1. Again, while the sign of rS
encodes the direction of a rank correlation, in interpreting the strength of a rank correlation via
the magnitude |rS | one usually employs the qualitative
Rule of thumb:
0.0 = |rS |: no rank correlation
0.0 < |rS | < 0.2: very weak rank correlation
0.2 ≤ |rS | < 0.4: weak rank correlation
0.4 ≤ |rS | < 0.6: moderately strong rank correlation
0.6 ≤ |rS | ≤ 0.8: strong rank correlation
0.8 ≤ |rS | < 1.0: very strong rank correlation
1.0 = |rS |: perfect rank correlation.
R: cor(variable1, variable2, method = "spearman")
SPSS: Analyze → Correlate → Bivariate . . . : Spearman
When no tied ranks occur, Eq. (4.32) simplifies to (cf. Hartung et al (2005) [39, p 554])
P
6 ni=1 [R(xi ) − R(yi )]2
rS = 1 − . (4.33)
n(n2 − 1)
with range 0 ≤ V ≤ 1. For the interpretation of the strength of statistical association in the joint
distribution for a two-dimensional categorical variable (X, Y ), one may thus employ the qualitative
Rule of thumb:
0.0 ≤ V < 0.2: weak association
0.2 ≤ V < 0.6: moderately strong association
0.6 ≤ V ≤ 1.0: strong association.
R: assocstats(contingency table) (package: vcd, by Meyer et al (2017) [70])
SPSS: Analyze → Descriptive Statistics → Crosstabs . . . → Statistics . . . : Chi-square, Phi and
Cramer’s V
36 CHAPTER 4. MEASURES OF ASSOCIATION FOR BIVARIATE DISTRIBUTIONS
Chapter 5
For strongly correlating bivariate sample data {(xi , yi )}i=1,...,n for a metrically scaled two-
dimensional statistical variable (X, Y ), i.e., when 0.71 ≤ |r| ≤ 1.0, it is meaningful to con-
struct a mathematical model of the linear quantitative statistical association so diagnosed. The
standard method to realise this by systematic means is due to the German mathematician and
astronomer Carl Friedrich Gauß (1777–1855) and is known by the name of descriptive linear re-
gression analysis; cf. Gauß (1809) [29]. We here restrict our attention to the case of simple
linear regression, which aims to explain the variability in one dependent variable in terms of the
variability in a single independent variable.
To be determined is a best-fit linear model to given bivariate metrical data {(xi , yi )}i=1,...,n . The
linear model in question can be expressed in mathematical terms by
ŷ = a + bx , (5.1)
with unknown regression coefficients y-intercept a and slope b. Gauß’ method of least squares
works as follows.
constitutes a non-negative real-valued function of two variables, a and b. Hence, determining its
(local) minimum values entails satisfying (i) the necessary condition of simultaneously vanishing
37
38 CHAPTER 5. DESCRIPTIVE LINEAR REGRESSION ANALYSIS
first partial derivatives
! ∂S(a, b) ! ∂S(a, b)
0= , 0= , (5.3)
∂a ∂b
— this yields a well-determined (2 × 2) system of linear algebraic equations for the unknowns
a and b, cf. Ref. [18, Sec. 3.1] —, and (ii) the sufficient condition of a positive definite Hessian
matrix H(a, b) of second partial derivatives,
∂ 2 S(a, b) ∂ 2 S(a, b)
∂a2 ∂a∂b
H(a, b) := , (5.4)
∂ 2 S(a, b) 2
∂ S(a, b)
∂b∂a ∂b2
at the candidate optimal values of a and b. H(a, b) is referred to as positive definite when all of its
eigenvalues are positive; cf. Ref. [18, Sec. 3.6].
100
50
0
60 70 80 90
temperature [°F]
Figure 5.1: Example of a best-fit linear model obtained by the method of least squares for the
case of the bivariate joint distribution featured in Fig- 4.1. The least squares estimators for the
y-intercept and the slope take values a = 69.41 ppb and b = 0.20 (ppb/°F), respectively.
R:
data("airquality")
?airquality
regMod <- lm( airquality$Temp ~ airquality$Ozone )
summary(regMod)
plot( airquality$Temp , airquality$Ozone )
abline(regMod)
n
X n
X n
X
(yi − ȳ)2 − (yi − ŷi )2 (ŷi − ȳ)2
i=1 i=1 i=1
B := n = n , (5.9)
X X
(yi − ȳ)2 (yi − ȳ)2
i=1 i=1
40 CHAPTER 5. DESCRIPTIVE LINEAR REGRESSION ANALYSIS
with range 0 ≤ B ≤ 1. A perfect fit is signified by B = 1, while no fit amounts to B = 0. The
coefficient of determination provides a descriptive measure for the proportion of variability of Y
in a bivariate data set {(xi , yi )}i=1,...,n that can be accounted for as due to the association with X
via the simple linear regression model. Note that in simple linear regression it holds that
B = r2 ; (5.10)
This concludes Part I of these lecture notes, the introductory discussion on uni- and bivariate
descriptive statistical methods of data analysis. We wish to encourage the interested reader to
adhere to accepted scientific standards when actively getting involved with data analysis her/him-
self. This entails, amongst other aspects, foremost the truthful documentation of all data taken
into account in a specific analysis conducted. Features facilitating understanding such as visu-
alisations of empirical distributions by means of, where appropriate, histograms, bar charts, box
plots or scatter plots, or providing the values of five number summaries, sample means, sample
standard deviations, standardised skewness and excess kurtosis measures, or sample correlation
coefficients should be commonplace in any kind of research report. It must be a prime objective of
the researcher to empower potential readers to retrace the inferences made by her/him.
To set the stage for the application of inferential statistical methods in Part III, we now turn to
review the elementary concepts underlying Probability Theory, predominantly as interpreted in
the frequentist approach to this topic.
Chapter 6
All examples of inferential statistical methods of data analysis to be presented in Chs. 12 and 13
have been developed in the context of the so-called frequentist approach to Probability Theory.1
The issue in Inferential Statistics is to estimate the plausibility or likelihood of hypotheses given
the observational evidence for them. The frequentist approach was pioneered by the Italian
mathematician, physician, astrologer, philosopher and gambler Girolamo Cardano (1501–1576),
the French lawyer and amateur mathematician Pierre de Fermat (1601–1665), the French math-
ematician, physicist, inventor, writer and Catholic philosopher Blaise Pascal (1623–1662), the
Swiss mathematician Jakob Bernoulli (1654–1705), and the French mathematician and astronomer
Marquis Pierre Simon de Laplace (1749–1827). It is deeply rooted in the two fundamental as-
sumptions that any particular random experiment can be repeated arbitrarily often (i) under the
“same conditions,” and (ii) completely “independent of one another,” so that a theoretical basis
is given for defining allegedly “objective probabilities” for random events and hypotheses via the
relative frequencies of very long sequences of repetition of the same random experiment.2 This
is a highly idealised viewpoint, however, which shares only a limited degree of similarity with
the actual conditions pertaining to an observer’s resp. experimentor’s reality. Renowned textbooks
adopting the frequentist viewpoint of Probability Theory and Inferential Statistics are, e.g.,
Cramér (1946) [13] and Feller (1968) [21].
Not everyone in Statistics is entirely happy, though, with the philosophy underlying the fre-
quentist approach to introducing the concept of probability, as a number of its central ideas
rely on unobserved data (information). A complementary viewpoint is taken by the frame-
work which originated from the work of the English mathematician and Presbyterian minister
Thomas Bayes (1702–1761), and later of Laplace, and so is commonly referred to as the Bayes–
Laplace approach; cf. Bayes (1763) [2] and Laplace (1812) [58]. A striking conceptual difference
to the frequentist approach consists in its use of prior, allegedly “subjective probabilities” for ran-
dom events and hypotheses, quantifying a persons’s individual reasonable degree-of-belief in their
1
The origin of the term “probability” is traced back to the Latin word probabilis, which the Roman philosopher
Cicero (106 BC–43 BC) used to capture a notion of plausibility or likelihood; see Mlodinow (2008) [73, p 32].
2
A special role in the context of the frequentist approach to Probability Theory is assumed by Jakob Bernoulli’s
law of large numbers, as well as the concept of independently and identically distributed (in short: “i.i.d.”) random
variables; we will discuss these issues in Sec. 8.15 below.
41
42 CHAPTER 6. ELEMENTS OF PROBABILITY THEORY
likelihood, which are subsequently updated by analysing relevant empirical data.3 Renowned text-
books adopting the Bayes–Laplace viewpoint of Probability Theory and Inferential Statistics
are, e.g., Jeffreys (1939) [44] and Jaynes (2003) [43], while general information regarding the
Bayes–Laplace approach is available from the website bayes.wustl.edu. More recent text-
books, which assist in the implementation of advanced computational routines, have been issued
by Gelman et al (2014) [30] and by McElreath (2016) [69]. A discussion of the pros and cons of
either of these two competing approaches to Probability Theory can be found, e.g., in Sivia and
Skilling (2006) [92, p 8ff], or in Gilboa (2009) [31, Sec. 5.3].
A common denominator of both frameworks, frequentist and Bayes–Laplace, is the attempt to
quantify a notion of uncertainty that can be related to in formal treatments of decision-making.
In the following we turn to discuss the general principles on which Probability Theory is built.
• Random experiments: Random experiments are experiments which can be repeated arbi-
trarily often under identical conditions, with events — also called outcomes — that can-
not be predicted with certainty. Well-known simple examples are found amongst games of
chance such as tossing a coin, rolling dice, or playing roulette.
• Sample space Ω = {ω1 , ω2 , . . .}: The sample space associated with a random experiment
is constituted by the set of all possible elementary events (or elementary outcomes) ωi
(i = 1, 2, . . .), which are signified by their property of mutual exclusivity. The sample
space Ω of a random experiment may contain either
The essential concept of the sample space associated with a random experiment was intro-
duced to Probability Theory by the Italian mathematician Girolamo Cardano (1501–1576);
see Cardano (1564) [10], Mlodinow (2008) [73, p 42], and Bernstein (1998) [3, p 47ff].
• Random events A, B, . . . ⊆ Ω: Random events are formally defined as all kinds of subsets
of Ω that can be formed from the elementary events ωi ∈ Ω.
3
Anscombe and Aumann (1963) [1] in their seminal paper refer to “objective probabilities” as associated with
“roulette lotteries,” and to “subjective probabilities” as associated with “horse lotteries.” Savage (1954) [89] employs
the alternative terminology of distinguishing between “objectivistic probabilities” and “personalistic probabilities.”
4
For reasons of definiteness, we will assume in this case that the sample space Ω associated with a random exper-
iment is compact.
6.1. RANDOM EVENTS 43
• Certain event Ω: The certain event is synonymous with the sample space itself. When a
particular random experiment is conducted, “something will happen for sure.”
• Impossible event ∅ = {} = Ω̄: The impossible event is the natural complement to the
certain event. When a particular random experiment is conducted, “it is not possible that
nothing will happen at all.”
• Event space P(Ω) := {A|A ⊆ Ω}: The event space, also referred to as the power set
of Ω, is the set of all possible subsets (random events!) that can be formed from elementary
events ωi ∈ Ω. Its size (or cardinality) is given by |P(Ω)| = 2|Ω| . The event space P(Ω)
constitutes a so-called σ–algebra associated with the sample space Ω; cf. Rinne (2008) [87,
p 177]. When |Ω| = n, i.e., when Ω is finite, then |P(Ω)| = 2n .
In the formulation of probability theoretical laws and computational rules, the following set oper-
ations and identities prove useful.
Set operations
1. Ā = Ω\A — complementation of a set (or event) A (“not A”)
2. A\B = A ∩ B̄ — formation of the difference of sets (or events) A and B (“A, but not B”)
3. A ∪ B — formation of the union of sets (or events) A and B, otherwise referred to as the
disjunction of A and B (“A or B”)
4. A ∩ B — formation of the intersection of sets (or events) A and B, otherwise referred to as
the conjunction of A and B (“A and B”)
5. A ⊆ B — inclusion of a set (or event) A in a set (or event) B (“A is a subset of or equal
to B”)
Before addressing the central axioms of Probability Theory, we first provide the following im-
portant definition.
Def.: Suppose given a compact sample space Ω of some random experiment. Then one under-
stands by a finite complete partition of Ω a set of n ∈ N random events {A1 , . . . , An } such
that
44 CHAPTER 6. ELEMENTS OF PROBABILITY THEORY
(i) Ai ∩ Aj = ∅ for i 6= j, i.e., they are pairwise disjoint (mutually exclusive), and
n
[
(ii) Ai = Ω, i.e., their union is identical to the full sample space.
i=1
P (A) ≥ 0 , (6.2)
P (Ω) = 1 , (6.3)
3. for all pairwise disjoint random events A1 , A2 , . . . ∈ P(Ω), i.e., Ai ∩ Aj = ∅ for all i 6= j,
(σ–additivity)
∞
! ∞
[ X
P Ai = P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . = P (Ai ) . (6.4)
i=1 i=1
the expression P (A) itself is referred to as the probability of a random event A ∈ P(Ω). A
less strict version of the third axiom is given by requiring only finite additivity of a probability
measure. This means it shall possess the property
1. P (Ā) = 1 − P (A)
2. P (∅) = P (Ω̄) = 0
Employing its complementation Ā and the first of the consequences stated above, one defines by
the ratio
P (A) P (A)
O(A) := = (6.7)
P (Ā) 1 − P (A)
the so-called odds of a random event A ∈ P(Ω).
The renowned Israeli–US-American experimental psychologists Daniel Kahneman and Amos
Tversky (the latter of which deceased in 1996, aged fifty-nine) refer to the third of the conse-
quences stated above as the extension rule; see Tversky and Kahneman (1983) [111, p 294]. It
provides a cornerstone to their remarkable investigations on the “intuitive statistics” applied by
Humans in everyday decision-making, which focus in particular on the conjunction rule,
Both may be perceived as subcases of the fourth law above, which is occasionally referred to as
the convexity property of a probability measure; cf. Gilboa (2009) [31, p 160]. By means of
their famous “Linda the bank teller” example in particular, Tversky and Kahneman (1983) [111,
46 CHAPTER 6. ELEMENTS OF PROBABILITY THEORY
p 297ff] were able to demonstrate the startling empirical fact that the conjunction rule is frequently
violated in everyday (intuitive) decision-making; in their view, in consequence of decision-makers
often resorting to a so-called representativeness heuristic as an aid in corresponding situations; see
also Kahneman (2011) [46, Sec. 15]. In recognition of their as much intriguing as groundbreaking
work, which sparked the discipline of Behavioural Economics, Daniel Kahneman was awarded
the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in 2002.
All random experiments of this nature are referred to as Laplacian random experiments.
Def.: For a Laplacian random experiment, the probability of an arbitrary random event A ∈ P(Ω)
can be computed according to the rule
Any probability measure P which can be constructed in this fashion is called a Laplacian proba-
bility measure.
The systematic counting of the numbers of possible outcomes of random experiments in general is
the central theme of combinatorics. We now briefly address its main considerations.
6.4 Combinatorics
At the heart of combinatorical considerations is the well-known urn model. This supposes given
an urn containing N ∈ N balls that are either
(a) all different, and thus can be uniquely distinguished from one another, or
6.4. COMBINATORICS 47
(b) there are s ∈ N (s ≤ N) subsets of indistinguishable like balls, of sizes n1 , . . . , ns resp.,
such that n1 + . . . + ns = N.
The first systematic developments in Combinatorics date back to the Italian astronomer, physicist,
engineer, philosopher, and mathematician Galileo Galilei (1564–1642) and the French mathemati-
cian Blaise Pascal (1623–1662); cf. Mlodinow (2008) [73, p 62ff].
6.4.1 Permutations
Permutations relate to the number of distinguishable possibilities of arranging N balls in an or-
dered sequences. Altogether, for cases (a) resp. (b) one finds that there are a total number of
N!
N!
n1 !n2 ! · · · ns !
N! := N × (N − 1) × (N − 2) × · · · × 3 × 2 × 1 . (6.12)
R: factorial(N)
(a) the order in which balls were selected is either neglected or instead accounted for, and
(b) a ball that was selected once either cannot be selected again or indeed can be selected again
as often as a ball is being drawn.
N
variations (order accounted for) n! Nn
n
48 CHAPTER 6. ELEMENTS OF PROBABILITY THEORY
Note that, herein, the binomial coefficient for two natural numbers n, N ∈ N, n ≤ N, introduced
by Blaise Pascal (1623–1662), is defined by
N N!
:= . (6.13)
n n!(N − n)!
For fixed value of N and running value of n ≤ N, it generates the positive integer entries of
Pascal’s well-known numerical triangle; see, e.g., Mlodinow (2008) [73, p 72ff]. The binomial
coefficient satisfies the identity
N N
≡ . (6.14)
n N −n
R: choose(N, n)
To conclude this chapter, we turn to discuss the essential concept of conditional probabilities of
random events.
(i) random events A1 , . . . , Am ∈ P(Ω) which constitute a finite complete partition of Ω into
m ∈ N pairwise disjoint events,
m
X
(ii) P (Ai ) > 0 for all i = 1, . . . , m, with P (Ai ) = 1 by Eq. (6.3), and
i=1
m
X
Eq. 6.17
(iii) a random event B ∈ P(Ω) with P (B) = P (B|Ai )P (Ai ) > 0 that is known to have
i=1
occurred,
the identity
P (B|Ai )P (Ai )
P (Ai |B) = m (6.18)
X
P (B|Aj )P (Aj )
j=1
applies. This form of the theorem was given by Laplace (1774) [56]. By Eq. (6.3), it necessar-
m
X
ily follows that P (Ai |B) = 1. Again, the content of Bayes’ theorem may be conveniently
i=1
visualised by means of a Venn diagram.
Some of the different terms appearing in Eq. (6.18) have been given names in their own right:
• P (Ai ) is referred to as the prior probability of random event, or hypothesis, Ai ,
• P (B|Ai ) is the likelihood of random event, or empirical evidence, B, given random event,
or hypothesis, Ai , and
• P (Ai |B) is called the posterior probability of random event, or hypothesis, Ai , given ran-
dom event, or empirical evidence, B.
The most common interpretation of Bayes’ theorem is that it essentially provides a means for
computing the posterior probability of a random event, or hypothesis, Ai , given information on
the factual realisation of an associated random event, or evidence, B, in terms of the product of the
likelihood of B, given Ai , and the prior probability of Ai ,
X:Ω→D⊆R (7.1)
of the sample space Ω of some random experiment with associated probability space (Ω, P, P )
into a subset D of the real numbers R.
Depending on the nature of the spectrum of values of X, we will distinguish in the following
between random variables of the discrete and continuous kinds.
51
52 CHAPTER 7. DISCRETE AND CONTINUOUS RANDOM VARIABLES
n
X
(ii) pi = 1. (normalisability)
i=1
Specific distributional features of a discrete random variable X deriving from its probability func-
tion P (X = xi ) are encoded in the associated theoretical
Cumulative distribution function (cdf):
X
FX (x) = cdf(x) := P (X ≤ x) = P (X = xi ) . (7.4)
i|xi ≤x
Information on the central tendency and the variability of a discrete random variable X is quantified
in terms of its
Expectation value and variance:
n
X
E(X) := xi P (X = xi ) (7.6)
i=1
Xn
Var(X) := (xi − E(X))2 P (X = xi ) . (7.7)
i=1
One of the first occurrences of the notion of the expectation value of a random variable relates
to the famous “wager” put forward by the French mathematician Blaise Pascal (1623–1662); cf.
Gilboa (2009) [31, Sec. 5.2].
By the so-called shift theorem it holds that the variance may alternatively be obtained from the
computationally more efficient formula
Var(X) = E (X − E(X))2 = E(X 2 ) − [E(X)]2 . (7.8)
Specific values of E(X) and Var(X) will be denoted throughout
p by the Greek letters µ and σ 2 ,
respectively. The standard deviation of X amounts to Var(X); its specific values will be
denoted by σ.
The evaluation of event probabilities for a discrete random variable X with known probability
function P (X = xi ) follows from the
Computational rules:
P (X ≤ d) = FX (d) (7.9)
P (X < d) = FX (d) − P (X = d) (7.10)
P (X ≥ c) = 1 − FX (c) + P (X = c) (7.11)
P (X > c) = 1 − FX (c) (7.12)
P (c ≤ X ≤ d) = FX (d) − FX (c) + P (X = c) (7.13)
P (c < X ≤ d) = FX (d) − FX (c) (7.14)
P (c ≤ X < d) = FX (d) − FX (c) − P (X = d) + P (X = c) (7.15)
P (c < X < d) = FX (d) − FX (c) − P (X = d) , (7.16)
7.2. CONTINUOUS RANDOM VARIABLES 53
where c and d denote arbitrary lower and upper cut-off values imposed on the spectrum of X.
In applications it is frequently of interest to know the values of a discrete cdf’s
α–quantiles:
These are realisations xα of X specifically determined by the condition that X take values x ≤ xα
at least with probability α (for 0 < α < 1), i.e.,
!
FX (xα ) = P (X ≤ xα ) ≥ α and FX (x) = P (X ≤ x) < α for x < xα . (7.17)
Hence, approximately,
P (X ∈ dx) ≈ fX (ξ) dx ,
for some representative ξ ∈ dx. The pdf of an arbitrary continuous random variable X has the
defining properties:
The central tendency and the variabilty of a continuous random variable X are quantified by its
Expectation value and variance:
Z +∞
E(X) := xfX (x) dx (7.26)
−∞
Z +∞
Var(X) := (x − E(X))2 fX (x) dx . (7.27)
−∞
and Var(X)p will be denoted throughout by µ and σ 2 , respectively. The standard deviation of X
amounts to Var(X); its specific values will be denoted by σ.
The construction of interval estimates for unknown distribution parameters of continuous one-
dimensional random variables X in given target populations Ω, and null hypothesis significance
testing (to be discussed later in Chs. 12 and 13), both require explicit knowledge of the α–
quantiles associated with the cdfs of the Xs. Generally, these are defined as follows.
α–quantiles:
X take values x ≤ xα with probability α (for 0 < α < 1), i.e.,
FX (x) is strictly monotonously increasing
! z}|{
P (X ≤ xα ) = FX (xα ) = α ⇔ xα = FX−1 (α) . (7.28)
Hence, α–quantiles of the probability distribution for a continuous one-dimensional random vari-
able X are determined by the inverse cdf, FX−1 . For given α, the spectrum of X is thus naturally
partitioned into domains x ≤ xα and x ≥ xα . Occasionally, α–quantiles of a probability distribu-
tion are also referred to as percentile values.
If, in addition, the X1 , . . . , Xn are mutually stochastically independent according to Eq. (6.16) (see
also Sec. 7.7.4 below), it follows from Sec. 7.5.2 that the variances of Yn and X̄n are given by
n
! n 2
X X 1
Var(Yn ) = Var Xi = Var(Xi ) and Var(X̄n ) = Var(Yn ) , (7.38)
i=1 i=1
n
1
That is: E(X1 + X2 ) = E(X1 ) + E(X2 ).
7.7. TWO-DIMENSIONAL RANDOM VARIABLES 57
respectively.
Def.: Reproductivity of a probability distribution law (cdf) F (x) is given when the total sum Yn
of n independent and identically distributed (in short: “i.i.d.”) additive one-dimensional random
variables X1 , . . . , Xn , which each individually satisfy distribution laws FXi (x) ≡ F (x), inherits
this very distribution law F (x) from its underlying n random variables. Examples of reproductive
distribution laws, to be discussed in the following Ch. 8, are the binomial, the Gaußian normal,
and the χ2 –distributions.
(X, Y ) : Ω → D ⊆ R2 (7.39)
of the sample space Ω of some random experiment with associated probability space (Ω, P, P )
into a subset D of the two-dimensional Euclidian space R2 .
We proceed by sketching some important concepts relating to two-dimensional random variables.
All pairs of values (xi , yj )i=1,...,k;j=1,...,l in this spectrum are assigned individual probabilities pij
by a real-valued
Joint probability function:
with properties
Continuous case:
For two-dimensional continuous random variables the range can be represented by the
Spectrum of values:
Probabilities are now assigned to infinitesimally small areas dx×dy ∈ D by means of a real-valued
Joint probability density function (pdf):
with properties:
(i) fXY (x, y) ≥ 0 for all (x, y) ∈ D, and (non-negativity)
Z +∞ Z +∞
(ii) fXY (x, y) dxdy = 1. (normalisability)
−∞ −∞
Approximately, one now has
for representative ξ ∈ dx and η ∈ dy. Specific event probabilities for (X, Y ) are obtained from
the associated
Joint cumulative distribution function (cdf):
Z x Z y
FXY (x, y) = cdf(x, y) := P (X ≤ x, Y ≤ y) = fXY (t, u) dtdu . (7.45)
−∞ −∞
In addition, one defines conditional probability functions for X given Y = yj , with p+j > 0,
and for Y given X = xi , with pi+ > 0, by
pij
pi|j := = P (X = xi |Y = yj ) for i = 1, . . . , k , (7.48)
p+j
respectively
pij
pj|i := = P (Y = yj |X = xi ) for j = 1, . . . , l . (7.49)
pi+
Continuous case:
The univariate marginal probability density functions for X and Y induced by the joint proba-
bility density function fXY (x, y) are
Z +∞
fX (x) = fXY (x, y) dy , (7.50)
−∞
and Z +∞
fY (y) = fXY (x, y) dx . (7.51)
−∞
Moreover, one defines conditional probability density functions for X given Y , and for Y given
X, by
fXY (x, y)
fX|Y (x|y) := for fY (y) > 0 , (7.52)
fY (y)
respectively
fXY (x, y)
fY |X (y|x) := for fX (x) > 0 . (7.53)
fX (x)
Discrete case:
Let P (X = xi ) = pi+ > 0 be a prior probability function for a discrete random variable X.
Then, on the grounds of a joint probability function P (X = xi , Y = yj ) = pij and Eqs. (7.48) and
(7.49), the posterior probability function for X given Y = yj , withP (Y = yj ) = p+j > 0, is
determined by
pj|i
pi|j = pi+ for i = 1, . . . , k . (7.54)
p+j
60 CHAPTER 7. DISCRETE AND CONTINUOUS RANDOM VARIABLES
By using Eqs. (7.47) and (7.49) to re-expressed the denominator p+j , this may be given in the
standard form
pj|i pi+
pi|j = k for i = 1, . . . , k . (7.55)
X
pj|i pi+
i=1
Continuous case:
Let fX (x) > 0 be a prior probability density function for a continuous random variable X.
Then, on the grounds of a joint probability density function fXY (x, y) and Eqs. (7.52) and (7.53),
the posterior probability density function for X given Y , with fY (y) > 0, is determined by
fY |X (y|x)
fX|Y (x|y) = fX (x) . (7.56)
fY (y)
By using Eqs. (7.51) and (7.53) to re-expressed the denominator fY (y), this may be stated in the
standard form
fY |X (y|x) fX (x)
fX|Y (x|y) = Z +∞ . (7.57)
fY |X (y|x) fX (x) dx
−∞
In practical applications, evaluation of the, at times intricate, single and double integrals con-
tained in this representation of Bayes’ theorem is managed by employing sophisticated numeri-
cal approximation techniques; cf. Saha (2002) [88], Sivia and Skilling (2006) [92], Greenberg
(2013) [35], Gelman et al (2014) [30], or McElreath (2016) [69].
for (x, y) ∈ D ⊆ R2 . Moreover, in this case (i) E(X × Y ) = E(X) × E(Y ), and (ii) Var(aX +
bY ) = a2 Var(X) + b2 Var(Y ).
In the next chapter we will highlight a number of standard univariate probability distributions for
discrete and continuous one-dimensional random variables.
62 CHAPTER 7. DISCRETE AND CONTINUOUS RANDOM VARIABLES
Chapter 8
In this chapter, we review (i) the univariate probability distributions for one-dimensional random
variables which one typically encounters as theoretical probability distributions in the context
of frequentist null hypothesis significance testing (cf. Chs. 12 and 13), but we also include
(ii) cases of well-established pedagogical merit, and (iii) a few examples of rather specialised uni-
variate probability distributions, which, nevertheless, prove to be of interest in the description and
modelling of various theoretical market situations in Economics. We split our considerations into
two main parts according to whether a one-dimensional random variable X underlying a particular
distribution law varies discretely or continuously. For each of the cases to be presented, we list
the spectrum of values of X, its probability function (for discrete X) or probability density
function (pdf) (for continuous X), its cumulative distribution function (cdf), its expectation
value and its variance, and, in some continuous cases, also its skewness, excess kurtosis and
α–quantiles. Additional information, e.g., commands in R, on a GDC, in EXCEL, or in OpenOf-
fice, by which a specific distribution function may be activated for computational purposes or be
plotted, is included where available.
X ∼ L(n) , (8.1)
63
64 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
Probability function:
1
P (X = xi ) = for i = 1, . . . , n ; (8.3)
n
its graph is shown in Fig. 8.1 below for n = 6.
0.10
L(6)
0.00
1 2 3 4 5 6
Figure 8.1: Probability function of the discrete uniform distribution according to Eq. (8.3) for the
case L(6). An enveloping line is also shown.
For skewness and excess kurtosis, see, e.g., Rinne (2008) [87, p 372f].
The discrete uniform distribution is identical to a Laplacian probability measure; cf. Sec. 6.3. This
is well-known from games of chance such as tossing a fair coin once, selecting a single card from
a deck of cards, rolling a fair dye once, or the fair roulette lottery.
R: ddunif(x, x1 , xn ), pdunif(x, x1 , xn ), qdunif(α, x1 , xn ), rdunif(nsimulations , x1 , xn )
(package: extraDistr, by Wolodzko (2018) [121])
8.2. BINOMIAL DISTRIBUTION 65
8.2 Binomial distribution
8.2.1 Bernoulli distribution
Another simple probability distribution, for a discrete one-dimensional random variable
X with only two possible values, 0 and 1,1 is due to the Swiss mathematician
Jakob Bernoulli (1654–1705). The Bernoulli distribution,
X ∼ B(1; p) , (8.7)
depends on a single free parameter, the probability p ∈ [0; 1] for the event X = x = 1.
Spectrum of values:
X 7→ x ∈ {0, 1} . (8.8)
Probability function:
1
P (X = x) = px (1 − p)1−x , with 0 ≤ p ≤ 1 ; (8.9)
x
1
its graph is shown in Fig. 8.2 below for p = .
3
Bernoulli distribution
0.8
B(1; 1/3)
Bernoulliprob(x)
0.6
0.4
0.2
0.0
0 1
Figure
8.2: Probability function of the Bernoulli distribution according to Eq. (8.9) for the case
1
B 1; .
3
1
Any one-dimensional random variable of this kind is referred to as dichotomous.
66 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
Cumulative distribution function (cdf):
⌊x⌋
X 1
FX (x) = P (X ≤ x) = pk (1 − p)1−k . (8.10)
k
k=0
2
In the context of an urn model with M black balls and N − M white balls, and the random selection of n
balls from a total of N , with repetition, this probability function can be derived from Laplace’s principle of forming
the ratio between the
“number of favourable cases” and the “number of all possible cases,” cf. Eq. (6.11). Thus,
n
M x (N − M )n−x
x
P (X = x) = , where x denotes the number of black balls drawn, and one substitutes
Nn
accordingly from the definition p := M/N .
8.3. HYPERGEOMETRIC DISTRIBUTION 67
Binomial distribution
0.30
B(10; 3/5)
Binomialprob(x)
0.20
0.10
0.00
0 2 4 6 8 10
Figure
8.3: Probability function of the binomial distribution according to Eq. (8.16) for the case
3
B 10; . An enveloping line is also shown.
5
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 260]):
n
X
E(X) = p = np (8.18)
i=1
n
X
Var(X) = p(1 − p) = np(1 − p) (8.19)
i=1
1 − 2p
Skew(X) = p (8.20)
np(1 − p)
1 − 6p(1 − p)
Kurt(X) = . (8.21)
np(1 − p)
The results for E(X) and Var(X) are based on the rules (7.37) and (7.38), the latter of which
applies to a set of mutually stochastically independent random variables.
R: dbinom(x, n, p), pbinom(x, n, p), qbinom(α, n, p), rbinom(nsimulations , n, p)
GDC: binompdf(n, p, x), binomcdf(n, p, x)
EXCEL, OpenOffice: BINOM.DIST (dt.: BINOM.VERT, BINOMVERT), BINOM.INV (for α–
quantiles)
X ∼ H(n, M, N) . (8.22)
In particular, this model forms the mathematical basis of the internationally popular National Lot-
tery “6 out of 49,” in which case there are M = 6 winning numbers amongst a total of N = 49
numbers, and X ∈ {0, 1, . . . , 6} counts the total of correctly guessed winning numbers on an
individual gambler’s lottery ticket.
Spectrum of values:
Probability function:
M N −M
x n−x
P (X = x) = ; (8.24)
N
n
its graph is shown in Fig. 8.4 below for the National Lottery example, so n = 6, M = 6 and
N = 49.
Hypergeometric distribution
Hypergeometricprob(x)
H(6, 6, 49)
0 1 2 3 4 5 6
Figure 8.4: Probability function of the hypergeometric distribution according to Eq. (8.24) for the
case H (6, 6, 49). An enveloping line is also shown.
8.4. POISSON DISTRIBUTION 69
Cumulative distribution function (cdf):
M N −M
⌊x⌋
X k n−k
FX (x) = P (X ≤ x) = . (8.25)
k=max(0,n−(N −M ))
N
n
Poisson distribution
0.30
Pois(3/2)
Poissonprob(x)
0.20
0.10
0.00
0 2 4 6 8 10
Figure8.5:
Probability function of the Poisson distribution according to Eq. (8.30) for the case
3
P ois . An enveloping line is also shown.
2
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 285f]):3
E(X) = λ (8.32)
Var(X) = λ (8.33)
1
Skew(X) = √ (8.34)
λ
1
Kurt(X) = . (8.35)
λ
X ∼ U(a; b) , (8.36)
also referred to as the rectangular distribution. Its two free parameters, a and b, denote the limits
of X’s
3
Note that for a binomial distribution, cf. Sec. 8.2, in the limit that n ≫ 1 while simultaneously 0 < p ≪ 1 it
holds that np ≈ np(1 − p), and so the corresponding expectation value and variance become more and more equal.
8.5. CONTINUOUS UNIFORM DISTRIBUTION 71
Spectrum of values:
X 7→ x ∈ [a, b] ⊂ R . (8.37)
Probability density function (pdf):4
1
b − a for x ∈ [a, b]
fX (x) = ; (8.38)
0 otherwise
its graph is shown in Fig. 8.6 below for four different combinations of the parameters a and b.
U(3/2; 7/2)
Uniformpdf(x)
U(1; 4)
U(0; 5)
0.4
0.0
0 1 2 3 4 5
Figure 8.6: pdf of the continuous uniform distribution according to Eq. (8.38) for the cases
U(0; 5), U(1; 4), U(3/2; 7/2) and U(2; 3).
4
It is a nice and instructive little exercise, strongly recommended to the reader, to go through the details of explicitly
computing from this simple pdf the corresponding cdf, expectation value, variance, skewness and excess kurtosis
of X ∼ U (a; b).
72 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
Expectation value, variance, skewness and excess kurtosis:
a+b
E(X) = (8.40)
2
(b − a)2
Var(X) = (8.41)
12
Skew(X) = 0 (8.42)
6
Kurt(X) = − . (8.43)
5
Using some of these results, as well as Eq. (8.39), one finds that for all continuous uniform distri-
butions the event probability
√ √ !
p 3(a + b) − (b − a) 3(a + b) + (b − a)
P (|X − E(X)| ≤ Var(X)) = P √ ≤X ≤ √
2 3 2 3
1
= √ ≈ 0.5773 , (8.44)
3
√
i.e., the event probability that X falls within one standard deviation (“1σ”) of E(X) is 1/ 3. α–
quantiles of continuous uniform distributions are obtained by straightforward inversion, i.e., for
0 < α < 1,
! xα − a
α = FX (xα ) = ⇔ xα = FX−1 (α) = a + α(b − a) . (8.45)
b−a
We emphasise the fact that the normal–cdf cannot be expressed in terms of elementary mathe-
matical functions.
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 301]):
E(X) = µ (8.53)
Var(X) = σ2 (8.54)
Skew(X) = 0 (8.55)
Kurt(X) = 0. (8.56)
N(−2; 1/4)
N(0; 1/4)
0.6
N(1; 1/4)
Npdf(x)
N(3/2; 1/4)
0.4
0.2
0.0
−4 −2 0 2 4
Figure 8.7: pdf of the Gaußian normal distribution according to Eq. (8.51). Cases N(−2; 1/4),
N(0; 1/4), N(1; 1/4) and N(3/2; 1/4), which have constant σ.
N(0; 1/4)
N(0; 1)
0.6
N(0; 2)
Npdf(x)
N(0; 4)
0.4
0.2
0.0
−4 −2 0 2 4
Figure 8.8: pdf of the Gaußian normal distribution according to Eq. (8.51). Cases N(0; 1/4),
N(0; 1), N(0; 2) and N(0; 4), which have constant µ.
N(0; 1)
0.3
φ(z)
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Figure 8.9: pdf of the standard normal distribution according to Eq. (8.57).
P (Z ≤ b) = Φ(b) (8.59)
P (Z ≥ a) = 1 − Φ(a) (8.60)
P (a ≤ Z ≤ b) = Φ(b) − Φ(a) (8.61)
Φ(−z) = 1 − Φ(z) (8.62)
P (−z ≤ Z ≤ z) = 2Φ(z) − 1 . (8.63)
76 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
The event probability that a (standard) normally distributed one-dimensional random variable takes
values inside an interval of length k times two standard deviations, centred on its expectation value,
is given by the important kσ–rule. This states that
Eq. (7.34) Eq. (8.63)
z}|{ z}|{
P (|X − µ| ≤ kσ) = P (−k ≤ Z ≤ +k) = 2Φ(k) − 1 for k>0. (8.64)
According to this rule, the event probability of a normally distributed one-dimensional random
variable to deviate from its mean by more than six standard deviations amounts to
i.e., about two parts in one billion. Thus, in this scenario the occurrence of extreme outliers
for X is practically impossible. In turn, the persistent occurrence of so-called 6σ–events, or larger
deviations from the mean, in quantitative statistical surveys can be interpreted as evidence against
the assumption of an underlying Gaußian random process; cf. Taleb (2007) [105, Ch. 15].
The rapid, accelerated decline in the event probabilities for deviations from the mean of a Gaußian
normal distribution can be related to the fact that the elasticity of the standard normal–pdf is given
by (cf. Ref. [18, Sec. 7.6])
εϕ (z) = − z 2 . (8.66)
Manifestly this is negative for all z 6= 0 and increases non-linearly in absolute value as one moves
away from z = 0.
α–quantiles associated with Z ∼ N(0; 1) are obtained from the inverse standard normal–cdf
according to
!
α = P (Z ≤ zα ) = Φ(zα ) ⇔ zα = Φ−1 (α) for all 0<α<1. (8.67)
Due to the reflection symmetry of ϕ(z) with respect to the vertical axis at z = 0, it holds that
zα = −z1−α . (8.68)
For this reason, one typically finds zα -values listed in textbooks on Statistics only for α ∈ [1/2, 1).
Alternatively, a particular zα may be obtained from R, a GDC, EXCEL, or from OpenOffice. The
backward transformation from a particular zα of the standard normal distribution to the corre-
sponding xα of a given normal distribution follows from Eq. (7.34) and amounts to xα = µ + zα σ.
R: qnorm(α)
GDC: invNorm(α)
EXCEL, OpenOffice: NORM.S.INV (dt.: NORM.S.INV, NORMINV)
At this stage, a few historical remarks are in order. The Gaußian normal distribution
gained a prominent, though in parts questionable status in the Social Sciences through the
highly influential work of the Belgian astronomer, mathematician, statistician and sociologist
Lambert Adolphe Jacques Quetelet (1796–1874) during the 19th Century. In particular, his re-
search programme on the generic properties of l’homme moyen (engl.: the average man), see
8.7. χ2 –DISTRIBUTION 77
Quetelet (1835) [84], an ambitious and to some extent obsessive attempt to quantify and clas-
sify physiological and sociological human characteristics according to the principles of a nor-
mal distribution, left a lasting impact on the field, with repercussions to this day. Quetelet,
by the way, co-founded the Royal Statistical Society (rss.org.uk) in 1834. Further vis-
ibility was given to Quetelet’s ideas at the time by a contemporary, the English empiricist
Sir Francis Galton FRS (1822–1911), whose intense studies on heredity in Humans, see Galton
(1869) [27], which he later subsumed under the term “eugenics,” complemented Quetelet’s in-
vestigations, and profoundly shaped subsequent developments in social research; cf. Bernstein
(1998) [3, Ch. 9]. Incidently, amongst many other contributions to the field, Galton’s activities
helped to pave the way for making questionnaires and surveys a commonplace for collecting
statistical data from Humans.
The (standard) normal distribution, as well as the next three examples of probability distributions
for a continuous one-dimensional random variable X, are commonly referred to as the test distri-
butions, due to the central roles they play in null hypothesis significance testing (cf. Chs. 12 and
13).
X ∼ χ2 (n) , (8.69)
n
X
X := Zi2 = Z12 + . . . + Zn2 , with n ∈ N . (8.70)
i=1
Spectrum of values:
X 7→ x ∈ D ⊆ R≥0 . (8.71)
chi−squared−distributions
chi2(3)
0.20
chi2(5)
chi2(10)
chi2pdf(x)
chi2(30)
0.10
0.00
0 10 20 30 40 50
Figure 8.10: pdf of the χ2 –distribution for df = n ∈ {3, 5, 10, 30} degrees of freedom.
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 320f]):
E(X) = n (8.72)
Var(X) = 2n (8.73)
r
8
Skew(X) = (8.74)
n
12
Kurt(X) = . (8.75)
n
α–quantiles, χ2n;α , of χ2 –distributions are generally tabulated in textbooks on Statistics. Alterna-
tively, they may be obtained from R, EXCEL, or from OpenOffice.
Note that for n ≥ 50 a χ2 –distribution may be approximated reasonably well by a normal distri-
bution, N(n, 2n). This is a reflection of the central limit theorem, to be discussed in Sec. 8.15
below.
R: dchisq(x, n), pchisq(x, n), qchisq(α, n), rchisq(nsimulations , n)
GDC: χ2 pdf(x, n), χ2 cdf(0, x, n)
EXCEL, OpenOffice: CHISQ.DIST, CHISQ.INV (dt.: CHIQU.VERT, CHIQVERT,
CHIQU.INV, CHIQINV)
t−distributions
0.4
t(2)
t(3)
0.3
t(5)
tpdf(x)
t(50)
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Figure 8.11: pdf of the t–distribution for df = n ∈ {2, 3, 5, 50} degrees of freedom. For the case
t(50), the tpdf is essentially equivalent to the standard normal pdf. Notice the fatter tails of the
tpdf for small values of n.
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 327]):
E(X) = 0 (8.78)
n
Var(X) = for n > 2 (8.79)
n−2
Skew(X) = 0 for n > 3 (8.80)
6
Kurt(X) = for n > 4 . (8.81)
n−4
80 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
α–quantiles, tn;α , of t–distributions, for which, due to the reflection symmetry of the tpdf, the
identity tn;α = −tn;1−α holds, are generally tabulated in textbooks on Statistics. Alternatively,
they may be obtained from R, some GDCs, EXCEL, or from OpenOffice.
Note that for n ≥ 50 a t–distribution may be approximated reasonably well by the standard normal
distribution, N(0; 1). Again, this is a manifestation of the central limit theorem, to be discussed
in Sec. 8.15 below. For n = 1, a t–distribution amounts to the special case a = 1, b = 0 of the
Cauchy distribution; cf. Sec. 8.14.
each of which satisfies a χ2 –distribution with n1 resp. n2 degrees of freedom. Then the quotient
random variable
X/n1
Fn1 ,n2 := ∼ F (n1 , n2 ) , with n1 , n2 ∈ N , (8.83)
Y /n2
Spectrum of values:
Fn1 ,n2 7→ fn1 ,n2 ∈ D ⊆ R≥0 . (8.84)
F−distributions
1.5
F(80, 40)
F(40, 20)
1.0
F(6, 10)
Fpdf(x)
F(3, 5)
0.5
0.0
0 1 2 3 4 5 6
Figure 8.12: pdf of the F –distribution for four combinations of degrees of freedom (df1 =
n1 , df2 = n2 ). The curves correspond to the cases F (80, 40), F (40, 20), F (6, 10) and F (3, 5),
respectively.
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 332]):
n2
E(X) = for n2 > 2 (8.85)
n2 − 2
2n22 (n1 + n2 − 2)
Var(X) = for n2 > 4 (8.86)
n1 (n2 − 2)2 (n2 − 4)
p
(2n1 + n2 − 2) 8(n2 − 4)
Skew(X) = p for n2 > 6 (8.87)
(n2 − 6) n1 (n1 + n2 − 2)
n1 (5n2 − 22)(n1 + n2 − 2) + (n2 − 2)2 (n2 − 4)
Kurt(X) = 12 for n2 > 8 . (8.88)
n1 (n2 − 6)(n2 − 8)(n1 + n2 − 2)
α–quantiles, fn1 ,n2 ;α , of F –distributions are tabulated in advanced textbooks on Statistics. Alter-
natively, they may be obtained from R, EXCEL, or from OpenOffice.
R: df(x, n1 , n2 ), pf(x, n1 , n2 ), qf(α, n1 , n2 ), rf(nsimulations , n1 , n2 )
GDC: F pdf(x, n1 , n2 ), F cdf(0, x, n1 , n2 )
EXCEL, OpenOffice: F.DIST, F.INV (dt.: F.VERT, FVERT, F.INV, FINV)
Pareto distributions
Par(1/3, 1)
2.0
Par(1/2, 1)
Paretopdf(x)
Par(ln(5)/ln(4), 1)
Par(5/2, 1)
1.0
0.0
1 2 3 4 5 6 7 8
It is important to realise that E(X), Var(X), Skew(X) and Kurt(X) are well-defined only for the
values of γ indicated; otherwise these measures do not exist.
α–quantiles:
γ r
! xmin 1
α = FX (xα ) = 1 − ⇔ xα = FX−1 (α) = γ
xmin for all 0 < α < 1 . (8.97)
xα 1−α
This result forms the basis of Pareto’s famous 80/20 rule concerning concentration in the distribu-
tion of various assets of general importance in a given population. According to Pareto’s empirical
findings, typically 80% of such an asset are owned by just 20% of the population considered (and
vice versa); cf. Pareto (1896) [77].5 The 80/20 rule applies exactly for a value of the power-law
ln(5)
index of γ = ≈ 1.16. It is a prominent example of the phenomenon of universality, fre-
ln(4)
quently observed in the mathematical modelling of quantitative–empirical relationships between
variables in a wide variety of scientific disciplines; cf. Gleick (1987) [34, p 157ff].
For purposes of numerical simulation it is useful to work with a truncated Pareto distribution,
for which the one-dimensional random variable X takes values in an interval [xmin , xcut ] ⊂ R>0 .
Samples of random values for such an X can be easily generated from a one-dimensional random
variable Y that is uniformly distributed on the interval [0, 1]. The sample values of the latter are
subsequently transformed according to the formula; cf. Ref. [120]:
xmin xcut
x(y) = . (8.102)
[xγcut − (xγcut − xγmin ) y]1/γ
The required uniformly distributed random numbers y ∈ [0, 1] can be obtained, e.g., from
R by means of runif(nsimulations , 0, 1), or from the random number generator RAND() (dt.:
ZUFALLSZAHL()) in EXCEL or in OpenOffice.
X ∼ Ex(λ) , (8.103)
Exponential distributions
2.0
Ex(1/4)
Exponentialpdf(x)
Ex(1/2)
1.5
Ex(1)
Ex(2)
1.0
0.5
0.0
0 1 2 3 4 5
Figure 8.14: pdf of the exponential distribution according to Eq. (8.105). Displayed are the cases
Ex(1/4), Ex(1/2), Ex(1) and Ex(2).
1
E(X) = (8.107)
λ
1
Var(X) = (8.108)
λ2
Skew(X) = 2 (8.109)
Kurt(X) = 6 . (8.110)
α–quantiles:
! ln(1 − α)
α = FX (xα ) = 1−exp [−λxα ] ⇔ xα = FX−1 (α) = − for all 0 < α < 1 . (8.111)
λ
X ∼ Lo(µ; s) , (8.112)
depends on two free parameters: a location parameter µ ∈ R and a scale parameter s ∈ R>0 .
Spectrum of values:
X 7→ x ∈ R . (8.113)
Probability density function (pdf):
x−µ
exp −
s
fX (x) = 2 , µ ∈ R , s ∈ R>0 ; (8.114)
x−µ
s 1 + exp −
s
Logistic distributions
Lo(−2; 1/4)
0.8
Lo(−1; 1/2)
Logisticpdf(x)
Lo(0; 1)
Lo(1; 2)
0.4
0.0
−6 −4 −2 0 2 4 6 8
Figure 8.15: pdf of the logistic distribution according to Eq. (8.114). Displayed are the cases
Lo(−2; 1/4), Lo(−1; 1/2), Lo(0; 1) and Lo(1; 2).
1
FX (x) = P (X ≤ x) = . (8.115)
x−µ
1 + exp −
s
8.13. SPECIAL HYPERBOLIC DISTRIBUTION 87
Expectation value, variance, skewness and excess kurtosis (cf. Rinne (2008) [87, p 359]):
E(X) = µ (8.116)
s2 π 2
Var(X) = (8.117)
3
Skew(X) = 0 (8.118)
6
Kurt(X) = . (8.119)
5
α–quantiles:
! 1 −1 α
α = FX (xα ) = ⇔ xα = FX (α) = µ+s ln for all 0 < α < 1 .
xα − µ 1−α
1 + exp −
s
(8.120)
R: dlogis(x, µ, s), plogis(x, µ, s), qlogis(α, µ, s), rlogis(nsimulations , µ, s)
sHyp
sHyppdf(x)
1.0
0.5
0.0
Figure 8.16: pdf of the special hyperbolic distribution according to Eq. (8.123).
X ∼ Ca(b; a) , (8.130)
with properties
7
Use polynomial division to simplify the integrands in the ensuing moment integrals when verifying these results.
8.14. CAUCHY DISTRIBUTION 89
Spectrum of values:
X 7→ x ∈ R . (8.131)
Probability density function (pdf):
1 a
fX (x) = , with a ∈ R>0 , b ∈ R ; (8.132)
π a + (x − b)2
2
its graph is shown in Fig. 8.17 below for four particular cases.
Cauchy distributions
Ca(−2; 2)
0.4
Ca(−1; 3/2)
Cauchypdf(x)
Ca(0; 1)
Ca(1; 3/4)
0.2
0.0
−8 −6 −4 −2 0 2 4 6
Figure 8.17: pdf of the Cauchy distribution according to Eq. (8.132). Displayed are the cases
Ca(−2; 2), Ca(−1; 3/2), Ca(0; 1) and Ca(1; 3/4). The case Ca(0; 1) corresponds to a t–
distribution with df = 1 degree of freedom; cf. Sec. 8.8.
(ii) finite variances σ12 , . . . , σn2 , which are not too different from one another, and
Introduce for this set a total sum Yn according to Eq. (7.36), and, by standardisation via Eq. (7.34),
a related standardised summation random variable
n
X
Yn − µi
Zn := v i=1 . (8.139)
u n 2
uX
t σj
j=1
i.e., that asymptotically the standard deviation of the total sum dominates the standard devia-
tions of any of the individual Xi , and certain additional regularity requirements (see, e.g., Rinne
(2008) [87, p 427 f]), the central limit theorem in its general form according to the Finnish
mathematician Jarl Waldemar Lindeberg (1876–1932) and the Croatian–American mathematician
8.15. CENTRAL LIMIT THEOREM 91
William Feller (1906–1970) states that in the asymptotic limit of infinitely many Xi contributing
to Yn (and so to Zn ), it holds that
lim Fn (zn ) = Φ(z) , (8.141)
n→∞
i.e., the limit of the sequence of probability distributions Fn (zn ) for the standardised sum-
mation random variables Zn is constituted by the standard normal distribution N(0; 1),
discussed in Sec. 8.6; cf. Lindeberg (1922) [63] and Feller (1951) [20]. Earlier results
on the asymptotic distributional properties of a sum of independent additive one-dimensional
random variables were obtained by the Russian mathematician, mechanician and physicist
Aleksandr Mikhailovich Lyapunov (1857–1918); cf. Lyapunov (1901) [66].
Thus, under fairly general conditions, the normal distribution acts as a stable attractor distri-
bution for the sum of n mutually stochastically independent, additive random variables Xi .9 In
oversimplified terms: this result bears a certain economical convenience for most practical pur-
poses in that, given favourable conditions, when the size of a random sample is sufficiently large
(in practice, a typical rule of thumb is n ≥ 50), one essentially needs to know the characteristic
features of only a single continuous univariate probability distribution to perform, e.g., null hypoth-
esis significance testing within the frequentist framework; cf. Ch. 11. As will become apparent
in subsequent chapters, the central limit theorem has profound ramifications for applications in all
empirical scientific disciplines.
Note that for finite n the central limit theorem makes no statement as to the nature of the tails of
the probability distribution for Zn (or for Yn ), where, in principle, it can be very different from a
normal distribution; cf. Bouchaud and Potters (2003) [4, p 25f].
A direct consequence of the central limit theorem and its preconditions is the fact that for the
sample mean X̄n , defined in Eq. (7.36) above, both
n
X n
X
µi σi2
i=1 i=1
lim E(X̄n ) = lim and lim Var(X̄n ) = lim
n→∞ n→∞ n n→∞ n→∞ n2
converge to finite values. This property is most easily recognised in the special case of n mu-
tually stochastically independent and identically distributed (in short: “i.i.d.”) additive one-
dimensional random variables X1 , . . . , Xn , which have common finite expectation value µ, com-
mon finite variance σ 2 , and common cdf F (x).10 Then,
nµ
lim E(X̄n ) = lim
= µ (8.142)
n→∞ n→∞ n
nσ 2 σ2
lim Var(X̄n ) = lim 2 = lim = 0. (8.143)
n→∞ n→∞ n n→∞ n
9
Put differently, for increasingly large n the cdf of the total sum Yn approximates a normal distribution with
Xn Xn
expectation value µi and variance σi2 to an increasingly accurate degree. In particular, all reproductive distri-
i=1 i=1
butions may be approximated by a normal distribution as n becomes large.
10
These conditions lead to the central limit theorem in the special form according to Jarl Waldemar Lindeberg
(1876–1932) and the French mathematician Paul Pierre Lévy (1886–1971).
92 CHAPTER 8. STANDARD UNIVARIATE PROBABILITY DISTRIBUTIONS
This result is known as the law of large numbers according to the Swiss mathematician
Jakob Bernoulli (1654–1705); the sample mean X̄n converges stochastically to its expectation
value µ.
We point out that a counter-example to the central limit theorem is given by a set of n i.i.d. Pareto-
distributed with exponent γ ≤ 2 one-dimensional random variables Xi , since in this case the
variance of the Xi is undefined; cf. Eq. (8.94).
This ends Part II of these lecture notes, and we now turn to Part III in which we focus on a number
of useful applications of inferential statistical methods of data analysis within the frequentist
framework. Data analysis techniques within the conceptually compelling Bayes–Laplace frame-
work have been reviewed, e.g., in the online lecture notes by Saha (2002) [88], in the textbooks
by Sivia and Skilling (2006) [92], Gelman et al (2014) [30] and McElreath (2016) [69], and in the
lecture notes of Ref. [19].
Chapter 9
2: disagree/unfavourable
3: undecided
4: agree/favourable
In the research literature, one also encounters 7–level or 10–level item rating scales, which offer
more flexibility. Note that it is assumed (!) fom the outset that the items Xi , and thus their ratings,
can be treated as additive, so that the conceptual principles of Sec. 7.6 relating to sums of random
variables can be relied upon. When forming the sum over the ratings of all the indicator items
one selected, it is essential to carefully pay attention to the polarity of the items involved. For the
93
94 CHAPTER 9. LIKERT’S SCALING METHOD OF SUMMATED ITEM RATINGS
X
resultant total sum Xi to be consistent, the polarity of all items used needs to be uniform.1
i
The construction of a consistent and coherent Likert scale for a one-dimensional latent statistical
variable XL involves four basic steps (see, e.g., Trochim (2006) [109]):
(i) the compilation of an initial list of 80 to 100 potential indicator items Xi for the one-
dimensional latent variable of interest,
(ii) the draw of a gauge random sample from the target population Ω,
X
(iii) the computation of the total sum Xi of item ratings, and, most importantly,
i
(iv) X
the performance of an item analysis based on the sample data and the associated total sum
Xi of item ratings.
i
The item analysis, in particular, consists of the consequential application of two exclusion criteria,
which aim at establishing the scientific quality of the final Likert scale. Items are being discarded
from the list when either
X
(a) they show a weak item-to-total correlation with the total sum Xi (a rule of thumb is to
i
exclude items with correlations less than 0.5), or
(b) it is possible to increase the value of Cronbach’s2 α–coefficient (see Cronbach (1951)
[14]), a measure of the scale’s internal consistency reliability, by excluding a particular
item from the list (the objective being to attain α-values greater than 0.8).
where Si2 denotes the sample variance associated with the ith indicatorX item (perceived as being
2
metrically scaled), and Stotal is the sample variance of the total sum Xi .
i
.. .. .. .. .. .. .. ..
. . . . . . . .
Table 9.1: Structure of a discrete k-indicator-item Likert scale for some one-dimensional latent
statistical variable XL , based on a visualised equidistant 5–level item rating scale.
The outcome of the item analysis is a drastic reduction of the initial list to a set of just k ∈ N
indicator items Xi (i = 1, . . . , k) of high discriminatory power, where k is typically in the range
of 10 to 15.3 The associated total sum
k
X
XL := Xi (9.2)
i=1
The structure of a finalised discrete k-indicator-item Likert scale for some one-dimensional la-
tent statistical variable XL with an equidistant graphical 5–level item rating scale is displayed in
Tab. 9.1.
Likert’s scaling method of aggregating information from a set of k highly interdependent
X ordinally
scaled items to form an effectively quasi-metrical, one-dimensional total sum XL = Xi draws
i
its legitimisation to a large extent from a generalised version of the central limit theorem (cf.
Sec. 8.15), wherein the precondition of mutually stochastically independent variables contributing
to the sum is relaxed. In practice it is found that
Xfor many cases of interest in the samples one
has available for research the total sum XL = Xi is normally distributed in to a very good
i
approximation. Nevertheless, the normality property of Likert scale data needs to be established
on a case-by-case basis. The main shortcoming of Likert’s approach is its dependency of the
gauging process of the scale on the target population.
In the Social Sciences there is available a broad variety of operationalisation procedures alter-
native to the discrete Likert scale. We restrict ourselves here to mention but one example,
3
However, in many research papers one finds Likert scales with a minimum of just four indicator items.
96 CHAPTER 9. LIKERT’S SCALING METHOD OF SUMMATED ITEM RATINGS
namely the continuous psychometric visual analogue scale (VAS) developed by Hayes and Pa-
terson (1921) [40] and by Freyd (1923) [26]. Further measurement scales for latent statistical
variables can be obtained from the websites zis.gesis.org, German Social Sciences mea-
surement scales (ZIS), and ssrn.com, Social Science Research Network (SSRN). On a historical
note: one of the first systematically designed questionnaires as a measurement tool for collecting
socio-economic data (from workers on strike at the time in Britain) was published by the Statistical
Society of London in 1838; see Ref. [97].
Chapter 10
Quantitative–empirical research methods may be employed for exploratory as well as for con-
firmatory data analysis. Here we will focus on the latter, in the context of a frequentist view-
point of Probability Theory and statistical inference. To investigate research questions sys-
tematically by statistical means, with the objective to make inferences about the distributional
properties of a set of statistical variables in a specific target population Ω of study objects, on
the basis of analysis of data from just a few units in a sample SΩ , the following three issues have
to be addressed in a clearcut fashion:
(i) the target population Ω of the research activity needs to be defined in an unambiguous way,
(ii) an adequate random sample SΩ needs to be drawn from an underlying sampling frame LΩ
associated with Ω, and
We will briefly discuss these issues in turn, beginning with a review in Tab. 10.1 of conventional
notation for distinguishing specific statistical measures relating to target populations Ω on the
one-hand side from the corresponding ones relating to random samples SΩ on the other.
One-dimensional random variables in a target population Ω (of size N), as what statistical vari-
ables will be understood to constitute subsequently, will be denoted by capital Latin letters such
as X, Y , . . . , Z, while their realisations in random samples SΩ (of size n) will be denoted by
lower case Latin letters such as xi , yi , . . . , zi (i = 1, . . . , n). In addition, one denotes population
parameters by lower case Greek letters, while for their corresponding point estimator functions
relating to random samples, which are also perceived as random variables, again capital Latin let-
ters are used for representation. The ratio n/N will be referred to as the sampling fraction. As
is standard in the statistical literature, we will denote a particular random sample of size n for a
one-dimensional random variable X by a set SΩ : (X1 , . . . , Xn ), with Xi representing any arbitrary
random variable associated with X in this sample.
In actual practice, it is often not possible to acquire access for the purpose of enquiry to every single
statistical unit belonging to an identified target population Ω, not even in principle. For example,
this could be due to the fact that Ω’s size N is far too large to be determined accurately. In this case,
97
98 CHAPTER 10. RANDOM SAMPLING OF TARGET POPULATIONS
Table 10.1: Notation for distinguishing between statistical measures relating to a target popula-
tion Ω on the one-hand side, and to the corresponding quantities and unbiased maximum likelihood
point estimator functions obtained from a random sample SΩ on the other.
10.1. RANDOM SAMPLING METHODS 99
to ensure a reliable investigation, one needs to resort to using a sampling frame LΩ for Ω. By
this one understands a representative list of elements in Ω to which access can actually be obtained
one way or another. Such a list will have to be compiled by some authority of scientific integrity.
In an attempt to avoid a notational overflow in the following, we will continue to use N to denote
both: the size of the target population Ω and the size of its associated sampling frame LΩ (even
though this is not entirely accurate). As regards the specific sampling process, one may distinguish
cross-sectional one-off sampling at a fixed instant from longitudinal multiple sampling over a
finite time interval.1
We now proceed to introduce the three most commonly practiced methods of drawing random
samples from given fixed target populations Ω of statistical units.
We emphasise at this point that empirical data gained from convenience samples (in contrast to
random samples) is not amenable to statistical inference, in that its information content cannot
be generalised to the target population Ω from which it was drawn; see, e.g., Bryson (1976) [9,
p 185], or Schnell et al (2013) [91, p 289].
For metrically scaled one-dimensional random variables X, defining for a given random sample
SΩ : (X1 , . . . , Xn ) of size n a sample total sum by
n
X
Yn := Xi , (10.5)
i=1
the two most prominent maximum likelihood point estimator functions satisfying the unbiased-
ness and consistency conditions are the sample mean and sample variance, defined by
1
X̄n := Yn (10.6)
n
n
1 X
Sn2 := (Xi − X̄n )2 . (10.7)
n − 1 i=1
These will be frequently employed in subsequent considerations in Ch. 12 for point-estimating the
values of the location and scale parameters µ and σ 2 of the distribution for a one-dimensional
random variable X in a target population Ω. Sampling theory in the frequentist framework
holds it that the standard errors (SE) associated with the maximum likelihood point estimator
functions X̄n and Sn2 , defined in Eqs. (10.6) and (10.7), amount to the standard deviations of the
underlying theoretical sampling distributions for these functions; see, e.g., Cramér (1946) [13,
Chs. 27 to 29]. For a given target population Ω (or sampling frame LΩ ) of size N, imagine
N
drawing all possible mutually independent random samples of a fixed size n (no order
n
accounted for and repetitions excluded), from each of which individual realisations of X̄n and Sn2
are obtained. The theoretical distributions for all such realisations of X̄n resp. Sn2 for given N and
n are referred to as their corresponding sampling distributions. A useful simulation illustrating
the concept of a sampling distribution is available at the website onlinestatbook.com. In
the limit that N → ∞ while keeping n fixed, the theoretical sampling distributions of X̄n and Sn2
become normal (cf. Sec. 8.6) resp. χ2 with n − 1 degrees of freedom (cf. Sec. 8.7), with standard
deviations
Sn
SEX̄n := √ (10.8)
n
r
2
SESn2 := S2 ; (10.9)
n−1 n
102 CHAPTER 10. RANDOM SAMPLING OF TARGET POPULATIONS
cf., e.g., Lehman and Casella (1998) [59, p 91ff], and Levin et al (2010) [61, Ch. 6]. Thus,
for a finite sample standard deviation √
Sn , these two standard√errors decrease with the sample
size n in proportion to the inverse of n resp. the inverse of n − 1. It is a main criticism of
proponents of the Bayes–Laplace approach to Probability Theory and statistical inference that
the concept of a sampling distribution for a maximum likelihood point estimator function is based
on unobserved data; cf. Greenberg (2013) [35, p 31f].
There are likewise unbiased maximum likelihood point estimators for the shape parameters γ1 and
γ2 of the probability distribution for a one-dimensional random variable X in a target population Ω,
as given in Eqs. (7.29) and (7.30). For n > 2 resp. n > 3, the sample skewness and sample excess
kurtosis in, e.g., their implementation in the software packages R (package: e1071, by Meyer et
al (2019) [71]) or SPSS are defined by (see, e.g., Joanes and Gill (1998) [45, p 184])
p 1
Pn 3
(n − 1)n n i=1 (Xi − X̄n )
G1 := P 3/2 (10.10)
n−2 1 n 2
n j=1 (Xj − X̄n )
1
Pn 4
n−1 i=1 (Xi − X̄n )
(n + 1) n P
G2 := 2 − 3 + 6 , (10.11)
(n − 2)(n − 3) 1 n 2
n j=1 (Xj − X̄n )
with associated standard errors (cf. Joanes and Gill (1998) [45, p 185f])
s
6(n − 1)n
SEG1 := (10.12)
(n − 2)(n + 1)(n + 3)
s
6(n − 1)2 n
SEG2 := 2 . (10.13)
(n − 3)(n − 2)(n + 3)(n + 5)
Chapter 11
Null hypothesis significance testing by means of observable quantities is the centrepiece of the
current body of inferential statistical methods in the frequentist framework. Its logic of an on-
going routine of systematic falsification of null hypotheses by empirical means is firmly rooted in
the ideas of critical rationalism and logical positivism. The latter were expressed most emphat-
ically by the Austro–British philosopher Sir Karl Raimund Popper CH FRS FBA (1902–1994);
see, e.g., Popper (2002) [83]. The systematic procedure for null hypothesis significance
testing on the grounds of observational evidence, as practiced today within the frequentist
framework as a standardised method of probability-based decision-making, was developed
during the first half of the 20th Century, predominantly by the English statistician, evolution-
ary biologist, eugenicist and geneticist Sir Ronald Aylmer Fisher FRS (1890–1962), the Polish–
US-American mathematician and statistician Jerzy Neyman (1894–1981), the English mathe-
matician and statistician Karl Pearson FRS (1857–1936), and his son, the English statistician
Egon Sharpe Pearson CBE FRS (1895–1980); cf. Fisher (1935) [24], Neyman and Pearson
(1933) [75], and Pearson (1900) [78]. We will describe the main steps of the systematic test
procedure in the following.
103
104 CHAPTER 11. NULL HYPOTHESIS SIGNIFICANCE TESTING
Generically, statistical hypotheses need to be viewed as probabilistic statements. As such the
researcher will always have to deal with a fair amount of uncertainty in deciding whether an
observed, potentially only apparent effect is statistically significant and/or practically significant
in Ω or not. Bernstein (1998) [3, p 207] summarises the circumstances relating to the test of a
specific hypothesis as follows:
The question arises as to which kinds of quantitative problems can be efficiently settled by statistical
means? With respect to a given target population Ω, in the simplest kinds of applications of null
hypothesis significance testing, one may (a) test for differences in the distributional properties of
a single one-dimensional statistical variable X between a number of subgroups of Ω, necessitating
univariate methods of data analysis, or one may (b) test for association for a two-dimensional
statistical variable (X, Y ), thus requiring bivariate methods of data analysis. The standardised
procedure for null hypothesis significance testing, practiced within the frequentist framework
for the purpose of assessing statistical significance of an observed, potentially apparent effect,
takes the following six steps on the way to making a decision:
Six-step procedure for null hypothesis significance testing
1. Formulation, with respect to the target population Ω, of a pair of mutually exclusive hy-
potheses:
(a) the null hypothesis H0 conjectures that “there exists no effect in Ω of the kind envis-
aged by the researcher,” while
(b) the research hypothesis H1 conjectures that “there does exist a true effect in Ω of the
kind envisaged by the researcher.”
The starting point of the test procedure is the assumption (!) that it is the content of the H0
conjecture which is realised in Ω. The objective is to try to refute H0 empirically on the basis
of random sample data drawn from Ω, to a level of significance which needs to be specified
in advance. In this sense it is H0 which is being subjected to a statistical test.1 The striking
asymmetry regarding the roles of H0 and H1 in the test procedure embodies the notion of a
falsification of hypotheses, as advocated by critical rationalism.
2. Specification of a significance level α prior to the performance of the test, where, by conven-
tion, α ∈ [0.01, 0.05]. The parameter α is synonymous with the probability of committing a
Type I error (to be defined below) in making a test decision.
A fitting metaphor for the six-step procedure for null hypothesis significance testing just de-
scribed is that of a statistical long jump competition. The issue here is to find out whether actual
empirical data deviates sufficiently strongly from the “no effect” reference state conjectured in the
given null hypothesis H0 , so as to land in the corresponding rejection region Bα within the spec-
trum of values of the test statistic Tn (X1 , . . . , Xn ). Steps 1 to 4 prepare the long jump facility
(the test stage), while the evaluation of the outcome of the jump attempt takes place in steps 5
and 6. Step 4 necessitates the direct application of Probability Theory within the frequentist
framework in that the determination of the rejection region Bα for H0 entails the calculation of
a conditional event probability from an assumed test distribution.
When an effect observed on the basis of random sample data proves to possess statistical signifi-
cance (to a predetermined significance level), this means that most likely it has come about not
by chance due to the sampling methodology. A different matter altogether is whether such an
effect also possesses practical significance, so that, for instance, management decisions ought
to be adapted to it. Practical significance of an observed effect can be evaluated, e.g., with the
standardised and scale-invariant effect size measures proposed by Cohen (1992, 2009) [11, 12].
Addressing the practical significance of an observed effect should be commonplace in any report
on inferential statistical data analysis; see also Sullivan and R Feinn (2012) [102].
When performing null hypothesis significance testing, the researcher is always at risk of making
a wrong decision. Hereby, one distinguishes between the following two kinds of potential error:
2
Within the frequentist framework of null hypothesis significance testing the test statistic and its partner test distri-
bution form an intimate pair of decision-making devices.
3
The statistical software packages R and SPSS provide p–values as a means for making decisions in null hypothesis
significance testing.
106 CHAPTER 11. NULL HYPOTHESIS SIGNIFICANCE TESTING
Reality / Ω:
The Bayes–Laplace approach can be viewed as a proposal to the formalisation of the process
of learning. Note that the posterior probability distribution of one round of data generation and
analysis can serve as the prior probability distribution for a subsequent round of generation and
analysis of new data. Further details on the principles within the Bayes–Laplace framework
underlying the estimation of distribution parameters, the optimal curve-fitting to a given set of
empirical data points, and the related selection of an adequate mathematical model are given in,
e.g., Greenberg (2013) [35, Chs. 3 and 4], Saha (2002) [88, p 8ff], Lupton (1993) [65, p 50ff], and
in Ref. [19].
This result specialises to p = 2 [1 − FTn (|tn |)] if the respective pdf of the test distribu-
tion exhibits reflection symmetry with respect to a vertical axis at tn = 0, i.e., when
FTn (−|tn |) = 1 − FTn (|tn |) holds.
108 CHAPTER 11. NULL HYPOTHESIS SIGNIFICANCE TESTING
• left-sided statistical test,
p := P (Tn < tn |H0 ) = FTn (tn ) , (11.4)
With respect to the test decision criterion of rejecting an H0 whenever p < α, one refers to
(i) cases with p < 0.05 as significant test results, and to (ii) cases with p < 0.01 as highly
significant test results.4
Remark: User-friendly routines for the computation of p–values are available in R, SPSS, EXCEL
and OpenOffice, and also on some GDCs.
In the following two chapters, we will turn to discuss a number of standard problems in Inferential
Statistics within the frequentist framework, in association with the quantitative–empirical tools
that have been developed in this context to tackle them. In Ch. 12 we will be concerned with prob-
lems of a univariate nature, in particular, testing for statistical differences in the distributional
properties of a single one-dimensional statistical variable X between two of more subgroups of
some target population Ω, while in Ch. 13 the problems at hand will be of a bivariate nature,
testing for statistical association in Ω for a two-dimensional statistical variable (X, Y ). An en-
tertaining exhaustive account of the history of statistical methods of data analysis prior to the year
1900 is given by Stigler (1986) [99].
4
Lakens (2017) [55] posted a stimulating blog entry on the potential traps associated with the interpretation of
a p–value in statistical data analysis. His remarks come along with illustrative demonstrations in R, including the
underlying codes.
Chapter 12
In this chapter we present a selection of standard inferential statistical techniques within the fre-
quentist framework that, based upon the random sampling of some target population Ω, were
developed for the purpose of (a) range-estimating unknown distribution parameters by means of
confidence intervals, (b) testing for differences between a given empirical distribution of a one-
dimensional statistical variable and its a priori assumed theoretical distribution, and (c) comparing
distributional properties and parameters of a one-dimensional statistical variable between two or
more subgroups of Ω. Since the methods to be introduced relate to considerations on distributions
of a single one-dimensional statistical variable only, they are thus referred to as univariate.
such that P (θ ∈ K1−α (θ)) = 1 − α applies. The interpretation of the confidence interval K1−α
is that upon arbitrarily many independent repetitions of the random sampling process, in (1 −
α)×100% of all cases the unknown distribution parameter θ will fall inside the boundaries of
109
110 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
K1−α and in α×100% of all cases it will not.1 In the following we will consider the two cases
which result when choosing θ ∈ {µ, σ 2 }.
(Oi − Ei )2 2
in terms of a sum of rescaled squared residuals , which, under H0 , approximately
Ei
follows a χ2 –test distribution with df = k − 1 − r degrees of freedom (cf. Sec. 8.7); r denotes
the number of free parameters of the reference distribution F0 (x) which need to be estimated from
the random sample data. For this test procedure to be reliable, it is important (!) that the size n of
the random sample be chosen such that the condition
!
Ei ≥ 5 (12.10)
holds for all categories i = 1, . . . , k, due to the fact that the Ei appear in the denominator of the
test statistic in Eq. (12.9) (and so would artifically inflate the magnitudes of the summed ratios
when the denominators become too small).
Test decision: The rejection region for H0 at significance level α is given by (right-sided test)
R: chisq.test(table(variable))
SPSS: Analyze → Nonparametric Tests → Legacy Dialogs → Chi-square . . .
Effect size: In the present context, the practical significance of the phenomenon investigated can
be estimated from the realisation tn and the sample size n by
r
tn
w := . (12.13)
n
For the interpretation of its strength Cohen (1992) [11, Tab. 1] recommends the
Rule of thumb:
0.10 ≤ w < 0.30: small effect
0.30 ≤ w < 0.50: medium effect
0.50 ≤ w: large effect.
Note that in the spirit of critical rationalism the one-sample χ2 –goodness–of–fit–test provides a
tool for empirically excluding possibilities of distribution laws for X.
For sample sizes n < 50, however, the validity of the normality assumption for the X-distribution
may be estimated in terms of the magnitudes of the standardised skewness and excess kurtosis
measures,
G1 G2
SEG1 and SEG2 ,
(12.14)
which are constructed from the quantities defined in Eqs. (10.10)–(10.13). At a significance level
α = 0.05, the normality assumption may be maintained as long as both measures are smaller than
the critical value of 1.96; cf. Hair et al (2010) [36, p 72f].
Formulated in a non-directed or a directed fashion, the starting point of the t–test resp. Z–test
procedures are the
Hypotheses:
(
H0 : µ = µ 0 or µ ≥ µ0 or µ ≤ µ0
. (12.15)
H1 : µ 6= µ0 or µ < µ0 or µ > µ0
To measure the deviation of the sample data from the state conjectured to hold in the null hypoth-
esis H0 , the difference between the sample mean X̄n and the hypothesised population mean µ0 ,
normalised in analogy to Eq. (7.34) by the standard error
Sn
SEX̄n := √ (12.16)
n
Test statistic:
X̄n − µ0 H0t(n − 1) for
n < 50
Tn (X1 , . . . , Xn ) = ∼ , (12.17)
SEX̄n
N(0; 1) for n ≥ 50
which, under H0 , follows a t–test distribution with df = n − 1 degrees of freedom (cf. Sec. 8.8)
resp. a standard normal test distribution (cf. Sec. 8.6).
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
114 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
(
tn−1;1−α/2 (t–test)
(a) two-sided µ = µ0 µ 6= µ0 |tn | >
z1−α/2 (Z–test)
(
tn−1;α = −tn−1;1−α (t–test)
(b) left-sided µ ≥ µ0 µ < µ0 tn <
zα = −z1−α (Z–test)
(
tn−1;1−α (t–test)
(c) right-sided µ ≤ µ0 µ > µ0 tn >
z1−α (Z–test)
p–values associated with realisations tn of the test statistic (12.17) can be obtained from
Eqs. (11.3)–(11.5), using the relevant t–test distribution resp. the standard normal test dis-
tribution.
R: t.test(variable, mu = µ0 ),
t.test(variable, mu = µ0 , alternative = "less"),
t.test(variable, mu = µ0 , alternative = "greater")
GDC: mode STAT → TESTS → T-Test... when n < 50, resp. mode STAT → TESTS →
Z-Test... when n ≥ 50.
SPSS: Analyze → Compare Means → One-Sample T Test . . .
Note: Regrettably, SPSS provides no option for selecting between a “one-tailed” (left-/right-sided)
and a “two-tailed” (two-sided) t–test. The default setting is for a two-sided test. For the purpose
of one-sided tests the p–value output of SPSS needs to be divided by 2.
Effect size: The practical significance of the phenomenon investigated can be estimated from the
sample mean x̄n , the sample standard deviation sn , and the reference value µ0 by the scale-invariant
ratio
|x̄n − µ0 |
d := . (12.18)
sn
For the interpretation of its strength Cohen (1992) [11, Tab. 1] recommends the
Rule of thumb:
0.20 ≤ d < 0.50: small effect
0.50 ≤ d < 0.80: medium effect
0.80 ≤ d: large effect.
We remark that the statistical software package R holds available a routine
power.t.test(power, sig.level, delta, sd, n, alternative, type
= "one.sample") for the purpose of calculating any one of the parameters power, delta
12.4. ONE-SAMPLE χ2 –TEST FOR A POPULATION VARIANCE 115
or n (provided all remaining parameters have been specified) in the context of empirical investi-
gations employing the one-sample t–test for a population mean. One-sided tests are specified via
the parameter setting alternative = "one.sided".
(
< χ2n−1;α/2
(a) two-sided σ 2 = σ02 σ 2 6= σ02 tn
> χ2n−1;1−α/2
p–values associated with realisations tn of the test statistic (12.20), which are to be calculated
from the χ2 –test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: varTest(variable, sigma.squared = σ02 ) (package: EnvStats, by Millard
(2013) [72]),
116 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
varTest(variable, sigma.squared = σ02 , alternative = "less"),
varTest(variable, sigma.squared = σ02 , alternative = "greater")
Regrettably, the one-sample χ2 –test for a population variance does not appear to have been imple-
mented in the SPSS software package.
A test statistic is constructed from the difference of sample means, X̄n1 − X̄n2 , standardised by the
standard error s
Sn21 Sn22
SE(X̄n1 − X̄n2 ) := + , (12.22)
n1 n2
which derives from the associated theoretical sampling distribution for X̄n1 − X̄n2 . Thus, one
obtains the
12.5. INDEPENDENT SAMPLES T –TEST FOR A MEAN 117
Test statistic:
X̄n1 − X̄n2 H0
Tn1 ,n2 := ∼ t(df ) , (12.23)
SE(X̄n1 − X̄n2 )
which, under H0 , satisfies a t–test distribution (cf. Sec. 8.8) with a number of degrees of freedom
determined by the relations
n1 + n2 − 2 , when σ12 = σ22
df := S2
n1 Sn2
2 . (12.24)
n 1
+ n 2
2
2 2
(Sn2 1 /n1 )2 (Sn2 2 /n2 )2 , when σ1 6= σ2
n1 −1
+ n2 −1
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
p–values associated with realisations tn1 ,n2 of the test statistic (12.23), which are to be calculated
from the t–test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: t.test(variable~group variable),
t.test(variable~group variable, alternative = "less"),
t.test(variable~group variable, alternative = "greater")
GDC: mode STAT → TESTS → 2-SampTTest...
SPSS: Analyze → Compare Means → Independent-Samples T Test . . .
Note: Regrettably, SPSS provides no option for selecting between a one-sided and a two-sided
t–test. The default setting is for a two-sided test. For the purpose of one-sided tests the p–value
output of SPSS needs to be divided by 2.
Effect size: The practical significance of the phenomenon investigated can be estimated from the
sample means x̄n1 and x̄n2 and the pooled sample standard deviation
s
(n1 − 1)s2n1 + (n2 − 1)s2n2
spooled := (12.25)
n1 + n2 − 2
118 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
by the scale-invariant ratio
|x̄n1 − x̄n2 |
d := . (12.26)
spooled
For the interpretation of its strength Cohen (1992) [11, Tab. 1] recommends the
Rule of thumb:
0.20 ≤ d < 0.50: small effect
0.50 ≤ d < 0.80: medium effect
0.80 ≤ d: large effect.
R: cohen.d(variable, group variable, pooled = TRUE) (package: effsize,
by Torchiano (2018) [106])
We remark that the statistical software package R holds available a routine
power.t.test(power, sig.level, delta, sd, n, alternative) for the
purpose of calculation of any one of the parameters power, delta or n (provided all remaining
parameters have been specified) in the context of empirical investigations employing the indepen-
dent samples t–test for a population mean. Equal values of n are required here. One-sided tests
are addressed via the parameter setting alternative = "one.sided".
When the necessary conditions for the application of the independent samples t–test are not satis-
fied, the following alternative test procedures (typically of a weaker test power, though) for com-
paring two subgroups of Ω with respect to the distribution of a metrically scaled variable X exist:
(i) at the nominal scale level, provided Eij ≥ 5 for all i, j, the χ2 –test for homogeneity; cf.
Sec. 12.10 below, and
(ii) at the ordinal scale level, provided n1 , n2 ≥ 8, the two independent samples Mann–
Whitney–U –test for a median; cf. the following Sec. 12.6.
for which the identity U1 + U2 = n1 n2 applies. Choose U := min(U1 , U2 ).3 For independent
random samples of sizes n1 , n2 ≥ 8 (see, e.g., Bortz (2005) [5, p 151]), the standardised U–value
serves as the
Test statistic:
U − µ U H0
Tn1 ,n2 := ≈ N(0; 1) , (12.30)
SEU
which, under H0 , approximately satisfies a standard normal test distribution; cf. Sec. 8.6. Here,
µU denotes the mean of the U–value expected under H0 ; it is defined in terms of the sample sizes
by
n1 n2
µU := ; (12.31)
2
SEU denotes the standard error of the U–value and can be obtained, e.g., from Bortz (2005) [5,
Eq. (5.49)].
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
(a) two-sided x̃0.5 (1) = x̃0.5 (2) x̃0.5 (1) 6= x̃0.5 (2) |tn1 ,n2 | > z1−α/2
(b) left-sided x̃0.5 (1) ≥ x̃0.5 (2) x̃0.5 (1) < x̃0.5 (2) tn1 ,n2 < zα = −z1−α
(c) right-sided x̃0.5 (1) ≤ x̃0.5 (2) x̃0.5 (1) > x̃0.5 (2) tn1 ,n2 > z1−α
3
Since the U –values are tied to each other by the identity U1 + U2 = n1 n2 , it makes no difference to this method
when one chooses U := max(U1 , U2 ) instead.
120 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
p–values associated with realisations tn1 ,n2 of the test statistic (12.30), which are to be calculated
from the standard normal test distribution, can be obtained from Eqs. (11.3)–(11.5).
Note: Regrettably, SPSS provides no option for selecting between a one-sided and a two-sided
U–test. The default setting is for a two-sided test. For the purpose of one-sided tests the p–value
output of SPSS needs to be divided by 2.
Dealing with independent random samples of sizes n1 and n2 , the ratio of the corresponding sample
variances serves as a
Test statistic:
Sn21 H0
Tn1 ,n2 := 2 ∼ F (n1 − 1, n2 − 1) , (12.33)
Sn 2
which, under H0 , satisfies an F –test distribution with df1 = n1 − 1 and df2 = n2 − 1 degrees of
freedom; cf. Sec. 8.9.
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
4
Run the Kolmogorov–Smirnov–test to check whether the assumption of normality of the distribution of X in the
two random samples drawn needs to be rejected.
12.8. DEPENDENT SAMPLES T –TEST FOR A MEAN 121
(
< 1/fn2 −1,n1 −1;1−α/2
(a) two-sided σ12 = σ22 σ12 6= σ22 tn1 ,n2
> fn1 −1,n2 −1;1−α/2
(b) left-sided σ12 ≥ σ22 σ12 < σ22 tn1 ,n2 < 1/fn2 −1,n1 −1;1−α
(c) right-sided σ12 ≤ σ22 σ12 > σ22 tn1 ,n2 > fn1 −1,n2 −1;1−α
p–values associated with realisations tn1 ,n2 of the test statistic (12.33), which are to be calculated
from the F –test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: var.test(variable ~ group variable),
var.test(variable ~ group variable, alternative = "less"),
var.test(variable ~ group variable, alternative = "greater")
GDC: mode STAT → TESTS → 2-SampFTest...
Regrettably, the two-sample F –test for a population variance does not appear to have been imple-
mented in the SPSS software package. Instead, to address quantitative issues of the kind raised
here, one may resort to Levene’s test; cf. Sec. 12.5.
An important test prerequisite demands that D itself may be assumed normally distributed in Ω;
cf. Sec. 8.6. Whether this property holds true, can be checked for n ≥ 50 via the Kolmogorov–
122 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
Smirnov–test; cf. Sec. 12.3. When n < 50, one may resort to a consideration of the magnitudes
of the standardised skewness and excess kurtosis measures, Eqs. (12.14).
With µD denoting the population mean of the difference variable D, the
Hypotheses: (test for differences)
(
H0 : µD = 0 or µD ≥ 0 or µD ≤ 0
(12.35)
H1 : µD 6= 0 or µD < 0 or µD > 0
can be given in a non-directed or a directed formulation. From the sample mean D̄ and its associ-
ated standard error,
SD
SED̄ := √ , (12.36)
n
which derives from the theoretical sampling distribution for D̄, one obtains by means of stan-
dardisation according to Eq. (7.34) the
Test statistic:
D̄ H0
Tn := ∼ t(n − 1) , (12.37)
SED̄
which, under H0 , satisfies a t–test distribution with df = n − 1 degrees of freedom; cf. Sec. 8.8.
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
p–values associated with realisations tn of the test statistic (12.37), which are to be calculated
from the t–test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: t.test(variableA, variableB, paired = "T"),
t.test(variableA, variableB, paired = "T", alternative = "less"),
t.test(variableA, variableB, paired = "T", alternative =
"greater")
SPSS: Analyze → Compare Means → Paired-Samples T Test . . .
12.9. DEPENDENT SAMPLES WILCOXON–TEST 123
Note: Regrettably, SPSS provides no option for selecting between a one-sided and a two-sided
t–test. The default setting is for a two-sided test. For the purpose of one-sided tests the p–value
output of SPSS needs to be divided by 2.
Effect size: The practical significance of the phenomenon investigated can be estimated from the
sample mean D̄ and the sample standard deviation sD by the scale-invariant ratio
D̄
d := . (12.38)
sD
For the interpretation of its strength Cohen (1992) [11, Tab. 1] recommends the
Rule of thumb:
0.20 ≤ d < 0.50: small effect
0.50 ≤ d < 0.80: medium effect
0.80 ≤ d: large effect.
R: cohen.d(variable, group variable, paired = TRUE) (package: effsize,
by Torchiano (2018) [106])
We remark that the statistical software package R holds available a routine
power.t.test(power, sig.level, delta, sd, n, alternative, type
= "paired") for the purpose of calculation of any one of the parameters power, delta or n
(provided all remaining parameters have been specified) in the context of empirical investigations
employing the dependent samples t–test for a population mean. One-sided tests are addressed via
the parameter setting alternative = "one.sided".
nred (nred + 1)
µW + := , (12.42)
4
while the standard error SEW + can be computed from, e.g., Bortz (2005) [5, Eq. (5.52)].
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
(b) left-sided x̃0.5 (D) ≥ 0 x̃0.5 (D) < 0 tnred < zα = −z1−α
(c) right-sided x̃0.5 (D) ≤ 0 x̃0.5 (D) > tnred > z1−α
p–values associated with realisations tnred of the test statistic (12.41), which are to be calculated
from the standard normal test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: wilcox.test(variableA, variableB, paired = "T"),
wilcox.test(variableA, variableB, paired = "T", alternative =
"less"),
wilcox.test(variableA, variableB, paired = "T", alternative =
"greater")
SPSS: Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples . . . : Wilcoxon
5
Due to the identity W + + W − = nred (nred + 1)/2, choosing instead W − would make no qualitative difference
to the subsequent test procedure.
12.10. χ2 –TEST FOR HOMOGENEITY 125
Note: Regrettably, SPSS provides no option for selecting between a one-sided and a two-sided
Wilcoxon–test. The default setting is for a two-sided test. For the purpose of one-sided tests the
p–value output of SPSS needs to be divided by 2.
With Oij denoting the observed frequency of category aj in subgroup i (i = 1, . . . , k), and Eij
the, under H0 , expected frequency of category aj in subgroup i, the sum of rescaled squared
(Oij − Eij )2
residuals provides a useful
Eij
Test statistic:
l
k X
X (Oij − Eij )2 H0
Tn := ≈ χ2 [(k − 1) × (l − 1)] . (12.44)
i=1 j=1
Eij
O+j
Eij := Oi+ . (12.45)
n
Note the important (!) test prerequisite that the total sample size n be such that
!
Eij ≥ 5 (12.46)
(ii) testing for differences of the mean of a quantitative one-dimensional statistical variable X
between k ≥ 3 different subgroups of some target population Ω.
A necessary condition for the application of the one-way analysis of variance (ANOVA) test
procedure is that the quantitative one-dimensional statistical variable X to be investigated may be
reasonably assumed to be (a) normally distributed (cf. Sec. 8.6) in the k ≥ 3 subgroups of the
target population Ω considered, with, in addition, (b) equal variances. Both of these conditions
also have to hold for each of a set of k mutually stochastically independent random variables
X1 , . . . , Xk representing k random samples drawn independently from the identified k subgroups
of Ω, of sizes n1 , . . . , nk ∈ N, respectively. In the following, the element Xij of the underlying
(n × 2) data matrix X represents the jth value of X in the random sample drawn from the ith
subgroup of Ω, with X̄i the corresponding subgroup sample mean. The k independent random
X k
samples can be understood to form a total random sample of size n := n1 + . . . + nk = ni ,
i=1
with total sample mean X̄n ; cf. Eq. (10.6).
6
Only experimental designs with fixed effects are considered here.
12.11. ONE-WAY ANALYSIS OF VARIANCE (ANOVA) 127
The intention of the ANOVA procedure in the variant (ii) stated above is to empirically test the
null hypothesis H0 in the set of
Hypotheses: (test for differences)
(
H0 : µ 1 = . . . = µ k = µ 0
. (12.49)
H1 : µi 6= µ0 at least for one i = 1, . . . , k
The necessary test prerequisites can be checked by (a) the Kolmogorov–Smirnov–test for nor-
mality of the X-distribution in each of the k subgroups of Ω (cf. Sec. 12.3) when ni ≥ 50, or,
when ni < 50, by a consideration of the magnitudes of the standardised skewness and excess
kurtosis measures, Eqs. (12.14), and likewise by (b) Levene’s test for H0 : σ12 = . . . = σk2 = σ02
against H1 : “σi2 6= σ02 at least for one i = 1, . . . , k” to test for equality of the variances in these k
subgroups (cf. Sec. 12.5).
R: leveneTest(variable, group variable) (package: car, by Fox and Weisberg
(2011) [25])
The starting point of the ANOVA procedure is a simple algebraic decomposition of the random
sample values Xij into three additive components according to
Xij = X̄n + (X̄i − X̄n ) + (Xij − X̄i ) . (12.50)
This expresses the Xij in terms of the sum of the total sample mean, X̄n , the deviation of the
subgroup sample means from the total sample mean, (X̄i − X̄n ), and the residual deviation of the
sample values from their respective subgroup sample means, (Xij − X̄i ). The decomposition of
the Xij motivates a linear stochastic model for the target population Ω of the form7
in Ω : Xij = µ0 + αi + εij (12.51)
in order to quantify, via the αi (i = 1, . . . , k), the potential influence of the qualitative one-
dimensional variable Y on the P quantitative one-dimensional variable X. Here µ0 is the popu-
lation mean of X, it holds that ki=1 ni αi = 0, and it is assumed for the random errors εij that
i.i.d.
εij ∼ N(0; σ02 ), i.e., that they are identically normally distributed and mutually stochastically
independent.
Having established the decomposition (12.50), one next turns to consider the associated set of
sums of squared deviations, defined by
X ni
k X k
X
2 2
BSS := X̄i − X̄n = ni X̄i − X̄n (12.52)
i=1 j=1 i=1
ni
k X
X 2
RSS := Xij − X̄i (12.53)
i=1 j=1
ni
k X
X 2
TSS := Xij − X̄n , (12.54)
i=1 j=1
7
Formulated in the context of this linear stochastic model, the null and research hypotheses are H0 : α1 = . . . =
αk = 0 and H1 : at least one αi 6= 0, respectively.
128 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
where the summations are (i) over all ni sample units within a subgroup, and (ii) over all of the
k subgroups themselves. The sums are referred to as, resp., (a) the sum of squared deviations be-
tween the subgroup samples (BSS), (b) the residual sum of squared deviations within the subgroup
samples (RSS), and (c) the total sum of squared deviations (TSS) of the individual Xij from the
total sample mean X̄n . It is a fairly elaborate though straightforward algebraic exercise to show
that these three squared deviation terms relate to one another according to the strikingly simple
and elegant identity (cf. Bosch (1999) [7, p 220f])
TSS = BSS + RSS . (12.55)
Now, from the sums of squared deviations (12.52)–(12.54), one defines, resp., the total sample
variance,
k ni
2 1 XX 2 TSS
Stotal := Xij − X̄n = , (12.56)
n − 1 i=1 j=1 n−1
involving df = n − 1 degrees of freedom, the sample variance between subgroups,
k
2 1 X 2 BSS
Sbetween := ni X̄i − X̄n = , (12.57)
k − 1 i=1 k−1
with df = k − 1, and the mean sample variance within subgroups,
k ni
2 1 XX 2 RSS
Swithin := Xij − X̄i = , (12.58)
n − k i=1 j=1 n−k
for which df = n − k.
Employing the latter two subgroup-specific dispersion measures, the set of hypotheses (12.49) may
be recast into the alternative form
Hypotheses: (test for differences)
2
Sbetween
H 0 : 2
≤1
Swithin
. (12.59)
S2
H1 : between
2
>1
Swithin
Finally, as a test statistic for the ANOVA procedure one chooses this very ratio of variances8 we
just employed,
(sample variance between subgroups) BSS/(k − 1)
Tn,k := = ,
(mean sample variance within subgroups) RSS/(n − k)
8 (explained variance)
This ratio is sometimes given as Tn,k := , in analogy to expression (13.10) below. Occa-
(unexplained variance)
BSS
sionally, one also considers the coefficient η 2 := , which, however, does not account for the degrees of freedom
TSS
S2
involved. In this respect, the modified coefficient η̃ 2 := between
2 would constitute a more sophisticated measure.
Stotal
12.11. ONE-WAY ANALYSIS OF VARIANCE (ANOVA) 129
ANOVA sum of df mean test
variability squares square statistic
2
between groups BSS k−1 Sbetween tn,k
2
within groups RSS n−k Swithin
total TSS n−1
expressing the size of the “sample variance between subgroups” in terms of multiples of the “mean
sample variance within subgroups”; it thus constitutes a relative measure. A real effect of differ-
ence between subgroups is thus given when the non-negative numerator turns out to be significantly
larger than the non-negative denominator. Mathematically, this statistical measure of deviations
between the data and the null hypothesis is captured by the
Test statistic:9
2
Sbetween H0
Tn,k := 2
∼ F (k − 1, n − k) . (12.60)
Swithin
Under H0 , it satisfies an F –test distribution with df1 = k −1 and df2 = n−k degrees of freedom;
cf. Sec. 8.9.
It is a well-established standard in practical applications of the one-way ANOVA procedure to
display the results of the data analysis in the form of a summary table, here given in Tab. 12.1.
Test decision: The rejection region for H0 at significance level α is given by (right-sided test)
With Eq. (11.5), the p–value associated with a specific realisation tn,k of the test statistic (12.60),
which is to be calculated from the F –test distribution, amounts to
p = P (Tn,k > tn,k |H0 ) = 1 − P (Tn,k ≤ tn,k |H0 ) = 1 − F cdf(0, tn,k , k − 1, n − k) . (12.62)
For the interpretation of its strength Cohen (1992) [11, Tab. 1] recommends the
9
Note the one-to-one correspondence to the test statistic (12.33) employed in the independent samples F –test for
a population variance.
130 CHAPTER 12. UNIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
Rule of thumb:
0.10 ≤ f < 0.25: small effect
0.25 ≤ f < 0.40: medium effect
0.40 ≤ f : large effect.
We remark that the statistical software package R holds available a routine
power.anova.test(groups, n, between.var, within.var, sig.level,
power) for the purpose of calculation of any one of the parameters power or n (provided all
remaining parameters have been specified) in the context of empirical investigations employing
the one-way ANOVA. Equal values of n are required here.
When a one-way ANOVA yields a statistically significant result, so-called post-hoc tests need to
be run subsequently in order to identify those subgroups i whose means µi differ most drastically
from the reference value µ0 . The Student–Newman–Keuls–test (Newman (1939) [74] and Keuls
(1952) [48]), e.g., successively subjects the pairs of subgroups with the largest differences in sam-
ple means to independent samples t–tests; cf. Sec. 12.5. Other useful post-hoc tests are those
developed by Holm–Bonferroni (Holm (1979) [42]), Tukey (Tukey (1977) [110]), or by Scheffé
(Scheffé (1959) [90]).
R: pairwise.t.test(variable, group variable, p.adj = "bonferroni")
SPSS: Analyze → Compare Means → One-Way ANOVA . . . → Post Hoc . . .
By Eq. (11.5), the p–value associated with a realisation tn,k of the test statistic (12.66), which is
to be calculated from the χ2 –test distribution, amounts to
p = P (Tn,k > tn,k |H0 ) = 1 − P (Tn,k ≤ tn,k |H0 ) = 1 − χ2 cdf(0, tn,k , k − 1) . (12.68)
Recognising patterns of regularity in the variability of data sets for given (observable) statisti-
cal variables, and explaining them in terms of causal relationships in the context of a suitable
theoretical model, is one of the main objectives of any empirical scientific discipline, and thus
motivation for corresponding research; see, e.g., Penrose (2004) [82]. Causal relationships are
intimately related to interactions between objects or agents of the physical or/and of the social
kind. A necessary (though not sufficient) condition on the way to theoretically fathoming causal
relationships is to establish empirically the existence of significant statistical associations be-
tween the variables in question. Replication of positive observational or experimental results of
this kind, when accomplished, yields strong support in favour of this idea. Regrettably, however,
the existence of causal relationships between two statistical variables cannot be established with
absolute certainty by empirical means; compelling theoretical arguments need to stand in. Causal
relationships between statistical variables imply an unambiguous distinction between independent
variables and dependent variables. In the following, we will discuss the principles of the sim-
plest three inferential statistical methods within the frequentist framework, each associated with
specific null hypothesis significance tests, that provide empirical checks of the aforementioned
necessary condition in the bivariate case.
133
134 CHAPTER 13. BIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
Hypotheses: (test for association)
(
H0 : ρ = 0 or ρ ≥ 0 or ρ ≤ 0
, (13.1)
H1 : ρ 6= 0 or ρ < 0 or ρ > 0
with −1 ≤ ρ ≤ +1.
For sample sizes n ≥ 50, the assumption of normality of the marginal X- and Y -distributions
in a given random sample SΩ : (X1 , . . . , Xn ; Y1 , . . . , Yn ) drawn from Ω can be tested by means
of the Kolmogorov–Smirnov–test; cf. Sec. 12.3. For sample sizes n < 50, on the other hand,
the magnitudes of the standardised skewness and excess kurtosis measures, Eqs. (12.14), can
be considered instead. A scatter plot of the bivariate raw sample data {(xi , yi )}i=1,...,n displays
characteristic features of the joint (X, Y )-distribution.
R: ks.test(variable, "pnorm")
SPSS: Analyze → Nonparametric Tests → Legacy Dialogs → 1-Sample K-S . . . : Normal
Normalising the sample correlation coefficient r of Eq. (4.19) by its standard error,
r
1 − r2
SEr := , (13.2)
n−2
the latter of which can be derived from the corresponding theoretical sampling distribution for r,
presently yields the (see, e.g., Toutenburg (2005) [108, Eq. (7.18)])
Test statistic:
r H0
Tn := ∼ t(n − 2) , (13.3)
SEr
which, under H0 , satisfies a t–test distribution with df = n − 2 degrees of freedom; cf. Sec. 8.8.
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
p–values associated with realisations tn of the test statistic (13.3), which are to be calculated from
the t–test distribution, can be obtained from Eqs. (11.3)–(11.5).
13.1. CORRELATION ANALYSIS AND LINEAR REGRESSION 135
R: cor.test(variable1, variable2),
cor.test(variable1, variable2, alternative = "less"),
cor.test(variable1, variable2, alternative = "greater")
SPSS: Analyze → Correlate → Bivariate . . . : Pearson
Effect size: The practical significance of the phenomenon investigated can be estimated directly
from the absolute value of the scale-invariant sample correlation coefficient r according to Cohen’s
(1992) [11, Tab. 1]
Rule of thumb:
0.10 ≤ |r| < 0.30: small effect
0.30 ≤ |r| < 0.50: medium effect
0.50 ≤ |r|: large effect.
It is generally recommended to handle significant test results of correlation analyses for metrically
scaled two-dimensional statistical variables (X, Y ) with some care, due to the possibility of spuri-
ous correlations induced by additional control variables Z, . . ., acting hidden in the background.
To exclude this possibility, a correlation analysis should, e.g., be repeated for homogeneous sub-
groups of the sample SΩ . Some rather curious and startling cases of spurious correlations have
been collected at the website www.tylervigen.com.
in Ω : Yi = α + βxi + εi (i = 1, . . . , n) , (13.4)
which, for instance, assigns X the role of an independent variable (and so its values xi can be
considered prescribed by the modeller) and Y the role of a dependent variable; such a model
is essentially univariate in nature. The regression coefficients α and β denote the unknown y–
intercept and slope of the model in Ω. For the random errors εi it is assumed that
i.i.d.
εi ∼ N(0; σ 2 ) , (13.5)
meaning they are identically normally distributed (with zero mean and constant variance σ 2 )
and mutually stochastically independent. With respect to the bivariate random sample
SΩ : (X1 , . . . , Xn ; Y1 , . . . , Yn ), the supposed linear relationship between X and Y is expressed
by
in SΩ : yi = a + bxi + ei (i = 1, . . . , n) . (13.6)
136 CHAPTER 13. BIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
So-called residuals are then defined according to
ei := yi − ŷi = yi − a − bxi (i = 1, . . . , n) , (13.7)
which, for given values of xi , encode the differences between the observed realisations yi of Y and
the corresponding (by the linear regression model) predicted values ŷi of Y . Given the assumption
Xn
expressed in Eq. (13.5), the residuals must satisfy the condition ei = 0.
i=1
Next, introduce sums of squared deviations for the Y -data, in line with the ANOVA procedure of
Sec. 12.11, i.e.,
n
X
TSS := (yi − ȳ)2 (13.8)
i=1
Xn n
X
RSS := (yi − ŷi )2 = e2i . (13.9)
i=1 i=1
In terms of these quantities, the coefficient of determination of Eq. (5.9) for assessing the
goodness-of-the-fit of a regression model can be expressed by
TSS − RSS (total variance of Y ) − (unexplained variance of Y )
B= = . (13.10)
TSS (total variance of Y )
This normalised measure expresses the proportion of variability in a data set of Y which can be
explained by the corresponding variability of X through the best-fit regression model. The range
of B is 0 ≤ B ≤ 1.
In the methodology of a regression analysis within the frequentist framework, the first issue to
be addressed is to test the significance of the overall simple linear regression model (13.4), i.e.,
to test H0 against H1 in the set of
Hypotheses: (test for differences)
(
H0 : β = 0
. (13.11)
H1 : β 6= 0
Exploiting the goodness-of-the-fit aspect of the regression model as quantified by B in Eq. (13.10),
one arrives via division by the standard error of B,
1−B
SEB := , (13.12)
n−2
which derives from the theoretical sampling distribution for B, at the (see, e.g., Hatzinger and
Nagel (2013) [37, Eq. (7.8)])
Test statistic:1
B H0
Tn := ∼ F (1, n − 2) . (13.13)
SEB
1
Note that with the identity B = r2 of Eq. (5.10), which applies in simple linear regression, this is just the square
of the test statistic (13.3).
13.1. CORRELATION ANALYSIS AND LINEAR REGRESSION 137
Under H0 , this satisfies an F –test distribution with df1 = 1 and df2 = n − 2 degrees of freedom;
cf. Sec. 8.9.
Test decision: The rejection region for H0 at significance level α is given by (right-sided test)
tn > f1,n−2;1−α . (13.14)
With Eq. (11.5), the p–value associated with a specific realisation tn of the test statistic (13.13),
which is to be calculated from the F –test distribution, amounts to
p = P (Tn > tn |H0 ) = 1 − P (Tn ≤ tn |H0 ) = 1 − F cdf(0, tn , 1, n − 2) . (13.15)
yielding solutions
SY
b= r and a = Ȳ − bx̄ . (13.17)
sX
The equation of the best-fit simple linear regression model is thus given by
SY
ŷ = Ȳ + r (x − x̄) , (13.18)
sX
138 CHAPTER 13. BIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
and can be employed for purposes of predicting values of Y from given values of X in the empirical
interval [x(1) , x(n) ].
Next, the standard errors associated with the values of the maximum likelihood point estimators
a and b in Eq. (13.17) are derived from the corresponding theoretical sampling distributions and
amount to (cf., e.g., Hartung et al (2005) [39, p 576ff])
s
1 x̄
SEa := + SEe (13.19)
n (n − 1)s2X
SEe
SEb := √ , (13.20)
n − 1 sX
where the standard error of the residuals ei is defined by
v
u n
uX
u
u (Yi − Ŷi )2
t i=1
SEe := . (13.21)
n−2
We now describe the test procedure for the regression coefficient β. To be tested is H0 against H1
in one of the alternative pairs of
Hypotheses: (test for differences)
(
H0 : β = 0 or β ≥ 0 or β ≤ 0
. (13.22)
H1 : β 6= 0 or β < 0 or β > 0
Dividing the sample regression slope b by its standard error (13.20) yields the
Test statistic:
b H0
Tn := ∼ t(n − 2) , (13.23)
SEb
which, under H0 , satisfies a t–test distribution with df = n − 2 degrees of freedom; cf. Sec. 8.8.
Test decision: Depending on the kind of test to be performed, the rejection region for H0 at
significance level α is given by
(ii) homoscedasticity of the ei (i = 1, . . . , n), i.e., whether or not they can be assumed to have
constant variance, can be investigated qualitatively in terms of a scatter plot that marks
the standardised ei (along the vertical axis) against the corresponding predicted Y -values ŷi
(i = 1, . . . , n) (along the horizontal axis). An elliptically shaped envelope of the cloud of
data points thus obtained indicates that homoscedasticity applies.
Simple linear regression analysis can be easily modified to provide a tool to test bivariate empirical
data {(xi , yi )}i=1,...,n for positive metrically scaled statistical variables (X, Y ) for an association in
the form of a Pareto distribution; cf. Sec. 8.10. To begin with, the original data is subjected to log-
arithmic transformations in order to obtain data for the logarithmic quantities ln(yi ) resp. ln(xi ).
Subsequently, a correlation analysis can be performed on the transformed data. Given there exists
a functional relationship between the original Y and X of the form y = Kx−(γ+1) , the logarithmic
quantities are related by
ln(y) = ln(K) − (γ + 1) × ln(x) , (13.24)
i.e., one finds a straight line relationship between ln(y) and ln(x) with negative slope equal to
−(γ + 1).
We like to draw the reader’s attention to a remarkable statistical phenomenon that was discovered,
and emphatically publicised, by the English empiricist Sir Francis Galton FRS (1822–1911), fol-
lowing years of intense research during the late 19th Century; see Galton (1886) [28], and also
140 CHAPTER 13. BIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
Kahneman (2011) [46, Ch. 17]. Regression toward the mean is best demonstrated on the basis
of the standardised version of the best-fit simple linear regression model of Eq. (13.18), namely
ẑY = rzX . (13.25)
For bivariate metrically scaled random sample data that exhibits a non-perfect positive correlation
(i.e., 0 < r < 1), one observes that, on average, large (small) zX -values (i.e., values that are
far from their mean; that are, perhaps, even outliers) pair with smaller (larger) zY -values (i.e.,
values that are closer to their mean; that are more mediocre). Since this phenomenon persists
after the roles of X and Y in the regression model have been switched, this is clear evidence
that regression toward the mean is a manifestation of randomness, and not of causality (which
requires an unambiguous temporal order between a cause and an effect). Incidently, regression
toward the mean ensures that many physical and social processes cannot become unstable.
Ending this section we point out that in reality a lot of the processes studied in the Natural Sci-
ences and in the Social Sciences prove to be of an inherently non-linear nature; see e.g. Gleick
(1987) [34], Penrose (2004) [82], and Smith (2007) [94]. On the one hand, this increases the level
of complexity involved in the analysis of data, on the other, non-linear processes offer the reward
of a plethora of interesting and intriguing (dynamical) phenomena.
p–values associated with realisations tn of the test statistic (13.29), which are to be calculated
from the t–test distribution, can be obtained from Eqs. (11.3)–(11.5).
R: cor.test(variable1, variable2, method = "spearman"),
cor.test(variable1, variable2, method = "spearman", alternative =
"less"),
cor.test(variable1, variable2, method = "spearman", alternative =
"greater")
SPSS: Analyze → Correlate → Bivariate . . . : Spearman
Effect size: The practical significance of the phenomenon investigated can be estimated directly
from the absolute value of the scale-invariant sample rank correlation coefficient rS according to
(cf. Cohen (1992) [11, Tab. 1])
Rule of thumb:
0.10 ≤ |rS | < 0.30: small effect
0.30 ≤ |rS | < 0.50: medium effect
0.50 ≤ |rS |: large effect.
For the subsequent test procedure to be reliable, it is very important (!) that the empirical prere-
quisite
!
Eij ≥ 5 (13.33)
13.3. χ2 –TEST FOR INDEPENDENCE 143
holds for all values of i = 1 . . . , k and j = 1, . . . , l, such that one avoids the possibility of individ-
(Oij − Eij )2
ual rescaled squared residuals becoming artificially magnified. The latter constitute
Eij
the core of the
Test statistic:
k X
l
X (Oij − Eij )2 H0
Tn := ≈ χ2 [(k − 1) × (l − 1)] , (13.34)
i=1 j=1
Eij
By Eq. (11.5), the p–value associated with a realisation tn of the test statistic (13.34), which is to
be calculated from the χ2 –test distribution, amounts to
Our discussion on the foundations of statistical methods of data analysis and their application to
specific quantitative problems ends here. We have focused on the description of uni- and bivari-
ate data sets and making inferences from corresponding random samples within the frequentist
approach to Probability Theory. At this stage, the attentive reader should feel well-equipped for
confronting problems concerning more complex, multivariate data sets, and adequate methods for
tackling them by statistical means. Many modules at the Master degree level review a broad spec-
trum of advanced topics such as multiple linear regression, generalised linear models, principal
component analysis, or cluster analysis, which in turn relate to computational techniques presently
employed in the context of machine learning. The ambitious reader might even think of getting in-
volved with proper research and work towards a Ph.D. degree in an empirical scientific discipline.
To gain additional data analytical flexibility, and to increase chances on obtaining transparent and
satisfactory research results, it is strongly recommended to consult the conceptually compelling
inductive Bayes–Laplace approach to statistical inference. In order to leave behind the method-
ological shortcomings uncovered by the recent replication crisis (cf., e.g., Refs. [17], [76], or
[112]), strict adherence to accepted scientific standards cannot be compromised with.2
Beyond activities within the scientific community, the dedicated reader may feel encouraged to use
her/his solid topical qualification in statistical methods of data analysis for careers in either field
of higher education, public health, renewable energy supply chains, evaluation of climate change
adaptation, development of plans for sustainable production in agriculture and global economy,
civil service, business management, marketing, logistics, or the financial services, amongst a mul-
titude of other inspirational possibilities.
Not every single matter of human life is amenable to quantification, or, acknowledging an indi-
vidual freedom of making choices, needs to be quantified in the first place. Blind faith in the
powers of quantitative methods is certainly misplaced. Thorough reflection and introspection on
the options available for action and their implied consequences, together with a critical evalua-
tion of relevant tangible facts, might suggest a viable alternative approach to a given research or
practical problem. Generally, there is a potential for looking behind curtains, shifting horizons, or
anticipating prospects and opportunities. Finally, more often than not, there exists a dimension of
non-knowledge on the part of the individual investigator that needs to be taken into account as an
integral part of the boundary conditions of the overall problem in question. The adventurous mind
will always excel in view of the intricate challenge of making inferences on the basis of incomplete
information.
2
With regard to the replication crisis, the interested reader might be aware of the international initiative known as
the Open Science Framework. URL (cited on August 17, 2019): https://osf.io.
145
146 CHAPTER 13. BIVARIATE METHODS OF STATISTICAL DATA ANALYSIS
Appendix A
147
148 APPENDIX A. SIMPLE PRINCIPAL COMPONENT ANALYSIS
√
where Tr(M ) = 2 and det(M ) = 1. The correlation matrix R can now be diagonalised by
means of a rotation with M according to1
Rdiag = M −1 RM
1 1 1 1 r 1 1 −1 1+r 0
= √ √ = . (A.7)
2 −1 1 r 1 2 1 1 0 1−r
Note that Tr(Rdiag ) = 2 and det(Rdiag ) = 1 − r 2 , i.e., the trace and determinant of R remain
invariant under the diagonalising transformation.
The concepts of eigenvalues and eigenvectors (principal components), as well as of diagonalisation
of symmetric matrices, generalise in a straightforward though computationally more demanding
fashion to arbitrary real-valued correlation matrices R ∈ Rm×m , with m ∈ N.
R: prcomp(data matrix)
1
Alternatively one can write
cos(π/4) − sin(π/4)
M= ,
sin(π/4) cos(π/4)
thus emphasising the character of a rotation of R by an angle ϕ = π/4.
Appendix B
Statistics employs a number of different measures of distance dij to quantify the separation in
an m–D space of metrically scaled statistical variables X, Y, . . . , Z of two statistical units i and
j (i, j = 1, . . . , n). Note that, by construction, these measures dij exhibit the properties dij ≥ 0,
dij = dji and dii = 0. In the following, Xik is the entry of the data matrix X ∈ Rn×m relating to
the ith statistical unit and the kth statistical variable, etc. The dij define the elements of a (n × n)
proximity matrix D ∈ Rn×n .
where δkl denotes the elements of the unit matrix 1 ∈ Rm×m ; cf. Ref. [18, Eq. (2.2)].
149
150 APPENDIX B. DISTANCE MEASURES IN STATISTICS
Appendix C
A first version of the following list of online survey tools for the Social Sciences, the use of some
of which is free of charge, was compiled and released courtesy of an investigation by Michael
Rüger (IMC, year of entry 2010):
• easy-feedback.de/de/startseite
• www.evalandgo.de
• www.limesurvey.org
• www.netigate.de
• polldaddy.com
• q-set.de
• www.qualtrics.com
• www.soscisurvey.de
• www.surveymonkey.com
• www.umfrageonline.com
151
152 APPENDIX C. LIST OF ONLINE SURVEY TOOLS
Appendix D
A
additive: additiv, summierbar
ANOVA: Varianzanalyse
arithmetical mean: arithmetischer Mittelwert
association: Zusammenhang, Assoziation
attribute: Ausprägung, Eigenschaft
B
bar chart: Balkendiagramm
Bayes’ theorem: Satz von Bayes
Bayesian probability: Bayesianischer Wahrscheinlichkeitsbegriff
best-fit model: Anpassungsmodell
bin: Datenintervall
binomial coefficient: Binomialkoeffizient
bivariate: bivariat, zwei variable Größen betreffend
box plot: Kastendiagramm
C
category: Kategorie
causality: Kausalität
causal relationship: Kausalbeziehung
census: statistische Vollerhebung
central limit theorem: Zentraler Grenzwertsatz
centre of gravity: Schwerpunkt
centroid: geometrischer Schwerpunkt
certain event: sicheres Ereignis
class interval: Ausprägungsklasse
cluster analysis: Klumpenanalyse
cluster random sample: Klumpenzufallsstichprobe
coefficient of determination: Bestimmtheitsmaß
coefficient of variation: Variationskoeffizient
combination: Kombination
combinatorics: Kombinatorik
153
154 APPENDIX D. GLOSSARY OF TECHNICAL TERMS (GB – D)
compact: geschlossen, kompakt
complementation of a set: Bilden der Komplementärmenge
concentration: Konzentration
conditional distribution: bedingte Verteilung
conditional probability: bedingte Wahrscheinlichkeit
confidence interval: Konfidenzintervall
conjunction: Konjunktion, Mengenschnitt
contingency table: Kontingenztafel
continuous data: stetige Daten
control variable: Störvariable
convenience sample: Gelegenheitsstichprobe
convexity: Konvexität
correlation matrix: Korrelationsmatrix
covariance matrix: Kovarianzmatrix
critical value: kritischer Wert
cross tabulation: Kreuztabelle
cumulative distribution function (cdf): theoretische Verteilungsfunktion
D
data: Daten
data matrix: Datenmatrix
decision: Entscheidung
deductive method: deduktive Methode
degree-of-belief: Glaubwürdigkeitsgrad, Plausibilität
degrees of freedom: Freiheitsgrade
dependent variable: abhängige Variable
descriptive statistics: Beschreibende Statistik
deviation: Abweichung
difference: Differenz
direction: Richtung
discrete data: diskrete Daten
disjoint events: disjunkte Ereignisse, einander ausschließend
disjunction: Disjunktion, Mengenvereinigung
dispersion: Streuung
distance: Abstand
distortion: Verzerrung
distribution: Verteilung
distributional properties: Verteilungseigenschaften
E
econometrics: Ökonometrie
effect size: Effektgröße
eigenvalue: Eigenwert
elementary event: Elementarereignis
empirical cumulative distribution function: empirische Verteilungsfunktion
155
estimator: Schätzer
Euclidian distance: Euklidischer Abstand
Euclidian space: Euklidischer (nichtgekrümmter) Raum
event: Ereignis
event space: Ereignisraum
evidence: Anzeichen, Hinweis, Anhaltspunkt, Indiz
expectation value: Erwartungswert
extreme value: extremer Wert
F
fact: Tatsache, Faktum
factorial: Fakultät
falsification: Falsifikation
five number summary: Fünfpunktzusammenfassung
frequency: Häufigkeit
frequentist probability: frequentistischer Wahrscheinlichkeitsbegriff
G
Gini coefficient: Ginikoeffizient
goodness-of-the-fit: Anpassungsgüte
H
Hessian matrix: Hesse’sche Matrix
histogram: Histogramm
homoscedasticity: Homoskedastizität, homogene Varianz
hypothesis: Hypothese, Behauptung, Vermutung
I
inclusion of a set: Mengeninklusion
independent variable: unabhängige Variable
inductive method: induktive Methode
inferential statistics: Schließende Statistik
interaction: Wechselwirkung
intercept: Achsenabschnitt
interquartile range: Quartilsabstand
interval scale: Intervallskala
impossible event: unmögliches Ereignis
J
joint distribution: gemeinsame Verteilung
K
kσ–rule: kσ–Regel
kurtosis: Wölbung
L
latent variable: latente Variable, nichtbeobachtbares Konstrukt
law of large numbers: Gesetz der großen Zahlen
156 APPENDIX D. GLOSSARY OF TECHNICAL TERMS (GB – D)
law of total probability: Satz von der totalen Wahrscheinlichkeit
Likert scale: Likertskala, Verfahren zum Messen von eindimensionalen latenten Variablen
linear regression analysis: lineare Regressionsanalyse
location parameter: Lageparameter
Lorenz curve: Lorenzkurve
M
Mahalanobis distance: Mahalanobis’scher Abstand
manifest variable: manifeste Variable, Observable
marginal distribution: Randverteilung
marginal frequencies: Randhäufigkeiten
measurement: Messung, Datenaufnahme
method of least squares: Methode der kleinsten Quadrate
median: Median
metrical: metrisch
mode: Modalwert
N
nominal: nominal
O
observable: beobachtbare/messbare Variable, Observable
observation: Beobachtung
odds: Wettchancen
operationalisation: Operationalisieren, latente Variable messbar gestalten
opinion poll: Meinungsumfrage
ordinal: ordinal
outlier: Ausreißer
P
p–value: p–Wert
partition: Zerlegung, Aufteilung
percentile value: Perzentil, α–Quantil
pie chart: Kreisdiagramm
point estimator: Punktschätzer
population: Grundgesamtheit
power: Teststärke
power set: Potenzmenge
practical significance: praktische Signifikanz, Bedeutung
principal component analysis: Hauptkomponentenanalyse
probability: Wahrscheinlichkeit
probability density function (pdf): Wahrscheinlichkeitsdichte
probability function: Wahrscheinlichkeitsfunktion
probability measure: Wahrscheinlichkeitsmaß
probability space: Wahrscheinlichkeitsraum
projection: Projektion
157
proportion: Anteil
proximity matrix: Distanzmatrix
Q
quantile: Quantil
quartile: Quartil
questionnaire: Fragebogen
R
randomness: Zufälligkeit
random experiment: Zufallsexperiment
random sample: Zufallsstichprobe
random variable: Zufallsvariable
range: Spannweite
rank: Rang
rank number: Rangzahl
rank order: Rangordnung
ratio scale: Verhältnisskala
raw data set: Datenurliste
realisation: Realisierung, konkreter Messwert für eine Zufallsvariable
regression analysis: Regressionsanalyse
regression coefficient: Regressionskoeffizient
regression model: Regressionsmodell
regression toward the mean: Regression zur Mitte
rejection region: Ablehnungsbereich
replication: Nachahmung
research: Forschung
research question: Forschungsfrage
residual: Residuum, Restgröße
risk: Risiko (berechenbar)
S
σ–algebra: σ–Algebra
6σ–event: 6σ–Ereignis
sample: Stichprobe
sample correlation coefficient: Stichprobenkorrelationskoeffizient
sample covariance: Stichprobenkovarianz
sample mean: Stichprobenmittelwert
sample size: Stichprobenumfang
sample space: Ergebnismenge
sample variance: Stichprobenvarianz
sampling distribution: Stichprobenkenngrößenverteilung
sampling error: Stichprobenfehler
sampling frame: Auswahlgesamtheit
sampling unit: Stichprobeneinheit
scale-invariant: skaleninvariant
158 APPENDIX D. GLOSSARY OF TECHNICAL TERMS (GB – D)
scale level: Skalenniveau
scale parameter: Skalenparameter
scatter plot: Streudiagramm
scientific method: Wissenschaftliche Methode
shift theorem: Verschiebungssatz
significance level: Signifikanzniveau
simple random sample: einfache Zufallsstichprobe
skewness: Schiefe
slope: Steigung
spectrum of values: Wertespektrum
spurious correlation: Scheinkorrelation
standard error: Standardfehler
standardisation: Standardisierung
statistical (in)dependence: statistische (Un)abhängigkeit
statistical unit: Erhebungseinheit
statistical significance: statistische Signifikanz
statistical variable: Merkmal, Variable
stochastic: stochastisch, wahrscheinlichkeitsbedingt
stochastic independence: stochastische Unabhängigkeit
stratified random sample: geschichtete Zufallsstichprobe
strength: Stärke
summary table: Zusammenfassungstabelle
survey: statistische Erhebung, Umfrage
T
test statistic: Teststatistik, statistische Effektmessgröße
type I error: Fehler 1. Art
type II error: Fehler 2. Art
U
unbiased: erwartungstreu, unverfälscht, unverzerrt
uncertainty: Unsicherheit (nicht berechenbar)
univariate: univariat, eine variable Größe betreffend
unit: Einheit
urn model: Urnenmodell
V
value: Wert
variance: Varianz
variation: Variation
Venn diagram: Venn–Diagramm
visual analogue scale: visuelle Analogskala
W
weighted mean: gewichteter Mittelwert
159
Z
Z scores: Z–Werte
zero point: Nullpunkt
160 APPENDIX D. GLOSSARY OF TECHNICAL TERMS (GB – D)
Bibliography
[2] T Bayes (1763) An essay towards solving a problem in the doctrine of chances
Philosophical Transactions 53 370–418
[3] P L Bernstein (1998) Against the Gods — The Remarkable Story of Risk (New York: Wiley)
ISBN–10: 0471295639
[4] J–P Bouchaud and M Potters (2003) Theory of Financial Risk and Derivative Pricing —
From Statistical Physics to Risk Management 2nd Edition (Cambridge: Cambridge University
Press) ISBN–13: 9780521741866
[5] J Bortz (2005) Statistik für Human– und Sozialwissenschaftler 6th Edition (Berlin: Springer)
ISBN–13: 9783540212713
[6] J Bortz and N Döring (2006) Forschungsmethoden und Evaluation für Human– und Sozial-
wissenschaftler 4th Edition (Berlin: Springer) ISBN–13: 9783540333050
[7] K Bosch (1999) Grundzüge der Statistik 2nd Edition (München: Oldenbourg) ISBN–10:
3486252593
[8] A Bravais (1846) Analyse mathématique sur les probabilités des erreurs de situation d’un
point Mémoires présentés par divers savants à l’Académie royale des sciences de l’Institut
de France 9 255–332
[9] M C Bryson (1976) The Literary Digest poll: making of a statistical myth
The American Statistician 30 184–185
[12] J Cohen (2009) Statistical Power Analysis for the Behavioral Sciences 2nd Edition (New
York: Psychology Press) ISBN–13: 9780805802832
[13] H Cramér (1946) Mathematical Methods of Statistics (Princeton, NJ: Princeton University
Press) ISBN–10: 0691080046
161
162 BIBLIOGRAPHY
[14] L J Cronbach (1951) Coefficient alpha and the internal structure of tests
Psychometrika 16 297–334
[15] P Dalgaard (2008) Introductory Statistics with R 2nd Edition (New York: Springer) ISBN–13:
9780387790534
[16] C Duller (2007) Einführung in die Statistik mit EXCEL und SPSS 2nd Edition (Heidelberg:
Physica) ISBN–13: 9783790819113
[17] The Ecomomist (2013) Trouble at the lab URL (cited on August 25, 2015):
www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-i
[19] H van Elst (2018) An introduction to inductive statistical inference: from parameter estima-
tion to decision-making Preprint arXiv:1808.10137v1 [stat.AP]
[20] W Feller (1951) The asymptotic distribution of the range of sums of independent random
variables The Annals of Mathematical Statistics 22 427–432
[21] W Feller (1968) An Introduction to Probability Theory and Its Applications — Volume 1 3rd
Edition (New York: Wiley) ISBN–13: 9780471257080
[22] R A Fisher (1918) The correlation between relatives on the supposition of Mendelian inheri-
tance Transactions of the Royal Society of Edinburgh 52 399–433
[23] R A Fisher (1924) On a distribution yielding the error functions of several well known statis-
tics Proc. Int. Cong. Math. Toronto 2 805–813
[26] M Freyd (1923) The graphic rating scale Journal of Educational Psychology 14 83–102
[27] F Galton (1869) Hereditaty Genius: An Inquiry into its Laws and Consequences (London:
Macmillan)
[29] C F Gauß (1809) Theoria motus corporum celestium in sectionibus conicis solem ambientium
[30] A Gelman, J B Carlin, H S Stern, D B Dunson, A Vehtari and D B Rubin (2014) Bayesian
Data Analysis 3rd Edition (Boca Raton, FL: Chapman & Hall) ISBN–13: 9781439840955
BIBLIOGRAPHY 163
[31] I Gilboa (2009) Theory of Decision under Uncertainty (Cambridge: Cambridge University
Press) ISBN–13: 9780521571324
[33] C Gini (1921) Measurement of inequality of incomes The Economic Journal 31 124–126
[34] J Gleick (1987) Chaos — Making a New Science nth Edition 1998 (London: Vintage) ISBN–
13: 9780749386061
[35] E Greenberg (2013) Introduction to Bayesian Econometrics 2nd Edition (Cambridge: Cam-
bridge University Press) ISBN–13: 9781107015319
[36] J F Hair jr, W C Black, B J Babin and R E Anderson (2010) Multivariate Data Analysis 7th
Edition (Upper Saddle River (NJ): Pearson) ISBN–13: 9780135153093
[37] R Hatzinger and H Nagel (2013) Statistik mit SPSS — Fallbeispiele und Methoden 2nd Edition
(München: Pearson Studium) ISBN–13: 9783868941821
[38] R Hatzinger, K Hornik, H Nagel and M J Maier (2014) R — Einführung durch angewandte
Statistik 2nd Edition (München: Pearson Studium) ISBN–13: 9783868942507
[39] J Hartung, B Elpelt and K–H Klösener (2005) Statistik: Lehr– und Handbuch der ange-
wandten Statistik 14th Edition (München: Oldenburg) ISBN–10: 3486578901
[40] M H S Hayes and D G Paterson (1921) Experimental development of the graphic rating
method Psychological Bulletin 18 98–99
[41] J M Heinzle, C Uggla and N Röhr (2009) The cosmological billard attrac-
tor Advances in Theoretical and Mathematical Physics 13 293–407 and Preprint
arXiv:gr-qc/0702141v1
[43] E T Jaynes (2003) Probability Theory — The Logic of Science (Cambridge: Cambridge Uni-
versity Press) ISBN–13: 9780521592710
[45] D N Joanes and C A Gill (1998) Comparing measures of sample skewness and kurtosis
Journal of the Royal Statistical Society: Series D (The Statistician) 47 183–189
[46] D Kahneman (2011) Thinking, Fast and Slow (London: Penguin) ISBN–13: 9780141033570
[47] D Kahneman and A Tversky (1979) Prospect Theory: an analysis of decision under risk
Econometrica 47 263–292
164 BIBLIOGRAPHY
[48] M Keuls (1952) The use of the “studentized range” in connection with an analysis of variance
Euphytica 1 112–122
[51] A N Kolmogorov (1933) Sulla determinazione empirica di una legge di distribuzione Inst.
Ital. Atti. Giorn. 4 83–91
[52] C Kredler (2003) Einführung in die Wahrscheinlichkeitsrechnung und Statistik Online lec-
ture notes (München: Technische Universität München) URL (cited on August 20, 2015):
www.ma.tum.de/foswiki/pub/Studium/ChristianKredler/Stoch1.pdf
[53] J K Kruschke and T M Liddell (2017) The Bayesian New Statistics: hypothesis
testing, estimation, meta-analysis, and power analysis from a Bayesian perspective
Psychonomic Bulletin & Review 24 1–29 (Brief Report)
[54] W H Kruskal and W A Wallis (1952) Use of ranks on one-criterion variance analysis
Journal of the American Statistical Association 47 583–621
[55] D Lakens (2017) Understanding common misconceptions about p-values (blog entry: De-
cember 5, 2017) URL (cited on June 19, 2019): http://daniellakens.blogspot.com/2017/
[56] P S Laplace (1774) Mémoire sur la probabilité des causes par les évènements Mémoires de
l’Académie Royale des Sciences Presentés par Divers Savans 6 621–656
[57] P S Laplace (1809) Mémoire sur les approximations des formules qui sont
fonctions de très grands nombres et sur leur application aux probabilités
Mémoires de l’Académie des sciences de Paris
[59] E L Lehman and G Casella (1998) Theory of Point Estimation 2nd Edition (New York:
Springer) ISBN–13: 9780387985022
[60] H Levene (1960) Robust tests for equality of variances Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling eds I Olkin et al (Stanford, CA: Stanford
University Press) 278–292
[61] J A Levin, J A Fox and D R Forde (2010) Elementary Statistics in Social Research 11th
Edition (München: Pearson Education) ISBN–13: 9780205636921
[62] R Likert (1932) A technique for the measurement of attitudes Archives of Psychology 140
1–55
BIBLIOGRAPHY 165
[63] J W Lindeberg (1922) Eine neue Herleitung des Exponentialgesetzes in der Wahrschein-
lichkeitsrechnung Mathematische Zeitschrift 15 211–225
[65] R Lupton (1993) Statistics in Theory and Practice (Princeton, NJ: Princeton University Press)
ISBN–13: 9780691074290
[66] A M Lyapunov (1901) Nouvelle forme du théorème sur la limite de la probabilité Mé-
moires de l’Académie Impériale des Sciences de St.-Pétersbourg VIIIe Série, Classe Physico–
Mathématique 12 1–24 [in Russian]
[68] H B Mann and D R Whitney (1947) On a test of whether one of two random variables is
stochastically larger than the other The Annals of Mathematical Statistics 18 50–60
[69] R McElreath (2016) Statistical Rethinking — A Bayesian Course with Examples in R and
Stan (Boca Raton, FL: Chapman & Hall) ISBN–13: 9781482253443
[70] D Meyer, A Zeileis and K Hornik (2017) vcd: Visualizing categorical data (R package ver-
sion 1.4-4) URL (cited on June 7, 2019): https://CRAN.R-project.org/package=vcd
[71] D Meyer, E Dimitriadou, K Hornik, A Weingessel and F Leisch (2019) Misc functions of the
Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien (R package
version 1.7-1) URL (cited on May 16, 2019): https://CRAN.R-project.org/package=e1071
[72] S P Millard (2013) EnvStats: An R Package for Environmental Statistics (New York:
Springer) ISBN–13: 9781461484554
[73] L Mlodinow (2008) The Drunkard’s Walk — How Randomness Rules Our Lives (New York:
Vintage Books) ISBN–13: 9780307275172
[74] D Newman (1939) The distribution of range in samples from a normal population, expressed
in terms of an independent estimate of standard deviation Biometrika 31 20–30
[75] J Neyman and E S Pearson (1933) On the problem of the most efficient tests of statistical hy-
potheses Philosophical Transactions of the Royal Society of London, Series A 231 289–337
[76] R Nuzzo (2014) Scientific method: statistical errors — P values, the ‘gold standard’ of sta-
tistical validity, are not as reliable as many scientists assume Nature 506 150–152
[78] K Pearson (1900) On the criterion that a given system of deviations from the probable in the
case of a correlated system of variables is such that it can be reasonably supposed to have
arisen from random sampling Philosophical Magazine Series 5 50 157–175
166 BIBLIOGRAPHY
[79] K Pearson (1901) LIII. On lines and planes of closest fit to systems of points in space
Philosophical Magazine Series 6 2 559–572
[82] R Penrose (2004) The Road to Reality — A Complete Guide to the Laws of the Universe
1st Edition (London: Jonathan Cape) ISBN–10: 0224044478
[83] K R Popper (2002) Conjectures and Refutations: The Growth of Scientific Knowledge 2nd
Edition (London: Routledge) ISBN–13: 9780415285940
[84] A Quetelet (1835) Sur l’ Homme et le Développment de ses Facultés, ou Essai d’une Physique
Sociale (Paris: Bachelier)
[85] R Core Team (2019) R: A language and environment for statistical computing (Wien: R Foun-
dation for Statistical Computing) URL (cited on June 24, 2019): https://www.R-project.org/
[86] W Revelle (2019) psych: Procedures for psychological, psychometric, and per-
sonality research (R package version 1.8.12) URL (cited on June 2, 2019):
https://CRAN.R-project.org/package=psych
[87] H Rinne (2008) Taschenbuch der Statistik 4th Edition (Frankfurt/Main: Harri Deutsch)
ISBN–13: 9783817118274
[88] P Saha (2002) Principles of Data Analysis Online lecture notes URL (cited on August 15,
2013): www.physik.uzh.ch/~psaha/pda/
[91] R Schnell, P B Hill and E Esser (2013) Methoden der empirischen Sozialforschung 10th
Edition (München: Oldenbourg) ISBN–13: 9783486728996
[92] D S Sivia and J Skilling (2006) Data Analysis — A Bayesian Tutorial 2nd Edition (Oxford:
Oxford University Press) ISBN–13: 9780198568322
[93] N Smirnov (1939) On the estimation of the discrepancy between empirical curves of distri-
bution for two independent samples Bull. Math. Univ. Moscou 2 fasc. 2
[94] L Smith (2007) Chaos — A Very Short Introduction (Oxford: Oxford University Press)
ISBN–13: 9780192853783
[95] G W Snedecor (1934) Calculation and Interpretation of Analysis of Variance and Covariance
(Ames, IA: Collegiate Press)
BIBLIOGRAPHY 167
[96] C Spearman (1904) The proof and measurement of association between two things
The American Journal of Psychology 15 72–101
[97] Statistical Society of London (1838) Fourth Annual Report of the Council of the Statistical
Society of London Journal of the Statistical Society of London 1 5–13
[98] S S Stevens (1946) On the theory of scales of measurement Science 103 677–680
[99] S M Stigler (1986) The History of Statistics — The Measurement of Uncertainty before 1900
(Cambridge, MA: Harvard University Press) ISBN–10: 067440341x
[100] Student [W S Gosset] (1908) The probable error of a mean Biometrika 6 1–25
[102] G M Sullivan and R Feinn (2012) Using effect size — or why the p value is not enough
Journal of Graduate Medical Education 4 279–282
[103] E Svetlova and H van Elst (2012) How is non-knowledge represented in economic theory?
Preprint arXiv:1209.2204v1 [q-fin.GN]
[104] E Svetlova and H van Elst (2014) Decision-theoretic approaches to non-knowledge in eco-
nomics Preprint arXiv:1407.0787v1 [q-fin.GN]
[105] N N Taleb (2007) The Black Swan — The Impact of the Highly Improbable (London: Pen-
guin) ISBN–13: 9780141034591
[106] M Torchiano (2018) effsize: Efficient effect size computation (R package version 0.7.4)
URL (cited on June 8, 2019): https://CRAN.R-project.org/package=effsize
[107] H Toutenburg (2004) Deskriptive Statistik 4th Edition (Berlin: Springer) ISBN–10:
3540222332
[108] H Toutenburg (2005) Induktive Statistik 3rd Edition (Berlin: Springer) ISBN–10:
3540242937
[109] W M K Trochim (2006) Web Center for Social Research Methods URL (cited on June 22,
2012): www.socialresearchmethods.net
[110] J W Tukey (1977) Exploratory Data Analysis (Reading, MA: Addison–Wesley) ISBN–10:
0201076160
[111] A Tversky and D Kahneman (1983) Extensional versus intuitive reasoning: the conjunction
fallacy in probability judgment Psychological Review 90 293–315
[112] S Vasishth (2017) The replication crisis in science (blog entry: December 29, 2017) URL
(cited on July 2, 2018): https://thewire.in/science/replication-crisis-science
168 BIBLIOGRAPHY
[113] J Venn (1880) On the employment of geometrical diagrams for the sensible representations
of logical propositions Proceedings of the Cambridge Philosophical Society 4 47–59
[114] G R Warnes, B Bolker, T Lumley and R C Johnson (2018) gmodels: Various R program-
ming tools for model fitting (R package version 2.18.1) URL (cited on June 27, 2019):
https://CRAN.R-project.org/package=gmodels
[115] S L Weinberg and S K Abramowitz (2008) Statistics Using SPSS 2nd Edition (Cambridge:
Cambridge University Press) ISBN–13: 9780521676373
[116] M C Wewel (2014) Statistik im Bachelor–Studium der BWL und VWL 3nd Edition
(München: Pearson Studium) ISBN–13: 9783868942200
[117] H Wickham (2016) ggplot2: Elegant Graphics for Data Analysis (New York: Springer)
ISBN–13: 9783319242774 URL (cited on June 14, 2019): ggplot2.tidyverse.org
[120] WolframMathWorld (2015) Random number URL (cited on January 28, 2015):
mathworld.wolfram.com/RandomNumber.html
[122] G U Yule (1897) On the theory of correlation Journal of the Royal Statistical Society 60 812–854