Introduction To Probabilistic and Statistical Methods With Examples in R - 9783030457990 PDF
Introduction To Probabilistic and Statistical Methods With Examples in R - 9783030457990 PDF
Katarzyna Stapor
Introduction
to Probabilistic
and Statistical
Methods
with Examples in R
Intelligent Systems Reference Library
Volume 176
Series Editors
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
Lakhmi C. Jain, Faculty of Engineering and Information Technology, Centre for
Artificial Intelligence, University of Technology, Sydney, NSW, Australia;
KES International, Shoreham-by-Sea, UK;
Liverpool Hope University, Liverpool, UK
The aim of this series is to publish a Reference Library, including novel advances
and developments in all aspects of Intelligent Systems in an easily accessible and
well structured form. The series includes reference works, handbooks, compendia,
textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains
well integrated knowledge and current information in the field of Intelligent
Systems. The series covers the theory, applications, and design methods of
Intelligent Systems. Virtually all disciplines such as engineering, computer science,
avionics, business, e-commerce, environment, healthcare, physics and life science
are included. The list of topics spans all the areas of modern intelligent systems
such as: Ambient intelligence, Computational intelligence, Social intelligence,
Computational neuroscience, Artificial life, Virtual society, Cognitive systems,
DNA and immunity-based systems, e-Learning and teaching, Human-centred
computing and Machine ethics, Intelligent control, Intelligent data analysis,
Knowledge-based paradigms, Knowledge management, Intelligent agents,
Intelligent decision making, Intelligent network security, Interactive entertainment,
Learning paradigms, Recommender systems, Robotics and Mechatronics including
human-machine teaming, Self-organizing and adaptive systems, Soft computing
including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion
of these paradigms, Perception and Vision, Web intelligence and Multimedia.
** Indexing: The books of this series are submitted to ISI Web of Science,
SCOPUS, DBLP and Springerlink.
Introduction to Probabilistic
and Statistical Methods
with Examples in R
123
Katarzyna Stapor
Faculty of Automatic Control,
Electronics and Computer Science
Silesian Technical University
Gliwice, Poland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
vii
viii Contents
Statisticians use the word experiment to describe any process that generates a set of
data. A simple example of a statistical experiment is the tossing of a coin in which
there are only two possible outcomes, heads or tails. Many board games require the
rolling of an ordinary six-sided die. As in the case of a coin, before the die is rolled,
we cannot predict the outcome. We cannot predict with certainty the number of
accidents that occur at a given intersection of roads monthly or the class “defective”
or “nondefective” of items going down from an assembly line, etc.
We are now ready to give a definition of a statistical experiment as the one in which
there are a number of possible outcomes (greater than one), each possible outcome
can be specified in advance, and we have no way of predicting which outcome will
actually occur.
In this section, we will present a mathematical construct of probability theory, a
probability space that is used as a model of a statistical experiment. We will also
provide examples of probability spaces for various statistical experiments.
In probability theory a probability space is a construct consisting of three elements:
(, Z , P) (1.1.1)
a sample space,
Z a set of events (a σ —field of subsets of ),
P a probability measure on Z.
1. ∈ Z
2. A ∈ Z ⇒ Ac = \A ∈ Z
∞
3. A1 , A2 , . . . . ∈ Z ⇒ ∪ Ai ∈ Z (1.1.2)
i=1
as a set consisting of those points of a sample space which belong to at least one
of the events Ai , i ∈ J .
An intersection of events Ai , i ∈ J denoted as:
Ai (1.1.4)
i∈J
A1 − A2 or A1 \A2 (1.1.5)
is a set consisting of those points of sample space that belong to A1 and not belong
to A2 .
We say that occurring of an event A2 arises from a fact of occurring of an event
A1 which we denote as A1 ⊂ A2 , if a set of points of a sample space belonging to
A1 is included in a set A2 .
An event A = is called a sure event, while an empty subset of a set is called
an impossible (null) event and denoted with a symbol ∅.
An opposite event Ac of an event A, also called a complementary event is defined
as:
Ac = − A (1.1.6)
1.1 Probability Space 3
A1 ∩ A2 = ∅ (1.1.7)
∀ Ai ∩ A j = ∅ (1.1.8)
i, j
i
= j
Ai = (1.1.10)
i∈J
1. ∀ A ∈ Z 0 ≤ P( A) ≤ 1
2. P() = 1 (1.1.11)
∞
∞
P ∪ Ai = P(Ai ) (1.1.12)
i=1
i∈1
A number P(A) is called a probability of an event A. The first axiom states that
a probability of an event is a non-negative real number from 0 to 1. According to
4 1 Elements of Probability Theory
the second axiom the probability that at least one of elementary events in the entire
sample space will occur is 1 (the assumption of unit measure). The third axiom
specifies the σ -additivity (the countable additivity) which means that a probability
of an union of any countable sequence of pairwise mutually exclusive events in Z is
equal to a sum of the probabilities of these events.
The above stated axioms are also known as the axiomatic definition of probability
(proposed by the Russian mathematician A. N. Kolmogorov).
We will now consider the construction of probability spaces for the three types of
sample spaces: finite, countable and uncountable ones.
= {ω1 , ω2 , . . . , ωn } (1.1.13)
Then the σ -algebra Z is defined by a power set of Z, that means Z = 2 . For any
elementary event {ωi } ∈ , we can assign a probability P({ωi }) such that
n
P({ωi }) = 1 (1.1.14)
i=1
m
A = {ω1 , ω2 , . . . , ωm } = {ωk } m ≤ n (1.1.15)
k=1
we can assign a probability P( A) from the third axiom (1.1.12) constituting the
definition of probability measure::
P(A) = P({ωi }) (1.1.16)
ωi ∈ A
1 m |A|
=m· = = (1.1.17)
n n ||
1.1 Probability Space 5
b1 − a1
P([a1 , b1 ]) = (1.1.24)
b−a
m(A)
P(A) = (1.1.25)
m()
Example 1.1.1 Consider an experiment of tossing a fair coin. There are two possible
outcomes (elementary events): H (head) and T (tail). The associated sample space
is:
= {H, T } (1.1.26)
Since the two elementary events are equiprobable, we get the following assign-
ment of probabilities:
1
P({H }) = P({T }) = (1.1.28)
2
Example 1.1.2 Let us consider throwing a fair die. There are six possible outcomes:
“one dot”, “two dots”, … “six dots” resulting in a sample space:
A set Z of events contains of 26 = 64 elements, these are an empty set, all possible
1-element subsets, 2-element subsets, …, 5-element subsets, and only one 6-element
subset, it is a sample space .
Since the six elementary events are equiprobable, we get the following assignment
of probabilities:
1
P({1}) = . . . = P({6}) i = 1, 2, . . . , 6 (1.1.30)
6
Probabilities of other events from a set Z are calculated using the third axiom
of the probability definition. Let’s look for the interpretation of some of the events
from a set Z. Events A1 = {{2}, {4}, {6}} and A2 = {{1}, {3}, {5}} can be described
as “the even/odd number of dots”. Events A1 and A2 are mutually exclusive events.
The event A3 = {{1}, {2}, {3}, {4}, {5}} is a sure event: “at least one of six dots”.
Example 1.1.3 Consider an experiment of tossing a fair coin until a head is obtained.
We may have to toss a coin any number of times before a head is obtained. Thus, the
possible outcomes are: H, TH, TTH, TTTH…. How many outcomes are there? The
outcomes are countable but infinite in number. A countably infinite sample space is
= {H, T H, T T H, . . . . . .} (1.1.31)
Let us denote
ω1 = H, ω2 = T H, ω3 = T T H . . . . . . (1.1.32)
the elementary events ω1 , ω2 , ω3 , . . . . “Head in the 1st, 2nd, 3rd …. roll”. We can
assign the probabilities as:
1
P(ωn ) = (1.1.33)
2n
This is a special case of a so-called geometric distribution. Note that probabilities
cannot be evenly distributed!
For any event A from a set Z, a P measure is calculated using the third axiom of
probability. So, for example, for A = {ω1 , . . . , ω4 } we have:
Example 1.1.5 Consider the popular sports lottery. The experiment relies on random
drawing 6 out of 49 numbers. Before experiment, a player selects (crosses out) 6
numbers (from 49). Player’s win depends on the level of agreement between the
numbers selected and those drawn in experiment (the bigger the fitting the greater
the win). We will build the probability space for this experiment and calculate the
probability of the event A = “3 numbers fitted”.
Theset of simple events consists of all 6-element subsets (i.e. combinations)
ωi = ωi1 , . . . , ωi6 , ωik ∈ {1, 2, . . . , 49}, k = 1, 2, . . . , 6, which can be created
from 49 numbers. The number of such combinations is:
49
(1.1.35)
6
1
pi = (1.1.36)
49
6
From the earlier stated axioms of probability (1.1.12), several basic results or
properties can be established. We list only some of them.
1. P(∅) = 0
2. P Ac = P() − P( A) = 1 − P(A)
Proof We have
as a union of disjoint sets. Then, based on the 3rd axiom of probability we have:
Substituting in the above formula for P(A2 \A1 ) the right side of expression
(1.1.43), we obtain the final property:
P(Ai ∩ A0 )
P( Ai |A0 ) = provided P( A0 ) > 0 (1.2.1)
P(A0 )
From the above two formulas, it follows that two events Ai and A0 are independent
if and only if:
P(Ai ∩ A0 ) = P( Ai ) · P( A0 ) (1.2.3)
Example 1.2.1 Consider rolling a six-sided die twice. Let us determine the
probability space of this experiment. The set has 36 elementary events:
1.2 Conditional Probability, Independence and Bernoulli Trials 11
11 21 31 41 51 61
12 22 32 42 52 62
13 23 33 43 53 63
14 24 34 44 54 64
15 25 35 45 55 65
16 26 36 46 56 66
The symbol “13”, for example, means the event “one point” has occurred in a first
roll and “three points” in a second. Each subset of the set is an event, i.e., Z = 2 .
The elementary events are equally likely, therefore, we can assign probabilities as:
1
pi = P(ωi ) = i = 1, . . . , 36 (1.2.5)
36
Let’s introduce the following denotation of events:
A1 one point in a first roll,
A2 two points in a second roll,
A3 sum of points in two rolls is equal to 3.
The probability of the event A1 = {11, 12, 13, 14, 15, 16} is equal:
6 1
P( A1 ) = = (1.2.6)
36 6
Similarly, we calculate the probabilities of other events. Probability of the event
A2 = {12, 22, 32, 42, 52, 62} is equal:
6 1
P( A2 ) = = (1.2.7)
36 6
and of the event A3 = {12, 21}:
2
P(A3 ) = (1.2.8)
36
Intuitively, it seems that the events A1 and A2 are independent because the result
of the first roll has nothing common with the result obtained in the second roll. This
fact follows from the equality:
1 1 1
P(A1 ∩ A2 ) = P({12}) = = · = P( A1 ) · P(A2 ) (1.2.9)
36 6 6
On the other hand, events A1 and A3 are dependent, because the occurrence of an
event A1 influences the probability of an event A3 :
12 1 Elements of Probability Theory
1
P(A3 ∩ A1 ) 1 2
P(A3 |A1 ) = = 36
= > P(A3 ) = (1.2.10)
P(A1 ) 1
6
6 36
1 1
P(A1 ∩ A3 ) = P({12}) =
= P(A1 ) · P(A3 ) = (1.2.11)
36 6
Bernoulli Trials
A statistical experiment often consists of independent repeated trials, each with
exactly two possible outcomes that may be labeled “success” (S) and “failure” (F).
The probability p of success (or failure q) is the same each time the experiment
is conducted. Such independent repeated trials of an experiment with exactly two
possible outcomes are called Bernoulli trials. The probability of success and the
probability of failure sum to unity (i.e. p + q = 1) since these are complementary
events.
The most obvious example is flipping a coin in which obverse (“heads”) con-
ventionally denotes success and reverse (“tails”) denotes failure. A fair coin has the
probability of success p = 0.5 by definition.
One may be interested in probability of obtaining a certain number of successes,
denote it k, in n Bernoulli trials. The following theorem enables computation of such
probability.
. . . S F F
SSSS . . . F (1.2.13)
k times n−k times
p k q n−k (1.2.14)
In the Bernoulli trials, the indicator for k0 for which Pn,k0 is not smaller than the
other probabilities is called the most probable number of successes in a series of n
trials. This indicator satisfies the following inequalities:
(n + 1) p − 1 ≤ k0 ≤ (n + 1) p (1.2.17)
If the numbers:
are integers, then we have the two most likely numbers of successes. Otherwise,
there is only one such number:
k0 = [(n + 1) p] (1.2.19)
Example 1.2.2 The products manufactured by a factory are packed into packages
of 10 items. The quality control states that 35% of packages on average contain
damaged items. A customer has purchased 5 packages. What is the probability of an
event that 3 of purchased packages will contain items without damage? What is the
most likely number of undamaged packages?
The purchase of a single package can be treated as a Bernoulli trial. The success
is a purchase of a package without defective items inside, and its probability is
p = 1 − 0.35 = 0.65. Therefore, the probability of purchasing exactly 3 packages
without damaged items is:
5
P5,3 = (0.65)3 (0.35)2 = 0.3364 (1.2.20)
3
Because:
n
P(A) = P(A|Ai )P( Ai ) (1.3.1)
i=1
Proof An event A can be represented in terms of MECE events in the following way:
n
n
A = A∩= A∩ Ai = A ∩ Ai (1.3.2)
i=1 i=1
The probabilities P(Ai ) are called a priori probabilities, i.e. they are given in
advance. The probability P( A) defined by Formula (1.3.1) is called the total proba-
bility. The above theorem can be given the following interpretation. Having a given
set of mutually exclusive events Ai in probability space (, Z , P), whose union
is a sure event, it is possible to calculate the probability of any other event A ∈ Z ,
based on a priori probabilities P( Ai ) of those events and the conditional probabilities
P(A|Ai ) for i = 1, . . . , n.
Example 1.3.1 We have three containers A1 , A2 , A3 with some products among
which there are also defective ones. There are 1% defective products in container A1 ,
3% defective products in containers A2 and 10% in container A3 . All the containers
have the same chance to be selected. We randomly select a container, and, then, the
product from this container. We’re interested in a probability that the selected product
is defective.
Let us denote by Ak (k = 1, 2, 3) an event that the kth container is selected. These
events are the MECE events. A priori probabilities of these events are the same:
1
P( Ak ) = k = 1, 2, 3 (1.3.4)
3
1.3 Theorem of Total Probability, Bayes Theorem and Applications 15
The conditional probabilities follow from the given assumptions about defectivity:
P(A|A1 ) =0.01
P(A|A2 ) =0.03
P(A|A3 ) =0.1 (1.3.5)
Using the theorem of total probability, we compute the probability of event A that
a selected product is defective:
1
P( A) = (0.01 + 0.03 + 0.1) ≈ 0.05 (1.3.6)
3
P( A|Ai )P(Ai )
P( Ai |A) = n i = 1, . . . , n (1.3.7)
j=1 P A|A j P A j
This is what we want to know: the probability of a hypothesis Ai given the observed
evidence A.
Proof
From the definition of conditional probability we have:
P(A ∩ Ai )
P(Ai |A) = (1.3.8)
P(A)
P(A ∩ Ai )
P(A|Ai ) =
P( Ai )
We can substitute the nominator of the first expression with the right side of
P(A ∩ Ai ) = P(A|Ai )P( Ai ) (this follows from the second expression) and the
denominator using theorem of total probability which results in:
P( A ∩ Ai ) P(A|Ai )P( Ai )
P(Ai |A) = = n (1.3.9)
P( A) j=1 P A|A j P A j
16 1 Elements of Probability Theory
Continuing the Example 1.3.1, using the Bayes’ rule, we can now compute a
posteriori probability P(A1 |A) that the randomly selected, defective product comes
from the first container A1 :
0.01 · 1/3 1
P(A1 |A) = = (1.3.10)
1/3(0.01 + 0.03 + 0.1) 14
The next example illustrates the nature of Bayesian inference: how the successive
application of Bayes rule enables to update the probability for a hypothesis as more
evidence becomes available.
Example 1.3.2 A patient comes to a doctor and reports some complaints. Based on
symptoms doctor suspects certain disease H (an initial diagnosis). The frequency
of disease H in general population is about 3%. This is a priori probability of an
event H+ that a patient is ill with this disease: p(H+) = 0.03. A doctor orders the
appropriate blood test T which appears to be positive (event T +). The following
two facts are also known: sensitivity of the T test, i.e. the probability of detecting a
disease in a group of patients with this disease is p(T + | H+) = 0.90. The specificity
of the T test, i.e. the probability of negative result of the test T is p(T − | H−) = 0.94.
The probability of false-positive results is, therefore
Using the Bayes’ rule, we can now compute how the positive result of the test T
(i.e. an event T +) will modify a priori probability p(H+), the level of our belief in an
initial diagnosis:
p(T + |H +) p(H +)
p(H + |T +) =
p(T + |H +) p(H +) + p(T + |H −) p(H −)
0.90 × 0.03
= = 0.3146 ≈ 31.5% (1.3.12)
0.90 × 0.03 + 0.06 × 0.97
After a positive test result has been occurred, the level of our belief in an initial
diagnosis increases significantly: a small initial a priori probability 3% increases 10
times, i.e. up to 31.5%, but it is still more likely that the patient is not ill with a
disease H. Do not rely on the results of a single test!
Therefore, the doctor orders another test U, now more specific to a disease
H. The sensitivity test of the test U is p(U + |H +) = 0.95 and the specificity
p(U − |H −) = 0.97.
The probability of false positive results is, therefore:
We now take p(H +) = 0.3146 as a priori probability, i.e. the previously obtained
final probability. Then, the positive result of the test U (i.e. an event U+), increases
1.3 Theorem of Total Probability, Bayes Theorem and Applications 17
p(U + |H +) p(H +)
p(H + |U +) =
p(U + |H +) p(H +) + p(U + |H −) p(H −)
0.95 × 0.3146
= = 0.9356 ≈ 93.6% (1.3.14)
0.95 × 0.3146 + 0.03 × 0.6854
X : →R (1.4.1.1)
that assigns to each elementary event ω ∈ a real number X (ω) ∈ R in such a way
that for each r ∈ R a set:
We shall use a capital letter, say X , to denote a random variable and the
corresponding small letter, x, for one of its values.
In the following examples which illustrate a notion of random variable, we assume
that a probability space (, Z , P) has been defined for a given statistical experiment.
Example 1.4.1.1
1. Consider rolling a six-sided die as a statistical experiment. We can define the
random variable X that assigns to each elementary event ωi ∈ Ω a number of
dots on a die’s wall that has been occurred:
X (ωi ) = i i = 1, . . . , 6 (1.4.1.3)
The corresponding functions for a continuous random variable are the cumulative
distribution function, defined in the same way as in the case of a discrete random
variable, and the probability density function. The definitions of these functions will
follow.
Given a statistical experiment with its associated random variable X and given
a real number x, let us consider the probability of the event {ω : X (ω) < x}, or,
simply, P(X < x). This probability is clearly dependent on the assigned value x.
The function FX : R → [0, 1] defined by the formula:
• It exists for discrete and continuous random variables and has values between 0
and 1.
• If a and b are two real numbers such that a < b, then
Let X be a discrete random variable that assumes at most a countably infinite number
of values x1 , x2 , . . . with nonzero probabilities. If we denote:
P(X = xi ) = pi i = 1, 2, . . . (1.4.2.1)
then, clearly,
20 1 Elements of Probability Theory
0 ≤ pi ≤ 1 for all i
(1.4.2.2)
pi = 1
i
A function:
i:x i ≤x
The upper limit for the sum in Eq. (1.4.2.5) means that the sum is taken over all
i satisfying xi ≤ x. Hence, we see that a cdf and pmf of a discrete random variable
contain the same information; each one is recoverable from the other.
Example 1.4.2.1 Consider the simple game involving rolling a six-sided die. If the
number of occurring dots is divided by 3, the player wins 100 $, otherwise he/she
loses 100 $. Let a discrete random variable X associates each elementary event ωi
from sample the space = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 } with a real number being his/her
wins or loss as shown in the following Table 1.2.
The assignment defined in the table above defines the random variable X . Let’s
define now the distribution of this random variable X. Based on the 3rd axiom of
probability, we first calculate the following probabilities:
1 1 1
P(X = +100) = P({ω3 , ω6 }) = + = (1.4.2.6)
6 6 3
1 2
P(X = −100) = P({ω1 , ω2 , ω4 , ω5 }) = 4 · = (1.4.2.7)
6 3
The probability mass function of the random variable X is presented in the
Table 1.3 and its plot in the Fig. 1.1.
The cumulative distribution function of our random variable X is:
⎧
⎨ 0 = P(∅) x ≤ −100
F(x) = 3 = P(∅ ∪ {1, 2, 4, 5})
2
−100 < x ≤ +100 (1.4.2.8)
⎩
1 = P(∅ ∪ {1, 2, 4, 5} ∪ {3, 6}) +100 < x
0.7
0.6
0.5
P(X=x)
0.4
0.3
0.2
0.1
0
−100 −50 0 50 100
x
1.2
0.8
F(x)
0.6
0.4
0.2
0
−100 −50 0 50 100
x
The Fig. 1.2 shows a ‘staircase’ plot of the calculated cdf of X which is typical
for a discrete random variable.
exists for all x. A function f X (x) is called a probability density function (pdf), or
simply a density function of X . Since FX (x) is monotone nondecreasing, we clearly
have f X (x) ≥ 0 for all x. Additional properties of f X (x) can be derived easily from
Eq. (1.4.3.1). These include:
x
∀ FX (x) = f X (t)dt (1.4.3.2)
x∈R
−∞
+∞
f X (x)d x = 1 (1.4.3.3)
−∞
1.4 Random Variable and Probability Distribution 23
Example 1.4.3.1 Let us consider random sampling of a number from the range [0,1],
constituting the sample space . Let us specify the random variable X as follows:
i.e. the value of variable X for the particular number selected from the interval [0, 1]
is equal to the number that has been selected.
Figure 1.3 shows the plots of probability density and cumulative distribution
functions of the defined random variable X. The plot of cdf (dashed line) is typical
for continuous random variables. It has no jumps or discontinuities as in the case of a
discrete random variable. A continuous random variable assumes a none numerable
number of values over a real line. Hence, a probability of a continuous random
variable assuming any particular value is zero and, therefore, no discrete jumps are
possible for its cdf. A probability of having a value in a given interval is found by
using the Eq. (1.4.3.4). A pdf of a continuous random variable plays exactly the same
role as a pmf of a discrete random variable. A function f X (x) can be interpreted as
a mass density (mass per unit length).
24 1 Elements of Probability Theory
1.2
f(x)
1
0.8
0.6
F(x)
0.4
0.2
0
−1 −0.5 0 0.5 1 1.5 2
x
Fig. 1.3 Plot of pdf (solid line) and cdf (dashed line) of random variable X (Example 1.4.3.1)
When the range of i extends from 1 to infinity, the sum in Eq. (1.4.4.1) exists if it
converges absolutely.
An expectation of a continuous random variable X is defined by the formula:
+∞
E(X ) = x · f X (x)d x (1.4.4.2)
−∞
1.4 Random Variable and Probability Distribution 25
if the improper integral is absolutely convergent. The symbol E(.) is regarded here
and in the sequel as the expectation operator.
Let us note two basic properties associated with the expectation operator. For any
real constants a, b, we have
E(b) = b (1.4.4.3)
+∞
D (X ) =
2
(x − E(X ))2 f X (x)d x (1.4.4.9)
_∞
The variance and standard deviation are the measures of spread, or, dispersion of a
distribution. A standard deviation is measured in the same units as X, while a variance
is in X-units squared. Large values of D 2 (X ) imply a large spread in a distribution
of X about its mean. Conversely, small values imply a sharp concentration of a mass
of distribution in a neighborhood of a mean.
For a variance to exist, it must be assumed that E(X 2 ) < ∞. We note two
properties of a variance of random variable X. They are:
D 2 (b) = 0 (1.4.4.11)
D 2 (a X + b) = a 2 D 2 (X ) (1.4.4.12)
Chebyshev’s Theorem
The probability that any random variable X will assume a value within k standard
deviations of its mean is at least (1 − 1/k 2 ). That is,
1
P(|X − E(X )| ≥ 3 · D(X )) ≤ (1.4.4.16)
32
which means that such an observation practically does not happen. For this reason,
if observed, are considered as outliers and are usually rejected in statistical analyses.
A standardization of a random variable
Suppose X is a random variable with mean E(X ) and standard deviation D(X ) >
0. Then a random variable U
X − E(X )
U= (1.4.4.17)
D(X )
Example 1.4.4.1 The expected value and the variance of discrete random variable
X from Example 1.4.2.1 are:
2 1 100
+ 100 · = −
E(X ) = −100 · (1.4.4.18)
3 3 3
2 2
100 2 100 1 800
D 2 (X ) = −100 + · + 100 + · = (1.4.4.19)
3 3 3 3 9
This result means that a player who plays over and over again will, on the average,
lose approximately 33.33 $. The spread of these loses around 33.33 $ will be about
D(X ) = 9.43 $.
The analogous values or the continuous random variable X from Example 1.4.3.1
are:
28 1 Elements of Probability Theory
+∞ 1
1
E(X ) = x f (x)d x = xd x = (1.4.4.20)
2
−∞ 0
+∞ 1
1 1
D 2 (X ) = (x − E(X ))2 f (x)d x = (x − )2 d x = (1.4.4.21)
2 12
−∞ 0
Often, the observations generated by different statistical experiments have the same
general type of behavior. Consequently, the discrete or continuous random variables
associated with these experiments can be described by essentially the same proba-
bility distribution and, therefore, can be represented by a single formula. In fact, one
needs only a handful of important probability distributions to describe many of the
discrete or continuous random variables encountered in practice. Such a handful of
distributions actually describe several real-life random phenomena. For instance, in
an industrial example, when a sample of items selected from a batch of production
is tested, the number of defective items in the sample usually can be modeled as a
hypergeometric random variable.
The current and the next section present the commonly used discrete and
continuous distributions with various examples.
1
P(X = x; k) = x = x1 , x2 , . . . , xk (1.4.5.1)
k
For example, when a fair die isrolled, each element of the sample space occurs
with probability P(X = x; 6) = 1 6 for x = 1, 2, 3, 4, 5, 6.
Two-Point Distribution
The random variable X has a two-point distribution if its probability function is
defined by the formula:
P(X = x1 ) = p
P(X = x2 ) = q
1.4 Random Variable and Probability Distribution 29
where p, q are the parameters. The example of such distribution has been given
in Example 1.4.2.1. The expectation and variance of this random variable are:
where p is a probability of success. Its values are denoted by b(x; n, p) (Fig. 1.4).
The expectation and variance of this random variable are:
0.4
0.35
0.3
0.25
P(X=x)
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5
x
Poisson Distribution
Experiments yielding the number of outcomes occurring during a given time interval
are called Poisson experiments. Hence, a Poisson experiment can generate observa-
tions for a random variable X , representing for example a number of telephone calls
per hour received by someone or a number of failures in a production process dur-
ing one day, etc. The number of outcomes occurring in one-time interval in Poisson
experiment is independent of the number that occurs in any other disjoint time inter-
val (no memory property). The probability that a single outcome will occur during
a very short time interval is proportional to its length and the probability that more
than one outcome will occur in such a short time interval is negligible.
The probability distribution of the Poisson random variable X , representing the
number x of outcomes occurring in a given time interval denoted by t, is given as:
(λt)x
P(X = x) = p(x; λt) = e−λt x = 0, 1, . . . (1.4.5.5)
x!
where λ (λ > 0) is the average number of outcomes per unit time, and e = 2.71828.
The expectation and variance of Poisson random variables are:
E(X ) = λt D 2 (X ) = λt (1.4.5.6)
Geometric Distribution
Let us consider an experiment where the properties are the same as those listed for
a binomial experiment (a success with probability p and a failure with probability
q = 1 − p). The only difference now is that independent trials will be repeated until
a first success occur. Let’s define the random variable X , being the number of trial
on which the first success occurs, then, its probability function is described by the
formula:
1 1− p
E(X ) = D 2 (X ) = (1.4.5.8)
p p2
1.4 Random Variable and Probability Distribution 31
0.3
0.25
0.2
P(X=x)
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9
x
0.25
0.2
0.15
P(X=x)
0.1
0.05
0
0 5 10 15
x
Hypergeometric Distribution
A hypergeometric experiment can be described as follows: (1) a random sample of
size n is selected without replacement from population of N items; (2) k of the N
items may be classified as successes and N −k are classified as failures. In general, we
32 1 Elements of Probability Theory
n·k
E(X ) = (1.4.5.10)
N
N −n k k
D (X ) =
2
·n· 1− (1.4.5.11)
N −1 N N
0.35
0.3
0.25
0.2
P(X=x)
0.15
0.1
0.05
0
0 2 4 6 8 10
x
Fig. 1.7 Plot of probability function of a hypergeometric random variable N = 35, k = 15, n = 10
1.4 Random Variable and Probability Distribution 33
Normal Distribution
The most important, continuous probability distribution in the entire field of statistics
is the normal distribution. Its plot, called the normal curve, is the bell-shaped curve
(see Fig. 1.8), which describes approximately many phenomena that occur in nature,
industry, or research. The normal distribution is often referred to as the Gaussian
distribution, in honour of Karl Friedrich Gauss (1777–1855), who also derived its
equation from a study of errors in repeated measurements of the same quantity. The
mathematical equation for the probability distribution of the normal random variable
depends upon the two parameters m and σ , its mean and standard deviation. Hence,
we denote the values of the density as N (x; m, σ ). The probability density function
has the form:
1 (x − m)2
f (x) = N (x; m, σ ) = √ exp − (1.4.6.1)
σ 2π 2σ 2
It can be shown that the two parameters are related to the expected value and the
variance:
E(X ) = m D 2 (X ) = σ 2 (1.4.6.2)
0.25
0.2
0.15
f(x)
0.1
0.05
0 x
4 6 8 10 12 14 16
m−σ m m+σ
x2
1 (x − m)2
P(x1 < X < x2 ) = √ exp − dx (1.4.6.3)
σ 2π 2σ 2
x1
X −m
Z= (1.4.6.4)
σ
we obtain the standard normal distribution (denoted as N (x; 0, 1)) with mean 0
and variance 1. The probability density and cumulative distribution functions of
N (x; 0, 1) are:
2 x
1 x
ϕ(x) = √ exp − (x) = ϕ(t)dt (1.4.6.5)
2π 2
−∞
We have now reduced the required number of tables of normal-curve areas to one,
that of the standard normal distribution which is tabulated.
We will now provide three important distributions related to the normal distribu-
tion:
• chi-square χ 2
• Student’s t
• F-Snedecor.
Chi-square Distribution (χ 2 )
Let’s U1 , . . . , Uk be independent random variables, each of which has a standard
normal distribution N (x; 0, 1). The random variable Y:
1.4 Random Variable and Probability Distribution 35
k
Y = Ui2 (1.4.6.6)
i=1
has a chi-square distribution with k degrees of freedom (chi(k)). The density function
of this distribution is expressed by the formula:
x 2 −1 e− 2
k x
f (x) = k (1.4.6.7)
2 2 Γ k2
where Γ (.) means the gamma function defined by the formula (1.4.6.8). Sample plot
of chi-square distribution is shown in Fig. 1.9.
+∞
( p) = x p−1 · e−x d x for p > 0 (1.4.6.8)
0
Student’s t-Distribution
Let U and Y be the independent random variables with normal N (x; 0, 1) and chi(k)
distributions, respectively. The random variable:
U
T =" (1.4.6.9)
Y
k
0.18
0.16
0.14
0.12
0.1
f(x)
0.08
0.06
0.04
0.02
0
0 5 10 15 20
x
0.4
0.35
0.3
0.25
f(x)
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
x
has a Student’s t-distribution with k degrees of freedom. The density function of the
Student’s t-distribution is expressed by the formula:
k+1 − k+1
x2 2
f (x) = √ k
2
· 1 + (1.4.6.10)
kπ · 2 k
where (.) means the gamma function defined above. The sample plot of Student’s
t-distribution is shown in Fig. 1.10.
F-Snedecor Distribution
Let U and V be two independent random variables with chi(knom ) and chi(kden )
distributions, respectively. The random variable:
U/knom
F= (1.4.6.11)
V /kden
has F-Snedecor distribution with knom and kden degrees of freedom of nominator and
denominator, respectively (F in honour of English statistician R. Fisher). The sample
plot of F-Snedecor’s distribution is shown in Fig. 1.11 (the complicated formula is
not included).
Exponential Distribution
The probability density function of a continuous random variable X with exponential
distribution with parameter a is expressed by the formula:
1.4 Random Variable and Probability Distribution 37
0.7
0.6
0.5
0.4
f(x)
0.3
0.2
0.1
0
0 1 2 3 4 5 6
x
Fig. 1.11 Plot of F-Snedecor’s probability density, number of degrees of freedom of the
nominator—10, denominator—5
0 x <0
f (x; a) = (1.4.6.12)
ae−ax x ≥ 0
The expectation and variance of this distribution are defined by the formula
(1.4.6.13). The sample plot of exponential distribution is shown in Fig. 1.12.
1 1
E(X ) = D 2 (X ) = 2 (1.4.6.13)
a a
Exponential distribution is a special case (p = 1) of gamma distribution (with
scale a and shape p parameters) characterized by the following probability density
function:
a p p−1 −ax
f (x; a, p) = x e (1.4.6.14)
( p)
0.5
0.4
0.3
f(x)
0.2
0.1
0
0 2 4 6 8 10
x
1
x ∈ [a, b]
f (x) = b−a (1.4.6.15)
0 x∈/ [a, b]
The notation X ∼ uni f (a, b) states that a random variable X has a uniform
distribution on the interval [a, b]. It can be shown that the expectation and variance
of such random variable X are:
a+b (b − a)2
E(X ) = D 2 (X ) = (1.4.6.17)
2 12
The plot of probability density and cumulative distribution functions of the
uniform random variable on the interval [0, 1] have been shown in the Example
1.4.3.1.
1.5 Two or More Random Variables 39
In real life, we are often interested in several random variables that are related to each
other. For example, suppose that we choose a random family, and we would like to
study the number of people in the family (X 1 ) and the household income (X 2 ). Both
X 1 and X 2 are random variables, and we suspect that they are dependent. Formally,
a vector:
(X 1 , X 2 , . . . .., X n ) (1.5.1)
Let the two random variables X and Y be defined on the common probability space of
an experiment. A discrete random vector (X, Y ) assumes a finite or countable number
of pair of values (xi , y j ) ∈ R 2 . The joint probability distribution of a random vector
(X, Y ) is determined by the joint probability mass function (pmf) P(x, y) defined
for each pair of values (xi , y j ) of a vector (X, Y ) by:
P({X = xi } ∩ {Y = y j })
= P X = xi , Y = y j = p(xi , y j ) = pi j (1.5.1.1a)
The joint pmf of discrete random vector (X, Y ) describes how much probability
mass is placed on each possible pair of values (x, y). Now let A be any set consisting
of pairs of values (x, y). Then, the probability P(X, Y ) ∈ A is obtained by summing
the joint pmf over pairs in A:
P(X, Y ) ∈ A = p xi , y j (1.5.1.2)
(xi ,y j )∈A
The joint pmf of discrete random vector (X, Y ) can be presented in the accom-
panying joint probability table (see Table 1.4) in which the first row and column
Table 1.4 The joint probability table of (X, Y ) with marginal distributions
X Y
k
y1 y2 … yk pi. = j=1 pi j
x1 p11 p12 … p1k p1.
x2 p21 p22 … p2k p2.
… … … … … …
xr pr 1 pr 2 … pr k pr.
r r k
p. j = i=1 pi j p.1 p.2 … p.k i=1 j=1 pi j = 1
1.5 Two or More Random Variables 41
contain the values of variables Y and X , respectively. In the central part of the table
there are the probabilities pi j of the joint probability mass function.
Marginal Distribution
Once the joint pmf of the two variables X and Y is available, it is straight-forward
to obtain the distribution of just one of these variables. For any possible value x of
X , the probability P({X = x}) results from holding x fixed and summing the joint
pmf P(x, y) over all y for which the pair (x, y) has a positive probability mass. The
same strategy applies to obtaining the distribution P({Y = y}) of Y .
The marginal probability mass function of X is given by:
k
P(xi ) = P({X = xi }) = pi. = pi j i = 1, . . . , r (1.5.1.4)
j=1
where
pi. = 1 (1.5.1.5)
i
r
P Y = y j = p. j = pi j j = 1, . . . , k (1.5.1.6)
i=1
where
p. j = 1 (1.5.1.7)
j
The use of the word marginal here is a consequence of the fact that if the joint
pmf is displayed in a rectangular table as in Table 1.4, then, the row totals give the
marginal pmf of X and the column totals give the marginal pmf of Y .
The marginal distribution of a random vector (X, Y ) can also be determined by
marginal cumulative distribution functions:
If (X, Y ) is of discrete type, then, the above expressions are reduced to the
following:
FX (x) = P(X < x, Y < ∞) = pi. (1.5.1.10)
xi <x
42 1 Elements of Probability Theory
FY (y) = P(X < ∞, Y < y) = p. j (1.5.1.11)
y j <y
Conditional Distribution
Given the two jointly distributed discrete random variables X and Y, the conditional
probability distribution of Y given X, is the probability distribution of Y when X is
known to be a particular value. Formally, the conditional probability mass function
of variable X given Y = y is defined as follows (see Table 1.4):
P X = xi |Y = y j
P X = xi ∩ Y = y j pi j
= = i = 1, . . . r (1.5.1.12)
P Y = yj p .j
Below we present the example illustrating the calculation of the joint, marginal
and conditional distributions of random vector (X, Y ).
Table 1.5 The joint and marginal distributions of vector (X, Y) from Example 1.5.1.1
X Y
1 2 3 4 5 6 pi.
1 1 1 1 1 1 6
0 12 12 12 12 12 12 12
1 1 1 1 1 1 6
1 12 12 12 12 12 12 12
2 2 2 2 2 2
p. j 12 12 12 12 12 12 1
result of such simultaneous tossing and rolling can be described by a random vector
(X, Y ) that can assume the following pairs of values:
(0, 1) (0, 2) (0, 3) (0, 4) (0, 5) (0, 6) (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
The probability of occurring of each such pair (event) is:
1 1 1
P X = xi , Y = y j = · = i = 1, 2 j = 1, . . . , 6 (1.5.1.18)
2 6 12
The joint distribution of (X, Y ) has yet been defined and is presented in the
Table 1.5. The last column and row of this table show the values of the probability
mass function of the two marginal distributions of (X, Y ).
Let us now determine the conditional distributions of the vector (X, Y ). There
are 6 conditional distributions of the random variable X given the variable Y , each
containing the two values of the random variable X at a particular value of variable
Y = 1, . . . , 6:
1/12 1 1/12 1
P(X = 0|Y = 1) = = P(X = 1|Y = 1) = =
2/12 2 2/12 2
1/12 1 1/12 1
P(X = 0|Y = 2) = = P(X = 1|Y = 2) = =
2/12 2 2/12 2
1/12 1 1/12 1
P(X = 0|Y = 3) = = P(X = 1|Y = 3) = =
2/12 2 2/12 2
1/12 1 1/12 1
P(X = 0|Y = 4) = = P(X = 1|Y = 4) = =
2/12 2 2/12 2
1/12 1 1/12 1
P(X = 0|Y = 5) = = P(X = 1|Y = 5) = =
2/12 2 2/12 2
1/12 1 1/12 1
P(X = 0|Y = 6) = = P(X = 1|Y = 6) = =
2/12 2 2/12 2
There are two conditional distributions of the random variable Y given the variable
X . The first distribution, containing 6 values of the variable Y at the value X = 0:
44 1 Elements of Probability Theory
1/12 1
P(Y = 1|X = 0) = =
6/12 6
···
1/12 1
P(Y = 6|X = 0) = =
6/12 6
1/12 1
P(Y = 1|X = 1) = =
6/12 6
···
1/12 1
P(Y = 6|X = 1) = =
6/12 6
The continuous random vector (X, Y ) assumes the infinite and uncountable number
of pairs of values (xi , y j ) ∈ R 2 . The probability that the pair (X, Y ) of continuous
random variables falls in a two-dimensional set A (such as a rectangle) is obtained
by integrating a function called the joint density function.
Formally, let X and Y be continuous random variables. A joint probability density
function f (x, y) for these two variables is a function satisfying conditions:
f (x, y) ≥ 0 (1.5.2.1)
+∞ +∞
f (x, y)d xd y = 1 (1.5.2.2)
−∞ −∞
x y
F(x, y) = P(X < x, Y < y) = f (x, y)d xd y (1.5.2.3)
−∞ −∞
Marginal Distribution
The marginal pdf of each variable can be obtained in a manner analogous to what
we did in the case of two discrete variables. The marginal pdf of X at the value x
results from holding x fixed in the pair (x, y) and integrating the joint pdf over y.
Integrating the joint pdf with respect to x gives the marginal pdf of Y. The marginal
1.5 Two or More Random Variables 45
+∞
f X (x) = f (x, y)dy (1.5.2.4)
−∞
+∞
f Y (y) = f (x, y)d x (1.5.2.5)
−∞
The formulas for the marginal cumulative distribution functions in the case of a
continuous vector (X, Y ) take the form:
x y
FX (x) = f X (x)d x FY (y) = f Y (y)dy (1.5.2.6)
−∞ −∞
Conditional Distribution
The conditional probability density function of the variable X given Y = y is defined
as follows:
f (x, y)
f (x|y) = (1.5.2.7)
f Y (y)
f (x, y)
f (y|x) = (1.5.2.8)
f X (x)
x y
F(x|y) = f (x|y)d x F(y|x) = f (y|x)dy (1.5.2.9)
−∞ −∞
Example 1.5.2.1 Let the joint probability density function of the random vector
(X, Y ) is specified by the formula:
(x − y)2 −1 ≤ x ≤ 1, −1 ≤ y ≤ 1
f (x, y) = (1.5.2.10)
0 other wise
Let −1 ≤ x ≤ 1. We have
46 1 Elements of Probability Theory
+∞ 1 1
f X (x) = f (x, y)dy = (x − y) dy =
2
x 2 − 2x y + y 2 dy
−∞ −1 −1
# 3$
y +1 1
= x 2 y − x y2 + = 2(x 2 + ) (1.5.2.11)
3 −1 3
Due to the symmetry, the marginal probability density function of the variable Y
has the same form:
2 1
2 y + 3 −1 ≤ y ≤ 1
f Y (y) = (1.5.2.13)
0 other wise
f (x, y) (x − y)2
f (x|y) = = 2 1 if x ∈ [−1, 1] (1.5.2.14)
f Y (y) 2 y +3
f (x|y) = 0 if x ∈
/ [−1, 1] (1.5.2.15)
In many situations, information about the observed value of one of the two random
variables X and Y gives information about the value of the other variable, thus, there
is a dependence between the two variables.
Two random variables X and Y are said to be independent if for each pair of their
values (xi , y j ) the following holds:
P X = xi , Y = y j = P(X = xi ) · P Y = y j (1.5.3.1)
or equivalently
pi j = pi. · p. j (1.5.3.2)
1.5 Two or More Random Variables 47
Example 1.5.3.1 The random variables X and Y from Example 1.5.2.1 are
dependent:
1 1
f (x, y) = (x − y)
= f X (x) · f Y (y) = 4 x +
2 2
· y +
2
(1.5.3.4)
3 3
Let X and Y be jointly distributed random variables with pmf pi j or pdf f (x, y)
according to whether the variables are discrete or continuous. Then, their expected
values (expectations) are given by:
r
k
E(X ) = xi · pi. E(Y ) = y j · p. j (1.5.4.1)
i=1 j=1
+∞ +∞ +∞ +∞
E(X ) = x f (x, y)d xd y E(Y ) = y f (x, y)d xd y (1.5.4.2)
−∞ −∞ −∞ −∞
for a continuous random vector (X, Y ). These are the expectations in the marginal
distribution of X and Y, respectively.
When two random variables X and Y are not independent, it is frequently of
interest to assess how strongly they are related to one another.
The covariance between the random variables X and Y is defined by the following
formula:
For a strong positive relationship, cov(X, Y ) should be quite positive while for a
strong negative—quite negative. If X and Y are not strongly related, covariance will
be near 0.
The defect of covariance is that its computed value depends critically on the units
of measurement. Ideally, the choice of units should have no effect on a measure of
strength of relationship. This is achieved by scaling the covariance.
Correlation coefficient between variables X and Y is used, which is a normalized
covariance:
cov(X, Y ) cov(X, Y )
ρ=! = (1.5.4.5)
D 2 (X ) · D 2 (Y ) D(X ) · D(Y )
where D(X ) and D(Y ) are the standard deviations in the marginal distributions of
X and Y, respectively.
Correlation coefficient between the variables X and Y satisfies the condition:
−1 ≤ ρ ≤ +1 (1.5.4.6)
Theorem 1.5.4.1
1. If X and Y are independent, then ρ = 0, but ρ = 0 does not imply independence.
2. Correlation coefficient ρ 2 = 1 if and only if:
P(Y = a X + b) = 1 (1.5.4.7)
Only when the two variables are perfectly related in a linear manner will ρ be as
positive or negative as it can be. A ρ less than 1 indicates only that the relationship is
not completely linear. When ρ = 0, X and Y are said to be uncorrelated. Two vari-
ables could be uncorrelated but highly dependent because there is a strong nonlinear
relationship.
In the case of a multidimensional random variable X = (X 1 , X 2 , . . . , X n ), the two
most important parameters characterizing its distribution are: the vector of expected
values:
C = E[(X − E X )(X − E X )T ] =
⎡ 2 ⎤
D (X 1 ) cov(X 1 , X 2 ) . . . cov(X 1 , X n )
⎢ cov(X 2 , X 1 ) D 2 (X 2 ) . . . cov(X 2 , X n ) ⎥
=⎢⎣
⎥
⎦ (1.5.4.9)
...
cov(X n , X 1 ) cov(X n , X 2 ) . . . D 2 (X n )
Example 1.5.4.1 Let’s calculate the covariance of random variables X and Y from
Example 1.5.1.1. The expected values of the two marginal distributions are:
6 6 6
E(X ) = 0 · +1· = (1.5.4.10)
12 12 12
2 42
E(Y ) = (1 + 2 + 3 + 4 + 5 + 6) = (1.5.4.11)
12 12
1 6 42 42
cov(X, Y ) = {(0 − )[(1 − ) + . . . + (6 − )]+
12 12 12 12
42 6 42
+ (1 − )[(0 − ) + . . . + (6 − )]}
12 12 12
1 42 42 6 6
= [(1 − ) + . . . + (6 − )](− + ) = 0 (1.5.4.12)
12 12 12 12 12
It follows that the random variables X and Y are uncorrelated.
where
50 1 Elements of Probability Theory
x = (x1 , x2 , . . . , xk )
Polynomial Distribution
Consider n independent repetitions of some statistical experiment, which can result
in one of k mutually exclusive events:
A1 , . . . , Ak (1.5.5.2)
k
P(Ai ) = pi i = 1, . . . , k, pi = 1 (1.5.5.3)
i=1
(X 1 , . . . , X k ) (1.5.5.4)
k
xi = n (1.5.5.5)
i=1
n!
P(X 1 = x1 , . . . , X k = xk ) = p x1 . . . pkxk
x1 ! . . . xk ! 1
xi = 0, 1, . . . , n i = 1, . . . , k (1.5.5.6)
The above formula determines the joint probability of an event that in n experi-
ments event A1 will happen x1 times and the event A2 will happen x2 times, and so
on.
1.5 Two or More Random Variables 51
The function r which assigns to each individual value x the expected value of the
conditional distribution of the variable Y given X = x:
is called the type 1 regression function oh Y given X. The values E(Y |X = x) are
called conditional expectations.
In an analogous way, one can define the type 1 regression function oh X given
Y = y:
The regression function can be linear, quadratic, exponential, and so on. The
regression function of independent random variables X and Y is a constant function.
The regression function r (x) of the random variable Y given X, in case when these
variables are correlated (and therefore stochastically dependent), enables prediction
of the value of the variable Y based on the known value of the variable X.
1 1 1
E(X |Y = 1) = 0 · +1· = (1.5.6.3)
2 2 2
1 1 1
E(X |Y = 2) = 0 · +1· = (1.5.6.4)
2 2 2
1 1 1
E(X |Y = 3) = 0 · +1· = (1.5.6.5)
2 2 2
1 1 1
E(X |Y = 4) = 0 · +1· = (1.5.6.6)
2 2 2
1 1 1
E(X |Y = 5) = 0 · +1· = (1.5.6.7)
2 2 2
1 1 1
E(X |Y = 6) = 0 · +1· = (1.5.6.8)
2 2 2
As you can see, the obtained regression function r (y) of the random variable X
given Y is a constant function:
0.6
0.5
0.4
x
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
y
pi j = pi. · p. j (1.5.6.10)
In this section, we will discuss the important theorems in probability, often called the
limit theorems, namely the laws of large numbers (LLN) and the central limit theorem
(CLT). Since the limit theorems concern themselves with an asymptotic behavior of
a sequence of random variables, we should first give the necessary definitions of
different types of convergence of random variables.
We begin by recalling some defnitions pertaining to convergence of a sequence
of real numbers. Let {xn } be a real-valued sequence. We say that the sequence {xn }
converges to some x ∈ R (Rudin 1976, p.47) if there exists an n 0 ∈ N such that for
all ε > 0,
1.6 Limit Theorems 53
X 1, X 2, . . . , X n (1.6.2)
Note that for a fixed ω, {X n (ω)} is a sequence of real numbers. Hence, the conver-
gence for this sequence is same as the one in Definition (1.6.1). Since this notion is
too strict for most practical purposes, and does not consider the probability measure,
we define other notions.
Almost sure convergence or convergence with probability 1
A sequence of random variables {X n } is said to converge almost surely or with
probability 1 to X if;
w. p.1
It is shortly denoted by X n → X . Almost sure convergence demands that the set
of ω’s where the random variables converge have a probability one. In other words,
this definition gives the random variables “freedom” not to converge on a set of zero
measure!
Convergence in probability
A sequence of random variables {X n } is said to converge in probability to X if:
P
It is shortly denoted by X n → X . As seen from the above definition, this notion
concerns itself with the convergence of a sequence of probabilities!
Convergence in distribution or weak convergence
A sequence of random variables {X n } is said to converge in distribution to X if:
D
at all points of continuity of FX (x). It is shortly denoted by X n → X .
Let X 1 X 2 , . . . be i.i.d random variables with E(|X i |) = m i < ∞. In general,
we say that a sequence {X n } of random variables satisfies the weak law of large
numbers (WLLN) if the convergence is according to probability:
1
n
P
(X i − m i ) → 0 (1.6.7)
n i=1
1
n
w. p. 1
(X i − m i ) → 0 (1.6.8)
n i=1
variables tends toward a standard normal distribution even if the original variables
themselves are not normally distributed.
More formally, let {X n } be a sequence of i.i.d. random variables having common
expectations E(|X i |) = m i = m < ∞ and a finite variance
n σ .
2
Xn − m D
Un = √ → N (x; 0, 1) (1.6.11)
σ/ n
X = F −1 (U ) (1.7.3)
where U has the continuous uniform distribution over the interval (0, 1). Then X
is distributed as F, that is, P(X ≤ x) = F(x), x ∈ R.
The inverse transform method can be used in practice as long as we are able to
get an explicit formula for F −1 (y) in closed form. We illustrate this method with
example.
Example 1.7.1 Suppose we have a random variable U ∼ unif(0, 1) and a cdf:
ln(1 − y)
x = F −1 (y) = − (1.7.5)
λ
This yields X = (−1 λ)ln(1 − U ). But from U ∼ unif(0, 1), it follows that
1 − U ∼ unif(0, 1) and, thus, we can simplify the algorithm by replacing 1 − U by
U obtaining the following
Algorithm for generating an exponential random number at rate λ
1. Generate U ∼ unif(0, 1)
2. Set X = (−1 λ)ln(U )
We describe now a simple method for generating random numbers from a standard
normal distribution N (0, 1) based on the central limit theorem.
Let’s consider the mean
X 1 + . . . + X 12 S12
X 12 = = (1.7.6)
12 12
of 12 random variables, each of which has a uniform distribution on the interval (0, 1),
e.g. X i ∼ unif(0, 1). The expectation E(X i ) = 1/2 and the variance D 2 (X i ) = 1/12.
1.7 Pseudorandom Number Generation 57
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−3 −2 −1 0 1 2 3
Fig. 1.14 Histogram of standardized sum of 12 random variables from the uniform distribution for
10,000 repetitions. The plot of normal probability density function is also superimposed
X 12 − 21
Z=√ √ = S12 − 6 (1.7.7)
1/12/ 12
follows a distribution close to the standard normal distribution N (0, 1) which gives
a “recipe” for generating random numbers from this distribution.
Figure 1.14 presents the histogram of distribution of such standardized sum of 12
uniformly distributed random variables for 10,000 repetitions of random experiment
described by the Formula 1.7.7.
[0, 1] into six equal parts and assign to each of them one elementary event: for the
interval [0, 1/6]—“1 dot”, for [1/6, 2/6]—“2 dots”, and so on. Sampling a number
from interval [0, 1/6] using the mentioned “runif( )” function is as likely as for the
remaining five intervals. The code for simulation of rolling a die has the following
form.
**R**
number_dots = floor(6*runif(1)+1)
***
Since we have “forced” a computer to “roll a die”, we now show how to design a
computer simulation to answer a question about a chance to win in a certain game.
Example 1.8.1 Let’s consider the following game. A player pays 2 $ and rolls a die
6 times. If a player gets at least two “6 dots”, he wins 6 $, otherwise he loses 2 $
he paid. We are interested in estimation of a chance to win more money than we’ve
paid in 60 such games.
Note that in order to win more than we’ve paid, we must win more than 12 games
in 60 games (at 12 we break even). To answer the question about a chance of financial
success in 60 games, we should count the number of simulations in which we win
more than 12 games of 60 in a large number of simulations, for example 1,000,000.
According to Bernoulli LLN, the resulting frequency is close to the probability of
success.
Below, we present R code that “plays” 1,000,000 times, each time 60 games, and,
then, returns a percentage of those games with at least 13 wins.
**R**
cnt = 0 # counter
for (r in 1:1000000)
{
cnt60 = 0
for (i in 1:60)
{k = 0
for (j in 1:6) {if (runif(1) > 5/6) k = k + 1}
if (k >= 2) cnt60 = cnt60 + 1 }
if (cnt60 > 12) cnt = cnt + 1}
chance = cnt / 1000000
***
With the above presented simulation, the estimated chance to win is approximately
0.898606.
We can also calculate the above probability analytically using the formulas from
the Bernoulli trials. The probability of success in a single game is:
S6 stands for the number of successes in 6 rolls. The probability of obtaining more
than 12 successes in 60 games is:
+ ,
P(S60 > 12) = 1 − P(S60 ≤ 12) = 1 − P60,0 + P60,1 + . . . + P60,12
12
60
=1− (0.2632)i · (1 − 0.2632)60−i ≈ 0.8989 (1.8.2)
i
i=0
The above, precise calculations are time-consuming and not easy to calculate even
with the help of a calculator, arguing in favour of simulation.
Computer simulations can also be used to investigate the distribution of random
variables. Such simulations rely on multiple repetitions of random experiment and
taking the resulting empirical distribution as an approximation of the probability
distribution of random variable under consideration.
Example 1.8.2 Let’s consider sampling without replacement n times from expo-
nential distribution with parameter λ = 2 using “rexp” R function (see R code of
simulation below). Let’s calculate the empirical means of such generated random
variables X i , i = 1, . . . , n:
n
Sn Xi
Mn = = i=1
(1.8.3)
n n
and, then, make a plot of a sequence of increasing means:
M1 , M2 , . . . , Mn (1.8.4)
Figure 1.15 shows the plot produced by the above simulation. As you can see,
according to the LLN as the sample size increases, the mean of many independent
realizations of a random variable tends to its expected value, here equal 1/λ = 1/2.
Example 1.8.3
Let us now simulate the convergence of sequence of distributions as described by
60 1 Elements of Probability Theory
Fig. 1.15 A plot of sequence of increasing means for 1000 repetitions of random experiment
CLT. Consider the sequence of independent random variables from exponential dis-
tribution with parameter λ = 2 and plot the histogram of empirical distribution of
their sums for 10,000 repetitions (Fig. 1.16). The appropriate R code is given below.
**R**
m = 10000
n = 100
lambda = 2
S = replicate(m,sum(rexp(n,rate = lambda)))
hist(S,prob=”TRUE”)
curve(dnorm(x,mean = n/lambda,sd = sqrt(n)/lambda),
col = ”blue”,add = TRUE)
***
Fig. 1.16 Histogram of distribution of sums of 100 random variables from exponential distribution
for 10,000 repetitions of random experiment with normal density curve superimposed
Bibliography
Billingsley, P.: Probability and Measure, 3rd edn. Wiley, New York (1995)
Hodges Jr., J.L., Lehmann, E.L.: Basic Concepts of Probability and Statistics, 2nd edn. Society for
Industrial and Applied Mathematics, Philadelphia (2005)
Papoulis, A., Pillai, S.U.: Probability, Random Variables and Stochastic Processes, 4th edn.
McGraw-Hill, New York (2002)
R Core Team: R language definition. http://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
(2019). Accessed 16 Dec 2019
Rudin, W.: Principles of Mathematical Analysis, 3rd edn. McGraw-Hill, New York (1976)
Soong, T.T.: Fundamentals of Probability and Statistics for Engineers, 1st edn. Wiley, Chichester
(2004)
Chapter 2
Descriptive and Inferential Statistics
The 25 salt bags have been randomly sampled from a whole production of factory
(called here a population) and weighted on an independent weighting machine.
The results (in grams) which make up our Sample No. 1 are given below.
*
1000.33, 1004.97, 998.98, 1000.85, 1001.42, 1001.68, 999.58, 1001.16,
1001.79, 997.64, 1001.59, 1000.56, 1003.26, 996.25, 995.83, 999.56,
1002.08, 998.89, 998.09, 1004.42, 1002.14, 998.01, 1002.79, 999.56,
1001.72
*
Before we can understand what a particular sample can tell us about the population,
we should first understand the uncertainty associated with taking a sample from a
given population. This is why we have been studying probability before statistics.
We would like to emphasize the differences between approaches to a problem in
probability and statistics. In a probability problem, properties of the population under
study are assumed known (e.g., some specified distribution), and questions regarding
a sample taken from the population are posed and answered. In an inferential statistics
problem, characteristics of a sample are available, and this information enables to
draw conclusions about the population. The relationship between the two disciplines
can be summarized by saying that probability reasons from the population to the
sample (deductive reasoning), whereas inferential statistics reasons from the sample
to the population (inductive reasoning).
In this chapter, we will present the four main stages in doing statistical study like
the one above:
1. designing a study,
2. collecting the data,
3. obtaining descriptive statistics,
4. performing some inferential statistics.
As a result of performing such a four-step study, we’ll be able to draw conclu-
sions like “whether the mean weight of salt bags is conforming to manufacturing
specifications or not” based on different methods of statistical inference.
The question about the mean weight of salt bags produced in a factory (Exam-
ple 2.1.1) is, in fact, a question about an expected value E(X ) of distribution of salt
bags weights in a whole population of manufactured salt bags. The answer can be
obtained by methods of statistical inference in three different ways: by point or inter-
val estimation, or by testing statistical hypothesis which will be presented throughout
this chapter.
Population, Sample
A statistical study like the one above will typically focus on a well-defined collection
of objects constituting a population of interest. In the study above, the population
might consist of all salt bags produced during a specified period. When desired
information is available for all objects in the population, we have what is called a
census. Constraints on time, money, and other scarce resources usually make a census
impractical or infeasible. Instead, a subset of the population, a sample, is selected in
some prescribed manner (as in our Example 2.1.1, the set of 25 salt bags sampled
from a particular production). A sampling frame, that is, a listing of the individuals or
objects to be sampled is either available to an investigator or else can be constructed.
Types of Data
We are usually interested only in certain characteristics of the objects in a population:
the weight of salt bags, the number of flaws on the surface of each casing, the gender
of an engineering graduate, and so on. A variable (or feature) is any characteristic
2.1 Preliminary Concepts 65
whose value may change from one object to another in the population. Generally,
we are dealing with four types of data.
Nominal (categorical) data are numbers that are used only as names/labels for
categories. If, for example, you had two groups, you could label them “1”, “2”, but
it would mean just the same as if they were entitled “A”, “B”. You cannot do any
statistics or arithmetic on nominal data.
Ordinal (rank) data can be ordered in terms of some property such as size, length,
etc. However, successive points on the scale are not necessarily spaced equally apart.
To rank the swimmers in terms of whether they finish first, second, and so on, is
to measure them on an ordinal scale: all that you know for certain is the order of
finishing—the difference between first and second placed swimmers may be quite
different from the difference between those placed second and third.
Interval data are measured on a scale on which measurements are spaced at
equal intervals, but on which there is no true zero point. The classic example is
the Centigrade temperature scale. On the Centigrade scale, a rise of “one degree”
represents an increase in heat by the same amount, whether one is talking about the
shift from 2 to 3°, or from 25 to 26°. However, because the zero point on the scale is
arbitrary, one cannot make any statements like “4 degrees is twice 2 degrees”.
Ratio scale is the same as the interval scale, except that there is a true zero on
the scale, representing an absence of the measured thing. Measurements of weight,
height, length, time are all measurements on ratio scales. With this kind of scale, the
intervals are equally spaced and we can make meaningful statements about the ratios
of quantities measured.
Data results from making observations either on a single variable or simultane-
ously on two or more variables. An univariate data set consists of observations on a
single variable, while multivariate data arises when observations are made on more
than one variable.
When data collection entails selecting individuals or objects from a frame, the
simplest method for ensuring a representative selection is to take a simple random
sample. This is one for which any particular subset of the specified size (e.g., a sample
of size 100) has the same chance of being selected. For example, if the frame consists
of 1,000,000 serial numbers, the numbers 1, 2, …, up to 1,000,000 could be placed
on identical slips of paper. After placing these slips in a box and thoroughly mixing,
slips could be drawn one by one until the requisite sample size has been obtained.
Alternatively (and much to be preferred), a table of random numbers or a computer’s
random number generator could be employed.
Sometimes, alternative sampling methods like stratified, cluster, systematic and
other ones can be used to make the selection process easier, to obtain extra
information, or to increase the degree of confidence in conclusions.
The next step after the completion of data collection is to organize or represent the
data into a meaningful form, so that a trend, if any, emerging out of the data can be
seen easily. For this purpose, the methods of descriptive statistics can be used. There
are two general subject areas of representing a data set:
1. using tabular forms and visual techniques like frequency tables, histograms,
2. using some numerical summary measures describing location of central tendency,
measures of data variability, asymmetry or kurtosis.
Tabular Methods
One of the common methods for organizing data is to construct frequency distri-
bution. A frequency distribution is a table or graph that displays the frequency of
various outcomes/observations in a sample: each entry in the frequency table con-
tains the frequency, or, count of the occurrences of values within a particular group
or interval, and in this way, the table summarizes the distribution of values in the
sample.
Frequency distribution tables (or shortly frequency tables) can be used for both
discrete and continuous variables. They can be constructed by grouping observations,
i.e. the realizations x1 , x2 , . . . , xn of some random variable X , which stands as a
“model” of our data, into certain classes that can be:
xmax − xmin
h= (2.2.1.2)
k
Generally, the class width is the same for all classes. Note, that equal class intervals
are preferred in frequency distribution, while unequal class interval may be necessary
in certain situations to avoid empty classes. Table 2.1 shows the frequency distribution
table for the Sample Nr. 1 (Example 2.1.1).
Frequency distribution allows us to have a glance at the entire data conveniently.
It shows, whether the observations are high or low, and also, whether they are
concentrated in one area or spread out across the entire scale.
Let us denote by r the number of sample elements which take values lower than
a fixed number x:
Table 2.1 Frequency distribution table for Sample Nr. 1 (Example 2.1.1)
From To Frequency Rel. freq. Cum. freq.
1 995.827 997.352 2 0.08 0.08
2 997.352 998.876 3 0.12 0.20
3 998.876 1000.401 6 0.24 0.44
4 1000.401 1001.925 8 0.32 0.76
5 1001.925 1003.450 4 0.16 0.92
6 1003.450 1004.974 2 0.08 1.00
2.2 Descriptive Statistics 69
n
r= 1(xi < x) (2.2.1.3)
i=1
where
1 arg = T
1(arg) = (2.2.1.4)
0 arg = F
Pictorial Methods
The information provided by a frequency distribution table is easier to grasp if pre-
sented graphically. The most popular is a histogram which, for interval frequency
table, is a set of rectangles, whose width is equal to the length of class interval while
their heights are equal to the relative frequency divided by the class width (see the
exampled histogram in Fig. 2.2). In the case of point frequency tables, the rectangles
in the histogram are reduced to the vertical bars at the succeeding values of random
variable. As the size of the sample becomes larger, the resulting limit histogram
would be the plot of unknown probability distribution of X.
A distribution is said to be symmetric if it can be folded along a vertical axis
so that the two sides coincide. A distribution that lacks symmetry with respect to a
vertical axis is said to be skewed. The distribution illustrated in Fig. 2.2 is said to be
slightly skewed to the left since it has a little left tail and a shorter right tail.
Another display that is helpful for reflecting properties of a sample is the box-
and-whisker plot (see the example in the bottom of Fig. 2.2). This plot encloses the
interquartile range i (for explanations see next subsection) of the data in a box that
has the median displayed within. The interquartile range i has as its extremes the 75th
percentile (upper quartile) and the 25th percentile (lower quartile). For reasonably
large samples, the display shows center of location, variability and the degree of
symmetry: the position of the median symbol relative to the two edges conveys
information about skewness in the middle 50% of the data. In addition to the box,
70 2 Descriptive and Inferential Statistics
Visual summaries of data are the excellent tools for obtaining preliminary impres-
sions and insights. More formal data analysis often requires the calculation and
interpretation of numerical summary measures. That is, from the data we try to
extract several summarizing numbers that might serve to characterize the data set
and convey some of its salient features.
In general, four groups of summary measures (statistics) are distinguished:
• measures of location of a data center,
• measures of variability,
• measures of symmetry,
• measures of kurtosis.
Measures of Location
Location measures in a sample are designed to give some quantitative measure of
the location (center) of a sample. A common measure is the arithmetic mean:
1
n
x̄ = xi (2.2.2.1)
n i=1
It is called the sample mean and is the point which “balances” the system of
“weights” corresponding to the values of elements in a sample. The sample mean
can be greatly affected by the presence of even a single outlier (unusually large or
small observation).
Quantiles (empirical) are the other measures of central tendency that are unin-
fluenced by outliers. Formally, quantile of order p (0 < p < 1) is the lowest value
x p in a sample, for which the empirical cumulative distribution function satisfies the
condition:
F̂n (x p ) ≥ p (2.2.2.2)
Quantiles can be computed exactly from data available, or, based on interpola-
tion expressions built on frequency distribution tables. For example, given that the
observations x1 ≤ x2 ≤ . . . ≤ xn in a sample are arranged in increasing order
of magnitude, the sample median, which is a quantile of order p = 1/2, can be
computed as:
2.2 Descriptive Statistics 71
x(n+1)/2
n is odd
me = q2 = (2.2.2.3)
1
x
2 n/2
+ x n/2+1 n is even
Sample median is an example of quartiles which divide the sample into four equal
parts and me is the second quartile q2 . The observations above the third quartile q3
constitute the upper quarter of the sample, and the first quartile q1 separates the lower
quarter from the upper three-quarters. The quartile q1 can be obtained as the one,
which position in the increasingly ordered sample is (n + 1)/4 (with rounding up).
Similarly, the quartile q3 is calculated as the one, which position in the increasingly
ordered sample is 3(n + 1)/4 (with rounding down).
A sample can be even more finely divided using percentiles: the 99th percentile
separates the highest 1% from the bottom 99%, and so on.
The mode is the value which occurs most frequently in a sample.
Measures of Variability
There are many measures of variability. The simplest one is the sample range:
i = q3 − q1 (2.2.2.5)
1
n
s2 = (xi − x̄)2 (2.2.2.6)
n i=1
which measures the average squared deviation from an arithmetic mean x̄, and the
sample standard deviation, the positive square root of s 2 :
s= s2 (2.2.2.7)
Large variability in a data produces relatively large (xi − x̄)2 , and, thus a large
sample variance.
Coefficient of Variation:
s
v= (2.2.2.8)
x̄
72 2 Descriptive and Inferential Statistics
Measures of Symmetry
Skewness is a measure of the asymmetry of the probability distribution. The already
introduced coefficient As of skewness of random variable (also known as Pearson’s
moment) can be estimated from a sample resulting in sample skewness:
n
1
i=1 (xi − x̄)3
as = n
(2.2.2.9)
s3
Values equal to zero are reserved for symmetrical distribution, positive for right
asymmetry, negative for left asymmetry. Skewness is positive if the tail on the right
side of the distribution is longer, or, fatter than the tail on the left side.
Measures of Kurtosis
Kurtosis is a measure of the “tailedness” of the probability distribution of a real-
valued random variable, and there are different ways of quantifying it. The standard
measure of kurtosis, originating with Karl Pearson, is based on a scaled version of
the fourth moment of the data and can be estimated from a sample resulting in sample
kurtosis:
1 n
i=1 (x i − x̄)
4
kr t = n
(2.2.2.10)
s4
The kurtosis of any univariate normal distribution is equal to 3. Distributions with
kurtosis less than 3 are said to be platokurtic and it means the distribution produces
fewer and less extreme outliers than does the normal distribution (for example the
uniform distribution, which does not produce outliers). Distributions with kurtosis
greater than 3 are said to be leptokurtic (for example Laplace distribution).
It is also common practice to use an adjusted version of Pearson’s kurtosis, the
excess kurtosis, which is the kurtosis minus 3, to provide the comparison to the
normal distribution.
0.3
relative frequency
0.25
0.2
0.15
0.1
0.05
0
1 2 3 4 5 6
interval number
Fig. 2.1 Histogram based on distribution Table 2.1 from Sample Nr. 1 with a plot of the normal
probability density superimposed
Histogram
9
6
frequency
0
1 2 3 4 5 6
MIN Q1 Q2 Q3 MAX
996 997 998 999 1000 1001 1002 1003 1004 1005
Fig. 2.2 Histogram for Sample Nr. 1 with box-and-whisker plot in the bottom part
0.8
0.6
F^25(x)
0.4
0.2
0
996 998 1000 1002 1004 1006
x
Fig. 2.3 The empirical cumulative distribution function based on distribution Table 2.1 for Sample
Nr. 1
0.8
0.6
F^25(x)
0.4
0.2
0
996 998 1000 1002 1004 1006
x
Fig. 2.4 The empirical cumulative distribution function based on raw data from Sample Nr. 1
in the observed points xi . The minimum and maximum values are: xmin = 995.83,
xmax = 1004.97 which gives the value of the range r = 9.14. The sample mean,
is x̄ = 1000.53. In order to calculate the quartiles, we order sample data as shown
in Table 2.3. The position of the q1 quartile is k = 25+14
≈ 7, so the 7th element
from the ordered sample is the quartile q1 = 998.98. The q3 quartile occupies a
position k = 3(25+1)
4
≈ 19, hence its value is q3 = 1001.80. The median takes a
position k = 2 = 13, hence its value q2 = 1000.85. The interquartile range is
25+1
i = q3 − q1 = 2.82. This range includes the central 50% of the observations in the
sample.
2.2 Descriptive Statistics 75
The calculated sample and interquartile ranges characterize in some√way the vari-
ability of weights in the Sample Nr. 1. The standard deviation is s = 5.13 = 2.27
and it represents 23% of the sample mean (the coefficient of variation is v = x̄s =
0.23). The sample skewness as = −0.17 confirms the already observed slight, left
asymmetry of the empirical distribution. The sample kurtosis is krt = 2.39 < 3, so
fewer outliers in comparison with normal distribution.
In the bottom part of Fig. 2.2, under the histogram, there is a box-and-whisker
plot for Sample Nr. 1. The median is slightly shifted to the right in relation to the
76 2 Descriptive and Inferential Statistics
interquartile range, which indicates a slight left asymmetry (i.e. more elements in
the sample are concentrated to the left of the median).
We begin this section by discussing more formally the notions of populations and
samples which were already introduced in the previous sections. However, much
more needs to be discussed about them here, particularly in the context of random
variables. The totality of observations with which we are concerned (finite or infi-
nite) constitutes, what we call a population. Groups of people, animals, or all possible
outcomes from some complicated engineering system are the examples. Each obser-
vation in a population is a value of some random variable X having some probability
distribution f X (x) (or p X (x) if X is of discrete type). For example, if one is inspecting
items coming off an assembly line for defects, then each observation x in the popu-
lation might be a value 0 or 1 of the Bernoulli random variable X with probability
distribution
1
n
x̄ = xi (2.3.3)
n i=1
The value x̄ is then used to make an inference concerning the true E(X ). Now, x̄
is a function
x̄ = h(x1 , . . . , xn ) (2.3.4)
X = h(X 1 , . . . , X n ) (2.3.5)
Such a random variable is called a statistic. Thus, any function of the random
variables constituting a random sample is called a statistic. In the above example,
the chosen statistic h is:
1
n
X = h(X 1 , . . . , X n ) = Xi (2.3.6)
n i=1
78 2 Descriptive and Inferential Statistics
and is commonly called the sample mean of population X. The term sample mean is
applied to both the statistic X and its computed value x̄.
Another example of statistic may be:
1
n
S2 = (X i − X )2 (2.3.7)
n i=1
for measuring the population √ variability and is called the sample variance of
population X. The statistic S = S 2 is known as sample standard deviation.
For the reasons presented in the next section, the following statistic:
1
n
S∗2 = (X i − X )2 (2.3.8)
n − 1 i=1
X −m√
U1 = n (2.3.9)
σ
has a standard normal distribution N (x; 0, 1).
A far more important application involves two populations. A scientist, or, engi-
neer is interested in a comparative experiment in which two manufacturing methods,
say 1 and 2, are to be compared. The basis for that comparison is X 1 − X 2 , the dif-
ference in the population means. The following theorem enables statistical inference
about such a difference.
Theorem 2.3.2 (sampling distribution of the difference between two sample means)
If independent samples of size n 1 and n 2 are drawn at random from two nor-
mal populations N (x; m 1 , σ1 ) and N (x; m 2 , σ2 ), respectively, in which σ1 , σ2 are
known, then, the statistic:
2.3 Fundamental Sampling Distributions 79
X 1 − X 2 − (m 1 − m 2 )
U2 = 2 (2.3.10)
σ1 σ2
n1
+ n22
X −m√
T = n−1 (2.3.11)
S
X1 − X2
T =
(2.3.12)
n 1 ·S12 +n 2 ·S22
n 1 +n 2 −2
1
n1
+ 1
n2
nS 2
χ2 = (2.3.13)
σ2
has a chi-squared distribution with v = n − 1 degrees of freedom.
While it is of interest to let sample information shed light on two population
means, it is often the case that a comparison of variability is equally important, if
not more so. The F-distribution finds enormous application in comparing sample
variances.
Theorem 2.3.6 (sampling distribution of two sample variances)
If
1
n
1
2
S∗1 = (X i − X 1 )2 (2.3.14)
n 1 − 1 i=1
1
n
2
2
S∗2 = (X i − X 2 )2 (2.3.15)
n 2 − 1 i=1
are the variances of independent random samples of size n 1 and n 2 , taken from
normal populations with variances σ12 and σ22 , respectively, then the statistics:
2
S∗1 /σ12
F= (2.3.16)
2
S∗2 /σ22
2.4 Estimation
As was stated in Sect. 2.1, the field of statistical inference consists of those methods
used to make decisions or to draw conclusions about a population. Statistical infer-
ence may be divided into two major areas: estimation and tests of hypotheses. We
treat these two areas separately, dealing with estimation in this section and hypothesis
testing in the next.
Statisticians distinguish between the classical methods of estimating a population
parameter, whereby inferences are based strictly on information obtained from a ran-
dom sample selected from the population, and the Bayesian method, which utilizes
prior subjective knowledge about the probability distribution of the unknown param-
eters in conjunction with the information provided by the sample data. Throughout
2.4 Estimation 81
this book we shall use only the classical methods to estimate unknown population
parameters such as the mean and the variance, by computing statistics from ran-
dom samples, and applying the theory of sampling distributions covered in Sect. 2.3.
Bearing in mind the mentioned parametric and nonparametric statistical models,
the methods of statistical inference, i.e. the estimation and hypothesis testing, can
have, depending on a knowledge about a population, parametric or nonparametric
form. We start with parametric point and interval estimation, and then, for the non-
parametric case, we present the histogram and kernel probability density estimation.
We also describe certain bootstrap method for estimation the confidence intervals of
unknown distribution parameters.
θ (2.4.1.1.1)
(theta) to represent the parameter. Suppose that an engineer is analyzing the param-
eter θ , the torsional resistance of metal bars. Since variability in torsional resistance
is naturally present between the individual bars because of differences in raw mate-
rial batches, manufacturing processes, and measurement procedures, the engineer
is interested in estimating the mean torsional resistance of bars. Thus, in practice,
the engineer will use the random sample of n bars for which the torsional resistance
x1 , . . . , xn has been determined, to compute a number:
θ̂ = h(x1 , . . . , xn ) (2.4.1.1.2)
that is in some sense a reasonable value (or guess) of an unknown, true mean torsional
resistance θ . This number θ̂ is called a point estimate of parameter θ . The function
h(x1 , . . . , xn ), when applied to the new sample x1 , . . . , xn , would yield a different
value for θ̂ . We, thus, see that estimate is itself a random variable possessing a
probability distribution. Thus, if (X 1 , . . . , X n ) is a random sample of size n, the
statistic:
ˆ = h(X 1 , . . . , X n )
(2.4.1.1.3)
is called a point estimator of θ (please note a hat “ˆ” notation above a symbol of a
parameter). After the sample has been selected, ˆ takes on a particular numerical
value θ̂ called the point estimate of θ .
82 2 Descriptive and Inferential Statistics
Example 2.4.1.1.1 Let us consider the population represented by the random variable
X . The sample mean introduced in section 2.3:
1
n
m̂ = X = h(X 1 , . . . , X n ) = Xi (2.4.1.1.4)
n i=1
E(X ) = m (2.4.1.1.4a)
1
2
n
σ̂ 2 = S 2 = h(X 1 , . . . , X n ) = Xi − X (2.4.1.1.5)
n i=1
D 2 (X ) = σ 2 (2.4.1.1.5a)
After the sample from population X has been selected, for example (3, 4, 5), the
numerical value x̄ = 6 calculated from this sample data, is the point estimate of
unknown parameter, the population mean m = E(X ). Similarly, the numerical value
s 2 = 4.67, is the point estimate of the second unknown population parameter, the
variance σ 2 = D 2 (X ).
ˆ = E(
M S E() ˆ − θ )2 (2.4.1.1.6)
ˆ and true
which is the expected squared difference between a point estimator
parameter value θ . It can be shown that:
ˆ = D 2 ()
M S E() ˆ + bias 2 ()
ˆ (2.4.1.1.7)
2.4 Estimation 83
ˆ = E()
bias() ˆ −θ (2.4.1.1.8)
of the estimator ˆ which is the difference between the expected value of the estimator
ˆ
and the true value of a parameter θ . Thus, the mean squared error of a point
estimator ˆ is the sum of its variance and squared bias. The mean square error is
an important criterion for comparing two estimators. The relative efficiency of two
estimators ˆ 2 is defined as:
ˆ 1,
ˆ 1)
M S E(
(2.4.1.1.9)
ˆ 2)
M S E(
Unbiasedness of an Estimator
ˆ is equal to the parameter θ
If the expected value of a point estimator
ˆ =θ
E() (2.4.1.1.10)
ˆ = E()
bias() ˆ −θ =0 (2.4.1.1.11)
Example 2.4.1.1.2 Continuing the previous example, let’s check the unbiasedness
of the two estimators, the sample mean X and the sample variance S 2
n
1 1
n n
1
E(X ) = E Xi = E Xi = E(X i )
n i=1 n i=1
n i=1
1
= (n · m) = m (2.4.1.1.12)
n
The one before last element in the derivation process above follows directly from
a definition of a random sample (E(X i ) = E(X ) i = 1, . . . , n). As the bias of X is
zero:
Regarding the second estimator S 2 , it can be shown that its expected value is:
σ2 n−1 2
E(S 2 ) = σ 2 − = σ (2.4.1.1.14)
n n
84 2 Descriptive and Inferential Statistics
σ2 σ2
bias(S 2 ) = E(S 2 ) − σ 2 = σ 2 − − σ2 = − (2.4.1.1.15)
n n
1
n
n 2
S∗2 = · S2 = (X i − X ) (2.4.1.1.16)
n−1 n − 1 i=1
ˆ ≤ D 2 (∗ )
D 2 () (2.4.1.1.19)
for all θ .
Given two unbiased estimators for a given parameter, the one with smaller variance
is preferred because smaller variance implies that observed values of the estimator
tend to be closer to its mean, the true parameter value.
Example 2.4.1.1.3 We have seen in Example 2.4.1.1.2 that X obtained from a sample
of size n is an unbiased estimator for population mean m = E(X ). Does the quality
of X improve as n increases?
1 2
n
1 σ2
= 2 D (X i ) = 2 · n · D 2 (X i ) = (2.4.1.1.20)
n i=1 n n
which decreases as n increases. Thus, based on the minimum variance criterion, the
quality of X as an estimator for m = E(X ) improves as n increases.
We may ask further, if based on a fixed sample size n, the estimator X is the best
estimator for m = E(X ) in terms of unbiasedness and minimum variance. That is,
if X is UMVE for m = E(X )?
In order to answer this question, it is necessary to show that the variance of X ,
which is equal to σ 2 /n, is the smallest among all the unbiased estimators that can
be constructed from the sample. A powerful theorem below shows that it is possible
to determine the minimum achievable variance of any unbiased estimator obtained
from a given sample.
1
ˆ ≥
D 2 () 2 (2.4.1.1.21)
∂ ln f (X,θ)
n·E ∂θ
We will show that in this case the sample mean X is the UMVE estimator for
m = E(X ).
86 2 Descriptive and Inferential Statistics
We have:
√ (x − m)2
ln f (x, m) = − ln(σ 2π ) − (2.4.1.1.23)
2σ 2
and the derivative is:
∂ ln f (x, m) x −m
= (2.4.1.1.24)
∂m σ2
Thus, RCLB is equal to,
1 1 1 σ2
2 = x−m 2 = σ2
= (2.4.1.1.25)
nE ∂ ln f (x,m) nE σ2
n· σ4
n
∂m
σ2
D 2 (X ) = (2.4.1.1.26)
n
which proves that the sample mean X is the UMVE for m = E(X ), i.e. is the most
efficient estimator in normal populations.
It should be noted that sometimes biased estimators are preferable to unbiased
estimators because they have smaller mean square error. That is, we may be able to
reduce the variance of the estimator considerably by introducing a relatively small
amount of bias. As long as the reduction in variance is greater than the squared
bias, an improved estimator from a mean square error viewpoint will result. Linear
regression analysis is an area in which biased estimators are occasionally used.
Consistency of an Estimator
An estimator ˆ is said to be a consistent estimator for θ if, as sample size n increases,
the following holds for all ε.
ˆ − θ | < ε) = 1
lim P(| (2.4.1.1.27)
n→∞
The consistency condition states that estimator ˆ converges, in the sense above,
to the true value θ , as sample size increases. It is, thus, a large-sample concept and
is a good quality for an estimator to have.
Example 2.4.1.1.5 Point estimation of the population mean of salt bag’s weight
Continuing Example 2.1.1, we can use the sample mean X to estimate the unknown
population mean m = E(X ) of salt bag’s weights based on the random Sample Nr.
1. The point estimate is:
1
n
x̄ = xi =1000.53 (2.4.1.1.29)
n i=1
If we can assume that the variance of the salt bag’s weight distribution is known,
for example σ 2 = 32 (otherwise, it should be estimated from a sample), then the
standard error of estimator X is:
σ 3
D(X ) = √ = √ = 0.6 (2.4.1.1.30)
n 25
Often, the estimators of parameters have been those, that appeal to intuition. The
estimator X certainly seems reasonable as an estimator of a population mean m =
E(X ). The virtue of S∗2 as an estimator of σ 2 = D 2 (X ) is underscored through the
discussion of unbiasedness in the previous section. But there are many situations
in which it is not obvious what the proper estimator should be. As a result, there
is much to be learned by the student in statistics concerning different methods of
estimation. In this section, we describe the method of maximum likelihood, one
of the most important approaches to estimation in all of the statistical inference.
This method was developed in the 1920s by a famous British statistician, Sir R. A.
Fisher. We introduce the method of maximum likelihood through the example with
a discrete distribution and a single parameter. Denote by (X 1 , . . . , X n ), the random
sample taken from a discrete probability distribution represented by p(x; θ ), where
θ is a single parameter of the distribution, and by x1 , . . . , xn the observed values in
a sample. The following joint probability:
P(X 1 = x1 , . . . , X n = xn |θ ) (2.4.1.2.1)
88 2 Descriptive and Inferential Statistics
n
L(x1 , . . . , xn ; θ ) = p(xi ; θ ) (2.4.1.2.2)
i=1
The quantity L(x1 , . . . , xn ; θ ) is called the likelihood of the sample, and also often
referred to as a likelihood function. Note that the variable of the likelihood function
is θ , not the xi .
In the case when X is continuous, we write the likelihood function as:
n
L(x1 , . . . , xn ; θ ) = f (xi ; θ ) (2.4.1.2.3)
i=1
The algorithm for determining such an estimator can be presented in the following
steps.
1. Find the likelihood function L of a sample for a given population distribution.
2. Applying the necessary condition for the existence of an extreme, find the first
derivative of the likelihood function L and solve the equation:
∂L
=0 (2.4.1.2.5)
∂θ
∂ ln(L)
=0 (2.4.1.2.6)
∂θ
If we have r unknown parameters, then we obtain a system of r equations, in
which the left side of each equation is an expression being a partial derivative with
respect to one of r parameters. The solution of such a system of r equations is the
vector r of estimators of unknown parameters.
It can be shown that estimators obtained by the maximum likelihood method are
asymptotically unbiased, effective and consistent.
θ x −θ
p(x; θ ) = e x = 0, 1, 2, . . . (2.4.1.2.7)
x!
Suppose that a random sample x1 , . . . , xn is taken from this distribution to obtain
MLE of unknown parameter θ .
n
n
ln L = xi ln θ − nθ − ln xi ! (2.4.1.2.9)
i=1 i=1
n
2 − 21 (xi − m)2
L(x1 , . . . , xn ; m) = (2π σ ) exp − =
i=1
2σ 2
90 2 Descriptive and Inferential Statistics
1
n
−n
= (2π σ ) 2 2 exp − 2 (xi − m)2 (2.4.1.2.14)
2σ i=1
1
n
1
ln L = − n ln(2π σ 2 ) − (xi − m)2 (2.4.1.2.15)
2 2σ 2 i=1
1
n
d ln L
= 2 (xi − m) (2.4.1.2.16)
dm σ i=1
Setting the derivative to zero and solving for the parameter m we obtain:
n
xi − nm = 0 (2.4.1.2.17)
i=1
n
Xi
m̂ = i=1
(2.4.1.2.18)
n
The MLE estimator of parameter m in the normal distribution, is therefore, the
sample mean.
Even the most efficient unbiased estimator is unlikely to estimate the population
parameter exactly. There are many situations in which it is preferable to determine
an interval within which we would expect to find the value of the parameter. Such an
interval is called an interval estimate. The interval estimate indicates, by its length,
the accuracy of the point estimate. Moreover, the interval estimation provides, on the
basis of a sample from a population, not only information on the parameter values
to be estimated, but also an indication of the level of confidence that can be placed
on possible numerical values of the parameters.
Suppose that a sample (X 1 , . . . , X n ) is drawn from a population having probability
density function f (x; θ ), θ being the parameter to be estimated. Further suppose that
alow (X 1 , . . . , X n ) and aup (X 1 , . . . , X n ) are the two statistics such that alow < aup
with probability 1.
The interval:
is called a [100(1 − α)]% confidence interval for θ (0 < α < 1) if alow and aup can
be selected such that:
The limits alow and aup are called the lower and upper confidence limits for θ ,
respectively, and (1 − α) is the confidence coefficient. The value of confidence
level (1 − α) is generally taken as 0.90, 0.95, 0.99, and 0.999. Thus, when α = 0.05,
we have a 95% confidence interval. The wider the confidence interval is, the more
confident we can be that the given interval contains the unknown parameter. Of
course, it is better to be 95% confident that the average salt bag’s weight is between
9.9 and 1.1 kg than to be 99% confident that it is between 9.7 and 1.3 kg! Ideally, we
prefer a short interval with a high degree of confidence. Sometimes, restrictions on
the size of our sample prevent us from achieving short intervals without sacrificing
some of our degree of confidence.
Point and interval estimation represent different approaches to gain information
regarding a parameter, but they are related in the sense, that confidence interval
estimators are based on point estimators. We know that the estimator X is a very
reasonable point estimator of E(X ). As a result, the important confidence interval
estimator of E(X ) depends on knowledge of the sampling distribution of X . We only
illustrate the procedure for construction of a confidence interval for a population mean
E(X ) in the case of normal population with known variance σ 2 . This confidence
interval is also approximately valid (because of the central limit theorem) regardless
of whether or not the underlying population is normal, so long as n is reasonably
large (n > 30). However, when the sample is small and σ 2 is unknown, the procedure
for constructing a valid confidence interval is based on t distribution (it is not covered
in this book—the interested reader is referred to the suggested literature at the end).
Many populations encountered in practice are well approximated by the normal
distribution, so this assumption will lead to confidence interval procedures of wide
applicability. When the normality assumption is unreasonable, an alternate is to use
the nonparametric procedures.
X −m√
U= n (2.4.2.3)
σ
has distribution N (x; 0, 1). Suppose we specify that the probability of U being in
interval (u 1 , u 2 ) is equal to 1 − α:
The obtained interval is a random interval and must be interpreted carefully. The
mean m, although unknown, is nevertheless deterministic and it either lies in an inter-
val or it does not. The probability that this random interval covers the distribution’s
true mean m is 1 − α. Based upon the given sample values, we get the observed
interval.
The obtained interval is symmetrical with respect to x̄, and has the length:
α σ
ln = 2u 1 − ·√ (2.4.2.13)
2 n
For a fixed sample size and confidence coefficient, the length is constant. The
greater the sample size, the shorter the confidence interval, i.e. the greater the
accuracy.
The [100(1 − α)]% confidence interval provides an estimate of the accuracy of
our point estimate x̄. If m = E(X ) is actually the center value of the interval, then
x̄ estimates m without error. The size of this error will be the absolute value of
the difference between m and x̄, and we can be [100(1
√ − α)]% confident that this
difference will not exceed ln /2 = u(1 − α/2) · σ/ n, being the maximum error of
estimation.
The accuracy of the estimation can be quantitatively described by the so-called
relative precision:
ln /2
r elpr ec = · 100% (2.4.2.14)
x̄
94 2 Descriptive and Inferential Statistics
u 2 1 − α2 σ 2
n≥ (2.4.2.16)
d2
where the value of the quantile of this distribution read from tables of the standard
normal distribution, for the assumed confidence level equals to u(1−0.05/2) = 1.96.
So, we have 95% certainty that the interval (999.35, 1001.70) covers the unknown
mean salt bag weight. The length of this interval is:
α σ 3
2u 1 − · √ = 2 · 1.96 · √ = 2.35 (2.4.2.18)
2 n 25
1.175
r elpr ec = · 100% = 0.117% (2.4.2.19)
1000.53
2.4 Estimation 95
Now, let’s assume that we want even greater accuracy, i.e. a shorter interval with
length:
u 21−α σ 2 1.962 · 32
n≥ = ≈ 61.47 (2.4.2.21)
d2 (0.75)2
The bootstrap was first published by Bradley Efron in “Bootstrap methods: another
look at the jackknife” (The Annals of Statistics, 7(1), pp. 1–26 (1979)). In statistics,
bootstrapping is any method that relies on random sampling with replacement. The
idea behind bootstrap is to use the data of a sample study at hand as a “surrogate
population” for the purpose of approximating the sampling distribution of a statistic.
Let us develop now some mathematical notations for explaining the bootstrap
approach. If we have a random sample data of size n:
(x1 , . . . , xn ) (2.4.3.1)
resampling with replacement from the sample data at hand creates a large number
of “phantom samples”:
known as bootstrap samples (we denote a resample of size n by adding a star to the
symbols).
Suppose now the population mean is the target of our study (in general the popu-
lation parameter θ ). The corresponding sample statistic computed from the sample
data is:
1
n
θ̂ = h(x1 , . . . , xn ) = x̄ = xi (2.4.3.3)
n i=1
Similarly, just as x̄ is the sample mean of the original data, we write x̄ ∗ (in general,
∗
θ̂ ) for the mean of the resampled data. We then compute:
96 2 Descriptive and Inferential Statistics
0.8
frequency
0.6
0.4
0.2
0
999 999.5 1000 1000.5 1001 1001.5 1002
x7
using the same computing formula as the one used for θ̂ , but now base it on N
(usually a few thousand) different bootstrap samples (each of size n). From the
values θ̂1∗ , θ̂2∗ , . . . , θ̂ N∗ of the statistic computed on the subsequent bootstrap samples,
the histogram is created (referred here to as the bootstrap distribution of the statistic)
(see Fig. 2.6).
The primary application of bootstrap is approximating the standard error of a
sample estimate. One defines the bootstrap standard error as:
1 N
S E B (θ̂) = (θ̂ ∗ − θ̂ )2 (2.4.3.5)
N i=1 i
After ranking from bottom to top, let us denote these bootstrap values as:
∗ ∗ ∗
(θ̂(1) , θ̂(2) , . . . , θ̂(1000) ) (2.4.3.7)
Then, the bootstrap percentile confidence interval at 95% level of confidence is:
∗ ∗
(θ̂(25) , θ̂(975) ) (2.4.3.8)
2.4 Estimation 97
(i.e. the 25th and 97th percentiles). It should be pointed out that this method requires
the symmetry of the sampling distribution of θ around θ̂ .
Example 2.4.3.1 Let us continue the example of estimating the mean salt bag’s
weight (Example 2.1.1). We determine the confidence intervals at 95% level of con-
fidence for the mean salt bag’s weight using the percentile method. From the Sample
No. 1 we generate N = 1000 bootstrap samples, and for each of them we calculate
the sample mean x̄ ∗ . We then sort the 1000 values in ascending order and find the
25th and 97th percentiles. Figure 2.6 shows the histogram of the bootstrap sample
means x̄ ∗ , i.e. the bootstrap distribution of sample mean. The bootstrap confidence
interval for the mean salt bag’s weight at 95% level of confidence is:
In this section, instead of assuming a parametric model for the distribution (e.g.
normal distribution with unknown expectation and variance), we only assume that
the probability density function exists and is suitably smooth (e.g. differentiable).
It is then possible to estimate the unknown probability density function f (x). We
assume that we have a random sample (X 1 , . . . , X n ) where X i ∼ F, and F denotes
an unknown cumulative distribution function.
The goal is to estimate the distribution F. In particular, we are interested in
estimating the density f = d F/d x, assuming that it exists. The resulting estimator
will be denoted as fˆ(x).
In this section, we consider two such approaches: the histogram and kernel
estimators.
The histogram is the oldest density estimator. We need to specify an “origin” x0 and
the class width h (also called a bandwidth) for the specifications of the intervals:
for which the histogram counts the number of observations falling into each ck . We
then plot the histogram such that the area of each bar is proportional to the number
of observations falling into the corresponding class (interval ck ).
In searching for the optimal density estimator, we can minimize the mean squared
error M S E(x) at a point x,
98 2 Descriptive and Inferential Statistics
2
M S E(x) = E fˆ(x) − f (x) + V ar ( f (x)) (2.4.4.1.2)
but, then we are confronted with a bias-variance trade-off. Instead of optimizing the
mean squared error at a point x, the integrated Mean Square Error (MISE) defined
by the following formula is used:
+∞
where the unknown value of the σ is estimated from the sample by the sample
standard deviation s. This gives quite good results also for the distributions differing
significantly from the normal distribution.
Example 2.4.4.1.1 Let’s return to Example 2.1.1 and assume that we do not know
the functional form of the salt bag’s weight probability density function.
We will construct the histogram estimator based on the Sample Nr. 1. The band-
width h is calculated as (a sample standard deviation for sample Nr 1 is appr.
2.27):
√ √
h = 3.486 · σ · n = 3.486 · 2.27/ 25 = 1.579 (2.4.4.1.5)
The calculations of histogram estimator are shown in Table 2.4 for the selected
range (990.00, 1012.109) which was divided into 14 intervals with the length h =
1.579. Figure 2.7 presents the plot of the constructed histogram estimator.
As in the histogram, the relative frequency of observations falling into a small region
can be computed. The density function f (x) at a point x can be represented as:
1
f (x) = lim P(x − h < X ≤ x + h) (2.4.4.2.1)
h→0 2h
0.2
0.15
0.1
0.05
0
990 992 994 996 998 1000 1002 1004 1006 1008 1010
Fig. 2.7 Histogram estimator for salt bag’s weight distribution based on Sample Nr. 1. The plot of
N(x; 1000, 3) superimposed
1
fˆ(x) = |{i : X i ∈ (x − h, x + h)}| (2.4.4.2.2)
2h
We can represent this estimator in an alternative way:
1
n
x − Xi
fˆ(x) = K (2.4.4.2.3)
nh i=1 h
100 2 Descriptive and Inferential Statistics
where
1/2 if |x| ≤ 1
K (x) = (2.4.4.2.4)
0 otherwise
+∞
Asymptotically optimal (i.e. minimizing the mean square error) kernel is the
Epanechnikov kernel:
3
(1 − u 2 ) x ∈ B = [−1, 1]
K e (u) = 4 (2.4.4.2.8)
0 u ∈ R\B
where
x − Xi
u= (2.4.4.2.9)
h
The optimal bandwidth h depends on the form of the kernel and the unknown den-
sity function f. Assuming that f is the density of the normal distribution N (x; m, σ ),
the following formula for bandwidth is obtained:
The σ parameter can be estimated from the sample standard deviation s. It turns
out that such an estimation of the parameter h gives quite good results also for
distributions differing significantly from the normal distribution.
Study of the asymptotic mean square error shows that the use of another kernel,
the Gaussian kernel:
2.4 Estimation 101
2
1 u
K g (u) = √ exp − (2.4.4.2.11)
2π 2
leads to an increase in the mean square error by only a few percent and the optimal
bandtwidth is practically the same as at Epanechnikov.
Example 2.4.4.2.1 Continuing the previous example, we will construct now a kernel
estimator of the unknown density function using the Gaussian kernel. The bandwidth
h is calculated according to Formula (2.4.4.2.10):
Let’s calculate for example the value of the kernel estimator at point x = 1000:
1 (1000 − 1000.33)2
fˆ(1000) = √ exp −
25 · 1.249 2π 2 · (1.249)2
(1000 − 1001.72)2
+ · · · + exp −
2 · (1.249)2
0.111 2.970
= 0.0128 · exp − + · · · + exp −
3.12 3.12
= 0.0128 · exp(−0.036) + · · · + exp(−0.952)
= 0.0128 · [0.965 + · · · + 0.386] = 0.0128 · 11.010 = 0.1409
(2.4.4.2.13)
In a similar way, we calculate the values of the kernel estimator at the remaining
points from the range (990.00, 1012.109). We performed calculations of the values
at 350 points in the above range with a step of 0.6. Figure 2.8 shows a plot of the
kernel estimator obtained for the value of h = 1.249.
Besides the estimation of parameters presented in the previous section, often the
problem of inference about a population relies on the formation of a data-based
decision procedure. For example, a medical researcher may decide on the basis of
experimental evidence whether smoking cigarettes increases the risk of cancer in
humans; an engineer might have to decide on the basis of sample data whether there
is a difference between the accuracy of two kinds of gauges produced. Similarly, a
psychologist might wish to collect appropriate data to enable him to decide whether
a person’s personality type and intelligence quotient are independent variables.
In each of the above cases, the person postulates or conjectures something about
an investigated situation. In addition, each must involve the use of experimental data
and decision-making that is based on the data. The conjecture can be put in the form
102 2 Descriptive and Inferential Statistics
0.08
0.06
0.04
0.02
0
990 992 994 996 998 1000 1002 1004 1006 1008 1010
x
Fig. 2.8 Kernel estimator for salt bag’s weight distribution based on Sample Nr. 1. The plot of N(x;
1000, 3) superimposed (dashed line)
H0 : m = 1000 (2.5.1)
H0 : m = 1000 (2.5.1.1)
The statement H0 : m = 1000 is called the null hypothesis, and the statement H1 :
m > 1000—the alternative hypothesis. It is the example of one-sided alternative
hypothesis. However, in some situations, we may wish to formulate the two-sided
alternative hypothesis, H1 : m
= 1000.
Testing the hypothesis involves taking a random sample, computing a test statistic
from a sample data, and then, using a test statistic to make a decision about a null
hypothesis. Referring to Example 2.1.1, a value of the sample mean x̄ that falls close
to the hypothesized value of m = 1000 g is the evidence that the true mean m is really
1000 g, that is, such evidence supports the null hypothesis H0 . On the other hand, a
sample mean that is considerably different from 1000 g is the evidence in support of
the alternative hypothesis H1 . Thus, the sample mean is the test statistic in this case
(i.e. Tn = X ). The sample mean can take on many different values. Suppose, that if
x̄ > 1003 we will reject the null hypothesis in favor of the alternative: the values of
x̄ that are greater than 1003 constitute the critical region for the test:
form the acceptance region for which we will fail to reject the null hypothesis. It
should be noted that the critical and acceptance regions correspond with the form of
alternative hypothesis stated above.
The boundaries between the critical and the acceptance regions are called the
critical values (here, c = 1003). We reject H0 in favor of H1 if the test statistic falls
in the critical region and fail to reject H0 otherwise.
This decision procedure can lead to either of two wrong conclusions. For example,
the true mean salt bag weight could be equal to 1000 g. However, for the randomly
selected sample that are tested, we could observe a value of the test statistic x̄ that
falls into the critical region. We would then reject the null hypothesis H0 in favor of
104 2 Descriptive and Inferential Statistics
the alternative H1 when, in fact, H0 is really true. This type of wrong conclusion is
called a type I error.
Now, suppose that the true mean salt bag weight is different from 1000 g, yet the
sample mean falls in the acceptance region. In this case, we would fail to reject H0
when it is false. This type of wrong conclusion is called a type II error.
Summarizing, rejecting the null hypothesis H0 when it is true is defined as a type
I error. Failing to reject the null hypothesis when it is false is defined as a type II
error.
Because our decision is based on random variables, probabilities can be associated
with the type I and type II errors.
The probability of type I error is denoted by α and defined as
The type I error probability is called the significance level of the test.
The probability of type II error, which we will denote by β is defined as
+∞
−2
because the standardized√value that corresponds to the critical value 1003 is now
z = (1003 − 1009)/(15/ 25) = −2.
The obtained results imply that about 15% of all random samples would lead to
rejection of the null hypothesis H0 : m = 1000 g when the true mean weight is
really 1000 g. If the true value of the mean weight is m = 1009 g, the probability
that we will fail to reject the false null hypothesis is 0.023 (2.3%).
From inspection of Fig. 2.10, we notice that we can reduce the probability of type
I error α by widening the acceptance region. For example, if we make the critical
value c = 1005, the value of α decreases to
The power can be interpreted as the probability of correctly rejecting a false null
hypothesis. We often compare statistical tests by comparing their power properties.
In our example of testing H0 : m = 1000 based on the critical region K 0 = {x :
X > 1003}, we found that β = 0.023, so the power of this test is 1−β = 1−0.023 =
0.977.
Power is a measure of the sensitivity of a statistical test, where by sensitivity we
mean the ability of the test to detect differences. In this case, the sensitivity of the test
for detecting the difference between the mean salt bag’s weight of 1000 and 1009 g
is 0.977. That means if the true mean weight is really 1009 g, this test will correctly
reject H0 : m = 1000, and “detect” this difference 97.7% of the time.
A test of any hypothesis such as:
H0 : m = m 0
H1 : m < m 0 or H1 : m > m 0 (2.5.1.12)
H0 : m = m 0
H1 : m
= m 0 (2.5.1.13)
In this section we have developed the general philosophy for hypothesis test-
ing. We recommend the following sequence of steps in applying hypothesis testing
methodology.
• State the null hypothesis H0 about the investigated population depending on the
problem.
• Specify the appropriate alternative hypothesis H1 .
• Choose the significance level α (i.e. the probability of type I error).
• Determine the appropriate test statistic.
• Determine the critical (rejection) region for the statistic.
• For the random sample from an investigated population, compute the value of test
statistic.
• Decide whether or not H0 should be rejected and report that in the problem context.
The above stated sequence of steps will be illustrated in subsequent two sec-
tions. We would like to notice that we do not provide statistical tables in this book
since R environment provides the relevant critical values to construct critical regions.
The hypothesis-testing procedures discussed in this section are based on the assump-
tion that we are working with random samples from normal populations. Tradition-
ally, we have called these procedures parametric methods because they are based
on a particular parametric family of distributions—in this case, the normal. Later, in
the next section, we describe procedures called nonparametric or distribution-free
methods, which do not make assumptions about the distribution of the underlying
population other than that it is continuous. However, the assumptions of normality
often cannot be justified and we do not always have quantitative measurements. In
such cases, the nonparametric methods are used with increasing frequency by data
analysts.
In this section we will assume that a random sample (x1 , . . . , xn ) has been taken
from the normal population N (x; m, σ ). We will consider tests of hypotheses on a
single population parameters such as mean and variance and then, we extend those
results to the case of two independent populations: we present hypotheses tests for a
difference in means and variances.
H0 : m = m 0
H1 : m
= m 0 (2.5.2.1.1)
X − m0
U= √ (2.5.2.1.2)
σ/ n
that has a normal distribution N (x; 0, 1) which follows from Theorem 2.3.1. Hence,
the expression:
α X − m0
α
P −u 1 − < √ <u 1− =1−α (2.5.2.1.3)
2 σ/ n 2
can be used to write an appropriate nonrejection region which follows from the
symmetry of the normal distribution N (x; 0, 1), where u(1 − α/2) is a quantile of
order (1 − α/2) of this distribution (see Fig. 2.11).
Clearly, a sample producing a value of the test statistic U that falls in the tails of
the distribution of U would be unusual if H0 : m = m 0 is true. Therefore, it is an
indication that H0 is false. Thus, we should reject H0 if the observed value of the test
statistic U falls in the critical (rejection) region:
K 0 = −∞, −u 1 − α2 ) ∪ (u 1 − α2 , +∞ (2.5.2.1.4)
Note that the probability is α that the test statistic U will fall in the critical region.
It should be kept in mind that, formally, the critical region is designed to control α,
the probability of type I error.
Example 2.5.2.1.1 Hypothesis test for the population mean salt bag’s weight (σ is
known)
110 2 Descriptive and Inferential Statistics
Fig. 2.11 Plot of the distribution of the standardized mean test statistic with the critical region
shaded
Let us consider the already investigated problem of the assessment of the mean
weight of salt bags produced in a certain factory (Example 2.1.1, Sample 1). Assum-
ing that the weight’s distribution is approximately normal with the standard deviation
σ = 3 known, we’ve already found the 100 · (1 − 0.05)% confidence interval for
this mean weight as (999.35, 1001.70). Now, we can put the problem differently
and ask: “is it true that the weight of a salt bag is 1000 g?”
We may solve this problem by following the seven-step general procedure for
hypothesis testing outlined in the previous section. This results in the statement of
the null and alternative hypotheses:
H0 : m = 1000
H1 : m
= 1000 (2.5.2.1.5)
We set the significance level α = 0.05. The calculated mean from the Sample Nr.
1 is x̄ = 1000.53, thus the value of the test statistic U is:
1000.53 − 1000
u= √ = 0.88 (2.5.2.1.6)
3/ 25
From the tables of the standard normal distribution, the quantile u(1 − 0.05/2)
is:
α 0.05
u 1− =u 1− = u(0.975) ≈ 1.96 (2.5.2.1.7)
2 2
The calculated value of the test statistic does not fall into the critical region:
0.88 ∈
/ (−∞, −1.96) ∪ (1.96, +∞) (2.5.2.1.9)
Therefore, we fail to reject the null hypothesis which states that the mean weight
of the produced salt bags is 1000 g. Stated more completely, we conclude that there
is no strong evidence that the mean salt bag’s weight is different from 1000 g.
Below, we illustrate the above calculations using computer with R language.
**R**
We calculate the value of the test statistic:
> u=(1000.53-1000)/(3/sqrt(25))
[1] 0.88
We then read the value of the quantile u(0.975) of the standard normal distribution:
> qnorm(0.975)
[1] 1.959964
and construct the critical region as in the above example, followed by the checking
if this value falls into it.
X − m0 √
T = n−1 (2.5.2.1.10)
S
where
n
1
S= (X i − X )2 (2.5.2.1.11)
n i=1
K 0 = −∞, −t 1 − α2 , n − 1 ) ∪ (t 1 − α2 , n − 1 , +∞ (2.5.2.1.12)
Example 2.5.2.1.2 Hypothesis test for the population mean salt bag’s weight (σ is
unknown)
Let us consider the already investigated problem of the assessment of the mean
weight of salt bags produced in a certain factory (Example 2.1.1, Sample 1). Assum-
ing that the weight’s distribution is approximately normal with the unknown standard
deviation σ . As in Example 2.5.2.1.1, we set the significance level α = 0.05.
We follow the seven-step general procedure for hypothesis testing. The hypothesis
statement is the same as in Example 2.5.2.1.1. We calculate the value of the test
statistic based on the Sample 1:
X − m0 √ 1000.53 − 1000 √
t= n−1= 25 − 1 = 1.138 (2.5.2.1.13)
S 2.27
The quantile of the t-distribution is:
α 0.05
t 1 − ,n − 1 = t 1 − , 25 − 1 = 2.064 (2.5.2.1.14)
2 2
As the calculated value of the test statistic does not fall into the critical region:
1.138 ∈
/ (−∞, −2.064) ∪ (2.064, +∞) (2.5.2.1.16)
we state that there is no strong evidence to reject the null hypothesis that the mean
weight of a salt bag is 1000 g.
X − m0
U= √ (2.5.2.1.17)
S/ n
Such procedure is valid regardless of the form of the distribution of the population.
This large-sample test relies on the central limit theorem.
2.5 Tests of Hypotheses 113
Suppose that we wish to test the hypothesis that the variance σ 2 of a normal population
equals a specified value, say σ02 . To test
H0 : σ 2 = σ02 (2.5.2.2.1)
H1 : σ 2 = σ02 (2.5.2.2.2)
n · S2
X2 = (2.5.2.2.3)
σ02
which has χ 2 distribution with n − 1 degrees of freedom when H0 is true (it follows
from Theorem 2.3.5 about the sample variance).
Therefore, we calculate the critical region K 0 of the test statistic:
K 0 = 0, χ 2 α2 , n − 1 ) ∪ (χ 2 1 − α2 , n − 1 , +∞ (2.5.2.2.4)
Example 2.5.2.2.1 Let us consider the problem of the assessment of the variance of
weight of salt bags produced in a certain factory (Example 2.1.1, Sample 1). As in
Example 2.5.2.1.1, we set the significance level α = 0.05.
We may solve this problem by following the seven-step general procedure for
hypothesis testing. This results in the statement of the null and alternative hypotheses:
H0 : σ 2 = 32
H1 : σ 2
= 32 (2.5.2.2.5)
nS 2 25 · (5, 13)2
χ12 = = = 14, 25 (2.5.2.2.6)
σ02 32
α 0.05
χ2 1 − , n − 1 = χ2 1 − , 25 − 1 = 39, 36 (2.5.2.2.8)
2 2
Since the value of the test statistic does not fall into the critical region:
The previous section presented statistical inference based on hypothesis tests for
a single population parameter (the mean m, the variance σ 2 ). This and the next
section extend those results to the case of two independent normal populations with
distributions: N (x; m 1 , σ1 ) and N (x; m 2 , σ2 ). Inferences will be based on the two
independent random samples of sizes n 1 and n 2 , respectively. We now consider the
hypothesis testing on the difference in the means
H0 : m 1 = m 2
H1 : m 1
= m 2 (2.5.2.3.1)
X 1 − X 2 − (m 1 − m 2 )
U2 = 2 (2.5.2.3.2)
σ1 σ2
n1
+ n22
follows the standard normal distribution N (x; 0, 1) if the null hypothesis is true.
Assuming the truth of the null hypothesis ((m 1 − m 2 ) = 0), the test statistic:
X1 − X2
U= 2 (2.5.2.3.3)
σ1 σ2
n1
+ n22
also follows standard normal distribution. Based on this fact, we can construct two-
tailed critical region of this test statistic on the specified level of significance α.
K 0 = −∞, −u 1 − α2 ) ∪ (u 1 − α2 , +∞ (2.5.2.3.4)
2.5 Tests of Hypotheses 115
Case 2 We will now consider the case when the standard deviations σ1 , σ2 are
unknown, but equal σ1 = σ2 = σ . In most situations we do not know whether this
assumption is met, therefore, first we must verify the hypothesis about equality of
variances (presented in the next section).
X1 − X2
T =
(2.5.2.3.5)
n 1 ·S12 +n 2 ·S22
n 1 +n 2 −2
1
n1
+ 1
n2
where
n1 n2
1 − 2 1 − 2
S12 = Xi − X1 S2 =
2
Xi − X2 (2.5.2.3.6)
n 1 i=1 n 2 i=1
The sample from the first company is the already known Sample No. 1, while the
Sample No. 2 consists of the following 25 observations:
1001.08, 1004.89, 991.86, 999.48, 1002.54, 996.83, 997.41, 1003.30,
1005.05, 999.14, 989.04, 1003.13, 994.79, 1003.32, 996.33, 990.90,
992.84, 999.24, 1005.54, 1001.13, 996.56, 1004.00, 995.09, 1006.98,
1005.03
H0 : m 1 = m 2 (2.5.2.3.8)
at the significance level α = 0.05. The value of the test statistic is:
X1 − X2 1000.53 − 999.42
u= 2 = = 0.95 (2.5.2.3.9)
σ1 σ2
2
32 52
n1
+ n2 25
+ 25
116 2 Descriptive and Inferential Statistics
As the value of the test statistic does not fall into the critical region, we have
no strong evidence to reject the null hypothesis about the equalities of population
weights of salt bags produced in the two factories.
Now, let us consider the problem of testing the equality of the two variances σ12 , σ22 of
the two normal populations: N (x; m 1 , σ1 ) and N (x; m 2 , σ2 ), respectively. Assume
that two independent samples of sizes n 1 and n 2 have been taken from the two
populations. We wish to test the following null hypothesis against the alternative:
H0 : σ12 = σ22
H1 : σ12
= σ22 (2.5.2.4.1)
where
n 1 S12 n 2 S22
2
S∗1 = 2
S∗2 = (2.5.2.4.3)
n1 − 1 n2 − 1
are unbiased estimators of variances in the first and the second population, respec-
tively, has F-distribution with n 1 − 1 and n 2 − 1 degrees of freedom. Based on this
fact, we can construct, on the prespecified significance level α, the critical region of
the above statistic:
Based on the samples, the calculated values of unbiased sample variances are:
2
s∗1 = 5.34 s∗2
2
= 25.74 (2.5.2.4.6)
And the critical region, based on the quantile f (1 − 0.05, 24, 24) = 1.98 of
F-distribution is:
The value of the test statistic falls into the critical region, therefore, at the signif-
icance level α = 0.05, we reject the null hypothesis about the equality of variances
of salt bag’s weights in the two factories.
In the previous sections we have presented the parametric tests, that means the
hypothesis-testing procedures for problems in which the probability distribution is
known, and the hypotheses involve the parameters of the distribution. Often, we
do not know the underlying distribution of the population, and we wish to test the
hypothesis that a particular distribution will be satisfactory as a population model.
For example, we might wish to test the hypothesis that the population is normal.
Such tests are the examples of nonparametric tests and are called the goodness of fit
tests. We will describe three examples of such tests: the test based on a chi-square
distribution, the Kolmogorov test and Shapiro-Wilk’s test. Moreover, we present the
test for independence of two variables, also based on a chi-square distribution.
Generally, nonparametric tests, as being a part of nonparametric statistics, are
distribution free methods which do not rely on assumptions that the data are drawn
from a given parametric family of probability distributions. If the assumptions for
the parametric method may be difficult, or impossible to justify, or if the data are
reported on nominal scale, then nonparametric tests should be used. For example, if
the sample mean of the two populations being compared does not follow a normal
distribution, the already presented popular and commonly used t-test for comparing
two population means should not be used. The nonparametric test, the Mann–Whitney
U test also known as Wilcoxon rank-sum test is a good alternative and will also be
described in this section.
The seven-step general hypothesis-testing procedure described earlier, can also
be used now, in nonparametric settings.
118 2 Descriptive and Inferential Statistics
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the
observed values of a given phenomena are significantly different from the expected
values. In this test, the term goodness of fit is used to compare the observed sample
distribution with the expected probability distribution. Chi-Square goodness of fit
test determines how well theoretical distribution (such as normal, binomial, Poisson,
etc.) fits the empirical distribution.
The null hypothesis H0 in this test assumes that there is no significant difference
between the observed and the expected values of the hypothesized distribution of
random variable X . The alternative hypothesis H1 states that there is a significant
difference between them. The test procedure requires a random sample of size n from
the population whose probability distribution is unknown. These n observations are
arranged in a frequency table having k classes: points or intervals depending on the
discrete or continuous nature of the data, respectively (see Table 2.5). The value n i in
that table is called the observed frequency of data in ith class. From the hypothesized
probability distribution, we compute then the expected frequency in the ith class,
in the following way. First, we compute the theoretical, hypothesized probability
associated with the ith class: in the case of point frequency table as:
pi = P0 (X = xi ) (2.5.3.1.1)
where F0 (x) denotes the cumulative distribution function of X . The expected fre-
quencies are then computed by multiplying the sample size n by the probabilities pi .
The test statistic is:
k
(n i − npi )2
χ2 = (2.5.3.1.3)
i=1
npi
Assuming the truth of the null hypothesis, the above test statistic has approx-
imately χ 2 (k − 1) distribution, or χ 2 (k − m − 1), if m unknown parameters of
hypothesized distribution are estimated by the maximum likelihood method. Based
on this fact, we can construct, for a given level of significance α, the critical region:
χ 2 (1 − α, k − m − 1), +∞ (2.5.3.1.4)
Example 2.5.3.1.1 We wish to verify the reliability of the purchased die. In our null
hypothesis, we hypothesize that the distribution is (discrete) uniform, that is:
1
pi = P0 (X = i) = for i = 1, . . . , 6 (2.5.3.1.5)
6
To verify this hypothesis the chi-squared goodness of fit test, n = 120 rolls were
made, the obtained results are presented in the first two columns of point frequency
Table 2.5.
The 3rd column in Table 2.5 comprises the expected frequencies, while the last
two—the calculation of the value of test statistic, which is χ 2 = 24.50. The appropri-
ate quantile of chi-square distribution is χ 2 (1−0.05, 5) = 11.070. As 24.50 > 11.07,
the value of test statistic falls into the critical region, we reject the null hypothesis at
the significance level α = 0.05. There is strong evidence that the purchased die is
not reliable.
The chi-square test for independence is used to determine whether there is a signifi-
cant relationship between the two nominal (categorical) variables. Suppose we wish
to examine whether the company productivity depends on the level of absence of
their workers. A sample of n companies is randomly selected, then the productivity
and the level of absence are measured on r - and k-point scales, respectively. The
data, i.e. the observed frequencies n i j can then be displayed in the contingency table
(see Table 2.6), where each of r rows represents a category of the first variable, and
each of the k columns represents a category of the second variable. The elements of
last row and last column contain the sums of corresponding column or row observed
frequencies, respectively,
120 2 Descriptive and Inferential Statistics
x1 n 11 n 12 … n 1k n 1.
x2 n 21 n 22 … n 2k n 2.
… … … … … …
xr nr 1 nr 2 … nr k nr.
r n .1 n .2 … n .k
r
k
n. j = ni j ni j = 1
i=1 i=1 j=1
r
k
n i. = n i j i = 1, . . . , r n . j = n i j i = 1, . . . , k (2.5.3.2.1)
j=1 i=1
k
r
k
r
n= n i. = n. j = ni j (2.5.3.2.2)
i=1 j=1 i=1 j=1
pi j = P(X = xi , Y = y j ) : i = 1, . . . , r ; y = 1, . . . , k (2.5.3.2.3)
k
r
pi. = pi j p. j = pi j (2.5.3.2.4)
j=1 i=1
H0 : ∀ i, j pi j = pi. · p. j (2.5.3.2.5)
2.5 Tests of Hypotheses 121
k r
(n i j − n̂ i j )2
χ =
2
(2.5.3.2.9)
i=1 j=1
n̂ i j
Assuming the null hypothesis is true, the above test statistic has asymptotic chi-
squared distribution with (r − 1)(k − 1) degrees of freedom.
For a given level of significance α, the critical region for this test is:
2
where χ 2 is the value of the test statistic. This coefficient takes values from the
interval [0, 1], the value 0 when the variables are independent, the value 1 in the case
of the perfect functional dependence.
122 2 Descriptive and Inferential Statistics
Lets’ verify the hypothesis about the independence of the company productivity
and the level of absence at the significance level α = 0.05.
We calculate the expected frequencies:
14 · 17 17 · 23 21 · 23
n̂ 11 = = 3.97, n̂ 12 = = 6.52, . . . n̂ 33 = = 8.05
60 60 60
Then, the value of the test statistic:
Based on the quantile χ 2 (1−0.05, 4) = 9.49, we then construct the critical region
(9.49, +∞). Since the calculated value of the test statistic falls into the critical region,
we reject the null hypothesis on the independence of company’s productivity and the
level of absence at the significance level α = 0.05.
The strength of this relationship can be quantified by the value of V-Cramer
coefficient:
19.2
V = ≈ 0.3265
60 · 3
The Kolmogorov’s goodness of fit test checks how well theoretical distribution fits
the empirical distribution. It is based on test statistic:
where
" "
"i "
dn+ "
= max " − F0 (x)"" (2.5.3.3.3)
1≤i≤n n
" "
" i − 1 ""
− "
dn = max " F0 (x) − (2.5.3.3.4)
1≤i≤n n "
where the critical value dn (1−α) can be found in the tables of the exact Kolmogorov’s
distribution for appropriate values of n and α.
If the sample size n is large (n ≤ 100), like a few hundreds, then the limit
distribution of Dn statistic can be used. This fact results from the following theorem.
√
i=+∞
(−1)i e−2i d (d > 0)
2 2
P( n Dn ≥ d) → K (d) = (2.5.3.3.6)
i=−∞
From the above theorem, it follows that the limit K (d) is a known cumulative
distribution function. Thus, for a given significance level α, the d(1 − α) quantiles
of the limit Kolmogorov’s distribution can be found.
√
If the value n · Dn calculated for a given sample is greater than d(1 − α), then
at the level of significance α, we reject the verified hypothesis.
Example 2.5.3.3.1 We wish to verify at the significance level α = 0.05 that the
25-element sample of salt bag’s weights (Sample No. 1, Example 2.1.1) comes from
a normal population.
From the sample, we calculate the values of estimates of the two parameters in
normal distribution:
1
25
m̂ = xi = 1000.53 (2.5.3.3.7)
25 i=1
124 2 Descriptive and Inferential Statistics
1 25
σ̂ = (xi − m̂)2 = 2.27 (2.5.3.3.8)
25 i=1
test does not reject the null hypothesis that the population distribution of salt bag’s
weight is a normal distribution.
Normality is one of the most common assumptions when using different statistical
procedures. There are a lot of tests for checking normality, including the Kolmogorov
test described in the previous section. All normality tests are sensitive to sample size.
If it is lower than 50 observations, that it is preferable to use the described below
Shapiro Wilk test. Also graphical methods are a good alternative to evaluate nor-
mality, in particular QQ (quantile-quantile) plots, by plotting data quantiles against
those of standard normal distribution.
The Shapiro-Wilk test is used to verify the null hypothesis that the distribution of
population (random variable X ) is normal. It is based on test statistic:
n 2
i=1 ai X (i)
W =
2 (2.5.3.4.1)
n
i=1 Xi − X
X (1) ≤ X (2) ≤ . . . ≤ X (n) are the ordered values of the sample (X 1 , . . . , X n ), ai are
tabulated coefficients. Small values of W indicate non-normality. The null hypothesis
is rejected at the significance level α if:
W ≤ W (α, n) (2.5.3.4.2)
replace the observations by the mean of the ranks that the observations would have if
they were distinguishable. The sum of the ranks corresponding to the n 1 observations
in the smaller sample is denoted by R1 . Similarly, the value R2 represents the sum
of the n 2 ranks corresponding to the larger sample.
The total R1 + R2 depends only on the number of observations in the two samples
(R1 + R2 = (n 1 + n 2 )(n 1 + n 2 + 1)/2). The null hypothesis will be rejected if R1
is small and R2 is large or vice versa. In actual practice, we usually base our decision
on the value of test statistic:
U = min(U1 , U2 ) (2.5.3.5.1)
n 1 (n 1 + 1) n 2 (n 2 + 1)
U1 = R 1 − U2 = R 2 − (2.5.3.5.2)
2 2
The null hypothesis is rejected at the significance level α, when the value of the U
statistic calculated from the sample is less than or equal to the tabled critical value.
Example 2.5.3.5.1 The following data show the number of defects in some products
manufactured by the two methods of which the second is the improved version of
the first:
X 1 : 30 32 20 23 44 31 28 33
X 2 : 41 46 29 32 20 23 48 36 42 43
We wish to know if there a difference in the number of defects before, and after
the modification, in other words, we ask “did the modification introduce an effect?”
The observations are arranged in ascending order and ranks are assigned to them
(3rd row):
20 20 23 23 28 29 30 31 32 32 33 36 41 42 43 44 46 48
1 2 1 2 1 2 1 1 1 2 1 2 2 2 2 1 2 2
1.5 1.5 3.5 3.5 5 6 7 8 9.5 9.5 11 12 13 14 15 16 17 18
In the second row, the labels of population have been preserved for clarity. We
calculate the sum of ranks for the first population:
Since, u = 74.5 > 17, at the significance level 0.05, we do not reject the null
hypothesis. There is no significant difference in the number of defects before and
after the modification of the technology (no effect).
One way to present the results of a hypothesis test is to state that the null hypothesis
H0 is, or is not rejected at a specified significance level α. This statement is often
inadequate because it gives the decision maker no idea about whether the computed
value of the test statistic was just barely, or, far in the critical region. Moreover,
reporting the results this way imposes the predefined level of significance α on other
users of the information. This approach may be unsatisfactory because it might be
difficult for some decision makers to have an idea about the risks implied by α!
To avoid these difficulties the p-value approach has been proposed, and now
commonly used in practice. The p-value is the probability that the test statistic will
take on a value that is at least as extreme as the observed value of the statistic,
when the null hypothesis H0 is true. More formally, p-value is the smallest level of
significance that would lead to rejection of the null hypothesis H0 with the given
data.
It is important to notice that p-value conveys much information about the weight
of evidence against H0 , and allows a decision maker to draw a conclusion at any
specified level of significance. Having computed p-value based on a given data, we
know that the null hypothesis H0 would be rejected at any level of significance α ≥
p-value. Thus, the p-value associated with a specific decision gives us the opportunity
for displaying the risk levels.
On the other hand, the purpose is most often only to decide whether to reject or not
the null hypothesis H0 at a given level of significance α. In such cases, the approach
to hypothesis testing based on p-values leads to the commonly used rule for making
decision:
• we reject the null hypothesis H0 when:
p-value ≤ α (2.5.4.1)
If the p-value approach is used for hypothesis testing, it is not necessary to state
explicitly the critical region in the presented, general procedure for hypothesis testing.
Most modern computer programs for statistical analysis report p-values. Below,
we illustrate the approach to hypothesis testing based on p-values using R. A short
introduction to calculations in R is in Appendix B.
128 2 Descriptive and Inferential Statistics
Example 2.5.4.1 Hypothesis test for the population mean salt bag’s weight based
on p-value
Let us consider the already investigated problem of the assessment of the mean
weight of salt bags produced in a certain factory (Example 2.1.1, Sample Nr. 1). Our
null hypothesis is H0 : m = 1000 g.
**R**
The values of salt bag’s weights forming a random sample are combined into a
vector and assigned variable data:
> data = c(1000.33, 1004.97, 998.98, 1000.85, 1001.42,
1001.68, 999.58, 1001.16, 1001.79, 997.64, 1001.59,
1000.56, 1003.26, 996.25, 995.83, 999.56, 1002.08, 998.89,
998.09, 1004.42, 1002.14, 998.01, 1002.79, 999.56,
1001.72)
On the created variable data, further calculations are carried out. We first check
the assumption—verify whether the data comes from a normal distribution:
> ks.test(data,”pnorm”,mean(data),sd(dane))
One-sample Kolmogorov-Smirnov test
data: data
D = 0.0906, p-value = 0.9865
alternative hypothesis: two-sided
Because the obtained p-value = 0.9865 > 0.05, there is no evidence to reject the
null hypothesis about the normality of weight distribution.
Then, we perform the test for the mean weight by calling the function t.test with
the appropriate parameters:
> t.test(data, mu=1000)
One Sample t-test
data: data
t = 1.1383, df = 24, p-value = 0.2662
Because:
Example 2.5.4.2 This illustrative example presents the testing procedure for the
comparison of two means on the simulated data (the case of unknown vari-
ances described previously). The data are generated from the two different normal
distributions.
**R**
> x=rnorm(50,mean=5,sd=1)
> y=rnorm(30,mean=7,sd=1)
2.5 Tests of Hypotheses 129
Because p-value = 0.4121 > 0.05, there is no evidence for rejecting the null
hypothesis. So, we can perform a test for the comparison of two means.
> t.test(x,y)
Welch Two Sample t-test
data: x and y
t = -9.0594, df = 68.131, p-value = 2.595e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.497739 -1.596047
sample estimates:
mean of x mean of y
5.042800 7.089693
Due to the very low p-value, less than α = 0.05, we reject the null hypothesis
about the equality of means in the two considered populations.
As we saw in the previous sections, standard parametric statistical tests have numer-
ous assumptions that should be respected prior to drawing conclusions from their
results. Violation of these assumptions can lead to erroneous conclusions about the
populations under study. Permutation tests provide yet another alternative solution
to parametric tests.
A permutation test (also called a randomization test) is a type of statistical signif-
icance test in which the distribution of the test statistic under the null hypothesis is
obtained by calculating all possible values of the test statistic under rearrangements
of the labels on the observed data points. Like bootstrapping, a permutation test
builds sampling distribution (called the “permutation distribution”) by resampling
(without replacement) the observed data: it permutes the observed data by assigning
different outcome values to each observation from among the set of actually observed
outcomes. Permutation tests exist for any test statistic, regardless of whether or not
its distribution is known. However, an important assumption behind a permutation
130 2 Descriptive and Inferential Statistics
test is that the observations are exchangeable under the null hypothesis. An impor-
tant consequence of this assumption is that tests of difference in location (like a
permutation t-test) require equal variances.
Permutation tests are particularly relevant in experimental studies, where we are
often interested in the null hypothesis of no difference between two groups. We
illustrate the basic idea of a permutation test through the following example.
Example 2.5.5.1 Consider the problem of comparing the mean weights of salt bags
produced by the two different factories, say f 1 and f 2. Suppose the random variables
X 1 , X 2 represent the two weights of salt bags produced by the two considered
factories, respectively. Let n 1 = 25 and n 2 = 25 be the sample sizes collected from
each population. The sample means of weights in the two populations are:
We shall determine whether the observed difference between the sample means
is large enough to reject, at some significance level α, the null hypothesis
H0 : m 1 = m 2 (2.5.5.3)
that there is no difference in mean salt bag’s weights E(X 1 ) = m 1 and E(X 2 ) = m 2
of the two populations (i.e. that they come from the same distribution). A reasonable
test statistic is:
R = X1 − X2 (2.5.5.4)
The distribution of test statistics is not known, since no assumptions about the
distributions of weights has been done.
The test proceeds as follows. First, the sample observations of two populations
are pooled and the difference in sample means is calculated and recorded for every
possible way of dividing the pooled values into the two groups of size n 1 and n 2 (for
every permutation of the two group labels f 1 and f 2). With a different allocation of
sample observations, a different value for R statistics would be obtained. The set of
all calculated differences is the permutation distribution of possible differences (for
this sample) under the null hypothesis.
While a permutation test requires that we see all possible permutations of the data
(which can become quite large), we can easily conduct “approximate permutation
tests” by generating the reference distribution by Monte Carlo sampling, which takes
a small (relative to the total number of permutations) random sample of the possible
replicates, in our example B = 1000. These 1000 permutations give the histogram
shown in Fig. 2.12 which approximates the unknown distribution of R statistics.
2.5 Tests of Hypotheses 131
Fig. 2.12 Histogram of the distribution of R statistics for B = 1000 random permutations
The one-sided p-value of the test is calculated as the proportion of sampled per-
mutations where the difference in means was greater than or equal to the observed
one:
#{|ri | ≥ r0 i = 1, . . . , B}
p= (2.5.5.5)
B
where ri is the value of the test statistic in the ith permutation. In our example, we
obtained p = 0.332, which gives an approximate probability of obtaining the value
r0 = 1.11 and higher when the null hypothesis is true.
Hence, 0.332 > α = 0.05, there is no strong evidence for the rejection of the null
hypothesis, that means for the difference in the two mean salt bag’s weights.
Bibliography
Hodges Jr., J.L., Lehmann, E.L.: Basic Concepts of Probability and Statistics, 2nd edn. Society for
Industrial and Applied Mathematics, Philadelphia (2005)
R Core Team: R language definition. http://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
(2019). Accessed 16 Dec 2019
Rice, J.: Mathematical Statistics and Data Analysis, 3rd edn. Thomson-Brooks/Cole, Belmont
(2007)
Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley Publishing Co., Reading (1977)
Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and
Scientists, 9th edn. Pearson Education, Essex (2016)
Chapter 3
Linear Regression and Correlation
Y = a1 x + a0 (3.1.1)
Yi = a1 xi + a0 + εi i = 1, . . . , n (3.1.2)
where εi are random variables called random errors, n is the number of paired
observations used to estimate the model. In addition, random errors are assumed to
meet the following conditions:
E(εi ) = 0 (3.1.3)
D 2 (εi ) = σ 2 (3.1.4)
cov(εi , ε j ) = 0 i = j (3.1.5)
The symbols a0 , a1 , σ are the unknown parameters of the model. The first two are
intercept and line slope, the third, when squared is called error variance. The simple
linear regression model described in the above formula consists of two parts:
• A systematic component a1 xi + a0 expressing the effect of x on variable Y,
• A random component εi expressing the combined effect of all other variables or
factors, except the variable x affecting variable Y.
From the assumptions above, several things become visible. The first condition
implies that at a specific x, the y values are distributed around the true or population
regression line E(Y |X = x) = a1 x + a0 , i.e. the true regression line goes through
the means of the responses, and actual observations are on the distribution around the
means. The second condition is often called a homogeneous variance assumption,
and the last means there is no correlation between the individual random errors. For
the purpose of inferential procedures, we shall need to add one assumption:
εi ∼ N (x; 0, σ 2 ) (3.1.6)
In the simple linear regression model, the values of the independent variable x are
treated as pre-determined or non-random values. In the above-mentioned example of
3.1 Simple Linear Regression Model 135
canola cultivation, we perform the procedure of seeding the n experimental plots with
the pre-determined quantities x1 , . . . , xn of nitrogen fertilizer. After some time, we
measure the crop yields responses y1 , . . . , yn from the seeded experimental plots. The
experimental (regression) data comprise the n-element sample of paired observations
{(x1 , y1 ), . . . , (xn , yn )} that can be plotted in a scatter diagram (see for example
Fig. 3.2) indicating visually, if our assumption of linearity between regressor x and
the response Y appears to be reasonable.
It should be noted that the model described above is conceptual in nature: as we
never observe the actual values of random errors in practice, we can never draw the
true regression line E(Y |X = x) = a0 + a1 x, only the estimated line can be drawn.
An important aspect of regression analysis is to estimate the parameters of the
model, based on n-element sample which will be presented in the next section.
The method of least squares is used to estimate the parameters of simple linear
regression model. Before we present this method, we should introduce the concept
of a residual.
Given a set of regression data {(xi , yi ) : i = 1, . . . , n} and already fitted
(estimated) regression model:
(see Fig. 3.1), the ith residual is the difference between the ith observed response
value and the ith response value that is predicted from linear regression model which
variation
not explained
by regression
total variation
variation
explained
by regression
Fig. 3.1 The variations for an observation in the linear regression model
136 3 Linear Regression and Correlation
ei = yi − ŷi i = 1, . . . , n (3.1.1.2)
Obviously, small residuals are a sign of a good fit. It should be noted that the
εi are not observed but ei are observed and play an important role in the regression
analysis. The sum of squares of regression residuals for all observations in the sample
is denoted by SSE (Sum of Square Errors, sometimes Residual Sum of Squares):
n
n
SS E = ei2 = (yi − ŷi )2 (3.1.1.3)
i=1 i=1
The method of least squares finds the estimators of parameters α0 and α1 , so that
the sum of squares of the residuals SSE is a minimum. Hence, we shall find α0 and
α1 , so as to minimize
n
n
SS E = ei2 = (yi − a1 xi − a0 )2 (3.1.1.4)
i=1 i=1
Differentiating SSE with respect to α0 and α1 , and setting the partial derivatives
to zero, we have:
∂ SS E n
=2 (yi − a1 xi − a0 )(−1) = 0 (3.1.1.5)
∂a0 i=1
∂ SS E n
=2 (yi − a1 xi − a1 )(−xi ) = 0 (3.1.1.6)
∂a1 i=1
n
n
a1 x i + a0 n = yi (3.1.1.7)
i=1 i=1
n
n
n
a1 xi2 + a0 xi = xi yi (3.1.1.8)
i=1 i=1 i=1
The solution of this linear system gives the following estimators â1 , â0 of the
parameters a0 , a1 :
n n n n
n xi yi − i=1 xi i=1 yi i=1 (x i − x̄)(yi − ȳ)
â1 = i=1
n n 2 = n (3.1.1.9)
i=1 (x i − x̄)
2
n i=1 xi −
2
i=1 x i
where:
1 1
n n
x̄ = xi ȳ = yi (3.1.1.11)
n i=1 n i=1
we can rewrite the derived formulas for model parameters in the following form:
slope of the regression line:
SSx y
â1 = (3.1.1.15)
SSx
intercept:
It can be shown that these least squares estimates, â0 , â1 , are unbiased.
The third, yet not estimated parameter of the simple linear regression model is
the error or model variance σ 2 which measures squared deviations between Y values
and their mean (i.e. true regression line E(Y |X = x) = a1 x + a0 ). The unbiased
estimator of the parameter σ 2 is a residual variance (sometimes called a Mean Square
Error, MSE), defined as follows:
SS E
σ̂ 2 = M S E = (3.1.1.17)
n−2
assumptions stated at the beginning of the chapter are met is necessary (here, the plot
function in R environment can be helpful).
where the standard error of model variance, so called residual standard error is:
√
RS E = MSE (3.1.1.20)
Standard errors can be the used to construct confidence intervals on the confidence
level α for the regression parameters:
H0 : a1 = 0 (3.1.1.23)
versus
H1 : a1 = 0 (3.1.1.24)
The following test statistic which follows the t-distribution with (n − 2) degrees of
freedom is used in this test:
â1
t= (3.1.1.25)
s(â1 )
When the null hypothesis is not rejected, the conclusion is that there is no signif-
icant linear relationship between Y and the independent variable x. Rejection of H0
indicates that the linear term in x residing in the model, explains a significant portion
of variability in Y.
3.1 Simple Linear Regression Model 139
Assessing the quality of the estimated regression line is often performed by analysis-
of-variance (ANOVA) approach. This relies on the partitioning of the total sum of
squares of y (i.e. total variation of y) designated here as SST into the two components
(see Fig. 3.1):
n
n
n
(yi − ȳ)2 = ( ŷi − ȳ)2 + (yi − ŷi )2
i=1 i=1 i=1 (3.1.2.1)
SST = SSR+ SSE
where:
n
SST = (yi − ȳ)2 (3.1.2.2)
i=1
n
SS R = ( ŷi − ȳ)2 (3.1.2.3)
i=1
n
SS E = (yi − ŷi )2 (3.1.2.4)
i=1
The first component, SSR, is called the regression sum of squares and it reflects
the amount of variation in the y values explained by the straight, line. The second
component is the already introduced Sum of Square Errors SSE, which reflects the
variation about the regression line (i.e. the part of the total variation not been explained
by regression).
The analysis of variance is conveniently presented in the form of the following
Table 3.1.
Suppose now that we are interested in testing the following null hypothesis:
H0 : a1 = 0 (3.1.2.5)
H1 : a1 = 0 (3.1.2.6)
MSR
F= (3.1.2.7)
MSE
is calculated based on the analysis of variance table (the last column of this table).
The critical region for this test statistic is:
When the null hypothesis is rejected, that is, when the computed F-statistic exceeds
the critical value F(1 − α, 1, n − 2), we conclude that there is a significant amount
of variation in the response resulting from the postulated straight-line relationship.
The linear relationship between the variable Y and x is statistically significant at the
significance level α.
If null hypothesis is not rejected, we conclude that the data did not reflect sufficient
evidence to support the model postulated. In case of simple regression, the above
test is equivalent to the previously presented t-test.
The already introduced RSE is an absolute measure of lack of fit of linear model
to the data. The alternative measure is the R 2 statistic, also called the coefficient of
determination, which is a measure of goodness of fit of a model to the observed data:
SS R SS E
R2 = =1− (3.1.2.9)
SST SST
3.1.3 Prediction
One reason for building a linear regression model is to predict response values at
one or more values x0 of the independent variable x. The equation ŷ = α̂0 + α̂1 x
may be used to predict the mean response E(Y |x0 ) (i.e. a conditional expectation
E(Y |X = x)) at a prechosen value x = x0 , or it may be used to predict a single
value y0 of the variable Y0 , when x = x0 . As one would expect, the prediction
3.1 Simple Linear Regression Model 141
error to be higher in the case of a single predicted value y0 than in the case where a
mean E(Y |x0 ) is predicted. This, of course, will affect the width of intervals for the
predicted values.
In this section, the focus is on errors associated with the prediction.
A 100 · (1 − α)% confidence interval for the mean response E(Y |x0 ) is:
1 (x − x̄)2
ŷ0 ± t (1 − α/2, n − 2) · RS E · + (3.1.3.1)
n SSx
There is a distinction between the concept of a confidence interval and the predic-
tion interval described above. The confidence interval interpretation is identical to
that described for all confidence intervals on population parameters discussed in the
previous chapter. This follows from the fact that E(Y |x0 ) is a population parameter.
The prediction interval, however, represents an interval that has a probability equal
to (1 − α) of containing not a parameter, but a future value y0 of the random variable
Y.
Prediction intervals are always wider than confidence ones because they incor-
porate both the error in the estimate of the model function (reducible error) and the
uncertainty as to how much an individual point will differ from the population model,
here, the regression line (irreducible error).
Example 3.1.3.1 Suppose we wish to examine the relationship between the two
variables: the canola crop yield (dependent variable Y ) and the quantity of nitrogen
fertilizer used (independent variable x). The n = 12 experimental plots with different
levels of nitrogen fertilizer were seeded. The obtained experimental data are given
in Table 3.2.
We assume linear relationship between the canola crop yield and the quantity
of nitrogen fertilizer which follows from the scatter diagram (Fig. 3.2). Using the
introduced in 3.1.1 formulas for the parameter estimators, we obtain the following
estimates of model parameters (on the significance level α = 0.05):
142 3 Linear Regression and Correlation
Fig. 3.2 Scatter diagram for the canola crop yield data with the regression line superimposed
M S E = 9.398 ± 3.066
ŷ = 3.143x + 5.152
The calculated values of variations (i.e. sums of squares), degrees of freedom, and
average squares are included in the analysis of variance table (Table 3.3). The value
of the test statistic in the F test is:
SS R
F= = 150.345 > F(1 − 0.05, 1, 10) = 4.96
MSE
And, as we see, it falls into the critical region. Hence, at the significance level
α = 0.05, we reject the null hypothesis that the slope of the regression line is equal
to zero. Thus, the examined linear relationship between the canola crop yield and
fertilizer’s quantities is statistically significant.
The coefficient of determination is:
SS R
R2 = = 0, 938
SST
3.1 Simple Linear Regression Model 143
Almost 94% of the canola crop yield variation can be predicted from fertilizer’s
quantity. It follows that our linear regression model fits well to the experimental data.
Calculating the confidence interval for each experimental point xi , we obtain two
series of points resulting in confidence limits for the mean E(Y |X = x), which is
shown in Fig. 3.3.
In most real-life problems, the complexity of their mechanisms is such that in order
to be able to predict an important response, more than one independent variable
is needed. If the regression analysis is to be applied, this means that a multiple
regression model is needed. When this model is linear in the coefficients, it is called a
multiple linear regression model. For the case of k independent variables (regressors)
x1 , . . . , xk , the mean Y |x1 , . . . , xk is given by the multiple linear regression model:
and the estimated response is obtained from the sample regression equation:
144 3 Linear Regression and Correlation
50
40
30
y
20
10
0
0 2 4 6 8 10 12
x
Fig. 3.3 Confidence limits (dashed line) for E(Y|x) in canola crop yield data
where yi is the observed response to the values x1i , x2i . . . , xki of the k indepen-
dent variables x1 , x2 . . . , xk . Similar to simple regression, it is assumed that each
observation in a sample satisfies the following condition:
yi = a0 + a1 x1 + . . . + ak xk + εi (3.2.4)
where εi are the random errors, independent and identically distributed with mean
zero and common variance σ 2 .
Using the method of least squares to obtain the estimates âi , we minimize the
expression:
n
n
SS E = ei2 = (yi − ŷi )2 (3.2.5)
i=1 i=1
where the symbol Y is the column vector of the response variables and a is a vector
of unknown model coefficients.
Obtaining an unambiguous solution requires that the matrix X T X be non-singular,
hence the requirement that the row r (X ) of the matrix X satisfy the following con-
dition: r (X ) = k + 1 ≤ n. Then, it can be shown that the least square estimate of
vector a of coefficients is given by:
â = (X T X )−1 X T Y (3.2.7)
The least squares estimator â possesses a minimum variance property among all
unbiased linear estimators (this follows from the Gauss-Markov theorem).
Making the additional assumption that ε ∼ N (0, σ 2 I ), by using the maximum
likelihood method, we can obtain the following estimator of unknown variance of
errors:
1
σ̂ 2 = (y − X â)T (y − X â) (3.2.8)
n
The maximum likelihood estimate of vector a is the same.
Similarly, as in simple the linear regression, in multiple linear regression model the
total variability of Y can be decomposed into two components: variation explained
by regression (SSR) and not explained (SSE). Based on this, the accuracy of the fit
can be measured using a coefficient of determination:
SS R SS E
R2 = =1− (3.2.9)
SST SST
The quantity R 2 gives the proportion of the total variation in yi s “explained” by,
or attributed to, the predictor variables x1 , . . . , xk .
The null hypothesis about the significance of the linear relationship between
independent variables and the dependent one has the form:
H0 : a1 = . . . = ak = 0 (3.2.10)
146 3 Linear Regression and Correlation
â = (X T X )−1 X T y
Using the calculated values, we obtain the following equation describing the
relationship between amounts of alloy components and its hardness:
The analysis of variance table presents the two sources of variation in alloy
hardness: that explained by the regression (SSR) and SSE. (Table 3.5).
The quantile of the F distribution at the significance level α = 0.05 is F(1 −
0.05, 2, 6) = 5, 786 which gives the critical region:
3.2 Multiple Linear Regression 147
Table 3.5 The analysis of variance table for the alloy hardness data
Source of variation Sum of squares Degrees of Mean squares F statistics
freedom
SSR 2507.155 2 MSR = 1253.577 F = 28,527
SSE 219.720 5 MSE = 43.944
SST 2726.875 7 MST = 389.554
5.786, +∞)
The value of the F statistic falls into the critical region, thus the null hypothesis,
is rejected at the significance level α = 0.05. The linear relationship between alloy
hardness and the amounts of its two components is, thus, statistically significant. The
coefficient of determination is:
SS R
R2 = = 0.92
SST
Multiple linear regression of two predictors explains 92% of the variability in
alloy hardness. It follows that our linear model fits well to the experimental data.
3.3 Correlation
So far in this chapter we have assumed that the independent regressor variable x is
a physical variable but not a random variable. In many applications of regression
techniques it is more realistic to assume that both X and Y are random variables, and
the measurements {(xi , yi ) : i = 1, 2, . . . , n} are observations from a population
having the joint density function f (x, y). We shall consider the problem of measuring
the relationship between the two variables X and Y. For example, if X represents
worker’s salary and Y —his efficiency, then we may suspect that larger values of
X are associated with larger values of Y and vice versa. This is an example of
positive correlation, which is a relationship between the two variables in which both
variables move in tandem—that is, in the same direction. If X represents one’s time
of working and Y —one’s free time, then the more time one works the less free time
one has. This is a negative correlation, which is a relationship between two variables
whereby they move in opposite directions. The already introduced scatter plots are
the visualization of the correlation. Correlation is said to be linear if the ratio of
change of the two variables is constant (for ex. when worker’s salary is doubled by
doubling his efficiency). When the ratio of change is not constant, we say about non
linear correlation.
In statistics, correlation is any statistical association, or, dependence between
the two random variables (whether causal or not), though it commonly refers to
the degree to which a pair of variables are linearly related. Formally, two random
148 3 Linear Regression and Correlation
cov(X, Y )
ρ= (3.3.1)
D(Y ) · D(Y )
where D(X ), D(Y ) are standard deviations of X and Y, respectively, while cov(X, Y )
is a covariance of X and Y, already defined.
Pearson correlation coefficient, when applied to a sample, may be referred to as
the sample Pearson correlation coefficient (often denoted by r). We can obtain a
formula for it by substituting estimates of the covariances and variances into the
formula given above.
Given a sample, i.e. the paired data {(xi , yi ) : i = 1, 2, . . . , n}, a sample Pearson’s
correlation coefficient is defined as:
n
(xi − x̄)(yi − ȳ) SSx y
r = i=1
= (3.3.2)
n n SSx SS y
i=1 (x i − x̄) · i=1 (yi − ȳ)
2 2
Ho : ρ = 0 (3.3.3)
3.3 Correlation 149
Bibliography
Draper, N.R., Smith, H.: Applied Regression Analysis, 3rd edn. Wiley, New York (1998)
Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (1973)
Appendix A
Permutations, Variations, Combinations
Pn = n! (A.1)
When some of those objects are identical copies, this is a permutation with repe-
tition. The number of such permutations of n objects of which n 1 are of the 1st kind,
n 2 are of the 2nd kind, …, n k are of the k-th kind is:
n!
Pnn 1 ,n 2 ,...,n k = (A.2)
n1! · n2! · . . . nk !
If repetitions are allowed, the counting procedure is as follows. The 1st object can
be selected in n ways and is then returned, then the same is true for the 2nd object and
so on until the k-th object. This gives the following number of k-element variations
of n objects (with repetitions):
k
V n = nk (A.6)
Appendix B
An Introduction to R
About R
R is a scripting language for statistical data manipulation and analysis. It was inspired
by, and is mostly compatible with, the statistical language S developed by AT&T
(S obviously standing for statistics). R is sometimes called “GNU S,” to reflect its
open source nature. R is available for Windows, Macs, Linux. R also provides an
environment in which you can produce graphics. As a programming language, R has
a large libraries of pre-defined functions that can be used to perform various tasks.
A major focus of these pre-defined functions is statistical data analysis, and these
allow R to be used purely as a toolbox for standard statistical techniques. However,
some knowledge of R programming is essential to use it well at any level. Therefore,
in this Appendix, we learn about common data structures and programming features
in R. For more resources, see the R Project homepage:
http://www.r-project.org,
which links to various manuals and other user-contributed documentation.
One typically submits commands to R via text in a terminal window (a console),
rather than mouse clicks in a Graphical User Interface. If you can’t live without GUIs,
you use one of the free GUIs that have been developed for R, e.g. R Studio.
Installation, libraries
To install R on your computer, go to the home website of R:
http://cran.r-project.org
and do the following (assuming you work on a windows computer): (1) click down-
load CRAN in the left bar, (2) choose a download site, then choose Windows as
target operation system and click base, (3) choose Download R x.x.x for Windows
(x.x.x stands for an actual version nr) and choose default answers for all questions.
R can do many statistical and data analyses. They are organized in the so-called
packages or libraries. With the standard installation, most common packages are
installed. To get a list of all installed packages, go to the packages window or type
© Springer Nature Switzerland AG 2020 153
K. Stapor, Introduction to Probabilistic and Statistical Methods with Examples in R,
Intelligent Systems Reference Library 176,
https://doi.org/10.1007/978-3-030-45799-0
154 Appendix B: An Introduction to R
library() in a console window. There are many more packages available on the R
website. If you want to install and use a package, for example, the package called
“utils”, you should first install the package:
> install.packages(utils)
R can be used as a calculator. You can just type your equation in the command
window, just after the “>":
> 1+1
You can also give numbers a name. By doing so, they become so-called variables
which can be used later. For example, you can type in the command window:
>a=2
The variable a appears now in the workspace, which means that R now remembers
what a is 2. You can also ask R what a is:
>a
[1] 2
>b=c(1,2,3)
Matrices are nothing more than 2-dimensional vectors. To define a matrix, use
the function matrix:
> m = matrix(data = c(1,2,3,4,5,6), ncol=3)
>m
[, 1] [, 2] [, 3]
[1,] 1 2 3
[2,] 4 5 6
The argument data specifies which numbers should be in the matrix. Use either
ncol to specify the number of columns or nrow to specify the number of rows.
A data frame is a matrix with names above the columns.
> f = data.frame(x=c(1,2,3), y=c(4.5.6))
You can for example select the column called y from the data frame called f for
computing its mean using pre-defined function mean:
>mean(f$y)
The operator “$” allows you to extract elements by name from a named list.
There are many ways to write data from within the R environment to files, and to
read data from files. The following commands first write the f data frame to a text
file, called f1.txt and then read a file into a data frame g:
> write.table(f, file="f1.txt")
> g = read.table(file="f1.txt")
You do often automate your calculations using functions. Some functions are
standard in R or in one of the packages. You can also program your own functions.
Within the brackets you specify the arguments. As the example, the function rnorm,
is a standard R function which creates random samples from a normal distribution.
If you want 5 random numbers out of normal distribution with mean 1 and standard
deviation 0.5 you can type:
> rnorm(5, mean=1, sd=0.5)
The R package also has built-in functions to calculate the values of density, and
quantiles of the most commonly used distributions. Preceding the name of the distri-
bution with the letter d, we obtain the value of density function of this distribution.
For example, the following call:
> dnorm(0)
[1] 0.3989423
156 Appendix B: An Introduction to R
Two hundred random numbers are plotted by connecting the points by lines
(type="l") in gold. Another example is the classical histogram plot, generated by
the simple command:
> hist(rnorm(200))
In a for-loop you specify what has to be done and how many times. To tell “how
many times”, you specify a so-called counter, as in the following example:
> a=seq(from=1,to=4)
> b= c()
> for(i in 1:4) {b[i]=a[i]*2}
>b
2468
First, we need to verify if the sample comes from a normal population. The sample
is small, therefore we use Shapiro-Wilk test. We install appropriate package first and
then load it:
> install.packages(“nortest”)
> library(nortest)
data: dane
W = 0.9853 , p-value = 0.9871
The very large p-value indicates that there is no evidence to reject null the hypoth-
esis about normality. Then, we perform a t-test to verify a null hypothesis about the
mean population weight:
> t.test(dane, mu=250)
One sample t-test
data: dane
t = 4.4764, df = 9, p-value = 0.00154
alternative hypothesis: true mean is not equal to 250
95 percentage confidence interval:
254.2045 262.7955
sample estimate of x
258.5
The low p-value (less than the test significance level α = 0.05) indicates that the
null hypothesis should be rejected. We also obtain a 95% confidence interval for the
mean weight:
[254.2045 262.7955]