Lecture Notes
Lecture Notes
Lecture notes
written by Tomasz Kosmala and Tatiana Tyukina
2
Contents
2 Descriptive Statistics 18
2.1 Graphical representation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Distribution Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Descriptive Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Estimators 24
3.1 Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Expectation and variance of the sample mean . . . . . . . . . . . . . . . 31
3.2 Properties of Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3
3.3.3 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Confidence intervals based on the normal distribution . . . . . . . . . . 53
3.4.4 Confidence intervals based on the t-distribution . . . . . . . . . . . . . . 54
3.4.5 Confidence intervals for variance . . . . . . . . . . . . . . . . . . . . . . 55
3.4.6 Confidence intervals for binomial random variables . . . . . . . . . . . . 56
3.4.7 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.8 Interval Estimation for two population means . . . . . . . . . . . . . . . 58
4 Hypothesis testing 63
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Hypotheses, test statistic, rejection region . . . . . . . . . . . . . . . . . 63
4.1.2 Z-test for sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Errors when testing hypotheses . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.4 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 List of most common statistical test . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Testing for proportions (small sample) . . . . . . . . . . . . . . . . . . . 69
4.2.3 Testing for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Tests for paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Goodness of Fit 72
5.1 For continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Probability Distributions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Plots of the denities and probability mass functions . . . . . . . . . . . . . . . . 77
Bibliography 80
4
Chapter 1
1.1 Introduction
What is statistics? The Cambridge Dictionary defines:
Statistics is the science of collecting and studying numbers that give information about
particular situations or events.
Statistics is a form of mathematical analysis that uses quantified models, representations and
synopses for a given set of experimental data or real-life studies. Statistics studies
methodologies to gather, review, analyze and draw conclusions from data.
We all have some intuitive understanding of what statistics is. We can identify two distinct
branches of statistics:
• Inferential statistics consists of methods for drawing and measuring the reliability of
conclusions about population based on information obtained from a sample of the
populations.
5
find the tallest person, etc. On the other hand, the role of the inferential statistics is to char-
acterize a large population based on some selection of elements. For example, one could select
100 people in Leicester, measure their height, calculate the average and then argue that this an
estimate of the average height of people living in Leiceser. The role of statistician is often to
develop methods for making the step from the sample to the whole population mathematically
precise, that is to measure its reliability and precision.
Statistics is closely connected to probability. Probabilist use knowledge about the distribu-
tion of a random variable to make a conclusion about the probability of obtaining a particular
sample. Continuing the earlier example, a probabilist that knows the distribution of people’s
height is able to calculate how many people in Leicester, on average, are taller than 1.9m.
On the other hand, the goal of the statistician is often to investigate the properties of the
distribution that can be inferred from a sample. Statisticians analyse data for the purpose
of making generalizations and decisions. In summary, probability can be considered a more
abstract approach, whereas statistics is more practical and data-driven.
Statisticians
• assist in organizing, describing, summarizing, and displaying experimental data;
6
Definition 1.1. Let S be a sample space of an experiment. Probability P (·) is a real-valued
function that assigns to each event A in the sample space S a number P (A) ∈ [0, 1], called
the probability of A, with the following conditions satisfied:
(i) It is zero for the impossible event: P (∅) = 0 and unity for the certain event: P (S) = 1.
(ii) It is additive over the union of an infinite number of pairwise disjoint events, that is, if
A1 , A2 , . . . form a sequence of pairwise mutually exclusive events (i.e. Ai ∩ Aj = ∅, for
i 6= j) in S, then
∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1
A discrete random variable takes a finite or countably infinite number of possible values
with specific probabilities associated with each value.
The probability mass function (abbreviated as pmf) of a discrete random variable X is the
function
p(xi ) = P (X = xi ), i = 1, 2, 3, . . .
For a discrete random variable we can make a list x1 , x2 , . . . of values attained with positive
probability. The list may be finite or infinite (in such case we say it is countable).
The cumulative distribution function (abbreviated as cdf) F of the random variable X is
defined by X
F (x) = P (X ≤ x) = p(xi ), for − ∞ < x < ∞.
all y≤x
A cumulative distribution function is also called a distribution function. We may use the
notation pX and FX to stress that it is a probability mass function and cumulative distribution
function of a random variable X. For a given function p to be a pmf, it needs to satisfy the
P∞
following two conditions: p(x) ≥ 0 for all values of x ∈ R, and p(xi ) = 1.
i=1
Exercise 1.2. Suppose that a fair coin is tossed twice so that the sample space is S =
{HH, HT, T H, T T }. Let X be a number of heads.
(a) Find the probability function for X.
(b) Find the cumulative distribution function of X.
Solution. We have
x 0 1 2
1 1 1
p(x) 4 2 4
7
and
0, x≤0
1,
x ∈ [0, 1)
F (x) = 43
4, x ∈ [1, 2)
1, x ≥ 2.
Exercise 1.3. The probability mass function of a discrete random variable X is given in the
following table:
x -1 2 5 8 10
p(x) 0.1 0.15 0.25 0.3 0.2
a) Find the cumulative distribution function F (x) and graph it.
b) Find P (X > 2).
A continuous random variable attains uncountably many values, such as the points on a
real line.
Definition 1.4. Let X be a random variable. Suppose that there exists a nonnegative real-
valued function: f : R → R+ such that for any interval [a, b],
Z b
P (X ∈ [a, b]) = f (x) dx.
a
Then X is called a continuous random variable. The function f is called the probability density
function or simply density (and abbreviated as pdf) of X.
Sometimes, we use the notation fX and FX to stress the fact that fX is the density of a
random variable X and that FX is the cumulative distribution function of X.
For a given function f toRbe a pdf, it needs to satisfy the following two conditions: f (x) ≥ 0
∞
for all values of x ∈ R, and −∞ f (t) dt = 1.
(a) Find c.
(b) Find the distribution function F (x).
(c) Compute P (1 < X < 3).
8
1.2 Discrete Probability Distributions
Definition 1.6. A random variable X is said to be uniformly distributed over the numbers
1, 2, 3, . . . , n if
1
P (X = i) = , for i = 1, 2, . . . , n.
n
The uniform probability mass function is:
1
p(i) = P (X = i) = , for i = 1, 2, . . . , n.
n
The cumulative uniform distribution function is:
X X1 bxc
F (x) = P (X ≤ x) = p(i) = = ,
n n
i≤x i≤x
for 0 ≤ x ≤ n. Here, bxc is the floor of x, that is, the largest integer smaller than or equal to
x.
Definition 1.7. A binomial random variable is a discrete random variable that describes the
number of successes X in a sequence of n Bernoulli trials. The binomial probability mass
function is defined as:
n k
p(k) = P (X = k) = p (1 − p)n−k , k = 0, 1, 2, 3 . . . , n.
k
The corresponding cumulative distribution function is:
X X n
F (x) = P (X ≤ x) = p(k) = pk (1 − p)n−k ,
k
k≤x k≤x
for 0 ≤ x ≤ n.
Bernoulli distribution arises in the following scenario. Suppose we carry out an experiment,
which may result in 2 outcomes, say a success or a failure. We repeat this experiment n times
and count the number of successes. Bernoulli trials are the fixed sequence of n identical
repetitions of the same Bernoulli experiment; For each trial probability of success is p ∈ (0, 1)
and the probability of failure is q = 1 − p. The outcome of each experiment is independent on
previous experiments and does not influence any subsequent outcomes.
Definition 1.8. The probability distribution function and the cumulative distribution func-
tions for the geometric distribution are:
p(k) = P (X = k) = (1 − p)k−1 p, for k = 1, 2, . . . (1.1)
X
F (x) = P (X ≤ x) = (1 − p)k−1 p = 1 − (1 − p)bxc (1.2)
k≤x
for x ≥ 0
9
The geometric random variable is the number X of Bernoulli trials needed to get one
success. Note that X is supported on the set S = {1, 2, 3, . . .};
Sometimes geometric distribution is defined differently. One can say say that X is the
number failures before the first success. Note that with this definition, X is supported on the
set S = {0, 1, 2, . . .} as one can have X = 0 if the first trial result in a success. If X represents
number of ’failures’ before first success, then instead of (1.1)-(1.2) we have
Definition 1.9. A discrete random variable X has a Poisson distribution with parameter
λ > 0 if its probability mass function is given by:
λk e−λ
p(k) = P (X = k) = , for k = 0, 1, 2, . . .
k!
for 0 ≤ x < ∞.
A Poisson random variable describes the number of random events occurring in a fixed
unit of time and space. For example, the number of customers that entered a supermarket in
between 9 and 10 AM (here ‘a random event’ means arrival of a customer at the supermarket).
More precisely,
• The occurrence of one event does not affect the probability that a second event will
occur. That is, events occur independently.
• Two events cannot occur at exactly the same instant; instead, at each very small sub-
interval, either exactly one event occurs, or no event occurs.
The fact that these properties give rise to a Poisson random variable can be derived form-
ally, see D. D. Wackerly, W. Mendenhall and R. L. Scheaffer, Mathematical Statistics with
Applications, 7th edition, Section 3.8.
10
Figure 1.1: Probability mass function of the Poisson distribution
for 0 ≤ x < ∞.
The uniform distribution describes an experiment where there is an arbitrary outcome that
lies between certain bounds and a and b and obtaining a value in each small interval [x, x + dx]
is equally probable.
Definition 1.11. A random variable X is called Gaussian or normal if its probability density
function f (x) is of the form
1 (x−µ)2
f (x) = √ e− 2σ2 ,
2πσ
where −∞ < x < ∞ and µ and σ are two parameters, such that −∞ < µ < ∞, σ > 0.
11
Figure 1.2: Density of the normal distribution
for −∞ < x < ∞. This integral cannot be simplified further. There is no ’nice’ formula for
the Gaussian cumulative distribution function.
Notation 1.12. We denote the normal distribution with mean µ and variance σ 2 by N (µ, σ 2 ).
Definition 1.13. The exponential distribution is the probability distribution with the density
λe−λx x ≥ 0
f (x) = , (1.3)
0, x<0
where λ > 0 is parameter called a rate parameter. The cumulative distribution function is
1 − e−λx x ≥ 0
F (x) =
0, x<0
12
Exponential distribution is often used for modelling the waiting time for an event. For
example if Nt is the number of customers that joined the queue before time t, the time
between arrivals of the customers would be modelled by the exponential random variables.
1.4 Quantiles
Definition 1.14. For a random variable X and 0 < p < 1, the p-th quantile of X, denoted
by ϕp , is the smallest value such that P (X ≤ ϕp ) = F (ϕp ) ≥ p.
If F is an invertible function, then the p-th quantile is simply
ϕp = F −1 (p).
ϕ0.5 is the median. We often express quantiles in percentages e.g. the 0.05-quantile is the 5-th
percentile
1.5 Moments
Definition 1.15. Let X be any random variable. For any positive integer r the r-th moment
(r-th moment about the origin) is
µr := E[X r ].
The central r-th moment of X is µ0r (r-th moment of X about the mean), is given by
µ0r := E[(X − µ1 )r ].
The moments are only defined if the expectation is well-defined and finite.2
If X is a continuous random variable with a pdf fX (x), then
Z +∞
µr = xr fX (x) dx,
−∞
and Z +∞
µ0r = (x − µ1 )r fX (x) dx,
−∞
R∞ r f (x) dx
provided −∞ |x| X < ∞.
Definition 1.16. The r−th standardized moment, µ̃r , is a moment that is normalized, typic-
ally by the standard deviation raised to the power of r, σ r .
E [(W − µ)r ]
µ̃r = .
σr
2
When we calculate the r-th moment we either find a sum of a series (discrete case) or an integral (continuous
case). In some rare cases these series and integrals may diverge.
13
We also have
• When r = 1, the subscript is usually omitted, i.e. µ1 := µ is the mean (or expectation)
of X.
• When r = 2, µ02 = σ 2 is called the variance, and σ is called the standard deviation.
• We define the skewness E[(X − µ)3 ]/(E[(X − µ)2 ])3/2 , the third standardized moment
and the kurtosis E[(X − µ)4 ]/(E[(X − µ)2 ])2 , the forth standardized moment.
The third moment of the distribution shows the extent to which the distribution is not
symmetric about the mean. For example, since the density of Z ∼ N (0, 1) is symmetric, its
third moment is zero: Z +∞
1 1 2
x3 √ e− 2 x dx = 0
3
E Z =
−∞ 2π
1 2
Figure 1.3: The function x3 √12π e− 2 x .
The fourth moment allows to express the flatness or both the ‘peakedness’ of the distribu-
tion and the heaviness of its tail.
We illustrate these definitions with two exercises - one involving a discrete random variable
and one continuous.
Exercise 1.17. To find out the prevalence of smallpox vaccine use, a researcher inquired into
the number of times a randomly selected 200 people aged 16 and over in an African village
had been vaccinated. He obtained the following figures:
N 0 1 2 3 4 5
proportion 17/200 30/200 58/200 50/200 38/200 7/200
Assume that these proportions continue to hold exhaustively for the population of that village.
14
a) What is the expected number of times those people in the village had been vaccinated?
Example 1.19. Find the expectation and the variance for a random variable X:
tX X, suppose
Definition 1.20. For a random variable
3
that there is a positive number h such that
for −h < t < h the expectation E e exists . The moment-generating function (abbreviated
as mgf) of the random variable X is defined by
MX (t) = E etX
We have
∞
X
MX (t) = etxi pX (xi ), if X is discrete, taking values x1 , x2 , . . . ,
Zi=1∞
MX = etx fX (x) dx, if X is continuous.
−∞
(tX)2 (tX)n
etX = 1 + tX + + ··· + + ···
2! n!
3
This assumption is needed because, as mentioned earlier, the series and integrals used for finding an
expectation may diverge
15
Plugging in the definition of mgf
t2 tn
MX (t) = E etX = 1 + tE[X] + E[X 2 ] + · · · + E[X n ] + · · ·
2! n!
On the other hand
00 (0) (n)
0 MX M (0) n
MX (t) = MX (0) + MX (0)t + t2 + · · · + X t + ··· .
2! n!
Comparing the coefficients, we must have
0 00 (n)
MX (0) = E[X], MX (0) = E[X 2 ], ..., MX (0) = E[X n ].
dr MX
(r)
r
= MX (0) = µr .
dt
t=0
This is an important result, which provides a way of finding expectation, variance and
higher moments of random variables.
Other properties of moment-generating functions are:
Theorem 1.22. (i) The moment-generating function of X is unique in the sense that, if
two random variables X and Y have the same mgf (MX (t) = MY (t), for t in an interval
containing 0), then X and Y have the same distribution.
(ii) If X and Y are independent, then MX+Y (t) = MX (t)MY (t). That is, the mgf of the sum
of two independent random variables is the product of the mgfs of the individual random
variables. The result can be extended to n random variables.
Example 1.23. We find the moment generating function of Z ∼ N (0, 1). We have
MZ (t) = E etZ
Z ∞
1 1 2
= etx √ e− 2 x dx
2π
Z−∞∞
1 − 1 (x−t)2 + 1 t2
= √ e 2 2 dx we put the exponentials together and completed the square
−∞ 2π
Z ∞
1 2 1 1 2
=e 2
t
√ e− 2 (x−t) dx.
−∞ 2π
16
Since density of N (µ, σ 2 ) integrates to 1, we know that for any µ and σ > 0
Z ∞
1 (x−µ)2
√ e− 2σ2 dx = 1.
−∞ 2πσ
17
Chapter 2
Descriptive Statistics
Recall from Section 1.1 that a population is the collection of the individuals or items under
consideration in a statistical study. Recall that a sample is that part of the population from
which information is obtained.
A sample should reflect all the characteristics (of importance) of the population - be rep-
resentative - reflect as closely as possible the relevant characteristics of the population. A
sample that is not representative of the population characteristics is called a biased sample.
Nevertheless, the results of the studies almost never replicate the features of the whole
population exactly. There is always some error: a non-zero difference between the quantities
found in a study and the (unknown) value characteristic for the whole population:
Definition 2.1. Sampling error occur when the sample is not representative, its characteristics
differ from the population.
The sample size is an important feature of any empirical study. Sample size depends:
Definition 2.2. Nonsampling errors occur in the collection, recording, and processing of
sample data.
• missing values;
18
2.1 Graphical representation of Data
Graphical representation
◦ Pareto Chart;
19
2.2.1 Measures of central tendency
Measures of central tendency are
• The Mean (Sample Mean) - the sum of observations divided by the number of observa-
tions.
n
1X
x̄ := xi ,
n
i=1
The order statistics of x1 , . . . , xn are their values put in increasing order, which we denote
x(1) ≤ x(2) ≤ . . . ≤ x(n) . x(i) is called i-th order statistic.
• An α-trimmed Mean - the mean of the remaining middle 100(1−2α)% of the observations.
n−bnαc
P
x(i)
x(bnαc+1) + x(bnαc+2) + · · · + x(n−bnαc) i=bnαc+1
x̄α := = ,
n − 2bnαc n − 2bnαc
Note that the mode is the only measure of center that can be used for qualitative data.
• A weighted mean is used when some data points contribute more ‘weight’ than others.
Pn
j=1 xj wj
x̄w := Pn ,
j=1 wj
• A Grouped-Data Mean Grouped data are data formed by aggregating individual obser-
vations of a variable into groups, so that a frequency distribution of these groups serves
as a convenient means of summarizing or analyzing the data. Suppose n observations
20
are grouped into m classes indexed by j = 1, . . . , m Let fj be the frequency for each
group. The grouped-data mean can be calculated as:
Pm Pm
j=1 xj fj j=1 xj fj
x̄g := Pm = ,
j=1 fj n
79, 61, 77, 74, 54, 70, 62, 83, 93, 66, 70, 89
80, 98, 54, 90, 83, 50, 72, 82, 60, 83, 51, 86, 61
Using the limit grouping with a first class of 50 − 59 and a class width of 10. Find the
sample mean, the median, the α-trimmed mean (α = 5%), and the mode.
Solution. The ordered sample is
50, 51, 54, 54, 60, 61, 61, 62, 66, 70, 70, 72, 74, 77, 79,
21
or v
u n
uP
u (xi − x̄)2
sx = i=1
t
.
n−1
The Sample Variance for grouped data:
Pm
j=1 (xj − x̄)2 fj
s2x,grouped = ,
n−1
where xj denotes either a class
Pm mark or a midpoint, fj - a class frequency, m is the number of
classes or intervals, and n = j=1 fj is the sample size.
The Sample Standard Deviation for grouped data:
sP
m 2
j=1 (xj − x̄) fj
sx,grouped = ,
n−1
The Range is the difference between the largest and smallest values of the sample
The Midrange is average of the largest and the smallest values of the sample
x(max) + x(min)
midrange = .
2
We defined quantiles of a probability distribution in Section 1.4. Similarly, we can define
quantiles of a sample. If p ∈ (0, 1), the p-th quantile is a value x which divides the ordered
sample in a way that p100% values are below x and (1 − p)100% are above x.
Instead of quantiles, we often use percentiles. Percentile is the value below which a per-
centage of data falls and vary between 0 and 100. E.g. the 20th percentile is value that divides
the sample into the bottom 20% and top 80%.
Quartiles are the values that split the data into quarters, denoted as Q0 , Q1 , Q2 , Q3 , Q4 .
A five-number summary consists of
• Q1 = the first quartile, which divides the bottom half of the sample into halves
• Q2 = the median,
• Q3 = the third quartile, , which divides the top half of the sample into halves,
22
The five-number summary usually presented as a box-plot. The range is r = Q4 − Q0 . The
Interquartile Range (IQR) is the difference between the upper and lower quartiles of the
sample:
IQR = Q3 − Q1 .
Formally, we may define
x(b1+(n−1)/4c) + x(d1+(n−1)/4e)
Q1 = ,
2
x(b1+3(n−1)/4c) + x(d1+3(n−1)/4e)
Q3 = .
2
23
Chapter 3
Estimators
24
We formalize this procedure. Let k be the dimension of θ, that is θ = (θ1 , . . . , θk ). The
problem of point estimation is to determine the statistics
based on the observed sample data from the population. Any random variable that is calcu-
lated using the sample is called a statistic. Note that the functions gi establish the link between
the results of the experiments (the sample) and our guess for the value of the parameters, which
we decided to seek in (3.1).
The statistics θ̂i are called estimators for the parameters. The estimators θ̂i are random
variables as they depend on X1 , . . . , Xn . When a particular sample x1 , . . . , xn is taken, the
values calculated from these statistics are called estimates of the parameters.
The rest of this Section is to study two most popular methods of finding estimators:
We then study the criteria for choosing a desired point estimator such as bias, efficiency,
sufficiency, and consistency.
25
(iii) From the system of equations, µr = mr , r = 1, 2, . . . , k, solved for the parameter θ =
(θ1 , . . . , θk ) we find a moment estimator of θ̂ = (θ̂1 , . . . , θ̂k ).
Example 3.3. Let X1 , . . . , Xn be a random sample from a Bernoulli population with para-
meter p. Tossing a coin 10 times and equating heads to value 1 and tails to value 0, we
obtained the following values: 0 1 1 0 1 0 1 1 1 0. We find an estimator for p using the method
of moments.
For the Bernoulli random variable, µ1 = E[X] = 1 · p + 0 · (1 − p) = p. We require that
µ1 = m1 , that is
n
1X
p= Xi .
n
i=1
Thus, the estimator of p is
n
1X
p̂ = Xi
n
i=1
i.e. we can estimate p as the ratio of the total number of heads to the total number of tosses.
n
P
For the partciluar sample x = (0, 1, 1, 0, 1, 0, 1, 1, 1, 0) we have xi = 6, so the moment
i=1
estimate of p is p̂ = 6/10.
Example 3.4. (i) Let the distribution of X be N (µ, σ 2 ). For a sample X1 , . . . , Xn of size
n, we use the method of moments to estimate µ and σ 2 . Since there are two parameters,
we find the first two moments: µ1 = µ and
Thus
n n
1X 2 1X 2
σ2 = Xi − µ2 = Xi − (X̄)2
n n
i=1 i=1
n n
! !
1 X 2 2 1 X 2 2 2
= Xi − n(X̄) = Xi − 2n(X̄) + n(X̄) (3.3)
n n
i=1 i=1
n n n n
!
1 X
2
X X
2 1X
Xi2 − 2Xi X̄ + (X̄)2
= Xi − 2 Xi X̄ + (X̄) =
n n
i=1 i=1 i=1 i=1
26
n
1X 2
= Xi − X̄ .
n
i=1
Note that (3.3) is ‘reverse engineering’ as we are performing these steps knowing that
the variance should be estimated by an expression similar to (2.1).
(ii) The following data (rounded to the third decimal digit) were generated from a normal dis-
tribution with mean 2 and a standard deviation of 1.5. 3.163, 1.883, 3.252, 3.716, −0.049,
−0.653, 0.057, 4.098, 1.670, 1.396, 2.332, 1.838, 3.024, 2.706, 3.830, 3.349, −0.230, 1.496, 0.231,
2.987. Obtain the method of moments estimates of the true mean and the true variance.
µ̂ = 2.0048 and σ̂ = 1.44824. 1
Exercise 3.5. Let X1 , X2 , . . . , Xn be a sample of i.i.d. random variables with pdf with θ > 0:
(
θxθ−1 , 0 < x ≤ 1,
fX (x) =
0, otherwise.
(ii) For the following observations of X calculate the method of moments estimate for θ:
0.3, 0.5, 0.8, 0.6, 0.4, 0.4, 0.5, 0.8, 0.6, 0.3
X̄
Solution. Answer: θ̂ = 1−X̄
.
If X1 , . . . , Xn are discrete i.i.d. random variables with probability mass function p(x, θ),
then, the likelihood function is given by
n
Y
L(θ) = Pθ (X1 = x1 , . . . , Xn = xn ) = Pθ (X1 = x1 ) · · · Pθ (Xn = xn ) = Pθ (Xi = xi ).
i=1
1
If you use a software to calculate the standard deviation and need to have exact results, you need to check
if the programme is using the factor n1 or n−11
. In Excel, the function STDEV.P has n1 and STDEV.S has n−1 1
.
27
Thus in the discrete case we have
n
Y
L(θ) = p(xi , θ) (3.4)
i=1
and similarly in the continuous case, if the probability density function is f (x, θ), the likelihood
function is
Yn
L(θ) = f (xi , θ) (3.5)
i=1
In practice, we will always use either (3.4) or (3.5) to calculate the likelihood.
Example 3.7. Continuing Example 3.1 we have
k
(i) If X1 , . . . , Xn ∼ P ois(λ), then as we said θ = λ and p(k, λ) = e−λ λk! for k = 0, 1, 2, . . ..
Thus
n n
Y Y λ xi λx1 +···+xn
L(λ) = p(xi , λ) = e−λ = e−nλ .
xi ! x1 ! · · · xn !
i=1 i=1
1 (x−µ)2
(ii) If X1 , . . . , Xn ∼ N (µ, σ 2 ), then θ = (µ, σ 2 ) and f (x; µ, σ 2 ) = √ 1 e− 2 σ2 for x ∈ R.
2πσ
Thus
n n (xi −µ)2
Pn 2
1 1 i=1 (xi −µ)
e− 2σ2 = e−
Y Y
L(µ, σ 2 ) = f (xi , µ, σ 2 ) = √ 2σ 2 .
i=1 i=1
2πσ (2π)n/2 σ n
Definition 3.8. The Maximum Likelihood Estimates are those values θ̂ of the parameters
that maximize the likelihood function with respect to the parameter θ. That is,
L(θ̂, x1 , . . . , xn ) ≥ L(θ, x1 , . . . , xn )
for all θ ∈ ΩΘ .
As explained earlier, since θ̂ depends on the sample, it is also a random variable.
The procedure to find Maximum Likelihood Estimators (MLEs) is:
(i) Define the likelihood function, L(θ).
(ii) Often it is easier to take the natural logarithm of L(θ), that is work with the log-likelihood
l(θ) = ln(L(θ)).
28
(iii) Differentiate l(θ) with respect to θ, and then equate the derivative to zero.
Taking the logarithm in (ii) simplifies the calculations, and is justified as logarithm is a
monotonically increasing function so it doesn’t change the location of the maxima.
We now try this recipe in a few examples and exercises.
Example 3.9. Suppose X1 , . . . , Xn are discrete i.i.d. random variables with the geomet-
ric distribution with an unknown parameter p, so that P (X = x) = (1 − p)x−1 p for x =
1, 2, . . .. We find the maximum likelihood estimator for the parameter p for n independent
observations x1 , x2 , . . . , xn and calculate the estimate for the following set of observations:
231, 127, 60, 4, 3, 183, 2, 4, 71, 22.
Exercise 3.10. Suppose the isolated weather-reporting station has an electronic device with
operating for a random time until a failure occurs. The station also has one spare device, and
the time Y until this second instruments is not available has a distributed with the density
1 −y
fY (y, θ) = ye θ , 0 ≤ y < ∞, 0 < θ < ∞ (3.7)
θ2
Five data points have been collected: 9.2, 5.6, 18.4, 12.1, 10.7. Find the maximum likelihood
estimate for θ.
29
Solution. We find the likelihood
n n
!
Y 1 yi 1 Y 1 Pn
L(θ) = 2
yi e− θ = 2n yi e− θ i=1 yi
.
θ θ
i=1 i=1
Exercise 3.11. For a random sample X1 , X2 , . . . , Xn from normal distribution N (µ, σ 2 ) with
the pdf:
1 (x−µ)2
fX (x) = √ e− 2σ2
2πσ
find the maximum likelihood estimators for µ and σ 2 . Compare these estimators to the method
of moments estimators discussed in the previous section.
n
1
Solution. Answer: µ̂ = X̄ and σ̂ 2 = (Xi − X̄)2 .
P
n
i=1
Example 3.12. Let X1 , . . . , Xn be a random sample from U (0, θ), θ > 0. We find the MLE
of θ. (
1
, 0<x≤θ
fX (x) = θ
0, otherwise
Thus (
1
θn 0 < x1 , x2 , . . . , xn ≤ θ
L(θ, x1 , . . . , xn ) =
0, otherwise.
30
Unlike the previous examples, we don’t take logarithm nor calculate derivatives. Instead, we
observe that L is non-zero only when all xi are smaller than or equal to θ, or equivalently,
when max xi ≤ θ. L is increasing on [0, max xi ] and attains maximum at θ̂ = max xi .
i=1,...,n i=1,...,n
Theorem 3.13. Let X1 , . . . , Xn be a random sample of size n from a population with mean
2
µ and variance σ 2 . Then E[X̄] = µ and Var(X̄) = σn .
Proof. We have
n n n
" #
1X 1X 1X nµ
E[X̄] = E Xi = E[Xi ] = µ= =µ
n n n n
i=1 i=1 i=1
and by independence
n n
!
1X 1 X 1 2 σ2
Var(X̄) = Var Xi = Var(Xi ) = nσ = .
n n2 n2 n
i=1 i=1
2 σ2
µX̄ = E[X̄] = µ, σX̄ = Var(X̄) =
n
Definition 3.14. The value σX̄ is called the standard error of the mean.
If in Theorem 3.13 we additionally assume that the sample is from a Gaussian population,
then the sample mean is also Gaussian. Firtst though we recall the following results from
MA1061 - Probability module:
31
Proof. By Proposition 3.15, a linear combination of independent Gaussian random variables
is Gaussian, thus both X̄ = n1 ni=1 Xi and σ/X̄−µ
P
√ are Gaussian. Their mean and variance can
n
easily be found similarly to the previous proof.
The above theorems formalize the important concept of standardizing, that is creating
mean-zero and unit-variance random variables:
X −µ
X ∼ N (µ, σ 2 ) =⇒ Z = ∼ N (0, 1).
σ
3.2.1 Unbiasedness
Definition 3.17. A point estimator θ̂ is called an unbiased estimator of the parameter θ if
E[θ̂] = θ. Otherwise θ̂ is said to be biased. Furthermore, the bias of θ̂ is given by
h i
B θ̂, θ = Bias θ̂, θ = E θ̂ − θ.
Example 3.18. Let X1 , . . . , Xn be a random sample from a Bernoulli population with para-
meter p. We show that the method of moments estimator obtained in Example 3.3 is also an
unbiased estimator. We have Pn
Xi Y
p̂ = i=1 = ,
n n
where Y is the binomial random variable, E[Y ] = np, hence,
1 1
E [p̂] = E[Y ] = np = p.
n n
Theorem 3.19. The mean of a random sample X̄ is an unbiased estimator of the population
mean µ.
Proof. Let X1 , . . . , Xn be random variables with mean µ. Then, the sample mean is X̄ =
1 Pn
n i=1 Xi .
n
1X 1
E X̄ = E[Xi ] = nµ = µ.
n n
i=1
32
Theorem 3.20. Let X1 , . . . , Xn be random sample drawn from an infinite population with
variance σ 2 < ∞ . Let
n
1 X
S2 = (Xi − X̄)2
n−1
i=1
be the variance of the random sample, then S 2 is an unbiased estimator for σ 2 .
Proof.
" n #
2 1 X 2
E S = E (Xi − µ) − (X̄ − µ)
n−1
i=1
" n n n
#
1 X X X
= E (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + (X̄ − µ)2
n−1
i=1 i=1 i=1
" n #
1 X
2 2
= E (Xi − µ) − 2(X̄ − µ)n(X̄ − µ) + n(X̄ − µ)
n−1
i=1
n
!
1 X 2
2
= E (Xi − µ) − nE (X̄ − µ) .
n−1
i=1
n
!
2 1 X σ 2
E S = σ2 − n = σ2
n−1 n
i=1
n−1 2
E[σ̂ 2 ] = σ .
n
However, if the sample size n → ∞
E[σ̂ 2 ] → σ 2 ,
hence, σ̂ 2 is asymptotically unbiased.
Exercise 3.21. Let θ̂1 and θ̂2 be two unbiased estimators of θ. Show that a convex combin-
ation of θ̂1 and θ̂2
θ̂3 = aθ̂1 + (1 − a)θ̂2 , 0 ≤ a ≤ 1
is an unbiased estimator of θ.
Solution. We are given that E[θ̂1 ] = θ and E[θ̂2 ] = θ. Therefore,
E[θ̂3 ] = E[aθ̂1 + (1 − a)θ̂2 ] = aE[θ̂1 ] + (1 − a)E[θ̂2 ] = aθ + (1 − a)θ = θ.
Hence, θ̂3 is unbiased.
33
3.2.2 Efficiency
Among unbiased estimators, we may still argue that one is better than the other as the
following example shows.
Example 3.22. Let X1 , X2 , X3 be a sample of size n = 3 from a distribution with unknown
mean µ, −∞ < µ < ∞, where the variance σ 2 is a known positive number.
One can show that both θ̂1 = X̄ and θ̂2 = (2X1 + X2 + 5X3 )/8 are unbiased estimators
for µ. Indeed, for θ̂1 we have
1
E[θ̂1 ] = E[X̄] = · 3µ = µ
3
and for θ̂2 :
1 2µ + µ + 5µ
E[θ̂2 ] = · (2E[X1 ] + E[X2 ] + 5E[X3 ]) = = µ.
8 8
However, the variances of θ̂1 and θ̂2 are different:
σ2
Var(θ̂1 ) = ,
3
2X1 + X2 + 5X3 4 2 1 25 30
Var(θ̂2 ) = Var = σ + σ2 + σ2 = σ2.
8 64 64 64 64
Var(θ̂∗ ) ≤ Var(θ̂)
34
We skip the proof of this Theorem.
The expression
" #
∂ ln fX (X, θ) 2
2
∂ ln fX (X, θ)
I(θ) = nE = −nE
∂θ ∂θ2
is called the Fisher information. This Theorem says that the lowest possible variance of an
estimator equals to I(θ)−1 .
Definition 3.26. The unbiased estimator θ̂ is said to be efficient if the variance of θ̂ equals
to the Cramér-Rao lower bound associated with fX (x, θ).
The efficiency of an unbiased estimator θ̂ is the ratio of the Cramér-Rao lower bound for
fX (x, θ) to the variance of θ̂.
An efficient estimator is optimal in the sense that it has the smallest possible variance
among all unbiased estimators.
Example 3.27. Let X1 , X2 , . . . , Xn be a random sample from the Poisson distribution pX (x, θ) =
e−λ λx
x! , x = 0, 1, . . .. We compare the Cramér-Rao lower bound for pX (x, λ) to the variance
of the maximum likelihood estimator P
for λ. Firstly, we find the MLE. The likelihood is given
Qn e−λ λxi xi
−nλ λ
. The log-likelihood is ln L(λ) = −nλ + ln(λ) ni=1 xi −
P
by L(θ) = i=1 xi ! = e Q
x i !
Pn
i=1 ln(xi !). Differentiating with respect to λ we get
n
d ln L(λ) X
= −n + λ−1 xi = 0.
dλ
i=1
d ln pX (x, λ) x
= −n +
dλ λ
d2 ln pX (x, λ) x
=− 2
dλ2 λ
Thus, the Fisher information is equal to (recall from Exercise 1.19 that E[X] = λ)
2
d ln fX (X, λ) 1 λ 1
I(λ) = −E 2
= 2 E[X] = 2 = .
dλ λ λ λ
1 λ
CRLB = 1 = n
nλ
35
and is equal to the variance of λ̂:
Pn Pn Pn
i=1 Xi i=1 Var(Xi ) λ nλ λ
Var(λ̂) = Var = 2
= i=1
2
= 2 = .
n n n n n
We conclude that the maximum-likelihood estimator λ̂ = X̄ for the parameter λ of the Poisson
distribution is efficient.
For biased estimators, one can measure the precision of an estimator by finding expected
squared distance between the true value of the parameter and its estimator:
Definition 3.28. The mean square error of the estimator θ̂, denoted by MSE θ̂ , is defined
as h i
MSE θ̂ = E (θ̂ − θ)2 .
We have
h 2 i
MSE θ̂ = E θ̂ − E[θ̂] + E[θ̂] − θ
h 2 2 i
= E θ̂ − E[θ̂] + E[θ̂] − θ + 2 θ̂ − E[θ̂] E[θ̂] − θ
h 2 i h 2 i h i
= E θ̂ − E[θ̂] + E E[θ̂] − θ + 2E θ̂ − E[θ̂] E[θ̂] − θ
2 h i
= Var θ̂ + E[θ̂] − θ because E θ̂ − E[θ̂] = 0,
2
= Var θ̂ + Bias θ̂, θ .
Definition 3.29. The unbiased estimator θ̂ that minimizes the mean square error is called
the Minimum-Variance Unbiased Estimator of θ.
Exercise 3.30. If X has a binomial distribution with parameters n and p, then p̂1 = X/n is
an unbiased estimator of p. Another estimator of p is p̂2 = (X + 1)/(n + 2).
1) Derive the bias of p̂2 .
2) Derive MSE(p̂1 ) and MSE(p̂2 ).
3) Show that for p ≈ 0.5 MSE(p̂1 ) < MSE(p̂2 ).
3.2.3 Sufficiency
Definition 3.31. Let X = (X1 , . . . , Xn ) be a random sample from a probability distribution
with unknown parameter θ. Then, the statistic U = g(X1 , . . . , Xn ) is said to be sufficient for
θ if the conditional pdf fX (x1 , . . . , xn |U = u) (or pmf pX (x1 , . . . , xn |U = u)) does not depend
on θ for any value of u.
An estimator of θ that is a function of a sufficient statistic for θ is said to be a sufficient
estimator of θ.
36
Example 3.32. Let X1 , . . . , Xn be a random sample of size n drawn from the Bernoulli pmf:
pX (k, p) = pk (1 − p)1−k , where k =P0, 1 and p is an unknown parameter. The maximum
likelihood estimator for p is p̂ = n1 ni=1 Xi . Let us also P denote the maximum likelihood
estimate for p given a sample x = (k1 , . . . , kn ) by p̂e = n1 ni=1 ki .
We show that p̂ is a sufficient estimator for p. We have
n
!
X n
P (p̂ = pe ) = P Xi = npe = pnpe (1 − p)n−npe . (3.8)
npe
i=1
P (X1 = k1 , . . . , Xn = kn | p̂ = pe )
P (X1 = k1 , . . . , Xn = kn and p̂ = pe ) P (X1 = k1 , . . . , Xn = kn )
= =
P (p̂ = pe ) P (p̂ = pe )
Pn Pn
pk1 (1 − p)1−k1 · · · pkn (1 − p)1−kn p i=1 ki (1− p)n− i=1 ki pnpe (1 − p)n−npe
= = = .
P (p̂ = pe ) P (p̂ = pe ) P (p̂ = pe )
Plugging in (3.8) we get
1
P (X1 = k1 , . . . , Xn = kn | p̂ = pe ) = n
,
npe
37
where g(θ̂, θ) is a function only of θ̂ and θ and h(x1 , . . . , xn ) is a function of only x1 , . . . , xn
and not of θ.
The same statement holds for continuous case.
Note that h may also depend on θ̂ as θ̂ = u(x1 , . . . , xn ) is a itself a function of the sample.
Example 3.36. Let X1 = k1 , ..., Xn = kn be a random sample of size n from the Poisson
distribution. We show that λ̂ = X̄i is a sufficient statistic for λ. We have
n
λki
= e−nλ λ ki (k1 ! · · · kn !)−1 = e−nλ λnλ̂ (k1 ! · · · kn !)−1
Y P
pX (k1 , . . . , kn , λ) = e−λ
ki
i=1
= g(λ, λ̂)h(x1 , . . . , xn ),
Exercise 3.37. Let X1 , ..., Xn denote a random sample from a geometric population with
parameter p:
pX (x; p) = p(1 − p)x−1 , x = 1, 2, 3...
Show that X̄ is sufficient for p.
3.2.4 Consistency
Definition 3.38. A sequence of random variables X1 , X2 , . . . converges in probability to a
random variable X if for every ε > 0
p
Convergence in probability is denoted as Xn →
− X.
lim P (|Xn − X| ≥ ε) = 0.
n→∞
Consistency, which we now define, involves investigating the convergence of the estimator
as the sample size increases. Intuitively, the larger the sample, the more accurate estimate of
the parameter. Since the sample size n is very important in these considerations, we denote
by θ̂n an estimator, which is based on the sample containing n observations.
38
Consistency means that the probability of our estimator being within some small ε-interval
of θ can be made as close to one as we like by making the sample size n sufficiently large.
The fact that the sample mean is a consistent estimator for the true mean no matter what
pdf the data come from is often refer as the weak law of large numbers:
Theorem 3.40 (Weak Law of Large Numbers). LetPX1 , X2 , . . . be i.i.d random variables with
E[Xi ] = µ and Var(Xi ) = σ 2 < ∞. Define X̄n = n1 ni=1 Xi . Then for every ε > 0,
lim P |X̄n − µ| < ε = 1)
n→∞
Proof. Recall from MA1601, the Markov inequality: for any random variable X and any
non-negative function g we have for any k > 0,
E[g(X)]
P (g(X) ≥ k) ≤ .
k
Using this, we have
X̄ − µ
Un = √
σ/ n
39
The assertion of this theorem means that lim FUn (u) = Φ(u), where Φ is the cdf of the
n→∞
standard normal distribution. Equivalently:
Z z
1 2
lim P (Un ≤ z) = √ e−t /2 dt.
n→∞ 2π −∞
The Central Limit Theorem means that the sample mean is asymptotically normally dis-
tributed whatever the distribution of the original random variables is.
Γ(n) = (n − 1)!
and
√
1
Γ = π.
2
Definition 3.44. A random variable X is said to possess a gamma probability distribution
with parameters α > 0 and β > 0 if it has the pdf given by
(
1 α−1 e−x/β ,
f (x) = β α Γ(α) x x>0
0, otherwise
We denote this by Gamma(α, β) or Γ(α, β). The parameter α is called a shape parameter,
and β is called a scale parameter.
If X ∼ Γ(α, β):
E[X] = αβ, Var(X) = αβ 2
and the moment-generating function:
MX (t) = (1 − βt)−α
40
Figure 3.1: Density of the Gamma distribution
3.3.2 χ2 distribution
Usually when defining a new continuous distribution, we give a formula for its density. This
time we will construct it differently.
Definition
Pn 3.48. Let Z1 , Z2 , . . . , Zn be independent random variables with Zi ∼ N (0, 1). If
Y = i=1 Zi2 , then Y follows the chi-square distribution with n degrees of freedom. We write,
Y ∼ χ2n .
In particular, if Z ∼ N (0, 1) and X = Z 2 , then X follows the chi-square distribution with
1 degree of freedom.
If X ∼ χ2n , then the probability density is2 :
(
1 n/2−1 e−x/2 , x > 0
2n/2 Γ(n/2) x
f (x) =
0, otherwise.
2
Derivation of the χ2 pdf: https://en.wikipedia.org/wiki/Proofs_related_to_chi-squared_
distribution
41
Figure 3.2: Density of the χ2 -distribution
Comparing with Definition 3.44, we see the χ2 -distribution with n degrees of freedom is the
same as Γ(n/2, 1/2). The expectation and variance are
E[X] = n, Var(X) = 2n,
and the moment-generating function
1
M (t) = (1 − 2t)−n/2 for t < . (3.10)
2
We have the following result about sums of χ2 -random variables (without proof):
Corollary 3.49. If X1 , X2 , . . . , Xn are independent RVs such that Xj ∼ χ2 (rj ), j = 1, 2, . . . , n,
then Y is a χ2(Pn rj ) RV.
i=1
Xi −µ
Proof. This obvious from the definition of the χ2 -distribution, since σ is standard normal.
42
Then
(n − 1)S 2
∼ χ2n−1 .
σ2
Proof. This is the most advanced proof of this module. We have
n n
(n − 1) 2 (n − 1) 1 X 2 1 X 2
S = · Xi − X̄ = (Xi − µ) − (X̄ − µ)
σ2 σ2 n−1 σ2
i=1 i=1
n n n
X (Xi − µ)2 X (Xi − µ) (X̄ − µ) X (X̄ − µ)2
= −2 + (3.12)
σ2 σ σ σ2
i=1 i=1 i=1
Plugging in to (3.12)
n
(n − 1) 2 X (Xi − µ)2 (X̄ − µ)2
S = − n .
σ2 σ2 σ2
i=1
Equivalently,
n
(n − 1) 2 (X̄ − µ)2 X (Xi − µ)2
S + n = .
σ2 σ2 σ2
i=1
Let
n
(n − 1) 2 (X̄ − µ)2 X (Xi − µ)2
Y1 = S , Y2 = n , Y3 = .
σ2 σ2 σ2
i=1
Since Y3 ∼ χ2n , we have MY3 (t) = (1 − 2t)−n/2 (cf. Theorem 3.50 and formula (3.10)). Note
2
X̄−µ X̄−µ
that σ/√ is a standard normal random variable. Thus Y2 =
n
√
σ/ n
∼ χ21 . We conclude
MY2 (t) = (1 − 2t)−1/2 . From (3.13)
which is the moment-generating function of the χ2 -distribution with (n−1) degrees of freedom.
Since the moment genrating function uniquelly identifies the distribution (Theorem 1.22(i))
we conclude that (n−1)
σ2
S 2 has χ2 -distribution with (n − 1) degrees of freedom.
43
If X ∼ χ2n , then from the χ2 table, we can read the values of χ2α,n (or χ2α (n)) such that
Solution. Since 5i=1 (Xi − 5)2 ∼ χ25 we are looking in the table for the row with 5 degrees
P
of freedom (df) and for for χ20.10 we obtain ε = 9.23635.
Definition 3.55. If Y and Z are independent random variables, Y has a χ2n distribution, and
Z ∼ N (0, 1), then
Z
T =p
Y /n
is said to have a (Student) t-distribution with n degrees of freedom. We denote this by T ∼ Tn .
3
William Sealy Gossett - English statistician, chemist and brewer who served as Head Brewer of Guinness
and Head Experimental Brewer of Guinness - derived the formula for the pdf in 1908.
Ronald Aylmer Fisher presented a rigorous mathematical derivation of Gossett’s density in 1924.
44
Figure 3.3: Density of the t-distribution
If X ∼ Tn :
Γ( n+1
2 )
f (x) = (n+1)/2 , −∞ < x < ∞
√ n
x2
nπΓ( 2 ) 1 + n
The pdf for Tn is symmetric: fTn (x) = fTn (−x), for all x.
n
E[X] = 0, Var(X) =
n−2
Theorem 3.56. If X̄ and S 2 are the mean and the variance of a random sample of size n
from a normal population with the mean µ and variance σ 2 , then
X̄ − µ
T = √
S/ n
Hence,
√ X̄−µ
X̄ − µ σ/ n Z
T = √ =q =q
S/ n (n−1)S 2 Y
σ 2 (n−1) n−1
Also, by Theorem 3.51 X̄ and S 2 are independent. Thus, Y and Z are independent, and by
definition, T follows a t-distribution with (n − 1) degrees of freedom.
45
A conclusion from this Theorem is that if a random sample of size n is given, then the
corresponding degrees of freedom will be (n − 1).
Exercise 3.57. The 95% quantile for the normal distribution are given by 1.96, and the 99%
quantile by 2.58.
What are the corresponding quantiles for the t distribution if
(a) n = 4, (b) n = 12, (c) n = 30, (d) n = 40, (e) n = 100
3.3.4 F -distribution
Definition 3.58. Suppose that U ∼ χ2m and V ∼ χ2n are independent random variables. A
random variable of the form
U/m
F =
V /n
is said to have an F -distribution with m and n degrees of freedom. We denote this by F ∼
F (m, n)
If X ∼ F (m, n):
Γ( m+n
2 ) m m/2 m
− m+n
X m/2−1
1+ nx
2
, x>0
m n n
f (x) = Γ( 2 )Γ( )
2
0, otherwise
n 2n2 (m + n − 2)
E[X] = , Var(X) =
n−2 m(n − 2)2 (n − 4)
We denote the quantiles of the F -distribution by Fα (m, n). If we know Fα (m, n), it is possible
to find F1−α (n, m) by using the identity
1
F1−α (n, m) = (3.15)
Fα (m, n)
The reason for studying the F distribution is that it allows us to study the ratio of sample
variances of two populations:
46
Figure 3.4: Density of the F -distribution
Theorem 3.59. Let two independent random samples of size m and n be drawn from two
normal populations with variances σ12 , σ22 , respectively. If the variances of the random samples
are given by S12 , S22 , respectively, then the statistic
Two populations, with σ12 = σ22 , are called homogeneous with respect to their variances.
Exercise 3.61. (i) Use table for F -distributions to find the values of x that P (0.109 <
F4,6 < x) = 0.95, where F4,6 denotes a random variable with the F -distribution with 6
and 4 degrees of freedom.
(ii) Use table for F -distributions to find the values of x that P (0.427 < F11,7 < 1.69) = x,
where F11,7 denotes a random variable with the F -distribution with 11 and 7 degrees of
freedom..
Solution. We have the following tables in Richard J. Larsen, Morris L. Marx An Introduction
to Mathematical Statistics and Its Applications, 5th Edition, page 704-717.
47
(i) We have P (F4,6 < 1.09) = 0.025. Thus P (F4,6 < x) = P (0.109 < F4,6 < x) + P (F4,6 <
1.09) = 0.975. Thus x ≈ 6.23.
(ii) P (0.427 < F11,7 < 1.69) = P (F11,7 < 1.69) − P (F11,7 < 0.427) ≈ 0.75 − 0.10 = 0.65.
Exercise 3.62. Let S12 denote the sample variance for a random sample of size 10 from
Population I with normal pdf and let S22 denote the sample variance for a random sample of
size 8 from Population II with normal pdf. The variance of Population I is assumed to be
three times the variance of Population II.
Find two numbers a and b such that
S12
P a < 2 < b = 0.90
S2
assuming S12 to be independent of S22 .
Solution. From the assumption σ12 = 3σ22 , n1 = 10, n2 = 8
S12 /σ12 S12 /3σ22 S12
= = ∼ Fn1 −1,n2 −1
S22 /σ22 S22 /σ22 3S22
S12
P L< < U = 0.90 (3.16)
3S22
48
From the tables we get: L = F0.05,(9,7) , U = F0.95,(9,7) . Rearranging (3.16)
S12
P 0.912 < 2 < 11.04 = 0.90
S2
• P (L ≤ θ ≤ U ) is high,
In the confidence interval estimation, the sets S(X) are intervals: S(X) = [L(X), U (X)]. The
limits L and U are called the lower and the upper confidence limits, respectively.
The probability 1 − α that a confidence interval contains the true parameter θ is called the
confidence coefficient.
Example 3.64. Let’s say that we have an i.i.d sample X1 , . . . , Xn from the normal distribution
N (µ, σ 2 ) and the value of σ is known but µ is unknown and needs to be estimated. We find
a 95%-confidence interval for µ.
49
The estimator for the mean is µ̂ = X̄ = n1 ni=1 Xi . Hence, according to Theorem 3.16,
P
X̄−µ
X̄ ∼ N (µ, σ 2 /n) and Z := σ/ √
n
∼ N (0, 1). We need two numbers a, b ∈ R such that
P (a < Z < b) = 0.95. These can’t be easily calculated from the density nor cdf of the
Gaussian distribution. However, one can use statistical tables to pick a and b. By convention,
Figure 3.5: Construction of the 95% confidence interval using the standard normal density
we pick these values in a symmetric manner, that is we make sure P (Z < a) ≈ 0.025 and
P (Z > b) ≈ 0.025. This will make the confidence interval as narrow as possible.
50
Example 3.65. Suppose that 6.5, 9.2, 9.9, 12.4 are the realisations of a random variable X
from N (µ, 0.82 ).
We construct a 95% confidence interval for µ using (3.17):
The correct interpretation of confidence interval for the population mean is that if samples
of the same size, n, are drawn repeatedly from a population, and a confidence interval is
calculated from each sample, then 95% of these intervals should contain the population mean.
3.4.2 Pivots
Definition 3.66. Let X ∼ Pθ . A random variable T (X, θ) is known as a pivot if the distri-
bution of T (X, θ) does not depend on θ.
• It is a function of the random sample (a statistic or an estimator θ̂) and the unknown
parameter θ, where θ is the only unknown quantity, and
Suppose that θ̂ = g(X) is a point estimate of θ, and let T (θ̂, θ) be the pivotal quantity.
Let a and b be constants with (a < b), such that
P (a ≤ T (θ̂, θ) ≤ b) = 1 − α
51
In the previous example involving the normal distribution we had
X̄ − µ
T (X̄, µ) = √ , T ∼ N (0, 1).
σ/ n
Note that T (X̄, µ) depends on the estimaor and the parameter but its distribution doesn’t.
As we saw earlier (in Example 3.64), the inequality a ≤ T (X̄, µ) ≤ b) can be solved and gives
a confidence interval for µ.
Theorem 3.67. Let T (X, θ) be a pivot such that for each θ, T (X, θ) is a statistic, and as a
function of θ, T is either strictly increasing or decreasing at each x ∈ R.
Let Λ ⊆ R be the range of T , and for every λ ∈ Λ and x ∈ R, let the equation λ = T (x, θ) be
solvable. Then one can construct a confidence interval for θ at any level.
T is monotone in θ, hence, T (x, θ) = λ1 (α) and T (x, θ) = λ2 (α) for every x uniquely for θ.
The condition that λ = T (x, θ) be solvable will be satisfied if, for example, T is continuous
and strictly increasing or decreasing as a function of θ ∈ Θ.
Procedure for finding CI for θ using pivot
(ii) Find a function of θ̂ and θ, T (θ̂, θ) (pivot), such that the probability distribution of
T (θ̂, θ) does not depend on θ.
(iv) Transform the pivot confidence interval to a confidence interval for the parameter θ:
P (L < θ < U ) = 1 − α, where L is the lower confidence limit and U is the upper
confidence limit.
Exercise 3.68. Suppose the random sample X1 , . . . , Xn has U (0, θ) distribution. Construct
a 90% confidence interval for θ. Identify the upper and lower confidence limits.
52
Figure 3.7: Construction of the 95% confidence interval for the fT (t) = ntn−1
P (|Z| ≤ zα/2 ) = 1 − α.
X̄−µ √σ zα/2 √σ ,
Since √
σ/ n
≤ zα/2 ⇐⇒ X̄ − n
≤ µ ≤ X̄ + n
we get
σ σ
P X̄ − √ zα/2 ≤ µ ≤ X̄ + √ = 1 − α.
n n
h i
The (1 − α)-confidence interval for µ is thus X̄ − √σn zα/2 ≤ µ ≤ X̄ + √σn .
Due to the Central limit theorem (Theorem 3.42) this method can also be used when the
sample is not Gaussian but the sample size n is large. Also, typically the standard deviation
is unknown, but the estimate S 2 from the sample is a good enough replacement (when sample
size is large).
53
Exercise 3.69. Hemoglobin levels in 11-year old boys have a normal distribution with un-
known mean µ and σ = 1.209 g/dl. Suppose that a random sample of size 10 has the sample
mean 12 g/dl. Find the 90% confidence interval for µ.
Exercise 3.70. Let X be the mean of a random sample of size n from a distribution that is
N (µ, 9). Find n such that P (X − 1 < µ < X + 1) = 0.90, approximately.
Exercise 3.71. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield
x̄ = 4.7 and σ 2 = 5.76. Determine a 90% confidence interval for µ.
Example 3.72. A manufacturer wants to estimate the reaction time of fuses under 20%
overload. To run the test a random sample of 20 of fuses was subjected to a 20% overload,
and it was found that the times it took them to blow had the mean of 10.4 minutes and
a sample standard deviation of 1.6 minutes. It can be assumed that the data constitute a
random sample from a normal population.
(i) Construct 95% confidence interval to estimate the mean of reaction time.
(ii) Construct one-tailed(lower) 95% confidence interval to estimate the mean reaction time.
54
Given X̄ = 10.4, and S 2 = 1.6. From the table for Student t distribution, obtain
tα/2,n−1 = t0.025,19 = 2.093. Hence,
√ √
P 10.4 − 2.093 · 1.6/ 20 ≤ µ ≤ 10.4 + 2.093 · 1.6/ 20 = 0.95
i.e.
P (9.65 ≤ µ ≤ 11.15) = 0.95.
Hence, the 95% confidence interval for the mean fuse operating time is (9.65, 11.15)
(ii) For one-tailed interval, we need to consider the upper boundary, because the reaction
time should be as short as possible. The confidence interval is:
√
P µ ≤ X̄ + tα,n−1 · S/ n = 1 − α.
α α
P (Y > χ2α/2,n−1 ) = , P (Y < χ1−α/2,n−1 = forY ∼ χ2n−1 .
2 2
Suppose that the population is normal. As (n − 1)S 2 /σ 2 ∼ χ2n−1 we have
(n − 1)S 2
2
P χ1−α/2,n−1 < < χα/2,n−1 = 1 − α
σ2
(ii) Find L = χ2α/2,n−1 , and U = χ21−α/2,n−1 using the χ2 table with (n − 1) degrees of
freedom.
2 (n−1)s2 (n−1)s2
(iii) Compute the (1−α)100% confidence interval for the population variance s as χ2 , χ2 ,
1−α/2 α/2
Exercise 3.73. Suppose we have an independent random sample X1 , . . . , X10 from N (µ, σ 2 )
and we want to find a 95% confidence interval for σ 2 .
55
Solution. As (n − 1)S 2 /σ 2 ∼ χ2n−1 , the pivot T = 9S 2 /σ 2 ∼ χ29 . Recall that the appropriate
notation for the quantiles of the χ-distribution was introduced in (3.14). Using χ2 table, we
find the lower bound L = χ2α/2,n−1 and the upper bound U = χ21−α/2,n−1 : L = 2.70, U = 19.02,
then
(n − 1)S 2
P (L < T < U ) = P L < <U =1−α
σ2
(n − 1)S 2 (n − 1)S 2
2
P <σ < =1−α
U L
9S 2 9S 2
2
P <σ < = 0.95
19.02 2.70
The confidence interval is
9S 2 9S 2
,
19.02 2.70
56
point estimate ± margin of error.
Definition 3.74. The margin of error E for the estimate of µ is
√
E = zα/2 · σ/ n. (3.18)
The margin of error for the estimate of a population mean indicates the accuracy with
which a sample mean estimates the unknown population mean.
Let d be the width of a (1 − α)% confidence interval for the true proportion, p. Then
r r !
X (X/n)(1 − (X/n)) X (X/n)(1 − (X/n))
d= + zα/2 − − zα/2
n n n n
r r
(X/n)(1 − (X/n)) 1 zα/2
= 2zα/2 ≤ 2zα/2 = √
n 4n n
Definition 3.75. The margin of error for confidence level 100(1 − α)% associated with an
estimate X
n , where X is a number of success in n independent trials, and p is unknown, is
zα/2
E= √ .
2 n
If we can estimate that true value of p is greater than 21 (or less than 1
2 ) the margin of
error for confidence level 100(1 − α)% associated with an estimate X n is
p
zα/2 pg (1 − pg )
E= √ .
n
57
rounded up to the next integer. Or, if we can make ‘educated guess’ about the value of p (let’s
say it is at most equal to a value pg ), then the smallest sample sample required is
2
zα/2
n= pg (1 − pg ).
E2
Exercise 3.76. The Bureau of Labor Statistics collects information on the ages of people in
the civilian labor force and publishes the results in Current Population Survey.
(i) Determine the sample size needed to be collected in order to be 95% confident that µ
(mean age of all people in the civilian labor force) is within 0.5 year of the point estimate,
X̄. Assuming that σ = 12.1 years.
(ii) Find a 95% confidence interval for µ if a sample of the size determined in part i) has a
mean age of 43.8 years.
Solution. (i) We have E = 0.5, for 95% confidence interval α = 0.05 and zα/2 = 1.96
2 σ2
zα/2 1.962 · 12.12
n≥ = ≈ 2249.8
E2 0.052
Hence, n = 2250.
58
Large sample - normal distribution
Let X1,1 , . . . , X1,n1 be a random sample from a normal distribution N (µ1 , σ12 ), and let X2,1 , . . . , X2,n2
be a random sample from a normal distribution N (µ2 , σ22 ).
Let X̄1 = n11 ni=1 X1,i and X̄2 = n12 ni=1
P 1 P 2
X2,i .
As we assume that the two samples are independent, the averages X̄1 and X̄2 are also
independent, and the distribution of X̄1 − X̄2 is N (µ1 − µ2 , n11 σ12 + n12 σ22 ). Thus, similarly to
the one-sample case, the (1 − α)-confidence interval for µ1 − µ2 is:
q
X̄1 − X̄2 ± zα/2 σ12 /n1 + σ22 /n2
To apply this formula we need to know σ1 and σ2 . If σ12 and σ22 are unknown, then σ1
and σ2 can be replaced by respective sample standard deviations S1 and S2 , provided that the
samples are large (say n1 , n2 ≥ 30). In such case the confidence interval is:
q
X̄1 − X̄2 ± zα/2 S12 /n1 + S22 /n2 .
For a large sample, by the Central Limit Theorem, this formula may be applied even if the
sample is not Gaussian.
59
Rearranging for µ1 − µ2 we get
q
P (X̄1 − X̄2 ) − tα/2,n1 +n2 −2 · Sp n11 + 1
n2 < µ 1 − µ2
q
1 1
< (X̄1 − X̄2 ) + tα/2,n1 +n2 −2 · Sp n1 + n2 =1−α
Exercise 3.79. Independent random samples from two normal populations with equal vari-
ances produced the following data.
Sample 1 : 1.2, 3.1, 1.7, 2.8, 3.0
Sample 2 : 4.2, 2.7, 3.6, 3.9
Hence,
(n1 − 1)s21 + (n2 − 1)s22
s2p = = 0.599
n1 + n2 − 2
(ii) For the confidence coefficient 0.90, α = 0.10 and from the t-table, t0.05,7 = 1.895. Thus,
a 90% confidence interval for µ1 − µ2 is
r
1 1
(x̄1 − x̄2 ) ± tα/2,n1 +n2 −2 · sp + =
n1 n2
s
1 1
(2.36 − 3.6) ± 1.895 · 0.599 + = −1.24 ± 0.98
5 4
or (−2.22, −0.26).
60
Small samples with different variances
If the equality of the variances cannot be reasonably assumed, σ12 6= σ22 , the previous procedure
still can be used, except that the pivot random variable is
(X̄1 − X̄2 ) − (µ1 − µ2 )
T = q 2 ∼ tν ,
S1 S22
n1 + n2
Proportions
Let X1 and X2 denote the numbers of successes observed in two independent sets of n1 and
n2 Bernoulli trials, respectively, where p1 and p2 are the true success probabilities associated
with each set of trials. According to the Central Limit Theorem, the distribution of X1 and
X2 can be approximated by the Gaussian distribution and thus
(X1 /n1 − X2 /n2 ) − (p1 − p2 )
T =q ∼ N (0, 1)
X1 /n1 (1−X1 /n1 ) X2 /n2 (1−X2 /n2 )
n1 + n2
61
The approximation is valid provided that the two samples are independent, large, and (Xi /ni )ni >
5 and (1 − Xi /ni )ni > 5 for i = 1, 2.
Exercise 3.81. The phenomenon of handedness has been extensively studied in human pop-
ulations. The percentages of adults who are right-handed, left-handed, and ambidextrous are
well documented. What is not so well known is that a similar phenomenon is present in lower
animals. Dogs, for example, can be either right-pawed or left-pawed.
Suppose that in a random sample of 200 beagles, it is found that 55 are left-pawed and
that in a random sample of 200 collies, 40 are left-pawed.
Obtain a 95% confidence interval for p1 − p2 .
62
Chapter 4
Hypothesis testing
Let’s motivate hypothesis testing with the following example: a car manufacturer is looking
for additives that might increase car’s performance. As a pilot study, they send thirty cars
fuelled with a new additive on a road. Without the additive, those same cars are known to
have average fuel consumption µ0 = 25.0 mpg with a standard deviation of σ0 = 2.4 mpg.
The fuel consumption is assumed to be normally distributed.
Suppose it turns out that the thirty cars average x̄ = 26.3 mpg with the additive. Can the
company claim that the additive has significant effect on mileage increase?
To formalize this setting, we may say that the company wants to collect a sample X1 , . . . , Xn ∼
N (µ, σ02 ) to compare two claims:
(i) H0 : the cars with the new additive have average fuel consumption equal to µ = 25.0
mpg i.e. there is no improvement, with
(ii) H1 : the cars with new additive have mileage higher than 25.0 i.e. there is an improve-
ment.
These can be written mathematically as:
(i) H0 : µ = 25.0,
(ii) H1 : µ > 25.0.
It seems natural to use Z = X̄−25.0
2.4/n and choose H1 in favour of H0 if we get a value of Z
larger than some threshold value z.
Let us now introduce this framework formally.
4.1 Definitions
4.1.1 Hypotheses, test statistic, rejection region
A statistical test consist of the following 4 components:
63
(i) The null hypothesis, denoted by H0, is usually the nullification of a claim. Unless evidence
from the data indicates otherwise, the null hypothesis is assumed to be true.
(ii) The alternative hypothesis, denoted by HA (or sometimes denoted by H1), is the claim
the truth (or validity) of which needs to be shown.
(iii) The test statistic, denoted by TS, is a function of the sample measurements upon which
the statistical decision, to reject or not reject the null hypothesis, will be based.
(iv) A rejection region (or a critical region) is the region (denoted by RR) that specifies the
values of the observed test statistic for which the null hypothesis will be rejected. This
is the range of values of the test statistic that corresponds to the rejection of H0 at some
fixed level of significance α.
Having specified those and carried out the calculations, we reach a conclusion, that is
an answer to the question posed at the beginning of the whole process. If the value of the
observed test statistic falls in the rejection region, the null hypothesis is rejected and we will
conclude that there is enough evidence to decide that the alternative hypothesis is true. If the
test statistic does not fall in the rejection region, we conclude that we cannot reject the null
hypothesis.
Failure to reject the null hypothesis does not necessarily mean that the null hypothesis is
true.
X̄ − µ0
Z= √ ∼ N (0, 1)
σ/ n
x̄−µ
√ 0 . Given a significance level
Let us denote the calculated value of the statistic by zT S = σ/ n
α we find a rejection region using the quantile of the standard normal distribution zα/2
RR = {|Z| ≥ zα/2 }
64
Let X1 , X2 , . . . , Xn be from Normal Distribution N (µ, σ 2 ): The test hypotheses are:
H0 : µ = µ0
RR = {Z ≥ zα } resp. RR = {Z ≤ −zα }.
Conclusion: If zT S ≥ zα/2 we reject the null hypothesis (resp. if zT S ≤ −zα we reject the
null hypothesis). Otherwise, we do not reject H0.
A rejection region (also called a critical region) is the region (denoted by RR) that specifies
the values of the observed test statistic for which the null hypothesis will be rejected. This is
the range of values of the test statistic that corresponds to the rejection of H0 at some fixed
level of significance, α.
The probability distribution of the test statistic is known and does not depend on the
parameters of the population!
With this framework we can now solve the problem of car manufacturer posed at the
beginning of this chapter.
Exercise 4.1. A car manufacturer is looking for additives that might increase car’s perform-
ance. As a pilot study, they send thirty cars fuelled with a new additive on a road. Without
the additive, those same cars are known to have average fuel consumption µ0 = 25.0 mpg
with a standard deviation of σ0 = 2.4 mpg. The fuel consumption is assumed to be normally
distributed. Suppose it turns out that the thirty cars average x̄ = 26.3 mpg with the additive.
At the 5% significance level, what should the company conclude?
H0 : µ = µ0 HA : µ > µ0
X̄ − µ
TS : Z = √ ∼ N (0, 1)
σ/ n
x̄−µ
√0 26.3−25.0
Thus zT S = σ/ n
= √
2.4/ 30
= 2.97 This is a one-tailed test so
RR = {Z ≥ zα , zα = 1.65.
Since zT S = 2.97 > zα , we reject H0 and conclude that at 5% significance level the company
can claim that the additive provides an increase in petrol mileage.
65
4.1.3 Errors when testing hypotheses
In a statistical test, it is impossible to establish the truth of a hypothesis with 100% certainty.
There are two possible types of errors.
Definition 4.2. A type I error is made if H0 is rejected when in fact H0 is true. The
probability of type I error is denoted by α. That is,
α = P (T S ∈ RR|H0) (4.1)
β = P (T S ∈
/ RR|HA) (4.2)
The right and wrong decisions we may take when conducting a statistical test are sum-
marized in the following table:
Statistical Decision and Error Probabilities
True State of Null Hypothesis
Statistical Decision H0 True H0 False
Do not reject H0 Correct decision Type II error (β)
Reject H0 Type I error (α) Correct decision
In another nomenclature, these are called true/false positive/negative results:
It is desirable that a test should have α and β are as small as possible. It can be shown
that there is a relation between the sample size, α (Type I error) and β (type II error)
Once we are given the sample size n, an α, a simple alternative HA, and a test statistic,
we have no control over β and it is exactly determined. In other words, for a given sample size
and test statistic, any effort to lower β will increase in α and vice versa. However, increasing
the sample size n, we can decrease β for the same α to an acceptable level.
Definition 4.3. The power or sensitivity of a test is the probability that the null hypothesis
is rejected given that the alternative hypothesis is true: 1 − β
66
4.1.4 p-value
Definition 4.4. Corresponding to an observed value of a test statistic, the p-value is the
lowest level of significance at which the null hypothesis would have been rejected.
• If the p-value of the test is less than the maximum value of α, reject H0.
For the simple Z-test of Section 4.1.2 the p-value can be found as (zT S is the value of the
test statistic)
P (T S < zT S |H0 ) = Φ(zT S )
for lower tail test
p − values = P (T S > zT S |H0 ) = 1 − Φ(zT S ) for upper tail test
P (|T S| > |zT S ||H0 ) = 2(1 − Φ(|zT S |) for two-tailed test
67
Solution. 1) Hypotheses: H0: µ = 72, H1: µ > 62 (one-sided test).
X̄−µ
√0. 78.3−72
2) The test statistic T S = S/ n
gives the value √
11.2/ 25
= 2.8125.
5) Additionally we may report the p-value: P (t > 2.81) < P (t > 2.797) = 0.005.
In the following Exercise, since the sample size is large (greater than 30) we use the z-test:
Exercise 4.7. We need to analyse the height of wheat plants in the field in different irrigation
(watering) regime. Assume that the height has normal distribution with unknown variance.
35 plants from different parts of the field and found that the average height is 102 cm and
the sample variance is 16. It was shown that in usual regime the mean height of the plants is
100 cm.
We want to test the hypothesis that the change of irrigation regime has significant influence
on the height of plants at 5% significance level.
Solution. 1) Hypotheses:
2) TS:
3) RR:
4) Conclusion:
68
5) p-value:
Exercise 4.9. How long sporting events last is quite variable. This variability can cause
problems for TV broadcasters, since the amount of commercials and commentator blather
varies with the length of the event. Assume that the lengths for a random sample of 16
middle-round contests at the 2008 Wimbledon Championships in women’s tennis has sample
standard deviation 27.25.
Assuming that match lengths are normally distributed, test the hypothesis that standard
deviation of a match length is no more than 25 mins using α = 0.05.
Compute the p-value.
Solution. :
1) Hypotheses:
2) TS:
3) RR:
4) Conclusion:
5) p-value:
69
4.2.3 Testing for variance
When testing hypotheses about the variance σ 2 , the F -distribution introduced in Section 3.3.4
is useful. Let X1 , X2 , . . . , Xn be from the Normal Distribution N (µ1 , σ12 ) and X1 , X2 , . . . , Xn
be from the Normal Distribution N (µ2 , σ22 ). We test
H0 : σ12 = σ22
against
HA : σ12 6= σ22 (resp. HA : σ12 > σ22 or σ12 < σ22 )
The test statistics is
S12 S12 /σ12
F = = ∼ Fn1 −1,n2 −1
S22 S22 /σ22
(because under the null hypothesis σ1 = σ2 ). We denote the value of the test statistic for a
s2
particular sample by fT S = s12 .
2
We find the rejection region for fixed α using the statistical tables of the F -distribution.
Exercise 4.10. Consider two independent random samples X1 , . . . , Xn1 from N (µ1 , σ12 ) dis-
tribution and Y1 , . . . , Yn2 from N (µ2 , σ22 ) distribution.
Test H0 : σ12 = σ22 versus HA : σ12 6= σ22 for α = 0.20 using the following basic statistics:
Sample Size Sample Mean Sample Variance
1 25 410 95
2 16 390 300
3) RR: From the tables F0.1,24,15 = 1.90. By (3.15) F0.9,24,15 = 1/F0.1,15,24 = 1/1.78 = 0.56.
4) Conclusion: fT S < 0.56 hence, reject H0; there is evidence that the population variances
are not equal.
70
Let X1,1 , . . . , X1,n be the first sample and X2,1 , . . . , X2,n be the second sample. The proced-
ure to test the significance of the difference between two population means when the samples
are dependent:
(i) calculate for each pair of scores the difference, Di = X1,i − X2,i , i = 1, 2, . . . , n, between
the two scores.
(ii) Because D1 , . . . , Dn are i.i.d. random variables, if d1 , . . . , dn are the observed values of
D1 , . . . , D n
n
¯ 1X
d= di
n
i=1
n
1 X ¯2
s2d = (di − d)
n−1
i=1
(iii) Now the testing will proceed as in the case of a single sample. Let µD = E[D], be the
expected value of the difference. The hypotheses are H0 µD = 0 against HA: µD 6= 0.
Exercise 4.11. A new diet and exercise program has been advertised as remarkable way to
reduce blood glucose levels in diabetic patients. Ten randomly selected diabetic patients are
put on the program, and the results after 1 month are given by the following table:
Before 268 225 252 192 307 228 246 298 231 185
After 106 186 223 110 203 101 211 176 194 203
Do the data provide sufficient evidence to support the claim that the new program reduces
blood glucose level in diabetic patients? Use α = 0.05
Solution. We find the differences:
Before 268 225 252 192 307 228 246 298 231 185
After 106 186 223 110 203 101 211 176 194 203
D -162 -39 -29 -82 -104 -127 -35 -122 -37 18
From the table, the mean of the differences is d¯ = 71.9 and the standard deviation sd = 56.2.
1) Hypotheses: H0 : µd = 0 versus HA : µd < 0 - one sided
D̄−0 −71.9
2) TS: T = √
SD / n
∼ tn−1 , we find tT S = 56.2 = −4.046
4) Conclusion: tT S = −4.046 < −1.833, reject H0. The sample evidence suggests that the
new diet and exercise program is effective.
71
Chapter 5
Goodness of Fit
The idea behind the chi-square goodness-of-fit test is to check if given data come from a
particular probability distribution. We have outcomes X of an experiment that can produce
k different results. We say that X is a categorical random variable (with k categories). Let
pi be the probability of observing category i. We want to test if the data comes from certain
probability distribution:
Categories 1 2 ··· k
probabilities p1,0 p2,0 ··· pk,0
For example X could be a result of a coin toss X ∈ {head, tail} (k = 2) or a result of rolling a
die X ∈ {1, 2, 3, 4, 5, 6} (k = 6). The experiment is repeated n times and we count the number
of times that each category is attained.
The null hypothesis can be stated as
The alternative hypothesis is that at least one of these equalities doesn’t hold.
Let
• ni be the number of observations in the i-th category, n = n1 + · · · + nk . ni ’s are called
observed frequencies.
• pi be the probability of getting an observation in the i-th class
Then npi,0 is the expected count for the i-th category and is called expected frequency. Note
that the expected frequencies are calculated assuming the null hypothesis is true. The prob-
abilities pi,0 can either be given from the very beginning or estimated from the sample (in
which case the notation p̂i,0 would be more appropriate).
We use the statistics
k
X (ni − npi,0 )2
g= ∼ χ2k−1−l
npi,0
i=1
72
which has approximately χ2 distribution with k − 1 − l degrees of freedom and
• l is the number of parameters that were estimated while calculating the probabilities
pi,0 .
This approximation works well, provided that the expected frequency of each class is at least
5. If the expected frequencies are smaller than 5 we may combine categories.
Note that small values of the test statistic suggest that the null hypothesis holds whereas
large values of g suggest that the alternative is true. Therefore, the rejection region is construc-
ted using the right tail of the χ2 distribution. The rejection region is RR = {G ≥ χ2α,k−1−l }.
Exercise 5.1. Researchers in Germany concluded that the risk of heart attack on a Monday
for a working person may be as much as 50% greater than on any other day.
The researchers kept track of heart attacks and coronary arrests over a period of 5 years
among 330, 000 people who lived near Augsberg, Germany. In an attempt to verify the re-
searcher’s claim, 200 working people who had recently had heart attacks were surveyed.
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
24 36 27 26 32 26 29
Do these data present sufficient evidence to indicate that there is a difference in the percentages
of heart attacks that occur on different days of the week? Test using α = 0.05.
Solution. The sample size n = 200, observed ni are listed in the table. The hypotheses are
versus
HA : The data do not follow the specified probability distribution
73
we divide the range of the random variable into a finite number of bands. Specify ranges as
ri = (YiL , YiU ],where i = 1, 2, . . . , k is the number of classes where YiL and YiU the lower limit
upper limits of class i, respectively.
Then the goodness-of-fit statistics is defined similarly as before
k
X (ni − npi )2
G= ,
npi
i=1
where ni is the i-th observed outcome frequency (in class i) and pi is the i-th expected (theor-
etical) relative frequency. Note that pi can be found using the cumulative distribution function
F0 appearing in H0
pi = F0 (YiU ) − F0 (YiL ).
(0, 32] (32, 34] (34, 36] (36, 38] (38, 40] (40, 42] (42, ∞)
3 6 8 9 8 4 2
Test the hypothesis that the distribution is normal N (µ, 10), for α = 0.05
Solution. As we need to test for N (µ, 10), where µ is unknown, then according to the theorem
the value of µ for the null hypothesis can be find using the MLE for µ.
We can show that MLE for µ for population with normal distribution is X̄, hence, as
x̄ = 36.765:
Using the statistical tables or software we can find the values of the cumulative distribution
function and the expected frequencies for each interval. The information is collected in the
table:
(YLi , YRi ] (0, 34] (34, 36] (36, 38] (38, 40] (40, ∞)
ni 9 8 9 8 6
F0 (YLi ) 0.0000 0.1910 0.4044 0.6519 0.8468
F0 (YRi ) 0.1910 0.4044 0.6519 0.8468 1.0000
p̂i 0.1910 0.2135 0.2475 0.1949 0.1532
np̂i 7.64 8.5386 9.9003 7.7965 6.128
Note that we combined some classes in order to have npi ≥ 5. After this operation there are
k = 5 classes. We estimated one parameter, so l = 1 and thus there are k − 1 − l = 3 degrees
of freedom.
74
For the goodness-of-fit statistics:
k
X (ni − np̂i )2
G= ,
np̂i
i=1
the observed g = 0.374, the corresponding critical value from the χ2 distribution is χ20.05,5−1−1 =
7.81. Therefore, we cannot reject the null hypothesis, and so it is likely that the sample was
obtained from population with normal distribution.
Exercise 5.3. The speeds of vehicles (in mph) passing through a section of Highway 75 are
recorded for a random sample of 150 vehicles and are given below. Test the hypothesis that
the speeds are normally distributed with a mean of 70 and a standard deviation of 4. Use
α = 0.01.
Range 40 − 55 56 − 65 66 − 75 76 − 85 > 85
Number 12 14 78 40 6
Exercise 5.4. Based on the sample data of 50 days contained in the following table, test the
hypothesis that the daily mean temperatures in the City of Tampa are normally distributed
with mean 77 and variance 6. Use α = 5%.
Temperature 46 − 55 56 − 65 66 − 75 76 − 85 86 − 95
Number of days 4 6 13 23 4
75
Appendix A: Summary of important
random variables and their
distributions
76
Appendix B: Matlab and R code
77
Figure 3.2 can be plotted in MATLAB:
n=1;
x=0:0.1:10;
y=chi2pdf(x,n);
plot(x,y)
or in R:
n<-3;
x<-seq(0,10,0.1);
y<-dchisq(x, fd=n,);
plot(x,y)
To plot pdf of t-distribution in Figure 3.3 in MATLAB:
n=5;
x=-10:0.1:10;
y=tpdf(x,n);
plot(x,y)
and in R:
n<-5;
x<-seq(-10,10,0.1);
y<-dt(x, df=n);
plot(x,y)
To find x percentile of t-distribution, MATLAB:
n=5;
x=0.025;
y= tinv(x,n)
R:
n<-5;
x<-0.025;
y<-qt(x, df=n)
To plot pdf of F (m, n)-distribution (Figure 3.4), MATLAB:
m=10;
n=5;
x=0:0.1:10;
y=fpdf(x,m,n);
plot(x,y)
78
R:
m<-10;
n<-5;
x<-seq(0,10,0.1);
y<-df(x, df1=m, df2 =n,);
plot(x,y)
m=10;
n=5;
x=0.025;
y=finv(x,m,n)
in R:
m<-10;
n<-5;
x<-0.025;
y<-qf(x, df1=m, df2=n)
79
Bibliography
80