0% found this document useful (0 votes)
10 views

Intro to Statistics for Engineers using Python

Uploaded by

mailforsumant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Intro to Statistics for Engineers using Python

Uploaded by

mailforsumant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

Introduction to

Statistics for Engineers


with Python

Prepared by:
Dr. Gokhan Bingol (gbingol@hotmail.com)
December 13, 2024

Document version: 1.0


Engineering Documents: https://www.pebytes.com/pubs
Follow on GitHub: https://github.com/gbingol
1. Introduction
With the increasing digitalization of the process industry, engineers must be equipped not only with
core engineering principles but also with strong digital and computational skills (Proctor & Chiang,
2023)1. The growing popularity of machine learning (ML) and its broader field, artificial intelligence
(AI), has further highlighted the need for engineers to develop a solid foundation in descriptive and
inferential statistics, as well as in supervised and unsupervised modeling techniques (Pinheiro &
Patetta, 2021). Statistical tests, long-standing tools in data analysis, offer interpretable results and well-
defined hypothesis testing (Montgomery, 2012). They are particularly valuable for analyzing small
datasets and determining the significance of relationships (Box et al. 2005).

On the other hand, ML techniques excel at uncovering patterns in complex datasets with intricate
relationships that traditional statistics may overlook (Hastie et al., 2009). However, these methods often
require larger datasets and computational resources. Moreover, the "black-box" nature of certain ML
models, such as deep learning, can limit their interpretability, which is crucial for making informed
decisions in process engineering (Rudin, 2019)2.

Random variables—such as the lifespan of a pump, the time required to complete a task, or the
occurrence of natural phenomena like earthquakes—play a pivotal role in both everyday life and
engineering applications (Forbes et al., 2011). The probability distribution of a random variable
provides a mathematical description of how probabilities are assigned across its possible values. While
statistical literature describes a vast array of distributions (Wolfram MathWorld) 3, only a limited subset
is commonly used in engineering, as highlighted by Forbes et al. (2011) and Bury (1999).

Statistical tools and tests are indispensable in engineering analysis. Common parametric tests like
t-tests and ANOVA are widely used for comparing means and analyzing variance (Montgomery, 2012).
Non-parametric tests, such as the Kruskal-Wallis test or the sign test, are particularly useful when data
fail to meet the assumptions of normality (Gopal, 2006; Kreyszig et al., 2011). Regression analysis,
another critical tool, enables the investigation of relationships between variables (Montgomery et al.,
2021).

1 https://www.thechemicalengineer.com/features/data-science-and-digitalisation-for-chemical-engineers/
2 Rudin C (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nature Machine Intelligence, 1(5), 206-215.
3 https://mathworld.wolfram.com/topics/StatisticalDistributions.html
The current work emphasizes the application of statistics in engineering, leveraging Python as the
computational tool of choice. Furthermore, it relies extensively on Python packages such as numpy and
scisuit4. The scisuit’s statistical library draws inspiration from R5, enabling readers to transfer the
knowledge gained here to R, a widely used software in the data science domain.

4 scisuit at least v1.4.3.


5 https://www.r-project.org
2. Probability & Random Variables

2.1. Permutations / Combinations

2.1.1. Permutations
Any ordered sequence of k objects taken without replacement from a set of n objects is called a
permutation of size k of the objects (Devore et al., 2021). There are two cases:

A) Objects are Distinct: The set contains only distinct objects, such as A, B, C… Then the number of
permutations of length k, that can be formed from the set of n elements is:

n! (2.1)
n P k =n⋅( n− 1 )⋅...⋅( n− k +1 )=
(n − k ) !

The interpretation of Eq. (2.1) is fairly straightforward: Initially there are n objects in the set and once
one is taken out (since without replacement), n-1 objects are left and then the sequence continues in
similar fashion.

Example 2.1
How many permutations of length k=3 can be formed from the elements A, B, C and D (Adapted from
Larsen & Marx, 2011)?

Solution:

4!
Mathematically the solution is: =24
( 4 − 3 )!

Script 2.1
from itertools import permutations
print(list ( permutations(["A", "B", "C", "D"], 3) ))
This will printout 24 tuples, each representing a permutation.

B) Objects are NOT Distinct: The set contains n objects, n1 being one kind, n2 of second kind … and n r
of rth kind, then:
n! (2.2)
n1 ! ⋅ n2 ! ⋅ n r !

where n1 +n2 +...+nr =n

Example 2.2
A biscuit in a vending machine cost 85 cents. In how many ways can a customer put 2 quarters, 3 dimes
and 1 nickel (Adapted from Larsen & Marx, 2011)?

Solution:

n1=2, n2=3 and n3=1 therefore

n = n1+n2+n3 = 2+3+1=6

Eq. (2.2) can now be used:

6!
=60
2! 3 ! 1!

2.1.2. Combinations
The number of different combinations of n different things taken, k at a time, without repetitions, is
computed by (Kreyszig et al., 2011):

(nk)= k ! ( nn−! k ) ! (2.3)

and if repetitions allowed:

(n+k−1
k )
(2.4)
Example 2.3
Given a set of elements A, B, C and D list the combinations of unique elements of size 2.

Solution:

There are 4 unique characters, therefore: 4 = 4!


()
2 2! ( 4 − 2 ) !
=6

Script 2.2
from itertools import combinations
print (list( combinations(["A", "B", "C", "D"], 2) ))
('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')
Note that each tuple contains k=2 different “things” and none of the tuples contains exactly the same 2
things, i.e. there is no (‘A’, ‘A’). Please also note that unlike permutations, there is no ('B', 'A'), ('C', 'A')
since there is already (‘A’, ‘B’) and (‘A’, ‘C’), respectively.

Note that if repetitions were allowed Eq. (2.4) would be used:

(4+2−1
2 ) ( 2) 2⋅1
5 5⋅4
= = =10

adding (AA, BB, CC, DD) to the above list. ■

A summary of permutations and combinations for k elements from a set of n candidates is given by
Liben-Nowell (2022) as follows:

k
1) Order matters and repetition is allowed: n

n!
2) Order matters and repetition is not allowed:
(n − k ) !

3) Order does not matter and repetition is allowed: (n+k−1


k )

4) Order does not matter and repetition is not allowed: (nk)


2.2. Random Variables

A random variable is a variable therefore can assume different values; however, the value depends on
the outcome of a chance experiment (Peck et al. 2016; Devore et al. 2021).

For example, when two dice are tossed a sample space of a set of 36 ordered pairs,
S (i , j)=[(1,1),(1,6) , ... ,(6,1),(6,6)] is obtained. In many cases, the set of 36 ordered pairs is not of
interest to us, for example for some of the games only the sum of the numbers is of interest to us,
therefore, we are only interested in eleven possible sums (2, 3,…, 11, 12), i.e. if we were interested in
sum being 7, then it does not matter if the outcome was (4, 3) or (6, 1). Therefore, in this case, the
random variable can be defined as X (i , j)=i+ j (Larsen & Marx, 2011).

There are two types of random variables:

1. Discrete: Takes values from either a finite set or a countably infinite set.

2. Continuous: Takes values from uncountably infinite number of outcomes, i.e. all numbers in a
single interval on the number line.

2.2.1. Discrete Random Variables


With each discrete random variable X a probability density function is associated:

p X (k )=P(s∈S∣X (s)=k ) (2.5)

where,
• X = the random variable
• k = a specified number the random variable can assume
• P(X=k), the probability that X equals to k (Utts and Heckard, 2007).
For the dice example, let’s say we are interested in the sum of numbers being 2. Then the notation
would be P(X=2) = 1/36.
2.2.2. Continuous Random Variables
With each continuous random variable Y a probability density function is associated:

P(a≤Y ≤b)=P(s∈S∣a≤Y (s)≤b)=∫ f Y ( y)dy (2.6)


a

It is seen that unlike Eq. (2.5) which gives the probability at a particular value, Eq. (2.6) yields
probability at an interval [a, b].

2.2.3. Cumulative Distribution Function


Unlike PDF, the cumulative distribution function (CDF) for discrete or continuous random variable is
the same, that is:

F W (w)=P(W ≤w) (2.7)

Example 2.4
A fair die is rolled 4 times. Let X denote the number of sixes that appear. Find PDF and CDF ( Adapted
from Larsen & Marx, 2011).

Solution:

X has a binomial distribution (see chapter 3.2) with n=4 and p=1/6. Therefore, the PDF:

( )( )( )
4−k
n 1 5
p( X =4)= ⋅ ⋅ , k=0 ,1 , 2 , 3 , 4
4 6 6

Let’s see how the probability of getting number of sixes changes with a simple plot:

Script 2.3
import scisuit.plot as plt
import scisuit.plot.gdi as gdi
from scisuit.stats import dbinom, pbinom

k = range(0, 5)
x = [*k]
y = [dbinom(x=i, size=4, prob=1/6) for i in x]
plt.scatter(x=x, y=y)
for i,v in enumerate(k):
gdi.line(p1=(v, 0), p2=(x[i], y[i]))

plt.show(antialiasing=True)

As expected the probability of getting number of


sixes decrease as k increases.

Fig 2.1: Probability of getting number of sixes

Note that Fig. (2.1) only shows probabilities for individual data points, i.e., for k=0, 1, 2, 3 and 4 sixes.
However, it does not immediately show the probability for P(X≤2). The cumulative distribution
function is:

{
0 x <0

(56 ) =0.482
4
0≤x <1

( )
4 3
5 1 5
+4 ( )( ) =0.868 1≤x <2
6 6 6
F ( x)=
X

(56 ) +4 (16 )(56 ) +6(16 ) (56 ) =0.984


4 3 2 2
2≤x <3

(56 ) +4 (16 )(56 ) +6(16 ) (56 ) +4 (16 ) (56 )=0.999


4 3 2 2 3
3≤x <4

1 4≤x
With minor changes to Script (2.3):

Script 2.4
y = [pbinom(q=i, size=4, prob=1/6) for i in x]

plt.scatter(x=x, y=y)

for i,v in enumerate(k):


gdi.line(p1=(v, y[i]), p2=(i+1, y[i]))
gdi.marker(xy=(i+1, y[i]))

plt.show(antialiasing=True)

What is the probability of getting at least 2 sixes?

Now it is easier to answer this question using the


cumulative distribution plot. It is seen that
P(X≤2)=0.98.

Fig 2.2: Cumulative distribution

2.2.4. Empirical Distribution Function


The empirical distribution function for a random sample X1, X2, …, Xn from a distribution F is the
function defined by:

n
1
F n ( x)= ∑ 1{x ≤x } (2.8)
n i=1 i

where 1{xi≤x} is an indicator function that is equal to 1 if xi≤x and 0 otherwise.


Example 2.5
A random sample of n=8 people yields the following counts of the number of times they exercised in
the past 2 weeks:

0, 2, 1, 2, 7, 6, 4, 6

Calculate Fn(x) (adapted from Anon. 2024)6.

Solution:

The general equation for the given data is:


8
1
F 8 ( x)= ∑1
8 i=1 {x ≤x }
i

For example, for x≤2,


8
1
F 8 ( x)= ∑ 1 = 1⋅(1+1+1+1+0+0+0+0)= 48
8 i=1 {x ≤x } 8
i

As demonstrated below for larger datasets, it is considerably more convenient to use Numpy:

Script 2.5
import numpy as np

x = np.array([0, 2, 1, 2, 7, 6, 4, 6])
f2 = np.sum(x<=2)/len(x)
print(f2)
0.5

2.2.5. Moment-Generating Function


Let X be a random variable. Then the moment-generating function for X is denoted by Mx(t) and
expressed as:

{
∑ etk pW (k ) if W is discrete
all k
M W (t )=E (t tW )= (2.9)

∫ etw f W (w) if W is continuous


−∞

6 https://online.stat.psu.edu/stat415/lesson/empirical-distribution-functions
Theorem: Let W1, W2, …, Wn be independent random variables with mgfs MW1(t), MW2(t), …, MWn(t),
respectively. Let W = W1+W2+, …, Wn. Then,

(2.1
M W (t )=M W (t )⋅M W (t )... M W (t )
1 2 n
0)

Example 2.6
Find the moment-generating function of a Bernoulli random variable:

X i= {01 p
1− p
, 0< p<1

Solution:

Note that Bernoulli random variable is a discrete random variable, therefore condensing Eq. (2.9) for
only discrete random variables yields:

M X (t )= ∑ e tk p X (k )
all k

One should notice the condition in the equation which states that the summation should be performed
for “all k”. Note that for Bernoulli random variable there exists only 2 k’s, therefore:

t⋅0 t⋅1 t
M X (t ) = e ⋅p( X =0)+e p( X =1) = (1− p)+ p⋅e

Example 2.7
Find the MGF of a binomial random variable given by the following equation:

p x ( k )=P ( X =k )= (nk) p ( 1− p )
k n−k

Solution:

Binomial random variable is a discrete random variable, therefore MX(t) is:


n
M X ( t )=E (e tX )=∑ e tk
k=0
(nk) p (1 − p)
k n−k

Rewriting the equation yields:

n
M X ( t ) =∑
k=0
(nk)( pe ) (1− p)
t k n−k

Newton’s binomial expansion formula:

n
( x + y ) =∑ n x ⋅y
n

k=0 k
k
()
n−k

Observing that mgf and binomial expansion are exactly the same if we replace x and y with x=pet and
y=1-p. Therefore the moment-generating function is:

n
M X (t )=( 1− p+ pet )

2.2.6. Expected Value


It is the most frequently used statistical measure to describe central tendency (Larsen & Marx, 2011).
Let X and Y be discrete and continuous random variables, respectively. The expected values of X and Y
are denoted by E(X) and E(Y), respectively, and given by the following equations:

E( X)= μ =∑ k⋅p X (k ) (2.11)


all k


E(Y )= μ = ∫ y f Y ( y )dy (2.12)
−∞

One notable property of expected value is that it is a linear operator and therefore,

E (aX +bY )=a⋅E( X)+b⋅E(Y ) (2.13)


Example 2.8
Below table shows the number of courses a student registered in a university with 15,000 students. Find
the average number of courses per student (Adapted from Carlton & Devore, 2014).

x 1 2 3 4 5 6 7

# Students 150 450 1950 3750 5850 2550 300

Solution #1:

1×150+2×450+...+7×300
The simplest approach: x̄= =4.57
15000

Solution #2:

We define a random variable X as the number of courses a student has enrolled. The mean value
(weighted average) of a random variable is its expected value. Furthermore, since the random variable
is discrete, Eq. (2.11) will be applied. However, we first need to compute the probabilities:

p(x) 0.01 = 150/15000 0.03 0.13 0.25 0.39 0.17 0.02 = 300/15000

Eq. (2.11) can now be applied: x̄=1×0.01+2×0.03+...+7×0.02=4.57 ■

Although Eqs. (2.11 & 2.12) can be used to find the expected value of a random variable, it is not
always very convenient to do so.

If MW(t) is the moment-generating function (mgf) of the random variable W, then the following
relationship holds as long as the rth derivative of mgf exists:

M (rW) (0)= E(W r ) (2.14)

Let’s prove for r=1.

d ∞ ty
M (1)
Y (0)= ∫ e f Y ( y )dy
dt −∞
Placing the derivative as the integrand, then equation can be rewritten as:


d ty
M Y (0)= ∫
(1)
e f Y ( y )dy
−∞ dt

Noting that only ety is a function of t and performing the derivation yields:

Y (0)= ∫ y e f Y ( y)dy
M (1) ty

−∞

Replacing t=0 gives:

Y (0)= ∫ y f Y ( y)dy=E (Y )
M (1)
−∞

Note that the last equation is exactly the same as Eq. (2.12), which is the expected value, E(Y).
Therefore, the first-derivative of mgf with respect to t=0 gives E(Y) and second-derivative E(Y2) and so
on…

Example 2.9
Find the expected value of the binomial random variable.

Solution:

The moment-generating function was already computed in example (2.7) as:

n
M X (t )=( 1 − p+ pe t )

Taking the derivative with respect to t:

(1) t n −1 t
M X (t )=n(1 − p+ pe ) ⋅pe

Replacing t=0 yields the final answer:

M 1X (t =0)=E ( X )=np ■
Example 2.10
The PDF of the maximum order statistics is given by:
n−1
f Y ( y)=n [ F Y ( y) ]
(n)
f Y ( y)

Find the expected maximum for uniform distribution in the interval of [0, 1].

Solution:

Substituting above equation in Eq. (2.12) yields:



E [Y (n) ]= ∫ y⋅n[ F Y ( y)]n−1 f Y ( y)dy
−∞

The maximum PDF for uniform distribution in the interval of [0, 1] is fY(y)=1, therefore the cumulative
distribution function is FY(y)=y. Substituting these knowns to above equation and integrating in the
n
interval of [0, 1] yields .
n+1

Note that as n increases, the expected maximum values approaches to 1, which we would expect if we
draw large number of samples from a uniform distribution. ■

2.2.7. Variance
Although the expected value is an effective statistical measure of central tendency, it gives no
information about the spreadout of a probability density function. Although the spread can be
calculated using X-μ, it is immediately noted that negative deviations will cancel positive ones (Larsen
& Marx, 2011). The variance of a random variable is defined as the expected value of its squared
deviations. In mathematical terms,

Var ( X )=E [( X −μ)2 ] (2.15)

Noting the following property of expected value for the random variable X and g(X) any function,

E [ g( X )]=∑ g(k )⋅p x (k ) (2.16)


all k

If g(X) in Eq. (2.16) is replaced with (X-μ)2, then Eq. (2.15) can also be expressed as,
Var ( X )=∑ (k−μ)2⋅p X (k ) (2.17)
all k

If Y is a continuous random variable with PDF fY(y), then


Var (Y )=E [(Y −μ)2 ]= ∫ ( y−μ)2⋅f Y ( y)dy (2.18)
−∞

Let W be any random variable, discrete or continuous, and a and b any two constants. Then,

2
Var (a⋅W +b)=a ⋅Var (W ) (2.19)

Let W1, W2, …., Wn be a set of independent random variables. Then,

Var (W 1 +W 2 +...+W n )=Var (W 1 )+Var (W 2 )+...+Var (W n ) (2.20)

Example 2.11
Test whether Eq. (2.15) represents population or sample variance.

Solution:

Let’s work on an arbitrarily chosen dataset: [4, 7, 6, 2, 7, 6].

Spreadsheet’s have two equations for computing sample and population variance, namely Var.S and
Var.P, respectively. Computation with Var.S and Var.P yielded 3.86667 and 3.22222, respectively. Let’s
investigate using Python libraries:

Script 2.6
import statistics as stat

x=np.array([4, 7, 6, 2, 7, 6]) #arbitrary numbers

#returns sample variance


varS1 = stat.variance(x.tolist())
varS2 = np.var(x, ddof=1) #notice ddof=1

#Using Equation
EX, EX2 = np.mean(x), np.mean(x**2)
varEq = EX2 - EX**2
varP = np.var(x, ddof=0) #notice ddof=0

print(f"Sample: statistics pkg= {varS1} and Numpy={varS2}")


print(f"Population: Equation= {varEq} and Numpy={varP}")
Sample: statistics pkg= 3.8666 and Numpy=3.8666
Population: Equation= 3.2222 and Numpy=3.2222

Notice that the number of samples was intentionally kept low to see the difference between sample and
population variance since for large samples the difference becomes negligible.

Although Eqs. (2.17 & 2.18) can be used to find variances of discrete and continuous random variables,
respectively, using MGF (if known/available) to find the variance can be more convenient as
demonstrated in the following example.

Example 2.12
Find the variance of the binomial random variable.

Solution:

From example (2.7) the moment-generating function:

n
M X (t )=( 1 − p+ pe t )

From example (2.9) the expected-value:

E ( X )=np

From Eq. (2.14) we know that the second-derivative of mgf with respect to t gives E(X2), therefore:

2 t t n −2 t t n −1 t
M X (t )= pe ⋅n⋅(n − 1)⋅(1 − p+ pe ) pe +n(1 − p+ pe ) pe

Replacing t=0 gives E(X2):

E ( X 2 )=n(n − 1) p2 +np
From Eq. (2.15) remembering that:

E [( X −μ)2 ]=E ( X 2 )−E ( X )2

Now all we have to do is to replace E(X2) and E(X) which yields:

2 2
Var ( X )=n(n − 1) p +np −(np)

Tidying up the equation gives the final answer:

Var ( X )=np(1− p)
3. Discrete Probability Distributions
All discrete probability distributions has the following properties:

1. For every possible x value, 0≤x ≤1.

2. ∑ p (x )=1
all x values

The general characteristics of a discrete probability distribution can be visualized using a probability
histogram.

Script 3.1
import scisuit.plot as plt

plt.histogram([1, 2, 3, 4, 5, 3, 4, 2, 5, 4, 6])
plt.show()

Note that after the histogram has been


plotted, the density option was selected and
the number of bins were adjusted to 5.

Fig 3.1: Density histogram for a random data


3.1. Bernoulli Distribution

A Bernoulli trial can have one of the two outcomes, success or failure. The probability of success is p
and therefore the probability of failure is 1-p (Forbes et al., 2011). It is the simplest discrete
distribution; however, it serves as the building block for other complicated discrete distributions
(Weisstein 2023)7.

The PDF is:

X i= {01 p
1− p
, 0< p<1 (3.1)

MGF, Mean and Variance

M X (t )=(1− p)+ p⋅e t (3.2)

E ( X )= p (3.3)

2 2 2
Var ( X )=E ( X )−E ( X ) = p− p = p(1− p) (3.4)

7 Weisstein, Eric W. "Bernoulli Distribution." From https://mathworld.wolfram.com/BernoulliDistribution.html


3.2. Binomial Distribution

The outcome of the experiment is either a success or a failure. The term success is determined by the
random variable of interest (X). For example, if X counts the number of female births among the next n
births, then a female birth can be considered as a success (Peck et al., 2016).

We run n independent trials and define probability as p=P(success occurs) and assume p remains
constant from trial to trial (Larsen & Marx, 2011). However, since we are only interested in the total
number of successes, we therefore define X as the total number of successes in n trials. This definition
then leads to binomial distribution and is expressed as:

p x ( k )=P ( X =k )= (nk) p ( 1− p )
k n−k
(3.5)

Imagine 3 coins being tossed, each having a probability of p of coming up heads. Then the probability
of all heads (HHH) coming up is p3 and all tails (no heads, TTT) is (1-p)3 and HHT is 3p2(1-p).

Observe that in Eq. (3.5) the combination part shows the number of ways to arrange k heads and n-k
tails (section 2.1), therefore:

(nk)= k ! ( nn−! k ) ! (3.6)

n− k
the remaining part of Eq. (3.5), pk⋅( 1 − p ) , is the probability of any sequence having k heads and n-k
tails.

Example 3.1
An IT center uses 9 drives for storage. The probability that any of them is out of service is 0.06. For the
center at least 7 of the drives must function properly. What is the probability that the computing center
can get its work done (Adapted from Larsen & Marx, 2011)?
Solution #1:

(79)0.94 0.06 +(98)0.94 0.06 +(97)0.94 0.06 =0.986


7 2 8 1 9 0

sum( dbinom(x=[7, 8, 9], size=9, prob=0.94) )


0.986

Solution #2:
6

()
7 8() ()
9 0.94 7 0.062 + 9 0.94 8 0.06 1 + 9 0.94 9 0.060 =1−
7 ()
∑ 9i 0.94i 0.06(9−i)
i=0

1 - pbinom(q=6, size=9, prob=0.94)


0.986

Example 3.2
Find the 10% quantile of a binomial distribution with 10 trials and probability of success on each trial
is 0.4?

qbinom(p=0.10, size=10, prob=0.4)


2.0

Figure shows the results of a simulation run


by generating 100 random data points from
the binomial distribution.
Script 3.2
import scisuit.plot as plt
from scisuit.stats import rbinom

data = rbinom(n=100, size=10, prob=0.4)

plt.hist(data, cumulative=True)
plt.show()

Fig 3.2: Cumulative histogram of 100 random data


points
It is seen that 10% quantile is somewhere around 1.8, less than 2; however, when reporting it is
rounded up8. The following two commands shines more light on this policy:

pbinom(q=1, size=5, prob=0.3)


0.52822
pbinom(q=2, size=5, prob=0.3)
0.83692
qbinom(p=[0.53, 0.80], size=5, prob=0.3)
[2, 2]
It can be seen that although p=0.53 is closer to q=1 (0.52822) whereas p=0.80 is closer to q=2
(p=0.83692), any number in between p=0.52822 and p=0.83692 will be reported as q=2 by the qbinom
function.

MGF, Mean and Variance

The derivations of MGF, E(X) and Var(X) were already presented in Examples (2.7), (2.9) and (2.12),
respectively. Although approaches presented in the examples work very well, one can also keep in
mind that each binomial trial is actually a Bernoulli trial, therefore the random variable W for binomial
distribution is a function of Bernoulli random variables: X1, X2, …, Xn, yielding W=X1+X2+...+Xn. Thus
Eq. (3.2) and Eq. (2.10) can be combined to derive Eq. (3.7). Remembering the linearity of expected
value, similar approach can be used for E(W) and Var(W) to obtain Eqs. (3.7 & 3.8).

M X (t )=( 1− p+ pe t )
n
(3.7)

E ( X )=np (3.8)

Var ( X )=np(1− p) (3.9)

8 https://www.boost.org/doc/libs/1_40_0/libs/math/doc/sf_and_dist/html/math_toolkit/policy/pol_tutorial/
understand_dis_quant.html
Let’s run a simple simulation to test Eq. (3.8):

Script 3.3
from scisuit.stats import rbinom

for size in [5, 10]:


x = rbinom(n=1000, size=size, prob=0.3)
print(f"size={size}, mean= {np.mean(x)}")
size=5, mean= 1.489
size=10, mean= 2.983

We have intentionally run large number of experiments (N=1000) for the simulation. Note that Eq.
(3.8) and rbinom function match when n=size and p=prob. Therefore for the first case E=5×0.3=1.5,
which is close to 1.49.

To test Eq. (3.9), the following simulation can be run:

Script 3.4
p, n = 0.3, 10
x = rbinom(n=5000, size=10, prob=0.3)

print(f"variance = {np.var(x, ddof=0)}")


print(f"equation = {n*p*(1-p)}")
variance = 2.108
equation = 2.099

Finally, let’s test our understanding in the meaning of randomness of Binomial distribution. First let’s
generate 10 random numbers from a Binomial distribution.

rbinom(n=10, size=5, prob=0.5)


[1, 2, 2, 1, 2, 3, 2, 2, 3, 2]

What do the numbers returned by the function mean?

In an analogy, we flip 5 coins (size=5) and count the number of heads (prob=0.5) which we consider as
success. We run this experiment for 10 times (n=10). In the first experiment we have 1 heads, in the
second 2 heads and so on.
3.3. Hypergeometric Distribution

Suppose that an urn contains r good chips and w defective chips (total number of chips N =r +w ). If n
chips are drawn out at random without replacement, and X denotes the total number of good chips
selected, then X has a hypergeometric distribution and,

P( X =k )=
( r
k ) ⋅( w )
n−k (3.10)
( Nn )
Notes:
1. If the selected chip was returned back to the population, that is the chips were drawn with
replacement, then X would have a binomial distribution (see Example 3.3).
2. Since we are interested in total number of good chips, it does not matter if it is r 1r2r3…. or

r2r1r3…. Therefore
r!
(r − k ) !
was divided by k! and we used
r
()
=
r!
k k !⋅ (r − k ) !
.

Example 3.3
An urn has 100 items, 70 good and 30 defective. A sample of 7 items is drawn. What is the probability
that it has 3 good and 4 defective items? (adapted from Tesler 2017)9

Solution #1: Sampling with replacement

p ( X=3)= (73)⋅0.7 ⋅(1−0.7) =0.0972


3 4 (dbinom(x=3, size=7, prob=0.7))

Solution #2: Sampling without replacement

P(3 good and 4 bad )=


( 3 )( 4 )
70 ⋅ 30
=0.0937 (dhyper(x=3, m=70, n=30, k=7))
(7)
100

9 https://mathweb.ucsd.edu/~gptesler/186/slides/186_hypergeom_17-handout.pdf
MGF, Mean and Variance

The MGF, mean and variance of hypergeometric distribution are presented by Walck (2007) and
derivation of expected-value is given by Hogg et al. (2019).

M (t )=
( n)
W
⋅ F (−n ,−r ; w−n+1 ; t ) (3.11)
X 2 1

(n)
N

where F is hypergeometric function.

r
Let p= and q=1− p then,
N

E( X)=np (3.12)

N −n (3.13)
Var ( X )=npq
N −1

Let’s demonstrate Eq. (3.12) with a simple code:

Script 3.5
x = rhyper(nn=1000, m=70, n=30, k=7)
avg = np.mean(x)

print(f"mean = {avg}")
mean = 4.931

Explicitly expressing Eq. (3.12):

r
E ( X )=np=n⋅
N

Transforming above equation to the notation used by the rhyper function:

m 70
E( X)=k⋅ =7⋅ =4.9
m+n 30+70
3.4. Geometric Distribution

It is similar to binomial distribution such that trials have two possible outcomes: success or failure.
However, unlike binomial distribution where we were interested in the total number of successes, now
we are only interested in the trial where first success occurs. Therefore, if k trials were carried out, k-1
trials end up in failures and the kth one occurs with success. Thus we define the random variable X as
the trial at which the first success occurs (Larsen & Marx, 2011).

In more explicit terms, we have thus far said that: “first k-1 trials end up in failure” and “kth trial ends in
success”. Mathematically expressing,

P(X=k) = P(first success on kth trial)

= P(first k-1 ends in failure) · P(kth trial ends in success)

which then leads to the following equation:

P ( X =k )=(1− p) ⋅p
k−1 (3.14)

MGF, Mean and Variance

p et (3.15)
M x (t )=
1−(1− p)e t

1 (3.16)
E ( X )=
p

1− p
var ( X )=E ( X 2 )−E ( X )2= (3.17)
p2
Example 3.4
A political pollster randomly selects persons on the street until he encounters someone who voted for
the Fun-Party. What is the probability he encounters 3 people who did not vote for the Fun-Party before
he encounters one who voted. It is known that 20% of the population voted for the Fun-Party ( adapted
from Foley10 2019)?

Solution:

20
The probability of success (voted for Fun-Party) is: p= =0.2
100

Since 3 have not voted for the Fun-Party (failure) and the next one voted, 4 trials carried out.

3 1
P( X =4)=(1−0.2) ⋅0.2 =0.1024

Using Python code:

dgeom(x=3, prob=0.2)
0.1024
Note that, in the definition of the function dgeom x is the number of failures, therefore, instead of x=4,
x=3 was used.

10 https://rpubs.com/mpfoley73/458721
3.5. Negative Binomial Distribution

In section (3.4) the geometric distribution was introduced where we defined the random variable X as
the trial at which the first success occurs. Therefore the trials were discontinued as soon as a success
occurred. Now instead of first success, we are interested in rth success. Similar to geometric distribution
each trial has a probability p of ending in success.

Therefore, we might have a sequence of {S, F, F, S, S, S} if we were interested in the r=4 th success out
of k=6 trials. Putting it in more mathematical terms,

3 successes before the 4th success: r-1

2 failures before the 4th success in k-1=5 trials: (k-1) – (r-1) = k-r

Now if we define the random variable X as the trial at which the rth success occurs, then all the
background work to obtain the probability density function has been done.

Before proceeding with the final pdf, also note that before the rth success occurs, k-1 trials might have
various different sequences having r-1 successes, such as {SFFSS} or {FSFSS} or so on… Note that
this is indeed very similar to the idea presented in section (3.2) by Eq. (3.6). Therefore,

I) Before the rth success occurs (k-1 trials), number of different sequences with r-1 successes:

(kr −1
−1
)
II) (r-1 success in the first k-1 trials) and (success on kth trial):

r −1 k −1−(r −1)
p (1 − p)

Putting the equations in (I) and (II) together gives the pdf for negative binomial distribution:

( )
p X (k )= k − 1 p r⋅(1− p)k − r
r −1
(3.18)
Example 3.5
A process engineer wishes to recruit 4 interns to aid in carrying out lab tests for the development of a
new technology. Let p= P(randomly chosen CV is a fit). If p is 0.2, what is the probability that exactly
15 CVs must be examined before 4 interns can be recruited (Adapted from Carlton & Devore, 2014)?

Solution:

The pdf for negative-binomial distribution is:

( )
p X (k )= k − 1 p r⋅(1− p)k − r
r −1

where k=15, r=4 and p=0.2.

Substituting k and r in the equation:

( )
p( X =15)= 15 − 1 0.24⋅(1− 0.2)15 − 4 =0.050
4 −1

Using Python:
dnbinom(x=15-4, size=4, prob=0.2)
0.050
Note that in dnbinom function the argument x represents the number of failures (k-r).

Now, let’s ask ourselves a simple question? Does the probability increase or decrease if the number of
CVs to be examined increase or decrease?

for k in [4, 5, 10, 15, 20, 25, 50]:


print(dnbinom(x=k-4, size=4, prob=0.2))
0.0016 0.005 0.035 0.05 0.043 0.029 0.001
It is seen that the probability rises to maximum at around 15 and then decreases. In this context, this
means that it is very unlikely to find 4 suitable candidates by examining 5 CVs and it is not so much
necessary to examine more than 25 CVs. Finally note that,

dbinom(x=4, size=4, prob=0.2)


0.0016 #0.2**4
3.5.1. Relationship to Geometric Distribution
Let G and B be random variables for geometric and negative-binomial distributions. The definitions of
random variables are then as follows:

G: Trial at which the first success occurs

B: Trial at which the rth success occurs

It is clearly seen that if r=1 then B=G, it can therefore be said that the negative-binomial distribution
generalizes the geometric distribution.

Larsen & Marx (2011) expresses the relationship between negative-binomial and geometric
distributions in the following way which is easier to derive a mathematical relationship between the
random variables:

X = total number of trials to achieve rth success

= number of trials to achieve 1st success +

number of additional trials to achieve 2nd +

…+

number of additional trials to achieve rth success.

X = X 1 + X 2 +…+ X r (3.19)

where X1, X2,…, Xr are random variables for geometric distributions.

It should be observed that until the 1st success occurs the trials overlaps with the definition of geometric
random variable. However, after the 1st success we are interested in the additional trials (please note the
word additional) to observe the 2nd success and therefore the trials between the 1st and 2nd success fits
again with the definition of geometric random variable. Continuing in this fashion the rationale for Eq.
(3.19) is justified.
3.5.2. MGF, Mean and Variance

[ ]
r
pe t (3.20)
M X ( t )=
1 − ( 1 − p ) et

r (3.21)
E ( X )=
p

r (1− p) (3.22)
var ( X )=
p2

Although above-given equations can be derived directly from the PDF of negative-binomial
distribution, Eq. (3.19) paves the way to combine Eqs. (2.10 & 3.15) to derive MGF in a very
straightforward fashion. Also by using Eqs. (3.16 & 3.17) expected-value and variance can be derived
conveniently as shown below:

[ ]
r
1) M ( t ) = M ( t ) ⋅ M ( t ) ⋅ .... M ( t ) = pe t
X X X X
1 2 r
1− ( 1− p ) e t

2) E ( X ) = E X + E X +...+ E X = 1/ p+1/ p+...+1/ p=r / p


( 1) ( 2) ( r)

3) var ( X ) = var X +var X +...+var X = 1− p + 1− p +...+ 1− p = r ( 1− p )


( 1) ( 2) ( r) 2 2 2 2
p p p p
3.6. Poisson Distribution

Poisson distribution is a consequence of Poisson limit, which is an approximation to binomial


distribution when n→∞ and p→0.

3.6.1. Poisson Limit


The Poisson limit states that, if n→∞ and p→0 such that λ=np remains constant, then for k≥0, the
following relationship holds (Larsen & Marx, 2011):

e−np (np)k
lim ()
n k(
n→∞ k
n−k
p 1− p) =
k!
(3.23)

A proof of Eq. (3.23) is presented in various textbooks (Devore et al., 2021; Larsen & Marx, 2011).

Let’s inspect the accuracy of Eq. (3.23) using Python code. There are two tests where each has different
probabilities (p); however for both tests λ=np remains constant as 1.

Test #1:

Script 3.6
n, kmax = 5, 5
Test #2

n, kmax = 100, 10

p = 1/n #probability

binom = dbinom(x=x, size=n, prob=p)


pois = dpois(x=x, mu=n*p) #lambda=1

D = np.abs(np.array(binom)-np.array(pois)) #difference

print(f"min:{min(D)} at k={np.argmin(D)}")
print(f"max:{max(D)} at k={np.argmax(D)}")
Test #1: min:0.0027 at k=5 & max:0.0417 at k=1,
Test #2: min:3.13e-08 at k=10, max:0.0018 at k=1
It is clearly seen that in both tests the Poisson limit approximates binomial probabilities fairly well.
However, as evidenced from Test #2 where n was larger and p was smaller, the agreement between
Poisson limit and binomial probabilities became remarkably good for all k.
Example 3.6
When data is transmitted over a data link, there is a possibility of errors being introduced. Bit error rate
is defined as the rate (errors/total number of bits) at which errors occur in a transmission system 11.
Assume you have a 4 MBit modem with bit error probability 10 −8. What is the probability of exactly 3
bit errors in the next minute (adapted from Devore et al. 2021)?

Solution:

bits
In a minute 4⋅106 ×60 s=240⋅106 bits will be transferred and probability of error is 10 -8. The errors
s
can be at any sequence and we are interested in total number of errors, which is by definition is the
binomial probability:

( )
6
P(3)= 240⋅10 (10 ) (1−10 )
6
−8 3 −8 240⋅10 −3

Since n is very large (240,000,000) and p is very small (10-8) the above computation is an excellent
candidate for Poisson’s limit: λ =np=2.4⋅108×10−8 =2.4

#Binomial probability
dbinom(x=3, size=240000000, prob=1E-8)
0.2090142
#Poisson limit
dpois(x=3, mu=2.4)
0.2090142
If we pose the question, “what is the probability at most 3 bit errors in the next minute?”, then the
solution is:

( )
6
P( X ≤3)=∑ 240⋅10 (10 ) (1−10 )
6
−8 k −8 240⋅10 −k

k=0 k

#Poisson limit
ppois(q=3, mu=2.4)
0.7787229

11 https://www.electronics-notes.com/articles/radio/bit-error-rate-ber/what-is-ber-definition-tutorial.php
3.6.2. Poisson Distribution
The random variable X is said to have a Poisson distribution if,

e− λ λ k (3.24)
P X (k )=
k!

where λ>0.

Example 3.7
7 cards drawn (with replacement) from a deck containing numbers from 1 to 10. Success is considered
when 5 is drawn. Can the produced data be described by the Poisson distribution?

Solution: Simulation will be run using the following script:

Script 3.7
#size=7 cards, prob=1/10
XX = rbinom(n=10000, size=7, prob=0.1)

#Get unique elements (e.g. [0, 1, 2, 3, 4, 5]) and their frequencies


unique, Frequencies = np.unique(XX, return_counts=True)

total = float(np.sum(Frequencies))

#frequencies / total is the weighted average


aver = sum(Frequencies*unique)/total

probabilities = Frequencies/total
poisson = [dpois(x=float(i), mu=aver) for i in unique]

print(probabilities)
print(poisson)
[0.4781 0.3733 0.1253 0.0209 0.0021 0.0003]
[0.4983, 0.34708, 0.12087, 0.02806, 0.0048, 0.0007]
It is seen from the output that the probabilities can be well described by Poisson distribution. It should
be noted that when the probability value in the simulation was increased to 0.5, the difference between
actual and predicted probabilities increased.
3.6.3. MGF, Mean and Variance

(3.25)
t

M X (t )=e λ⋅(e −1)

E( X)= λ (3.26)

Var ( X )=λ (3.27)

Derivation of Eq. (3.25) can be found in mathematical statistic textbooks (Devore et al. 2021; Wackerly
et al. 2008).

3.6.4. Poisson Process


It is widely used counting processes (the number of accidents in an area; the outbreaks of diseases; …)
and mostly used in situations where we only know the rate of occurrence of an event but the events
occur completely at random, for example using historic data knowing that earthquakes occurring in a
certain area with a rate of 3 per year. Note that we only know the rate of earthquakes and do not have
any information on timings of the earthquakes as they occur completely at random (Anon 12. 2023). If
an event satisfies the above-mentioned conditions we can assume that Poisson process might be a good
candidate to model such event.

12 https://www.probabilitycourse.com/chapter11/11_1_2_basic_concepts_of_the_poisson_process.php
3.7. Multinomial Distribution

The multinomial distribution is a generalization of the binomial distribution (Forbes et al., 2011;
Larsen & Marx, 2011). Let Xi show the number of times the random variable Y equals yi, i=1,2,…,k in a
series of n independent trials where pi=P(Y=yi). Then,

n! x x x
(3.28)
P( X 1=x 1 , X 2=x 2 ,... , X k =x k )= p ⋅p ... p k
1 2 k

x 1 !⋅x 2 ! ... x k ! 1 2

k
where i=0, 1, …, k and ∑ x i =n.
i=1

Notes:

n!
1. The rationale for part is directly related to Eq. (2.2) in section (2.1).
x 1 !⋅x 2 ! ... x k !
2. Thinking along the lines of probability events:
Trial #1: Event 1 (E1) probability p1 → n independent trials x1 successes
Trial #2: Event 2 (E2) probability p2→ n independent trials x2 successes
Trial #k: Event k (Ek) probability pk → n independent trials xk successes
Since trials are independent then P( E 1∩ E 2∩...∩ E k )= p1x ⋅p2x ... pkx
1 2 k

Example 3.8
A die is tampered such that the probability of each of its face appearing is pi =P(face i appears)=ki
where k is a constant. If the die is tossed 12 times, what is the probability that each face will appear
exactly twice? Compute the probability for the case of a normal die (Adapted from Larsen & Marx, 2011)

Solution:

Since a die has 6 faces and the sum of probabilities must be equal to 1.0, it is straightforward to
6 6
6×7 1
compute the constant k: ∑ k⋅i=k⋅∑ i=k⋅ =1→ k =
i=1 i=1 2 21

Since the question is asking that each face should appear twice, then all is left to apply Eq. (3.28):
( )( ) ( )
2 2 2
12! 1 2 6
P( X 1 =2 , ... , X 6 =2)= ⋅ ⋅ ... =0.0005
2 ! 2 ! 2 ! 2 !2 ! 2 ! 21 21 21

With a normal die, each face would have probability of 1/6 and therefore:

( )( ) ( ) ( ) =0.0034
2 2 2 12
12! 1 1 1 12 ! 1
P( X 1 =2,... , X 6 =2)= ⋅ ⋅ ... = 6⋅
2! 2! 2! 2!2! 2! 6 6 6 2 6

Script 3.8
from scisuit.stats import dmultinom

#Tempered die
probs = [1/21*i for i in range(1,7)]
x = [2]*6

p = dmultinom(x=x, size=12, prob=probs)


print(f"probability (tempered)={p}")

#Normal die
probs = [1/6]*6

p = dmultinom(x=x, size=12, prob=probs)


print(f"probability (normal)={p}")
probability (tempered) = 0.00052
probability (normal) = 0.0034

3.7.1. Binomial/Multinomial Relationship


At the beginning of this section (3.7) it has already been mentioned that multinomial distribution is a
generalization of the binomial distribution. Binomial distribution is characterized by two outcomes:
success or failure, where the probability of success is p. In the language of multinomial distribution this
corresponds to two events: p1=p and p2=1-p. Furthermore if there are n trials x1=k will end up with
success and x2=n-k with failure. Therefore replacing p1, p2 and x1, x2 in Eq. (3.28) yields:

n!
P( X 1=k , X 2=n−k )= p k⋅(1− p)n−k
k !(n−k )!

Noting that
n!
()
= n , then one can see that above equation is exactly the same as Eq. (3.5).
k !(n−k )! k
3.7.2. MGF, Mean and Variance
The moment-generating function, mean and variance of multinomial distribution is given in various
textbooks (Forbes et al., 2011; Larsen & Marx, 2011).

(∑ )
k n

M X (t )= pi e
ti (3.29)
i=1

E ( X )=npi (3.30)

Var ( X )=n pi (1− pi ) (3.31)

A proof of Eq. (3.29) is given by Taboga13 (2024).

Let’s simulate Eq. (3.30):

Script 3.9
import numpy as np
from scisuit.stats import rmultinom

n=10

#testing probabilities
p = np.array([0.05, 0.15, 0.30, 0.50 ])

#2D array
arr = np.array(rmultinom(n=1000, size=n, prob=p))

#4 means, each is mean of 1000 random numbers with probabilities 0.05, 0.15 ...
means = np.mean(arr, axis=1)

#expected value (n*p[i])


E_X = n*p

print(f"Difference = {means - E_X}")


Difference = [0.003 0.013 0.064 0.042]
It is seen that the difference for each probability is less than 0.1, therefore for a reasonably large
number of random samples Eq. (3.30) predicts the mean adequately well.

13 https://www.statlect.com/probability-distributions/multinomial-distribution
3.8. Summary

Name Description Equation

Bernoulli One of the two outcomes, success (p) or failure (1-p). X i= {01 p
1− p
, 0< p<1

Two possible outcomes: success (p) or failure (1-p). In n


Binomial independent trials where p remains constant, we are only
interested in the total number of successes (k).
p x ( k )= (nk) p (1− p )
k n−k

Hypergeometric
n chips are drawn out at random without replacement, and X
P( X =k )=
( k ) (n − k )
r
⋅ w

( Nn )
denotes the total number of good chips (N=r+w).

Two possible outcomes: success or failure. However, unlike


Geometric binomial distribution in k trials we are only interested in the P( X =k )=(1− p)k−1⋅p
trial where first success occurs.

In geometric distribution we defined the random variable X


Negative
Binomial
as the trial at which the first success occurs. Now instead of
th
first success, we are interested in r success. Therefore, it r −1 ( )
p X (k )= k − 1 p r⋅(1− p)k − r
generalizes the geometric distribution.

We only know the rate of occurrence of an event but the


events occur completely at random. The Poisson limit states e− λ λ k
Poisson P X (k )=
that, if n→∞ and p→0 such that λ=np remains constant, then k!
for k≥0,

It is a generalization of the binomial distribution. In the language of multinomial distribution


Multinomial this corresponds to two events: p1=p and p2=1-p. Furthermore if there are n trials x1=k will
end up with success and x2=n-k with failure.

n! x x
P( X 1=x 1 , ... , X k =x k )= p ⋅... p k
1 k

x 1 !⋅... x k ! 1
4. Continuous Probability Distributions
Continuous probability distributions have the following properties:
1. f ( x)≥0 ,

2. ∫ f ( x)=1
−∞

Continuous probability distributions can be visualized by a curve called a density curve. The function
that defines this curve is called the density function.

Script 4.1
from numpy import linspace
from scisuit.plot import scatter, plot, show
from scisuit.stats import dnorm

x = linspace(start=-3, stop=3, num=100)


y = dnorm(x)

scatter(x=x, y=y, marker=”c”, markersize=3)


plot(x=x, y=y)

show()

Fig 4.1: Density curve of standard normal distribution

Using the following rationale in the above-given script, probability density curve for other distributions
can be obtained.
4.1. Uniform Distribution

If you generate random numbers between 0 and 1 using a computer, you will get observations from a
uniform distribution since there will be almost same amount of numbers in each equally spaced sub-
interval, i.e. 0-0.2 or 0.2-0.4. Let’s run a simulation:

Script 4.2
import random

x = np.array([random.random() for i in range(1000)])

start, dx=0.0, 0.2

while start<1.0:
L = len( np.where( np.logical_and(x>=start, x<(start+dx)) )[0] )
print(f"({start}, {start+dx}): {L}")

start += dx
Output14 is: (0.0, 0.2): 218, (0.2, 0.4): 202, (0.4, 0.6): 192, (0.6, 0.8): 209, (0.8, 1.0): 179

Although number of samples drawn was relatively small, it is seen that each sub-interval in the range of
[0, 1] has similar amount of numbers. Instead of 1000 samples, if the simulation was run with
10,000,000 samples the difference between amount of numbers in each sub-interval would have been
negligible.

A random variable Y has a continuous uniform probability distribution on the interval (a, b) if the PDF
is defined as follows:

{
1
a≤ y≤b (4.1)
f ( y)= b−a
0 elsewhere

The uniform distribution is very important for theoretical studies (Wackerly et al., 2008). For example
if F(y) is a distribution function, it is often possible to transform uniform distribution to F(y). For
example, it is possible to transform it to standard normal distribution using Box-Muller transform 15.

14 It should be reminded that in random sampling each run will produce different results.
15 https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
MGF, Mean and Variance

For t≠0:

a b ∞
e ty e tb −e ta
M Y (t )= ∫ 0⋅e dy +∫ +∫ 0⋅e ty dy =
ty

−∞ a b−a b t (b−a)

and for t=0:


∞ ∞
1
M Y (t )= ∫ e ty⋅f Y ( y)dy = ∫ 1⋅ dy = 1
−∞ −∞ b−a

Therefore moment-generating function is:

{
e tb −e ta
t ≠0 (4.2)
f ( y)= t (b−a)
1 t =0

a+b (4.3)
E (Y )=
2

(b−a)2 (4.4)
Var (Y )=
12

It should be noted that the derivation (presented by Wolfram 16) of Eq. (4.3) from Eq. (4.2) might pose
challenges for many. Instead, it is recommended to use Eq. (2.12) as then the derivation becomes
considerably more convenient.

16 https://mathworld.wolfram.com/UniformDistribution.html
Example 4.1
As evidenced from above a random number generator will spread its output uniformly across the entire
interval from 0 to 1. What is the probability that the numbers will be in between 0.3 and 0.7?

Solution

This is a rather straightforward question and the answer is P(0.3≤X≤0.7)=0.4. Let’s demonstrate it with
a short script:

Script 4.3
from random import random
from numpy import array, logical_and, where

#[10, 100, 1000, ...]


arr = array( [10**i for i in range(1, 6)] )

#helper function to create 1D list with j random numbers


func = lambda j: [random() for _ in range(j)]

# x[0] has 10, x[1] has 100 elements


x = list ( map(func, arr ) )

L = []
for lst in x:
cond = logical_and(array(lst)>=0.3, array(lst)<0.7)
length = len( where( cond )[0])
L.append(length)

print(np.array(L)/arr)
[0.3 0.41 0.396 0.4022 0.39924]
As evidenced from the above output, as the number of samples in the array (arr) increased from 10 to
105, the simulated probability approached to the computed probability.
4.2. Normal Distribution

Normal distributions are bell-shaped and symmetric curves. They are widely used and are the single
most important probability model in all of statistics since:

1. They provide a reasonable approximation to the distribution of many different variables,

2. They play a central role in many of the inferential procedures (Larsen & Marx, 2011; Peck et
al., 2016).

In section (3.6) it was shown that the Poisson limit approximated binomial probabilities when n→∞
and p→0. Historically, this was not the only approximation [interested reader can find a historical
evolution of the normal distribution in the paper from Stahl (2006)]. Abraham DeMoivre showed that
X −np
when X is a binomial random variable and n is large the probability for P(a≤ ≤b) can be
√ np(1− p)
estimated using the following equation:

−z 2
1 2 (4.5)
f Z ( z)= e ,−∞< z<∞
√2 π

The formal statement of the approximation is known as DeMoivre-Laplace limit theorem (Larsen &
Marx, 2011):

( X −np 1
∫ )
2

lim P a≤ ≤b = e−z /2 dz (4.6)


n→∞ √ np(1− p) √2 π a

Eq. (4.5) is referred as the standard normal curve where μ=0 and σ=1. If μ≠0 and σ≠1 then the
equation is expressed as follows:

( ) ,−∞< x <∞
2
−1 x−μ

1 2 σ (4.7)
f Z ( z)= e
√2 π σ
In order to show DeMoivre’s idea, let’s write a fairly short Python script. rbinom function was used to
sample 1000 experiments where each experiment consists of 60 trials with a probability of success of
0.4 (adapted from Larsen & Marx, 2011).
Script 4.4
import scisuit.plot as plt
from scisuit.stats import rbinom

n, p = 60, 0.4

#Generate random numbers from a binomial distribution


x = np.array(rbinom(n=1000, size=n, prob=p))

#z-ratio
z = (x - n*p)/math.sqrt(n*p*(1-p))

#DeMoivre's equation
f = 1.0/math.sqrt(2*math.pi)*np.exp(-z**2/2.0)

#Density scaled histogram


plt.hist(z, density=True, breaks=5)

#Overlay scatter plot


plt.scatter(x=z, y=f)
plt.show()

It is seen that the curve generated by


DeMoivre’s approximation equation is
fairly well describing the variation of the
histogram generated by the binomial data.

Fig 4.2: Density scaled histogram and scatter plot (x-


axis: z-ratio, y-axis: density)
4.2.1. MGF, Mean and Variance

(4.8)
2 2

M Y (t )=e μ t +σ
t /2

E(Y )= μ (4.9)

Var (Y )=σ 2 (4.10)

4.2.2. Sampling Variability


When we would like to estimate the mean value of a population, we would take samples of size n from
the population and try to make inferences based on the sample. It is natural that the average value of
samples will change from sample to sample. This is known as sampling variability.

In order to simulate this we will generate a sample space of size 250 from an exponential distribution.
Then we will draw samples of size 5, 10, 20 and 30 (250 times) from the sample space and compute the
average of each sample. It will reveal us how the choice of sample size affects sampling distribution.
We will run the following script:

Script 4.5
import numpy as np
import scisuit.plot as plt
import scisuit.stats as st

N = 250

#Generate random numbers from an exponential distribution


SS = np.array(st.rexp(n=N))

plt.layout(3, 2)

#Density scaled histogram of exponential distribution


plt.subplot(0,0)
plt.hist(SS)

n = [5, 10, 20, 30]


colors=["#FF0000", "#FFA500", "#00FF00", "#964B00"]
r, c = 1, 0
for i , v in enumerate(n):
#take samples of size n[i] from the sample space (SS)
x = [np.mean(np.random.choice(SS, size = v, replace=False)) for _ in range(N)]

plt.subplot(r, c)
plt.hist(x, fc = colors[i])
plt.title(f"{chr(65+i)}) n={v}")

c += 1
if c%2 == 0:
r += 1 ; c = 0

plt.show()

Fig 4.3: Frequency histogram of sample space and different sample sizes
The following inferences can be made from Fig. (4.3):

1. Although the histogram of sample space (variable SS) does not look like normal in shape, each
of the four histograms is resembles to normal in shape,

2. Each of the histogram (A-D) has an average value close to the sample space’s average value.
Generally, x̄ based on a larger sample size is closer to the mean value of the population.

3. The smaller the value of sample size, the greater the sampling distribution spreads out (compare
the limits of x-axis for A and D where sample sizes were 5 and 30, respectively).

4.2.3. Central Limit Theorem


When n is sufficiently large (n≥30), the sampling distribution of x̄ is well approximated by a normal
curve (Peck et al., 2016). Formally expressing, let W1, W2, … be an infinite sequence of independent
random variables each with the same distribution. Then,

( )
b
W +...+W n −n μ 1
∫ (4.11)
2

lim P a≤ 1 ≤b = e−z /2 dz
n→∞ √n σ √2 π a

E [ 1
n ]
(W 1 +...+W n) =E (W̄ )= μ (4.12)

1
[
Var (W 1 +...+W n ) =
n n]
σ2 (4.13)

The implication of Eq. (4.13) could be observed from Fig. (4.3) where increasing the sample size
decreased the variability of the distribution. In order to show how Eq. (4.11) works we will be
generating an array with 5 columns and 250 rows from a standard uniform distribution. Then, sum of 5
columns will be computed to generate another array (250 rows). Since for a standard uniform
y−5 /2
distribution μ=0.5 and σ2=1/12, z-ratio will be computed using .
√ 5 /12
Script 4.6
from math import sqrt, pi
from numpy import array, exp, sum

import scisuit.plot as plt


from scisuit.stats import runif

n=5

#For uniform distribution


mu, sigma = 0.5, sqrt(1/12)

#generate a list of random numbers from uniform distribution


G = lambda _: runif(n=250)

#2D list (5 rows and 250 columns) Python list


L = list(map(G, [None]*n))

#2D Numpy array (250*5)


W = array(L).transpose()

#W1+W2+...
x = sum(W, axis=1) #len=250

#z-ratio
z = (x - n*mu)/(sqrt(n)*sigma)

#DeMoivre's equation
f = 1.0 / sqrt(2*pi)*exp(-z**2/2.0)

#Density scaled histogram


plt.hist(z, density=True,)

#Overlay scatter plot


plt.scatter(x=z, y=f)

plt.show()
It is seen that even the number of samples
were small (n=5), the sums yielded a
distribution closely resembling to normal
distribution.

Larsen & Marx (2011) states that samples


from symmetric distributions will produce
sums that will quickly converge to the
theoretical limit (normal dist). However, if
samples come from a skewed distribution
then larger n is needed (see section on
sampling variability)

Fig 4.4: Density scaled histogram and scatter plot (x-


axis: z-ratio, y-axis: density)

4.2.4. The 68-95-99.7 Rule


A normal distribution with mean μ and standard deviation σ:

1. Approximately 68% of the observations fall within σ of the mean μ.


2. Approximately 95% of the observations fall within 2σ of μ.
3. Approximately 99.7% of the observations fall within 3σ of μ.

Script 4.7
N = 10000 #number of samples

#sample from standard normal distribution


x = np.array(rnorm(n=N))

for sigma in [1, 2, 3]:


L = len(np.where(np.logical_and(x>=-sigma, x<=sigma))[0])

print(f"{sigma} sigma= {L/N*100}%")


1 sigma= 68.61%, 2 sigma= 95.42%, 3 sigma= 99.75%
Note that rnorm(n=,) function samples from standard normal distribution where μ=0 and σ=1.
Example 4.2
A producer claims that bottles contain μ=12 deciliters of soda with σ=0.16 deciliters. To verify this
claim as a quality control engineer you have randomly selected 16 bottles and measured the volume in
each bottle. What is the probability that the average value of 16 bottles is in between 11.96 and 12.08
deciliters (adapted from Peck et al., 2016)?

Solution:

It is reasonable to assume that the samples come from a normal distribution. Standard deviation of
sample:

0.16
σ x̄ = σ = =0.04
n √ 16

Approach #1

Standardizing the given limits:

11.96−12 12.08−12
z 1= =−1.0 z 2= =2.0
0.04 0.04

Probability that sample average will be between 11.96 and 12.08 is:

P( z1 ≤ x̄≤z2 )=P(−1.0≤ x̄≤2.0)=0.8185

Since the limits have been standardized we can use standard normal distribution to compute
probabilities:

pnorm(q=2) – pnorm(q=-1)
0.8186

Approach #2

If not using the standard normal distribution then mean and standard deviation must be specified.

pnorm(q=12.08, mean=12, sd=0.04) - pnorm(q=11.96, mean=12, sd=0.04)


0.8186
4.3. Exponential Distribution

In section (3.6.4) it was mentioned that in situations where we only know the rate of occurrence ( λ) of
an event where the events occur completely at random might be a good candidate to be modeled by a
Poisson model. However, situations might arise where the time interval between consecutively
occurring event is an important random variable. The exponential distribution has many applications:

• The time to decay of a radioactive atom,


• The time to failure of components with constant failure rates,
• In the theory of waiting lines or queues (for example, time taken for an ambulance to arrive at
the scene of an accident) (Forbes et al., 2011).
Suppose a series of events satisfying the Poisson process are occurring at a rate of λ per unit time. Let
random variable Y denote the interval between consecutive events. Then,

f Y ( y)= λ e− λ y , y >0 (4.14)

MGF, Mean and Variance

λ (4.15)
M Y (t )=
λ−t

1 (4.16)
E (Y )=
λ

1 (4.17)
Var (Y )=
λ2

Example 4.3
During the period of 1832 to 1950, the following data was collected for the eruptions of a volcano:

126 73 3 6 37 23 73 23 2 65 94 51
26 21 6 68 16 20 6 18 6 41 40 18
41 11 12 38 77 61 26 3 38 50 91 12
Can the data be described by an exponential distribution model? (Adapted from Larsen & Marx, 2011)
Solution:
In order to test whether exponential distribution is an adequate choice, first a density histogram of the
data needs to be plotted. Then a scatter plot in the domain of the data using Eq. (4.14) will be overlaid.
The following script handles both tasks:

Script 4.8
from numpy import array, linspace, average
import scisuit.plot as plt
from scisuit.stats import dexp

Data = array([126, 73, 3, 6, 37, 23, 73, 23, 2, 65, 94, 51, 26, 21, 6, 68, 16, 20, 6,
18, 6, 41, 40, 18, 41, 11, 12, 38, 77, 61, 26, 3, 38, 50, 91, 12])

xvals = range(0, 160, 5)

plt.hist(Data, density=True, fc = "0 255 0", ec="255 0 0", label="data")


plt.scatter(x=xvals, y=dexp(x=xvals, rate=1/average(Data)), label="exp pdf")
plt.show()
The script will produce the following figure:

It is seen that the shape of the


histogram is consistent with the
theoretical model (exponential
distribution).

Fig 4.5: Histogram of raw data and exponential distribution


describing data
4.4. Gamma Distribution

In section (4.3), it was mentioned that if a series of events satisfying the Poisson process are occurring
at a rate of λ per unit time and the random variable Y denote the interval between consecutive events it
could be modeled with exponential distribution. Here the random variable Y can also be interpreted as
the waiting time for the first occurrence.

This is similar to geometric distribution (section 3.4) where we were only interested in the trial where
first success occurs. In section (3.5), in negative-binomial distribution instead of first success, we were
interested in rth success. Therefore, it was mentioned that the negative-binomial distribution generalizes
the geometric distribution.

In a similar fashion, gamma distribution generalizes the exponential distribution such that we are now
interested in the occurrence of (waiting time of) r th event. However, before we proceed with the
probability density function of gamma distribution we need to define the gamma function.

4.4.1. Gamma Function


It is a commonly used extension of factorial function and defined as:


Γ( z)=∫ t z−1⋅e−t dt (4.18)
0

With minor calculus, one can quickly see that Γ(1)=1. Using integration by parts17, it is seen that:
Γ (z +1)= z⋅Γ( z). Using induction one can further see that Γ(n)=(n-1)! .

Script 4.9
from math import gamma, factorial

for i in [1, 2, 3, 4]:


print(f"T({i})={gamma(i)}, ({i}-1)!={factorial(i-1)}")
T(1)=1.0, (1-1)!=1
T(2)=1.0, (2-1)!=1
T(3)=2.0, (3-1)!=2
T(4)=6.0, (4-1)!=6

17 https://en.wikipedia.org/wiki/Gamma_function
4.4.2. Probability Density Function
Suppose that Poisson events are occuring at constant rate of λ. Let random variable Y denote the
waiting time for rth event. Then,

f Y ( y )= λ r y r−1 e−λ y , y >0 (4.19)


(r−1) !

A proof of Eq. (4.19) can be found in mathematical statistics textbooks (Larsen & Marx, 2011). Eq.
(4.19) is often expressed in the following form (Devore et al., 2021; Miller & Miller, 2014; R-
Documentation18) :

1
f ( x ; α , β)= α
x α−1 e−x / β , x >0 (4.20)
β Γ(α )

Softwares such as R and scisuit Python package calls α as the shape and β as the scale parameter. Note
that in Eq. (4.19) r=α and λ=1/β.

When β=1 the distribution is called the standard gamma distribution.

Devore et al. (2021) states that the parameter β is called a scale parameter because values other than 1
either stretch or compress the pdf in the x-direction. Let’s visualize this using a constant shape factor,
shape=2:

Script 4.10
from numpy import linspace
from scisuit.plot import scatter, show, legend
from scisuit.stats import dgamma

x=linspace(0, 7, num=100)

for beta in [0.5, 1, 2, 4]:


scatter(x, dgamma(x=x, shape=2, scale=beta), label=str(beta), lw=3, ls="-")

legend()
show()
The following figure will be generated:

18 https://search.r-project.org/CRAN/refmans/ExtDist/html/Gamma.html
It is seen that for β=1, max value is
around 0.35.

For a smaller β value, the curve is


“compressed” and therefore became
narrower and the max value
increased to ~0.7.

For a larger β value the curve is


“stretched” and therefore became
wider and max value decreased to
~0.2 for β=2.

Fig 4.6: Gamma density curves for different scale (β) values
(α=2)

With minor editing if the same script is run for different values of α=[0.6, 1, 2, 4], where β=1, then the
following figure will be obtained:

It is seen that:
1) when α≤1, the curve is strictly
decreasing as x increases.
2) when α> 1, f(x; α) rises to a
maximum and then decreases as x
increases.

Fig 4.7: Standard gamma (β=1) density curves for different


shapes (α)
4.4.3. MGF, Mean and Variance

1 (4.21)
M Y (t )=
(1−β t )α

E (Y )=α⋅β (4.22)

Var (Y )=α⋅β
2 (4.23)

Example 4.4
As a process engineer you are given the task of designing a system to pump fluid from a reservoir to
the processing plant. As this is important for the manufacturing to continue smoothly you have included
two pumps, one active and one as a backup to be brought on line.

The manufacturer of the pump specifies that the pump is expected to fail once every 100 hours. What
are the chances that the whole manufacturing will not remain functioning for 50 hours? ( Adapted from
Larsen & Marx, 2011)

Solution:

For the whole manufacturing to be interrupted, 2 pumps should fail, for example first after 10 hours
and second after 40 hours… Failure rate: λ= 0.01 failure/hour

Approach #1:

We are going to use Eq. (4.19) where λ= 0.01 and r=2.


50
0.012 2−1 −0.01 y
P(manufacturing fails to last for 50 hours)=∫ y e =0.09
0 (2−1)!

Approach 2: We are going to use Eq. (4.20) where β= 100 and α=2.

pgamma(q=50, shape=2, scale=100)


0.09
Assume that 9% probability is too high for you. Another manufacturer claims that the pump they are
offering is expected to fail once every 200 hours, but the price is double, therefore your costs will
double. Would you use 3 pumps where each is expected to fail once every 100 hours or 2 pumps where
each is expected to fail once every 200 hours to minimize the probability of 9%?

We will use a short script to generate the probability density curves and inspect the pdf’s.

Script 4.11
from numpy import linspace
from scisuit.plot import plot, legend, show
from scisuit.stats import dgamma

x = linspace(0, 500, num=100)

plot(x, dgamma(x=x, shape=2, scale=100), label="2 pumps, 100h")


plot(x, dgamma(x=x, shape=3, scale=100), label="3 pumps, 100h")
plot(x, dgamma(x=x, shape=2, scale=200), label="2 pumps, 200h")

legend()
show()

It is seen that using 3 pumps where


each pump is expected to fail once
every 100 hours will have a lower
probability than using 2 pumps where
each pump is expected fail once every
200 hours. Therefore, there is no need
to double the cost.

However, also note that if instead of 50


hours we would like 80 hours or greater
than using 2 pumps where each pump is
expected fail once every 200 hours is a
more reasonable approach in terms of
lowering the probability.

Fig 4.8: Gamma curves for different number of pumps and


failure rates.
4.5. Chi-Square Distribution

Chi-squared distribution is the sum of the squares of a number of normal distribution and this fact gives
to important applications of it, i.e. analysis of contingency tables (Forbes et al., 2011).

4.5.1. One-way Frequency Table


Categorical univariate data consists of non-numerical observations which maybe placed in categories
(Wikipedia19) and are most conveniently summarized in a one-way frequency table (Peck et al., 2016).

Suppose 100 people being surveyed whether they will go to a certain movie and choices (categories)
are: Definitely, Probably, Probably not, Definitely not. Now a table can be formed from counting the
observations:

Table 4.1: Results of the hypothetical survey

Definitely Probably Probably not Definitely not

Frequency 20 40 25 15

Let k be the number of categories of a categorical variable and pk population proportion for category
k>0. Then,

H0: pk is the hypothesized proportion for category k

Ha: H0 is not true (at least one of the population category proportions differs from the corresponding
hypothesized value).

∑ (Observed cell count−Expected cell count )2


(4.24)
Χ 2= all cells
Expected cell count

where Χ2 has approximately a chi-square distribution with df= k-1.

19 https://en.wikipedia.org/wiki/Univariate_(statistics)
Example 4.5

Lunar Phase Number of Days Number of Births


An urban legend claims that more babies are
Phase 1 24 7680 born during certain phases of the lunar cycle,
especially near the full moon. Data for a sample
Phase 2 152 48442 of randomly selected births occurring during 24
lunar cycles are given in the table. Test whether
Phase 3 24 7579 the data support the urban legend claim (Adapted
from Peck et al., 2016).
Phase 4 149 47814

Phase 5 24 7711

Phase 6 150 47595

Phase 7 24 7733

Phase 8 152 48230

Solution:

There are 699 total days and a total of 222,784 births. The probability of a birth to happen at Phase 1 is
24/699=0.0343 and at Phase 8 is 152/699=0.2175.

So if lunar phase did not have any effect, then we expect that at Phase 1 there would be
0.0343×222784=7649.23 births. We continue our computations in this fashion and then use Eq. (4.24)
to compute Χ2 value.

Script 4.12
from scisuit.stats import pchisq

#Lunar periods
days = np.array([24, 152, 24, 149, 24, 150, 24, 152])

#Observed births at each lunar cycle


observed = np.array([7680, 48442, 7579, 47814, 7711, 47595, 7733, 48230])

#probabilities (ratios)
probs = days / np.sum(days)
#expected birth numbers
expected = np.sum(observed)*probs
chisq = (expected-observed)**2 / expected

#pchisq gives left tails probability


pval = 1 - pchisq(q=np.sum(chisq), df=len(chisq) - 1)

print(f"p-value: {round(pval, 3)}")


The output is: p-value: 0.504. Therefore, we cannot accept Ha (population category proportions differs
from the corresponding hypothesized value). Thus, the claim is not supported by statistical evidence.

Forbes et al. (2011) states that to be able to use Eq. (4.24), the data produced from the differences
between observed and expected values should be normally distributed. We will use QQ plot to check
whether the data is normally distributed.

Script 4.13
import scisuit.plot as plt
from scisuit.stats import test_norm_ad

diff = observed-expected
print(f"Anderson-Darling test: {test_norm_ad(x=diff)}")

plt.qqnorm(data=diff)
plt.show()

Although there is an outlier point, the rest of


the data follows the QQ-Line fairly well.

Moreover, the p-value reported by Anderson-


Darling test is 0.448, therefore we cannot
reject H0 that “the data follows normal
distribution”.

Fig 4.9: QQ plot of the differences between observed


and expected birth rates.
4.5.2. Probability Density Function
A random variable Y is said to have a chi-square distribution with n degrees of freedom (n>0), if

1
f Y ( y)= n/2
y (n/2)−1 e− y /2 , y >0 (4.25)
2 Γ(n/2)

Please note that Eq. (4.25) is a special case of Eq. (4.19) where r=n/2 and λ=1/2. Substituting these
values in Eq. (4.19) and tidying up slightly yields:

1
f Y ( y)= (n/2)
y n/2−1 e−1/2 y
2 (n/2−1)!

Noticing that (n/2-1)!=Γ(n/2), then one can see that above equation is equal to Eq. (4.25).

The shape of chi-square distribution


depends on the value of degrees of
freedom( df):
df<3: decreases strictly as x increases,
df≥3: increases to a maximum and
then decreases.

It should also be noted that regardless


of df, all chi-square distributions are
skewed to right.

Fig 4.10: Chi-square distribution with different degrees of


freedoms

Theorem: Let Z1, Z2, …, Zn be n independent standard normal random variables. Then,
n

∑ Z 2i
i=1

has chi-square distribution with n degrees of freedom. A proof of the theorem can be found in
mathematical statistics textbooks (Larsen & Marx, 2011).
4.5.3. MGF, Mean and Variance

−n/2
M Y (t )=(1−2 t ) , t <1/2 (4.26)

E (Y )=n (4.27)

Var (Y )=2 n (4.28)

Script 4.14
from scisuit.stats import rchisq

#number of samples
N = 1000

#arbitrary values for degrees of freedom


df = [1, 3, 5, 10]

#2D array of random values for each degrees of freedom


X=np.array([rchisq(N, x) for x in df])

#mean and variance


print(f"mean = {np.average(X, axis=1)}")
print(f"variance = {np.var(X, axis=1, ddof=0)}")
mean = [1.019 2.968 4.914 9.817]
variance = [2.131 5.696 9.669 19.553]
Notice how close the values are to the values that would be computed by Eqs. ( 4.27 & 4.28). For
example for df=1, E(Y)=1 and Var(Y)=2.
4.6. The Student’s t distribution

The t distribution is used to test whether the difference between the means of two samples of
observations is statistically significant assuming they were drawn from the same population (Forbes et
al., 2011).

In sections (4.2.2 & 4.2.3) it was shown that if y1, y2, …, yn is a random sample from a normal
Ȳ −μ
distribution with mean μ and standard deviation ρ then has a standard normal distribution (SND).
ρ/n
Ȳ −μ
However Gosset (Student, 1908) realized that does not have a SND and derived the probability
s /n
density function.

Let’s see the differences between SND and t-distribution using the short Python code:

Script 4.15
from numpy import linspace
from scisuit.stats import dnorm, dt
from scisuit.plot import plot, show, legend

x = linspace(-4, 4, num=100)

plot(x, dnorm(x=x), label="normal")

for n in [2, 10]:


scatter(x, dt(x=x, df=n), label=f"df={n}")

show()
1) Both dists are symmetric.

2) Both dists have a mean of 0.

3) t-dist is characterized by the degrees of


freedom (df). As df increases, t-dist
becomes more similar to a normal dist.

4) The curves of t-dist with larger df are


taller and have thinner tails.

5) t-dist is most useful for small sample


sizes.

Fig 4.11: Standard normal distribution and t-distribution


with different degrees of freedoms

In comparison of t-distribution with SND it was mentioned that t-distribution is most useful for small
sample sizes, but have not explained what is meant by small. Larsen & Marx (2011) states that many
tables providing probability values for t-distribution will have it for degrees of freedom in the range of
[1, 30]. Furthermore, elsewhere20 it was mentioned that for a sample size of at least 30, SND can be
used instead of t-distribution.

Let Z be a standard normal random variable and V an independent chi-square random variable with n
degrees of freedom. The Student t ratio with n degrees of freedom is,

Z (4.29)
T n=
√ V /n
In line with observations from Fig. (4.11), Eq. (4.29) is symmetric: f T (t )=f T (−t )
n n

The PDF for a Student t random variable with n degrees of freedom is,

20 https://www.jmp.com/en_no/statistics-knowledge-portal/t-test/t-distribution.html
fT (n)=
Γ ( 2 )
n+1
,−∞<t <∞ (4.30)
√ n π Γ ( )(1+ )
n 2 (n+1)
n t 2
2 n

MGF, Mean and Variance

The moment-generating function of t-distribution is undefined 21 and its mean is 0 as can be observed
from Fig. 4.11 for different degrees of freedom.

n (4.31)
Var (Y )= , n>2
n−2

Script 4.16
from scisuit.plot import scatter, show
from scisuit.stats import rt
from statistics import pvariance

var = []
for df in range(3, 100, 2):
var.append(pvariance( rt(n=5000, df=df) ))

scatter(x=list(dfs), y=var)
show()

It is seen that for df>2 the variance is


always larger than 1 and for large df the
variance is close to 1 (this can also be
observed from the equation).

Devore et al. (2021) states that for small


dfs the t-dist curve spreads out more than
the standard normal dist curve; however,
for large dfs the t-dist curve approaches
to standard normal dist curve (μ=0, ρ=1)
(see above figure).

Fig 4.12: Variance of t-distribution with different df values

21 https://en.wikipedia.org/wiki/Student's_t-distribution
4.7. F (Fisher–Snedecor) Distribution

It is the ratio of independent chi-square random variables. Many experimental scientists use the
technique called analysis of variance (ANOVA) (Forbes et al., 2011). ANOVA analyzes the variability
in the data to see how much can be attributed to differences in the means and how much is due to
variability in the individual populations (Peck et al., 2016). In one-way ANOVA, F is the ratio of
variation among the samples to variation within the samples.

Suppose that U and V are independent chi-square random variables with m and n degrees of freedom,
respectively. Then,

U /m (4.32)
F=
V /n

The PDF for F distribution is:

f F (r )=
Γ ( m+n
2 )
m/2
m n n/2 (m/2)−1
r
, r >0 (4.33)
(n+mr )(m+n)/2
Γ( )Γ( )
m n
m,n

2 2

The derivation of Eq. (4.33) is detailed in the textbook from Larsen & Marx (2011). Let’s use a fairly
short script to generate F-distribution curves for constant m (df1) and varying n (df2) and for constant
df2 and varying df1.

Script 4.17
from scisuit.stats import df
from numpy import linspace
import scisuit.plot as plt

x_axis=linspace(0.1, 6, num=500)
dfree = 10

plt.layout(2,1)
plt.subplot(0,0)
for x in [1, 2, 3, 5]:
plt.plot(x_axis, df(x_axis, df1=x, df2=dfree), label=f"df1={x}")
plt.title("df2=10")
plt.legend()
plt.subplot(1,0)
for x in [1, 2, 3, 5]:
plt.plot (x_axis, df(x_axis, df1=dfree, df2=x), label=f"df2={x}")
plt.title("df1=10")
plt.legend()

plt.show()

It is seen that when df2 is constant


the F dist curves looks very much
like a typical chi-square dist curves.

When df1 is constant, all F dist


curves rapidly rises to a maximum
and then decreases in value as x
increases.

In all cases, F values are never


negative and sharply skewed to the
right.

Fig 4.13: F-distribution curves for constant A) df2, B) df1

MGF, Mean and Variance

The moment-generating function of F distribution does not exist22.

n (4.34)
E (Y )= , n>2
n−2

2 n2 (m+n−2) (4.35)
Var (Y )=
m(n−2)2 (n−4)

22 https://en.wikipedia.org/wiki/F-distribution
4.8. Weibull Distribution

Weibull distribution is commonly used as a lifetime distribution in reliability applications (Forbes et


al., 2011). It is of great interest to statisticians and to practitioners because of its ability to fit to data
from various fields including engineering sciences (Rinne, 2009).

A random variable X has Weibull distribution (α>0, β>0), if the PDF is defined as follows:

{ ()
α−1
α x −( x / β)α
e x≥0 (4.36)
f ( x ; α , β)= β β
0 x <0

where α and β are the shape and scale parameters, respectively. According to Rinne (2009), Eq. (4.36)
is the most often used two-parameter (third parameter, the location was assumed to be 0) Weibull
distribution.

4.8.1. Effect of parameters


When α=1, the Eq. (4.36) reduces to the exponential distribution (Eq. 4.14). Therefore, exponential
distribution is a special case of both the gamma and Weibull distributions.

Replacing α with 1 in Eq. (4.36) gives:

1
f ( x ; α=1 , β)= e−( x / β)
β

If λ=1/β then

−λ y
f ( x ; β )= λ e

which is exactly the same as Eq. (4.14).

The shape parameter (α) can be interpreted in the following way too:

• 0<α<1 →the failure rate decreases over time (waiting time between two subsequent stock
exchange transactions of the same stock),
• α=1 → the failure rate is constant over time (radioactive decay of unstable atoms),
• α>1 → the failure rate increases over time (wind speeds, distribution of the size of droplets)
A “bathtub” diagram and the α-values for above-mentioned examples are presented by Kızılersü 23 et al.
(2018).

Now, let’s remember that in section (4.4.2), it was mentioned that Gamma distribution has shape and
scale parameters, which is similar to Weibull distribution. Let’s investigate the similarities and
differences:

Script 4.18
from numpy import linspace
from scisuit.stats import dgamma, dweibull
import scisuit.plot as plt

x=linspace(0, 7, num=1000)

plt.layout(nrows=2, ncols=1)

plt.subplot(0,0)
for beta in [0.5, 1, 2, 4]:
plt.plot(x, dgamma(x=x, shape=2, scale=beta), label=str(beta))
plt.title("Gamma")
plt.legend()

plt.subplot(1,0)
for beta in [0.5, 1, 2, 4]:
plt.plot(x, dweibull(x=x, shape=2, scale=beta), label=str(beta))
plt.title("Weibull")
plt.legend()

plt.show()

23 Kızılersü A, Kreer M, Thomas AW. The Weibull Distribution. Significance, April 2018.
Similarities:

1) For a smaller β value, the curve


is “compressed” and therefore
became narrower.

2) For a larger β value the curve is


“stretched” and therefore became
wider.

Differences:

1) Weibull is compressed more


and stretched less.

2) Both are right-skewed;


however, Gamma distribution has
longer tail.

Fig 4.14: Gamma and Weibull density curves for different scale
(β) values (α=2)

Similarities:
1) when α≤1, the curve is strictly
decreasing as x increases.
2) when α> 1, f(x; α) rises to a
maximum and then decreases as x
increases.
3) when α=1, both distribution shows
exactly the same characteristics (Why?).
Differences:
1) when α> 1, f(x; α) rises to a
maximum; however, Weibull dist
decreases sharply whereas Gamma dist
decreases gradually as x increases.

2) when α> 1, by modifying the script it


was observed that Weibull dist has the
maximum peak approximately around
x=0.5-1.0 whereas for Gamma dist x-
values ranged considerably.
Fig 4.15: Gamma and Weibull density curves for different shape
(α) values (β=1)
4.8.2. MGF, Mean and Variance

t n βn
( )

n
∑ n!
Γ 1+ , α≥1
α
(4.37)
n=0

( α1 )
E (Y )=β Γ 1+ (4.38)

{( ) [ ( )] }
2
2 2 1 (4.39)
Var (Y )=β Γ 1+ − Γ 1+
α α

Example 4.6
The article24 by Field and Blumenfeld (2016) investigates modeling the time to repair for reusable
shipping containers, which are fairly expensive and need to be monitored carefully. The random
variable X defined as the time required for repairing in months. The authors recommended the Weibull
distribution with parameters α=10.0 and β=3.5. What is the probability that a container requires repair
within the first 3 months?

Solution:

Script 4.19
from scisuit.stats import pweibull

for m in [2, 3, 4]:


print(f"P ({m} months) = {pweibull(q=m, shape=10, scale=3.5)}")
P (2 months) = 0.0037
P (3 months) = 0.193
P (4 months) = 0.978

Note that we are almost certain that a container will not require a repair the first two months but will
definitely require a repair within first 4 months. Why is that?

24 Field DA, Blumenfeld D. Supply Chain Inventories of Engineered Shipping Containers. International Journal of
Manufacturing Engineering. Available at: https://doi.org/10.1155/2016/2021395
1. First of all in the previous section (4.8.1) it was mentioned that if α>1 then the failure rate
increases over time, which coincides with our observation.

2. Secondly let’s compute the mean and standard deviation of the specific distribution:

μ=3.5 Γ(1+1/10)=3.33

and the standard deviation is:

σ 2=3.52 {Γ(1+2/10)−[Γ(1+1/10)]2=0.16→σ =0.4 }

Thus it is reasonable to expect a high probability of requirement for repair in the range 3.33-0.4 ≤ x ≤
3.33+0.4. This is also evidenced in the following figure:

It is seen that the first 2 months the probability is


very low and then the probability drastically
increases between 2 to 4 months. The difference
after 4 months can be considered as negligible for
many practical purposes.

P (4 months) = 0.978
P (5 months) = 0.9999
P (6 months) = 1.0

Fig 4.16: Weibull pdf α=10.0 and β=3.5


4.9. Beta Distribution

Beta distribution is defined on the interval [0, 1] or (0, 1) in terms of two parameters, α>0 and β>0
which control the shape of the distribution (Wikipedia 25, 2023). It is frequently used as a prior
distribution for binomial proportions in Bayesian analysis (Forbes et al., 2011) and often used as a
model for proportions, i.e. proportion of impurities in a chemical product or the proportion of time that
a machine is under repair (Wackerly et al., 2008).

A random variably Y is said to have beta distribution with parameters α, β, A and B if the pdf is,

( ) ( )
α−1 β−1
1 Γ(α + β) y− A B− y (4.40)
f ( y ; α , β , A , B)= ⋅ ⋅ ⋅ , A≤ y≤B
B− A Γ(α )Γ( β) B− A B− A

If A=0 and B=1 then Eq. (4.40) gives standard26 beta distribution:

Γ(α + β) α−1 (4.41)


f ( y ; α , β)= ⋅y ⋅(1− y)β−1
Γ(α )Γ( β)

Eq. (4.41) is sometimes expressed as (Wackerly et al., 2008):

y α−1⋅(1− y)β−1 (4.42)


f ( y ; α , β)=
Β(α , β)

where Β is:

1
Γ(α + β)
∫ y α−1⋅(1− y)β−1 dy= Γ(α )Γ( β) (4.43)
0

25 https://en.wikipedia.org/wiki/Beta_distribution
26 R (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Beta.html) and scisuit uses standard beta distribution where
parameters shape1, shape2 corresponds to α and β, respectively.
Let’s demonstrate the relationship between beta and gamma functions:

Script 4.20
#Python’s built-in math library does not have the beta function
from scipy.special import beta
from math import gamma
from random import randint

a, b = randint(1,10), randint(1,10)

print(f"beta({a}, {b}) = {beta(a,b)}")


print(f"Using gamma = {gamma(a)*gamma(b)/gamma(a+b)}")
beta(9, 4) = 0.00050505
Using gamma = 0.00050505

4.9.1. MGF, Mean and Variance


Moment-generating function of beta distribution is fairly complex and is given by Wikipedia 27 as
follows:

(∏ )
k−1

α +r tk
M Y (t )=1+ ∑ (4.44)
k=1 r=0 α + β +r k !

Series expansion of Eq. (4.44) can be conveniently obtained using hypergeometric function at Wolfram
Alpha28. Let’s present the first 3 terms of the series:

αt α (α +1)t 2
M Y (t )=1+ + +⋯
α + β 2(α + β)(α + β +1)

Taking the first derivative and setting t=0:

d α
dt
( M Y (t =0))=
α+β

Therefore the expected value is:

27 https://en.wikipedia.org/wiki/Beta_distribution
28 https://www.wolframalpha.com/input?i=1F1%28%CE%B1%2C+%CE%B1+%2B+%CE%B2%2C+t%29+series
α (4.45)
μ=E (Y )=
α+β

Taking the second derivative and setting t=0 yields E(X2):

α (α +1)
E (Y 2 )=
(α + β)(α + β +1)

Using Eq. (2.15) and some mathematical manipulation:

α⋅β (4.46)
Var (Y )= 2
(α + β) ⋅(α + β +1)

4.9.2. Effect of parameters


Standard beta distribution has two parameters, namely α and β. Let’s simulate the effect of these two
parameters on the shape of the beta curves:

It is seen from Fig. A that when α=β, the


curves are symmetric and has a peak
value at x=0.5 (Why?).

It is also observed that larger α=β values


spread less than lower α=β values since if
x=α=β then Eq. (4.46) gives:
1
Var (Y ; x=α=β)=
4⋅(2 x +1)

In Figs. B&C one can observe that when


α>β then the curves are left-skewed
whereas when α<β then the curves are
right-skewed.

Fig 4.17: Pdf of beta dist for different α and β values


Finally note that in Fig. (4.17-A) α=β values started from 2. It is recommended to try to get an insight
how the curve would look like when α=β=1. [Tip: use Eq. (4.41)].

Example 4.7
A wholesale distributor has storage tanks to hold fixed supplies of gasoline which are filled at the
beginning of the week. The wholesaler is interested in the proportion of the supply that is sold during
the week. After several weeks of data collection it is found that the proportion that is sold could be
modeled by a beta distribution with α=4 and β=2. Find the probability that the wholesaler will sell at
least 90% of her stock in a given week (Adapted from Wackerly et al., 2008).

Solution:

A simple script will yield the solution (remember that shape1= α and shape2= β).

Script 4.21
from scisuit.stats import pbeta
prob = pbeta(q=0.9, shape1=4, shape2=2)
print(f"P(Y>0.9) = {1 - prob}")
P(Y>0.9) = 0.081

Imagine also the wholesaler is interested in up to what proportions of gasoline could be sold 50, 75 and
95% of the time?

Script 4.22
from scisuit.stats import qbeta
probs = [0.5, 0.75, 0.95]
for p in probs:
print(f"{p*100}%: {qbeta(p=p, shape1=4, shape2=2)}")
50%: 0.68, 75%: 0.81, 95%: 0.92
It is seen that the curve is left-skewed (Why?). This
might be considered as “lucky” for the wholesaler
as lower probabilities is associated with lower
gasoline sale proportions.

Fig 4.18: pdf of beta dist for α=4 and β=2


4.10. Summary

Name Description Equation

{
1
There is almost same amount of numbers in each equally a≤ y≤b
Uniform f ( y )= b−a
spaced sub-interval
0 elsewhere

Poisson limit approximated binomial probabilities when


n→∞ and p→0. Abraham DeMoivre showed that when X −z
2

is a binomial random variable and n is large the 1 2


Normal f Z ( z)= e ,−∞< z<∞
X −np √2 π
probability P(a≤ ≤b)can be estimated
√ np(1− p)

Poisson model is a good candidate when we only know


the rate of occurrence (λ) of an event where the events
Exponential occur completely at random. However, situations might f Y ( y )= λ e− λ y , y >0
arise where the time interval between consecutively
occurring event is an important random variable.

Events satisfying the Poisson process occurring at a rate


of λ could be modeled with exponential distribution. Here
the random variable Y can also be interpreted as the
1
Gamma waiting time for the first occurrence. gamma distribution f= α
x α−1 e−x / β , x >0
generalizes the exponential distribution such that we are β Γ(α )
now interested in the occurrence of (waiting time of) r th
event.

The chi-squared distribution with k degrees of freedom is


1
Chi-square the distribution of a sum of the squares of k independent f Y ( y)= n/2 y (n/2)−1 e− y /2
standard normal random variables. 2 Γ(n/2)

If y1, y2, …, yn is a random sample from a normal


distribution with mean μ and standard deviation ρ then
Ȳ − μ
has a standard normal distribution (SND). Z
Student t ρ/n T n=
Ȳ − μ √V / n
However Gosset realized that does not have a
s /n
SND and derived the probability density function.
Ratio of independent chi-square random variables. U /m
F F=
V /n

{
Commonly used as a lifetime distribution in reliability
()
α−1
α x α

applications. It is of great interest to statisticians and to e−( x / β) x≥0


Weibull f= β β
practitioners because of its ability to fit to data from
various fields. 0 x <0

Defined on the interval [0, 1] or (0, 1) in terms of two


parameters, α>0 and β>0 which control the shape of the
distribution. It is frequently used as a prior distribution Γ(α + β) α−1
Beta f= ⋅y ⋅(1− y)β−1
for binomial proportions in Bayesian analysis and often Γ(α )Γ( β)
used as a model for proportions, i.e. proportion of
impurities in a chemical product.
5. Estimation and Hypothesis Testing
5.1. Point Estimation

A point estimate is a single value (i.e., mean, median, proportion, ...) based on sampled data to
represent a plausible value of a population characteristics (Peck et al. 2016).

Example 5.1
Out of 7421 US College students 2998 reported using internet more than 3 hours a day. What is the
proportion of all US College students who use internet more than 3 hours a day? ( Adapted from Peck et
al. 2016).

Solution:

2998
The solution is straightforward: p= =0.40
7421

Based on the statistics it is possible to claim that approximately 40% of the students in US spend more
than 3 hours a day using the internet. Please note that based on the survey result, we made a claim
about the population, students in US. ■

Now that we made an estimate based on the survey, we should ask ourselves: “How reliable is this
estimate?”. We know that if we had another group of students, the percentage might not have been 40,
maybe it would be 45 or 35. There are no perfect estimators but we expect that on average the
estimator should gives us the right answer.

5.1.1. Unbiased estimators


Definition: A statistic Θ is an unbiased estimator of the parameter θ of a given distribution if and only
if,

E(Θ) = θ (5.1)

for all possible values of θ (Miller and Miller 2014).


Example 5.2
If X has binomial distribution with the parameters n and p, show that the sample proportion, X/n is an
unbiased estimator of p.

Before we proceed with the solution, let’s refresh ourselves with a simple example: Suppose we
conduct an experiment where we flip a coin 10 times. We already know that the probability of getting
heads (success) is 𝑝=0.5. However, we want to estimate p by flipping the coin and calculating the
sample proportion, X/n. If we flip the coin 10 times and get X=6 heads, p=0.6. However, after many
experiments p will be found as 0.5. Therefore, X/n is an unbiased estimator.

E ( Xn )= 1n E ( X )= 1n⋅np= p
therefore, X/n is an unbiased estimator of p. ■

Example 5.3

[ ]
n
1
Prove that E ∑
n−1 i=1
( X i − X̄ )2 is unbiased estimator of population variance (σ2).

Solution:

[ ]
n
1
2
E (S )=E ∑
n−1 i=1
( X i − X̄ )2

Now we are going to add and subtract μ inside the parenthesis:

[∑ ( ]
n
1 2
= E ( X i −μ)−( X̄ −μ))
n−1 i=1

After a straightforward algebraic manipulation,

[∑ ( ]
n
1
= E ( X i −μ)2−( X̄ −μ)2 )
n−1 i=1

[∑ ]
n
1
= E ( X i −μ)2−n( X̄ −μ)2
n−1 i=1

σ2
Note that E ( X i −μ)2=σ 2 and E ( X̄ −μ)2= . Putting the knowns in the last equation,
n
=
1
n−1 ( σ2
n⋅σ 2−n⋅ =σ 2
n )
Therefore, E(S2) is an unbiased estimator of population variance. ■

Example 5.4
1
For the uniform probability distribution f Y ( y ; θ)= there are two estimates for θ:
θ
n
2
1. θ^1= ∑ Y i
n i=1

2. θ^2=Y max

Which one is an unbiased estimator of θ? (Adapted from Larsen & Marx 2011)

Solution #1:

( )
n n
2 2
E ( θ^1 )=E ∑ Y i = ∑ E (Y i )
n i=1 n i=1

expected value of uniform distribution is θ/2, therefore


n
2 θ 2 θ
E ( θ^1 )= ∑ = ⋅n⋅ =θ
n i=1 2 n 2

Therefore is an unbiased estimator.

Solution #2:

Using the equation given in Example (2.10), the PDF can be found as follows:

()
n−1
1 u
f Y (u)=n⋅ ⋅
max
θ θ
θ

()
n−1
1 u n
E ( θ^2 )=∫ u⋅n⋅ ⋅ du= θ
0 θ θ n+1

It is seen that as n increases, the “bias” decreases and for large n it becomes unbiased.
5.1.2. Efficiency
It is seen in section (5.1.1) that a parameter can have more than one unbiased estimators. Which one
should we choose? We should choose one with higher precision, in other words, with smaller variance.

Definition: Let θ^1 and θ 2 be two unbiased estimators for parameter θ. If,

Var ( θ^1 )<Var ( θ^2 ) (5.2)

θ^1 is more efficient than θ 2 (Larsen & Marx 2011).

Example 5.5
Given the estimators in Example (5.4), which one is more efficient?

Solution:

A tedious mathematical derivation is presented by Larsen & Marx (2011). The results are as follows:
2
θ
Var ( θ^1 )=
3n
θ2
Var ( θ^2 )=
n(n+2)

For n>1, it is seen that second estimator has a smaller variance than the first one. Therefore, it is more
efficient. ■

There are more properties of estimators: i) minimum variance, ii) robustness, iii) consistency, iv)
sufficiency. Interested readers can refer to textbooks on mathematical statistics (Devore et al. 2021;
Larsen & Marx 2011; Miller & Miller 2014).
5.2. Statistical Confidence

Suppose you want to estimate the SAT scores of students. For that purpose, a randomly selected 500
students have been given an SAT test and a mean value of 461 is obtained ( adapted from Moore et al.
2009). Although it is known that the sample mean is an unbiased estimator of the population mean (μ),
we already know that had we sampled another 500 students, the mean could (most likely would) have
been different than 461. Therefore, how confident are we to claim that the population mean will be 461.

Suppose that the standard deviation of the population is known (σ=100). We know that if we repeat
100
sampling 500 samples, the mean of these samples will follow the N(μ, =4.5) curve.
√ 500
Script 5.1
import scisuit.plot as plt
from scisuit.stats import rnorm
aver = []
for i in range(1000):
sample = rnorm(n=500, mean= 461, sd= 100)
aver.append(sum(sample)/500)
plt.hist(data=aver, density=True)
plt.show()

It is seen that the interval (447.5, 474.5) represents


almost all possible mean values. Therefore we are
99.7% (3σ) confident (confidence level) that the
population mean will be in this interval. Note also
that, as a natural consequence our confidence level
decreases as the interval length decreases.

461−3×4.5=447.5

461+3×4.5=474.5

Fig 5.1: Density scaled histogram


5.3. Confidence Intervals

A way to quantify the amount of uncertainty in a point estimator is to construct a confidence interval
(Larsen & Marx 2011). The definition of confidence interval is as follows: “... an interval computed
from sample data by a method that has probability C of producing an interval containing the true value
of the parameter.” (Moore et al. 2009). Peck et al. (2016) gives a general form of confidence interval as
follows:

(specified ) (of the statistic


point estimate using a ±( critical value )⋅ estimated standard deviation
statistic ) (5.3)

Note that the estimated standard deviation of the statistic is also known as standard error. In other
words, when the standard deviation of a statistic is estimated from the data (because the population’s
standard deviation is not known), the result is called the standard error of the statistic (Moore et al.
2009).

Example 5.6
Establish a confidence interval for binomial distribution.

Solution:

We already know (chapter 4.2) that Abraham DeMoivre showed that when X is a binomial random
variable and n is large the probability can be approximated as follows:
b

( X −np 1
) ∫
2

lim P a≤ ≤b = e−z /2 dz
n→∞ √ np(1− p) √ a
2 π

To establish an approximate 100(1−α)% confidence interval,

[
P −z α /2≤
X −np
√ np(1− p) ]
≤z α /2 =1−α

Rewriting the equation,

[ ]
X /n− p
P −z α /2≤ ≤z α /2 =1−α

√ ( X /n)(1− X /n)
n
Rewriting the equation by isolating p leads to,

[ k
n √
−z α /2
(k /n)(1−k /n) k
n √
, + z α /2
n
(k /n)(1−k /n)
n
, ]

If a 95% confidence interval to be established, then zα/2 would be ≈1.96.

Script 5.2
alpha1 = 0.05
alpha2 = 0.01
print(qnorm(alpha1/2), qnorm(1-alpha1/2))
print(qnorm(alpha2/2), qnorm(1-alpha2/2))
-1.95996 1.95996
-2.57583 2.57583
Note that if a 95% confidence interval (CI) yields an interval (0.52, 0.57), it is tempting to say that
there is a probability of 0.95 that p will be in between 0.52 and 0.57. Larsen & Marx (2011) and Peck
et al. (2016) warns against this temptation. A close look at Eq. (5.3) reveals that from sample to sample
the constructed CI will be different. However, in the long run 95% of the constructed CIs will contain
the true p and 5% will not. This is well depicted in the figure (Figure 9.4 at pp. 471) presented by Peck
et al. (2016).

Note also that a 99% CI will be wider than a 95% CI. However, the higher reliability causes a loss in
precision. Therefore, Peck et al. (2016) remarks that many investigators consider a 95% CI as a
reasonable compromise between reliability and precision.
5.4. Hypothesis Testing

Confidence intervals and statistical tests are the two most important ideas in the age of modern
statistics (Kreyszig et al. 2011). The confidence interval is carried out when we would like to estimate
population parameter. Another type of inference is to assess the evidence provided by data against a
claim about a parameter of the population (Moore et al. 2009). Therefore, after carrying out an
experiment conclusions must be drawn based on the obtained data. The two competing propositions are
called the null hypothesis (H0) and the alternative hypothesis (H1) (Larsen & Marx 2011).

We initially assume that a particular claim about a population (H 0) is correct. Then based on the
evidence from data we either reject H0 and accept H1 if there is compelling evidence or accept H0 in
favor of H1 (Peck et al. 2016).

An example from Larsen & Marx (2011) would clarify the concepts better: Imagine as an automobile
company you are looking for additives to increase gas mileage. Without the additives, the cars are
known to average 25.0 mpg with a σ=2.4 mpg and with the addition of additives, it was found
(experiment involved 30 cars) that the mileage increased to 26.3 mpg.

Now, in terms of null and alternative hypothesis, H0 is 25 mpg and H1 claims 26.3 mpg. We know that
if the experiments were carried out with another 30 cars, the result would be different (lower or higher)
than 26.3 mpg. Therefore, “is an increase to 26.3 mpg due to additives or not?”. At this point we
should rephrase our question: “if we sample 30 cars from a population with μ=25.0 mpg and σ=2.4,
what are the chances that we will get 26.3 mpg on average?”. If the chances are high, then the additive
is not working; however, if the chances are low, then it must be due to the additives that the cars are
getting 26.3 mpg. Let’s evaluate this with a script (note the similarity to Script 5.1):

Script 5.3
aver = []
for i in range(10000):
sample = rnorm(n=30, mean= 25, sd= 2.4)
aver.append(sum(sample)/30)

filtered = list(filter(lambda x: x>=26.5, aver))


print(f"probability = {len(filtered)/len(aver)}")
probability = 0.0002
We observe that the probability is too low for this to happen by chance (random sampling from the
population). Therefore, we conclude that in light of the statistical evidence the additives indeed work
(H1 wins) and reject H0.

Directly computing the probability:

P
( 26.50−25.0
)
2.4 / √ 30
=0.0003

which is very close to the simulation result of Script (5.3).

Wackerly et al. (2008) lists the elements of a statistical test as follows:

1. Null hypothesis (μ=25.0 mpg),


2. Alternative hypothesis (26.5 mpg),
x̄−25.0
3. Test statistic ( ),
2.4 / √ 30
4. Rejection region ( x̄≥25.718 for α=0.05)

5.4.1. The P-Value


We have seen that using a level of significance a critical region (H 0 being rejected) can be identified
(Larsen & Marx 2011); for example, z≥2 is a rejection criteria for the Supreme Court of the United
States (Moore et al. 2009). However, not all test statistics are normal and therefore a new strategy is to
calculate the p-value, which is defined as (Larsen & Marx 2011): “… the probability of getting a value
for that test statistic as extreme as or more extreme than what was actually observed given that H 0 is
true.”. The term extreme is also used by Moore et al. (2009) and is explained as “far from what we
would expect if H0 were true”.

If for example the test statistics yield Z=1.37 and we are carrying out a two-sided test, the p-value
would be, P(Z≤−1.37 or Z≥1.37) where Z has a standard normal distribution.
z=1.37

#pnorm computes left-tailed probability


pvalue = pnorm(-z) + (1-pnorm(z))
print(f"p-value = {pvalue}")
p-value = 0.17

Fig 5.2: P-value (i.e. z≥1.37 is considered


extreme) (adapted from Moore et al. 2009)

A simpler definition is given by Miller & Miller (2014): “… the lowest level of significance at which
the null hypothesis could have been rejected”. Let’s rephrase Miller & Miller (2014) definition: once a
level of significance is decided (e.g. α=0.05), if the computed p-value is less than the α, then we reject
H0. For example, in the gasoline additive example, p-value was computed as 0.0003 and if α=0.05, then
since p< α, we reject H0 in favor of H1 (i.e., additive has effect).

In terms of a standard normal distribution, there are 3 cases of computing p-values:

1. H1: μ > μ0 → P(Z≥z) (alternative is greater than)


2. H1: μ < μ0 → P(Z≤z) (alternative is smaller than)
3. H1: μ ≠ μ0 → 2P(Z≥|z|) (alternative is not equal)

Example 5.7
A bakery claims on its packages that its cookies are 8 g. It is known that the standard deviation of the 8
g packages of cookies is 0.16 g. As a quality control engineer, you collected 25 packages and found that
the average is 8.091 g. Is the production process going alright? (adapted from Miller & Miller 2014).

Solution:

The null hypothesis is H0: μ=8 g,

The alternative hypothesis H1: μ≠8 g.

8.091−8
The test statistic: z= =2.84
0.16/ √ 25
1-pnorm(2.84) + pnorm(-2.84) #2*(1- pnorm(2.84))
0.0045
Since p<0.05, we reject the null hypothesis. Therefore, the process should be checked and suitable
adjustments should be made.
6. Z-Test for Population Means
The fundamental limitation to applying z-test is that the population variance must be known in advance
(Kanji 2006; Moore et al. 2009; Peck et al. 2016). The test is accurate when the population is normally
distributed; however, it will give an approximate value even if the population is not normally
distributed (Kanji 2006). In most practical applications, population variance is unknown and the sample
size is small therefore a t-test is more commonly used.

6.1. One-sample z-test

From a population with known mean (μ) and variance (σ), a random sample of size n is taken
(generally n≥30) and the sample mean ( x̄) calculated. The test statistic:

x̄−μ
Z= (6.1)
σ /√n

Example 6.1
A filling process is set to fill tubs with powder of 4 g on average. For this filling process it is known
that the standard deviation is 1 g. An inspector takes a random sample of 9 tubs and obtains the
following data: Weights = [3.8, 5.4, 4.4, 5.9, 4.5, 4.8, 4.3, 3.8, 4.5]

Is the filling process working fine? (Adapted from Kanji 2006).

Solution:

The average of 9 samples is: 4.6 g

4.6−4
Test statistic: Z= =1.8,
1/ √ 9

Since 1.8 is in the range of -1.96<Z<1.96, we cannot reject the null hypothesis, therefore the filling
process works fine (i.e. there is no evidence to suggest it is different than 4 g).

Is it over-filling?

Now, we are going to carry out 1-tailed z-test and therefore acceptance region is Z<1.645. Since the test
statistic is greater than 1.645, we reject the null hypothesis and have evidence that the filling process is
over-filling.
Script 6.1
import scisuit.plot as plt
from scisuit.stats import test_z

data = [3.8, 5.4, 4.4, 5.9, 4.5, 4.8, 4.3, 3.8, 4.5]
result = test_z(x=data, sd1=1, mu=4)
print(result)
N=9, mean=4.6, Z=1.799
p-value = 0.072 (two.sided)
Confidence interval (3.95, 5.25)
Since p>0.05, we cannot reject H0.

Script 6.1 requires minor change to analyze whether it is over-filling or not. We will set the parameter,
namely alternative, to “greater” whose default value was set to “two.sided”.

Script 6.2
result = test_z(x=data, sd1=1, mu=4, alternative="greater")
print(result)
p-value = 0.036 (greater)
Confidence interval (4.052, inf)
Since p<0.05, we reject the null hypothesis in favor of alternative hypothesis.

6.2. Two-sample z-test

In essence, two-sample is very similar to one-sample z-test such that we take n1 and n2 samples from
two populations with means (μ1 and μ2) and variances (σ1 and σ2). Therefore, the test statistic is
computed as:

( x¯1− x¯2 )−( μ1−μ2 )


Z=

( )
1
σ 12 σ 22 2 (6.2)
+
n1 n2
Example 6.2
A survey has been conducted to see if studying over or under 10 h/week has an effect on overall GPA.
For those who studied less (x) and more (y) than 10 h/week the GPAs were:

x=[2.80, 3.40, 4.00, 3.60, 2.00, 3.00, 3.47, 2.80, 2.60, 2.0]

y = [3.00, 3.00, 2.20, 2.40, 4.00, 2.96, 3.41, 3.27, 3.80, 3.10, 2.50].

respectively. It is known that the standard deviation of GPAs for the whole campus is σ=0.6. Does
studying over or under 10 h/week has an effect on GPA? (Adapted from Devore et al. 2021)

Solution:

We have two groups (those studying over and under 10 h/week) from the same population (whole
campus) whose standard deviation is known (σ=0.6).

We will solve this question directly using a Python script and the mathematical computations are left as
an exercise to the reader.

Script 6.3
x = [2.80, 3.40, 4.00, 3.60, 2.00, 3.00, 3.47, 2.80, 2.60, 2.0]
y = [3.00, 3.00, 2.20, 2.40, 4.00, 2.96, 3.41, 3.27, 3.80, 3.10, 2.50]
mu = 0
sd1, sd2 = 0.6, 0.6

result = test_z(x=x, y=y, sd1=sd1, sd2=sd2, mu=0)


print(result)

n1=10, n2=11, mean1=2.967, mean2=3.058


Z=-0.3478
p-value = 0.728 (two.sided)
Confidence interval (-0.605, 0.423)
Since p>0.05 there is no statistical evidence to reject H 0 (μ1 – μ2 = 0) and therefore there is no
statistically significant different between studying over or under 10 h/week.
7. Student’s t-test for Population Means
In Chapter 6, we have seen that z-test is only possible when standard deviation (σ) of the population is
known. Therefore, the z-statistic is not commonly used (Peck et al. 2016). When standard deviation(s)
are not known they must be estimated from samples and in Example (2.3) it was already shown that S2
is an unbiased estimator of σ2. Therefore, one might immediately be tempted to use Eq. (6.1).

Then, comes the question: What effect does replacing σ with S have on Z ratio? ( Larsen & Marx 2011).
In order to answer this question, let’s demonstrate the effect of replacing σ with S on Z ratio with a
script:

Script 7.1
import numpy as np
import scisuit.plot as plt
from scisuit.stats import dnorm, rnorm

#plotting f(z) curve


x = np.linspace(-3, 3, num=100)
y = dnorm(x)

N=4
sigma, mu = 1.0, 0.0 #stdev and mean of population
z, t = [], []
for i in range(1000):
sample = rnorm(n=N)
aver = sum(sample)/N

#using population stdev


z_ratio = (aver-mu)/(sigma/sqrt(N))
z.append(z_ratio)

#computing stdev from sample


s = float(np.std(sample, ddof=1))
z_ratio = (aver-mu)/(s/sqrt(N))

#filter out too big and too small ones


if(-4<z_ratio<4):
t.append(z_ratio)

plt.layout(nrows=2, ncols=1)

plt.subplot(row=0, col=0)
plt.scatter(x=x, y=y)
plt.hist(data=z, density=True)
plt.title("Population Std Deviation")

plt.subplot(row=1, col=0)
plt.scatter(x=x, y=y)
plt.hist(data=t, density=True)
plt.title("Sample Std Deviation")

plt.show()

In the top figure, it is seen that when


the standard deviation of the
population (σ) is known f(z) is
x̄−μ
consistent with .
σ /√n
However, when σ is not known and
instead S is used to compute z-ratio,
x̄−μ
, it is seen that f(z)
S /√n
underestimates the ratios much less
than zero as well as the ratios much
larger than zero.
Credit for recognizing this
difference goes to William Sealy
Gossett.29

x̄−μ x̄−μ
Fig 7.1: (top) vs
σ /√n S /√n

Note that in Script (7.1), N was intentionally chosen a small value (N=4). It is recommended to change
N to a greater number, such as 10, 20 or 50 in order to observe the effect of large samples.

29 Student (1908). The probable error of a mean. Biometrika, 6(1), 1-25.


7.1. One-sample t-test

Let x̄ and s be the mean and standard deviation of a random sample from a normally distributed
population. Then,

x̄−μ
t= (7.1)
s/√n

has a t distribution with df=n-1. Here s is the sample’s standard deviation and computed as:

n
1
2
s= ∑
n−1 i=1
( X i − X̄ )2 (7.2)

Example 7.1
In 2006, a report revealed that UK subscribers with 3G phones listen on average 8.3 hours/month full-
track music. The data for a random sample of size 8 for US subscribers is x=[5, 6, 0, 4, 11, 9, 2, 3]. Is
there a difference between US and UK subscribers? (Adapted from Moore et al. 2009).

Solution:

Script 7.2
from statistics import stdev
from scisuit.stats import qt

x=[5, 6, 0, 4, 11, 9, 2, 3]
n = len(x)
df = n-1 #degrees of freedom
aver = sum(x)/n
stderr = stdev(x)/sqrt(n) #standard error

#construct a 95% interval


tval = qt(0.025, df=df) #alpha/2=0.025
v_1 = aver - tval*stderr
v_2 = aver + tval*stderr
print(f"Interval: ({min(v_1, v_2)}, {max(v_1, v_2)})")
Interval: (1.97, 8.03)
Since the confidence interval does not contain 8.3 and furthermore since its upper limit is smaller than
8.3, it can be concluded that US subscribers listen less than UK subscribers.
Directly solving using scisuit’s
scisuit built-in function:

Script 7.3
from scisuit.stats import test_t
x=[5, 6, 0, 4, 11, 9, 2, 3]
result = test_t(x=x, mu=8.3)
print(result)
One-sample t-test for two.sided
N=8, mean=5.0
SE=1.282, t=-2.575
p-value =0.037
Confidence interval: (1.97, 8.03)
Since p<0.05 we reject H0 and claim that there is statistically significant difference between US and
UK subscribers. [If in test_t function H1 was set to “less” instead of “two.sided” then p=0.018. Therefore, we
would reject the H0 in favor of H1, i.e. US subscribers indeed listen less than UK’s. ]

7.2. Two-sample t-test

7.2.1. Equal Variances


Assume we are drawing n and m samples from two populations, namely X and Y, with equal variances,
2
s2, but with different means μ1 and μ2. Let S P be the pooled variance, then:

n n

∑ ( X i− X̄ )2 +∑ (Y i−Ȳ )2 (7.3)
S 2P = i=1 i=1
n+m−2

and test statistic is defined as:

X̄ −Ȳ −( μ X −μY )
T n+m−2=

√ 1 1 (7.4)
Sp +
n m

has a Student’s t-distribution with n+m-2 degrees of freedom.


Example 7.2
Student surveys are important in academia. An academic who scored low on a student survey joined
workshops to improve “enthusiasm” in teaching. X and Y are survey scores from his fall and spring
semester classes which he selected to have the same demographics.

X= [3, 1, 2, 1, 3, 2, 4, 2, 1]
Y = [5, 4, 3, 4, 5, 4, 4, 5, 4]
Is there a difference in scores of both semester? (Adapted from Larsen & Marx 2011).

Solution:

We can make the following assumptions:

1. The variance of the populations are not known, therefore z-test cannot be applied.

2. It is reasonable to assume equal variances since the X and Y have the same demographics.

Script 7.4
from scisuit.stats import test_t
x = [3, 1, 2, 1, 3, 2, 4, 2, 1]
y = [5, 4, 3, 4, 5, 4, 4, 5, 4]
result = test_t(x=x, y=y, varequal=True)
print(result)
Two-sample t-test assuming equal variances
n1=9, n2=9, df=16
s1=1.054, s2=0.667
Pooled std = 0.882
t = -5.07
p-value = 0.0001 (two.sided)
Confidence interval: (-2.992, -1.230)
Since p<0.05, the difference between the scores of fall and spring are statistically significant.
7.2.2. Unequal Variances
Similar to section 7.2.1, we are drawing random samples of size n1 and n2 from normal distributions
with means μX and μY, but with standard deviations σX and σY, respectively.

n n

∑ ( X i− X̄ )2 ∑ (Y i−Ȳ )2 (7.5)
S 12= i=1 and S 22= i=1
n1−1 n2−1

The test statistic is computed as follows:

X̄ −Ȳ −( μ X −μY )
t=

√ s12 s 22
+
n1 n2
(7.6)

In 1938 Welch30 showed that t is approximately distributed as a Student’s t random variable with df:

( )
2
s12 s 22
+
n1 n2
df = (7.7)
s14 s 24
+
n12 (n1−1) n22 (n2−1)

Example 7.3
A study by Larson and Morris31 (2008) surveyed the annual salary of men and women working as
purchasing managers subscribed to Purchasing magazine. The salaries are (in thousands of US dollars):

Men = [81, 69, 81, 76, 76, 74, 69, 76, 79, 65]
Women = [78, 60, 67, 61, 62, 73, 71, 58, 68, 48]

Is there a difference in salaries between men and women? (Adapted from Peck et al. 2016)

30 https://www.jstor.org/stable/2332010
31 Larson PD & Morris M (2008). Sex and Salary: A Survey of Purchasing and Supply Professionals, Journal of
Purchasing and Supply Management, 112–124.
Solution:

Following assumption can be made:

1. Z-test cannot be applied because the variance of the populations are not known.

2. Although the samples were selected from the subscribers of Purchasing magazine, Larson and
Morris (2008) considered two populations of interest, i.e. male and female purchasing
managers. Therefore, equal variances should not be applied.

Script 7.5
from scisuit.stats import test_t
Men = [81, 69, 81, 76, 76, 74, 69, 76, 79, 65]
Women = [78, 60, 67, 61, 62, 73, 71, 58, 68, 48]
result = test_t(x=Women, y=Men, varequal=False)
print(result)
Two-sample t-test assuming unequal variances
n1=10, n2=10, df=15
s1=8.617, s2=5.399
t = -3.11
p-value = 0.007 (two.sided)
Confidence interval: (-16.7, -3.1)
Since p<0.05, there is statistically significant difference between salaries of each group.

7.3. Paired t-test

In essence a paired t-test is a two-sample t-test as there are two samples. However, the two samples are
not independent as one of the factors in the first sample is paired in a meaningful way with a particular
observation in the second sample (Larsen & Marx 2011; Peck et al. 2016).

The equation to compute the test statistics is similar to one-sample t-test, Eq. (7.1):

x̄−μ
t= (7.8)
s/√n

where x̄ and s are mean and standard deviation of the sample differences, respectively. The degrees of
freedom is: df=n-1.
Example 7.4
In a study where 6th grade students who had not previously played chess participated in a program in
which they took chess lessons and played chess daily for 9 months. Below data demonstrates their
memory test score before and after taking the lessons:

Pre = [510, 610, 640, 675, 600, 550, 610, 625, 450, 720, 575, 675]
Post = [850, 790, 850, 775, 700, 775, 700, 850, 690, 775, 540, 680]

Is there evidence that playing chess increases the memory scores? (Adapted from Peck et al. 2016).

Solution:

Before we attempt to solve the question, we make the following assumptions:

1. Z-test cannot be applied since population variance is not known,

2. Pre- and post-test scores are not independent since they were applied to the same subjects.

Script 7.6
from scisuit.stats import test_t
Pre = [510, 610, 640, 675, 600, 550, 610, 625, 450, 720, 575, 675]
Post = [850, 790, 850, 775, 700, 775, 700, 850, 690, 775, 540, 680]
result = test_t(x=Post, y=Pre, paired=True)
print(result)
Paired t-test for two.sided
N=12, mean1=747.9, mean2=603.3, mean diff=144.6
t =4.564
p-value =0.0008
Confidence interval: (74.9, 214.3)
Since p<0.05, there is statistical evidence that playing chess indeed made a difference in increasing the
memory scores.

If the parameter, namely alternative, was set to “less”, then p=0.99. Therefore, we would reject the
alternative hypothesis (Post<Pre
Post<Pre). However, on the other hand, alternative was set to “greater” then
p=0.0004, therefore we would reject the H0 and accept H1 (Post>Pre
Post>Pre).
8. F-Test for Population Variances
Assume that a metal rod production facility uses two machines on the production line. Each machine
produces rods with thicknesses μX and μY which are not significantly different. However, if the
variabilities are significantly different, then some of the produced rods might become unacceptable as
they will be outside the engineering specifications.

In Section (7.2), it was shown that there are two cases for two-sample t-tests: whether variances were
equal or not. To be able to choose the right procedure, Larsen & Marx (2011) recommended that F test
should be used prior to testing for μX=μY.

Let’s draw random samples from populations with normal distribution. Let X1, … , Xm be a random
sample from a population with standard deviation σ1 and let Y1, …, Yn be another random sample from a
population with standard deviation σ2. Let S1 and S2 be the sample standard deviations. Then the test
statistic is:

S 12 / σ 1
F= (8.1)
S 22 / σ 2

has an F distribution with df1=m-1 and df2=n-1, (Devore et al. 2021).

Example 8.1
α-waves produced by brain have a characteristic frequency from 8 to 13 Hz. The subjects were 20
inmates in a Canadian prison who were randomly split into two groups: one group was placed in
solitary confinement; the other group was allowed to remain in their own cells. Seven days later,
α-wave frequencies were measured for all twenty subjects are shown below:

non-confined = [10.7, 10.7, 10.4, 10.9, 10.5, 10.3, 9.6, 11.1, 11.2, 10.4]
confined = [9.6, 10.4, 9.7, 10.3, 9.2, 9.3, 9.9, 9.5, 9, 10.9]

Is there a significant difference in variability between two groups?


Solution:

Using a box-whisker plot, let’s first visualize the data as shown in Fig. (8.1).

It is seen that inmates placed in solitary


confinement (red box) show a clear
decrease in the α-wave frequency.

Furthermore, the variability of that


particular group seems higher than non-
confined inmates.

Fig 8.1: Non-confined (blue) vs solitary confined (red)

Script 8.1
from scisuit.stats import test_f, test_f_Result

nonconfined = [10.7, 10.7, 10.4, 10.9, 10.5, 10.3, 9.6, 11.1, 11.2, 10.4]
confined = [9.6, 10.4, 9.7, 10.3, 9.2, 9.3, 9.9, 9.5, 9, 10.9]
result = test_f(x=confined, y=nonconfined)
print(result)
F test for two.sided
df1=9, df2=9, var1=0.357, var2=0.211
F=1.696
p-value =0.443
Confidence interval: (0.42, 6.83)
Since p>0.05, we cannot reject H0 (σ1=σ2). Therefore, there is no statistically significant difference
between the variances of two groups.
9. Analysis of Variance (ANOVA)
In Section (7.2) we have seen that when exactly two means needs to be compared, we could use two-
sample t-test. The methodology for comparing several means is called analysis of variance (ANOVA).
When there is only a single factor with multiple levels, i.e. color of strawberries subjected to different
power levels of infrared radiation, then we can use one-way ANOVA. However, besides infrared power
if we are also interested in different exposure times, then two-way ANOVA needs to be employed.

9.1. One-Way ANOVA

There are 3 essential assumptions for the test to be accurate (Anon 2024)32:

1. Each group comes from a normal population distribution.


2. The population distributions have the same standard deviations (σ1=σ2=… = σn).
It is reasonable to expect that standard deviations of populations have some differences in
values. Therefore, Peck et al. (2016) suggest that if σmax ≤ 2·σmin ANOVA still can safely be used.
3. The data are independent.

A similarity comparison of two-sample t-test and ANOVA is given by Moore et al. (2009). Suppose we
are analyzing whether the means of two different groups of same size are different. Then we would
employ two-sample t-test with equal variances (due to assumption #2):

t=
X̄ −Ȳ
=
n
2 √
( X̄ −Ȳ )
(9.1)

√1 1 Sp
Sp +
n n

The square of test statistic is:

n
( X̄ −Ȳ )2
2 (9.2)
t 2=
S 2p

32 https://online.stat.psu.edu/stat500/lesson/10/10.2/10.2.1
If we had used ANOVA, the F-statistic would have been exactly equal to t2 computed using Eq. (9.2). A
careful inspection of Eq. (9.2) reveals couple things:

1. The numerator measures the variation between the groups (known as fit).

2. The denominator measures the variation within groups (known as residual), see Eq. (7.3).

The null- and alternative-hypothesis for ANOVA are:

H0: μ1=μ2=...=μ n
(9.3)
Ha: At least two of the μ 's are different

Therefore the basic idea is, to test H0, we simply compare the variation between the means of the
groups with the variation within groups. A graphical example adapted from Peck et al. (2016) can
cement our understanding:

It is clearly seen from Fig. (9.1-A) that H0 can


be rejected as the means of 3 samples are
different. The variability within each sample is
smaller than the differences between the sample
means.
Fig 9.1-A: A dataset with small within variability
In Fig. (9.1-B), the difference between sample
means are as same as Fig. (9.1-A); however,
there is considerable overlap between the
samples. Therefore, the difference between the
means of the samples could simply be due to
Fig 9.1-B: A dataset with high within variability variability in sampling rather than the
differences in population means.

Computing the statistics:

Let k be the number of populations being compared [in Fig. (9.1) k=3] and n1, n2, …, nk be the sample sizes:

1. Total number of observations:

N = n1 + n2 + …+ nk

2. Grand total (the sum of all observations):


k n
T =∑ ∑ X k ,i
k=1 i=1

3. Grand mean (average of all observations):

T
x̄=
N

4. Sum of squares of treatment:

SS TR =n1⋅( x¯1− x̄)2 +n2⋅( x¯2− x̄)2 +...+nk⋅( x¯k − x̄)2

where df=k-1

5. Sum of squares of error:


2 2 2
SS Error =(n1−1)⋅s1 +(n2−1)⋅s 2 +...+(n k −1)⋅s k

where df = N-k

6. Mean squares:

SS TR SS Error
MS TR = and MS Error =
k−1 N −k

The test statistics:

MS TR
F= (9.4)
MS Error

with df1=k-1 and df2=N-k.

Before proceeding with an example on ANOVA, let’s further investigate Eq. (9.4). Remember that F
distribution is the ratio of independent chi-square random variables and is given with the following
equation:

U /m
F= (9.5)
V /n

where U and V are independent chi-square random variables with m and n degrees of freedom.
The following theorem establishes the link between Eqs. (9.4 & 9.5):

Theorem: Let Y1, Y2, …, Yn be random sample from a normal distribution with mean μ and variance σ2.
Then,

n
(n−1) S 2 1
2
= 2 ∑ (Y i −Ȳ )2 (9.6)
σ σ i=1

has a chi-square distribution with n-1 degrees of freedom. A proof of Eq. (9.6) is given by Larsen &
Marx (2011) and is beyond the scope of this study.

Using Eq. (9.6), now it is easy to see that when sum of squares of treatment (or error) is divided by σ, it
will have a chi-square distribution. Therefore Eq. (9.4) is indeed equivalent to Eq. (9.5) and therefore
gives an F distribution with df1=k-1 and df2=N-k.

Example 9.1
In most of the integrated circuit manufacturing, a plasma etching process is widely used to remove
unwanted material from the wafers which are coated with a layer of material, such as silicon dioxide. A
process engineer is interested in investigating the relationship between the radio frequency power and
the etch rate. The etch rate data (in Å/min) from a plasma etching experiment is given below:

160 W 180 W 200 W 220 W

575 565 600 725

542 593 651 700

530 590 610 715

539 579 637 685

570 610 629 710

Does the RF power affect etching rate? (Adapted from Montgomery 2012)
Solution:

Before attempting any numerical solution, let’s first visualize the data using box-whisker plot generated
with a Python script:

Script 9.1
import scisuit.plot as plt

rf_160 = [575, 542, 530, 539, 570]


rf_180 = [565, 593, 590, 579, 610]
rf_200 = [600, 651, 610, 637, 629]
rf_220 = [725, 700, 715, 685, 710]

for dt in [rf_160, rf_180, rf_200, rf_220]:


_name = [ k for k,v in locals().items() if v == dt][0]
plt.boxplot(data=dt, label=_name)
plt.show()

It is immediately seen from the figure that


μ220 is considerably different than other
means. It can thus be inferred that the null
hypothesis will be rejected since H0 claims:
μ160=μ180=μ200=μ220

Fig 9.2: The etch rate data at different RFs

Before using scisuit’s


scisuit built-in function, let’s compute F-value using a Python script so that above-
shown steps to calculate test statistics become clearer.
Script 9.2
import numpy as np
from scisuit.stats import qf

#create a 2D array
data = np.array([rf_160, rf_180, rf_200, rf_220]) #see Script (9.1)

#compute grand mean


grandmean = np.mean(data)

ss_tr, ss_error = 0, 0
for dt in data:
n = len(dt) #size of each sample
ss_tr += n*(np.mean(dt)-grandmean)**2
ss_error += (n-1)*np.var(dt, ddof=1) #note ddof=1, the sample variance

row, col = data.shape


df_tr = row - 1
df_error = row*(col - 1)

Fvalue = (ss_tr/df_tr) / (ss_error/df_error)


Fcritical = qf(1-0.05, df1=df_tr, df2=df_error)

print(f"F={Fvalue}, F-critical={Fcritical}")
F=66.8, F-critical=3.24
Since the computed F-value is considerably greater than F-critical, we can safely reject H 0. Using
scisuit’s
scisuit built-in aov function:

Script 9.3
aovresult = aov(rf_160, rf_180, rf_200, rf_220)
print(aovresult)
One-Way ANOVA Results
Source df SS MS F p-value
Treatment 3 66870.55 22290.18 66.80 2.8829e-09
Error 16 5339.20 333.70
Total 19 72209.75
Since p<0.05, we can reject H0 in favor of H1.

Now, had we not plotted Fig. (9.2), we would not be able to see why H0 has been rejected. As a matter
of fact, among other reasons due to overlap in whiskers and boxes or outliers a box-whisker plot does
not always clearly show whether H 0 will be rejected. Therefore, we need to use post hoc tests along
with ANOVA. There are several tests 33 for this purpose, here we will be using Tukey’s test 34.
Continuing from Script (9.3):

tukresult = tukey(alpha=0.05, aovresult=aovresult)


print(tukresult)
Tukey Test Results (alpha=0.05)
Pairwise Diff i-j Interval
1-2 -36.20 (-69.25, -3.15)
1-3 -74.20 (-107.25, -41.15)
1-4 -155.80 (-188.85, -122.75)
2-3 -38.00 (-71.05, -4.95)
2-4 -119.60 (-152.65, -86.55)
3-4 -81.60 (-114.65, -48.55)
Since none of the pairs contain the value 0.0, the Tukey procedure shows that means of all pairs are
significantly different. Thus it can be concluded that each power level has an effect on etch rate that is
different from the other power levels.

9.2. Two-Way ANOVA

In one-way ANOVA, the populations were classified according to a single factor; whereas in two-way
ANOVA, as the name implies, there are two factors, each with different number of levels. For example,
a baker might choose 3 different baking temperatures (150, 175, 200°C) and 2 different baking times
(45 and 60 min) to optimize a cake recipe. In this example we have two factors (baking time and
temperature) each with different number of levels (Devore et al. 2021; Moore et al. 2009).

Moore et al. (2009) lists the following advantages for using two-way ANOVA:

1. It is more efficient (i.e., less costly) to study two factors rather than each separately,
2. The variation in residuals can be decreased by the inclusion of a second factor,
3. Interactions between factors can be explored.

33 https://en.wikipedia.org/wiki/Post_hoc_analysis
34 https://en.wikipedia.org/wiki/Tukey%27s_range_test
In order to analyze a data set with two-way ANOVA the following assumptions must be satisfied (Field
2024; Moore 2012):

1. The response variable must be continuous (e.g., weight, height, yield, … ),


2. The two independent variables must consist of discrete levels (e.g., type of treatment, brand of
product) and each factor must have at least two levels,
3. In order to analyze interaction effects between independent variables, there should be replicates,
4. The observations must be independent,
5. It is desirable that the design should be balanced.

Let’s start from #5 and take a look at what it means balanced or unbalanced. In ANOVA or design of
experiments, a balanced design has equal number of observations for all possible combinations of
factor levels. For example35, assume that the independent variables are A, B, C with 2 levels. Table
(9.1) shows a balanced design whereas Table (9.2) shows an unbalanced design of the same factors
(since the combination [1, 0, 0] is missing).

Table 9.1: Balanced Design Table 9.2: Unbalanced Design


A B C A B C

0 0 0 0 0 0

0 0 1 0 1 0

0 1 0 0 1 0

0 1 1 0 0 1

1 0 0 0 1 0

1 0 1 1 0 1

1 1 0 1 1 0

1 1 1 1 1 1

Note that if Table (9.1) was re-designed such that each row displayed a factor level (0 or 1) and each
column displayed a factor (A, B or C) then there would be no empty cells in that table. If the data
includes multiple observations for each treatment, the design includes replication.
replication

35 https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/anova/supporting-topics/anova-
models/balanced-and-unbalanced-designs/
Example 9.2
A study by Moore and Eddleman36 (1991) investigated the removal of marks made by erasable pens on
cotton and cotton/polyester fabrics. The following data compare three different pens and four different
wash treatments with respect to their ability to remove marks on. The response variable is based on the
color change and the lower the value the more marks were removed.

Table 9.3: Effect of washing treatment and different pen brands on color change
Wash 1 Wash 2 Wash 3 Wash 4

Pen #1 0.97 0.48 0.48 0.46

Pen #2 0.77 0.14 0.22 0.25

Pen #3 0.67 0.39 0.57 0.19

Is there any difference in color change due either to different brands of pen or to the different washing
treatments? (Adapted from Devore et al. 2021)

Solution:

The data satisfies the requirements to be analyzed with two-factor ANOVA, since:

1. There are two independent factors (pen brands and washing treatment),
2. The independent variables consist of discrete levels (e.g., brand #1, #2 and #3)
3. There are no empty cells (data is balanced),
4. There are no replicates (interaction cannot be explored),
5. Observations are independent.

Once a table similar to Table (9.3) is prepared, finding the F-values for both factors is fairly
straightforward if a spreadsheet software is used.

Grand mean (T) = 0.466

36 Moore MA, Eddleman VL (1991). An Assessment of the Effects of Treatment, Time, and Heat on the Removal of
Erasable Pen Marks from Cotton and Cotton/Polyester Blend Fabrics. J. Test. Eval.. 19(5): 394-397
Averages of treatments (μtreatments) = [0.803, 0.337, 0.423, 0.3]
4
SS treatment 0.48
SS treatment =∑ ( μtreatments [i]−T )2×3=0.48 and MS treatment = = =0.16
i=1 df 4−1

Averages of brands (μbrands) = [0.598, 0.345, 0.455]


3
SS brand 0.128
SS brand =∑ ( μ brands [i]−T )2×4=0.128 and MS brand = = =0.06
i=1 df 3−1

SS Error 0.087
SS Error =∑ ∑ ( μij −T )−SS Treatment −SS brand =0.087 and MS Error = = =0.014
df (3−1)×(4−1)

MS treatment 0.16
F treatment = = =11.05
MS Error 0.014

MS brand 0.06
F brand = = =4.15
MS Error 0.014

Although the solution is straightforward, it is still cumbersome and error-prone; therefore, it is best to
use functions dedicated for this purpose:

Script 9.4
brand = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
treatment = [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
removal = [0.97, 0.48, 0.48, 0.46, 0.77, 0.14, 0.22, 0.25, 0.67, 0.39, 0.57, 0.19]

result = aov2(y=removal, x1=treatment, x2=brand)


print(result)
Two-way ANOVA Results
Source df SS MS F p-value
x1 3 0.48 0.16 11.05 7.40e-03
x2 2 0.13 0.06 4.43 6.58e-02

Unlike Example (9.2) in which the data does not have replicates, the following example will
demonstrate a data set which have replicates. It should be noted that when replicates are involved the
solution becomes slightly more tedious and therefore the following example will be directly solved
using scisuit’s
scisuit built-in function. Interested readers can consult to textbooks (Devore et al. 2021) for a
detailed solution.

Example 9.3
A process engineer is testing the effect of catalyst type (A, B, C) and reaction temperature (high,
medium, low) on the yield of a chemical reaction. She designs an experiment with 3 replicates for each
combination as shown in the following data. Do both catalyst type and reaction temperature have an
effect on the reaction yield?

Catalyst = [A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, B, B, B, C, C, C, C, C, C, C, C, C]
Temperature = [L, L, L, M, M, M, H, H, H, L, L, L, M, M, M, H, H, H, L, L, L, M, M, M, H, H, H]
%Yield = [85, 88, 90, 80, 82, 84, 75, 78, 77, 90, 92, 91, 85, 87, 89, 80, 83, 82, 88, 90, 91, 84, 86, 85, 79, 80, 81]

Solution:

If one wishes to use a spreadsheet for the solution, a table of averages needs to be prepared as shown
below:

Table 9.4: Effect of temperature and catalyst type on reaction rate


Temperature

Catalyst L M H

A 87.667 82 76.667

B 91 87 81.667

C 89.667 85 80

After preparing the above-shown table, a methodology similar to Example (9.2) can be followed.
Let’s solve the question directly by using scisuit’s
scisuit built-in function:

Script 9.5
from scisuit.stats import aov2

Catalyst = ["A", "A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C", "C", "C", "C", "C"]

Temperature = ["L", "L", "L", "M", "M", "M", "H", "H", "H",
"L", "L", "L", "M", "M", "M", "H", "H", "H",
"L", "L", "L", "M", "M", "M", "H", "H", "H"]

Yield = [85, 88, 90, 80, 82, 84, 75, 78, 77, 90, 92, 91,
85, 87, 89, 80, 83, 82, 88, 90, 91, 84, 86, 85, 79, 80, 81]

result = aov2(y=Yield, x1=Temperature, x2=Catalyst)


print(result)
Two-way ANOVA Results
Source df SS MS F p-value
x1 2 450.30 225.15 83.27 7.9886e-10
x2 2 90.74 45.37 16.78 7.7004e-05
x1*x2 4 3.04 0.76 0.28 8.8654e-01
From the ANOVA results, it is seen that both temperature and catalyst have significant (p<0.05) effect
on reaction yield.
10. Linear Regression
Based on the amount of error associated with data, there are two general approaches for curve fitting
(Chapra & Canale 2013):

1. Regression: When data shows a significant degree of error or “noise” (generally originates from
experimental measurements), we want a curve that represents the general trend of the data.

2. Interpolation: When the noise in data can be ignored (generally originates from tables), we
would like a curve(s) that pass directly through each of the data points.

In terms of mathematical expressions, interpolation (Eq. 10.1) and regression (Eq. 10.2) can be shown
as follows:

Y =f ( X ) (10.1)

Y =f ( X )+ϵ (10.2)

Peck et al. (2016) used the terms deterministic and probabilistic relationships for Eq. (10.1) and Eq.
(10.2), respectively. Therefore a probabilistic relationship is actually a deterministic relationship with
noise (random deviations).

To further our understanding on Eq. (10.2), a simple example from Larsen & Marx (2011) can be
helpful: Consider a tooling process where the initial weight of the sample determines the finished
weight of the steel rods. For example, in a simple experiment if the initial weight was measured as
2.745 g then the finished weight was measured as 2.080 g. However, even if the initial weight is
controlled and is exactly 2.745 g, in reality the finished weight would fluctuate around 2.080 g. and
therefore, with each x (independent variable) there will be a range of possible y values (dependent
variable), which Eq. (10.2) exactly tells us.
10.1. Simple Linear Regression

When there is only a single explanatory (independent) variable, the model is referred to as “simple”
linear regression. Therefore, Eq. (10.2) can be expressed as:

Y =β 0 + β 1 x +ϵ (10.3)

where regardless of the x value, the random variable ε is assumed to follow a N(0, σ) distribution.

Let x* show a particular value of x, then:

E ( β 0 + β 1 x *+ϵ )=β 0 + β 1 x *+ E (ϵ )=β 0 + β 1 x *=μY | x * (10.4)

2
Var ( β 0 + β 1 x *+ϵ )=Var (ϵ )=σ Y | x * (10.5)

where the notation Y|x* should be read as the value of Y when x=x*, i.e., the mean value of Y when
x=x*. Note also that Eq. (10.4) tells us something important that the population regression line is the
line of mean values of Y.

The following assumptions are made for a linear model (Larsen & Marx, 2011):

1. fY|x(y) is a normal probability density function for all x (i.e., for a known x value, there is a
probability density function associated with y values)

2. The standard deviations, σ, of y-values are same for all x values.

3. For all x-values, the distributions associated with fY|x(y) are independent.

Example 10.1
Suppose that the relationship between applied stress (x) and time to fracture (y) is given by the simple
linear regression model with β0=65, β1=-1.2, and σ=8. What is the probability of getting a fracture value
greater than 50 when the applied stress in 20? (Adapted from Devore et al. 2021)

Solution:

Let’s compute y when x=20:

y=65−1.2 x=65−1.2×20=41
Note that if this was a curve fitting problem in nature, then whenever the stress value was 20, the
fracture time would have always been equal to 41. However, since Eq. (10.2) tells us that random
deviations are involved, this cannot be the case. We already know that the random deviations, namely ε,
follows a normal distribution. Therefore, it becomes straightforward to compute the probability:

50−41
P(Z > )=P(Z >1.125)=1− pnorm(1.125)=0.13 ■
8

In Example (10.1), the coefficients, namely β0 and β1, of the regression line was given. However, in
practice we need to estimate these coefficients. It should be noted that there are two commonly 37 used
methods for estimating the regression coefficients (please note that we use the word, estimate):
estimate

1. Least squares estimation method,

2. Maximum likelihood estimation method.

10.1.1. Least Squares Estimation


Let (x1, y1), (x2, y2), … , (xn, yn) represent n observation pairs, from the measurement of X and Y. Our
goal is to find β0 and β1 in Eq. (10.3) such that the drawn line is as close as possible to all data points.

In the figure yi is the measured data point


whereas yp is the predicted value both of which
corresponds to xi. The associated error (also
known as residual)
residual is with this prediction is:
e i = yi – y p

Since by definition we want the line as close as


possible to all data points, therefore our goal is
to minimize the sum of ei’s by varying β0 and β1.

Fig 10.1: Fitting a line through a set of data points

37 https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/reliability/supporting-topics/estimation-
methods/least-squares-and-maximum-likelihood-estimation-methods/
The residual sum of squares (RSS) also known as sum of squares of error (SSE):

n
RSS=∑ e i2=e 12 +e 22 +...+e 2n (10.6)
i=1

If the coefficients of the best line passing through the data points are β0 and β1 then:

n
L=RSS=∑ ( y i −β 0 −β 1 x)2 (10.7)
i=1

The partial derivatives of Eq. (10.7) with respect to β0 and β1 are:


n
∂L
=∑ −2( y i −β 0 −β 1⋅x i )=0
∂ β 0 i=1
n
∂L
=∑ 2 x ( y −β −β ⋅x )=0
∂ β 1 i=1 i i 0 1 i

Dropping the constants -2 and 2 from both equations and simply rearranging the terms yields:
n n

∑ y i=n β 0 + β 1 ∑ x i
i=1 i=1

n n n

∑ x i⋅y i=β 0 ∑ x i + β 1 ∑ x i2
i=1 i=1 i=1

We have two equations and two unknowns, therefore it is possible to solve this system of equations.
Here, one can use the elimination method; however, Cramer’s rule provides a direct solution. Let’s
solve for β1 and leave β0 as an exercise:

| ∑ yi
|
n

∑ xi ∑ xi yi
β^1=

|∑ xi
|
n

∑ x i ∑ x i2
If one takes the determinants in numerator and denominator, then:
n ∑ x i y i −( ∑ x i )( ∑ y i )
β^1= 2 (10.8)
n ∑ x i2−( ∑ x i )

β1 can be further simplified if a notation Sxy and Sxx and Sxy are defined as:

1 2
S xx =∑ ( x i − x̄)2=∑ x i2− ( ∑ x i )
n

1 2
S yy =∑ ( y i − ȳ)2=∑ y i2− ( ∑ y i )
n

1
S xy =∑ ( x i − x̄)( y i − ȳ)=∑ x i y i − ( ∑ x i )( ∑ y i )
n

Then β^1 can be simplified as:

S
β^1= xy (10.9)
S xx

and β^0 is equal to:

β^0 = ȳ− β^1⋅x̄ (10.10)

and the estimated variance is:

n
1
σ^2= ∑ (Y i −Y^ i )2 (10.11)
n i=1

where Y^ i = β^0 + β^1⋅x i , i=1 , 2 , ... , n


10.1.2. Maximum likelihood estimation
Before proceeding with the derivation based on maximum likelihood estimation (MLE), let’s work on a
simple example.

Example 10.2
Suppose you have been tasked with finding the probability of heads (H) and tails (T) of an unknown
coin. You flipped the coin for 3 times and the sequence is HTH. What is the probability, p? (Adapted
from Larsen & Marx)

Solution:

It makes sense with defining a random variable, X, as follows:

X= {01 heads come up


tails come up

Then a probability model is defined:

p X (k )= p k (1− p)1−k = {1−p p k=1


k=0

Therefore, based on the probability model the function is that defines the sequence HTH is:
2
p X (k )= p (1− p)

Using calculus, it can easily be computed that the value that maximizes the probability model is:
p=2/3. ■

Now, instead of the sequence HTH (Example 10.2) we have data pairs (x1, y1), (x2, y2), … , (xn, yn)
obtained from a random experiment. Furthermore, it is known that the yi’s are normally distributed with
mean β0+β1xi and variance σ2 (Eqs. 10.4 & 10.5).

The equation for normal distribution is:

( ) ,−∞< x <∞
2
−1 x−μ

1 2 σ (10.12)
f Z ( z)= e
√2 π σ
Replacing x and μ in Eq. (10.12) with yi and Eq. (10.4), respectively, yields the probability model for a
single data pair:
( )
2
−1 y i −β 0 −β 1 x i

1 2 σ (10.13)
f Z ( z)= e
√2 π σ
For n data pairs, the maximum likelihood function is:

( )
2
n −1 y i −β 0 −β 1 x i

1 (10.14)
L=∏
2 σ
e
i=1 √2 π σ
In order to find MLE of β0 and β1 partial derivatives with respect to β0 and β1 must be taken. However,
Eq. (10.14) is not easy to work with as is. Therefore, as suggested by Larsen and Marx (2011), taking
the logarithm will make it more convenient to work with.

n
1
2∑
−2lnL=n⋅ln (2 π )+n ln (σ 2 )+ ( y i −β 0 −β 1⋅x i )2 (10.15)
σ i=1

Taking the partial derivatives of Eq. (10.15) with respect to β0 and β1 and solving the resulting set of
equations similar to as shown in section (10.1.1) will yield Eqs. (10.9 & 10.10).

10.1.3. Properties of Linear Estimators


Due to the assumptions made for a linear model (section 10.1), the estimators, β^0 , β^1 and σ,
^ are random
variables (i.e., probability distribution functions are associated with them). Then,

1. β^0 and β^1 are normally distributed.

2. β^0 and β^1 are unbiased, therefore, E ( β^0 )=β 0 and E ( β^1 )=β 1

σ2
3. Var ( β^1 )= n

∑ ( x i− x̄)2
i=1

n
σ
2
∑ x i2
4. Var ( β^0 )= n
i=1

n ∑ ( x i − x̄)2
i=1
Proof of #2:
In section (5.1.1), it was mentioned that to be an unbiased estimator, E(Θ) = θ must be satisfied. In the
case of β^1, we need to show that E ( β^1 )=β 1. If Eq. (10.8) is divided by n, the following equation is
obtained:

∑ x i y i− 1n (∑ x i )(∑ y i )
β^1= (I)
∑ x − 1n (∑ x i )
2 2
i

Noting that x̄=


∑ x i the Eq. (I) can be rewritten as:
n

β^1=
∑ x i y i− x̄ ∑ y i (II)
∑ x i2−n x¯2
Rearranging the terms in the numerator:

β^1=
∑ y i ( x i− x̄) (III)
∑ x i2−n x¯2
Note that due to the assumption of the linear model, in Eq. (III) except yi, the other terms can be treated
as constant. Therefore, replacing the expected value of yi with Eq. (10.4) gives:

E ( β^1 )=
∑ ( β 0 + β 1 x i )( x i− x̄) (IV)
∑ x i2−n x¯2
Expanding the terms in the numerator:

β ∑ ( x i − x̄)+ β 1 ∑ ( x i − x̄) x i
E ( β^1 )= 0 (V)
∑ x i2−n x¯2
Noting that the first term in the numerator equals to 0 and the remaining terms in the numerator (except
β1) equals to the denominator, the proof is completed.

E ( β^1 )=β 1 (VI)

A similar proof can be obtained for β0. For cases #3 and #4, Larsen & Marx (2011) presented a detailed
proof.

Example 10.3
It seems logical that riskier investments might offer higher returns. A study by Statman et al. (2008)38
explored this by conducting an experiment. One group of investors rated the risk (x) of a company’s
stock on a scale from 1 to 10, while a different group rated the expected return (y) on the same scale.
This was done for 210 companies, and the average risk and return scores were calculated for each. Data
for a sample of ten companies, ordered by risk level, is given below:

x = [4.3, 4.6, 5.2, 5.3, 5.5, 5.7, 6.1, 6.3, 6.8, 7.5]
y = [7.7, 5.2, 7.9, 5.8, 7.2, 7, 5.3, 6.8, 6.6, 4.7]

How does the risk of an investment related to its expected return? (Adapted from Devore et al. 2021)

Solution:

Let’s first visualize the data using a scatter plot.

Script 10.1
import scisuit.plot as plt
x = [4.3, 4.6, 5.2, 5.3, 5.5, 5.7, 6.1, 6.3, 6.8, 7.5]
y = [7.7, 5.2, 7.9, 5.8, 7.2, 7, 5.3, 6.8, 6.6, 4.7]
plt.scatter(x=x, y=y)
plt.show()

38 Statman M, Fisher KL, Anginer D (2008). Affect in a Behavioral Asset-Pricing Model. Financial Analysts Journal,
64-2, 20-29.
It is seen that there is a weak inverse relationship
between the perceived risk of a company’s stock
and its expected return value.

Note: After plotting the data, since scisuit’s


charts are interactive, the trendline was added by
first selecting the data and then selecting “Add
trendline” option.

Fig 10.2: Relationship between risk and expected


return

Fig. (10.2) shows that there is no convincing relationship between risk and expected return of an
investment. Let’s take a look if this is numerically the case. Continuing from Script (10.1):

Script 10.2
from scisuit.stats import linregress
result = linregress(yobs=y, factor=x)
print(result)
Simple Linear Regression
F=1.85, p-value=0.211, R2=0.19

The regression equation: Y = 9.235 - 0.491·X

Predictor Coeff StdError T p-value


Intercept 9.235 2.10 4.40 0.0023
Slope -0.491 0.36 -1.36 0.2110
Since p>0.05, we cannot reject the null hypothesis (H0: β1=0) in favor of H1.

Have we carried out a reliable analysis, i.e., is there no relationship between risk and expected returns?
Devore et al. (2021) suggested that with small number of observations, it is possible not to detect a
relationship because when the sample size is small hypothesis tests do not have much power. Also note
that the original study uses 210 observations where Statman et al. (2008) concluded that risk is a useful
predictor of expected return, although the risk only accounted for 19% of expected returns. ■

10.2. Multiple Linear Regression

Suppose the taste of a fruit juice is related to sugar content and pH. We wish to establish an empirical
model, which can be described as follows:

y=β 0 + β 1 x 1 + β 2 x 2 +ϵ (10.16)

where y is the response variable (taste) and x1 and x2 are independent variables (sugar content and pH).
Unlike simple linear regression (SLR) model, where only one independent variable exists, in multiple
linear regression (MLR) problems at least 2 independent variables are of interest to us. Therefore, in
general, the response variable maybe related to k independent (regressor) variables. The model is:

y=β 0 + β 1 x 1 + β 2 x 2 +...+ β k x k +ϵ (10.17)

This model describes a hyperplane and the regression coefficient, βj, represents the expected change in
response to per unit change in xj when all other variables are held constant (Montgomery 2012). If one
enters the data in a spreadsheet, it would generally be in the following format:

Table 10.1: Data for multiple linear regression

y x1 x2 … xk
y is the response variable and x are the
y1 x11 x12 … x1k regressor variables. It is assumed that n>k.

y2 x21 x22 … x2k

yn xn1 xn2 ... xnk

The model equation for the data in Table (10.1):


k
y=β 0 + ∑ β j x ij +ϵ i , i=1 , 2 ,... , n (10.18)
j=1

For example, for the 1st row (i=1) in Table (10.1), Eq. (10.18) yields, y1 = β0 + β1·x11 + β2·x12 +… +
βk·x1k.

To find the regression coefficients, we will use a similar approach presented in section (10.1.1), such
that the sum of the squares of errors, εi, is minimized. Therefore,

( )
n k 2

L=∑ y i −β 0 −∑ β j x ij (10.19)
i=1 j=1

where the function L will be minimized with respect to β0, β1, …, βk which then will give the least
square estimators, β^1 , β^2 , .. , β^k . The derivatives with respect to β0 and βj are:

(10.20-
| ( )
n k
∂L
=−2 ∑ y i − β^0 −∑ β^ j x ij
∂ β0 β^0 , β^1 ,... , β^k i=1 j=1 a)

(10.20-
| ( )
n k
∂L
=−2 ∑ y i − β^0 −∑ β^ j x ij x ij
∂βj β^0 , β^1 ,... , β^k i=1 j=1 b)

After some algebraic manipulation, Eq. (10.20) can be written in matrix notation as follows:

[ ][ ][ ]
n ∑ xi 1 ∑ xi 2 ... ∑ x ik β^0 ∑ yi
∑ xi 1 ∑ x i21 ∑ x i 1 x i 2 ... ∑ x i 1 x ik β^1 ∑ xi 1 yi (10.21)
⋮ ⋮ ⋮ ... ⋮ ⋮ ⋮
∑ x ik ∑ x ik x i 1 ∑ x ik x i 2 ... ∑ x ik2 β^k ∑ x ik yi
which can be condensed to the following expression:

X⋅β= y (10.22)

Note that since X is an i by k matrix, therefore not square, the inverse does not exist and therefore the
equation cannot be solved. The least-squares approach to solving Eq. (10.22) is by multiplying with
transpose of X:
X T X⋅β= X T⋅y (10.23)

The test of significance of regression involves the hypotheses:

H 0 : β 1=β 2=...=β k =0 (10.24)


H 1 : β j ≠0 for at least for 1 j

Example 10.4
A process engineer who was tasked to improve the viscosity of a polymer, among the several factors,
chose two process variables: reaction temperature and feed rate. She ran 16 experiments and collected
the following data:

Temperature = [80, 93, 100, 82, 90, 99, 81, 96, 94, 93, 97, 95, 100, 85, 86, 87]

Feed Rate = [8, 9, 10, 12, 11, 8, 8, 10, 12, 11, 13, 11, 8, 12, 9, 12]

Viscosity = [2256, 2340, 2426, 2293, 2330, 2368, 2250, 2409, 2364, 2379, 2440, 2364, 2404, 2317, 2309, 2328]

Explain the effect of feed rate and temperature on polymer viscosity. (Adapted from Montgomery 2012).

Solution:

The solution involves several computations which can be performed by using a spreadsheet or by using
Python with numpy library. Step by step solution for the coefficients can be found in the textbook from
Montgomery (2012). We will be skipping all these steps and directly solve it using scisuit’s builtin
linregress function.

Script 10.3
from scisuit.stats import linregress

#input values
temperature = [80, 93, 100, 82, 90, 99, 81, 96, 94, 93, 97, 95, 100, 85, 86, 87]
feedrate = [8, 9, 10, 12, 11, 8, 8, 10, 12, 11, 13, 11, 8, 12, 9, 12]
viscosity = [2256, 2340, 2426, 2293, 2330, 2368, 2250, 2409, 2364, 2379, 2440, 2364, 2404, 2317, 2309, 2328]

#note the order of input to factor


result = linregress(yobs=viscosity, factor=[temperature, feedrate])
print(result)
Multiple Linear Regression
F=82.5, p-value=4.0997e-08, R2=0.93

Predictor Coeff StdError T p-value


X0 1566.078 61.59 25.43 9.504e-14
X1 7.621 0.62 12.32 3.002e-09
X2 8.585 2.44 3.52 3.092e-03

Based on Eq. (10.24), the p-value tells us that at least one of the two variables (temperature and feed
rate) has a nonzero regression coefficient. Furthermore, analysis on individual regression coefficients
show that both temperature and feed rate have an effect on polymer’s viscosity.

According to Larsen & Marx (2011), applied statisticians find residual plots to be very helpful in
assessing the appropriateness of fitting. Continuing from Script (10.3), let’s plot the residuals:

Script 10.4
import scisuit.plot as plt
import scisuit.plot.gdi as gdi

#x=Fits, y=Residuals
plt.scatter(x=result.Fits, y= result.Residuals)

#show a line at y=0


x0, x1 = min(result.Fits), max(result.Fits)*1.005
gdi.line(p1=(x0,0), p2=(x1, 0), lw=2, ls = "---" )

plt.show()
It is seen that the magnitudes of the residuals
are comparable and they are randomly
distributed. Therefore, the applied regression
can be considered as appropriate.

Fig 10.3: Fits vs residuals (y-axis)


11. Exploring Normality
Most statistical methods are based on one basic assumption, that the observation follows normal
distribution, which assumes that the samples come from normally distributed populations. Thus it is
important to check normality assumptions (Das and Rahmatullah Imon, 2016). There are commonly
two approaches to check normality:

1. Graphical Tests 2. Analytical Test Procedures


a) Histogram a) Kolmogorov-Smirnov Test
b) Box and Whisker Plot b) Shapiro-Wilk Test
c) Normal Percent-Percent Plot c) Anderson-Darling Test

11.1. Graphical Tests

Let’s start with histogram,


histogram the easiest and simplest (in terms of interpretation) plot. A histogram
provides a visual representation of the distribution of quantitative data. Before attempting to plot a
histogram, let’s first generate random data from normal and exponential distributions (highly skewed).

Script 11.1
import scisuit.plot as plt
from scisuit.stats import rnorm, rexp

n=1000
dt_norm, dt_exp = rnorm(n), rexp(n)

Let’s see how we could visualize the data from Script (11.1) by histogram:

Script 11.2
plt.layout(1,2)
plt.subplot(0,0)
plt.hist(dt_norm, density=True)
plt.title("Normal")

plt.subplot(0, 1)
plt.hist(dt_exp, density=True)
plt.title("Exponential")

plt.show(antialiasing=True)
Fig 11.1: Histogram of normal and exponential distributions
It is clearly seen from Fig. (11.1) that normal distribution has a nearly bell-shaped distribution whereas
exponential distribution is highly skewed to right.

Box-whisker plot has has another name as five number summary where it displays 1 st quartile, median,
3rd quartile, min and max values. Continuing from Script (11.1):

Script 11.3
plt.boxplot(dt_norm)
plt.title("Normal")

plt.boxplot(dt_exp)
plt.title("Exponential")

plt.show(antialiasing=True)
The following observations can be made.

Normal distribution:

1. Mean marker (x) and median line


almost overlaps.
2. Mean divides the box into two
equivalent halves.
3. |Q1-min| is more or less equivalent to |
Q3-max|.

Exponential distribution:

1. A clear separation between mean


marker and median line.
2. Existence of outliers.
3. The mean marker is considerably closer
to Q3.
4. |Q1-min| is clearly different than
Fig 11.2: Histogram of normal and exponential |Q3-max|.
distributions (n=100)

Q–Q plot (quantile–quantile plot) is a graphical method for comparing two probability distributions by
plotting their quantiles against each other 39. On the other hand, a normal Q-Q plot is that which can be
shaped by plotting quantiles of one distribution against quantiles of normal distribution. If both
distributions come from normal distribution, then the data aligns itself on a straight line. To visualize
Q-Q plot, we will slightly modify Script (11.2) and change hist with qqnorm function. Continuing from
Script (11.1) and applying following changes, we should obtain Fig. (11.3):

Script 11.4
plt.qqnorm(dt_norm)
plt.qqnorm(dt_exp)

39 https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
Fig 11.3: QQ plot of normal and exponential distributions (n=250)
It is seen from Fig. (11.3) that the data coming from a normal distribution aligns well with the straight
line whereas the data from exponential distribution shows apparent deviations.
11.2. Analytical Test Procedures

11.2.1. Kolmogorov - Smirnov test


The Kolmogorov-Smirnov test was first derived by Kolmogorov40 (1933) and more than a decade later
modified by Smirnov41 (1948). It is also known as KS test and used to test if a sample comes from a
population with a specific distribution (NIST 2024)42. The KS test is defined by:

H0: The data follows a specific distribution, i.e. normal distribution


H1: The data does not follow the specific distribution

The test statistic is:

D=sup x|F n ( X )−F ( X )| (11.1)

where F(X) is the theoretical cumulative distribution function (must be a continuous distribution and
must be fully specified) of the normal distribution and Fn(X) is the empirical CDF of the data.

Example 11.1
Does the following data come from a normal distribution:

[2.39798, -0.16255, 0.54605, 0.68578, -0.78007, 1.34234, 1.53208, -0.86899, -0.50855, -0.58256, -0.54597, 0.08503,
0.38337, 0.26072, 0.34729]

Solution:

We will write a script to run the KS test and at the same time visualize the CDF and ECDF. The
following script performs both tasks:

Script 11.5
import numpy as np
import scisuit.plot as plt
from scisuit.stats import ks_1samp, pnorm

data = [2.39798, -0.16255, 0.54605, 0.68578, -0.78007, 1.34234, 1.53208, -0.86899, -0.50855,
-0.58256, -0.54597, 0.08503, 0.38337, 0.26072, 0.34729]
40 Kolmogorov A (1933). ‘‘Sulla determinazione empirica di una legge di distribuzione.’’ G. Ist. Ital. Attuari, 4, 83–91
41 Smirnov N (1948). ‘‘Table for estimating the goodness of fit of empirical distributions.’’ Annals of Mathematical
Statistics, 19(2): 279–281.
42 https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
mu, sd = np.mean(data), np.std(data)

"""Analytic test"""
result = ks_1samp(x=data, cdf=pnorm, args=( mu, sd))
print(result)

""" Visualization """


# Sort the data and compute ECDF
data_sorted = np.sort(data)
ecdf_y = np.arange(1, len(data) + 1) / len(data)
plt.scatter(x=data_sorted, y=ecdf_y, label="ECDF")

# Theoretical CDF for normal distribution


x = np.linspace(min(data), max(data), 100)
cdf_y = pnorm(x, mean=mu, sd=sd)
plt.scatter(x=x, y=cdf_y, label="CDF")

plt.legend(nrows=2)
plt.show()
Kolmogorov-Smirnov test
p-value: 0.885
Test statistic: 0.1414 and its sign 1
Max distance at: -0.50855

It is seen that the maximum vertical distance


between CDF and ECDF is around -0.5 as
confirmed by the ks_1samp test as -0.508.

The sign of the test is +1, which indicates that the


ECDF is above the theoretical CDF as can be seen
from Fig. (11.4).

The test statistic is calculated as the vertical


distance between CDF and ECDF. The values of
ECDF and CDF at max distance are roughly 0.334
and 0.192, respectively. Approximately the
difference is 0.142.

Fig 11.4: CDF and ECDF of given data


11.2.2. Shapiro-Wilk Test
The Shapiro-Wilk, introduced by Shapiro and Wilk (1965), is among the most popular tests and is
particularly powerful for small to medium sample size. Similar to Kolmogorov-Smirnov test, the null
and alternative hypotheses are:

H0: The data follows normal distribution


H1: The data does not follow the normal distribution

The test statistic is:

2
(∑ ai y(i)) (11.2)
W=
∑ ( y− ȳ) 2

Example 11.2
Compute the test statistic (W) for the following data (from Shapiro and Wilk, 1965):

[6, 1, -4, 8, -2, 5, 0]

Solution:

1) The coefficients, ai, in Eq. (11.2) is given by Shapiro and Wilk (1965) as:
a = [0.6233, 0.3031, 0.1401, 0.0, 0.1401, 0.3031, 0.6233]

2) The sorted sequence is: [-4, -2, 0, 1, 5, 6, 8]

3) In Shapiro and Wilk (1965) paper, the numerator in Eq. (11.2) is computed as follows (note that the
indices start from 1):

• If the number of samples (n) is even, then n=2k, and numerator is:
k
b=∑ an−i+1⋅( y n−i+1− y i )
i=1

• If n is odd then n=2k+1 and the computation of b is the same as above.

Since n=7 for the example data, then k=3 and b is computed as follows:
3
b=∑ a7−i+1⋅( y 7−i+1− y i )
i=1
Let’s automate these 3 steps using a Python script:

Script 11.6
import numpy as np
from scisuit.stats import shapiro

arr = np.array([6, 1, -4, 8, -2, 5, 0])


sorted_x = np.sort(arr)

n = len(arr)

a = np.array([0.6233, 0.3031, 0.1401, 0.0, 0.1401, 0.3031, 0.6233])

k = n/2 if n%2 == 0 else (n-1)/2

b=0
for i in range(int(k)):
b += a[n-i-1]*(sorted_x[n-i-1]- sorted_x[i])

W_test_stat = b**2/(np.var(arr)*n)
print(f"Test statistic: {W_test_stat}")

result = shapiro(arr)
print(result)
Test statistic: 0.95308

Shapiro-Wilk Test
p-value: 0.7612
Test statistic: 0.9535

One way to visualize the Shapiro-Wilk test is through a QQ plot. If the points lie on a straight line, it
indicates that the data is approximately normally distributed, matching the expectation used in the
Shapiro-Wilk test. Therefore,

• When 𝑊 is close to 1.0, the sample data aligns closely with the expected normal distribution,
indicating that the data is likely normal.

• When 𝑊 deviates significantly from 1.0, it suggests that the data does not follow a normal
distribution.

Continuing from Script (11.6):


Script 11.7
import scisuit.plot as plt
plt.qqnorm(arr)
plt.show()

It is seen that the data aligns itself well with the straight line
which is why the test statistic is close to 1.0.

Fig 11.5: QQ plot of given data

11.2.3. Anderson-Darling Test


Anderson-Darling test is similar to the KS test; however, it gives more weight to the tails (Anon.
2024)43. Similar to KS and Shapiro-Wilk tests, Anderson-Darling test is defined as:

H0: The data follows specified distribution


H1: The data does not follow the specified distribution

The test statistic is:

A 2=−n−S (11.3)

where

n
(2i−1)
S=∑ [ ln F (Y i )+ln (1−F (Y n+1−i )) ] (11.4)
i=1 n

where F is the CDF of the specified distribution and the Yi are the ordered data.

43 https://www.itl.nist.gov/div898/handbook/eda/section3//eda35e.htm
Example 11.3
Test whether the above-given normality tests for t-distribution (see Fig. 4.11).

Solution:

Let’s first remind ourselves briefly similarities and differences between t-distribution and standard
normal distribution:

1) Both distributions are symmetric.

2) t- distribution is characterized by the degrees of freedom (df). As df increases, t- distribution


becomes more similar to a normal dist.

3) The curves of t- distribution with larger df are taller and have thinner tails.

Script 11.8
from scisuit.stats import rt, anderson, ks_1samp, shapiro

n=50

data = rt(n=n, df=3)


for func in [anderson, ks_1samp, shapiro]:
print(func(data))
print("\n")
Anderson-Darling Test
p-value: 0.0
Test statistic: 2.3554

Kolmogorov-Smirnov test
p-value: 0.645
Test statistic: 0.1014 and its sign 1
Max distance at: 0.1982

Shapiro-Wilk Test
p-value: 0.0
Test statistic: 0.8508
It is seen that except KS test, both Shapiro-Wilk and Anderson-Darling tests can detect the difference
between t-distribution (with small degrees of freedom) and standard normal distribution. However,
when df=10, all of the above-mentioned tests yielded a p-value greater than 0.05.
11.2.4. Summary
The above-mentioned tests are sensitive to sample size such that if n<20 it can be difficult to detect
deviations from normality whereas if n>5000 even minor departures from normality may be flagged as
statistically significant (Anon. 2024)44. As a rule of thumb, sample sizes between 30-300 observations
are recommended for reliable normality assessment.

Table 11.1: Comparison of normality tests

Test Sample Size Strengths

Anderson-Darling Small to large Focuses on the tails of the distribution and is


applicable to various distributions.

Kolmogorov-Smirnov Large Focuses on entire distribution and is considered a


general-purpose test

Shapiro-Wilk Small to medium Focuses on entire distribution and is powerful for


(generally n<50) small samples

44 https://www.6sigma.us/six-sigma-in-focus/normality-test-lean-six-sigma/
References
Box GEP., Hunter WG, Hunter JS (2005). Statistics for Experimenters: Design, Innovation, and
Discovery, 2nd Ed., Wiley.
Bury K (1999). Statistical Distributions in Engineering, Cambridge University Press.
Carlton MA, Devore JL (2014). Probability with Applications in Engineering, Science and
Technology. Springer USA.
Chapra SC, Canale RP (2013). Numerical methods for engineers, seventh edition. McGraw Hill
Education.
Das KR, Rahmatullah Imon AHM (2016). A Brief Review of Tests for Normality. American Journal
of Theoretical and Applied Statistics. 5(1), 5-12.
Devore JL, Berk KN, Carlton MA (2021). Modern Mathematical Statistics with Applications. 3 rd Ed.,
Springer.
Forbes C, Evans M, Hastings N, Peacock B (2011). Statistical Distributions, 4th Ed., Wiley.
Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer.
Hogg RV, McKean JW, Craig AT (2019). Introduction to mathematical statistics, 8th Ed., Pearson.
Kanji GK (2006). 100 Statistical Tests, 3rd Ed., Sage Publications.
Kreyszig E, Kreyszig H, Norminton EJ (2011). Advanced Engineering Mathematics, 10th Ed., John
Wiley & Sons Inc.
Larsen RJ, Marx ML (2011). An Introduction to Mathematical Statistics and Its Applications. 5 th Ed.,
Prentice Hall.
Liben-Nowell D. (2022). Connecting Discrete Mathematics and Computer Science (2 nd Ed.).
Cambridge: Cambridge University Press.
Miller I, Miller M (2014). John E. Freund's Mathematical Statistics with Applications. 8 th Ed., Person
New International Edition.
Montgomery DC (2012). Design and analysis of experiments, 8th Ed., John Wiley & Sons, Inc.
Montgomery DC, Peck EA, Vining GG (2021). Introduction to Linear Regression Analysis, 6 th Ed.,
Wiley.
Moore DS, McCabe GP, Craig BA (2009). Introduction to the Practice of Statistics. 6 th Ed., W. H.
Freeman and Company, New York.
Peck R, Olsen C, Devore JL (2016). Introduction to Statistics and Data Analysis. 5th Ed., Cengage
Learning.
Pinheiro, CAR, Patetta M (2021). Introduction to Statistical and Machine Learning Methods for Data
Science. Cary, NC: SAS Institute Inc.
Rinne H (2009). The Weibull Distribution A Handbook. CRC Press.
Shapiro SS, Wilk MB (1965). An Analysis of Variance Test for Normality (Complete Samples).
Biometrika, 52(3/4), 591-611.
Stahl S (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 76(2), pp. 96-113.
Available at: https://www.maa.org/sites/default/files/pdf/upload_library/22/Allendoerfer/stahl96.pdf
Student (1908). The probable error of a mean. Biometrika, 6(1), 1-25.
Utts JM, Heckard RF (2007). Mind on Statistics, 3rd Ed., Thomson/Brooks Cole.
Wackerly DD, Mendenhall W, Scheaffer RL (2008). Mathematical Statistics with Applications, 7 th
Ed., Thomson/Brooks Cole.
Walck C (2007). Handbook on Statistical Distributions for Experimentalists. Available at:
https://s3.cern.ch/inspire-prod-files-1/1ab434101d8a444500856db124098f9c
Acronyms

CDF: Cumulative Distribution Function

MGF Moment-generating Function

PDF: Probability Density Function

You might also like