Intro to Statistics for Engineers using Python
Intro to Statistics for Engineers using Python
Prepared by:
Dr. Gokhan Bingol (gbingol@hotmail.com)
December 13, 2024
On the other hand, ML techniques excel at uncovering patterns in complex datasets with intricate
relationships that traditional statistics may overlook (Hastie et al., 2009). However, these methods often
require larger datasets and computational resources. Moreover, the "black-box" nature of certain ML
models, such as deep learning, can limit their interpretability, which is crucial for making informed
decisions in process engineering (Rudin, 2019)2.
Random variables—such as the lifespan of a pump, the time required to complete a task, or the
occurrence of natural phenomena like earthquakes—play a pivotal role in both everyday life and
engineering applications (Forbes et al., 2011). The probability distribution of a random variable
provides a mathematical description of how probabilities are assigned across its possible values. While
statistical literature describes a vast array of distributions (Wolfram MathWorld) 3, only a limited subset
is commonly used in engineering, as highlighted by Forbes et al. (2011) and Bury (1999).
Statistical tools and tests are indispensable in engineering analysis. Common parametric tests like
t-tests and ANOVA are widely used for comparing means and analyzing variance (Montgomery, 2012).
Non-parametric tests, such as the Kruskal-Wallis test or the sign test, are particularly useful when data
fail to meet the assumptions of normality (Gopal, 2006; Kreyszig et al., 2011). Regression analysis,
another critical tool, enables the investigation of relationships between variables (Montgomery et al.,
2021).
1 https://www.thechemicalengineer.com/features/data-science-and-digitalisation-for-chemical-engineers/
2 Rudin C (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nature Machine Intelligence, 1(5), 206-215.
3 https://mathworld.wolfram.com/topics/StatisticalDistributions.html
The current work emphasizes the application of statistics in engineering, leveraging Python as the
computational tool of choice. Furthermore, it relies extensively on Python packages such as numpy and
scisuit4. The scisuit’s statistical library draws inspiration from R5, enabling readers to transfer the
knowledge gained here to R, a widely used software in the data science domain.
2.1.1. Permutations
Any ordered sequence of k objects taken without replacement from a set of n objects is called a
permutation of size k of the objects (Devore et al., 2021). There are two cases:
A) Objects are Distinct: The set contains only distinct objects, such as A, B, C… Then the number of
permutations of length k, that can be formed from the set of n elements is:
n! (2.1)
n P k =n⋅( n− 1 )⋅...⋅( n− k +1 )=
(n − k ) !
The interpretation of Eq. (2.1) is fairly straightforward: Initially there are n objects in the set and once
one is taken out (since without replacement), n-1 objects are left and then the sequence continues in
similar fashion.
Example 2.1
How many permutations of length k=3 can be formed from the elements A, B, C and D (Adapted from
Larsen & Marx, 2011)?
Solution:
4!
Mathematically the solution is: =24
( 4 − 3 )!
Script 2.1
from itertools import permutations
print(list ( permutations(["A", "B", "C", "D"], 3) ))
This will printout 24 tuples, each representing a permutation.
B) Objects are NOT Distinct: The set contains n objects, n1 being one kind, n2 of second kind … and n r
of rth kind, then:
n! (2.2)
n1 ! ⋅ n2 ! ⋅ n r !
Example 2.2
A biscuit in a vending machine cost 85 cents. In how many ways can a customer put 2 quarters, 3 dimes
and 1 nickel (Adapted from Larsen & Marx, 2011)?
Solution:
n = n1+n2+n3 = 2+3+1=6
6!
=60
2! 3 ! 1!
2.1.2. Combinations
The number of different combinations of n different things taken, k at a time, without repetitions, is
computed by (Kreyszig et al., 2011):
(n+k−1
k )
(2.4)
Example 2.3
Given a set of elements A, B, C and D list the combinations of unique elements of size 2.
Solution:
Script 2.2
from itertools import combinations
print (list( combinations(["A", "B", "C", "D"], 2) ))
('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')
Note that each tuple contains k=2 different “things” and none of the tuples contains exactly the same 2
things, i.e. there is no (‘A’, ‘A’). Please also note that unlike permutations, there is no ('B', 'A'), ('C', 'A')
since there is already (‘A’, ‘B’) and (‘A’, ‘C’), respectively.
(4+2−1
2 ) ( 2) 2⋅1
5 5⋅4
= = =10
A summary of permutations and combinations for k elements from a set of n candidates is given by
Liben-Nowell (2022) as follows:
k
1) Order matters and repetition is allowed: n
n!
2) Order matters and repetition is not allowed:
(n − k ) !
A random variable is a variable therefore can assume different values; however, the value depends on
the outcome of a chance experiment (Peck et al. 2016; Devore et al. 2021).
For example, when two dice are tossed a sample space of a set of 36 ordered pairs,
S (i , j)=[(1,1),(1,6) , ... ,(6,1),(6,6)] is obtained. In many cases, the set of 36 ordered pairs is not of
interest to us, for example for some of the games only the sum of the numbers is of interest to us,
therefore, we are only interested in eleven possible sums (2, 3,…, 11, 12), i.e. if we were interested in
sum being 7, then it does not matter if the outcome was (4, 3) or (6, 1). Therefore, in this case, the
random variable can be defined as X (i , j)=i+ j (Larsen & Marx, 2011).
1. Discrete: Takes values from either a finite set or a countably infinite set.
2. Continuous: Takes values from uncountably infinite number of outcomes, i.e. all numbers in a
single interval on the number line.
where,
• X = the random variable
• k = a specified number the random variable can assume
• P(X=k), the probability that X equals to k (Utts and Heckard, 2007).
For the dice example, let’s say we are interested in the sum of numbers being 2. Then the notation
would be P(X=2) = 1/36.
2.2.2. Continuous Random Variables
With each continuous random variable Y a probability density function is associated:
It is seen that unlike Eq. (2.5) which gives the probability at a particular value, Eq. (2.6) yields
probability at an interval [a, b].
Example 2.4
A fair die is rolled 4 times. Let X denote the number of sixes that appear. Find PDF and CDF ( Adapted
from Larsen & Marx, 2011).
Solution:
X has a binomial distribution (see chapter 3.2) with n=4 and p=1/6. Therefore, the PDF:
( )( )( )
4−k
n 1 5
p( X =4)= ⋅ ⋅ , k=0 ,1 , 2 , 3 , 4
4 6 6
Let’s see how the probability of getting number of sixes changes with a simple plot:
Script 2.3
import scisuit.plot as plt
import scisuit.plot.gdi as gdi
from scisuit.stats import dbinom, pbinom
k = range(0, 5)
x = [*k]
y = [dbinom(x=i, size=4, prob=1/6) for i in x]
plt.scatter(x=x, y=y)
for i,v in enumerate(k):
gdi.line(p1=(v, 0), p2=(x[i], y[i]))
plt.show(antialiasing=True)
Note that Fig. (2.1) only shows probabilities for individual data points, i.e., for k=0, 1, 2, 3 and 4 sixes.
However, it does not immediately show the probability for P(X≤2). The cumulative distribution
function is:
{
0 x <0
(56 ) =0.482
4
0≤x <1
( )
4 3
5 1 5
+4 ( )( ) =0.868 1≤x <2
6 6 6
F ( x)=
X
1 4≤x
With minor changes to Script (2.3):
Script 2.4
y = [pbinom(q=i, size=4, prob=1/6) for i in x]
plt.scatter(x=x, y=y)
plt.show(antialiasing=True)
n
1
F n ( x)= ∑ 1{x ≤x } (2.8)
n i=1 i
0, 2, 1, 2, 7, 6, 4, 6
Solution:
As demonstrated below for larger datasets, it is considerably more convenient to use Numpy:
Script 2.5
import numpy as np
x = np.array([0, 2, 1, 2, 7, 6, 4, 6])
f2 = np.sum(x<=2)/len(x)
print(f2)
0.5
{
∑ etk pW (k ) if W is discrete
all k
M W (t )=E (t tW )= (2.9)
∞
6 https://online.stat.psu.edu/stat415/lesson/empirical-distribution-functions
Theorem: Let W1, W2, …, Wn be independent random variables with mgfs MW1(t), MW2(t), …, MWn(t),
respectively. Let W = W1+W2+, …, Wn. Then,
(2.1
M W (t )=M W (t )⋅M W (t )... M W (t )
1 2 n
0)
Example 2.6
Find the moment-generating function of a Bernoulli random variable:
X i= {01 p
1− p
, 0< p<1
Solution:
Note that Bernoulli random variable is a discrete random variable, therefore condensing Eq. (2.9) for
only discrete random variables yields:
M X (t )= ∑ e tk p X (k )
all k
One should notice the condition in the equation which states that the summation should be performed
for “all k”. Note that for Bernoulli random variable there exists only 2 k’s, therefore:
t⋅0 t⋅1 t
M X (t ) = e ⋅p( X =0)+e p( X =1) = (1− p)+ p⋅e
Example 2.7
Find the MGF of a binomial random variable given by the following equation:
p x ( k )=P ( X =k )= (nk) p ( 1− p )
k n−k
Solution:
n
M X ( t ) =∑
k=0
(nk)( pe ) (1− p)
t k n−k
n
( x + y ) =∑ n x ⋅y
n
k=0 k
k
()
n−k
Observing that mgf and binomial expansion are exactly the same if we replace x and y with x=pet and
y=1-p. Therefore the moment-generating function is:
n
M X (t )=( 1− p+ pet )
∞
E(Y )= μ = ∫ y f Y ( y )dy (2.12)
−∞
One notable property of expected value is that it is a linear operator and therefore,
x 1 2 3 4 5 6 7
Solution #1:
1×150+2×450+...+7×300
The simplest approach: x̄= =4.57
15000
Solution #2:
We define a random variable X as the number of courses a student has enrolled. The mean value
(weighted average) of a random variable is its expected value. Furthermore, since the random variable
is discrete, Eq. (2.11) will be applied. However, we first need to compute the probabilities:
p(x) 0.01 = 150/15000 0.03 0.13 0.25 0.39 0.17 0.02 = 300/15000
Although Eqs. (2.11 & 2.12) can be used to find the expected value of a random variable, it is not
always very convenient to do so.
If MW(t) is the moment-generating function (mgf) of the random variable W, then the following
relationship holds as long as the rth derivative of mgf exists:
d ∞ ty
M (1)
Y (0)= ∫ e f Y ( y )dy
dt −∞
Placing the derivative as the integrand, then equation can be rewritten as:
∞
d ty
M Y (0)= ∫
(1)
e f Y ( y )dy
−∞ dt
Noting that only ety is a function of t and performing the derivation yields:
Y (0)= ∫ y e f Y ( y)dy
M (1) ty
−∞
Y (0)= ∫ y f Y ( y)dy=E (Y )
M (1)
−∞
Note that the last equation is exactly the same as Eq. (2.12), which is the expected value, E(Y).
Therefore, the first-derivative of mgf with respect to t=0 gives E(Y) and second-derivative E(Y2) and so
on…
Example 2.9
Find the expected value of the binomial random variable.
Solution:
n
M X (t )=( 1 − p+ pe t )
(1) t n −1 t
M X (t )=n(1 − p+ pe ) ⋅pe
M 1X (t =0)=E ( X )=np ■
Example 2.10
The PDF of the maximum order statistics is given by:
n−1
f Y ( y)=n [ F Y ( y) ]
(n)
f Y ( y)
Find the expected maximum for uniform distribution in the interval of [0, 1].
Solution:
The maximum PDF for uniform distribution in the interval of [0, 1] is fY(y)=1, therefore the cumulative
distribution function is FY(y)=y. Substituting these knowns to above equation and integrating in the
n
interval of [0, 1] yields .
n+1
Note that as n increases, the expected maximum values approaches to 1, which we would expect if we
draw large number of samples from a uniform distribution. ■
2.2.7. Variance
Although the expected value is an effective statistical measure of central tendency, it gives no
information about the spreadout of a probability density function. Although the spread can be
calculated using X-μ, it is immediately noted that negative deviations will cancel positive ones (Larsen
& Marx, 2011). The variance of a random variable is defined as the expected value of its squared
deviations. In mathematical terms,
Noting the following property of expected value for the random variable X and g(X) any function,
If g(X) in Eq. (2.16) is replaced with (X-μ)2, then Eq. (2.15) can also be expressed as,
Var ( X )=∑ (k−μ)2⋅p X (k ) (2.17)
all k
∞
Var (Y )=E [(Y −μ)2 ]= ∫ ( y−μ)2⋅f Y ( y)dy (2.18)
−∞
Let W be any random variable, discrete or continuous, and a and b any two constants. Then,
2
Var (a⋅W +b)=a ⋅Var (W ) (2.19)
Example 2.11
Test whether Eq. (2.15) represents population or sample variance.
Solution:
Spreadsheet’s have two equations for computing sample and population variance, namely Var.S and
Var.P, respectively. Computation with Var.S and Var.P yielded 3.86667 and 3.22222, respectively. Let’s
investigate using Python libraries:
Script 2.6
import statistics as stat
#Using Equation
EX, EX2 = np.mean(x), np.mean(x**2)
varEq = EX2 - EX**2
varP = np.var(x, ddof=0) #notice ddof=0
Notice that the number of samples was intentionally kept low to see the difference between sample and
population variance since for large samples the difference becomes negligible.
Although Eqs. (2.17 & 2.18) can be used to find variances of discrete and continuous random variables,
respectively, using MGF (if known/available) to find the variance can be more convenient as
demonstrated in the following example.
Example 2.12
Find the variance of the binomial random variable.
Solution:
n
M X (t )=( 1 − p+ pe t )
E ( X )=np
From Eq. (2.14) we know that the second-derivative of mgf with respect to t gives E(X2), therefore:
2 t t n −2 t t n −1 t
M X (t )= pe ⋅n⋅(n − 1)⋅(1 − p+ pe ) pe +n(1 − p+ pe ) pe
E ( X 2 )=n(n − 1) p2 +np
From Eq. (2.15) remembering that:
2 2
Var ( X )=n(n − 1) p +np −(np)
Var ( X )=np(1− p)
3. Discrete Probability Distributions
All discrete probability distributions has the following properties:
2. ∑ p (x )=1
all x values
The general characteristics of a discrete probability distribution can be visualized using a probability
histogram.
Script 3.1
import scisuit.plot as plt
plt.histogram([1, 2, 3, 4, 5, 3, 4, 2, 5, 4, 6])
plt.show()
A Bernoulli trial can have one of the two outcomes, success or failure. The probability of success is p
and therefore the probability of failure is 1-p (Forbes et al., 2011). It is the simplest discrete
distribution; however, it serves as the building block for other complicated discrete distributions
(Weisstein 2023)7.
X i= {01 p
1− p
, 0< p<1 (3.1)
E ( X )= p (3.3)
2 2 2
Var ( X )=E ( X )−E ( X ) = p− p = p(1− p) (3.4)
The outcome of the experiment is either a success or a failure. The term success is determined by the
random variable of interest (X). For example, if X counts the number of female births among the next n
births, then a female birth can be considered as a success (Peck et al., 2016).
We run n independent trials and define probability as p=P(success occurs) and assume p remains
constant from trial to trial (Larsen & Marx, 2011). However, since we are only interested in the total
number of successes, we therefore define X as the total number of successes in n trials. This definition
then leads to binomial distribution and is expressed as:
p x ( k )=P ( X =k )= (nk) p ( 1− p )
k n−k
(3.5)
Imagine 3 coins being tossed, each having a probability of p of coming up heads. Then the probability
of all heads (HHH) coming up is p3 and all tails (no heads, TTT) is (1-p)3 and HHT is 3p2(1-p).
Observe that in Eq. (3.5) the combination part shows the number of ways to arrange k heads and n-k
tails (section 2.1), therefore:
n− k
the remaining part of Eq. (3.5), pk⋅( 1 − p ) , is the probability of any sequence having k heads and n-k
tails.
Example 3.1
An IT center uses 9 drives for storage. The probability that any of them is out of service is 0.06. For the
center at least 7 of the drives must function properly. What is the probability that the computing center
can get its work done (Adapted from Larsen & Marx, 2011)?
Solution #1:
Solution #2:
6
()
7 8() ()
9 0.94 7 0.062 + 9 0.94 8 0.06 1 + 9 0.94 9 0.060 =1−
7 ()
∑ 9i 0.94i 0.06(9−i)
i=0
Example 3.2
Find the 10% quantile of a binomial distribution with 10 trials and probability of success on each trial
is 0.4?
plt.hist(data, cumulative=True)
plt.show()
The derivations of MGF, E(X) and Var(X) were already presented in Examples (2.7), (2.9) and (2.12),
respectively. Although approaches presented in the examples work very well, one can also keep in
mind that each binomial trial is actually a Bernoulli trial, therefore the random variable W for binomial
distribution is a function of Bernoulli random variables: X1, X2, …, Xn, yielding W=X1+X2+...+Xn. Thus
Eq. (3.2) and Eq. (2.10) can be combined to derive Eq. (3.7). Remembering the linearity of expected
value, similar approach can be used for E(W) and Var(W) to obtain Eqs. (3.7 & 3.8).
M X (t )=( 1− p+ pe t )
n
(3.7)
E ( X )=np (3.8)
8 https://www.boost.org/doc/libs/1_40_0/libs/math/doc/sf_and_dist/html/math_toolkit/policy/pol_tutorial/
understand_dis_quant.html
Let’s run a simple simulation to test Eq. (3.8):
Script 3.3
from scisuit.stats import rbinom
We have intentionally run large number of experiments (N=1000) for the simulation. Note that Eq.
(3.8) and rbinom function match when n=size and p=prob. Therefore for the first case E=5×0.3=1.5,
which is close to 1.49.
Script 3.4
p, n = 0.3, 10
x = rbinom(n=5000, size=10, prob=0.3)
Finally, let’s test our understanding in the meaning of randomness of Binomial distribution. First let’s
generate 10 random numbers from a Binomial distribution.
In an analogy, we flip 5 coins (size=5) and count the number of heads (prob=0.5) which we consider as
success. We run this experiment for 10 times (n=10). In the first experiment we have 1 heads, in the
second 2 heads and so on.
3.3. Hypergeometric Distribution
Suppose that an urn contains r good chips and w defective chips (total number of chips N =r +w ). If n
chips are drawn out at random without replacement, and X denotes the total number of good chips
selected, then X has a hypergeometric distribution and,
P( X =k )=
( r
k ) ⋅( w )
n−k (3.10)
( Nn )
Notes:
1. If the selected chip was returned back to the population, that is the chips were drawn with
replacement, then X would have a binomial distribution (see Example 3.3).
2. Since we are interested in total number of good chips, it does not matter if it is r 1r2r3…. or
r2r1r3…. Therefore
r!
(r − k ) !
was divided by k! and we used
r
()
=
r!
k k !⋅ (r − k ) !
.
Example 3.3
An urn has 100 items, 70 good and 30 defective. A sample of 7 items is drawn. What is the probability
that it has 3 good and 4 defective items? (adapted from Tesler 2017)9
9 https://mathweb.ucsd.edu/~gptesler/186/slides/186_hypergeom_17-handout.pdf
MGF, Mean and Variance
The MGF, mean and variance of hypergeometric distribution are presented by Walck (2007) and
derivation of expected-value is given by Hogg et al. (2019).
M (t )=
( n)
W
⋅ F (−n ,−r ; w−n+1 ; t ) (3.11)
X 2 1
(n)
N
r
Let p= and q=1− p then,
N
E( X)=np (3.12)
N −n (3.13)
Var ( X )=npq
N −1
Script 3.5
x = rhyper(nn=1000, m=70, n=30, k=7)
avg = np.mean(x)
print(f"mean = {avg}")
mean = 4.931
r
E ( X )=np=n⋅
N
m 70
E( X)=k⋅ =7⋅ =4.9
m+n 30+70
3.4. Geometric Distribution
It is similar to binomial distribution such that trials have two possible outcomes: success or failure.
However, unlike binomial distribution where we were interested in the total number of successes, now
we are only interested in the trial where first success occurs. Therefore, if k trials were carried out, k-1
trials end up in failures and the kth one occurs with success. Thus we define the random variable X as
the trial at which the first success occurs (Larsen & Marx, 2011).
In more explicit terms, we have thus far said that: “first k-1 trials end up in failure” and “kth trial ends in
success”. Mathematically expressing,
P ( X =k )=(1− p) ⋅p
k−1 (3.14)
p et (3.15)
M x (t )=
1−(1− p)e t
1 (3.16)
E ( X )=
p
1− p
var ( X )=E ( X 2 )−E ( X )2= (3.17)
p2
Example 3.4
A political pollster randomly selects persons on the street until he encounters someone who voted for
the Fun-Party. What is the probability he encounters 3 people who did not vote for the Fun-Party before
he encounters one who voted. It is known that 20% of the population voted for the Fun-Party ( adapted
from Foley10 2019)?
Solution:
20
The probability of success (voted for Fun-Party) is: p= =0.2
100
Since 3 have not voted for the Fun-Party (failure) and the next one voted, 4 trials carried out.
3 1
P( X =4)=(1−0.2) ⋅0.2 =0.1024
dgeom(x=3, prob=0.2)
0.1024
Note that, in the definition of the function dgeom x is the number of failures, therefore, instead of x=4,
x=3 was used.
10 https://rpubs.com/mpfoley73/458721
3.5. Negative Binomial Distribution
In section (3.4) the geometric distribution was introduced where we defined the random variable X as
the trial at which the first success occurs. Therefore the trials were discontinued as soon as a success
occurred. Now instead of first success, we are interested in rth success. Similar to geometric distribution
each trial has a probability p of ending in success.
Therefore, we might have a sequence of {S, F, F, S, S, S} if we were interested in the r=4 th success out
of k=6 trials. Putting it in more mathematical terms,
2 failures before the 4th success in k-1=5 trials: (k-1) – (r-1) = k-r
Now if we define the random variable X as the trial at which the rth success occurs, then all the
background work to obtain the probability density function has been done.
Before proceeding with the final pdf, also note that before the rth success occurs, k-1 trials might have
various different sequences having r-1 successes, such as {SFFSS} or {FSFSS} or so on… Note that
this is indeed very similar to the idea presented in section (3.2) by Eq. (3.6). Therefore,
I) Before the rth success occurs (k-1 trials), number of different sequences with r-1 successes:
(kr −1
−1
)
II) (r-1 success in the first k-1 trials) and (success on kth trial):
r −1 k −1−(r −1)
p (1 − p)
Putting the equations in (I) and (II) together gives the pdf for negative binomial distribution:
( )
p X (k )= k − 1 p r⋅(1− p)k − r
r −1
(3.18)
Example 3.5
A process engineer wishes to recruit 4 interns to aid in carrying out lab tests for the development of a
new technology. Let p= P(randomly chosen CV is a fit). If p is 0.2, what is the probability that exactly
15 CVs must be examined before 4 interns can be recruited (Adapted from Carlton & Devore, 2014)?
Solution:
( )
p X (k )= k − 1 p r⋅(1− p)k − r
r −1
( )
p( X =15)= 15 − 1 0.24⋅(1− 0.2)15 − 4 =0.050
4 −1
Using Python:
dnbinom(x=15-4, size=4, prob=0.2)
0.050
Note that in dnbinom function the argument x represents the number of failures (k-r).
Now, let’s ask ourselves a simple question? Does the probability increase or decrease if the number of
CVs to be examined increase or decrease?
It is clearly seen that if r=1 then B=G, it can therefore be said that the negative-binomial distribution
generalizes the geometric distribution.
Larsen & Marx (2011) expresses the relationship between negative-binomial and geometric
distributions in the following way which is easier to derive a mathematical relationship between the
random variables:
…+
X = X 1 + X 2 +…+ X r (3.19)
It should be observed that until the 1st success occurs the trials overlaps with the definition of geometric
random variable. However, after the 1st success we are interested in the additional trials (please note the
word additional) to observe the 2nd success and therefore the trials between the 1st and 2nd success fits
again with the definition of geometric random variable. Continuing in this fashion the rationale for Eq.
(3.19) is justified.
3.5.2. MGF, Mean and Variance
[ ]
r
pe t (3.20)
M X ( t )=
1 − ( 1 − p ) et
r (3.21)
E ( X )=
p
r (1− p) (3.22)
var ( X )=
p2
Although above-given equations can be derived directly from the PDF of negative-binomial
distribution, Eq. (3.19) paves the way to combine Eqs. (2.10 & 3.15) to derive MGF in a very
straightforward fashion. Also by using Eqs. (3.16 & 3.17) expected-value and variance can be derived
conveniently as shown below:
[ ]
r
1) M ( t ) = M ( t ) ⋅ M ( t ) ⋅ .... M ( t ) = pe t
X X X X
1 2 r
1− ( 1− p ) e t
e−np (np)k
lim ()
n k(
n→∞ k
n−k
p 1− p) =
k!
(3.23)
A proof of Eq. (3.23) is presented in various textbooks (Devore et al., 2021; Larsen & Marx, 2011).
Let’s inspect the accuracy of Eq. (3.23) using Python code. There are two tests where each has different
probabilities (p); however for both tests λ=np remains constant as 1.
Test #1:
Script 3.6
n, kmax = 5, 5
Test #2
n, kmax = 100, 10
p = 1/n #probability
D = np.abs(np.array(binom)-np.array(pois)) #difference
print(f"min:{min(D)} at k={np.argmin(D)}")
print(f"max:{max(D)} at k={np.argmax(D)}")
Test #1: min:0.0027 at k=5 & max:0.0417 at k=1,
Test #2: min:3.13e-08 at k=10, max:0.0018 at k=1
It is clearly seen that in both tests the Poisson limit approximates binomial probabilities fairly well.
However, as evidenced from Test #2 where n was larger and p was smaller, the agreement between
Poisson limit and binomial probabilities became remarkably good for all k.
Example 3.6
When data is transmitted over a data link, there is a possibility of errors being introduced. Bit error rate
is defined as the rate (errors/total number of bits) at which errors occur in a transmission system 11.
Assume you have a 4 MBit modem with bit error probability 10 −8. What is the probability of exactly 3
bit errors in the next minute (adapted from Devore et al. 2021)?
Solution:
bits
In a minute 4⋅106 ×60 s=240⋅106 bits will be transferred and probability of error is 10 -8. The errors
s
can be at any sequence and we are interested in total number of errors, which is by definition is the
binomial probability:
( )
6
P(3)= 240⋅10 (10 ) (1−10 )
6
−8 3 −8 240⋅10 −3
Since n is very large (240,000,000) and p is very small (10-8) the above computation is an excellent
candidate for Poisson’s limit: λ =np=2.4⋅108×10−8 =2.4
#Binomial probability
dbinom(x=3, size=240000000, prob=1E-8)
0.2090142
#Poisson limit
dpois(x=3, mu=2.4)
0.2090142
If we pose the question, “what is the probability at most 3 bit errors in the next minute?”, then the
solution is:
( )
6
P( X ≤3)=∑ 240⋅10 (10 ) (1−10 )
6
−8 k −8 240⋅10 −k
k=0 k
#Poisson limit
ppois(q=3, mu=2.4)
0.7787229
11 https://www.electronics-notes.com/articles/radio/bit-error-rate-ber/what-is-ber-definition-tutorial.php
3.6.2. Poisson Distribution
The random variable X is said to have a Poisson distribution if,
e− λ λ k (3.24)
P X (k )=
k!
where λ>0.
Example 3.7
7 cards drawn (with replacement) from a deck containing numbers from 1 to 10. Success is considered
when 5 is drawn. Can the produced data be described by the Poisson distribution?
Script 3.7
#size=7 cards, prob=1/10
XX = rbinom(n=10000, size=7, prob=0.1)
total = float(np.sum(Frequencies))
probabilities = Frequencies/total
poisson = [dpois(x=float(i), mu=aver) for i in unique]
print(probabilities)
print(poisson)
[0.4781 0.3733 0.1253 0.0209 0.0021 0.0003]
[0.4983, 0.34708, 0.12087, 0.02806, 0.0048, 0.0007]
It is seen from the output that the probabilities can be well described by Poisson distribution. It should
be noted that when the probability value in the simulation was increased to 0.5, the difference between
actual and predicted probabilities increased.
3.6.3. MGF, Mean and Variance
(3.25)
t
E( X)= λ (3.26)
Derivation of Eq. (3.25) can be found in mathematical statistic textbooks (Devore et al. 2021; Wackerly
et al. 2008).
12 https://www.probabilitycourse.com/chapter11/11_1_2_basic_concepts_of_the_poisson_process.php
3.7. Multinomial Distribution
The multinomial distribution is a generalization of the binomial distribution (Forbes et al., 2011;
Larsen & Marx, 2011). Let Xi show the number of times the random variable Y equals yi, i=1,2,…,k in a
series of n independent trials where pi=P(Y=yi). Then,
n! x x x
(3.28)
P( X 1=x 1 , X 2=x 2 ,... , X k =x k )= p ⋅p ... p k
1 2 k
x 1 !⋅x 2 ! ... x k ! 1 2
k
where i=0, 1, …, k and ∑ x i =n.
i=1
Notes:
n!
1. The rationale for part is directly related to Eq. (2.2) in section (2.1).
x 1 !⋅x 2 ! ... x k !
2. Thinking along the lines of probability events:
Trial #1: Event 1 (E1) probability p1 → n independent trials x1 successes
Trial #2: Event 2 (E2) probability p2→ n independent trials x2 successes
Trial #k: Event k (Ek) probability pk → n independent trials xk successes
Since trials are independent then P( E 1∩ E 2∩...∩ E k )= p1x ⋅p2x ... pkx
1 2 k
Example 3.8
A die is tampered such that the probability of each of its face appearing is pi =P(face i appears)=ki
where k is a constant. If the die is tossed 12 times, what is the probability that each face will appear
exactly twice? Compute the probability for the case of a normal die (Adapted from Larsen & Marx, 2011)
Solution:
Since a die has 6 faces and the sum of probabilities must be equal to 1.0, it is straightforward to
6 6
6×7 1
compute the constant k: ∑ k⋅i=k⋅∑ i=k⋅ =1→ k =
i=1 i=1 2 21
Since the question is asking that each face should appear twice, then all is left to apply Eq. (3.28):
( )( ) ( )
2 2 2
12! 1 2 6
P( X 1 =2 , ... , X 6 =2)= ⋅ ⋅ ... =0.0005
2 ! 2 ! 2 ! 2 !2 ! 2 ! 21 21 21
With a normal die, each face would have probability of 1/6 and therefore:
( )( ) ( ) ( ) =0.0034
2 2 2 12
12! 1 1 1 12 ! 1
P( X 1 =2,... , X 6 =2)= ⋅ ⋅ ... = 6⋅
2! 2! 2! 2!2! 2! 6 6 6 2 6
Script 3.8
from scisuit.stats import dmultinom
#Tempered die
probs = [1/21*i for i in range(1,7)]
x = [2]*6
#Normal die
probs = [1/6]*6
n!
P( X 1=k , X 2=n−k )= p k⋅(1− p)n−k
k !(n−k )!
Noting that
n!
()
= n , then one can see that above equation is exactly the same as Eq. (3.5).
k !(n−k )! k
3.7.2. MGF, Mean and Variance
The moment-generating function, mean and variance of multinomial distribution is given in various
textbooks (Forbes et al., 2011; Larsen & Marx, 2011).
(∑ )
k n
M X (t )= pi e
ti (3.29)
i=1
E ( X )=npi (3.30)
Script 3.9
import numpy as np
from scisuit.stats import rmultinom
n=10
#testing probabilities
p = np.array([0.05, 0.15, 0.30, 0.50 ])
#2D array
arr = np.array(rmultinom(n=1000, size=n, prob=p))
#4 means, each is mean of 1000 random numbers with probabilities 0.05, 0.15 ...
means = np.mean(arr, axis=1)
13 https://www.statlect.com/probability-distributions/multinomial-distribution
3.8. Summary
Bernoulli One of the two outcomes, success (p) or failure (1-p). X i= {01 p
1− p
, 0< p<1
Hypergeometric
n chips are drawn out at random without replacement, and X
P( X =k )=
( k ) (n − k )
r
⋅ w
( Nn )
denotes the total number of good chips (N=r+w).
n! x x
P( X 1=x 1 , ... , X k =x k )= p ⋅... p k
1 k
x 1 !⋅... x k ! 1
4. Continuous Probability Distributions
Continuous probability distributions have the following properties:
1. f ( x)≥0 ,
∞
2. ∫ f ( x)=1
−∞
Continuous probability distributions can be visualized by a curve called a density curve. The function
that defines this curve is called the density function.
Script 4.1
from numpy import linspace
from scisuit.plot import scatter, plot, show
from scisuit.stats import dnorm
show()
Using the following rationale in the above-given script, probability density curve for other distributions
can be obtained.
4.1. Uniform Distribution
If you generate random numbers between 0 and 1 using a computer, you will get observations from a
uniform distribution since there will be almost same amount of numbers in each equally spaced sub-
interval, i.e. 0-0.2 or 0.2-0.4. Let’s run a simulation:
Script 4.2
import random
while start<1.0:
L = len( np.where( np.logical_and(x>=start, x<(start+dx)) )[0] )
print(f"({start}, {start+dx}): {L}")
start += dx
Output14 is: (0.0, 0.2): 218, (0.2, 0.4): 202, (0.4, 0.6): 192, (0.6, 0.8): 209, (0.8, 1.0): 179
Although number of samples drawn was relatively small, it is seen that each sub-interval in the range of
[0, 1] has similar amount of numbers. Instead of 1000 samples, if the simulation was run with
10,000,000 samples the difference between amount of numbers in each sub-interval would have been
negligible.
A random variable Y has a continuous uniform probability distribution on the interval (a, b) if the PDF
is defined as follows:
{
1
a≤ y≤b (4.1)
f ( y)= b−a
0 elsewhere
The uniform distribution is very important for theoretical studies (Wackerly et al., 2008). For example
if F(y) is a distribution function, it is often possible to transform uniform distribution to F(y). For
example, it is possible to transform it to standard normal distribution using Box-Muller transform 15.
14 It should be reminded that in random sampling each run will produce different results.
15 https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
MGF, Mean and Variance
For t≠0:
a b ∞
e ty e tb −e ta
M Y (t )= ∫ 0⋅e dy +∫ +∫ 0⋅e ty dy =
ty
−∞ a b−a b t (b−a)
{
e tb −e ta
t ≠0 (4.2)
f ( y)= t (b−a)
1 t =0
a+b (4.3)
E (Y )=
2
(b−a)2 (4.4)
Var (Y )=
12
It should be noted that the derivation (presented by Wolfram 16) of Eq. (4.3) from Eq. (4.2) might pose
challenges for many. Instead, it is recommended to use Eq. (2.12) as then the derivation becomes
considerably more convenient.
16 https://mathworld.wolfram.com/UniformDistribution.html
Example 4.1
As evidenced from above a random number generator will spread its output uniformly across the entire
interval from 0 to 1. What is the probability that the numbers will be in between 0.3 and 0.7?
Solution
This is a rather straightforward question and the answer is P(0.3≤X≤0.7)=0.4. Let’s demonstrate it with
a short script:
Script 4.3
from random import random
from numpy import array, logical_and, where
L = []
for lst in x:
cond = logical_and(array(lst)>=0.3, array(lst)<0.7)
length = len( where( cond )[0])
L.append(length)
print(np.array(L)/arr)
[0.3 0.41 0.396 0.4022 0.39924]
As evidenced from the above output, as the number of samples in the array (arr) increased from 10 to
105, the simulated probability approached to the computed probability.
4.2. Normal Distribution
Normal distributions are bell-shaped and symmetric curves. They are widely used and are the single
most important probability model in all of statistics since:
2. They play a central role in many of the inferential procedures (Larsen & Marx, 2011; Peck et
al., 2016).
In section (3.6) it was shown that the Poisson limit approximated binomial probabilities when n→∞
and p→0. Historically, this was not the only approximation [interested reader can find a historical
evolution of the normal distribution in the paper from Stahl (2006)]. Abraham DeMoivre showed that
X −np
when X is a binomial random variable and n is large the probability for P(a≤ ≤b) can be
√ np(1− p)
estimated using the following equation:
−z 2
1 2 (4.5)
f Z ( z)= e ,−∞< z<∞
√2 π
The formal statement of the approximation is known as DeMoivre-Laplace limit theorem (Larsen &
Marx, 2011):
( X −np 1
∫ )
2
Eq. (4.5) is referred as the standard normal curve where μ=0 and σ=1. If μ≠0 and σ≠1 then the
equation is expressed as follows:
( ) ,−∞< x <∞
2
−1 x−μ
⋅
1 2 σ (4.7)
f Z ( z)= e
√2 π σ
In order to show DeMoivre’s idea, let’s write a fairly short Python script. rbinom function was used to
sample 1000 experiments where each experiment consists of 60 trials with a probability of success of
0.4 (adapted from Larsen & Marx, 2011).
Script 4.4
import scisuit.plot as plt
from scisuit.stats import rbinom
n, p = 60, 0.4
#z-ratio
z = (x - n*p)/math.sqrt(n*p*(1-p))
#DeMoivre's equation
f = 1.0/math.sqrt(2*math.pi)*np.exp(-z**2/2.0)
(4.8)
2 2
M Y (t )=e μ t +σ
t /2
E(Y )= μ (4.9)
In order to simulate this we will generate a sample space of size 250 from an exponential distribution.
Then we will draw samples of size 5, 10, 20 and 30 (250 times) from the sample space and compute the
average of each sample. It will reveal us how the choice of sample size affects sampling distribution.
We will run the following script:
Script 4.5
import numpy as np
import scisuit.plot as plt
import scisuit.stats as st
N = 250
plt.layout(3, 2)
plt.subplot(r, c)
plt.hist(x, fc = colors[i])
plt.title(f"{chr(65+i)}) n={v}")
c += 1
if c%2 == 0:
r += 1 ; c = 0
plt.show()
Fig 4.3: Frequency histogram of sample space and different sample sizes
The following inferences can be made from Fig. (4.3):
1. Although the histogram of sample space (variable SS) does not look like normal in shape, each
of the four histograms is resembles to normal in shape,
2. Each of the histogram (A-D) has an average value close to the sample space’s average value.
Generally, x̄ based on a larger sample size is closer to the mean value of the population.
3. The smaller the value of sample size, the greater the sampling distribution spreads out (compare
the limits of x-axis for A and D where sample sizes were 5 and 30, respectively).
( )
b
W +...+W n −n μ 1
∫ (4.11)
2
lim P a≤ 1 ≤b = e−z /2 dz
n→∞ √n σ √2 π a
E [ 1
n ]
(W 1 +...+W n) =E (W̄ )= μ (4.12)
1
[
Var (W 1 +...+W n ) =
n n]
σ2 (4.13)
The implication of Eq. (4.13) could be observed from Fig. (4.3) where increasing the sample size
decreased the variability of the distribution. In order to show how Eq. (4.11) works we will be
generating an array with 5 columns and 250 rows from a standard uniform distribution. Then, sum of 5
columns will be computed to generate another array (250 rows). Since for a standard uniform
y−5 /2
distribution μ=0.5 and σ2=1/12, z-ratio will be computed using .
√ 5 /12
Script 4.6
from math import sqrt, pi
from numpy import array, exp, sum
n=5
#W1+W2+...
x = sum(W, axis=1) #len=250
#z-ratio
z = (x - n*mu)/(sqrt(n)*sigma)
#DeMoivre's equation
f = 1.0 / sqrt(2*pi)*exp(-z**2/2.0)
plt.show()
It is seen that even the number of samples
were small (n=5), the sums yielded a
distribution closely resembling to normal
distribution.
Script 4.7
N = 10000 #number of samples
Solution:
It is reasonable to assume that the samples come from a normal distribution. Standard deviation of
sample:
0.16
σ x̄ = σ = =0.04
n √ 16
Approach #1
11.96−12 12.08−12
z 1= =−1.0 z 2= =2.0
0.04 0.04
Probability that sample average will be between 11.96 and 12.08 is:
Since the limits have been standardized we can use standard normal distribution to compute
probabilities:
pnorm(q=2) – pnorm(q=-1)
0.8186
Approach #2
If not using the standard normal distribution then mean and standard deviation must be specified.
In section (3.6.4) it was mentioned that in situations where we only know the rate of occurrence ( λ) of
an event where the events occur completely at random might be a good candidate to be modeled by a
Poisson model. However, situations might arise where the time interval between consecutively
occurring event is an important random variable. The exponential distribution has many applications:
λ (4.15)
M Y (t )=
λ−t
1 (4.16)
E (Y )=
λ
1 (4.17)
Var (Y )=
λ2
Example 4.3
During the period of 1832 to 1950, the following data was collected for the eruptions of a volcano:
126 73 3 6 37 23 73 23 2 65 94 51
26 21 6 68 16 20 6 18 6 41 40 18
41 11 12 38 77 61 26 3 38 50 91 12
Can the data be described by an exponential distribution model? (Adapted from Larsen & Marx, 2011)
Solution:
In order to test whether exponential distribution is an adequate choice, first a density histogram of the
data needs to be plotted. Then a scatter plot in the domain of the data using Eq. (4.14) will be overlaid.
The following script handles both tasks:
Script 4.8
from numpy import array, linspace, average
import scisuit.plot as plt
from scisuit.stats import dexp
Data = array([126, 73, 3, 6, 37, 23, 73, 23, 2, 65, 94, 51, 26, 21, 6, 68, 16, 20, 6,
18, 6, 41, 40, 18, 41, 11, 12, 38, 77, 61, 26, 3, 38, 50, 91, 12])
In section (4.3), it was mentioned that if a series of events satisfying the Poisson process are occurring
at a rate of λ per unit time and the random variable Y denote the interval between consecutive events it
could be modeled with exponential distribution. Here the random variable Y can also be interpreted as
the waiting time for the first occurrence.
This is similar to geometric distribution (section 3.4) where we were only interested in the trial where
first success occurs. In section (3.5), in negative-binomial distribution instead of first success, we were
interested in rth success. Therefore, it was mentioned that the negative-binomial distribution generalizes
the geometric distribution.
In a similar fashion, gamma distribution generalizes the exponential distribution such that we are now
interested in the occurrence of (waiting time of) r th event. However, before we proceed with the
probability density function of gamma distribution we need to define the gamma function.
∞
Γ( z)=∫ t z−1⋅e−t dt (4.18)
0
With minor calculus, one can quickly see that Γ(1)=1. Using integration by parts17, it is seen that:
Γ (z +1)= z⋅Γ( z). Using induction one can further see that Γ(n)=(n-1)! .
Script 4.9
from math import gamma, factorial
17 https://en.wikipedia.org/wiki/Gamma_function
4.4.2. Probability Density Function
Suppose that Poisson events are occuring at constant rate of λ. Let random variable Y denote the
waiting time for rth event. Then,
A proof of Eq. (4.19) can be found in mathematical statistics textbooks (Larsen & Marx, 2011). Eq.
(4.19) is often expressed in the following form (Devore et al., 2021; Miller & Miller, 2014; R-
Documentation18) :
1
f ( x ; α , β)= α
x α−1 e−x / β , x >0 (4.20)
β Γ(α )
Softwares such as R and scisuit Python package calls α as the shape and β as the scale parameter. Note
that in Eq. (4.19) r=α and λ=1/β.
Devore et al. (2021) states that the parameter β is called a scale parameter because values other than 1
either stretch or compress the pdf in the x-direction. Let’s visualize this using a constant shape factor,
shape=2:
Script 4.10
from numpy import linspace
from scisuit.plot import scatter, show, legend
from scisuit.stats import dgamma
x=linspace(0, 7, num=100)
legend()
show()
The following figure will be generated:
18 https://search.r-project.org/CRAN/refmans/ExtDist/html/Gamma.html
It is seen that for β=1, max value is
around 0.35.
Fig 4.6: Gamma density curves for different scale (β) values
(α=2)
With minor editing if the same script is run for different values of α=[0.6, 1, 2, 4], where β=1, then the
following figure will be obtained:
It is seen that:
1) when α≤1, the curve is strictly
decreasing as x increases.
2) when α> 1, f(x; α) rises to a
maximum and then decreases as x
increases.
1 (4.21)
M Y (t )=
(1−β t )α
E (Y )=α⋅β (4.22)
Var (Y )=α⋅β
2 (4.23)
Example 4.4
As a process engineer you are given the task of designing a system to pump fluid from a reservoir to
the processing plant. As this is important for the manufacturing to continue smoothly you have included
two pumps, one active and one as a backup to be brought on line.
The manufacturer of the pump specifies that the pump is expected to fail once every 100 hours. What
are the chances that the whole manufacturing will not remain functioning for 50 hours? ( Adapted from
Larsen & Marx, 2011)
Solution:
For the whole manufacturing to be interrupted, 2 pumps should fail, for example first after 10 hours
and second after 40 hours… Failure rate: λ= 0.01 failure/hour
Approach #1:
Approach 2: We are going to use Eq. (4.20) where β= 100 and α=2.
We will use a short script to generate the probability density curves and inspect the pdf’s.
Script 4.11
from numpy import linspace
from scisuit.plot import plot, legend, show
from scisuit.stats import dgamma
legend()
show()
Chi-squared distribution is the sum of the squares of a number of normal distribution and this fact gives
to important applications of it, i.e. analysis of contingency tables (Forbes et al., 2011).
Suppose 100 people being surveyed whether they will go to a certain movie and choices (categories)
are: Definitely, Probably, Probably not, Definitely not. Now a table can be formed from counting the
observations:
Frequency 20 40 25 15
Let k be the number of categories of a categorical variable and pk population proportion for category
k>0. Then,
Ha: H0 is not true (at least one of the population category proportions differs from the corresponding
hypothesized value).
19 https://en.wikipedia.org/wiki/Univariate_(statistics)
Example 4.5
Phase 5 24 7711
Phase 7 24 7733
Solution:
There are 699 total days and a total of 222,784 births. The probability of a birth to happen at Phase 1 is
24/699=0.0343 and at Phase 8 is 152/699=0.2175.
So if lunar phase did not have any effect, then we expect that at Phase 1 there would be
0.0343×222784=7649.23 births. We continue our computations in this fashion and then use Eq. (4.24)
to compute Χ2 value.
Script 4.12
from scisuit.stats import pchisq
#Lunar periods
days = np.array([24, 152, 24, 149, 24, 150, 24, 152])
#probabilities (ratios)
probs = days / np.sum(days)
#expected birth numbers
expected = np.sum(observed)*probs
chisq = (expected-observed)**2 / expected
Forbes et al. (2011) states that to be able to use Eq. (4.24), the data produced from the differences
between observed and expected values should be normally distributed. We will use QQ plot to check
whether the data is normally distributed.
Script 4.13
import scisuit.plot as plt
from scisuit.stats import test_norm_ad
diff = observed-expected
print(f"Anderson-Darling test: {test_norm_ad(x=diff)}")
plt.qqnorm(data=diff)
plt.show()
1
f Y ( y)= n/2
y (n/2)−1 e− y /2 , y >0 (4.25)
2 Γ(n/2)
Please note that Eq. (4.25) is a special case of Eq. (4.19) where r=n/2 and λ=1/2. Substituting these
values in Eq. (4.19) and tidying up slightly yields:
1
f Y ( y)= (n/2)
y n/2−1 e−1/2 y
2 (n/2−1)!
Noticing that (n/2-1)!=Γ(n/2), then one can see that above equation is equal to Eq. (4.25).
Theorem: Let Z1, Z2, …, Zn be n independent standard normal random variables. Then,
n
∑ Z 2i
i=1
has chi-square distribution with n degrees of freedom. A proof of the theorem can be found in
mathematical statistics textbooks (Larsen & Marx, 2011).
4.5.3. MGF, Mean and Variance
−n/2
M Y (t )=(1−2 t ) , t <1/2 (4.26)
E (Y )=n (4.27)
Script 4.14
from scisuit.stats import rchisq
#number of samples
N = 1000
The t distribution is used to test whether the difference between the means of two samples of
observations is statistically significant assuming they were drawn from the same population (Forbes et
al., 2011).
In sections (4.2.2 & 4.2.3) it was shown that if y1, y2, …, yn is a random sample from a normal
Ȳ −μ
distribution with mean μ and standard deviation ρ then has a standard normal distribution (SND).
ρ/n
Ȳ −μ
However Gosset (Student, 1908) realized that does not have a SND and derived the probability
s /n
density function.
Let’s see the differences between SND and t-distribution using the short Python code:
Script 4.15
from numpy import linspace
from scisuit.stats import dnorm, dt
from scisuit.plot import plot, show, legend
x = linspace(-4, 4, num=100)
show()
1) Both dists are symmetric.
In comparison of t-distribution with SND it was mentioned that t-distribution is most useful for small
sample sizes, but have not explained what is meant by small. Larsen & Marx (2011) states that many
tables providing probability values for t-distribution will have it for degrees of freedom in the range of
[1, 30]. Furthermore, elsewhere20 it was mentioned that for a sample size of at least 30, SND can be
used instead of t-distribution.
Let Z be a standard normal random variable and V an independent chi-square random variable with n
degrees of freedom. The Student t ratio with n degrees of freedom is,
Z (4.29)
T n=
√ V /n
In line with observations from Fig. (4.11), Eq. (4.29) is symmetric: f T (t )=f T (−t )
n n
The PDF for a Student t random variable with n degrees of freedom is,
20 https://www.jmp.com/en_no/statistics-knowledge-portal/t-test/t-distribution.html
fT (n)=
Γ ( 2 )
n+1
,−∞<t <∞ (4.30)
√ n π Γ ( )(1+ )
n 2 (n+1)
n t 2
2 n
The moment-generating function of t-distribution is undefined 21 and its mean is 0 as can be observed
from Fig. 4.11 for different degrees of freedom.
n (4.31)
Var (Y )= , n>2
n−2
Script 4.16
from scisuit.plot import scatter, show
from scisuit.stats import rt
from statistics import pvariance
var = []
for df in range(3, 100, 2):
var.append(pvariance( rt(n=5000, df=df) ))
scatter(x=list(dfs), y=var)
show()
21 https://en.wikipedia.org/wiki/Student's_t-distribution
4.7. F (Fisher–Snedecor) Distribution
It is the ratio of independent chi-square random variables. Many experimental scientists use the
technique called analysis of variance (ANOVA) (Forbes et al., 2011). ANOVA analyzes the variability
in the data to see how much can be attributed to differences in the means and how much is due to
variability in the individual populations (Peck et al., 2016). In one-way ANOVA, F is the ratio of
variation among the samples to variation within the samples.
Suppose that U and V are independent chi-square random variables with m and n degrees of freedom,
respectively. Then,
U /m (4.32)
F=
V /n
f F (r )=
Γ ( m+n
2 )
m/2
m n n/2 (m/2)−1
r
, r >0 (4.33)
(n+mr )(m+n)/2
Γ( )Γ( )
m n
m,n
2 2
The derivation of Eq. (4.33) is detailed in the textbook from Larsen & Marx (2011). Let’s use a fairly
short script to generate F-distribution curves for constant m (df1) and varying n (df2) and for constant
df2 and varying df1.
Script 4.17
from scisuit.stats import df
from numpy import linspace
import scisuit.plot as plt
x_axis=linspace(0.1, 6, num=500)
dfree = 10
plt.layout(2,1)
plt.subplot(0,0)
for x in [1, 2, 3, 5]:
plt.plot(x_axis, df(x_axis, df1=x, df2=dfree), label=f"df1={x}")
plt.title("df2=10")
plt.legend()
plt.subplot(1,0)
for x in [1, 2, 3, 5]:
plt.plot (x_axis, df(x_axis, df1=dfree, df2=x), label=f"df2={x}")
plt.title("df1=10")
plt.legend()
plt.show()
n (4.34)
E (Y )= , n>2
n−2
2 n2 (m+n−2) (4.35)
Var (Y )=
m(n−2)2 (n−4)
22 https://en.wikipedia.org/wiki/F-distribution
4.8. Weibull Distribution
A random variable X has Weibull distribution (α>0, β>0), if the PDF is defined as follows:
{ ()
α−1
α x −( x / β)α
e x≥0 (4.36)
f ( x ; α , β)= β β
0 x <0
where α and β are the shape and scale parameters, respectively. According to Rinne (2009), Eq. (4.36)
is the most often used two-parameter (third parameter, the location was assumed to be 0) Weibull
distribution.
1
f ( x ; α=1 , β)= e−( x / β)
β
If λ=1/β then
−λ y
f ( x ; β )= λ e
The shape parameter (α) can be interpreted in the following way too:
• 0<α<1 →the failure rate decreases over time (waiting time between two subsequent stock
exchange transactions of the same stock),
• α=1 → the failure rate is constant over time (radioactive decay of unstable atoms),
• α>1 → the failure rate increases over time (wind speeds, distribution of the size of droplets)
A “bathtub” diagram and the α-values for above-mentioned examples are presented by Kızılersü 23 et al.
(2018).
Now, let’s remember that in section (4.4.2), it was mentioned that Gamma distribution has shape and
scale parameters, which is similar to Weibull distribution. Let’s investigate the similarities and
differences:
Script 4.18
from numpy import linspace
from scisuit.stats import dgamma, dweibull
import scisuit.plot as plt
x=linspace(0, 7, num=1000)
plt.layout(nrows=2, ncols=1)
plt.subplot(0,0)
for beta in [0.5, 1, 2, 4]:
plt.plot(x, dgamma(x=x, shape=2, scale=beta), label=str(beta))
plt.title("Gamma")
plt.legend()
plt.subplot(1,0)
for beta in [0.5, 1, 2, 4]:
plt.plot(x, dweibull(x=x, shape=2, scale=beta), label=str(beta))
plt.title("Weibull")
plt.legend()
plt.show()
23 Kızılersü A, Kreer M, Thomas AW. The Weibull Distribution. Significance, April 2018.
Similarities:
Differences:
Fig 4.14: Gamma and Weibull density curves for different scale
(β) values (α=2)
Similarities:
1) when α≤1, the curve is strictly
decreasing as x increases.
2) when α> 1, f(x; α) rises to a
maximum and then decreases as x
increases.
3) when α=1, both distribution shows
exactly the same characteristics (Why?).
Differences:
1) when α> 1, f(x; α) rises to a
maximum; however, Weibull dist
decreases sharply whereas Gamma dist
decreases gradually as x increases.
t n βn
( )
∞
n
∑ n!
Γ 1+ , α≥1
α
(4.37)
n=0
( α1 )
E (Y )=β Γ 1+ (4.38)
{( ) [ ( )] }
2
2 2 1 (4.39)
Var (Y )=β Γ 1+ − Γ 1+
α α
Example 4.6
The article24 by Field and Blumenfeld (2016) investigates modeling the time to repair for reusable
shipping containers, which are fairly expensive and need to be monitored carefully. The random
variable X defined as the time required for repairing in months. The authors recommended the Weibull
distribution with parameters α=10.0 and β=3.5. What is the probability that a container requires repair
within the first 3 months?
Solution:
Script 4.19
from scisuit.stats import pweibull
Note that we are almost certain that a container will not require a repair the first two months but will
definitely require a repair within first 4 months. Why is that?
24 Field DA, Blumenfeld D. Supply Chain Inventories of Engineered Shipping Containers. International Journal of
Manufacturing Engineering. Available at: https://doi.org/10.1155/2016/2021395
1. First of all in the previous section (4.8.1) it was mentioned that if α>1 then the failure rate
increases over time, which coincides with our observation.
2. Secondly let’s compute the mean and standard deviation of the specific distribution:
μ=3.5 Γ(1+1/10)=3.33
Thus it is reasonable to expect a high probability of requirement for repair in the range 3.33-0.4 ≤ x ≤
3.33+0.4. This is also evidenced in the following figure:
P (4 months) = 0.978
P (5 months) = 0.9999
P (6 months) = 1.0
Beta distribution is defined on the interval [0, 1] or (0, 1) in terms of two parameters, α>0 and β>0
which control the shape of the distribution (Wikipedia 25, 2023). It is frequently used as a prior
distribution for binomial proportions in Bayesian analysis (Forbes et al., 2011) and often used as a
model for proportions, i.e. proportion of impurities in a chemical product or the proportion of time that
a machine is under repair (Wackerly et al., 2008).
A random variably Y is said to have beta distribution with parameters α, β, A and B if the pdf is,
( ) ( )
α−1 β−1
1 Γ(α + β) y− A B− y (4.40)
f ( y ; α , β , A , B)= ⋅ ⋅ ⋅ , A≤ y≤B
B− A Γ(α )Γ( β) B− A B− A
If A=0 and B=1 then Eq. (4.40) gives standard26 beta distribution:
where Β is:
1
Γ(α + β)
∫ y α−1⋅(1− y)β−1 dy= Γ(α )Γ( β) (4.43)
0
25 https://en.wikipedia.org/wiki/Beta_distribution
26 R (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Beta.html) and scisuit uses standard beta distribution where
parameters shape1, shape2 corresponds to α and β, respectively.
Let’s demonstrate the relationship between beta and gamma functions:
Script 4.20
#Python’s built-in math library does not have the beta function
from scipy.special import beta
from math import gamma
from random import randint
a, b = randint(1,10), randint(1,10)
(∏ )
k−1
∞
α +r tk
M Y (t )=1+ ∑ (4.44)
k=1 r=0 α + β +r k !
Series expansion of Eq. (4.44) can be conveniently obtained using hypergeometric function at Wolfram
Alpha28. Let’s present the first 3 terms of the series:
αt α (α +1)t 2
M Y (t )=1+ + +⋯
α + β 2(α + β)(α + β +1)
d α
dt
( M Y (t =0))=
α+β
27 https://en.wikipedia.org/wiki/Beta_distribution
28 https://www.wolframalpha.com/input?i=1F1%28%CE%B1%2C+%CE%B1+%2B+%CE%B2%2C+t%29+series
α (4.45)
μ=E (Y )=
α+β
α (α +1)
E (Y 2 )=
(α + β)(α + β +1)
α⋅β (4.46)
Var (Y )= 2
(α + β) ⋅(α + β +1)
Example 4.7
A wholesale distributor has storage tanks to hold fixed supplies of gasoline which are filled at the
beginning of the week. The wholesaler is interested in the proportion of the supply that is sold during
the week. After several weeks of data collection it is found that the proportion that is sold could be
modeled by a beta distribution with α=4 and β=2. Find the probability that the wholesaler will sell at
least 90% of her stock in a given week (Adapted from Wackerly et al., 2008).
Solution:
A simple script will yield the solution (remember that shape1= α and shape2= β).
Script 4.21
from scisuit.stats import pbeta
prob = pbeta(q=0.9, shape1=4, shape2=2)
print(f"P(Y>0.9) = {1 - prob}")
P(Y>0.9) = 0.081
Imagine also the wholesaler is interested in up to what proportions of gasoline could be sold 50, 75 and
95% of the time?
Script 4.22
from scisuit.stats import qbeta
probs = [0.5, 0.75, 0.95]
for p in probs:
print(f"{p*100}%: {qbeta(p=p, shape1=4, shape2=2)}")
50%: 0.68, 75%: 0.81, 95%: 0.92
It is seen that the curve is left-skewed (Why?). This
might be considered as “lucky” for the wholesaler
as lower probabilities is associated with lower
gasoline sale proportions.
{
1
There is almost same amount of numbers in each equally a≤ y≤b
Uniform f ( y )= b−a
spaced sub-interval
0 elsewhere
{
Commonly used as a lifetime distribution in reliability
()
α−1
α x α
A point estimate is a single value (i.e., mean, median, proportion, ...) based on sampled data to
represent a plausible value of a population characteristics (Peck et al. 2016).
Example 5.1
Out of 7421 US College students 2998 reported using internet more than 3 hours a day. What is the
proportion of all US College students who use internet more than 3 hours a day? ( Adapted from Peck et
al. 2016).
Solution:
2998
The solution is straightforward: p= =0.40
7421
Based on the statistics it is possible to claim that approximately 40% of the students in US spend more
than 3 hours a day using the internet. Please note that based on the survey result, we made a claim
about the population, students in US. ■
Now that we made an estimate based on the survey, we should ask ourselves: “How reliable is this
estimate?”. We know that if we had another group of students, the percentage might not have been 40,
maybe it would be 45 or 35. There are no perfect estimators but we expect that on average the
estimator should gives us the right answer.
E(Θ) = θ (5.1)
Before we proceed with the solution, let’s refresh ourselves with a simple example: Suppose we
conduct an experiment where we flip a coin 10 times. We already know that the probability of getting
heads (success) is 𝑝=0.5. However, we want to estimate p by flipping the coin and calculating the
sample proportion, X/n. If we flip the coin 10 times and get X=6 heads, p=0.6. However, after many
experiments p will be found as 0.5. Therefore, X/n is an unbiased estimator.
E ( Xn )= 1n E ( X )= 1n⋅np= p
therefore, X/n is an unbiased estimator of p. ■
Example 5.3
[ ]
n
1
Prove that E ∑
n−1 i=1
( X i − X̄ )2 is unbiased estimator of population variance (σ2).
Solution:
[ ]
n
1
2
E (S )=E ∑
n−1 i=1
( X i − X̄ )2
[∑ ( ]
n
1 2
= E ( X i −μ)−( X̄ −μ))
n−1 i=1
[∑ ( ]
n
1
= E ( X i −μ)2−( X̄ −μ)2 )
n−1 i=1
[∑ ]
n
1
= E ( X i −μ)2−n( X̄ −μ)2
n−1 i=1
σ2
Note that E ( X i −μ)2=σ 2 and E ( X̄ −μ)2= . Putting the knowns in the last equation,
n
=
1
n−1 ( σ2
n⋅σ 2−n⋅ =σ 2
n )
Therefore, E(S2) is an unbiased estimator of population variance. ■
Example 5.4
1
For the uniform probability distribution f Y ( y ; θ)= there are two estimates for θ:
θ
n
2
1. θ^1= ∑ Y i
n i=1
2. θ^2=Y max
Which one is an unbiased estimator of θ? (Adapted from Larsen & Marx 2011)
Solution #1:
( )
n n
2 2
E ( θ^1 )=E ∑ Y i = ∑ E (Y i )
n i=1 n i=1
Solution #2:
Using the equation given in Example (2.10), the PDF can be found as follows:
()
n−1
1 u
f Y (u)=n⋅ ⋅
max
θ θ
θ
()
n−1
1 u n
E ( θ^2 )=∫ u⋅n⋅ ⋅ du= θ
0 θ θ n+1
It is seen that as n increases, the “bias” decreases and for large n it becomes unbiased.
5.1.2. Efficiency
It is seen in section (5.1.1) that a parameter can have more than one unbiased estimators. Which one
should we choose? We should choose one with higher precision, in other words, with smaller variance.
Definition: Let θ^1 and θ 2 be two unbiased estimators for parameter θ. If,
Example 5.5
Given the estimators in Example (5.4), which one is more efficient?
Solution:
A tedious mathematical derivation is presented by Larsen & Marx (2011). The results are as follows:
2
θ
Var ( θ^1 )=
3n
θ2
Var ( θ^2 )=
n(n+2)
For n>1, it is seen that second estimator has a smaller variance than the first one. Therefore, it is more
efficient. ■
There are more properties of estimators: i) minimum variance, ii) robustness, iii) consistency, iv)
sufficiency. Interested readers can refer to textbooks on mathematical statistics (Devore et al. 2021;
Larsen & Marx 2011; Miller & Miller 2014).
5.2. Statistical Confidence
Suppose you want to estimate the SAT scores of students. For that purpose, a randomly selected 500
students have been given an SAT test and a mean value of 461 is obtained ( adapted from Moore et al.
2009). Although it is known that the sample mean is an unbiased estimator of the population mean (μ),
we already know that had we sampled another 500 students, the mean could (most likely would) have
been different than 461. Therefore, how confident are we to claim that the population mean will be 461.
Suppose that the standard deviation of the population is known (σ=100). We know that if we repeat
100
sampling 500 samples, the mean of these samples will follow the N(μ, =4.5) curve.
√ 500
Script 5.1
import scisuit.plot as plt
from scisuit.stats import rnorm
aver = []
for i in range(1000):
sample = rnorm(n=500, mean= 461, sd= 100)
aver.append(sum(sample)/500)
plt.hist(data=aver, density=True)
plt.show()
461−3×4.5=447.5
461+3×4.5=474.5
A way to quantify the amount of uncertainty in a point estimator is to construct a confidence interval
(Larsen & Marx 2011). The definition of confidence interval is as follows: “... an interval computed
from sample data by a method that has probability C of producing an interval containing the true value
of the parameter.” (Moore et al. 2009). Peck et al. (2016) gives a general form of confidence interval as
follows:
Note that the estimated standard deviation of the statistic is also known as standard error. In other
words, when the standard deviation of a statistic is estimated from the data (because the population’s
standard deviation is not known), the result is called the standard error of the statistic (Moore et al.
2009).
Example 5.6
Establish a confidence interval for binomial distribution.
Solution:
We already know (chapter 4.2) that Abraham DeMoivre showed that when X is a binomial random
variable and n is large the probability can be approximated as follows:
b
( X −np 1
) ∫
2
lim P a≤ ≤b = e−z /2 dz
n→∞ √ np(1− p) √ a
2 π
[
P −z α /2≤
X −np
√ np(1− p) ]
≤z α /2 =1−α
[ ]
X /n− p
P −z α /2≤ ≤z α /2 =1−α
√ ( X /n)(1− X /n)
n
Rewriting the equation by isolating p leads to,
[ k
n √
−z α /2
(k /n)(1−k /n) k
n √
, + z α /2
n
(k /n)(1−k /n)
n
, ]
■
If a 95% confidence interval to be established, then zα/2 would be ≈1.96.
Script 5.2
alpha1 = 0.05
alpha2 = 0.01
print(qnorm(alpha1/2), qnorm(1-alpha1/2))
print(qnorm(alpha2/2), qnorm(1-alpha2/2))
-1.95996 1.95996
-2.57583 2.57583
Note that if a 95% confidence interval (CI) yields an interval (0.52, 0.57), it is tempting to say that
there is a probability of 0.95 that p will be in between 0.52 and 0.57. Larsen & Marx (2011) and Peck
et al. (2016) warns against this temptation. A close look at Eq. (5.3) reveals that from sample to sample
the constructed CI will be different. However, in the long run 95% of the constructed CIs will contain
the true p and 5% will not. This is well depicted in the figure (Figure 9.4 at pp. 471) presented by Peck
et al. (2016).
Note also that a 99% CI will be wider than a 95% CI. However, the higher reliability causes a loss in
precision. Therefore, Peck et al. (2016) remarks that many investigators consider a 95% CI as a
reasonable compromise between reliability and precision.
5.4. Hypothesis Testing
Confidence intervals and statistical tests are the two most important ideas in the age of modern
statistics (Kreyszig et al. 2011). The confidence interval is carried out when we would like to estimate
population parameter. Another type of inference is to assess the evidence provided by data against a
claim about a parameter of the population (Moore et al. 2009). Therefore, after carrying out an
experiment conclusions must be drawn based on the obtained data. The two competing propositions are
called the null hypothesis (H0) and the alternative hypothesis (H1) (Larsen & Marx 2011).
We initially assume that a particular claim about a population (H 0) is correct. Then based on the
evidence from data we either reject H0 and accept H1 if there is compelling evidence or accept H0 in
favor of H1 (Peck et al. 2016).
An example from Larsen & Marx (2011) would clarify the concepts better: Imagine as an automobile
company you are looking for additives to increase gas mileage. Without the additives, the cars are
known to average 25.0 mpg with a σ=2.4 mpg and with the addition of additives, it was found
(experiment involved 30 cars) that the mileage increased to 26.3 mpg.
Now, in terms of null and alternative hypothesis, H0 is 25 mpg and H1 claims 26.3 mpg. We know that
if the experiments were carried out with another 30 cars, the result would be different (lower or higher)
than 26.3 mpg. Therefore, “is an increase to 26.3 mpg due to additives or not?”. At this point we
should rephrase our question: “if we sample 30 cars from a population with μ=25.0 mpg and σ=2.4,
what are the chances that we will get 26.3 mpg on average?”. If the chances are high, then the additive
is not working; however, if the chances are low, then it must be due to the additives that the cars are
getting 26.3 mpg. Let’s evaluate this with a script (note the similarity to Script 5.1):
Script 5.3
aver = []
for i in range(10000):
sample = rnorm(n=30, mean= 25, sd= 2.4)
aver.append(sum(sample)/30)
P
( 26.50−25.0
)
2.4 / √ 30
=0.0003
If for example the test statistics yield Z=1.37 and we are carrying out a two-sided test, the p-value
would be, P(Z≤−1.37 or Z≥1.37) where Z has a standard normal distribution.
z=1.37
A simpler definition is given by Miller & Miller (2014): “… the lowest level of significance at which
the null hypothesis could have been rejected”. Let’s rephrase Miller & Miller (2014) definition: once a
level of significance is decided (e.g. α=0.05), if the computed p-value is less than the α, then we reject
H0. For example, in the gasoline additive example, p-value was computed as 0.0003 and if α=0.05, then
since p< α, we reject H0 in favor of H1 (i.e., additive has effect).
Example 5.7
A bakery claims on its packages that its cookies are 8 g. It is known that the standard deviation of the 8
g packages of cookies is 0.16 g. As a quality control engineer, you collected 25 packages and found that
the average is 8.091 g. Is the production process going alright? (adapted from Miller & Miller 2014).
Solution:
8.091−8
The test statistic: z= =2.84
0.16/ √ 25
1-pnorm(2.84) + pnorm(-2.84) #2*(1- pnorm(2.84))
0.0045
Since p<0.05, we reject the null hypothesis. Therefore, the process should be checked and suitable
adjustments should be made.
6. Z-Test for Population Means
The fundamental limitation to applying z-test is that the population variance must be known in advance
(Kanji 2006; Moore et al. 2009; Peck et al. 2016). The test is accurate when the population is normally
distributed; however, it will give an approximate value even if the population is not normally
distributed (Kanji 2006). In most practical applications, population variance is unknown and the sample
size is small therefore a t-test is more commonly used.
From a population with known mean (μ) and variance (σ), a random sample of size n is taken
(generally n≥30) and the sample mean ( x̄) calculated. The test statistic:
x̄−μ
Z= (6.1)
σ /√n
Example 6.1
A filling process is set to fill tubs with powder of 4 g on average. For this filling process it is known
that the standard deviation is 1 g. An inspector takes a random sample of 9 tubs and obtains the
following data: Weights = [3.8, 5.4, 4.4, 5.9, 4.5, 4.8, 4.3, 3.8, 4.5]
Solution:
4.6−4
Test statistic: Z= =1.8,
1/ √ 9
Since 1.8 is in the range of -1.96<Z<1.96, we cannot reject the null hypothesis, therefore the filling
process works fine (i.e. there is no evidence to suggest it is different than 4 g).
Is it over-filling?
Now, we are going to carry out 1-tailed z-test and therefore acceptance region is Z<1.645. Since the test
statistic is greater than 1.645, we reject the null hypothesis and have evidence that the filling process is
over-filling.
Script 6.1
import scisuit.plot as plt
from scisuit.stats import test_z
data = [3.8, 5.4, 4.4, 5.9, 4.5, 4.8, 4.3, 3.8, 4.5]
result = test_z(x=data, sd1=1, mu=4)
print(result)
N=9, mean=4.6, Z=1.799
p-value = 0.072 (two.sided)
Confidence interval (3.95, 5.25)
Since p>0.05, we cannot reject H0.
Script 6.1 requires minor change to analyze whether it is over-filling or not. We will set the parameter,
namely alternative, to “greater” whose default value was set to “two.sided”.
Script 6.2
result = test_z(x=data, sd1=1, mu=4, alternative="greater")
print(result)
p-value = 0.036 (greater)
Confidence interval (4.052, inf)
Since p<0.05, we reject the null hypothesis in favor of alternative hypothesis.
In essence, two-sample is very similar to one-sample z-test such that we take n1 and n2 samples from
two populations with means (μ1 and μ2) and variances (σ1 and σ2). Therefore, the test statistic is
computed as:
( )
1
σ 12 σ 22 2 (6.2)
+
n1 n2
Example 6.2
A survey has been conducted to see if studying over or under 10 h/week has an effect on overall GPA.
For those who studied less (x) and more (y) than 10 h/week the GPAs were:
x=[2.80, 3.40, 4.00, 3.60, 2.00, 3.00, 3.47, 2.80, 2.60, 2.0]
y = [3.00, 3.00, 2.20, 2.40, 4.00, 2.96, 3.41, 3.27, 3.80, 3.10, 2.50].
respectively. It is known that the standard deviation of GPAs for the whole campus is σ=0.6. Does
studying over or under 10 h/week has an effect on GPA? (Adapted from Devore et al. 2021)
Solution:
We have two groups (those studying over and under 10 h/week) from the same population (whole
campus) whose standard deviation is known (σ=0.6).
We will solve this question directly using a Python script and the mathematical computations are left as
an exercise to the reader.
Script 6.3
x = [2.80, 3.40, 4.00, 3.60, 2.00, 3.00, 3.47, 2.80, 2.60, 2.0]
y = [3.00, 3.00, 2.20, 2.40, 4.00, 2.96, 3.41, 3.27, 3.80, 3.10, 2.50]
mu = 0
sd1, sd2 = 0.6, 0.6
Then, comes the question: What effect does replacing σ with S have on Z ratio? ( Larsen & Marx 2011).
In order to answer this question, let’s demonstrate the effect of replacing σ with S on Z ratio with a
script:
Script 7.1
import numpy as np
import scisuit.plot as plt
from scisuit.stats import dnorm, rnorm
N=4
sigma, mu = 1.0, 0.0 #stdev and mean of population
z, t = [], []
for i in range(1000):
sample = rnorm(n=N)
aver = sum(sample)/N
plt.layout(nrows=2, ncols=1)
plt.subplot(row=0, col=0)
plt.scatter(x=x, y=y)
plt.hist(data=z, density=True)
plt.title("Population Std Deviation")
plt.subplot(row=1, col=0)
plt.scatter(x=x, y=y)
plt.hist(data=t, density=True)
plt.title("Sample Std Deviation")
plt.show()
x̄−μ x̄−μ
Fig 7.1: (top) vs
σ /√n S /√n
Note that in Script (7.1), N was intentionally chosen a small value (N=4). It is recommended to change
N to a greater number, such as 10, 20 or 50 in order to observe the effect of large samples.
Let x̄ and s be the mean and standard deviation of a random sample from a normally distributed
population. Then,
x̄−μ
t= (7.1)
s/√n
has a t distribution with df=n-1. Here s is the sample’s standard deviation and computed as:
n
1
2
s= ∑
n−1 i=1
( X i − X̄ )2 (7.2)
Example 7.1
In 2006, a report revealed that UK subscribers with 3G phones listen on average 8.3 hours/month full-
track music. The data for a random sample of size 8 for US subscribers is x=[5, 6, 0, 4, 11, 9, 2, 3]. Is
there a difference between US and UK subscribers? (Adapted from Moore et al. 2009).
Solution:
Script 7.2
from statistics import stdev
from scisuit.stats import qt
x=[5, 6, 0, 4, 11, 9, 2, 3]
n = len(x)
df = n-1 #degrees of freedom
aver = sum(x)/n
stderr = stdev(x)/sqrt(n) #standard error
Script 7.3
from scisuit.stats import test_t
x=[5, 6, 0, 4, 11, 9, 2, 3]
result = test_t(x=x, mu=8.3)
print(result)
One-sample t-test for two.sided
N=8, mean=5.0
SE=1.282, t=-2.575
p-value =0.037
Confidence interval: (1.97, 8.03)
Since p<0.05 we reject H0 and claim that there is statistically significant difference between US and
UK subscribers. [If in test_t function H1 was set to “less” instead of “two.sided” then p=0.018. Therefore, we
would reject the H0 in favor of H1, i.e. US subscribers indeed listen less than UK’s. ]
n n
∑ ( X i− X̄ )2 +∑ (Y i−Ȳ )2 (7.3)
S 2P = i=1 i=1
n+m−2
X̄ −Ȳ −( μ X −μY )
T n+m−2=
√ 1 1 (7.4)
Sp +
n m
X= [3, 1, 2, 1, 3, 2, 4, 2, 1]
Y = [5, 4, 3, 4, 5, 4, 4, 5, 4]
Is there a difference in scores of both semester? (Adapted from Larsen & Marx 2011).
Solution:
1. The variance of the populations are not known, therefore z-test cannot be applied.
2. It is reasonable to assume equal variances since the X and Y have the same demographics.
Script 7.4
from scisuit.stats import test_t
x = [3, 1, 2, 1, 3, 2, 4, 2, 1]
y = [5, 4, 3, 4, 5, 4, 4, 5, 4]
result = test_t(x=x, y=y, varequal=True)
print(result)
Two-sample t-test assuming equal variances
n1=9, n2=9, df=16
s1=1.054, s2=0.667
Pooled std = 0.882
t = -5.07
p-value = 0.0001 (two.sided)
Confidence interval: (-2.992, -1.230)
Since p<0.05, the difference between the scores of fall and spring are statistically significant.
7.2.2. Unequal Variances
Similar to section 7.2.1, we are drawing random samples of size n1 and n2 from normal distributions
with means μX and μY, but with standard deviations σX and σY, respectively.
n n
∑ ( X i− X̄ )2 ∑ (Y i−Ȳ )2 (7.5)
S 12= i=1 and S 22= i=1
n1−1 n2−1
X̄ −Ȳ −( μ X −μY )
t=
√ s12 s 22
+
n1 n2
(7.6)
In 1938 Welch30 showed that t is approximately distributed as a Student’s t random variable with df:
( )
2
s12 s 22
+
n1 n2
df = (7.7)
s14 s 24
+
n12 (n1−1) n22 (n2−1)
Example 7.3
A study by Larson and Morris31 (2008) surveyed the annual salary of men and women working as
purchasing managers subscribed to Purchasing magazine. The salaries are (in thousands of US dollars):
Men = [81, 69, 81, 76, 76, 74, 69, 76, 79, 65]
Women = [78, 60, 67, 61, 62, 73, 71, 58, 68, 48]
Is there a difference in salaries between men and women? (Adapted from Peck et al. 2016)
30 https://www.jstor.org/stable/2332010
31 Larson PD & Morris M (2008). Sex and Salary: A Survey of Purchasing and Supply Professionals, Journal of
Purchasing and Supply Management, 112–124.
Solution:
1. Z-test cannot be applied because the variance of the populations are not known.
2. Although the samples were selected from the subscribers of Purchasing magazine, Larson and
Morris (2008) considered two populations of interest, i.e. male and female purchasing
managers. Therefore, equal variances should not be applied.
Script 7.5
from scisuit.stats import test_t
Men = [81, 69, 81, 76, 76, 74, 69, 76, 79, 65]
Women = [78, 60, 67, 61, 62, 73, 71, 58, 68, 48]
result = test_t(x=Women, y=Men, varequal=False)
print(result)
Two-sample t-test assuming unequal variances
n1=10, n2=10, df=15
s1=8.617, s2=5.399
t = -3.11
p-value = 0.007 (two.sided)
Confidence interval: (-16.7, -3.1)
Since p<0.05, there is statistically significant difference between salaries of each group.
In essence a paired t-test is a two-sample t-test as there are two samples. However, the two samples are
not independent as one of the factors in the first sample is paired in a meaningful way with a particular
observation in the second sample (Larsen & Marx 2011; Peck et al. 2016).
The equation to compute the test statistics is similar to one-sample t-test, Eq. (7.1):
x̄−μ
t= (7.8)
s/√n
where x̄ and s are mean and standard deviation of the sample differences, respectively. The degrees of
freedom is: df=n-1.
Example 7.4
In a study where 6th grade students who had not previously played chess participated in a program in
which they took chess lessons and played chess daily for 9 months. Below data demonstrates their
memory test score before and after taking the lessons:
Pre = [510, 610, 640, 675, 600, 550, 610, 625, 450, 720, 575, 675]
Post = [850, 790, 850, 775, 700, 775, 700, 850, 690, 775, 540, 680]
Is there evidence that playing chess increases the memory scores? (Adapted from Peck et al. 2016).
Solution:
2. Pre- and post-test scores are not independent since they were applied to the same subjects.
Script 7.6
from scisuit.stats import test_t
Pre = [510, 610, 640, 675, 600, 550, 610, 625, 450, 720, 575, 675]
Post = [850, 790, 850, 775, 700, 775, 700, 850, 690, 775, 540, 680]
result = test_t(x=Post, y=Pre, paired=True)
print(result)
Paired t-test for two.sided
N=12, mean1=747.9, mean2=603.3, mean diff=144.6
t =4.564
p-value =0.0008
Confidence interval: (74.9, 214.3)
Since p<0.05, there is statistical evidence that playing chess indeed made a difference in increasing the
memory scores.
If the parameter, namely alternative, was set to “less”, then p=0.99. Therefore, we would reject the
alternative hypothesis (Post<Pre
Post<Pre). However, on the other hand, alternative was set to “greater” then
p=0.0004, therefore we would reject the H0 and accept H1 (Post>Pre
Post>Pre).
8. F-Test for Population Variances
Assume that a metal rod production facility uses two machines on the production line. Each machine
produces rods with thicknesses μX and μY which are not significantly different. However, if the
variabilities are significantly different, then some of the produced rods might become unacceptable as
they will be outside the engineering specifications.
In Section (7.2), it was shown that there are two cases for two-sample t-tests: whether variances were
equal or not. To be able to choose the right procedure, Larsen & Marx (2011) recommended that F test
should be used prior to testing for μX=μY.
Let’s draw random samples from populations with normal distribution. Let X1, … , Xm be a random
sample from a population with standard deviation σ1 and let Y1, …, Yn be another random sample from a
population with standard deviation σ2. Let S1 and S2 be the sample standard deviations. Then the test
statistic is:
S 12 / σ 1
F= (8.1)
S 22 / σ 2
Example 8.1
α-waves produced by brain have a characteristic frequency from 8 to 13 Hz. The subjects were 20
inmates in a Canadian prison who were randomly split into two groups: one group was placed in
solitary confinement; the other group was allowed to remain in their own cells. Seven days later,
α-wave frequencies were measured for all twenty subjects are shown below:
non-confined = [10.7, 10.7, 10.4, 10.9, 10.5, 10.3, 9.6, 11.1, 11.2, 10.4]
confined = [9.6, 10.4, 9.7, 10.3, 9.2, 9.3, 9.9, 9.5, 9, 10.9]
Using a box-whisker plot, let’s first visualize the data as shown in Fig. (8.1).
Script 8.1
from scisuit.stats import test_f, test_f_Result
nonconfined = [10.7, 10.7, 10.4, 10.9, 10.5, 10.3, 9.6, 11.1, 11.2, 10.4]
confined = [9.6, 10.4, 9.7, 10.3, 9.2, 9.3, 9.9, 9.5, 9, 10.9]
result = test_f(x=confined, y=nonconfined)
print(result)
F test for two.sided
df1=9, df2=9, var1=0.357, var2=0.211
F=1.696
p-value =0.443
Confidence interval: (0.42, 6.83)
Since p>0.05, we cannot reject H0 (σ1=σ2). Therefore, there is no statistically significant difference
between the variances of two groups.
9. Analysis of Variance (ANOVA)
In Section (7.2) we have seen that when exactly two means needs to be compared, we could use two-
sample t-test. The methodology for comparing several means is called analysis of variance (ANOVA).
When there is only a single factor with multiple levels, i.e. color of strawberries subjected to different
power levels of infrared radiation, then we can use one-way ANOVA. However, besides infrared power
if we are also interested in different exposure times, then two-way ANOVA needs to be employed.
There are 3 essential assumptions for the test to be accurate (Anon 2024)32:
A similarity comparison of two-sample t-test and ANOVA is given by Moore et al. (2009). Suppose we
are analyzing whether the means of two different groups of same size are different. Then we would
employ two-sample t-test with equal variances (due to assumption #2):
t=
X̄ −Ȳ
=
n
2 √
( X̄ −Ȳ )
(9.1)
√1 1 Sp
Sp +
n n
n
( X̄ −Ȳ )2
2 (9.2)
t 2=
S 2p
32 https://online.stat.psu.edu/stat500/lesson/10/10.2/10.2.1
If we had used ANOVA, the F-statistic would have been exactly equal to t2 computed using Eq. (9.2). A
careful inspection of Eq. (9.2) reveals couple things:
1. The numerator measures the variation between the groups (known as fit).
2. The denominator measures the variation within groups (known as residual), see Eq. (7.3).
H0: μ1=μ2=...=μ n
(9.3)
Ha: At least two of the μ 's are different
Therefore the basic idea is, to test H0, we simply compare the variation between the means of the
groups with the variation within groups. A graphical example adapted from Peck et al. (2016) can
cement our understanding:
Let k be the number of populations being compared [in Fig. (9.1) k=3] and n1, n2, …, nk be the sample sizes:
N = n1 + n2 + …+ nk
T
x̄=
N
where df=k-1
where df = N-k
6. Mean squares:
SS TR SS Error
MS TR = and MS Error =
k−1 N −k
MS TR
F= (9.4)
MS Error
Before proceeding with an example on ANOVA, let’s further investigate Eq. (9.4). Remember that F
distribution is the ratio of independent chi-square random variables and is given with the following
equation:
U /m
F= (9.5)
V /n
where U and V are independent chi-square random variables with m and n degrees of freedom.
The following theorem establishes the link between Eqs. (9.4 & 9.5):
Theorem: Let Y1, Y2, …, Yn be random sample from a normal distribution with mean μ and variance σ2.
Then,
n
(n−1) S 2 1
2
= 2 ∑ (Y i −Ȳ )2 (9.6)
σ σ i=1
has a chi-square distribution with n-1 degrees of freedom. A proof of Eq. (9.6) is given by Larsen &
Marx (2011) and is beyond the scope of this study.
Using Eq. (9.6), now it is easy to see that when sum of squares of treatment (or error) is divided by σ, it
will have a chi-square distribution. Therefore Eq. (9.4) is indeed equivalent to Eq. (9.5) and therefore
gives an F distribution with df1=k-1 and df2=N-k.
Example 9.1
In most of the integrated circuit manufacturing, a plasma etching process is widely used to remove
unwanted material from the wafers which are coated with a layer of material, such as silicon dioxide. A
process engineer is interested in investigating the relationship between the radio frequency power and
the etch rate. The etch rate data (in Å/min) from a plasma etching experiment is given below:
Does the RF power affect etching rate? (Adapted from Montgomery 2012)
Solution:
Before attempting any numerical solution, let’s first visualize the data using box-whisker plot generated
with a Python script:
Script 9.1
import scisuit.plot as plt
#create a 2D array
data = np.array([rf_160, rf_180, rf_200, rf_220]) #see Script (9.1)
ss_tr, ss_error = 0, 0
for dt in data:
n = len(dt) #size of each sample
ss_tr += n*(np.mean(dt)-grandmean)**2
ss_error += (n-1)*np.var(dt, ddof=1) #note ddof=1, the sample variance
print(f"F={Fvalue}, F-critical={Fcritical}")
F=66.8, F-critical=3.24
Since the computed F-value is considerably greater than F-critical, we can safely reject H 0. Using
scisuit’s
scisuit built-in aov function:
Script 9.3
aovresult = aov(rf_160, rf_180, rf_200, rf_220)
print(aovresult)
One-Way ANOVA Results
Source df SS MS F p-value
Treatment 3 66870.55 22290.18 66.80 2.8829e-09
Error 16 5339.20 333.70
Total 19 72209.75
Since p<0.05, we can reject H0 in favor of H1.
Now, had we not plotted Fig. (9.2), we would not be able to see why H0 has been rejected. As a matter
of fact, among other reasons due to overlap in whiskers and boxes or outliers a box-whisker plot does
not always clearly show whether H 0 will be rejected. Therefore, we need to use post hoc tests along
with ANOVA. There are several tests 33 for this purpose, here we will be using Tukey’s test 34.
Continuing from Script (9.3):
In one-way ANOVA, the populations were classified according to a single factor; whereas in two-way
ANOVA, as the name implies, there are two factors, each with different number of levels. For example,
a baker might choose 3 different baking temperatures (150, 175, 200°C) and 2 different baking times
(45 and 60 min) to optimize a cake recipe. In this example we have two factors (baking time and
temperature) each with different number of levels (Devore et al. 2021; Moore et al. 2009).
Moore et al. (2009) lists the following advantages for using two-way ANOVA:
1. It is more efficient (i.e., less costly) to study two factors rather than each separately,
2. The variation in residuals can be decreased by the inclusion of a second factor,
3. Interactions between factors can be explored.
33 https://en.wikipedia.org/wiki/Post_hoc_analysis
34 https://en.wikipedia.org/wiki/Tukey%27s_range_test
In order to analyze a data set with two-way ANOVA the following assumptions must be satisfied (Field
2024; Moore 2012):
Let’s start from #5 and take a look at what it means balanced or unbalanced. In ANOVA or design of
experiments, a balanced design has equal number of observations for all possible combinations of
factor levels. For example35, assume that the independent variables are A, B, C with 2 levels. Table
(9.1) shows a balanced design whereas Table (9.2) shows an unbalanced design of the same factors
(since the combination [1, 0, 0] is missing).
0 0 0 0 0 0
0 0 1 0 1 0
0 1 0 0 1 0
0 1 1 0 0 1
1 0 0 0 1 0
1 0 1 1 0 1
1 1 0 1 1 0
1 1 1 1 1 1
Note that if Table (9.1) was re-designed such that each row displayed a factor level (0 or 1) and each
column displayed a factor (A, B or C) then there would be no empty cells in that table. If the data
includes multiple observations for each treatment, the design includes replication.
replication
35 https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/anova/supporting-topics/anova-
models/balanced-and-unbalanced-designs/
Example 9.2
A study by Moore and Eddleman36 (1991) investigated the removal of marks made by erasable pens on
cotton and cotton/polyester fabrics. The following data compare three different pens and four different
wash treatments with respect to their ability to remove marks on. The response variable is based on the
color change and the lower the value the more marks were removed.
Table 9.3: Effect of washing treatment and different pen brands on color change
Wash 1 Wash 2 Wash 3 Wash 4
Is there any difference in color change due either to different brands of pen or to the different washing
treatments? (Adapted from Devore et al. 2021)
Solution:
The data satisfies the requirements to be analyzed with two-factor ANOVA, since:
1. There are two independent factors (pen brands and washing treatment),
2. The independent variables consist of discrete levels (e.g., brand #1, #2 and #3)
3. There are no empty cells (data is balanced),
4. There are no replicates (interaction cannot be explored),
5. Observations are independent.
Once a table similar to Table (9.3) is prepared, finding the F-values for both factors is fairly
straightforward if a spreadsheet software is used.
36 Moore MA, Eddleman VL (1991). An Assessment of the Effects of Treatment, Time, and Heat on the Removal of
Erasable Pen Marks from Cotton and Cotton/Polyester Blend Fabrics. J. Test. Eval.. 19(5): 394-397
Averages of treatments (μtreatments) = [0.803, 0.337, 0.423, 0.3]
4
SS treatment 0.48
SS treatment =∑ ( μtreatments [i]−T )2×3=0.48 and MS treatment = = =0.16
i=1 df 4−1
SS Error 0.087
SS Error =∑ ∑ ( μij −T )−SS Treatment −SS brand =0.087 and MS Error = = =0.014
df (3−1)×(4−1)
MS treatment 0.16
F treatment = = =11.05
MS Error 0.014
MS brand 0.06
F brand = = =4.15
MS Error 0.014
Although the solution is straightforward, it is still cumbersome and error-prone; therefore, it is best to
use functions dedicated for this purpose:
Script 9.4
brand = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
treatment = [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
removal = [0.97, 0.48, 0.48, 0.46, 0.77, 0.14, 0.22, 0.25, 0.67, 0.39, 0.57, 0.19]
Unlike Example (9.2) in which the data does not have replicates, the following example will
demonstrate a data set which have replicates. It should be noted that when replicates are involved the
solution becomes slightly more tedious and therefore the following example will be directly solved
using scisuit’s
scisuit built-in function. Interested readers can consult to textbooks (Devore et al. 2021) for a
detailed solution.
Example 9.3
A process engineer is testing the effect of catalyst type (A, B, C) and reaction temperature (high,
medium, low) on the yield of a chemical reaction. She designs an experiment with 3 replicates for each
combination as shown in the following data. Do both catalyst type and reaction temperature have an
effect on the reaction yield?
Catalyst = [A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, B, B, B, C, C, C, C, C, C, C, C, C]
Temperature = [L, L, L, M, M, M, H, H, H, L, L, L, M, M, M, H, H, H, L, L, L, M, M, M, H, H, H]
%Yield = [85, 88, 90, 80, 82, 84, 75, 78, 77, 90, 92, 91, 85, 87, 89, 80, 83, 82, 88, 90, 91, 84, 86, 85, 79, 80, 81]
Solution:
If one wishes to use a spreadsheet for the solution, a table of averages needs to be prepared as shown
below:
Catalyst L M H
A 87.667 82 76.667
B 91 87 81.667
C 89.667 85 80
After preparing the above-shown table, a methodology similar to Example (9.2) can be followed.
Let’s solve the question directly by using scisuit’s
scisuit built-in function:
Script 9.5
from scisuit.stats import aov2
Catalyst = ["A", "A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C", "C", "C", "C", "C"]
Temperature = ["L", "L", "L", "M", "M", "M", "H", "H", "H",
"L", "L", "L", "M", "M", "M", "H", "H", "H",
"L", "L", "L", "M", "M", "M", "H", "H", "H"]
Yield = [85, 88, 90, 80, 82, 84, 75, 78, 77, 90, 92, 91,
85, 87, 89, 80, 83, 82, 88, 90, 91, 84, 86, 85, 79, 80, 81]
1. Regression: When data shows a significant degree of error or “noise” (generally originates from
experimental measurements), we want a curve that represents the general trend of the data.
2. Interpolation: When the noise in data can be ignored (generally originates from tables), we
would like a curve(s) that pass directly through each of the data points.
In terms of mathematical expressions, interpolation (Eq. 10.1) and regression (Eq. 10.2) can be shown
as follows:
Y =f ( X ) (10.1)
Y =f ( X )+ϵ (10.2)
Peck et al. (2016) used the terms deterministic and probabilistic relationships for Eq. (10.1) and Eq.
(10.2), respectively. Therefore a probabilistic relationship is actually a deterministic relationship with
noise (random deviations).
To further our understanding on Eq. (10.2), a simple example from Larsen & Marx (2011) can be
helpful: Consider a tooling process where the initial weight of the sample determines the finished
weight of the steel rods. For example, in a simple experiment if the initial weight was measured as
2.745 g then the finished weight was measured as 2.080 g. However, even if the initial weight is
controlled and is exactly 2.745 g, in reality the finished weight would fluctuate around 2.080 g. and
therefore, with each x (independent variable) there will be a range of possible y values (dependent
variable), which Eq. (10.2) exactly tells us.
10.1. Simple Linear Regression
When there is only a single explanatory (independent) variable, the model is referred to as “simple”
linear regression. Therefore, Eq. (10.2) can be expressed as:
Y =β 0 + β 1 x +ϵ (10.3)
where regardless of the x value, the random variable ε is assumed to follow a N(0, σ) distribution.
2
Var ( β 0 + β 1 x *+ϵ )=Var (ϵ )=σ Y | x * (10.5)
where the notation Y|x* should be read as the value of Y when x=x*, i.e., the mean value of Y when
x=x*. Note also that Eq. (10.4) tells us something important that the population regression line is the
line of mean values of Y.
The following assumptions are made for a linear model (Larsen & Marx, 2011):
1. fY|x(y) is a normal probability density function for all x (i.e., for a known x value, there is a
probability density function associated with y values)
3. For all x-values, the distributions associated with fY|x(y) are independent.
Example 10.1
Suppose that the relationship between applied stress (x) and time to fracture (y) is given by the simple
linear regression model with β0=65, β1=-1.2, and σ=8. What is the probability of getting a fracture value
greater than 50 when the applied stress in 20? (Adapted from Devore et al. 2021)
Solution:
y=65−1.2 x=65−1.2×20=41
Note that if this was a curve fitting problem in nature, then whenever the stress value was 20, the
fracture time would have always been equal to 41. However, since Eq. (10.2) tells us that random
deviations are involved, this cannot be the case. We already know that the random deviations, namely ε,
follows a normal distribution. Therefore, it becomes straightforward to compute the probability:
50−41
P(Z > )=P(Z >1.125)=1− pnorm(1.125)=0.13 ■
8
In Example (10.1), the coefficients, namely β0 and β1, of the regression line was given. However, in
practice we need to estimate these coefficients. It should be noted that there are two commonly 37 used
methods for estimating the regression coefficients (please note that we use the word, estimate):
estimate
37 https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/reliability/supporting-topics/estimation-
methods/least-squares-and-maximum-likelihood-estimation-methods/
The residual sum of squares (RSS) also known as sum of squares of error (SSE):
n
RSS=∑ e i2=e 12 +e 22 +...+e 2n (10.6)
i=1
If the coefficients of the best line passing through the data points are β0 and β1 then:
n
L=RSS=∑ ( y i −β 0 −β 1 x)2 (10.7)
i=1
Dropping the constants -2 and 2 from both equations and simply rearranging the terms yields:
n n
∑ y i=n β 0 + β 1 ∑ x i
i=1 i=1
n n n
∑ x i⋅y i=β 0 ∑ x i + β 1 ∑ x i2
i=1 i=1 i=1
We have two equations and two unknowns, therefore it is possible to solve this system of equations.
Here, one can use the elimination method; however, Cramer’s rule provides a direct solution. Let’s
solve for β1 and leave β0 as an exercise:
| ∑ yi
|
n
∑ xi ∑ xi yi
β^1=
|∑ xi
|
n
∑ x i ∑ x i2
If one takes the determinants in numerator and denominator, then:
n ∑ x i y i −( ∑ x i )( ∑ y i )
β^1= 2 (10.8)
n ∑ x i2−( ∑ x i )
β1 can be further simplified if a notation Sxy and Sxx and Sxy are defined as:
1 2
S xx =∑ ( x i − x̄)2=∑ x i2− ( ∑ x i )
n
1 2
S yy =∑ ( y i − ȳ)2=∑ y i2− ( ∑ y i )
n
1
S xy =∑ ( x i − x̄)( y i − ȳ)=∑ x i y i − ( ∑ x i )( ∑ y i )
n
S
β^1= xy (10.9)
S xx
n
1
σ^2= ∑ (Y i −Y^ i )2 (10.11)
n i=1
Example 10.2
Suppose you have been tasked with finding the probability of heads (H) and tails (T) of an unknown
coin. You flipped the coin for 3 times and the sequence is HTH. What is the probability, p? (Adapted
from Larsen & Marx)
Solution:
Therefore, based on the probability model the function is that defines the sequence HTH is:
2
p X (k )= p (1− p)
Using calculus, it can easily be computed that the value that maximizes the probability model is:
p=2/3. ■
Now, instead of the sequence HTH (Example 10.2) we have data pairs (x1, y1), (x2, y2), … , (xn, yn)
obtained from a random experiment. Furthermore, it is known that the yi’s are normally distributed with
mean β0+β1xi and variance σ2 (Eqs. 10.4 & 10.5).
( ) ,−∞< x <∞
2
−1 x−μ
⋅
1 2 σ (10.12)
f Z ( z)= e
√2 π σ
Replacing x and μ in Eq. (10.12) with yi and Eq. (10.4), respectively, yields the probability model for a
single data pair:
( )
2
−1 y i −β 0 −β 1 x i
⋅
1 2 σ (10.13)
f Z ( z)= e
√2 π σ
For n data pairs, the maximum likelihood function is:
( )
2
n −1 y i −β 0 −β 1 x i
⋅
1 (10.14)
L=∏
2 σ
e
i=1 √2 π σ
In order to find MLE of β0 and β1 partial derivatives with respect to β0 and β1 must be taken. However,
Eq. (10.14) is not easy to work with as is. Therefore, as suggested by Larsen and Marx (2011), taking
the logarithm will make it more convenient to work with.
n
1
2∑
−2lnL=n⋅ln (2 π )+n ln (σ 2 )+ ( y i −β 0 −β 1⋅x i )2 (10.15)
σ i=1
Taking the partial derivatives of Eq. (10.15) with respect to β0 and β1 and solving the resulting set of
equations similar to as shown in section (10.1.1) will yield Eqs. (10.9 & 10.10).
2. β^0 and β^1 are unbiased, therefore, E ( β^0 )=β 0 and E ( β^1 )=β 1
σ2
3. Var ( β^1 )= n
∑ ( x i− x̄)2
i=1
n
σ
2
∑ x i2
4. Var ( β^0 )= n
i=1
n ∑ ( x i − x̄)2
i=1
Proof of #2:
In section (5.1.1), it was mentioned that to be an unbiased estimator, E(Θ) = θ must be satisfied. In the
case of β^1, we need to show that E ( β^1 )=β 1. If Eq. (10.8) is divided by n, the following equation is
obtained:
∑ x i y i− 1n (∑ x i )(∑ y i )
β^1= (I)
∑ x − 1n (∑ x i )
2 2
i
β^1=
∑ x i y i− x̄ ∑ y i (II)
∑ x i2−n x¯2
Rearranging the terms in the numerator:
β^1=
∑ y i ( x i− x̄) (III)
∑ x i2−n x¯2
Note that due to the assumption of the linear model, in Eq. (III) except yi, the other terms can be treated
as constant. Therefore, replacing the expected value of yi with Eq. (10.4) gives:
E ( β^1 )=
∑ ( β 0 + β 1 x i )( x i− x̄) (IV)
∑ x i2−n x¯2
Expanding the terms in the numerator:
β ∑ ( x i − x̄)+ β 1 ∑ ( x i − x̄) x i
E ( β^1 )= 0 (V)
∑ x i2−n x¯2
Noting that the first term in the numerator equals to 0 and the remaining terms in the numerator (except
β1) equals to the denominator, the proof is completed.
A similar proof can be obtained for β0. For cases #3 and #4, Larsen & Marx (2011) presented a detailed
proof.
Example 10.3
It seems logical that riskier investments might offer higher returns. A study by Statman et al. (2008)38
explored this by conducting an experiment. One group of investors rated the risk (x) of a company’s
stock on a scale from 1 to 10, while a different group rated the expected return (y) on the same scale.
This was done for 210 companies, and the average risk and return scores were calculated for each. Data
for a sample of ten companies, ordered by risk level, is given below:
x = [4.3, 4.6, 5.2, 5.3, 5.5, 5.7, 6.1, 6.3, 6.8, 7.5]
y = [7.7, 5.2, 7.9, 5.8, 7.2, 7, 5.3, 6.8, 6.6, 4.7]
How does the risk of an investment related to its expected return? (Adapted from Devore et al. 2021)
Solution:
Script 10.1
import scisuit.plot as plt
x = [4.3, 4.6, 5.2, 5.3, 5.5, 5.7, 6.1, 6.3, 6.8, 7.5]
y = [7.7, 5.2, 7.9, 5.8, 7.2, 7, 5.3, 6.8, 6.6, 4.7]
plt.scatter(x=x, y=y)
plt.show()
38 Statman M, Fisher KL, Anginer D (2008). Affect in a Behavioral Asset-Pricing Model. Financial Analysts Journal,
64-2, 20-29.
It is seen that there is a weak inverse relationship
between the perceived risk of a company’s stock
and its expected return value.
Fig. (10.2) shows that there is no convincing relationship between risk and expected return of an
investment. Let’s take a look if this is numerically the case. Continuing from Script (10.1):
Script 10.2
from scisuit.stats import linregress
result = linregress(yobs=y, factor=x)
print(result)
Simple Linear Regression
F=1.85, p-value=0.211, R2=0.19
Have we carried out a reliable analysis, i.e., is there no relationship between risk and expected returns?
Devore et al. (2021) suggested that with small number of observations, it is possible not to detect a
relationship because when the sample size is small hypothesis tests do not have much power. Also note
that the original study uses 210 observations where Statman et al. (2008) concluded that risk is a useful
predictor of expected return, although the risk only accounted for 19% of expected returns. ■
Suppose the taste of a fruit juice is related to sugar content and pH. We wish to establish an empirical
model, which can be described as follows:
y=β 0 + β 1 x 1 + β 2 x 2 +ϵ (10.16)
where y is the response variable (taste) and x1 and x2 are independent variables (sugar content and pH).
Unlike simple linear regression (SLR) model, where only one independent variable exists, in multiple
linear regression (MLR) problems at least 2 independent variables are of interest to us. Therefore, in
general, the response variable maybe related to k independent (regressor) variables. The model is:
This model describes a hyperplane and the regression coefficient, βj, represents the expected change in
response to per unit change in xj when all other variables are held constant (Montgomery 2012). If one
enters the data in a spreadsheet, it would generally be in the following format:
y x1 x2 … xk
y is the response variable and x are the
y1 x11 x12 … x1k regressor variables. It is assumed that n>k.
For example, for the 1st row (i=1) in Table (10.1), Eq. (10.18) yields, y1 = β0 + β1·x11 + β2·x12 +… +
βk·x1k.
To find the regression coefficients, we will use a similar approach presented in section (10.1.1), such
that the sum of the squares of errors, εi, is minimized. Therefore,
( )
n k 2
L=∑ y i −β 0 −∑ β j x ij (10.19)
i=1 j=1
where the function L will be minimized with respect to β0, β1, …, βk which then will give the least
square estimators, β^1 , β^2 , .. , β^k . The derivatives with respect to β0 and βj are:
(10.20-
| ( )
n k
∂L
=−2 ∑ y i − β^0 −∑ β^ j x ij
∂ β0 β^0 , β^1 ,... , β^k i=1 j=1 a)
(10.20-
| ( )
n k
∂L
=−2 ∑ y i − β^0 −∑ β^ j x ij x ij
∂βj β^0 , β^1 ,... , β^k i=1 j=1 b)
After some algebraic manipulation, Eq. (10.20) can be written in matrix notation as follows:
[ ][ ][ ]
n ∑ xi 1 ∑ xi 2 ... ∑ x ik β^0 ∑ yi
∑ xi 1 ∑ x i21 ∑ x i 1 x i 2 ... ∑ x i 1 x ik β^1 ∑ xi 1 yi (10.21)
⋮ ⋮ ⋮ ... ⋮ ⋮ ⋮
∑ x ik ∑ x ik x i 1 ∑ x ik x i 2 ... ∑ x ik2 β^k ∑ x ik yi
which can be condensed to the following expression:
X⋅β= y (10.22)
Note that since X is an i by k matrix, therefore not square, the inverse does not exist and therefore the
equation cannot be solved. The least-squares approach to solving Eq. (10.22) is by multiplying with
transpose of X:
X T X⋅β= X T⋅y (10.23)
Example 10.4
A process engineer who was tasked to improve the viscosity of a polymer, among the several factors,
chose two process variables: reaction temperature and feed rate. She ran 16 experiments and collected
the following data:
Temperature = [80, 93, 100, 82, 90, 99, 81, 96, 94, 93, 97, 95, 100, 85, 86, 87]
Feed Rate = [8, 9, 10, 12, 11, 8, 8, 10, 12, 11, 13, 11, 8, 12, 9, 12]
Viscosity = [2256, 2340, 2426, 2293, 2330, 2368, 2250, 2409, 2364, 2379, 2440, 2364, 2404, 2317, 2309, 2328]
Explain the effect of feed rate and temperature on polymer viscosity. (Adapted from Montgomery 2012).
Solution:
The solution involves several computations which can be performed by using a spreadsheet or by using
Python with numpy library. Step by step solution for the coefficients can be found in the textbook from
Montgomery (2012). We will be skipping all these steps and directly solve it using scisuit’s builtin
linregress function.
Script 10.3
from scisuit.stats import linregress
#input values
temperature = [80, 93, 100, 82, 90, 99, 81, 96, 94, 93, 97, 95, 100, 85, 86, 87]
feedrate = [8, 9, 10, 12, 11, 8, 8, 10, 12, 11, 13, 11, 8, 12, 9, 12]
viscosity = [2256, 2340, 2426, 2293, 2330, 2368, 2250, 2409, 2364, 2379, 2440, 2364, 2404, 2317, 2309, 2328]
Based on Eq. (10.24), the p-value tells us that at least one of the two variables (temperature and feed
rate) has a nonzero regression coefficient. Furthermore, analysis on individual regression coefficients
show that both temperature and feed rate have an effect on polymer’s viscosity.
According to Larsen & Marx (2011), applied statisticians find residual plots to be very helpful in
assessing the appropriateness of fitting. Continuing from Script (10.3), let’s plot the residuals:
Script 10.4
import scisuit.plot as plt
import scisuit.plot.gdi as gdi
#x=Fits, y=Residuals
plt.scatter(x=result.Fits, y= result.Residuals)
plt.show()
It is seen that the magnitudes of the residuals
are comparable and they are randomly
distributed. Therefore, the applied regression
can be considered as appropriate.
Script 11.1
import scisuit.plot as plt
from scisuit.stats import rnorm, rexp
n=1000
dt_norm, dt_exp = rnorm(n), rexp(n)
Let’s see how we could visualize the data from Script (11.1) by histogram:
Script 11.2
plt.layout(1,2)
plt.subplot(0,0)
plt.hist(dt_norm, density=True)
plt.title("Normal")
plt.subplot(0, 1)
plt.hist(dt_exp, density=True)
plt.title("Exponential")
plt.show(antialiasing=True)
Fig 11.1: Histogram of normal and exponential distributions
It is clearly seen from Fig. (11.1) that normal distribution has a nearly bell-shaped distribution whereas
exponential distribution is highly skewed to right.
Box-whisker plot has has another name as five number summary where it displays 1 st quartile, median,
3rd quartile, min and max values. Continuing from Script (11.1):
Script 11.3
plt.boxplot(dt_norm)
plt.title("Normal")
plt.boxplot(dt_exp)
plt.title("Exponential")
plt.show(antialiasing=True)
The following observations can be made.
Normal distribution:
Exponential distribution:
Q–Q plot (quantile–quantile plot) is a graphical method for comparing two probability distributions by
plotting their quantiles against each other 39. On the other hand, a normal Q-Q plot is that which can be
shaped by plotting quantiles of one distribution against quantiles of normal distribution. If both
distributions come from normal distribution, then the data aligns itself on a straight line. To visualize
Q-Q plot, we will slightly modify Script (11.2) and change hist with qqnorm function. Continuing from
Script (11.1) and applying following changes, we should obtain Fig. (11.3):
Script 11.4
plt.qqnorm(dt_norm)
plt.qqnorm(dt_exp)
39 https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
Fig 11.3: QQ plot of normal and exponential distributions (n=250)
It is seen from Fig. (11.3) that the data coming from a normal distribution aligns well with the straight
line whereas the data from exponential distribution shows apparent deviations.
11.2. Analytical Test Procedures
where F(X) is the theoretical cumulative distribution function (must be a continuous distribution and
must be fully specified) of the normal distribution and Fn(X) is the empirical CDF of the data.
Example 11.1
Does the following data come from a normal distribution:
[2.39798, -0.16255, 0.54605, 0.68578, -0.78007, 1.34234, 1.53208, -0.86899, -0.50855, -0.58256, -0.54597, 0.08503,
0.38337, 0.26072, 0.34729]
Solution:
We will write a script to run the KS test and at the same time visualize the CDF and ECDF. The
following script performs both tasks:
Script 11.5
import numpy as np
import scisuit.plot as plt
from scisuit.stats import ks_1samp, pnorm
data = [2.39798, -0.16255, 0.54605, 0.68578, -0.78007, 1.34234, 1.53208, -0.86899, -0.50855,
-0.58256, -0.54597, 0.08503, 0.38337, 0.26072, 0.34729]
40 Kolmogorov A (1933). ‘‘Sulla determinazione empirica di una legge di distribuzione.’’ G. Ist. Ital. Attuari, 4, 83–91
41 Smirnov N (1948). ‘‘Table for estimating the goodness of fit of empirical distributions.’’ Annals of Mathematical
Statistics, 19(2): 279–281.
42 https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
mu, sd = np.mean(data), np.std(data)
"""Analytic test"""
result = ks_1samp(x=data, cdf=pnorm, args=( mu, sd))
print(result)
plt.legend(nrows=2)
plt.show()
Kolmogorov-Smirnov test
p-value: 0.885
Test statistic: 0.1414 and its sign 1
Max distance at: -0.50855
2
(∑ ai y(i)) (11.2)
W=
∑ ( y− ȳ) 2
Example 11.2
Compute the test statistic (W) for the following data (from Shapiro and Wilk, 1965):
Solution:
1) The coefficients, ai, in Eq. (11.2) is given by Shapiro and Wilk (1965) as:
a = [0.6233, 0.3031, 0.1401, 0.0, 0.1401, 0.3031, 0.6233]
3) In Shapiro and Wilk (1965) paper, the numerator in Eq. (11.2) is computed as follows (note that the
indices start from 1):
• If the number of samples (n) is even, then n=2k, and numerator is:
k
b=∑ an−i+1⋅( y n−i+1− y i )
i=1
Since n=7 for the example data, then k=3 and b is computed as follows:
3
b=∑ a7−i+1⋅( y 7−i+1− y i )
i=1
Let’s automate these 3 steps using a Python script:
Script 11.6
import numpy as np
from scisuit.stats import shapiro
n = len(arr)
b=0
for i in range(int(k)):
b += a[n-i-1]*(sorted_x[n-i-1]- sorted_x[i])
W_test_stat = b**2/(np.var(arr)*n)
print(f"Test statistic: {W_test_stat}")
result = shapiro(arr)
print(result)
Test statistic: 0.95308
Shapiro-Wilk Test
p-value: 0.7612
Test statistic: 0.9535
One way to visualize the Shapiro-Wilk test is through a QQ plot. If the points lie on a straight line, it
indicates that the data is approximately normally distributed, matching the expectation used in the
Shapiro-Wilk test. Therefore,
• When 𝑊 is close to 1.0, the sample data aligns closely with the expected normal distribution,
indicating that the data is likely normal.
• When 𝑊 deviates significantly from 1.0, it suggests that the data does not follow a normal
distribution.
It is seen that the data aligns itself well with the straight line
which is why the test statistic is close to 1.0.
A 2=−n−S (11.3)
where
n
(2i−1)
S=∑ [ ln F (Y i )+ln (1−F (Y n+1−i )) ] (11.4)
i=1 n
where F is the CDF of the specified distribution and the Yi are the ordered data.
43 https://www.itl.nist.gov/div898/handbook/eda/section3//eda35e.htm
Example 11.3
Test whether the above-given normality tests for t-distribution (see Fig. 4.11).
Solution:
Let’s first remind ourselves briefly similarities and differences between t-distribution and standard
normal distribution:
3) The curves of t- distribution with larger df are taller and have thinner tails.
Script 11.8
from scisuit.stats import rt, anderson, ks_1samp, shapiro
n=50
Kolmogorov-Smirnov test
p-value: 0.645
Test statistic: 0.1014 and its sign 1
Max distance at: 0.1982
Shapiro-Wilk Test
p-value: 0.0
Test statistic: 0.8508
It is seen that except KS test, both Shapiro-Wilk and Anderson-Darling tests can detect the difference
between t-distribution (with small degrees of freedom) and standard normal distribution. However,
when df=10, all of the above-mentioned tests yielded a p-value greater than 0.05.
11.2.4. Summary
The above-mentioned tests are sensitive to sample size such that if n<20 it can be difficult to detect
deviations from normality whereas if n>5000 even minor departures from normality may be flagged as
statistically significant (Anon. 2024)44. As a rule of thumb, sample sizes between 30-300 observations
are recommended for reliable normality assessment.
44 https://www.6sigma.us/six-sigma-in-focus/normality-test-lean-six-sigma/
References
Box GEP., Hunter WG, Hunter JS (2005). Statistics for Experimenters: Design, Innovation, and
Discovery, 2nd Ed., Wiley.
Bury K (1999). Statistical Distributions in Engineering, Cambridge University Press.
Carlton MA, Devore JL (2014). Probability with Applications in Engineering, Science and
Technology. Springer USA.
Chapra SC, Canale RP (2013). Numerical methods for engineers, seventh edition. McGraw Hill
Education.
Das KR, Rahmatullah Imon AHM (2016). A Brief Review of Tests for Normality. American Journal
of Theoretical and Applied Statistics. 5(1), 5-12.
Devore JL, Berk KN, Carlton MA (2021). Modern Mathematical Statistics with Applications. 3 rd Ed.,
Springer.
Forbes C, Evans M, Hastings N, Peacock B (2011). Statistical Distributions, 4th Ed., Wiley.
Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer.
Hogg RV, McKean JW, Craig AT (2019). Introduction to mathematical statistics, 8th Ed., Pearson.
Kanji GK (2006). 100 Statistical Tests, 3rd Ed., Sage Publications.
Kreyszig E, Kreyszig H, Norminton EJ (2011). Advanced Engineering Mathematics, 10th Ed., John
Wiley & Sons Inc.
Larsen RJ, Marx ML (2011). An Introduction to Mathematical Statistics and Its Applications. 5 th Ed.,
Prentice Hall.
Liben-Nowell D. (2022). Connecting Discrete Mathematics and Computer Science (2 nd Ed.).
Cambridge: Cambridge University Press.
Miller I, Miller M (2014). John E. Freund's Mathematical Statistics with Applications. 8 th Ed., Person
New International Edition.
Montgomery DC (2012). Design and analysis of experiments, 8th Ed., John Wiley & Sons, Inc.
Montgomery DC, Peck EA, Vining GG (2021). Introduction to Linear Regression Analysis, 6 th Ed.,
Wiley.
Moore DS, McCabe GP, Craig BA (2009). Introduction to the Practice of Statistics. 6 th Ed., W. H.
Freeman and Company, New York.
Peck R, Olsen C, Devore JL (2016). Introduction to Statistics and Data Analysis. 5th Ed., Cengage
Learning.
Pinheiro, CAR, Patetta M (2021). Introduction to Statistical and Machine Learning Methods for Data
Science. Cary, NC: SAS Institute Inc.
Rinne H (2009). The Weibull Distribution A Handbook. CRC Press.
Shapiro SS, Wilk MB (1965). An Analysis of Variance Test for Normality (Complete Samples).
Biometrika, 52(3/4), 591-611.
Stahl S (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 76(2), pp. 96-113.
Available at: https://www.maa.org/sites/default/files/pdf/upload_library/22/Allendoerfer/stahl96.pdf
Student (1908). The probable error of a mean. Biometrika, 6(1), 1-25.
Utts JM, Heckard RF (2007). Mind on Statistics, 3rd Ed., Thomson/Brooks Cole.
Wackerly DD, Mendenhall W, Scheaffer RL (2008). Mathematical Statistics with Applications, 7 th
Ed., Thomson/Brooks Cole.
Walck C (2007). Handbook on Statistical Distributions for Experimentalists. Available at:
https://s3.cern.ch/inspire-prod-files-1/1ab434101d8a444500856db124098f9c
Acronyms