Statistics and Risk Modelling Using Python
Statistics and Risk Modelling Using Python
Eric Marsden
<[email protected]>
curve
fitting
data probabilistic model consequence model
risks
Where does this fit into risk engineering?
curve
fitting
data probabilistic model consequence model
risks costs
criteria
decision-making
Where does this fit into risk engineering?
curve
fitting
data probabilistic model consequence model
These slides
risks costs
criteria
decision-making
Angle of attack: computational approach to statistics
▷ Free software
→ colab.research.google.com
Jupyter: interact with Python in a web browser
In [2]: 2 + 2
Out[2]: 4
In [3]: numpy.sqrt(2 + 2)
Out[3]: 2.0
In [4]: numpy.pi
Out[4]: 3.141592653589793
In [5]: numpy.sin(numpy.pi)
Out[5]: 1.2246467991473532e-16
as a
Download this content
In [6]: numpy.random.uniform(20, 30) Python notebook at
rg
Out[6]: 28.890905809912784 risk-engineering.o
In [4]: obs
Out[4]:
array([ 25.64917726, 21.35270677, 21.71122725, 27.94435625,
25.43993038, 22.72479854, 22.35164765, 20.23228629,
26.05497056, 22.01504739])
In [5]: len(obs)
Out[5]: 10
In [7]: obs - 25
Out[7]:
array([ 0.64917726, -3.64729323, -3.28877275, 2.94435625,
0.43993038,
-2.27520146, -2.64835235, -4.76771371, 1.05497056,
-2.98495261])
In [8]: obs.mean()
Out[8]: 23.547614834213316
In [9]: obs.sum()
Out[9]: 235.47614834213317
In [10]: obs.min()
Out[10]: 20.232286285845483
Python as a statistical calculator: plotting
In [2]: import numpy, matplotlib.pyplot as plt
Discrete Continuous
A discrete variable takes separate, countable A continuous variable is the result of a
values measurement (a floating point number)
Examples: Examples:
▷ outcomes of a coin toss: {head, tail} ▷ height of a person
▷ number of students in the class ▷ flow rate in a pipeline
▷ questionnaire responses {very unsatisfied, ▷ volume of oil in a drum
unsatisfied, satisfied, very satisfied} ▷ time taken to cycle from home to
university
Random variables
Examples:
▷ sum of the values on two dice throws (a discrete random variable)
0.6
▷ Example: 𝑋 = “number of heads when tossing a coin twice” 0.5
• 𝑝𝑋 (0) ≝ Pr(𝑋 = 0) = 1/4 0.4
0.3
• 𝑝𝑋 (1) ≝ Pr(𝑋 = 1) = 2/4
0.2
0.0
0 1 2
Probability Mass Functions: two coins
▷ Task: simulate “expected number of heads when tossing a coin twice”
as a
Download this content
Python notebook at
rg
risk-engineering.o
Probability Mass Functions: two coins
▷ Task: simulate “expected number of heads when tossing a coin twice”
import numpy
import matplotlib.pyplot as plt
N = 1000
heads = numpy.zeros(N, dtype=int)
for i in range(N):
# second argument to randint is exclusive upper bound
heads[i] = numpy.random.randint(0, 2, 2).sum()
plt.stem(numpy.bincount(heads))
heads[i]: element
number
i of the array hea ds
More information on Python programming
→ https://oli.cmu.edu/courses/probability-statistics-open-free/
Free self-paced course whose only prerequisite is basic algebra
Probability Mass Functions: properties
𝑝𝑋 (𝑥) ≥ 0
∑ 𝑝𝑋 (𝑥) = 1
𝑥
▷
Pr(𝑎 ≤ 𝑋 ≤ 𝑏) = ∑ 𝑝𝑋 (𝑥)
𝑥∈[𝑎,𝑏]
Probability Density Functions
▷ It is non-negative
𝑓𝑋 (𝑥) ≥ 0
component fails.
0.8
The PDF 𝑓 (𝑡) is the “failure density function”. It tells you 0.6
0.2
𝑡+Δ𝑡
Pr(𝑡 < 𝑇 < 𝑡 + Δ𝑡) ∫𝑡 𝑓 (𝑡)𝑑𝑡
lim = lim 0.0
0 1 2 3 4
Δ𝑡→0 Δ𝑡 Δ𝑡→0 Δ𝑡
Expectation of a random variable
▷ Interpretation:
• the center of gravity of the pmf or pdf
• the average in a large number of independent realizations of your experiment
Important concept: independence
▷ pmf: T
T
• 𝑝𝑋 (0) ≝ Pr(𝑋 = 0) = 1/4 H
• 𝑝𝑋 (1) ≝ Pr(𝑋 = 1) = 2/4
T
• 𝑝𝑋 (2) ≝ Pr(𝑋 = 2) = 1/4 H
H
1 2 1
▷ 𝔼[𝑋] ≝ ∑𝑘 × 𝑝𝑋 (𝑘) = 0 × 4
+1× 4
+2× 4
=1
𝑘
Illustration: expected value of a dice roll
6
1
▷ Expected value of a dice roll is ∑ 𝑖 × = 3.5
𝑖=1
6
(These numbers will be different for different executions. The greater the number of
random “dice throws” we simulate, the greater the probability that the mean will be
close to 3.5.)
Illustration: expected value of a dice roll
Expected value
4
N = 1000
roll = numpy.zeros(N) 3
expectation = numpy.zeros(N)
2
for i in range(N):
roll[i] = numpy.random.randint(1, 7) 1
▷ 𝔼[𝑐𝑋] = 𝑐𝔼[𝑋]
▷ 𝔼[𝑐 + 𝑋] = 𝑐 + 𝔼[𝑋]
▷ Both the positive part and the negative part of 𝑋 have infinite expectation
in this case, so 𝔼[𝑋] would have to be ∞ − ∞ (meaningless)
Variance of a random variable
▷ In Python:
• obs.var() if obs is a NumPy vector
• numpy.var(obs) for any Python sequence (vector or list) In Excel, function VAR
Variance with coins
▷ pmf:
• 𝑝𝑋 (0) ≝ Pr(𝑋 = 0) = 1/4
• 𝑝𝑋 (1) ≝ Pr(𝑋 = 1) = 2/4
• 𝑝𝑋 (2) ≝ Pr(𝑋 = 2) = 1/4
𝑁
Var(𝑋) ≝ ∑(𝑥𝑖 − 𝜇𝑋 )2 Pr(𝑋 = 𝑥𝑖 )
𝑖=1
1 2 1 1
= × (0 − 1)2 + × (1 − 1)2 + × (2 − 1)2 =
4 4 4 2
Variance of a dice roll
𝑁
𝑉 𝑎𝑟(𝑋) ≝ ∑(𝑋𝑖 − 𝜇𝑋 )2 Pr(𝑋 = 𝑥𝑖 )
𝑖=1
1
= × ((1 − 3.5)2 + (2 − 3.5)2 + (3 − 3.5)2 + (4 − 4.5)2 + (5 − 3.5)2 + (6 − 3.5)2 )
6
= 2.916
▷ Var(𝑐) = 0
Beware:
▷ Var(𝑐 + 𝑋) = Var(𝑋) ▷ 𝔼[𝑋 2 ] ≠ (𝔼[𝑋])2
▷ Var(𝑐) = 0
Beware:
▷ Var(𝑐 + 𝑋) = Var(𝑋) ▷ 𝔼[𝑋 2 ] ≠ (𝔼[𝑋])2
Note:
▷ Cov(𝑋, 𝑌 ) ≝ 𝔼 [(𝑋 − 𝔼[𝑋])(𝑌 − 𝔼[𝑌 ])]
▷ Cov(𝑋, 𝑋) = Var(𝑋)
Standard deviation
𝑁
▷ Formula for variance: Var(𝑋) ≝ ∑(𝑋𝑖 − 𝜇𝑋 )2 Pr(𝑋 = 𝑥𝑖 )
𝑖=1
▷ In Python:
• obs.std() if obs is a NumPy vector
• numpy.std(obs) for any Python sequence (vector or list) EV
In Excel, function STD
Properties of standard deviation
▷ Suppose 𝑌 = 𝑎𝑋 + 𝑏, where
• 𝑎 and 𝑏 are scalar
• 𝑋 and 𝑌 are two random variables
▷ Var(𝑐𝑋) = 𝑐2 Var(𝑋)
▷ Var(𝑐𝑋) = 𝑐2 Var(𝑋)
▷ Properties of a cdf:
• 𝐹𝑋 (𝑥) is a non-decreasing function of 𝑥
• 0 ≤ 𝐹𝑋 (𝑥) ≤ 1
• lim𝑥→∞ 𝐹𝑋 (𝑥) = 1
• lim𝑥→−∞ 𝐹𝑋 (𝑥) = 0
• if 𝑥 ≤ 𝑦 then 𝐹𝑋 (𝑥) ≤ 𝐹𝑋 (𝑦)
• Pr(𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎) ∀𝑏 > 𝑎
CDF of a discrete distribution
0.6
= ∑ Pr(𝑋 = 𝑥𝑖 ) 0.5
𝑥𝑖 ≤𝑥 0.4
0.3
The CDF is built by accumulating probability as 𝑥 0.2
increases. 0.1
0.0
0 1 2
0.8
𝑥 0 1 2 0.6
0.4
0.0
0.100 0.6
0.075 0.4
0.050
0.2
0.025
0.0
0.000
0 2 4 6 8 10 12 0 2 4 6 8 10 12
CDF of a continuous distribution
1
In reliability engineering, we are often
interested in the random variable 𝑇 P(T ≤ t)
representing time to failure of a component.
Probability F(t)
The cumulative distribution function tells you
the probability that lifetime is ≤ 𝑡.
0
t
𝐹(𝑡) = Pr(𝑇 ≤ 𝑡) Time to failure (T)
Exercise
Problem
Field data tells us that the time to failure of a pump, 𝑋, is normally distributed.
The mean and standard deviation of the time to failure are estimated from
historical data as μ = 3200 hours and σ = 600 hours.
What is the probability that a pump will fail before 2000 hours of operation?
Exercise
Problem
Field data tells us that the time to failure of a pump, 𝑋, is normally distributed.
The mean and standard deviation of the time to failure are estimated from
historical data as μ = 3200 hours and σ = 600 hours.
What is the probability that a pump will fail before 2000 hours of operation?
Solution
We are interested in calculating Pr(𝑋 ≤ 2000) and we know that 𝑋 follows
a norm(3200, 600) distribution. We can use the CDF to calculate Pr(𝑋 ≤
2000).
Problem
Field data tells us that the time to failure of a pump, 𝑋, is normally distributed.
The mean and standard deviation of the time to failure are estimated from
historical data as μ = 3200 hours and σ = 600 hours.
What is the probability that a pump will fail after it has worked for at least
2000 hours?
Exercise
Problem
Field data tells us that the time to failure of a pump, 𝑋, is normally distributed.
The mean and standard deviation of the time to failure are estimated from
historical data as μ = 3200 hours and σ = 600 hours.
What is the probability that a pump will fail after it has worked for at least
2000 hours?
Solution
We are interested in calculating Pr(𝑋 > 2000) and we know that 𝑋 follows a
norm(3200, 600) distribution. We know that Pr(𝑋 > 2000) = 1 − Pr(𝑋 ≤
2000).
import scipy.stats
scipy.stats.norm(0, 1).ppf(0.05)
-1.6448536269514729
Quantile measures
0.25
▷ What’s the probability that it takes 𝑘 trials to get a success?
0.20
• Before we can succeed at trial 𝑘, we must have had 𝑘 − 1 failures 0.15
(1 − 𝑝)𝑘−1 0.05
0.00
• Then a single success with probability 𝑝 0.0 2.5 5.0 7.5 10.0 12.5 15.0
▷ Pr(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝
The geometric distribution: intuition
▷ Suppose I am at a party and I start asking girls to dance. Let 𝑋 be the number
of girls that I need to ask in order to find a partner.
• If the first girl accepts, then 𝑋 = 1
• If the first girl declines but the next girl accepts, then 𝑋 = 2
▷ 𝑋 = 𝑘 means that I failed on the first 𝑘 − 1 tries and succeeded on the 𝑘th try
• My probability of failing on the first try is (1 − 𝑝)
• My probability of failing on the first two tries is (1 − 𝑝)(1 − 𝑝)
• My probability of failing on the first 𝑛 − 1 tries is (1 − 𝑝)𝑘−1
• Then, my probability of succeeding on the nth try is 𝑝
▷ Properties:
• 𝔼[𝑋] = 1
𝑝
1−𝑝
• Var(𝑋) = 𝑝2
The binomial distribution (counting successes)
▷ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑝, 𝑘, 𝑛): probability of observing 𝑘 successes in 𝑛 trials, Binomial distribution PMF, n=20, p=0.3
The probability of all 9 wells failing is 0.99 = 0.3874 (and also wells.pmf(0)).
▷ The famous “bell shaped” curve, fully described by its mean and
standard deviation
▷ Python: scipy.stats.norm(μ, σ)
▷ Excel: NORMINV(RAND(), μ, σ)
Scipy.stats examples
0.40
0.35
0.30
0.25
0.15
0.05
In [3]: norm.ppf(0.5)
Out[3]: 0.0
quantile function 0.00
−4 −3 −2 −1 0 1 2 3 4
In [4]: norm.cdf(0)
Out[4]: 0.5
0.25
0.15
0.10
0.05
0.00
−4 −3 −2 −1 0 1 2 3 4
ed large tables
istics textbooks contain
In prehistoric times, stat tion . With cheap
the nor mal dist ribu
of quantile values for
longer necessary!
computing power, no
The “68–95–99.7 rule”
▷ The 68–95–99.7 rule (aka the three-sigma rule)
states that if 𝑥 is an observation from a normally
distributed random variable with mean μ and
standard deviation σ, then
• Pr(𝜇 − 𝜎 ≤ 𝑥 ≤ 𝜇 + 𝜎) ≈ 0.6827
• Pr(𝜇 − 2𝜎 ≤ 𝑥 ≤ 𝜇 + 2𝜎) ≈ 0.9545
• Pr(𝜇 − 3𝜎 ≤ 𝑥 ≤ 𝜇 + 3𝜎) ≈ 0.9973
▷ Part of the reason for the ubiquity of the normal distribution in 1.4
science 1.2
1.0
0.6
0.4
N = 10_000 0.2
→ randomservices.org/random/apps/GaltonBoardGame.html
Exponential distribution
Exponential distribution PDF
1.2
λ = 0.5
λ = 1.0
1.0 λ = 10.0
0.6
0.2
0.4
1
▷ Property: expected value of exponential random variable is 𝜆 0.2
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Exponential distribution
• An observed failure is the result of some suddenly appearing failure, not due to
gradual deterioration
Failure of power transistors (1/2)
= 1 − 𝑒−0.000575𝑡
0.20
0.05
• the number of events of the Poisson process over a time
0.00
interval 0 5 10 15 20
0.150
▷ Probability mass function: 0.125
0.100
𝜆𝑘 𝑒−𝜆
0.075
Pr(𝑍 = 𝑘) = , 𝑘 = 0, 1, 2… 0.050
𝑘! 0.025
0.000
0 5 10 15 20
▷ The parameter λ is called the intensity of the Poisson Poisson distribution PMF, λ = 12
distribution 0.10
0.06
0.04
0.00
0 5 10 15 20
Poisson distribution and Prussian horses
▷ Number of fatalities for the Prussian cavalry resulting from
being kicked by a horse was recorded over a period of 20 years
• for 10 army corps, so total number of observations is 200
120
Prussian army deaths from horse kicks
Deaths Occurrences Poisson fit
Observed
1 65 80
2 22 60
3 3 40
4 1 20
>4 0 0
0 1 2 3 4
llows
Notation: ∼ means “fo
in distribu tion ”
Simulating earthquake occurrences (1/2)
1
▷ Rate: λ = 60.8
per hour
next day?
0.8
Probability of an earthquake
• right: plot of the cdf of the corresponding exponential
0.6
distribution
0.4
• scipy.stats.expon(scale=60.8).cdf(24) =
0.326
0.2
0.0
0 50 100 150 200 250
Elapsed time (hours)
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5
▷ Python: scipy.stats.dweibull(k, 𝜇, 𝜆)
0.6
k=0.5, λ=1
Weibull distribution PDF k=1.0, λ=1
k=2.0, λ=1
0.4
k=2.0, λ=2
0.3
0.2
0.1
0.30
0.25
0.20
0.15
0.10
0.05
0.00
−4 −2 0 2 4
▷ Python: scipy.stats.t(df)
Source of this useful flowchart is uncertain, possibly it is due to Aswath Damodaran, NYU
Image credits
@LearnRiskEng
fb.me/RiskEngineering
Was some of the content unclear? Which parts were most useful to
you? Your comments to [email protected]
(email) or @LearnRiskEng (Twitter) will help us to improve these
materials. Thanks!