Probability & Stats Notes
Probability & Stats Notes
Expected value
• The expected value of a r.v. X, denoted as E(X) or µX , is intuitively the long-run average value of
repetitions of the experiment it represents.
1
Probability Mass Function Cumultative Distribution Function
0.4
1.0
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0.0
0.0
1 2 3 4 1 2 3 4
– If X is a continuous r.v. with p.d.f. f (x) and support on (−∞, +∞), then
Z +∞
E(x) = xf (x)dx
−∞
Variance
2
• The variance of a r.v. X, denoted as V ar(X) or σX , measures the dispersion of the probability
distribution around the mean, µX = E(X).
• The variance is defined as the expected value of the squared deviation from the mean, i.e.
V ar(X) = E((X − µX )2 )
– If X is a discrete r.v. taking N values with probability p(x), then
N
X
V ar(x) = (xi − µX )2 p(xi )
i=1
– If X is a continuous r.v. with p.d.f. f (x) and support on (−∞, +∞), then
Z +∞
V ar(x) = (x − µX )2 f (x)dx
−∞
2
Probability Density Function Cumulative Distribution Function
0.4
1.0
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
p
• A common measure of dispersion is the standard deviation, denoted as σX = V ar(X).
• Some useful properties of the variance are the following:
– V ar(X) > 0.
– V ar(aX) = a2 V ar(X) where a is a constant.
– V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ) where Cov(X, Y ) is the covariance
between X and Y .
• Example. Three Normal p.d.f. with different expected values and variance are depicted in Figure 3.
E((X − µX )4 )
Kurt(X) = 4 .
σX
– If Kurt(X) > 3, the distribution is leptokurtic. If Kurt(X) < 3, the distribution is plaktikurtic.
– The higher the kurtosis, the higher the likelihood of extreme events or outliers.
• Example. Distributions with different degrees of skewness and kurtosis are depicted in Figure 4.
3
1.0
0.8 E(X)=0,Var(X)=1
E(X)=2,Var(X)=0.5
E(X)=−4,Var(X)=2
0.6
0.4
0.2
0.0
−10 −5 0 5
Moments
• The r-th moment of a r.v. X is defined as E(X r ). Example. The mean of X is the first moment of X.
• In general, mean, variance, skweness and kurtosis are all functions of the moments of the r.v.
• Example. V ar(X) = E(X 2 ) − E(X)2 .
• Moments are important to establish asymptotic properties.
4
0.4
0.3
density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
support
Figure 4: Normal density (black), leptokurtic (kurt=190) and symmetric density (red), leptokurtic (kurt=12)
and right-asymmetric (skew=1.5) density (green), leptokurtic (kurt=7) and left-asymmetric (skew=-1.5)
density (blue).
– For two continuous r.v. with joint density function f (x, y) on R2 , we write the marginal density of
Y as Z +∞
f (y) = f (x, y)dx
−∞
• Example continued. The marginal probability of the journey time r.v. X is 0.22 and 0.78 for the slow
and fast journey, respectively.
5
Covariance and correlation
• Two r.v. X and Y are independent if the value of X is not informative of the value of Y , and viceversa.
Their joint distribution distribution is the product of the marginals, i.e.
– For two discrete r.v. with joint probability mass function p(x, y) on a set of N 2 joint outcomes,
we write
XN X N
Cov(X, Y ) = (xi − µX )(yj − µY )p(xi , yj )
i=1 j=1
– For two continuous r.v. with joint density function f (x, y) on R2 , we write
Z +∞ Z +∞
Cov(X, Y ) = (x − µX )(y − µY )f (x, y)dxdy
−∞ −∞
• If X and Y are independent then Cov(X, Y ) = 0, implying that E(XY ) = E(X)E(Y ). The converse is
not true, i.e. zero covariance does not imply independence.
• Note that Cov(X, X) = V ar(X) and Cov(X, Y ) = Cov(Y, X).
• Given two r.v. X and Y , the correlation coefficient is defined in terms of covariance as
Cov(X, Y )
Corr(X, Y ) = p
V ar(X)V ar(Y )
– −1 ≤ Corr(X, Y ) ≤ 1.
– Corr(X, Y ) = 1 means perfect linear positive association.
– Corr(X, Y ) = −1 means perfect linear negative association.
– Corr(X, Y ) = 0 means no linear association.
• Example. Figure 5 depicts four different situations with different degrees of correlation.
6
−0.8 * x + sqrt(1 − 0.8^2) * rnorm(200)
0.8 * x + sqrt(1 − 0.8^2) * rnorm(200)
Corr(X,Y)=0.8 Corr(X,Y)=−0.8
2
2
1
1
0
0
−2
−2
−2 −1 0 1 2 3 −2 −1 0 1 2 3
x x
Corr(X,Y)=0 (Independent) Corr(X,Y)=0 (Nonlinear dependence)
2
1 2
0
rnorm(200)
2 − x^2
−2
−1
−6
−3
−2 −1 0 1 2 3 −2 −1 0 1 2 3
• The normal distribution is symmetric around the mean and concentrates the 95% of its probability
mass in the interval {µ − 1.96σ, µ + 1.96σ}.
• If X ∼ N (µ, σ 2 ), then Z = X−µ
σ ∼ N (0, 1), and N (0, 1) is called standard normal distribution, and
its c.d.f. is often denoted with Φ(·).
• If X ∼ Nd (µ, Σ), and A and b are non-random matrix and vector, respectively, then
AX + b ∼ Nd (Aµ + b, AΣA0 )
• If d = 2, we have
7
– X = (X1 , X2 )
– µ = (µ1 , µ2 ) where µ1 = E(X1 ) and µ2 = E(X2 )
σ1 σ12
– Σ= where σ1 = V ar(X1 ), σ2 = V ar(X2 ), σ12 = Cov(X1 , X2 ).
σ21 σ2
df=2
df=4
0.3
df=8
p.d.f.
0.2
0.1
0.0
0 5 10 15 20
support
8
0.4
0.3
N(0,1)
t4
p.d.f.
0.2
t10
0.1
0.0
−4 −2 0 2 4
support
The bottle factory. You are the CEO of a firm producing handcrafted bottles of glass. You
have only one client, and you know that to satisfy his annual demand of bottles you need to
produce a bottle every ten minutes.
The bottles are handcrafted, so the time the artisan takes to produce one bottle changes every time.
Your concern is to be able to reach the annual target, and therefore you control the production
process.
Statistical question: The production time of a bottle Y is a continuous r.v. Is the population
mean of Y , µY = E(Y ) equal to 10?
• Statistical theory can help us answering these questions, following these steps:
– Sampling
– Estimation
– Hypothesis testing
Sampling
• In statistics, the interest is always on the population distribution. The population is the collection of
all possible entities of interests. It can be made of existing or hypothetical objects and can be thought
as an infinitely large quantity.
• Example: All the bottles in the production process; Men and Women; People with a PhD.
• To make statistical inference on the population distribution, we collect data on a subset of the
population, called sample, that needs to be representative of the population itself.
9
1.0
0.8
0.6
p.d.f.
0.4
F2,4
F8,16
0.2
F4,2
0.0
0 5 10 15
support
• In what follows, we assume that the random phenomena of interest is a random variable Y with
population mean µY and variance σY2 .
• We perform simple random sampling: Choose N individuals at random from the population,
Y1 , . . . , YN .
– We can interpret these as N copies of the same r.v. Y , and thus as N different r.v.
– Once sampled, they take a specific value.
• Example: Record N production times randomly.
• Since Y1 , . . . , Yn are sampled indipendently from the same distribution, we say they are independent
and identically distributed or i.i.d.
• Note that a sample is just one specific realization of Y1 , . . . , YN . One can obtain as many samples he
wants just re-sampling.
Estimation
• Once we have a sample Y1 , . . . , YN , we can try to obtain an estimate of the population mean µY , and
to do so we need an estimator.
• An estimator is a function of a random sample that is informative of the quantity of interest, i.e. the
population mean in our case. The estimator is a random variable, and its outcome changes with
the sample. An estimate is the outcome for a specific sample.
• A “good” estimator should satisfy some desirable properties:
– Unbiasdness: The expected value of the estimator is equal to the true quantity.
– Consistency: As the sample size increases, the uncertainty around the true value reduces.
– Efficiency: If the variance of an estimator is lower than all the other estimators, then it is efficient.
10
• The natural estimator of the population mean is the sample mean Ȳ of Y1 , . . . , YN , defined as
N
1 X
Ȳ = Yi .
N i=1
PN
• Ȳ is the least square estimator of µY , i.e. it solves the minimization problem minm i=1 (Yi − m)2 .
• The sample average Ȳ is a natural estimator of µY , but not the only one. Is it the best?
– Ȳ is an unbiased estimator of µY , i.e. E(Ȳ ) = µY .
p
– Ȳ is consistent because of the Law of Large Numbers, i.e. Ȳ −
→ µY .
– Ȳ is the most efficient among all the unbiased linear estimators of µY . Ȳ is the best linear
unbiased estimator (BLUE) of µY .
N
• The Law of Large Numbers (LLN) states that if (Yi )i=1 are i.i.d. and σY2 < ∞, the probability that
Ȳ falls within an arbitrary small interval of the true population value µY tends to one as the sample
size N increases.
• Figure 9 provides an intuitive representation of the LLN.
50
50
40
40
40
30
30
30
frequency
frequency
frequency
20
20
20
10
10
10
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 9: Law of Large Numbers. The figure shows how the mean estimates concentrates around the true
value as the number of observations increase from 10 to 1000 through 100.
• Since individuals in the sample are randomly drawn, the sample mean is a r.v. itself, and the distribution
of Ȳ over possible samples is called sampling distribution. This plays a fundamental role in statistical
inference.
• Simple math can be used to show that the mean and variance of the sampling distribution are:
E(Ȳ ) = µY
σ2
V ar(Ȳ ) = NY
11
• There are two approaches to derive the sampling distribution of Ȳ :
– Exact distribution or finite sample distribution. This describes the distribution of Ȳ for every
N . Unfortunately, this is not easy to obtain in general as it depends on the distribution of Y . The
only exception is when Y is i.i.d and normally distributed, because then Ȳ is normal with
mean µY and variance σY2 /N .
– Asymptotic distribution or large sample distribution. This is an approximation of the exact
distribution that holds when N is large. Derive this approximate distribution is easy using the
Law of Large Numbers and the Central Limit Theorem.
• The Central Limit Theorem establishes that the normalized sum of independent r.v. converges
toward a normal distribution.
• Under certain assumptions on the moments of Y , the sampling distribution of Ȳ is approximated by a
normal distribution as N increases, i.e.
Ȳ − E(Ȳ ) d
q −
→ N (0, 1)
V ar(Ȳ )
• This result holds regardless the distribution of Y , but the speed of convergence, i.e. how much N need
to be large for this approximation to be good depends on it.
The bottle factory. We collect N = 100 observations on the production time of the bottles.
Figure 10 plots the observations, the population mean (black line) and the sample mean (red line).
Production time (minute)
18
16
14
12
10
8
6
4
0 20 40 60 80 100
Observations
Figure 10: Bottels factory example. Sample of recorded production times.
Hyopthesis Testing
• Example continued. Can we conclude that we are under the target production level?
• The hypothesis testing (on the population mean) verifies the statistical validity of a null hypothesis,
H0 , based on the observations, against an alternative hypothesis, H1 . The test can be:
12
– One-sided(>): H0 : E(Y ) = µY,0 vs. H1 : E(Y ) > µY,0
– One-sided(<): H0 : E(Y ) = µY,0 vs. H1 : E(Y ) < µY,0
– Two-sided: H0 : E(Y ) = µY,0 vs. H1 : E(Y ) 6= µY,0
• We can compute the test statistic
Ȳ − µY,0
z= √ .
σY / N
For large N , CLT suggests that z has distribution N (0, 1), and we reject H0 if the value of z is far
enough from 0. How far?
• Hypothesis testing can lead to two types of errors:
– Type I error: Rejecting H0 when this is true.
– Type II error: Not rejecting H0 when this is false.
• We want to reach a decision on H0 controlling the probability of committing a type I error. Decisions
in statistics are never conclusive, are always subject to a significance level.
• The significance level α of the test is a pre-specified probability of incorrectly rejecting the null
hypothesis when the null is true (Type I error).
• The critical value c∗α of the test statistic is the threshold value that allows us to reach a conclusion
on H0 , knowing that we have probability α of committing a type I error.
– If the test statistic exceeds the critical value, it falls in the rejection region, and H0 is rejected.
– If the test statistic does not exceed the critical value, it falls in the acceptance region, and
H0 is not rejected.
• The critical value are quantiles of the sampling distribution that depend on the significance level α
and the alternative hypothesis. In large sample,
– One-sided(>). cv := Φ−1 (1 − α)
– One-sided(<). cv := Φ−1 (α)
– Two-sided. cv := Φ−1 ( α2 ), Φ−1 ( α2 )
Ȳ − µY,0
t= √
sY / N
– The exact distribution of t is the Student’s t distribution with N − 1 d.o.f, if Y is normal.
– In large samples, the Student’s t is well approximated by the standard normal. We can thus use
the critical value of the latter to decide on H0 .
13
0.4
0.3
dnorm(z)
0.2
0.1
0.0
X X
−4 −2 0 2 4
Figure 11: The sampling distribution N(0,1), the significance level at the 5 percent level (grey area), and the
critical values (red crosses) at +1.96 and -1.96 for the two-sided test.
• In practice, if t exceeds c∗α , the test rejects H0 and we say that µY is statistically different (> or
<) from µY,0 at the significance level α.
• An alternative way to decide on H0 is based on the p-value. This is the probability of drawing a
statistic t at least as adverse to H0 as the value computed with the data, t∗ . In large samples:
∗
∗
– One-sided(>). p-value = PrH0 sȲ −µ /
√Y > Ȳ −µ
N s
√Y
∗ / N = 1 − Φ − σȲ /−µ √Y
N
Y Y Y
∗
∗
– One-sided(<). p-value = PrH0 sȲ −µ √Y < Ȳ −µ
∗
√Y = Φ − Ȳ −µ √Y
Y/ N sY / N σY / N
∗
Ȳ ∗ −µ
– Two-sided. p-value = PrH0 sȲ −µ √Y
/ N
> Ȳ −µ√Y
s∗ / N
= 2Φ − σ / N
√Y
Y Y Y
• Once the significance level is fixed (α = 5% is a common choice), we conclude that H0 is rejected in
favor of the alternative hypothesis if the computed p-value is lower than the significance level.
• Technical issues in hypothesis testing: size and power of the test.
• The size of the test is the probability that the test falsely rejects H0 when this is true.
– The test has correct size if the size of the test is equal to the significance level.
– Remark. When the size is correct, we should commit Type-I error α% of the times at the
significance level α.
– The size of the test must be correct in order to avoid the over-rejection problem, i.e. committing
type I error too often.
– Problem with the size may emerge when N is small or when the data fails to match the
assumptions underlying the asymptotics.
• The power of the test is the probability that the test rejects the null when this is false.
– The most powerful the test, the lower the probability of committing type II errors.
14
The bottle factory. My concern as a CEO of the firm is that the mean production rate stays
on one bottle every ten minute, no less no more. Therefore, I want to test
H0 : E(Y ) = 10 vs. H1 : E(Y ) 6= 10.
I collected N = 100 observations on the production time and computed Ȳ ∗ = 9.62 and s∗Y = 2.92.
The t-statistic becomes
9.62 − 10
t= = −1.27.
2.92/10
Fixing the significance level at α = 0.05, we have that for the two-sided test the critical value is
c∗α = ±1.96. Thus we do not reject H0 . See Figure 12.
9.39−10
We can compute the p-value as 2Φ − 3.05/ √
100
= 0.201143. This is larger than the significance
level, thus we do not reject H0 .
0.4
0.3
dnorm(t)
t*
0.2
0.1
X X
0.0
−4 −2 0 2 4
Figure 12: Bottels factory example. The significance level at the 5 percent level (grey area), and the critical
values (red crosses). The value of the t-statistic (blue line) and the p-value (blue area).
Confidence interval
• A (1 − α)% confidence interval (C.I.) for µY is an interval that contains the true value of µY in
(1 − α)% of the repeated samples.
• Note that the confidence interval will differ from one sample to another, it is a r.v..
• The (1 − α)% C.I. contains all the values of µY that cannot be rejected at the α% level in the
hypothesis testing, given the sample.
sY sY
C.I. := µY : Ȳ − Φ−1 (1 − α) √ , Ȳ + Φ−1 (1 − α) √
N N
• Example. We can obtain a 95% confidence interval for our production process of the bottles, .i.e.
C.I = 9.62 − 1.96 2.92 2.92
10 ; 9.62 + 1.96 10
= {9.2850; 9.9549}
15
The bottle factory. As CEO of the bottles factory you have ascertained that the company is
able to satisfy the client demand of a bottle every 10 minutes. However, you have the suspect
that artisan A is slower than artisan B.
Statistical question: The average production time of artisan A is lower than that of artisan B?
We can formalize this hypothesis as
H0 : µA − µB = 0 H1 : µa − µB < 0
H0 : µA − µB = d0 µa − µ 6= (><)d0
(ȲA − ȲB ) − d0
t= 2 /N + σ 2 /N
σA A B B
s2A s2B
– This is not feasible as the variance is unknown. A consistent estimator is s2Ȳ = NA + NB
A −ȲB
The bottle factory. We collect N = 100 observations from artisan A and B and compute the
sample means, ȲA = 9.36 and ȲB = 10.41. See the scatterplot in Figure 13.
Our test hyopthesis is H0 : µA − µB = 0 µa − µB < 0.
We can compute the test statistic, t = 9.36−10.41
0.94 0.96 = −5.45. The critical value at level α = 0.05
10 + 10
for the normal distribution in one-side (<) test hypothesis is -1.65. We thus reject the null
hypothesis. See Figure 14.
We can also compute the p-value of the test as
! !
(ȲA − ȲB ) − d0 (Ȳ ∗ − ȲB∗ ) − d0 (ȲA∗ − ȲB∗ ) − d0
t = Pr < A =Φ = 2e − 08
H0 sȲA −ȲB sȲ ∗ −Ȳ ∗ sȲ ∗ −Ȳ ∗
A B A B
16
12
11
Artisan B
10
9
8 9 10 11
Artisan A
Figure 13: Bottels factory example. Scatterplot of the observations from artisans A and B.
0.4
0.3
dnorm(t)
t*
0.2
0.1
X
0.0
−6 −4 −2 0 2 4 6
Figure 14: Bottels factory example. The significance level at the 5 percent level (grey area), and the critical
value (red cross) at -1.65. The value of the t-statistic (blue line) and the p-value (blue area).
17