0% found this document useful (0 votes)
40 views

Probability & Stats Notes

The document discusses key concepts in probability theory and statistics including: 1) Random variables can be discrete or continuous, and are described by probability mass/density functions and cumulative distribution functions. 2) Important properties of random variables include the expected value, variance, skewness and kurtosis. 3) Probability distributions can be univariate or joint, with marginal distributions obtained by summing joint probabilities.

Uploaded by

jerry vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Probability & Stats Notes

The document discusses key concepts in probability theory and statistics including: 1) Random variables can be discrete or continuous, and are described by probability mass/density functions and cumulative distribution functions. 2) Important properties of random variables include the expected value, variance, skewness and kurtosis. 3) Probability distributions can be univariate or joint, with marginal distributions obtained by summing joint probabilities.

Uploaded by

jerry vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Applied Econometrics

Review of probability and statistics

Notions of Probability Theory


Random Variable
• A random variable (r.v.) is a variable whose possible values are numerical outcomes of a random
phenomenon. Example. The Bernoulli r.v. takes values {0,1} and can describe the outcomes of a coin
toss.
• There are two types of random variables, discrete and continuous.
– A discrete r.v. is one which may take only a countable number of distinct values. Examples
include the Friday night attendance at a cinema, the number of patients in a doctor’s surgery, the
number of defective light bulbs in a box of ten.
– A continuous r.v. is one which takes an infinite number of possible values. Continuous random
variables are usually measurements, such as height, weight, the time required to run a mile.
• The probability of an outcome is a positive number measuring the likelihood that an event will occur.
The sum of the probabilities of all the outcomes must be equal to one.

Discrete random variables


• The probability mass function (p.m.f) of a discrete r.v. is a list of probabilities associated with
each of its possible values.
• The cumulative distribution function (c.d.f) is a function giving the probability that the r.v. is
less than a certain value. For a discrete random variable, it is found by summing up the probabilities.
• Example. The r.v. X can take value {1, 2, 3, 4} with probabilities {0.1, 0.3, 0.4, 0.2}. The p.m.f of 3 is
Pr(X = 3) = 0.4. The c.d.f. of 3 is Pr(X ≤ 3) = 0.8. Figure 1 depicts the probability mass function
and the cumulative distribution function for this random variable.

Continuous random Variable


• The probability of a continuous r.v. is described with a continuous function called the probability
density function (p.d.f).
• The cumulative distribution function (c.d.f.) is defined as the integral of the p.d.f from the
beginning of the support up to a certain value.
R1
• Example. The r.v. X has support on (−∞, +∞) and has p.d.f. f (x). The c.d.f. of X ≤ 1 is −∞ f (x)dx.
Figure 2 depicts this situation.

Expected value
• The expected value of a r.v. X, denoted as E(X) or µX , is intuitively the long-run average value of
repetitions of the experiment it represents.

1
Probability Mass Function Cumultative Distribution Function

0.4

1.0
0.8
0.3

0.6
0.2

0.4
0.1

0.2
0.0

0.0
1 2 3 4 1 2 3 4

Figure 1: Discrete random variable.

– If X is a discrete r.v. taking N values with probability p(x), then


N
X
E(x) = xi p(xi )
i=1

– If X is a continuous r.v. with p.d.f. f (x) and support on (−∞, +∞), then
Z +∞
E(x) = xf (x)dx
−∞

• Some useful properties of the expected value are the following:


– E(X + Y ) = E(X) + E(Y )
– E(aX) = aE(X) where a is a constant.

Variance
2
• The variance of a r.v. X, denoted as V ar(X) or σX , measures the dispersion of the probability
distribution around the mean, µX = E(X).
• The variance is defined as the expected value of the squared deviation from the mean, i.e.
V ar(X) = E((X − µX )2 )
– If X is a discrete r.v. taking N values with probability p(x), then
N
X
V ar(x) = (xi − µX )2 p(xi )
i=1

– If X is a continuous r.v. with p.d.f. f (x) and support on (−∞, +∞), then
Z +∞
V ar(x) = (x − µX )2 f (x)dx
−∞

2
Probability Density Function Cumulative Distribution Function

0.4

1.0
0.8
0.3

0.6
0.2

0.4
0.1

0.2
0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

Figure 2: Continuous random variable.

p
• A common measure of dispersion is the standard deviation, denoted as σX = V ar(X).
• Some useful properties of the variance are the following:
– V ar(X) > 0.
– V ar(aX) = a2 V ar(X) where a is a constant.
– V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ) where Cov(X, Y ) is the covariance
between X and Y .
• Example. Three Normal p.d.f. with different expected values and variance are depicted in Figure 3.

Skewness and Kurtosis


• The skewness of a r.v. measures the asymmetry of the probability distribution around the mean, and
it is defined as
E((X − µX )3 )
Skew(X) = 3 .
σX
– If a distribution is symmetric then Skew(X) = 0.
– If Skew(X) > 0 then P (X = x) > P (X = −x), and vice versa.
• The kurtosis of a r.v. measures the tailedness of the probability distribution, and it is defined as

E((X − µX )4 )
Kurt(X) = 4 .
σX

– If Kurt(X) > 3, the distribution is leptokurtic. If Kurt(X) < 3, the distribution is plaktikurtic.
– The higher the kurtosis, the higher the likelihood of extreme events or outliers.
• Example. Distributions with different degrees of skewness and kurtosis are depicted in Figure 4.

3
1.0
0.8 E(X)=0,Var(X)=1
E(X)=2,Var(X)=0.5
E(X)=−4,Var(X)=2
0.6
0.4
0.2
0.0

−10 −5 0 5

Figure 3: Examples of different variance and expected values.

Moments
• The r-th moment of a r.v. X is defined as E(X r ). Example. The mean of X is the first moment of X.
• In general, mean, variance, skweness and kurtosis are all functions of the moments of the r.v.
• Example. V ar(X) = E(X 2 ) − E(X)2 .
• Moments are important to establish asymptotic properties.

Joint Probability Distribution


• The joint probability distribution of two or more r.v. describes the probability of joint events. The
sum of the probability of all the possible combinations of events must sum to one.
– Joint probability distributions may involve either continuous or discontinuous r.v. or both.
• Example. An employee living in the countryside travels to its downtown office every morning. The
journey time is affected by the rainfall. These random events can be described by two bernoulli r.v. X
and Y .
## Rain (X=0) No rain (X=1)
## Slow (Y=1) 0.15 0.07
## Fast (Y=0) 0.15 0.63
• The term marginal probability distribution of a r.v. X is used to distinguish the distribution of X
from the joint distribution with other variables. It can be obtained from the joint distribution by
summing (integrating) the probabilities of all the outcomes for which Y takes a specific value.
– For two discrete r.v. with joint probability mass function p(x, y) on a set of N 2 joint outcomes,
we write the marginal mass function of Y as
N
X
p(y) = p(X = xi , Y = y)
i=1

4
0.4
0.3
density

0.2
0.1
0.0

−6 −4 −2 0 2 4 6

support

Figure 4: Normal density (black), leptokurtic (kurt=190) and symmetric density (red), leptokurtic (kurt=12)
and right-asymmetric (skew=1.5) density (green), leptokurtic (kurt=7) and left-asymmetric (skew=-1.5)
density (blue).

– For two continuous r.v. with joint density function f (x, y) on R2 , we write the marginal density of
Y as Z +∞
f (y) = f (x, y)dx
−∞

• Example continued. The marginal probability of the journey time r.v. X is 0.22 and 0.78 for the slow
and fast journey, respectively.

Conditional distribution, expectation and variance


• The conditional distribution of Y given X is the probability that Y takes value y given that X has
value x, and it is denoted as
Pr(Y = y, X = x)
Pr(Y = y|X = x) = .
Pr(X = x)
– The sum of all the conditional probabilities over y given x must sum to one.
• Example continued. The conditional probability of a slow journey time given that it is raining is
0.15/0.30 = 0.5, and it is the same for the fast journey. While unconditionally, a fast journey is three
times more likely, fast and slow journeys are equally likely when it rains.
• The conditional mean is the mean of the conditional distribution of Y given X, i.e.
k
X
E(Y |X = x) = yi Pr(Y = yi |X = x)
i=1
.
• The conditional variance is the variance of the conditional distribution of Y given X, i.e.
k
X
V ar(Y |X = x) = (yi − E(Y |X = x)))2 Pr(Y = yi |X = x)
i=1
.

5
Covariance and correlation
• Two r.v. X and Y are independent if the value of X is not informative of the value of Y , and viceversa.
Their joint distribution distribution is the product of the marginals, i.e.

Pr(X = x, Y = y) = Pr(X = x) Pr(Y = y),

and P r(Y = y|X = x) = P r(y = y).


• Given two r.v. X and Y , the covariance, denoted as Cov(X, Y ), is a measure of linear dependence
between X and Y and is written as

Cov(X, Y ) = E [(X − µX )(Y − µY )]

– For two discrete r.v. with joint probability mass function p(x, y) on a set of N 2 joint outcomes,
we write
XN X N
Cov(X, Y ) = (xi − µX )(yj − µY )p(xi , yj )
i=1 j=1

– For two continuous r.v. with joint density function f (x, y) on R2 , we write
Z +∞ Z +∞
Cov(X, Y ) = (x − µX )(y − µY )f (x, y)dxdy
−∞ −∞

• If X and Y are independent then Cov(X, Y ) = 0, implying that E(XY ) = E(X)E(Y ). The converse is
not true, i.e. zero covariance does not imply independence.
• Note that Cov(X, X) = V ar(X) and Cov(X, Y ) = Cov(Y, X).
• Given two r.v. X and Y , the correlation coefficient is defined in terms of covariance as

Cov(X, Y )
Corr(X, Y ) = p
V ar(X)V ar(Y )

– −1 ≤ Corr(X, Y ) ≤ 1.
– Corr(X, Y ) = 1 means perfect linear positive association.
– Corr(X, Y ) = −1 means perfect linear negative association.
– Corr(X, Y ) = 0 means no linear association.
• Example. Figure 5 depicts four different situations with different degrees of correlation.

6
−0.8 * x + sqrt(1 − 0.8^2) * rnorm(200)
0.8 * x + sqrt(1 − 0.8^2) * rnorm(200)
Corr(X,Y)=0.8 Corr(X,Y)=−0.8

2
2

1
1

0
0

−2
−2

−2 −1 0 1 2 3 −2 −1 0 1 2 3

x x
Corr(X,Y)=0 (Independent) Corr(X,Y)=0 (Nonlinear dependence)

2
1 2

0
rnorm(200)

2 − x^2

−2
−1

−6
−3

−2 −1 0 1 2 3 −2 −1 0 1 2 3

Figure 5: Examples of different different degrees of correlation.

Some continuous distributions


The normal distribution
• The normal or Gaussian distribution with mean µ and variance σ 2 is a r.v. denoted as N (µ, σ 2 ) with
p.d.f.
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ

• The normal distribution is symmetric around the mean and concentrates the 95% of its probability
mass in the interval {µ − 1.96σ, µ + 1.96σ}.
• If X ∼ N (µ, σ 2 ), then Z = X−µ
σ ∼ N (0, 1), and N (0, 1) is called standard normal distribution, and
its c.d.f. is often denoted with Φ(·).

Pr(X ≤ c1 ) = Pr(Z ≤ d1 ) = Φ(d1 )


Pr(X ≤ c2 ) = Pr(Z ≤ d2 ) = Φ(d2 )
Pr(c1 ≤ X ≤ c2 ) = Pr(d1 ≤ Z ≤ d2 ) = Φ(d2 ) − Φ(d1 )
c1 −µ c2 −µ
where d1 = σ and d2 = σ .

The multivariate normal distribution


• Let X a vector of d r.v. with mean vector µ and covariance matrix Σ. We say that X has multivariate
normal distribution, X ∼ Nd (µ, Σ), if the joint probability density is written as:
1 1 0
f (x) = (2π)−n/2 |Σ|− 2 e− 2 (x−µ) Σ(x−µ)

• If X ∼ Nd (µ, Σ), and A and b are non-random matrix and vector, respectively, then

AX + b ∼ Nd (Aµ + b, AΣA0 )

• If d = 2, we have

7
– X = (X1 , X2 )
– µ = (µ1 , µ2 ) where µ1 = E(X1 ) and µ2 = E(X2 )
 
σ1 σ12
– Σ= where σ1 = V ar(X1 ), σ2 = V ar(X2 ), σ12 = Cov(X1 , X2 ).
σ21 σ2

The Chi-squared distribution


• Let Xi ∼ N (0, 1) with i = 1, . . . , N be N r.v. identically and independently distributed standard
PN
normal. The r.v. W = i=1 Xi2 has chi-squared distribution with N degrees of freedom and it is
denoted as W ∼ χ2N .
• Example. Figure 6 shows the p.d.f. of the χ2 distribution for three different values of the degrees of
freedom parameter.
0.5
0.4

df=2
df=4
0.3

df=8
p.d.f.

0.2
0.1
0.0

0 5 10 15 20

support

Figure 6: Chi-squared distribution.

The Student’s t distribution


• Let X ∼ N (0, 1) and W ∼ χ2n be two independent r.v. The r.v. Y = √XW has Student’s t distribution
n
with n degrees of freedom, and it is denoted as Y ∼ tn . For n large, the Student’s t converges to the
normal distribution.
• Example. Figure 7 shows the p.d.f. of the Students’t distribution for two different values of the degrees
of freedom parameter along with the Normal distribution.

The Snedecor’s F distribution


W/n
• Let W ∼ χ2n and U χ2m two independent r.v. The r.v. Z = U/m has Snedecor’s F distribution with n
and m degrees of freedom, and it is written as Z ∼ Fn,m .
• Example. Figure 8 shows the p.d.f. of the Snedecor’s F distribution for three different values of the
degrees of freedom parameters.

8
0.4
0.3
N(0,1)
t4
p.d.f.

0.2
t10
0.1
0.0

−4 −2 0 2 4

support

Figure 7: Student’s t distribution.

Review of statistical theory

The bottle factory. You are the CEO of a firm producing handcrafted bottles of glass. You
have only one client, and you know that to satisfy his annual demand of bottles you need to
produce a bottle every ten minutes.
The bottles are handcrafted, so the time the artisan takes to produce one bottle changes every time.
Your concern is to be able to reach the annual target, and therefore you control the production
process.
Statistical question: The production time of a bottle Y is a continuous r.v. Is the population
mean of Y , µY = E(Y ) equal to 10?

• Statistical theory can help us answering these questions, following these steps:
– Sampling
– Estimation
– Hypothesis testing

Sampling
• In statistics, the interest is always on the population distribution. The population is the collection of
all possible entities of interests. It can be made of existing or hypothetical objects and can be thought
as an infinitely large quantity.
• Example: All the bottles in the production process; Men and Women; People with a PhD.
• To make statistical inference on the population distribution, we collect data on a subset of the
population, called sample, that needs to be representative of the population itself.

9
1.0
0.8
0.6
p.d.f.

0.4

F2,4
F8,16
0.2

F4,2
0.0

0 5 10 15

support

Figure 8: Snedecor F distribution.

• In what follows, we assume that the random phenomena of interest is a random variable Y with
population mean µY and variance σY2 .
• We perform simple random sampling: Choose N individuals at random from the population,
Y1 , . . . , YN .
– We can interpret these as N copies of the same r.v. Y , and thus as N different r.v.
– Once sampled, they take a specific value.
• Example: Record N production times randomly.
• Since Y1 , . . . , Yn are sampled indipendently from the same distribution, we say they are independent
and identically distributed or i.i.d.
• Note that a sample is just one specific realization of Y1 , . . . , YN . One can obtain as many samples he
wants just re-sampling.

Estimation
• Once we have a sample Y1 , . . . , YN , we can try to obtain an estimate of the population mean µY , and
to do so we need an estimator.
• An estimator is a function of a random sample that is informative of the quantity of interest, i.e. the
population mean in our case. The estimator is a random variable, and its outcome changes with
the sample. An estimate is the outcome for a specific sample.
• A “good” estimator should satisfy some desirable properties:
– Unbiasdness: The expected value of the estimator is equal to the true quantity.
– Consistency: As the sample size increases, the uncertainty around the true value reduces.
– Efficiency: If the variance of an estimator is lower than all the other estimators, then it is efficient.

10
• The natural estimator of the population mean is the sample mean Ȳ of Y1 , . . . , YN , defined as
N
1 X
Ȳ = Yi .
N i=1

PN
• Ȳ is the least square estimator of µY , i.e. it solves the minimization problem minm i=1 (Yi − m)2 .
• The sample average Ȳ is a natural estimator of µY , but not the only one. Is it the best?
– Ȳ is an unbiased estimator of µY , i.e. E(Ȳ ) = µY .
p
– Ȳ is consistent because of the Law of Large Numbers, i.e. Ȳ −
→ µY .
– Ȳ is the most efficient among all the unbiased linear estimators of µY . Ȳ is the best linear
unbiased estimator (BLUE) of µY .
N
• The Law of Large Numbers (LLN) states that if (Yi )i=1 are i.i.d. and σY2 < ∞, the probability that
Ȳ falls within an arbitrary small interval of the true population value µY tends to one as the sample
size N increases.
• Figure 9 provides an intuitive representation of the LLN.

N=10 N=100 N=1000


50

50

50
40

40

40
30

30

30
frequency

frequency

frequency
20

20

20
10

10

10
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

sample mean sample mean sample mean

Figure 9: Law of Large Numbers. The figure shows how the mean estimates concentrates around the true
value as the number of observations increase from 10 to 1000 through 100.

• Since individuals in the sample are randomly drawn, the sample mean is a r.v. itself, and the distribution
of Ȳ over possible samples is called sampling distribution. This plays a fundamental role in statistical
inference.
• Simple math can be used to show that the mean and variance of the sampling distribution are:
E(Ȳ ) = µY
σ2
V ar(Ȳ ) = NY

where µY = E(Y ) and σY2 = V ar(Y ).

11
• There are two approaches to derive the sampling distribution of Ȳ :
– Exact distribution or finite sample distribution. This describes the distribution of Ȳ for every
N . Unfortunately, this is not easy to obtain in general as it depends on the distribution of Y . The
only exception is when Y is i.i.d and normally distributed, because then Ȳ is normal with
mean µY and variance σY2 /N .
– Asymptotic distribution or large sample distribution. This is an approximation of the exact
distribution that holds when N is large. Derive this approximate distribution is easy using the
Law of Large Numbers and the Central Limit Theorem.
• The Central Limit Theorem establishes that the normalized sum of independent r.v. converges
toward a normal distribution.
• Under certain assumptions on the moments of Y , the sampling distribution of Ȳ is approximated by a
normal distribution as N increases, i.e.
Ȳ − E(Ȳ ) d
q −
→ N (0, 1)
V ar(Ȳ )

• This result holds regardless the distribution of Y , but the speed of convergence, i.e. how much N need
to be large for this approximation to be good depends on it.

The bottle factory. We collect N = 100 observations on the production time of the bottles.
Figure 10 plots the observations, the population mean (black line) and the sample mean (red line).
Production time (minute)
18
16
14
12
10
8
6
4

0 20 40 60 80 100

Observations
Figure 10: Bottels factory example. Sample of recorded production times.

Hyopthesis Testing
• Example continued. Can we conclude that we are under the target production level?
• The hypothesis testing (on the population mean) verifies the statistical validity of a null hypothesis,
H0 , based on the observations, against an alternative hypothesis, H1 . The test can be:

12
– One-sided(>): H0 : E(Y ) = µY,0 vs. H1 : E(Y ) > µY,0
– One-sided(<): H0 : E(Y ) = µY,0 vs. H1 : E(Y ) < µY,0
– Two-sided: H0 : E(Y ) = µY,0 vs. H1 : E(Y ) 6= µY,0
• We can compute the test statistic
Ȳ − µY,0
z= √ .
σY / N
For large N , CLT suggests that z has distribution N (0, 1), and we reject H0 if the value of z is far
enough from 0. How far?
• Hypothesis testing can lead to two types of errors:
– Type I error: Rejecting H0 when this is true.
– Type II error: Not rejecting H0 when this is false.
• We want to reach a decision on H0 controlling the probability of committing a type I error. Decisions
in statistics are never conclusive, are always subject to a significance level.
• The significance level α of the test is a pre-specified probability of incorrectly rejecting the null
hypothesis when the null is true (Type I error).
• The critical value c∗α of the test statistic is the threshold value that allows us to reach a conclusion
on H0 , knowing that we have probability α of committing a type I error.
– If the test statistic exceeds the critical value, it falls in the rejection region, and H0 is rejected.
– If the test statistic does not exceed the critical value, it falls in the acceptance region, and
H0 is not rejected.
• The critical value are quantiles of the sampling distribution that depend on the significance level α
and the alternative hypothesis. In large sample,
– One-sided(>). cv := Φ−1 (1 − α)
– One-sided(<). cv := Φ−1 (α)
– Two-sided. cv := Φ−1 ( α2 ), Φ−1 ( α2 )


• Tables with tabulated critical values are important!


• Example. A graphical representation of the sampling distribution, the significance level, and the critical
values for the two-sided test is given in Figure 11.
Ȳ −µ
• Recall that z = √Y ,
σY / N
in practice σY is unknown. An estimator of the variance of Y is the sample
variance
N
1 X
s2Y = (Yi − Ȳ )2
n − 1 i=1
p
– s2Y −
→ σY2 (LLN).
d
– s2Y exact distribution is χ2N −1 if Y is normal, otherwise s2Y −
→ N (0, 1) (CLT).
• The feasible test statistic thus becomes the t-statistic,

Ȳ − µY,0
t= √
sY / N
– The exact distribution of t is the Student’s t distribution with N − 1 d.o.f, if Y is normal.
– In large samples, the Student’s t is well approximated by the standard normal. We can thus use
the critical value of the latter to decide on H0 .

13
0.4
0.3
dnorm(z)

0.2
0.1
0.0

X X

−4 −2 0 2 4

Figure 11: The sampling distribution N(0,1), the significance level at the 5 percent level (grey area), and the
critical values (red crosses) at +1.96 and -1.96 for the two-sided test.

• In practice, if t exceeds c∗α , the test rejects H0 and we say that µY is statistically different (> or
<) from µY,0 at the significance level α.
• An alternative way to decide on H0 is based on the p-value. This is the probability of drawing a
statistic t at least as adverse to H0 as the value computed with the data, t∗ . In large samples:
 ∗
  ∗

– One-sided(>). p-value = PrH0 sȲ −µ /
√Y > Ȳ −µ
N s
√Y
∗ / N = 1 − Φ − σȲ /−µ √Y
N
Y Y Y
 ∗
  ∗

– One-sided(<). p-value = PrH0 sȲ −µ √Y < Ȳ −µ

√Y = Φ − Ȳ −µ √Y
Y/ N sY / N σY / N
 ∗
  
Ȳ ∗ −µ
– Two-sided. p-value = PrH0 sȲ −µ √Y
/ N
> Ȳ −µ√Y
s∗ / N
= 2Φ − σ / N
√Y
Y Y Y

• Once the significance level is fixed (α = 5% is a common choice), we conclude that H0 is rejected in
favor of the alternative hypothesis if the computed p-value is lower than the significance level.
• Technical issues in hypothesis testing: size and power of the test.
• The size of the test is the probability that the test falsely rejects H0 when this is true.
– The test has correct size if the size of the test is equal to the significance level.
– Remark. When the size is correct, we should commit Type-I error α% of the times at the
significance level α.
– The size of the test must be correct in order to avoid the over-rejection problem, i.e. committing
type I error too often.
– Problem with the size may emerge when N is small or when the data fails to match the
assumptions underlying the asymptotics.
• The power of the test is the probability that the test rejects the null when this is false.
– The most powerful the test, the lower the probability of committing type II errors.

14
The bottle factory. My concern as a CEO of the firm is that the mean production rate stays
on one bottle every ten minute, no less no more. Therefore, I want to test
H0 : E(Y ) = 10 vs. H1 : E(Y ) 6= 10.
I collected N = 100 observations on the production time and computed Ȳ ∗ = 9.62 and s∗Y = 2.92.
The t-statistic becomes
9.62 − 10
t= = −1.27.
2.92/10
Fixing the significance level at α = 0.05, we have that for the two-sided test the critical value is
c∗α = ±1.96. Thus we do not reject H0 . See Figure 12.
 
9.39−10
We can compute the p-value as 2Φ − 3.05/ √
100
= 0.201143. This is larger than the significance
level, thus we do not reject H0 .
0.4
0.3
dnorm(t)

t*
0.2
0.1

X X
0.0

−4 −2 0 2 4

Figure 12: Bottels factory example. The significance level at the 5 percent level (grey area), and the critical
values (red crosses). The value of the t-statistic (blue line) and the p-value (blue area).

Confidence interval
• A (1 − α)% confidence interval (C.I.) for µY is an interval that contains the true value of µY in
(1 − α)% of the repeated samples.
• Note that the confidence interval will differ from one sample to another, it is a r.v..
• The (1 − α)% C.I. contains all the values of µY that cannot be rejected at the α% level in the
hypothesis testing, given the sample.
  
sY sY
C.I. := µY : Ȳ − Φ−1 (1 − α) √ , Ȳ + Φ−1 (1 − α) √
N N

• Example. We can obtain a 95% confidence interval for our production process of the bottles, .i.e.
C.I = 9.62 − 1.96 2.92 2.92

10 ; 9.62 + 1.96 10

= {9.2850; 9.9549}

15
The bottle factory. As CEO of the bottles factory you have ascertained that the company is
able to satisfy the client demand of a bottle every 10 minutes. However, you have the suspect
that artisan A is slower than artisan B.
Statistical question: The average production time of artisan A is lower than that of artisan B?
We can formalize this hypothesis as

H0 : µA − µB = 0 H1 : µa − µB < 0

and can verify it using a test for the difference in mean.

Hypothesis testing on difference in mean


• Sampling: Collect N observations for the r.v. YA and YB .
• Estimation: An estimator of µA − µB is ȲA − ȲB .
– By CLT, Ȳi ∼ N (µi , σi2 /Ni ) for large Ni with i ∈ {A, B}.
2 2
– Since YA and YB are independent, ȲA − ȲB ∼ N (µA − µB , σA /NA + σB /NB )
• Hypothesis sampling: Our general test hypothesis is

H0 : µA − µB = d0 µa − µ 6= (><)d0

• We can test H0 using the test statistic

(ȲA − ȲB ) − d0
t= 2 /N + σ 2 /N
σA A B B

s2A s2B
– This is not feasible as the variance is unknown. A consistent estimator is s2Ȳ = NA + NB
A −ȲB

(ȲA −ȲB )−d0


– A feasible test statistic is t = sȲ −Ȳ . This only valid for NA and NB large!.
A B

The bottle factory. We collect N = 100 observations from artisan A and B and compute the
sample means, ȲA = 9.36 and ȲB = 10.41. See the scatterplot in Figure 13.
Our test hyopthesis is H0 : µA − µB = 0 µa − µB < 0.
We can compute the test statistic, t = 9.36−10.41
0.94 0.96 = −5.45. The critical value at level α = 0.05
10 + 10
for the normal distribution in one-side (<) test hypothesis is -1.65. We thus reject the null
hypothesis. See Figure 14.
We can also compute the p-value of the test as
! !
(ȲA − ȲB ) − d0 (Ȳ ∗ − ȲB∗ ) − d0 (ȲA∗ − ȲB∗ ) − d0
t = Pr < A =Φ = 2e − 08
H0 sȲA −ȲB sȲ ∗ −Ȳ ∗ sȲ ∗ −Ȳ ∗
A B A B

16
12
11
Artisan B

10
9

8 9 10 11

Artisan A

Figure 13: Bottels factory example. Scatterplot of the observations from artisans A and B.
0.4
0.3
dnorm(t)

t*
0.2
0.1

X
0.0

−6 −4 −2 0 2 4 6

Figure 14: Bottels factory example. The significance level at the 5 percent level (grey area), and the critical
value (red cross) at -1.65. The value of the t-statistic (blue line) and the p-value (blue area).

17

You might also like