0% found this document useful (0 votes)
48 views26 pages

Probability

The document discusses random variables and probability theory. It defines random variables as variables whose values cannot be predicted exactly and are the outcome of chance experiments. Continuous random variables can take any value within a range, while discrete variables take a limited number of values. Probability density functions describe the probabilities of continuous random variables falling within intervals. Cumulative distribution functions give the probability of a variable being less than or equal to a value. Multivariate distributions describe the joint probabilities of two or more random variables.

Uploaded by

Abdulla Shaheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views26 pages

Probability

The document discusses random variables and probability theory. It defines random variables as variables whose values cannot be predicted exactly and are the outcome of chance experiments. Continuous random variables can take any value within a range, while discrete variables take a limited number of values. Probability density functions describe the probabilities of continuous random variables falling within intervals. Cumulative distribution functions give the probability of a variable being less than or equal to a value. Multivariate distributions describe the joint probabilities of two or more random variables.

Uploaded by

Abdulla Shaheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Random Variables and

Probability Theory

1 Random Variables
Econometrics essentially applies statistical methods to examine questions of interest
to economists, such as quantifying relationships between different economic variables,
testing competing hypotheses and computing forecasts. Models of relationships be-
tween economic variables (e.g. consumption and income) and models of how economic
variables change over time (e.g. the time path of GDP growth) all involve a degree
of uncertainty when applied to real world data, since we cannot capture all possible
variations in a simple model. As a result, econometric analysis treats economic data
as observations on random variables.
A random (or stochastic) variable is any variable whose value is a real number that
cannot be predicted exactly, and can be viewed as the outcome of a chance experiment.
A random variable is said to be discrete if it can only take a limited number of distinct
values (e.g. the total score when two dice are thrown). A random variable is said to
be continuous if it can assume any value over a continuum (e.g. the temperature in a
room). Most economic variables (e.g. GDP, exchange rates, etc.) are considered to be
continuous random variables, so our focus in this module is on this type of variable.
We now consider some basic probability theory associated with continuous random
variables.

2 Probability Density Function


The probability density function (PDF) for a continuous random variable is a function
that governs how probabilities are assigned to interval values for the random variable.
Let X be a continuous random variable defined on the interval −∞ ≤ x ≤ ∞. Then
if f (x) is the PDF of X, we have:
Z b
P (a ≤ X ≤ b) = f (x)dx
a

Thus the integral of the PDF over a certain range gives the probability that the random
variable will fall in that interval. To be a valid PDF, f (x) must satisfy the following
conditions:
(i) f (x) ≥ 0, −∞ ≤ x ≤ ∞
R∞
(ii) −∞ f (x)dx = 1

1
so that all probabilities are non-negative, and the continuous sum of probabilities for
all possible outcomes is one. Notice that:
P (X = a) = P (a ≤ X ≤ a)
Z a
= f (x)dx
a
= 0
so that probabilities for a continuous random variable are only non-zero when measured
over an interval.
Example
Consider a spinner on a 0–100 dial with four equal sections (i.e. dividing lines at 0/100,
25, 50 and 75), and let X denote the value the spinner lands on. Since the spinner can
land on an infinite number of positions, the probability of it landing on any particular
position is zero, e.g. P (X = 32) = 0. However, the probability that the spinner
lands in a specified interval can be easily established, e.g. P (0 ≤ X ≤ 50) = 0.5,
P (75 ≤ X ≤ 100) = 0.25. The PDF of X here is given by:
 1
100
for 0 ≤ x ≤ 100
f (x) =
0 otherwise
and probabilities can be calculated using this formula, e.g.:
Z 100
1
P (75 ≤ X ≤ 100) = dx
75 100
x 100
=
100 75

= 1 − 0.75
= 0.25
This example is a case of a uniform distribution, more generally defined by X ∼
U (a, b) with PDF:  1
b−a
for a ≤ x ≤ b
f (x) =
0 otherwise
Here we have, for a ≤ v ≤ w ≤ b:
Z w
1
P (v ≤ X ≤ w) = dx
v b−a
w
x
=
b − a v
w−v
=
b−a
Notice that P (v ≤ X ≤ w) depends only on the width of the interval w − v relative to
the total range b − a, but not on its position relative to a and b; hence the terminology
uniform distribution.

2
3 Cumulative Distribution Function
The cumulative distribution function (CDF) for a random variable X gives the prob-
ability that X will take a value less than or equal to a specified number x. It is a
monotonically increasing function of the PDF defined as:

F (x) = P (−∞ ≤ X ≤ x)
Z x
= f (t)dt
−∞

where f (t) is the PDF. Note that F (−∞) = 0 and F (∞) = 1.

Example
In the above example of the spinner on a 0–100 dial, the PDF was given by:
 1
100
for 0 ≤ x ≤ 100
f (x) =
0 otherwise

From this we can obtain the CDF as follows:


Z x
F (x) = f (t)dt
−∞
Z x
1
= dt
0 100
x
t
=
100 0
x
=
100
Then, for example, we can calculate P (X ≤ 70) = 0.7.
More generally, for the uniform distribution X ∼ U (a, b) we obtain:
Z x
1
F (x) = dt
a b−a
x
t
=
b − a a
x−a
=
b−a

3
4 Multivariate Distributions
In the previous section we considered the distribution of a single continuous random
variable. In econometrics we are often concerned with the joint distribution of more
than one continuous random variable. We now extend our analysis to consider the
joint distribution of two random variables.

4.1 Joint Probability Density Function


Let X and Y be continuous random variables defined on the interval −∞ ≤ X ≤ ∞
and −∞ ≤ Y ≤ ∞. Then if f (x, y) is the joint (or bivariate) PDF of X and Y , we
have: Z Z d b
P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y)dxdy
c a
Thus the double integral of the PDF over certain ranges now gives the probability that
both X and Y will fall in specified intervals. To be a valid PDF, f (x, y) must satisfy
the following conditions:
(i) f (x, y) ≥ 0, −∞ ≤ x ≤ ∞, −∞ ≤ y ≤ ∞
R∞ R∞
(ii) −∞ −∞ f (x, y)dxdy = 1
Example
Consider the random variables X and Y that have the joint PDF:
 1
(b−a)(d−c)
for a ≤ x ≤ b and c ≤ y ≤ d
f (x, y) =
0 otherwise
In this case X and Y are said to have a bivariate uniform distribution on the interval
[a, b], [c, d]. We can establish the probabilities of X and Y lying in certain intervals
using this PDF, e.g.:
Z fZ w
1
P (v ≤ X ≤ w, e ≤ Y ≤ f ) = dxdy
e v (b − a)(d − c)
Z f Z w 
1
= dx dy
e v (b − a)(d − c)
Z f w 
x
= dy
e (b − a)(d − c) v
Z f
w−v
= dy
e (b − a)(d − c)
f
(w − v)y
=
(b − a)(d − c) e
(w − v)(f − e)
=
(b − a)(d − c)

4
4.2 Marginal Probability Density Functions
Given a joint distribution for a pair of random variables X and Y , it is possible to
work out the univariate PDFs of the individual variables X and Y , regardless of the
values that the other variable might take. When the PDF of X or Y is obtained from
the joint distribution, we refer to it as the marginal PDF. The marginal PDFs for X
and Y are defined as:
Z ∞
f (x) = f (x, y)dy
−∞
Z ∞
f (y) = f (x, y)dx
−∞

Note that f (x) (or f (y)) is a function of x (or y) alone and both are legitimate univariate
PDFs in their own right.
The marginal PDF of X is used to assign probabilities to a range of values of X
irrespective of the range of values in which Y is located, i.e.

P (a ≤ X ≤ b) = P (a ≤ X ≤ b, −∞ ≤ Y ≤ ∞)
Z ∞Z b
= f (x, y)dxdy
−∞ a
Z b Z ∞ 
= f (x, y)dy dx
a −∞
Z b
= f (x)dx
a

Similarly, the marginal PDF of Y is used to assign probabilities to a range of values of


Y irrespective of the range of values in which X is located:

P (c ≤ Y ≤ d) = P (−∞ ≤ X ≤ ∞, c ≤ Y ≤ d)
Z dZ ∞
= f (x, y)dxdy
c −∞
Z d Z ∞ 
= f (x, y)dx dy
c −∞
Z d
= f (y)dy
c

Example
Consider again the bivariate uniform random variables X and Y :
 1
(b−a)(d−c)
for a ≤ x ≤ b and c ≤ y ≤ d
f (x, y) =
0 otherwise

5
The marginal PDF of X is obtained as
Z ∞
f (x) = f (x, y)dy
−∞
Z d
1
= dy
c (b − a)(d − c)
d
y
=
(b − a)(d − c) c
d−c
=
(b − a)(d − c)
1
=
b−a
so that the marginal distribution of X is X ∼ U (a, b). A similar analysis shows that
Z ∞
f (y) = f (x, y)dx
−∞
Zb
1
= dx
a (b − a)(d − c)
b
x
=
(b − a)(d − c) a
b−a
=
(b − a)(d − c)
1
=
d−c
so that the marginal distribution of Y is Y ∼ U (c, d).

4.3 Conditional Probability Density Functions


We are sometimes interested in the distribution of one random variable given that
another takes a certain value. Given the joint distribution f (x, y), the conditional
PDFs of X and Y are defined as:
f (x, y)
f (x|y) =
f (y)
f (x, y)
f (y|x) =
f (x)

This is analogous to the conditional probability of event A occurring given that event
B occurs, specified as P (A|B) = P (A and B)/P (B).

6
The conditional PDF of X is used to assign probabilities to a range of values of X
given that Y takes the value Y = y, i.e.:
Z b
P (a ≤ X ≤ b|Y = y) = f (x|y)dx
a
Rb
f (x, y)dx
= a
f (y)

Similarly, the conditional PDF of Y is used to assign probabilities to a range of values


of Y given that X takes the value X = x, i.e.
Z d
P (c ≤ Y ≤ d|X = x) = f (y|x)dy
c
Rd
f (x, y)dy
= c
f (x)

Example
Consider again the bivariate uniform random variables X and Y :
 1
(b−a)(d−c)
for a ≤ x ≤ b and c ≤ y ≤ d
f (x, y) =
0 otherwise

The conditional PDF of X is obtained as:


1
(b−a)(d−c)
f (x|y) = 1
d−c
1
=
b−a
so that for this particular case, the conditional distribution of X given Y is the same
as the marginal distribution of X, f (x). Similarly:
1
(b−a)(d−c)
f (y|x) = 1
b−a
1
=
d−c
which is identical to the marginal distribution of Y , f (y). Although this result holds for
our bivariate uniform example, it is not always the case, and only arises because here the
X and Y random variables are independent of each other, a notion we consider further
in the next sub-section. If random variables are not independent, the conditional
distributions will differ from the corresponding marginal distributions.

7
4.4 Statistical Independence
The notion of the independence of two events is that knowledge of one event occurring
has no effect on the probability of the second event occurring. In the continuous random
variable context, given two random variables X and Y with joint PDF f (x, y), X and
Y are said to be statistically independent if and only if:
f (x, y) = f (x)f (y)
i.e. the joint PDF can be expressed as the product of the marginal PDFs. Under
independence, it also follows that:
f (x, y)
f (x|y) =
f (y)
f (x)f (y)
=
f (y)
= f (x)
and similarly that:
f (y|x) = f (y)
so that marginal and conditional PDFs are identical.

Example
In the above bivariate uniform example, we found that f (x|y) = f (x) and f (y|x) =
f (y), showing that X and Y are independent. This can also be confirmed by demon-
strating that the joint PDF is the product of the marginal PDFs:
1 1
f (x)f (y) =
b−ad−c
1
=
(b − a)(d − c)
= f (x, y)

5 Moments of a Probability Distribution


The PDF of a random variable provides a full probabilistic description of the stochastic
behaviour of the variable, giving probabilities for all possible intervals that the random
variable can lie in. However, it is often desirable to summarize this information in
certain ways, and this can be done using the moments of a distribution. The two most
widely reported moments are the first moment, called the mean or expected value and
the second moment, called the variance. The expected value is a measure of the random
variable’s average value or location, and the variance is a measure of its dispersion
around this average value. It is important to note that the mean and variance are not
random variables themselves, but are simply fixed numbers that are functions of the
underlying PDF.

8
5.1 Expected Value
Consider first an example of a discrete random variable: the score from throwing a die.
The probability of scoring 1, 2, 3, 4, 5 or 6 is in each case 61 , and so the expected score
on average would clearly be:

(1 × 16 ) + (2 × 16 ) + (3 × 61 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.5

The notion of expected value follows from this idea of the mean outcome of a random
variable, and can be defined formally for a discrete random variable X as:
X
E(X) = xi P (X = xi )
∀i

Extending this concept to the case of a continuous random variable X with PDF f (x),
the expected value is defined as:
Z ∞
E(X) = xf (x)dx
−∞

When considering more than one random variable, e.g. two random variables X
and Y with joint PDF f (x, y), we define:
Z ∞Z ∞
E(X) = xf (x, y)dxdy
−∞ −∞
Z ∞Z ∞
E(Y ) = yf (x, y)dxdy
−∞ −∞

These expected values could alternatively be calculated by first working out the marginal
PDFs of X and Y , and then using the expected value formula for a univariate random
variable. To show that these approaches give the same answer, consider for example
E(X):
Z ∞Z ∞
E(X) = xf (x, y)dxdy
−∞ −∞
Z ∞ Z ∞ 
= x f (x, y)dy dx
−∞ −∞
Z ∞
= xf (x)dx
−∞

Example
In the case where X ∼ U (a, b), we have:
 1
b−a
for a ≤ x ≤ b
f (x) =
0 otherwise

9
and so:
Z ∞
E(X) = xf (x)dx
−∞
Z b
x
= dx
a b−a
b
x2
=
2(b − a) a
2 2
b −a
=
2(b − a)
(b − a)(b + a)
=
2(b − a)
b+a
=
2
Hence, for a uniformly distributed random variable, E(X) is simply the midpoint
between a and b.

Properties of Expected Value

1. If g(X) is a continuous function of X, then we define:


Z ∞
E[g(X)] = g(x)f (x)dx
−∞

2. If a is a constant then:
E(a) = a

3. If a and b are constants then:


E(aX + b) = aE(X) + b
This follows since:
Z ∞
E(aX + b) = (ax + b)f (x)dx
−∞
Z ∞ Z ∞
= a xf (x)dx + b f (x)dx
−∞ −∞
= aE(X) + b

4. In the bivariate case, if g(X, Y ) is a continuous function of X and Y , then we define:


Z ∞Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy
−∞ −∞

10
5. The expected value of a sum of random variables is the sum of their expected values,
i.e.:
E(X + Y ) = E(X) + E(Y )
This follows since:
Z ∞ Z ∞
E(X + Y ) = (x + y)f (x, y)dxdy
Z−∞
∞ Z−∞
∞ Z ∞Z ∞
= xf (x, y)dxdy + yf (x, y)dxdy
−∞ −∞ −∞ −∞
= E(X) + E(Y )

6. If X and Y are independent random variables, the expected value of their product
is the product of their expected values, i.e.:

E(XY ) = E(X)E(Y )

This follows since:


Z ∞ Z ∞
E(XY ) = xyf (x, y)dxdy
Z−∞
∞ Z−∞

= xyf (x)f (y)dxdy
Z−∞

−∞
Z ∞ 
= yf (y) xf (x)dx dy
−∞ −∞
Z ∞
= E(X) yf (y)dy
−∞
= E(X)E(Y )

7. If X has a PDF f (x) which is symmetric about x = a, then:

E(X) = a

5.2 Variance
The variance is a measure of the dispersion of a random variable about its expected
value, and for a continuous random variable X with PDF f (x), it is defined as:

V (X) = E{[X − E(X)]2 }


Z ∞
= [x − E(X)]2 f (x)dx
−∞

11
i.e. the average squared deviation of X from its mean. An alternative formula can be
derived as follows:
Z ∞
V (X) = [x − E(X)]2 f (x)dx
Z−∞
∞ Z ∞ Z ∞
2 2
= x f (x)dx + E(X) f (x)dx − 2E(X) xf (x)dx
−∞ −∞ −∞
2 2
= E(X ) + E(X) − 2E(X).E(X)
= E(X 2 ) − E(X)2

Since the variance measures the average squared deviation of X from its mean, the units
of measurement are the squares of those of X. An alternative measure which converts
the variance into the same units of measurement as X is the standard deviation, simply
defined as the square root of the variance, i.e.:
p
s.d.(X) = V (X)

When considering more than one random variable, e.g. two random variables X
and Y with joint PDF f (x, y), we define:
Z ∞Z ∞
V (X) = [x − E(X)]2 f (x, y)dxdy
−∞ −∞
Z ∞Z ∞
V (Y ) = [y − E(Y )]2 f (x, y)dxdy
−∞ −∞

These variances could alternatively be calculated by first working out the marginal
PDFs of X and Y , and then using the variance formula for a univariate random variable.
To show that these approaches give the same answer, consider for example V (X):
Z ∞Z ∞
V (X) = [x − E(X)]2 f (x, y)dxdy
Z−∞

−∞
Z ∞ 
2
= [x − E(X)] f (x, y)dy dx
−∞ −∞
Z ∞
= [x − E(X)]2 f (x)dx
−∞

Example
In the case where X ∼ U (a, b), we have:
 1
b−a
for a ≤ x ≤ b
f (x) =
0 otherwise

12
and so:
Z ∞
2
E(X ) = x2 f (x)dx
−∞
b
x2
Z
= dx
a b−a
b
x3
=
3(b − a) a
3 3
b −a
=
3(b − a)

Together with the earlier result that E(X) = (b + a)/2 we have:

V (X) = E(X 2 ) − E(X)2


2
b 3 − a3

b+a
= −
3(b − a) 2
2
(b − a)
=
12
Properties of Variance

1. If a is a constant then:
V (a) = 0

2. If a and b are constants then:

V (aX + b) = a2 V (X)

This follows since:

V (aX + b) = E[(aX + b)2 ] − E(aX + b)2


= E(a2 X 2 + b2 + 2abX) − [aE(X) + b]2
= a2 E(X 2 ) + b2 + 2abE(X) − a2 E(X)2 − b2 − 2abE(X)
= a2 [E(X 2 ) − E(X)2 ]
= a2 V (X)

5.3 Covariance
When we have a joint distribution involving more than one variable, it is also useful
to have a summary measure of the association between the variables. A well used
measure of linear association between two random variables is the covariance. Given

13
two random variables X and Y with joint PDF f (x, y), the covariance between X and
Y is defined as:

C(X, Y ) = E{[X − E(X)][Y − E(Y )]}


Z ∞Z ∞
= [x − E(X)][y − E(Y )]f (x, y)dxdy
−∞ −∞

As with the variance formula, an alternative version can also be derived:


Z ∞Z ∞
C(X, Y ) = [x − E(X)][y − E(Y )]f (x, y)dxdy
−∞ −∞
Z ∞Z ∞
= [xy + E(X)E(Y ) − xE(Y ) − yE(X)]f (x, y)dxdy
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
= xyf (x, y)dxdy + E(X)E(Y ) f (x, y)dxdy
−∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞Z ∞
−E(Y ) xf (x, y)dxdy − E(X) yf (x, y)dxdy
−∞ −∞ −∞ −∞
= E(XY ) + E(X)E(Y ) − E(Y )E(X) − E(X)E(Y )
= E(XY ) − E(X)E(Y )

The sign of the covariance indicates whether the random variables are positively related
(i.e. when X increases, Y typically increases as well) or negatively related (i.e. when
X increases, Y typically decreases), or if the covariance is zero there is no linear
relationship between X and Y .
A standardized measure of covariance, which provides a unit free measure of the
strength of linear association between X and Y (as well as the sign of the relationship),
is the correlation. The correlation between X and Y is defined by:

C(X, Y )
ρXY = p p
V (X) V (Y )

If ρXY > 0 then X and Y are said to be positively correlated, and if ρXY < 0 then X
and Y are said to be negatively correlated.
It is possible to show that ρXY always lies in the following range:

−1 ≤ ρXY ≤ 1

If ρXY = 0 then X and Y are said to be uncorrelated, while if ρXY = ±1 then there
exists between X and Y an exact linear relationship of the form Y = aX +b. Inbetween
these extremes, correlation measures the degree of linear relationship between X and
Y , and does not depend on the units of measurement involved with X and Y .

14
Properties of Covariance

1. For sums and differences of random variables:

V (X + Y ) = V (X) + V (Y ) + 2C(X, Y )
V (X − Y ) = V (X) + V (Y ) − 2C(X, Y )

This follows since:

V (X + Y ) = E[(X + Y )2 ] − E(X + Y )2
= E(X 2 ) + E(Y 2 ) + 2E(XY ) − E(X)2 − E(Y )2 − 2E(X)E(Y )
= V (X) + V (Y ) + 2C(X, Y )

V (X − Y ) = E[(X − Y )2 ] − E(X − Y )2
= E(X 2 ) + E(Y 2 ) − 2E(XY ) − E(X)2 − E(Y )2 + 2E(X)E(Y )
= V (X) + V (Y ) − 2C(X, Y )

2. If X and Y are independent random variables:

C(X, Y ) = E(XY ) − E(X)E(Y )


= E(X)E(Y ) − E(X)E(Y )
= 0

so that

ρXY = 0
V (X + Y ) = V (X) + V (Y )
V (X − Y ) = V (X) + V (Y )

It is also important to recognize that ρXY = 0 is implied by independence, but ρXY


= 0 does not imply independence, since correlation only measures the strength of
any linear relationship between X and Y .

6 Important Continuous Probability Distributions


In this section we examine a number of important continuous probability distributions
that are extensively used in econometric analysis. They allow us to build models of
the underlying processes that produce economic data, and allow us to make statisti-
cal inference about the estimates of parameters from empirical models of economic
behaviour.

15
6.1 The Normal Distribution
A random variable X is said to have a normal distribution if its PDF has the form:
(x − µ)2
 
1
f (x) = √ exp − , −∞ ≤ x ≤ ∞
2πσ 2 2σ 2

The normal distribution is fully characterized by the two parameters µ and σ 2 . It can
be shown that:
Z ∞
E(X) = xf (x)dx = µ
−∞
Z ∞
V (X) = (x − µ)2 f (x)dx = σ 2
−∞

We write X ∼ N (µ, σ 2 ).

Properties of the Normal Distribution

1. The normal distribution is bell-shaped and symmetric about its mean µ. Below is
the PDF corresponding to X ∼ N (0, 1), called the standard normal distribution:

2. The following probabilities can be obtained for X ∼ N (µ, σ 2 ):

P (|X − µ| ≥ σ) ' 0.320


P (|X − µ| ≥ 2σ) ' 0.050
P (|X − µ| ≥ 3σ) ' 0.003

16
3. A linear function of a normal random variable is also normally distributed:

if X ∼ N (µ, σ 2 ), Y = aX + b ∼ N (aµ + b, a2 σ 2 )

4. A linear combination of independent normal random variables is also normally


distributed:

if X1 ∼ N (µ1 , σ12 ), X2 ∼ N (µ2 , σ22 ) and X1 and X2 are independent,


Y = a1 X1 + a2 X2 ∼ N (a1 µ1 + a2 µ2 , a21 σ12 + a22 σ22 )

5. If X ∼ N (µ, σ 2 ) then:
X − µ ∼ N (0, σ 2 )
and:
X −µ
∼ N (0, 1)
σ
Thus any probability statement concerning X ∼ N (µ, σ 2 ) can be transformed into
an equivalent one concerning (X − µ)/σ ∼ N (0, 1). For this reason, textbooks only
report probability values for the N (0, 1) distribution.

6.2 The Chi-Square Distribution


Consider n different random variables which are all standard normal and independent
of each other, i.e. Xi ∼ N (0, 1), i = 1, 2, ..., n with Xi and Xj independent ∀i, j. Then
the sum of the Xi2 , i = 1, 2, ..., n has a chi-square distribution with n degrees of freedom:
n
X
Xi2 ∼ χ2n
i=1

Properties of the Chi-Square Distribution

1. The chi-square distribution is asymmetric, and is right skewed, with the skewness
decreasing with the degrees of freedom n. As n → ∞, the chi-square distribution
approaches the (symmetric) normal distribution, as seen below:

17
2. Let Z ∼ χ2n . We can show that the mean and variance of Z are given by:

E(Z) = n
V (Z) = 2n

3. A sum of independent chi-square random variables is also chi-square distributed:

if Z1 ∼ χ2n1 , Z2 ∼ χ2n2 and Z1 and Z2 are independent, Z1 + Z2 ∼ χ2n1 +n2

6.3 The t Distribution


Let X ∼ N (0, 1) and Z ∼ χ2n where X and Z are independent. Then:
X
q ∼ tn
Z
n

with tn denoting a t distribution with n degrees of freedom.

Properties of the t Distribution

1. The t distribution is symmetric, but is less peaked and has thicker tails than the
normal distribution. As n → ∞, the t distribution approaches the standard normal
distribution, as seen below:

18
2. Let Y ∼ tn . We can show that the mean and variance of Y are given by:
E(Y ) = 0
V (Y ) = n/(n − 2), n>2

3. Textbooks give probability values for different values of n, but if n is large standard
normal probabilities provide a good approximation.

6.4 The F Distribution


Consider two independent chi-square random variables Z1 ∼ χ2n1 and Z2 ∼ χ2n2 . Then:
Z1 /n1
∼ Fn1 ,n2
Z2 /n2
with Fn1 ,n2 denoting an F distribution with n1 and n2 degrees of freedom.

Properties of the F Distribution

1. Like the chi-square distribution, the F distribution is asymmetric and right skewed.
As n1 , n2 → ∞, the F distribution approaches the (symmetric) normal.

2. Let W ∼ Fn1 ,n2 . We can show that the mean and variance of W are given by:
E(W ) = n2 /(n2 − 2), n2 > 2
2n22 (n1 + n2 − 2)
V (W ) = , n2 > 4
n1 (n2 − 2)2 (n2 − 4)

3. If Y ∼ tn , then Y 2 ∼ F1,n .

19
7 Samples and Estimators
When the PDF of a random variable is known, we can work out its probabilistic prop-
erties, and establish the mean, variance and other characteristics we may be interested
in. However, in practice, we typically do not know the precise probability distributions
governing the continuous random variables that we deal with in empirical economic
analysis. In this situation we refer to the true unknown distribution as the population
distribution, and we try and investigate aspects of this population distribution using
a sample of observed data. The sample data can then be used to estimate quantities
associated with the population distribution, for example we may wish to estimate the
population mean.

7.1 Random Samples


A sample of n observations on a random variable X, denoted x1 , x2 , ..., xn , is said to
be a random sample if the n observations are drawn independently from the same
population distribution (i.e. from the same PDF f (x)). We call such a sample a set of
independent and identically distributed (iid) random variables. Notice that although
a given sample observation takes a specific numerical value in each sample, it is still
treated as a random variable, since if the random sampling process was to be repeated,
it would take a different numerical value.
Economic data is typically referred to as a sample of observations on the unobserved
population random variable that is being examined, and there are two main types of
economic data:

Cross-sectional data – a sample of n observations drawn at the same point in time (e.g.
a sample of annual income for n different households in a given year).

Time series data – a sample of n observations drawn on a variable at each of n discrete


and equally spaced points in time (e.g. a single household’s annual income measured
for n consecutive years).

7.2 Estimators
A function of the sample observations is known as a sample statistic, and since each
observation in a random sample is itself a random variable, then any sample statistic
will also be a random variable. Sample statistics are often constructed in a way to
provide information about the unknown parameters of the population distribution,
and when this is done the sample statistic is called an estimator. For example, the
population mean is usually unknown, and we may wish to estimate it using sample
data; in such a case, we would construct an estimator of the mean using the data.
In general, let the random variable X have the PDF f (x; θ), where the notation
f (x; θ) is used to denote the fact that the PDF depends on some parameters collectively
referred to as θ. For example, if X ∼ N (µ, σ 2 ) then θ = (µ, σ 2 ). Now suppose we have

20
a random sample x1 , x2 , ..., xn from f (x; θ) but θ is unknown. Then we let our estimator
of θ be denoted by:
θ̂ = g(x1 , x2 , ..., xn )
where g(.) is some suitably chosen function of the sample observations x1 , x2 , ..., xn . For
example, if X ∼ N (µ, σ 2 ) and we wanted to estimate the mean µ, a sensible estimator
would be: n
1X
x̄ = xi
n i=1
i.e. the average of the data, known as the sample mean. Another common estimator
is the sample variance, defined as:
n
1 X
s2 = (xi − x̄)2
n − 1 i=1

which can be used to estimate the population variance σ 2 . Note that while population
parameters like E(X) and V (X) are fixed values related to a given PDF, estimators
of these values are random variables – if different samples are drawn from the same
population, their values will change. Since an estimator is a random variable, it will
have its own probability distribution, called a sampling distribution.

Example

Suppose X ∼ N (µ, σ 2 ) and we have a random (iid) sample x1 , x2 , ..., xn from this
distribution. The sampling distribution of the sample mean x̄ can be found as follows.
Since x̄ is a linear combination of independent, normally distributed random variables,
it is itself normally distributed. The mean can be derived as follows:
n
!
1X
E(x̄) = E xi
n i=1
n
!
1 X
= E xi
n i=1
n
1X
= E(xi )
n i=1
n
1X
= µ
n i=1
= µ

and the variance can be shown to be:


σ2
V (x̄) =
n

21
Hence, the sampling distribution of x̄ is given by:

σ2
 
x̄ ∼ N µ,
n

We now consider two desirable properties that we would wish our estimators to
possess.

Unbiasedness

An estimator θ̂ is said to be unbiased if:

E(θ̂) = θ

i.e. the mean of its sampling distribution equals the parameter of interest θ. Conversely,
if E(θ̂) 6= θ, we say that θ̂ is a biased estimator. Other things equal, an unbiased
estimator is preferred to a biased one. Examples of unbiased estimators are x̄ and s2
since, as shown above, E(x̄) = µ, and we can also show that E(s2 ) = σ 2 .

Efficiency

An efficient estimator is one that achieves the smallest possible variance among a given
class of estimators, i.e. it delivers as precise an estimator as possible. More formally, let
θ̂1 , θ̂2 , ..., θ̂k be k different unbiased estimators of θ. Then θ̂j is said to be the minimum
variance unbiased estimator, i.e. efficient among unbiased estimators, if:

V (θ̂j ) < V (θ̂i ) i = 1, 2, ..., k, i 6= j

8 Hypothesis Testing
Suppose we observe a random sample x1 , x2 , ..., xn on a random variable X with PDF
f (x; θ) and we obtain an estimate of θ, denoted by θ̂. The problem of hypothesis testing
involves deciding whether the value of θ̂ is compatible with some hypothesized value
of θ, say θ∗ . In other words does the value of θ̂ lend support to the assertion that the
observed sample could have originated from the PDF f (x; θ∗ )? To be able to perform a
test, we need to assume a functional form for f (.), e.g. the normal distribution, but of
course we will not typically know the parameters of f (.) as these are the objects of the
hypothesis testing exercise, e.g. testing whether the mean of the normal distribution
is equal to some particular value.
To set up a hypothesis test, we formally state the question as:

H0 : θ = θ∗
H1 : θ 6= θ∗

22
where H0 is called the null hypothesis and H1 is called the alternative hypothesis.
The idea behind hypothesis testing is as follows. If θ̂ is a decent estimator of θ then
it will typically take a value that is close to θ, i.e. the distance θ̂ − θ will be “small”.
So, if H0 is true, θ̂ −θ∗ = (θ̂ −θ)+(θ −θ∗ ) should be “small” since (θ̂ −θ) is “small” and
(θ − θ∗ ) = 0 if H0 is true. On the other hand, if H1 is true, θ̂ − θ∗ = (θ̂ − θ) + (θ − θ∗ )
should be “large” since (θ̂ − θ) is “small” but (θ − θ∗ ) 6= 0 is “large”. We need to know
the distribution of θ̂, and particularly its mean and variance, to be able to calibrate
what we mean by “small” and “large”.
A sample statistic based on θ̂ − θ∗ which is used to discriminate between H0 and
H1 is known as a test statistic, which we shall denote by t∗ . The idea is that t∗ has a
different sampling distribution under H0 and H1 , this being “larger” under H1 than H0 ,
thereby making discrimination between the two hypotheses possible. We then compare
the value of t∗ with a specified cut-off value, known as the critical value, and if t∗ is
greater than the critical value we reject H0 in favour of H1 .
Although we associate large values of t∗ as being consistent with H1 , there is always
the possibility that t∗ happens to be greater than the critical value even when H0 is
true. In such cases we would reject H0 even though it is true – this is known as making
a Type I error. However, once we have worked out the distribution of t∗ under H0 , we
can quantify what risk there is of a Type I error for a given critical value. Put another
way, we can decide what risk of Type I error we are happy with, and set the critical
value accordingly. Usually, we set this risk level, known as the significance level or size,
to be 0.05, and we then determine the critical value (c.v.) so as to make the following
probability statement true under H0 :

P (t∗ > c.v.) = 0.05

If H1 is true, we want t∗ to be larger than the critical value so that we correctly


reject H0 . The probability of rejecting H0 when H1 is true is known as the power of
the test. The other possibility is that we incorrectly accept H0 when H1 is true – this
is known as making a Type II error, and clearly power = 1 − P (Type II error ). In a
well-designed test, power is an increasing function of the distance θ − θ∗ , so that the
less θ∗ resembles θ, the higher is the power of the test and the more likely we are to
correctly conclude that H1 : θ 6= θ∗ is true. If, as the sample size n → ∞, the power of
the test approaches 1, we say that the test is consistent, which is a desirable property
for a test to possess.

Example

Suppose we have a random (iid) sample x1 , x2 , ..., xn from the distribution X ∼ N (µ, σ 2 ),
and suppose that we assume X ∼ N but we do not know the values of µ or σ 2 . We
have already considered unbiased estimators of these parameters – the sample mean

23
and variance:
n
1X
x̄ = xi
n i=1
n
1 X
s2 = (xi − x̄)2
n − 1 i=1

Suppose now that we wish to test a hypothesis about the population mean, specifically:

H0 : µ = µ∗
H1 : µ 6= µ∗

Given that we are conducting a test about the population mean, our test statistic will
naturally be based on the sample mean x̄. We have already established the sampling
distribution of x̄:
σ2
 
x̄ ∼ N µ,
n
which we can standardize to give:

σ2
 
x̄ − µ ∼ N 0,
n
x̄ − µ
q ∼ N (0, 1)
σ2
n

 
x̄ − µ
n ∼ N (0, 1)
σ

Now consider replacing σ with its estimator s. It can be shown that:

(n − 1)s2
∼ χ2n−1
σ2
and we can write:
√ √
  
x̄ − µ x̄ − µ  σ 
n = n
s σ s
√ x̄−µ 
n σ N (0, 1)
= r ∼p 2
(n−1)s2
. χn−1 /(n − 1)
σ2
(n − 1)

√ x̄−µ
 (n−1)s2
Then, since n σ
and σ2
are independent:


 
x̄ − µ
n ∼ tn−1
s

24
Now if H0 is true it follows that:
√ x̄ − µ∗
 
n ∼ tn−1
s

We can then use this result to define a test statistic for distinguishing between H0 and
H1 :
√ x̄ − µ∗
 

t = n
s
The appropriate critical values can be obtained from the tn−1 distribution using a
chosen significance level. The test can distinguish between H0 and H1 because if H1 is
true, we can write:
√ x̄ − µ∗
 

t = n
s
√ (x̄ − µ) + (µ − µ∗ )
 
= n
s
√ √ µ − µ∗
   
x̄ − µ
= n + n
s s
√ ∗
 
µ−µ
= ∼ tn−1 + n
s

and so, given that µ 6= µ∗ under H1 , t∗ should be “larger” (in absolute value terms)
than it would be if H0 was true. As n → ∞ we find that |t∗ | → ∞ under H1 and so
the probability of rejecting H0 approaches 1, i.e. the test is consistent.
Using a numerical example, suppose the unknown population parameters are µ = 3
and σ 2 = 2, and suppose we obtain estimates from a sample of n = 64 observations,
giving x̄ = 2.9 and s2 = 4. If we consider testing H0 : µ = 3 at the 0.05 significance
level, the hypothesis is true, so the distribution of the test statistic t∗ is tn−1 . Given
that we use critical values from a tn−1 distribution, there is a 0.05 chance that we
incorrectly reject the null. Suppose instead we consider testing H0 : µ = 1 at the 0.05
significance level. In this case the null hypothesis is false (as the true population value
is µ = 3), and the distribution of the test statistic t∗ is

√ µ − µ∗ √
   
3−1
tn−1 + n = t63 + 64 √
s 4
= t63 + 8

The test statistic is then very likely to reject the null hypothesis as it comes from a
distribution that is heavily right-shifted compared to the distribution from which the
critical values are obtained.

25
Figure 1:

26

You might also like