0% found this document useful (0 votes)
22 views131 pages

Statistics For Economists Module - Copy For Students

Uploaded by

abdurhamenaliyyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views131 pages

Statistics For Economists Module - Copy For Students

Uploaded by

abdurhamenaliyyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 131

Statistics for economists

Statistics for economists

Module

Chapter one

OVERVIEW OF BASIC PROBABILITY THEORY

1. Introduction

There exist a tremendous number of random phenomena in nature, real life and experimental
sciences. Almost everything is random in nature: occurrences of rain and durations, number of
double stars in a region of the sky, lifetimes of plants, of humans, and of animals, life span of a
radioactive atom, phenotypes of offspring of plants or any biological beings, etc. moreover, most
of the day to day activities of human beings are also full of uncertainty. In such cases we use
words such as probability or chance, event or happening, randomness etc. The general theory
states that each phenomenon has a structural part, that is deterministic, and random part called
the error or the deviation. Randomness also appears in conceptual experiments: tossing a coin
once or 100 times, throwing three dice, arranging a deck of cards, matching two decks, playing
roulettes, etc.

In this chapter the students will be introduced to the nature of probability theory in general and
some specific concepts in relation to this probability theory.

Objectives:
After successful completion of this chapter; students will be able to:
 Understand the concept of probability theory:
 Define basic probability concepts and terms:
 Understand the axioms of probability theory:
 Identify the counting rules/procedures of probability:
 Compute different probability problems:
 Solve permutations and combinatorial problems:
 Deal with conditional probabilities and independence.

Stat foreconomists Page 1


Statistics for economists

1.1 What is probability?


Every day human life is subject to randomness : waiting times for buses, traffic, number of calls
on a telephone, number of busy lines in a communication network, sex of a newborn, etc. this
randomness of the phenomena’s have strong relation with the concept of probability. Therefore,
probability can be defined as the chance of an outcome of an experiment. It is the measure of
how likely an outcome is to occur. Alternatively Probability can also be defined as an attempt to
work mathematically with the relative uncertainties of random events. The concept of Probability
theory is the foundation upon which the logic of inference is built. Probability is important for us
to cope up with uncertainty.

1.1 Definitions of some basic concepts in probability

As far as, we are in the concept of probability there are some basic terms related to probability
that we need to introduce. Here are some of the terms;
1. Experiment: Any process of observation or measurement or any process which generates
well defined outcome.
Examples:
 Rolling a die number of times
 Flip a coin
 Selecting a card from standard deck of cards etc.
1. Probability Experiment: It is an experiment that can be repeated any number of times
under similar conditions and it is possible to enumerate the total number of outcomes
without predicting an individual outcome. It is also called random experiment.
Example:
i. If a fair die is rolled once it is possible to list all the possible outcomes i.e.1, 2, 3, 4, 5,
6 but it is not possible to predict which outcome will occur.
ii. If a fair coin is tossed once, clearly the outcome will be either head or tail with equal
probability, but no one can surely know which outcome will occur until the
experiment is done.
2. Outcome: The result of a single trial of a random experiment.

Stat foreconomists Page 2


Statistics for economists
For instance, if head is occurred in the single trial of the above experiment, then it is
called the outcome of that specific trail.

3. Sample Space: Set of all possible outcomes of a probability/random experiment.


4. Event: It is a subset of sample space.
 It is a statement about one or more outcomes of a random experiment.
 They are denoted by capital letters.
Example: Considering the above experiment rolling a fair die once.
Let A be the event of odd numbers, B the event of even numbers, and C be the event of number 8.
A = {1, 3, 5}
B = {2, 4, 6}
C = { } or empty space or impossible event
 A, B and C are called events and the numbers within the bracket are called
outcomes.
5. Equally Likely Events: Events which have the same chance of occurring.
 The outcomes in tossing a fair coin have equal chance of being occurred with
probability 0.5 each and are best examples of equally likely events.
6. Complement of an Event: the complement of an event A means nonoccurrence of A and
is denoted by A’, or Ac, or Ᾱ contains those points of the sample space which don’t belong
to event A.
 In an experiment of rolling a die the event of odd numbers and the event of even numbers
are complementary events.
7. Elementary Event: an event having only a single element or sample point.
8. Mutually Exclusive Events: Two events which cannot happen at the same time.
9. Independent Events: Two events are independent if the occurrence of one does not

affect the probability of the other occurring.


10. Dependent Events: Two events are dependent if the first event affects the outcome
or occurrence of the second event in a way the probability is changed.

Stat foreconomists Page 3


Statistics for economists

1.3 Axioms of Probability

Axioms of Probability
Probability of any event ‘A’, P (A), is assigned in such a way that it satisfies certain conditions.
These conditions for assigning probability are known as Axioms of probability. There are three
such axioms. All conclusions drawn on probability theory are either directly or indirectly related
to these three axioms. Let E be a random experiment and S be a sample space associated with E.
With each event A a real number called the probability of A satisfies the following properties
called axioms of probability or postulates of probability.
Axiom 1: For any event ‘A’ belongs to the sample space, ‘S’, the value of probability of the
event lies between zero and one i.e. mathematically expressed as; 0 ≤P (A)≤ 1. Thus, this axiom
states that probabilities of events for a particular sample space are real numbers on the interval
[0, 1].

Axiom 2: The probability of the sample space S is, P(S) =1, i.e. S is sure event. This axiom
states that the event described by the entire sample space has probability of 1. If γ is the outcome
of an experimental trial, then γ is an element Ѕ by the definition of the sample space. Therefore,
the event described by S must occur on every trial. Intuitively, we can say that every experiment
we perform must yield some kind of result, and that result must be in the sample space, so the
“event” described by the sample space is so general that it must always occur.
Axiom 3: If A and B are mutually exclusive events or disjoint events, i.e. A ∩ B is empty, then
the probability that one or the other occur equals the sum of the two probabilities.
i. e. P (A∪B) = P (A) + P (B)
In general p (A∪ B) = p (A) + p (B) – p (A∩ B), for disjoint events p (A∩ B) = 0
Axiom 4: The probability of any event must be greater or equal to zero.
P (A) ≥ 0, i.e. non- negativity property of the probability theory.

Stat foreconomists Page 4


Statistics for economists
Axiom 5: The probability of a complementary event is equal to one less the probability of that
event P (A’) =1 − P (A)
Axiom 6: P (ø) =0, ø is the impossible event, and the probability of an impossible event is zero.

1.4Counting Procedures

 In order to calculate probabilities, we have to know


i. The number of elements of an event
ii. The number of elements of the sample space.
That is in order to judge what is probable, we have to know what is possible first.
In order to determine the number of outcomes, one can use several rules of counting.
 The addition rule
 The multiplication rule
 Permutation rule
 Combination rule
 To list the outcomes of the sequence of events, a useful device called tree diagram is
used.
Example: A student goes to the nearest snack to have a breakfast. He can take tea, coffee, or milk
with bread, cake and sand witch. How many possibilities does he have?
Solutions:
Bread
Tea cake
Sandwich

Bread
Coffee cake
Sandwich

Stat foreconomists Page 5


Statistics for economists
Bread
Milk cake
Sandwich
 There are nine possibilities.

The Multiplication Rule

If a choice consists of k steps of which the first can be made in n 1 ways, the second
can be made in n2 ways…, the kth can be made in nk ways, then the whole choice
can be made in (n1 * n2 *........* nk ) ways.
Example 1: The first four natural numbers are to be used in 4 digit identification card.
How many different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted.
Solutions
a)
1st digit 2nd digit 3rd digit 4th digit
4 4 4 4

There are four steps


1. Selecting the 1st digit, this can be made in 4 ways.
2. Selecting the 2nd digit, this can be made in 4 ways.
3. Selecting the 3rd digit, this can be made in 4 ways.
4. Selecting the 4th digit, this can be made in 4 ways.
Therefore, 4*4*4*4 = 256 different cards are possible.
b)
1st digit 2nd digit 3rd digit 4th digit
4 3 2 1

Stat foreconomists Page 6


Statistics for economists

There are four steps


5. Selecting the 1st digit, this can be made in 4 ways.
6. Selecting the 2nd digit, this can be made in 3 ways.
7. Selecting the 3rd digit, this can be made in 2 ways.
8. Selecting the 4th digit, this can be made in 1 ways.
Therefore, 4*3*2*1 = 24 different cards are possible.

Permutations and combinations

The study of permutations and combinations is concerned with determining the number of
different ways of arranging and selecting objects out of a given number of objects, without
actually listing them. There are some basic counting techniques which will be useful in
determining the number of different ways of arranging or selecting objects.
The two basic counting principles are given below:
1. Permutation
An arrangement of n objects in a specified order is called permutation of the objects.
Permutation Rules:
1. The number of permutations of n distinct objects taken all together is n!
Where n! = n*(n −1)*(n − 2)*.....*3*2*1; or equivalently
= 1 × 2 ×3 × 4 × … … … .× n
2. The arrangement of n objects in a specified order using r objects at a time is called the
permutation of n objects taken r objects at a time. It is written as n P r and the formula is;
(n)!
nPr= (n−r )!

3. The numbers of permutations of n objects in which k1 are alike K2 are alike ---- etc. is
(n)!
nPr=
k 1!∗K 2 !∗…∗Kn !
Example:
1. Suppose we have a letters A, B, C, D

Stat foreconomists Page 7


Statistics for economists
a. How many permutations are there taking all the four?
b. How many permutations are there two letters at a time?
2. How many different permutations can be made from the letters in the word
“CORRECTION”?
3. The final night of the Folklore Festival will feature 3 different bands. There are 7 bands
to choose from. How many different programs are possible?
Solutions:
1.
a) Here n = 4, there are four distinct objects

There are 4! = 24 permutations


b) Here n = 4 and r = 2
4! 24
Therefore, there are 4 P 2 = = = 12 permutations
( 4−2) ! 2

2. Here n = 10
Of which, 2 are C, 2 are O, 2 are R, 1 E, 1 T, 1 I, 1 N
⇒ K 1=2 , k 2=2 , k 3=2 ,k 4=k 5=k 6=k 7=1
 Using the third rule of permutation, there are;
10 !
= 453, 600 different permutations are possible.
2!∗2 !∗2!∗1 !∗1!∗1 !∗1!
3. Here n = 7 i.e. there are seven different bands/ objects and r = 3

Therefore, we have 7P3 = 7·6·5 = 210 different permutations.


 This expression is not the same as asking “How many ways are there to choose 3 bands
from 7?”

Exercise:
1. Six different statistics books, seven different physics books, and 3 different Economics
books are arranged on a shelf. How many different arrangements are possible if;
i. The books in each particular subject must all stand together
ii. Only the statistics books must stand together

Stat foreconomists Page 8


Statistics for economists
2. If the permutation of the word WHITE is selected at random, how many of the
permutations
i. Begins with a consonant?
ii. Ends with a vowel?
iii. Has a consonant and vowels alternating?

2. Combinations
On many occasions we are not interested in arranging but only in selecting k objects from given
n objects. A combination is a selection of some or all of a number of different objects where the
order of selection is immaterial. Therefore, combination can be defined as the selection of
objects with-out regard to order. The number of selections of k objects from the given n objects

is denoted by ( nk) or n C k, and is given by;

n!
n C k = k !(n−k )!

Remark

 ( nk)Is sometimes called a “combinatorial symbol” or “binomial coefficient” (in

connection with a fundamental mathematical result called the Binomial Theorem; you
may also recall the related “Pascal’s Triangle”).
 Use permutations if a problem calls for the number of arrangements of objects and
different orders are to be counted.
 Use combinations if a problem calls for the number of ways of selecting objects and the
order of selection is not to be counted.
Example: Given the letters A, B, C, and D list the permutation and combination for selecting
two letters.
Solutions:
Permutation Combination
AB BA CA DA AB BC
AC BC CB DB AC BD

Stat foreconomists Page 9


Statistics for economists
AD BD CD DC AD DC
 Note that in permutation AB is different from BA. But in combination AB is the same as
BA.
Examples:
1. In how many ways a committee of 5 people is chosen out of 9 people?

Solutions:
n = 9, r=5

(nr)= ( n−r ) !∗r ! = 4 !∗5 ! =126 ways


n! 9!

2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three of
the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.

Solutions:
n = 15, of which 2 are defectives and 13 are non- defectives
r=3
a. If there is no restriction select three clocks from 15 clocks and this can be
done in :
n = 15, r = 3
15 !
nCr= 12! 3 ! = 455 ways
b. None of the defective clocks is included.
This is equivalent to zero defective and three non-defectives, which can be done
in:

Stat foreconomists Page 10


Statistics for economists

(20)(133) = 286 ways


c. Only one of the defective clocks is included. This is equivalent to one defective and two
non-defective, which can be done in:

(21)(132) = 156 ways


d. Two of the defective clock is included.
This is equivalent to two defective and one non-defective, which can be done in:

(22)(131) = 13 different ways


Exercise:
1. In a class, there are 27 boys and 14 girls. The teacher wants to select 1 boy and 1 girl to
represent the class for a function. In how many ways can the teacher make this selection?
2. Out of 5 Mathematician and 7 Statistician a committee consisting of 2 Mathematician and
3 Statistician is to be formed. In how many ways this can be done if
a) There is no restriction
b) One particular Statistician should be included
c) Two particular Mathematicians cannot be included on the committee.
 Some important results
Let n and r be positive integers such that r ≤ n, Then
(i) n C r = n C n – r
(ii) n C r + n C r – 1 = n + 1 C r
(iii) n n – 1 C r – 1 = (n – r + 1) n C r – 1

1.5. Conditional Probability and Independence


Before we directly involved in to what conditional probability is let us first introduce what
conditional events mean. Conditional Events: If the occurrence of one event has an effect on the
next occurrence of the other event then the two events are conditional or dependent events.
Example: Suppose we have two red and three white balls in a bag
1. Draw a ball with replacement

Stat foreconomists Page 11


Statistics for economists
2
Let A= the event that the first draw is red P (A) =
5
2
B= the event that the second draw is red P (B) =
5
A and B are independent.
2. Draw a ball without replacement
2
Let A= the event that the first draw is red P (A) =
5
B= the event that the second draw is red p (B) =?
This is conditional.
Let B= the event that the second draw is red given that the first draw is red p (B) =1/ 4

Conditional probability of an event


One of the most important concepts in the theory of probability is based on the question: How do
we modify the probability of an event in light of the fact that something new is known? What is
the chance that we will win the game now that we have taken the first point? What is the chance
that I am a victim of a genetic disease now that my father has a genetic problem? What is the
chance that a child smokes if the household has two parents who smoke? This question leads us
to the concept of conditional probability.
Suppose we know that an event A already occurred. Then a natural question asked is ``what is
the effect of this on the probabilities of other events?'' This leads to the notion of conditional
probability.
Definition: let (Ω, Ѕ, P) be a probability space and A ϵ S be such that P (A) >0. The conditional
probability of an event B ϵ S given A denoted by P (B|A) is defined as:
P ( A ∩B)
P (B/A) = P( A ) , P (A) ≠ 0 ………………………………1

Remark:
i. p (A ̍ /¿B) =1− p (A/B)
ii. P (B ̍ /A) =1− p (B/A)
Examples

Stat foreconomists Page 12


Statistics for economists

1. For a student enrolling at freshman at certain university the probability is 0.25 that he/she
will get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she
will get scholarship and will also graduate. What is the probability that a student who get
a scholarship graduate?

Solution: Let A= the event that a student will get a scholarship


B= the event that a student will graduate
Given, P (A) = 0.25, P (B) = 0.75, P( A ∩ B) = 0.20

Required, P (B/A) =?
P ( A ∩B) 0.20
P (B/A) = P( A )
= 0.25 = 0.80

Note; for any two events A and B the following relation holds.
P (B) = p (B/A).p (A) + p (B/A').p (A')

Exercises:

1. If the probability that a research project will be well planned is 0.60 and the probability
that it will be well planned and well executed is 0.54, what is the probability that it will
be well executed given that it is well planned?
2. A lot consists of 20 defective and 80 non-defective items from which two items are
chosen without replacement. Events A & B are defined as A = {the first item chosen is
defective}, B = {the second item chosen is defective}
a. What is the probability that both items are defective?
b. What is the probability that the second item is defective?
3. 27 students out of a class of 43 are economists. 20 of the students are female, of whom 7
are economists. Find the probability that a randomly selected student is an economist
given that she is female.

The Multiplication Rule

Stat foreconomists Page 13


Statistics for economists
Re-arranging the conditional probability formula gives;
P( A ∩ B) = P( A) . P (B/A) ………………………………………2
We will see that our use of the multiplication rule is very similar to our use of the rule of product
in counting. In fact, the multiplication rule is just a souped up version of the rule of product.
This is often useful in computing the probability of the intersection of events.

Let’s start with a simple example where we can check all the probabilities directly by counting.

Example1. Draw two cards from a deck. Define the events: S1 = ‘first card is a spade’ and S2 =
‘second card is a spade’. What is the P (S2|S1)?

Solution: We can do this directly by counting: if the first card is a spade then of the 51 cards
remaining, 12 are spades.
P (S2|S1) = 12/51.

Now, let’s re compute this using conditional probability formula. We have to compute P (S1), P
(S2) and P (S1 ∩ S2): We know that P (S1) =1/4 because there are 52 equally likely ways to
draw the first card and 13 of them are spades. The same logic says that there are 52 equally likely
ways the second card can be drawn, so P (S2) =1/4.

The probability P (S2) =1/4 may seem surprising since the value of first card certainly affects the
probabilities for the second card. However, if we look at all possible two card sequences we will
see that every card in the deck has equal probability of being the second card. Since 13 of the 52
cards are spades we get P (S2) = 13/52 = 1/4. Another way to say this is: if we are not given
value of the first card then we have to consider all possibilities for the second card.

Continuing, we see that

13 12
P (S1 ∩ S2) = . = 3/51
52 51

This was found by counting the number of ways to draw a spade followed by a second spade and
dividing by the number of ways to draw any card followed by any other card. Now, using (1) we
get;

P (S 2∩ S 1) 3/51
P (S2|S1) = P (S 1) = 1/ 4 = 12/51

Stat foreconomists Page 14


Statistics for economists

Think: For S1 and S2 in the previous example, what is P (S2|S1c)?

Law of Total Probability

The law of total probability will allow us to use the multiplication rule to find probabilities in
more interesting examples. It involves a lot of notation, but the idea is fairly simple. We state the
law when the sample space is divided into 3 pieces. It is a simple matter to extend the rule when
there are more than 3 pieces.

Law of Total Probability

Suppose the sample space Ω is divided into 3 disjoint events B1, B2, B3 (see the figure below).
Then for any event A:

P (A) = P (A ∩ B1) + P (A ∩ B2) + P (A ∩ B3)

P (A) = P (A|B1) P (B1) + P (A|B2) P (B2) + P (A|B3) P (B3)………………………………3

The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities of
the pieces’. The bottom equation (3) is called the law of total probability. It is just a rewriting of
the top equation using the multiplication rule.

Stat foreconomists Page 15


Statistics for economists
The sample space Ω and the event A are each divided into 3 disjoint pieces.

The law holds if we divide Ω into any number of events, so long as they are disjoint and cover all
of Ω. Such a division is often called a partition of Ω.

Example: An urn contains 5 red balls and 2 green balls. Two balls are drawn one after the other.
What is the probability that the second ball is red?

Solution: The sample space is Ω = {rr, rg, gr, gg}. Let R1 be the event ‘the first ball is red’, G1
= ‘first ball is green’, R2 = ‘second ball is red’, G2 = ‘second ball is green’. We are asked to find
P (R2).

The fast way to compute this is just like P (S2) in the card example above. Every ball is equally
likely to be the second ball. Since 5 out of 7 balls are red, P (R2) = 5/7.

Let’s compute this same value using the law of total probability (3). First, we’ll find the
conditional probabilities. This is a simple counting exercise.

P (R2|R1) = 4/6, P (R2|G1) = 5/6.

Since R1 and G1 partition Ω the law of total probability says:

P (R2) = P (R2|R1) P (R1) + P (R2|G1) P (G1)

4 5 5 2
= . + .
6 7 6 7

30 5
= =
42 7

Independence

Two events are independent if knowledge that one occurred does not change the probability that
the other occurred. Informally, events are independent if they do not influence one another.

Formal definition of independence: Two events A and B are independent if and only if,

P (A ∩ B) = P (A) · P (B) ………………………..4

Stat foreconomists Page 16


Statistics for economists
This is a nice symmetric definition which makes clear that A is independent of B if and only if B
is independent of A. Unlike the equation with conditional probabilities, this definition makes
sense even when P (B) = 0. In terms of conditional probabilities, we have:
1. If P (B) ≠ 0 then A and B are independent if and only if P (A|B) = P (A).
2. If P (A) ≠ 0 then A and B are independent if and only if P (B|A) = P (B).

Independent events commonly arise as different trials in an experiment, as in the following


example.

Example1. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on second
toss’. Are H1 and H2 independent?

Answer: Since H1 ∩ H2 is the event ‘both tosses are heads’ we have

P (H1 ∩ H2) =1/4= P (H1) P (H2).

Therefore the events are independent.

Example2. Toss a fair coin 3 times. Let H1 = ‘heads on first toss’ and A = ‘two heads total’. Are
H1 and A independent?

Answer: We know that P (A) =3/8. Since this is not 0 we can check if P (A/B) = P (A) holds.
Now, H1 = {HHH, HHT, HTH, HTT} contains exactly two outcomes (HHT, HTH) from A, so
we have P (A|H1) =2/4. Since P (A|H1) P (A) these events are not independent.

Example3. Draw one card from a standard deck of playing cards. Let’s examine the
independence of 3 events ‘the card is an ace’, ‘the card is a heart’ and ‘the card is red’.

Define the events as A = ‘ace’, H = ‘hearts’, R = ‘red’.

(a) We know that P (A) = 4/52 = 1/13, P (A|H) = 1/13. Since P (A) = P(A|H) we have that A is
independent of H.

(b) P (A|R) = 2/26 = 1/13. So A is independent of R. That is, whether the card is an ace is
independent of whether it’s red.

(c) Finally, what about H and R? Since P (H) = 1/4 and P (H|R) = 1/2, H and R are not
independent. We could also see this the other way around: P (R) = 1/2 and P (R|H) = 1, so H and
R are not independent.

Stat foreconomists Page 17


Statistics for economists
Exercise:
1. A box contains four black and six white balls. What is the probability of getting two black
balls in drawing one after the other under the following conditions?
a. The first ball drawn is not replaced
b. The first ball drawn is replaced

1.6. Bayes' Theorem

Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of this
course. For two events A and B Bayes' theorem (also called Bayes' rule and Bayes' formula)
says;
P ( A / B)
P (B/A) = P (A ) . P (B) …………………………………….. 5
 Bayes’ rule tells us how to invert conditional probabilities, i.e. to find P (B/A) from P( A /B).
 In practice, P (A) is often computed using the law of total probability.

Proof of Bayes’ rule

The key points is that A∩B is symmetric in A and B. so the multiplication rule says

P (B/A) . P (A) = P (A∩B) = P (A/B) . P (B)

Now divide both sides by P (A) to get Bayes’ formula.

Exercise:
1. Suppose 99% of people with HIV test positive, 95% of people without HIV test negative,
and 0.1% of people have HIV.
What is the chance that someone testing positive has HIV?

Stat foreconomists Page 18


Statistics for economists

Chapter two
2. Random variables and probability distributions

Introduction
Random variables are of important concepts in dealing with probability and probability
distributions. Generally random variables are of two types; therefore this chapter is dedicated to
the overview of random variables and related terms.

Chapter objectives:
After successfully completing this chapter, students are expected to be able to:
 Define the terms probability distribution and random variables.
 Distinguish between a discrete and continuous probability distribution.
 Calculate the mean, variance, and standard deviation of discrete and continuous
probability distributions.
 Define moments and moment generating functions.

Definition: A random variable is a numerical description of the outcomes of the experiment or


a numerical valued function defined on sample space, usually denoted by capital letters.
Alternatively, A random variable is a function that associates a real number with each element in
the sample space.

Stat foreconomists Page 19


Statistics for economists
If X is a random variable, then it is a function from the elements of the sample space to the set
of real numbers. i.e. X is a function X: S → R
 A random variable takes a possible outcome and assigns a number to it.
NOTE. By convention, we use a capital letter, say X, to denote a random variable, and use the
corresponding lower-case letter x to denote the realization (values) of the random variable.
Example: Flip a coin three times, let X be the number of heads in three tosses.
⇒ S = {(HHH), (HHT), (HTH), (HTT), (THH), (THT), (TTH), (TTT)}

⇒ X (HHH) = 3,

X (HHT) = X (HTH) = X (THH) = 2,

X (HTT) = X (TTH) = X (THT) = 1,

X (TTT) = 0

X = {0, 1, 2, 3}

⇒ X assumes a specific number of values with some probabilities.

2.1. Types of random variables


Random variables are of two types;
1. Discrete Random Variable: If a sample space contains a finite number of possibilities or
an unending sequence with as many elements as there are whole numbers (countable), it is
called a discrete sample space. A random variable is called a discrete random variable if
its set of possible outcomes is countable. They are variables which can assume only a
specific number of values. They have values that can be counted.

Examples:
 Toss coin n times and count the number of heads.
 Number of children in a family.
 Number of car accidents per week.
 Number of defective items in a given company.
 Number of bacteria per two cubic centimeter of water.

Stat foreconomists Page 20


Statistics for economists
2. Continuous Random Variable: If a sample space contains an infinite number of
possibilities equal to the number of points on a line segment, it is called a continuous
sample space. When a random variable can take on values on a continuous scale, it is
called a continuous random variable. They are variables that can assume all values
between any two give values.
Examples:
 Height of students at certain college.
 Mark of a student.
 Life time of light bulbs.
 Length of time required to complete a given training.

2.2. Probability distribution


Definition: a probability distribution consists of a value a random variable can assume and the
corresponding probabilities of the values. It may be specified in the form of a graph, a table or a
function of the probability associated with each value of a random variable.
2.2.1. Discrete Random Variables and their Probability Distributions
The probability distribution of a discrete random variable X lists the values and their
probabilities.
Example: Consider the experiment of tossing a coin three times. Let X be the number of heads.
Construct the probability distribution of X.
Solution:
 First identify the possible value that X can assume.
 Calculate the probability of each possible distinct value of X and express X in the form of
frequency distribution.
The 8 possible elementary events, and the corresponding values for X, are:
Elementary event Value of X
TTT 0
TTH 1
THT 1
HTT 1

Stat foreconomists Page 21


Statistics for economists
THH 2
HTH 2
HHT 2
HHH 3

 Therefore, the probability distribution for the number of heads occurring in three coin
tosses is:

X=x 0 1 2 3
P (X = x) 1/8 3/8 3/8 1/8
F (x) 1/8 1/2 7/8 1

 Graphically, we might depict this as;

From the above, any number of other questions can be answered. For example, P (1 ≤ X ≤ 3) (i.e.
the probability that you will get at least one head) = P (1) + P (2) + P (3) = 3/8 + 3/8 + 1/8 = 7/8.
Or, if you prefer, you can use the complements rule and note that P (at least 1 head) = 1 – P (No
heads) = 1 - 1/8 = 7/8.
Function rules: Sometimes it is easier to specify the distribution of a discrete random variable
by its rule, rather than by a simple listing or graph. A rule for our coin-tossing experiment would
be:
1/8 if x = 0

Stat foreconomists Page 22


Statistics for economists
3/8 if x = 1, 2
P (x) = 1/8 if x=3
0 otherwise

Probability Mass Function (PMF): A probability distribution involving only discrete values of
X and it is denoted by P. Graphically, this is illustrated by a graph in which the x axis has the
different possible values of X, the Y axis has the different possible values of P(X = x).

Properties of Probability Distribution of discrete random variables:


The set of ordered pairs (x, p (x)) is a probability function, probability mass function, or
probability distribution of the discrete random variable X if, for each possible outcome x,

1. P(x) ≥ 0

2. ∑ P (X=x ) = 1
x

3. If X is discrete random variable then;


b−1

P (a < X < b) ∑ P(x )


x=a +1

b −1

P (a X < b) ∑ P (x)

x=a

P (a < X ≤ b) ∑ P(x )
x=a +1

P (a ≤X ≤ b) ∑ P (x)
x=a

Exercise:
1. Three men and three women apply for an executive position in a small company. Two of
the applicants are selected for interview. Let X denotes the number of women in the
interview pool.
b. Find the PMF of X, assuming that the selection is done randomly. Plot it.
c. What is the probability that at least one woman is included in the interview pool?

Stat foreconomists Page 23


Statistics for economists
x
2. Let the function of X: p (x) = 10 for x∈ 1,2,3,4
Construct its PMF and CDF
a. Using table
b. Graph
c. Compute
 P ( 1< X = x < 4)
 P (2≤ X=x < 4 )
 P (1≤ X=x ≤ 4 )
Cumulative Distribution Function: The probability that a discrete random variable X takes on
a value less than or equal to some particular value a is often written as;

F (a) = P (X≤ a) = ∑
X≤ a
P(x )

For a discrete random variable X, F (X) satisfies the following properties:

 F (x) = P (X≤ x) = ∑
xi ≤ x
f (xi)

 0 ≤ F(x ) ≤ 1
 If x ≤ y, then F (x) ≤ F ( y )

2.2.2. Continuous random variables and their probability density functions


Suppose that we could measure height to any degree of precision, regardless of how many
decimal points that would require. Height would then be measured as a continuous variable.
Continuous variables are treated somewhat differently than discrete variables. But fortunately,
most probability theory is basically the same for discrete and continuous variables.
In general, for continuous random variables, the occurrence of any exact value of X may be
regarded as having zero probability. For this reason, one does not usually discuss the probability
per se for a value of a continuous random variable. Instead of the probability that X takes on
some value a, we deal with the so-called probability density of X at a, symbolized by;

f (a) = probability density of X at a


However, intervals of values can always be assigned probabilities. The probability of any
continuous interval is given by;

Stat foreconomists Page 24


Statistics for economists
b

P (a≤ X ≤ b) = ∫ f ( x ) dx = area under f(X) from a to b


a

Probability Density Function (PDF):


The function denoted by f (x) is a probability density function (pdf) for the continuous random
variable X, defined over the set of real numbers, if

i. f (x) ≥ 0 for all x ∈ R


ii. ∫ f (x )dx , ∀ x =∫ f (x ) dx = 1
−∞

iii. P ( a< X <b ) = ∫ f ( x ) dx


a

iv. Probability of a fixed value of a continuous random variable is zero.


⇒ P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a ≤ X ≤ b)

Cumulative Distribution Function (CDF) of continuous random variables


The cumulative distribution function F(x) of a continuous random variable X with probability
density function f (x) is
F (a) = P (X ≤ a)
a

= ∫ f ( x )dx , for −∞ < a<∞ = area up to X = a


−∞

NOTE. A random variable is continuous if and only if its CDF is an everywhere continuous
function.

d. Introduction to expectation
The expected value of a random variable is denoted by E (X). The expected value can be thought
of as the ‘average’ value attained by the random variable; in fact the expected value of a random
variable is also called its mean, in which case we use the notation μx.

Stat foreconomists Page 25


Statistics for economists
Let X be a random variable assuming the values x 1, x2, x3,... with corresponding probabilities
p(x1), p(x2), p(x3),..... The mean or expected value of X is defined by E(X) = sum x k p (xk).
Expectations can be used to describe the potential gains and losses from games.
1. Let a discrete random variable X assume the values X1, X2, …., Xn with the probabilities
P(X1), P(X2), ….,P(Xn) respectively. Then the expected value of X ,denoted as E(X) is
defined as:
E (X) = X1 P (X1) + X2 P (X2) + … + Xn P (Xn)
n
= ∑ Xi P ( Xi )
i=1

2. Let X be a continuous random variable assuming the values in the interval (a, b) such that
b

∫ f (x )dx =1, then


a

E (X) = ∫ x f ( x) dx
a

2.4. Variance of a random variable

We often seek to summarize the essential properties of a random variable in as simple terms as
possible.
Definition: If X is a random variable with mean E(X), then the variance of X, denoted by:
Var(X), is defined by Var(X) = E [(X-E(X)) 2]
= E (X2) – [E (X)] 2
n
Where, E (X2) =∑ xi2 P( X =xi), if X is discrete
i=1

= ∫ x f (x )dx , if X is continuous .
2

A small variance indicates a small spread.


There are some basic general rules for mathematical expectation.
Let X and Y are random variables and k be a constant.
 RULE 1
E (k) = k

Stat foreconomists Page 26


Statistics for economists
 RULE 2
Var (k) = 0
 RULE 3
E (k X) = k E (X)
 RULE 4
Var (k X) = k 2Var (X)
 RULE 5
E (X +Y) = E (X) + E(Y)
E (a X + b) = a E(X) + b
Var (a X + b) = a2Var(X)
Example: if E (X) = 1 and Var (X) = 5, find
2
a. E [(2 + X) ]

b. Var(4 + 3X)
Solution: Exercise

1.5. Moments and Moment Generating Functions

The moment generating function of the random variable X, denoted M X (t), is defined for all real
values of t by;

{
∑ e tx P ( x ) if X is discrete with pmf P(x)
x
M X (t) = E (e tx) = ∞

∫ etx f ( x ) dx if X is continuous with pdf f (x)


−∞

The reason M X (t) is called a moment generating function is because all the moments of
X can be obtained by successively differentiating M X (t) and evaluating the result at t=0.

The “moment generating function” gives us a nice way of collecting together all the moments of
a random variable X into a single power series (i.e. Maclaurin series) in the variable t. It is
defined to be;

Stat foreconomists Page 27


Statistics for economists
∞ k
E( X ) k
E (e ) = ∑
Xt
k!
t
k=0

This is a clever way of organizing all the moments into one mathematical object, and that object
is called the moment generating function. It's a function m (t) of a new variable t defined by;
M (t) = E (e tx)
Since the exponential function e t has the power series

tk t
2
t
k
e t =∑ =1+t+ + ……..+
k=0 k! 2! k!

We can rewrite m (t) as follows:


M (t) = E (e tx)
2 k
(tx) (tx)
= E (1 + t x + 2!
+ …+
k!
2 2 k k
t E(X ) t E( X )
= 1 + t E (X) + 2!
+…+ k!
2 k
t μ2 t μk
=1+ t μ1 + 2!
+…+ k!

μ2 μ
= 1 + μ1 t+ 2! t2 +…+ k k! t
k

Theorem: The kth derivative of m (t) evaluated at t = 0 is the kth moment μk of X.


In other words, the moment generating function generates the moments of X by differentiation.
 First moment:
d d d tx
dt
M x (t) = dt E (e tx) = E ( dt e ) = E (Xe tx)

M ̍ x (0) = E (X)
(For any of the distributions we will use we can move the derivative inside the expectation).
 Second moment:
d d d
M ̍ ̍ x (t) = M ̍ x (t) = E (Xe tx) = E ( ¿)) = E (X2e tx)
dt dt dt
2
M ̍ ̍ x (t = 0) = E (X )

Stat foreconomists Page 28


Statistics for economists

Properties of moment generating functions


 Translation: If Y = X + a, then
mY (t) = e at M x (t)

proof :mY ( t )=E ( eYt )=E ( e( X +a)t )=E ( e Xt e at )=eat E ( e Xt )=eat M x ( t ) .


 Scaling: if Y = bX, then
mY (t) = m X (bt)

proof :mY ( t )=E ( eYt )=E ( e(bX )t )=E ( e X (bt ) )=M x ( bt ) .


 Standardizing: From the last two properties, if
¿ X−μ
X =
σ
is the standard random variable for X , then
− μt
m X (t) = e
¿ σ m X (t/σ )

proof : first traslate by −μ ¿ get


m X −μ (t) = e− μt mX (t)

then Scale that by a factor of 1/σ to get


m x− μ/ σ = m X −μ (t/σ )
− μt
=e σ m X (t/σ )

 Convolution: If X and Y are independent variables, and Z = X + Y , then


mZ (t) = m X (t) mY (t)

Proof: mZ (t) = E (e Zt ) = E (e (X +Y )t ) = E (e Xt eYt ).


Now, since X and Y are independent, so are e Xt ande Yt .

Therefore, E (e Xt eYt ) = E (e Xt) E (e Yt ) = m X(t) mY (t).

Stat foreconomists Page 29


Statistics for economists

Chapter Three

2. Special probability distributions and densities

Introduction
Once we have dealt with the probability theory and random variables, now we have come to the
heart of the matter. In the previous chapters we tried to see the bird’s eye view of the probability
theory and the random variables. On top of this, we also defined briefly what discrete and
continuous random variables and their probability distributions, which are the probability mass
functions and probability density functions of the random variables respectively. In this chapter
we try to deal with some important probability distributions from both discrete and continuous
random variables in a more detailed manner.

Learning Objectives:
After successfully completing this chapter, students are expected to:
 Describe the characteristics and compute probabilities using the Bernoulli and binomial
probability distributions.

Stat foreconomists Page 30


Statistics for economists
 Understand the characteristics and solve probabilities using the Poisson and hyper
geometric probability distributions.
 Distinguish the characteristics and compute probabilities using the uniform, normal, beta
and gamma distributions.
 Describe how to approximate different probability distributions and the conditions
necessary for approximating probability distributions.
 Proof the expected value and variance of probability distributions from both discrete and
continuous probability distributions.

3.1 Common Discrete Probability Distributions


3.1.1 Bernoulli distributions
A Bernoulli random variable is the simplest kind of random variable. It can take on two values,
1 and 0. It takes on 1 if an experiment with probability p resulted in success and 0 otherwise.
Some examples include;
 Coins flip (equally likely if the coin is fair)
 Whether a disk drive crashed or not
 Whether someone likes a romance movie or not.
 Success of medical treatment
 Interviewed person is female or male
 Student passes exam
 Transmittance of a disease etc…
Definition: A random variable X is said to be a Bernoulli random variable with parameter θ if it
takes values in the set {0, 1} with probability mass function

{
x 1− x
P ( x )= θ ( 1−θ ) ,∧for x ∈{0 ,1 }
0 ,∧otherwise

Stat foreconomists Page 31


Statistics for economists
P (0) = P(X = 0) = 1- θ
P (1) = P(X = 1) = θ
We write concisely;
X Bernoulli (θ)
We often code experiments which can either result in failure or success using a Bernoulli random
variable, assigning the value 1 if there is a success and 0 if there is a failure. Therefore, the probability of
success regarded as θ and the probability of failure considered as 1- θ .

3.1.2 Binomial distribution

The Binomial Distribution represents the number of successes and failures in n independent
Bernoulli trials for some given value of n. A binomial experiment is a probability experiment that
satisfies the following four requirements called assumptions of a binomial distribution.
1. The experiment consists of n identical trials.
2. Each trial has only one of the two possible mutually exclusive outcomes, success or a
failure.
3. The probability of each outcome does not change from trial to trial, and
4. The trials are independent, thus we must sample with replacement.
Examples of binomial experiments
Tossing a coin 20 times to see how many tails occur.
Asking 200 people if they watch BBC news.
Registering a newly produced product as defective or non-defective.
Asking 100 people if they favor the ruling party.
Rolling a die to see if a 5 appears etc…

Stat foreconomists Page 32


Statistics for economists

The random variable X that counts the number of successes, x, in the n trials is said to follow a
binomial distribution with parameters n andθ , written bin (x; n;θ ) with probability mass function;

()
P ( X=x ) = n θ (1−θ) , where x=0 , 1 ,2 … n∨0 ≤ x ≤ n
x
x n− x

0, otherwise

If X is a Binomial random variable, we sometimes denote this;


X bin (n , θ), where n and θ are parameters of the distribution.
In using the binomial formula to solve problems, we have to identify three things:
 The number of trials (n )
 The probability of a success on any one trial (θ ) and
 The number of successes desired (x).

Examples:

1. A biased coin is tossed 6 times. The probability of heads on any toss is 0.3. Let X denotes
the number of heads that come up.
Calculate:
i. P (X = 2)
ii. P (X = 3)
iii. P (1 < X ≤ 5).
Solution:
If we call heads a success then this X has a binomial distribution with parameters n = 6 and θ =
0.3

i.
2 ()
P ( X=2 )= 6 (0.3) 0.7 =0.324135
2 4

ii. Do it yourself!
iii. P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5)
= 0.324 + 0.185 + 0.059 + 0.01
= 0.578

Stat foreconomists Page 33


Statistics for economists

2. What is the probability of getting three heads by tossing a fair coin four times?
Solution:
Let X be the number of heads in tossing a fair coin four times
X ~ Bin (n = 4, p = 0.50)

x
x
()
→ P ( X=x )= n θ ( 1−θ ) , x=0 ,1 , 2 ,3 , 4
n−x

( 4)
= x 0.5 x ( 0.5 )4− x

= ( x ) 0.5
4 4

P ( X=3 )=( 4 ) 0.5 = 0.25


4

3. Let X = number of heads after a coin is flipped three times. X Bin (3, 0.5).
What is the probability of each of the different values of X?

()
1
P ( X=0 )= 3 θ 0 ( 1−θ )3=
0 8

Stat foreconomists Page 34


Statistics for economists

()
3
P ( X=1 )= 3 θ ( 1−θ ) =
1 2

1 8

( 2)
3
P ( X=2 )= 3 θ ( 1−θ ) =
2 1

( 3)
1
P ( X=3 )= 3 θ ( 1−θ ) =
3 0
8
Remark: If X is a binomial random variable with parameters n and θ then

E (X) = nθ, Var (X) = nθ(1−θ)

3.1.3 The Poisson distribution

Certain events in nature are said to occur spontaneously, i.e. they occur at random times,
independently of each other, but with certain intensity λ . The intensity is the average number of
spontaneous events per time interval. The number of spontaneous events X in any given concrete
time interval is then Poisson distributed, and we write X pois ( λ ). X is discrete
random variable and takes values in the set {0, 1, 2, 3…}.
Often we are interested in the number of events which occur in a specific period of time or in a
specific area of volume: The Poisson distribution is used as a distribution of rare events, such as:
 Number of alpha particles emitted from a radioactive source during a given period of
time.
 Number of telephone calls coming into an exchange during one unit of time.
 Number of diseased trees per acre of certain woodland.
 Number of death claims received per day by an insurance company.
 Number of misprints.
 Natural disasters like;
 Accidents.
 Hereditary.
 Arrivals.
 Earth quake.
The process that gives rise to such events is called Poisson process.

Stat foreconomists Page 35


Statistics for economists
A random variable X is said to have a Poisson distribution if its probability distribution, which is
probability mass function, is given by:
x −λ
λ e
P ( X=x ) = , x=0 , 1 ,2 …
x!
Where λ = the average number
The Poisson distribution depends only on the average number of occurrences per unit time of
space.
The distribution P (X = x) is the Poisson distribution. It has one parameter λ > 0. Notation
X Poisson ( λ )

Examples:
1. If 1.6 accidents can be expected in an intersection on any given day, what is the
probability that there will be 3 accidents on any given day?
Solution;
Let X =the number of accidents, λ = 1.6
x −1.6
1.6 e
X = Poisson (1.6) ⇒ P (X = x) =
x!
3 −1.6
1.6 e
P (X = 3) = 3!
= 0.1380
2. Operators of toll roads and bridges need information for staffing tollbooths so as to
minimize queues (waiting lines) without using too many operators. Assume that in a
specified time period the number of cars per minute approaching a tollbooth has a mean
of 10. Traffic engineers are interested in the probability that exactly 11 cars approach the
tollbooth in the minute from 12 noon to 12:01.
11 −10
10 e
P ( 11)= =0.114
11!
Stat foreconomists Page 36
Statistics for economists
Thus, there is about an 11% chance that exactly 11 cars would approach the tollbooth the first
minute after noon.

NB. If X is a Poisson random variable with parameters θ then,

E(X) = λ , Var(X) = λ

Note: E(X) = Var(X).


 The Poisson probability distribution provides a close approximation to the binomial
probability distribution when n is large and p is quite small with λ = nθ .
x
( ) n
()
x n −x
P x = θ (1−θ) ≈ e
x
λ −λ
x!
With λ=nθ = the average number
Usually we use this approximation if nθ ≤ 5. In other words, if n > 20 and nθ ≤ 5 or

n ( 1−θ ) ≤ 5, then we may use Poisson distribution as an approximation to binomial distribution.

Example:
1. Find the binomial probability P(X=3) by using the Poisson distribution if θ = 0.01 and
n = 200
Solution: using Poisson, λ=n θ=0.01∗200=2
3 −2
( ) 2 e
⇒ P X=3 = =0. 1804
3!
Using binomial, n = 200, θ = 0.01

⇒ P ( X=3 )= ( )
200 0.013 0.99197 =0.1814
3

3.1.4 Hyper geometric distribution

Description

Stat foreconomists Page 37


Statistics for economists
In an urn we have m red balls and L black balls, in total N = m +L balls. We now pick out n
balls from the urn, randomly and without returning the chosen balls in to the urn. Necessarily
n ≤ N .the number of red balls X amongst the balls chosen is then hyper geometrically
distributed and we write X HG ( n , m , L , N ) .Therefore, X is a discrete random variable with
values in the set {0, 1, 2……min {n, m}}.

The Hyper geometric Distribution arises when sampling is performed from a finite population
without replacement thus making trials dependent on each other.
The assumptions leading to the hyper geometric distribution are as follows:
1. The population or set to be sampled consists of N individuals, objects, or elements (a
finite population).
2. Each individual can be characterized as a success (S) or a failure (F), and there are m
successes in the population.
3. A sample of n individuals is selected without replacement in such a way that each subset
of size n is equally likely to be chosen.
If X is the number of successes in a completely random sample of size n drawn from a
population consisting of m successes and (N –m) failures, then the probability distribution of
X, called the hyper geometric distribution, is given by;

(mx )(n−x
L
)
P (X = x) = N , where
(n)
m = elements called successes
L = N – m = elements called failures
A sample of n elements is selected at random without replacement.
x = number of successes
Therefore, X is said to follow a hyper geometric distribution and denoted by;
X HG ( n , m , L , N ) .
Example:
1. Draw 6 cards from a deck without replacement.
What is the probability of getting two hearts?
m = 13 number of hearts, L = 39 number of non-hearts
Stat foreconomists Page 38
Statistics for economists
N = 52 total
x=2
n=6

P ( 2 hearts )=
( 2 ( 4 ))
13 39

=0.31513
(526)
2. What is the probability of getting at most 2 diamonds in the 5 selected cards without
replacement from a well shuffled deck?
p ( X ≤2 )=P ( X=0 )=P ( X=1 )=P(X =2)

(¿ 395) + (131)(394) + (132)(393)


(525) (525) (525)
= 0.222 + 0.411 + 0.274
= 0.907

The mean and variance of the hyper geometric random variable X having pmf h(x; n, m, L, N)
are;

The ratio M/N is the proportion of successes in the population. If we replace M/N by θ in E(X)
and Var(X), we get;
E ( X )=n θ
N −n
Var ( X )= . n θ(1−θ)
N −1
Note: for any values of the parameters, the mean of X is the same, whether the sampling is with
N −n
or without replacement. On the other hand, the variance of X is smaller, by a factor of ,
N−1

Stat foreconomists Page 39


Statistics for economists
when the sampling is without replacement than with replacement. Think about these results. The
N −n
factor is sometimes called the finite population correction factor.
N−1
This factor is less than 1, so the hyper geometric variable has smaller variance than does the
binomial random variable. The correction factor written above is approximately 1 when n is
small relative to N.

Binomial Distribution as a Limit of Hyper geometric Distributions


Suppose that the population size N is very large compared to the sample size n. In this case, it
seems reasonable that sampling without replacement is not too much different than sampling
with replacement, and hence the hyper geometric distribution should be well approximated by
the binomial. As N→ ∞ , the hyper geometric distribution converges to the binomial.
Population Size = N
Proportion of successes = θ
Number of successes in N = Nθ
Number of failures = N (1 −θ )
Let X = number of successes in sample of size n drawn without replacement from N
Nθ N (1 - θ)
Successes Failures

Then,

Stat foreconomists Page 40


Statistics for economists

( x ) ( n−x )
N θ N (1−θ)
P ( X=x ) =
( Nx )

x()
→ n θ (1−θ) as N → ∞
x n− x

3.2 Common Continuous Probability Distributions

So far we have dealt with random variables with a finite number of possible values. For example;
if X is the number of heads that will appear, when you flip a coin 5 times, X can only take the
values 0, 1, 2, 3, 4, or 5.
Some variables can take a continuous range of values, for example a variable such as the height
of a human being, weight of human beings, the lifetime of an electronic component etc... For a
continuous random variable X, the analogue of a histogram is a continuous curve (the probability
density function) and it is our primary tool in finding probabilities related to the variable. As
with the histogram for a random variable with a finite number of values, the total area under the
curve equals 1. Here are some of the important continuous distributions.

3.2.1 Uniform Distribution

The uniform distribution is the simplest example of a continuous probability distribution. A


continuous random variable X is said to be uniformly distributed between θ1∧θ 2 , with θ1 <θ2if its
probability density function is given by:

{
1
,−∞< θ1 ≤∧x ≤ θ2 <∞
f ( x )= θ2−θ 1 …………………..1
0 , otherwise

X U ( θ1 , θ2 ) , for short

Stat foreconomists Page 41


Statistics for economists
θ1∧θ 2 are parametrs of the distribution.

It is easy to see that this is a valid PDF (because p (x) > 0 and ∫ f ( x ) dx=1
We can also write this distribution with this alternative notation:
θ|θ1, θ2 ∼ U (θ1,θ2) ………………………….2
The first and second equations are equivalent. The latter simply says: x is distributed uniformly
in the range θ1 andθ2 , and it is impossible that x lies outside of that range.
Intuitively, this distribution states that all values within a given range [θ1,θ2] are equally likely.

Visually,
f (x)

1/ θ2 - θ1 …………

θ1 θ2 X

Where the rectangular region has area (θ2 – θ1) [1/ (θ2 – θ1)] = 1 (width times height).

Occurrence of the Uniform distribution


1. waiting times from random arrival time until a regular event (see below)
2. Engineering tolerances: e.g. if a diameter is quoted "±0.1mm", it sometimes assumed
(probably incorrectly) that the error has a U (-0.1, 0.1) distribution.
3. Simulation: programming languages often have a standard routine for simulating the U
(0, 1) distribution. This can be used to simulate other probability distributions etc…

Mean and variance of uniform distribution


If the values θ1 and θ2 are the parameters of the uniform distribution then;

2
θ 1+θ 2 2 (θ ¿ ¿ 2−θ 1)
E ( X )= ∧Var ( X )=σ = ¿
2 12

Stat foreconomists Page 42


Statistics for economists

3.2.2 Normal Distribution

The normal distribution is the most important distribution in statistics, since it arises naturally in
numerous applications. The key reason is that large sums of (small) random variables often turn
out to be normally distributed.
A random variable X is said to have the normal distribution with parameters μ and σ if its
probability density function is given by:

2
1 1 (x−μ)
f ( x )= exp {− },−∞ < x< ∞ ,−∞ < μ< ∞ , σ > 0
σ √2π 2 σ2
2
X N (μ,σ )
Where μ=E ( X ) , σ 2=variance of ( X)

2
μ ¿ σ , are the parameters of the distribution.

Properties of Normal Distribution:


1. It is bell shaped and is symmetrical about its mean and it is mesocratic. The maximum
ordinate is at x =μ and is given by;
1
f ( x )=
σ √2 π
2. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a
different normal distribution. Thus, the normal distribution is completely described by
two parameters: mean and standard deviation. For the same mean, μ, a smaller value of σ
gives a taller and narrower curve, whereas a larger values of σ gives a flatter curve. i.e.
the standard deviation σ controls the “flatness” of the curve.

Increasing the mean shifts the density curve to the right.


Stat foreconomists Page 43
Statistics for economists

Increasing the standard deviation flattens the density curve

5. Total area under the curve sums to 1, i.e., the area under the curve to the right of the
mean is 0.5, and the area under the curve to the left of the mean is 0.5.

∫ f ( x ) dx=1
−∞

6. It is unimodal, i.e., values mound up only in the center of the curve.


7. Mean = Median = mode = μ

Stat foreconomists Page 44


Statistics for economists
8. The probability that a random variable will have a value between any two points is equal
to the area under the curve between those points.
9. The empirical rule (68%, 95%, and 99.7%) for mound shaped data applies to variables
with normal distributions i.e.;
 approximately 68% of the measurements will fall within one standard deviation
of the mean, i.e. within the interval ( μ−σ , μ+ σ ¿
 approximately 95% of the measurements will fall within 2 standard deviations of
the mean, i.e. within the interval ( μ−2 σ , μ+2 σ ¿
 approximately 99.7% or almost all of the observations will fall within 3 standard
deviations of the mean, i.e. within the interval ( μ−3 σ , μ+3 σ ¿

Why is the normal distribution important?


 Many things actually are normally distributed naturally or very close to it. For
example, height and intelligence are approximately normally distributed;
measurement errors also often have a normal distribution.
 The normal distribution is easy to work with mathematically. In many practical
cases, the methods developed using normal theory work quite well even when the
distribution is not normal.
 There is a very strong connection between the size of a sample N and the extent to
which a sampling distribution approaches the normal form. Many sampling
distributions based on large N can be approximated by the normal distribution even
though the population distribution itself is definitely not normal.
3.2.2.1 The standard normal distribution

A normal distribution with expected value μ=0 and variance σ 2=1 is called a standard normal
distribution. The standard deviation in a standard normal distribution equals one obviously. The
density φ ( x ) of a standard normal distribution is;

Stat foreconomists Page 45


Statistics for economists
1 −1 2
φ ( x )= exp( x)
√2 π 2
The distribution function ϑ of a standard normal distribution is
x
ϑ ( x )= ∫
−∞
1
√2 π
exp
2(
−1 2
x dx )
Properties of the Standard Normal Distribution:
Same as a normal distribution, but also...
 Mean is zero
 Variance is one
 Standard Deviation is one

Calculating Normal Probabilities


As you might suspect from the formula for the normal density function, it would be difficult and
tedious to do the calculus every time we had a new set of parameters for μ and σ. So instead, we
usually work with the standardized normal distribution, where μ = 0 and σ = 1, i.e. N (0, 1). That
is, rather than directly solve a problem involving a normally distributed variable X with mean μ
and standard deviation σ, an indirect approach is used.

Note: To facilitate the use of normal distribution to calculate probabilities, the following
distribution known as the standard normal distribution was derived by using the transformation.
X−μ
Z=
σ
1 −1 2
⇒ f ( z )= exp z
√2 π 2
Areas under the standard normal distribution curve have been tabulated in various ways. The
most common ones are the areas between;
Z = 0 and a positive value of Z.
Given a normally distributed random variable X with Mean μ and standard deviation σ

Stat foreconomists Page 46


Statistics for economists
a−μ X−μ b−μ
P ( a< X <b ) =P( < < )
σ σ σ

a−μ b−μ
⇒ P ( a< X <b ) =P( σ
< Z<
σ
)

Note:
P (a < X < b) = P (a≤ X <b )
= P (a < X≤ b)
= P (a≤ X ≤ b)
 A table of standardized normal values attached at the end of the module as an appendix
can then be used to obtain an answer in terms of the converted problem.
The interpretation of Z values is straightforward. Since σ = 1, if Z = 2, the corresponding X value
is exactly 2 standard deviations above the mean. If Z = -1, the corresponding X value is one
standard deviation below the mean. If Z = 0, X = the mean, i.e. μ.
Rules for using the standardized normal distribution
It is very important to understand how the standardized normal distribution works, so we will
spend some time here going over it. Recall that, for a random variable X,
F(x) = P(X ≤ x)
NOTE: While memorization may be useful, you will be much better off if you gain an intuitive
understanding as to why the rules that follow are correct. Try drawing pictures of the normal
distribution to convince you that each rule is valid.
RULES:
1. P (Z ≤ a) ⇒ one tail probability
= F (a) = 1 – F (-a) (use when a is positive)
= 1 – F (a) = F (-a) (use when a is negative)

For positive values of a

Stat foreconomists Page 47


Statistics for economists

For negative values of a

Examples:
Find:
1. P (Z ≤ a) for a = 1.96, -1.96, 2.00, -2.00
To solve: for positive values of a, look up and report the value for F (a) given in the standard
normal table attached as an appendix at the end. For negative values of a, look up the value for F
(-a) in absolute value (i.e. F (absolute value of a)) and report 1 – F (a)
Solution:
P (Z ≤ 1.96) = F (1.96) = 0 .975
P (Z ≤ -1.96) = F (-1.96) = 1 – F (1.96) = .025
P (Z ≤ 2.00) = F (2.00) = 0.9772
P (Z ≤ -2.00) = F (-2.00) = 1 – F (2.00) = 0.0228
It is possible also to work easily in the other direction, and determine what a is given P (Z ≤ a)

Finding “Z” for Given Probability

So far, Most of the calculations above are of the form: Find the probability P (Z ≤ z) for a given
value of z. Often times, we are also interested in an inverse problem: Find the value of ZA such
that the probability for Z to be greater than ZA equals a specified value A.
Formally, our question is: For what value of zA do we have

Stat foreconomists Page 48


Statistics for economists
P (Z > zA) = A?
This can be visualized as:

Examples:

Find zA for A = 0.025 (or 2.5%). That is, what is Z0.025?


Answer: Observe that
P (Z > z0.025) = 1 – P (Z ≤ z0.025).
P (Z ≤ z0.025) = 1 – P (Z > z0.025)
= 1 − 0.025
= 0.975, where the second equality follows from the definition of z0.025.
Therefore, the value of Z associated with the area 0.025 is -1.96.
2. Find the value of Z if
a. The normal curve area between 0 and Z (positive) is 0.4726
Solution:

P ( 0< Z< z )=0.4726∧¿ table


P ( 0< Z< 1.92 )=0.4726
⇒ z=1.92 … uniqueness of area

b. The area to the left of z is 0.9868


Solution:
P ( Z < z )=0.9868
¿ P ( Z <0 )+ P ( 0< Z < z )
¿ 0.50+ P(0< Z< z )
⇒ P ( 0< Z< z )=0.9868−0.50=0.4868 ,∧¿ table

Stat foreconomists Page 49


Statistics for economists
P ( 0< Z< 2.2 )=0.4868
z=2.2

3.2.2 Gamma and Beta distributions


3.2.2.1 Gamma distribution
Many continuous random variables can only take positive values, like height, thickness, life
expectations of transistors, etc. Such random variables are often modeled by gamma type random
variables. The corresponding density functions contain two parameters α ∧β . The first is known
as the shape parameter, the second as the scale parameter. For instance, the following figure
shows gamma density withα =1 , 2 ,3∧β=1. In general, α might not be an integer.

Gamma random variables are used to model a number of physical quantities. Some examples are

 The time it takes for something to occur, e.g. a lifetime or the service time in a queue.
 The rate at which some physical quantity is accumulating during a certain period of time,
e.g. the excess water flowing into a dam during a certain period of time due to rain or the
amount of grain harvested during a certain season.

A random variable X is said to have a gamma distribution, X ~  (α,) with parameters α > 0 and
 > 0 if its probability density function has the form;

Stat foreconomists Page 50


Statistics for economists

{
−x
1 α −1 β
f x∨α , β = β Γ ( α ) x e ,∧α , β >0
( ) α

0 , otherwise

Where,


Γ ( α )=∫ x
α−1 − x
e dx , is the gamma function.
0

Properties of the gamma function:

 Γ ( 1 )=1 Show!
 Γ ( α )=( α −1 ) Γ ( α −1 ) for any α >1
 Γ ( α )=( α −1 ) ! for integer α

 Γ ( 12 )=√ π
 Γ ( α +1 )=αΓα =α !

The mean and variance of the gamma distribution are;

2
E ( X )=αβ ∧Var ( X )=α β

3.2.2.2 Beta distribution


In probability theory and statistics, the beta distribution is a family of continuous
probability distributions defined on the interval [0, 1] with two positive shape parameters,
denoted by α and β, that appear as exponents of the random variable and control the
shape of the distribution.
The beta distribution has been applied to model the behavior of random variables limited
to intervals of finite length in a wide variety of disciplines. For example, it has been used
as a statistical description of allele frequencies in;
 population genetics;

Stat foreconomists Page 51


Statistics for economists
 time allocation in project management / control systems;
 sunshine data;
 variability of soil properties;
 proportions of the minerals in rocks in stratigraphy; and
 Heterogeneity in the probability of HIV transmission etc...
The beta distribution is a suitable model for the random behavior of percentages and
proportions. The usual formulation of the beta distribution is also known as the beta
distribution of the first kind, whereas beta distribution of the second kind is an alternative
name for the beta prime distribution.

A random variable X has a beta distribution X beta(α , β ) with parameters α > 0 and β >0
If the density of X is

{
x α −1 (1−x )β −1
f ( X )= ,∧0 ≤ x ≤ 1
B (α , β )
0 ,∧elsewhere

{
Γ ( α + β ) α −1 β−1
x ( 1−x ) ,∧0 ≤ x ≤ 1
Alternatively, f ( x ) = Γ (α ) Γ ( β )
0 ,∧elsewhere
1
B ( α , β )=∫ x
β−1
dx=¿ ¿ Some constant
α−1
( 1−x )
0

Relationship between the gamma and beta functions:

Stat foreconomists Page 52


Statistics for economists
Γ ( α ) Γ ( β)
B ( α , β )=
Γ (α + β)
 This makes it easy to compute for whole numbers, since Γ ( n ) =( n−1 ) !

α 2
αβ
E( X ) = σ =var ( X ) = 2
α+β ( α + β ) (α + β +1)

Example:
1. The length of your morning commute (in hours) is a random variable X that has a beta
distribution withα =β=2 .
a. Find the probability that your commute tomorrow will take longer than 30
minutes.
b. Your rage level is R = X2 + 2X +1. Find the expected value of R
Solution:
x α −1 (1−x) β−1
f ( x )=
B(α , β)
Γ (4 )
= x( 1−x)
Γ ( 2 ) Γ (2)

= 6 ( x−x 2 )
2
f ( x )=6( x−x )

a. ( ) 1
P x ≥ =∫ 6 ( x−x ) dx
2 1
2

( )
2 3
x x
= 6 −
2 3

=6 ( 12 − 13 − 18 + 241 ) = 12
Stat foreconomists Page 53
Statistics for economists

α 1
b. E ( X )= =
α+β 2
E ( X 2) =Var ( X ) + E ( X )
2

( )
2
αβ α
¿ +
2
( α + β ) ( α + β +1 ) α +β
4 1 3
¿ + =
16 .5 4 10
3 1
E ( R )= +2. +1
10 2
23
=
10

Chapter Four
4. Joint and conditional probability distributions

Introduction
So far we have considered only distributions with one random variable. There are many
problems that involve two or more random variables. We need to examine how these random
variables are distributed together (‘’jointly’’). There are discrete and continuous joint
distributions based on the type of random variables.

Learning objectives:
After successful completion of this chapter, students are able to:
 Differentiate Joint and Marginal Distributions
 Formulate marginal distributions from joint discrete and continuous distributions.
 Understand the Conditional Distributions and Independence
 Compute expectations, covariance and Correlations for joint random variables.
 Solve the conditional Expectation of joint random variables.

Stat foreconomists Page 54


Statistics for economists
4.1 Joint and Marginal Distributions

In general, if X and Y are two random variables, the probability distribution that defines their
simultaneous behavior is called a joint probability distribution.

Shown here as a table for two discrete random variables, which gives P( X=x ,Y = y) .
x
1 2 3
y 1 0 1/6 1/6
2 1/6 0 1/6  If X and Y are discrete, this
3 1/6 1/6 0 distribution can be described with a joint
probability mass function and denoted by;
P xy (X =x , Y = y )
 If X and Y are continuous, this distribution can be described with a joint probability
density function and denoted by; f xy ( X=x ,Y = y )
 If we are given a joint probability distribution for X and Y, we can obtain the
individual probability distribution for X or for Y (and these are called the
Marginal Probability Distributions)...
The rule for finding a marginal is simple.
 To obtain a marginal PMF/PDF from a joint PMF/PDF, sum or integrate out the
variable(s) you don't want.
For discrete, this is obvious from the definition of the PMF of a random variable.

p X ( x )=P ( X =x )=∑ p X ,Y ( x , y )
y

pY ( y )=P (Y = y )=∑ p X ,Y ( x , y)
x

 To obtain the marginal of X sum out y.


 To obtain the marginal of Y sum out x.

 For continuous, this is a bit less obvious, but if we define

Stat foreconomists Page 55


Statistics for economists
f X ( x )=P ( X=x )=∫ f X ,Y (x , y )dy

f Y ( y )=P ( Y = y )=∫ f X , Y (x , y)dx


 To obtain the marginal of X integrate out y.
 To obtain the marginal of Y integrate out x.
“Joint” and “marginal” are just verbal shorthand to distinguish the univariate distributions
(marginal) from the bivariate (joint) distribution.

Example:
1. Measurements for the length and width of a rectangular plastic covers for CDs are
rounded to the nearest mm (so they are discrete).
Let: X denote the length and
Y denotes the width.
The possible values of X are 129, 130, and 131 mm. The possible values of Y are 15 and16 mm
(Thus, both X and Y are discrete).
There are 6 possible pairs (X; Y).
We show the probability for each pair in the following table:
x=length
y=width 129 130 131
The sum of 15 0.12 0.42 0.06 all the probabilities is 1.0.
The 16 0.08 0.28 0.04 combination with the highest
probability is (130; 15).
The combination with the lowest probability is (131; 16).
The joint probability mass function is the function p X ,Y ( x , y )=P(X =x ,Y = y) . For example, we

p X ,Y
have (129; 15) = 0.12.

Questions:
a. Find the probability that a CD cover has length of 129mm (i.e. X = 129).
P (X = 129) = P (X = 129 and Y = 15) + P (X = 129 and Y = 16)
= 0.12 + 0.08 = 0.20
b. What is the probability distribution of X?
x=length

Stat foreconomists Page 56


Statistics for economists
y=width 129 130 131
15 0.12 0.42 0.06
16 0.08 0.28 0.04
Column total 0.20 0.70 0.1

The probability distribution for X appears in the column totals.

x 129 130 131


f X( x ) 0.20 0.70 0.10

 This table represents the values of the random variable X and its
corresponding probabilities and it is called the marginal distribution of X.

NOTE: We've used a subscript X in the probability mass function of X, or f X ( x ) , for clarification
since we're considering more than one variable at a time now.

We can do the same for the Y random variable.


x=length
129 130 131 Row totals
y=width 15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column total 0.20 0.70 0.1 1

y 15 16
f Y ( y) 0.60 0.40

 Because the probability mass functions for X and Y appear in the margins of the table
(i.e. column and row totals), they are often referred to as the Marginal Distributions for X
and Y.
 When there are two random variables of interest, we also use the term bivariate
probability distribution or bivariate distribution to refer to the joint distribution.

Stat foreconomists Page 57


Statistics for economists

4.1.1 Joint Probability Mass Function

The joint probability mass function of the discrete random variables X and Y, denoted as
P X ,Y (x , y ), satisfies;

1. p X ,Y ( x , y ) ≥ 0

2. ∑
x
∑ p X ,Y ( x , y )=1
y

3. p X ,Y ( x , y )=P(X =x ,Y = y)

4.1.2 Marginal Probability Mass Function

If X and Y are discrete random variables with joint probability mass function P X ,Y ( x , y ), then the
marginal probability mass functions of X and Y are;

P X ( x )=∑ P X ,Y (x , y), and;


y

PY ( y )=∑ P X ,Y (x , y )
x

Where the sum for P X (x) is over all points in the range of (X; Y) for which X =x and the sum
for PY ( y )=¿ is over all points in the range of (X; Y) for whichY = y .
4.1.3 Joint Probability Density Function

A bivariate function with values f (X , Y ) defined over the X Y plane is called joint probability
density function for the continuous random variable X and Y, denoted as f X , Y ( x , y ) and satisfies
the following properties:
1. f X , Y ( x , y ) ≥ 0 , for all values of x , y

Stat foreconomists Page 58


Statistics for economists
∞ ∞

2. ∫ ∫ f X ,Y ( x , y ) dx dy=1
−∞ −∞

3. for any region R of two dimensional space


P¿
4.1.4 Marginal Probability Density function

If X and Y are continuous random variables with joint probability density function f XY (x ; y),
then the marginal density functions for X and Y are:


f X ( x )=∫ f XY ( x , y ) dy , ∀ y= ∫ f XY ( x , y ) dy And
−∞


f Y ( y )=∫ f XY ( x , y ) dx , ∀ x= ∫ f XY ( x , y ) dx
−∞

Where the first integral is over all points in the range of (X ; Y ) for which X =x , and the second
integral is over all points in the range of (X ; Y )for whichY = y .

4.2 Conditional Distributions and Independence

We have previously shown that the conditional probability of A given B can be obtained by
dividing the probability of the intersection by the probability of B, specifically,

P( A ∩ B)
P ( A|B )=
P( B)
Definition: for discrete random variables X andY , the conditional PMF of X given Y and vice
versa are defined as:

P XY ( x , y )
P X ∨Y ( x| y )=P ( X=x|Y = y ¿= , f Y ( y)>0
PY ( y )

Stat foreconomists Page 59


Statistics for economists
P XY ( x , y )
PY ∨X ( y|x )=P ( Y = y| X =x ¿= , f X ( x )> 0
P X (x)

 The conditional probability can be stated as the joint probability over the
marginal probability.
A conditional probability distribution PY ∨X ( y|x ) and P X ∨Y ( x| y ) have the following properties:
1. PY ∨X ( y|x ) ≥ 0, P X ∨Y ( x| y ) ≥ 0

2. ∑ PY ∨ X ( y|x )=1 , ∑ P X ∨Y ( x| y )=1


y x

3. PY ∨X ( y|x )=P ( Y = y| X =x ¿ , P X ∨Y ( x| y )=P ( X=x|Y = y ¿


Example: continuing the plastic covers...
X =length
Y =width 129 130 131 Row totals
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column total 0.20 0.70 0.1 1

a. Find the probability that a CD cover has a length of 130mm given the width is
15mm.
Solution: P( X=130∨Y =15)= P ¿ ¿
0.42
= 0.60
= 0.70
b. Find the conditional distribution of X given Y =15.

0.12
P ( X=129|Y =15 ) = =0.20
0.60

0.42
P ( X=130|Y =15 )= =0.70
0.60

0.06
P ( X=131|Y =15 )= =0.10
0.60

Therefore the conditional distribution of X given Y =15, or p X ∨Y=15 ( X=x|Y =15 ¿ is;

Stat foreconomists Page 60


Statistics for economists
x 129 130 131
p X ∨Y=15 ( X=x|Y =15 ¿ 0.2 0.70 0.10
0

The sum of these probabilities is 1, and this is a legitimate probability distribution.


c. Find the conditional distribution of Y given X =130.
d. P(Y =16∨ X=131)

In the continuous case, the idea of a conditional distribution takes on a slightly different meaning
than in the discrete case. If X and Y are both continuous, P( X=x∨Y = y) is not defined because
the probability of any one point is identically zero. It make sense however to define a conditional
distribution function, i.e.

P( X ≤ x∨Y = y)

Because the value of Y is known when we compute the value the probability that X is less than
some specific value.

Definition: Let X and Y be jointly continuous random variables with joint probability density
f XY (x , y) and marginal densities f X ( x ) and f Y ( y), respectively. For any Y such that f Y ( y ) >0 , the
conditional probability density function of X given Y = y, is defined to be;

f XY ( x , y )
f X∨Y (x∨ y )=
f Y ( y)
f (x , y)
= f ( y)

f XY ( x , y )
And similarly, f Y ∨X ( y∨x )=
f X (x)
f (x , y)
= f (x )

Stat foreconomists Page 61


Statistics for economists

{
1
( x + 4 y ),∧0< x <2 , 0< y <1
Example: f ( x , y )= 6
0 ,∧otherwise
1
The marginal density of x is f X ( x )= (x+ 2) while the marginal density of y,
6
1
f Y ( y )= ( 2+8 y ) . now,
6
a. find the conditional distribution of x given y. this is given by:
1
( x +2 )
f XY ( x , y ) 6 x +4 y
f X∨Y (x∨ y )= = =
f Y ( y) 1 8 y +2
( 2+ 8 y )
6
b. For 0 < x < 2 and 0 < y < 2, find the probability that X ≤ 1 given that y = ½.
First determine the density function when y = ½ as follows.
f (x , y) x+ 4 y
=
f ( y) 8 y+ 2

( 12 )
x+ 4
= 1
8 ( )+2
2
x+ 2 x+2
= 4+ 2 = 6
1
1 1
Then, P ( X ≤1|Y = ¿=∫ ( x +2 ) dx
2 0 6

1 1
( )
= 6 2 +2 −0

1 2 5
= 12 + 6 = 12

4.2.1 Independent Random Variables

We have previously shown that two events A and B are independent if the probability of their
intersection is the product of their individual probabilities, i.e.

Stat foreconomists Page 62


Statistics for economists
P ( A ∩B )=P ( A ) P ( B )
This is a parallel expression with that, if X and Y are independent, then the joint probability can
be written as the product of the marginal probabilities. We then have the following definition.

Discrete Random Variables: If X and Y are discrete random variables with joint probability
density function p(x , y) and marginal density functions P X (x) and PY ( y ), respectively, then X
and Y are independent if, and only if;
P X ,Y (x , y )=P X (x)P Y ( y)
¿ p(x) p( y)
for all pairs of real numbers (x, y).
Example: Continuing the plastic covers example once again...
X =length
Y =width 129 130 131 Row totals
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column total 0.20 0.70 0.1 1

Show that whether X and Y are dependent or independent

Continuous Bivariate Random Variables: If X and Y are continuous random variables with
joint probability density function f(x, y) and marginal density functions f X ( x ) and f Y ( y )
respectively then X and Y are independent if and only if;

f X , Y ( x , y)=f X (x) f Y ( y )
¿ f ( x ) f ( y ) ,for all pairs of real numbers (x, y).

4.3Expectation

Definition: for random variables X and Y their joint expectation can be given by;

Stat foreconomists Page 63


Statistics for economists
E [ P X ,Y ( x , y ) ]=∑ ∑ xy ( P X ,Y ( x , y ) ) ,if X∧Y are discrete
x y

∞ ∞
E [ f X , Y ( x , y ) ]= ∫ ∫ xy ( f X ,Y ( x , y ) ) , if X∧Y are continuous random variables
−∞ −∞

4.3.1 Variance

Variance of a Single Random Variable: The variance of a random variable X with mean μ is
given by;

Var (X) ≡ σ 2 ≡ E ¿

≡ E [ ( X−μ ) ]
2


≡ ∫ ( X −μ ) ¿
2

−∞


≡ ∫ x f (x )dx−¿ ¿
2

−∞

≡ E ( x 2 ) −(E (X ))2

The variance is a measure of the dispersion of the random variable about the mean.
 Variance of a sum:
2 2
Var (a X +b Y )=a Var ( X)+b Var (Y )+2 abCov (X , Y )

4.3.2 Covariance

Definition: Let X and Y be any two random variables defined in the same probability space. The
covariance of X and Y, denoted Cov [X, Y] orσ X, Y, is defined as;

Cov [X, Y] ≡ E [ ( X−μx ) ( Y −μy ) ]


≡ E [ XY ]−E [ μxY ]−E [ Xμy ] + E [μxμy ]

Stat foreconomists Page 64


Statistics for economists
≡ E [ XY ]−E ( X ) E (Y )
Therefore,
Cov [X, Y] ≡ E [ P X ,Y ( x , y ) ]−E ( P X ( x ) ) E(PY ( y )) …. Discrete random variables

Cov [ X , Y ] ≡ E [ f X ,Y ( x , y ) ]−E ( f X ( x ) ) E ( f Y ( y ) ) … … continuous random variables

The covariance measures the interaction between two random variables, but its numerical value
is not independent of the units of measurement of X and Y. Positive values of the covariance
imply that X and Y that X increases when Y increases; negative values indicate X decreases as Y
decreases.

 The covariance has the following properties:


1. Cov(X,X) = Var (X)
2. if X and Y are independent then Cov(X; Y ) = 0;
3. Cov(X,Y ) = Cov(Y,X)
4. Cov(a X, Y ) = a Cov(X, Y )
5. Cov(X + c, Y ) = Cov(X, Y )
6. Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)
7. more generally,

( )
k n k n
Cov ∑ a i X i , ∑ b j Y j =∑ ∑ a i b j Cov( X i ,Y j)
i=1 j=1 i=1 j=1

4.3.3 Correlation

The correlation coefficient, denoted by ρ[ X , Y ],or ρ X , Y of random variables X and Y is defined


to be;
Cov [ X , Y ]
ρ X ,Y=
σ X σY
Provided that Cov [X, Y], σ X and σ Y exist, andσ X , σ Y are positive. The correlation coefficient
denoted by ρ XY or ρ( X ,Y ), is obtained by normalizing the covariance. In particular, we define
the correlation coefficient of two random variables X and Y as the covariance of the standardized

Stat foreconomists Page 65


Statistics for economists
versions of X and Y. The correlation coefficient between two random variables is a measure of
the interaction between them. It also has the property of being independent of the units of
measurement and being bounded between negative one and one. The sign of the correlation
coefficient is the same as the sign of the covariance. Thus ρ > 0 indicates that Y increases as X
increases and ρ = 1 indicates perfect correlation, with all the points falling on a straight line with
positive slope. If ρ = 0, there is no correlation and the covariance is zero.

Generally,
1. 1 ≤ ρ ( X , Y ) ≤1
2. i f ρ ( X ,Y )=1 ,then Y =aX+ b , where a>0
3. if ρ ( X , Y )=1 , thenY =aX +b , wherea< 0
4. ρ ( aX +b , cY +d )= ρ ( X , Y ) for a , c> 0
5. if ρ ( X , Y )=0 , we say that X∧Y are uncorrelated
6. if ρ ( X , Y ) >0 , we say that X∧Y are positively correlated
7. If ρ( X ; Y )<0 , we say that X∧Y are negatively correlated .

Note: If X and Y are uncorrelated, then


Var (X + Y) = Var (X) + Var (Y)
More generally, if X1, X2 ….Xn are pairwise uncorrelated, i.e., ρ (Xi, Xk) = 0 when i ≠ k,
Then;
Var(X1 + X2 + ….. + Xn) = Var(X1) + Var(X2) +……+ Var (Xn):
Example:
1. Let X and Y be discrete random variables with joint mass function defined by;
1
P X ,Y ( x , y )= , ( x , y ) ∈{( 0 ,0 ) , ( 1 ,1 ) , ( 1 ,−1 ) ,(2 , 0)
4
0 , otherwise
The marginal mass functions, expectations and variances of X and Y are;

{
1
,∧x=0 ,2
P X ( x )=∑ P X ,Y ( x , y )= 4
y 1
,∧x=1
2

Stat foreconomists Page 66


Statistics for economists

( 0∗14 + 1∗12 + 2∗14 )=1 ¿


2
E(P¿ ¿ X ( x ) )=∑ x P X ( x )=
0

( )
2 2 2
0∗1 1 ∗1 2 ∗1 3
E(P ¿ ¿ X ( x ) )=∑ x P X ( x )=
2 2
+ + = ¿
0 4 2 4 2

Var (P¿¿ X ( x ) )=E( P ¿ ¿ X ( x2 ) )−¿ ¿ ¿ ¿

{
1
,∧ y=−1 , 1
PY ( y )=∑ P X ,Y ( x , y )= 4
x 1
,∧ y =1
2

( −1∗1 4 )
1
0∗1 1∗1
E(P¿ ¿Y ( y ) )=∑ y PY ( y )= + + =0 ¿
−1 4 2

( )
1 2 2 2
−1 ∗1 0 ∗1 1 ∗1 1
E(P ¿ ¿Y ( y ) )=∑ y PY ( y )=
2 2
+ + = ¿
−1 4 2 4 2

Var (P¿¿ Y ( y ) )=E( P ¿ ¿Y ( y 2 ) )−¿ ¿ ¿ ¿

And to compute the covariance we also need to compute E [ P X ,Y ( x , y ) ]


E [ P X ,Y ( x , y ) ] ≡ ∑ ∑ xy ( P X , Y ( x , y ) )
x y

≡ [ ( 0∗0 )∗1 ( 1∗1 )∗1 ( 1∗−1 )∗1 ( 2∗0 )∗1


4
+
4
+
4
+
4
=0 ]
Therefore,
Cov ( X , Y ) =E [ P X ,Y ( x , y ) ]−E (P¿¿ X ( x )) E(P¿ ¿Y ( y ) )¿ ¿
= 0 – 1*0
=0
Cov [X , Y ]
And, Corr ( X ,Y )=ρ X , Y =
σ X σY
0
=
√ 11
22
=0
Hence, the two random variables have covariance and correlation zero. But note that X and Y are
not independent as it is not true that,

Stat foreconomists Page 67


Statistics for economists
P X ,Y ( x , y ) =¿ ¿, for all x and y.

2. Let X and Y be continuous random variables with joint pdf


f X , Y (x , y)=3 x , 0 ≤ y ≤ x ≤ 1,
=0 , elsewhere
The marginal PDFs, expectations and variances of X and Y are;

∞ x
f X ( x )=∫ f X ,Y (x , y ) dy=∫ 3 x dy =3 x , 0 ≤ x ≤1 ,
2

−∞ 0

∞ 1
3
E(f ¿¿ X ( x ) )=∫ x f X ( x ) dx=¿∫ x∗3 x dx=¿ ¿ ¿,
2

−∞ 0 4
∞ 1
3
E(f ¿ ¿ X ( x ) )= ∫ x f X ( x ) dx=¿∫ x ∗3 x dx=¿ ¿ ¿,
2 2 2 2

−∞ 0 5

Var (f ¿¿ X ( x ))=E(f ¿ ¿ X ( x 2 ))−¿ ¿ ¿ ¿


∞ 1 2
3 3
f Y ( y )= ∫ f X , Y (x , y) dx=∫ 3 x dx= x = (1− y ), 0 ≤ y ≤ 1 ,
2

−∞ y 2 2

( ) ( )
∞ 1 2 4
y∗3 3 y y 3 1 1 3
E(f ¿¿ Y ( y ) )=∫ y f Y ( y ) d y=¿ ∫
2
(1− y )dy= − = − = ¿¿ ,
−∞ 0 2 2 2 4 2 2 4 8

( ) ( )
∞ 1 2 3 5
y ∗3 3 y y 3 1 1 1
E(f ¿ ¿ Y ( y ) )= ∫ y f Y ( y ) dy=¿∫
2 2 2
(1− y ) dy= − = − = ¿ ¿,
−∞ 0 2 2 3 5 2 3 5 5

Var (f ¿¿ Y ( y ))=E(f ¿ ¿ Y ( y 2 ))−¿ ¿ ¿ ¿

And to compute the covariance we also need to compute E [ f X , Y ( x , y ) ]


∞ ∞
E [ f X , Y ( x , y ) ]= ∫ ∫ xy f X ,Y ( x , y ) dydx
−∞ −∞

1 x

= ∫ ∫ xy∗3 x dy dx
0 0

{∫ }
1 x

=∫
2
y dy 3 x dx
0 0

{ }
1 2
y
=∫
2
3 x dx
0 2

Stat foreconomists Page 68


Statistics for economists
1 2
x
=∫
2
∗3 x dx
0 2
3
=
10
Now,
Cov [ X , Y ] ≡ E [ f X ,Y ( x , y ) ]−E ( f X ( x ) ) E ( f Y ( y ) )
3
∗3
3 4
≡ −
10 8
3

160
Cov [X , Y ]
Corr ( X ,Y )=ρ X , Y =
σ X σY
3
160


= 3
∗19
80
320
= 0. 397
4.3.4 Conditional Expectation

The conditional expectation or mean of discrete random variables, X given Y = y, denoted as E


(X| Y= y) or μX∨Y = y is;
E ( X|Y = y ¿=∑ x P X ∨Y = y ( x∨ y )=μ X ∨Y = y
x

NB : E (Y | X=x ¿ ,this can be done ∈a similar fashion.


 Law of Iterated Expectations: E[X] = E [E [X|Y ]]
 If X and Y are independent random variables, then
a. E[X|Y ] = E[X]
b. E[XY ] = E[X] E[Y ]
Example:
You roll two 6-sided dice D1 and D2. Let X = D1+D2 and let Y = the value of D2. What is E [X|
Y = 6].

Stat foreconomists Page 69


Statistics for economists
E ( X|Y =6 ) =∑ xP( X=x∨Y =6)
x

1
= ( )(7 +8+9+ 10+11+12)
6
57
=9.5
6
This makes intuitive sense since, 6+ E (value of D 1)=6+3.5 ⇒ 9.5
The conditional variance of discrete random variables X given Y= y, denoted as Var (X|Y = y) or
2
σ X ∨Y = y
var ( X|Y = y ¿=∑ ( x−μ X ∨Y = y ) P X ∨Y = y ( x∨ y )
2

=∑
2 2
x P X ∨Y = y ( x∨ y )−μ X ∨Y= y
x

= E ( X 2|Y = y )−[ E ( X|Y = y ) ]2


2
¿ σ X ∨Y = y
The conditional expectation or mean of continuous random variables, X given Y = y, denoted as
E (X| Y= y) or μX∨Y = y is

E ( X|Y = y ¿= ∫ x f X∨Y = y ( x ∨ y ) dx=μ X ∨Y = y
−∞

Example: Find E [Y|X] if the joint probability density function is f X , Y ( x , y )=1/x ; 0< y ≤ x ≤ 1.
x
1
Solution: f X ( x )=∫ dy=1 , 0≤ x ≤1
0 x
f (x , y ) 1
f Y ∨X ( y∨x )= X ,Y = , 0< y ≤ x
f X (x) x
x x
y x
E ( Y |X =x )=∫ y f Y ∨ X ( y∨x ) dy=∫ dy=
0 0 x 2
Therefore, the conditional expectation E(Y ∨X )=x /2

The conditional expectation or mean of continuous random variables, X given Y = y, denoted as


E (X| Y= y) or μX∨Y = y is;
var ( X|Y = y ¿=∫ (x−μ X ∨Y = y ) f X ∨Y= y ( x∨ y ) dx
2

= ∫ x f X ∨Y = y ( x∨ y )−μ X ∨Y = y
2 2

= E ( X 2|Y = y )−[ E ( X|Y = y ) ]2


2
¿ σ X ∨Y = y
Stat foreconomists Page 70
Statistics for economists
4.3.5 Independence and Expectation
If two random variables X and Y are independent, then
E( XY )=E(X ) E(Y )
Var ( X+ Y )=Var (X )+Var (Y )
Cov (X , Y )=0
 Warning zero covariance does not mean that the random variables are
independent.

Chapter five
5. Sampling and Sampling Distribution
Introduction
In statistics, Sampling plays vital role. Generally the purpose the whole chapter is to
introduce and armed students with the concept of sampling and sampling distributions.

Learning objectives:
At the end of the chapter students are expected to:
 Understand what sampling is? Why sampling? And explain the different
types of sampling technique/methods.
 Describe the concept of sampling distribution and elaborate the different
types of sampling distribution (i.e., sampling distribution of the sample mean
and sample proportion).

Stat foreconomists Page 71


Statistics for economists
 Understand the law of large numbers and the central limit theorem

5.1 The concept of sampling

Sampling is very often used in our daily life. For example while purchasing food grains from a
shop we usually examine a handful from the bag to assess the quality of the commodity.
A doctor examines a few drops of blood as sample and draws conclusion about the blood
constitution of the whole body. Thus most of our investigations are based on samples. In this
chapter, let us see the importance of sampling and the various methods of sample selections from
the population.
1. A (statistical) population: is the complete set of possible measurements for which
inferences are to be made. The population represents the target of an investigation, and
the objective of the investigation is to draw conclusions about the population hence we
sometimes call it target population. Sometimes it is possible and practical to examine
every person or item in the population we wish to describe. We call this a complete
enumeration, or census. We use sampling when it is not possible to measure every item
in the population. Statisticians use the word population to refer not only to people but to
all items that have been chosen for study.
Examples
 Population of trees under specified climatic conditions
 Population of animals fed a certain type of diet
 Population of farms having a certain type of natural fertility
 Population of households, etc.

Finite population and infinite population:


A population is said to be finite if it consists of finite number of units. Numbers of workers in a
factory, production of articles in a particular day for a company are examples of finite
population. The total number of units in a population is called population size. A population is
said to be infinite if it has infinite number of units. For example the number of stars in the sky,
the number of people seeing the Television programs etc…
 There are two ways of investigation: Census and sample survey.

Stat foreconomists Page 72


Statistics for economists
2. Census: a complete enumeration of the population. But in most real problems it cannot
be realized, hence we take sample. In census method every element of the population is
included in the investigation. For example, if we study the average annual income of the
families of a particular village or area, and if there are 1000 families in that area, we must
study the income of all 1000 families. In this method no family is left out, as each family
is a unit.
3. Sample: A sample from a population is the set of measurements that are actually
collected in the course of an investigation. It should be selected using some predefined
sampling technique in such a way that they represent the population very well.
Examples:
 Monthly production data of a certain factory in the past 10 years.
 Small portion of a finite population.
In practice, we don’t conduct census, instead we conduct sample survey
4. Parameter and Statistic:
We can describe samples and populations by using measures such as the mean, median,
mode and standard deviation. When these terms describe the characteristics of a
population, they are called parameters. When they describe the characteristics of a
sample, they are called statistics. A parameter is a characteristic of a population and a
statistic is a characteristic of a sample. Since samples are subsets of population statistics
provide estimates of the parameters.
That is, when the parameters are unknown, they are estimated from the values of the
statistics.
In general, we use Greek or capital letters for population parameters and lower case
Roman letters to denote sample statistics. N, μ, σ, are the standard symbols for the size,
mean, S.D, of population. n, x , s, are the standard symbol for the size, mean, s.d of
sample respectively.
5. Sampling: The process or method of sample selection from the population.
6. Sampling unit: the ultimate unit to be sampled or elements of the population to be
sampled.
Examples:

Stat foreconomists Page 73


Statistics for economists
 If somebody studies Scio-economic status of the households, households is the
sampling unit.
 If one studies performance of freshman students in some college, the student is
the sampling unit.
7. Sampling frame: is the list of all elements in a population.
Examples:
 List of households.
 List of students in the registrar office.
8. Sampling errors and non-sampling errors:
The two types of errors in a sample survey are sampling errors and non – sampling errors.
a. Sampling errors:
Although a sample is a part of population, it cannot be expected generally to supply full
information about population. So there may be in most cases difference between statistics and
parameters. The discrepancy between a parameter and its estimate due to sampling process is
known as sampling error.
b. Non-sampling errors:
In all surveys some errors may occur during collection of actual information. These errors are
called Non-sampling errors.
 May arise due to incorrect response.
 Measurement problems
 Errors at different stages in processing the data.
Reasons for Sampling
 Reduced cost
 Greater speed
 Greater accuracy
 Greater scope
 Avoids destructive test
 The only option when the population is infinite
Because of the above consideration, in practice we take sample and make conclusion about the
population values such as population mean and population variance, known as parameters of the
population.

Stat foreconomists Page 74


Statistics for economists
Sometimes taking a census makes more sense than using a sample. Some of the reasons include:

 Universality
 Qualitativeness
 Detailedness
 Non-representativeness

5.1.1 Principles of Sampling:


Samples have to provide good estimates. The following principles tell us that the sample
methods provide such good estimates.

1. Principle of statistical regularity:


A moderately large number of units chosen at random from a large group are almost sure on the
average to possess the characteristics of the large group.
2. Principle of Inertia of large numbers:
Other things being equal, as the sample size increases, the results tend to be more accurate and
reliable.
3. Principle of Validity:
This states that the sampling methods provide valid estimates about the population units
(parameters).
4. Principle of Optimization:
This principle takes into account the desirability of obtaining a sampling design which gives
optimum results. This minimizes the risk or loss of the sampling design.
The foremost purpose of sampling is to gather maximum information about the population under
consideration at minimum cost, time and human power. This is best achieved when the sample
contains all the properties of the population.
Exercises:
1. Describe briefly the difference between
 Population and sample
 Parameter and statistic
 Census and sample survey

Stat foreconomists Page 75


Statistics for economists
 Sampling and non-sampling error
 Sampling frame and sampling units
2. Why do researchers usually select sample elements from a given population?
3. Mention some of the advantages and disadvantage of sampling.
4. An insurance company has insured 300,000 cars over the last six years. The company
would like to know the number of cars involved in one or more accidents over this
period. The manger selected 1000 cars from the files and made a record of cars that were
involved in one or more accidents.
a. What is the population?
b. What is the sample?
c. What is the variable of interest to the insurance company?
5.1.2 Types of Sampling Techniques

The technique of selecting a sample is of fundamental importance in sampling theory and it


depends upon the nature of investigation. The sampling procedures which are commonly used
may be classified as;
a. Probability sampling.
b. Non-probability sampling.
c. Mixed sampling.
a. Probability sampling (Random sampling):
A probability sample is one where the selection of units from the population is made according
to known probabilities. i.e. It is a method of sampling in which all elements in the population
have a pre-assigned non-zero probability to be included in to the sample.
Examples:
i. Simple random sampling
A simple random sample from finite population is a sample selected such that each possible
sample combination has equal probability of being chosen. It is also called unrestricted random
sampling. This can be done in two ways:
a. Simple random sampling without replacement:
In this method the population elements can enter the sample only once, the units once selected is
not returned to the population before the next draw.

Stat foreconomists Page 76


Statistics for economists

b. Simple random sampling with replacement:


In this method the population units may enter the sample more than once. Simple random
sampling may be with or without replacement.
Generally, Simple random sampling can be done either using the lottery method or table of
random numbers.
ii. Systematic sampling
This method is widely employed because of its ease and convenience. A frequently used method
of sampling when a complete list of the population is available is systematic sampling. It is also
called Quasi-random sampling.
The whole sample selection is based on just a random start. The first unit is selected with the
help of random numbers and the rest get selected automatically according to some pre designed
th
pattern is known as systematic sampling. With systematic random sampling every K element
in the frame is selected for the sample, with the starting point among the first K elements
determined at random.
th
For example, if we want to select a sample of 10 students from 60 students under this method K
item is picked up from the sampling frame and K is called the sampling interval.
N
Sampling interval, K = =population ¿ ¿ sample ¿ ¿ ¿ ¿
n
N 60
K= = =6
n 10
K = 6 is the sampling interval. Systematic sample consists in selecting a random number say i ≤
th
K and every K unit subsequently. Suppose the random number ‘i’ is 4, then we select 4, 10, 16,
22, 28, 34, 40, 46, 52 and 58 as a sample using this method. The random number ‘i’ is called
random start. The technique will generate K systematic samples with equal probability.
iii. Stratified random sampling
Of all the methods of sampling the procedure commonly used in surveys is stratified sampling.
This technique is mainly used to reduce the population heterogeneity and to increase the
efficiency of the estimates. Stratification means division into groups. In this method the
population is divided into a number of subgroups called strata. The strata should be so formed
that each stratum is homogeneous as far as possible. Then from each stratum a simple random

Stat foreconomists Page 77


Statistics for economists
sample may be selected and these are combined together to form the required sample from the
population.
Types of Stratified Sampling:
There are two types of stratified sampling. They are proportional and non-proportional. In the
proportional sampling equal and proportionate representation is given to subgroups or strata. If
the number of items is large, the sample will have a higher size and vice versa.
The population size is denoted by N and the sample size is denoted by ‘n’ the sample size is
allocated to each stratum in such a way that the sample fractions is a constant for each stratum.
That is given byn /N =c . So in this method each stratum is represented according to its size.
In non-proportionate sample, equal representation is given to all the sub-strata regardless of their
existence in the population.
Exercise:
A sample of 20 students is to be drawn from a population consisting of 300 students belonging to
two institutions A and B. The number of students in the institution A is 120 and the institution B
is 180. How will you draw the sample using proportional allocation?
iv. Cluster sampling
The population is divided in to non-overlapping groups called clusters. A simple random sample
of groups or cluster of elements is chosen and all the sampling units in the selected clusters will
be surveyed. Clusters are formed in a way that elements within a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar. Cluster sampling is useful when it
is difficult or costly to generate a simple random sample. For example, to estimate the average
annual household income in a large city we use cluster sampling, because to use simple random
sampling we need a complete list of households in the city from which to sample. To use
stratified random sampling, we would again need the list of households. A less expensive way is
to let each block within the city represent a cluster. A sample of clusters could then be randomly
selected, and every household within these clusters could be interviewed to find the average
annual household income.

b. Non-Probability(non-random) sampling:
It is a sampling technique in which the choice of individuals for a sample depends on the basis of
convenience, personal choice or interest. It is the one where discretion is used to select

Stat foreconomists Page 78


Statistics for economists
‘representative’ units from the population (or) to infer that a sample is ‘representative’ of the
population. This method is called judgment or purposive sampling. This method is mainly used
for opinion surveys; A common type of judgment sample used in surveys is quota sample. This
method is not used in general because of prejudice and bias of the enumerator. However if the
enumerator is experienced and expert, this method may yield valuable results. For example, in
the market research survey of the performance of their new car, the sample was all new car
purchasers.

 Judgment Sampling
In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
 Convenience Sampling
In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
 Quota Sampling
In this method, the decision maker requires the sample to contain a certain number of items with
a given characteristic. Many political polls are, in part, quota sampling.
 Snowball/networking sampling
In this sampling method samples are generated in a networked manner. Important method if the
researcher has little or no knowledge about the population under study.
c. Mixed Sampling:
Here samples are selected partly according to some probability and partly according to a fixed
sampling rule; they are termed as mixed samples and the technique of selecting such samples is
known as mixed sampling.
Exercises:
1. What are the merits and limitations of simple random sampling stratified random
sampling, systematic sampling and cluster sampling techniques?
2. A population of size 800 is divided into 3 strata of sizes 500, 200, 100 respectively. A
stratified sample size of 160 is to be drawn from the population. Determine the sizes of
the samples from each stratum under proportional allocation.

Stat foreconomists Page 79


Statistics for economists

Note:
Let N = population size, n = sample size.
1. Suppose simple random sampling is used
 We have N npossible samples if sampling is with replacement.

 We have ( Nn ) possible samples if sampling is without replacement


5.2 Sampling Distribution

So far we have seen the techniques how samples can be drawn from the population. Using one of
the already discussed sampling techniques above if we take different samples from a population,
the statistic that we would compute for each sample combinations need not be the same and most
likely would vary from sample to sample.

Example: Suppose that a population has four elements;


N = 4 consisting of 2, 4, 6, and 8.
If we draw two samples n = 2 with replacement we have 4 2=16 total number of samples.
Samples; 2, 2 2, 4 2, 6 2, 8
4, 2 4, 4 4, 6 4, 8
6, 2 6, 4 6, 6 6, 8
8, 2 8, 4 8, 6 8, 8

For each sample combination given above we can compute the mean value (i.e., the sample
statistic). The following table shows us the mean value of each sample combinations.

Sample Mean sample Mean Therefore as we see from the table different sample
2, 2 2 6, 2 4 combinations yield different sample mean (statistic).
2,4 3 6, 4 5
2,6 4 6, 6 6 The population mean can be computed as;
2,8 5 6, 8 7 ∑ x = 20 =5Some of the sample combinations
4,2 3 8, 2 5 μ=
n 4
4,4 4 8, 4 6
4, 6 5 8, 6 7 have the mean value equal to the population mean 5.
4, 8 6 8, 8 8
Stat foreconomists Page 80
Statistics for economists
But the populations mean basically different from most of the sample means and the mean of the
sample itself differs from sample to sample. This leads us to concept of sampling distribution.

Sampling Distribution: - Given a variable X, if we arrange its values in ascending order and
assign probability to each of the values or if we present Xi in a form of relative frequency
distribution the result is called Sampling Distribution of X. or alternatively it is a probability
distribution of all the values of sample statistics.
 We do have sampling distribution of the mean, proportion etc.
Generally, the probability distribution of a statistic is called sampling distribution. The
probability distribution of X is called the sampling distribution of the mean. The sampling
distribution of a statistic depends on the size of the population, the size of the samples, and the
method of choosing the samples.
 Random Sample
The random variables X 1 , X 2, … Xn are a random sample of size n if...
a. The Xi ' s are independent random variables, and
b. Every Xi has the same sample probability distribution (i.e. they are drawn from the same
population).
NOTE: the observed data x 1 x 2 , … xn are also referred to as a random sample.
 Statistic
A statistic is any function of the observations in a random sample.
Example:
The mean of X is a function of the observations (specifically, a linear combination of the
observations).
n

∑ Xi 1 1 1
X = i =1 = X 1 + X 2+ …+ X n
n n n n
A statistic is a random variable, and it has a probability distribution.
The distribution of a statistic is called the sampling distribution of the statistic because it depends
on the sample chosen.

Sampling Distribution of the sample mean

Stat foreconomists Page 81


Statistics for economists
The first important sampling distribution to be considered is that of the mean X . Sampling
distribution of the sample mean is a theoretical probability distribution that shows the functional
relationship between the possible values of a given sample mean based on samples of size and
the probability associated with each value, for all possible samples of size n drawn from that
particular population.
There are commonly three properties of interest of a given sampling distribution.
 Its Mean
 Its Variance
 Its Functional form.

Steps for the construction of Sampling Distribution of the mean


1. From a finite population of size N, randomly draw all possible samples of size n .
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or relative
frequency distribution.

Example:
Suppose we have a population of size N = 5, consisting of the age of five children: 4, 6, 8, 10,
and 12.
Population mea, μ=8
Population Variance, σ 2=8
Take samples of size 2 with replacement and construct sampling distribution of the sample mean.
Solution:
N = 5, n = 2
We have N n=5 2 = 25 possible samples since sampling is with replacement.
Step 1: Draw all possible samples:
4 6 8 10 12
4 4, 4 4, 6 4, 8 4, 10 4, 12
6 6, 4 6, 6 6, 8 6, 10 6, 12
8 8, 4 8, 6 8, 8 8, 10 8, 12
10 10, 4 10, 6 10, 8 10, 10 10, 12
12 12, 4 12, 6 12, 8 12, 10 12, 12

Stat foreconomists Page 82


Statistics for economists

Step 2: Calculate the mean for each sample


4 6 8 10 12
4 4 5 6 7 8
6 5 6 7 8 9
8 6 7 8 9 10
10 7 8 9 10 11
12 8 9 10 11 12

Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.


Xi 4 5 6 7 8 9 10 11 12
Fi 1 2 3 4 5 4 3 2 1

a. Find the mean of X , say μ X

μX=
∑ Xi fi = 200 =8=μ
∑ fi 25
b. Find the variance of X , say σ 2X
2 ∑ 2
( Xi−μ X ) fi 100 2
σ X= = =4 ≠ 8=σ
∑ fi 25

Exercise: construct the sampling distribution taking sample size of 2 without replacement.
Properties of the sampling distribution of the mean
1. The variance of the sample mean σ 2X : Given the population mean (μ), population
variance (σ 2), the sample size (n) and population size (N); the variance of the
sample mean is given as follows:
 if sampling is with replacement
2
2 σ
σ X= (show !)
n
σ
And the standard deviation is given by; σ X =
√n
 If sampling is without replacement
2
σ N −n
), and the standard deviation is;
2
σ X= (
n N−1

Stat foreconomists Page 83


Statistics for economists

σ X=

σ N −n
√ n N−1
The value
√ N −n
N −1
is referred as finite population correction factor.

2. In any case the sample mean is unbiased estimator of the population mean. i.e.
Statisticians are in agreement that the expected value of the sample mean is equal to
the population mean.
algebrically , μ X = μ ⇒ E ( X ) = μ, (show!)
 Sampling may be from a normally distributed population or from a non-normally
distributed population.

Sampling from a Normal Distribution


 When sampling is from a normally distributed population, the distribution of X will
possess the following property.
i. The distribution of X will be normal for any sample size
j. The mean of X is equal to the population mean , i.e. μ X = μ
k. The variance of X is equal to the population variance divided by the sample
size;
2
2 σ
σ X=
n

( )
2
σ
⇒X N μ,
n
X−μ
N (0 , 1)
⇒ Z= σ
√n
Example:
If the uric acid values in normal adult males are approximately normally distributed with mean
5.7 mgs and standard deviation 1mg find the probability that a sample of size 9 will yield a
mean.
i. greater than 6
ii. between 5 and 6
iii. less than 5.2

Stat foreconomists Page 84


Statistics for economists

Solution:
Let X be the amount of uric acids in normal adult males
μ = 5.9, σ =1, n=9

( )
2
σ
⇒ X N μ,
n
N ( 5.9 , 0.33 )
X−μ
N (0 , 1)
⇒Z= σ
√n

i. P( X> 6)=P Z >


( 6−5.7
√0.33 )
=P ( Z > 0.17 )=0.4325

ii. P(5< X <6)=P


( 5−5.7
√ 0.33
<Z <
√ 0.33 )
6−5.7
=P (−1.57 <Z <0.17 )=0.5093

iii. ( 5.2−5.7
P( X< 5.2)=P Z <
√ 0.33 )
=P ( Z←1.22 ) =0.1112

If we are sampling from a population with unknown distribution, either normal or non-normal,
the sampling distribution of X will still be approximately normal with mean μ X and variance σ 2/n
provided that the sample size is large. This amazing result is an immediate consequence of the
following theorem, called the central limit theorem.

Stat foreconomists Page 85


Statistics for economists
Central Limit Theorem

Given a population of any functional form with mean μ and finite variance σ 2, the sampling
distribution of X , computed from samples of size n from the population will be approximately
normally distributed with mean μ and variance σ 2/n, when the sample size is large. i.e.
X−μ
N (0 , 1)
Z= σ
√n
As n —>∞ , is the standard normal distribution n (z; 0, 1).

The normal approximation for X will generally be good if n ≥ 30. If n < 30, the approximation is
good only if the population is not too different from a normal distribution and, as stated above, if
the population is known to be normal, the sampling distribution of X will follow a normal
distribution exactly, no matter how small the size of the samples.
The sample size n = 30 is a guideline to use for the central limit theorem. However, as the
statement of the theorem implies, the presumption of normality on the distribution of X becomes
more accurate as n grows larger. The following figure illustrates how the theorem works. It
shows how the distribution of X becomes closer to normal as n grows larger, beginning with the
clearly non-symmetric distribution of that of an individual observation (n = 1). It also illustrates
that the mean of X remains μ X for any sample size and the variance of X gets smaller as n
increases.
As one might expect, the distribution of X will be near normal for sample size n < 30 if the
distribution of an individual observation itself is close to normal.

Stat foreconomists Page 86


Statistics for economists
Illustration of the central limit theorem (distribution of X for n = 1, moderate n, and large n).

Example:
1. An electrical firm manufactures light bulbs that have a length of life that is approximately
normally distributed, with mean equal to 800 hours and a standard deviation of 40 hours.
Find the probability that a random sample of 16 bulbs will have an average life of less
than 775 hours.
Solution:
The sampling distribution of X will be approximately normal, with μ X = 800 and σ = 40/√ 16
= 10. The desired probability is given by the area of the shaded region in the following
figure.
Corresponding to x = 775, we find that;

775−800
z= =−2.5
10
And therefore
P ( X <775 ) =P ( Z <−2.5 ) =0.0062

Law of large numbers


To deal with the Law of Large Numbers, we first need an important inequality called the
Chebyshev Inequality.

Chebyshev Inequality
Let X be a random variable with mean μ and standard deviationσ , then for any positive number
k,

Stat foreconomists Page 87


Statistics for economists
2
σ
P ( ¿ X−μ|≤ k ¿ ≥ 1− 2
k
Equivalently
2
σ
P ( ¿ X−μ|≥ k ¿ ≤ 2
k

This is an important result in probability and will be especially useful in our proof of the Weak
Law of Large Numbers.
Example:
1. Let X be any random variable with E (X) = μ and V (X) =σ 2. Then, if k = cσ , Chebyshev
inequality states that;

2
σ 1
P ( ¿ X−μ|≥ c σ ¿ ≤ 2 2
= 2
c σ c
Thu, for any random variable, the probability of a deviation from the mean of more than c
1 1
standards is≤ 2 . If for instance, c = 4, 2 = 0.0625
c c
2. Let X be a random variable that follows a Poisson distribution with parameter θ=7. Give
a lower bound for P ( ¿ X−μ|≤ 4 ¿.
For a Poisson distribution with parameter 7 we have μ=σ 2=7. then,
P ( 3 ≤ X ≤ 11) =P ( ¿ X−7|≤ 4 ¿ = P ( ¿ X−μ|≤ 4 ¿ ≥1−σ 2 /4 2
¿ 1−7 /16=0.4375.
Law of large numbers (LLN)

Actually, this concept comprises the Weak and Strong Law of Large Numbers as well as
Kolmogorov's Law of Large Numbers. But, for the purpose of this module we will see and prove
the Weak Law using two different methods. The first proof uses Chebyshev inequality, and the
second uses what is known as characteristic functions.

Generally, Consider a sequence X 1 , X 2 , X 3 ,… of independent random variables with the same


distribution and let μ be the common expected value μ=E ( X )∧finite varian ce σ 2. Denote by Sn
the sums;

Stat foreconomists Page 88


Statistics for economists
Sn=X 1+ X 2+…+ X n

the law of large numbers states that;

P (| | )
Sn
n
−μ ≥ ε →0 for n → ∞

Equivalently;

P (| | )
Sn
n
−μ < ε → 1 for n→ ∞

For every ε > 0.


Expressed in words;
The mean value of a sample from any given distribution converges to the expected value of that
distribution when the size n of the sample approaches ∞ .

Definition: Random variables X1, X2, X3…Xn are said to be independent and identically
distributed or i.i.d. if each random variable Xi has the same probability distribution as X1 and the
occurrence of one does not affect the probability of another.

Proof the weak law using Chebyshev Inequality

Let X1, X2, X3…Xn be a sequence of i.i.d. random variables, each with mean E (X i) = μand

X 1+ X 2+ …+ Xn
standard deviation σ , we define X n = . The weak law of large numbers
n
(WLLN) states for all k > 0 then,
lim P ( ¿ X n−μ|>k ¿ ¿=0
n→∞

Proof: assume that Var (Xi) = σ 2 for all i<∞ .


Since the X1, X2, X3…Xn are i.i.d. there is no correlation among them, thus
X 1+ X 2+ …+ Xn
Var ( X n) = Var ( )
n
1
= 2
Var ( X 1+ X 2+ X 3+ …+ X n)
n

Stat foreconomists Page 89


Statistics for economists
1
= 2 [var ( X 1) + var ( X 2) + var ( X 3 ) + …. + Var ( X n )]
n
1 2 2 2 2
= 2 (σ + σ + σ + …+σ )
n
2
n
= 2
σ
n
2
σ
for n>1
n
We note that the mean of each Xi in the sequence is also the mean of the sample average, i.e. E (
X ) = μ. We can now apply Chebyshev's Inequality on X n to get, for all k > 0,
2
σ
P ( ¿ X n −μ|>k ¿ ≤ 2
nk
2 2
σ σ
P ( ¿ X n −μ|>k ¿=1−P ( ¿ X n −μ|≤ k ¿ ≥ 1−(1− 2
)= 2
nk nk
Which → 0 as n → ∞
Definition: A Characteristic Function of a probability distribution is given by;
itX
γ X ( t )=E(e )
Where i = √ −1 and X is the variable with the given distribution.

Using Characteristic Function;


Proof:

Let X1, X2, X3…Xn be a sequence of i.i.d. random variables with mean E |X 1| < ¿ μ=E ( X 1 ) .
X 1+ X 2+ …+ Xn
Again define X n = , n≥ 1.Let γ ( t ) =E ( e itX1 ) with t ϵ (−∞ , ∞ ) . then,
n

( )
t

()
t i X1
γ =E e n
n

( t
= E 1+ i X 1+o
n
1
n ( ))
t
= E ( 1 ) +i E ( X 1 )+ o
n
1
n ()
t
= 1+i μ+ o
n
1
n
as n → ∞ ()
Stat foreconomists Page 90
Statistics for economists
Thus,
¿ Xn
γ X ( t )=E(e
n
)

( )
n
Xi
= E e¿ ∑ j=i n

n t
i Xj
= E( ∏ e n
)
j=1

=
n

∏ E(e ¿ ¿ i nt Xj ¿ )(since X 1 , X 2 , X 3 … X n are i. i .d .)¿ ¿


j=1

=γ n n (( ))t

( ( ))
n
¿μ 1
= 1+ +o
n n

Now recall that

( ( )) → e . so we have γ
n
1 n a 1
(1+ ) → e ⇒ 1+ + o
a
Xn ( t )=γ D ( t )=e¿ μ where P ( D=μ )=1∧thus X n → Pμ .
n n n

Example: die rolling. Consider n rolls of a die. Let Xi be the outcome of the ith roll. Then,

Sn =X1 + X2 + …Xn is the sum of the first n rolls. This is an idependent Bernouli trial
with E(Xi) = 7/2. Thus by the LLN, for any k > 0,
P¿
this can be restated as for any k >0 ,

P (| | )
Sn 7
− <k →1 as n→ ∞
n 2
The strong law of large numbers

Consider a sequence X 1 , X 2 , X 3 ,… of independent random variables with the same distribution


and let μ be the common expected value μ=E ( X )∧standard deviation σ then,
P¿
The proof of the Strong Law of Large Numbers (SLLN) is much more complex than that of the
WLLN as it is much more powerful. However something should be noted.

Stat foreconomists Page 91


Statistics for economists
Kolmogorov's Strong Law of Large Numbers
Let X1, X2…Xn, be a sequence of independent random variables with | E ( X n ) | < ∞ for n ≥ 1. Then
the SLLN holds if one of the following are satisfied:
1. The random variables are identically distributed

Var ( X n )
2. ∀ n , Var ( X n ) <∞ ∧∑ 2
<∞
n=1 n
Like SLLN, the proof of Kolmogorov's is very complex and as such will not be covered in this
module.

CHAPTER SIX

6. ESTIMATION
Introduction
We now come to the heart of the matter called statistical inference. So far, under given values of
the population parameters μ andσ 2, we try to solve the probability of the sample mean X , from a
sample of size n, being greater than some specified value or within some range of values? The
parameters μ and σ 2 are assumed to be known and the objective is to try to form some
conclusions about possible values of X . But, in practice it is usually the sample values X and s2
is known, while the population parameters μ and σ 2 are not. Thus an interesting question to ask
is: given the values of X and s2, what can be generalized about μ and σ 2? This is called statistical
inference. Therefore, inference is the process of or way of making interpretations or conclusions
from sample data for the totality of the population or to the population parameters. Alternatively,
statistical inference is the act of generalizing from the data (“sample”) to a larger phenomenon
(“population”) with calculated degree of certainty. The act of generalizing and deriving statistical
judgments is the process of inference.
Note: There is a distinction between causal inference and statistical inference. Here we consider
only statistical inference.
Sometimes the population variance is known, and inferences have to be made about μ alone.
For instance, if a sample of 100 Ethiopian families find an average weekly expenditure on food (
X ) of 250 birr with a standard deviation (s) of 48 birr, what can be said about the average
expenditure (μ) of all Ethiopian families?
Diagrammatically this type of problem is shown as follows:

Stat foreconomists Page 92


Statistics for economists

Numeri Analyz Populati


Sample
cal data ed data Inference on

Generally, in statistics there are two ways in which inference can be made.
 Statistical estimation
 Statistical hypothesis testing.

This chapter covers the estimation of population parameters such as μ and σ 2 from the sample
statistic while the coming chapter describes statistical hypotheses testing about these
parameters. The two concepts are very closely related statistical terms.

Learning objectives:
After successful completion of this chapter, students will be able to;
 Differentiate between point and interval estimation.
 Show how point estimates can be calculated for the population parameters.
 Elaborate the desirable properties of the estimators
 Describe how confidence intervals are computed for the population parameters.
 Indicate how it can be determine how large must be the significance level in order for an
estimate to have a desired level of accuracy.

6.1 Statistical estimation

This is one way of making inference about the population parameter where the investigator does
not have any prior notion about values or characteristics of the population parameter.
Basically there are two ways of conducting estimation.
i. Point Estimation
ii. Interval estimation

Point Estimation

Stat foreconomists Page 93


Statistics for economists
Point estimation draws inferences about a population by estimating the value of an unknown

population parameter using a single value or point. It is a procedure that results in a single value
as an estimate for a parameter. Point estimation is the one which is most prevalent in everyday
activities; for instance, the average number of economics instructors surfing the internet in Raya
University per day for an hour. Although this is presented as a truth, it is actually an estimate,
obtained from a survey of instructor’s use of either personal computers or their mobile phones.
Since it is obtained from a sample there must be some doubt about its accuracy of being the
population average. As a result interval estimates are also used, that gives some information
about the likely accuracy of the estimate.
Example: The following table shows the age of a simple random sample of 18 economics
students in Raya University in years. The figures are obtained from sampling, not from a census.
Use the data to estimate the populations mean age, μ, of all economics students.

18 22 19 20 21 18
23 21 22 24 18 20
17 23 18 19 21 19

Solution: We estimate the populations mean age, μ, of all economics students by the sample
mean age, x of the 18 students sampled. From the Table above,

x=
∑ x i = 363 =20.17
n 18
Interpretation: Based on the sample data, we estimate the mean age, μ, of all economics
students to be approximately 20.17 years.
An estimate of this kind is called a point estimate for μ because it consists of a single number, or
point.

Interval estimation
It is the procedure that results in the interval of values as an estimate for a parameter, which is
interval that contains the likely values of a parameter. It deals with identifying the upper and
lower limits of a parameter. The limits by themselves are random variable. Interval estimates are
better one way for the user of statistics knowledge, since they not only show the estimate of the
Stat foreconomists Page 94
Statistics for economists
parameter but also give an idea of the confidence in which the researcher has in that estimate.
For instance, suppose we are trying to estimate the mean summer income of students. Then, an
interval estimate might say that the (unknown) mean income is between 1000 and 2000 birr with
probability 0.95.

Estimator and Estimate


In order to estimate a parameter such as the population mean, a rule (or set of rules) is required
which describes how to derive the estimate of the parameter from the sample data. Such a rule is
known as an estimator. Estimator is the rule or random variable that helps us to approximate a
population parameter. But estimate is the different possible values of which an estimator can
assume.
Xi
Example: from the above example the sample mean X =∑ is an estimator for the
n
population mean and X = 20.17 is an estimate, which is one of the possible value of X

6.2 Methods of Point Estimation

Generally, there are two principal methods/ approaches of point estimation, the method of
moments and the method of maximum likelihood estimation techniques. Though, there are other
methods of point estimation like the method of least square and Bayesian method, for the
purpose of this module and this course, for the time being we goanna focus on the first two
constructive methods of obtaining point estimators. By constructive we mean that the general
definition of each type of estimator suggests explicitly how to obtain the estimator in any
specific problem.

6.2.1 Method of moment

The method of moments is a very intuitive approach to the problem of parameter estimation: the
argument is as follows. On the one hand, we have defined population moments as expressions of
the variability in a population; on the other hand, for random samples drawn from such
populations, we have calculated sample moments to summarize the data. Population moments

Stat foreconomists Page 95


Statistics for economists
involve unknown parameters; sample moments are numbers. The method of moments matches
analogous moments to obtain estimates for the unknown parameters.
For instance, μ1 is the population mean, and X is the samples mean, which are the first moments
of both population and sample observations respectively. Therefore, the method of moments
estimates parameters by finding expressions for them in terms of the lowest possible order
moments, and then equating these moments to the corresponding sample moments, for example
n
1
equate μ1 to X , equate μ2 to ∑ X i etc… i.e.
2
n i=1
μ1=M 1 μ 2=M 2 μ3 =M 3 So on…
n n n
1 1
E ( X )= ∑X
n i=1 i
E ( X 2) = ∑ X 2 E ( X 3 ) = 1n ∑ X i3
n i=1 i i=1

Continue this until there are enough equations to solve for the unknown parameters.
 Procedures to find MoM estimators
Step 1: Identify how many parameters the distribution has. (Let's say m).
Step 2: Find the first m population moments (using econ 1041 knowledge).
Step 3: Equalize each of the population moments to the corresponding sample moment.
Step 4: Solve the system to find solutions to the parameters.
Step 5: The solutions are the MoM estimators.

Definition: Let X1, X2… Xn be a random sample from a distribution with pmf or pdf f (x; θ1. . .
θm), where θ1, . . . , θm are parameters whose values are unknown. Then the moment estimators
θ^ 1. . . θ^ m are obtained by equating the first m sample moments to the corresponding first m
population moments and solving for θ1, . . . , θm.
If, for example, m = 2, E (X) and E ( X 2 ) will be functions of θ1 andθ2 .
Setting E(X) = (1/n) Σ Xi (= X ) and E ( X 2 ) = (1/n) Σ Xi 2 gives two equations in θ1 andθ2. The
solution then defines the estimators.
Example:
1. Let X i N ( μ , σ 2 ) . find the method of moment’s estimators of μ∧σ 2.
We have,
μ1=E ( X )=μ , μ 2=E ( X 2 ) =σ 2 + μ2 ,

Stat foreconomists Page 96


Statistics for economists
n
1
m1=X ∧m2= ∑ X i2
n i=1
Therefore, the method of moments estimators of μ∧σ 2 fulfill the equations,
^μ= X
n
1
σ^ 2+ μ^ 2= ∑
n i=1
X i2 .

Thus, we found
^μ= X
n
1
σ^ = ∑ X i2− X 2
2
n i=1
n
1
= ∑ (X i− X )
2
n i=1

2. Let X i bernouli ( p ) , i. i. d , i=0 , 1, 2 … n


Derive the MoME of P

Solution: E ( X )=p =
X 1 + X 2 +…+ X n
⇒ ^p =
∑ Xi
n n

6.2.2 Maximum likelihood estimation

The method of maximum likelihood was first introduced by R. A. Fisher, a geneticist and
statistician, in the 1920s. Most statisticians recommend this method, at least when the sample
size is large, since the resulting estimators have certain desirable efficiency properties.

To estimate model parameters by maximizing the likelihood, which is the joint probability
density function of a random sample, the resulting point estimators found can be regarded as
yielded by the most likely underlying distribution from which we have drawn the given sample.

Let X 1 , X 2… Xn be a random sample from a population with pmf or pdf f(x, θ), then the
likelihood function is defined as;

Stat foreconomists Page 97


Statistics for economists
L (θ| X 1 , X 2… Xn) = f ( X 1 , X 2… Xn|θ) or
= P ( X 1= x 1) P ( X 2 = x 2)… P ( X n = x n) or;
= f(x1;θ) f(x2; θ) … f(Xn; θ)
n n
= ∏ f ( X i∨θ)or ∏ P (X i=x i )
i=1 i=1

Log L (θ | X 1 , X 2… Xn) = ∑ log (P( X i=x i))


i=1

To find an MLE:
 Write down the likelihood function (i.e. the joint distribution)
 Take the natural logarithm of the likelihood and simplify
 Differentiate with respect to the appropriate parameters, set to 0, and solve
 Write resulting functions (RVs) as your estimator.
The likelihood function is the joint pdf or pf of the sample observations. The maximum
likelihood estimate (MLE) of a parameter θ is obtained by maximizing the likelihood function
with respect toθ .
Computational trick: Instead of maximizing L (θ ) with respect toθ , often we maximize ln (θ ) =
n

∑ log f ( X i∨θ) with respect to θ. Since the log function is monotonic, finding θ that maximizes
i

L ( X 1 , X 2… X|θ) is equivalent to finding θ that maximizes ln L ( X 1 , X 2… X|θ ). We typically


find where the maximum occurs by differentiating with respect to θ ('s) and setting the
derivative(s) to zero. Doing this for ln L instead of L is simpler since we have a sum to
differentiate rather than a product.

Example:
1. Suppose that the population has a Poisson distribution with unknown parameter λ. find
the ML estimator of λ.
x
λ −λ
P ( x , θ ) =P ( x , λ )= e , x=0 , 1 ,2 , …
x!
for some λ > 0. To estimate λ by MLE, let’s proceed as follows.
STEP 1: Calculate the likelihood function L (λ).

Stat foreconomists Page 98


Statistics for economists
n n
λX − λ i
λ x + x +…+ x 1 2 n

L ( λ )=∏ P( X i , λ)=∏ { e }= exp−n λ


i=1 i=1 x i ! x 1 !+ x 2 !+ … x n !
STEP 2: Calculate the log-likelihood log L (λ).
n n
log L ( λ )=∑ xi log λ−nλ−∑ log x i !
i=1 i=1

STEP 3: Differentiate log L (λ) with respect to λ, and equate the derivative to zero to find the
MLE.
n

d
∑ xi 1
n
log L ( λ ) = i=1
−n=0 ⇒ λ= ∑ x i
^
dλ λ n i=1
^λ=X

Thus the maximum likelihood estimate of λ is ^λ=x


Check that the second derivative of log L (λ) with respect to λ is negative at λ = ^λ
n
d2 −1
2
= 2 ∑ x i <0 at λ= ^λ
d λ log L ( λ ) λ i =1
2. Let X i N (μ , σ 2 ), i.i.d., i = 1, 2, …, n; derive the MLE for μ∧σ 2

Solution:
2
1 ( X i−μ )
i. f ( X i )= e−
√ 2 π σ2 2σ
2

¿ ¿,
X i ∈ R , i=1 , 2 ,… … n
n

ii. Likelihood function = L = f ( X 1 , X 2… Xn) = ∏ f ( X i)


i=1

= ∏ ¿¿¿
i=1

=¿
iii. Log likelihood function
n
2

−n n
∑ ( X i−μ )
l=ln L=¿ ln ( 2 π )− ln σ 2−¿ i=1
ln e ¿ ¿
2 2 2 σ2

Stat foreconomists Page 99


Statistics for economists
n
2

= −n n
∑ ( X i−μ )
ln (2 π )− ln σ 2−¿ i=1
, since ln e=1 ¿
2 2 2 σ2
n
2 ∑ ( X i −μ )
iv. dl i=1
= =0
dμ 2σ 2
n
2

dl −n i=1
∑ ( X i−μ )
= + =0
d σ2 2 σ 2 2σ4

⇒ ^μ= X
n
2
∑ ( X i− X )
σ^ 2= i =1
n

Remarks: Advantages/Disadvantages between the MOME and MLE methods


 The ML estimators have higher probability of being close to the quantities to be
estimated.
 However, in some cases (e.g. gamma distribution) the likelihood equations may be
intractable without computers; in contrast the MOM estimators can be quickly and easily
calculated by hand.
 Estimates by the MOM may be used as the first approximation to the solutions of the
likelihood equations, and successive improved approximations may then be found by
iterative approaches to solve the likelihood equation (e.g. Newton Raph son method).
 In some cases (infrequent with large samples) the estimates given by the MOM are
outside of the parameter space; it does not make sense to rely on them then. That problem
never arises in the method of ML.

6.3 Desirable Properties of Estimators

Why are statistical properties of estimators important? These statistical properties are extremely
important because they provide criteria for choosing among alternative estimators. Knowledge of

Stat foreconomists Page 100


Statistics for economists
these properties is therefore essential to understanding why we use the estimation procedures we
do in econometric analysis.
In general there can be several alternative procedures that can be adopted to obtain the point
estimate of the population parameter. For instance, we may compute arithmetic mean, median or
geometric mean to estimate the population mean. As we never know the true value of the
parameter, therefore it does not make sense that our estimate is correct. Therefore, if there are
more than one estimator then the question arises, which among them is better. This means that
we must stipulate some criteria which can be applied to decide whether one estimator is better
than the other i.e. although an estimator is not expected to estimate the population parameter
without error. Simultaneously we don’t expect them to be far off.
When we want to study the properties of the obtained estimators, it is convenient to distinguish
between two categories of properties: i) the small (or finite) sample properties, which are valid
whatever the sample size, and ii) the asymptotic properties, which are associated with large
samples, i.e., when n tends to ∞.
Generally, there are two categories of statistical properties of estimators.
1. Small-sample, or finite-sample, properties of estimators
The most fundamental desirable small-sample properties of an estimator are:
 Unbiasedness
 Minimum Variance
 Efficiency
2. Large-sample, or asymptotic, properties of estimators
The most important desirable large-sample property of an estimator is:
 Consistency
Both sets of statistical properties refer to the properties of the sampling distribution, or
probability distribution, of the estimator θ^ for different sample sizes.

Small-Sample (Finite-Sample) Properties


The small-sample, or finite-sample, properties of the estimator θ^ refer to the properties of the
sampling distribution of θ^ for any sample of fixed size n, where n is a finite number (i.e., a
number less than infinity) denoting the number of observations in the sample.
n = number of sample observations, where n < ∞.

Stat foreconomists Page 101


Statistics for economists
Definition: The sampling distribution of θ^ for any finite sample size n < ∞ is called the small-
sample, or finite-sample, distribution of the estimatorθ^ . In fact, there is a family of finite-sample
distributions for the estimatorθ^ , one for each finite value of n.
The sampling distribution of θˆ is based on the concept of repeated sampling.
 Suppose a large number of samples of size n are randomly selected from some underlying
population.
 Each of these samples contains n observations.
 Each of these samples in general contains different sample values of the observable
random variables that enter the formula for the estimatorθ^ .
For each of these samples of n observations, the formula for θ^ is used to compute a numerical
estimate of the population parameter θ.
Each sample yields a different numerical estimate of the unknown parameter θ.
Why? Because each sample typically contains different sample values of the observable random
variables that enter the formula for the estimatorθ^ .
If we tabulate or plot these different sample estimates of the parameter θ for a very large number
of samples of size n, we obtain the small-sample, or finite-sample, distribution of the estimatorθ^ .

Large-Sample (Asymptotic) Properties


The large-sample, or asymptotic, properties of the estimator θ^ refer to the properties of the
sampling distribution of θ^ as the sample size n becomes indefinitely large, i.e., as sample size n
approaches infinity (as n→∞).
Definition: The probability distribution to which the sampling distribution of θ^ converges as
sample size n becomes indefinitely large (i.e., as n →∞) is called the asymptotic, or large-
sample, distribution of the estimatorθ^ .
The properties of the asymptotic distribution of θ^ are what we call the large sample, or
asymptotic, properties of the estimatorθ^ .
To define all the large-sample properties of an estimator, we need to distinguish between two
large-sample distributions of an estimator:
 the asymptotic distribution of the estimator;
 the ultimate, or final, distribution of the estimator

Stat foreconomists Page 102


Statistics for economists
Small-Sample Properties
 Nature of Small-Sample Properties
The small-sample, or finite-sample, distribution of the estimator θ^ for any finite sample size n <
∞ has
 a mean, or expectation, denoted as E(θ^ ), and
 a variance denoted as Var(θ^ ).
 The small-sample properties of the estimator θ^ are defined in terms of the mean E (θ^ )
and the variance Var (θ^ ) of the finite-sample distribution of the estimatorθ^ .

Unbiasedness
Definition of Unbiasedness: The estimator θ^ is an unbiased estimator of the population
parameter θ if the mean or expectation of the finite-sample distribution of θ^ is equal to the true θ.
That is, θ^ is an unbiased estimator of θ if,
E (θ^ ) = θ for any given finite sample size n < ∞.
Definition of the Bias of an Estimator: The bias of the estimator θ^ is defined as,
Bias (θ^ ) = E (θ^ ) − θ = the mean of θ^ minus the true value of θ.
The estimator θ^ is an unbiased estimator of the population parameter θ if the bias of θ^ is equal
to zero; i.e., if,
^
Bias ( θ)=E( ^
θ)−θ=0 ^
⇔ E( θ)=θ .

Alternatively, the estimator θ^ is a biased estimator of the population parameter θ if the bias of θ^
is non-zero; i.e., if,

Bias ( θ^ )=E ( θ^ )−θ ≠ 0 ⇔ E ( θ^ ) ≠ θ .


1. The estimator θ^ is an upward biased (or positively biased) estimator of the population
parameter θ if the bias of θ^ is greater than zero; i.e., if,
Bias ( θ^ )=E ( θ^ )−θ> 0⇔ E ( θ^ ) > θ .

Stat foreconomists Page 103


Statistics for economists
2. The estimator θ^ is a downward biased (or negatively biased) estimator of the population
parameter θ if the bias of θ^ is less than zero; i.e., if

Bias ( θ^ )=E ( θ^ )−θ< 0⇔ E ( θ^ ) < θ .


Meaning of the Unbiasedness Property
 The estimator θ^ is an unbiased estimator of θ if on average it equals the true
parameter value θ.
This means that on average the estimator θ^ is correct, even though any single estimate of θ for a
particular sample of data may not equal θ. More technically, it means that the finite-sample
distribution of the estimator θ^ is centered on the value θ, not on some other real value.
 The bias of an estimator is an inverse measure of its average accuracy. The
smaller in absolute value is Bias (θ^ ), the more accurate on average is the estimator
θ^ in estimating the population parameter θ.
Thus, an unbiased estimator for which Bias ( θ^ ) = 0 that is, for which E (θ^ ) = θ, is on average a
perfectly accurate estimator of θ. Given a choice between two estimators of the same population
parameter θ, of which one is biased and the other is unbiased, we prefer the unbiased estimator
because it is more accurate on average than the biased estimator.

Steps to check whether a given estimator is unbiased or not


1. Draw all possible samples of a given size from the population.
2. Calculate the value of given estimator for all these samples separately.
3. Take the average of all these values obtained in step 2.
If this average is equal to the population parameter, then the estimator is unbiased and if this
average is more than the population parameter, then the estimator is said to be positively biased
and when this average is less than the population parameter, it is said to be negatively biased.

Minimum Variance
Definition of Minimum Variance: The estimator θ^ is a minimum-variance estimator of the
population parameter θ if the variance of the finite-sample distribution of θ^ is less than or equal

Stat foreconomists Page 104


Statistics for economists
~ ~
to the variance of the finite-sample distribution ofθ , where θ is any other estimator of the
population parameter θ; i.e., if
~
Var (θ^ ) ≤ Var (θ ) for all finite sample sizes n such that 0 < n < ∞
Where,
Var (θ^ ) = E [ θ^ – E ( θ)]
^ 2 = the variance of the estimatorθ^ .
~ ~ ~ ~
Var (θ ) = E [ θ – E ( θ)]2 = the variance of any other estimatorθ .

~
Note: Either or both of the estimators’ θ^ and θ may be biased. The minimum variance property
implies nothing about whether the estimators are biased or unbiased.

Meaning of the Minimum Variance Property


The variance of an estimator is an inverse measure of its statistical precision, i.e., of its
dispersion or spread around its mean.
The smaller the variance of an estimator, the more statistically precise it is.
A minimum variance estimator is therefore the statistically most precise estimator of an unknown
population parameter, although it may be biased or unbiased.

Efficiency
 A Necessary condition for Efficiency is Unbiasedness
The small-sample property of efficiency is defined only for unbiased estimators.
Therefore, a necessary condition for efficiency of the estimator θ^ is that
E (θ^ ) = θ, i.e., θ^ must be an unbiased estimator of the population parameter θ.
Definition of Efficiency: Efficiency = Unbiasedness + Minimum Variance
~
Verbal Definition: If θ^ and θ are two unbiased estimators of the population parameter θ, then
~
the estimator θ^ is efficient relative to the estimator θ if the variance of θ^ is smaller than the
~
variance of θ for any finite sample size n < ∞.

~
Formal Definition: Let θ^ and θ be two unbiased estimators of the population parameter θ, such
~ ~
that E (θ^ ) = θ and E (θ ) = θ. Then the estimator θ^ is efficient relative to the estimator θ if the

Stat foreconomists Page 105


Statistics for economists
variance of the finite-sample distribution of θ^ is less than or at most equal to the variance of the
~
finite-sample distribution ofθ ; i.e. if,

Var (θ^ ) ≤ var ( ~ ~


θ ) for all finite n where E ( θ^ ) =θ∧E ( θ)=θ

~
Note: Both the estimator’s θ^ and θ must be unbiased, since the efficiency property refers only to
the variances of unbiased estimators.
Meaning of the Efficiency Property
Efficiency is a desirable statistical property because of two unbiased estimators of the same
population parameter; we prefer the one that has the smaller variance, i.e., the one that is
statistically more precise.
~
In the above definition of efficiency, if θ is any other unbiased estimator of the population
parameter θ, then the estimator θ^ is the best unbiased, or minimum-variance unbiased, estimator
of θ.
Large-Sample Properties
Nature of Large-Sample Properties
The large-sample properties of an estimator are the properties of the sampling distribution of that
estimator as sample size n becomes very large, as n approaches infinity, as n→∞. Recall that the
sampling distribution of an estimator differs for different sample sizes i.e., for different values of
n.
The sampling distribution of a given estimator for one sample size is different from the sampling
distribution of that same estimator for some other sample size.
Consider the estimator θ^ for two different values of n, n1 andn2 .
In general, the sampling distributions of θ^ n and θ^ n are different they can have
1 2

 different means
 different variances
 different mathematical forms
Desirable Large-Sample Properties
 Consistency
 Asymptotic Unbiasedness;

Stat foreconomists Page 106


Statistics for economists
 Asymptotic Efficiency.
We need to distinguish between two large-sample distributions of an estimator:
 the asymptotic distribution of the estimator;
 the ultimate, or final, distribution of the estimator.

The Ultimate Distribution of an Estimator


The asymptotic distribution of an estimator θ^ is not necessarily the final or ultimate form that
the sampling distribution of the estimator takes as sample size n approaches infinity (as n →∞).
For many estimators, the sampling distribution collapses to a single point as sample size n
approaches infinity.
More specifically, as sample size n → ∞, the sampling distribution of the estimator collapses to a
column of unit probability mass (or unit probability density) at a single point on the real line.
Such an estimator is said to converge in probability to some value.
A distribution that is completely concentrated at one point (on the real line) is called a degenerate
distribution. Graphically, a degenerate distribution is represented by a perpendicular to the real
line with height equal to one.

Distinguishing Between an Estimator's Asymptotic and Ultimate Distributions


The ultimate distribution of an estimator is the distribution to which the sampling distribution of
the estimator finally converges as sample size n "reaches" infinity. The asymptotic distribution of
an estimator is the distribution to which the sampling distribution of the estimator converges just
before it collapses (if indeed it does) as sample size n "approaches" infinity.
If the ultimate distribution of an estimator is degenerate, then its asymptotic distribution is not
identical to its ultimate (final) distribution. But if the ultimate distribution of an estimator is non-
degenerate, then its asymptotic distribution is identical to its ultimate (final) distribution.

Consistency
A Necessary Condition for Consistency
Let n θ^ n be an estimator of the population parameter θ based on a sample of size n observations.

Stat foreconomists Page 107


Statistics for economists
A necessary condition for consistency of the estimator θ^ n is that the ultimate distribution of θ^ n be
a degenerate distribution at some point on the real line, meaning that it converges to a single
point on the real line.
The Probability Limit of an Estimator……… the plim of an estimator
Short Definition: The probability limit of an estimator θ^ n is the value or point on the real line to
which that estimator's sampling distribution converges as sample size n increases without limit.
Long Definition: If the estimator θ^ n has a degenerate ultimate distribution i.e., if the sampling
distribution of θ^ n collapses on a single point as sample size n →∞ then that point on which the
sampling distribution of θ^ n converges is called the probability limit ofθ^ n , and is denoted as plim θ^ n
^
or plim θn , Where "plim" means "probability limit".
n→∞

Formal Definition of Probability Limit: The point θ0 on the real line is the probability limit of
the estimator θ^ n if the ultimate sampling distribution of θ^ n is degenerate at the pointθ0 , meaning
that the sampling distribution of θ^ n collapses to a column of unit density on the point θ0 as sample
size n →∞.
 This definition can be written concisely as
^ 0∨ plim θ^ n=θ0
plim θ=θ
n →∞

The three statements above are equivalent to the statement;

lim Pr ⁡(θ0 −ε ≤ θ^ n ≤ θ0 + ε)=lim Pr (−ε ≤ θ^ n−θ0 ≤+ε )= plim Pr (|θ^ n−θ 0|≤ ε )=1
n→∞ n →∞ n →∞

Where ε > 0 is an arbitrarily small positive number and |θ^ n−θ0| denotes the absolute value of the

difference between the estimator θ^ n and the pointθ0 .


This statement means that as sample size n→ ∞, the probability that the estimator θ^ n is arbitrarily
close to the point θ0 approaches one. The more technical way of saying this is that the estimator
θ^ n converges in probability to the pointθ0 .

Stat foreconomists Page 108


Statistics for economists
Definition of Consistency
Verbal Definition: The estimator θ^ n is a consistent estimator of the population parameter θ if its
sampling distribution collapses on, or converges to, the value of the population parameter θ as n
→ ∞.
Formal Definition: The estimator θ^ n is a consistent estimator of the population parameter θ if
the probability limit of θ^ n is θ, i.e., if,

lim θ^ n=θ∨¿ plim Pr (|θ^ n−θ|≤ ε )=1∨ pr (|θ^ n−θ|≤ ε ) → ∞ . ¿


n→∞ n→ ∞

The estimator θ^ n is a consistent estimator of the population parameter θ if,


 the probability that θ^ n is arbitrarily close to θ approaches 1 as the sample size n → ∞ or,
 If the estimator θ^ n converges in probability to the population parameter θ.

Meaning of the Consistency Property


As sample size n becomes larger and larger, the sampling distribution of θ^ n becomes more and
more concentrated around θ. As sample size n becomes larger and larger, the value of θ^ n is more
and more likely to be very close to θ.

A Sufficient Condition for Consistency


One way of determining if the estimator θ^ n is consistent is to trace the behavior of the sampling
distribution of θ^ n as sample size n becomes larger and larger.
→If as n → ∞ (sample size n approaches infinity) both the bias of θ^ n and the variance of θ^ n

approach zero, then θ^ n is a consistent estimator of the parameter θ.


→Recall that the bias of θ^ n is defined as,

Bias (θ^ n ) = E (θ^ n ) - θ.


Thus, the bias of θ^ n approaches zero as n → ∞ if and only if the mean or expectation of the
sampling distribution of θ^ n approaches θ as n → ∞:

lim bias( θ^ n) = lim E ( θ^ n) −¿ θ=0 ⇔ lim E ( θ^ n ) =θ .¿


n→∞ n→∞ n→∞

Result: A sufficient condition for consistency of the estimator θ^ n is that,

Stat foreconomists Page 109


Statistics for economists
lim bias( θ^ n) = 0 or lim E ( θ^ n) =θ∧lim Var ( θ^ n ) =0.
n→∞ n→∞ n →∞

This condition states that if both the bias and variance of the estimator θ^ n approach zero as sample
size n → ∞, then θ^ n is a consistent estimator of θ.

6.4 Interval Estimation

In point estimation, we recover a point values for an unknown parameter. But knowing the point
value is not enough, we also want to know how close to the truth it is. Using interval estimation,
we make statements that the true parameter lays within some region (typically depending on the
point estimate) with some prescribed probability. A point estimate by itself provides no
information about the precision and reliability of estimation. Therefore, an alternative to
reporting a single number is to report an entire interval of plausible values that is an interval
estimate.
Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling error, we
know that it's not likely that our sample statistic will be equal to the population parameter, but
instead will fall into an interval of values. We will have to be satisfied knowing that the statistic
is "close to" the parameter. That leads to the obvious question, what is "close"?
We can phrase the latter question differently: How confident can we be that the value of the
statistic falls within a certain "distance" of the parameter? Or, what is the probability that the
parameter's value is within a certain range of the statistic's value? This range is the confidence
interval.
A Confidence Interval is an interval of numbers containing the most plausible values for our
Population Parameter. The probability that this procedure produces an interval that contains the
actual true parameter value is known as the Confidence Level and is generally chosen to be 0.9,
0.95 or 0.99.
The confidence level is the probability that the value of the parameter falls within the range
specified by the confidence interval surrounding the statistic.

Confidence Intervals take the form:

Stat foreconomists Page 110


Statistics for economists
Point Estimate ± Critical Value * Standard Error

 There are different cases to be considered to construct confidence intervals.

Case 1:
If sample size is large or if the population is normal with known variance
Recall the Central Limit Theorem, which applies to the sampling distribution of the mean of a
sample. Consider a sample of size n drawn from a population, whose mean, is μ and standard
deviation is σ with replacement and order important. The population can have any frequency
distribution. The sampling distribution of X will have a mean μ X = μ x and a standard deviation
σ
σ X= , and approaches a normal distribution as n gets large.
√n
This allows us to use the normal distribution curve for computing confidence intervals.
X−μ
⇒Z = σ has a normal distribution with mean = 0 and variance = 1
√n
⇒ μ =X ± Z σ /√ n

= X ± ε , where ε is a margin of error .


⇒ ε =Z σ / √ n

For the interval estimator to be a good estimator, the error should be small i.e. more narrow
CIs are desirable. We can use the following methods so that we can have a small level of
error.
 By making n large i.e. a relatively large number of the sample size (good idea if possible)
 Small variability or decreaseσ . Not an option, it is fixed by original distribution.
 Taking Z small
 Decreasing the confidence level? (Not a great idea. You reduce the CI width, but you're
less likely to capture μ)
To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an area of
size 1−α such

(
P −Z α < Z <Z α =1−α
2 2
)
Stat foreconomists Page 111
Statistics for economists
Where α is the probability that the parameter lies outside the interval . Or the (“alpha”) level
represents the “lack of confidence” and is the chance the researcher is willing to take in not capturing
the value of the parameter.
Z α Stands for the standard normal variable to the right of which α/2 probability lays i.e. P (Z > Z α
2 2

) = α/2 and the reason we use Z α instead of Z α in this formula is because the random error
2

(imprecision) is split between underestimates (left tail of the SDM) and overestimates (right tail of

the SDM). The confidence level 1−α area lies between − Z α and Z α .
2 2

( )
X−μ
P −Z α < <Z α =1−α
⇒ 2
σ 2
√n

(
⇒ P X−Z α
2
σ
√n
< μ< X +Z α
σ
2 √n
=1−α
)
(
⇒ X−Z α
2
σ
√n
, X +Zα
σ
2 √n
)
is a 100 (1 – α ) % confidence interval for μ.

Where,
Z α , called the reliability factor,
2

σ
SE = is the standard error,
√n
σ
ME = Z α is the margin of error,
2 √n
w = 2ME is the width,
σ
UCL = Z α is the upper confidence limit, and
2 √n
σ
LCL = X −Z α is the lower confidence limit.
2 √n

But usually σ 2 is not known, in that case we estimate by its point estimator s2.

(
⇒ X−Z α
2
s
√n
, X +Zα
s
2 √n
)
is a 100 (1 – α ) % confidence interval for μ.

Here are the z values corresponding to the most commonly used confidence levels.

Stat foreconomists Page 112


Statistics for economists

100 (1 – α ) % α α /2 Zα
2

90 0.10 0.05 1.645


95 0.05 0.025 1.96
99 0.01 0.005 2.58

Case 2:
If sample size is small and the population variance, σ 2 is not known.
X−μ
t= s has t distribution with n -1 degrees of freedom.
√n

(
⇒ X−t α
2
s
√n
, X +t α
)
s
2 √n
is a 100(1− α) % confidence interval for μ .The unit of measurement of

the confidence interval is the standard error. This is just the standard deviation of the sampling
distribution of the statistic.

Student’s t Distribution
 The t is a family of distributions.

 The t α value depends on degrees of freedom (d.f.)


2

Degree of freedom is the number of observations that are free to vary after sample mean has
been calculated.
Degree of freedom, d.f. = n - 1

Example: Suppose the mean of 3 numbers is 6.

Let X1 = 5

Let X2 = 6

What is X3?

If the mean of these three values is 6, then X3 must be 7 i.e., X3 is not free to vary.
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2

Stat foreconomists Page 113


Statistics for economists
(2 values can be any numbers, but the third is not free to vary for a given mean)

Note: t →Z as n increases.

Summary

Assumption 100 (1 - α) % confidence interval (two sided)

n large and σ 2known, or normal population (


P X−Z α
2
σ
√n
< μ< X +Z α
σ
2 √n
)
=1−α

P ( X−Z
√n)
s s
n large and σ 2unknown α < μ< X +Z α =1−α
2 √n 2

n small, population normal, σ 2unknown (


P X−t n−1 , α
2
s
√n
< μ < X +t
n−1 ,
α
)s
2 √n
=1−α

Example:
1. From a normal sample of size 25 a mean of 32 was found .Given that the population
standard deviation is 4.2. Find
a. A 95% confidence interval for the population mean.
b. A 99% confidence interval for the population mean.
c. A 90 % confidence interval for the population mean.

Solution:
a. X = 32 , σ =4.2, 1− α = 0.95 ⇒ α = 0.05, α/2 = 0.025

⇒ Z α = 1.96 from standard normal table


2

σ
The required interval will be X ± Z α
2 √n

= 32 ± 1.96 * 4.2/√ 25
= 32 ± 1.65

Stat foreconomists Page 114


Statistics for economists
= (30.35, 33.65)
b. X = 32 , σ =4.2, 1− α = 0.99 ⇒ α = 0.01, α/2 = 0.005

⇒ Z α = 2.58 from standard normal table


2

σ
The required interval will be X ± Z α
2 √n

= 32 ± 2.58 * 4.2/√ 25
= 32 ± 2.17
= (29.83, 34.17)
c. Exercise
2. A Drug Company is testing a new drug which is supposed to reduce blood pressure. From
the six people who are used as subjects, it is found that the average drop in blood pressure
is 2.28 points, with a standard deviation of .95 points.
Compute:
a. A 90 % confidence interval for the population mean.
b. A 95% confidence interval for the population mean.
c. A 99% confidence interval for the population mean.

Solution:
b.
X = 2.28 , s=0.95, 1− α = 0.95 ⇒ α = 0.05, α/2 = 0.025

⇒t α = 2.571 with df = 6 - 1⇒ 5 from t- table


2

s
The required interval will be X ± t α
2 √n
= 2.28 ± 2.571 * 0.95/√ 6
= 2.28 ± 1.008
= (1.28, 3.28)
That is, we can be 95% confident that the mean decrease in blood pressure is between 1.28 and
3.28 points.

Stat foreconomists Page 115


Statistics for economists

Chapter seven
7. HYPOTHESIS TESTING

Introduction

Stat foreconomists Page 116


Statistics for economists
This is also one way of making inference about population parameter, where the investigator has
prior notion about the value of the parameter. i.e. a statistical test is a procedure that leads to

either acceptance or rejection of a null hypothesis H0 given in advance.

Learning objectives:
After completing this chapter, students will be able to;
 Define the null and alternative hypothesis:
 Develop and test the hypothesis:
 Test the test of association between variables

Definitions:
Statistical hypothesis: is an assertion or statement about the population whose plausibility is to
be evaluated on the basis of the sample data.
Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value depends
on sample data.

7.1 Developing Null and Alternative Hypotheses

 There are two types of hypothesis:


In statistical hypothesis testing, there are always two hypotheses. The hypothesis to be tested is
called the null hypothesis and given the symbol H 0. The null hypothesis states that there is no
difference between a hypothesized population mean and a sample mean. It is the status quo
hypothesis.
For example, if we were to test the hypothesis that college freshmen study 20 hours per
week, we would express our null hypothesis as:
H0: μ = 20

We test the null hypothesis against an alternative hypothesis, which is given the symbol H a. The
alternative hypothesis is often the hypothesis that you believe yourself! It includes the outcomes

Stat foreconomists Page 117


Statistics for economists
not covered by the null hypothesis. In this example, our alternative hypothesis would express that
freshmen do not study 20 hours per week:
Ha: μ ≠ 20
In short,
Null hypothesis is,
 It is the hypothesis to be tested.
 It is the hypothesis of equality or the hypothesis of no difference.
 Usually denoted by H0.
Alternative hypothesis:
 It is the hypothesis available when the null hypothesis has to be rejected.
 It is the hypothesis of difference.

 Usually denoted by H1 or Ha.


Types and size of errors:
Testing hypothesis is based on sample data which may involve sampling and non-sampling
errors.
The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Accept H0
truth H0 Type I error Right decision

H1 Right decision Type II error

 Type I error (α): Rejecting the null hypothesis when it is true. It is sometimes called
level of significance.

Level of Significance

The level of significance of a statistical test is the probability level α for obtaining a critical value
under the distribution specified by the null hypothesis. A calculated test-statistic is compared to
the critical value to make a decision as to rejection or non-rejection of the null hypothesis. The
rejection of the null hypothesis is equivalent to saying that the calculated probability for
obtaining the test statistic is less than α.

Stat foreconomists Page 118


Statistics for economists
 Type II error (β): Failing to reject the null hypothesis when it is actually false is type II
error.

Power of a test:
The most powerful test is a test that fixes the level of significance and minimizes type II error
(β). The power of a test is defined as the probability of rejecting the null hypothesis when it is
actually false. It is given as:

Power of test =1− β

NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (α) and type II error (β) have inverse relationship and therefore, cannot be
minimized at the same time.
 In practice we set α at some value and design a test that minimizes β. This is because a
type I error is often considered to be more serious, and therefore more important to avoid,
than a type II error.

General steps in hypothesis testing:


1. The first step in hypothesis testing is to specify the null hypothesis (H 0) and the

alternative hypothesis (H1).


2. The next step is to select a significance level,α
3. Identify the sampling distribution of the estimator.
4. The fourth step is to calculate a statistic analogous to the parameter specified by the null
hypothesis.
5. Identify the critical region.
6. Making decision.
7. Summarization of the result.

Stat foreconomists Page 119


Statistics for economists
7.2. Tests about the population mean μ:

Suppose the assumed or hypothesized value of μ is denoted by μ0, then one can formulate two
sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 : μ=μ 0 vs H 1 : μ ≠ μ0 →two sided
2. H 0 : μ=μ 0 vs H 1 : μ> μ 0 → one side d
3. H 0 : μ=μ 0 vs H 1 : μ< μ 0 → one sided

CASES:

Case 1: When sampling is from a normal distribution with σ 2known


 The relevant test statistic is;
X−μ
Z=
σ /√n
After specifying α we have the following regions (critical and acceptance) on the standard
normal distribution corresponding to the above three hypothesis. See the following summary
table for decision.

H0 Reject H 0 if Accept H 0 if Inconclusive if


μ ≠ μ0 | Z cal| > Z α / 2 | Z cal| < Z α / 2 Z cal= Z α / 2or Z cal = - Z α / 2
μ< μ0 Z cal < - Z α Z cal > - Z α Z cal = - Z α
μ> μ0 Z cal > Z α Z cal < Z α Z cal = Z α

X−μ 0
Where: Z cal=
σ /√n

Case 2: When sampling is from a normal distribution with σ 2 unknown and small sample size
 The relevant test statistic is;

Stat foreconomists Page 120


Statistics for economists
X−μ
t= t with n−1 degrees of freedom .
S/ √ n
After specifying α we have the following regions on the student t-distribution corresponding to
the above three hypothesis.

H0 Reject H 0 if Accept H 0 if Inconclusive if


μ ≠ μ0 | t cal| > t α / 2 | t cal| < t α / 2 t cal = t α / 2or t cal = - t α / 2
μ< μ0 t cal < - t α t cal > - t α t cal = - t α
μ> μ0 t cal > t α t cal < t α t cal = t α

X−μ 0
Where: t cal=
S /√n

Case 3: When sampling is from a non- normally distributed population or a population whose
functional form is unknown.
 If a sample size is large one can perform a test hypothesis about the mean by using:
X−μ 0 2
Z cal= , if σ is known
σ
√n
X−μ 0 2
Z cal= , if σ isunknown
S
√n
 The decision rule is the same as case I.

Examples

Stat foreconomists Page 121


Statistics for economists
1. Test the hypotheses that the average height content of containers of certain lubricant is 10
liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1,
9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use the 0.01 level of significance and assume that the
distribution of contents is normal.

Solution:
Let μ =Population mean, μ0=10
Step 1: Identify the appropriate hypothesis
H 0 : μ0=10 Vs. H 1 : μ 0 ≠ 10
Step 2: select the level of significance, α = 0.01 (given)
Step 3: Select an appropriate test statistics
t- Statistic is appropriate because population variance is not known and the sample size is also
small.
Step 4: identify the critical region.
Here we have two critical regions since we have two tailed hypothesis.
The critical region is, | t cal| > t 0.005(9)=3.2498

⇒ (-3.2498, 3.2498) is acceptance region.


Step 5: Computations:
X =10.06 , S=0.25

X−μ 0 10.06−10
=0.76
⇒ t cal = S = 0.25
√n √10
Step 6: Decision
Accept H 0, since t cal is in the acceptance region.
Step 7: Conclusion: At 1% level of significance, we have no evidence to say that the average
height content of containers of the given lubricant is different from 10 litters, based on the given
sample data.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is
computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the
hypothesized value for the population mean is 1600 hours. Can we conclude that the life

Stat foreconomists Page 122


Statistics for economists
time of light bulbs is decreasing? (Use α = 0.05 and assume the normality of the
population)
Solution:
Let μ = Population mean. , μ0=1600
Step 1: Identify the appropriate hypothesis
H 0 : μ0=1600 Vs. H 1 : μ 0 <1600
Step 2: select the level of significance, α = 0.05 (given)
Step 3: Select an appropriate test statistics
Z- Statistic is appropriate because population variance is known.
Step 4: identify the critical region.
The critical region is, Z cal < - Z 0.05=−1.645

⇒ (−1.645, ∞ ) is acceptance region.


Step 5: Computations:
X−μ 0 1570−1600
=−1
⇒ Z cal = σ = 120
√n √16
Step 6: Decision
Accept H 0, since Z cal is in the acceptance region.
Step 7: Conclusion
At 5% level of significance, we have no evidence to say that that the life time of light bulbs is
decreasing, based on the given sample data.

Exercises:
1. It is known in a pharmacological experiment that rats fed with a particular diet over a
certain period gain an average of 40 gms. in weight. A new diet was tried on a sample of
2
20 rats yielding a weight gain of 43 gms with variance 7 gms . Test the hypothesis that
the new diet is an improvement assuming normality.
2. A batch of 100 resistors has an average of 102 Ohms. Assuming a population standard
deviation of 8 Ohms, test whether the population mean is 100 Ohms at a significance
level of α = 0.05.

Stat foreconomists Page 123


Statistics for economists
3. A quality control engineer finds that a sample of 100 light bulbs had an average life-time
of 480 hours. Assuming a population standard deviation of σ = 25 hours, test whether the
population mean is 480 hours vs. the alternative hypothesis that the population mean is
greater at a significance level of α = 0.05.
4. It is claimed that a vacuum cleaner expends 46 kWh per year. A random sample of 12
homes indicates that vacuum cleaners expend an average of 42 kWh per year with
standard deviation 11.9 kWh. At a 0.05 level of significance, does this suggest that, on
average, vacuum cleaner expend less than 46 kWh per year? Assume the population to be
normally distributed.

7.3. Test of association

 Suppose we have a population consisting of observations having two attributes or


qualitative characteristics say A and B.
 If the attributes are independent then the probability of possessing both A and B is
P A * PB
Where;
PA is the probability that a number has attribute A.
PB is the probability that a number has attribute B.
Suppose, A has r mutually exclusive and exhaustive classes, and
B has c mutually exclusive and exhaustive classes
Then the entire set of data can be represented using r *c contingency table.

A B1 B2 . . Bj . Bc total

A1 O11 O12 O1 j O1 c R1

Stat foreconomists Page 124


Statistics for economists
A2 O21 O22 O2 j O2 c R2

.
.

Ai Oi 1 Oi 2 Oij Oic Ri

.
.

Ar Or 1 Or 2 Orj Orc

Total C1 C2 Cj n

The chi-square procedure test is used to test the hypothesis of independency of two attributes.
For instance we may be interested;
 Whether the presence or absence of hypertension is independent of smoking habit or not.
 Whether the size of the family is independent of the level of education attained by the
mothers.
 Whether there is association between father and son regarding boldness.
 Whether there is association between stability of marriage and period of acquaintance
ship prior to marriage.

The χ 2statistic is given by:

[ ]
r c 2
(Oij −eij )
=∑ ∑
2 2
χ cal χ (r−1)(c−1)
i=1 j=1 e ij

Where,
Oij =¿The number of units that belongs to category i of A and j of B
e ij =¿Expected frequency that belong to category i of A and j of B.
The e ij is given by:

R i ×C j
e ij =
n

Stat foreconomists Page 125


Statistics for economists
Where, Ri = the ith row total
Cj = the jth column total
n = total number of observations
Remark:
r c r c
n=∑ ∑ Oij =∑ ∑ e ij
i=1 j=1 i=1 j=1

 The null and alternative hypothesis may be stated as:


H0: There is no association between A and B

H1: not H0 There is association between A and B

Decision Rule:
Reject H0 for independency at α level of significance if the calculated value of χ 2 exceeds the
tabulated value with degree of freedom equal to (r −1)(c −1) .

Examples:
1. A geneticist took a random sample of 300 men to study whether there is association
between father and son regarding boldness. He obtained the following results.
Son
Father bold Not
Bold 85 59
Not 65 91

Using α = 5% test whether there is association between father and son regarding boldness.
Solution:

H0: there is no association between father and Son regarding boldness.

H1: not H0
First calculate the row and column totals
R1 = 144, R2 = 156, C1 = 150, C2 = 150

Stat foreconomists Page 126


Statistics for economists
Then calculate the expected frequencies

R i ×C j
e ij =
n

R1 ×C 1 144 ×150 R 2 × C1 156 ×150


e 11= = =72 e21= = =78
n 300 n 300
R 1 ×C 2 144 ×150 R2 ×C 2 156 ×150
e 12= = =72 e 22= = =78
n 300 n 300

Obtain the calculated value of the chi-square;

[ ]
2 2 2
(Oij −eij )
cal =∑ ∑
2
χ
i=1 j=1 e ij
2 2 2 2
(85−72) (59−72) (65−72) (91−72)
= + + + =9.028
72 72 72 72
Obtain the tabulated value of chi-square
df = (r - 1)(c -1) = 1* 1 = 1
2
χ 0.05 ( 1 )=3.841 ¿ chi−square( χ ¿¿ 2)table ¿
2 2
The decision is to reject H0 since χ cal >¿ χ 0.05 (1)

Conclusion: At 5% level of significance we have evidence to say there is association between


father and son regarding boldness, based on this sample data.

2. Random samples of 200 men, all retired were classified according to education and
number of children as shown below.

Education level Number of children


0–1 2–3 Over 3
Elementary 14 37 32
Secondary and above 31 59 27

Test the hypothesis that the size of the family is independent of the level of education attained by
fathers. (Use 5% level of significance).

Stat foreconomists Page 127


Statistics for economists
Solution:
H 0: There is no association between the size of the family and the level of education attained by
fathers
H 1: not H 0
First calculate the row and column totals
R1 = 83, R2 = 117, C1 = 45, C2 = 96, C3 = 59
Then compute the expected frequencies

R i ×C j
e ij =
n

R1 ×C 1 83 × 45 R2 ×C 1 117 × 45
e 11= = =18.675 e 21= = =26.325
n 200 n 200
R 1 ×C 2 83× 96 R2 × C 2 117 × 96
e 12= = =39.84 e22= = =56.16
n 200 n 200
R 1 ×C 3 83× 59 R 2 ×C 3 117 ×59
e 13= = =24.485 e 23= = =34.515
n 200 n 200
Obtain the calculated value of the chi- square.

[ ]
2 3 2
(Oij −eij )
=∑ ∑
2
χ cal
i=1 j=1 e ij
2 2 2
(14−18.675) (37−39.84) ( 27−34.515 )
= + +…+ =6.3
18.675 39.84 34.515
Obtain the tabulated value of chi-square at α = 0.05
Degree of freedom = (r - 1)(c -1) = 1* 2 = 2
2
χ 0.05 ( 2 )=5.99 ¿ chi−square( χ ¿¿ 2)table ¿
2 2
The decision is to reject H0 since χ cal >¿ χ 0.05 ( 1 ) and conclude that there is association.

Conclusion: At 5% level of significance we have evidence to say there is association between


the size of the family and the level of education attained by fathers, based on this sample data.

Stat foreconomists Page 128


Statistics for economists

References

Anteneh., T. (2005). introduction to statistics. Addis Ababa, Ethiopia: AAU.

Young, G. A. (2011, sptember). Retrieved 05 20, 2021, from http://www2.imperial.ac.uk/∼ayoung

Hinton, P. R. (n.d.). Statistics Explained. 400.


Allan G. Bluman. (2012). Elementary Statistics A Step by Step Approach (8 ed.). (S. K. Mattson,

Ed.) New York: McGraw-Hill.

Stat foreconomists Page 129


Statistics for economists
B.Indrani, V. N. (2017). STATISTICS higher Secondary First Year (2 ed.). (K.Nagabushanam,

Ed.) Tamilnadu: Government of Tamilnadu Distribution of Free Textbook Programme.

C.R.Kothari. (2004). research Methodology methods and techniques (2 ed.). New Delhi, India:

New Age International (P) Ltd., Publishers.

Wilson, R. J. (2003). statistical methods (2 ed.). (B. Holland, Ed.) San Diego, CAlifornia, USA:

Academic press.

The end
Stat foreconomists Page 130
Statistics for economists

In loving memory of my little brother


Minwyelet Tadesse (Kenya)
Thanks to God!!!

Stat foreconomists Page 131

You might also like