WEEK1 Intro & Review
WEEK1 Intro & Review
INTRODUCTION
&
REVIEW
WEEK1
FALL 2024
2
COURSE DESCRIPTION
➢ Econometrics ISL355E introduces you to the
regression methods for analyzing data in economics.
➢ This course emphasizes both the theoretical and
practical aspects of statistical analysis. It focuses on
techniques for estimating various econometric
models and conducting tests of hypotheses that are
of interest to economists.
➢ The goal is to help you develop a solid theoretical
background in introductory-level econometrics, the
ability to implement the techniques, and critique
empirical studies in economics.
3
TEXTBOOKS
• Applied Statistics and Econometrics. Basic Topics and Tools
with Gretl and R (2024) B. K. Kivedal
• Elia Kacapyr (2022) Essential Econometric Techniques A Guide to
Concepts and Applications 3rd Ed.
• Dougherty, C. (2016), Introduction to Econometrics, 5th. Ed., Oxford
University Press
• Studenmund, A. H. (2017), Using Econometrics A Practical Guide 7th
Edition, Pearson
✓ Pedace, R. (2013) Econometrics for Dummies, John Wiley & Sons, Inc
✓ Hanck C., Arnold M., Gerber A. and Schmelzer M. (2020), Introduction
to Econometrics with R, https://www.econometrics-with-r.org/
✓ Griffiths, W. E., Hill, R. C., Lim, G.C. (2008), Using Eviews for
Principles of Econometrics, 3rd. Ed. John Wiley
➢ Quizzes (15%)
➢ Midterm (30%)
6
ECONOMETRICS
➢ Econometrics is a branch of economics that
utilizes mathematical and statistical methods to
analyze economic theories and validate them
through empirical evidence.
7
ECONOMETRICS
• Theoretical foundations
– Behavioral modeling: Economic growth, Labor
supply, Demand equations, etc.
– Microeconometrics, Macroeconometrics, Financial
econometrics, Marketing …
• Mathematical Elements
STATISTICS MATHEMATICS
• Statistical foundations ECONOMETRICS
• ‘Econometric Model’ building
– Mathematical elements ECONOMICS
MODEL SPECIFICATION
MODEL ESTIMATION
DIAGNOSTIC TESTS
INTERPRETATION OF
NO GOOD YES FINDINGS
PERFORMANCE
? &
MODEL USAGE
11
ECONOMETRICS
➢ The exciting aspect of econometrics is its focus on verifying
or disproving economic laws, such as purchasing power
parity, the life cycle hypothesis, and the quantity theory of
money, using economic data.
➢ David F. Hendry (1980) emphasized this function of
econometrics:
– The three golden rules of econometrics are test, test, and
test; all three rules are broken regularly in empirical
applications and are fortunately easily remedied.
Rigorously tested models, which adequately described the
available data, encompassed previous findings, and were
derived from well-based theories, would enhance any
claim to be scientific.
12
USAGE OF ECONOMETRIC STUDY
➢ STRUCTURAL ANALYSIS
– Price and income elasticity estimation
– Smoking and cancer/ heart attack relationship, reality or myth?
– Effect of exchange rate on import and export of Turkey
➢ POLICY RECOMMENDATIONS
– If the interest rate increases by 1 percentage point, what effect
does it have on inflation?
– If income tax increases by 5 percentage points, how will it
affect economic growth?
– If customer satisfaction increases, how does it affect the sales
volume of the firm?
➢ FORECASTING
– GDP growth rate in 2021
– Firm sales in 2021
– Population of İstanbul in 2040
13
TRENDS IN ECONOMETRICS
14
Data-Generating Process (DGP)
15
ECONOMETRICS: SCIENCE + ART
➢ Econometrics, while based on scientific principles, still retains a
particular element of art.
➢ According to Malinvaud (1966), the art of econometrics is finding
the correct set of sufficiently specific yet realistic assumptions to
enable us to take the best possible advantage of the available data.
➢ Data in economics are not generated under ideal experimental
conditions as in a physics laboratory. This data cannot be replicated
and is most likely measured with error.
➢ Many published empirical studies find that economic data may not
have enough variation to discriminate between competing
economic theories.
➢ To some, the “art” element in econometrics has left several
distinguished economists doubtful of the power of econometrics to
yield sharp predictions.
16
CRITIQUES OF ECONOMETRICS
➢ Econometrics has its critics. Interestingly, John Maynard
Keynes (1940, p. 156) had the following to say about Jan
Tinbergen’s (1939) pioneering work:
– No one could be more frank, painstaking, or free of
subjective bias or parties than Professor Tinbergen. There is
no one, therefore, so far as human qualities go, whom it
would be safer to trust with black magic. I am not yet
persuaded that there is anyone I would trust with it at the
present stage or that this brand of statistical alchemy is ripe
to become a branch of science. But Newton, Boyle, and
Locke all played with alchemy. So, let him continue.
➢ In 1969, Jan Tinbergen shared the first Nobel Prize in
economics with Ragnar Frisch.
17
RESPONSE TO THE CRITIQUES
➢ Econometrics has limitations due to incomplete economic theory
and non-experimental data, but it has played a fundamental role
in developing economics as a scientific discipline.
➢ Economic theories can't be conclusively rejected using
econometric methods, but testing specific formulations against
rival alternatives can still be valuable. Despite the challenge of
specification searches, econometric modeling remains
worthwhile.
➢ Econometric models are essential tools for forecasting and
policy analysis, and it is unlikely that they will be discarded.
The challenge is recognizing their limitations and working
towards turning them into more reliable and practical tools.
There seem to be no viable alternatives.
18
DATA STRUCTURES
➢ Observation mechanisms
– Passive, nonexperimental (the usual)
– Randomly assigned experiment (wishful)
➢ Data types
– Cross-section X i
– Time series X t
– Panel X it
➢ The data type you’re using may influence how you
estimate your econometric model. In particular,
specialized techniques are usually required to deal
with time series and panel data.
19
20
EXPERIMENTAL DATA
➢ Practical situations often arise where the questions that interest us are such
that no data are available to answer the questions. We may have to generate
the required data.
➢ Simple example. A coffee powder manufacturer would like to design a
packaging and pricing strategy for the product that maximizes its revenue.
➢ He knows that using a plastic bag with color positively affects the
consumer’s choice, while a colored plastic bag is more costly than a plain
plastic cover. He needs to estimate the net benefit he would have in
introducing a colored plastic bag.
➢ He also knows that consumers prefer fresh coffee powder; thus, depending
on the weekly consumption rate, they choose the packet size. The larger the
packet size that a household wants, the lower its willingness to pay, but
smaller packets will increase the cost of packaging.
➢ He would like to know the net benefits to the firm of different sizes of the
packets at different price levels he could fix for them given different types
of demand. 21
EXPERIMENTAL DATA
➢ Historically collected data on coffee sales may be useless in answering
these questions as colored plastic bags were not used in the past. The
manufacturer cannot introduce the new colored package, incurring
higher costs.
➢ To introduce more realism and more complexity, let us assume that
there is a cost-saving
➢ A coffee substitute, called chicory, brings thickness and bitterness to
coffee that some people may like when mixed with coffee. However,
too much chicory is not appreciated by many consumers. As a result,
the manufacturer expects that the greater the chicory content, the lower
the price the customer is willing to pay.
➢ The coffee manufacturer wishes to conduct a small-scale pilot
marketing experiment to estimate the effects on net revenue of
different types of packaging, different levels of chicory, and different
packet sizes.
22
➢ How should one experiment?
EXPERIMENTAL DATA
Each factor is set at two
levels labeled Low (L) and
High (H) for chicory content
of 10%, size of packet 100
and 200 grams, plain cover,
and colored cover.
➢ The questions of interest are:
1. How do you choose the factors and assign them to the experimental subjects of
the pilot experiment?
2. How do the changes in the three factors affect people’s willingness to pay for
100 grams of coffee powder?
3. Is the relation between these factors and willingness to pay linear or nonlinear?
4. How can we estimate the effects?
➢ These questions can be answered using the statistical theory of
design of experiments and the statistical method of analysis of
23
variance or conjoint analysis.
CROSS-SECTION DATA
➢ This data type consists of measurements for individual observations
(persons, households, firms, counties, states, countries, or whatever)
at a given time. The observed changes are due to the unit’s
characteristics.
➢ For example, if the research question is to determine the determinants
of a big firm’s investment decisions, ISO 500 data for 2019 may be
used to design models.
➢ TUIK conducts nationwide sample surveys of households to record
their consumption expenditure patterns. This database is now an
excellent tool for understanding consumer behavior in Turkey and
developing retail marketing strategies.
➢ Given this sample information, one might want to know (i) if there is
any pattern implied by the theory of consumer behavior that relates
expenditure on cereals to household size and total expenditure; (ii) if
such a relation is linear or nonlinear; (iii) how to estimate alternate
specifications; and (iv) how to choose between alternate 24
specifications.
NON-EXPERIMENTAL DATA TAKEN
FROM SECONDARY SOURCES
29
AGGREGATION LEVEL
➢ The level of aggregation used in measuring the variables: The level of
aggregation refers to the unit of analysis when information is acquired for
the data. In other words, the variable measurements may originate at a
lower level of aggregation (like an individual, household, or firm) or
higher (like a city, county, or state).
➢ The frequency with which the data is captured refers to the rate at
which measurements are obtained. Time-series data may be charged at a
higher frequency (hourly, Daily, or weekly) or a lower frequency (like
monthly, quarterly, or yearly).
➢ Remember: Having a large amount of data won't help you get accurate
results if the level of aggregation or frequency isn't suitable for your
specific problem. For instance, if you want to figure out how spending
per student impacts academic performance, using city-level data may not
work well because spending and student characteristics differ
significantly from city to city within states. This could lead to misleading
results. 30
A FIRST LOOK AT THE DATA
DESCRIPTIVE STATISTICS
31
AN APPLICATION:
LABOR MARKET DATA
IS WAGE RELATED TO EDUCATION?
Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years
Variables:
EXP = work experience
WKS = weeks worked
OCC = occupation, 1 if blue collar,
IND = 1 if manufacturing industry
SOUTH = 1 if resides in south
SMSA = 1 if resides in a city (SMSA)
MS = 1 if married
FEM = 1 if female
UNION = 1 if wage set by union contract
ED = years of education
LWAGE = log of wage = dependent variable in regressions
These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with
Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal
of Applied Econometrics, 3, 1988, pp. 149-155.
32
DESCRIPTIVE STATISTICS
33
HISTOGRAM: POOLED DATA WITHIN
PERSON VARIATION
34
GRAPHICAL DEVICES: BOX PLOTS
MEDIAN LOG WAGE
➢ Estimation
➢ Inference
➢ Analysis
36
SIMPLE LINEAR REGRESSION
37
MULTIPLE REGRESSION
38
AN APPLICATION:
Professor’s Overall Teaching Ability
➢ Have you heard of “ RateMyProfessors.com ”?
➢ On this website, students evaluate a professor’s overall teaching ability
and various other attributes. The website then summarizes these
student-submitted ratings for the benefit of any student considering
taking a class from the professor.
Model
RATINGi = b0 + b1EASEi + b2HOTi + ei
where: Parameter
RATINGi = the overall rating (5 = best) of the ith professor
EASEi = the easiness rating (5 = easiest) of the ith professor
(in terms of workload and grading),
HOTi = 1 if the ith professor is considered “hot,” 0 otherwise
39
(apparently in terms of physical attractiveness)
AN APPLICATION:
Professor’s Overall Teaching Ability
Professor RATING EASE HOT
1 2.8 3.7 0 RATING by HOT
2 4.3 4.1 1
3 4.0 2.8 1 6 5
4 3.0 3.0 0
5 4.3 2.4 0
6 2.7 2.7 0
7 3.0 3.3 0 5 4
8 3.7 2.7 0
9 3.9 3.0 1
10 2.7 3.2 0
4
RATING
11 4.2 1.9 1 3
12 1.9 4.8 0
13 3.5 2.4 1
14 2.1 2.5 0
3 2
15 2.0 2.7 1
16 3.8 1.6 0
17 4.1 2.4 0
18 5.0 3.1 1 2
19 1.2 1.6 0 1
20 3.7 3.1 0
21 3.6 3.0 0
22 3.3 2.1 0 1 0
23 3.2 2.5 0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
24 4.8 3.3 0 HOT=0 HOT=1
25 4.6 3.0 0
EASE
Data for these variables from 25 randomly chosen professors on RateMyProfessors.com. 40
AN APPLICATION:
Professor’s Overall Teaching Ability
Population
RATINGi = b0 + b1EASEi + b2HOTi + ei
Sample RATINGi = b0 + b1EASEi + b2HOTi + ei
Estimation
RATING = 3.23 + 0.0063*EASE + 0.59*HOT
red 1 2 3 4 5 6
This sequence provides an example of a discrete random variable. Suppose that you have
a red die which, when thrown, takes the numbers from 1 to 6 with equal probability.
46
1
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
green
1
2
3
4
5
6
Suppose that you also have a green die that can take the numbers from 1 to
6 with equal probability.
We will define a random variable x as the sum of the numbers when the dice
are thrown.
47
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
green
1
2
3
4
5
6 10
For example, if the red die is 4 and the green one is 6, x is equal to 10.
48
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
green
1
2
3
4
5 7
6
49
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
green
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
50
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
x
green 2
3
1 2 3 4 5 6 7 4
5
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12
If you look at the table, you can see that x can be any of the numbers from 2 to 12.
51
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
x f
green 2
3
1 2 3 4 5 6 7 4
5 4
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12
We will now define f, the frequencies associated with the possible values of x.
For example, there are four outcomes which make x equal to 5.
52
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
x f p
green 2 1 1/36
3 2 2/36
1 2 3 4 5 6 7 4 3 3/36
5 4 4/36
2 3 4 5 6 7 8 6 5 5/36
3 4 5 6 7 8 9 7 6 6/36
8 5 5/36
4 5 6 7 8 9 10 9 4 4/36
5 6 7 8 9 10 11 10 3 3/36
11 2 2/36
6 7 8 9 10 11 12 12 1 1/36
Similarly you can work out the frequencies for all the other values of x.
Finally we will derive the probability of obtaining each value of x.
If there is 1/6 probability of obtaining each number on the red die, and the same on the
green die, each outcome in the table will occur with 1/36 probability.
Hence to obtain the probabilities associated with the different values of x, we divide the
frequencies by 36. 53
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE
probability
1/ 2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2 1
__
/
36 36 36 36 36 36 36 36 36 36 36
2 3 4 5 6 7 8 9 10 11 12 x
The distribution is shown graphically. in this example it is symmetrical, highest for x equal
to 7 and declining on either side.
54
PROBABILITY
DISTRIBUTION
➢ THEORETICAL PROBABILITY
DISTRIBUTION
➢ EXPERIMENTAL PROBABILITY
DISTRIBUTION
55
EXPECTED VALUE OF A
RANDOM VARIABLE
EXPECTED VALUE OF A RANDOM VARIABLE
n
E ( x ) = x1 p1 + ... + xn pn = xi pi
i =1
The expected value of a random variable, also known as its population mean, is the
weighted average of its possible values, the weights being the probabilities attached to the
values.
57
EXPECTED VALUE OF A RANDOM VARIABLE
n
E ( x ) = x1 p1 + ... + xn pn = xi pi
i =1
Note that the sum of the probabilities must be unity, so there is no need to divide by the
sum of the weights.
58
EXPECTED VALUE OF A RANDOM VARIABLE
xi
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
This sequence shows how the expected value is calculated, first in abstract and then with
the random variable defined in the first sequence. We begin by listing the possible values
of x.
59
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi
x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11
60
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi xi pi
x1 p1 x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11
Then we define a column in which the values are weighted by the corresponding
probabilities.
61
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11
xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11
S xi pi = E(x)
The expected value is the sum of the entries in the third column.
63
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36
x2 p2 x2 p2 3 2/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(x)
The random variable x defined in the previous sequence could be any of the integers from 2
to 12 with probabilities as shown.
64
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(x)
x could be equal to 2 with probability 1/36, so the first entry in the calculation of the
expected value is 2/36.
The probability of x being equal to 3 was 2/36, so the second entry is 6/36. 65
EXPECTED VALUE OF A RANDOM VARIABLE
xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(x) 252/36
xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(x) 252/36 = 7
The expected value turns out to be 7. Actually, this was obvious anyway. We saw in the
previous sequence that the distribution is symmetrical about 7.
67
EXPECTED VALUE OF A RANDOM VARIABLE
E(x) = mx
Very often the expected value of a random variable is represented by m, the Greek m. If
there is more than one random variable, their expected values are differentiated by adding
subscripts to m.
68
EXPECTED VALUE OF
A FUNCTION
OF A RANDOM VARIABLE
f (x)
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
E ( g ( x ) ) = g ( x1 ) p1 + ... + g ( xn ) pn =
n
g ( xi ) pi
i =1
To find the expected value of a function of a random variable, you calculate all the possible
values of the function, weight them by the corresponding probabilities, and sum the
results.
70
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
E ( g ( x ) ) = g ( x1 ) p1 + ... + g ( xn ) pn =
n
g ( xi ) pi
i =1
Example:
n
E ( x ) = x p1 + ... + x pn =
2 2
1
2
n xi
2
pi
i =1
For example, the expected value of x2 is found by calculating all its possible values,
multiplying them by the corresponding probabilities, and summing.
71
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
xi pi
x1 p1
x2 p2
x3 p3
… …
… …
… …
… …
… …
… …
… …
xn pn
The calculation of the expected value of a function of a random variable will be outlined in
general and then illustrated with an example.
72
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
xi pi g(xi) g(xi ) pi
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2)
x3 p3 g(x3)
… … …...
… … …...
… … …...
… … …...
… … …...
… … …...
… … …...
xn pn g(xn)
First you list the possible values of x and the corresponding probabilities.
Next you calculate the function of x for each possible value of x.
Then, one at a time, you weight the value of the function by its corresponding probability. 73
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
xi pi g(xi) g(xi ) pi
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2) g(x2) p2
x3 p3 g(x3) g(x3) p3
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
xn pn g(xn) g(xn) pn
S g(xi) pi
You do this individually for each possible value of x.
The sum of the weighted values is the expected value of the function of x.
74
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
xi pi g(xi) g(xi ) pi xi pi
x1 p1 g(x1) g(x1) p1 2 1/36
x2 p2 g(x2) g(x2) p2 3 2/36
x3 p3 g(x3) g(x3) p3 4 3/36
… … …... ……... 5 4/36
… … …... ……... 6 5/36
… … …... ……... 7 6/36
… … …... ……... 8 5/36
… … …... ……... 9 4/36
… … …... ……... 10 3/36
… … …... ……... 11 2/36
xn pn g(xn) g(xn) pn 12 1/36
S g(xi) pi
The process will be illustrated for x2, where x is the random variable defined in the first
sequence. The 11 possible values of x and the corresponding probabilities are listed.
75
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
76
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
Note that E(x2) is not the same thing as E(x), squared. In the previous sequence we saw that
E(x) for this example was 7 Its square is 49.
79
POPULATION VARIANCE
OF A DISCRETE RANDOM
VARIABLE
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
Population variance of x:
E ( x − m )2
E ( x − m ) 2 = ( x1 − m )2 p1 + ... + ( xn − m ) 2 pn = ( xi − m ) 2 pi
n
i =1
xi pi
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
We will calculate the population variance of the random variable x defined in the first
sequence. We start as usual by listing the possible values of x and the corresponding
probabilities.
82
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
xi pi xi-m
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36 m x = E( x) = 7
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Next we need a column giving the deviations of the possible values of x about its population
mean. In the second sequence we saw that the population mean of x was 7.
83
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
xi pi xi-m
2 1/36 -5
3 2/36
4 3/36
5 4/36
6 5/36 m x = E( x) = 7
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
84
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
xi pi xi-m
2 1/36 -5
3 2/36 -4
4 3/36 -3
5 4/36 -2
6 5/36 -1 m x = E( x) = 7
7 6/36 0
8 5/36 1
9 4/36 2
10 3/36 3
11 2/36 4
12 1/36 5
85
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
xi pi xi-m (xi-m)2
2 1/36 -5 25
3 2/36 -4
4 3/36 -3
5 4/36 -2
6 5/36 -1
7 6/36 0
8 5/36 1
9 4/36 2
10 3/36 3
11 2/36 4
12 1/36 5
2 1/36 -5 25 0.69
3 2/36 -4 16
4 3/36 -3 9
5 4/36 -2 4
6 5/36 -1 1
7 6/36 0 0
8 5/36 1 1
9 4/36 2 4
10 3/36 3 9
11 2/36 4 16
12 1/36 5 25
2 1/36 -5 25 0.69
3 2/36 -4 16
4 3/36 -3 9
5 4/36 -2 4
6 5/36 -1 1
7 6/36 0 0
8 5/36 1 1
9 4/36 2 4
10 3/36 3 9
11 2/36 4 16
12 1/36 5 25
A reason for making an initial guess is that it may help you to identify an arithmetical error,
if you make one. If the initial guess and the outcome are very different, that is a warning.
88
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
2 1/36 -5 25 0.69
3 2/36 -4 16 0.89
4 3/36 -3 9 0.75
5 4/36 -2 4 0.44
6 5/36 -1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
89
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
2 1/36 -5 25 0.69
3 2/36 -4 16 0.89
4 3/36 -3 9 0.75
5 4/36 -2 4 0.44
6 5/36 -1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
90
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
Population variance of x
There are several ways of writing the population variance. First the formal
mathematical definition.
E ( x − m ) 2
pop.var(x)
s x2
91
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
Standard deviation of x
E[( x − m ) 2 ]
sx
92
EXPECTED VALUE
RULES
EXPECTED VALUE RULES
This sequence states the rules for manipulating expected values. First, the additive rule.
The expected value of the sum of two random variables is the sum of their expected values.
94
EXPECTED VALUE RULES
95
EXPECTED VALUE RULES
The second rule is the multiplicative rule. The expected value of (a variable multiplied by a
constant) is equal to the constant multiplied by the expected value of the variable.
96
EXPECTED VALUE RULES
For example, the expected value of 3x is three times the expected value of x.
97
EXPECTED VALUE RULES
Finally, the expected value of a constant is just the constant. Of course this is obvious.
98
EXPECTED VALUE RULES
y = a + bx
E(y) = E(a + bx)
As an exercise, we will use the rules to simplify the expected value of an expression.
Suppose that we are interested in the expected value of a variable y, where y = a + bx.
99
EXPECTED VALUE RULES
y = a + bx
E(y) = E(a + bx)
= E(a) + E(bx)
We use the first rule to break up the expected value into its two components.
100
EXPECTED VALUE RULES
y = a + bx
E(y) = E(a + bx)
= E(a) + E(bx)
= a + bE(x)
Then we use the second rule to replace E(bx) by bE(x) and the third rule to simplify E(a) to
just a. This is as far as we can go in this example.
101
INDEPENDENCE
OF TWO RANDOM
VARIABLES
INDEPENDENCE OF TWO RANDOM VARIABLES
pop.var(x) = E(x2) - m2
This sequence derives an alternative expression for the population variance of a random
variable. It provides an opportunity for practising the use of the expected value rules.
106
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
107
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
108
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
Now the first expected value rule is used to decompose the expression into three separate
expected values.
109
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
= E(x2) - 2mE(x) + m2
The second expected value rule is used to simplify the middle term and the third rule is
used to simplify the last one.
110
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
= E(x2) - 2mE(x) + m2
= E(x2) - 2m2 + m2
The middle term is rewritten, using the fact that E(x) and mx are just different ways of writing
the population mean of x.
111
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
pop.var(x) = E(x2) - m2
pop.var(x) = E[(x-m)2]
= E(x2) - 2mE(x) + m2
112
DISCRETE RANDOM VARIABLES
probability
1/ 2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2 1
__
/
36 36 36 36 36 36 36 36 36 36 36
2 3 4 5 6 7 8 9 10 11 12 x
A discrete random variable is one that can take only a finite set of values.
The sum of the numbers when two dice are thrown is an example.
Each value has associated with it a finite probability, which you can think of as a
“packet” of probability. The packets sum to unity because the variable must
take one of the values. 113
CONTINUOUS RANDOM VARIABLES
height
55 60 65 70 75 x
However, most random variables encountered in econometrics are
continuous. They can take any one of an infinite set of values defined
over a range (or possibly, ranges).
As a simple example, take the temperature in a room. We will assume
that it can be anywhere from 55 to 75 degrees Fahrenheit with equal
114
probability within the range.
CONTINUOUS RANDOM VARIABLES
height
55 60 65 70 75 x
In the case of a continuous random variable, the probability of it being
equal to a given finite value (for example, temperature equal to 55.473927)
is always infinitesimal.
P(x)=0 115
CONTINUOUS RANDOM VARIABLES
height
55 56 60 65 70 75 x
For this reason, you can only talk about the probability of a continuous
random variable lying between two given values. The probability is
represented graphically as an area.
For example, you could measure the probability of the temperature
being between 55 and 56, both measured exactly. 116
CONTINUOUS RANDOM VARIABLES
height
0.05
55 56 60 65 70 75 x
0.05
55 56 57 60 65 70 75 x
0.05
55 5758 60 65 70 75 x
The probability per unit interval is 0.05 and accordingly the area of
the rectangle representing the probability of the temperature lying in
any given unit interval is 0.05.
119
CONTINUOUS RANDOM VARIABLES
height
0.05
55 5758 60 65 70 75 x
The probability per unit interval is called the probability density and
it is equal to the height of the unit-interval rectangle.
120
CONTINUOUS RANDOM VARIABLES
0.05
55 5758 60 65 70 75 x
0.05
55 5758 60 65 70 75 x
The vertical axis is given the label probability density, rather than
height. f(x) is known as the probability density function and is shown
graphically in the diagram as the thick black line.
122
CONTINUOUS RANDOM VARIABLES
probability f(x) = 0.05 for 55 x 75
density f(x) = 0 for x < 55 and x > 75
f(x)
0.05
55 60 65 70 75 x
0.05 0.25
55 60 65 70 75 x
Typically you have to use the integral calculus to work out the area under
a curve, but in this very simple example all you have to do is calculate
the area of a rectangle.
The height of the rectangle is 0.05 and its width is 5, so its area is 0.25.
124
CONTINUOUS RANDOM VARIABLES
probability
density
f(x)
0.20
0.15
0.10
0.05
65 70 75 x
Now suppose that the temperature can lie in the range 65 to 75 degrees,
with uniformly decreasing probability as the temperature gets higher.
The total area of the triangle is unity because the probability of the
temperature lying in the 65 to 75 range is unity. Since the base of the
125
triangle is 10, its height must be 0.20.
CONTINUOUS RANDOM VARIABLES
probability f(x) = 1.50 - 0.02x for 65 x 75
density f(x) = 0 for x < 65 and x > 75
f(x)
0.20
0.15
0.10
0.05
65 70 75 x
In this example, the probability density function is a line of the form
f(x) = a + bx. To pass through the points (65, 0.20) and (75, 0), a must
equal 1.50 and b must equal -0.02.
Suppose that we are interested in finding the probability of the
126
temperature lying between 65 and 70 degrees.
CONTINUOUS RANDOM VARIABLES
probability f(x) = 1.50 - 0.02x for 65 x 75
density f(x) = 0 for x < 65 and x > 75
f(x)
0.20
0.15
0.10
0.05
65 70 75 x
We could do this by evaluating the integral of the function over this range,
but there is no need.
It is easy to show geometrically that the answer is 0.75. This completes
the introduction to continuous random variables. 127
EXPECTED VALUE, VARIANCE &
COVARIANCE RULES
A SUMMARY
➢ VARIANCE
▪ Var[a]=0 a is a constant
▪ Var[aX] = a2 Var[X]
▪ Var[a+X] = Var[X]
▪ Var[X+Y] = Var[X] + Var[Y] + 2 Cov[X,Y]
128
THE FIXED AND RANDOM
COMPONENTS OF
A RANDOM VARIABLE
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE
In this short sequence we shall decompose a random variable x into its fixed and random
components. Let the population mean of x be mx.
130
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE
The actual value of x in any observation will in general be different from mx. We will call the
difference ui, so ui = xi - mx.
131
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE
Re-arranging this equation, we can write x as the sum of its fixed component, mx, which is
the same for all observations, and its random component, u.
132
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE
The expected value of the random component is zero. It does not systematically tend to
increase or decrease x. It just makes it deviate from its population mean.
133
ESTIMATORS
ESTIMATORS
1 n
Mean: mx x = xi
n i =1
1 n
Mean: mx x = xi
n i =1
1 n
Population variance: s 2
s =
2
( xi − x )
2
n − 1 i =1
x
137
ESTIMATORS
1 n 1
x= xi = ( x1 + ... + xn )
n i =1 n
138
ESTIMATORS
Estimators are random variables
1 n 1
x= xi = ( x1 + ... + xn )
n i =1 n
xi = m x + ui
1 n 1
x= xi = ( x1 + ... + xn )
n i =1 n
xi = m x + ui
1 1
x = ( m x + ... + m x ) + ( u1 + ... + un )
n n
1
= ( nm x ) + u = m x + u
n
1 n 1
x= xi = ( x1 + ... + xn )
n i =1 n
xi = m x + ui
1 1
x = ( m x + ... + m x ) + ( u1 + ... + un )
n n
1
= ( nm x ) + u = m x + u
n
141
ESTIMATORS
probability density probability density
function of x function of x
mx x mx x
The graph compares the probability density functions of x and x. As
we have seen, they have the same fixed component. However the
distribution of the sample mean is more concentrated.
Its random component tends to be smaller than that of x because it is
the average of the random components in all the observations, and 142
these tend to cancel each other out.
TYPE OF ESTIMATORS
➢ Maximum Likelihood
143
ESTIMATION OF
SAMPLE MEAN
144
THE LEAST SQUARES
𝑥𝑖 = 𝜇 + 𝑢𝑖 x7
x1
𝑒𝑖 = 𝑥𝑖 − 𝑥ҧ
e7
e1
m
x2
𝛿 min σ 𝑒𝑖2
= 2 (𝑥𝑖 − 𝑥)(−1)
ҧ = −2 (𝑥𝑖 − 𝑥)ҧ = 0
𝛿 𝑥lj
𝑥𝑖 = 𝑛𝑥ҧ
σ 𝑥𝑖
𝑥ҧ =
𝑛 145
THE METHOD OF MOMENTS
➢ Suppose there are k unknown parameters
➢ Select k population moments in terms of unknown parameters
➢ If there are k moments and k unknown parameters, the unknown
parameters can be solved
– An advantage of this method is that is based on moments that
are often easy to compute.
– However, it should be noted that if the number of moments is
greater than the number of unknown parameters, the obtained
estimates depend on the chosen moments.
– The rth moment
m r = E[( x − m) ]r
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 m
L
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
This sequence introduces the principle of maximum likelihood estimation
and illustrates it with some simple examples.
149
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 m 1 1 𝑥−𝜇 2
−2 𝜎
L
𝑓(𝑥) = 𝑒
0.06 𝜎 2𝜋
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Suppose that you have a normally-distributed random variable x with
unknown population mean m and standard deviation s, and that
you have a sample of two observations, 4 and 6.
150
For the time being, we will assume that s is equal to 1.
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
1 4 −3.5 2 1 6 −3.5 2
1 − 1 −
0.4 f ( x) = e 2 1
f ( x) = e 2 1
0.3521 1 2 1 2
0.3
0.2
m p(4) p(6)
0.1
0.0175 3.5 0.3521 0.0175
0.0
0 1 2 3 4 5 6 7 8 m
L
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Suppose initially you consider the hypothesis m = 3.5. Under this hypothesis
the probability density at 4 would be 0.3521 and that at 6 would be 0.0175.
151
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3
0.2
m p(4) p(6) L
0.1
0.0175 3.5 0.3521 0.0175 0.0062
0.0
0 1 2 3 4 5 6 7 8 m
L
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
The joint probability density, shown in the bottom chart, is the product
of these, 0.0062.
152
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4 0.3989
0.3
0.2
m p(4) p(6) L
0.1
0.0540 3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
L
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Next consider the hypothesis m = 4.0. Under this hypothesis the probability
densities associated with the two observations are 0.3989 and 0.0540, and the
joint probability density is 0.0215. 153
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3
0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Under the hypothesis m = 4.5, the probability densities are 0.3521 and 0.1295,
and the joint probability density is 0.0456.
154
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3
0.2420 0.2420
0.2
m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Under the hypothesis m = 5.0, the probability densities are both 0.2420 and
the joint probability density is 0.0585.
155
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3
0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
Under the hypothesis m = 5.5, the probability densities are 0.1295 and 0.3521
and the joint probability density is 0.0456.
156
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3
0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
The complete joint density function for all values of m has now been plotted
in the lower diagram. We see that it peaks at m = 5.
157
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 x − m 2
1 −
2 s
f ( x) = e
s 2
1 x − m 2
1 −
2 s
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2
For the time being, we are assuming s is equal to 1, so the density function
simplifies to the second expression.
159
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 x − m 2
1 −
2 s
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2
1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2
1 x − m 2
1 −
2 s
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2
1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
joint density = e 2 e 2
2 2
The joint probability density for the two observations in the sample
is just the product of their individual densities.
161
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 x − m 2
1 −
2 s
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2
1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
joint density = e 2 e 2
2 2
In maximum likelihood estimation we choose as our estimate of m the value
that gives us the greatest joint density for the observations in our sample.
This value is associated with the greatest probability, or maximum likelihood,
of obtaining the observations in the sample. 162
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3
0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 m
In the graphical treatment we saw that this occurs when m is equal to 5. We
will prove this must be the case mathematically.
163
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
L( m | 4,6) = e 2 e 2
2 2
To do this, we treat the sample values x = 4 and x = 6 as given and we use the
calculus to determine the value of m that maximizes the expression.
When it is regarded in this way, the expression is called the likelihood function
for m, given the sample observations 4 and 6. This is the meaning of L(m | 4,6).
To maximize the expression, we could differentiate with respect to m and set the
result equal to 0. This would be a little laborious. Fortunately, we can simplify
the problem with a trick.
164
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
L( m | 4,6) = e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
log L = log e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
= log e 2 + log e 2
2 2
log L is a monotonically increasing function of L (meaning that log2 L
1 −
1
and decreases( − m )
2
1 −
1
( 6− m )
increases if L increases
= log + log e 2
4
if L decreases).
+ log
+ log e 2
2 2
It follows that the value of m which
maximizes log L is the same as
the one that maximizes 1 L. As 1 it so happens, 1 it is2 easier to maximize
= 2 log − (4 − m ) − (6 − m )
2
165
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
L( m | 4,6) = e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
log L = log e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
= log e 2 + log e 2
2 2
1 −
1
( − m )
2
1 −
1
( − m )
2
= log
+ log e 2
4
+ log
+ log e 2
6
2 2
1 1 1
The logarithm= 2 of − of
product
logthe (4 −them )density
2
− (6 functions
− m)
2
can be
decomposed as the
sumof their logarithms.
2 2 2
166
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1 ( 4 − m ) 2 1 − 1 ( 6 − m ) 2
L( m | 4,6) = e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
log L = log e 2 e 2
2 2
1 − 1 ( 4− m ) 2 1 − 1 ( 6− m ) 2
= log e 2 + log e 2
2 2
1 −
1
( − m )
2
1 −
1
( − m )
2
= log
+ log e 2
4
+ log
+ log e 2
6
2 2
1 1 1
= 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
Using the product rule a second time, we can decompose each term
as shown.
167
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1(4− m )2 1b − 1(6− m )2
L( m | 4,6) = e 2 a =eb log
log 2 a
2 2
1
− (1x − 4 )2 2
log1 e 2 2 ( 4− m )
−
1 1 2 2
=− 1( x − 42()6− mlog
− ) e
log L = log e 2 e
2 2
1
= − ( x − 4) 2
1 − ( 4− m ) 2 1 − 1 ( 6− m ) 2
1 2
= log e 2 + log e 2
2 2
1 −
1
( − m )
2
1 −
1
( − m )
2
= log + log e 2
4
+ log
+ log e 2
6
2 2
1 1 1
= 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
Now one of the basic rules for manipulating logarithms allows us to
rewrite the second term as shown.
168
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1(4− m )2 1b − 1(6− m )2
L( m | 4,6) = e 2 a =eb log
log 2 a
2 2
1
− (1x − 4 )2 2
log1 e 2 2 ( 4− m )
−
1 1 2 2
=− 1( x − 42()6− mlog
− ) e
log L = log e 2 e
2 2
1
= − ( x − 4) 2
1 − ( 4− m ) 2 1 − 1 ( 6− m ) 2
1 2
= log e 2 + log e 2
2 2
1 −
1
( − m )
2
1 −
1
( − m )
2
= log + log e 2
4
+ log
+ log e 2
6
2 2
1 1 1
= 2 log − (4 − m ) − (6 − m )
2 2
2 basic
log e is equal to 1,another
2 2
logarithm result. (Remember, as
always, we are using natural logarithms, that is, logarithms to base e.)
169
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 − 1(4− m )2 1b − 1(6− m )2
L( m | 4,6) = e 2 a =eb log
log 2 a
2 2
1
− (1x − 4 )2 2
log1 e 2 2 ( 4− m )
−
1 1 2 2
=− 1( x − 42()6− mlog
− ) e
log L = log e 2 e
2 2
1
= − ( x − 4) 2
1 − ( 4− m ) 2 1 − 1 ( 6− m ) 2
1 2
= log e 2 + log e 2
2 2
1 −
1
( − m )
2
1 −
1
( − m )
2
= log + log e 2
4
+ log
+ log e 2
6
2 2
1 1 1
= 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
Hence the second term reduces to a simple quadratic in x. And so
does the fourth.
We will now choose m so as to maximize this expression. 170
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1
log L = 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
1 1 1
log L = 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d 1 2
− (a − m ) =a−m
dm 2
172
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1
log L = 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d 1 2
− (a − m ) =a−m
dm 2
d log L
= (4 − m ) + (6 − m )
dm
1 1 1
log L = 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d 1 2
− (a − m ) = a − m
dm 2
d log L
= (4 − m ) + (6 − m )
dm
d log L
= 0 mˆ = 5
dm
Thus from the first order condition we confirm that 5 is the value of m that
maximizes the log-likelihood function, and hence the likelihood function.
Note that a caret mark has been placed over m, because we are now talking
174
about an estimate of m, not its true value.
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1
log L = 2 log − (4 − m ) − (6 − m )
2 2
2 2 2
− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d 1 2
− (a − m ) =a−m
dm 2
d log L
= (4 − m ) + (6 − m )
dm
d log L
= 0 mˆ = 5
dm
Note also that the second differential of log L with respect to m is -2. Since
this is negative, we have found a maximum, not a minimum.
175
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2
Now treating the sample values as fixed, we can re-interpret the joint
density function as the likelihood function for m, given this sample.
We will find the value of m that maximizes it.
178
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2
1 − 1( x1 − m )2 1 − 1 ( xn − m ) 2
L( m | x1 ,..., xn ) = e 2 ... e 2
2 2
1 − 1 ( x1 − m )2 1 − 1 ( xn − m ) 2
log L = log e 2 ... e 2
2
2
1 − 1 ( x1 − m )2 1 − 1 ( xn − m ) 2
= log e 2 + ... + log e 2
2 2
1 1 1
= n log − ( x − m )2
− ... − ( x − m )2
2 2
1 n
2
We will do this indirectly, as before, by maximizing log L with respect
to m. The logarithm decomposes as shown.
179
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1
log L = n log − ( x − m )2
− ... − ( x − m )2
2 2
1 n
2
d log L
= ( x1 − m ) + ... + ( x n − m )
dm
180
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1
log L = n log − ( x − m )2
− ... − ( x − m )2
2 2
1 n
2
d log L
= ( x1 − m ) + ... + ( x n − m )
dm
d log L
dm
=0 x i − nmˆ = 0
1 1 1
log L = n log − ( x − m )2
− ... − ( x − m )2
2 2
1 n
2
d log L
= ( x1 − m ) + ... + ( x n − m )
dm
d log L
dm
=0 x i − nmˆ = 0
1
m̂ =
n
xi = x
s
1 xi − m 2
1 −
2 s
f ( xi ) = e
s 2
183
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 m
L
0.06
0.04
0.02
0
0 1 2 3 4 s
We will illustrate the process graphically with the two-observation example,
keeping m fixed at 5. We will start with s equal to 2. 184
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8
0.6
0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0
0 1 2 3 4 5 6 7 8 9 m
L
0.06
0.04
0.02
0
0 1 2 3 4 s
0.6
0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
L
0.06
0.04
0.02
0
0 1 2 3 4 s
Now try s equal to 1. The individual densities are 0.2420 and so the
joint density, 0.0586, has increased. 186
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8
0.6
0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
0.5 0.1080 0.1080 0.0117
L
0.06
0.04
0.02
0
0 1 2 3 4 s
Now try putting s equal to 0.5. The individual densities have fallen and the
joint density is only 0.0117. 187
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8
0.6
0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
0.5 0.1080 0.1080 0.0117
L
0.06
0.04
0.02
0
0 1 2 3 4 s
The joint density has now been plotted as a function of s in the lower
diagram. You can see that in this example it is greatest for s equal to 188
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 x − m 2
1 − i
2 s
f ( xi ) = e
s 2
1 x − m 2
1 − i
2 s
f ( xi ) = e
s 2
1 − 1 x1 − m 2 1 1 x − m 2
− n
e 2 s
... e 2 s
2 s 2
190
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 x − m 2
1 − i
2 s
f ( xi ) = e
s 2
1 − 1 x1 − m 2 1 1 x − m 2
− n
L( m ,s | x1 ,..., xn ) = e 2 s
... e 2 s
2 s 2
1 x − m 2
1 − i
2 s
f ( xi ) = e
s 2
1 − 1 x1 − m 2 1 1 x − m 2
− n
L( m ,s | x1 ,..., xn ) = e 2 s
... e 2 s
2 s 2
1 1 x1 − m 2
− 1 xn − m 2
−
log L = log e 2 s
... 1
e 2 s
s 2
s 2
1 1 x1 − m 2
− 1 xn − m 2
−
log L = log e 2 s
... 1
e 2 s
s 2
s 2
1 1 x1 − m 2
− 1 xn − m 2
−
= log e 2 s
+ ... + log 1
e 2 s
s 2 s 2
1 1 x1 − m 1 xn − m
2 2
= n log − − ... −
s 2 2 s 2 s
1 1 1 1 1 2
= n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s
2 s 2
1
2
n
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
=
1
( x1 − m ) + ... + ( xn − m )
s 2
=
1
( x − nm )
s 2 i
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
=
1
( x1 − m ) + ... + ( xn − m )
s 2
=
1
( x − nm )
s 2 i
log L
= 0 mˆ = x
m
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log a b = b log a
1
log = log s −1 = ( −1) log s = − log s
s
197
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
198
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
log L n
= 0 − + sˆ −3 ( xi − mˆ )2 = 0
s sˆ
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
log L n
= 0 − + sˆ −3 ( xi − mˆ )2 = 0
s sˆ
− nsˆ 2 + ( xi − x )2 = 0
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
log L n
= 0 − + sˆ −3 ( xi − mˆ )2 = 0
s sˆ
− nsˆ 2 + ( xi − x )2 = 0
1
sˆ = ( xi − x )2 = Var( x )
2
n
Hence the maximum likelihood estimator of the population variance
is the sample variance.
201
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
log L n
= 0 − + sˆ −3 ( xi − mˆ )2 = 0
s sˆ
− nsˆ 2 + ( xi − x )2 = 0
1
sˆ = ( xi − x )2 = Var( x )
2
n
Note that it is biased.
The unbiased estimator is obtained by dividing by (n-1), not n. 202
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 1 1 1 1 2
log L = n log + n log + 2
− ( x − m )2
− ... − ( x − m )
s 2 s 2
1 n
2
1 s
−2
= − n log s + n log − ( − m )2
x
2 2
i
log L n
= − + s −3 ( xi − m ) 2
s s
log L n
= 0 − + sˆ −3 ( xi − mˆ )2 = 0
s sˆ
− nsˆ 2 + ( xi − x )2 = 0
1
sˆ = ( xi − x )2 = Var( x )
2
n
However it can be shown that the maximum likelihood estimator is
asymptotically efficient, in the sense of having a smaller mean square
error than the unbiased estimator in large samples. 203
COMPARISON OF METHODS
➢ It depends on the application which method is the most attractive
one.
➢ If the model is expressed in terms of an equation, then least squares
is intuitively appealing, as it optimizes the fit of the model with
respect to the observations.
➢ Least squares and the method of moments are both based on the idea
of minimizing a distance function.
– For least squares, the distance is measured directly in terms of the
observed data
– For the method of moments, the distance is measured in terms of the
sample and population moments.
➢ The maximum likelihood method is not based on a distance
function, but on the likelihood function that express the likelihood
or ‘credibility’ of parameter values with respect to the observed data.
➢ The maximum likelihood estimators have optimal properties in
204
large samples.
UNBIASEDNESS
AND
EFFICIENCY
UNBIASEDNESS &
EFFICIENCY
✔✔✔ x x
x
x xx x Unbiased
x x xx Unbiased x x &
xx
& x
x Inefficient
Efficient
x
✘✘✘
xxx Biased x x
xx Biased
xx x
& x &
Efficient x x Inefficient
206
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
We use the second expected value rule to take the (i/n) factor out of
the expectation expression.
208
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
Next we use the first expected value rule to break up the expression
into the sum of the expectations of the observations.
209
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
Each expectation is equal to mx, and hence the expected value of the
sample mean is mx.
210
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
However, the sample mean is not the only unbiased estimator of the
population mean. We will demonstrate this supposing that we have
a sample of two observations (to keep it simple).
211
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
We will analyze the expected value of Z and find out what condition
the weights have to satisfy for Z to be an unbiased estimator.
213
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
214
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
Now we use the second expected value rule to bring l1 and l2 out of
the expected value expressions.
215
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
216
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of x:
1 1
E ( x ) = E ( x1 + ... xn ) = E ( x1 + ... + xn )
n n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n
E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
probability
density
function
estimator B
estimator A
mx
probability
density
function
estimator B
estimator A
mx
219
UNBIASEDNESS AND EFFICIENCY
221
UNBIASEDNESS AND EFFICIENCY
223
UNBIASEDNESS AND EFFICIENCY
f ( l1.2
1)
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 l1 1
f ( l1.2
1)
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 l1 1
probability
density
function
estimator B
estimator A
x x
x
x x
xx
x x Biased x x
x Unbiased
Efficient x Inefficient
x
xxx
✔
xx Biased
Efficient
232
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
loss
One way is to define a loss function which reflects the cost to you of
making errors, positive or negative, of different sizes.
233
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2 = s Z2 + ( m Z − q ) 2
probability
density
function
234
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2 = s Z2 + ( m Z − q ) 2
probability
density
function
bias
q mZ
235
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2 = s Z2 + ( m Z − q ) 2
probability
density
function
bias
q mZ
The mean square error can be shown to be equal to the sum of the
population variance of the estimator and the square of the bias
236
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
237
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
238
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
We use the first expected value rule to break up the expectation into
its three components.
239
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
240
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
241
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
In the third term, (mZ-q) may be brought out of the expectation, again
because it is a constant, using the second expected value rule.
242
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
243
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
MSE( Z ) = E ( Z − q ) 2
= E ( Z − m Z + m Z − q ) 2
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2 + E ( m Z − q ) 2 + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2
Hence the third term is zero and the mean square error of Z is shown
be the sum of the population variance of Z and the bias squared.
244
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability
density
function
estimator B
estimator A
probability density
function of x
n sx
0.08 1 50
0.06
0.04
0.02 n=1
probability density
function of x
n sx
0.08 1 50
0.06
0.04
0.02 n=1
probability density
function of x
n sx
0.08 1 50
0.06
0.04
0.02 n=1
The sample mean will have the same population mean as x, but its
standard deviation will be 50/ n , where n is the number of
observations in the sample.
249
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n sx
0.08 1 50
0.06
0.04
0.02 n=1
The larger is the sample, the smaller will be the standard deviation of
the sample mean.
250
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n sx
0.08 1 50
0.06
0.04
0.02 n=1
probability density
function of x
n sx
0.08 1 50
4 25
0.06
0.04
n=4
0.02
We will see how the shape of the distribution changes as the sample
size is increased.
252
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n sx
0.08 1 50
4 25
25 10
0.06
n = 25
0.04
0.02
probability density
function of x
n sx
0.08 n = 100 1 50
4 25
25 10
0.06 100 5
0.04
0.02
To see what happens for n greater than 100, we will have to change
the vertical scale.
254
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n sx
0.8 1 50
4 25
25 10
0.6 100 5
0.4
n = 100
0.2
255
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n sx
0.8 1 50
4 25
25 10
0.6 100 5
n = 1000 1000 1.6
0.4
0.2
256
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
probability density
function of x
n = 5000 n sx
0.8 1 50
4 25
25 10
0.6 100 5
1000 1.6
5000 0.7
0.4
0.2
plim x = m
259
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x
plim x = m
260
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT
probability density
function of Z
n = 20
q Z
probability density
function of Z
n = 20
q Z
In the diagram, Z is an estimator of a population characteristic q.
Looking at the probability distribution of Z, you can see that Z is
biased upwards. 262
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT
probability density
function of Z
n = 100
n = 20
q Z
probability density
function of Z n = 1000
n = 100
n = 20
q Z
264
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT
probability density
function of Z
(re-scaled axis)
n = 1000
n = 100
q Z
probability density
function of Z n = 100000
(re-scaled axis)
n = 1000
n = 100
q Z
In the case of the estimator in the diagram, both of the conditions are
approximately satisfied when the sample size is 100,000.
266
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT
X plim X
If Z = , plim Z =
Y plim Y