0% found this document useful (0 votes)
3 views266 pages

WEEK1 Intro & Review

The document outlines a course on Econometrics, aimed at providing foundational knowledge and skills necessary for business students to analyze economic data using regression methods. It includes a syllabus detailing weekly topics, course requirements, and recommended textbooks, while also discussing the historical context, critiques, and applications of econometrics in real-world scenarios. The course emphasizes both theoretical understanding and practical application of econometric techniques for policy analysis and forecasting.

Uploaded by

meminatmaca55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views266 pages

WEEK1 Intro & Review

The document outlines a course on Econometrics, aimed at providing foundational knowledge and skills necessary for business students to analyze economic data using regression methods. It includes a syllabus detailing weekly topics, course requirements, and recommended textbooks, while also discussing the historical context, critiques, and applications of econometrics in real-world scenarios. The course emphasizes both theoretical understanding and practical application of econometric techniques for policy analysis and forecasting.

Uploaded by

meminatmaca55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 266

ECONOMETRICS

INTRODUCTION
&
REVIEW

WEEK1

FALL 2024

Prof. Dr. Burç Ülengin


COURSE OBJECTIVE
➢ The objective of this course is to provide the basic
knowledge of econometrics, which is essential
equipment for any business student, to a level
where the participant would be competent to
continue studying the subject at a graduate level.

➢ The course emphasizes both technical knowledge


and intuitive understanding, enabling participants to
apply their skills creatively and confidently.

2
COURSE DESCRIPTION
➢ Econometrics ISL355E introduces you to the
regression methods for analyzing data in economics.
➢ This course emphasizes both the theoretical and
practical aspects of statistical analysis. It focuses on
techniques for estimating various econometric
models and conducting tests of hypotheses that are
of interest to economists.
➢ The goal is to help you develop a solid theoretical
background in introductory-level econometrics, the
ability to implement the techniques, and critique
empirical studies in economics.
3
TEXTBOOKS
• Applied Statistics and Econometrics. Basic Topics and Tools
with Gretl and R (2024) B. K. Kivedal
• Elia Kacapyr (2022) Essential Econometric Techniques A Guide to
Concepts and Applications 3rd Ed.
• Dougherty, C. (2016), Introduction to Econometrics, 5th. Ed., Oxford
University Press
• Studenmund, A. H. (2017), Using Econometrics A Practical Guide 7th
Edition, Pearson
✓ Pedace, R. (2013) Econometrics for Dummies, John Wiley & Sons, Inc
✓ Hanck C., Arnold M., Gerber A. and Schmelzer M. (2020), Introduction
to Econometrics with R, https://www.econometrics-with-r.org/
✓ Griffiths, W. E., Hill, R. C., Lim, G.C. (2008), Using Eviews for
Principles of Econometrics, 3rd. Ed. John Wiley

Lecture notes and materials on the course website at the NINOVA


4
SYLLABUS
WEEK1 REVIEW - STATISTICS
WEEK2 BASIC CONCEPTS OF ECONOMETRICS
WEEK3 SIMPLE REGRESSION
WEEK4 PROPERTIES OF REGRESSION COEFFICIENTS
WEEK5 MULTIPLE REGRESSION
WEEK6 HYPOTHESIS TESTS AND DIAGNOSTIC CHECKS
WEEK7 NONLINEAR MODELS
WEEK8 MIDTERM
WEEK9 DUMMY VARIABLES
WEEK10 MODEL MISSPEFICATION
WEEK11 TIME SERIES ECONOMETRICS
WEEK12 DYNAMIC ECONOMETRIC MODELS
WEEK13 INTEGRATION AND COINTEGRATION
WEEK14 ECONOMETRIC APPLICATIONS 5
COURSE REQUIREMENTS

➢ Problem sets (15%)

➢ Quizzes (15%)

➢ Midterm (30%)

➢ Final exam (40%)

6
ECONOMETRICS
➢ Econometrics is a branch of economics that
utilizes mathematical and statistical methods to
analyze economic theories and validate them
through empirical evidence.

ECONOMETRICS PROVIDES EMPIRICAL CONTENT FOR


MUCH ECONOMIC THEORY
ECONOMIC THEORY WHY?
ECONOMETRICS HOW MUCH/MANY?

7
ECONOMETRICS
• Theoretical foundations
– Behavioral modeling: Economic growth, Labor
supply, Demand equations, etc.
– Microeconometrics, Macroeconometrics, Financial
econometrics, Marketing …
• Mathematical Elements
STATISTICS MATHEMATICS
• Statistical foundations ECONOMETRICS
• ‘Econometric Model’ building
– Mathematical elements ECONOMICS

– The underlying truth – is there one?


8
HISTORICAL PERSPECTIVE
➢ 1900 first studies; Ragnar Frisch
➢ Tinbergen, the first applied econometric model –United Nations
➢ Acceleration with Keynesian Theory
➢ 1950-1970 post-war years golden age
➢ 1970 oil crisis, decline in econometrics studies
➢ 1970-1990 New research
➢ 1990 re-shining
➢ The number of econometric research increased and became widespread
– Economics
– Finance
– Marketing
➢ Two different approaches
– Theoretical – causal relationship based
– Atheoretical– Time series econometrics
➢ Two main schools
– LSE
– USA 9
STEPS OF APPLIED ECONOMETRIC
STUDY
ECONOMIC THEORY
LITERATURE SURVEY

MODEL SPECIFICATION

DATA COLLECTION &


PRELIMINARY ANALYSIS

MODEL ESTIMATION

DIAGNOSTIC TESTS

INTERPRETATION OF
NO GOOD YES FINDINGS
PERFORMANCE
? &
MODEL USAGE
11
ECONOMETRICS
➢ The exciting aspect of econometrics is its focus on verifying
or disproving economic laws, such as purchasing power
parity, the life cycle hypothesis, and the quantity theory of
money, using economic data.
➢ David F. Hendry (1980) emphasized this function of
econometrics:
– The three golden rules of econometrics are test, test, and
test; all three rules are broken regularly in empirical
applications and are fortunately easily remedied.
Rigorously tested models, which adequately described the
available data, encompassed previous findings, and were
derived from well-based theories, would enhance any
claim to be scientific.
12
USAGE OF ECONOMETRIC STUDY
➢ STRUCTURAL ANALYSIS
– Price and income elasticity estimation
– Smoking and cancer/ heart attack relationship, reality or myth?
– Effect of exchange rate on import and export of Turkey
➢ POLICY RECOMMENDATIONS
– If the interest rate increases by 1 percentage point, what effect
does it have on inflation?
– If income tax increases by 5 percentage points, how will it
affect economic growth?
– If customer satisfaction increases, how does it affect the sales
volume of the firm?
➢ FORECASTING
– GDP growth rate in 2021
– Firm sales in 2021
– Population of İstanbul in 2040
13
TRENDS IN ECONOMETRICS

➢ Small structural models


➢ Pervasiveness of an econometrics paradigm
➢ Non- and semiparametric methods vs. parametric
➢ Robust methods / Estimation and inference
➢ Nonlinear modeling (the role of software)
➢ Behavioral and structural modeling vs. “reduced
form,” “covariance analysis.”
➢ Identification and “causal” effects

14
Data-Generating Process (DGP)

15
ECONOMETRICS: SCIENCE + ART
➢ Econometrics, while based on scientific principles, still retains a
particular element of art.
➢ According to Malinvaud (1966), the art of econometrics is finding
the correct set of sufficiently specific yet realistic assumptions to
enable us to take the best possible advantage of the available data.
➢ Data in economics are not generated under ideal experimental
conditions as in a physics laboratory. This data cannot be replicated
and is most likely measured with error.
➢ Many published empirical studies find that economic data may not
have enough variation to discriminate between competing
economic theories.
➢ To some, the “art” element in econometrics has left several
distinguished economists doubtful of the power of econometrics to
yield sharp predictions.
16
CRITIQUES OF ECONOMETRICS
➢ Econometrics has its critics. Interestingly, John Maynard
Keynes (1940, p. 156) had the following to say about Jan
Tinbergen’s (1939) pioneering work:
– No one could be more frank, painstaking, or free of
subjective bias or parties than Professor Tinbergen. There is
no one, therefore, so far as human qualities go, whom it
would be safer to trust with black magic. I am not yet
persuaded that there is anyone I would trust with it at the
present stage or that this brand of statistical alchemy is ripe
to become a branch of science. But Newton, Boyle, and
Locke all played with alchemy. So, let him continue.
➢ In 1969, Jan Tinbergen shared the first Nobel Prize in
economics with Ragnar Frisch.
17
RESPONSE TO THE CRITIQUES
➢ Econometrics has limitations due to incomplete economic theory
and non-experimental data, but it has played a fundamental role
in developing economics as a scientific discipline.
➢ Economic theories can't be conclusively rejected using
econometric methods, but testing specific formulations against
rival alternatives can still be valuable. Despite the challenge of
specification searches, econometric modeling remains
worthwhile.
➢ Econometric models are essential tools for forecasting and
policy analysis, and it is unlikely that they will be discarded.
The challenge is recognizing their limitations and working
towards turning them into more reliable and practical tools.
There seem to be no viable alternatives.
18
DATA STRUCTURES
➢ Observation mechanisms
– Passive, nonexperimental (the usual)
– Randomly assigned experiment (wishful)
➢ Data types
– Cross-section X i
– Time series X t
– Panel X it
➢ The data type you’re using may influence how you
estimate your econometric model. In particular,
specialized techniques are usually required to deal
with time series and panel data.
19
20
EXPERIMENTAL DATA
➢ Practical situations often arise where the questions that interest us are such
that no data are available to answer the questions. We may have to generate
the required data.
➢ Simple example. A coffee powder manufacturer would like to design a
packaging and pricing strategy for the product that maximizes its revenue.
➢ He knows that using a plastic bag with color positively affects the
consumer’s choice, while a colored plastic bag is more costly than a plain
plastic cover. He needs to estimate the net benefit he would have in
introducing a colored plastic bag.
➢ He also knows that consumers prefer fresh coffee powder; thus, depending
on the weekly consumption rate, they choose the packet size. The larger the
packet size that a household wants, the lower its willingness to pay, but
smaller packets will increase the cost of packaging.
➢ He would like to know the net benefits to the firm of different sizes of the
packets at different price levels he could fix for them given different types
of demand. 21
EXPERIMENTAL DATA
➢ Historically collected data on coffee sales may be useless in answering
these questions as colored plastic bags were not used in the past. The
manufacturer cannot introduce the new colored package, incurring
higher costs.
➢ To introduce more realism and more complexity, let us assume that
there is a cost-saving
➢ A coffee substitute, called chicory, brings thickness and bitterness to
coffee that some people may like when mixed with coffee. However,
too much chicory is not appreciated by many consumers. As a result,
the manufacturer expects that the greater the chicory content, the lower
the price the customer is willing to pay.
➢ The coffee manufacturer wishes to conduct a small-scale pilot
marketing experiment to estimate the effects on net revenue of
different types of packaging, different levels of chicory, and different
packet sizes.
22
➢ How should one experiment?
EXPERIMENTAL DATA
Each factor is set at two
levels labeled Low (L) and
High (H) for chicory content
of 10%, size of packet 100
and 200 grams, plain cover,
and colored cover.
➢ The questions of interest are:
1. How do you choose the factors and assign them to the experimental subjects of
the pilot experiment?
2. How do the changes in the three factors affect people’s willingness to pay for
100 grams of coffee powder?
3. Is the relation between these factors and willingness to pay linear or nonlinear?
4. How can we estimate the effects?
➢ These questions can be answered using the statistical theory of
design of experiments and the statistical method of analysis of
23
variance or conjoint analysis.
CROSS-SECTION DATA
➢ This data type consists of measurements for individual observations
(persons, households, firms, counties, states, countries, or whatever)
at a given time. The observed changes are due to the unit’s
characteristics.
➢ For example, if the research question is to determine the determinants
of a big firm’s investment decisions, ISO 500 data for 2019 may be
used to design models.
➢ TUIK conducts nationwide sample surveys of households to record
their consumption expenditure patterns. This database is now an
excellent tool for understanding consumer behavior in Turkey and
developing retail marketing strategies.
➢ Given this sample information, one might want to know (i) if there is
any pattern implied by the theory of consumer behavior that relates
expenditure on cereals to household size and total expenditure; (ii) if
such a relation is linear or nonlinear; (iii) how to estimate alternate
specifications; and (iv) how to choose between alternate 24
specifications.
NON-EXPERIMENTAL DATA TAKEN
FROM SECONDARY SOURCES

➢ An advertising company noted that the pharmaceutical


industry is poised for rapid growth in India owing to
several factors, such as switching to a new product
patenting regime and economic reforms that permitted
foreign direct investment.
➢ The researcher wishes to examine the data on sales and
advertisement expenditures and demonstrate that
advertisement expenditures pay rich dividends by
generating a substantial increase in sales.
➢ The agency collected data from an industry database,
such as TISD.
25
NON-EXPERIMENTAL DATA TAKEN
FROM SECONDARY SOURCES
• The issues to be examined are:
1. Is the effect of advertising
on sales the same for all
companies in the database?
2. Do all companies in the
database have the same
structural pattern to be treated
as one sample?
3. What are the various
drivers of sales?
4. What is the most plausible functional form for the multivariate
relation between sales and these drivers?
5. How does one estimate the separate effect of each of these factors on
sales? These questions can be answered using the multiple regression
26
methods for cross-sectional data
TIME SERIES DATA
➢ This data type consists of measurements on one or more
variables (such as gross domestic product, interest rates, or
unemployment rates) over time in a given space (like a
specific country or state).
➢ Time interval might be
– Annual
– Quarterly
– Monthly
– Weekly
– Daily
– or higher frequencies such as minutes, seconds, .. etc.
➢ Generally, annual, quarterly, and monthly data is used in
macroeconomics; weekly and higher frequency data is
used, especially in financial econometrics. 27
PANEL DATA
➢ This data type consists of a time series for each cross-sectional unit in the
sample. The data contains measurements for individual observations
(persons, households, firms, counties, countries, and so on) over some time
(months, quarters, or years).
➢ Several exciting questions arise concerning the banking sector in Turkey as
a result of the financial sector reforms:
1. Do the private sector banks perform better than the public sector banks?
2. After introducing financial sector reforms, Are the public-sector banks
improving their performance relative to the private-sector banks?
3. Is the performance of all banks improving after the introduction of
financial sector reforms?
➢ There are several private-sector banks and only a few public ones. Data on
banks’ economic operations has been available for several years.
➢ Regression models for such panel data have some unique characteristics of
their own, and ordinary multiple regression models must be suitably
modified to address the data’s unique features.
28
PANEL DATA

29
AGGREGATION LEVEL
➢ The level of aggregation used in measuring the variables: The level of
aggregation refers to the unit of analysis when information is acquired for
the data. In other words, the variable measurements may originate at a
lower level of aggregation (like an individual, household, or firm) or
higher (like a city, county, or state).
➢ The frequency with which the data is captured refers to the rate at
which measurements are obtained. Time-series data may be charged at a
higher frequency (hourly, Daily, or weekly) or a lower frequency (like
monthly, quarterly, or yearly).
➢ Remember: Having a large amount of data won't help you get accurate
results if the level of aggregation or frequency isn't suitable for your
specific problem. For instance, if you want to figure out how spending
per student impacts academic performance, using city-level data may not
work well because spending and student characteristics differ
significantly from city to city within states. This could lead to misleading
results. 30
A FIRST LOOK AT THE DATA
DESCRIPTIVE STATISTICS

➢ Basic Measures of Location and Dispersion


➢ Graphical Devices
– Box Plots
– Histogram
– Scatter Diagrams
– Heat Maps
– …..

31
AN APPLICATION:
LABOR MARKET DATA
IS WAGE RELATED TO EDUCATION?
Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years
Variables:
EXP = work experience
WKS = weeks worked
OCC = occupation, 1 if blue collar,
IND = 1 if manufacturing industry
SOUTH = 1 if resides in south
SMSA = 1 if resides in a city (SMSA)
MS = 1 if married
FEM = 1 if female
UNION = 1 if wage set by union contract
ED = years of education
LWAGE = log of wage = dependent variable in regressions
These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with
Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal
of Applied Econometrics, 3, 1988, pp. 149-155.
32
DESCRIPTIVE STATISTICS

33
HISTOGRAM: POOLED DATA WITHIN
PERSON VARIATION

34
GRAPHICAL DEVICES: BOX PLOTS
MEDIAN LOG WAGE

SHOWS UPWARD TREND IN MEDIAN LOG WAGE 35


OBJECTIVE: IMPACT OF EDUCATION
ON (Log) WAGE

➢ Specification: What is the right model to


use to analyze this association?

➢ Estimation

➢ Inference

➢ Analysis

36
SIMPLE LINEAR REGRESSION

37
MULTIPLE REGRESSION

38
AN APPLICATION:
Professor’s Overall Teaching Ability
➢ Have you heard of “ RateMyProfessors.com ”?
➢ On this website, students evaluate a professor’s overall teaching ability
and various other attributes. The website then summarizes these
student-submitted ratings for the benefit of any student considering
taking a class from the professor.
Model
RATINGi = b0 + b1EASEi + b2HOTi + ei
where: Parameter
RATINGi = the overall rating (5 = best) of the ith professor
EASEi = the easiness rating (5 = easiest) of the ith professor
(in terms of workload and grading),
HOTi = 1 if the ith professor is considered “hot,” 0 otherwise
39
(apparently in terms of physical attractiveness)
AN APPLICATION:
Professor’s Overall Teaching Ability
Professor RATING EASE HOT
1 2.8 3.7 0 RATING by HOT
2 4.3 4.1 1
3 4.0 2.8 1 6 5
4 3.0 3.0 0
5 4.3 2.4 0
6 2.7 2.7 0
7 3.0 3.3 0 5 4
8 3.7 2.7 0
9 3.9 3.0 1
10 2.7 3.2 0
4
RATING

11 4.2 1.9 1 3
12 1.9 4.8 0
13 3.5 2.4 1
14 2.1 2.5 0
3 2
15 2.0 2.7 1
16 3.8 1.6 0
17 4.1 2.4 0
18 5.0 3.1 1 2
19 1.2 1.6 0 1
20 3.7 3.1 0
21 3.6 3.0 0
22 3.3 2.1 0 1 0
23 3.2 2.5 0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
24 4.8 3.3 0 HOT=0 HOT=1
25 4.6 3.0 0
EASE
Data for these variables from 25 randomly chosen professors on RateMyProfessors.com. 40
AN APPLICATION:
Professor’s Overall Teaching Ability
Population
RATINGi = b0 + b1EASEi + b2HOTi + ei
Sample RATINGi = b0 + b1EASEi + b2HOTi + ei
Estimation
RATING = 3.23 + 0.0063*EASE + 0.59*HOT

➢ An article by Otto and colleagues indicates that


being “hot” improves a professor’s rating more
than being “easy.”
➢ Is this the correct conclusion?
James Otto, Douglas Sanford, and Douglas Ross, “Does RateMyProfessors.com Really RateMy
Professor?” Assessment and Evaluation in Higher Education, August 2008, pp. 355–368.
41
AN APPLICATION:
Professor’s Overall Teaching Ability
1. Do the estimated coefficients support our expectations? Explain.
2. This model includes two independent variables. Does it make sense
to think that a professor’s teaching rating depends on these two
variables? What other variable(s) might be important?
3. Suppose you add your suggested variable(s) to the equation. What
would happen to the coefficients of EASE and HOT when you said
the variable(s)? Would you expect them to change? Would you
expect them to remain the same? Explain.
4. Choose 25 new observations at random from the website. , and
estimate your version of the Equation. Do your estimated coefficients
have the same signs as the estimated Equation? Why or why not?
42
AN APPLICATION:
Professor’s Overall Teaching Ability
RATING = 3.23 + 0.0063*EASE + 0.59*HOT Is it
reliable?

Dependent Variable: RATING


Method: Least Squares
Included observations: 25

Variable Coefficient Std. Error t-Statistic Prob.

C 3.232149 0.807546 4.002435 0.0006


EASE 0.006313 0.274075 0.023033 0.9818
HOT 0.592672 0.428815 1.382117 0.1808

R-squared 0.079985 Mean dependent var 3.416000


Adjusted R-squared -0.003653 S.D. dependent var 0.960764
S.E. of regression 0.962517 Akaike info criterion 2.873636
Sum squared resid 20.38165 Schwarz criterion 3.019901
Log likelihood -32.92045 Hannan-Quinn criter. 2.914204
F-statistic 0.956323 Durbin-Watson stat 1.839778
Prob(F-statistic) 0.399711
43
STATISTICS
BRIEF REVIEW
PROBABILITY
DISTRIBUTION
DISCRETE RANDOM
VARIABLES
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6

This sequence provides an example of a discrete random variable. Suppose that you have
a red die which, when thrown, takes the numbers from 1 to 6 with equal probability.
46
1
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
green

1
2
3
4
5
6

Suppose that you also have a green die that can take the numbers from 1 to
6 with equal probability.

We will define a random variable x as the sum of the numbers when the dice
are thrown.
47
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
green

1
2
3
4
5
6 10

For example, if the red die is 4 and the green one is 6, x is equal to 10.

48
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
green

1
2
3
4
5 7
6

Similarly, if the red die is 2 and the green one is 5, x is equal to 7.

49
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
green

1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

The table shows all the possible outcomes.

50
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
x
green 2
3
1 2 3 4 5 6 7 4
5
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12

If you look at the table, you can see that x can be any of the numbers from 2 to 12.

51
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
x f
green 2
3
1 2 3 4 5 6 7 4
5 4
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12

We will now define f, the frequencies associated with the possible values of x.
For example, there are four outcomes which make x equal to 5.

52
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

red 1 2 3 4 5 6
x f p
green 2 1 1/36
3 2 2/36
1 2 3 4 5 6 7 4 3 3/36
5 4 4/36
2 3 4 5 6 7 8 6 5 5/36
3 4 5 6 7 8 9 7 6 6/36
8 5 5/36
4 5 6 7 8 9 10 9 4 4/36
5 6 7 8 9 10 11 10 3 3/36
11 2 2/36
6 7 8 9 10 11 12 12 1 1/36

Similarly you can work out the frequencies for all the other values of x.
Finally we will derive the probability of obtaining each value of x.
If there is 1/6 probability of obtaining each number on the red die, and the same on the
green die, each outcome in the table will occur with 1/36 probability.
Hence to obtain the probabilities associated with the different values of x, we divide the
frequencies by 36. 53
PROBABILITY DISTRIBUTION EXAMPLE: x IS THE SUM OF TWO DICE

probability

1/ 2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2 1
__
/
36 36 36 36 36 36 36 36 36 36 36

2 3 4 5 6 7 8 9 10 11 12 x

The distribution is shown graphically. in this example it is symmetrical, highest for x equal
to 7 and declining on either side.
54
PROBABILITY
DISTRIBUTION

➢ THEORETICAL PROBABILITY
DISTRIBUTION

➢ EXPERIMENTAL PROBABILITY
DISTRIBUTION

55
EXPECTED VALUE OF A
RANDOM VARIABLE
EXPECTED VALUE OF A RANDOM VARIABLE

Definition of E(x), the expected value of x:

n
E ( x ) = x1 p1 + ... + xn pn =  xi pi
i =1

The expected value of a random variable, also known as its population mean, is the
weighted average of its possible values, the weights being the probabilities attached to the
values.
57
EXPECTED VALUE OF A RANDOM VARIABLE

Definition of E(x), the expected value of x:

n
E ( x ) = x1 p1 + ... + xn pn =  xi pi
i =1

Note that the sum of the probabilities must be unity, so there is no need to divide by the
sum of the weights.
58
EXPECTED VALUE OF A RANDOM VARIABLE

xi
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11

This sequence shows how the expected value is calculated, first in abstract and then with
the random variable defined in the first sequence. We begin by listing the possible values
of x.
59
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi
x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11

Next we list the probabilities attached to the different possible values of x.

60
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi
x1 p1 x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11

Then we define a column in which the values are weighted by the corresponding
probabilities.
61
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11

We do this for each value separately.


Here we are assuming that n, the number of possible values, is equal to 11, but it could be
any number. 62
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11
S xi pi = E(x)
The expected value is the sum of the entries in the third column.

63
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36
x2 p2 x2 p2 3 2/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(x)
The random variable x defined in the previous sequence could be any of the integers from 2
to 12 with probabilities as shown.
64
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(x)
x could be equal to 2 with probability 1/36, so the first entry in the calculation of the
expected value is 2/36.
The probability of x being equal to 3 was 2/36, so the second entry is 6/36. 65
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(x) 252/36

Similarly for the other 9 possible values.


To obtain the expected value, we sum the entries in this column.
66
EXPECTED VALUE OF A RANDOM VARIABLE

xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(x) 252/36 = 7

The expected value turns out to be 7. Actually, this was obvious anyway. We saw in the
previous sequence that the distribution is symmetrical about 7.
67
EXPECTED VALUE OF A RANDOM VARIABLE

Alternative notation for E(x):

E(x) = mx

Very often the expected value of a random variable is represented by m, the Greek m. If
there is more than one random variable, their expected values are differentiated by adding
subscripts to m.
68
EXPECTED VALUE OF
A FUNCTION
OF A RANDOM VARIABLE

f (x)
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

Definition of E[g(x)], the expected value of a function of x:

E ( g ( x ) ) = g ( x1 ) p1 + ... + g ( xn ) pn =
n
 g ( xi ) pi
i =1

To find the expected value of a function of a random variable, you calculate all the possible
values of the function, weight them by the corresponding probabilities, and sum the
results.
70
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

Definition of E[g(x)], the expected value of a function of x:

E ( g ( x ) ) = g ( x1 ) p1 + ... + g ( xn ) pn =
n
 g ( xi ) pi
i =1

Example:

n
E ( x ) = x p1 + ... + x pn =
2 2
1
2
n  xi
2
pi
i =1

For example, the expected value of x2 is found by calculating all its possible values,
multiplying them by the corresponding probabilities, and summing.
71
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi
x1 p1
x2 p2
x3 p3
… …
… …
… …
… …
… …
… …
… …
xn pn

The calculation of the expected value of a function of a random variable will be outlined in
general and then illustrated with an example.
72
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2)
x3 p3 g(x3)
… … …...
… … …...
… … …...
… … …...
… … …...
… … …...
… … …...
xn pn g(xn)

First you list the possible values of x and the corresponding probabilities.
Next you calculate the function of x for each possible value of x.
Then, one at a time, you weight the value of the function by its corresponding probability. 73
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2) g(x2) p2
x3 p3 g(x3) g(x3) p3
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
… … …... ……...
xn pn g(xn) g(xn) pn
S g(xi) pi
You do this individually for each possible value of x.
The sum of the weighted values is the expected value of the function of x.
74
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi xi pi
x1 p1 g(x1) g(x1) p1 2 1/36
x2 p2 g(x2) g(x2) p2 3 2/36
x3 p3 g(x3) g(x3) p3 4 3/36
… … …... ……... 5 4/36
… … …... ……... 6 5/36
… … …... ……... 7 6/36
… … …... ……... 8 5/36
… … …... ……... 9 4/36
… … …... ……... 10 3/36
… … …... ……... 11 2/36
xn pn g(xn) g(xn) pn 12 1/36
S g(xi) pi
The process will be illustrated for x2, where x is the random variable defined in the first
sequence. The 11 possible values of x and the corresponding probabilities are listed.
75
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi xi pi xi2


x1 p1 g(x1) g(x1) p1 2 1/36 4
x2 p2 g(x2) g(x2) p2 3 2/36 9
x3 p3 g(x3) g(x3) p3 4 3/36 16
… … …... ……... 5 4/36 25
… … …... ……... 6 5/36 36
… … …... ……... 7 6/36 49
… … …... ……... 8 5/36 64
… … …... ……... 9 4/36 81
… … …... ……... 10 3/36 100
… … …... ……... 11 2/36 121
xn pn g(xn) g(xn) pn 12 1/36 144
S g(xi) pi
First you calculate the possible values of x2.

76
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi


x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9
x3 p3 g(x3) g(x3) p3 4 3/36 16
… … …... ……... 5 4/36 25
… … …... ……... 6 5/36 36
… … …... ……... 7 6/36 49
… … …... ……... 8 5/36 64
… … …... ……... 9 4/36 81
… … …... ……... 10 3/36 100
… … …... ……... 11 2/36 121
xn pn g(xn) g(xn) pn 12 1/36 144
S g(xi) pi
The first value is 4, which arises when x is equal to 2. The probability of x being equal to 2
is 1/36, so the weighted function is 4/36, which we shall write in decimal form as 0.11.
77
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi


x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9 0.50
x3 p3 g(x3) g(x3) p3 4 3/36 16 1.33
… … …... ……... 5 4/36 25 2.78
… … …... ……... 6 5/36 36 5.00
… … …... ……... 7 6/36 49 8.17
… … …... ……... 8 5/36 64 8.89
… … …... ……... 9 4/36 81 9.00
… … …... ……... 10 3/36 100 8.83
… … …... ……... 11 2/36 121 6.72
xn pn g(xn) g(xn) pn 12 1/36 144 4.00
S g(xi) pi 54.83
Similarly for all the other possible values of x.
The expected value of x2 is the sum of its weighted values in the final column. It is equal to
54.83. It is the average value of the figures in the previous column, taking the differing
probabilities into account. 78
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi


x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9 0.50
x3 p3 g(x3) g(x3) p3 4 3/36 16 1.33
… … …... ……... 5 4/36 25 2.78
… … …... ……... 6 5/36 36 5.00
… … …... ……... 7 6/36 49 8.17
… … …... ……... 8 5/36 64 8.89
… … …... ……... 9 4/36 81 9.00
… … …... ……... 10 3/36 100 8.83
… … …... ……... 11 2/36 121 6.72
xn pn g(xn) g(xn) pn 12 1/36 144 4.00
S g(xi) pi 54.83

Note that E(x2) is not the same thing as E(x), squared. In the previous sequence we saw that
E(x) for this example was 7 Its square is 49.
79
POPULATION VARIANCE
OF A DISCRETE RANDOM
VARIABLE
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

Population variance of x: 
E ( x − m )2 
E ( x − m ) 2  = ( x1 − m )2 p1 + ... + ( xn − m ) 2 pn =  ( xi − m ) 2 pi
n

i =1

The third sequence defined the expected value of a function of a random


variable x. There is only one function that is of much interest to us, at least
initially: the squared deviation from the population mean.

The expected value of the squared deviation is known as the population


variance of x. It is a measure of the dispersion of the distribution of x about
its population mean. 81
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi

2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36

We will calculate the population variance of the random variable x defined in the first
sequence. We start as usual by listing the possible values of x and the corresponding
probabilities.
82
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m

2 1/36
3 2/36
4 3/36
5 4/36
6 5/36 m x = E( x) = 7
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36

Next we need a column giving the deviations of the possible values of x about its population
mean. In the second sequence we saw that the population mean of x was 7.
83
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m

2 1/36 -5
3 2/36
4 3/36
5 4/36
6 5/36 m x = E( x) = 7
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36

When x is equal to 2, the deviation is -5.

84
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m

2 1/36 -5
3 2/36 -4
4 3/36 -3
5 4/36 -2
6 5/36 -1 m x = E( x) = 7
7 6/36 0
8 5/36 1
9 4/36 2
10 3/36 3
11 2/36 4
12 1/36 5

85
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m (xi-m)2

2 1/36 -5 25
3 2/36 -4
4 3/36 -3
5 4/36 -2
6 5/36 -1
7 6/36 0
8 5/36 1
9 4/36 2
10 3/36 3
11 2/36 4
12 1/36 5

Similarly for all the other possible values.


Next we need a column giving the squared deviations. When x is equal to 2, the squared
deviation is 25. 86
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m (xi-m)2 (xi-m)2 pi

2 1/36 -5 25 0.69
3 2/36 -4 16
4 3/36 -3 9
5 4/36 -2 4
6 5/36 -1 1
7 6/36 0 0
8 5/36 1 1
9 4/36 2 4
10 3/36 3 9
11 2/36 4 16
12 1/36 5 25

Similarly for the other values of x.


Now we start weighting the squared deviations by the corresponding probabilities. What do
you think the weighted average will be? Have a guess. 87
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m (xi-m)2 (xi-m)2 pi

2 1/36 -5 25 0.69
3 2/36 -4 16
4 3/36 -3 9
5 4/36 -2 4
6 5/36 -1 1
7 6/36 0 0
8 5/36 1 1
9 4/36 2 4
10 3/36 3 9
11 2/36 4 16
12 1/36 5 25

A reason for making an initial guess is that it may help you to identify an arithmetical error,
if you make one. If the initial guess and the outcome are very different, that is a warning.
88
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m (xi-m)2 (xi-m)2 pi

2 1/36 -5 25 0.69
3 2/36 -4 16 0.89
4 3/36 -3 9 0.75
5 4/36 -2 4 0.44
6 5/36 -1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69

We calculate all the weighted squared deviations.

89
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

xi pi xi-m (xi-m)2 (xi-m)2 pi

2 1/36 -5 25 0.69
3 2/36 -4 16 0.89
4 3/36 -3 9 0.75
5 4/36 -2 4 0.44
6 5/36 -1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83

The sum is the population variance of x.

90
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

Population variance of x
There are several ways of writing the population variance. First the formal
mathematical definition.
E ( x − m ) 2 

In text, it is convenient to refer to it as pop.var(x).

pop.var(x)

In equations, the population variance of x is usually written sx2, s being the


Greek s.

s x2

91
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE

Standard deviation of x

E[( x − m ) 2 ]

sx

The standard deviation of x is the square root of its population variance.


Usually written sx, it is an alternative measure of dispersion.
It has the same units as x.

92
EXPECTED VALUE
RULES
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)

This sequence states the rules for manipulating expected values. First, the additive rule.
The expected value of the sum of two random variables is the sum of their expected values.
94
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


Example generalization:
E(w+x+y+z) = E(w) + E(x) + E(y) + E(z)

This generalizes to any number of variables. An example is shown.

95
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)

The second rule is the multiplicative rule. The expected value of (a variable multiplied by a
constant) is equal to the constant multiplied by the expected value of the variable.
96
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)
Example:
E(3x) = 3E(x)

For example, the expected value of 3x is three times the expected value of x.

97
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)
3. E(a) = a

Finally, the expected value of a constant is just the constant. Of course this is obvious.

98
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)
3. E(a) = a

y = a + bx
E(y) = E(a + bx)

As an exercise, we will use the rules to simplify the expected value of an expression.
Suppose that we are interested in the expected value of a variable y, where y = a + bx.
99
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)
3. E(a) = a

y = a + bx
E(y) = E(a + bx)
= E(a) + E(bx)

We use the first rule to break up the expected value into its two components.

100
EXPECTED VALUE RULES

1. E(x+y) = E(x) + E(y)


2. E(ax) = aE(x)
3. E(a) = a

y = a + bx
E(y) = E(a + bx)
= E(a) + E(bx)
= a + bE(x)

Then we use the second rule to replace E(bx) by bE(x) and the third rule to simplify E(a) to
just a. This is as far as we can go in this example.
101
INDEPENDENCE
OF TWO RANDOM
VARIABLES
INDEPENDENCE OF TWO RANDOM VARIABLES

Two random variables x and y are said to be


independent if

E[f(x)g(y)] = E[f(x)] E[g(y)]

for any functions f(x) and g(y)

This very short sequence presents an important definition, that of


the independence of two random variables.
Two variables x and y are independent if, given any functions f(x)
and g(y), the expected value of the product f(x)g(y) is equal to the
expected value of f(x) multiplied by the expected value of g(y).
103
INDEPENDENCE OF TWO RANDOM VARIABLES
Two random variables x and y are said to be
independent if
E[f(x)g(y)] = E[f(x)] E[g(y)]
for any functions f(x) and g(y)

Special case: if x and y are independent,


E(xy) = E(x) E(y)

As a special case, the expected value of xy is equal to the


expected value of x multiplied by the expected value of y
if and only if x and y are independent. 104
ALTERNATIVE EXPRESSION
FOR POPULATION
VARIANCE
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

This sequence derives an alternative expression for the population variance of a random
variable. It provides an opportunity for practising the use of the expected value rules.
106
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

We start with the definition of the population variance of x.

107
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

= E(x2 - 2mx + m2)

We expand the quadratic.

108
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

= E(x2 - 2mx + m2)

= E(x2) + E(-2mx) + E(m2)

Now the first expected value rule is used to decompose the expression into three separate
expected values.
109
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

= E(x2 - 2mx + m2)

= E(x2) + E(-2mx) + E(m2)

= E(x2) - 2mE(x) + m2

The second expected value rule is used to simplify the middle term and the third rule is
used to simplify the last one.
110
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

= E(x2 - 2mx + m2)

= E(x2) + E(-2mx) + E(m2)

= E(x2) - 2mE(x) + m2

= E(x2) - 2m2 + m2

The middle term is rewritten, using the fact that E(x) and mx are just different ways of writing
the population mean of x.
111
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE

pop.var(x) = E(x2) - m2

pop.var(x) = E[(x-m)2]

= E(x2 - 2mx + m2)

= E(x2) + E(-2mx) + E(m2)

= E(x2) - 2mE(x) + m2

= E(x2) - 2m2 + m2 = E(x2) - m2

Hence we get the result.

112
DISCRETE RANDOM VARIABLES
probability

1/ 2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2 1
__
/
36 36 36 36 36 36 36 36 36 36 36

2 3 4 5 6 7 8 9 10 11 12 x

A discrete random variable is one that can take only a finite set of values.
The sum of the numbers when two dice are thrown is an example.

Each value has associated with it a finite probability, which you can think of as a
“packet” of probability. The packets sum to unity because the variable must
take one of the values. 113
CONTINUOUS RANDOM VARIABLES
height

55 60 65 70 75 x
However, most random variables encountered in econometrics are
continuous. They can take any one of an infinite set of values defined
over a range (or possibly, ranges).
As a simple example, take the temperature in a room. We will assume
that it can be anywhere from 55 to 75 degrees Fahrenheit with equal
114
probability within the range.
CONTINUOUS RANDOM VARIABLES
height

55 60 65 70 75 x
In the case of a continuous random variable, the probability of it being
equal to a given finite value (for example, temperature equal to 55.473927)
is always infinitesimal.

P(x)=0 115
CONTINUOUS RANDOM VARIABLES
height

55 56 60 65 70 75 x

For this reason, you can only talk about the probability of a continuous
random variable lying between two given values. The probability is
represented graphically as an area.
For example, you could measure the probability of the temperature
being between 55 and 56, both measured exactly. 116
CONTINUOUS RANDOM VARIABLES
height

0.05

55 56 60 65 70 75 x

Given that the temperature lies anywhere between 55 and 75 with


equal probability, the probability of it lying between 55 and 56 must
be 0.05.
117
CONTINUOUS RANDOM VARIABLES
height

0.05

55 56 57 60 65 70 75 x

Similarly, the probability of the temperature lying between 56 and 57


is 0.05.
118
CONTINUOUS RANDOM VARIABLES
height

0.05

55 5758 60 65 70 75 x

The probability per unit interval is 0.05 and accordingly the area of
the rectangle representing the probability of the temperature lying in
any given unit interval is 0.05.
119
CONTINUOUS RANDOM VARIABLES
height

0.05

55 5758 60 65 70 75 x

The probability per unit interval is called the probability density and
it is equal to the height of the unit-interval rectangle.
120
CONTINUOUS RANDOM VARIABLES

f(x) = 0.05 for 55  x  75


height f(x) = 0 for x < 55 and x > 75

0.05

55 5758 60 65 70 75 x

Mathematically, the probability density is written as a function of the


variable, for example f(x). In this example, f(x) is 0.05 for 55 < x < 75
and it is zero elsewhere.
121
CONTINUOUS RANDOM VARIABLES
probability f(x) = 0.05 for 55  x  75
density f(x) = 0 for x < 55 and x > 75
f(x)

0.05

55 5758 60 65 70 75 x

The vertical axis is given the label probability density, rather than
height. f(x) is known as the probability density function and is shown
graphically in the diagram as the thick black line.
122
CONTINUOUS RANDOM VARIABLES
probability f(x) = 0.05 for 55  x  75
density f(x) = 0 for x < 55 and x > 75
f(x)

0.05

55 60 65 70 75 x

Suppose that you wish to calculate the probability of the temperature


lying between 65 and 70 degrees.
123
CONTINUOUS RANDOM VARIABLES
probability f(x) = 0.05 for 55  x  75
density f(x) = 0 for x < 55 and x > 75
f(x)
5
0.05

0.05 0.25

55 60 65 70 75 x
Typically you have to use the integral calculus to work out the area under
a curve, but in this very simple example all you have to do is calculate
the area of a rectangle.
The height of the rectangle is 0.05 and its width is 5, so its area is 0.25.
124
CONTINUOUS RANDOM VARIABLES
probability
density
f(x)
0.20

0.15

0.10

0.05

65 70 75 x
Now suppose that the temperature can lie in the range 65 to 75 degrees,
with uniformly decreasing probability as the temperature gets higher.
The total area of the triangle is unity because the probability of the
temperature lying in the 65 to 75 range is unity. Since the base of the
125
triangle is 10, its height must be 0.20.
CONTINUOUS RANDOM VARIABLES
probability f(x) = 1.50 - 0.02x for 65  x  75
density f(x) = 0 for x < 65 and x > 75
f(x)
0.20

0.15

0.10

0.05

65 70 75 x
In this example, the probability density function is a line of the form
f(x) = a + bx. To pass through the points (65, 0.20) and (75, 0), a must
equal 1.50 and b must equal -0.02.
Suppose that we are interested in finding the probability of the
126
temperature lying between 65 and 70 degrees.
CONTINUOUS RANDOM VARIABLES
probability f(x) = 1.50 - 0.02x for 65  x  75
density f(x) = 0 for x < 65 and x > 75
f(x)
0.20

0.15

0.10

0.05

65 70 75 x
We could do this by evaluating the integral of the function over this range,
but there is no need.
It is easy to show geometrically that the answer is 0.75. This completes
the introduction to continuous random variables. 127
EXPECTED VALUE, VARIANCE &
COVARIANCE RULES
A SUMMARY

➢ EXPECTED VALUE ➢ COVARIANCE


▪ E[a]=a a is a constant ▪ Cov[a,X]=0 a is a constant
▪ E[aX] = aE[X] ▪ Cov[aX, Y] = a Cov[X,Y]
▪ E[a+X] = a+E[X] ▪ Cov[X,(Y+Z)] = Cov[X,Y]+ Cov[X,Z]
▪ E[X+Y] = E[X] + E[Y] ▪ Cov(X,X)=Var(X)

➢ VARIANCE
▪ Var[a]=0 a is a constant
▪ Var[aX] = a2 Var[X]
▪ Var[a+X] = Var[X]
▪ Var[X+Y] = Var[X] + Var[Y] + 2 Cov[X,Y]
128
THE FIXED AND RANDOM
COMPONENTS OF
A RANDOM VARIABLE
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE

Population mean of x: E(x) =mx

In this short sequence we shall decompose a random variable x into its fixed and random
components. Let the population mean of x be mx.
130
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE

Population mean of x: E(x) =mx

x can be decomposed into


fixed and random components: xi = mx + ui

The actual value of x in any observation will in general be different from mx. We will call the
difference ui, so ui = xi - mx.
131
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE

Population mean of x: E(x) =mx

Hence x can be decomposed


into fixed and random components: xi = mx + ui

In observation i, the random


component is given by ui = xi - mx

Re-arranging this equation, we can write x as the sum of its fixed component, mx, which is
the same for all observations, and its random component, u.
132
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE

Population mean of x: E(x) =mx

In observation i, the random


component is given by ui = xi - mx

Hence x can be decomposed


into fixed and random components: xi = mx + ui

Note that the expected value


of ui is zero:

E(ui) = E(xi - mx) = E(xi) + E(-mx) =mx - mx = 0

The expected value of the random component is zero. It does not systematically tend to
increase or decrease x. It just makes it deviate from its population mean.
133
ESTIMATORS
ESTIMATORS

Estimators and estimates:

An estimator is a mathematical formula.

An estimate is a number obtained by applying


this formula to a set of sample data.

It is important to distinguish between estimators and estimates.


Definitions are given above.
135
ESTIMATORS

Population characteristic Estimator

1 n
Mean: mx x =  xi
n i =1

A common example of an estimator is the sample mean, which is the


usual estimator of the population mean.
Here it is defined for a random variable x and a sample of n
observations. 136
ESTIMATORS

Population characteristic Estimator

1 n
Mean: mx x =  xi
n i =1

1 n
Population variance: s 2
s =
2
 ( xi − x )
2

n − 1 i =1
x

Another common estimator is s2, defined above. It is used to estimate


the population variance, sx2.

137
ESTIMATORS

Estimators are random variables

1 n 1
x=  xi = ( x1 + ... + xn )
n i =1 n

An estimator is a special kind of random variable. We will demonstrate


this in the case of the sample mean.

138
ESTIMATORS
Estimators are random variables

1 n 1
x=  xi = ( x1 + ... + xn )
n i =1 n

xi = m x + ui

Fixed component Random component

We saw in the previous sequence that each observation on x can be


decomposed into a fixed component and a random component.
139
ESTIMATORS
Estimators are random variables

1 n 1
x=  xi = ( x1 + ... + xn )
n i =1 n

xi = m x + ui

1 1
x = ( m x + ... + m x ) + ( u1 + ... + un )
n n
1
= ( nm x ) + u = m x + u
n

So the sample mean is the average of n fixed components and n


random components.
140
ESTIMATORS
Estimators are random variables

1 n 1
x=  xi = ( x1 + ... + xn )
n i =1 n

xi = m x + ui

1 1
x = ( m x + ... + m x ) + ( u1 + ... + un )
n n
1
= ( nm x ) + u = m x + u
n

It thus has a fixed component mx and a random component u, the


average of the random components in the observations in the sample.

141
ESTIMATORS
probability density probability density
function of x function of x

mx x mx x
The graph compares the probability density functions of x and x. As
we have seen, they have the same fixed component. However the
distribution of the sample mean is more concentrated.
Its random component tends to be smaller than that of x because it is
the average of the random components in all the observations, and 142
these tend to cancel each other out.
TYPE OF ESTIMATORS

➢ The Least Squares

➢ The Method of Moments

➢ Maximum Likelihood

143
ESTIMATION OF
SAMPLE MEAN

144
THE LEAST SQUARES
𝑥𝑖 = 𝜇 + 𝑢𝑖 x7
x1
𝑒𝑖 = 𝑥𝑖 − 𝑥ҧ
e7
e1
m

min ෍ 𝑒𝑖2 = ෍(𝑥𝑖 − 𝑥)ҧ 2 e2

x2

𝛿 min σ 𝑒𝑖2
= 2 ෍(𝑥𝑖 − 𝑥)(−1)
ҧ = −2 ෍(𝑥𝑖 − 𝑥)ҧ = 0
𝛿 𝑥lj

෍(𝑥𝑖 − 𝑥)ҧ = 0 ⇒ ෍ 𝑥𝑖 − ෍ 𝑥ҧ = ෍ 𝑥𝑖 − 𝑛𝑥ҧ = 0

෍ 𝑥𝑖 = 𝑛𝑥ҧ

σ 𝑥𝑖
𝑥ҧ =
𝑛 145
THE METHOD OF MOMENTS
➢ Suppose there are k unknown parameters
➢ Select k population moments in terms of unknown parameters
➢ If there are k moments and k unknown parameters, the unknown
parameters can be solved
– An advantage of this method is that is based on moments that
are often easy to compute.
– However, it should be noted that if the number of moments is
greater than the number of unknown parameters, the obtained
estimates depend on the chosen moments.
– The rth moment
m r = E[( x − m) ]r

– The first moment is the mean


– The second one is the variance
– The third one is the skewness
– The fourth one is the kurtosis 146
ESTIMATION OF SAMPLE MEAN
THE METHOD OF MOMENTS
➢The summary statistics of sample are mean = 2.79,
standard deviation = 0.460, skewness = 0.168 and
kurtosis = 2.511, n=609
– The first moment is the mean, therefore m=2.79
– The second moment m2=(n-1)s2/n is s2, therefore
s2=609*0.4602/(609-1)= 0.2119 and s = 0.460
– If the the higher moments are used the estimates may be
different. The population moment of the normal distribution is
equal to 3s4. The sample Kurtosis K is m4/s4 so that
m4=Ks4=2.511*0.4604=0.122
3s4=m4=0.122 therefore the estimate of s2=(m4/3)1/2 = 0.194
which is slightly different from the above estimate of 0.212 147
MAXIMUM LIKELIHOOD
ESTIMATION
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7 8 m
L
0.06

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
This sequence introduces the principle of maximum likelihood estimation
and illustrates it with some simple examples.
149
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7 8 m 1 1 𝑥−𝜇 2
−2 𝜎
L
𝑓(𝑥) = 𝑒
0.06 𝜎 2𝜋
0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
Suppose that you have a normally-distributed random variable x with
unknown population mean m and standard deviation s, and that
you have a sample of two observations, 4 and 6.
150
For the time being, we will assume that s is equal to 1.
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
1  4 −3.5  2 1  6 −3.5  2
1 −   1 −  
0.4 f ( x) = e 2 1 
f ( x) = e 2 1 
0.3521 1 2 1 2
0.3

0.2
m p(4) p(6)
0.1
0.0175 3.5 0.3521 0.0175
0.0
0 1 2 3 4 5 6 7 8 m
L
0.06

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
Suppose initially you consider the hypothesis m = 3.5. Under this hypothesis
the probability density at 4 would be 0.3521 and that at 6 would be 0.0175.
151
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3

0.2
m p(4) p(6) L
0.1
0.0175 3.5 0.3521 0.0175 0.0062
0.0
0 1 2 3 4 5 6 7 8 m
L
0.06

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m

The joint probability density, shown in the bottom chart, is the product
of these, 0.0062.
152
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4 0.3989

0.3

0.2
m p(4) p(6) L
0.1
0.0540 3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
L
0.06

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m

Next consider the hypothesis m = 4.0. Under this hypothesis the probability
densities associated with the two observations are 0.3989 and 0.0540, and the
joint probability density is 0.0215. 153
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3

0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
Under the hypothesis m = 4.5, the probability densities are 0.3521 and 0.1295,
and the joint probability density is 0.0456.
154
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4

0.3
0.2420 0.2420
0.2
m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585

0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m

Under the hypothesis m = 5.0, the probability densities are both 0.2420 and
the joint probability density is 0.0585.
155
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3

0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
Under the hypothesis m = 5.5, the probability densities are 0.1295 and 0.3521
and the joint probability density is 0.0456.
156
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3

0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
The complete joint density function for all values of m has now been plotted
in the lower diagram. We see that it peaks at m = 5.
157
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  
2 s 
f ( x) = e
s 2

Now we will look at the mathematics of the example. If x is normally distributed


with mean m and standard deviation s, its density function is as shown.
158
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  
2 s 
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2

For the time being, we are assuming s is equal to 1, so the density function
simplifies to the second expression.
159
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  
2 s 
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2

1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2

Hence we obtain the probability densities for the observations where


x = 4 and x = 6.
160
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  
2 s 
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2

1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
joint density =  e 2  e 2 
 2  2 
  

The joint probability density for the two observations in the sample
is just the product of their individual densities.
161
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  
2 s 
f ( x) = e
s 2
1 2
1 − ( x−m )
f ( x) = e 2
2

1 2 1 2
1 − ( 4− m ) 1 − ( 6− m )
f ( 4) = e 2
f ( 6) = e 2
2 2

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
joint density =  e 2  e 2 
 2  2 
  
In maximum likelihood estimation we choose as our estimate of m the value
that gives us the greatest joint density for the observations in our sample.
This value is associated with the greatest probability, or maximum likelihood,
of obtaining the observations in the sample. 162
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.4
0.3521
0.3

0.2
0.1295 m p(4) p(6) L
0.1
3.5 0.3521 0.0175 0.0062
0.0 4.0 0.3989 0.0540 0.0215
0 1 2 3 4 5 6 7 8 m
4.5 0.3521 0.1295 0.0456
L
0.06 5.0 0.2420 0.2420 0.0585
5.5 0.1295 0.3521 0.0456
0.04

0.02

0.00
0 1 2 3 4 5 6 7 8 m
In the graphical treatment we saw that this occurs when m is equal to 5. We
will prove this must be the case mathematically.
163
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
L( m | 4,6) =  e 2  e 2 
 2  2 
  

To do this, we treat the sample values x = 4 and x = 6 as given and we use the
calculus to determine the value of m that maximizes the expression.

When it is regarded in this way, the expression is called the likelihood function
for m, given the sample observations 4 and 6. This is the meaning of L(m | 4,6).

To maximize the expression, we could differentiate with respect to m and set the
result equal to 0. This would be a little laborious. Fortunately, we can simplify
the problem with a trick.
164
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
L( m | 4,6) =  e 2  e 2 
 2  2 
  
 1 − 1 ( 4− m ) 2  1 − 1 ( 6− m ) 2  
log L = log  e 2  e 2 
 2  2



 1 − 1 ( 4− m ) 2   1 − 1 ( 6− m ) 2 
= log e 2  + log e 2 
 2   2 
   
log L is a monotonically increasing function of L (meaning that log2 L
 1  −
1
 and decreases( − m )
2
  1   −
1
( 6− m ) 
increases if L increases
= log  + log e 2
4
 if L decreases).
+ log  
+ log e 2 
 2    2    
It follows that the value of m which  
maximizes log L is the same as 
the one that maximizes 1 L.  As 1 it so happens, 1 it is2 easier to maximize
= 2 log  − (4 − m ) − (6 − m )
2

log L with respectto 2m than 2it is to maximize 2 L.

165
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
L( m | 4,6) =  e 2  e 2 
 2  2 
  
 1 − 1 ( 4− m ) 2  1 − 1 ( 6− m ) 2  
log L = log  e 2  e 2 
 2  2



 1 − 1 ( 4− m ) 2   1 − 1 ( 6− m ) 2 
= log e 2  + log e 2 
 2   2 
   
 1   −
1
( − m )
2
  1   −
1
( − m )
2

= log 
 + log e 2
4
 + log  
+ log e 2
6

 2    2   
   
 1  1 1
The logarithm= 2 of  − of
 product
logthe (4 −them )density
2
− (6 functions
− m)
2
can be
decomposed as the 
 sumof their logarithms.
2 2 2

166
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1 ( 4 − m ) 2  1 − 1 ( 6 − m ) 2 
L( m | 4,6) =  e 2  e 2 
 2  2 
  
 1 − 1 ( 4− m ) 2  1 − 1 ( 6− m ) 2  
log L = log  e 2  e 2 
 2  2



 1 − 1 ( 4− m ) 2   1 − 1 ( 6− m ) 2 
= log e 2  + log e 2 
 2   2 
   
 1   −
1
( − m )
2
  1   −
1
( − m )
2

= log 
 + log e 2
4
 + log  
+ log e 2
6

 2    2   
   
 1  1 1
= 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2
Using the product rule a second time, we can decompose each term
as shown.
167
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1(4− m )2  1b − 1(6− m )2 
L( m | 4,6) =  e 2  a =eb log
log 2 a 
 2  2 
  
1
− (1x − 4 )2 2
 log1 e 2 2 ( 4− m ) 

1 1 2 2 
=− 1( x − 42()6− mlog
− ) e
log L = log   e 2 e 
 2  2
 1


= − ( x − 4) 2

 1 − ( 4− m )  2  1 − 1 ( 6− m ) 2 
1 2

= log e 2  + log e 2 
 2   2 
   
 1   −
1
( − m )
2
  1   −
1
( − m )
2

= log  + log e  2
4
 + log  
+ log e 2
6

 2    2   
   
 1  1 1
= 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2
Now one of the basic rules for manipulating logarithms allows us to
rewrite the second term as shown.
168
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1(4− m )2  1b − 1(6− m )2 
L( m | 4,6) =  e 2  a =eb log
log 2 a 
 2  2 
  
1
− (1x − 4 )2 2
 log1 e 2 2 ( 4− m ) 

1 1 2 2 
=− 1( x − 42()6− mlog
− ) e
log L = log   e 2 e 
 2  2
 1


= − ( x − 4) 2

 1 − ( 4− m )  2  1 − 1 ( 6− m ) 2 
1 2

= log e 2  + log e 2 
 2   2 
   
 1   −
1
( − m )
2
  1   −
1
( − m )
2

= log  + log e  2
4
 + log  
+ log e 2
6

 2    2   
   
 1  1 1
= 2 log  − (4 − m ) − (6 − m )
2 2

2  basic
log e is equal to 1,another
2 2
logarithm result. (Remember, as
always, we are using natural logarithms, that is, logarithms to base e.)
169
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 − 1(4− m )2  1b − 1(6− m )2 
L( m | 4,6) =  e 2  a =eb log
log 2 a 
 2  2 
  
1
− (1x − 4 )2 2
 log1 e 2 2 ( 4− m ) 

1 1 2 2 
=− 1( x − 42()6− mlog
− ) e
log L = log   e 2 e 
 2  2
 1


= − ( x − 4) 2

 1 − ( 4− m )  2  1 − 1 ( 6− m ) 2 
1 2

= log e 2  + log e 2 
 2   2 
   
 1   −
1
( − m )
2
  1   −
1
( − m )
2

= log  + log e  2
4
 + log  
+ log e 2
6

 2    2   
   
 1  1 1
= 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2
Hence the second term reduces to a simple quadratic in x. And so
does the fourth.
We will now choose m so as to maximize this expression. 170
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2

− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2

Quadratic terms of the type in the expression can be expanded as


shown.
171
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2

− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d  1 2
 − (a − m ) =a−m
dm  2 

Thus we obtain the differential of the quadratic term.

172
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2

− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d  1 2
 − (a − m ) =a−m
dm  2 
d log L
= (4 − m ) + (6 − m )
dm

Applying this result, we obtain the differential of log L with respect to


m. (The first term in the expression for log L disappears completely
since it is not a function of m.) 173
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2

− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d  1 2
− (a − m )  = a − m
dm  2 

d log L
= (4 − m ) + (6 − m )
dm
d log L
= 0  mˆ = 5
dm
Thus from the first order condition we confirm that 5 is the value of m that
maximizes the log-likelihood function, and hence the likelihood function.
Note that a caret mark has been placed over m, because we are now talking
174
about an estimate of m, not its true value.
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = 2 log  − (4 − m ) − (6 − m )
2 2

 2  2 2

− (a − m ) = − (a − 2am + m ) = − a + am − m
1 2 1 2 2 1 2 1 2
2 2 2 2
d  1 2
 − (a − m ) =a−m
dm  2 
d log L
= (4 − m ) + (6 − m )
dm

d log L
= 0  mˆ = 5
dm

Note also that the second differential of log L with respect to m is -2. Since
this is negative, we have found a maximum, not a minimum.
175
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2

We will generalize this result to a sample of n observations x1,...,xn. The


probability density for xi is given by the first line.
176
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2
 1 − 1 ( x1 − m )2   1 − 1 ( xn − m ) 2 
 e 2   ...   e 2 
 2   2 
   

The joint density function for a sample of n observations is the


product of their individual densities.
177
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2
 1 − 1( x1 − m )2   1 − 1 ( xn − m ) 2 
L( m | x1 ,..., xn ) =  e 2   ...   e 2 
 2   2 
   

Now treating the sample values as fixed, we can re-interpret the joint
density function as the likelihood function for m, given this sample.
We will find the value of m that maximizes it.
178
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
1 2
1 − ( xi − m )
f ( xi ) = e 2
2
 1 − 1( x1 − m )2   1 − 1 ( xn − m ) 2 
L( m | x1 ,..., xn ) =  e 2   ...   e 2 
 2   2 
   
 1 − 1 ( x1 − m )2   1 − 1 ( xn − m ) 2  
log L = log  e 2   ...   e 2 
 2 

 2



 1 − 1 ( x1 − m )2   1 − 1 ( xn − m ) 2 
= log e 2  + ... + log e 2 
 2   2 
   
 1  1 1
= n log  − ( x − m )2
− ... − ( x − m )2

 2  2
1 n
2
We will do this indirectly, as before, by maximizing log L with respect
to m. The logarithm decomposes as shown.
179
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = n log  − ( x − m )2
− ... − ( x − m )2

 2  2
1 n
2

d log L
= ( x1 − m ) + ... + ( x n − m )
dm

We differentiate log L with respect to m.

180
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = n log  − ( x − m )2
− ... − ( x − m )2

 2  2
1 n
2

d log L
= ( x1 − m ) + ... + ( x n − m )
dm

d log L
dm
=0  x i − nmˆ = 0

The first order condition for a minimum is that the differential be


equal to zero.
181
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1  1 1
log L = n log  − ( x − m )2
− ... − ( x − m )2

 2  2
1 n
2

d log L
= ( x1 − m ) + ... + ( x n − m )
dm

d log L
dm
=0  x i − nmˆ = 0

1
 m̂ =
n
 xi = x

Thus we have demonstrated that the maximum likelihood estimator


of m is the sample mean. The second differential, -n, is negative,
confirming that we have maximized log L.
182
MAXIMUM LIKELIHOOD ESTIMATION of

s
1  xi − m  2
1 −  
2 s 
f ( xi ) = e
s 2

So far we have assumed that s, the standard deviation of the


distribution of x, is equal to 1. We will now relax this assumption
and find the maximum likelihood estimator of it.

183
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8

0.6

0.4

0.2

0.0
0 1 2 3 4 5 6 7 8 9 m
L
0.06

0.04

0.02

0
0 1 2 3 4 s
We will illustrate the process graphically with the two-observation example,
keeping m fixed at 5. We will start with s equal to 2. 184
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8

0.6

0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0
0 1 2 3 4 5 6 7 8 9 m
L
0.06

0.04

0.02

0
0 1 2 3 4 s

With s equal to 2, the probability density is 0.1760 for both x = 4 and x = 6,


and the joint density is 0.0310.
185
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8

0.6

0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
L
0.06

0.04

0.02

0
0 1 2 3 4 s

Now try s equal to 1. The individual densities are 0.2420 and so the
joint density, 0.0586, has increased. 186
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8

0.6

0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
0.5 0.1080 0.1080 0.0117
L
0.06

0.04

0.02

0
0 1 2 3 4 s

Now try putting s equal to 0.5. The individual densities have fallen and the
joint density is only 0.0117. 187
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION
p
0.8

0.6

0.4
s p(4) p(6) L
0.2
2.0 0.1760 0.1760 0.0310
0.0 1.0 0.2420 0.2420 0.0586
0 1 2 3 4 5 6 7 8 9 m
0.5 0.1080 0.1080 0.0117
L
0.06

0.04

0.02

0
0 1 2 3 4 s
The joint density has now been plotted as a function of s in the lower
diagram. You can see that in this example it is greatest for s equal to 188
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  i 
2 s 
f ( xi ) = e
s 2

We will now look at this mathematically, starting with the probability


density function for x given m and s.
189
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  i 
2 s 
f ( xi ) = e
s 2

 1 − 1 x1 − m  2   1 1 x − m  2 
−  n
 e 2 s  
 ...   e 2 s  

 2   s 2 
   

The joint density function for the sample of n observations is given by


the second line.

190
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  i 
2 s 
f ( xi ) = e
s 2

 1 − 1 x1 − m  2   1 1 x − m  2 
−  n
L( m ,s | x1 ,..., xn ) =  e 2 s  
 ...   e 2 s  

 2   s 2 
   

As before, we can re-interpret this function as the likelihood function


for m and s, given the sample of observations.
191
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1 x − m  2
1 −  i 
2 s 
f ( xi ) = e
s 2

 1 − 1 x1 − m  2   1 1 x − m  2 
−  n
L( m ,s | x1 ,..., xn ) =  e 2 s  
 ...   e 2 s  

 2   s 2 
   

 1 1  x1 − m  2 
−   1  xn − m  2  
− 
log L = log   e 2 s  

 ...   1
e 2 s  


 s 2 

 s 2




We will find the values of m and s that maximize this function. We


will do this indirectly by maximizing log L.
192
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

 1 1  x1 − m  2 
−   1  xn − m  2  
− 
log L = log   e 2 s  

 ...   1
e 2 s  


 s 2 

 s 2




 1 1  x1 − m  2 
−   1  xn − m  2 
− 
= log  e 2 s  

+ ... + log  1
e 2 s  

 s 2   s 2 
   
 1  1  x1 − m  1  xn − m 
2 2

= n log  −   − ... −  
 s 2  2  s  2 s 
1  1  1  1 1 2
= n log  + n log  + 2 
− ( x − m )2
− ... − ( x − m ) 
s
   2  s  2
1
2
n

We can decompose the logarithm as shown. To maximize it, we will


set the partial derivatives with respect to m and s equal to zero.
193
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2

log L 1  2 1 2 2 i 1


= − nlog s + − ( − m )2
n log x
2
= 2  − ( x − m ) − ... − ( x − m ) 
m s m  2 1 n
2 

=
1
( x1 − m ) + ... + ( xn − m )
s 2

=
1
( x − nm )
s 2 i

When differentiating with respect to m, the first two terms disappear.


We have already seen how to differentiate the other terms. 194
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2

log L 1  2 1 2 2 i 1


= − nlog s + − ( − m )2
n log x
2
= 2  − ( x − m ) − ... − ( x − m ) 
m s m  2 1 n
2 

=
1
( x1 − m ) + ... + ( xn − m )
s 2

=
1
( x − nm )
s 2 i

 log L
= 0  mˆ = x
m

Setting the first differential equal to 0, the maximum likelihood


estimate of m is the sample mean, as before. 195
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

Next, we take the partial differential of the log-likelihood function with


respect to s.
196
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

log a b = b log a
1
log = log s −1 = ( −1) log s = − log s
s

Before doing so, it is convenient to rewrite the equation.

197
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s

The derivative of log s with respect to s is 1/s. The derivative of s--


2 is -2s--3.

198
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s
 log L n
= 0  − + sˆ −3  ( xi − mˆ )2 = 0
s sˆ

Setting the first derivative of log L to zero gives us a condition that


must be satisfied by the maximum likelihood estimator.
199
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s
 log L n
= 0  − + sˆ −3  ( xi − mˆ )2 = 0
s sˆ
 − nsˆ 2 +  ( xi − x )2 = 0

We have already demonstrated that the maximum likelihood


estimator of m is the sample mean.
200
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s
 log L n
= 0  − + sˆ −3  ( xi − mˆ )2 = 0
s sˆ
 − nsˆ 2 +  ( xi − x )2 = 0
1
sˆ =  ( xi − x )2 = Var( x )
2

n
Hence the maximum likelihood estimator of the population variance
is the sample variance.
201
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s
 log L n
= 0  − + sˆ −3  ( xi − mˆ )2 = 0
s sˆ
 − nsˆ 2 +  ( xi − x )2 = 0
1
sˆ =  ( xi − x )2 = Var( x )
2

n
Note that it is biased.
The unbiased estimator is obtained by dividing by (n-1), not n. 202
INTRODUCTION TO MAXIMUM LIKELIHOOD ESTIMATION

1  1  1  1 1 2
log L = n log  + n log  + 2
− ( x − m )2
− ... − ( x − m ) 
s   2  s  2
1 n
2 
 1  s
−2
= − n log s + n log −  ( − m )2
 x
 2  2
i

 log L n
= − + s −3  ( xi − m ) 2
s s
 log L n
= 0  − + sˆ −3  ( xi − mˆ )2 = 0
s sˆ
 − nsˆ 2 +  ( xi − x )2 = 0
1
sˆ =  ( xi − x )2 = Var( x )
2

n
However it can be shown that the maximum likelihood estimator is
asymptotically efficient, in the sense of having a smaller mean square
error than the unbiased estimator in large samples. 203
COMPARISON OF METHODS
➢ It depends on the application which method is the most attractive
one.
➢ If the model is expressed in terms of an equation, then least squares
is intuitively appealing, as it optimizes the fit of the model with
respect to the observations.
➢ Least squares and the method of moments are both based on the idea
of minimizing a distance function.
– For least squares, the distance is measured directly in terms of the
observed data
– For the method of moments, the distance is measured in terms of the
sample and population moments.
➢ The maximum likelihood method is not based on a distance
function, but on the likelihood function that express the likelihood
or ‘credibility’ of parameter values with respect to the observed data.
➢ The maximum likelihood estimators have optimal properties in
204
large samples.
UNBIASEDNESS
AND
EFFICIENCY
UNBIASEDNESS &
EFFICIENCY

✔✔✔ x x
x
x xx x Unbiased
x x xx Unbiased x x &
xx
& x
x Inefficient
Efficient

x
✘✘✘
xxx Biased x x
xx Biased
xx x
& x &
Efficient x x Inefficient

206
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Suppose that you wish to estimate the population mean mx of a


random variable x given a sample of observations. We will
demonstrate that the sample mean is an unbiased estimator, but not
the only one.
207
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

We use the second expected value rule to take the (i/n) factor out of
the expectation expression.
208
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Next we use the first expected value rule to break up the expression
into the sum of the expectations of the observations.
209
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Each expectation is equal to mx, and hence the expected value of the
sample mean is mx.

210
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

However, the sample mean is not the only unbiased estimator of the
population mean. We will demonstrate this supposing that we have
a sample of two observations (to keep it simple).
211
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

We will define a generalized estimator Z which is the weighted


average of the two observations, l1 and l2 being the weights.
212
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1
We will analyze the expected value of Z and find out what condition
the weights have to satisfy for Z to be an unbiased estimator.

213
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1

We begin by decomposing the expectation using the first expected


value rule.

214
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1

Now we use the second expected value rule to bring l1 and l2 out of
the expected value expressions.
215
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1

The expected value of x in each observation is mx.

216
UNBIASEDNESS AND EFFICIENCY

Unbiasedness of x:
1  1
E ( x ) = E  ( x1 + ... xn ) = E ( x1 + ... + xn )
n  n
= E ( x1 ) + ... + E ( xn ) = nm x = m x
1 1
n n

Generalized estimator Z = l1x1 + l2x2

E ( Z ) = E ( l1 x1 + l2 x2 ) = E ( l1 x1 ) + E ( l2 x2 )
= l1 E ( x1 ) + l2 E ( x2 ) = ( l1 + l2 ) m x
= m x if ( l1 + l2 ) = 1

Thus Z is an unbiased estimator of m if the sum of the weights is


equal to one. An infinite number of combinations of l1 and l2 satisfy
this condition, not just the sample mean.
217
UNBIASEDNESS AND EFFICIENCY

probability
density
function

estimator B

estimator A

mx

How do we choose among them? The answer is to use the most


efficient estimator, the one with the smallest population variance,
because it will tend to be the most accurate.
218
UNBIASEDNESS AND EFFICIENCY

probability
density
function

estimator B

estimator A

mx

In the diagram, A and B are both unbiased estimators but B is superior


because it is more efficient.

219
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

We will analyze the population variance of the generalized estimator


and find out what condition the weights must satisfy in order to
minimize it.
220
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

The first variance rule is used to decompose the population variance.


Note that we are assuming that x1 and x2 are independent
observations, so their population covariance is zero.

221
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

The second variance rule is used to bring l1 and l2 out of the


population variance expressions.
222
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

The population variance of x is sx2.

223
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

Now we take account of the condition for unbiasedness and re-write


the population variance of Z, substituting for l2.
224
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2

The quadratic is expanded. To minimize the population variance of


Z, we must choose l1 so as to minimize the final expression.
225
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2
d pop.var ( Z )
= 0  4l1 − 2 = 0  l1 = l2 = 0.5
dl1

Thus Z is an unbiased estimator of m if the sum of the weights is


equal to one. An infinite number of combinations of l1 and l2 satisfy
this condition, not just the sample mean.
226
UNBIASEDNESS AND EFFICIENCY

Generalized estimator Z = l1x1 + l2x2


pop.var ( Z ) = pop.var ( l1 x1 + l2 x2 )
= pop.var ( l1 x1 ) + pop.var ( l2 x2 )
= l12 pop.var ( x1 ) + l22 pop.var ( x2 )
= ( l12 + l22 )s x2
= ( l12 + [1 − l1 ]2 )s x2 if ( l1 + l2 ) = 1
= ( 2l12 − 2l1 + 1)s x2
d pop.var ( Z )
= 0  4l1 − 2 = 0  l1 = l2 = 0.5
dl1
Thus Z is an unbiased estimator of m if the sum of the weights is
equal to one. An infinite number of combinations of l1 and l2 satisfy
this condition, not just the sample mean.
227
UNBIASEDNESS AND EFFICIENCY

f ( l1.2
1)

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 l1 1

Alternatively, we could find the minimum graphically. Here is a graph of the


expression as a function of l1.
228
UNBIASEDNESS AND EFFICIENCY

f ( l1.2
1)

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 l1 1

The expression is minimized for l1=0.5. It follows that l2=0.5 as well.


So we have demonstrated that the sample mean is the most efficient
unbiased estimator, at least in this example. 229
CONFLICTS BETWEEN
UNBIASEDNESS AND
MINIMUM VARIANCE
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

probability
density
function
estimator B

estimator A

Suppose that you have alternative estimators of a population


characteristic q, one unbiased, the other biased but with a smaller
population variance. How do you choose between them?
231
UNBIASEDNESS &
EFFICIENCY

x x
x
x x
xx
x x Biased x x
x Unbiased
Efficient x Inefficient

x
xxx

xx Biased
Efficient
232
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

loss

error (negative) error (positive)

One way is to define a loss function which reflects the cost to you of
making errors, positive or negative, of different sizes.

233
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2  = s Z2 + ( m Z − q ) 2
probability
density
function

A widely-used loss function is the mean square error of the estimator,


defined as the expected value of the square of the deviation of the
estimator about the true value of the population characteristic.

234
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2  = s Z2 + ( m Z − q ) 2
probability
density
function

bias

q mZ

The mean square error involves a trade-off between the population


variance of the estimator and its bias. Suppose you have a biased
estimator like estimator B above, with expected value mZ.

235
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2  = s Z2 + ( m Z − q ) 2
probability
density
function

bias

q mZ

The mean square error can be shown to be equal to the sum of the
population variance of the estimator and the square of the bias

236
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

To demonstrate this, we start by subtracting and adding mZ .

237
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

We expand the quadratic using the rule (a+b)2 = a2+b2+2ab, where


a = Z-mZ and b = mZ - q.

238
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

We use the first expected value rule to break up the expectation into
its three components.

239
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

The first term in the expression is by definition the population


variance of Z.

240
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

(mZ-q) is a constant, so the second term is a constant.

241
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

In the third term, (mZ-q) may be brought out of the expectation, again
because it is a constant, using the second expected value rule.

242
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

Now E(Z) is mZ, and E(- mZ) is - mZ.

243
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

MSE( Z ) = E ( Z − q ) 2 
= E ( Z − m Z + m Z − q ) 2 
= E ( Z − m Z ) 2 + ( m Z − q ) 2 + 2( Z − m Z )( m Z − q )
= E ( Z − m Z ) 2  + E ( m Z − q ) 2  + E 2( Z − m Z )( m Z − q )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q ) E ( Z − m Z )
= s Z2 + ( m Z − q ) 2 + 2( m Z − q )( m Z − m Z )
= s Z2 + ( m Z − q ) 2

Hence the third term is zero and the mean square error of Z is shown
be the sum of the population variance of Z and the bias squared.

244
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE

probability
density
function
estimator B

estimator A

In the case of the estimators shown, estimator B is a little better than


estimator A according to the MSE criterion.
245
EFFECT OF INCREASING
THE SAMPLE SIZE
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50

0.06

0.04

0.02 n=1

50 100 150 200

The sample mean is the usual estimator of a population mean, for


reasons discussed in the previous sequence. In this sequence we will
see how its properties are affected by the sample size.
247
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50

0.06

0.04

0.02 n=1

50 100 150 200

Suppose that a random variable x has population mean 100 and


standard deviation 50, as in the diagram. Suppose that we do not know
the population mean and we are using the sample mean to estimate it.
248
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50

0.06

0.04

0.02 n=1

50 100 150 200

The sample mean will have the same population mean as x, but its
standard deviation will be 50/ n , where n is the number of
observations in the sample.
249
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50

0.06

0.04

0.02 n=1

50 100 150 200

The larger is the sample, the smaller will be the standard deviation of
the sample mean.
250
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50

0.06

0.04

0.02 n=1

50 100 150 200

If n is equal to 1, the sample consists of a single observation. x is


the same as x and its standard deviation is 50.
251
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50
4 25

0.06

0.04

n=4
0.02

50 100 150 200

We will see how the shape of the distribution changes as the sample
size is increased.
252
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 1 50
4 25
25 10
0.06

n = 25
0.04

0.02

50 100 150 200

The distribution becomes more concentrated about the population


mean.
253
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.08 n = 100 1 50
4 25
25 10
0.06 100 5

0.04

0.02

50 100 150 200

To see what happens for n greater than 100, we will have to change
the vertical scale.
254
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.8 1 50
4 25
25 10
0.6 100 5

0.4

n = 100
0.2

50 100 150 200

We have reduced the vertical scale by a factor of 10.

255
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x
n sx
0.8 1 50
4 25
25 10
0.6 100 5
n = 1000 1000 1.6

0.4

0.2

50 100 150 200

The distribution continues to contract about the population mean.

256
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

probability density
function of x

n = 5000 n sx
0.8 1 50
4 25
25 10
0.6 100 5
1000 1.6
5000 0.7
0.4

0.2

50 100 150 200

In the limit, the variance of the distribution tends to zero. The


distribution collapses to a spike at the true value. The sample mean is
therefore a consistent estimator of the population mean.
257
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

Finite samples: x is an unbiased estimator of m

The sequence has illustrated the difference between the concepts of


unbiasedness and consistency.

Unbiasedness is a finite-sample concept. The expected value of the


sample mean is equal to the population mean, but in general its
actual value will be different.
258
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

Finite samples: x is an unbiased estimator of m

Large samples: the probability distribution of x


collapses to a spike at m

plim x = m

Consistency is a large-sample concept. A consistent estimator


becomes an increasingly accurate estimator of the population
characteristic and in the limit becomes equal to it.

259
EFFECT OF INCREASING THE SAMPLE SIZE ON THE DISTRIBUTION OF x

Finite samples: x is an unbiased estimator of m

Large samples: the probability distribution of x


collapses to a spike at m

plim x = m

As the sample size becomes large, the distribution of the sample


mean collapses to a spike located at the true value. The sample mean
is therefore consistent as well as unbiased.

260
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z

n = 20

q Z

It is possible for an estimator to be consistent, despite being biased


in finite samples.
261
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z

n = 20

q Z
In the diagram, Z is an estimator of a population characteristic q.
Looking at the probability distribution of Z, you can see that Z is
biased upwards. 262
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z

n = 100

n = 20

q Z

For the estimator to be consistent, two things must happen as the


sample size increases. One is that the bias should diminish as n
263
increases, as shown here.
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z n = 1000

n = 100

n = 20

q Z

The other is that the distribution should collapse to a spike.

264
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z
(re-scaled axis)

n = 1000
n = 100

q Z

The vertical axis has been re-scaled to accommodate distributions


with large sample sizes.
265
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

probability density
function of Z n = 100000
(re-scaled axis)

n = 1000
n = 100

q Z

In the case of the estimator in the diagram, both of the conditions are
approximately satisfied when the sample size is 100,000.
266
EXAMPLE OF AN ESTIMATOR BIASED IN FINITE SAMPLES BUT CONSISTENT

Useful rule for large samples:

X plim X
If Z = , plim Z =
Y plim Y

if plim X and plim Y exist.

Here is a rule which we shall use many times in future analysis.


Suppose that a random variable Z is equal to the ratio of two other
random variables, X and Y.

If X and Y tend to limiting values as the sample size becomes large, Z


will tend to a limiting value which is the ratio of the limits of X and Y.
267

You might also like