Quante Con
Quante Con
Ryan T. Godwin
Visit https://rtgodwin.com/quantecon.pdf for the most up to date version of this
book.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1 About this Book 8
1.2 Quantitative Methods 8
1.3 Objectives 8
1.4 Format of this Book 9
1.5 Acknowledgements 9
4 Describing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 How data is arranged 29
4.2 Types of observations 29
4.3 Types of variables 30
4.3.1 Qualitative / categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Quantitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Graphing categorical data 34
4.5 Graphing quantitative data 36
4.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.2 Describing distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.3 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.4 Multi-peaked distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.5 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Time plots 41
4.6.1 Logarithms in time plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Scatter plots 43
4.7.1 Explanatory and dependent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7.2 Points on a scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7.3 Scatter plots: types of relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7.4 Categorical variables in scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Density curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Probability distributions (densities) 61
6.2 Continuous uniform distribution 61
6.3 Discrete uniform distribution 62
6.4 The Normal distribution 62
6.4.1 Areas under the Normal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.2 68-95-99.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.3 Standard Normal distribution N(0,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.4 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 t-distribution 66
8 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.1 Parameter versus statistic 82
8.2 Population versus sample 83
8.2.1 Collecting a random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 The sample mean is a random variable 86
8.4 Distribution of the sample mean 87
8.4.1 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.0.1 Simplifying assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.1 Sample mean and population mean 92
9.2 Exact sampling distribution of ȳ 92
9.3 Accuracy of ȳ increases with n 95
9.4 Sampling distribution of ȳ with unknown µ 96
9.5 Confidence intervals 98
9.5.1 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.5.2 Interpreting confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.5.3 The width of a confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.5.4 Confidence level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Articles 146
1. Introduction
1.3 Objectives
Some objectives of this text are the following:
Example 1.1 This is an example of the examples you will see in this book. They
will appear in these boxes.
The upper box contains the input, and the lower box contains the output.
1.5 Acknowledgements
Janelle Mann for arranging funding for the book, for help with outline, content, and
edits. University of Manitoba for providing financial support. Cover images and chapter
heading images produced by NASA and the Space Telescope Science Institute (STScI).
Statistics performed using R and RStudio. Figures produced using R and Inkscape.
2. The R Programming Language
2.1 What is R?
Although R is a programming language, it is unlike most others. It is designed to
analyse data. It isn’t too difficult to learn, and is extremely popular. R has the
advantage that it is free and open-source, and that thousands of users have contributed
“add-on” packages that are readily downloadable by anyone.
R is found in all areas of academia that encounter data, and in many private and
public organizations. R is great for any job or task that uses data.
In the top left is your Script file. R commands can be run from the R Script file,
and saved at any time.
In the bottom left is the Console window. Output is displayed here. R commands
can be run from the Console, but not saved.
In the top right is the Environment. Data and variables will be visible here.
The bottom right will display graphics (e.g. histograms and scatterplots).
print("Hello, World!")
Run the code by highlighting it, or making sure the cursor is active at the end of the
line, and clicking “Run” (you can also press Ctrl + Enter on PC or Cmd + Return on
Mac).
Often we will display R output in boxes. The output from your program is reproduced
in the box below:
[1] "Hello, World!"
Operator Function
+ addition
- subtraction
* multiplication
/ division
^ exponentiation
1. 3 + 5
3 + 5
[1] 8
2.5 Create an object 13
2. 12 − 4
12 - 4
[1] 8
3. 2 × 13
2 * 13
[1] 26
4. 16/4
16 / 4
[1] 4
5. 28
2 ^ 8
[1] 256
10+6
6. 2
(10 + 6) / 2
[1] 8
We have created two new objects called a and b, and have assigned them values
using the assignment operator <- (the “less than” symbol followed by the “minus”
symbol). Notice that a and b pop up in the top-right of your screen (the Environment
window). We can now refer to these objects by name:
a * b
[1] 15
produces the output 15. To create a vector in R we use the “combine” function, c():
2.6 Simple functions in R 14
Notice that after creating it, the myvector object appears in the top-right Environment
window. myvector is just a list of numbers:
1
2
myvector =
4
6
7
sum(myvector)
[1] 20
which provides the output 20. We have asked the computer to add up an object by
calling the function sum(), and putting the name of the object myvector inside of
the parentheses. Try all of the functions in Table 2.1 on myvector.
Logical operators are used to determine whether something is TRUE or FALSE. Some
logical operators are:
2.7 Logical operators 15
Operator Function
> greater than
== equal to
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
Logical operators are useful for creating “subsamples” or “subsets” from our data.
Using logical operators, we can calculate statistics separately for ethnicities, treatment
group vs. control group, developed vs. developing countries, etc. (we will see how
to do this later). For now, let’s try some simple logical operations. Try entering and
running each of the following lines of code one by one:
8 > 4
[1] TRUE
b == 6
[1] FALSE
b > 2
[1] TRUE
myvector > 3
Operator Function
& “and”
| “or”
checks to see whether each element in myvector is greater than 3 and less than 7.
2.8 Loading data into R 16
There are several ways to load data into R. We cover three of them here. In this
book, we work mostly with the comma-separated values file format (CSV format).
We need to replace file location with the actual location of the file, either on the
internet or on your computer. We can also replace the name of the data set mydata
with any name we like. For example, to load data directly from the internet into R,
try the following:
After running the above line of code, you should see the data set appear in the top-right
of RStudio (the environment pane).
2.8.3 file.choose()
Using the file.choose() command will prompt you to select the file using file explorer:
2.9 View your data in spreadsheet form 17
View(mars)
Note the uppercase V (R is case sensitive). This command allows you to view your
data in spreadsheet form. See Figure 2.1.
[1] 1e+06
The e in the output signifies an exponent to base 10. Similarly, the number 0.0000001
would be output as 1e-06 (note the negative sign on the exponent).
The scientific notation can be difficult to read at times, and you can suppress this
notation using options(scipen=999). Try this option, and print out my.number again:
options(scipen=999)
my.number
[1] 1000000
3. Collecting Data
Quantitative and statistical studies are often trying to provide answers. The key feature
that differentiates quantitative analysis from other methods that attempt to provide
answers, is the use of data collected through experimentation, or by simply observing
what happens.
Where does data come from? In this chapter, we discuss various sources of data,
and issues involved with collecting or obtaining data. Not all data are created equally.
Some are better at answering specific questions than others, and some may not be
useful at all!
Several aspects of the data collection process can lead to sampling bias. In this chap-
ter, we will discuss situations in which the data set might not represent the population,
in ways that create misleading statistical conclusions.
Another anecdote:
“My friend Pooh doesn’t have an anti-tiger rock, and is continuously attacked by a
tiger.”
From these anecdotes, one might be tempted to draw the conclusion that anti-tiger
rocks prevent tiger attacks. From the standpoint of the statistician, this information
is not very valuable because it only contains 2 data points. The data set from these
anecdotes might look something like Table 3.1.
There is a perfect negative correlation (-1)1 between the two variables in the data
set. There is no way for a statistician to disprove the notion that rocks prevent tiger
attacks using this data set. However, if more information was collected on tiger attacks
(providing a bigger data set), we would likely see that there is no relationship between
rocks and tiger attacks at all!
Not only are sample sizes typically too small, anecdotal evidence may only exist
because the personal experiences are unusual or memorable in some way. In this chapter
we will talk about random sampling from a population. Anecdotes might just contain
the most extreme cases in a population, and so might not be very representative of the
population itself.
The treatment group are those individuals that receive the “treatment”; the control
group does not receive treatment (in a medical study they might receive a “placebo”).
The effect of the treatment can sometimes be determined by comparing the outcomes
of the two groups. An outcome is something particular that is thought to be influenced
by the treatment. See Table 3.2 for examples of treatments and outcomes.
The outcomes in Table 3.2 could differ depending on the researcher. Rather than
wages, a criminologist might be interested in the effect of education (treatment) on
crime rates (outcome). Governments may wonder whether adopting universal health
care (treatment) might reduce health care costs (outcome). The labelling of things as
treatment or outcome is part of a framework that allows researchers to try to figure
out cause and effect.
Example 3.1 How could we use an experiment to determine the value (in terms of
wages), of a university education? We could randomly select 10 individuals, and
then randomly choose 5 to receive a free university education (this is the treatment
group). The other 5 are not allowed to receive an education. 20 years later, we
measure the wages of the individuals (wage is the outcome). The experimental data
is displayed in the table below.
3.1 Data sources 21
One way to quantify the effect of the treatment is to calculate the sample aver-
age outcome between the treatment group (university) and the control group (high
school). We will talk about the sample average in depth in a later chapter, but you
should be able to calculate this difference now. The sample average wage for the
group with a university education is:
59 + 135 + 126 + 69 + 80
wage
¯ highschool = = 93.8
5
Taking the difference between these two sample averages (105.6−93.8 = 11.8) might
lead us to conclude that one of the effects of an education is to increase wages by
$11,800 on average. Better yet, we might express this increase as a percentage
instead. That is, we estimate that wages increase by 11.8/93.8 = 12.6% due to a
university education.
Can you identify any problems with this experiment? The sample size is probably
too small for us to have much confidence in our result, and we would want to include
many more individuals in this experiment (but we wanted to fit the table on the
page). More importantly, conducting this experiment would be very expensive,
and would be unethical. We would have to pay for the university education of each
member in the treatment group. Each member in the control group would be denied
an education, the access to which is a human right. This experiment, like many
that would be useful in economics, is too expensive to perform and would not pass
an ethics board!
Observational data is recorded without being able to apply any control over whether
the individuals in the data are in the treatment group, or in the control group. We
simply observe the choices that people make, and the outcomes that occur. There is
little to no influence over the behaviour or actions of the individuals in the data set.
There is no random assignment in observational data2 .
The lack of random assignment and control, means that individuals have some de-
gree of choice in whether or not they are in the treatment or control group. This can
have very serious consequences when trying to make causal statements using observa-
tional data. As an example, we will reconsider the link between education and wage,
but in a setting where the individuals in the data have chosen to obtain an education.
Example 3.2 Suppose now that there is no experiment available to determine the
effect of education on wages. Instead, we merely observe an individual’s wage and
educational attainment. We are powerless over which individuals obtain an ed-
ucation. Consider that the same individuals who were enrolled in the previous
experiment were instead allowed to live their lives free of interference. Some chose
to get a university education, some did not.
2
Natural experiments are one exception.
3.2 Populations and Samples 23
62 + 59 + 69 + 87 + 80
wage
¯ highschool = = 71.4
5
so that the average increase in wages is 132.6 − 71.4/71.4 = 85.7%! This is much more
than what was indicated using the experimental data (12.6%). What happened
here? Those individuals who had more to gain (a higher base salary) chose to get
an education.
In this example, education is not just increasing wages, it is indicating the
earning potential (base salary) of individuals. This makes it impossible to attribute
the increase in wages between the two groups to the difference in education. Here,
the lurking variable is an individuals perceived benefit of obtaining an education
(their self assessed earning potential).
Observational data, such as in the above example, often involve something economists
refer to as endogeneity. A large part of econometrics is dedicated to being able to make
causal statements (such as how much education causes an increase in wages) in the face
of “threats” of endogeneity. In this textbook we will not tackle such issues, but we will
be working primarily with observational data. We need to be aware of the limitations
of observational data, especially when attempting to infer causality.
a portion of the population is contacted. The main reason for using a sample is that it
is usually too costly (or it is impossible) to record information on an entire population.
Every 5 years, the Canadian Census of Population attempts to contact every house-
hold in Canada, costing more than half a billion dollars. While census data is important,
most economics researchers have a much smaller budget, and so must rely on a sample.
In addition, a census may require too much time to collect, and may be less accurate
than a carefully collected sample.
3.2.1 Population
Population. The population contains every member of a group of interest.
A population contains all cases, units, or members that we are interested in. In
economics a “member” or a “case” is usually an individual, a firm, or a country. The
terminology case/unit/member just refers to a single component of the population.
If we are interested in the effect of education on wage, the population consists of
every working individual, and a case refers to each individual. If we are comparing GDP
between countries then the population consists of all countries in the world, and each
case/unit/member is a separate country. If we are describing increasing food prices in
Manitoba, then the population might be every grocery store in the province. In the
following discussion, we will often refer to a “member” or a “case” as an “individual”,
but the discussions are valid whether we are talking about individuals, businesses,
schools, institutions, countries, etc.
3.2.2 Sample
Sample. A sample collects data on a subset of members from the population.
A sample is simply a subset of the population. It usually consists of far fewer cases
or members than the entire population. Information in a sample is meant to reflect the
properties and characteristics of the population of interest. The sample contains those
members of the population that are actually examined, and from which the data set
is created. A sample is in contrast to a census, where there is an attempt to contact
every member of the population.
Census. In a census, there is an attempt to contact and record data on every member
of a population.
Sample biases.
Example 3.3 An infamous example of the failure of sampling is that of the Literary
Digest poll of 1936. Some 10 million questionnaire cards were mailed out, 2.4
million of which were returned. Based on the data in the returned questionnaires
the Literary Digest mistakenly predicted that Landon (Republican), not Roosevelt,
would win the presidential election. Many academics have since held that the poll
failed so miserably due to the Digest selecting it’s sample from telephone books and
car registries [lusinchi2012], which contained more affluent individuals (those that
could afford a telephone and a car), and who tended to vote Republican.
Voluntary response sampling and on-line surveys are also prone to sample selection
bias. Who are the type of people who would answer an on-line survey? Likely it is
3.4 Simple random samples 26
those individuals most passionate, and holding extreme views, that are willing to take
the time and effort to voluntarily provide information.
Example 3.4 In 2016, polls predicted that Hillary Clinton would likely win the pres-
idential election, putting her probability of winning around 90%[kennedy2018].
How did the polls get it so wrong? One theory is non-response bias. The sample
was biased in the sense that Trump supporters simply refused to respond. This the-
ory is backed by findings that individuals with lower education, and anti-government
views, are less likely to respond to surveys.
Example 3.5 The view that the Literary Digest disproportionately sampled Republi-
can voters (see Example 3.3) has been challenged[lusinchi2012]. Non-response bias
is an alternate suspected culprit. 1⁄3 of Landon’s supporters answered the survey,
compared to only 1⁄5 of Roosevelt supporters. Most of the 7.6 million unanswered
surveys were from Democrats!
3.3.3 Misreporting
With any survey, misreporting is a concern. Misreporting is when a survey or poll
respondent does not provide accurate information. The reasons for this can be many.
For example, the “Shy Trump Hypothesis” supposes that the 2016 polls failed due
to Trump supporters feeling that their views were unaccepted by society. Individuals
may be too embarrassed to report truthfully, may be worried about social stigma, may
not understand the questions, or may not recall information accurately. If there is
systematic misreporting (in the sense that there is a pattern or a commonality among
the people who report), then inferences drawn from such surveys can be biased.
Example 3.6 The Current Population Survey (CPS) is an important survey that is
used in a variety of quantitative analyses, and that has hundreds of thousands of
citations in economics research.
The CPS asks respondents questions on enrolment in food stamp programs. This
information is important for understanding poverty, and ways to mitigate poverty.
A study investigating misreporting in CPS data has found that approximately 50%
of households on food stamps do not report it on the CPS[meyer2020], and that
theories such as stigma may explain the misreporting. When individuals feel that
they may be judged, they may not answer survey questions accurately.
In order to avoid sample selection bias, simple random samples are often recom-
3.5 Data ethics 27
mended. A simple random sample is when members of the population of interest are
selected at random. Each member has an equal chance of being selected. Imagine
a bowl containing pieces of paper with everyone in the population’s name written on.
Pulling out n pieces of paper from the bowl, and contacting those selected, would create
a simplae random sample of size n.
Simple random sampling is in contrast to convenience sampling, voluntary sam-
pling, and on-line polls. In a simple random sample, information and opinions will
not be skewed by those individuals who are the most motivated or the most willing to
participate in a study. There will be no underlying link between the members in the
sample.
There are more complicated versions of random sampling. For example, a stratified
random sample selects members from subgroups of a population. In this way, members
with certain characteristics have a higher probability of being sampled.
Example 3.7 — Stratified sample. Suppose that we want the portion of ethnicities in
our sample to perfectly reflect the portion of ethnicities in the population. Suppose
that we know that the population contains only 3% of a certain ethnicity. If we
take a sample of 100 from the population, what is the probability that no one in
the sample is from that ethnicity? It turns out to be approximately 5%.a We might
completely miss this group! Instead of pure random sampling, we could randomly
select a certain number of individuals from each ethnicity, where the number that we
select is based on their proportions in the population. That is, we could randomly
select exactly 3 people (if our sample size is going to be 100) from the ethnicity that
comprises 3% of the population.
a
Assuming that the population is very large, the probability of not drawing the certain ethnic
group is maintained at 97% for each draw, and the probability of 0 draws is 0.971 00 = 0.048.
In this chapter, we will begin to describe the variables in our data set. We start by
explaining the structure of a data set. Each row in a data set corresponds to a differ-
ent observation, and each column is a different variable. We then discuss some basic
characteristics of the variables, such as whether they are quantitative or categorical,
and whether they are continuous or discrete.
Such considerations not only help us understand our data set, but also inform the
type of graph that we should use to visualize the data. We will learn about the following
ways to graph a single variable in this chapter:
pie charts
bar graphs
histograms
time plots
Creating graphics from data is a powerful way to learn about the distribution of a
variable. Graphics are also used to convey information, to make a point, or to try to
convince the reader of some hypothesis. In this chapter we will match the appropriate
type of graph to the different types of variables, and learn how to create those graphs
in R.
By graphing a quantitative variable in a histogram, we can learn about the shape,
location, and spread of its distribution. These are important considerations that help
to characterize the population that we are studying. Graphs help us look for patterns
and exceptions to the pattern.
Finally, we will discuss the scatterplot. A scatterplot graphs two variables at once
(sometimes more!), and is a powerful way to begin to describe the relationship between
two variables that may be related to each other. We can use a scatterplot to describe
the direction, form, and strength of a relationship, whether the relationship is linear or
nonlinear, and to see if a relationship even exists!
4.1 How data is arranged 29
The number of observations, or the number of rows in the data set, is called the
sample size and is denoted n. It is always better to have a larger n!
Sample size. The sample size is the number of observations (rows) in the data set,
and is denoted by n.
Example 4.1 — Data example: Mars has been colonized. At several points in the
book we will use data on Mars colonists (see Figure 4.1 for a few rows and columns
of the data set). Mars has been colonized, with 720,720 individuals thriving on
Mars City. Due to the importance that Mars City represents for the survival of
humanity, detailed information on the inhabitants is available. People who want
to live on Mars are subjected to intense scrutiny and have agreed to allow detailed
information about themselves to be available. The data is of course fake (randomly
generated), but has variables that mimic many real data sets, such as the Current
Population Survey.
Figure 4.2: Data from the 2019 World Happiness report. Each observation (row) is a
different country. The variables (columns) are the average Happiness Score, and GDP
per capita. The name of the first column reveals that we have observations on countries,
rather than individuals, provinces, businesses etc.
Example 4.2 — Data example: 2019 World Happiness Report. We will use the World
Happiness report for several examples throughout the book. The First World Happi-
ness report was prepared in 2013, in support of a United Nations High-Level Meeting
on “Well-Being and Happiness: Defining a New Economic Paradigm.” The World
Happiness Reports are funded and supported by many individuals and institutions,
and based on a wide variety of data. The most important source of data, however,
is the Gallup World Poll question of life evaluations. The English wording is:
“Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at
the top. The top of the ladder represents the best possible life for you and the
bottom of the ladder represents the worst possible life for you. On which step of
the ladder would you say you personally feel you stand at this time?”
The responses can be averaged so that each country is ranked (see Figure 4.2 for
the happiest countries in the world!) By including other variables in the data set
for each country, researchers have an opportunity to investigate what factors lead
to differences in happiness between countries (differences such as GDP per capita).
In this data set, GDP per capita is in terms of Purchasing Power Parity adjusted
to constant 2011 international dollars.
A qualitative variable is one that takes on two or more possible qualitative values
(qualitative variables are also called categorical variables). When we say qualitative we
mean something that is not necessarily numerical, but that has a quality or a property.
For example, red is a quality, three is a quantity. The colour of someone’s eyes or hair
could be a quality that fits into one of several categories, whereas their height or weight
could be quantified. We could say that one person is twice as tall as another, but we
can’t make the same kind of algebraic comparisons for eye colour.
Some typical examples of qualitative variables encountered in the social sciences
are:
gender
treatment
ethnicity
province or territory of residence
marital status
political affiliation
exchange rate regime
For most of the examples above, the categorical variable can take on one of several
different possible values. A key feature of a categorical variable is that its categories
must be exhaustive. That is, each observation must be able to fit in one of the categories.
A simple way to ensure this is to have an “other” category that acts as a catch-all for
observations that are not easily categorized.
Ethnicity is a categorical variable reported in many data sets that collect infor-
mation at the individual level. “Ethnicity” as a categorical variable is problematic in
terms of developing appropriate concepts, avoiding ambiguity, and avoiding offensive
constructs and terminology (for example Eskimo in reference to Inuit). However, the
international meeting on the Challenges of Measuring an Ethnic World (Ottawa, 1992)
noted that ethnicity is a fundamental factor of human life inherent in human experi-
ence, and that data on ethnicity is in high demand by a diverse audience. Statistics
4.3 Types of variables 32
Canada has a standard that classifies individuals in one of eight categories: See Figure
4.4.
The number of categories that a categorical variable can take is often up to the
discretion of the researcher, and can vary. For example, countries must decide how
to manage their currency on the foreign exchange market. A categorical variable
could be used to describe which regime (currency exchange system) each country fol-
lows. There are three basic types, so for example each country could have a vari-
able called exchange.regime which takes on one of three values: floating.exchange,
fixed.exchange and pegged.float.exchange. However, the IMF classifies countries
in 1 of 8 exchange rate regime categories, so the exchange.regime variable could in-
stead take on one of eight possible values.
Finally, why are categorical variables used? They are important for predicting,
modelling, and understanding the differences between groups. Is a drug effective? We
can compare the outcomes between the treated and placebo/control groups. The cate-
gorical variable will identify which individuals belong to which group. Do women earn
less than men? To be able to investigate, and perhaps ultimately solve discrimination
by gender or race, we first need a way to identify differences between groups; this task
is greatly aided by categorical variables.
Dummy variables
Gender was traditionally considered a binary or dummy variable in the social sciences.
A dummy variable is a special kind of categorical variable that can take on one of
only two values (binary refers to a number system with a base of 2). Historically, a
gender categorical variable could take on the values either “male” or “female”; each
person was forced to belong to one of the two categories. With the more common
understanding that gender is a spectrum rather than a binary, more contemporary
statistical analyses try to recognize broader categories, such as non-binary, trans, and
possibly dozens others. For example, Statistics Canada has slightly broadened its
sex and gender classifications. A person’s sex can be “male”, “female” or “intersex”,
and a person can be “Cisgender”, “Transgender”, “Male gender”, “Female gender” or
“Gender diverse”. With more than two categories, gender is no longer the quintessential
“dummy” variable in the social sciences.
did the subject receive the “treatment”? The treatment variable could take on values
yes or no. Numbers are typically assigned to these dummy variables: 1 indicates “yes”
and 0 indicates “no”. Don’t be fooled by the numerical values! The numbers don’t
actually mean anything, other than to provide a key to the categories.
Other examples of dummy variables in economics include whether a firm is “domes-
tic” or “foreign”, whether an individual has participated in a social program or not,
whether a person has ever received social assistance, whether an individual or country
has ever defaulted on a loan, whether an individual has ever committed a crime, etc.
Ordinal variables
Ordinal variables rank observations (order them) relative to one another. For example,
the position that an athlete places in a race (1st for gold, 2nd for silver, etc.) is
an ordinal variable. The ranking of countries by happiness (see Figure 4.2) is an
ordinal variable. Ordinal variables do not contain as much information as quantitative
variables, and are not considered as useful. The magnitudes of ordinal variables don’t
have much meaning. Did the athlete who received a silver medal (position = 2) take
twice as long to complete the race as the athlete that received gold (position = 1)?
Ordinal variables provide a type of qualitative information.
Ordinal variable. An ordinal variable ranks each observation among all the observa-
tions.
Ordinal variables usually occur due to the ordering of some other latent or hidden
variable. In the case of the athletes, the time to complete the race is the underlying
variable that generates the ordinal position variable. It would always be better to
have the underlying variable time instead. The ordinal variable does not contain as
much information. Similarly, we would rather know the actual Happiness.score of
each country rather than their happiness rank. Ordinal variables are used when no
such quantitative alternative exists.
A quantitative variable takes on different numbers, and the magnitude of the vari-
able is important (whether the number is small or large). Depending on the nature
of the variable, it may have a certain domain. A domain is all the possible places the
variable can occur or “live”. For example, income cannot be below 0, so an income
variable might be confined to the set of positive real numbers. A variable measuring
temperature on Earth might realistically be confined between -100 and 70 degrees Cel-
sius. In some situations, the domain might be the entire real line, so that the variable
might take on any value between negative and positive infinity!
In Figure 4.1 we see that age and income are quantitative variables. Yet, there is
something different in the nature of these two variables. In fact, quantitative variables
can be divided into two types: discrete and continuous. In the Mars colonist data
4.4 Graphing categorical data 34
Continuous variables
A continuous variable is obtained by measuring, and can take any value over its range.
Even if the range is not infinity, a continuous variable has an uncountable number of
possibilities! For example, the possible heights of an individual are uncountable, even
though the possibilities are between 0m and 3m, for example. The person could be
1.63m tall. What about 1.63001m tall? Or 1.630000001m tall? We could keep adding
zeros. The possibilities are uncountably infinite. In Figure 4.2, the Happiness.score
and Log.GDP.per.capita variables are continuous. They can take on any values in a
range, but we can’t count all the possible values.
The distinction between discrete and continuous variables leads to very important
mathematical considerations in statistical modelling. For example, where a discrete
variable might be added up, a continuous variable would be integrated. Similarly, we
could find the derivative for a function of a continuous variable, but we can’t take the
derivative of a function of a discrete variable. We do not get into these topics in this
book, but rather focus on the consequences that these differences have for the way in
which we graph the variable.
Figure 4.5: Pie chart (left) and bar plot (right) of marital status for a sample of Mars
colonists.
Similar to a pie chart is the bar plot. A bar plot simply uses the number of obser-
vation in a category for the height of a bar. The bar plot has the added benefit that
it conveys the actual number of observations in each category. For example, in Figure
4.5 we can see that approximately 100 individuals in the sample are divorced.
Example 4.3 — Pie chart for marital status in Mars city. Let’s recreate Figure 4.5.
We’ll make a pie chart and bar plot for the marital status of a sample of 1000 Mars
colonists. First, load the data:
table(mars$marital.status)
pie(table(mars$marital.status))
barplot(table(mars$marital.status))
Note that you can “export” the images that you create (that’s how we got them
into this book!).
Typically, a pie chart or a bar plot is used, not both. In fact, it is questionable if
these graphs are even needed for categorical data. The table below, upon which the
graphs are based, takes up very little space and conveys a lot of information:
4.5.1 Histograms
Histogram. A common graphic for portraying the distribution of a continuous vari-
able. A histogram “bins” the variable, and draws the height of each bin by using the
number of observations in that bin.
A histogram is created by breaking up the range of a variable into several “bins”,
counting the number of observations that fall into each bin, and then graphing the
heights of the bins. This gives us a visual representation of how often the variable
takes on ranges of values. The histogram tells us if there are extreme values, if the
variable is spread out or tightly packed, and which values the variable tends to take.
Example 4.4 illustrates how a histogram is produced.
To create a histogram, the computer will “bin” this data, count how many scores
fall into each bin, and then use the number of values in the bin to graph its height.
For bins of size 10, we could have:
IQ <- read.csv("http://rtgodwin.com/data/IQ.csv")
There are hundreds of different statistical distributions. In this book we will limit
ourselves to only a few. We will, however, develop terminology that is helpful in select-
ing the right distribution. For example, if a distribution is spread out or condensed, if it
has skew or is multi-peaked, then the bell curve of Figure 4.6 would not be appropriate.
Shape, location, and spread
How would you describe the distribution for IQ scores (as it is portrayed by the his-
togram) from Example 4.4? Use words like shape, location, and spread.
Shape: IQ scores appear to have a single peak, are not skewed, and follow a
“bell” like shape.
Location: The distribution is located at around 100. This appears to be the
centre of the distribution (around where most values are located).
4.5 Graphing quantitative data 38
Spread: The distribution is not particularly spread out, nor is it tightly packed.
It matches the bell-curve nicely.
The “bell” curve that we mention is in reference to the important Normal distribu-
tion, which we discuss in a later chapter. It is a famous and important shape that you
should already be familiar with: see Figure 4.6.
4.5.3 Skew
A non-symmetrical distribution is said to be skewed if it looks as if one of the “tails”
of the curve has been stretched out. That is, a distribution is skewed if one of the tails
is longer than the other. Skew is a descriptor that helps to characterize a distribution.
Figure 4.7 illustrates two skewed distributions.4
Example 4.5 — Left skew: age at death. The distribution of peoples ages at the time
of their death is an example of left skew. Using Life Tables from Statistics Canada,
download constructed data on the age-at-death of 99,976 individuals:
What do you see? This is a left skewed distribution. It peaks at age 85-90, with a
bit of an extra “spike” at around age 0 (reflecting infant mortality).
4
The left skew distribution is from the “Skew Normal distribution” and the right skew is from the
“Log-Normal distribution”.
4.5 Graphing quantitative data 39
Example 4.6 — Right skew: income. The incomes of individuals typically follow a
right skewed distribution. For example, the majority of workers might be within
the $30,000 to $100,000 range, with a small portion of workers making very large
incomes. Let’s draw a histogram of incomes from the sample of 1,000 employed
Mars colonists:
mars <- read.csv("http://rtgodwin.com/data/mars.csv")
hist(mars$income, breaks = 16,
main = "Histogram of Mars incomes", xlab = "income")
The option breaks = 16 was used to control the number of bins in the histogram
(try removing it and see what happens). We see that incomes on Mars appear to
follow a right skewed distribution, with the majority making under $100,000, and
with some very large incomes in the sample.
can arise for a variety of reasons, one of which being when two distributions are mixed
together to create a single random variable. Figure 4.8 shows a multi-peaked (bi-modal)
distribution.
For example, the percentage grades in university courses are often bi-modal (have
two peaks): one peak around “C” grades and another peak around “B+”. The number
of years of education of individuals is often multi-peaked as well: for example one peak
at 12 years (high school) and another peak at 16 (university degree).
Figure 4.8: A multi-peaked distribution. 3/4 of the values come from a bell curve located
at 20, the other 1/4 are located at 60.
We see that the distribution has at least two-peaks: one for a high school degree,
and one for a university degree. Note that barplots have spaces between the bars,
whereas histograms do not.
4.5.5 Outliers
An important reason to graph data, using histograms and bar plots (and later scatter
plots), is to detect the presence of outliers. Outliers are extreme values that differ
significantly from the other observations in the data. An outlier might be sampled by
chance, in which case the observation should usually remain in the data set.
Outlier. An extreme value that may indicate the presence of an error, in which case
the observation should be removed from the sample.
If the outlier is due to an error, then it should be removed or corrected. Such errors
may occur as the data is being measured or recorded. For example, an extra 0 might be
typed when recording income, a single weight might be recorded in kilograms instead
of pounds, or an economist may forget to convert Pesos to Dollars when examining
trade.
Another possible source for outliers is if an observation comes from a different
population. Remember that the sample is meant to represent the population. If we
are interested in the income of employed Martian colonists, then the sample should not
include a student, for example. Observing a small value for income might induce us to
examine the observation more carefully and perhaps discover that the observation is
indeed from the wrong population.
In Example 4.6 we see some outliers: some very high incomes in the right tail of
the distribution. We should do our best to examine these observations to see if we can
detect any data recording mistakes, or any indication that these observations do not
belong in the sample.
Example 4.8 — Time plot of GDP. The data is from Statistics Canada[statscanGDP],
and contains GDP by year in millions of 2012 dollars:
GDP is really accelerating! Notice in the formula that the year is an exponent, so
GDP is growing exponentially over time. If we take the logarithm of both sides of the
equation:
then log(GDP ) (to any base) is growing linearly over time! (The 40 no longer appears
as a “power”). This is extremely useful for graphing variables that grow exponentially5 .
To see this, load some data on Mars GDP:
5
It is also useful when trying to fit “straight line” models to non-linear relationships.
4.7 Scatter plots 43
Figure 4.9: Mars GDP. It is difficult to locate the years in which recessions took place
without taking GDP in logs.
In the left pane of Figure 4.9, it is difficult to see the values of GDP at the beginning of
the time period. It looks like a smooth ride! This is because, by the end of the sample
(year 60), the values for GDP are very large, making the scale of the y-axis unhelpful
for seeing what is happening with GDP around year 10. Looking at the left pane of
Figure 4.9, it appears that there were two recessions at the end of the sample. Let’s
now put the log of GDP on the y-axis instead (right pane of Figure 4.9):
After graphing the log of GDP, we see a linear relationship. This shows that GDP
is growing constantly over time. It also allows us to see that GDP at the beginning
of the time period was actually quite tumultuous! The two major recessions occurred
near the beginning of the time period, not the end. This was only visible after taking
logs.
To summarize, if a variable is growing exponentially (or is growing with a constant
percentage increase), then a common trick for visualizing such a variable is to take logs.
By looking at a scatter plot we can comment on the strength, form, and direction of
the relationship between two variables.
In this section, we will:
Each point on the scatter plot represents a single observation (a row in the data set,
see Section 4.2). The position on the plot is determined by the values of the dependent
and explanatory variables; these values provide the coordinates.
4.7 Scatter plots 45
Using the World Happiness report, Table 4.1 shows the average happiness score,
and log GDP per capita, for a few countries. When dealing with GDP, it is common to
use the log. Hypothesizing that GDP may cause happiness, we’ll call “Happiness score”
our dependent variable (the y variable) and “GDP per capita” our explanatory variable
(the x variable). The three observations in Table 4.1 are plotted in Figure 4.10. (Make
sure you can locate all the points!) By plotting all 127 countries in the data set, the
scatter plot will show us whether there is a relationship between the two variables, and
allow us to comment on the strength, form, and direction of the relationship.
Example 4.9 — Scatter plot for happiness and GDP per capita. Load the Happiness
data, and create the scatter plot:
Describe what you see using terms like strength, form, and direction.
There is a fairly strong relationship between the two variables (the data points
are quite tightly packed together, rather than being spread out).
The form seems to be linear (rather than non-linear).
There is a positive (direct) relationship between the two variables. When
one variable increases, so does the other (rather than a negative or indirect
relationship where the values move in opposite directions).
A problem with Figure 4.12 is that there are some very large values for CO2 leading
to a scale for the graph that makes it difficult to see what is happening for the majority
of countries. As in Section 4.6.1, a trick for handling this is to take the logs of the
variables6 . We can do this easily in R:
plot(log(co2$gdp.per.cap), log(co2$co2),
ylab = "log CO2 emissions per capita", xlab = "log GDP per capita")
In Figure 4.13, it is much easier to see that there is a strong and positive relationship
between per capita CO2 emissions and per capita GDP.
Figure 4.14: Log per capita CO2 emissions and GDP by continent.
Finally, let’s add colour to the scatter plot, by giving each point on the plot a
different colour based on the country’s continent. “Continent” is a qualitative variable
in the data set that places each country in 1 of 5 categories. From Figure 4.14 we
can now see that, compared to other countries with similar GDP, the Americas have
fewer CO2 emissions. These types of revelations occurs much more easily when colour
6
Taking the logs of both variables leads to an approximate percentage change interpretation. That
is, a percentage increase in GDP will be associated with a percentage increase in CO2 emissions
(approximately).
4.7 Scatter plots 48
(or symbol) coding scatter plots using qualitative variables. The R code necessary for
adding colour to the scatter plot is provided in Example 4.10.
Example 4.10 — Colour coding CO2 emissions by continent. First, load the data:
We need to create a colour variable that will control the colour of each data point.
We begin this by initializing a colour variable:
Now we create the scatter plot, choosing the colour of each data point using the
variable we have created:
plot(log(co2$gdp.per.cap), log(co2$co2),
ylab = "log CO2 emissions per capita", xlab = "log GDP per capita",
col = colour, pch = 16)
legend("topleft",
legend = unique(co2$continent),
col = unique(colour), pch = 16)
This reproduces Figure 4.14. There are much easier ways to accomplish colour
coding in R, for example by using the ggplot2 downloadable extension for R. This
example instead serves to illustrate the principle behind colour coding in a scatter
plot: linking each possible value in a qualitative variable to a unique colour.
5. Describing distributions with statistics
A statistic is a numerical value that is a function of the sample data. When we say
“function of the sample data,” we mean a formula, algorithm, set of rules, etc. that
uses the information in the data. Statistics can be used to describe a distribution.
Some of the visual descriptors from the previous chapter, such as location, spread, and
skew, can actually be measured using a numerical value.
Some statistics that we will cover in this chapter are:
sample mean
median
interquartile range
pth percentile
sample variance and standard deviation
sample correlation
where yi denotes the ith observation, and where n denotes the sample size. The symbol
Σ tells you to add, starting at the 1st observation (i = 1) and ending at the last (n).
Equation 5.1 is a very common statistic, and should already be very familiar to you.
Example 5.1 — Sample mean in R. Load the variable y = {6, 2, 5, 6, 1} into R using:
y <- c(6, 2, 5, 6, 1)
mean(y)
[1] 4
We can calculate the sample mean of “income” using the mean() function:
mean(mars$income)
[1] 80938.1
What does the sample mean ȳ tell us? For one, it is an estimate of the true
population mean. The true population mean is the “centre” of the distribution (e.g.
the centre of a bell curve) that is generating the values for the variable. The true
population mean of y is the value that we expect to observe for y.
The sample mean gives us an idea about the centre or location of the variable’s
distribution, and is called a “measure of central tendency.” The sample mean is the
“centre of mass” of the variable. That is, if the histogram of the variable were a physical
object, the mean would be the location where we could balance the object on one finger
along the x-axis.
The sample mean is one of the most important statistics, because it defines a very
important feature of a distribution: its location.
We can calculate the sample median of “income” using the median() function:
median(mars$income)
[1] 70094
The sample median is important for similar reasons that the sample mean is im-
portant. It is a defining feature of the true underlying distribution that is generating
or describing the data that we observe. In addition, it gives an idea about the “centre”
or “middle” of the distribution of a random variable.
the tail. That is, once the median has been found, all the values to the left (or right) of
the median could be stretched out or rearranged and the median would be unchanged.
Example 5.5 — Resistance of the median to outliers. Take the ordered y variable from
Example 5.2: y = {1, 2, 5, 6, 6}. The sample mean and median of y are:
ȳ = 4
median(y) = 5
Now, let’s try changing the last value in the y variable so that it is an outlier, for
example let:
y = {1, 2, 5, 6, 100}
> mean(y)
[1] 22.8
> median(y)
[1] 5
The sample mean has been drastically affected by this outlier (it went from 4 to
22.8), and the sample median has remained unchanged.
In a symmetrical distribution, the mean and median are always the same. In an
asymmetrical distribution, they are always different. For example, in a right skewed
distribution, the mean will always be greater than the median. If outliers are suspected
to be in the data set, the median might be a safer measure of “central tendency” since
the sample mean can be greatly swayed by extreme values.
5.4.1 Percentiles
To calculate a percentile, we again start by arranging the values of the variable in
increasing order. Then, we count to the required percentage starting at the first ob-
servation. For example, the 20th percentile would be the (0.2 × n) + 1 observation of
the ordered variable. For a sample size of n = 101 for example, this would be the 21st
largest value of the variable. 20% of the values would be smaller, 80% of the values
would be larger.
Similar to the median, there may not be an exact correspondence between the
desired percentile and the observation number in the ordered list. In this case, we
5.4 Percentiles and quartiles 53
would take the sample mean of two values instead. For example, if n = 100, the 20th
percentile would be the sample average of the values for the 20th and 21st observation.
Percentiles can be used to measure the spread of a variable. A 5% probability (a 1
in 20 chance) is a common value chosen in statistics for classifying an “extreme” event.
We might wonder, in the extreme, what is the best and worst that could happen? The
mean height of a person may be 1.65m, but that doesn’t tell us anything about the
extremes or spread of the distribution. What height marks the shortest 5%? At what
income level are the top 5% of earners above?
Finally, quantiles are very similar to percentiles. A quantile is just expressed in
different units (not in percentage points but as a real number between 0 and 1). For
example, the 20th percentile is the 0.2 quantile.
Example 5.6 — Top and bottom 5% of Mars income earners. Load the Mars data:
To find the 5th percentile (the value for which approximately 5% of incomes are
smaller):
quantile(mars$income, 0.05)
5%
39092.15
So, 39092.15 is the 5th percentile of income. To find the income that marks the top
5% of earners, we can use:
quantile(mars$income, 0.95)
5.4.2 Quartiles
Quartiles break up a distribution into four quarters. That is, one-quarter of the values
will fall into each quartile. The 1st, 2nd, and 3rd quartiles correspond to the 25th, 50th,
and 75th percentiles, respectively. The 2nd quartile is the same as the median. The
first quartile is found by ordering the values, and then counting to the (0.25 × n) + 1
observation. Again, two values might need to be averaged if (0.25 × n) + 1 is not an
integer. Similarly, the 3rd quartile is the (0.75 × n) + 1 ordered value, and we already
know how to find the 2nd quartile (the median).
Quartiles are more common than percentiles, and are a simple way to summarize
the spread and shape of a distribution. When summarizing a variable, it is common to
report the values for the 1st, 2nd, and 3rd quartiles, as well as the sample mean. The
values for the quartiles tell us if the distribution is skewed, and in which direction, and
can help to select the right distribution in order to characterize a variable.
To find the quartiles of “income” we can ask for the 25th, 50th and 75th percentiles
using the quantile() function:
So, the quartiles of income are {53516, 70094, 96815}. What does this tell us? No-
tice that the gap between the 1st quartile and the median (approximately 17k) is
smaller than the gap between median and 3rd quartile (approximately 27k). In a
symmetrical distribution, these gaps would be equal. The distribution is skewed to
the right.
The minimum income in the data is 24973, and the maximum is 358318. To find
this in R use:
min(mars$income)
max(mars$income)
> min(mars$income)
[1] 24973
> max(mars$income)
[1] 358318
summary(mars$income)
Notice how several statistics from the previous few examples have all been calculated
5.7 Sample variance 55
Similar to how we used the symbol ȳ to denote the sample mean, we also use a
symbol to denote the sample variance: s2y . If we were calculating the sample variance
of income we would denote it s2income . The summation operator Σ is again telling us
to add something up, starting at the first observation and ending at the last. This
time, however, we are subtracting the sample mean from each observation, squaring
that “distance”, and then adding up all of these squared “distances”.
Notice that sample variance is essentially a measure of distance.2 Each value in the
variable is compared to the sample mean (yi − ȳ). This is measuring how far away
the values tend to be from the “centre”. However, we want to combine all of these
distances into a single measure, so we add them up. But if we just added up all of the
the (yi − ȳ), negative distances would cancel out the positive distances!
To avoid this, we could take the absolute values of the distances: |yi − ȳ|. This would
lead to an alternative measure of the spread or dispersion of a variable, called the “Mean
Absolute Deviation.” The sample variance is a more popular measure of dispersion,
and instead of taking absolute distances, we take squared distances: (yi − ȳ)2 . The
squaring in the formula means that all distances are now positive, but that variance
is very sensitive to large values in the data. As a value gets further and further away
from the sample mean, the squared distance gets even further. Note that the “square”
in the Equation 5.3 means that sample variance can never be negative. The smallest
possible value for 5.3 is 0. A 0 can only occur when all of the values for yi are identical
(and hence there is no variation).
When we calculated the sample mean, we added everything up and then divided by
n. Here we are instead dividing by n − 1. Why? The reason is somewhat complicated,
and we will not go into depth in this book. Instead, we will provide a cursory treatment
of the topic of degrees of freedom in order to understand this n − 1 in the formula for
sample variance.
Degrees of freedom
Degrees of freedom can be thought to account for the number of independent pieces
of information available when calculating a statistic. In Equation 5.3, notice that the
2
In particular, sample variance involves the squared Euclidean distance.
5.8 Sample standard deviation 56
formula for the statistic s2y involves another statistic (ȳ)! Having the ȳ on the right-
hand-side of the formula for s2y turns out to cause a distortion, and one degree of
freedom is lost. Instead of n pieces of sample information, there are now only n − 1
pieces of information available when calculating s2y .
This can be seen in a simple example. Take the variable y = {1, 3, ?}, and the
sample mean of y at ȳ = 3. Can you figure out the missing y value? Good job!
Together with ȳ, only 2 out of the 3 sample values (n − 1) actually provide any unique
information.
Example 5.10 — Sample variance of y. We’ll use the variable y = {6, 2, 5, 6, 1} again,
and calculate the sample variance. First we need to calculate ȳ = 4. We take each
of the values in the variable, subtract the mean, square the difference, and add al
the squared differences:
y <- c(6, 1, 2, 5, 6)
var(y)
> var(y)
[1] 5.5
Example 5.11 — Sample variance of Mars incomes. Load the Mars data and take the
sample variance:
> var(mars$income)
[1] 1605382317
This is quite a large number! What does it tell us? It is difficult to interpret this
number, unless we compare it to some other distribution. For example, we could
calculate the sample variance for Earth incomes, and see which distribution is more
spread out.
v
u n
q 1 X
(yi − ȳ)2
u
2
sy = sy = t (5.4)
n−1
i=1
The standard deviation is obviously closely related to the sample variance, and
often the two are used interchangeably. An important difference, however, is that sy
has the same units of measurement as y (whereas s2y does not). Sometimes, the value
of a variable is compared to the number of “standard deviations” it is away from the
sample mean. This can provide an idea of how “extreme” a value is.
Example 5.12 — Standard deviation of Mars incomes. What is the standard deviation
of Mars incomes? We know from Example 5.11 that:
q √
sy = s2y = 1738740548 = 41698.21
[1] 40067.22
5.10 Correlation
Correlation is a measure of the relationship between two variables. Correlation mea-
sures:
how two variables move or vary in relation to each other.
the direction of the relationship between two variables.
the strength of the relationship between two variables.
The equation for the sample correlation between two variables (x and y for example)
is:
Pn
1 i=1 (xi − x̄) (yi − ȳ)
rxy = (5.6)
n−1 sx sy
Lower case “r” is used to denote the sample correlation coefficient. sx and sy are the
sample standard deviation of x and y (see Section 5.8).
Correlation measures how often and how far two variables differ from their sample
mean value (notice the xi − x̄ and yi − ȳ terms in Equation 5.6). If both variables tend
to be larger than their mean at the same time, then correlation will be positive. If
when one variable is larger than its mean, the other tends to be smaller than its mean,
correlation will be negative. The larger the magnitude of the correlation number, the
more often this statement holds true for specific pairs of values.
5.10 Correlation 59
If the correlation is positive, then when one variable is larger (or smaller) than
its mean, the other variable tends to be larger (or smaller) as well. The larger the
magnitude of covariance, the more often this statement tends to be true. Covariance
tells us about the direction and strength of the relationship between two variables.
Note the following properties of rxy :
[1] -0.7467756
The scatterplot for this data is shown in Figure 5.1. The sample correlation of
-0.75 tells us that there is a negative or inverse relationship between x and y, and
that the relationship is quite strong.
Using the Mars data, calculate the sample correlation between income and ed-
ucation:
mars <- read.csv("http://rtgodwin.com/data/mars.csv")
cor(mars$income, mars$years.education)
[1] 0.4552673
between education and income. That is, when education tends to be higher (than
the sample mean value) so does income.
6. Density curves
y ∼ U[0,1]
Such a variable has an equal probability of taking on any value in the interval [0,1].
The probability distribution can be written as:
(
1 for 0 ≤ y ≤ 1
f (y) = (6.1)
0 for y < 0 or y > 1
Equation 6.1 defines the height of the density curve for any value of y, and is depicted
in Figure 6.1. To calculate the probability of y taking on a certain value in a range,
we calculate the area under the density curve. For example, if we want to know the
probability that y will be between 0.2 and 0.6, we calculate the area under the density
curve, between 0.2 and 0.6. This area, and probability, is height × width = 1 × (0.6 −
0.2) = 0.4.
Figure 6.1: Density curve for a uniform U(0,1) distribution. The area under the density
curve represents the probability that y will be between 0.2 and 0.6.
y ∼ U {1, 6}
If the variable is discrete, then probabilities are determined by the height of the density,
not the area under the density curve. Figure 6.2 shows the density function for a die
roll that follows a discrete uniform distribution U {1, 6}.
Figure 6.2: The discrete uniform distribution describes a die roll. For a discrete variable,
the height of the distribution is the probability of y taking a value.
Equation 6.2 can look a little scary. But this is just the bell curve! If you plug in a
y-value, you get a height on the Normal (bell) curve. If you plug in many y values
into this equation, you can trace out the curve. Equation 6.2 has two parameters: µ
(the mean) and σ (the standard deviation). These parameters control the location and
shape of the curve.
If a variable y follows a Normal distribution we can write:1
y ∼ N (µ, σ)
Figure 6.3 shows Normal distributions for three different means and standard devia-
tions: N (100, 15), N (100, 30), and N (130, 15).
Figure 6.3: The mean (µ) controls the location of the normal distribution, and the
standard deviation (σ) controls the shape.
Figure 6.4: The probability of 85 ≤ y ≤ 115 is an area under the Normal density.
variable is in the specified range. Calculating an area under the Normal distribution is
trickier than, for example, the continuous uniform distribution, and requires integration
(not covered in this book).
For example, suppose that y ∼ N (100, 15) and we wish to know the probability
of y being between 85 and 115. This probability is the area under the N (100, 15)
curve shown in Figure 6.4. Note that, in this example, the range of values (85 to
115) happens to be plus-and-minus one standard deviation around the mean of the
distribution: µ ± σ = 100 ± 15 = [85, 115].
6.4.2 68-95-99.7
The Normal distribution has an interesting property. No matter what the mean (µ) or
variance (σ 2 ) of the Normal distribution, the area under the curve is always the same
when the region of values is measured in standard deviations.
For example, if we take a range of ±σ around the centre of the distribution (µ),
then the area under the curve in this region is always 0.68 (68%) (see Figure 6.4).2 This
holds true no matter what the values of µ and σ. 95% of the area is within 2 standard
deviations of the mean (µ ± 2σ), and almost all of the area (99.7%) is within 3 standard
deviations. Remember that area under the curve is probability. So this means that,
for example, if a variable is Normally distributed then there is an approximate 95%
probability that it will be within 2 standard deviations from its mean.
(y − µ)
z=
σ
The original variable y has a Normal distribution with mean µ and variance σ 2 (we
write this N (µ, σ 2 )). The z variable, which is created from y, has mean 0 and variance
1 (N (0, 1)). No matter what the values are for µ and σ 2 , z will always be N (0, 1).
N (0, 1) is a special case of the Normal distribution, and is called the Standard Normal
distribution.
If we want to know the probability that y is within a range of values, we need
to draw the Normal curve for y, and then calculate the area under the curve. Each
situation presents different values for µ and σ, meaning that for each situation we have
to draw a unique curve and calculate a unique area. This was historically problematic.
Without computers, drawing these curves and calculating these areas was difficult.
Instead of drawing a curve and calculating an area for each unique situation, we
can transform the situation such that it is characterized by the standard Normal dis-
tribution. This means we can have one curve, and we can calculate a bunch of areas
under that curve once. These areas are reported in a Standard Normal table.
The benefit of “standardizing” a variable (subtracting its mean and dividing by its
standard deviation) has somewhat been diminished along with advances in computing
power. However, the topic is still worth studying. Standardization is ingrained in
statistics, and some other concepts build upon or are analogous to it.
6.5 t-distribution
Figure 6.5: Comparison of t-distributions with different degrees of freedom (df ), and
the Standard Normal N (0, 1) distribution.
7.1 Randomness
Something is said to be random if its occurrence involves a degree of unpredictability
or uncertainty. Outcomes that we cannot perfectly predict are random. Randomness
represents a human failing, an inability to accurately predict what will happen. For
example, if we roll two dice, the outcome is random because we are not skilled enough
to predict what the roll will be. Things that we cannot, or do not want to predict
(because it is too difficult), are random. We cannot know everything. However, we can
attempt to model randomness mathematically.
The idea that randomness embodies a lack of information does not oppose a deter-
ministic world view. While many things in our lives appear to be random, it is possible
that all events are potentially predictable. In the dice example, it is not too far-fetched
to believe that a camera connected to a computer could analyze hand movements and
perfectly predict the result of a dice roll before the dice finish rolling!
Just because an outcome or event is random, doesn’t mean that it is completely
unpredictable, or that we can’t at least try to guess what will happen. This is where
probability comes in. Probability is a way of providing structure for things that are
uncertain or random.
Sample space. The sample space is the set of all possibilities (all outcomes) that can
occur as a result of the random process.
Outcome. An outcome is a single point in the sample space. After the randomness
resolves (is realized ), the random process results in a single outcome.
Depending on the nature of the random process, the sample space may consist of
integers or real numbers, qualities (for example ethnicity or gender), colours, locations,
time, etc. The nature of the sample space, and the properties of the elements in
the sample space, vary by random process. The sample space could be countably or
uncountably infinite (e.g. the set of all integers or the set of all real numbers), could
be bounded (e.g. between the number 0 and 1) or unbounded (e.g. between −∞ and
+∞), and could take on a finite number of possibilities (e.g. 1, 2, 3, 4, 5, 6).
Out of all the possibilities in the sample space, the outcome is where the random
process arrives at. The outcome is a single element, point, or number, in the sample
space.
An event is a collection of outcomes. There are three good reasons for caring to
define events. (i) When we want to know the probability of something occuring, that
something is usually a collection of outcomes. For example, what is the probability
that someone is a millionaire? The event of interest consists of all the dollar outcomes
that are greater than $1 million. What is the probability that it will be cold tomorrow?
“Cold” means below a certain temperature, not an exact temperature.
(ii) When the sample space has an infinite number of possibilities, as is the case
for any continuous random variable (such as temperature or income), the probability
of any one outcome occurring tends to zero. What is the probability that it will be
−20◦ C? What about −20.1◦ C What about −20.000 01◦ C? Since there are infinite
possibilities, the probability of any one of them occurring goes to 0. Instead, we must
talk about ranges of values if we want to end up with non-zero probabilities. A range
of values is just a collection of outcomes, or an event.
(iii) Finally, an event represents an area under the density curve (see Section 6.1).
We will be able to calculate the probability of events using a density curve.
Example 7.1 — Rolling 2 dice. Consider the random process of rolling 2 dice.
Example 7.2 — Percentage mark on a midterm. What percentage mark will you
receive on your next midterm? If the professor told you that the midterm was out
of 100 marks and no part-marks would be given, then the sample space would have
exactly 101 values: {0%, 1%, 2%, . . . , 100%}.
If the professor does not tell you the marking structure of the midterm, then your
percentage score could be anything between 0% and 100%! Infinite possibilities. In
this case, the sample space is written as [0%, 100%]. An outcome is any value in this
range, and you will receive one of them for your score. An event, such as receiving
an “A+”, is the collection of outcomes [93%, 100%], and is a subset of the sample
space.
7.3 Probability
Probability can be defined several ways. There is a somewhat philosophical debate
between “frequentists” and “Bayesians” on the definition and meaning of probability.
I take the frequentist approach.
Probability. The probability of an event is the portion of times the event will occur,
if the event could occur repeatedly.
tifying uncertainty. For the Trump example, Bayesians would say that the probability
of re-election is subjective. I may think the probability is 0.1, but someone else may
assign a probability of 0.9. Which is right? These problems are better suited to a
Bayesian framework, which is not discussed further in this book. The first definition
of probability will be sufficient for the topics covered here.
Example 7.3 — Probabilities when rolling dice. When rolling 2 dice, what are the
probabilities of various events? It turns out we can assign a probability to any
event by making a simple assumption: there is an equal probability of the diea
landing on any one of its six sides. That is, we assume that the die is “fair”. This
means that each of the outcomes in the sample space:
1. The probability of rolling a “7” is 1/6. This is because there are 6 “ways” to
roll a “7”, out of the 36 possible outcomes: , , , , , and
. So, Pr[Y = 7] = 6/36 = 1/6. Here, “Pr” stands for probability, the event
is written in the square brackets [ ], and Y needs to be defined as the sum of
2 die rolls.
2. The probability of rolling a “2” is 1/36. Only 1 outcome satisfies the event:
.
3. The probability of rolling a “1” is 0, since there are no outcomes in the sample
space that can satisfy the event.
4. The probability that the roll is higher than “10” is 3/36: 3 outcomes out of 36
satisfy the event.
a
“Die” is the singular of “dice”.
A random variable is when outcomes are translated into numerical values. For
example, a die roll only has numerical meaning because someone has etched numbers
onto the sides of a cube. A random variable is a human-made construct, and the choice
of numerical values can be arbitrary. Different choices can lead to different properties of
the random variable. For example, I could measure temperature in Celsius, Fahrenheit,
Kelvin or something new (degrees Ryans).
7.5 Independence 71
A random variable can take on different values (or ranges of values), with different
probabilities.
It is sometimes helpful to differentiate between discrete and continuous random
variables.
Continuous random variables can take on an infinite number of possible values,
so we can only assign probabilities to ranges of values (events).
We can assign probabilities to all possible values (outcomes) for a discrete random
variable, because we can count all the outcomes that can occur.
When randomness resolves, we see the outcome as a realization of the random
process. It is now just a number.
7.5 Independence
Often, we consider two or more random processes simultaneously. In economics, we
frequently want to know if one variable is “associated” with, or “causes” another. For
example, how does a change in inflation effect GDP or employment? How does the
number of years of education of a worker influence their wages? There are elements of
1
We have already discussed this idea in Sections 4.3.2 and 7.2.
7.6 Rules of probability 72
randomness in all of these variables. When considering two or more potentially random
processes together, a key consideration is whether or not the processes are independent.
If two random variables are independent, then the outcome of one variable does
not influence the outcome of the other variable. Observing the value (outcome) of one
variable does not give any clues about what the other variable will be. Finding out
that two random variables are independent is very important in statistical analyses.
Example 7.4 — The gambler’s fallacy. The gambler’s fallacy occurs when the inde-
pendence of events is ignored. It is an incorrect belief that if independent events
occur more or less than usual, then that must mean that the event is more (or less)
likely to occur in the future.
For example, if a slot machine has been “cold” all night (has not paid out any
jackpots), then that must imply that the probability of a jackpot on the next pull is
somehow effected (different gamblers may avoid or seek out such a machine). This
is an incorrect belief. The probability of a jackpot is the same for each pull. Each
pull is independent from the last - the past events do not change the probabilities
of future events.
What is the probability of rolling a “7” with 2 dice? From Example 7.3 we know
this to be 1/6. What if we had just rolled a “7” three times in a row? As long as the
dice are fair, and there is no magical being interfering with the dice, the probability
of rolling a “7” is still 1/6. Each roll of the dice are independent from each other.
Knowing past events does not help predict future events that are independent.
0 ≤ P(A) ≤ 1
this mathematically, we define an event called “S” which is comprised of all outcomes
in the sample space. Then:
P(S) = 1
Rule 3: Complements
If event A does not occur, then the event “not A” must occur. Ac is the event “not
A”, and is called the complement of event A. The probability of Ac occurring is:
P(Ac ) = 1 − P(A)
Rule 4: Addition
The probability of either event “A” or event “B” occurring can be determined by
adding and subtracting probabilities depending on whether the two events are mutually
exclusive or not.
(a) If “A” and “B” are mutually exclusive (meaning that they do not have any out-
comes in common) then:
(b) If “A” and “B” are not mutually exclusive (they have some outcomes in common),
then:
This rule extends to more than two events, for example: P(A or B or C) = P(A) +
P(B) + P(C) (in the case of mutually exclusive events).
Rule 5: Multiplication
The probability of both event “A” and “B” occurring can be determined through mul-
tiplication.
(a) If events A and B are independent (neither event influences or affects the proba-
bility of the other occurring) then:
(b) If events A and B are dependent then we must condition on one of the events
occurring:
The vertical line | means “conditional” or given. P(B|A) means the probability
of event B, given that A has already occurred.
This rule also extends to more than two events, for example: P(A and B and C) =
P(A) × P(B) × P(C) (in the case of independent events).
7.7 Mean and variance from a probability distribution 74
Example 7.5 — Snow storm and a cancelled midterm. What is the probability of both
a snow storm and a cancelled midterm occurring? Suppose that in good weather the
probability of a midterm being cancelled tomorrow is only 1%. However, if there is
a snow storm, then the probability of a cancelled midterm is:
Suppose further that there is a risk of a snow storm tomorrow and the weather
forecast gives it a 20% chance:
P(snow) = 20%
Using the multiplication rule, the probability that both a snow storm and a cancelled
midterm occurs is:
Example 7.6 — Probability distribution for a coin flip. What is the probability distri-
bution for a coin flip? We begin by describing all the possible outcomes that can
occur. We can either get “tails” (T ) or “heads” (H). So, the sample space for the
coin flip (call it Y ) is {T, H}. Next, we need to assign a probability to each possible
outcome. If the coin is fair (not weighted), then the probability of each outcome is
equal. Putting this all together, we can write the probability distribution as:
P(Y = T ) = 0.5
P(Y = H) = 0.5
Example 7.7 — Probability distribution for a die roll. What is the probability dis-
tribution for a die roll? The sample space (all the outcomes that can occur) is:
S = {1, 2, 3, 4, 5, 6}. If the die is fair (not weighted), then the probability of each
outcome is equal. Denoting the result of the die roll as Y , we can write the proba-
7.7 Mean and variance from a probability distribution 75
P(Y = 1) = 1/6
P(Y = 2) = 1/6
P(Y = 3) = 1/6
P(Y = 4) = 1/6
P(Y = 5) = 1/6
P(Y = 6) = 1/6
Probability distributions can be written in alternate ways. The important points are
that (i) all of the possible outcomes are defined, and (ii) probabilities are assigned
to each outcome. We can rewrite the above probability function as:
P(Y = k) = 1/6 ; k = 1, . . . , 6
Let Y be a discrete random variable, for example the result of a die roll. Notation
for the mean of Y or expectation of Y is µY or E[Y ]. As mentioned above, E[Y ] can
be determined from its probability distribution.
Mean of a discrete random variable. For discrete random variables, the mean is
determined by taking a weighted average of all possible outcomes, where the weights
are the probabilities of each outcome occurring. The equation for the mean of discrete
random variable Y is:
K
X
E[Y ] = P k Yk (7.1)
k=1
Example 7.8 — Mean of a die roll. Let Y be the result of a die roll. What is E[Y ]?
We will use Equation 7.1, and from the probability function for the die roll (see
Example 7.7) we know that K = 6 and each Pk = 1/6, so:
K
X 1 1 1
E[Y ] = Pk Yk = × (1) + × (2) + ... + × (6) = 3.5
6 6 6
k=1
Notice that the mean of 3.5 is not a number that we can possibly roll on the die!
However, it is still the expected result.
If y is normally distributed, then f (y) is equation (6.2), and the mean of y turns out
to by µ. You do not need to integrate for this course, but you should have some idea
about how the mean of a continuous random variable is determined from its probability
function.
Note that, in general, the expected value is not multiplicative. E[XY ] ̸= E[X] E[Y ].
Only if X and Y are independent is the expected value multiplicative. Note that a
constant c and a random variable Y are always independent!
Example 7.9 — Changing the numbers on a die. Suppose that we create our own
custom die. A typical die has sides that read , , , , , . On our custom die,
we instead make the sides read {3, 4, 5, 6, 7, 8}. What is the expected value (mean)
of the custom die?
Instead of defining the probability distribution for this custom die, and using
Equation 7.1, we can instead use the rules of the mean. Let Y represent the typical
die, and X represent the custom die. What is the relationship between Y and X?
We can get the custom die by adding 2 to each side of the typical die, so:
X =2+Y
and using the rules of means for constants and addition we have that:
Example 7.10 — The sum of two dice. Often, in games, players roll two dice and take
the sum (backgammon, craps, Monopoly, Catan, Dungeons and Dragons, etc.). In
Monopoly, you move forward a number of spaces equal to the number that you roll
with two dice. How many spaces forward do you expect to move? That is, what is
the mean value of the sum of two dice?
To answer this question, we could either determine the probability function for
the sum of two dice and use Equation 7.1, or we could use the rules of means. Let
X be the result of one of the die rolls, and Y the result of the other. The number of
spaces the game piece moves forward is equal to X + Y . The mean dice roll, using
the rules of the mean, is:
You can expect to move forward 7 spaces. This is an important thing to know if
you gamble or play board games!
7.7.3 Variance
Sample versus population variances. Again, be aware of the difference between the
true population variance (which we are discussing here), and the sample variance
(see Equation 5.3 for sample variance). In this section, the probability functions
are completely known, so that we can use them to determine the variance of the
random variable directly. In cases where there is some question as to the shape of
the probability function, we can use the sample variance to guess or estimate the
true population variance.
values that are far away from the mean or expected value.
Example 7.11 — Variance of a die roll. What is the variance of a die roll, Y ? We
already know that E[Y ] = 3.5, and using Equation 7.3, we have:
K
X
Var[Y ] = Pk × (Yk − E[Yk ])2
k=1
1 1 1
= (1 − 3.5)2 + (2 − 3.5)2 + · · · + (6 − 3.5)2
6 6 6
1
= (6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25)
6
17.5
= ≈ 2.92
6
This is telling us that we expect the squared distance between the die roll and the
mean value of 3.5 to be equal to 2.92. This is a measure of the dispersion of Y .
Rule 2: Addition
The variance of the sum of two random variables is equal to the sum of the variances,
plus a covariance term (covariance defined later):
If the two random variables are independent (the outcome of one does not influence or
affect the outcome of the other), then the covariance2 between them is 0, and:
Since the variance of a constant is zero, adding constants to random variables does not
change the variance:
Var[c + Y ] = Var[Y ]
Rule 3: Multiplication
The variance of a random variable multiplied by a constant is equal to the square of
the constant multiplied by the variance of the random variable:
Var[cY ] = c2 Var[Y ]
The variance rules for the product of two random variables are more complicated and
are not used here.
Rule 4: Non-negativity
Variance cannot be a negative number. Note the “square” in the formula for variance
(Equation 7.2). Since distance from the mean is being squared, we can never get a
negative variance for a random variable Y : Var[Y ] ≥ 0.
Example 7.12 — Variance of a custom die. Consider the custom die from Example
7.9, with sides {3, 4, 5, 6, 7, 8}. Call the result of the custom die roll random variable
X. What is Var[X]? It’s the same as the standard die! That is, Var[X] ≈ 2.92.
This makes sense: the distance between each consecutive outcome is 1, whether
looking at the custom die or the standard die.
We can verify this intuition either using Equation 7.3, or by using the rules of
variance. Again, the relationship between the custom die X, and a standard die
(call it Y ) is: X = 2 + Y . Using the rules of variance:
Adding and subtracting a constant does not affect the variance of a random variable.
Example 7.13 — Variance of the sum of two dice. Take the situation in Example
7.10. What is the variance of the sum of two dice? Call one die X and the other
Y . From the rules of variance:
but the dice are independent (the result of one roll cannot influence the other), so
2
Covariance is very similar to correlation (see Section 5.10.
7.7 Mean and variance from a probability distribution 80
Example 7.14 — Another custom die. Let’s create another custom die. The sides of
the die will be equal to 2 times the sides of a regular die, so that the six sides read:
{2, 4, 6, 8, 10, 12}. Call the result of this new custom die Z. What is the variance of
Z?
Again, we could use Equation 7.3, or we could make our lives simpler by using
the rules of variance. The relationship between this new custom die, and a standard
die, is: Z = 2 × Y . Using the rules of variance, we have:
The values on the traditional die were multiplied by 2, the variance increases by 4.
8. Statistical Inference
Statistical inference. Statistical inference is when a statistic (for example the sample
mean) is used to infer (i.e. “guess” or “estimate”) something about the population.
In this chapter, we put some of what we have learned about statistics and probability
together. This introduction quickly outlines what statistical inference means, and some
key points. We will spend the remainder of the chapter dissecting and explaining the
statements made in this overview. The key point in this chapter is that the sample
mean is a random variable!
You have seen the equation for the sample mean before (see Section 5.1). The
sample mean is found by adding up all values of a variable in the sample, and dividing
by the sample size:
n
1X
ȳ = yi (8.1)
n
i=1
where yi denotes the ith observation, and where n denotes the sample size. The sample
mean is a very popular method for inferring an unknown population mean. Suppose
there is a random variable (call it y), and we want to know the true population mean of
y. The true population mean value for y is denoted µy , and is an unknown parameter.
We can infer 1 the value of µy by randomly drawing a sample from the population, and
calculating the sample average. This process is called statistical inference.
Since ȳ is calculated from a randomly selected sample, ȳ is itself a random variable.
ȳ turns out to be Normally distributed, thanks to the central limit theorem (we will
cover the central limit theorem in Section 8.4.1). The mean of ȳ turns out to be the
true population mean! This partly explains why ȳ is such a popular estimator.
We end this introduction with a simulation experiment. We already know from
Example 7.8 that the true population mean of a die roll is 3.5. Let’s pretend, however,
that we don’t know this true population mean. How could we estimate it? We can use
1
Instead of using the word “infer”, we could also say “guess” or “estimate”. In fact, ȳ is an “estimator”
for µy .
8.1 Parameter versus statistic 82
the sample mean! First we must collect a sample. We could sit at our desks and roll
a die repeatedly, recording each result. Suppose we only have enough time to collect a
sample of n = 20. Then, we take the sample average of all recorded die rolls. You can
use your own die to accomplish this, or use the following R code to simulate rolling a
die 20 times:
dierolls <- sample(1:6, 20, replace=TRUE)
dierolls
[1] 4 5 3 4 3 6 6 3 2 1 2 1 2 6 2 1 6 5 5 5
mean(dierolls)
[1] 3.6
If we didn’t know that the true mean die roll is 3.5, we could collect a sample and use
the sample mean to come up with a pretty good guess!
The mean2 income, or the mean years of education, of all Martian colonists.
The mean height of a human being.
The mean quantity demanded of Mars diamonds, given prices.
The mean number of doctor visits for individuals with health insurance.
The mean sales for Fortune 500 companies.
The mean temperature and CO2 emissions by country.
There are many other examples, and many reasons for wanting to know the mean
of a population. In each example, we could calculate a sample mean, and use that
sample mean to infer the true population mean. The true expected or mean wage
of a Mars colonist could be estimated using the sample average. It is important to
note the distinction between the true population mean, and the sample mean. The
2
Instead of “mean” we could use the word “average” for these examples, but we want to stress that we
are talking about the true mean of the population, and not a “sample average”.
8.2 Population versus sample 83
true population mean income of Mars colonists is µincome , a parameter that determines
the entire population distribution of incomes. It can be considered a fixed number,
¯
unknown to us, but that we desire to discover. The sample mean, income is a statistic
that can be used in place of the unknown µincome .
Statistic. A statistic is calculated from a sample of data, and can be used to estimate
an unknown parameter.
In each of the above examples, we could collect a sample, and calculate a sample
mean in order to infer the unknown population mean. Measuring and averaging the
heights of a sample of humans would allow us to guess at the expected human height.
Observing a few diamond sales allows us to guess at the true quantity demanded for
diamonds. Recording the number of doctor visits made by some randomly selected
individuals would allow us to guess at the overall demand for healthcare.
Statistics are random variables!. Since statistics are calculated from the sample,
and the sample is randomly selected from the population, statistics are themselves
random variables! This idea is key to understanding many of the concepts that
follow.
3
The closest we can actually get are called pseudo random numbers. We start with a seed, and apply
a complicated process to obtain an unpredictable result.
8.2 Population versus sample 85
Example 8.1 — Random sample of Mars colonists. Begin by downloading a data set
containing information on all 620,136 Mars colonists aged 18 and older (give it a
few minutes, it’s a large data set):
Now, pretend that we don’t have this entire data set. This is the entire population - if
we had information on the entire population we would not need statistical inference.
We will simulate sampling from this population. Let’s pretend that our budget
allows us to interview 100 individuals. Draw a random sample of 100 individuals
from the population:
View(msample)
Let’s calculate the sample mean income from the randomly drawn sample:
mean(msample$income)
[1] 51686.45
¯
The sample mean value is income = 51, 687. You will get a different sample mean!
This is because your random sample will consist of different colonists. Try the
following lines of code many times:
Each time you will get a different value for the sample mean of income. We have
just conducted a simulation experiment. In reality, we will only have one sample of
size n = 100 to work with. In this experiment, we are drawing many samples in
order to imagine what else we could possibly calculate for the sample mean. Since
this is an experiment, we also know the population mean:
mean(mars18$income)
[1] 51737.09
How close were the sample averages to the true population mean? It is important
to keep in mind that this is an experiment: in reality we only have one sample, and
we do not know the true population mean.
Height of a human
We could go out into the street at 2 pm and record people’s heights, and obtain a
sample. Since we don’t know who we’re going to meet, the sample is random. But
what if we had decided to go out at 3 pm instead? We would have recorded a different
sample of heights. Any statistics calculated from the two hypothetical samples (2 pm
and 3 pm) would differ, even though the true population height remains unchanged
and is a fixed parameter.
8.3 The sample mean is a random variable 86
Figure 8.1: The true population mean is 5 (µ = 5). Each possible random sample y
that we could draw from the population gives us a different sample average (ȳ A = 5.2
and ȳ B = 4.9 for example). ȳ is a random variable because it is calculated from a
randomly drawn sample.
n
1X
ȳ = yi
n
i=1
where yi denotes the ith observation, and where n denotes the sample size.
To reinforce the idea that the sample mean is random, consider the following situa-
tion. You will roll a die 20 times, collect the results, and calculate the sample average,
¯
dierolls. What will be the number that you calculate for dierolls? ¯ You might guess
that it will be close to 3.5, but you can’t completely predict the result. It is random,
because the sample values {1, 2, 3, 4, 5, 6} are randomly collected.
8.4 Distribution of the sample mean 87
When we think about sampling individuals from a population, remember that they
are chosen randomly. Imagine what would happen if we got a different sample, if we
were in a parallel universe, if we collected the sample on a Tuesday instead of a Monday,
etc. The sample values could be different, meaning that anything that is calculated
from the sample could be different. See Figure 8.1 for a visualization of this idea.
Figure 8.2: Histogram of sample means: simulated sampling distribution for the sample
mean of 20 die rolls.
An important question is: how good is the estimator? That is, how good of a job
is the estimator doing at “guessing” the true unobservable thing in the population? In
our specific example: how good is the sample mean at estimating the true population
mean of heights? This is an importannt question, because there are many ways that
we could use the information in the sample to try to estimate the true mean. Why is
equation (8.1) so popular?
The fact that a statistic is a random variable has important implications for statis-
tical inference. If we are using the sample mean to estimate the population mean, we
might wonder: “how well does the sample mean represent the true population mean?”
One way to answer this question is to consider the distribution of the sample mean.
It’s a random variable after all, and it has a distribution!
Let’s start from the fact that the sample mean, ȳ, is random. What is the dis-
tribution for ȳ? Remember that the probability distribution for a random variable
accomplishes two things: (i) it lists all possible numerical values that the random vari-
able can take, and (ii) assigns probabilities to ranges of values. So, what are the possible
values that ȳ can take? How likely is ȳ to take on certain values? Ideally, we would
like to know the exact location and shape of the probability distribution for ȳ.
Before we proceed, let’s define the term sampling distribution. When the random
variable is an estimator (such as the sample mean), then its probability distribution
gets a special name - sampling distribution. That is, a sampling distribution is just a
fancy name for the probability function of an estimator.
8.4 Distribution of the sample mean 88
Sampling distribution. Imagine that you could draw all possible random samples of
size n from the population, calculate ȳ each time, and construct a relative frequency
diagram (a histogram) for all of the ȳs. This relative frequency diagram would be
the sampling distribution of the estimator ȳ for sample size n.
The histogram from this R code is shown in Figure 8.2. Notice that there were a few
“weird” samples drawn, where the sample mean was calculated to be very low or high,
but this happens rarely. Most of the sample means from the experiment tend to be
centered between 3 and 4. What is the exact location of this distribution? In fact, we
can find this location by taking the sample mean of all 1 million ȳ:4
mean(allmeans)
[1] 3.498985
Wow, the value of 3.499 is very close to the true population mean die roll of 3.5! So,
even though the possible values for ȳ can be all over the place, on average they give
the correct answer! In this example, the expected value of the sample mean is exactly
equal to the true population mean: E[ȳ = µy , which is part of the reason why ȳ is a
popular statistic.
Unbiased estimator. When the expected value of the estimator is equal to the true
population parameter intended to be estimated, the estimator is said to be “unbi-
ased.” The sample mean, ȳ is an unbiased estimator (under certain assumptions).
Returning to Figure 8.2, what shape characterizes the histogram? It is the familiar
Normal distribution, or bell curve! Figure 8.3 shows a Normal distribution super-
imposed onto the histogram of ȳ for die rolls. In fact, the sample average ȳ always
(approximately) follows a Normal distribution, regardless of the distribution of the
variables in the sample! This is due to the central limit theorem.
4
It may be confusing to take the sample mean of sample means. Just focus on the fact that ȳ is a
random variable. It is natural to try to find the mean and variance of a random variable.
8.4 Distribution of the sample mean 89
Figure 8.3: Normal distribution with µ = 3.5 and σ 2 = 0.145, and histogram simulating
the sampling distribution for the sample mean of 20 die rolls.
Central limit theorem. Loosely speaking, the central limit theorem (CLT) says that
the sums of random variables are Normally distributed.
To illustrate the CLT, let’s again use dice for an example. We know that a single die
roll is uniformly distributed (equal probability of each number coming up). But what
if we start adding the results of dice rolls? Figure 8.4 shows the probability function
for the sum of two dice. It’s no longer flat (uniform)! It even seems to have a bit of a
curve to it.
Now, let’s add a third die, and see if the probability function looks more normal.
Let Y = the sum of three dice. It turns out the mean of Y is 10.5 and the variance is
8.75. The probability function for Y is shown in Figure (8.4). Also in Figure (8.4) is
the probability function for a Normal distribution with µ = 10.5 and σ 2 = 8.75. Notice
the similarity between the two probability functions.
The CLT says that if we add up the result of enough dice, the resulting probability
function should become Normal. Finally, we add up eight dice, and show the probability
function for both the dice and the Normal distribution in Figure(8.4), where the mean
and variance of the normal probability function has been set equal to that of the sum
of the dice.
8.4 Distribution of the sample mean 90
Figure 8.4: Probability function for the sums of dice, with Normal density functions
superimposed. As the number of random variables that we sum increases, the distri-
bution of the sum becomes Normal. This is due to the central limit theorem (CLT).
CLT and ȳ. So, what does the CLT have to do with the sample mean? Look at Equa-
tion 8.1 again. Notice the summation Σ operator. Taking a sample average involves
adding up random variables, so the CLT means that ȳ is randomly distributed.
9. Confidence intervals
This chapter discusses how to construct and interpret confidence intervals. Confidence
intervals are very easy to calculate but very difficult to understand, and are commonly
misinterpreted.
One of the uses of a confidence interval is to quantity the uncertainty surrounding
the estimate ȳ. Confidence intervals can be calculated along with the calculation of ȳ,
provided a measure of how “close” ȳ might be to the true population mean µy .
The first part of the chapter lays down the groundwork necessary to understand
confidence intervals. Sampling distributions, estimators, and the variance of estimators,
are some of the required concepts that we begin with.
Assumptions.
Since the sample mean is a random variable, we can consider probabilities of ȳ taking
on certain values. For example, we could try to determine: P (ȳ > 4) for a sample of
dice rolls, or P (80k ≤ ȳ ≤ 90k) for a sample of Mars incomes. As long as we know the
distribution for ȳ, we can determine these probabilities by taking the area under the
probability distribution (density) curve. So, what is the exact sampling distribution
(probability distribution) for ȳ?
1
What constitutes “extreme” and “far away from the truth” is subjective.
9.2 Exact sampling distribution of ȳ 93
This says that ȳ is Normally distributed with mean µy and with variance σy2/n. The
mean of ȳ is the same as the mean of y. The variance of ȳ is whatever the variance
of y is, divided by n.
This sampling distribution of N (µy , σy2/n) is only valid in certain situations. The
sample size n has to be large enough for the central limit theorem to provide a Normal
distribution, and the y data must be identically and independently distributed (which
is assured if the y data was collected by simple random sampling).
Example 9.1 — Probability of getting a ȳ > 4. Suppose that you are about to roll 10
dice, and take the sample average ȳ. We know that the sample average “should”
give us an answer that is close to 3.5 (the true mean of a die roll). What is P (ȳ > 4)?
That is, what is the probability that we get some “extreme” value for the sample
average? We now know that the sample average (approximately) follows the Normal
distribution with mean µy and variance σy2/n. From Example 7.8 we know that the
mean of a die roll is 3.5:
µy = 3.5
From Example 7.11 we know that the variance of a die roll is:
35
σy2 = ≈ 2.92
12
The sample size is going to be n = 10, so the variance of ȳ is:
σy2 35/12
= ≈ 0.292
n 10
Putting this together, we have that the sampling distribution for the sample average
of 10 die rolls is:
N (3.5, 0.292)
We can now get R to draw this Normal distribution, and calculate the area under
the curve to the right of 4. This area tells us the probability of getting a ȳ that is
“extreme”, or greater than 4. The R code for calculating this probability is:
9.2 Exact sampling distribution of ȳ 94
[1] 0.1774071
The pnorm() function calculates an area (a probability) under the Normal curve.
The first argument in the function is 4: we want P (ȳ > 4). Next we give the function
the mean and standard deviationa so that we draw the correct curve: mean = 3.5
and sd = sqrt(0.292). Finally, we tell the function that we want the “upper tail”
(the area to the right of 4), so we set lower.tail = FALSE.
So, there is only a 17.7% probability of getting a ȳ > 4 when we average 10 dice!
a
Remember that standard deviation is just the square root of the variance.
Example 9.2 — Number of times getting a ȳ > 4. In the previous example (Example
9.1) we found that if we were to roll 10 dice, and take the sample average, that:
One way of interpreting this probability of 0.177 is that, of all the samples of
n = 10 die rolls that we could obtain, 17.7% will give a sample average higher than
4. This can easily be verified! Roll 10 dice, take the average. Repeat this many
times. 17.7% of sample averages calculated should be above 4. Instead of actually
rolling dice, we can use R:
[1] 3.9
Repeat the above code many times, and you will see that roughly 17.7% ȳs are
above 4!
σy2
var(ȳ) =
n
Variance measures the “spread” of a random variable. The formula shows that as n
gets larger, the variance of ȳ decreases. ȳ gets more accurate with a bigger n! This is
one of the reasons we want the sample size n to be as large as possible. As we collect
more information in the sample, the sample average gets “better”. This is true for
many other statistics as well, such as the median or mode.
σy2 35/12
= ≈ 0.292
n 10
Where 35/12 is the variance of a single die roll. If we were to instead roll 20 dice and
take the average, the variance of ȳ would be:
σy2 35/12
= ≈ 0.146
n 20
This gives us the sampling distribution for ȳ when n = 20: ȳ ∼ N (3.5, 0.146). The
probability of getting a ȳ > 4 is similarly found by calculating the area under the
N (3.5, 0.146) curve, to the right of ȳ = 4. R can do this for us:
[1] 0.09534175
The probability of getting a ȳ that is “far away” from the true mean of 3.5 is getting
smaller as n increases! Let’s try one more time for 40 dice.
σy2 35/12
= ≈ 0.073
n 40
[1] 0.03211478
The Normal curves, and P (ȳ > 4), for n = 10, 20, 40, are shown above.
The problem is, in reality µy is usually not known!2 We need to calculate ȳ as a way of
estimating the unknown true population mean µy . So, in reality we don’t know where
the sampling distribution is “centered”. That is:
!
σy2
ȳ ∼ N ? ,
n
How can we calculate probabilities involving ȳ? We need to locate the sampling distri-
bution before we can do so.
2
Our dice examples are an exception.
9.4 Sampling distribution of ȳ with unknown µ 97
Figure 9.1: An “actually” calculated value for the sample mean ȳ ACT is used to locate
the sampling distribution, since the true location µy is typically unknown.
What is our best guess for the unknown population mean µy ? It’s ȳ! We can
estimate the sampling distribution for ȳ by replacing the unknown µy with a value
that we “actually” calculate for the sample average. Call this value ȳ ACT , where ACT
stands for a number that we actually calculate:
!
ACT
σy2
ȳ ∼ N ȳ ,
n
The idea of replacing the unknown parts of the sampling distribution with estimated
numbers is the beginning step in constructing confidence intervals, and performing
hypothesis tests.
1. Collect a sample.
2. Calculate ȳ to use as an estimate for µy .
3. Use ȳ in place of µy in the sampling distribution, in order to calculate confi-
dence intervals and hypothesis tests.
set.seed(2040)
roll <- sample(1:6, 10, replace=TRUE)
roll
[1] 1 4 5 2 6 6 6 4 1 6
Here we used set.seed(2040) so that all the randomly generated numbers will be
the same no matter who runs the code! Next, use this sample to calculate ȳ ACT :
9.5 Confidence intervals 98
mean(roll)
[1] 4.1
Lastly, we can calculate probabilities of getting different values for ȳ, by using the
N ȳ ACT , σy2/n distribution. In Example 9.1 we calculated the probability of getting
an “extreme” ȳ. Let’s calculate the probability that, if we were to draw another
sample of size n = 10, that the ȳ we calculate from this sample is within ±1 of
ȳ ACT . That is, we want: P(3.1 ≤ ȳ ≤ 5.1). To get this probability, we use a Normal
distribution with µ = 4.1 and σy2/n = 2.92/10 = 0.292.
Notice that P(3.1 ≤ ȳ ≤ 5.1) = P(ȳ ≤ 5.1) − P(ȳ ≤ 3.1). We calculate two
probabilities in R, and subtract:
[1] 0.9357704
This tells us that, if the true population mean were 4.1, there would be a 93.6%
chance of calculating a ȳ between 3.1 and 5.1 with a new sample of size n = 10.
These types of probability statements, involving what would happen if we could hy-
pothetically recalculate ȳ with a new sample, forms the basis for confidence intervals
and hypothesis testing.
and said that we can replace the unknown µy with an actual value for ȳ:
!
σ 2
y
ȳ ∼ N ȳ ACT ,
n
Figure 9.2: Solving for “lower value” and “upper value” provide the 95% confidence
interval around ȳ ACT .
Which interval around ȳ ACT has a 95% probability of containing a new ȳ?
σ2
To answer this question, we could draw the N ȳ ACT , ny distribution, put 95% of the
area in the middle, and figure out the lower and upper bounds. See Figure 9.2.
n = 10
ȳ = 4.1
σy2 = 2.92
σy2/n = 0.292
p 2
σ /n
y = 0.54
The estimated sampling distribution for ȳ is N (4.1, 0.292). We can use R to find
the lower value, that puts 2.5% of the area under the curve to the left:
[1] 3.040894
Find the value that puts 2.5% of the area under the curve to the right:
[1] 5.159106
These two values define the confidence interval around ȳ ACT : [3.04 , 5.16].
In addition to using R (Example 9.5), we can also find the 95% confidence interval
9.5 Confidence intervals 100
using:
q
σy2
lower value = ȳ − 1.96 × n
q
σy2
upper value = ȳ + 1.96 × n
or:
95% confidence interval.
h p p i
ȳ − 1.96 × σy2/n , ȳ + 1.96 × σy2/n (9.1)
The number 1.96 in Equation 9.1 is coming from the Standard Normal distribution:
N(0, 1) (see Section 6.4.3). In a Standard Normal distribution, 2.5% of the area in the
“tails” is located outside of the values -1.96 and
1.96.
Instead of drawing out the N ȳ ACT σ 2
, y/n distribution and calculating areas, Equa-
tion 9.1 uses the values ȳ ACT and σy2/n in order to transform the distribution to the
Standard Normal distribution N (0, 1), where the number 1.96 is well known. Essen-
tially, we are “standardizing”3 ȳ: creating a different variable that instead follows
N (0, 1), and using what we know about N (0, 1) (that ±1.96 puts 95% area in the
middle).
Example 9.6 — Confidence intervals using Equation 9.1. Returning to earlier dice
examples (Examples 9.4 and 9.5):
n = 10
ȳ = 4.1
σy2 = 2.92
σy2/n = 0.292
p 2
σ /n
y = 0.54
Figure 9.3: Each hypothetical sample of size n that we could draw (sample A, sample
B, etc.) provides a 95% confidence interval that has a 95% probability of containing the
true population mean µy . In reality, we will only draw one sample from the population,
and calculate one sample mean and interval. The confidence interval provides a measure
of the uncertainty surrounding ȳ.
The confidence interval is itself a random interval. The randomness all begins with
the random sample. From the random sample we get ȳ, which is a random variable.
From ȳ we get the confidence interval. Since ȳ is random, so must be the confidence
interval.
It turns out that there is a 95% probability that we will draw a sample that leads to
a 95% confidence interval containing µy . Of all the possible samples that we could draw
from the population, 95% of them will produce 95% confidence intervals that contain
the truth.
How not to interpret a confidence interval
Some misconceptions on how to interpret confidence intervals persist. The following
interpretations are wrong:
There’s a 95% probability that the true µy lies inside the 95% confidence interval.
The 95% confidence interval contains the true µy 95% of the time.
These interpretations are subtly wrong. The reason is that the interval is random,
and µy is a fixed parameter, not the other way around.
Margin of error
The distance in the confidence interval, on either side of ȳ, is sometimes called the
margin of error. That is:
p
95% margin of error = 1.96 × σ2/n
The term “error” is in keeping with the idea that the confidence interval is measuring
the uncertainty surround an estimate.
9.5 Confidence intervals 102
As long as we are sampling from the same population so that σy2 is constant, then
the width of the confidence interval only depends on n. From the dice examples,
where the true variance of a die roll is σy2 = 35/12, the margin of error from a 95%
confidence interval using n = 20 will be:
p
1.96 × (35/12)/20 = 0.749
so that any 95% confidence interval calculated by sampling n = 20 from this popu-
lation will just be:
ȳ ± 0.749
ȳ ± 1.06
qnorm(.005, mean = 0, sd = 1)
[1] -2.575829
In the qnorm() function we chose the area .005. This calculates the value for a Standard
Normal variable that puts 0.5% area in the left tail. This gives 1% area in both tails,
hence 99% area in the middle. Similarly, for the 95% critical value, we put 2.5% area
in the left tail of the distribution:
qnorm(.025, mean = 0, sd = 1)
[1] -1.959964
qnorm(.05, mean = 0, sd = 1)
[1] -1.644854
Figure 9.4: Standard Normal distribution. “Critical values” of ±2.58, ±1.96, and
±1.65 are used to construct 99%, 95%, 90% confidence intervals (respectively). These
numbers can be used when ȳ (at least approximately) follows a Normal distribution.
These critical values are depicted in Figure 9.4. Using these values, we can construct
confidence intervals with varying levels of confidence:
p
99% CI = ȳ ± 2.58 × σy2/n
p
95% CI = ȳ ± 1.96 × σy2/n
p
90% CI = ȳ ± 1.65 × σy2/n
Example 9.8 — Confidence intervals for Mars incomes. Let’s calculate 99%, 95% and
90% confidence intervals using the sample of Mars colonists. In this chapter, we are
operating under the unrealistic assumption that the population variance is known.
Since the Mars data is fake (I generated it), in this example we can know what the
9.5 Confidence intervals 104
true population variance is. Load up the entire population of 720,720 colonists:
This will take awhile since the file is large! Now that we can unrealistically “see”
the entire population, we can get the population variance of income for “employed”
individuals:
var(wholepop$income[wholepop$occupation == "employed"])
[1] 1452833175
2
So, the true population variance is σincome = 1.4 billion. Now, pretend that we do
2
not have any other information on the population other than σincome = 1.4 billion.
Let’s use the sample of n = 1000 colonists, calculate the sample mean income
¯
income, and construct the confidence intervals.
[1] 80938.1
H0 is the null hypothesis. The null hypothesis is “choosing” a value for the unknown
population mean, µy . The hypothesized value of the population mean is denoted µy,0 .
The alternative hypothesis is denoted by HA . One of the two situations must occur.
This is called a “two-sided” hypothesis test: the null hypothesis is wrong if the popu-
lation mean (µy ) is either “too small” or “too big” relative to the hypothesized value.
The hypothesis test concludes with either: (i) “reject” H0 in favour of HA , or (ii)
“fail to reject” H0 . We should never say that we “accept” either of the hypotheses: we
either have evidence to reject H0 , or we do not have enough evidence to reject H0 .
The decision to “reject” or “fail to reject” H0 may begin by the researcher subjec-
tively deciding on a significance level and then doing one or more of the following:
10.2 Distribution of ȳ assuming H0 is true 106
We’ll use the example of the incomes of Mars colonists in order to illustrate hy-
pothesis testing. Suppose that the Mars government claims that the population mean
income of employed Mars colonists is 82,000. Let’s begin by formally stating the null
and alternative hypotheses:
H0 : µincome = 82000
(10.1)
HA : µincome ̸= 82000
If we think the government is lying, we will reject their claim. This is a “two sided”
hypothesis test; the government is lying if the true income is either greater than or less
than 82000.1
Hypothesis Testing Assumption 2. The true population variance is known. For the
2
Mars income example, this means that σincome is assumed to be known.
1
We cover one-sided hypothesis tests in Section 10.6.
10.3 p-values 107
¯
Figure 10.1: Sampling distribution of the sample average income income if H0 :
µincome = 82000 is correct. 2 × 19% = 38% is the probability of getting a “worse”
sample average, and is called the p-value.
[1] 0.04852874
If the population mean income is truly 82000, then there is only a 4.9% chance that
¯
income < 80000.
10.3 p-values
After stating H0 and HA , the next step is to actually estimate the parameter (µy
for example) that the hypothesis is about. A p-value can then tell us whether the
difference between the what we hypothesis (H0 : µy,0 ) and what we actually observe
from the sample (ȳ) is “large” enough to warrant rejection of the hypothesis.
For example, to test whether µincome = 82000 or not, we proceed by estimating
the unknown µincome . In several examples using a sample of n = 1000 employed Mars
colonists we have calculated that:
¯
income = 80938
Notice that our estimate of 80938 is clearly different from our hypothesis that the
true population mean is 82000. The difference between what we actually estimated
from the sample, and our null hypothesis, is 82000 − 80938 = 1062. Just because there
is a difference does not imply we should reject H0 outright. We need to assess whether
this difference is “large”. Assessing whether the difference is large can be accomplished
using a p-value. We will only reject H0 if the probability of getting an income¯ (from
another hypothetical sample drawn from the population) further away than 1062, is
small. This probability is called a p-value.
10.4 Significance of a test (α) 108
By “more adverse” we mean a difference ȳ − µy,0 that is even larger than the differ-
ence calculated with our given sample. If H0 is actually true, then the probability of
calculating a sample average that is more “extreme” than the one we just calculated is
¯
two times2 the probability that income < 80938, using the N (82000, 1605382) curve.
From R, this probability is:
[1] 0.3782731
This is the p-value for our example hypothesis test. It tells us that, if H0 is true,
there is a 38% chance of getting a sample that would lead to an income ¯ that is fur-
ther away from 82000 than the sample average of 80883 that was just calculated.
That is, out of all the hypothetical samples of n = 1000 that we could draw from
an N (82000, 1452833) distribution, 38% would give a sample average further than
82000 − 80883 = 1117 from H0 .
All that remains is to decide whether the p-value of 38% is “large” or “small”. This
decision is subjective. With a p-value of 38%, most researchers would decide to “fail
to reject” the null hypothesis.
Now, z is still Normally distributed, but has mean 0 and variance 1 since:
and:
σy2
y Var[y]
Var[z] = Var = = =1
σy σy2 σy2
(refer to the rules of mean and variance in Sections 7.7.2 and 7.7.4).
How is standardization helpful for hypothesis testing? The sampling distribution of
ȳ under the null hypothesis is ȳ ∼ N (µy,0 , σy2/n). Create a new variable z. Subtract µy,0
(the mean of ȳ if the null is true) from ȳ. z has mean 0 (if the null is actually true).
Divide by the standard error (standard error = the standard deviation of an estimator)
of ȳ, and z has variance of 1. That is:
ȳ − µy,0
z= p 2 ∼ N (0, 1)
σy/n
This is the “z -test statistic” for the null hypothesis that µy = µy,0 . If the null is
true, then z should be “close” to 0. The probability of observing a ȳ further away from
H0 than what we just observed from the sample is obtained by plugging ȳ and µy,0
into the z statistic formula, and calculating a probability using the Standard Normal
distribution. From our Mars incomes example, the z statistic is:
80938 − 82000
z= q = −0.881
1452833175
1000
“What is the probability of getting further away than 80938 from the null hypothesis
of 82000?”
has just been translated to:
“What is the probability of an N (0, 1) variable being less than -0.881, or greater
than 0.881?”
Get this probability from R:
2 * pnorm(-0.881, mean = 0, sd = 1)
[1] 0.3783178
It is the same p-value that we obtained in Section 10.3! We only need to calculate the
area under the curve for several possible z values. These were tabulated long ago, and
are reproduced in Table 10.1.
mean(mars$income)
[1] 80938.1
80938.1 − 82000
z= q = −0.881
1452833175
1000
2 * pnorm(-0.881, mean = 0, sd = 1)
10.6 Two-sided vs. one-sided hypothesis tests 112
[1] 0.3783178
Make a decision.
Since the p-value is greater than the significance level, we fail to reject H0 . There
is insufficient evidence to reject the claim that µincome = 82000. Note that we also
reject the null at the 10% significance level.
If the null hypothesis is in the 1 − α% confidence interval, we will “fail to reject” that
null hypothesis at the α% significance level.
Table 10.1: Area under the Standard Normal curve, to the right of z.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
0.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681
1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559
1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143
2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110
2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084
2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064
2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048
2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036
2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026
2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019
2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014
3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010
3.1 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .0007
3.2 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .0005
3.3 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .0003
3.4 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0002
11. Hypothesis testing with unknown σ 2
In this chapter, we tackle the situation where the population variance σ 2 is unknown.
If we want to perform hypothesis testing, we need to estimate this variance. So far our
confidence intervals, test statistics, and p-values all rely on the unknown value σ 2 :
q
σy2
confidence interval for ȳ ȳ ± zc × n
q
σy2
z test statistic (ȳ − µy,0 ) / n
Note that we have written the confidence interval using zc , instead of 1.96 (for
example). When σy2 is known, zc = 1.96 for the 95% confidence interval. This “1.96”
is coming from the Standard Normal distribution. In this chapter, zc will change to tc ,
once we estimate σy2 .
We will replace the unknown σ 2 with an estimator, s2 . We will discuss how confi-
dence intervals, test statistics, and p-values are altered slightly when we substitute the
unknown σ 2 with the estimator s2 .
11.1 Estimating σ 2
So far we have assumed that σy2 is known. After calculating ȳ, we needed this σy2 for
confidence intervals, test statistics, and p-values. Hypothesis testing relies on knowing
the population variance σ 2 .
If we have to estimate µy , it is unlikely that we would know σy2 . That is, if the
population mean is unknown, it is likely that the population variance is unknown as
well. Equation 11.1 provides a way of estimating the unknown σy2 .
11.2 t-distribution 115
Sample variance of y.
n
1 X
s2y = (yi − ȳ)2 (11.1)
n−1
i=1
A discussion of this formula (including where the n − 1 comes from), along with exam-
ples, were presented in Section 5.7 (review this section now). Review Example 5.11 to
see how to use R to calculate the sample variance.
All instances where we used σ 2 can use s2 instead, with some minor modifications:
q
s2y
confidence interval for ȳ ȳ ± tc × n
q
s2y
t test statistic (ȳ − µy,0 ) / n
Replacing σ 2 with s2 . The reason why confidence intervals and hypothesis testing
changes is because we are replacing a parameter (σ 2 ) with a random estimator (s2 ).
s2 has its own sampling distribution. Introducing another element of randomness
into confidence intervals and hypothesis testing has the effect of changing the relevant
underlying distributions slightly.
Where we used the Standard Normal distribution before for confidence intervals and
hypothesis testing, we should now use the t-distribution.
11.2 t-distribution
The t-distribution (in place of the Standard Normal distribution) can be used in the
calculation of confidence intervals, p-values, and for hypothesis testing in general. See
Section 6.5 for a review of the t-distribution. It is the appropriate distribution when σ 2
is replaced by the estimator s2 in the formula for the z -statistic. Whereas the z -statistic
follows the Standard Normal distribution:
ȳ − µy
z = q 2 ∼ N (0, 1)
σy
n
The only difference is that σ 2 has been replaced with s2 , but introducing the random
estimator s2 into the equation changes the distribution of z. The t-distribution is
denoted t(n−k) . It is very similar to the N (0, 1) distribution, but it has fatter tails. It is
symmetric and centered at 0. The shape of the t-distribution depends on the degrees
of freedom (n − k), where n is our sample size, and k (in this case) is 1.
11.3 Confidence intervals using s2 116
Relationship between the t-distribution and Standard Normal. As the sample size
n grows, the t-distribution becomes identical to the Standard Normal distribution.
The developments in this chapter can essentially be ignored when the sample size n
is large enough. That is, the Standard Normal distribution is an approximation to
the t-distribution, and the approximation gets better as n increases.
s
s2y
ȳ ± tc ×
n
In the previous chapters, recall that if we wanted a 95% confidence interval the critical
value (zc ) was found by finding the values on the x-axis that put 2.5% area in each tail
of the N (0, 1) distribution (see Figure 9.4). The 95% critical value using a Standard
Normal distribution can be found in R using:
qnorm(.025)
[1] -1.959964
This is where the number “1.96” in the confidence interval formula comes from. The
95% critical value using the t-distribution can be found in R using:
qt(.025, 19)
[1] -2.093024
Try increasing the number “19” in the command qt(.025, 19). You will see that
the critical value produced approaches 1.96. This number is the “degrees of freedom”
(n − k) for the t-distribution. The “19” would correspond to a sample size of n = 20.
For a sample this size, we can see that the confidence interval will be quite a bit wider
under the t-distribution compared to the Standard Normal distribution. This is always
the case: confidence intervals using the t-distribution are always wider than those using
the Standard Normal.
Example 11.1 — Confidence intervals for Mars incomes - unknown s2 . This example
mimics Example 9.8, but here we use s2 instead of σ 2 and the t-distribution instead
of the Standard Normal distribution. Let’s calculate the 95% confidence intervals
around the sample mean income of Mars colonists. In this chapter, we are operating
under the realistic assumption that the population variance is unknown.
Using the sample of n = 1000 colonists, calculate the sample mean income
¯
(income):
[1] 80938.1
sqrt(var(sample$income) / 1000)
[1] 1267.037
p
So, s2i ncome/n = 1319. Next we need the critical value from the t999 distribution:
qt(.025, 999)
[1] -1.962341
The critical value of 1.96 is a value that we have become accustomed to. This critical
value from the t-distribution is nearly identical to that from the Standard Normal.
This is because the sample size of n = 1000 is large enough that the t-distribution
is approximately Normal. Finally, we can calculate the confidence interval:
We will see in Example 11.2 that the confidence interval can be generated automat-
ically in R using the t.test() function.
s2y
Estimated variance of ȳ =
n
We can implement hypothesis testing by replacing the unknown σy2 with its estimator
s2y . The z test statistic now becomes:
ȳ − µy,0
t= q
s2y
n
This is the t statistic. Because we have replaced σy2 with s2y (a random estimator) in
the z statistic formula, the form of the randomness of z has changed. The t statistic is
no longer a standard normal variable. It follows its own probability distribution, called
the t-distribution. When performing a t test, the p-values are different than in Table
10.1 (those obtained from the Standard Normal distribution). However, as the sample
size grows, the t-distribution becomes the standard normal distribution. This means
that, for sample sizes of approximately n > 100, using the standard normal distribution
(Table 10.1) instead of the t distribution, makes very little difference.
Example 11.2 — Hypothesis on Mars incomes - t test. This example mimics Example
10.2, in which σ 2 was assumed to be known.
11.4 The t-test 118
mean(mars$income)
[1] 80938.1
var(mars$income)
[1] 1605382317
This is somewhat close to the value of the z test statistic in Example 10.2 (-0.881).
Calculate the p-value.
We can look up the value 0.838 in Table 10.1 to get an approximate p-value. The
number in the table is 0.2005. Multiplying by 2 (since it’s a two sided test), gives
a p-value of 0.4010 (compared to 0.378 in Example 10.2). Get a p-value from the
t-distribution using R:
2 * pt(-0.838, 999)
[1] 0.4022311
The p-value for this test is 0.402. The “999” in pt(-0.838, 999) is the degrees of
freedom (n − k = 1000 − 1).
Using the R function t.test()
Use R to accomplish all of the above, in one command:
t.test(mars$income, mu=82000)
11.4 The t-test 119
data: mars$income
t = -0.8381, df = 999, p-value = 0.4022
alternative hypothesis: true mean is not equal to 82000
95 percent confidence interval:
78451.74 83424.45
sample estimates:
mean of x
80938.1
The same p-value of 0.402 was found above. Notice that t.test() also provides
the 95% confidence interval. The null hypothesis is inside this confidence interval,
so we will end up failing to reject H0 at the 5% significance level (at least).
Make a decision.
Since the p-value is greater than the significance level, we fail to reject H0 . There
is insufficient evidence to reject the claim that µincome = 82000. Note that we also
reject the null at the 10% significance level.
12. Least-squares regression
This chapter introduces least-squares regression, which is a way of modelling and quan-
tifying the relationship between two or more variables. In the preceding chapters, we
have mostly considered methods of analysis that deal with only one variable. In many
cases in economics (and other subjects), we want to know how much a change in one
variable might be associated with, or cause, a change in another variable.
How might you quantify the relationship between two variables? The relationship
between an x and a y variable is depicted in Figure 12.1 (see Section 4.7 for a review
of scatter plots). Download the data from Figure 12.1, and reproduce the scatter plot
yourself:
12.2 Equation of the least-squares regression line 121
Looking at Figure 12.1, there appears to be a strong, positive, and linear relationship
between x and y. One way to quantify this linear relationship is by using the correlation
coefficient:
cor(data$x, data$y)
[1] 0.8919703
plot(data$x, data$y)
abline(lm(data$y ~ data$x))
Figure 12.2: A least-squares line has been “fitted” through the scatter plot of x and y.
The line in Figure 12.2 is called a least-squares regression line. The term “regres-
sion” refers to taking the information from all of the data points and “regressing” or
reducing it to a single line. If the vertical distances between the data points and the
line are just random “noise”, then the relationship between x and y can be represented
by the regression line (provided we make a few other assumptions, for example that
the relationship is linear ).
y = a + bx
where y and x are variables, a is the “intercept” of the line, and b is the “slope” of
the line. Often in econometrics we instead use the symbol b0 for the intercept, and b1
12.3 Interpreting the least-squares regression line 122
for the slope1 . Since there is some randomness involved such that the data points in
Figure 12.2 are scattered around the regression line, we will write the equation of the
line in terms of ŷ instead of y:
ŷ = b0 + b1 x
The variable ŷ is the least-squares predicted y value, which we will explain in Section
12.5. The values for the intercept b0 and the slope b1 can be calculated using the lm()2
function in R:
lm(data$y ~ data$x)
Call:
lm(formula = data$y ~ data$x)
Coefficients:
(Intercept) data$x
18.264 2.071
From the R output, the intercept for the line in Figure 12.2 is b0 = 18.3. The slope of
the line is b1 = 2.1. The line in Figure 12.2 is written as:
ŷ = 18.3 + 2.1x
Figure 12.3: Mars completely controls the price of alcohol, and has experimented with
different prices to see how colonists respond with their quantity of drinks demanded.
The slope of b1 = −1.79 is the average decrease in drinks when Mars government
increases the price of alcohol by 1.
12.4 Formula for the intercept and slope of the regression line
How are the least-squares regression lines in Figure 12.2 and 12.3 “fitted”? That is,
how are b0 and b1 chosen? As the name implies, the “least-squares” regression line
has something to do with minimizing squared values. In particular, the line is chosen
such that the sum of all of the squared vertical distances between the regression line
and data points is minimized. “Least” refers to minimizing, and “squares” refers to
squaring the distance between the points and the regression line.
Each of these vertical distances is called a “residual”. There is one residuals for
each data point, and we refer to a single residual as ei . See Figure 12.4. The formulas
for b0 and b1 can be found by solving a calculus minimization problem (which is beyond
12.5 Predicted values and residuals 124
this course):
Pn
[(yi − ȳ) (xi − x̄)]
b1 = i=1Pn 2
i=1 (xi − x̄)
(12.1)
b0 = ȳ − b1 x̄
Equation 12.1 tells us how to use the x and y data in such a way to pick the intercept
and slope of a line that passes closely through the data points, by minimizing the sum
of all of the squared vertical distances.
e = y − ŷ
12.6 R-squared 125
The sum of all of these squared residuals is the very thing that the least-squares re-
gression line minimizes.
Now that we have defined the least-squares residuals, we can write a new equation
for the y variable:
y = b0 + b1 x + e (12.2)
Equation 12.2 says that each y value has a predictable part (b0 + b1 x), and an unpre-
dictable part that cannot be explained (e)
12.6 R-squared
R-squared (R2 ) is a “measure of fit” of the least-squares regression line. It is a number
between 0 and 14 . R2 indicates how close the data points are to the regression line.
R-squared is the portion of sample variance in the y variable that can be explained by
variation in the x variable.
The assumption is that changes in x are associated with or are leading to changes
in y. But, changes in x are not the only reason, or explanation, for changes in y. There
are unobservable variables that are leading to changes in y, otherwise all of the data
points in the scatter plot would line up exactly in a straight line. R2 helps answer the
question: how much of the change in y is coming from x? Some equivalent ways of
interpreting R2 are:
To get the R2 using R5 , we need to put the lm() function inside of the summary()
function in order to get some more information about the “fitted” regression line:
summary(lm(data$y ~ data$x))
Call:
lm(formula = data$y ~ data$x)
Residuals:
Min 1Q Median 3Q Max
-32.811 -12.356 -1.758 9.887 55.437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.2636 15.6668 1.166 0.249
data$x 2.0712 0.1515 13.669 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this R output, make sure you can find the intercept and slope (b0 = 18.2636, b1 =
2.0712). There is quite a bit of information provided, including R2 = 0.79566 . This
value can be interpreted as: 79.6% of the variation in y can be explained using the x
variable.
R-squared provides a measure of the predictive power of the fitted least-squares
regression line. The higher the value of R2 , the more power x has for explaining or
predicted values of y. A low value for R2 does not mean that x is insignificant, however.
Whether or not x is signif icant in explaining changes in y is better left to a hypothesis
test (covered in the next chapter).
These facts become important when we delve into more advanced topic in econometrics.
These three results can be proven mathematically, but in this text we only illustrate
that they are true via Example 12.1.
Example 12.1 — Three least-squares facts illustrated. We’ll illustrate the three results
using the data from Figures 12.1 - 12.6. Load the data into R, and in this example
we’ll suppress scientific notation:
From the saved object ls.line we can extract the residuals and the in-sample
predicted or fitted values. To illustrate result (1), save the residuals and check that
they sum to zero (or are at least very close to zero):
[1] 0.0000000000001425526
To illustrate fact number (2), get the predicted or fitted values, and check that the
mean of the fitted and actual y values are equal (ŷ¯ = ȳ):
[1] 229.5679
[1] 229.5679
To verify fact (3), we can plot the point (x̄, ȳ), draw the least-squares regression
7
There are rare situations where the intercept b0 is excluded from the model. In this case, these
properties do not necessarily hold.
12.8 Least-squares regression example 128
plot(mean(data$x), mean(data$y))
abline(ls.line)
The idea here is that GDP per capita is associated with per capita CO2 emissions.
We’ll fit a least-squares line through the data, which will give us b1 . Then, b1 will tell
us how an increase in GDP is associated with an increase in CO2 emissions. We don’t
know if GDP causes CO2 emissions (or vice versa); such causal statements are beyond
the scope of this statistical analysis.
To make the interpretation of b1 easier, measure GDP per capita in 1000s of dollars:
Now, estimate the least-squares regression line, and get some “summary” information:
summary(lm(data$co2 ~ data$gdp))
Call:
lm(formula = data$co2 ~ data$gdp)
Residuals:
Min 1Q Median 3Q Max
-11.8964 -1.0479 -0.6367 0.0841 28.2401
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.49730 0.45799 1.086 0.28
data$gdp 0.33110 0.02675 12.380 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Plot the data, and add the least-squares regression line to the plot (see Figure 12.7):
The slope of the least-squares regression line is b1 = 0.33. This can be interpreted as:
an increase in per capita GDP of $1000 is associated with an increase in per capita
CO2 emissions of 0.33. The R-squared of 0.54 tells us that per capita GDP explains
54% of the variation in per capita CO2 emissions between countries.
Note that it would be better to fit a regression line through the logs of per capita
12.8 Least-squares regression example 129
Figure 12.7: Per capita CO2 and GDP, with fitted least-squares regression line
GDP and CO2 emissions (see Figure 4.13). Doing so would give b1 a percentage change
interpretation. Using logs in a least-squares regression is a more advanced topic that
is left out of this text.
13. Least-squares continued
y = β0 + β1 x + ϵ (13.1)
Once again Greek letters (β) are being used to denote unknown population parameters.
β0 and β1 are expressing a true linear relationship between y and x. The least-squares
intercept (b0 ) and slope (b1 ) from the previous chapter are just estimators for β0 and
β1 . The terms β0 , β1 , and ϵ are unobservable components of the model. Typically, the
goal is to use the y and x data in order to estimate β0 and β1 , which can be tricky
given the random “noise” introduced through ϵ.
An observable counterpart to the linear population model has already been shown
in Section 12.5:
y = b0 + b1 x + e
The least-squares regression model “replaces” the unobservable components of the lin-
ear population model: β0 → b0 , β1 → b1 , and ϵ → e
In the linear population model, the slope β1 has a very important interpretation. It
represents the true change in y associated with a 1 unit change in x. If there is a causal
13.2 The random error term ϵ 131
Random error term ϵ. The linear population model contains a random error term ϵ.
The error term encapsulates all variables that determine y, besides x. ϵ contains the
effects of many variables acting on y, all summed together into one term.
13.3 Five least-squares assumptions 132
Next, we need to install and load the tseries package into R, which contains the
Jarque-Bera test:
install.packages("tseries")
library("tseries")
Finally, we apply the Jarque-Bera test to the residuals from our least-squares regression:
jarque.bera.test(resids)
data: resids
X-squared = 3.7476, df = 2, p-value = 0.1535
Since the p-value is greater than 0.1, we fail to reject the null hypothesis at the 10%
significance level. In this model, the assumption that the random error term is Normally
distributed seems plausible.
Q-Q plot
A quantile-quantile plot compares the sample quantiles1 of a variable to the theoretical
quantiles of the Normal distribution distribution (or any other distribution). The Q-Q
plot is a visual diagnostic tool. If the sample quantiles, plotted against the theoretical
quantiles, appear to follow a straight line, then the variable is thought to be Normally
distributed. To generate a Q-Q Normal plot in R, we can use the qqnorm function,
along with qqline in order to draw a straight line through the plot.
qqnorm(resids)
qqline(resids)
The plot generated by the above R code is shown in Figure 13.1. Most researchers
would conclude that the Q-Q plot supports the conclusion of the Jarque-Bera test: the
residuals appear to be Normally distributed, indicating that the Normality assumption
for ϵ is reasonable.
H0 : β1 = β1,0
HA : β1 ̸= β1,0
The 0 subscript in β1,0 denotes the value for β1 under the null hypothesis. We could
also make hypotheses about β0 , but usually the focus is on β1 . We can use the t-test
1
(See Section 5.4 on quartiles and percentiles, which are similar to quantiles.
13.4 Hypothesis testing and confidence intervals 134
Figure 13.1: Q-Q Normal plot. If the variable is Normally distributed, then the sample
quantiles should “line up” with the theoretical quantiles from the Normal distribution.
to perform hypothesis tests involving the β in the linear population model. In Section
11.4, the t-test statistic used for hypotheses on the population mean µ was:
ȳ − µy,0
t= q
s2y
n
Applying the generic formula (Equation 13.2) to the least-squares situation gives
us t-test statistic for testing hypotheses about β1 :
b1 − β1,0
t=
s.e.(b1 )
This t-statistic follows the same t-distribution (see Section 11.2) whether it is used
to test a population mean µ or a population slope β1 . We can obtain p-values from
the t-distribution the same way in which we did in Chapter 11. If the sample size n is
large, then this t-statistic is approximately Standard Normal N (0, 1), and we can use
Table 10.1 to obtain p-values.
The p-value is compared to a significance level (see Section 10.4). As before, if
the p-value exceeds the significance level (if p-value > α), we fail to reject the null
hypothesis.
Suppose that we want to test:
H0 : β1 = 2
HA : β1 ̸= 2
To calculate the t-test statistic using R, we need the least-squares estimate b1 , and the
s.e.(b1 ). Load data, estimate the model, and use the summary() function:
From the output, we see that b1 = 2.07, and s.e.(b1 ) = 0.15. The t-statistic for testing
H0 : β1 = 2 is:
2.07 − 2
t= = 0.47
0.15
This t-statistic follows the t-distribution, whose shape is determined by the degrees of
freedom, n − k (see Section 11.2). In the present context, n − k = 50 − 2 = 48 (because
the sample size is 50 and two least-squares estimates have been calculated, b0 and b1 ).
The p-value is:
[1] 0.3202417
If the null is correct, then the expected value of the t-statistic is 0. That is, the dif-
ference between what we estimate and hypothesize (b1 −β1,0 ) should be zero on average.
The p-value for this hypothesis test comes from the area in the t-distribution, to the
right of 0.47. This gives us the probability of calculating a b1 that is more “extreme”
than the value of 2.07 that we just calculated. Thus, we need to set lower.tail =
FALSE in the above R code. If the sample size n is large enough, then the t-distribution
is well approximated by the N (0, 1) distribution. Using Table 10.1, we obtain the same
value of 0.32.
13.5 Tests of “significance” 136
y = β0 + β1 x + ϵ
H0 : β1 = 0
HA : β1 ̸= 0
If β1 = 0 then x does not have a linear effect on y. That is, a change in x does not
lead to a change in y. The marginal effect of x on y is zero. If H0 : β1 = 0 is rejected,
then x is said to be “significant”. If we fail to reject H0 : β1 = 0, then x is said to be
“insignificant”.
In equation 13.3, the value of 1.96 comes from the 95% confidence level (-1.96 and
1.96 put 2.5% area in each tail of the Standard Normal distribution). A 90% or 99%
confidence interval would use 1.65 or 2.58 respectively, instead of 1.96. We can obtain
these critical values in R using:
qnorm(.05)
qnorm(.025)
qnorm(.005)
[1] -1.644854
[1] -1.959964
[1] -2.575829
For small samples, we could instead use the critical values from the t-distribution,
to get more accurate confidence intervals (the critical values of 1.65, 1.96, and 2.58
are approximate values for when the sample is large). Using a degrees of freedom of
n − k = 48 (for example), we get critical values of 1.68, 2.01, and 2.68 for the 90%, 95%
and 99% confidence levels:
13.7 Least-squares regression analysis 137
qt(.05, 48)
qt(.025, 48)
qt(.005, 48)
[1] -1.677224
[1] -2.010635
[1] -2.682204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.2636 15.6668 1.166 0.249
data$x 2.0712 0.1515 13.669 <2e-16 ***
With b1 = 2.07, a critical value of 2.01 (from the t-distribution), and s.e.(b1 ) = 0.15,
the 95% confidence interval for the above example is:
95% CI = 2.07 ± 2.01 × 0.15 = [1.77 , 2.37]
Notice that this confidence interval contains the value β1 = 2 from the null hypothesis in
the previous section. The 95% confidence interval around b1 contains all null hypotheses
for β1 that will be rejected at the 5% significance level.
The following examples go through these steps using the Mars data.
β1 is the true effect on income of an additional year spent living on Earth. ϵ contains
13.7 Least-squares regression analysis 138
To estimate β0 and β1 in this model, use the lm() command in R, and the summary()
command to see the results:
model1 <- lm(income ~ years.on.earth, data = mars)
summary(model1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78287.77 2338.59 33.476 <2e-16 ***
years.on.earth 123.45 91.91 1.343 0.18
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Because the sample size is large (n = 1000), it makes little difference whether we
get the critical value (1.96) from the Standard Normal distribution, or from the
t-distribution.
Conduct a hypothesis test
Test the hypothesis that years spent on Earth has no effect on income:
H0 : β1 = 0
HA : β1 ̸= 0
13.7 Least-squares regression analysis 139
[1] 0.1802453
With a p-value of 18%, we fail to reject the null hypothesis. It appears that years
on Earth has no effect on income. Note that:
Concerning item ((ii)): in the output from summary(model1 above, make sure that
you can find the values for the t-statistic of 1.343 and p-value of 0.18.
14. Multiple regression
In this final chapter, we briefly introduce multiple regression (in contrast to single
variable regression in the previous two chapters). Multiple regression refers to least-
squares estimation of population models that include more than one “x” variable.
In practice, the vast majority of models estimated by researchers contain multiple
regressors (x variables), and in reality there are likely many factors that determine the
value for y. A linear population model with multiple x variables can be written:
y = β0 + β1 x1 + β2 x2 + β3 x3 + · · · + βk xk + ϵ (14.1)
k is the total number of x variables in the model, all of which may determine the value
for y. β1 is the effect of x1 on y, holding all other x variables constant. Similarly, β2 is
the effect of x2 on y, etc.
There are several reasons for including more than one x variable in the population
model, three of which are:
We will briefly investigate the importance of item (iii) as it relates to causal inference.
Estimation, interpretation of the estimates, and hypothesis testing and confidence
intervals, all remain largely unchanged in the multiple regression model compared to
the single variable regression model.
Figure 14.1: A hidden x2 variable that determines both y and x1 will make estimation
of the effect of x1 on y difficult (or impossible).
y = β0 + β1 x1 + ϵ
The reason that b1 gives the wrong answer for the true effect of x1 on y is that:
A change in x2 is associated with a change in both x1 and y.
When we “see” x1 changing, we know x2 is also changing.
Attributing changes in y due to changes in x1 alone becomes impossible, since we
don’t know how much of the change in y came from x2 .
The solution to the problem is to include the x2 variable in the model! If we can’t
actually observe x2 then we must use clever strategies and more advanced methods to
attempt to estimate the effect of x1 on y.
In R:
mars <- read.csv("http://rtgodwin.com/data/mars.csv")
model <- lm(income ~ years.education + years.on.earth, data = mars)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6746.60 5313.59 1.270 0.2045
years.education 4740.39 315.68 15.016 <2e-16 ***
years.on.earth -141.89 74.33 -1.909 0.0566 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the R output <2e-16 is using scientific notation for the p-value, and means
that the p-value is less than 2 × 10−16 , and is the smallest decimal that R can represent
(the p-value is essentially zero).
14.3 Lurking or confounding variables 142
Figure 14.2: A hidden x2 variable that determines both y and x1 will make estimation
of the effect of x1 on y difficult (or impossible).
population model. That is, the estimated β1 (b1 ) is wrong in the population model:
y = β 0 + β 1 x1 + ϵ
The reason that b1 gives the wrong answer for the true effect of x1 on y is that:
A change in x2 is associated with a change in both x1 and y.
When we “see” x1 changing, we know x2 is also changing.
Attributing changes in y due to changes in x1 alone becomes impossible, since we
don’t know how much of the change in y came from x2 .
The solution to the problem is to include the x2 variable in the model! If we can’t
actually observe x2 then we must use clever strategies and more advanced methods to
attempt to estimate the effect of x1 on y.
To illustrate the issue, consider the population model:
income = β0 + β1 age + ϵ
We might guess that age has a positive effect on income, as we tend to see people
making more money the older they are. Let’s try estimating this model in R using the
Mars data:
mars <- read.csv("http://rtgodwin.com/data/mars.csv")
model1 <- lm(income ~ age, data = mars)
summary(model1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71634.84 4161.95 17.212 <2e-16 ***
age 223.49 95.26 2.346 0.0192 *
The estimation results suggest that each additional year of age is associated with an
increase in income of 223.49, on average. Now, test the hypothesis that age has zero
effect on income (that age does not determine or is not associated with income). This
is a test of the “significance” of the variable age:
H0 : β1 = 0
HA : β1 ̸= 0
R has already calculated the t-statistic (2.346) and p-value (0.0192) for this hypothesis
test. Since the p-value is less than 0.05, we reject the null hypothesis at the 5%
significance level, and conclude that age is a “significant” determinant of income.
Given our recent discussion for the need for the multiple regression model, can you
think of any lurking variables? We should be thinking about other variables that are
correlated or related with age, and also determine income. What about education?
The older a person is, the more likely they have more education. A worker who is 18
cannot have more than 12 years of education. The idea here is that age might just be
indicating (acting as a proxy for) years of education. Consider the following population
model instead:
income = β0 + β1 age + β2 years.educationϵ
In this model, we can examine the effect of age on income while controlling for edu-
cation. It is as if we can compare workers who all have the same education, but differ
only in their age. Estimate this model in R:
14.4 Multiple regression model for Mars incomes 144
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -607.56 5861.52 -0.104 0.917
age 42.88 85.83 0.500 0.617
years.education 5010.40 314.32 15.940 <2e-16 ***
Notice that the effect of age on income has reduced, and is no longer significant!
After controlling for education, we now conclude that age is not a determining factor
of income. The positive relationship between age and education (older people tend to
have more education) resulted in the overestimation of the effect of age, when education
was omitted from the model.
income = β0 + β1 years.education + ϵ
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 847.5 5084.9 0.167 0.868
years.education 5031.1 311.5 16.154 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
they have an easier time obtaining an education (it is less costly). A higher IQ may
also lead to higher incomes.
Let’s include these variables in a population model:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38238.81 8804.80 -4.343 1.55e-05 ***
years.education 2447.18 508.11 4.816 1.69e-06 ***
IQ 788.99 124.13 6.356 3.14e-10 ***
age 31.55 91.04 0.347 0.729
years.on.earth 35.19 120.39 0.292 0.770
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that: