IntroStat Oct2010
IntroStat Oct2010
The target audience for this book is college students who are required to learn
statistics, students with little background in mathematics and often no motiva-
tion to learn more. It is assumed that the students do have basic skills in using
computers and have access to one. Moreover, it is assumed that the students
are willing to actively follow the discussion in the text, to practice, and more
importantly, to think.
Teaching statistics is a challenge. Teaching it to students who are required
to learn the subject as part of their curriculum, is an art mastered by few. In
the past I have tried to master this art and failed. In desperation, I wrote this
book.
This book uses the basic structure of generic introduction to statistics course.
However, in some ways I have chosen to diverge from the traditional approach.
One divergence is the introduction of R as part of the learning process. Many
have used statistical packages or spreadsheets as tools for teaching statistics.
Others have used R in advanced courses. I am not aware of attempts to use
R in introductory level courses. Indeed, mastering R requires much investment
of time and energy that may be distracting and counterproductive for learning
more fundamental issues. Yet, I believe that if one restricts the application of
R to a limited number of commands, the benefits that R provides outweigh the
difficulties that R engenders.
Another departure from the standard approach is the treatment of proba-
bility as part of the course. In this book I do not attempt to teach probability
as a subject matter, but only specific elements of it which I feel are essential
for understanding statistics. Hence, Kolmogorov’s Axioms are out as well as
attempts to prove basic theorems and a Balls and Urns type of discussion. On
the other hand, emphasis is given to the notion of a random variable and, in
that context, the sample space.
The first part of the book deals with descriptive statistics and provides prob-
ability concepts that are required for the interpretation of statistical inference.
Statistical inference is the subject of the second part of the book.
The first chapter is a short introduction to statistics and probability. Stu-
dents are required to have access to R right from the start. Instructions regarding
the installation of R on a PC are provided.
The second chapter deals with data structures and variation. Chapter 3
provides numerical and graphical tools for presenting and summarizing the dis-
tribution of data.
The fundamentals of probability are treated in Chapters 4 to 7. The concept
of a random variable is presented in Chapter 4 and examples of special types of
random variables are discussed in Chapter 5. Chapter 6 deals with the Normal
iii
iv PREFACE
Preface iii
I Introduction to Statistics 1
1 Introduction 3
1.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 3
1.2 Why Learn Statistics? . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 The R Programming Environment . . . . . . . . . . . . . . . . . 7
1.6.1 Some Basic R Commands . . . . . . . . . . . . . . . . . . 7
1.7 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Descriptive Statistics 29
3.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 29
3.2 Displaying Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Measures of the Center of Data . . . . . . . . . . . . . . . . . . . 35
3.3.1 Skewness, the Mean and the Median . . . . . . . . . . . . 36
3.4 Measures of the Spread of Data . . . . . . . . . . . . . . . . . . . 38
v
vi CONTENTS
4 Probability 47
4.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 47
4.2 Different Forms of Variability . . . . . . . . . . . . . . . . . . . . 47
4.3 A Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Sample Space and Distribution . . . . . . . . . . . . . . . 53
4.4.2 Expectation and Standard Deviation . . . . . . . . . . . . 56
4.5 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Random Variables 65
5.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 65
5.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 The Binomial Random Variable . . . . . . . . . . . . . . . 66
5.2.2 The Poisson Random Variable . . . . . . . . . . . . . . . 71
5.3 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . 74
5.3.1 The Uniform Random Variable . . . . . . . . . . . . . . . 75
5.3.2 The Exponential Random Variable . . . . . . . . . . . . . 78
5.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Introduction to Statistics
1
Chapter 1
Introduction
3
4 CHAPTER 1. INTRODUCTION
4
3
y = Frequency
2
1
0
5 5.5 6 6.5 7 8 9
x = Time
Figure 1.1: Frequency of Average Time (in Hours) Spent Sleeping per Night
Included in this chapter are the basic ideas and words of probability and
statistics. In the process of learning this course, and more so in the second part
of the book, you will understand that statistics and probability work together.
1.3 Statistics
The science of statistics deals with the collection, analysis, interpretation, and
presentation of data. We see and use data in our everyday lives. To be able
to use data correctly is essential to many professions and in your own best
self-interest.
For example, assume the average time (in hours, to the nearest half-hour) a
group of people sleep per night has been recorded. Consider the following data:
In Figure 1.1 this data is presented in a graphical form (called a bar plot). A
bar plot consists of a number line and bars (vertical lines) positioned above the
number line. The length of each bar corresponds to the number of data points
1.4. PROBABILITY 5
that obtain the given numerical value. In the given plot the frequency of average
time (in hours) spent sleeping per night is presented with hours of sleep on the
horizontal x-axis and frequency on vertical y-axis.
Think of the following questions:
• Would the bar plot constructed from data collected from a different group
of people look the same as or different from the example? Why?
• If one would have carried the same example in a different group with the
same size and age as the one used for the example, do you think the results
would be the same? Why or why not?
• Where does the data appear to cluster? How could you interpret the
clustering?
The questions above ask you to analyze and interpret your data. With this
example, you have begun your study of statistics.
In this course, you will learn how to organize and summarize data. Or-
ganizing and summarizing data is called descriptive statistics. Two ways to
summarize data are by graphing and by numbers (for example, finding an av-
erage). In the second part of the book you will also learn how to use formal
methods for drawing conclusions from “good” data. The formal methods are
called inferential statistics. Statistical inference uses probabilistic concepts to
determine if conclusions drawn are reliable or not.
Effective interpretation of data is based on good procedures for producing
data and thoughtful examination of the data. In the process of learning how
to interpret data you will probably encounter what may seem to be too many
mathematical formulae that describe these procedures. However, you should
always remember that the goal of statistics is not to perform numerous calcu-
lations using the formulae, but to gain an understanding of your data. The
calculations can be done using a calculator or a computer. The understanding
must come from you. If you can thoroughly grasp the basics of statistics, you
can be more confident in the decisions you make in life.
1.4 Probability
Probability is the mathematical theory used to study uncertainty. It provides
tools for the formalization and quantification of the notion of uncertainty. In
particular, it deals with the chance of an event occurring. For example, if the
different potential outcomes of an experiment are equally likely to occur then
the probability of each outcome is taken to be the reciprocal of the number of
potential outcomes. As an illustration, consider tossing a fair coin. There are
two possible outcomes – a head or a tail – and the probability of each outcome
is 1/2.
If you toss a fair coin 4 times, the outcomes may not necessarily be 2 heads
and 2 tails. However, if you toss the same coin 4,000 times, the outcomes will
be close to 2,000 heads and 2,000 tails. It is very unlikely to obtain more than
2,060 tails and it is similarly unlikely to obtain less than 1,940 tails. This is
consistent with the expected theoretical probability of heads in any one toss.
Even though the outcomes of a few repetitions are uncertain, there is a regular
pattern of outcomes when the number of repetitions is large.
6 CHAPTER 1. INTRODUCTION
The theory of probability began with the study of games of chance such as
poker. Today, probability is used to predict the likelihood of an earthquake, of
rain, or whether you will get an “A” in this course. Doctors use probability
to determine the chance of a vaccination causing the disease the vaccination is
supposed to prevent. A stockbroker uses probability to determine the rate of
return on a client’s investments. You might use probability to decide to buy a
lottery ticket or not.
Although probability is instrumental for the development of the theory of
statistics, in this introductory course we will not develop the mathematical the-
ory of probability. Instead, we will concentrate on the philosophical aspects of
the theory and use computerized simulations in order to demonstrate proba-
bilistic computations that are applied in statistical inference.
> 1+2
[1] 3
>
The prompt “>” indicates that the system is ready to receive commands. Writ-
ing an expression, such as “1+2”, and hitting the Return key sends the expression
to be executed. The execution of the expression may produce an object, in this
case an object that is composed of a single number, the number “3”.
Whenever required, the R system takes an action. If no other specifications
are given regarding the required action then the system will apply the pre-
programmed action. This action is called the default action. In the case of
1 Detailed explanation of how to install the system on an XP Windows Operating System
hitting the Return key after the expression that we wrote the default is to
display the produced object on the screen.
Next, let us demonstrate R in a more meaningful way by using it in order
to produce the bar-plot of Figure 1.1. First we have to input the data. We
will produce a sequence of numbers that form the data2 . For that we will use
the function “c” that combines its arguments and produces a sequence with the
arguments as the components of the sequence. Write the expression:
> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)
at the prompt and hit return. The result should look like this:
> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)
[1] 5.0 5.5 6.0 6.0 6.0 6.5 6.5 6.5 6.5 7.0 7.0 8.0 8.0 9.0
>
The function “c” is an example of an R function. A function has a name, “c”
in this case, that is followed by brackets that include the input to the function.
We call the components of the input the arguments of the function. Arguments
are separated by commas. A function produces an output, which is typically
an R object. In the current example an object of the form of a sequence was
created and, according to the default application of the system, was sent to the
screen and not saved.
If we want to create an object for further manipulation then we should save
it and give it a name. For example, it we want to save the vector of data under
the name “X” we may write the following expression at the prompt (and then
hit return):
> X <- c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)
>
The arrow that appears after the “X” is produced by typing the less than key
“<” followed by the minus key “-”. This arrow is the assignment operator.
Observe that you may save typing by calling and editing lines of code that
were processes in an earlier part of the session. One may browse through the
lines using the up and down arrows on the right-hand side of the keyboard and
use the right and left arrows to move along the line presented at the prompt.
For example, the last expression may be produced by finding first the line that
used the function “c” with the up and down arrow and then moving to the
beginning of the line with the left arrow. At the beginning of the line all one
has to do is type “X <- ” and hit the Return key.
Notice that no output was sent to the screen. Instead, the output from the
“c” function was assigned to an object that has the name “X”. A new object
by the given name was formed and it is now available for further analysis. In
order to verify this you may write “X” at the prompt and hit return:
> X
[1] 5.0 5.5 6.0 6.0 6.0 6.5 6.5 6.5 6.5 7.0 7.0 8.0 8.0 9.0
The content of the object “X” is sent to the screen, which is the default output.
Notice that we have not changed the given object, which is still in the memory.
2 In R, a sequence of numbers is called a vector. However, we will use the term sequence to
refer to vectors.
1.6. THE R PROGRAMMING ENVIRONMENT 9
The object “X” is in the memory, but it is not saved on the hard disk.
With the end of the session the objects created in the session are erased unless
specifically saved. The saving of all the objects that were created during the
session can be done when the session is finished. Hence, when you close the
R Console window a dialog box will open (See the screenshot in Figure 1.2).
Via this dialog box you can choose to save the objects that were created in the
session by selecting “Yes”, not to save by selecting the option “No”, or you may
decide to abort the process of shutting down the session by selecting “Cancel”.
If you save the objects then they will be uploaded to the memory the next time
that the R Console is opened.
We used a capital letter to name the object. We could have used a small
letter just as well or practically any combination of letters. However, you should
note that R distinguishes between capital and small letter. Hence, typing “x”
in the console window and hitting return will produce an error message:
> x
Error: object "x" not found
An object named “x” does not exist in the R system and we have not created
such object. The object “X”, on the other hand, does exist.
Names of functions that are part of the system are fixed but you are free to
choose a name to objects that you create. For example, if one wants to create
an object by the name “my.vector” that contains the numbers 3, 7, 3, 3, and
-5 then one may write the expression “my.vector <- c(3,7,3,3,-5)” at the
prompt and hit the Return key.
If we want to produce a table that contains a count of the frequency of the
different values in our data we can apply the function “table” to the object
10 CHAPTER 1. INTRODUCTION
4
3
table(X)
2
1
0
5 5.5 6 6.5 7 8 9
Clearly, if one wants to produce a bar-plot to other numerical data all one has
to do is replace in the expression “plot(table(X))” the object “X” by an object
that contains the other data. For example, to plot the data in “my.vector” you
may use “plot(table(my.vector))”.
2. The percentage, among all registered voters of the given party, of those
that prefer a male candidate.
4. The voters in the state that are registered to the given party.
Solution (to Question 1.1.2): The percentage, among all registered voters
of the given party, of those that prefer a male candidate is a parameter. This
quantity is a characteristic of the population.
Solution (to Question 1.1.3): It is given that 42% of the sample prefer a
female candidate. This quantity is a numerical characteristic of the data, of the
sample. Hence, it is a statistic.
Solution (to Question 1.1.4): The voters in the state that are registered to
the given party is the target population.
Question 1.2. The number of customers that wait in front of a coffee shop at
the opening was reported during 25 days. The results were:
4, 2, 1, 1, 0, 2, 1, 2, 4, 2, 5, 3, 1, 5, 1, 5, 1, 2, 1, 1, 3, 4, 2, 4, 3 .
8
6
table(n.cost)
4
2
0
0 1 2 3 4 5
n.cost
3. The number of waiting costumers that occurred the least number of times.
Solution (to Question 1.2): One may read the data into R and create a table
using the code:
> n.cost <- c(4,2,1,1,0,2,1,2,4,2,5,3,1,5,1,5,1,2,1,1,3,4,2,4,3)
> table(n.cost)
n.cost
0 1 2 3 4 5
1 8 6 3 4 3
For convenience, one may also create the bar plot of the data using the code:
> plot(table(n.cost))
The bar plot is presented in Figure 1.4.
Solution (to Question 1.2.2): The number of waiting costumers that oc-
curred the largest number of times is 1. The value ”1” occurred 8 times, more
than any other value. Notice that the bar above this value is the highest.
Solution (to Question 1.2.3): The value ”0”, which occurred only once,
occurred the least number of times.
1.8 Summary
Glossary
Data: A set of observations taken on a sample from a population.
Statistic: A numerical characteristic of the data. A statistic estimates the
corresponding population parameter. For example, the average number
of contribution to the course’s forum for this term is an estimate for the
average number of contributions in all future terms (parameter).
Statistics The science that deals with processing, presentation and inference
from data.
Probability: A mathematical field that models and investigates the notion of
randomness.
15
16 CHAPTER 2. SAMPLING AND DATA STRUCTURES
liquid, was not put into the cans. Manufacturers regularly run tests to determine
if the amount of beverage in a 16-ounce can falls within the desired range.
Be aware that if an investigator collects data, the data may vary somewhat
from the data someone else is taking for the same purpose. This is completely
natural. However, if two investigators or more, are taking data from the same
source and get very different results, it is time for them to reevaluate their
data-collection methods and data recording accuracy.
2.2.3 Frequency
The primary way of summarizing the variability of data is via the frequency
distribution. Consider an example. Twenty students were asked how many
hours they worked per day. Their responses, in hours, are listed below:
5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3 .
Let us create an R object by the name “work.hours” that contains these data:
> work.hours <- c(5,6,3,3,2,4,7,5,2,3,5,6,5,4,4,3,5,2,5,3)
Next, let us create a table that summarizes the different values of working hours
and the frequency in which these values appear in the data:
> table(work.hours)
work.hours
2 3 4 5 6 7
3 5 3 6 2 1
2.2. THE SAMPLED DATA 17
Recall that the function “table” takes as input a sequence of data and produces
as output the frequencies of the different values.
We may have a clearer understanding of the meaning of the output of the
function “table” if we presented outcome as a frequency listing the different
data values in ascending order and their frequencies. For that end we may apply
the function “data.frame” to the output of the “table” function and obtain:
> data.frame(table(work.hours))
work.hours Freq
2 2 3
3 3 5
4 4 3
5 5 6
6 6 2
7 7 1
A frequency is the number of times a given datum occurs in a data set.
According to the table above, there are three students who work 2 hours, five
students who work 3 hours, etc. The total of the frequency column, 20, repre-
sents the total number of students included in the sample.
The function “data.frame” transforms its input into a data frame, which is
the standard way of storing statistical data. We will introduce data frames in
more detail in Section 2.3 below.
A relative frequency is the fraction of times a value occurs. To find the
relative frequencies, divide each frequency by the total number of students in
the sample – 20 in this case. Relative frequencies can be written as fractions,
percents, or decimals.
As an illustration let us compute the relative frequencies in our data:
> freq <- table(work.hours)
> freq
work.hours
2 3 4 5 6 7
3 5 3 6 2 1
> sum(freq)
[1] 20
> freq/sum(freq)
work.hours
2 3 4 5 6 7
0.15 0.25 0.15 0.30 0.10 0.05
We stored the frequencies in an object called “freq”. The content of the object
are the frequencies 3, 5, 3, 6, 2 and 1. The function “sum” sums the components
of its input. The sum of the frequencies is the sample size , the total number of
students that responded to the survey, which is 20. Hence, when we apply the
function “sum” to the object “freq” we get 20 as an output.
The outcome of dividing an object by a number is a division of each ele-
ment in the object by the given number. Therefore, when we divide “freq” by
“sum(freq)” (the number 20) we get a sequence of relative frequencies. The
first entry to this sequence is 3/20 = 0.15, the second entry is 5/20 = 0.25, and
the last entry is 1/20 = 0.05. The sum of the relative frequencies should always
be equal to 1:
18 CHAPTER 2. SAMPLING AND DATA STRUCTURES
> sum(freq/sum(freq))
[1] 1
The cumulative relative frequency is the accumulation of previous relative
frequencies. To find the cumulative relative frequencies, add all the previous
relative frequencies to the relative frequency of the current value. Alternatively,
we may apply the function “cumsum” to the sequence of relative frequencies:
> cumsum(freq/sum(freq))
2 3 4 5 6 7
0.15 0.40 0.55 0.85 0.95 1.00
Observe that the cumulative relative frequency of the smallest value 2 is the
frequency of that value (0.15). The cumulative relative frequency of the second
value 3 is the sum of the relative frequency of the smaller value (0.15) and
the relative frequency of the current value (0.25), which produces a total of
0.15 + 0.25 = 0.40. Likewise, for the third value 4 we get a cumulative relative
frequency of 0.15 + 0.25 + 0.15 = 0.55. The last entry of the cumulative relative
frequency column is one, indicating that one hundred percent of the data has
been accumulated.
The computation of the cumulative relative frequency was carried out with
the aid of the function “cumsum”. This function takes as an input argument a
numerical sequence and produces as output a numerical sequence of the same
length with the cumulative sums of the components of the input sequence.
Causality: A relationship between two variables does not mean that one causes
the other to occur. They may both be related (correlated) because of their
relationship to a third variable.
following rows may contain the data values associated with this variable. When
saving, the spreadsheet should be saved in the CSV format by the use of the
“Save by name” dialog and choosing there the option of CSV in the “Save by
Type” selection.
After saving a file with the data in a directory, R should be notified where
the file is located in order to be able to read it. A simple way of doing so is
by setting the directory with the file as R’s working directory. The working
directory is the first place R is searching for files. Files produced by R are saved
in that directory. In Windows, during an active R session, one may set the
working directory to be some target directory with the “File/Change Dir...”
dialog. This dialog is opened by selecting the option “File” on the left hand
side of the ruler on the top of the R Console window. Selecting the option of
“Change Dir...” in the ruler that opens will start the dialog. (See Figure 2.2.)
Browsing via this dialog window to the directory of choice, selecting it, and
approving the selection by clicking the “OK” bottom in the dialog window will
set the directory of choice as the working directory of R.
Rather than changing the working directory every time that R is opened one
may set a selected directory to be R’s working directory on opening. Again, we
demonstrate how to do this on the XP Windows operating system.
The R icon was added to the Desktop when the R system was installed.
The R Console is opened by double-clicking on this icon. One may change
the properties of the icon so that it sets a directory of choice as R’s working
directory.
In order to do so click on the icon with the mouse’s right bottom. A ruler
2.3. READING DATA INTO R 21
opens in which you should select the option “Properties”. As a result, a dialog
window opens. (See Figure 2.3.) Look at the line that starts with the words
“Start in” and continues with a name of a directory that is the current working
directory. The name of this directory is enclosed in double quotes and is given
with it’s full path, i.e. its address on the computer. This name and path should
be changed to the name and path of the directory that you want to fix as the
new working directory.
Consider again Figure 2.1. Imagine that one wants to fix the directory that
contains the file “ex1.csv” as the permanent working directory. Notice that
the full address of the directory appears at the “Address” bar on the top of
the window. One may copy the address and paste it instead of the name of the
current working directory that is specified in the “Properties” dialog of the
R icon. One should make sure that the address to the new directory is, again,
placed between double-quotes. (See in Figure 2.4 the dialog window after the
changing the address of the working directory. Compare this to Figure 2.3 of
the window before the change.) After approving the change by clicking the
“OK” bottom the new working directory is set. Henceforth, each time that the
R Console is opened by double-clicking the icon it will have the designated
directory as its working directory.
In the rest of this book we assume that a designated directory is set as R’s
working directory and that all external files that need to be read into R, such
as “ex1.csv” for example, are saved in that working directory. Once a working
directory has been set then the history of subsequent R sessions is stored in that
directory. Hence, if you choose to save the image of the session when you end
the session then objects created in the session will be uploaded the next time
22 CHAPTER 2. SAMPLING AND DATA STRUCTURES
to the file, should be provided. The file need not reside on the computer. One may provide,
for example, a URL (an internet address) as the address. Thus, instead of saving the file of the
example on the computer one may read its content into an R object by using the line of code
“ex.1 <- read.csv("http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex1.csv")” in-
stead of the code that we provide and the working method that we recommend to follow.
24 CHAPTER 2. SAMPLING AND DATA STRUCTURES
When the values of the variable are numerical we say that it is a quantitative
variable or a numeric variable. On the other hand, if the variable has qualitative
or level values we say that it is a factor. In the given example, sex is a factor
and height is a numeric variable.
The rows of the table are called observations and correspond to the subjects.
In this data set there are 100 subjects, with subject number 1, for example, being
a female of height 182 cm and identifying number 5696379. Subject number 98,
on the other hand, is a male of height 195 cm and identifying number 9383288.
Notice that backpacks carrying three books can have different weights. Weights
are quantitative continuous data because weights are measured.
Example 3 (Data Sample of Qualitative Data). The data are the colors of
backpacks. Again, you sample the same five students. One student has a red
backpack, two students have black backpacks, one student has a green backpack,
and one student has a gray backpack. The colors red, black, black, green, and
gray are qualitative data.
The distinction between continuous and discrete numeric data is not reflected
usually in the statistical method that are used in order to analyze the data.
Indeed, R does not distinguish between these two types of numeric data and
store them both as “numeric”. Consequently, we will also not worry about
the specific categorization of numeric data and treat them as one. On the other
hand, emphasis will be given to the difference between numeric and factors data.
One may collect data as numbers and report it categorically. For example,
the quiz scores for each student are recorded throughout the term. At the end
of the term, the quiz scores are reported as A, B, C, D, or F. On the other hand,
one may code categories of qualitative data with numerical values and report
the values. The resulting data should nonetheless be treated as a factor.
As default, R saves variables that contain non-numeric values as factors.
Otherwise, the variables are saved as numeric. The variable type is important
because different statistical methods are applied to different data types. Hence,
one should make sure that the variables that are analyzed have the appropriate
type. Especially that factors using numbers to denote the levels are labeled as
factors. Otherwise R will treat them as quantitative data.
Solution (to Question 2.1.1): The relative frequency of direct hits of cat-
egory 1 is 0.3993. Notice that the cumulative relative frequency of category
26 CHAPTER 2. SAMPLING AND DATA STRUCTURES
1 and 2 hits, the sum of the relative frequency of both categories, is 0.6630.
The relative frequency of category 2 hits is 0.2637. Consequently, the relative
frequency of direct hits of category 1 is 0.6630 - 0.2637 = 0.3993.
Solution (to Question 2.1.2): The relative frequency of direct hits of cat-
egory 4 or more is. Observe that the cumulative relative of the value “3” is
0.6630 + 0.2601 = 0.9231. This follows from the fact that the cumulative rela-
tive frequency of the value “2” is 0.6630 and the relative frequency of the value
“3” is 0.2601. The total cumulative relative frequency is 1.0000. The relative
frequency of direct hits of category 4 or more is the difference between the to-
tal cumulative relative frequency and cumulative relative frequency of 3 hits:
1.0000 - 0.9231 = 0.0769.
Question 2.2. The number of calves that were born to some cows during their
productive years was recorded. The data was entered into an R object by the
name “calves”. Refer to the following R code:
3. What is the relative frequency of cows that gave birth to at least 4 calves?
Solution (to Question 2.2.1): The total number of cows that were involved
in this study is 45. The object “freq” contain the table of frequency of the
cows, divided according to the number of calves that they had. The cumulative
frequency of all the cows that had 7 calves or less, which includes all cows in
the study, is reported under the number “7” in the output of the expression
“cumsum(freq)”. This number is 45.
Solution (to Question 2.2.2): The number of cows that gave birth to a total
of 4 calves is 10. Indeed, the cumulative frequency of cows that gave birth to
4 calves or less is 28. The cumulative frequency of cows that gave birth to 3
calves or less is 18. The frequency of cows that gave birth to exactly 4 calves is
the difference between these two numbers: 28 - 18 = 10.
Solution (to Question 2.2.3): The relative frequency of cows that gave birth
to at least 4 calves is 27. Notice that the cumulative frequency of cows that
gave at most 3 calves is 18. The total number of cows is 45. Hence, the number
of cows with 4 or more calves is the difference between these two numbers: 45
- 18 = 27.
2.5. SUMMARY 27
2.5 Summary
Glossary
Population: The collection, or set, of all individuals, objects, or measurements
whose properties are being studied.
Sample: A portion of the population understudy. A sample is representative
if it characterizes the population being studied.
Frequency: The number of times a value occurs in the data.
Relative Frequency: The ratio between the frequency and the size of data.
Cumulative Relative Frequency: The term applies to an ordered set of data
values from smallest to largest. The cumulative relative frequency is the
sum of the relative frequencies for all values that are less than or equal to
the given value.
Data Frame: A tabular format for storing statistical data. Columns corre-
spond to variables and rows correspond to observations.
Descriptive Statistics
29
30 CHAPTER 3. DESCRIPTIVE STATISTICS
Histogram of ex.1$height
25
20
Frequency
15
10
5
0
ex.1$height
our emphasis will be on histograms and box plots, which are other types of
plots. Some of the other types of graphs that are frequently used, but will not
be discussed in this book, are the stem-and-leaf plot, the frequency polygon
(a type of broken line graph) and the pie charts. The types of plots that will
be discussed and the types that will not are all tightly linked to the notion of
frequency of the data that was introduced in Chapter 2 and intend to give a
graphical representation of this notion.
3.2.1 Histograms
The histogram is a frequently used method for displaying the distribution of
continuous numerical data. An advantage of a histogram is that it can readily
display large data sets. A rule of thumb is to use a histogram when the data
set consists of 100 values or more.
One may produce a histogram in R by the application of the function “hist”
to a sequence of numerical data. Let us read into R the data frame “ex.1” that
contains data on the sex and height and create a histogram of the heights:
> ex.1 <- read.csv("ex1.csv")
3.2. DISPLAYING DATA 31
> hist(ex.1$height)
The outcome of the function is a plot that apears in the graphical window and
is presented in Figure 3.1.
The data set, which is the content of the CSV file “ex1.csv”, was used in
Chapter 2 in order to demonstrate the reading of data that is stored in a external
file into R. The first line of the above script reads in the data from “ex1.csv”
into a data frame object named “ex.1” that maintains the data internally in R.
The second line of the script produces the histogram. We will discuss below the
code associated with this second line.
A histogram consists of contiguous boxes. It has both a horizontal axis and
a vertical axis. The horizontal axis is labeled with what the data represents (the
height, in this example). The vertical axis presents frequencies and is labeled
”Frequency”. By the examination of the histogram one can appreciate the shape
of the data, the center, and the spread of the data.
The histogram is constructed by dividing the range of the data (the x-axis)
into equal intervals, which are the bases for the boxes. The height of each box
represents the count of the number of observations that fall within the interval.
For example, consider the box with the base between 160 and 170. There is a
total of 19 subjects with height larger that 160 but no more than 170 (that is,
160 < height ≤ 170). Consequently, the height of that box is 19.
The input to the function “hist” should be a sequence of numerical values.
In principle, one may use the function “c” to produce a sequence of data and
apply the histogram plotting function to the output of the sequence producing
function. However, in the current case we have already the data stored in the
data frame “ex.1”, all we need to learn is how to extract that data so it can be
used as input to the function “hist” that plots the histogram.
Notice the structure of the input that we have used in order to construct
the histogram of the variable “height” in the “ex.1” data frame. One may
address the variable “variable.name” in the data frame “dataframe.name”
using the format: “dataframe.name$variable.name”. Indeed, when we type
the expression “ex.1$height” we get as an output the values of the variable
“height” from the given data frame:
> ex.1$height
[1] 182 168 172 154 174 176 193 156 157 186 143 182 194 187 171
[16] 178 157 156 172 157 171 164 142 140 202 176 165 176 175 170
[31] 169 153 169 158 208 185 157 147 160 173 164 182 175 165 194
[46] 178 178 186 165 180 174 169 173 199 163 160 172 177 165 205
[61] 193 158 180 167 165 183 171 191 191 152 148 176 155 156 177
[76] 180 186 167 174 171 148 153 136 199 161 150 181 166 147 168
[91] 188 170 189 117 174 187 141 195 129 172
This is a numeric sequence and can serve as the input to a function that expects a
numeric sequence as input, a function such as “hist”. (But also other functions,
for example, “sum” and “cumsum”.)
There are 100 observations in the variable “ex.1$height”. So many ob-
servations cannot be displayed on the screen on one line. Consequently, the
sequence of the data is wrapped and displayed over several lines. Notice that
the square brackets on the left hand side of each line indicate the position in
the sequence of the first value on that line. Hence, the number on the first line
32 CHAPTER 3. DESCRIPTIVE STATISTICS
is “[1]”. The number on the second line in the is “[16]” since in the display
given in the book the second line starts with the 16th observation. Notice, that
numbers in the quare brackets on your R Console window may be different,
depending on the setting of the display on your computer.
The median is between the 7th value, 6.8, and the 8th value 7.2. To find the
median, add the two values together and divide by 2:
6.8 + 7.2
=7
2
The median is 7. Half of the values are smaller than 7 and half of the values
are larger than 7.
Quartiles are numbers that separate the data into quarters. Quartiles may
or may not be part of the data. To find the quartiles, first find the median or
second quartile. The first quartile is the middle value of the lower half of the
data and the third quartile is the middle value of the upper half of the data.
For illustration consider the same data set from above:
The median or second quartile is 7. The lower half of the data is:
1, 1, 2, 2, 4, 6, 6.8 .
The middle value of the lower half is 2. The number 2, which is part of the data
in this case, is the first quartile which is denoted Q1. One-fourth of the values
are the same or less than 2 and three-fourths of the values are more than 2.
The upper half of the data is:
10
8
6
4
2
The middle value of the upper half is 9. The number 9 is the third quartile
which is denoted Q3. Three-fourths of the values are less than 9 and one-fourth
of the values are more than 91 .
Outliers are values that do not fit with the rest of the data and lie outside of
the normal range. Data points with values that are much too large or much too
small in comparison to the vast majority of the observations will be identified
as outliers. In the context of the construction of a box plot we identify potential
outliers with the help of the inter-quartile range (IQR). The inter-quartile range
is the distance between the third quartile (Q3) and the first quartile (Q1), i.e.,
IQR = Q3 − Q1. A data point that is larger than the third quartile plus 1.5
times the inter-quartile range will be marked as a potential outlier. Likewise,
a data point smaller than the first quartile minus 1.5 times the inter-quartile
range will also be so marked. Outliers may have a substantial effect on the
outcome of statistical analysis, therefore it is important that one is alerted to
the presence of outliers.
In the running example we obtained an inter-quartile range of size 9-2=7.
1 The actual computation in R of the first quartile and the third quartile may vary slightly
from the description given here, depending on the exact structure of the data.
34 CHAPTER 3. DESCRIPTIVE STATISTICS
200
180
160
140
120
The upper threshold for defining an outlier is 9 + 1.5 × 7 = 19.5 and the lower
threshold is 2 − 1.5 × 7 = −8.5. All data points are within the two thresholds,
hence there are no outliers in this data.
In the construction of a box plot one uses a vertical rectangular box and two
vertical “whiskers” that extend from the ends of the box to the smallest and
largest data values that are not outliers. Outlier values, if any exist, are marked
as points above or blow the endpoints of the whiskers. The smallest and largest
non-outlier data values label the endpoints of the axis. The first quartile marks
one end of the box and the third quartile marks the other end of the box. The
central 50% of the data fall within the box.
One may produce a box plot with the aid of the function “boxplot”. The
input to the function is a sequence of numerical values and the output is a plot.
As an example, let us produce the box plot of the 14 data points that were used
as an illustration:
> boxplot(c(1,11.5,6,7.2,4,8,9,10,6.8,8.3,2,2,10,1))
The resulting box plot is presented in Figure 3.2. Observe that the end
points of the whiskers are 1, for the minimal value, and 11.5 for the largest
3.3. MEASURES OF THE CENTER OF DATA 35
value. The end values of the box are 9 for the third quartile and 2 for the first
quartile. The median 7 is marked inside the box.
Next, let us examine the box plot for the height data:
> boxplot(ex.1$height)
The resulting box plot is presented in Figure 3.3. In order to assess the plot let
us compute quartiles of the variable:
> summary(ex.1$height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
117.0 158.0 171.0 170.1 180.2 208.0
The function “summary”, when applied to a numerical sequence, produce the
minimal and maximal entries, as well the first, second and third quartiles (the
second is the Median). It also computes the average of the numbers (the Mean),
which will be discussed in the next section.
Let us compare the results with the plot in Figure 3.3. Observe that the
median 171 coincides with the thick horizontal line inside the box and that the
lower end of the box coincides with first quartile 158.0 and the upper end with
180.2, which is the third quartile. The inter-quartile range is 180.2 − 158.0 =
22.2. The upper threshold is 180.2 + 1.5 × 22.2 = 213.5. This threshold is
larger than the largest observation (208.0). Hence, the largest observation is
not an outlier and it marks the end of the upper whisker. The lower threshold
is 158.0 − 1.5 × 22.2 = 124.7. The minimal observation (117.0) is less than this
threshold. Hence it is an outlier and it is marked as a point below the end of the
lower whisker. The second smallest observation is 129. It lies above the lower
threshold and it marks the end point of the lower whisker.
Histogram of x
6
Frequency
4
2
0
4 5 6 7 8 9 10
Histogram of x
6
Frequency
4
2
0
4 5 6 7 8 9 10
Histogram of x
6
Frequency
4
2
0
4 5 6 7 8 9 10
Alternatively, we may note that the distinct values in the sample are 1, 2, 3,
and 4 with relative frequencies of 3/11, 2/11, 1/11 and 5/11, respectively. The
alternative method of computation produces:
3 2 1 5
x̄ = 1 × +2× +3× +4× = 2.7 .
11 11 11 11
4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10
This data produces the upper most histogram in Figure 3.4. Each interval has
width one and each value is located at the middle of an interval. The histogram
displays a symmetrical distribution of data. A distribution is symmetrical if a
vertical line can be drawn at some point in the histogram such that the shape
to the left and to the right of the vertical line are mirror images of each other.
Let us compute the mean and the median of this data:
3.3. MEASURES OF THE CENTER OF DATA 37
The mean and the median are each 7 for these data. In a perfectly symmetrical
distribution, the mean and the median are the same2 .
The functions “mean” and “median” were used in order to compute the mean
and median. Both functions expect a numeric sequence as an input and produce
the appropriate measure of centrality of the sequence as an output.
The histogram for the data:
4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8
is not symmetrical and is displayed in the middle of Figure 3.4. The right-hand
side seems “chopped off” compared to the left side. The shape of the distribution
is called skewed to the left because it is pulled out towards the left.
Let us compute the mean and the median for this data:
(Notice that the original data is replaced by the new data when object x is
reassigned.) The median is still 7, but the mean is less than 7. The relation
between the mean and the median reflects the skewing.
Consider yet another set of data:
6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10
The histogram for the data is also not symmetrical and is displayed at the
bottom of Figure 3.4. Notice that it is skewed to the right. Compute the mean
and the median:
The median is yet again equal to 7, but this time the mean is greater than 7.
Again, the mean reflects the skewing.
In summary, if the distribution of data is skewed to the left then the mean
is less than the median. If the distribution of data is skewed to the right then
the median is less than the mean.
Examine the data on the height in “ex.1”:
2 In the case of a symmetric distribution the vertical line of symmetry is located at the
> mean(ex.1$height)
[1] 170.11
> median(ex.1$height)
[1] 171
Observe that the histogram of the height (Figure 3.1) is skewed to the left. This
is consistent with the fact that the mean is less than the median.
9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11,
11.5, 11.5, 11.5 .
In order to explain the computation of the variance of these data let us create
an object x that contains the data:
> x <- c(9,9.5,9.5,10,10,10,10,10.5,10.5,10.5,10.5,11,11,11,11,11,
+ 11,11.5,11.5,11.5)
> length(x)
[1] 20
Pay attention to the fact that we did not write the “+” at the beginning of the
second line. That symbol was produced by R when moving to the next line to
indicate that the expression is not complete yet and will not be executed. Only
after inputting the right bracket and the hitting of the Return key does R carry
out the command and creates the object “x”. When you execute this example
yourself on your own computer make sure not to copy the “+” sign. Instead, if
you hit the return key after the last comma on the first line, the plus sign will
be produced by R as a new prompt and you can go on typing in the rest of the
numbers.
The function “length” returns the length of the input sequence. Notice that
we have a total of 20 data points.
The next step involves the computation of the deviations:
> x.bar <- mean(x)
> x.bar
[1] 10.525
> x - x.bar
[1] -1.525 -1.025 -1.025 -0.525 -0.525 -0.525 -0.525 -0.025
3.4. MEASURES OF THE SPREAD OF DATA 39
The average of the observations is equal to 10.525 and when we delete this
number from each of the components of the sequence x we obtain the deviations.
For example, the first deviation is obtained as 9 - 10.525 = -1.525, the second
deviation is 9.5 - 10.525 = -1.025, and so forth. The 20th deviation is 11.5 -
10.525 = 0.975, and this is the last number that is presented in the output.
From a more technical point of view observe that the expression that com-
puted the deviations, “x - x.bar”, involved the deletion of a single value
(x.bar) from a sequence with 20 values (x). The expression resulted in the
deletion of the value from each component of the sequence. This is an example
of the general way by which R operates on sequences. The typical behavior of
R is to apply an operation to each component of the sequence.
As yet another illustration of this property consider the computation of the
squares of the deviations:
> (x - x.bar)^2
[1] 2.325625 1.050625 1.050625 0.275625 0.275625 0.275625
[7] 0.275625 0.000625 0.000625 0.000625 0.000625 0.225625
[13] 0.225625 0.225625 0.225625 0.225625 0.225625 0.950625
[19] 0.950625 0.950625
Recall that “x - x.bar” is a sequence of length 20. We apply the square func-
tion to this sequence. This function is applied to each of the components of the
sequence. Indeed, for the first component we have that (−1.525)2 = 2.325625,
for the second component (−1.025)2 = 1.050625, and for the last component
(0.975)2 = 0.950625.
For the variance we sum the square of the deviations and divide by the total
number of data values minus one (n − 1). The standard deviation is obtained
by taking the square root of the variance:
> var(x)
[1] 0.5125
> sd(x)
[1] 0.715891
The reason for that stems from the theory of statistical inference that will be
discussed in Part II of this book. Unless the size of the data is small, dividing
by n or by n − 1 does not introduce much of a difference.
The variance is a squared measure and does not have the same units as
the data. Taking the square root solves the problem. The standard deviation
measures the spread in the same units as the data.
The sample standard deviation, s, is either zero or is larger than zero. When
s = 0, there is no spread and the data values are equal to each other. When s
is a lot larger than zero, the data values are very spread out about the mean.
Outliers can make s very large.
The standard deviation is a number that measures how far data values are
from their mean. For example, if the data contains the value 7 and if the mean
of the data is 5 and the standard deviation is 2, then the value 7 is one standard
deviation from its mean because 5 + 1 × 2 = 7. We say, then, that 7 is one
standard deviation larger than the mean 5 (or also say “to the right of 5”). If
the value 1 was also part of the data set, then 1 is two standard deviations
smaller than the mean (or two standard deviations to the left of 5) because
5 − 2 × 2 = 1.
The standard deviation, when first presented, may not be too simple to
interpret. By graphing your data, you can get a better “feel” for the deviations
and the standard deviation. You will find that in symmetrical distributions, the
standard deviation can be very helpful but in skewed distributions, the standard
deviation is less so. The reason is that the two sides of a skewed distribution
have different spreads. In a skewed distribution, it is better to look at the first
quartile, the median, the third quartile, the smallest value, and the largest value.
> summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.498 3.218 3.081 3.840 4.871
> summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0001083 0.5772000 1.5070000 1.8420000 2.9050000 4.9880000
> summary(x3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 3.391 4.020 4.077 4.690 6.414
In Figure 3.5 one may find the histograms of these three data sequences, given
in a random order. In Figure 3.6 one may find the box plots of the same data,
given in yet a different order.
1. Match the summary result with the appropriate histogram and the appro-
priate box plot.
0 1 2 3 4 5
Histogram 1
2 3 4 5 6
Histogram 2
0 1 2 3 4 5
Histogram 3
Solution (to Question 3.1.1): Consider the data “x1”. From the summary
we see that it is distributed in the range between 0 and slightly below 5. The
central 50% of the distribution are located between 2.5 and 3.8. The mean and
median are approximately equal to each other, which suggests an approximately
symmetric distribution. Consider the histograms in Figure 3.5. Histograms 1
and 3 correspond to a distributions in the appropriate range. However, the
distribution in Histogram 3 is concentrated in lower values than suggested by
the given first and third quartiles. Consequently, we match the summary of
“x1” with Histograms 1.
Consider the data “x2”. Again, the distribution is in the range between 0 and
slightly below 5. The central 50% of the distribution are located between 0.6 and
1.8. The mean is larger than the median, which suggests a distribution skewed
to the right. Therefore, we match the summary of “x2” with Histograms 3.
For the data in “x3” we may note that the distribution is in the range
between 2 and 6. The histogram that fits this description is Histograms 2.
The box plot is essentially a graphical representation of the information pre-
42 CHAPTER 3. DESCRIPTIVE STATISTICS
6
5
4
3
2
1
●
●
●
0
sented by the function “summary”. Following the rational of matching the sum-
mary with the histograms we may obtain that Histogram 1 should be matched
with Boxplot 2 in Figure 3.6, Histogram 2 matches Boxplot 3, and Histogram 3
matches Boxplot 1. Indeed, it is easier to match the box plots with the sum-
maries. However, it is a good idea to practice the direct matching of histograms
with box plots.
Solution (to Question 3.1.2): Notice that the data in “x1” fits Boxplot 2 in
Figure 3.6. The vale 0.000 is the smallest value in the data and it corresponds to
the smallest point in the box plot. Since this point is below the bottom whisker
it follows that it is an outlier. More directly, we may note that the inter-quartile
range is equal to IQR = 3.840 − 2.498 = 1.342. The lower threshold is equal to
2.498 − 1.5 × 1.342 = 0.485, which is larger that the given value. Consequently,
the given value 0.000 is an outlier.
Solution (to Question 3.1.3): Observe that the data in “x3” fits Boxplot 3
in Figure 3.6. The vale 6.414 is the largest value in the data and it corresponds
to the endpoint of the upper whisker in the box plot and is not an outlier.
3.5. SOLVED EXERCISES 43
Solution (to Question 3.2.1): In order to compute the mean of the data we
may write the following simple R code:
> x.val <- c(2,4,6,8,10)
> freq <- c(10,6,10,2,2)
> rel.freq <- freq/sum(freq)
> x.bar <- sum(x.val*rel.freq)
> x.bar
[1] 4.666667
We created an object “x.val” that contains the unique values of the data
and an object “freq” that contains the frequencies of the values. The object
“rel.freq” contains the relative frequencies, the ratios between the frequencies
and the number of observations. The average is computed as the sum of the
products of the values with their relative frequencies. It is stored in the objects
“x.bar” and obtains the value 4.666667.
An alternative approach is to reconstruct the original data from the fre-
quency table. A simple trick that will do the job is to use the function “rep”.
The first argument to this function is a sequence of values. If the second argu-
ment is a sequence of the same length that contains integers then the output
will be composed of a sequence that contains the values of the first sequence,
each repeated a number of times indicated by the second argument. Specifically,
if we enter to this function the unique value “x.val” and the frequency of the
values “freq” then the output will be the sequence of values of the original
sequence “x”:
> x <- rep(x.val,freq)
> x
[1] 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 6 6 6
44 CHAPTER 3. DESCRIPTIVE STATISTICS
[20] 6 6 6 6 6 6 6 8 8 10 10
> mean(x)
[1] 4.666667
Observe that when we apply the function “mean” to “x” we get again the value
4.666667.
Solution (to Question 3.2.3): In order to compute the median one may
produce the table of cumulative relative frequencies of “x”:
> data.frame(x.val,cumsum(rel.freq))
x.val cumsum.rel.freq.
1 2 0.3333333
2 4 0.5333333
3 6 0.8666667
4 8 0.9333333
5 10 1.0000000
Recall that the object “x.val” contains the unique values of the data. The
expression “cumsum(rel.freq)” produces the cumulative relative frequencies.
The function “data.frame” puts these two variables into a single data frame
and provides a clearer representation of the results.
Notice that more that 50% of the observations have value 4 or less. However,
strictly less than 50% of the observations have value 2 or less. Consequently,
the median is 4. (If the value of the cumulative relative frequency at 4 would
have been exactly 50% then the median would have been the average between
4 and the value larger than 4.)
In the case that we produce the values of the data “x” then we may apply
the function “summary” to it and obtain the median this way
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 4.000 4.667 6.000 10.000
3.6. SUMMARY 45
Solution (to Question 3.2.4):As for the inter-quartile range (IQR) notice that
the first quartile is 2 and the third quartile is 6. Hence, the inter-quartile range
is equal to 6 - 2 = 4. The quartiles can be read directly from the output of the
function “summary” or can be obtained from the data frame of the cumulative
relative frequencies. For the later observe that more than 25% of the data are
less or equal to 2 and more 75% of the data are less or equal to 6 (with strictly
less than 75% less or equal to 4).
Solution (to Question 3.2.5): In order to answer the last question we conduct
the computation: (10 − 4.666667)/2.425914 = 2.198484. We conclude that the
value 10 is approximately 2.1985 standard deviations above the mean.
3.6 Summary
Glossary
Median: A number that separates ordered data into halves: half the values are
the same number or smaller than the median and half the values are the
same number or larger than the median. The median may or may not be
part of the data.
Quartiles: The numbers that separate the data into quarters. Quartiles may
or may not be part of the data. The second quartile is the median of the
data.
Outlier: An observation that does not fit the rest of the data.
Interquartile Range (IQR) : The distance between the third quartile (Q3)
and the first quartile (Q1). IQR = Q3 - Q1.
Mean: A number that measures the central tendency. A common name for
mean is ‘average.’ The term ‘mean’ is a shortened form of ‘arithmetic
mean.’ By definition, the mean for a sample (denoted by x̄) is
(Sample) Variance: Mean of the squared deviations from the mean. Square
of the standard deviation. For a set of data, a deviation can be represented
as x − x̄ where x is a value of the data and x̄ is the sample mean. The
sample variance is equal to the sum of the squares of the deviations divided
by the difference of the sample size and 1:
• x = numerical value.
Formulas:
1
Pn P
• Mean: x̄ = n i=1 xi = x x × (fx /n)
1
Pn n
• Variance: s2 = − x̄)2 = (x − x̄)2 × (fx /n)
P
n−1 i=1 (xi n−1 x
√
• Standard Deviation: s = s2
Chapter 4
Probability
47
48 CHAPTER 4. PROBABILITY
4.3 A Population
In this section we introduce the variability of a population and present some
numerical summaries that characterizes this variability. Before doing so, let us
revise with the aid of an example some of the numerical summaries that were
used for the characterization of the variability of data.
Recall the file “ex1.csv” that contains data on the height and sex of 100
subjects. (The data file can be obtained from http://pluto.huji.ac.il/
~msby/StatThink/Datasets/ex1.csv.) We read the content of the file into a
data frame by the name “ex.1” and apply the function “summary” to the data
frame:
We saw in the previous chapter that, when applied to a numeric sequence, the
function “summary” produces the smallest and largest values in the sequence,
the three quartiles (including the median) and the mean. If the input of the
same function is a factor then the outcome is the frequency in the data of each
of the levels of the factor. Here “sex” is a factor with two levels. From the
summary we can see that 54 of the subjects in the sample are female and 46 are
male.
Notice that when the input to the function “summary” is a data frame, as is
the case in this example, then the output is a summary of each of the variables
of the data frame. In this example two of the variables are numeric (“id” and
“height”) and one variable is a factor (“sex”).
Recall that the mean is the arithmetic average of the data which is computed
by summing all the values of the variable and dividing the result by the number
of observations. Hence, if n is the number of observations (n = 100 in this
example) and xi is the value of the variable for subject i, then one may write
the mean in a formula form as
Pn
Sum of all values in the data xi
x̄ = = i=1 ,
Number of values in the data n
Pn
where x̄ corresponds to the mean of the data and the symbol “ i=1 xi ” corre-
sponds to the sum of all values in the data.
50 CHAPTER 4. PROBABILITY
3500
3000
2500
2000
Frequency
1500
1000
500
0
117 127 137 147 157 167 177 187 197 207 217
height
The median is computed by ordering the data values and selecting a value
that splits the ordered data into two equal parts. The first and third quartile
are obtained by further splitting each of the halves into two parts.
Let us discuss the variability associated with an entire target population.
The file “pop1.csv” that contains the population data can be found on the inter-
net (http://pluto.huji.ac.il/~msby/StatThink/Datasets/pop1.csv). It
is a CSV file that contains the information on sex and height of an entire adult
population of some imaginary city. The data in “ex.1” corresponds to a sample
from this city. Read the data into R and examine it:
The object “pop.1” is a data frame of the same structure as the data frame
“ex.1”. It contains three variables: a unique identifier of each subject (id),
the sex of the subject (sex), and its height (height). Applying the function
“summary” to the data frame produces the summary of the variables that it
contains. In particular, for the variable “sex”, which is a factor, it produces the
frequency of its two categories – 48,888 female and 51,112 – a total of 100,000
subjects. For the variable “height”, which is a numeric variable, it produces
the extreme values, the quartiles, and the mean.
Let us concentrate on the variable “height”. A bar plot of the distribution of
the heights in the entire population is given in Figure 4.1 1 . Recall that a vertical
bar is placed above each value of height that appears in the population, with
the height of the bar representing the frequency of the value in the population.
One may read out of the graph or obtain from the numerical summaries that
the variable takes integer values in the range between 117 and 217 (heights
are rounded to the nearest centimeter). The distribution is centered at 170
centimeter, with the central 50% of the values spreading between 162 and 178
centimeters.
The mean of the height in the entire population is equal to 170 centimeter.
This mean, just like the mean for the distribution of data, is obtained by the
summation of all the heights in the population divided by the population size.
Let us denote the size of the entire population by N . In this example N =
100, 000. (The size of the sample for the data was called n and was equal to
n = 100 in the example.) The mean of an entire population is denoted by the
Greek letter µ and is read “mew ”. (The average for the data was denoted x̄).
The formula of the population mean is:
PN
Sum of all values in the population xi
µ= = i=1 .
Number of values in the population N
Observe the similarity between the definition of the mean for the data and
the definition of the mean for the population. In both cases the arithmetic
average is computed. The only difference is that in the case of the mean of the
data the computation is with respect to the values that appear in the sample
whereas for the population all the values in the population participate in the
computation.
In reality, we will not have all the values of a variable in the entire population.
Hence, we will not be able to compute the actual value of the population mean.
However, it is still meaningful to talk about the population mean because this
number exists, even though we do not know what its value is. As a matter of
fact, one of the issues in statistics is to try to estimate this unknown quantity
on the basis of the data we do have in the sample.
A characteristic of the distribution of an entire population is called a pa-
rameter. Hence, µ, the population average, is a parameter. Other examples
of parameters are the population median and the population quartiles. These
parameters are defined exactly like their data counterparts, but with respect
to the values of the entire population instead of the observations in the sample
alone.
1 Such a bar plot can be produced with the expression “plot(table(pop.1$height))”.
52 CHAPTER 4. PROBABILITY
> sample(pop.1$height,1)
[1] 162
2 Observe that the function “var” computes the sample variance. Consequently, the sum
of squares is divided by N − 1. We can correct that when computing the population variance
by multiplying the result by N − 1 and dividing by N . Notice that the difference between
the two quantities is negligible for a large population. Henceforth we will use the functions
“var” and “sd” to compute the variance and standard deviations of populations without the
application of the correction.
4.4. RANDOM VARIABLES 53
The first entry to the function is the given sequence of heights. When we
set the second argument to 1 then the function selects one of the entries of
the sequence at random, with each entry having the same likelihood of being
selected. Specifically, in this example an entry that contains the value 162 was
selected. Let us run the function again:
> sample(pop.1$height,1)
[1] 192
In this instance an entry with a different value was selected. Try to run the
command several times yourself and see what you get. Would you necessarily
obtain a different value in each run?
Now let us enter the same command without pressing the return key:
> sample(pop.1$height,1)
Can you tell, before pressing the key, what value will you get?
The answer to this question is of course “No”. There are 100,000 entries
with a total of 94 distinct values. In principle, any of the values may be selected
and there is no way of telling in advance which of the values will turn out as an
outcome.
A random variable is the future outcome of a measurement, before the
measurement is taken. It does not have a specific value, but rather a collection
of potential values with a distribution over these values. After the measurement
is taken and the specific value is revealed then the random variable ceases to be
a random variable! Instead, it becomes data.
Although one is not able to say what the outcome of a random variable
will turn out to be. Still, one may identify patterns in this potential outcome.
For example, knowing that the distribution of heights in the population ranges
between 117 and 217 centimeter one may say in advance that the outcome of
the measurement must also be in that interval. Moreover, since there is a total
of 3,476 subjects with height equal to 168 centimeter and since the likelihood
of each subject to be selected is equal then the likelihood of selecting a subject
of this height is 3,476/100,000 = 0.03476. In the context of random variables
we call this likelihood probability. In the same vain, the frequency of subjects
with hight 192 centimeter is 488, and therefore the probability of measuring
such a height is 0.00488. The frequency of subjects with height 200 centimeter
or above is 393, hence the probability of obtaining a measurement in the range
between 200 and 217 centimeter is 0.00393.
and
P(X ≥ 200) = 0.00393 .
Consider, as another example, the probability that the height of a random
person sampled from the population differs from 170 centimeter by no more
than 10 centimeters. (In other words, that the height is between 160 and 180
centimeters.) Denote by X the height of that random person. We are interested
in the probability P(|X − 170| ≤ 10).3
The random person can be any of the subjects of the population with equal
probability. Thus, the sequence of the heights of the 100,000 subjects represents
the distribution of the random variable X:
Notice that the object “X” is a sequence of lenght 100,000 that stores all the
heights of the population. The probability we seek is the relative frequency in
this sequency of values between 160 and 180. First we compute the probability
and then explain the method of computation:
We get that the height of a person randomly sampled from the population is
between 160 and 180 centimeters with probability 0.64541.
Let us produce a small example that will help us explain the computation
of the probability. We start by forming a sequence with 10 numbers:
> Y <- c(6.3, 6.9, 6.6, 3.4, 5.5, 4.3, 6.5, 4.7, 6.1, 5.3)
The goal is to compute the proportion of numbers that are in the range [4, 6]
(or, equivalently, |Y − 5| ≤ 1).
The function “abs” computes the absolute number of its input argument.
When the function is applied to the sequence “Y-5” it produces a sequence of
the same length with the distances between the components of “Y” and the
number 5:
> abs(Y-5)
[1] 1.3 1.9 1.6 1.6 0.5 0.7 1.5 0.3 1.1 0.3
Compare the resulting output to the original sequence. The first value in the
input sequence is 6.3. Its distance from 5 is indeed 1.3. The fourth value in
3 The expression {|X − 170| ≤ 10} reads as “the absolute value of the difference between X
and 170 is no more that 10”. In other words, {−10 ≤ X − 170 ≤ 10}, which is equivalent to
the statement that {160 ≤ X ≤ 180}. It follows that P(|X − 170| ≤ 10) = P(160 ≤ X ≤ 180).
4.4. RANDOM VARIABLES 55
the input sequence is 3.4. The difference 3.4 - 5 is equal to -1.6, and when the
absolute value is taken we get a distance of 1.6.
The function “<=” expects an argument to the right and an argument to the
left. It compares each component to the left with the parallel component to the
right and returns a logical value, “TRUE” or “FALSE”, depending on whether the
relation that is tested holds or not:
Observe the in this example the function “<=” produced 10 logical values, one
for each of the elements of the sequence to the left of it. The first input in the
sequence “Y” is 6.3, which is more than one unit away from 5. Hence, the first
output of the logical expression is “FALSE”. On the other hand, the last input
in the sequence “Y” is 5.3, which is within the range. Therefore, the last output
of the logical expression is “TRUE”.
Next, we compute the proportion of “TRUE” values in the sequence:
When a sequence with logical values is entered into the function “mean” then the
function replaces the TRUE’s by 1 and the FALSE’s by 0. The average produces
then the relative frequency of TRUE’s in the sequence as required. Specifically,
in this example there are 4 TRUE’s and 6 FALSE’s. Consequently, the output of
the final expression is 4/10 = 0.4.
The computation of the probability that the sampled height falls within 10
centimeter of 170 is based on the same code. The only differences are that the
input sequence “Y” is replaced by the sequence of population heights “X” as
input. the number “5” is replaced by the number “170” and the number “1”
is replaced by the number “10”. In both cases the result of the computation is
the relative proportion of the times that the values of the input sequence fall
within a given range of the indicated number.
The probability function of a random variable is defined for any value that
the random variable may obtain and produces the distribution of the random
variable. The probability function may emerge as a relative frequency as in
the given example or it may be a result of theoretical modeling. Examples of
theoretical random variables are presented mainly in the next two chapters.
The sample space and the probability function specify the distribution of
the random variable. For example, assume it is known that a random variable
X may obtain the values 0, 1, 2, or 3. Moreover, imagine that it is known that
P(X = 1) = 0.25, P(X = 2) = 0.15, and P(X = 3) = 0.10. What is P(X = 0),
the probability that X is equal to 0?
The sample space, the collection of possible values that the random variable
may obtain is the collection {0, 1, 2, 3}. Observe that the sum over the positive
values is:
It follows, since the sum of probabilities over the entire sample space is equal
to 1, that P(X = 0) = 1 − 0.5 = 0.5.
56 CHAPTER 4. PROBABILITY
In this definition all the unique values of the sample space are considered. For
each value a product of the value and the probability of the value is taken. The
expectation is obtained by the summation of all these products. In this defi-
nition the probability P(x) replaces the relative frequency fx /n but otherwise,
the definition of the expectation and the second formulation of the mean are
identical to each other.
Consider the random variable X with distribution that is described in Ta-
ble 4.4.1. In order to obtain its expectation we multiply each value in the sample
space by the probability of the value. Summation of the products produces the
expectation (see Table 4.4.2):
E(X) = 0 × 0.5 + 1 × 0.25 + 2 × 0.15 + 3 × 0.10 = 0.85 .
In the example of height we get that the expectation is equal to 170.035 cen-
timeter. Notice that this expectation is equal to µ, the mean of the population4 .
This is no accident. The expectation of a potential measurement of a randomly
selected subject from a population is equal to the average of the measurement
across all subjects.
The sample variance (s2 ) is obtained as the sum of the squared deviations
from the average, divided by the sample size (n) minus 1:
Pn
(xi − x̄)2
s2 = i=1 .
n−1
4 The mean of the population can be computed with the expression “mean(pop.1$height)”
58 CHAPTER 4. PROBABILITY
A second formulation for the computation of the same quantity is via the use
of relative frequencies. The formula for the sample variance takes the form
n X
s2 = (x − x̄)2 × (fx /n) .
n−1 x
In this formulation one considers each of the unique value that are present in
the data. For each value the deviation between the value and the average is
computed. These deviations are then squared and multiplied by the relative
frequency. The products are summed up. Finally, the sum is multiplied by the
ratio between the sample size n and n − 1 in order to correct for the fact that
in the sample variance the sum of squared deviations is divided by the sample
size minus 1 and not by the sample size.
In a similar way, the variance of a random variable may be defined via the
probability of the values that make the sample space. For each such value one
computes the deviation from the expectation. This deviation is then squared
and multiplied by the probability of the value. The multiplications are summed
up in order to produce the variance:
X
(x − E(X))2 × P(x) .
Var(X) =
x
Notice that the formula for the computation of the variance of a random variable
is very similar to the second formulation for the computation of the sample
variance. Essentially, the mean of the data is replaced by the expectation of
the random variable and the relative frequency of a value is replaced by the
probability of the value. Another difference is that the correction factor is not
used for the variance of a random variable.
Value Prob. x − E(X) (x − E(X))2 (x − E(X))2 × P(X = x)
0 0.50 -0.85 0.7225 0.361250
1 0.25 0.15 0.0225 0.005625
2 0.15 1.15 1.3225 0.198375
3 0.10 2.15 4.6225 0.462250
Var(X) = 1.027500
In the example that involves the height of a subject selected from the popula-
tion at random we obtain that the variance is 126.1576, equal to the population
variance, and the standard deviation is 11.23199, the square of the variance.
Other characterization of the distribution that were computed for data, such
as the median, the quartiles, etc., may also be defined for random variables.
Value Probability
0 p
1 2p
2 3p
3 4p
4 5p
5 6p
7. Var(Y ) = ?
8. What is the standard deviation of Y .
Solution (to Question 4.1.1): Consult Table 4.6. The probabilities of the
different values of Y are {p, 2p, . . . , 6p}. These probabilities sum to 1, conse-
quently
p + 2p + 3p + 4p + 5p + 6p = (1 + 2 + 3 + 4 + 5 + 6)p = 21p = 1 =⇒ p = 1/21 .
Solution (to Question 4.1.2): The event {Y < 3} contains the values 0, 1
and 2. Therefore,
1 2 3 6
P(Y < 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) = + + = = 0.2857 .
21 21 21 21
Solution (to Question 4.1.3): The event {Y = odd} contains the values 1,
3 and 5. Therefore,
2 4 6 12
P(Y = odd) = P(Y = 1)+P(Y = 3)+P(Y = 5) = + + = = 0.5714 .
21 21 21 21
Solution (to Question 4.1.4): The event {1 ≤ Y < 4} contains the values 1,
2 and 3. Therefore,
2 3 4 9
P(1 ≤ Y < 4) = P(Y = 1)+P(Y = 2)+P(Y = 3) = + + = = 0.4286 .
21 21 21 21
Solution (to Question 4.1.5): The event {|Y − 3| < 1.5} contains the values
2, 3 and 4. Therefore,
3 4 5 12
P(|Y −3| < 1.5) = P(Y = 2)+P(Y = 3)+P(Y = 4) = + + = = 0.5714 .
21 21 21 21
Solution (to Question 4.1.6): The values that the random variable Y ob-
tains are the numbers 0, 1, 2, . . . , 5, with probabilities {1/21, 2/21, . . . , 6/21},
respectively. The expectation is obtained by the multiplication of the values by
their respective probabilities and the summation of the products. Let us carry
out the computation in R:
4.6. SOLVED EXERCISES 61
Solution (to Question 4.1.7): The values that the random variable Y ob-
tains are the numbers 0, 1, 2, . . . , 5, with probabilities {1/21, 2/21, . . . , 6/21},
respectively. The expectation is equal to E(Y ) = 3.333333. The variance is
obtained by the multiplication of the squared deviation from the expectation of
the values by their respective probabilities and the summation of the products.
Let us carry out the computation in R:
> Var <- sum((Y.val-E)^2*P.val)
> Var
[1] 2.222222
We obtain a variance Var(Y ) = 2.2222.
Solution (to Question 4.2.2): Consider the previous solution. One looses if
any other of the outcomes occurs. Hence, the probability of loosing is 7/8.
Solution (to Question 4.2.3): Denote the gain of the player by X. The
random variable X may obtain two values: 10-2 = 8 if the player wins and -2 if
the player looses. The probabilities of these values are {1/8, 7/8}, respectively.
Therefore, the expected gain, the expectation of X is:
1 7
E(X) = 8 × + (−2) × = −0.75 .
8 8
62 CHAPTER 4. PROBABILITY
4.7 Summary
Glossary
Random Variable: The probabilistic model for the value of a measurement,
before the measurement is taken.
Expectation: The central value for a random variable. The expectation of the
random variable X is marked by E(X).
Summary of Formulas
Population Size: N = the number of people, things, etc. in the population.
PN
Population Average: µ = (1/N ) i=1 xi
P
Expectation of a Random Variable: E(X) = x x × P(x)
PN
Population Variance: σ 2 = (1/N ) i=1 (xi − µ)2
Variance of a Random Variable: Var(X) = x (x − E(X))2 × P(x)
P
64 CHAPTER 4. PROBABILITY
Chapter 5
Random Variables
65
66 CHAPTER 5. RANDOM VARIABLES
type of random variable we will identify first the sample space — the values it
may obtain — and then describe the probabilities of the values. The R system
provides functions for the computation of these probabilities. We will use these
functions in this and in proceeding chapters in order to carry out computations
associated with these random variables and in order to plot these distributions.
The distribution of a random variable, just like the distribution of data, can
be characterized using numerical summaries. For the latter we used summaries
such as the mean and the sample variance and standard deviation. The mean
is used to describe the central location of the distribution and the variance and
standard deviation are used to characterize the total spread. Parallel summaries
are used for random variable. In the case of a random variable the name ex-
pectation is used for the central location of the distribution and the variance
and the standard deviation (the square root of the variation) are used to sum-
marize the spread. In all the examples of random variables we will identify the
expectation and the variance (and, thereby, also the standard deviation).
Random variables are used as probabilistic models of measurements. Theo-
retical considerations are used in many cases in order to define random variables
and their distribution. A random variable for which the values in the sample
space are separated from each other, say the values are integers, is called a
discrete random variable. In this section we introduce two important integer-
valued random variables: The Binomial and the Poisson random variables.
These random variables may emerge as models in contexts where the measure-
ment involves counting the number of occurrences of some phenomena.
Many other models, apart from the Binomial and Poisson, exist for discrete
random variables. An example of such model, the Negative-Binomial model, will
be considered in Section 5.4. Depending on the specific context that involves
measurements with discrete values, one may select the Binomial, the Poisson,
or one of these other models to serve as a theoretical approximation of the
distribution of the measurement.
0.25
0.20
0.15
Probability
0.10
0.05
0.00
0 2 4 6 8 10
“Tail”. Hence, the sample space of X is the set of integers {0, 1, 2, . . . , 10}. The
probability of each outcome may be computed by an appropriate mathematical
formula that will not be discussed here1 .
The probabilities of the various possible values of a Binomial random vari-
able may be computed with the aid of the R function “dbinom” (that uses the
mathematical formula for the computation). The input to this function is a se-
quence of values, the value of n, and the value of p. The output is the sequence
of probabilities associated with each of the values in the first input.
For example, let us use the function in order to compute the probability
that the given Binomial obtains an odd value. A sequence that contains the
odd values in the Binomial sample space can be created with the expression
“c(1,3,5,7,9)”. This sequence can serve as the input in the first argument of
the function “dbinom”. The other arguments are “10” and “0.5”, respectively:
> dbinom(c(1,3,5,7,9),10,0.5)
[1] 0.009765625 0.117187500 0.246093750 0.117187500 0.009765625
1 If n x
X ∼ Binomial(n, p) then P(X = x) = x
p (1 − p)n−x , for x = 0, 1, . . . , n.
68 CHAPTER 5. RANDOM VARIABLES
Observe that the output of the function is a sequence of the same length as the
first argument. This output contains the Binomial probabilities of the values in
the first argument. In order to obtain the probability of the event {X is odd}
we should sum up these probabilities, which we can do by applying the function
“sum” to the output of the function that computes the Binomial probabilities:
> sum(dbinom(c(1,3,5,7,9),10,0.5))
[1] 0.5
Observe that the probability of obtaining an odd value in this specific case is
equal to one half.
Another example is to compute all the probabilities of all the potential values
of a Binomial(10, 0.5) random variable:
> pbinom(x,10,0.5)
[1] 0.0009765625 0.0107421875 0.0546875000 0.1718750000
[5] 0.3769531250 0.6230468750 0.8281250000 0.9453125000
[9] 0.9892578125 0.9990234375 1.0000000000
> cumsum(dbinom(x,10,0.5))
[1] 0.0009765625 0.0107421875 0.0546875000 0.1718750000
[5] 0.3769531250 0.6230468750 0.8281250000 0.9453125000
[9] 0.9892578125 0.9990234375 1.0000000000
The numbers in the sum are the first 4 values from the output of the function
“dbinom(x,10,0.5)”, which computes the probabilities of the values of the
sample space.
In principle, the expectation of the Binomial random variable, like the expec-
tation of any other (discrete) random variable is obtained from the application
of the general formulae:
X X
(x − E(X))2 × P(x) .
E(X) = x × P(X = x) , Var(X) =
x x
However, in the specific case of the Binomial random variable, in which the
probability P(X = x) obeys the specific mathematical formula of the Binomial
distribution, the expectation and the variance reduce to the specific formulae:
Hence, the expectation is the product of the number of trials n with the proba-
bility of success in each trial p. In the variance the number of trials is multiplied
by the product of a probability of success (p) with the probability of a failure
(1 − p).
As illustration, let us compute for the given example the expectation and
the variance according to the general formulae for the computation of the ex-
pectation and variance in random variables and compare the outcome to the
specific formulae for the expectation and variance in the Binomial distribution:
> X.val <- 0:10
> P.val <- dbinom(X.val,10,0.5)
> EX <- sum(X.val*P.val)
> EX
[1] 5
> sum((X.val-EX)^2*P.val)
[1] 2.5
This agrees with the specific formulae for Binomial variables, since 10 × 0.5 = 5
and 10 × 0.5 × (1 − 0.5) = 2.5.
Recall that the general formula for the computation of the expectation calls
for the multiplication of each value in the sample space with the probability of
that value, followed by the summation of all the products. Notice that the object
“X.val” contains all the values of the random variable and the object “P.val”
contains the probabilities of these values. Hence, the expression “X.val*P.val”
produces the product of each value of the random variable times the probability
of that value. Summation of these products with the function “sum” gives the
expectation, which is saved in an object that is called “EX”.
The general formula for the computation of the variance of a random variable
involves the product of the squared deviation associated with each value with
the probability of that value, followed by the summation of all products. The
expression “(X.val-EX)^2” produces the sequence of squared deviations from
the expectation for all the values of the random variable. Summation of the
product of these squared deviations with the probabilities of the values (the
outcome of “(X.val-EX)^2*P.val”) gives the variance.
When the value of p changes (without changing the number of trials n) then
the probabilities that are assigned to each of the values of the sample space
70 CHAPTER 5. RANDOM VARIABLES
0.30
p = 0.5
p = 1/6
p = 0.6
0.25
0.20
Probability
0.15
0.10
0.05
0.00
0 2 4 6 8 10
of the Binomial random variable change, but the sample space itself does not.
For example, consider rolling a die 10 times and counting the number of times
that the face 3 was obtained. Having the face 3 turning up is a “Success”. The
probability p of a success in this example is 1/6, since the given face is one
out of 6 equally likely faces. The resulting random variable that counts the
total number of success in 10 trials has a Binomial(10, 1/6) distribution. The
sample space is yet again equal to the set of integers {0, 1, . . . , 10}. However, the
probabilities of values are different. These probabilities can again be computes
with the aid of the function “dbinom”:
> dbinom(x,10,1/6)
[1] 1.615056e-01 3.230112e-01 2.907100e-01 1.550454e-01
[5] 5.426588e-02 1.302381e-02 2.170635e-03 2.480726e-04
[9] 1.860544e-05 8.269086e-07 1.653817e-08
In this case smaller values of the random variable are assigned higher probabil-
ities and larger values are assigned lower probabilities..
In Figure 5.2 the probabilities for Binomial(10, 1/6), the Binomial(10, 1/2),
and the Binomial(10, 0.6) distributions are plotted side by side. In all these
5.2. DISCRETE RANDOM VARIABLES 71
3 distributions the sample space is the same, the integers between 0 and 10.
However, the probabilities of the different values differ. (Note that all bars
should be placed on top of the integers. For clarity of the presentation, the
bars associated with the Binomial(10, 1/6) are shifted slightly to the left and
the bars associated with the Binomial(10, 0.6) are shifted slightly to the right.)
The expectation of the Binomial(10, 0.5) distribution is equal to 10×0.5 = 5.
Compare this to the expectation of the Binomial(10, 1/6) distribution, which is
10 × (1/6) = 1.666667 and to the expectation of the Binomial(10, 0.6) distribu-
tion which equals 10 × 0.6 = 6.
The variance of the Binomial(10, 0.5) distribution is 10 × 0.5 × 0.5 = 2.5.
The variance when p = 1/6 is 10 × (1/6) × (5/6) = 1.388889 and the variance
when p = 0.6 is 10 × 0.6 × 0.4 = 2.4.
0.25
0.20
Probability
0.15
0.10
0.05
0.00
0 2 4 6 8 10
Observe that the outcome is almost, but not quite, equal to 2.00, which is the
actual value of the expectation. The reason for the inaccuracy is the fact that
we have based the computation in R on the first 11 values of the distribution
only, instead of the infinite sequence of values. A more accurate result may be
obtained by the consideration of the first 101 values:
0.6
lambda = 0.5
lambda = 1
0.5
lambda = 2
0.4
Probability
0.3
0.2
0.1
0.0
0 2 4 6 8 10
In the last expression we have computed the variance of the Poisson distribution
and obtained that it is equal to the expectation. This results can be validated
mathematically. For the Poisson distribution it is always the case that the
variance is equal to the expectation, namely to λ:
E(X) = Var(X) = λ .
In Figure 5.4 you may find the probabilities of the Poisson distribution for
λ = 0.5, λ = 1 and λ = 2. Notice once more that the sample space is the same
for all the Poisson distributions. What varies when we change the value of λ are
the probabilities. Observe that as λ increases then probability of larger values
increases as well.
74 CHAPTER 5. RANDOM VARIABLES
0.25
0.20
0.10
0.05
x = 4.73
0.00
0 2 4 6 8 10
The object “den” is a sequence of length 1,000 that contains the density of the
Uniform(3, 7) evaluated over the values of “x”. When we apply the function
“plot” to the two sequences we get a scatter plot of the 1,000 points that is
presented in the upper panel of Figure 5.6.
A scatter plot is a plot of points. Each point in the scatter plot is identify by
its horizontal location on the plot (its “x” value) and by its vertical location on
the plot (its y value). The horizontal value of each point in plot is determined by
the first argument to the function “plot” and the vertical value is determined
by the second argument. For example, the first value in the sequence “x” is 0.
The value of the Uniform density at this point is 0. Hence, the first value of the
sequence “den” is also 0. A point that corresponds to these values is produced
in the plot. The horizontal value of the point is 0 and the vertical value is 0. In
a similar way the other 999 points are plotted. The last point to be plotted has
a horizontal value of 10 and a vertical value of 0.
5.3. CONTINUOUS RANDOM VARIABLE 77
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
0.15
den
0.00
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
● ●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
0 2 4 6 8 10
x
0.15
den
0.00
0 2 4 6 8 10
x
0.8
cdf
0.4
0.0
0 2 4 6 8 10
The number of points that are plotted is large and they overlap each other
in the graph and thus produce an impression of a continuum. In order to obtain
nicer looking plots we may choose to connect the points to each other with
segments and use smaller points. This may be achieved by the addition of the
argument “type="l"”, with "l" for line, to the plotting function:
> plot(x,den,type="l")
The output of the function is presented in the second panel of Figure 5.6. In
the last panel the cumulative probability of the Uniform(3, 7) is presented. This
function is produced by the code:
One can think of the density of the Uniform as an histogram3 . The expecta-
tion of a Uniform random variable is the middle point of the histogram. Hence,
3 If X ∼ Uniform(a, b) then the density is f (x) = 1/(b − a), for a ≤ x ≤ b, and it is equal
if X ∼ Uniform(a, b) then:
a+b
E(X) = .
2
For the X ∼ Uniform(3, 7) distribution the expectation is E(X) = (3+7)/2 = 5.
Observe that 5 is the center of the Uniform density in Plot 5.5.
It can be shown that the variance of the Uniform(a, b) is equal to
(b − a)2
Var(X) = ,
12
with the standard deviation being the square root of this value. Specifically, for
X ∼ Uniform(3, 7) we√get that Var(X) = (7−3)2 /12 = 1.333333. The standard
deviation is equal to 1.333333 = 1.154701.
The difference is the probability of belonging to the interval, namely the area
marked in the plot.
4 If X ∼ Exponential(λ) then the density is f (x) = λe−λx , for 0 ≤ x, and it is equal to 0
for x < 0.
5.3. CONTINUOUS RANDOM VARIABLE 79
0.5
0.4
0.3
Density
0.2
0 2 4 6 8 10
E(X) = 1/λ ,
Var(X) = 1/λ2 .
The standard deviation is the square root of the variance, namely 1/λ. Observe
that the larger the rate the smaller the expectation and the standard deviation
are.
In Figure 5.8 the densities of the Exponential distribution are plotted for
λ = 0.5, λ = 1, and λ = 2. Notice that the larger the value of the parameter
is the smaller the values of the random variable tend to become. This inverse
relation makes sense in connection to the Poisson distribution. Recall that
the Poisson distribution corresponds to the total number of occurrences in a
unit interval of time when the time between occurrences has an Exponential
distribution. The larger the expectation λ of the Poisson is the larger is the
number of occurrences that are likely to take place during the unit interval of
80 CHAPTER 5. RANDOM VARIABLES
2.0
lambda = 0.5
lambda = 1
lambda = 2
1.5
Density
1.0
0.5
0.0
0 2 4 6 8 10
time. The larger the number of occurrences is the smaller are the time intervals
between occurrences.
1. What is the expected number of people that will develop a reaction each
day?
2. What is the standard deviation of the number of people that will develop
a reaction each day?
3. In a given day, what is the probability that more than 40 people will
develop a reaction?
5.4. SOLVED EXERCISES 81
4. In a given day, what is the probability that the number of people that will
develop a reaction is between 50 and 45 (inclusive)?
Solution (to Question 5.1.1): The Binomial distribution is a reasonable
model for the number of people that develop high fever as result of the vac-
cination. Let X be the number of people that do so in a give day. Hence,
X ∼ Binomial(500, 0.09). According to the formula for the expectation in the
Binomial distribution, since n = 500 and p = 0.09, we get that:
Solution (to Question 5.1.4): The probability that the number of people
that will develop a reaction is between 50 and 45 (inclusive) is the difference
between P(X ≤ 50) and P(X < 45) = P(X ≤ 44). Apply the function “pbinom”
to get:
> pbinom(50,500,0.09) - pbinom(44,500,0.09)
[1] 0.3292321
The bar plots of these random variables are presented in Figure 5.9, re-organizer
in a random order.
Use Figure 5.9 in order to match the random variable with its associated
pair. Do not use numerical computations or formulae for the expectation
and the variance in the Negative-Binomial distribution in order to carry
out the matching5 . Use, instead, the structure of the bar-plots.
Solution (to Question 5.2.1): The plots can be produced with the following
code, which should be run one line at a time:
5.5 Summary
Glossary
Binomial Random Variable: The number of successes among n repeats of
independent trials with a probability p of success in each trial. The dis-
tribution is marked as Binomial(n, p).
Poisson Random Variable: An approximation to the number of occurrences
of a rare event, when the expected number of events is λ. The distribution
is marked as Poisson(λ).
Density: Histogram that describes the distribution of a continuous random
variable. The area under the curve corresponds to probability.
Uniform Random Variable: A model for a measurement with equally likely
outcomes over an interval [a, b]. The distribution is marked as Uniform(a, b).
Exponential Random Variable: A model for times between events. The
distribution is marked as Exponential(λ).
Summary of Formulas
Discrete Random Variable:
X
E(X) = x × P(x)
x
X
(x − E(X))2 × P(x)
Var(X) =
x
0.10
0.00
0 5 10 15
Barplot 1
0.20
0.10
0.00
0 5 10 15
Barplot 2
0.20
0.10
0.00
0 5 10 15
Barplot 3
• Recognize the Normal density and apply R functions for computing Normal
probabilities and percentiles.
87
88 CHAPTER 6. THE NORMAL RANDOM VARIABLE
0.12
P(0 < X < 5)
0.10
0.08
Density
0.06
0.04
0.02
0.00
−10 −5 0 5 10
0.8
N(0,1)
N(2,9)
N(−3,0.25)
0.6
Density
0.4
0.2
0.0
−10 −5 0 5 10
interval (0, 5]. The required probability is indicated by the marked area in
Figure 6.1. This area can be computed as the difference between the probability
P(X ≤ 5), the area to the left of 5, and the probability P(X ≤ 0), the area to
the left of 0:
The difference is the indicated area that corresponds to the probability of being
inside the interval, which turns out to be approximately equal to 0.589. Notice
that the expectation µ of the Normal distribution is entered as the second
argument to the function. The third argument to the function is the standard
deviation, i.e.
√ the square root of the variance. In this example, the standard
deviation is 9 = 3.
Figure 6.2 displays the densities of the Normal distribution for the combina-
tions µ = 0, σ 2 = 1 (the red line); µ = 2, σ 2 = 9 (the black line); and µ = −3,
σ 2 = 1/4 (the green line). Observe that the smaller the variance the more
concentrated is the distribution of the random variable about the expectation.
90 CHAPTER 6. THE NORMAL RANDOM VARIABLE
(0 =) x = µ + z · σ (= 2 + z · 3)
The value that is being computed, the area under the graph for the standard
Normal distribution, is presented in Figure 6.3. Recall that 3 arguments where
specified in the previous application of the function “pnorm”: the x value, the
expectation, and the standard deviation. In the given application we did not
specify the last two arguments, only the first one. (Notice that the output of
6.2. THE NORMAL RANDOM VARIABLE 91
0.4
0.2
0.1
0.0
−4 −2 0 2 4
the expression “(5-2)/3” is a single number and, likewise, the output of the
expression “(0-2)/3” is also a single number.)
Most R function have many arguments that enables flexible application in a
wide range of settings. For convenience, however, default values are set to most
of these arguments. These default values are used unless an alternative value for
the argument is set when the function is called. The default value of the second
argument of the function “pnorm” that specifies the expectation is “mean=0”,
and the default value of the third argument that specifies the standard devi-
ation is “sd=1”. Therefore, if no other value is set for these arguments the
function computes the cumulative distribution function of the standard Normal
distribution.
0.4
0.2
0.1
z0 = −1.96 z1 = 1.96
0.0
−4 −2 0 2 4
We may find the z-values of the boundaries of the region, denoted in the
figure as z0 and z1 by the investigation of the cumulative distribution function.
Indeed, in order to have 95% of the distribution in the central region one should
leave out 2.5% of the distribution in each of the two tails. That is, 0.025
should be the area of the unshaded region to the right of z1 and, likewise, 0.025
should be the area of the unshaded region to the left of z0 . In other words, the
cumulative probability up to z0 should be 0.025 and the cumulative distribution
up to z1 should be 0.975.
In general, given a random variable X and given a probability p, the x
value with the property that the cumulative distribution up to x is equal to
the probability p is called the p-percentile of the distribution. Here we seek the
0.025-percentile and the 0.975-percentile of the standard Normal distribution.
The percentiles of the Normal distribution are computed by the function
“qnorm”. The first argument to the function is a probability (or a sequence
of probabilities), the second and third arguments are the expectation and the
standard deviations of the normal distribution. The default values to these argu-
ments are set to 0 and 1, respectively. Hence if these arguments are not provided
the function computes the percentiles of the standard Normal distribution. Let
6.2. THE NORMAL RANDOM VARIABLE 93
0.12
0.06
0.04
0.02
x0 = ? x1 = ?
0.00
−10 −5 0 5 10
[1] 7.879892
> qnorm(0.025,2,3)
[1] -3.879892
Hence, we get that x0 = −3.88 has the property that the total probability to
its left is 0.025 and x1 = 7.88 has the property that the total probability to its
right is 0.025. The total probability in the range [−3.88, 7.88] is 0.95.
An alternative approach for obtaining the given interval exploits the inter-
val that was obtained for the standardized values. An interval [−1.96, 1.96] of
standardized z-values corresponds to an interval [2 − 1.96 · 3, 2 + 1.96 · 3] of the
original x-values:
> 2 + qnorm(0.975)*3
[1] 7.879892
> 2 + qnorm(0.025)*3
[1] -3.879892
Hence, we again produce the interval [−3.88, 7.88], the interval that was ob-
tained before as the central interval that contains 95% of the distribution of the
Normal(2, 9) random variable.
In general, if X ∼ N (µ, σ) is a Normal random variable then the interval
[µ − 1.96 · σ, µ + 1.96 · σ] contains 95% of the distribution of the random variable.
Frequently one uses the notation µ ± 1.96 · σ to describe such an interval.
ends of this rectangle to the smallest and to the largest data values that are not
outliers. Outliers are values that lie outside of the normal range of the data.
Outliers are identified as values that are more then 1.5 times the interquartile
range away from the ends of the central rectangle. Hence, a value is an outlier
if it is larger than the third quartile plus 1.5 times the interquartile range or if
it is less than the first quartile minus 1.5 times the interquartile range.
How likely is it to obtain an outlier value when the measurement has the
standard Normal distribution? We obtained that the third quartile of the stan-
dard Normal distribution is equal to 0.6744898 and the first quartile is minus
this value. The interquartile range is the difference between the third and first
quartiles. The upper and lower thresholds for the defining outliers are:
Hence, a value larger than 2.697959 or smaller than -2.697959 would be identified
as an outlier.
The probability of being less than the upper threshold 2.697959 in the stan-
dard Normal distribution is computed with the expression “pnorm(2.697959)”.
The probability of being above the threshold is 1 minus that probability, which
is the outcome of the expression “1-pnorm(2.697959)”.
By the symmetry of the standard Normal distribution we get that the prob-
ability of being below the lower threshold -2.697959 is equal to the probability
of being above the upper threshold. Consequently, the probability of obtaining
an outlier is equal to twice the probability of being above the upper threshold:
> 2*(1-pnorm(2.697959))
[1] 0.006976603
We get that for the standard Normal distribution the probability of an outlier
is approximately 0.7%.
> qnorm(0.975,mu,sig)
[1] 2061.980
> qnorm(0.025,mu,sig)
[1] 1938.020
After rounding to the nearest integer we get the interval [1938, 2062] as a pro-
posed central region.
In order to validate the proposed region we may repeat the computation
under the actual Binomial distribution:
> qbinom(0.975,4000,0.5)
6.3. APPROXIMATION OF THE BINOMIAL DISTRIBUTION 97
[1] 2062
> qbinom(0.025,4000,0.5)
[1] 1938
Again, we get the interval [1938, 2062] as the central region, in agreement with
the one proposed by the Normal approximation. Notice that the function
“qbinom” produces the percentiles of the Binomial distribution. It may not
come as a surprise to learn that “qpois”, “qunif”, “qexp” compute the per-
centiles of the Poisson, Uniform and Exponential distributions, respectively.
The ability to approximate one distribution by the other, when computation
tools for both distributions are handy, seems to be of questionable importance.
Indeed, the significance of the Normal approximation is not so much in its
ability to approximate the Binomial distribution as such. Rather, the important
point is that the Normal distribution may serve as an approximation to a wide
class of distributions, with the Binomial distribution being only one example.
Computations that are based on the Normal approximation will be valid for
all members in the class of distributions, including cases where we don’t have
the computational tools at our disposal or even in cases where we do not know
what the exact distribution of the member is! As promised, a more detailed
discussion of the Normal approximation in a wider context will be presented in
the next chapter.
On the other hand, one need not assume that any distribution is well approx-
imated by the Normal distribution. For example, the distribution of wealth in
the population tends to be skewed, with more than 50% of the people possessing
less than 50% of the wealth and small percentage of the people possessing the
majority of the wealth. The Normal distribution is not a good model for such
distribution. The Exponential distribution, or distributions similar to it, may
be more appropriate.
> pbinom(6,30,0.3)
[1] 0.1595230
> pnorm(6,30*0.3,sqrt(30*0.3*0.7))
[1] 0.1159989
The Normal approximation, which is equal to 0.1159989, is not too close to the
actual probability, which is equal to 0.1595230.
A naı̈ve application of the Normal approximation for the Binomial(n, p) dis-
tribution may not be so good when the number of trials n is small. Yet, a
small modification of the approximation may produce much better results. In
order to explain the modification consult Figure 6.6 where you will find the bar
98 CHAPTER 6. THE NORMAL RANDOM VARIABLE
0.15
0.10
Pr
0.05
0.00
0 5 10 15 20 25 30
plot of the Binomial distribution with the density of the approximating Nor-
mal distribution superimposed on top of it. The target probability is the sum
of heights of the bars that are painted in red. In the naı̈ve application of the
Normal approximation we used the area under the normal density which is to
the left of the bar associated with the value x = 6.
Alternatively, you may associate with each bar located at x the area under
the normal density over the interval [x − 0.5, x + 0.5]. The resulting correction
to the approximation will use the Normal probability of the event {X ≤ 6.5},
which is the area shaded in red. The application of this approximation, which
is called continuity correction produces:
> pnorm(6.5,30*0.3,sqrt(30*0.3*0.7))
[1] 0.1596193
Observe that the corrected approximation is much closer to the target prob-
ability, which is 0.1595230, and is substantially better that the uncorrected
approximation which was 0.1159989. Generally, it is recommended to apply the
continuity correction to the Normal approximation of a discrete distribution.
6.3. APPROXIMATION OF THE BINOMIAL DISTRIBUTION 99
Solution (to Question 6.2.2): Refer again to the probability P(X > 11).
A formal application of the Normal approximation replaces in the computation
the Binomial distribution by the Normal distribution with the same mean and
variance. Since E(X) = n · p = 27 · 0.32 = 8.64 and Var(X) = n · p · (1 − p) =
27 · 0.32 · 0.68 = 5.8752. If we take X ∼ Normal(8.64, 5.8752) and use the
function “pnorm” we get:
> 1 - pnorm(11,27*0.32,sqrt(27*0.32*0.68))
[1] 0.1651164
Therefore, the current Normal approximation proposes P(X > 11) ≈ 0.1651164.
Solution (to Question 6.2.3): The continuity correction, that consider inter-
val of radius 0.5 about each value, replace P(X > 11), that involves the values
{12, 13, . . . , 27}, by the event P(X > 11.5). The Normal approximation uses the
Normal distribution with the same mean and variance. Since E(X) = 8.64 and
Var(X) = 5.8752. If we take X ∼ Normal(8.64, 5.8752) and use the function
“pnorm” we get:
102 CHAPTER 6. THE NORMAL RANDOM VARIABLE
> 1 - pnorm(11.5,27*0.32,sqrt(27*0.32*0.68))
[1] 0.1190149
The Normal approximation with continuity correction proposes P(X > 11) ≈
0.1190149.
Solution (to Question 6.2.4): The Poisson approximation replaces the Bi-
nomial distribution by the Poisson distribution with the same expectation. The
expectation is E(X) = n · p = 27 · 0.32 = 8.64. If we take X ∼ Poisson(8.64)
and use the function “ppois” we get:
> 1 - ppois(11,27*0.32)
[1] 0.1635232
Therefore, the Poisson approximation proposes P(X > 11) ≈ 0.1651164.
6.5 Summary
Glossary
Normal Random Variable: A bell-shaped distribution that is frequently used
to model a measurement. The distribution is marked with Normal(µ, σ 2 ).
Standard Normal Distribution: The Normal(0, 1). The distribution of stan-
dardized Normal measurement.
Percentile: Given a probability p, the value x is the p-percentile of a continuous
random variable X if it satisfies the equation P(X ≤ x) = p.
The Normal Approximation of the Binomial Distribution: Approximate
computations associated with the Binomial distribution with parallel com-
putations that use the Normal distribution with the same expectation and
standard deviation as the Binomial.
The Poisson Approximation of the Binomial Distribution: Approximate
computations associated with the Binomial distribution with parallel com-
putations that use the Poisson distribution with the same expectation as
the Binomial.
Consider, for example, testing IQ. The score of many IQ tests are modeled
as having a Normal distribution with an expectation of 100 and a standard
deviation of 15. The sample space of the Normal distribution is the entire line
of real numbers, including the negative numbers. In reality, IQ tests produce
only positive values.
104 CHAPTER 6. THE NORMAL RANDOM VARIABLE
Chapter 7
105
106 CHAPTER 7. THE SAMPLING DISTRIBUTION
get a particular value. This is the observed value and is no longer a random
variable. In this section we extend the concept of a random variable and define
the concept of a random sample.
“X.samp” and are presented above. The role of statistics is to make inference
on the parameters of the unobserved population based on the information that
is obtained from the sample.
For example, we may be interested in estimating the mean value of the
heights in the population. A reasonable proposal is to use the sample average
to serve as an estimate:
> mean(X.samp)
[1] 170.73
In our artificial example we can actually compute the true population mean:
> mean(pop.1$height)
[1] 170.035
Hence, we may see that although the match between the estimated value and
the actual value is not perfect still they are close enough.
The actual estimate that we have obtained resulted from the specific sample
that was collected. Had we collected a different subset of 100 individuals we
would have obtained different numerical value for the estimate. Consequently,
one may wonder: Was it pure luck that we got such good estimates? How likely
is it to get estimates that are close to the target parameter?
Notice that in realistic settings we do not know the actual value of the
target population parameters. Nonetheless, we would still want to have at
least a probabilistic assessment of the distance between our estimates and the
parameters they try to estimate. The sampling distribution is the vehicle that
may enable us to address these questions.
In order to illustrate the concept of the sampling distribution let us select
another sample and compute its average:
> X.samp <- sample(pop.1$height,100)
> X.bar <- mean(X.samp)
> X.bar
[1] 171.87
and do it once more:
> X.samp <- sample(pop.1$height,100)
> X.bar <- mean(X.samp)
> X.bar
[1] 171.08
In each case we got a different value for the sample average. In the first of
the last two iterations the result was more than 1 centimeter away from the
population average, which is equal to 170.035, and in the second it was within
the range of 1 centimeter. Can we say, prior to taking the sample, what is the
probability of falling within 1 centimeter of the population mean?
Chapter 4 discussed the random variable that emerges by randomly sampling
a single number from the population presented by the sequence “pop.1$height”.
The distribution of the random variable resulted from the assignment of the
probability 1/100,000 to each one of the 100,000 possible outcomes. The same
principle applies when we randomly sample 100 individuals. Each possible out-
come is a collection of 100 numbers and each collection is assigned equal prob-
ability. The resulting distribution is called the sampling distribution.
7.2. THE SAMPLING DISTRIBUTION 109
The distribution of the average of the sample emerges from this distribution:
With each sample one may associate the average of that sample. The probability
assigned to that average outcome is the probability of the sample. Hence, one
may assess the probability of falling within 1 centimeter of the population mean
using the sampling distribution. Each sample produces an average that either
falls within the given range or not. The probability of the sample average falling
within the given range is the proportion of samples for which this event happens
among the entire collection of samples.
However, we face a technical difficulty when we attempt to assess the sam-
pling distribution of the average and the probability of falling within 1 cen-
timeter of the population mean. Examination of the distribution of a sample
of a single individual is easy enough. The total number of outcomes, which is
100,000 in the given example, can be handled with no effort by the computer.
However, when we consider samples of size 100 we get that the total number
of ways to select 100 number out of 100,000 numbers is in the order of 10342
(1 followed by 342 zeros) and cannot be handled by any computer. Thus, the
probability cannot be computed.
As a compromise we will approximate the distribution by selecting a large
number of samples, say 100,000, to represent the entire collection, and use
the resulting distribution as an approximation of the sampling distribution.
Indeed, the larger the number of samples that we create the more accurate
the approximation of the distribution is. Still, taking 100,000 repeats should
produce approximations which are good enough for our purposes.
Consider the sampling distribution of the sample average. We simulated
above a few examples of the average. Now we would like to simulate 100,000
such examples. We do this by creating first a sequence of the length of the
number of evaluations we seek (100,000) and then write a small program that
produces each time a new random sample of size 100 and assigns the value of
the average of that sample to the appropriate position in the sequence. Do first
and explain later1 :
In the first line we produce a sequence of length 100,000 that contains zeros.
The function “rep” creates a sequence that contains repeats of its first argument
a number of times that is specified by its second argument. In this example,
the numerical value 0 is repeated 100,000 times to produce a sequence of zeros
of the length we seek.
1 Running this simulation, and similar simulations of the same nature that will be consid-
ered in the sequel, demands more of the computer’s resources than the examples that were
considered up until now. Beware that running times may be long and, depending on the
strength of your computer and your patience, too long. You may save time by running less it-
erations, replacing, say, “10^5” by “10^4”. The results of the simulation will be less accurate,
but will still be meaningful.
110 CHAPTER 7. THE SAMPLING DISTRIBUTION
3500
Frequency
1500
0
117 127 137 147 157 167 177 187 197 207 217
height
Histogram of X.bar
Frequency
10000
0
X.bar
The main part of the program is a “for” loop. The argument of the function
“for” takes the special form: “index.name in index.values”, where index.name
is the name of the running index and index.values is the collection of values over
which the running index is evaluated. In each iteration of the loop the running
index is assigned a value from the collection and the expression that follows the
brackets of the “for” function is evaluated with the given value of the running
index.
In the given example the collection of values is produced by the expression
“1:n”. Recall that the expression “1:n” produces the collection of integers be-
tween 1 and n. Here, n = 100,000. Hence, in the given application the collection
of values is a sequence that contains the integers between 1 and 100,000. The
running index is called “i”. the expression is evaluated 100,000 times, each time
with a different integer value for the running index “i”.
The R system treats a collection of expressions enclosed within curly brackets
as one entity. Therefore, in each iteration of the “for” loop, the lines that are
within the curly brackets are evaluated. In the first line a random sample of size
100 is produced and in the second line the average of the sample is computed and
stored in the i-th position of the sequence “X.bar”. Observe that the specific
7.2. THE SAMPLING DISTRIBUTION 111
> mean(pop.1$height)
[1] 170.035
> sd(pop.1$height)
[1] 11.23205
> mean(X.bar)
[1] 170.037
> sd(X.bar)
[1] 1.122116
Observe that the expectation of the population and the expectation of the sam-
ple average, are practically the same, the standard deviation of the sample
average is about 10 times smaller than the standard deviation of the popula-
tion. This result is not accidental and actually reflects a general phenomena
that will be seen below in other examples.
We may use the simulated sampling distribution in order to compute an ap-
proximation of the probability of the sample average falling within 1 centimeter
of the population mean. Let us first compute the relevant probability and then
explain the details of the computation:
Hence we get that the probability of the given event is about 62.6%.
The object “X.bar” is a sequence of length 100,000 that contains the simu-
lated sample averages. This sequence represents the distribution of the sample
average. Notice that the expression “abs(X.bar - mean(pop.1$height)) <=
1” produces a sequence of logical “TRUE” or “FALSE” values, depending on the
value of the sample average being less or more than one unit away from the
population mean. The application of the function “mean” to the output of the
last expression results in the computation of the relative frequency of TRUEs,
which corresponds to the probability of the event of interest.
112 CHAPTER 7. THE SAMPLING DISTRIBUTION
Again, one may wonder what is the distribution of the sample average X̄ in this
case?
We can approximate the distribution of the sample average by simulation.
The function “rbinom” produces a random sample from the Binomial distribu-
tion. The first argument to the function is the sample size, which we take in this
example to be equal to 64. The second and third arguments are the parameters
of the Binomial distribution, 10 and 0.5 in this case. We can use this function
in the simulation:
Observe that in this code we created a sequence of length 100,000 with evalua-
tions of the sample average of 64 Binomial random variables. We start with a
sequence of zeros and in each iteration of the “for” loop a zero is replaced by
the average of a random sample of 64 Binomial random variables.
Examine the sampling distribution of the Binomial average:
> hist(X.bar)
> mean(X.bar)
[1] 4.999074
> sd(X.bar)
[1] 0.1982219
The histogram of the sample average is presented in the lower panel of Figure 7.2.
Compare it to the distribution of a single Binomial random variable that appears
in the upper panel. Notice, once more, that the center of the two distributions
coincide but the spread of the sample average is smaller. The sample space of
a single Binomial random variable is composed of integers. The sample space
of the average of 64 Binomial random variables, on the other hand, contains
many more values and is closer to the sample space of a random variable with
a continuous distribution.
Recall that the expectation of a Binomial(10, 0.5) random variable is E(X) =
10 · 0.5 = 5 and
√ the variance is Var(X) = 10 · 0.5 · 0.5 = 2.5 (thus, the standard
deviation is 2.5 = 1.581139). Observe that the expectation of the sample
7.2. THE SAMPLING DISTRIBUTION
Probability 113
0.15
0.00
0 2 4 6 8 10
Histogram of X.bar
Frequency
10000
0
X.bar
average that we got from the simulation is essentially equal to 5 and the standard
deviation is 0.1982219.
One may prove mathematically that the expectation of the sample mean is
equal to the theoretical expectation of its components:
E(X̄) = E(X) .
The results of the simulation for the expectation of the sample average are
consistent with the mathematical statement. The mathematical theory of prob-
ability may also be used in order to prove that the variance of the sample average
is equal to the variance of each of the components, divided by the sample size:
Var(X̄) = Var(X)/n ,
Consider the problem of identifying the central interval that contains 95%
of the distribution. In the Normal distribution we were able to use the function
“qnorm” in order to compute the percentiles of the theoretical distribution. A
function that can be used for the same purpose for simulated distribution is
the function “quantile”. The first argument to this function is the sequence
of simulated values of the statistic, “X.bar” in the current case. The second
argument is a number between 0 and 1, or a sequence of such numbers:
> quantile(X.bar,c(0.025,0.975))
2.5% 97.5%
4.609375 5.390625
We used the sequence “c(0.025,0.975)” as the input to the second argument.
As a result we obtained the output 4.609375, which is the 0.025-percentile of the
sampling distribution of the average, and 5.390625, which is the 0.975-percentile
of the sampling distribution of the average.
Of interest is to compare these percentiles to the parallel percentiles of the
Normal distribution with the same expectation and the same standard deviation
as the average:
> qnorm(c(0.025,0.975),mean(X.bar),sd(X.bar))
[1] 4.611456 5.389266
Observe the similarity between the percentiles of the distribution of the average
and the percentiles of the Normal distribution. This similarity is a reflection of
the Normal approximation of the sampling distribution of the average, which is
formulated in the next section under the title: The Central Limit Theorem.
0.4
N(0,1)
n=10
n=100
n=1000
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
X̄ − E(X̄)
Z= p .
Var(X̄)
Recall that the expectation of the sample average is equal to the expectation
of a single random variable (E(X̄) = E(X)) and that the variance of the sample
average is equal to the variance of a single observation, divided by the sample
size (Var(X̄) = Var(X)/n). Consequently, one may rewrite the standardized
sample average in the form:
√
X̄ − E(X) n(X̄ − E(X))
Z=p = p .
Var(X)/n Var(X)
The second equality follows from placing in the numerator the square root of n
which divides the term in the denominator. Observe that with the increase of the
7.3. LAW OF LARGE NUMBERS AND CENTRAL LIMIT THEOREM 117
sample size the decreasing difference between the average and the expectation
is magnified by the square root of n.
The Central Limit Theorem states that, with the increase in sample size,
the sample average converges (after standardization) to the standard Normal
distribution.
Let us examine the Central Normal Theorem in the context of the example
of the Uniform measurement. In Figure 7.3 you may find the (approximated)
density of the standardized average for the three sample sizes based on the
simulation that we carried out previously (as red, green, and blue lines). Along
side with these densities you may also find the theoretical density of the standard
Normal distribution (as a black line). Observe that the four curves are almost
one on top of the other, proposing that the approximation of the distribution of
the average by the Normal distribution is good even for a sample size as small
as n = 10.
However, before jumping to the conclusion that the Central Limit Theorem
applies to any sample size, let us consider another example. In this example we
repeat the same simulation that we did with the Uniform distribution, but this
time we take Exponential(0.5) measurements instead:
> exp.10 <- rep(0,10^5)
> exp.100 <- rep(0,10^5)
> exp.1000 <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X.samp.10 <- rexp(10,0.5)
+ exp.10[i] <- mean(X.samp.10)
+ X.samp.100 <- rexp(100,0.5)
+ exp.100[i] <- mean(X.samp.100)
+ X.samp.1000 <- rexp(1000,0.5)
+ exp.1000[i] <- mean(X.samp.1000)
+ }
The expectation of an Exponential(0.5) random variable is E(X) = 1/λ =
1/0.5 = 2 and the variance is Var(X) = 1/λ2 = 1/(0.5)2 = 4. Notice
that the expectations of the sample averages are equal to the expectation of
the measurement and the variances of the sample averages follow the relation
Var(X̄) = Var(X)/n:
> mean(exp.10)
[1] 1.999888
> mean(exp.100)
[1] 2.000195
> mean(exp.1000)
[1] 1.999968
So the expectations of the sample average are all equal to 2. For the variance
we get:
> var(exp.10)
[1] 0.4034642
> var(exp.100)
[1] 0.03999479
118 CHAPTER 7. THE SAMPLING DISTRIBUTION
0.4
N(0,1)
n=10
n=100
n=1000
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
> var(exp.1000)
[1] 0.004002908
us, in the context of the sample average, to carry out probabilistic computations
using the Normal distribution even if we do not know the actual distribution of
the measurement. All we need to know for the computation are the expectation
of the measurement, its variance (or standard deviation) and the sample size.
The theorem can be applied whenever probability computations associated
with the sampling distribution of the average are required. The computation
of the approximation is carried out by using the Normal distribution with the
same expectation and the same standard deviation as the sample average.
An example of such computation was conducted in Subsection 7.2.3 where
the central interval that contains 95% of the sampling distribution of a Binomial
average was required. The 0.025- and the 0.975-percentiles of the Normal distri-
bution with the same expectation and variance as the sample average produced
boundaries for the interval. These boundaries were in good agreement with the
boundaries produced by the simulation. More examples will be provided in the
quizzes that are associated with this chapter.
With all its usefulness, one should treat the Central Limit Theorem with a
grain of salt. The approximation may be valid for large samples, but may be
bad for samples that are not large enough. When the sample is small a careless
application of the Central Limit Theorem may produce misleading conclusions.
sex: A factor variable. The sex of each subject. The values are either “MALE”
or “FEMALE”.
group: A factor variable. The blood pressure category of each subject. The
values are “NORMAL” both the systolic blood pressure is within its normal
range (between 90 and 139) and the diastolic blood pressure is within its
normal range (between 60 and 89). The value is “HIGH” if either mea-
surements of blood pressure are above their normal upper limits and it is
“LOW” if either measurements are below their normal lower limits.
Our goal in this question is to investigate the sampling distribution of the sample
average of the variable “bmi”. We assume a sample of size n = 150.
3. Compute the expectation of the sampling distribution for the sample av-
erage of the variable.
4. Compute the standard deviation of the sampling distribution for the sam-
ple average of the variable.
5. Identify, using simulations, the central region that contains 80% of the
sampling distribution of the sample average.
Solution (to Question 7.1.1): After placing the file “pop2.csv” in the work-
ing directory one may produce a data frame with the content of the file and
compute the average of the variable “bmi” using the code:
Solution (to Question 7.1.2): Applying the function “sd” to the sequence
of population values produces the population standard deviation:
> sd(pop.2$bmi)
[1] 4.188511
Solution (to Question 7.1.3): In order to compute the expectation under the
sampling distribution of the sample average we conduct a simulation. The sim-
ulation produces (an approximation) of the sampling distribution of the sample
average. The sampling distribution is represented by the content of the sequence
“X.bar”:
Solution (to Question 7.1.4): The standard deviation of the sample average
under the sampling distribution is computed using the function “sd”:
> sd(X.bar)
[1] 0.3422717
The resulting standard deviation is 0.3422717. Recall that the standard devi-
ation of a single measurement is equal to 4.188511 and that the sample size is
n = 150. The ratio between the √standard deviation of the measurement and the
square root of 150 is 4.188511/ 150 = 0.3419905, which is similar in value to
the standard deviation of the sample average3 .
Solution (to Question 7.1.5): The central region that contains 80% of the
sampling distribution of the sample average can be identified with the aid of the
function “quantile”:
> quantile(X.bar,c(0.1,0.9))
10% 90%
24.54972 25.42629
The value 24.54972 is the 0.10-percentile of the sampling distribution. To the
left of this value are 10% of the distribution. The value 25.42629 is the 0.90-
percentile of the sampling distribution. To the right of this value are 10% of the
distribution. Between these two values are 80% of the sampling distribution.
Solution (to Question 7.1.6): The Normal approximation, which is the con-
clusion of the Central Limit Theorem substitutes the sampling distribution of
the sample average by the Normal distribution with the same expectation and
standard deviation. The percentiles are computed with the function “qnorm”:
> qnorm(c(0.1,0.9),mean(X.bar),sd(X.bar))
[1] 24.54817 25.42545
Observe that we used the expectation and the standard deviation of the sample
average in the function. The resulting interval is [24.54817, 25.42545], which is
similar to the interval [24.54972, 25.42629] which was obtained via simulations.
fact that the sequence “X.bar” is only an approximation of the sampling distribution.
3 It can be shown theoretically that the variance of the sample average, in the case of
sampling from a population, is equal to [(N − n)/(N − 1)] · Var(X)/n, where Var(X) is the
population variance of the measurement, n is the sample size, and N is the population size.
The factor [(N − n)/(N − 1)] is called the finite population correction. In the current setting
the finite population correction is equal to 0.99851, which is practically equal to one.
122 CHAPTER 7. THE SAMPLING DISTRIBUTION
4. The central region that contains 99% of the distribution of the average is
of the form 5 ± c. Use the Central Limit Theorem in order to approximate
the value of c.
Solution (to Question 7.2.1): Denote by X the distance from the specified
endpoint of a random hit. Observe that X ∼ Uniform(0, 10). The 25 hits form a
sample X1 , X2 , . . . , X25 from this distribution and the sample average X̄ is the
average of these random locations. The expectation of the average is equal to the
expectation of a single measurement. Since E(X) = (a + b)/2 = (0 + 10)/2 = 5
we get that E(X̄) = 5.
Solution (to Question 7.2.2): The variance of the sample average is equal to
the variance of a single measurement, divided by the sample size. The variance
of the Uniform distribution is Var(X) = (a + b)2 /12 = (10 − 0)2 /12 = 8.333333.
The standard deviation of the sample average is equal to the standard deviation
of the sample average is equal to the standard deviation of a single measurement,
divided by the square root of the sample size. The p sample size is n = 25. Con-
sequently, the standard deviation of the average is 8.333333/25 = 0.5773503.
Solution (to Question 7.2.3): The left-most third of the detector is the
interval to the left of 10/3. The distribution of the sample average, according
to the Central Limit Theorem, is Normal. The probability of being less than
10/3 for the Normal distribution may be computed with the function “pnorm”:
> mu <- 5
> sig <- sqrt(100/(12*25))
> pnorm(10/3,mu,sig)
[1] 0.001946209
The expectation and the standard deviation of the sample average are used in
computation of the probability. The probability is 0.001946209, about 0.2%.
Solution (to Question 7.2.3): The central region in the Normal(µ, σ 2 ) distri-
bution that contains 99% of the distribution is of the form µ±qnorm(0.995)·σ,
where “qnorm(0.995)” is the 0.995-percentile of the Standard Normal distribu-
tion. Therefore, c = qnorm(0.995) · σ:
> qnorm(0.995)*sig
[1] 1.487156
7.5 Summary
Glossary
Random Sample: The probabilistic model for the values of a measurements
in the sample, before the measurement is taken.
Sampling Distribution: The distribution of a random sample.
Sampling Distribution of a Statistic: A statistic is a function of the data;
i.e. a formula applied to the data. The statistic becomes a random variable
when the formula is applied to a random sample. The distribution of this
random variable, which is inherited from the distribution of the sample,
is its sampling distribution.
Sampling Distribution of the Sample Average: The distribution of the sam-
ple average, considered as a random variable.
The Law of Large Numbers: A mathematical result regarding the sampling
distribution of the sample average. States that the distribution of the av-
erage of measurements is highly concentrated in the vicinity of the expec-
tation of a measurement when the sample size is large.
The Central Limit Theorem: A mathematical result regarding the sampling
distribution of the sample average. States that the distribution of the av-
erage is approximately Normal when the sample size is large.
Summary of Formulas
Expectation of the sample average: E(X̄) = E(X)
Variance of the sample average: Var(X̄) = Var(X)/n
124 CHAPTER 7. THE SAMPLING DISTRIBUTION
Chapter 8
• Integrate the tools that were given in the first part of the book in order
to solve complex problems.
8.2 An Overview
The purpose of the first part of the book was to introduce the fundamentals
of statistics and teach the concepts of probability which are essential for the
understanding of the statistical procedures that are used to analyze data.
Data is typically obtained by selecting a sample from a population and taking
measurements on the sample. There are many ways to select a sample, but all
methods for such selection should not violate the most important characteristic
that a sample should posses, namely that it represents the population it came
from. In this book we concentrate on simple random sampling. However, the
reader should be aware of the fact that other sampling designs exist and may
be more appropriate in specific applications. Given the sampled data, the main
concern of the science of statistics is in making inference on the parameter of
the population on the basis of the data collected. Such inferences are carried
out with the aid of statistics, which are functions of the data.
Data is frequently stored in the format of a data frame, in which columns
are the measured variable and the rows are the observations associated with the
selected sample. The main types of variables are numeric, either discrete or not,
125
126 CHAPTER 8. OVERVIEW AND INTEGRATION
and factors. We learned how one can produce data frames and read them into
R for further analysis.
Statistics is geared towards dealing with variability. Variability may emerge
in different forms and for different reasons. It can be summarized, analyzed and
handled with many tools. Frequently, the same tool, or tools that have much
resemblance to each other, may be applied in different settings and for different
forms of variability. In order not to loose track it is important to understand in
each scenario the source and nature of the variability that is being examined.
An important split in term of the source of variability is between descriptive
statistics and probability. Descriptive statistics examines the distribution of
data. The frame of reference is the data itself. Plots, such as the bar plots,
histograms and box plot; tables, such as the frequency and relative frequency
as well as the cumulative relative frequency; and numerical summaries, such as
the mean, median and standard deviation, can all serve in order to understand
the distribution of the given data set.
In probability, on the other hand, the frame of reference is not the data at
hand but, instead, it is all data sets that could have been sampled (the sample
space). One may use similar plots, tables, and numerical summaries in order
to analyze the distribution of functions of the sample, but the meaning of the
analysis is different. As a matter of fact, the relevance of the probabilistic
analysis to the data actually sampled is indirect. The given sample is only
one realization within the sample space among all possible realizations. In
the probabilistic context there is no special role to the observed realization in
comparison to all other potential realizations.
The fact that the relation between probabilistic variability and the observed
data is not direct does not make the relation unimportant. On the contrary,
this indirect relation is the basis for making statistical inference. In statistical
inference the characteristics of the data may be used in order to extrapolate
from the sampled data to the entire population. Probabilistic description of the
distribution of the sample is then used in order to assess the reliability of the
extrapolation. For example, one may try to estimate the value of population
parameters, such as the population average and the population standard devi-
ation, on the basis of the parallel characteristics of the data. The variability
of the sampling distribution is used in order to quantify the accuracy of this
estimation. (See Example 5 below.)
Statistics, like many other empirically driven forms of science, uses theo-
retical modeling for assessing and interpreting observational data. In statistics
this modeling component usually takes the form of a probabilistic model for
the measurements as random variables. In the first part of this book we have
encountered several such models. The model of simple sampling assumed that
each subset of a given size from the population has equal probability to be se-
lected as the sample. Other, more structured models, assumed a specific form
to the distribution of the measurements. The examples we considered were the
Binomial, the Poisson, the Uniform, the Exponential and the Normal distribu-
tions. Many more models may be found in the literature and may be applied
when appropriate. Some of these other models have R functions that can be
used in order to compute the distribution and produce simulations.
A statistic is a function of sampled data that is used for making statistical
inference. When a statistic, such as the average, is computed on a random
sample then the outcome, from a probabilistic point of view, is a random vari-
8.3. INTEGRATED APPLICATIONS 127
8.3.1 Example 1
A study involving stress is done on a college campus among the students. The
stress scores follow a (continuous) Uniform distribution with the lowest stress
score equal to 1 and the highest equal to 5. Using a sample of 75 students, find:
128 CHAPTER 8. OVERVIEW AND INTEGRATION
1. The probability that the average stress score for the 75 students is less
than 2.
2. The 90th percentile for the average stress score for the 75 students.
3. The probability that the total of the 75 stress scores is less than 200.
4. The 90th percentile for the total stress score for the 75 students.
Solution:
Denote by X the stress score of a random student. We are given that X ∼
Uniform(1, 5). We use the formulas E(X) = (a+b)/2 and Var(X) = (b−a)2 /12
in order to obtain the expectation and variance of a single observation and then
we use the relations E(X̄) = E(X) and Var(X̄) = Var(X)/n to translated these
results to the expectation and variance of the sample average:
> a <- 1
> b <- 5
> n <- 75
> mu.bar <- (a+b)/2
> sig.bar <- sqrt((b-a)^2/(12*n))
> mu.bar
[1] 3
> sig.bar
[1] 0.1333333
After obtaining the expectation and the variance of the sample average we
can forget about the Uniform distribution and proceed only with the R functions
that are related to the Normal distribution. By the Central Limit Theorem we
get that the distribution of the sample average is approximately Normal(µ, σ 2 ),
with µ = mu.bar and σ = sig.bar.
In the Question 1.1 we are asked to find the value of the cumulative distri-
bution function of the sample average at x = 2:
> pnorm(2,mu.bar,sig.bar)
[1] 3.190892e-14
The goal of Question 1.2 is to identify the 0.9-percentile of the sample aver-
age:
> qnorm(0.9,mu.bar,sig.bar)
[1] 3.170874
The sample average is equal to the total sum divided by the number of
observations, n = 75 in this example. The total sum is less than 200 if, and
only if the average is less than 200/n. Therefore, for Question 1.3:
> pnorm(200/n,mu.bar,sig.bar)
[1] 0.006209665
Finally, if 90% of the distribution of the average is less that 3.170874 then
90% of the distribution of the total sum is less than 3.170874 n. In Question 1.4
we get:
> n*qnorm(0.9,mu.bar,sig.bar)
[1] 237.8155
8.3. INTEGRATED APPLICATIONS 129
8.3.2 Example 2
Consider again the same stress study that was described in Example 1 and
answer the same questions. However, this time assume that the stress score may
obtain only the values 1, 2, 3, 4 or 5, with the same likelihood for obtaining
each of the values.
Solution:
Denote again by X the stress score of a random student. The modified dis-
tribution states that the sample space of X are the integers {1, 2, 3, 4, 5}, with
equal probability for each value. Since the probabilities must sum to 1 we get
that P(X = x) = 1/5, for all x in the sample space. In principle we may repeat
the steps of the solution of previous example, substituting the expectation and
standard deviation of the continuous measurement by the discrete counterpart:
Notice that the expectation of the sample average is the same as before but
the standard deviation is somewhat larger due to the larger variance in the
distribution of a single response.
We may apply the Central Limit Theorem again in order to conclude that
distribution of the average is approximately Normal(µ, σ 2 ), with µ = mu.bar as
before and for the new σ = sig.bar.
For Question 2.1 we compote that the cumulative distribution function of
the sample average at x = 2 is approximately equal:
> pnorm(2,mu.bar,sig.bar)
[1] 4.570649e-10
> qnorm(0.9,mu.bar,sig.bar)
[1] 3.209276
> pnorm(200/n,mu.bar,sig.bar)
[1] 0.02061342
130 CHAPTER 8. OVERVIEW AND INTEGRATION
Observe that in the current version of the question we have the score is
integer-valued. Clearly, the sum of scores is also integer valued. Hence we may
choose to apply the continuity correction for the Normal approximation whereby
we approximate the probability that the sum is less than 200 (i.e. is less than
or equal to 199) by the probability that a Normal random variable is less than
or equal to 199.5. Translating this event back to the scale of the average we get
the approximation1 :
> pnorm(199.5/n,mu.bar,sig.bar)
[1] 0.01866821
Finally, if 90% of the distribution of the average is less that 3.170874 then
90% of the distribution of the total sum is less than 3.170874n. Therefore:
> n*qnorm(0.9,mu.bar,sig.bar)
[1] 240.6957
or, after rounding to the nearest integer we get for Question 2.4 the answer 241.
8.3.3 Example 3
Suppose that a market research analyst for a cellular phone company conducts
a study of their customers who exceed the time allowance included on their
basic cellular phone contract. The analyst finds that for those customers who
exceed the time included in their basic contract, the excess time used follows an
exponential distribution with a mean of 22 minutes. Consider a random sample
of 80 customers and find
1. The probability that the average excess time used by the 80 customers in
the sample is longer than 20 minutes.
2. The 95th percentile for the average excess time for samples of 80 customers
who exceed their basic contract time allowances.
Solution:
Let X be the excess time for customers who exceed the time included in their
basic contract. We are told that X ∼ Exponential(λ). For the Exponential
distribution E(X) = 1/λ. Hence, given that E(X) = 22 we can conclude that
λ = 1/22. For the Exponential we also have that Var(X) = 1/λ2 . Therefore:
Like before, we can forget at this stage about the Exponential distribution
and refer henceforth to the Normal Distribution. In Question 2.1 we are asked
to compute the probability above x = 20. The total probability is 1. Hence,
the required probability is the difference between 1 and the probability of being
less or equal to x = 20:
> 1-pnorm(20,mu.bar,sig.bar)
[1] 0.7919241
The goal in Question 2.2 is to find the 0.95-percentile of the sample average:
> qnorm(0.95,mu.bar,sig.bar)
[1] 26.04580
8.3.4 Example 4
A beverage company produces cans that are supposed to contain 16 ounces of
beverage. Under normal production conditions the expected amount of beverage
in each can is 16.0 ounces, with a standard deviation of 0.10 ounces.
As a quality control measure, each hour the QA department samples 50 cans
from the production during the previous hour and measures the content in each
of the cans. If the average content of the 50 cans is below a control threshold
then production is stopped and the can filling machine is re-calibrated.
3. Find a threshold with the property that the probability of stopping the
machine in a given hour is 5% when, in fact, the production conditions
are normal.
5. Based on the data in the file “QC.csv”, which of the hours contains mea-
surements which are suspected outliers in comparison to the other mea-
surements conducted during that hour?
Solution
The only information we have on the distribution of each measurement is its
expectation (16.0 ounces under normal conditions) and its standard deviation
(0.10, under the same condition). We do not know, from the information pro-
vided in the question, the actual distribution of a measurement. (The fact
that the production conditions are normal does not imply that the distribution
2 URL for the file: http://pluto.huji.ac.il/~msby/StatThink/Datasets/QC.csv
132 CHAPTER 8. OVERVIEW AND INTEGRATION
16.3
●
●
16.2
16.1
16.0
15.9
15.8
●
● ●
15.7
h1 h2 h3 h4 h5 h6 h7 h8
> pnorm(15.95,16,0.1/sqrt(50))
[1] 0.000203476
Hence, we get that the probability of the average being less than 15.95 ounces
is (approximately) 0.0002, which is a solution to Question 4.2.
In order to solve Question 4.3 we4 may apply the function “qnorm” in order
to compute the 0.05-percentile of the distribution of the average:
> qnorm(0.05,16,0.1/sqrt(50))
[1] 15.97674
8.3. INTEGRATED APPLICATIONS 133
Consider the data in the file “QC.csv”. Let us read the data into a data
frame by the by the name “QC” and apply the function “summary” to obtain an
overview of the content of the file:
Observe that the file contains 8 quantitative variables that are given the names
h1, . . . , h8. Each of these variables contains the 50 measurements conducted in
the given hour.
Observe that the mean is computed as part of the summary. The threshold
that we apply to monitor the filling machine is 15.97674. Clearly, the average
of the measurements at the third hour “h3” is below the threshold. Not enough
significance digits of the average of the 8th hour are presented to be able to
say whether the average is below or above the threshold. A more accurate
presentation of the computed mean is obtained by the application of the function
“mean” directly to the data:
> mean(QC$h8)
[1] 15.9736
Now we can see that the average is below the threshold. Hence, the machine
required re-calibration after the 3rd and the 8th hours, which is the answer to
Question 4.4.
In Chapter 3 it was proposed to use box plots in order to identify points that
are suspected to be outliers. We can use the expression “boxplot(QC$h1)” in
order to obtain the box plot of the data of the first hour and go through the
names of the variable one by one in order to screen all variable. Alternatively,
we may apply the function “boxplot” directly to the data frame “QC” and get
a plot with box plots of all the variables in the data frame plotted side by side
(see Figure 8.1):
> boxplot(QC)
Examining the plots we may see that evidence for the existence of outliers can be
spotted on the 4th, 6th, 7th, and 8th hours, providing an answer to Question 4.5
134 CHAPTER 8. OVERVIEW AND INTEGRATION
8.3.5 Example 5
A measurement follows the Uniform(0, b), for an unknown value of b. Two
statisticians propose two distinct ways to estimate the unknown quantity b with
the aid of a sample of size n = 100. Statistician A proposes to use twice the
sample average (2X̄) as an estimate. Statistician B proposes to use the largest
observation instead.
In order to choose between the two options they agree to prefer the statistic
that tends to have values that are closer to b. (with respect to the sampling
distribution). They agree to compute the expectation and variance of each
statistic. The performance of a statistic is evaluated using the mean square
error (MSE), which is defined as the sum of the variance and the squared
difference between the expectation and b. Namely, if T is the statistic (either
the one proposed by Statistician A or Statistician B) then
Solution
In Questions 5.1 and 5.2 we take the value of b to be equal to 10. Consequently,
the distribution of a measurement is Uniform(0, 10). In order to generate the
sampling distributions we produce two sequences, “A” and “B”, both of length
100,000, with the evaluations of the statistics:
> A <- rep(0,10^5)
> B <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X.samp <- runif(100,0,10)
+ A[i] <- 2*mean(X.samp)
8.3. INTEGRATED APPLICATIONS 135
Observe that in each iteration of the “for” loop a sample of size n = 100
from the Uniform(0, 10) distribution is generated. The statistic proposed by
Statistician A (“2*mean(X.samp)”) is computed and stored in sequence “A”
and the statistic proposed by Statistician B (“max(X.samp)”) is computed and
stored in sequence “B”.
Consider the statistic proposed by Statistician A:
> mean(A)
[1] 9.99772
> var(A)
[1] 0.3341673
> var(A) + (mean(A)-10)^2
[1] 0.3341725
The expectation of the statistic is 9.99772 and the variance is 0.3341673. Con-
sequently, we get that the mean square error is equal to
> mean(B)
[1] 9.901259
> var(B)
[1] 0.00950006
> var(B) + (mean(B)-10)^2
[1] 0.01924989
Observe that the mean square error of the statistic proposed by Statistician B
is smaller.
For Questions 5.3 and 5.4 we run the same type of simulations. All we
change is the value of b (from 10 to 13.7):
> mean(A)
[1] 13.70009
> var(A)
[1] 0.6264204
> var(A) + (mean(A)-13.7)^2
[1] 0.6264204
The expectation of the statistic in this setting is 13.70009 and the variance is
0.6264204. Consequently, we get that the mean square error is equal to
Once more, the mean square error of the statistic proposed by Statistician B is
smaller.
Considering the fact that the mean square error of the statistic proposed
by Statistician B is smaller in both cases we may conclude that this statistic
seams to be better for estimation of b in this setting of Uniformly distributed
measurements3 .
3 As a matter of fact, it can be proved that the statistic proposed by Statistician B has a
smaller mean square error than the statistic proposed by Statistician A, for any value of b
Part II
Statistical Inference
137
Chapter 9
Introduction to Statistical
Inference
139
140 CHAPTER 9. INTRODUCTION TO STATISTICAL INFERENCE
data emerges from the examination of the probabilistic properties of the formal
computations.
Typically, the formal computations will involve statistics, which are functions
of the data. The assessment of the probabilistic properties of the computations
will result from the sampling distribution of these statistics.
An example of a problem that requires statistical inference is the estimation
of a parameter of the population using the observed data. Point estimation
attempts to obtain the best guess to the value of that parameter. An estimator
is a statistic that produces such a guess. One may prefer an estimator whose
sampling distribution is more concentrated about the population parameter
value over another estimator whose sampling distribution is less so. Hence, the
justification for selecting a specific statistic as an estimator is a consequence of
the probabilistic characteristics of this statistic in the context of the sampling
distribution.
An alternative approach for the estimation of a parameter is to construct
an interval that is most likely to contain the population parameter. Such an
interval, which is computed on the basis of the data, is called the a confidence
interval. The sampling probability that the confidence interval will indeed con-
tain the parameter value is called the confidence level. Confidence intervals are
constructed so as to have a prescribed confidence level.
A different problem in statistical inference is hypothesis testing. The scientific
paradigm involves the proposal of new theories and hypothesis that presumably
provide a better description for the laws of Nature. On the basis of these hy-
pothesis one may propose predictions that can be examined empirically. If the
empirical evidence is consistent with the predictions of the new hypothesis but
not with those of the old theory then the old theory is rejected in favor of the
new one. Otherwise, the established theory maintains its status. Statistical hy-
pothesis testing is a formal method for determining which of the two hypothesis
should prevail.
Each of the two hypothesis, the old and the new, predicts a different dis-
tribution for the empirical measurements. In order to decide which of the dis-
tributions is more in tune with the data a statistic is computed. This statistic
is called the test statistic. A threshold is set and, depending on where the test
statistic falls with respect to this threshold, the decision is made whether or not
to reject the old theory in favor of the new one.
This decision rule is not error proof, since the test statistic may fall by
chance on the wrong side of the threshold. Nonetheless, by the examination of
the sampling distribution of the test statistic one is able to assess the probability
of making an error. In particular, the probability of erroneously rejecting the
currently accepted theory (the old one) is called the significance level of the test.
Indeed, the threshold is selected in order to assure a small enough significance
level.
The method of testing hypothesis is also applied in other practical settings
where it is required to make decisions. For example, before a new treatment
to a medical condition is approved for marketing by the appropriate authorities
it must undergo a process of objective testing through clinical trials. In these
trials the new treatment is administered to some patients while other obtain the
(currently) standard treatment. Statistical tests are applied in order to compare
the two groups of patient. The new treatment is released to the market only if
it is shown to be beneficial with statistical significance and it is shown to have
9.3. THE CARS DATA SET 141
(Other) :100
drive.wheels engine.location wheel.base length
4wd: 9 front:202 Min. : 86.60 Min. :141.1
fwd:120 rear : 3 1st Qu.: 94.50 1st Qu.:166.3
rwd: 76 Median : 97.00 Median :173.2
Mean : 98.76 Mean :174.0
3rd Qu.:102.40 3rd Qu.:183.1
Max. :120.90 Max. :208.1
Observe that the first 6 variables are factors, i.e. they contain qualitative
data that is associated with categorization or the description of an attribute.
The last 11 variable are numeric and contain quantitative data.
Factors are summarized in R by listing the attributes and the frequency of
each attribute value. If the number of attributes is large then only the most
frequent attributes are listed. Numerical variables are summarized in R with
the aid of the smallest and largest values, the three quartiles (Q1, the median,
and Q3) and the average (mean).
The third factor variable, “num.of.doors”, as well as several of the numeri-
cal variables have a special category titled “NA’s”. This category describes the
number of missing values among the observations. For a given variable, the
observations for which a value for the variable is not recorded, are marked as
missing. R uses the symbol “NA” to identify a missing value3 .
Missing observations are a concern in the analysis of statistical data. If
3 Indeed, if you scan the CSV file directly by opening it with a spreadsheet then every here
the relative frequency of missing values is substantial and the reason for not
obtaining the data for specific observations is related to the phenomena under
investigation than naı̈ve statistical inference may produce biased conclusions.
In the “cars” data frame missing values are less of a concern since their relative
frequency is low.
Generally, one should be on the lookout for missing values when applying R
to data since the different functions may have different ways for dealing with
missing values. One should make sure that the appropriate way is applied for
the specific analysis.
Consider the variables of the data frame “cars”:
fuel.type: The type of fuel used by the car, either diesel or gas (a factor).
wheel.base: The distance between the centers of the front and rear wheels in
inches (numeric).
engine.size: The volume swept by all the pistons inside the cylinders in cubic
inches (numeric).
city.mpg: The fuel consumption of the car in city driving conditions, measured
as miles per gallon of fuel (numeric).
In the context of statistical inference the use of theoretical models for the
sampling distribution is the standard approach. There are situation, such as in
the application of statistical surveys to the population by the US Census Bureau,
where the consideration of the entire population as the frame of reference is more
natural. But, in most other applications the consideration of theoretical models
is the method of choice. In this part of the book, where we consider statistical
inference, we will always use the theoretical approach for modeling the sampling
distribution.
Poisson: The Poisson distribution is also used in settings that involve count-
ing. This distribution approximates the Binomial distribution when the
number of examinations n is large but the probability p of the particular
outcome is small. The parameter that determines the distribution is the
expectation λ. The expression “Poisson(λ)” is used to mark the Poisson
distribution. The sample space for this distribution is the entire collection
of natural numbers {0, 1, 2, . . .}. The expectation of the distribution is λ
and the variance is also λ. The functions “dpois”, “ppois”, and “qpois”
may be used in order to compute the probability, the cumulative proba-
bility, and the percentiles, respectively, for the Poisson distribution. The
function “rpois” can be used in order to simulate a random sample from
this distribution.
are called confidence interval. In the case of testing hypothesis these statistics
are called test statistics.
In all cases of inference, The relevant statistic possesses a distribution that it
inherits from the sampling distribution of the observations. This distribution is
the sampling distribution of the statistic. The properties of the statistic as a tool
for inference are assessed in terms of its sampling distribution. The sampling
distribution of a statistic is a function of the sample size and of the parameters
that determine the distribution of the measurements, but otherwise may be of
complex structure.
In order to assess the performance of the statistics as agents of inference
one should be able to determine their sampling distribution. We will apply two
approaches for this determination. One approach is to use a Normal approxima-
tion. This approach relies on the Central Limit Theorem. The other approach
is to simulate the distribution. This other approach relies on the functions
available in R for the simulation of a random sample from a given distribution.
On the other hand, one need not always assume that the distribution of a
statistic is necessarily Normal. In many cases it is not, even for a large sample
size. For example, the minimal value of a sample that is generated from the
Exponential distribution can be shown to follow the Exponential distribution
with an appropriate rate4 , regardless of the sample size.
9.4.6 Simulations
In most problems of statistical inference that will be discussed in this book we
will be using the Normal approximation for the sampling distribution of the
statistic. However, every now and then we may want to check the validity of
this approximation in order to reassure ourselves of its appropriateness. Com-
puterized simulations can be carried out for that checking. The simulations are
equivalent to those used in the first part of the book.
A model for the distribution of the observations is assumed each time a
simulation is carried out. The simulation itself involves the generation of random
samples from that model for the given sample size and for a given value of the
parameter. The statistic is evaluated and stored for each generated sample.
Thereby, via the generation of many samples, an approximation of the sampling
distribution of the statistic is produced. A probabilistic statement inferred from
the Normal approximation can be compared to the results of the simulation.
Substantial disagreement between the Normal approximation and the outcome
of the simulations is an evidence that the Normal approximation may not be
valid in the specific setting.
As an example, assume the statistic is the average price of a car. It is
assumed that the price of a car follows an Exponential distribution with some
unknown rate parameter λ. We consider the sampling distribution of the average
of 201 Exponential random variables. (Recall that in our sample there are 4
missing values among the 205 observations.) The expectation of the average
is 1/λ, which is the expectation of a single Exponential random variable. The
2
variance of a single
p observation is 1/λ .√Consequently, the standard deviation
2
of the average is (1/λ )/201 = (1/λ)/ 201 = (1/λ)/14.17745 = 0.0705/λ.
In the first part of the book we found out that for Normal(µ, σ 2 ), the Normal
distribution with expectation µ and variance σ 2 , the central region that contains
95% of the distribution takes the form µ ± 1.96 σ (namely, the interval [µ −
1.96 σ, µ + 1.96 σ]). Thereby, according to the Normal approximation for the
sampling distribution of the average price we state that the region 1/λ ± 1.96 ·
0.0705/λ should contain 95% of the distribution.
We may use simulations in order to validate this approximation for selected
values of the rate parameter λ. Hence, for example, we may choose λ = 1/12, 000
(which corresponds to an expected price of $12,000 for a car) and validate the
approximation for that parameter value.
The simulation itself is carried out by the generation of a sample of size
n = 201 from the Exponential(1/1200) distribution using the function “rexp”
for generating Exponential samples5 . The function for computing the average
(mean) is applied to each sample and the result stored. We repeat this process a
large number of times (100,000 is the typical number we use) in order to produce
4 If the rate of an Exponential measurement is λ then the rate of the minimum of n such
measurements is nλ.
5 The expression for generating a sample is “rexp(201,1/12000)”
150 CHAPTER 9. INTRODUCTION TO STATISTICAL INFERENCE
output is the maximal value in the sequence. The statistic itself is obtained by
adding the two extreme values to each other and dividing the sum by two8 .
We produce, just as before, a large number of samples and compute the
value of the statistic to each sample. The distribution of the simulated values
of the statistic serves as an approximation of the sampling distribution of the
statistic. The central range that contains 95% of the sampling distribution may
be approximated with the aid of this simulated distribution.
Specifically, we approximate the central range by the identification of the
0.025-percentile and the 0.975-percentile of the simulated distribution. Between
these two values are 95% of the simulated values of the statistic. The percentiles
of a sequence of simulated values of the statistic can be identified with the aid of
the function “quantile” that was presented in the first part of the book. The
first argument to the function is a sequence of values and the second argument
is a number p between 0 and 1. The output of the function is the p-percentile
of the sequence9 . The p-percentile of the simulated sequence serves as an ap-
proximation of the p-percentile of the sampling distribution of the statistic.
The second argument to the function “quantile” may be a sequence of val-
ues between 0 and 1. If so, the percentile for each value in the second argument
is computed10 .
Let us carry out the simulation that produces an approximation of the central
region that contains 95% of the sampling distribution of the mid-range statistic
for the Uniform distribution:
> mid.range <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- runif(100,3,7)
+ mid.range[i] <- (max(X)+min(X))/2
+ }
> quantile(mid.range,c(0.025,0.975))
2.5% 97.5%
4.941680 5.059004
Observe that (approximately) 95% of the sampling distribution of the statistic
are in the range [4.941680, 5.059004].
Simulations can be used in order to compute the expectation, the standard
deviation or any other numerical summary of the sampling distribution of a
statistic. All one needs to do is compute the required summary for the simulated
sequence of statistic values and hence obtain an approximation of the required
summary. For example, we my use the sequence “mid.range” in order to obtain
the expectation and the standard deviation of the mid-range statistic of a sample
of 100 observations from the Uniform(3, 7) distribution:
> mean(mid.range)
8 If the sample is stored in an object by the name “X” then one may compute the mid-range
with values smaller than that number is p and the proportion of entries with values larger
than the number is 1 − p.
10 If the simulated values of the statistic are stored in a sequence by the name “mid.range”
then the 0.025-percentile and the 0.975-percentile of the sequence can be computed with the
expression “quantile(mid.range,c(0.025,0.975))”.
152 CHAPTER 9. INTRODUCTION TO STATISTICAL INFERENCE
[1] 5.000168
> sd(mid.range)
[1] 0.02767719
The expectation of the statistic is obtained by the application of the function
“mean” to the sequence. Observe that it is practically equal to 5. The stan-
dard deviation is obtained by the application of the function “sd”. Its value is
approximately equal to 0.028.
static magnetic fields in postpolio patients: A double-blind pilot study. Archives of Physical
and Rehabilitation Medicine 78(11): 1200-1203.
9.5. SOLVED EXERCISES 153
4. Compute the sample standard deviation of the variable “change” for the
patients that received and active magnet and the sample standard devia-
tion for those that received an inactive placebo.
5. Produce a boxplot of the variable “change” for the patients that received
and active magnet and for patients that received an inactive placebo.
What is the number of outliers in each subsequence?
Solution (to Question 9.1.1): Let us read the data into a data frame by the
name “magnets” and apply the function “summary” to the data frame:
> magnets <- read.csv("magnets.csv")
> summary(magnets)
score1 score2 change active
Min. : 7.00 Min. : 0.00 Min. : 0.0 "1":29
1st Qu.: 9.25 1st Qu.: 4.00 1st Qu.: 0.0 "2":21
Median :10.00 Median : 6.00 Median : 3.5
Mean : 9.58 Mean : 6.08 Mean : 3.5
3rd Qu.:10.00 3rd Qu.: 9.75 3rd Qu.: 6.0
Max. :10.00 Max. :10.00 Max. :10.0
The variable “change” contains the difference between the patient’s rating be-
fore the application of the device and the rating after the application. The
sample average of this variable is reported as the “Mean” for this variable and
is equal to 3.5.
Solution (to Question 9.1.3): Based on the hint we know that the expres-
sions “change[1:29]” and “change[30:50]” produce the values of the variable
“change” for the patients that were treated with active magnets and by inactive
placebo, respectively. We apply the function “mean” to these sub-sequences:
> mean(magnets$change[1:29])
[1] 5.241379
> mean(magnets$change[30:50])
[1] 1.095238
The sample average for the patients that were treated with active magnets is
5.241379 and sample average for the patients that were treated with inactive
placebo is 1.095238.
Solution (to Question 9.1.4): We apply the function “sd” to these sub-
sequences:
> sd(magnets$change[1:29])
[1] 3.236568
> sd(magnets$change[30:50])
[1] 1.578124
12 The number codes are read as character strings into R. Notice that the codes are given in
10
●
5
●
8
4
●
6
3
4
2
2
1
0
The sample standard deviation for the patients that were treated with active
magnets is 3.236568 and sample standard deviation for the patients that were
treated with inactive placebo is 1.578124.
> boxplot(magnets$change[1:29])
> boxplot(magnets$change[30:50])
The box-plots are presented in Figure 9.1. The box-plot on the left correspond
to the sub-sequence of the patients that received an active magnet. There are no
outliers in this plot. The box-plot on the right correspond to the sub-sequence
of the patients that received aa inactive placebo. Three values, the values “3”,
“4”, and “5” are associated with outliers. Let us see what is the total number
of observations that receive these values:
> table(magnets$change[30:50])
9.5. SOLVED EXERCISES 155
0 1 2 3 4 5
11 5 1 1 2 1
One may see that a single observation obtained the value “3”, another one
obtained the value “5” and 2 observations obtained the value “4”, a total of
4 outliers13 . Notice that the single point that is associated with the value “4”
actually represents 2 observations and not one.
where X̄1 and X̄2 are the sample averages for the 29 patients that receive active
magnets and for the 21 patients that receive inactive placebo, respectively. The
quantities S12 and S22 are the sample variances for each of the two samples. Our
goal is to investigate the sampling distribution of this statistic in a case where
both expectations are equal to each other and to compare this distribution to
the observed value of the statistic.
2. Does the observed value of the statistic, computed for the data frame
“magnets”, falls inside or outside of the interval that is computed in 1?
Observe that each iteration of the simulation involves the generation of two
samples. One sample is of size 29 and it is generated from the Normal(3.5, 32 )
distribution and the other sample is of size 21 and it is generated from the
Normal(3.5, 1.52 ) distribution. The sample average and the sample variance are
computed for each sample. The test statistic is computed based on these aver-
ages and variances and it is stored in the appropriate position of the sequence
“test.stat”.
The values of the sequence “test.stat” at the end of all the iterations rep-
resent the sampling distribution of the static. The application of the function
“quantile” to the sequence gives the 0.025-percentiles and the 0.975-percentiles
of the sampling distribution, which are -2.014838 and 2.018435. It follows that
the interval [−2.014838, 2.018435] contains about 95% of the sampling distribu-
tion of the statistic.
Solution (to Question 9.2.2): In order to evaluate the statistic for the given
data set we apply the same steps that were used in the simulation for the
computation of the statistic:
In the first line we compute the sample average for the first 29 patients and in
the second line we compute it for the last 21 patients. In the third and fourth
lines we do the same for the sample variances of the two types of patients.
Finally, in the fifth line we evaluate the statistic. The computed value of the
statistic turns out to be 5.985601, a value that does not belong to the interval
[−2.014838, 2.018435].
9.6 Summary
Glossary
Statistical Inferential: Methods for gaining insight regarding the population
parameters from the observed data.
Point Estimation
• Become familiar with the notions of bias, variance and mean squared error
(MSE) of an estimator.
• Be able to estimate specific parameters from the data and assess the per-
formance of the estimator.
159
160 CHAPTER 10. POINT ESTIMATION
The application of the function “mean” for the computation of the sample
average produced a missing value. The reason is that the variable “price”
contains 4 missing values. As default, when applied to a sequence that contains
missing values, the function “mean” produce as output a missing value.
The behavior of the function “mean” at the presence of missing values is
determined by the argument “na.rm”1 . If we want to compute the average of the
non-missing values in the sequence we should specify the argument “na.rm” as
“TRUE”. This can be achieved by the inclusion of the expression “na.rm=TRUE”
in the arguments of the function:
> mean(cars$price,na.rm=TRUE)
[1] 13207.13
+ X <- rexp(201,lam)
+ X.bar[i] <- mean(X)
+ }
> mean(abs(X.bar - 1/lam) <= 1000)
[1] 0.7247
In the last line of the code we compute the probability of being within $1,000
of the expected price. Recall that the expected price in the Exponential case
is the reciprocal of the rate λ. In this simulation we obtained 0.7247 as an
approximation of the probability.
In the case of the sample average we may also apply the Normal approx-
imation in order to assess the probability under consideration. In particu-
lar, if λ = 1/13, 000 then the expectation of an Exponential observation is
E(X) = 1/λ = 13, 000 and the variance is Var(X) = 1/λ2 = (13, 000)2 . The
expectation of the sample average is equal to the expectation of the measure-
ment, 13,000 in this example. The variance of the sample average is equal to the
variance of the observation, divided by the sample size. In the current setting
it is equal to (13, 000)2 /201. The standard deviation is equal to the square root
of the variance.
The Normal approximation uses the Normal distribution in order to compute
probabilities associated with the sample average. The Normal distribution that
is used has the same expectation and standard deviation as the sample average:
(12000, 14000]. However, for continuous distributions the probability of the end-points is zero
and they do not contribute to the probability of the interval.
10.3. ESTIMATION OF THE EXPECTATION 163
Notice that the for larger sample sizes the estimator is more accurate. The lager
the sample size n becomes the smaller is the variance of the estimator and the
more concentrated about the expectation its values tend to be. Hence, one may
make the estimator more accurate by increasing the sample size.
Another method for improving the accuracy of the average of measurements
in estimating the expectation is the application of a more accurate measurement
device. If the variance Var(X) of the measurement device decreases so does the
variance of the sample average of such measurements.
In the sequel, when we investigate the accuracy of estimators, we will gener-
ally use overall summaries of the spread of their distribution around the target
value of the parameter.
164 CHAPTER 10. POINT ESTIMATION
the function “max” that computes the maximum value of its input and the
function “min” that computes the minimum value:
> mu <- 3
> sig <- sqrt(2)
> X.bar <- rep(0,10^5)
> mid.range <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- rnorm(100,mu,sig)
+ X.bar[i] <- mean(X)
+ mid.range[i] <- (max(X)+min(X))/2
+ }
> var(X.bar)
[1] 0.02020161
> var(mid.range)
[1] 0.1850595
We get that the variance of the sample average3 is approximately equal to 0.02.
The variance of the mid-range statistic is approximately equal to 0.185, more
than 9 times as large. We see that the accuracy of the sample average is better
in this case than the accuracy of the mid-range estimator. Evaluating the two
estimators at other values of the parameter will produce the same relation.
Hence, in the current example it seems as if the sample average is the better of
the two.
Is the sample average necessarily the best estimator for the expectation?
The next example will demonstrate that this need not always be the case.
Consider again a situation of observing a sample of size n = 100. How-
ever, this time the measurement X is Uniform and not Normal. Say X ∼
Uniform(0.5, 5.5) has the Uniform distribution over the interval [0.5, 5.5]. The
expectation of the measurement is equal to 3 like before, since E(X) = (0.5 +
5.5)/2 = 3. The variance on an observation is Var(X) = (5.5 − 0.5)2 /12 =
2.083333, not much different from the variance that was used in the Normal
case. The Uniform distribution, like the Normal distribution, is a symmetric
distribution. Hence, using the mid-range statistic as an estimator of the expec-
tation makes sense4 .
We re-run the simulations, using the function “runif” for the simulation
of a sample from the Uniform distribution and the parameters of the Uniform
distribution instead of the function “rnorm” that was used before:
the maximum value of the distribution b and the minimal value a, is (a + b)/2, which is equal
to the expectation of the distribution
166 CHAPTER 10. POINT ESTIMATION
+ X <- runif(100,a,b)
+ X.bar[i] <- mean(X)
+ mid.range[i] <- (max(X)+min(X))/2
+ }
> var(X.bar)
[1] 0.02074304
> var(mid.range)
[1] 0.001209732
Again, we get that the variance of the sample average is approximately equal to
0.02, which is close to the theoretical value5 . The variance of mid-range statistic
is approximately equal to 0.0012.
Observe that in the current comparison between the sample average and
the mid-range estimator we get that the latter is a clear winner. Examination
of other values of a and b for the Uniform distribution will produce the same
relation between the two competitors. Hence, we may conclude that for the case
of the Uniform distribution the sample average is an inferior estimator.
The last example may serve as yet another reminder that life is never simple.
A method that is good in one situation may not be as good in a different
situation.
Still, the estimator of choice of the expectation is the sample average. In-
deed, in some cases we may find that other methods may produce more accurate
estimates. However, in most settings the sample average beats its competitors.
The sample average also possesses other useful benefits. Its sampling distribu-
tion is always centered at the expectation it is trying to estimate. Its variance
has a simple form, i.e. it is equal to the variance of the measurement divided by
the sample size. Moreover, its sampling distribution can be approximated by
the Normal distribution. Henceforth, due to these properties, we will use the
sample average whenever estimation of the expectation is required.
Pn xi − x̄ is the
where x̄ is the sample average and n is the sample size. The term
deviation from the sample average of the ith observation and i=1 (xi − x̄)2 is
the sum of the squares of deviations. It is pointed out in Chapter 3 that the
5 Actually, the exact value of the variance of the sample average is Var(X)/100 =
0.02083333. The results of the simulation are consistent with this theoretical computation.
10.4. VARIANCE AND STANDARD DEVIATION 167
reason for dividing the sum of squares by (n − 1), rather than n, stems from
considerations of statistical inference. A promise was made that these reasonings
will be discussed in due course. Now we want to deliver on this promise.
Let us compare between two competing estimators for the variance, both
considered as random variables. One is the estimator S 2 , which is equal to the
formula for the sample variance applied to a random sample:
Pn
Sum of the squares of the deviations (Xi − X̄)2
S2 = = i=1 ,
Number of values in the sample − 1 n−1
The computation of this statistic can be carried out with the function “var”.
The second estimator is the one obtained when the sum of squares is divided
by the sample size (instead of the sample size minus 1):
Pn
Sum of the squares of the deviations (Xi − X̄)2
= i=1 .
Number of values in the sample n
Observe that the second estimator can be represented in the form:
Pn 2
Pn
i=1 (Xi − X̄) n−1 (Xi − X̄)2
= · i=1 = [(n − 1)/n]S 2 .
n n n−1
Hence, the second estimator may be obtained by the multiplication of the first
estimator S 2 by the ratio (n − 1)/n. We seek to compare between S 2 and
[(n − 1)/n]S 2 as estimators of the variance.
In order to make the comparison concrete, let us consider it in the context
of a Normal measurement with expectation µ = 5 and variance σ 2 = 3. Let us
assume that the sample is of size 20 (n = 20).
Under these conditions we carry out a simulation. Each iteration of the
simulation involves the generation of a sample of size n = 20 from the given
Normal distribution. The sample variance S 2 is computed from the sample with
the application of the function “var”. The resulting estimate of the variance is
stored in an object that is called “X.var”:
> mu <- 5
> std <- sqrt(3)
> X.var <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- rnorm(20,mu,std)
+ X.var[i] <- var(X)
+ }
The content of the object “X.var”, at the end of the simulation, approximates
the sampling distribution of the estimator S 2 .
Our goal is to compare between the performance of the estimator of the
variance S 2 and that of the alternative estimator. In this alternative estimator
the sum of squared deviations is divided by the sample size (n = 20) and not by
the sample size minus 1 (n − 1 = 19). Consequently, the alternative estimator is
obtained by multiplying S 2 by the ratio 19/20. The sampling distribution of the
values of S 2 is approximated by the content of the object “X.var”. It follows
that the sampling distribution of the alternative estimator is approximated by
168 CHAPTER 10. POINT ESTIMATION
> mean(X.var)
[1] 2.995400
> mean((19/20)*X.var)
[1] 2.845630
Note that 3 is the value of the variance of the measurement that was used in the
simulation. Observe that the expectation of S 2 is essentially equal to 3, whereas
the expectation of the alternative estimator is less than 3. Hence, at least in
the example that we consider, the center of the distribution of S 2 is located on
the target value. On the other hand, the center of the sampling distribution of
the alternative estimator is located off that target value.
As a matter of fact it can be shown mathematically that the expectation of
the estimator S 2 is always equal to the variance of the measurement. This holds
true regardless of what is the actual value of the variance. On the other hand
the expectation of the alternative estimator is always off the target value6 .
An estimator is called unbiased if its expectation is equal to the value of
the parameter that it tries to estimate. We get that S 2 is an unbiased estima-
tor of the variance. Similarly, the sample average is an unbiased estimator of
the expectation. Unlike these two estimators, the alternative estimator of the
variance is a biased estimator.
The default is to use S 2 as the estimator of the variance of the measurement
and to use its square root as the estimator of the standard deviation of the
measurement. A justification, which is frequently quoted to justify this selection,
is the fact that S 2 is an unbiased estimator of the variance7 .
In the previous section, when comparing two competing estimators of the
expectation, or main concern was the quantification of the spread of the sam-
pling distribution of either estimator about the target value of the parameter.
We used that spread as a measure of the distance between the estimator and the
value it tries to estimate. In the setting of the previous section both estimators
were unbiased. Consequently, the variance of the estimators, which measures
the spread of the distribution about its expectation, could be used in order to
6 For the estimator S 2 we get that E(S 2 ) = Var(X). On the other hand, for the alternative
estimator we get that E([(n − 1)/n] · S 2 ) = [(n − 1)/n]Var(X) 6= Var(X). This statement
holds true also in the cases where the distribution of the measurement is not Normal.
7 As part of your homework assignment you are required to investigate the properties of
quantify the distance between the estimator and the parameter (Since, for un-
biased estimators, the parameter is equal to the expectation of the sampling
distribution).
In the current section one of the estimators (S 2 ) is unbiased, but the other
(the alternative estimator) is not. In order to compare their accuracy in esti-
mation we need to figure out a way to quantify the distance between a biased
estimator and the value it tries to estimate.
Towards that end let us recall the definition of the variance. Given a random
variable X with an expectation E(X), we consider the square of the deviations
(X − E(X))2 , which measure the (squared) distance between each value of the
random variable and the expectation. The variance is defined as the expectation
of the squared distance: Var(X) = E[(X − E(X))2 ]. One may think of the
variance as an overall measure of the distance between the random variable and
the expectation.
Assume now that the goal is to assess the distance between an estimator
and the parameter it tries to estimate. In order to keep the discussion on an
abstract level let us use the Greek letter θ (theta) to denote this parameter8 .
The estimator is denoted by θ̂ (read: theta hat). It is a statistic, a formula
applied to the data. Hence, with respect to the sampling distribution, θ̂ is a
random variable9 . The issue is to measure the distance between the random
variable θ̂ and the parameter θ.
Motivated by the method that led to the definition of the variance we con-
sider the deviations between the estimator and the parameter. The square
deviations (θ̂ − θ)2 may be considered in the current context as a measure of the
(squared) distance between the estimator and the parameter. When we take the
expectation of these square deviations we get an overall measure of the distance
between the estimator and the parameter. This overall distance is called the
mean square error of the estimator and is denoted by MSE:
The mean square error of an estimator is tightly linked to the bias and the
variance of the estimator. The bias of an estimator θ̂ is the difference between
the expectation of the estimator and the parameter it seeks to estimate:
Bias = E(θ̂) − θ .
distribution. In the previous section we considered θ = E(X) and in this section we consider
θ = Var(X).
9 Observe that we diverge here slightly form our promise to use capital letters to denote
random variables. However, denoting the parameter by θ and denoting the estimator of the
parameter by θ̂ is standard in the statistical literature. As a matter of fact, we will use the
“hat” notation, where a hat is placed over a Greek letter that represents the parameter, in
other places in this book. The letter with the hat on top will represent the estimator and will
always be considered as a random variable. For Latin letters we will still use capital letters,
with or without a hat, to represent a random variable and small letter to represent evaluation
of the random variable for given data.
170 CHAPTER 10. POINT ESTIMATION
Hence, the mean square error of an estimator is the sum of its variance, the
(squared) distance between the estimator and its expectation, and the square of
the bias, the square of the distance between the expectation and the parameter.
The mean square error is influenced both by the spread of the distribution about
the expected value (the variance) and by the distance between the expected
value and the parameter (the bias). The larger either of them become the
larger is the mean square error, namely the distance between the estimator and
the parameter.
Let us compare between the mean square error of the estimator S 2 and
the mean square error of the alternative estimator [19/20]S 2 . Recall that we
have computed their expectations and found out that the expectation of S 2 is
essentially equal to 3, the target value of the variance. The expectation of the
alternative estimator turned out to be equal to 2.845630, which is less than the
target value10 . It turns out that the bias of S 2 is zero (or essentially zero in
the simulations) and the bias of the alternative estimator is 2.845630 − 3 =
−0.15437 ≈ −0.15.
In order to compute the mean square errors of both estimators, let us com-
pute their variances:
> var(X.var)
[1] 0.9361832
> var((19/20)*X.var)
[1] 0.8449054
Observe that the variance of S 2 is essentially equal to 0.936 and the variance of
the alternative estimator is essentially equal to 0.845.
The estimator S 2 is unbiased. Consequently, the mean square error of S 2 is
equal to its variance. The bias of the alternative is -0.15. As a result we get
that the mean square error of this estimator, which is the sum of the variance
and the square of the bias, is essentially equal to
Observe that the mean square error of the estimator S 2 , which is equal to 0.936,
is larger than the mean square error of the alternative estimator.
Notice that even though the alternative estimator is biased it still has a
smaller mean square error than the default estimator S 2 . Indeed, it can be
prove mathematically that when the measurement has a Normal distribution
then the mean square error of the alternative estimator is always smaller than
the mean square error of the sample variance S 2 .
Still, although the alternative estimator is slightly more accurate than S 2
in the estimation of the variance, the tradition is to use the latter. Obeying
10 It can be shown mathematically that E([(n − 1)/n]S 2 ) = [(n − 1)/n]E(S 2 ). Consequently,
the actual value of the expectation of the alternative estimator in the current setting is [19/20]·
3 = 2.85 and the bias is −0.15. The results of the simulation are consistent with this fact.
10.5. ESTIMATION OF OTHER PARAMETERS 171
each observation the event either occurs or not. A natural estimator of the
probability of the event is its relative frequency in the sample. Let us show that
this estimator can be represented as an average of a Bernoulli sample and the
sample average is used for the estimation of a Bernoulli expectation.
Consider an event, one may code a measurement X, associated with an
observation, by 1 if the event occurs and by 0 if it does not. Given a sample of
size n, one thereby produces n observations with values 0 or 1. An observation
has the value 1 if the event occurs for that observation or, else, the value is 0.
Notice that E(X) = 1 · p = p. Consequently, the probability of the event is
equal to the expectation of the Bernoulli measurement11 . It turns out that the
parameter one seeks to estimate is the expectation of a Bernoulli measurement.
The estimation is based on a sample of size n of Bernoulli observations.
In Section 10.3 it was proposed to use the sample average as an estimate of
the expectation. The sample average is the sum of the observations, divided by
the number of observation. In the specific case of a sample of Bernoulli obser-
vations, the sum of observation is the sum of zeros and ones. The zeros do not
contribute to the sum. Hence, the sum is equal to the number of times that 1
occurs, namely the frequency of the occurrences of the event. When we divide
by the sample size we get the relative frequency of the occurrences. The con-
clusion is that the sample average of the Bernoulli observations and the relative
frequency of occurrences of the event in the sample are one. Consequently, the
sample relative frequency of the event is also a sample average that estimates
the expectation of the Bernoulli measurement.
We seek to estimate p, the probability of the event. The estimator is the
relative frequency of the event in the sample. We denote this estimator by P̂ .
This estimator is a sample average of Bernoulli observations that is used in order
to estimate the expectation of the Bernoulli distribution. From the discussion
in Section 10.3 one may conclude that this estimator is an unbiased estimator
of p (namely, E(P̂ ) = p) and that its variance is equal to:
where the variance of the measurement is obtained form the formula for the
variance of a Binomial(1, p) distribution12 .
The second example of an integer valued random variable that was consid-
ered in the first part of the book is the Poisson(λ) distribution. Recall that λ
is the expectation of a Poisson measurement. Hence, one may use the sample
average of Poisson observations in order to estimate this parameter.
The first example of a continuous distribution that was discussed in the first
part of the book is the Uniform(a, b) distribution. This distribution is param-
eterized by a and b, the end-points of the interval over which the distribution
is defined. A natural estimator of a is the smallest value observed and a natu-
ral estimator of b is the largest value. One may use the function “min” for the
computation of the former estimate from the sample and use the function “max”
for the computation of the later. Both estimators are slightly biased but have
a relatively small mean square error.
11 The expectation of X ∼ Binomial(n, p) is E(X) = np. In the Bernoulli case n = 1.
Therefore, E(X) = 1 · p = p.
12 The variance of X ∼ Binomial(n, p) is Var(X) = np(1 − p). In the Bernoulli case n = 1.
The rate is equal to the reciprocal of the expectation. The expectation can
be estimated by the sample average. Hence a natural proposal is to use the
reciprocal of the sample average as an estimator of the rate:
λ̂ = 1/X̄ .
The final example that we mention is the Normal(µ, σ 2 ) case. The parameter
µ is the expectation of the measurement and may be estimated by the sample
average X̄. The parameter σ 2 is the variance of a measurement, and can be
estimated using the sample variance S 2 .
> mean(X.bar)
[1] 3.000010
> mean(X.med)
[1] 3.000086
> var(X.bar)
[1] 0.02013529
> var(X.med)
[1] 0.03120206
Observe that the variance of the sample average is essentially equal to 0.020
and the variance of the sample median is essentially equal to 0.0312. The mean
square error of an unbiased estimator is equal to its variance. Hence, these
numbers represent the mean square errors of the estimators. It follows that the
mean square error of the sample average is less than the mean square error of the
sample median in the estimation of the expectation of a Normal measurement.
Solution (to Question 10.1.2): We repeat the same steps as before for the
Uniform distribution. Notice that we use the parameters a = 0.5 and b = 5.5 the
same way we did in Subsection 10.3.2. These parameters produce an expectation
E(X) = 3 and a variance Var(X) = 2.083333:
Applying the function “mean” to the sequences that represent the sampling
distribution of the estimators we obtain that both estimators are essentially
unbiased14 :
13 It can be proved mathematically that for a symmetric distribution the expectation of the
sample average and the expectation of the sample median are both equal to the expectation
of the measurement. The Normal distribution is a symmetric distribution.
14 The Uniform distribution is symmetric. Consequently, both estimators are unbiased.
10.6. SOLVED EXERCISES 175
> mean(X.bar)
[1] 3.000941
> mean(X.med)
[1] 3.001162
Compute the variances:
> var(X.bar)
[1] 0.02088268
> var(X.med)
[1] 0.06069215
Observe 0.021 is, essentially, the value of the variance of the sample average15 .
The variance of the sample median is essentially equal to 0.061. The variance
of each of the estimators is equal to it’s mean square error. This is the case
since the estimators are unbiased. Consequently, we again obtain that the mean
square error of the sample average is less than that of the sample median.
able==level” produces a sequence with logical “TRUE” or “FALSE” entries that identify entries
in the sequence “variable” that obtain the value “level”.
176 CHAPTER 10. POINT ESTIMATION
Solution (to Question 10.2.1): Assuming that the file “ex2.csv” is saved
in the working directory, one may read the content of the file into a data frame
and produce a summary of the content of the data frame using the code:
Examine the variable “group”. Observe that the sample contains 37 subjects
with high levels of blood pressure. Dividing 37 by the sample size we get:
> 37/150
[1] 0.2466667
Solution (to Question 10.2.2): Make sure that the file “pop2.csv” is saved
in the working directory. In order to compute the proportion in the population
we read the content of the file into a data frame and compute the relative
frequency of the level “HIGH” in the variable “group”:
Observe that the sampling distribution is stored in the object “P.hat”. The
function “sample” is used in order to sample 150 observation from the sequence
“pop2$group”. The sample is stored in the object “X”. The expression “mean(X
== "HIGH")” computes the relative frequency of the level “HIGH” in the sequence
“X”.
At the last line, after the production of the sequence “P.hat” is completed,
the function “mean” is applied to the sequence. The result is the expected value
of estimator P̂ , which is equal to 0.2812307. This expectation is essentially
equal to the probability of the event p = 0.28126.18
Solution (to Question 10.2.4): The application of the function “var” to the
sequence “P.hat” produces:
> var(P.hat)
[1] 0.001350041
Solution (to Question 10.2.5): Compute the variance according to the for-
mula that is proposed in Section:
18 It can be shown mathematically that for random sampling from a population we have
E(P̂ ) = p. The discrepancy from the mathematical theory results from the fact that simula-
tions serves only as an approximation to the sampling distribution.
19 It can be shown theoretically that the variance of the sample proportion, in the case of
sampling from a population, is equal to [(N − n)/(N − 1)] · p(1 − p)/n, where n is the sample
size, and N is the population size. The factor [(N − n)/(N − 1)] is called the finite population
correction. In the current setting the finite population correction is equal to 0.99851, which
is practically equal to one.
178 CHAPTER 10. POINT ESTIMATION
10.7 Summary
Glossary
Point Estimation: An attempt to obtain the best guess of the value of a
population parameter. An estimator is a statistic that produces such a
guess. The estimate is the observed value of the estimator.
Bias: The difference between the expectation of the estimator and the value
of the parameter. An estimator is unbiased if the bias is equal to zero.
Otherwise, it is biased.
Bernoulli Random Variable: A random variable that obtains the value “1”
with probability p and the value “0” with probability 1 − p. It coincides
with the Binomial(1, p) distribution. Frequently, the Bernoulli random
variable emerges as the indicator of the occurrence of an event.
Formulas:
• Bias: Bias = E(θ̂) − θ.
• Variance: Var(θ̂) = E (θ̂ − E(θ̂))2 .
• Mean Square Error: MSE = E (θ̂ − θ)2 .
180 CHAPTER 10. POINT ESTIMATION
Chapter 11
Confidence Intervals
181
182 CHAPTER 11. CONFIDENCE INTERVALS
The confidence intervals are computed based on the data in the file “cars.csv”.
In the subsequent subsections we discuss the theory behind the computation of
the confidence intervals and explain the meaning of the confidence level. Sub-
section 11.2.2 does so with respect to the confidence interval for the expectation
and Subsection 11.2.3 with respect to the confidence interval for the proportion.
diesel gas
20 185
Notice that only 20 of the 205 types of cars are run on diesel in this data set.
Let us compute the point estimation of the probability of such car types and
the confidence interval for this probability:
> n <- 205
> p.hat <- 20/n
> p.hat
[1] 0.09756098
> p.hat - 1.96*sqrt(p.hat*(1-p.hat)/n)
[1] 0.05694226
> p.hat + 1.96*sqrt(p.hat*(1-p.hat)/n)
[1] 0.1381797
The point estimation of the probability is p̂ = 20/205 ≈ 0.098 and the confidence
interval, after rounding up, is [0.057, 0.138].
The Central Limit Theorem states pthat the distribution of the (standardized)
sample average Z = (X̄ − E(X)/ Var(X)/n is approximately standard Nor-
mal for a large enough sample size. The variance of the measurement can be
estimated using the sample variance S 2 .
Supposed we are interested in a confidence interval with a confidence level of
95%. The value 1.96 is the 0.975-percentile of the standard Normal. Therefore,
about 95% of the distribution of the standardized sample average is concentrated
in the range [−1.96, 1, 96]:
X̄ − E(X)
P p ≤ 1.96 ≈ 0.95
Var(X)/n
The event, the probability of which is being described in the last display,
states that the absolute value of deviation of the sample average from the ex-
pectation, divided by the standard deviation of the sample average, is no more
than 1.96. In other words, the distance between the sample average and the
expectation is at most 1.96 units of standard deviation. One may rewrite this
event in a form that puts the expectation within an interval that is centered at
the sample average1 :
n p o
|X̄ − E(X)| ≤ 1.96 · Var(X)/n ⇐⇒
n p p o
X̄ − 1.96 · Var(X)/n ≤ E(X) ≤ X̄ + 1.96 · Var(X)/n .
Clearly, the probability of the later event is (approximately) 0.95 since we are
considering the same event, each time representing in a different form. Notice
that the last representation states that the expectation
p E(X) belongs to an
interval about the sample average: X̄ ± 1.96 Var(X)/n. This interval is,
almost, the confidence interval we seek.
The difficulty is that we do not know the value of the variance Var(X), hence
we cannot compute the interval in the proposed form from the data. In order
to overcome this difficulty we recall that the unknown variance may nonetheless
be estimated from the data:
p √
S 2 ≈ Var(X) =⇒ Var(X)/n ≈ S/ n ,
√
Hence, X̄ ± 1.96 · S/ n is an (approximate) confidence interval of the (approx-
imate) confidence level 0.95.
Let us demonstrate the issue of confidence level by running a simulation.
We are interested in a confidence interval for the expected price of a car. In the
simulation we assume that the distribution of the price is Exponential(1/13000).
(Consequently, E(X) = 13, 000). We take the sample size to be equal to n = 201
and compute the actual probability of the confidence interval containing the
value of the expectation:
> lam <- 1/13000
> n <- 201
> X.bar <- rep(0,10^5)
> S <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- rexp(n,lam)
+ X.bar[i] <- mean(X)
+ S[i] <- sd(X)
+ }
> LCL <- X.bar - 1.96*S/sqrt(n)
> UCL <- X.bar + 1.96*S/sqrt(n)
> mean((13000 >= LCL) & (13000 <= UCL))
[1] 0.94518
Below we will go over the code and explain the simulation. But, before doing
so, notice that the actual probability that the confidence interval contains the
expectation is about 0.945, which is slightly below the nominal confidence level
of 0.95. Still quoting the nominal value as the confidence level of the confidence
interval is not too far from reality.
Let us look now at the code that produced the simulation. In each iteration
of the simulation a sample is generated. The sample average and standard
deviations are computed and stored in the appropriate locations of the sequences
“X.bar” and “S”. At the end of all the iterations the content of these two
sequences represents the sampling distribution of the sample average X̄ and the
sample standard deviation S, respectively.
The lower and the upper end-points of the confidence interval are computed
in the next two lines of code. The lower level of the confidence interval is stored
in the object “LCL” and the upper level is stored in “UCL”. Consequently, we
obtain the sampling distribution of the confidence interval. This distribution is
approximated by 100,000 random confidence intervals that where generated by
the sampling distribution. Some of these random intervals contain the value of
the expectation, namely 13,000, and some do not. The proportion of intervals
that contain the expectation is the (simulated) confidence level. The last ex-
pression produces this confidence level, which turns out to be equal to about
0.945.
The last expression involves a new element, the term “&”, which calls for
more explanations. Indeed, refer to this last expression in the code above.
This expression involves the application of the function “mean”. The input to
this function contains two sequences with logical values (“TRUE” or “FALSE”),
separated by the character “&”. The character “&” corresponds to the logical
“AND” operator. This operator produces a “TRUE” if a “TRUE” appears at
186 CHAPTER 11. CONFIDENCE INTERVALS
Notice that indeed the term “&” produces a “TRUE” only if parallel components
in the sequences “a” and “b” both obtain the value “TRUE”. On the other hand,
the term “|” produces a “TRUE” if at least one of the parallel components are
“TRUE”. Observe, also, that the output of the expression that puts either of the
two terms between two sequences with logical values is a sequence of the same
length (with logical components as well).
The expression “(13000 >= LCL)” produces a logical sequence of length
100,000 with “TRUE” appearing whenever the expectation is larger than the lower
level of the confidence interval. Similarly, the expression “(13000 <= UCL)”
produces “TRUE” values whenever the expectation is less than the upper level
of the confidence interval. The expectation belongs to the confidence interval
if the value in both expressions is “TRUE”. Thus, the application of the term
“&” to these two sequences identifies the confidence intervals that contain the
expectation. The application of the function “mean” to a logical vector produces
the relative frequency of TRUE’s in the vector. In our case this corresponds to the
relative frequency of confidence intervals that contain the expectation, namely
the confidence level.
We calculated before the confidence interval [12108.47, 14305.79] for the ex-
pected price of a car. This confidence interval was obtained via the application
of the formula for the construction of confidence intervals with a 95% confi-
dence level to the variable “price” in the data frame “cars”. Casually speak-
ing, people frequently refer to such an interval as an interval that contains the
expectation with probability of 95%.
However, one should be careful when interpreting the confidence level as a
probabilistic statement. The probability computations that led to the method
for constructing confidence intervals were carried out in the context of the sam-
pling distribution. Therefore, probability should be interpreted in the context
of all data sets that could have emerged and not in the context of the given
data set. No probability is assigned to the statement “The expectation belongs
to the interval [12108.47, 14305.79]”. The probability is assigned
√ to the state-
ment “The expectation belongs to the interval X̄ ± 1.96 · S/ n”, where X̄ and
S are interpreted as random variables. Therefore the statement that the in-
terval [12108.47, 14305.79] contains the expectation with probability of 95% is
meaningless. What is meaningful is the statement that the given interval was
constructed using a procedure that produces, when applied to random samples,
intervals that contain the expectation with the assigned probability.
11.2. INTERVALS FOR MEAN AND PROPORTION 187
Let us run a simulation in order to assess the confidence level of the con-
fidence interval for the probability. Assume that n = 205 and p = 0.12. The
simulation we run is very similar to the simulation of Subsection 11.2.2. In the
first stage we produce the sampling distribution of P̂ (stored in the sequence
“P.hat”) and in the second stage we compute the relative frequency in the sim-
ulation of the intervals that contain the actual value of p that was used in the
simulation:
> p <- 0.12
> n <- 205
> P.hat <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- rbinom(n,1,p)
+ P.hat[i] <- mean(X)
+ }
> LCL <- P.hat - 1.96*sqrt(P.hat*(1-P.hat)/n)
> UCL <- P.hat + 1.96*sqrt(P.hat*(1-P.hat)/n)
> mean((p >= LCL) & (p <= UCL))
[1] 0.95131
In this simulation we obtained that the actual confidence level is approximately
0.951, which is slightly above the
√ nominal confidence level of 0.95.
The formula X̄ ± 1.96 · S/ n that is used for a confidence interval for the
expectation and the formula P̂ ± 1.96 · {P̂ (1 − P̂ )/n}1/2 for the probability both
188 CHAPTER 11. CONFIDENCE INTERVALS
●
10
8
6
4
2
●
0
diesel gas
For each car type we calculate the difference variable that measures the
difference between the number of miles per gallon in highway conditions and
the number in urban conditions. The cars are sub-divided between cars that
run on diesel and cars that run on gas. Our concern is to estimate, for each fuel
type, the expectation of difference variable and to estimate the variance of that
variable. In particular, we are interested in the construction of a confidence
intervals for the expectation and a confidence interval for the variance.
Box plots of the difference in fuel consumption between highway and urban
conditions are presented in Figure 11.1. The box plot on the left hand side
corresponds to cars that run on diesel and the box plot on the right hand side
corresponds to cars that run on gas. Recall that 20 of the 205 car types use
diesel and the other 185 car types use gas. One may suspect that the fuel
consumption characteristics vary between the two types of fuel. Indeed, the
measurement tends to have sightly higher values for vehicles that use gas.
We conduct inference for each fuel type separately. However, since the sam-
ple size for cars that run on diesel is only 20, one may have concerns regarding
the application of methods that assume a large sample size to a sample size this
small.
190 CHAPTER 11. CONFIDENCE INTERVALS
Notice that the equation associated with the probability is not an approximation
but an exact relation. Rewriting the event that is described in the probability
in the form of a confidence interval, produces
√
X̄ ± qt(0.975,n-1) · S/ n
11.3. INTERVALS FOR NORMAL MEASUREMENTS 191
The objects “x.bar” and “s” contain the sample averages and sample standard
deviations, respectively. Both are sequences of length two, with the first compo-
nent referring to “diesel” and the second component referring to “gas”. The
object “n” contains the two sample sizes, 20 for “diesel” and 185 for “gas”.
In the expression next to last the lower boundary for each of the confidence
intervals is computed and in the last expression the upper boundary is com-
puted. Notice that the confidence interval for the expected difference in diesel
cars is [3.148431, 5.751569]. and the confidence interval for cars using gas is
[5.440699, 5.856598].
The 0.975-percentiles of the t-distributions are computed with the expres-
sions “qt(0.025,n-1)”:
> qt(0.975,n-1)
[1] 2.093024 1.972941
Observe that the second argument of the function “qt” is a sequence with
two components, the number 19 and the number 184. Accordingly, The first
position in the output of the function is the percentile associated with 19 degrees
of freedom and the second position is the percentile associated to 184 degrees
of freedom.
Compare the resulting percentiles to the 0.975-percentile of the standard
Normal distribution, which is essentially equal to 1.96. When the sample size
is small, 20 for example, the percentile of the t-distribution is noticeably larger
than the percentile of the standard Normal. However, for a larger sample size
the percentiles, more or less, coincide. It follows that for a large sample the
method proposed in Subsection 11.2.2 and the method discussed in this subsec-
tion produce essentially the same confidence intervals.
is read “Kai”.) This distribution is associated with the sum of squares of Normal
variables. It is parameterized, just like the t-distribution, by a parameter called
the number of degrees of freedom. This number is equal to (n − 1) in the
situation we discuss. The chi-square distribution on (n − 1) degrees of freedom
is denoted with the symbol χ2(n−1) .
The R system contains functions for the computation of the density, the cu-
mulative probability function and the percentiles of the chi-square distribution,
as well as for the simulation of a random sample from this distribution. Specifi-
cally, the percentiles of the chi-square distribution are computed with the aid of
the function “qchisq”. The first argument to the function is a probability and
the second argument is the number of degrees of freedom. The output of the
function is the percentile associated with the probability of the first argument.
Namely, it is a value such that the probability that the chi-square distribution
is below the value is equal to the probability in the first argument.
For example, let “n” be the sample size. The output of the expression
“qt(0.975,n-1)” is the 0.975-percentile of the chi-square distribution. By def-
inition, 97.5% of the chi-square distribution are below this value and 2.5%
are above it. Similarly, the expression “qchisq(0.025,n-1)” is the 0.025-
percentile of the chi-square distribution, with 2.5% of the distribution below
this value. Notice that between these two percentiles, namely within the inter-
val [qchisq(0.025,n-1), qchisq(0.975,n-1)], are 95% of the chi-square dis-
tribution.
We may summarize that for Normal measurements:
(n − 1)S 2 /σ 2 ∼ χ2(n−1) =⇒
P qchisq(0.025,n-1) ≤ (n − 1)S 2 /σ 2 ≤ qchisq(0.975,n-1) = 0.95 .
The left most and the right most expressions in this event mark the end points
of the confidence interval.
Observe that the structure of the confidence interval is of the form:
> (n-1)/qchisq(0.975,n-1)
[1] 0.5783456 0.8234295
> (n-1)/qchisq(0.025,n-1)
[1] 2.133270 1.240478
The ratios that are used in the left hand side of the intervals are 0.5783456 and
0.8234295, respectively. Both ratios are less than one. On the other hand, the
ratios associated with the other end of the intervals, 2.133270 and 1.240478, are
both larger than one.
Let us compute the point estimates of the variance and the associated con-
fidence intervals. Recall that the object “s” contains the sample standard de-
viations of the difference in fuel consumption for diesel and for gas cars. The
object “n” contains the two sample sizes:
> s^2
diesel gas
7.734211 2.055229
> (n-1)*s^2/qchisq(0.975,n-1)
diesel gas
4.473047 1.692336
> (n-1)*s^2/qchisq(0.025,n-1)
diesel gas
16.499155 2.549466
Observe that for diesel cars the variance of the difference in fuel consumption is
estimated to be 7.734211 with a 95%-confidence interval of [4.473047, 16.499155]
and for cars that use gas the estimated variance is 2.055229, with a confidence
interval of [1.692336, 2.549466].
As a final example in this section let us simulate the confidence level for
a confidence interval for the expectation and for a confidence interval for the
variance of a Normal measurement. In this simulation we assume that the ex-
pectation is equal to µ = 3 and the variance is equal to σ 2 = 32 = 9. The sample
size is taken to be n = 20. We start by producing the sampling distribution of
the sample average X̄ and of the sample standard deviation S:
> mu <- 4
> sig <- 3
> n <- 20
> X.bar <- rep(0,10^5)
> S <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- rnorm(n,mu,sig)
+ X.bar[i] <- mean(X)
+ S[i] <- sd(X)
+ }
Consider first the confidence interval for the expectation:
> mu.LCL <- X.bar - qt(0.975,n-1)*S/sqrt(n)
> mu.UCL <- X.bar + qt(0.975,n-1)*S/sqrt(n)
> mean((mu >= mu.LCL) & (mu <= mu.UCL))
[1] 0.95033
11.4. CHOOSING THE SAMPLE SIZE 195
The nominal significance level of the confidence interval is 95%, which is prac-
tically identical to the confidence level that was computed in the simulation.
The confidence interval for the variance is obtained in a similar way. The
only difference is that we apply now different formulae for the computation of
the upper and lower confidence limits:
> var.LCL <- (n-1)*S^2/qchisq(0.975,n-1)
> var.UCL <- (n-1)*S^2/qchisq(0.025,n-1)
> mean((sig^2 >= var.LCL) & (sig^2 <= var.UCL))
[1] 0.94958
Again, we obtain that the nominal confidence level of 95% coincides with the
confidence level computed in the simulation.
maximizer. Plugging this value in the function gives 1/4 as the maximal value of the function.
196 CHAPTER 11. CONFIDENCE INTERVALS
Finally,
√ √
0.98/ n ≤ 0.05 =⇒ n ≥ 0.98/0.05 = 19.6 =⇒ n ≥ (19.6)2 = 384.16 .
The conclusion is that n should be larger than 384 in order to assure the given
radius. Nonetheless n = 385 should be sufficient.
If the request is for an interval of radius 0.025 then the last line of reasoning
should be modified accordingly:
√ √ 0.98
0.98/ n ≤ 0.025 =⇒ n≥ = 39.2 =⇒ n ≥ (39.2)2 = 1536.64 .
0.025
Consequently, n = 1537 will do. Increasing the accuracy by 50% requires a
sample size that is 4 times larger.
More examples that involve selection of the sample size will be considered
as part of the homework.
Solution (to Question 11.1.1): We read the content of the file “teacher.csv”
into a data frame by the name “teacher” and produce a summary of the content
of the data frame:
There are two variables: The variable “condition” is a factor with two lev-
els, “C” that codes the Charismatic condition and “P” that codes the Punitive
condition. The second variable is “rating”, which is a numeric variable.
Solution (to Question 11.1.2): The sample average for the variable “rating”
can be obtained from the summary or from the application of the function
“mean” to the variable. The standard deviation is obtained from the application
of the function “sd” to the variable:
> mean(teacher$rating)
[1] 2.428567
> sd(teacher$rating)
[1] 0.5651949
Observe that the sample average is equal to 2.428567 and the sample standard
deviation is equal to 0.5651949.
Solution (to Question 11.1.3): The sample average and standard deviation
for each sub-sample may be produced with the aid of the function “tapply”.
We apply the function in the third argument, first “mean” and then “sd” to the
variable rating, in the first argument, over each level of the factor “condition”
in the second argument:
> tapply(teacher$rating,teacher$condition,mean)
C P
2.613332 2.236104
> tapply(teacher$rating,teacher$condition,sd)
C P
0.5329833 0.5426667
198 CHAPTER 11. CONFIDENCE INTERVALS
4.0
●
3.5
●
3.0
2.5
2.0
1.5
C P
Obtain that average for the condition “C” is 2.613332 and the standard deviation
is 0.5329833.
You may note that the rating given by students that were exposed to the
description of the lecturer as charismatic is higher on the average than the rating
given by students that were exposed to a less favorable description. The box
plots of the ratings for the two conditions are presented in Figure 11.2.
Solution (to Question 11.1.4): The 99% confidence interval √ for the expec-
tation is computed by the formula x̄ ± qt(0.995,n-1) · s/ n. Observe that
only 0.5% of the t-distribution on (n − 1) degrees of freedom resides above the
percentile “qt(0.995,n-1)”. Using this percentile leaves out a total of 1% in
both tails and leaves 99% of the distribution inside the central interval.
For the students that were exposed to Condition “C”, x̄ = 2.613332, s =
0.5329833, and n = 25:
[1] 2.672961
The confidence interval for the expectation is [2.553703, 2.672961].
Solution (to Question 11.1.5): The 90% confidence interval for the variance
n−1 n−1
is computed by the formula s2 , s2 .
qchisq(0.95,n-1) qchisq(0.05,n-1)
Observe that 5% of the chi-square distribution on (n − 1) degrees of freedom
is above the percentile “qchisq(0.95,n-1)” and 5% are below the percentile
“qchisq(0.05,n-1)”. Using this percentile leaves out a total of 10% in both
tails and leaves 90% of the distribution inside the central interval.
For the students that were exposed to Condition “C”, s = 0.5329833, and
n = 25:
> (24/qchisq(0.95,24))*0.5329833^2
[1] 0.1872224
> (24/qchisq(0.05,24))*0.5329833^2
[1] 0.4923093
The point estimate of the variance is s2 = 0.53298332 = 0.2840712. The confi-
dence interval for the variance is [0.18722243, 0.4923093].
Solution (to Question 11.2.2): Using the same sampling distribution that
was produced in the solution to Question 1 we now compute the actual con-
fidence level of a confidence interval that is constructed under the assumption
that the measurement has a Normal distribution:
> t.LCL <- X.bar - qt(0.975,n-1)*S/sqrt(n)
> t.UCL <- X.bar + qt(0.975,n-1)*S/sqrt(n)
> mean((4 >= t.LCL) & (4 <= t.UCL))
[1] 0.91953
Based on the assumption we used the percentiles of the t-distribution. The
actual significance level is 91.953% ≈ 92%, still short of the nominal 95% con-
fidence level.
Solution (to Question 11.2.1): The 80% confidence p interval for the proba-
bility is computed by the formula p̂ ± qnorm(0.90) · p̂(1 − p̂)/n:
> n <- 400
> p.hat <- 320/400
> p.hat - qnorm(0.90)*sqrt(p.hat*(1-p.hat)/n)
[1] 0.774369
> p.hat + qnorm(0.90)*sqrt(p.hat*(1-p.hat)/n)
[1] 0.825631
We obtain a confidence interval of the form [0.774369, 0.825631].
11.6 Summary
Glossary
Confidence Interval: An interval that is most likely to contain the population
parameter.
Confidence Level: The sampling probability that random confidence intervals
contain the parameter value. The confidence level of an observed interval
indicates that it was constructed using a formula that produces, when
applied to random samples, such random intervals.
t-Distribution: A bell-shaped distribution that resembles the standard Nor-
mal distribution but has wider tails. The distribution is characterized by
a positive parameter called degrees of freedom.
Chi-Square Distribution: A distribution associated with the sum of squares
of Normal random variable. The distribution obtains only positive values
and it is not symmetric. The distribution is characterized by a positive
parameter called degrees of freedom.
sample size is small and the observations have a distribution different from the
Normal then the nominal significance level may not coincide with the actual
significance level.
Testing Hypothesis
203
204 CHAPTER 12. TESTING HYPOTHESIS
data: cars$price
t = -0.8115, df = 200, p-value = 0.4181
alternative hypothesis: true mean is not equal to 13662
95 percent confidence interval:
12101.80 14312.46
sample estimates:
mean of x
13207.13
The data in the file “cars.csv” is read into a data frame that is given the
name “cars”. Afterwards, the data on prices of car types in 1985 is entered
as the first argument to the function “t.test”. The other argument is the
expected value that we want to test, the current average price of cars, given in
terms of 1985 Dollar value. The output of the function is reported under the
title: “One Sample t-test”.
Let us read the report from the bottom up. The bottom part of the report
describes the confidence interval and the point estimate of the expected price of
a car in 1985, based on the given data. Indeed, the last line reports the sample
1 Source: “http://wiki.answers.com/Q/Average_price_of_a_car_in_2009”.
2 Source: “http://www.westegg.com/inflation/”. The interpretation of adjusting prices
to inflation is that our comparison will correspond to changes in the price of cars, relative to
other items that enter into the computation of the Consumer Price Index.
12.2. THE THEORY OF HYPOTHESIS TESTING 205
average of the price, which is equal to 13,207.13. This number, which is the
average of the 201 non-missing values of the variable “price”, serves as the
estimate of the expected price of a car in 1985. The 95% confidence interval of
the expectation, the interval [12101.80, 14312.46], is presented on the 4th line
from the bottom. This is the confidence interval for the expectation that was
computed in Subsection 11.2.13 .
The information relevant to conducting the statistical test itself is given
in the upper part of the report. Specifically, it is reported that the data in
“cars$price” is used in order to carry out the test. Based on this data a test
statistic is computed and obtains the value of “t = -0.8115”. This statistic
is associated with the t-distribution with “df = 200” degrees of freedom. The
last quantity that is being reported is denoted the p-value and it obtains the
value “p-value = 0.4181”. The test may be carried out with the aid of the
value of the t statistic or, more directly, using the p-value. Currently we will
use the p-value.
The test itself examines the hypothesis that the expected price of a car
in 1985 was equal to $13,662, the average price of a car in 2009, given in 1985
values. This hypothesis is called the null hypothesis. The alternative hypothesis
is that the expected price of a car in 1985 was not equal to that figure. The
specification of the alternative hypothesis is reported on the third line of the
output of the function “t.test”.
One may decide between the two hypothesis on the basis of the size of the
p-value. The rule of thumb is to reject the null hypothesis, and thus accept the
alternative hypothesis, if the p-value is less than 0.05. In the current example the
p-value is equal 0.4181 and is larger than 0.05. Consequently, we may conclude
that the expected price of a car in 1985 was not significantly different than the
current price of a car.
In the rest of this section we give a more rigorous explanation of the theory
and practice of statistical hypothesis testing.
14305.79], which is not identical to the confidence that appears in the report. The reason for
the discrepancy is that we used the 0.975-percentile of the Normal distribution, 1.96, whereas
the confidence interval computed here uses the 0.975-percentile of the t-distribution on 201-
1=200 degrees of freedom. The latter is equal to 1.971896. Nonetheless, for all practical
purposes, the two confidence intervals are the same.
206 CHAPTER 12. TESTING HYPOTHESIS
(ii) Specifying the test: The second step in hypothesis testing involves the
selection of the decision rule, i.e. the statistical test, to be used in order to decide
12.2. THE THEORY OF HYPOTHESIS TESTING 207
between the two hypotheses. The decision rule typically involves a statistic that
is computed from the data and a subset of values of the statistic that correspond
to the rejection of the null hypothesis. The statistic is called the test statistic
and the subset of values is called the rejection region. The decision is to reject
the null hypothesis (and consequently choose the alternative hypothesis) if the
test statistic falls in the rejection region. Otherwise, if the test statistic does
not fall in the rejection region then the null hypothesis is selected.
Return to the example in which we test between H0 : E(X) = 13, 662 and
H1 : E(X) 6= 13, 662. One may compute the statistic:
X̄ − 13, 662
T = √ ,
S/ n
where X̄ is the sample average (of the variable “price”), S is the sample stan-
dard deviation, and n is the sample size (n = 201 in the current example).
The sample average X̄ is an estimator of a expected price of the car. In prin-
ciple, the statistic T measures the discrepancy between the estimated value of
the expectation (X̄) and the expected value under the null hypothesis (E(X) =
13, 662). This discrepancy is measured in units of the (estimated) standard
deviation of the sample average4 .
If the null hypothesis H0 : E(X) = 13, 662 is true then the sampling distri-
bution of the sample average X̄ should be concentrated about the value 13,662.
Values of the sample average much larger or much smaller than this value may
serve as evidence against the null hypothesis.
In reflection, if the null hypothesis holds true then the values of the sampling
distribution of the statistic T should tend to be in the vicinity of 0. Values with
a relative small absolute value are consistent with the null hypothesis. On
the other hand, extremely positive or extremely negative values of the statistic
indicate that the null hypothesis is probably false.
It is natural to set a value c and to reject the null hypothesis whenever the
absolute value of the statistic T is larger than c. The resulting rejection region
is of the form {|T | > c}. The rule of thumb, again, is to take threshold c to be
equal the 0.975-percentile of the t-distribution on n−1 degrees of freedom, where
n is the sample size. In the current example, the sample size is n = 201 and the
percentile of the t-distribution is qt(0.975,200) = 1.971896. Consequently,
the subset {|T | > 1.971896} is the rejection region of the test.
A change in the hypotheses that are being test may lead to a change in
the test statistic and/or the rejection region. For example, for testing H0 :
E(X) ≥ 13, 662 versus H1 : E(X) < 13, 662 one may still use the same test
statistic T as before. However, only very negative values of the statistic are
inconsistent with the null hypothesis. It turns out that the rejection region in
this case is of the form {T < −1.652508}, where qt(0.05,200) = -1.652508
is the 0.05-percentile of the t-distribution on 200 degrees of freedom. On the
other hand, the rejection region for testing between H0 : E(X) ≤ 13, 662 and
H1 : E(X) > 13, 662 is {T > 1.652508}. In this case, qt(0.95,200) = 1.652508
is the 0.95-percentile of the t-distribution on 200 degrees of freedom.
4 If
the pvariance of the measurement Var(X) was known one could have use Z = (X̄ −
−13, 662)/ VarX/n as a test statistic. This statistic corresponds to the discrepancy of the
sample average from the null expectation in units of its standard deviation, i.e. the z-value of
the sample average. Since the variance of the observation is unknown, we use an estimator of
the variance (S 2 ) instead.
208 CHAPTER 12. TESTING HYPOTHESIS
Selecting the test statistic and deciding what rejection region to use specifies
the statistical test and completes the second step.
(iii) Reaching a conclusion: After the stage is set, all that is left is to apply
the test to the observed data. This is done by computing the observed value of
the test statistic and checking whether or not the observed value belongs to the
rejection region. If it does belong to the rejection region then the decision is
to reject the null hypothesis. Otherwise, if the statistic does not belong to the
rejection region, then the decision is to accept the null hypothesis.
Return to the example of testing the price of car types. The observed value of
the T statistic is part of the output of the application of the function “t.test”
to the data. The value is “t = -0.8115”. As an exercise, let us recompute
directly from the data the value of the T statistic:
The observed value of the sample average is x̄ = 13207.13 and the observed
value of the sample standard deviation is s = 7947.066. The sample size (due
to having 4 missing values) is n = 201. The formula for √ the computation of
the test statistic in this example is t = [x̄ − 13, 662]/[s/ n]. Plugging in this
formula the sample size and the computed values of the sample average and
standard deviation produces:
This value, after rounding up, is equal to the value “t = -0.8115” that is
reported in the output of the function “t.test”.
The critical threshold for the absolute value of the T statistic on 201−1 = 200
degrees of freedom is qt(0.975,200) = 1.971896. Since the absolute observed
value (|t| = 0.8114824) is less then the threshold we get that the value of the
statistic does not belong to the rejection region (which is composed of absolute
values larger than the threshold). Consequently, we accept the null hypothesis
that declares that the expected price of a car was equal to the current expected
price (after adjusting for the change in Consumer Price Index)5 .
example is 0.4181. The null hypothesis was accepted since this value is larger than 0.05. As a
matter of fact, the test that uses the T statistic as a test statistic and reject the null hypothesis
for absolute values larger than qt(0.975,n-1) is equivalent to the test that uses the p-value
and rejects the null hypothesis for p-values less than 0.05. Below we discuss the computation
of the p-value.
12.2. THE THEORY OF HYPOTHESIS TESTING 209
under the null hypothesis. The structure of the rejection region of the test is
{|T | > c}, where c is an appropriate threshold. In the current example the value
of the threshold c was set to be equal to qt(0.975,200) = 1.971896. In general,
the specification of the threshold c depends on the error probabilities that are
associated with the test. In this section we describe these error probabilities.
The process of making decisions may involve errors. In the case of hypothesis
testing one may specify two types of error. On the one hand, the case may be
that the null hypothesis is correct (in the example, E(X) = 13, 662). However,
the data is such that the null hypothesis is rejected (here, |T | > 1.971896). This
error is called a Type I error.
A different type of error occurs when the alternative hypothesis holds (E(X) 6=
13, 662) but the null hypothesis is not rejected (|T | ≤ 1.971896). This other type
of error is called Type II error. A summary of the types of errors can be found
in Table 12.1:
H0 : E(X) = 13, 662 H1 : E(X) 6= 13, 662
Accept H0 : |T | ≤ 1.971896 X Type II Error
Reject H0 : |T | > 1.971896 Type I Error X
In statistical testing of hypothesis the two types of error are not treated
symmetrically. Rather, making a Type I error is considered more sever than
making a Type II error. Consequently, the test’s decision rule is designed so
as to assure an acceptable probability of making a Type I error. Reducing the
probability of a Type II error is desirable, but is of secondary importance.
Indeed, in the example that deals with the price of car types the thresh-
old was set as high as qt(0.975,200) = 1.971896 in order to reject the null
hypothesis. It is not sufficient that the sample average is not equal to 13,662
(corresponding to a threshold of 0), but it has to be significantly different from
the expectation under the null hypothesis, the distance between the sample av-
erage and the null expectation should be relatively large, in order to exclude H0
as an option.
The significance level of the evidence for rejecting the null hypothesis is
based on the probability of the Type I error. The probabilities associated with
the different types of error are presented in Table 12.2:
Observe that the probability of a Type I error is called the significance level.
The significance level is set at some pre-specified level such as 5% or 1%, with
5% being the most widely used level. In particular, setting the threshold in the
example to be equal to qt(0.975,200) = 1.971896 produces a test with a 5%
significance level.
This lack of symmetry between the two hypothesis proposes another inter-
pretation of the difference between the hypothesis. According to this interpre-
tation the null hypothesis is the one in which the cost of making an error is
210 CHAPTER 12. TESTING HYPOTHESIS
greater. Thus, when one splits the collection that forms the statistical model
into two sub-collections the sub-collection associated with a more sever error
is designated as the null hypothesis and the other sub-collection becomes the
alternative.
For example, a new drug must pass a sequence of clinical trials before it
is approved for distribution. In these trials one may want to test whether the
new drug produces beneficial effect in comparison to the current treatment.
Naturally, the null hypothesis in this case would be that the new drug is no
better than the current treatment and the alternative hypothesis would be that
it is better. Only if the clinical trials demonstrates a significant beneficiary
effect of the new drug would it be released to marketing.
In scientific research, in general, the currently accepted theory, the conser-
vative explanation, is designated as the null hypothesis. A claim for novelly in
the form of an alternative explanation requires strong evidence in order for it to
be accepted and be favored over the traditional explanation. Hence, the novel
explanation is designated as the alternative hypothesis. It replaces the current
theory only if the empirical data clearly supports its. The test statistic is a
summary of the empirical data. The rejection region corresponds to values that
are unlikely to be observed according to the current theory. Obtaining a value
in the rejection region is an indication that the current theory is probably not
adequate and should be replaced by an explanation that is more consistent with
the empirical evidence.
The second type of error probability in Table 12.2 is the probability of a
Type II error. Instead of dealing directly with this probability the tradition is
to consider the complementary probability that corresponds to the probability
of not making a Type II error. This complementary probability is called the
statistical power :
The statistical power is the probability of rejecting the null hypothesis when the
state of nature is the alternative hypothesis. (In comparison, the significance
level is the probability of rejecting the null hypothesis when the state of nature is
the null hypothesis.) When comparing two decision rules for testing hypothesis,
both having the same significance level, the one that possesses a higher statistical
power should be favored.
12.2.4 p-Values
The p-value is another test statistic. It is associated with a specific test statistic
and a structure of the rejection region. The p-value is equal to the significance
level of the test in which the observed value of the statistic serves as the thresh-
old. In the current example, where the T is the underlying test statistic and
the structure of the rejection region is of the form {|T | > c} then the p-value is
equal to the probability of rejecting the null hypothesis in the case where the
threshold is equal to the observed value of the statistic. In other words:
p-value = P(|T | > |t|) = P(|T | > | − 0.8114824|) = P(|T | > 0.8114824) ,
where t = −0.8114824 is the observed value of the T statistic and the compu-
tation of the probability is conducted under the null hypothesis.
12.3. TESTING HYPOTHESIS ON EXPECTATION 211
> 2*(1-pt(0.8114824,200))
[1] 0.4180534
60
50
40
table(dif.mpg)
30
20
10
0
0 1 2 3 4 5 6 7 8 9 11
dif.mpg
In the first expression we created the variable “dif.mpg” that contains the
difference in miles-per-gallon. The difference is computed for each car type be-
tween highway driving conditions and urban driving condition. The summary
of this variable is produced in the second expression. Observe that the values of
the variable range between 0 and 11, with 50% of the distribution concentrated
between 5 and 7. The median is 6 and the mean is 5.532. The last expres-
sion produces the bar plot of the distribution. This bar plot is presented in
Figure 12.1. It turns out that the variable “dif.mpg” obtains integer values.
In this section we test hypotheses regarding the expected difference in fuel
consumption between highway and city conditions.
Energy is required in order to move cars. The heavier the car the more is the
energy that is required in order to move it. Consequently, one may conjecture
that milage per gallon for heavier cars is less than the milage per gallon for
lighter cars.
The relation between the weight of the car and the difference between the
milage-per-gallon in highway and city driving conditions is less clear. On the one
hand, urban traffic involves frequent changes in speed in comparison to highway
conditions. One may presume that this change in speed is a cause for reduced
12.3. TESTING HYPOTHESIS ON EXPECTATION 213
efficiency in fuel consumption. If this is the case then one may predict that
heavier cars, which require more energy for acceleration, will be associated with
a bigger difference between highway and city driving conditions in comparison
to lighter cars.
One the other hand, heavier cars do less miles per gallon overall. The differ-
ence between two smaller numbers (the milage per gallon in highway and in city
conditions for heavier cars) may tend to be smaller than the difference between
two larger numbers (the milage per gallon in highway and in city conditions for
lighter cars). If this is the case then one may predict that heavier cars will be
associated with a smaller difference between highway and city driving conditions
in comparison to lighter cars.
The average difference between highway and city conditions is approximately
5.53 for all cars. Divide the cars into to two groups of equal size: one group
composed of the heavier cars and the other group composed of the lighter cars.
We will examine the relation between the weight of the car and difference in
miles per gallon between the two driving conditions by testing hypotheses sep-
arately for each weight group6 . For each such group we start by testing the
two-sided hypothesis H1 : E(X) 6= 5.53, where X is the difference between
highway and city miles-per-gallon in cars that belong to the given weight group.
After conducting the test for the two-sided alternative we will discuss results of
the application of tests for one-sided alternatives.
We start by the definition of the weight groups. The variable “curb.weight”
measures the weight of the cars in the data frame “cars”. Let us examine the
summary of the content of this variable:
> summary(cars$curb.weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1488 2145 2414 2556 2935 4066
Notice that half of the cars in the data frame weigh less than 2,414 lb and half
the cars weigh more. The average weight of a car is 2,556 lb. Let us take 2,414
as a threshold and denote cars below this weight as “light” and cars above this
threshold as “heavy”:
> heavy <- cars$curb.weight > 2414
> table(heavy)
heavy
FALSE TRUE
103 102
The variable “heavy” indicates for each car type whether its weight is above or
below the threshold weight of 2,414 lb. The variable is composed of a sequence
with as many components as the number of observations in the data frame
“cars” (n = 205). Each component is a logical value: “TRUE” if the car is heavier
than the threshold and “FALSE” if it is not. When we apply the function “table”
to this sequence we get that 102 of the cars are heavier than the threshold and
103 are not so. This outcome is expected, since 2414 is the median weight of
the cars.
6 In the next chapters we will consider more direct ways for comparing the effect of one
variable (curb.weight in this example) on the distribution of another variable (dif.mpg in this
example). Here, instead, we investigate the effect indirectly by the investigation of hypotheses
on the expectation of the variable dif.mpg separately for heavier cars and for lighter cars.
214 CHAPTER 12. TESTING HYPOTHESIS
We would like to apply the t-test first to the sub-collection of all cars with
weight above 2,414 lb (cars that are associated with the value “TRUE” in the
variable “heavy”), and then to all cars with weights not exceeding the threshold
(cars associated with value “FALSE”). In the past we showed that one may
address components of a sequence using its position in the sequence7 . Here we
demonstrate an alternative approach for addressing specific locations with the
aid of a sequence with logical components.
In order to illustrate this second approach consider the two sequences:
> w <- c(5,3,4,6,2,9)
> d <- c(13,22,0,12,6,20)
We would like to select the components of the sequence “d” in all the locations
where the components of the sequence “w” obtain values larger than 5. Consider
the code:
> w > 5
[1] FALSE FALSE FALSE TRUE FALSE TRUE
> d[w > 5]
[1] 12 20
Notice that the expression “w > 5” is a sequence of logical components, with
the value “TRUE” at the positions where “w” is above the threshold and the
value “FALSE” at the positions where “w” is below the threshold. We may use
the sequence with logical components as an index to the sequence of the same
length “d”. The expression is “d[w > 5]”. The output of this expression is the
sub-sequence of elements from “d” that are associated with the “TRUE” values of
the logical sequence. Indeed, “TRUE” values are present at the 4th and the 6th
positions of the logical sequence. Consequently, the output of the expression
“d[w > 5]” contains the 4th and the 6th components of the sequence “d”.
The operator “!”, when applied to a logical value, reverses the value. A
“TRUE” becomes “FALSE” and a “FALSE” becomes “TRUE”. Consider the code:
> !(w > 5)
[1] TRUE TRUE TRUE FALSE TRUE FALSE
> d[!(w > 5)]
[1] 13 22 0 6
Observe that the sequence “!(w > 5)” obtains a value of “TRUE” at the positions
where “w” is less or equal to 5. Consequently, the output of the expression
“d[!(w > 5)]” are all the values of “d” that are associated with components
of “w” that are less or equal to 5.
The variable “dif.mpg” contains data on the difference in miles-per-gallon
between highway and city driving conditions for all the car types. The se-
quence “heavy” identifies the car types with curb weight above the threshold
of 2,414 lb. The components of this sequence are logical with the value “TRUE”
at positions associated with he heavier car types and the “FALSE” at positions
associated with he lighter car types. Observe that the output of the expres-
sion “dif.mpg[heavy]” is the subsequence of differences in miles-per-gallon for
7 For example, in Question 9.1 we referred to the first 29 observations of the sequence
“change” using the expression “change[1:29]” and to the last 21 observations using the
expression “change[30:50]”.
12.3. TESTING HYPOTHESIS ON EXPECTATION 215
the cars with curb weight above the given threshold. We apply the function
“t.test” to this expression in order to conduct the t-test on the expectation of
the variable “dif.mpg” for the heavier cars:
> t.test(dif.mpg[heavy],mu=5.53)
data: dif.mpg[heavy]
t = -1.5385, df = 101, p-value = 0.1270
alternative hypothesis: true mean is not equal to 5.53
95 percent confidence interval:
4.900198 5.609606
sample estimates:
mean of x
5.254902
The target population are the heavier car types. Notice that we test the null hy-
pothesis that expected difference among he heavier cars is equal to 5.53 against
the alternative hypothesis that the expected difference among heavier cars is not
equal to 5.53. The null hypothesis is not rejected at the 5% significance level
since the p-value, which is equal to 0.1735, is larger than 0.05. Consequently,
based on the data at hand, we cannot conclude that the expected difference
in miles-per-gallon for heavier cars is significantly different than the average
difference for all cars.
Observe also that the estimate of the expectation, the sample mean, is equal
to 5.254902, with a confidence interval of the form [4.900198, 5.609606].
Next, let us apply the same test to the lighter cars. Notice that the expression
“dif.pmg[!heavy]” produces the subsequence of differences in miles-per-gallon
for the cars with curb weight below the given threshold. The application of the
function “t.test” to this subsequence gives:
> t.test(dif.mpg[!heavy],mu=5.53)
data: dif.mpg[!heavy]
t = 1.9692, df = 102, p-value = 0.05164
alternative hypothesis: true mean is not equal to 5.53
95 percent confidence interval:
5.528002 6.083649
sample estimates:
mean of x
5.805825
Observe that again the null hypothesis is rejected at the 5% significance level
since a p-value of 0.05164 is still larger than 0.05. However, unlike the case for
heavier cars where the p-value was undeniably larger than the threshold. In
this example it is much closer to the threshold of 0.05. Consequently, we may
almost conclude that the expected difference in miles-per-gallon for lighter cars
is significantly different than the average difference for all car.
216 CHAPTER 12. TESTING HYPOTHESIS
Why did we did we not reject the null hypothesis for the heavier cars but
we almost did so for the lighter cars? Recall that both tests are based on the T
statistic, which measures the ratio between the deviation of the sample average
from its expectation under the null and the estimate of the standard deviation
of the sample average. The value of this statistic is “t = -1.5385” for heavier
cars and it is “t = 1.9692” for lighter cars, an absolute value of about 25%
higher.
The deviation of the sample average for the heavier cars and the expectation
under the null is 5.254902−5.53 = −0.275098. On the other hand, the deviation
of the sample average for the lighter cars and the expectation under the null is
5.805825 − 5.53 = 0.275825. The two deviations are practically equal to each
other in the absolute value. √
The estimator of the standard deviation is of the sample average is S/ n,
where S is the sample standard deviation and n is the sample size. The sample
sizes, 103 for lighter cars and 102 for heavier cars, are almost equal. Therefore,
the reason for the difference in the values of the T statistics for both weight
groups must be differences in the sample standard deviations. Indeed, when we
compute the sample standard deviation for lighter and heavier cars respectively8
we get that the standard deviation for lighter cars (1.421531) is much smaller
than the standard deviation for heavier cars (1.805856):
> tapply(dif.mpg,heavy,sd)
FALSE TRUE
1.421531 1.805856
The important lesson to learn from this exercise is that simple minded notion
of significance and statistical significance are not the same. A simple minded
assessment of the discrepancy from the null hypothesis will put the evidence
from the data on lighter cars and the evidence from the data on heavier cars on
the same level. In both cases the estimated value of the expectation is the same
distance away from the null value.
However, statistical assessment conducts the analysis in the context of the
sampling distribution. The deviation of the sample average from the expectation
is compared to the standard deviation of the sample average. Consequently, in
statistical testing of hypothesis a smaller deviation of the sample average from
the expectation under the null may be more significant than a larger one if the
sampling variability of the former is much smaller than the sampling variability
of the later.
Let us proceed with the demonstration of the application of the t-test by
the testing of one-sided alternatives in the context of the lighter cars. One may
test the one-sided alternative H1 : E(X) > 5.53 that the expected value of the
difference in miles-per-gallon among cars with curb weight no more than 2,414
lb is greater than 5.53 by the application of the function “t.test” to the data
on lighter cars. This data is the output of the expression “dif.mpg[!heavy]”.
We once again specify the null value of the expectation by the introduction
8 Recall that the function “tapply” applies the function that is given as its third argument
(the function “sd” in this case) to each subset of values of the sequence that is given as its first
argument (the sequence “dif.mpg” in the current application). The subsets are determined
by the levels of the second arguments (the sequence “heavy” in this case). The output is the
sample standard deviation of the variable “dif.mpg” for lighter cars (the level “FALSE”) and
for heavier cars (the level “TRUE”).
12.3. TESTING HYPOTHESIS ON EXPECTATION 217
of the expression “mu=5.53”. The fact that we are interested in the testing
of the specific alternative is specified by the introduction of a new argument
of the form: “alternative="greater"”. The default value of the argument
“alternative” is “"two.sided"”, which produces a test of a two-sided alter-
native. By changing the value of the argument to “"greater"” we produce a
test for the appropriate one-sided alternative:
> t.test(dif.mpg[!heavy],mu=5.53,alternative="greater")
data: dif.mpg[!heavy]
t = 1.9692, df = 102, p-value = 0.02582
alternative hypothesis: true mean is greater than 5.53
95 percent confidence interval:
5.573323 Inf
sample estimates:
mean of x
5.805825
Observe that value of the test statistic (t = 1.9692) is the same as for the
test of the two-sided alternative and so is the number of degrees of freedom asso-
ciated with the statistic (df = 102). However, the p-value is smaller (p-value
= 0.02582), compared to the p-value in the test for the two-sided alternative
(p-value = 0.05164). The p-value for the one-sided test is the probability un-
der the sampling distribution that the test statistic obtains vales larger than
the observed value of 1.9692. The p-value for the two-sided test is twice that
figure since it involves also the probability of being less than the negative of the
observes value.
The estimated value of the expectation, the sample average, is unchanged.
However, instead of producing a confidence interval for the expectation the re-
port produces a one-sided confidence interval of the form [5.573323, ∞). Such an
interval corresponds to the smallest value that the expectation may reasonably
obtain on the basis of the observed data.
Finally, consider the test of the other one-sided alternative H1 : E(X) < 5.53:
> t.test(dif.mpg[!heavy],mu=5.53,alternative="less")
data: dif.mpg[!heavy]
t = 1.9692, df = 102, p-value = 0.9742
alternative hypothesis: true mean is less than 5.53
95 percent confidence interval:
-Inf 6.038328
sample estimates:
mean of x
5.805825
test statistic obtains values less than the observed value of 1.9692. Clearly, the
null hypothesis is not rejected in this test.
Let us test the hypothesis that the median weight of cars that run on diesel
is also 2,414 lb. Recall that 20 out of the 205 car types in the sample have diesel
engines. Let us use the weights of these cars in order to test the hypothesis.
Recall that the variable “fuel.type” is a factor with two levels “diesel” and
“gas” that identify the fuel type of each car. The variable “heavy” identifies for
each car whether its weight is above the level of 2414 or not. Let us produce a
2 × 2 table that summarizes the frequency of each combination of weight group
and the fuel type:
Originally the function “table” was applied to a single factor and produced
a sequence with the frequencies of each level of the factor. In the current ap-
plication the input to the function are two factors9 . The output is a table of
frequencies. Each entry to the table corresponds to the frequency of a combi-
nation of levels, one from the first input factor and the other from the second
input factor. In this example we obtain that 6 cars use diesel and their curb
weight was below the threshold. There are 14 cars that use diesel and their curb
weight is above the threshold. Likewise, there are 97 light cars that use gas and
88 heavy cars with gas engines.
The function “prop.test” produces statistical tests for proportions. The
relevant information for the current application of the function is the fact that
frequency of light diesel cars is 6 among a total number of 20 diesel cars. The
first entry to the function is the frequency of the occurrence of the event, 6 in
this case, and the second entry is the relevant sample size, the total number of
diesel cars which is 20 in the current example:
> prop.test(6,20)
The function produces a report that is printed on the screen. The title
identifies the test as a one-sample test of proportions. In later chapters we
will apply the same function to more complex data structures and the title will
9 To be more accurate, the variable “heavy” is not a factor but a sequence with logical
components. Nonetheless, when the function “table” is applied to such a sequence it treats
it as a factor with two levels, “TRUE” and “FALSE”.
220 CHAPTER 12. TESTING HYPOTHESIS
change accordingly. The title also identifies the fact that a continuity correction
is used in the computation of the test statistic.
The line under the title indicates the frequency of the event in the sample
and the sample size. (In the current example, 6 diesel cars with weights below
the threshold among a total of 20 diesel cars.) The probability of the event,
under the null hypothesis, is described. The default value of this probability
is “p = 0.5”, which is the proper value in the current example. This default
value can be modified by replacing the value 0.5 by the appropriate probability.
The next line presents the information relevant for the test itself. The test
statistic, which is essentially the square of the Z statistic described above10 ,
obtains the value 2.45. The sampling distribution of this statistic under the null
hypothesis is, approximately, the chi-square distribution on 1 degree of freedom.
The p-value, which is the probability that chi-square distribution on 1 degree of
freedom obtains a value above 2.45, is equal to 0.1175. Consequently, the null
hypothesis is not rejected at the 5% significance level.
The bottom part of the report provides the confidence interval and the point
estimate for the probability of the event. The confidence interval for the given
data is [0.1283909, 0.5433071] and the point estimate is p̂ = 6/20 = 0.3.
It is interesting to note that although the deviation between the estimated
proportion p̂ = 0.3 and the null value of the probability p = 0.5 is relatively large
still the null hypothesis was not rejected. The reason for that is the smallness
of the sample, n = 20, that was used in order to test the hypothesis. Indeed, as
an exercise let us examine the application of the same test to a setting where
n = 200 and the number of occurrences of the event is 60:
> prop.test(60,200)
The estimated value of the probability is the same as before since p̂ = 60/200 =
0.3. However, the p-value is 2.322 × 10−8 , which is way below the significance
threshold of 0.05. In this scenario the null hypothesis is rejected with flying
colors.
This last example is yet another demonstration of the basic characteristic
of statistical hypothesis testing. The consideration is based not on the discrep-
ancy of the estimator of the parameter from the value of the parameter under
10 The test statistic that is computed by default is based on Yates’ correction for continuity,
which is very similar to the continuity correction that was used in Chapter 6 for the Normal
approximation of the Binomial distribution. Specifically, the test statistic to the continuity
correction for testing H0 : p = p0 takes the form [|p̂ − p0 | − 0.5/n]2 /[p0 (1 − p0 )/n]. Compare
this statistic with the statistic proposed in the text that takes the form [p̂−p0 ]2 /[p0 (1−p0 )/n].
The latter statistic is used if the argument “correct = FALSE” is added to the function.
12.5. SOLVED EXERCISES 221
2. Identify the observations that can be used in order to test the hypotheses.
3. Carry out the test and report your conclusion. (Use a significance level of
5%.)
Solution (to Question 12.2.2): The observations that can be used in order
to test the hypothesis are those associated with patients that were treated with
the inactive placebo, i.e. the last 21 observations. We extracts these values form
the data frame using the expression “magnets$change[30:50]”.
Solution (to Question 12.2.3): In order to carry out the test we read the
data from the file “magnets.csv” into the data frame “magnets”. The function
“t.test” is applied to the observations extracted from the data frame. Note
that the default expectation value of tested by the function is “mu = 0”:
222 CHAPTER 12. TESTING HYPOTHESIS
data: magnets$change[30:50]
t = 3.1804, df = 20, p-value = 0.004702
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.3768845 1.8135916
sample estimates:
mean of x
1.095238
Question 12.2. It is assumed, when constructing the t-test, that the mea-
surements are Normally distributed. In this exercise we what to examine the
robustness of the test to divergence from the assumption. You are required to
compute the significance level of a two-sided t-test of H0 : E(X) = 4 versus
H1 : E(X) 6= 4. Assume there are n = 20 observations and use a t-test with a
nominal 5% significance level.
We compute the test statistic “T” from the sample average “X.bar” and the
sample standard deviation “S”. In the last expression we compute the probabil-
ity that the absolute value of the test statistic is larger than “qt(0.975,19)”,
which is the threshold that should be used in order to obtain a significance level
of 5% for Normal measurements.
12.5. SOLVED EXERCISES 223
We obtain that the actual significance level of the test is 0.08047, which is
substantially larger than the nominal significance level.
> a <- 0
> b <- 8
> n <- 20
> X.bar <- rep(0,10^5)
> S <- rep(0,10^5)
> for(i in 1:10^5)
+ {
+ X <- runif(n,a,b)
+ X.bar[i] <- mean(X)
+ S[i] <- sd(X)
+ }
> T <- (X.bar - 4)/(S/sqrt(n))
> mean(abs(T) > qt(0.975,n-1))
[1] 0.05163
The actual significance level of the test is 0.05163, much closer to the nominal
significance level of 5%.
A possible explanation for the difference between the two cases is that the
Uniform distribution is symmetric like the Normal distribution, whereas the
Exponential is skewed. In any case, for larger sample sizes one may expect the
Central Limit Theorem to kick in and produce more satisfactory results, even
for the Exponential case.
Solution (to Question 12.3.1): We input the data to R and then compute
the test statistic and the appropriate percentile of the t-distribution:
> n <- 55
> x.bar <- 22.7
> s <- 5.4
224 CHAPTER 12. TESTING HYPOTHESIS
Observe that the absolute value of the statistic (3.708099) is larger than the
threshold for rejection (2.004879). Consequently, we reject the null hypothesis.
> qt(0.995,n-1)
[1] 2.669985
Again, the absolute value of the statistic (3.708099) is larger than the threshold
for rejection (2.669985). Consequently, we reject the null hypothesis.
The absolute value of the new statistic (1.785381) is smaller than the threshold
for rejection (2.004879). Consequently, we do not reject the null hypothesis.
12.6 Summary
Glossary
Hypothesis Testing: A method for determining between two hypothesis, with
one of the two being the currently accepted hypothesis. A determination is
based on the value of the test statistic. The probability of falsely rejecting
the currently accepted hypothesis is the significance level of the test.
Test Statistic: A statistic that summarizes the data in the sample in order to
decide between the two alternative.
Rejection Region: A set of values that the test statistic may obtain. If the
observed value of the test statistic belongs to the rejection region then the
null hypothesis is rejected. Otherwise, the null hypothesis is not rejected.
12.6. SUMMARY 225
Type I Error The null hypothesis is correct but it is rejected by the test.
Type II Error The alternative hypothesis holds but the null hypothesis is not
rejected by the test.
Significance Level: The probability of a Type I error. The probability, com-
puted under the null hypothesis, of rejecting the null hypothesis. The
test is constructed to have a given significance level. A commonly used
significance level is 5%.
Formulas:
√
• Test Statistic for Expectation: t = (x̄ − µ0 )/(s/ n).
• Two-Sided Test: Reject H0 if {|t| > qt(0.975,n-1)}.
227
228 CHAPTER 13. COMPARING TWO SAMPLES
the value of the first variably is determined by the value of the second. How-
ever, in the statistical context relations between variables are more complex.
Typically, a statistical relation between variables does not make one a direct
function of the other. Instead, the distribution of values of one of the variables
is affected by the value of the other variable. For a given value of the second
variable the first variable may have one distribution, but for a different value
of the second variable the distribution of the first variable may be different. In
statistical terminology the second variable in this setting is called an explana-
tory variable and the first variable, with a distribution affected by the second
variable, is called the response.
As an illustration of the relation between the response and the explana-
tory variable consider the following example. In a clinical trial, which is a
precondition for the marketing of a new medical treatment, a group of pa-
tients is randomly divided into a treatment and a control sub-groups. The new
treatment is anonymously administered to the treatment sub-group. At the
same time, the patients in the control sub-group obtain the currently standard
treatment. The new treatment passes the trial and is approved for marketing
by the Health Authorities only if the response to the medical intervention is
better for the treatment sub-group than it is for the control sub-group. This
treatment-control experimental design, in which a response is measured under
two experimental conditions, is used in many scientific and industrial settings.
In the example of a clinical trial one may identify two variables. One vari-
able measures the response to the medical intervention for each patient that
participated in the trial. This variable is the response variable, the distribu-
tion of which one seeks to investigate. The other variable indicates to which
sub-group, treatment or control, each patient belongs. This variable is the ex-
planatory variable. In the setting of a clinical trial the explanatory variable
is a factor with two levels, “treatment” and “control”, that splits the sample
into two sub-samples. The statistical inference compares the distribution of
the response variable among the patients in the treatment sub-sample to the
distribution of the response among the patients in the control sub-group.
The analysis of experimental settings such as the treatment-control trial ia a
special case that involves the investigation of the effect an explanatory variable
may have on the response variable. In this special case the explanatory variable
is a factor with two distinct levels. Each level of the factor is associate with
a sub-sample, either treatment or control. The analysis seeks to compare the
distribution of the response in one sub-sample with the distribution in the other
sub-sample. If the response is a numeric measurement then the analysis may
take the form of comparing the response’s expectation in each sub-group to
each other. Alternatively, the analysis may involve comparing variances. On
the other hand, if the response is the indicator of the occurrence of an event
then the analysis may compare two probabilities, the probability of the event in
the treatment group and the probability of the same event in the control group,
to each other.
In this chapter we deal with statistical inference that corresponds to the
comparison of the distribution of a numerical response variable between two
sub-groups that are determined by a factor. The inference includes testing
hypotheses, mainly the null hypothesis that the distribution of the response
is the same in both subgroups versus the alternative hypothesis that it is not
the same, as well as point estimation and confidence intervals of appropriate
13.3. COMPARING THE SAMPLE MEANS 229
parameters.
In the next chapter we will consider the case where the explanatory variable
is numeric and in the subsequent chapter we describe the inference that is used
in the case where the response is the indicator of the occurrence of an event.
●
10
●
8
dif.mpg
6
4
●
2
● ●
●
0
FALSE TRUE
heavy
logical components. It cannot be used, for example, as an index to another sequence in order
to sellect the components that are associated with the “TRUE” logical value.
13.3. COMPARING THE SAMPLE MEANS 231
for heavier cars (cars associated with the level “TRUE”) tends to obtain smaller
values than the distribution of the response for lighter cars (cars associated with
the level “FALSE”).
The input to the function “plot” is a formula expression of the form:
“response ˜ explanatory.variable”. A formula identifies the role of variables.
The variable to the left of the tilde character in a formula is the response and
the variable to the right is the explanatory variable. In the current case the vari-
able “dif.mpg” is the response and the variable “dif.mpg” is the explanatory
variable.
Let us use a formal test in order to negate the hypothesis that the expectation
of the response for the two weight groups is the same. The test is provided by
the application of the function “t.test” to the formula “dif.mpg˜heavy”:
> t.test(dif.mpg~heavy)
heavier cars is 5.254902, the average of measurements associated with the level
“TRUE”.
The point estimate for the difference between the two expectations is the
difference between the two sample averages: 5.805825 − 5.254902 = 0.550923.
A confidence interval for the difference between the expectations is reported
under the title “95 percent confidence interval:”. The computed value of
the confidence interval is [0.1029150, 0.9989315].
In the rest of this section we describe the theory behind the construction of
the confidence interval and the statistical test.
Our goal in this section is to construct a confidence interval for the differ-
ence in expectations E(Xa ) − E(Xb ). A natural estimator for this difference in
expectations is the difference in averages X̄a − X̄b . The average difference will
also serve as the basis for the construction of a confidence interval.
Recall that the construction of the confidence interval for a signal expec-
tation was based on the sample p average X̄. We exploited the fact that the
distribution of Z = (X̄ − E(X)/ Var(X)/n, the standardized sample aver-
age, is approximately standard Normal. From this Normal approximation we
obtained an approximate 0.95 probability for the event
p p
− 1.96 · Var(X)/n ≤ X̄ − E(X) ≤ 1.96 · Var(X)/n ,
Var(Xa ) Var(Xb )
Var(X̄a − X̄b ) = Var(X̄a ) + Var(X̄b ) = + ,
na nb
where na is the size of the sub-sample that produces the sample average X̄a and
nb is the size of the sub-sample that produces the sample average X̄b . Observe
that both X̄a and X̄b contribute to the variability of the difference. The total
variability is the sum of the two contributions3 . Finally, we use the fact that the
variance of the sample average is equal to he variance of a single measurement
divided by the sample size. This fact is used for both averages in order to obtain
a representation of the variance of the estimator in terms of the variances of the
measurement in the two sub-population and the sizes of the two sub-samples.
The standardized deviation takes the form:
X̄a − X̄b − {E(Xa ) − E(Xb )}
Z=p .
Var(Xa )/na + Var(Xb )/nb
2 In the case where the sample size is small and the observations are Normally distributed we
used the t-distribution instead. The percentile that was used in that case was qt(0.975,n-1),
the 0.975 percentile of the t-distribution on n − 1 degrees of freedom.
3 It can be proved mathematically that the variance of a difference (or a sum) of two
independent random variables is the sum of the variances. The situation is different when the
two random variables are correlated.
234 CHAPTER 13. COMPARING TWO SAMPLES
When both sample sizes na and nb are large then the distribution of Z is ap-
proximately standard Normal. As a corollary from the Normal approximation
one gets that P(−1.96 ≤ Z ≤ 1.96) ≈ 0.95.
The values of variances Var(Xa ) and Var(Xb ) that appear in the definition
of Z are unknown. However, these values can be estimated using the sub-
samples variances Sa2 and Sb2 . When the size of both sub-samples is large then
these estimators will produce good approximations of the unknown variances:
Var(Xa ) Var(Xb ) S2 S2
Var(Xa ) ≈ Sa2 , Var(Xb ) ≈ Sb2 =⇒ + ≈ a + b .
na nb na nb
The approximation results from the use of the sub-sample variances as a substi-
tute for the unknown variances of the measurement in the two sub-populations.
When the two sample sizes na and nb are large then the probability of the given
event will also be approximately equal to 0.95.
Finally, reexpressing the least event in a format that puts the parameter
E(Xa )−E(Xb ) in the center will produce the confidence interval with boundaries
of the form: q
X̄a − X̄b ± 1.96 · Sa2 /na + Sb2 /nb
In order to illustrate the computations that are involved in the creation of
a confidence interval for the difference between two expectations let us return
to the example of difference in miles-per-gallon for lighter and for heavier cars.
Compute the two sample sizes, sample averages, and sample variances:
> table(heavy)
heavy
FALSE TRUE
103 102
> tapply(dif.mpg,heavy,mean)
FALSE TRUE
5.805825 5.254902
> tapply(dif.mpg,heavy,var)
FALSE TRUE
2.020750 3.261114
Observe that there 103 lighter cars and 102 heavier ones. These counts were
obtained by the application of the function “table” to the factor “heavy”. The
lighter cars are associated with the level “FALSE” and heavier cars are associated
with the level “TRUE”.
The average difference in miles-per-gallon for lighter cars is 5.805825 and
the variance is 2.020750. The average difference in miles-per-gallon for heavier
cars is 5.254902 and the variance is 3.261114. These quantities were obtained
by the application of the functions “mean” or “var” to the values of the vari-
able “dif.mpg” that are associated with each level of the factor “heavy”. The
application was carried out using the function “tapply”.
13.3. COMPARING THE SAMPLE MEANS 235
Notice that the computed values of the means are equal to the vales re-
ported in the output of the application of the function “t.test” to the for-
mula “dif.mpg ˜ heavy”. The difference between the averages is x̄a − x̄b =
5.805825 − 5.254902 = 0.550923. This value is the center of the confidence
interval. The estimate of the standard deviation of the difference in averages is:
q p
s2a /na + s2b /nb = 2.020750/103 + 3.261114/102 = 0.227135 .
X̄ − 0 X̄
T = √ = √ ,
S/ n S/ n
4 The confidence interval given in the output of the function “t.test” is
[0.1029150, 0.9989315], which is very similar, but not identical, to the confidence interval
that we computed. The discrepancy stems from the selection of the percentile. We used the
percentile of the normal distribution 1.96 = qnorm(0.975). On the other hand, the function
“t.test” uses the percentile of the t-distribution 1.972425 = qt(0.975,191.561). Using this
value instead would give 0.550923 ± 1.972425 · 0.227135, which coincides with the interval
reported by “t.test”. For practical applications the difference between the two confidence
intervals are not important.
236 CHAPTER 13. COMPARING TWO SAMPLES
where X̄ is the sample average and S is the sample standard deviation. The
rejection region of this test is {|T | > qt(0.975,n-1)}, for “qt(0.975,n-1)”
the 0.975-percentile of the t-distribution on n − 1 degrees of freedom.
Alternatively, one may compute the p-value and reject the null hypothesis
if the p-value is less than 0.05. The p-value in this case is equal to P(|T | > |t|),
where t is the computed value of the test statistic. The distribution of T is the
t-distribution of n − 1 degrees of freedom.
A similar approach can be used in the situation where two sub-population
are involved and one wants to test the null hypothesis that the expectations
are equal versus the alternative hypothesis that they are not. The null hypoth-
esis can be written in the form H0 : E(Xa ) − E(Xb ) = 0 with the alternative
hypothesis given as H1 : E(Xa ) − E(Xb ) 6= 0.
It is natural to base the test static on the difference in sub-samples averages
X̄a − X̄b . The T statistic is the ratio between the deviation of the estimator from
the null value of the parameter, divided by the (estimated) standard deviation of
the estimator. In the current setting the estimator is difference in sub-samples
averages X̄a − X̄b . The null value of the parameter, the difference between
the
p expectations, is 0. The (estimated) standard deviation of the estimator is
Sa2 /na + Sb2 /nb . It turns out that the test statistic in the current setting is:
which, after rounding up, is equal to the value presented in the report that was
produced by the function “t.test”.
The p-value is computed as the probability of obtaining values of the test
statistic more extreme than the value that was obtained in our data. The
computation is carried out under the assumptions of the null hypothesis. The
limit distribution of the T statistic, when both sub-sample sizes na and nb are
large, is standard Normal. In the case when the measurements are Normally
distributed then a refined approximation of the distribution of the statistic is the
t-distribution. Both the standard Normal and the t-distribution are symmetric
about the origin.
13.4. COMPARING SAMPLE VARIANCES 237
The function “t.test” computes the p-value using the t-distribution. For
the current data, the number of degrees of freedom that are used in this approxi-
mation5 is df = 191.561. When we apply the function “pt” for the computation
of the cumulative probability of the t-distribution we get:
> 2*(1-pt(2.4255,191.561))
[1] 0.01621458
which (after rounding) is equal to the reported p-value of 0.01621. This p-value
is less than 0.05, hence the null hypothesis is rejected in favor of the alternative
hypothesis that assumes an effect of the weight on the expectation.
imation of the null distribution of the T test statistic. The number of degrees of freedom is
computed by the formula: df = (va + vb )2 /{va2 /(na − 1) + vb2 /(nb − 1)}, where va = s2a /na
and vb = s2b /nb .
238 CHAPTER 13. COMPARING TWO SAMPLES
which are computed from the observations in the first and the second sub-
sample, respectively.
Consider, first, the confidence interval for the ratio of the variances. In
Chapter 11 we discussed the construction of the confidence interval for the
variance in a single sample. The derivation was based on the sample variance
S 2 that serves as an estimator of the population variance Var(X). In particular,
the distribution of the random variable (n − 1)S 2 /Var(X) was identified as the
chi-square distribution on n − 1 degrees of freedom6 . Using this observation, a
confidence interval for the variance was obtained as a result of the identification
of a central region in the chi-square distribution that contains a pre-subscribed
probability7 .
In order to construct a confidence interval for the ratio of the variances we
consider the random variable that is obtained as a ratio of the estimators of the
variances:
Sa2 /Var(Xa )
∼ F(na −1,nb −1) .
Sb2 /Var(Xb )
The distribution of this random variable is identified as the F -distribution8 .
This distribution is characterized by the number of degrees of freedom associ-
ated with the estimator of the variance at the numerator and by the number
of degrees of freedom associated with the estimator of the variance at the de-
nominator. The number of degrees of freedom associated with the estimation
of each variance is the number of observation used for the computation of the
estimator, minus 1. In the current setting the numbers of degrees of freedom
are na − 1 and nb − 1, respectively.
The percentiles of the F -distribution can be computed in R using the function
“qf”. For example, the 0.025-percentile of the distribution for the ratio between
sample variances of the response for two sub-samples is computed by the expres-
sion “qf(0.025,dfa,dfb)”, where dfa = na − 1 and dfb = nb − 1. Likewise,
the 0.975-percentile is computed by the expression “qf(0.975,dfa,dfb)”. Be-
tween these two numbers lie 95% of the given F -distribution. Consequently, the
probability that the random variable {Sa2 /Var(Xa )}/{Sb2 /Var(Xb )} obtains its
values between these two percentiles is equal to 0.95:
Sa2 /Var(Xa )
∼ F(na −1,nb −1) =⇒
Sb2 /Var(Xb )
Sa2 /Var(Xa )
P qf(0.025,dfa,dfb) ≤ Sb2 /Var(Xb )
≤ qf(0.975,dfa,dfb) = 0.95 .
A confidence interval for the ratio between Var(Xa ) and Var(Xb ) is obtained
by reformulation of the last event. In the reformulation, the ratio of the variances
is placed in the center:
n Sa2 /Sb2 Var(Xa ) Sa2 /Sb2 o
≤ ≤ .
qf(0.975,dfa,fdb) Var(Xb ) qf(0.025,dfa,dfb)
Next, consider testing hypotheses regarding the relation between the vari-
ances. Of particular interest is testing the equality of the variances. One may
formulate the null hypothesis as H0 : Var(Xa )/Var(Xb ) = 1 and test it against
the alternative hypothesis H1 : Var(Xa )/Var(Xb ) 6= 1.
The statistic F = Sa2 /Sb2 can used in order to test the given null hypothesis.
Values of this statistic that are either much larger or much smaller than 1 are
evidence against the null hypothesis and in favor of the alternative hypothesis.
The sampling distribution, under that null hypothesis, of this statistic is the
F(na −1,nb −1) distribution. Consequently, the null hypothesis is rejected either if
F < qf(0.025,dfa,dfb) or if F > qf(0.975,dfa,dfb), where dfa = na − 1
and dfb = nb − 1. The significance level of this test is 5%.
Given an observed value of the statistic, the p-value is computed as the
significance level of the test which uses the observed value as the threshold. If
the observed value f is less than 1 then the p-value is twice the probability of
the lower tail: 2 · P(F < f ). On the other hand, if f is larger than 1 one takes
twice the upper tail as the p-value: 2 · P(F > f ) = 2 · [1 − P(F ≤ f )]. The null
hypothesis is rejected with a significance level of 5% if the p-value is less than
0.05.
In order to illustrate the inference that compares variances let us return to
the variable “dif.mpg” and compare the variances associated with the two levels
of the factor “heavy”. The analysis will include testing the hypothesis that the
two variances are equal and a confidence interval for their ration.
The function “var.test” may be used in order to carry out the required
tasks. The input to the function is a formula such “dif.mpg ˜ heavy”, with
a numeric variable on the left and a factor with two levels on the right. The
default application of the function to the formula produces the desired test and
confidence interval:
> var.test(dif.mpg~heavy)
Consider the report produced by the function. Notice that the observed value
of the test statistic is “F = 0.6197”, and it is associated with the F -distribution
on “num df = 102” and “denom df = 101” degrees of freedom. The test statis-
tic can be used in order to test the null hypothesis H0 : Var(Xa )/Var(Xb ) = 1,
that states that the two variance are equal, against the alternative hypoth-
esis that states that they are not. The p-value for this test is “p-value =
0.01663”, which is less than 0.05. Consequently, the null hypothesis is rejected
and the conclusion is that the two variances are significantly different from each
other. The estimated ratio of variances, given at the bottom of the report, is
240 CHAPTER 13. COMPARING TWO SAMPLES
0.6196502. The confidence interval for the ratio is reported also and is equal to
[0.4189200, 0.9162126].
In order to relate the report to the preceding discussion regarding the test let
us recall that the sub-samples variances are s2a = 2.020750 and s2b = 3.261114.
The sub-samples sizes are na = 103 and nb = 102, respectively. The observed
value of the statistic is the ratio s2a /s2b = 2.020750/3.261114 = 0.6196502, which
is the value that appears in the report. Notice that this is the estimate of the
ration between the variances that is given at the bottom of the report.
The p-value of the two-sided test is equal to twice the probability of the tail
that is associated with the observed value of the test statistic as a threshold.
The number of degrees of freedom is dfa = na −1 = 102 and dfb = nb −1 = 101.
The observed value of the ratio test statistic is f = 0.6196502. This value is
less than one. Consequently, the probability P(F < 0.6196502) enters into the
computation of the p-value, which equals twice this probability:
> 2*pf(0.6196502,102,101)
[1] 0.01662612
Compare this value to the p-value that appears in the report and see that, after
rounding up, the two are the same.
For the confidence interval of the ratio compute the percentiles of the F
distribution:
> qf(0.025,102,101)
[1] 0.676317
> qf(0.975,102,101)
[1] 1.479161
Solution (to Question 13.2.1): The score of pain before the application of
the device is measured in the variable “score1”. This variable is used as the
response. We apply the function “t.test” in order to test the equality of the
expectation of the response in the two groups. First we read in the data from
the file into a data frame and then we apply the test:
Solution (to Question 13.2.2): Again, we use the variable “score1” as the
response. Now apply the function “var.test” in order to test the equality of
the variances of the response in the two groups:
The computed p-value is 0.3687, which is once more above 0.05. Consequently,
we do not reject the null hypothesis that the variances in the two groups are
equal. This fact is reassuring that indeed, prior to the application of the device,
the two groups have the same characteristics.
Solution (to Question 13.2.3): The difference in score between the treatment
and the control groups is measured in the variable “change”. This variable is
used as the response for the current analysis. We apply the function “t.test”
in order to test the equality of the expectation of the response in the two groups:
The computed p-value is 3.86 × 10−7 , which is much below 0.05. Consequently,
we reject the null hypothesis that the expectations in the two groups are equal.
The conclusion is that, according to this trial, magnets do have an effect on the
expectation of the response9 .
Solution (to Question 13.2.4): Once more we consider the variable “change”
as the response. We apply the function “var.test” in order to test the equality
of the variances of the response in the two groups:
debate regarding their effectiveness. More information can be found in the NIH NCCAM site.
13.5. SOLVED EXERCISES 243
ratio of variances
4.206171
The computed p-value is 0.001535, which is much below 0.05. Consequently, we
reject the null hypothesis that the variances in the two groups are equal. Hence,
magnets also affect the variance of the response.
The actual significance level of the test is 0.27596, which is much higher than
the nominal significance level of 5%.
Through this experiment we that the F -test may not be robust to the di-
vergence from the assumed Normal distribution of the measurement. If, as a
matter of fact, the measurement has a shewed distribution (the Exponential
distribution is an example of such distribution) then the application of the test
to the data may produce unreliable conclusions.
Question 13.3. The sample average in one sub-sample is x̄a = 124.3 and the
sample standard deviation is sa = 13.4. The sample average in the second sub-
sample is x̄a = 80.5 and the sample standard deviation is sa = 16.7. The size of
the first sub-sample is na = 15 and this is also the size of the second sub-sample.
We are interested in the estimation of the ratio of variances Var(Xa )/Var(Xb ).
Solution (to Question 13.3.1): We input the data to R and then compute
the estimate:
The estimate is equal to the ratio of the sample variances s2a /s2b . It obtains
the value 0.6438381. Notice that the information regarding the sample averages
and the sizes of the sub-samples is not relevant for the point estimation of the
parameter.
Solution (to Question 13.3.3): The estimate of the parameter is not affected
by the change in the sample sizes and it is still equal to 0.6438381. For the
confidence interval we use now the formula:
2 2
(sa /sb )/qf(0.975,149,149), (s2a /s2b )/qf(0.025,149,149) :
13.6 Summary
Glossary
Response: The variable who‘s distribution one seeks to investigate.
Explanatory Variable: A variable that may affect the distribution of the re-
sponse.
Some say that the quantity of data that is collected is most important. Other
say that the quality of the data is more important than the quantity. What is
your opinion?
When formulating your answer it may be useful to come up with an example
from your past experience where the quantity of data was not sufficient. Else,
you can describe a case where the quality of the data was less than satisfactory.
How did these deficiencies affected the validity of the conclusions of the analysis
of the data?
For illustration consider the surveys. Conducting the survey by the telephone
may be a fast way to reach a large number of responses. However, the quality of
the response may be less that the response obtained by face-to-face interviews.
Formulas:
p
s2a /na + s2b /nb .
• Test Statistic for Equality of Expectations: t = (x̄a −x̄b )/
p
• (Approx.) Confidence Interval: (x̄a −x̄b )±qnorm(0.975) s2a /na + s2b /nb .
• Test Statistic for Equality of Variances: f = s2a /s2b .
• Confidence Interval:
2 2
(sa /sb )/qf(0.975,dfa,dfb), (s2a /s2b )/qf(0.025,dfa,dfb) .
Chapter 14
Linear Regression
• Fit the linear regression to data using the function “lm” and conduct
statistical inference on the fitted model.
247
248 CHAPTER 14. LINEAR REGRESSION
Observation x y
1 4.5 9.5
2 3.7 8.2
3 1.8 4.9
4 1.3 6.7
5 3.2 12.9
6 3.8 14.1
7 2.5 5.6
8 4.5 8.0
9 4.1 12.6
10 1.1 7.2
Let us display this data in a scatter plot. Towards that end, let us read the
length data into an object by the name “x” and the weight data into an object
by the name “y”. Finally, let us apply the function “plot” to the formula that
relates the response “y” to the explanatory variable “x”:
> x <- c(4.5,3.7,1.8,1.3,3.2,3.8,2.5,4.5,4.1,1.1)
> y <- c(9.5,8.2,4.9,6.7,12.9,14.1,5.6,8.0,12.6,7.2)
> plot(y~x)
The scatter plot that is produced by the last expression is presented in
Figure 14.1. A scatter plot is a graph that displays jointly the data of two
numerical variables. The variables (“x” and “y” in this case) are represented
by the x-axis and the y-axis, respectively. The x-axis is associated with the
explanatory variable and the y-axis is associated with the response.
Each observation is represented by a point. The x-value of the point cor-
responds to the value of the explanatory variable for the observation and the
y-value corresponds to the value of the response. For example, the first observa-
tion corresponds to the point (x = 4.5, y = 9.5). The two rightmost points have
an x value of 4.5. The higher of the two has a y value of 9.5 and is therefore
point associated with the first observation. The lower of the two has a y value
14.2. POINTS AND LINES 249
14
●
●
12
10
●
y
●
●
8
●
6
of 8.0, and is thus associated with the 8th observation. Altogether there are 10
points in the plot, corresponding to the 10 observations in the data frame.
Let us consider another example of a scatter plot. The file “cars.csv”
contains data regarding characteristics of cars. Among the variables in this
data frame are the variables “horsepower” and the variable “engine.size”.
Both variables are numeric.
The variable “engine.size” describes the volume, in cubic inches, that
is swept by all the pistons inside the cylinders. The variable “horsepower”
measures the power of the engine in units of horsepower. Let us examine the
relation between these two variables with a scatter plot:
> cars <- read.csv("cars.csv")
> plot(horsepower ~ engine.size, data=cars)
In the first line of code we read the data from the file into an R data frame that
is given the name “cars”. In the second line we produce the scatter plot with
“horsepower” as the response and “engine.size” as the explanatory variable.
Both variables are taken from the data frame “cars”. The plot that is produced
by the last expression is presented in Figure 14.2.
250 CHAPTER 14. LINEAR REGRESSION
●
250
●
200
●
horsepower
● ●●
● ●
● ●
● ● ● ●
● ● ● ●
150
●
● ●
●●
● ●
● ● ●
● ●●●
● ● ● ●●
●
100
● ● ●
●
●
●
● ●
● ●
●
● ●●
●
●
● ●
●
● ●
●
●
●
●●● ●
●
● ●
●
● ●
● ●●
50
●
●
engine.size
Examine the scatter plot in Figure 14.2. One may see that the values of
the response (horsepower) tend to increase with the increase in the values of
the explanatory variable (engine.size). Overall, the increase tends to follow
a linear trend, a straight line, although the data points are not located exactly
on a single line. The role of linear regression, which will be discussed in the
subsequent sections, is to describe and assess this linear trend.
14.2. POINTS AND LINES 251
14
●
a=7 , b=1
a=14 , b=−2
a=8.97 , b=0 ●
●
12
10
y
●
●
8
●
●
6
0 1 2 3 4 5
y =a+b·x,
where y and x are variables and a and b are the coefficients of the equation.
The coefficient a is called the intercept and the coefficient b is called the slope.
A linear equation can be used in order to plot a line on a graph. With
each value on the x-axis one may associate a value on the y-axis: the value
that satisfies the linear equation. The collection of all such pairs of points, all
possible x values and their associated y values, produces a straight line in the
two-dimensional plane.
As an illustration consider the three lines in Figure 14.3. The green line is
produced via the equation y = 7+x, the intercept of the line is 7 and the slope is
252 CHAPTER 14. LINEAR REGRESSION
1. The blue is a result of the equation y = 14 − 2x. For this line the intercept is
14 and the slope is -2. Finally, the red line is produced by the equation y = 8.97.
The intercept of the line is 8.97 and the slope is equal to 0.
The intercept describes the value of y when the line crosses the y-axis. Equiv-
alently, it is the result of the application of the linear equation for the value
x = 0. Observe in Figure 14.3 that the green line crosses the y-axis at the level
y = 7. Likewise, the blue line crosses the y-axis at the level y = 14. The red
line stays constantly at the level y = 8.97, and this is also the level at which it
crosses the y-axis.
The slope is the change in the value of y for each unit change in the value
of x. Consider the green line. When x = 0 the value of y is y = 7. When x
changes to x = 1 then the value of y changes to y = 8. A change of one unit in
x corresponds to an increase in one unit in y. Indeed, the slope for this line is
b = 1. As for the blue line, when x changes from 0 to 1 the value of y changes
from y = 14 to y = 12; a decrease of two units. This decrease is associated with
the slope b = −2. Lastly, for the constant red line there is no change in the
value of y when x changes its value from x = 0 to x = 1. Therefore, the slope
is b = 0. Notice that a positive slope is associated with an increasing line, a
negative slope is associated with a decreasing line and a zero slope is associated
with a constant line.
Lines can be considered in the context of scatter plots. Observe that Fig-
ure 14.3 contains the scatter plot of the data on the relation between the length
of fish and their weight. A regression line is the line that best describes the
linear trend of the relation between the explanatory variable and the response.
Neither of the lines in the figure is the regression line, although the green line is
a better description of the trend than the blue line. In principle, the regression
line is the best description of the linear trend.
The red line is a fixed line that is constructed at a level equal to the average
value1 of the variable y. This line partly reflects the information in the data.
The regression, which we fit in the next section, reflects more of the information
by including a description of the trend in the data.
Lastly, let us see how one can add lines to a plot in R. Functions to produce
plots in R can be divided into two categories: high level and low level plotting
functions. High level functions produce an entire plot, including the axes and the
labels of the plot. The plotting functions that we encountered in the past such
as “plot”, “hist”, “boxplot” and the like are all high level plotting functions.
Low level functions, on the other hand, add features to an existing plot.
An example of a low level plotting function is the function “abline”. This
function adds a straight line to an existing plot. The first argument to the
function is the intercept of the line and the second argument is the slope of the
line. Other arguments may be used in order to specify the characteristics of
the line. For example, the argument “col=color.name” may be used in order
to change the color of the line from its default black color. A plot that is very
similar to plot in Figure 14.3 may be produced with the following code2 :
> plot(y~x)
1 Run the expression “mean(y)” to obtain ȳ = 8.97 as the value of the sample average.
2 The actual plot in Figure 14.3 is produced by a slightly modified code. First an empty
plot is produced with the expression “plot(c(0,5),c(5,15),type="n",xlab="x",ylab="y")”
and then the points are added with the expression “points(y~x)”. The lines are added as in
the text. Finally, a legend is added with the function “legend”.
14.3. LINEAR REGRESSION 253
> abline(7,1,col="green")
> abline(14,-2,col="blue")
> abline(mean(y),0,col="red")
Initially, the scatter plot is created and the lines are added to the plot one after
the other. Observe that color of the first line that is added is green, it has an
intercept of 7 and a slope of 1. The second line is blue, with a intercept of 14
and a negative slope of -2. The last line is red, and its constant value is the
average of the variable y.
In the next section we discuss the computation of the regression line, the line
that describes the linear trend in the data. This line will be added to scatter
plots with the aid of the function “abline”.
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
4.616 1.427
254 CHAPTER 14. LINEAR REGRESSION
14
●
●
●
12
10
●
y
●
●
8
●
6
When displayed, the output of the function “lm” shows the formula that was
used by the function and provides the coefficients of the regression linear equa-
tion. Observe that the intercept of the line is equal to 4.616. The slope of the
line, the coefficient that multiplies “x” in linear equation, is equal to 1.427.
One may add the regression line to the scatter plot with the aid of the
function “abline”:
> plot(y~x)
> abline(fit)
The first expression produces the scatter plot of the data on fish. The second
expression adds the regression line to the scatter plot. When the input to
the graphical function “abline” is the output of the function “lm” that fits
the regression line, then the result is the addition of the regression line to the
existing plot. The line that is added is the line characterized by the coefficients
that are computed by the function “lm”. The coefficients in the current setting
are 4.616 for the intercept and 1.427 for the slope.
The scatter plot and the added regression line are displayed in Figure 14.4.
Observe that line passes through the points, balancing between the points that
14.3. LINEAR REGRESSION 255
are above the line and the points that are below. The line captures the linear
trend in the data.
Examine the line in Figure 14.4. When x = 1 then the y value of the line
is slightly above 6. When the value of x is equal to 2, a change of one unit,
then value of y is below 8, and is approximately equal to 7.5. This observation
is consistent with the fact that the slop of the line is 1.427. The value of x is
decreased by 1 when changing from x = 1 to x = 0. Consequently, the value of
y when x = 0 should decrease by 1.427 in comparison to its value when x = 1.
The value at x = 1 is approximately 6. Therefore, the value at x = 0 should be
approximately 4.6. Indeed, we do get that the intercept is equal to 4.616.
The coefficients of the regression line are computed from the data and are
hence statistics. Specifically, the slope of the regression line is computed as
the ratio between the covariance of the response and the explanatory variable,
divided by the variance of the explanatory variable. The intercept of the re-
gression line is computed using the sample averages of both variables and the
computed slope.
Start with the slope. The main ingredient in the formula for the slope, the
numerator in the ratio, is the covariance between the two variables. The covari-
ance measures the joint variability of two variables. Recall that the formula for
the sample variance of the variable x is equal to::
Pn
Sum of the squares of the deviations (xi − x̄)2
s2 = = i=1 .
Number of values in the sample − 1 n−1
The sample covariance between x and y, on the other hand, replaces the square
of the deviations by the product of deviations. The product is between an y
deviation and the parallel x deviation:
Pn
Sum of products of the deviations (yi − ȳ)(xi − x̄)
covariance = = i=1 .
Number of values in the sample − 1 n−1
The function “cov” computes the sample covariance between two numeric
variables. The two variables enter as arguments to the function and the sample
covariance is the output. Let us demonstrate the computation by first applying
the given function to the data on fish and then repeating the computations
without the aid of the function:
> cov(y,x)
[1] 2.386111
> sum((y-mean(y))*(x-mean(x)))/9
[1] 2.386111
In both cases we obtained the same result. Notice that the sum of products of
deviations in the second expression was divided by 9, which is the number of
observations, minus 1.
The slope of the regression line is the ratio between the covariance and the
variance of the explanatory variable. The regression line itself passes through the
point (x̄, ȳ), a point that is determined by the means of the both the explanatory
variable and the response. It follows that the intercept should obey the equation:
ȳ = a + b · x̄ =⇒ a = ȳ − b · x̄ ,
256 CHAPTER 14. LINEAR REGRESSION
The left-hand-side equation corresponds to the statement that the value of the
regression line at the average x̄ is equal to the average of the response ȳ. The
right-hand-side equation is the solution to the left-hand-side equation.
One may compute the coefficients of the regression model manually by com-
puting first the slope as a ratio between the covariance and the variance of
explanatory variable. The intercept can then be obtained by the equation that
uses the computed slope and the averages of both variables:
Applying the manual method we obtain, after rounding up, the same coefficients
that were produced by the application of the function “lm” to the data.
As an exercise, let us fit the regression model to the data on the relation be-
tween the response “horsepower” and the explanatory variable “engine.size”.
Apply the function “lm” to the data and present the results:
Call:
lm(formula = horsepower ~ engine.size, data = cars)
Coefficients:
(Intercept) engine.size
6.6414 0.7695
The output of the plotting functions is presented in Figure 14.5. Observe once
more that the regression line describes the general linear trend in the data.
Overall, with the increase in engine size one obtains an increase in the power
produced by the engine.
14.3.2 Inference
Up to this point we have been considering the regression model in the context
of descriptive statistics. The aim in fitting the regression line to the data was to
characterize the linear trend observed in the data. Our next goal is to deal with
14.3. LINEAR REGRESSION 257
●
250
●
200
●
horsepower
● ●●
● ●
● ●
● ● ● ●
● ● ● ●
150
●
● ●
●●
● ●
● ● ●
● ●●●
● ● ● ●●
●
100
● ● ●
●
●
●
● ●
● ●
●
● ●●
●
●
● ●
●
● ●
●
●
●
●●● ●
●
● ●
●
● ●
● ●●
50
●
●
engine.size
Call:
lm(formula = horsepower ~ engine.size, data = cars)
Residuals:
Min 1Q Median 3Q Max
-59.643 -12.282 -5.515 10.251 125.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.64138 5.23318 1.269 0.206
engine.size 0.76949 0.03919 19.637 <2e-16 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
the distribution of the response is via the expectation. If such an effect, according to the null
hypothesis, is also excluded then the so called explanatory variable is not affecting at all the
distribution of the response.
14.3. LINEAR REGRESSION 259
The output produced by the application of the function “summary” to the out-
put of the function “lm” is long an detailed. We will discuss this output in
the next section. Here we concentrate on the table that goes under the title
“Coefficients:”. The said table is made of 2 rows and 4 columns. It contains
information for testing, for each of the coefficients, the null hypothesis that the
value of the given coefficient is equal to zero. In particular, the second row may
be used is order to test this hypothesis for the slope of the regression line, the
coefficient that multiplies the explanatory variable.
Consider the second row. The first value on this row is 0.76949, which is
equal (after rounding up) to the slope of the line that was fitted to the data
in the previous subsection. However, in the context of statistical inference this
value is the estimate of the slope of the population regression coefficient, the
realization of the estimator of the slope4 .
The second value is 0.03919. This is an estimate of the standard deviation
of the estimator of the slope. The third value is the test statistic. This statis-
tic is the ratio between the deviation of the sample estimate of the parameter
(0.76949) from the value of the parameter under the null hypothesis (0), di-
vided by the estimated standard deviation (0.03919): (0.76949 − 0)/0.03919 =
0.76949/0.03919 = 19.63486, which is essentially the value given in the report5 .
The last value is the computed p-value for the test. It can be shown that
the sampling distribution of the given test statistic, under the null distribution
which assumes no slope, is asymptotically the standard Normal distribution.
If the distribution of the response itself is Normal then the distribution of the
statistic is the t-distribution on n−2 degrees of freedom. In the current situation
this corresponds to 201 degrees of freedom6 . The computed p-value is extremely
small, practically eliminating the possibility that the slope is equal to zero.
The first row presents information regarding the intercept. The estimated
intercept is 6.64138 with an estimated standard deviation of 5.23318. The value
of the test statistic is 1.269 and the p-value for testing the null hypothesis that
the intercept is equal to zero against the two sided alternative is 0.206. In this
case the null hypothesis is not rejected since the p-value is larger than 0.05.
The report contains an inference for the intercept. However, one is advised to
take this inference in the current case with a grain of salt. Indeed, the intercept
is the expected value of the response when the explanatory variable is equal to
zero. Here the explanatory variable is the size of the engine and the response
is the power of that engine. The power of an engine of size zero is a quantity
that has no physical meaning! In general, unless the intercept is in the range
of observations, i.e. the value 0 is in the range of the observed explanatory
variable, one is advised to treat the inference on the intercept carefully. Such
inference requires extrapolation and is sensitive to the miss-specification of the
regression model.
Apart from testing hypotheses one may also construct confidence intervals
for the parameters. A crude confidence interval may be obtained by taking
4 The estimator of the slope is obtained via the application of the P
formula for the compu-
1 Pn 1 n 2
tation of the slope to the sample: n−1 i=1 (Yi − Ȳ )(Xi − X̄)/ n−1 i=1 (Xi − X̄) .
5 Our computation involves rounding up errors, hence the small discrepancy between the
vations are deleted for the analysis, leaving a total of n = 203 observations. The number of
degrees of freedom is n − 2 = 203 − 2 = 201.
260 CHAPTER 14. LINEAR REGRESSION
1.96 standard deviations on each side of the estimate of the parameter. Hence,
a confidence interval for the slope is approximately equal to 0.76949 ± 1.96 ×
0.03919 = [0.6926776, 0.8463024]. In a similar way one may obtain a confidence
interval for the slope7 : 6.64138 ± 1.96 × 5.23318 = [−3.615653, 16.89841].
Alternatively, one may compute confidence intervals for the parameters of
the linear regression model using the function “confint”. The input to this
function is the fitted model and the output is a confidence interval for each of
the parameters:
> confint(fit.power)
2.5 % 97.5 %
(Intercept) -3.6775989 16.9603564
engine.size 0.6922181 0.8467537
Observe the similarity between the confidence intervals that are computed by
the function and the crude confidence intervals that were produced by us. The
small discrepancies that do exist between the intervals result from the fact that
the function “confint” uses the t-distribution whereas we used the Normal
approximation.
> summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.0397 -2.1388 -0.6559 1.8518 4.0595
7 The warning message that was made in the context of testing hypotheses on the intercept
should be applied also to the construction of confidence intervals. If the value 0 is not in the
range of the explanatory variable then one should be careful when interpreting a confidence
interval for the intercept.
14.4. R-SQUARED AND THE VARIANCE OF RESIDUALS 261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6165 2.3653 1.952 0.0868 .
x 1.4274 0.7195 1.984 0.0826 .
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
The given report contains a table with estimates of the regression coefficients
and information for conducting hypothesis testing. The report also contains
other information. Our goal is to understand what is the other information.
This other information is associated mainly with the notion of the residuals
from regression. The residual from regression for each observation is the differ-
ence between the value of the response for the observation and the estimated
expectation of the response under the regression model8 . An observation is a
pair (xi , yi ), with yi being the value of the response. The expectation of the
response according to the regression model is a + b · xi , where a and b are the co-
efficients of the model. The estimated expectation is obtained by the use of the
coefficients that are estimated from the data is the formula for the expectation.
The residual is the difference between yi and a + b · xi .
Consider an example. The first observation on the fish is (4.5, 9.5), where
x1 = 4.5 and y1 = 9.5. The estimated intercept is 4.6165 and the estimated
slope is 1.4274. The estimated expectation of the response for the first variable
is equal to
The residual is the difference between the observes response and this value:
The residuals for the other observations are computed in the same manner.
The values of the intercept and the slope are kept the same but the values of
the explanatory variable and the response are changed.
Consult the upper plot in Figure 14.6. This is a scatter plot of the data,
together with the regression line in black and the line of the average in red.
Notice that a vertical arrow extends from each data point to the regression
line. The point where each arrow hits the regression line is associated with the
estimated value of the expectation for that point. The residual is the difference
between the value of the response at the origin of the arrow and the value of
the response at the tip of its head. Notice that there are as many residuals as
there are observations.
The function “residuals” computes the residuals. The input to the function
is the fitted regression model and the output is the sequence of residuals. The
function is applied to the object “fit” that contains the fitted regression model
for the fish data:
8 The estimated expectation of the response is also called the predicted response.
262 CHAPTER 14. LINEAR REGRESSION
Residuals
●
● ●
12
●
y
● ●
6 8
●
●
●
●
●
● ●
12
●
y
● ●
6 8
●
●
●
●
> residuals(fit)
1 2 3 4 5
-1.5397075 -1.6977999 -2.2857694 0.2279229 3.7158923
6 7 8 9 10
4.0594616 -2.5849385-3.0397075 2.1312463 1.0133998
Notice that 10 residuals were produced, one for each observation. In particular,
the residual for the first observation is -1.5397075, which is essentially the value
that we obtained9 .
Return to the report produced by the application of the function “summary”
to the fitted regression model. The first element in the report is the formula
that identifies the response and the explanatory variable. The second element,
that comes under the title “Residuals:”, gives a summary of the distribution
of the residuals. This summary contains the smallest and the largest values in
the sequence of residuals, as well as the first and third quartiles and the median.
9 The discrepancy between the value that we computed and the value computed by the
function results from rounding up errors. We used the vales of the coefficients that appear
in the report. These values are rounded up. The function “residuals” uses the coefficients
without rounding.
14.4. R-SQUARED AND THE VARIANCE OF RESIDUALS 263
The average is not reported since the average of the residuals from the regression
line is always equal to 0.
The table that contains information on the coefficients was discussed in the
previous section. Let us consider the last 3 lines of the report.
The first of the three lines contains the estimated value of the standard
deviation of the response from the regression model. If the expectations of
the measurements of the response are located on the regression line then the
variability of the response corresponds to the variability about this line. The
resulting variance is estimated by the sum of squares of the residuals from the
regression line, divided by the number of observations minus 2. A division
by the number of observation minus 2 produces an unbiased estimator of the
variance of the response about the regression model. Taking the square root of
the estimated variance produces an estimate of the standard deviation:
> sqrt(sum(residuals(fit)^2)/8)
[1] 2.790787
Notice that the value that we get for the estimated standard deviation is 2.790787,
which coincides with the vale that appears in the first of the last 3 lines of the
report.
The second of the last 3 lines reports the R-squared of the linear fit. In
order to explain the meaning of R-squared let us consider Figure 14.6 once
again. The two plots in the figure present the scatter plot of the data together
with the regression line and the line of the average. Vertical black arrows that
represent the residuals from the regression are added to the upper plot. The
lower plot contains vertical red arrows that extend from the data points to the
line of the average. These arrows represent the deviations of the response from
the average.
Consider two forms of variation. One form is the variation of the response
from its average value. This variation is summarized by the sample variance, the
sum of the squared lengths of the red arrows divided by the number of observa-
tions minus 1. The other form of variation is the variation of the response from
the fitted regression line. This variation is summarized by the sample variation
of the residuals, the sum of squared lengths of the black arrows divided by the
number of observations minus 1. The ratio between these two quantities gives
the relative variability of the response that remains after fitting the regression
line to the data.
The line of the average is a straight line. The deviations of the observations
from this straight line can be thought of as residuals from that line. The vari-
ability of these residuals, the sum of squares of the deviations from the average
divided by the number of observations minus 1, is equal to the sample variance.
The regression line is the unique straight line that minimizes the variability
of its residuals. Consequently, the variability of the residuals from the regres-
sion, the sum of squares of the residuals from the regression divided by the
number of observations minus 1, is the smallest produced by any straight line.
It follows that the sample variance of the regression residuals is less than the
sample variance of the response. Therefore, the ratio between the variance of
the residuals and the variance of the response is less than 1.
R-squared is the difference between 1 and the ratio of the variances. Its
value is between 0 and 1 and it represents the fraction of the variability of the
response that is explained by the regression line. The closer the points are to
264 CHAPTER 14. LINEAR REGRESSION
the regression line the larger the value of R-squared becomes. On the other
hand, the less there is a linear trend in the data the closer to 0 is the value of
R-squared. In the extreme case of R-squared being equal to 1 all the data point
are positioned exactly on a single straight line. In the other extreme, a value of
0 for R-squared implies no linear trend in the data.
Let us compte manually the difference between 1 and the ratio between the
variance of the residuals and the variance of the response:
> 1-var(residuals(fit))/var(y)
[1] 0.3297413
Observe that the computed value of R-squared is the same as the value “Multiple
R-squared: 0.3297” that is given in the report.
The report provides another value of R-squared, titled Adjusted R-squared.
The difference between the adjusted and un-adjusted quantities is that in the
former the sample variance of the residuals from the regression is replaced by an
unbiased estimate variability of the response about the regression line. The sum
of squares in the unbiased estimator is divided by the number of observations
minus 2. Indeed, when we re-compute the ratio using the unbiased estimate,
the sum of squared residuals divided by 10 − 2 = 8, we get:
> 1-(sum(residuals(fit)^2)/8)/var(y)
[1] 0.245959
The value of this adjusted quantity is equal to the value “Adjusted R-squared:
0.246” in the report.
Which value of R-squared to use is a matter of personal taste. In any case, for
a larger number of observations the difference between the two values becomes
negligible.
The last line in the report produces an overall goodness of fit test for the
regression model. In the current application of linear regression this test reduces
to a test of the slope being equal to zero, the same test that is reported in the
second row of the table of coefficients10 . The F statistic is simply the square of
the t value that is given in the second row of the table. The sampling distribution
of this statistic under the null hypothesis is the F -distribution on 1 and n − 2
degrees of freedom, which is the sampling distribution of the square of the test
statistic for the slope. The computed p-value, “p-value: 0.08255” is the
identical (after rounding up) to the p-value given in the second line of the table.
Return to the R-squared coefficient. This coefficient is a convenient measure
of the goodness of fit of the regression model to the data. Let us demon-
strate this point with the aid of the “cars” data. In Subsection 14.3.2 we
fitted a regression model to the power of the engine as a response and the size
of the engine as an explanatory variable. The fitted model was saved in the
object called “fit.power”. A report of this fit, the output of the expression
“summary(fit.power)” was also presented. Recall that the null hypothesis of
no slope was clearly rejected. The value of R-squared for this fit is 0.6574. Con-
sequently, about 2/3 of the variability in the power of the engine is explained
by the size of the engine.
10 In more complex applications of linear regression, applications that are not considered in
this book, the test in the last line of the report and the tests of coefficients do not coincide.
14.4. R-SQUARED AND THE VARIANCE OF RESIDUALS 265
Consider trying to fit a different regression model for the power of the engine
as a response. The variable “length” describes the length of the car (in inches).
How well would the length explain the power of the car? We may examine this
question using linear regression:
Call:
lm(formula = horsepower ~ length, data = cars)
Residuals:
Min 1Q Median 3Q Max
-53.57 -20.35 -6.69 14.45 180.72
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -205.3971 32.8185 -6.259 2.30e-09 ***
length 1.7796 0.1881 9.459 < 2e-16 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
We used one expression to both fit the regression model to the data and to
summarize the outcome of the fit.
A scatter plot of the two variables together with the regression line is pre-
sented in Figure 14.7. This plot may be produced using the code:
From the examination of the figure we may see that indeed there is a linear
trend in the relation between the length and the power of the car. Longer cars
tend to have more power. Testing the null hypothesis that the slope is equal
to zero produces a very small p-value and leads to the rejection of the null
hypothesis.
The length of the car and the size of the engine are both statistically signif-
icant in their relation to the response. However, which of the two explanatory
variables produces a better fit?
An answer to this question may be provided by the examination of values
of R-squared, the ratio of the variance of the response explained by each of
the explanatory variable. Recall that the R-squared for the size of the engine
as an explanatory variable is 0.6574, about 2/3. On the other hand the value
of R-squared for the length of the car as an explanatory variable is 0.308, less
than 1/3. It follows that the size of the engine explains twice as much of the
variability of the power of the engine than the size of car.
266 CHAPTER 14. LINEAR REGRESSION
●
250
●
200
●
horsepower
● ● ● ● ●
● ●
● ●
● ● ● ●
●
●
150
● ●● ● ●
● ● ● ●
● ●
●● ●● ● ●
●● ●
● ● ● ●● ● ●
●
●
●
100
● ● ● ●●●
●●● ● ●
● ●
●
● ● ●
●● ● ●
● ● ● ●
●●
● ● ● ● ●
●●● ●
●●
●● ● ●●●●● ● ● ● ● ●
● ● ●
● ●
●●
50
●
●
length
1. y = 4
2. y = 5 − 2x
3. y = x
You are asked to match the marked line to the appropriate linear equation and
match the marked point to the appropriate observation:
2. The poind marked with a red triangle represents which of the observations.
(Identify the observation number.)
14.5. SOLVED EXERCISES 267
Observation x y
1 2.3 -3.0
2 -1.9 9.8
3 1.6 4.3
4 -1.6 8.2
5 0.8 5.9
6 -1.0 4.3
7 -0.2 2.0
8 2.4 -4.7
9 1.8 1.8
10 1.4 -1.1
Solution (to Question 14.1.1): The line marked in red is increasing and
x = 0 it seems to obtain the value y = 0. An increasing line is associated with
a linear equation with a positive slope coefficient (b > 0). The only equation
with that property is Equation 3, for which b = 1. Notice that the intercept of
this equation is a = 0, which agrees with the fact that the line passes through
the origin (x, y) = (0, 0). If the x-axis and the y-axis were on the same scale
then one would expect the line to be tilted in the 45 degrees angle. However,
here the axes are not on the same scale, so the tilting is different.
Solution (to Question 14.1.2): The x-value of the line marked with a red
triangle is x = −1. The y-value is below 5. The observation that has an x-value
of -1 is Observation 6. The y-value of this observation is y = 4.3. Notice that
there is another observation with the same y-value, Observation 3. However,
the x-value of that observation is x = 1.6. Hence it is the point that is on the
same level as the marked point, but it is placed to the right of it.
Question 14.2. Assume a regression model that describes the relation between
the expectation of the response and the value of the explanatory variable in the
form:
E(Yi ) = 2.13 · xi − 3.60 .
1. What is the value of the intercept and what is the value of the slope in
the linear equation that describes the model?
2. Assume the x1 = 5.5, x2 = 12.13, x3 = 4.2, and x4 = 6.7. What is the
expected value of the response of the 3rd observation?
Solution (to Question 14.2.1): The intercept is equal to a = −3.60 and
the slope is equal to b = 2.13. Notice that the lope is the coefficient that
multiplies the explanatory variable and the intercept is the coefficient that does
not multiply the explanatory variable.
Solution (to Question 14.2.2): The value of the explanatory variable for the
3rd observation is x3 = 4.2. When we use this value in the regression formula
we obtain that:
10
●
●
5
●
y
●
●
0
●
−5
−2 −1 0 1 2
In words, the expectation of the response of the 3rd observation is equal to 5.346
Question 14.3. The file “aids.csv” contains data on the number of diag-
nosed cases of Aids and the number of deaths associated with Aids among
adults and adolescents in the United States between 1981 and 200211 . The file
can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/
Datasets/aids.csv.
The file contains 3 variables: The variable “year” that tells the relevant
year, the variable “diagnosed” that reports the number of Aids cases that were
diagnosed in each year, and the variable “deaths that reports the number of
Aids related deaths in each year. The following questions refer to the data in
the file:
1. Consider the variable “deaths” as the response and the variable “diagnosed”
11 Taken from Table 1 in section “Practice in Linear Regression” of the
online Textbook “Collaborative Statistics” (Connexions. March 22, 2010.
http://cnx.org/content/col10522/1.38/) by Barbara Illowsky and Susan Dean.
14.5. SOLVED EXERCISES 269
as the explanatory variable. What is the slope of the regression line? Pro-
duce a point estimate and a confidence interval. Is it statistically signifi-
cant (significantly different than 0)?
2. Plot the scatter plot that is produced by these two variables and add
the regression line to the plot. Does the regression line provided a good
description of the trend in the data?
3. Consider the variable “diagnosed” as the response and the variable “year”
as the explanatory variable. What is the slope of the regression line? Pro-
duce a point estimate and a confidence interval. Is it statistically signifi-
cant (significantly different than 0)?
4. Plot the scatter plot that is produced by the later pair of variables and
add the regression line to the plot. Does the regression line provided a
good description of the trend in the data?
Solution (to Question 14.3.1): After saving the file “aids.csv” in the work-
ing directory of R we read it’s content into a data frame by the name “aids”.
We then produce a summary of the fit of the linear regression model of “deaths”
as a response and “diagnosed” as the explanatory variable:
Call:
lm(formula = deaths ~ diagnosed, data = aids)
Residuals:
Min 1Q Median 3Q Max
-7988.73 -680.86 23.94 1731.32 7760.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 88.7161 1370.7191 0.065 0.949
diagnosed 0.6073 0.0312 19.468 1.81e-14 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
The estimated value of the slope 0.6073. The computed p-value associated
with this slope is 1.81 × 10−14 , which is much smaller than the 5% threshold.
Consequently, the slope is statistically significant. In order to compute the
confidence interval for the slope we apply the function “confint” to the fitted
model:
> confint(fit.deaths)
2.5 % 97.5 %
270 CHAPTER 14. LINEAR REGRESSION
50000
● ●
●
40000
●
●
●
30000
●
deaths
20000
● ●
● ●
●
●●
10000
●
●
●●
0
diagnosed
Solution (to Question 14.3.2): A scatter plot of the two variables is produced
by the application of the function “plot” to the formula that involves these two
variables. The regression line is added to the plot using the function “abline”
with the fitted model as an input:
> plot(deaths~diagnosed,data=aids)
> abline(fit.deaths)
The plot that is produced is given in Figure 14.9. Observe that the points are
placed nicely on a line that is nicely characterized by the linear trend of the
regression.
Call:
lm(formula = diagnosed ~ year, data = aids)
Residuals:
Min 1Q Median 3Q Max
-28364 -18321 -3909 14964 41199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3448225.0 1535037.3 -2.246 0.0361 *
year 1749.8 770.8 2.270 0.0344 *
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
The estimated value of the slope 1749.8. The computed p-value associated with
this slope is 0.0344, which is less than the 0.05. Consequently, declare the
slope to be statistically significant. Confidence intervals are produced using the
function “confint”:
> confint(fit.diagnosed)
2.5 % 97.5 %
(Intercept) -6650256.6654 -246193.429
year 141.9360 3357.618
We get that the 95% confidence interval for the slope is [141.9360, 3357.618].
Solution (to Question ??.1): A scatter plot of the two variables is produced
by the application of the function “plot” and the regression line is added with
function “abline”:
> plot(diagnosed~year,data=aids)
> abline(fit.diagnosed)
The plot is given in Figure 14.10. Observe that the points do not follow a
linear trend. It seems as if in the first years after Aids was discovered then the
number of diagnosed cases increased in an exponential rate. The trend changed
in the mid 90’s with a big drop in the number of diagnosed Aids patients. This
drop may be associated with the administration of therapies such as AZT to
HIV infected subjects that reduced the number of such subjects that developed
Aids. In the late 90’s there seams to be yet again a change in the trend and the
possible increase in numbers. The line of linear regression misses this history
altogether.
272 CHAPTER 14. LINEAR REGRESSION
80000 ● ●
●
●
60000
● ●
●
●
diagnosed
40000
●
●
●
● ● ●
●
20000
●
●
● ●
0
year
The take home message from this exercise is not to use models blindly. A
good advise is t plot the data. Simple examination of the plot would have warn
us that the linear model is probably not a good model for the given problem.
Question 14.4. Below are the percents of the U.S. labor force (excluding self-
employed and unemployed) that are members of a labor union12 . We use this
data in order to practice the computation of the regression coefficients.
1. Produce the scatter plot of the data and add the regression line. Is the
regression model reasonable for this data?
2. Compute the sample averages and the sample standard deviations of both
variables. Compute the covariance between the two variables.
3. Using the summaries you have just computed, compute the coefficients of
the regression model.
12 Taken from Homework section in the chapter on linear regression of the
year percent
1945 35.5
1950 31.5
1960 31.4
1970 27.3
1980 21.9
1986 17.5
1993 15.8
Solution (to Question 14.4.1): We read the data in the table into R”. The
variable “year” is the explanatory variable and the variable “percent” is the
response. The scatter plot is produced using the function “plot” and the re-
gression line, fitted to the data with the function “lm”, is added to the plot
using the function “abline”:
The scatter plot and regression line are presented in Figure 14.11. Observe that
a linear trend is a reasonable description of the reduction in the percentage of
workers that belong to labor unions in the post World War II period.
The average of the variable “year” is 1969.143 and the standard deviation is
18.27957. The average of the variable “percent” is 25.84286 and the standard
deviation is 7.574927. The covariance between the two variables is −135.6738.
Solution (to Question 14.4.3): The slope of the regression line is the ratio
between the covariance and the variance of the explanatory variable. The inter-
274 CHAPTER 14. LINEAR REGRESSION
●
35
● ●
30
●
percent
25
●
20
year
cept is the solution of the equation that states that the value of regression line
at the average of the explanatory variable is the average of the response:
We get that the intercept is equal to 825.3845 and the slope is equal to −0.4060353.
In order to validate these figures, let us apply the function “lm” to the data:
> lm(percent~year)
Call:
lm(formula = percent ~ year)
Coefficients:
14.5. SOLVED EXERCISES 275
(Intercept) year
825.384 -0.406
Solution (to Question 14.5.2): The residual from the regression line is the
difference between the observed value of the response and the estimated expec-
tation of the response. For the 4th observation we have that the observed value
of the response is y4 = 0.17. The estimated expectation was computed in the
previous question. Therefore, the residual from the regression line for the 4th
observation is:
y4 − (a + b · x4 ) = 0.17 − (−5.071) = 5.241 .
● ●
10
10
●● ●●
●●●
●●
●● ●
●●● ●
8
8
●●
●●
●
●●
●●
●●
●●
●●
●●● ●
●●●
● ●●
●●●
●●
●
● ●●
dif.mpg
dif.mpg
● ●
●
●●
●●●
●●
●●
●●
●
●●
●●
●●●
●●●
●●
●●● ● ●● ●●●
●●●
●
●●● ● ●
6
6
●●
●●●●●
●
●●
●
●●
●●●
●●●●●●
●●
●●
●●
●●
●
●●●
●●
●● ● ●
●●
●●
●●
●●●●●
●●●●
●●
●● ●
●●
●●
●●
● ●● ●●●
●● ●● ●●
●
●●● ●● ● ● ●
4
4
●
●● ● ●
● ●
● ●● ● ●
● ●
●
●● ● ● ●
●
2
● ● ● ●● ●
● ●● ● ●
0
cars$curb.weight cars$engine.size
Call:
lm(formula = dif.mpg ~ cars$curb.weight)
Residuals:
Min 1Q Median 3Q Max
-5.4344 -0.7755 0.1633 0.8844 6.3035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1653491 0.5460856 14.953 < 2e-16 ***
cars$curb.weight -0.0010306 0.0002094 -4.921 1.77e-06 ***
14.5. SOLVED EXERCISES 277
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
The p-value associated with the slope, 1.77 × 10−6 , is much smaller than the
5% threshold proposing a significant (negative) trend. The value of R-squared,
the fraction of the variability of the response that is explained by a regression
model, is 0.1066.
The standard deviation is the square root of the variance. It follows that
the fraction of
√ the standard deviation of the response that is explained by the
regression is 0.1066 = 0.3265.
Following our own advice, we plot the data and the regression model:
The resulting plot is presented on the left-hand side of Figure 14.12. One may
observe that although there seems to be an overall downward trend, there is
still a lot of variability about the line of regression.
Solution (to Question 14.6.2): We now fit and summarize the regression
model with size of the engine as the explanatory variable:
Call:
lm(formula = dif.mpg ~ cars$engine.size)
Residuals:
Min 1Q Median 3Q Max
-5.7083 -0.7083 0.1889 1.1235 6.1792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.717304 0.359385 18.691 < 2e-16 ***
cars$engine.size -0.009342 0.002691 -3.471 0.000633 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
The plot is given on the right-hand side of Figure 14.12. Again, there is vari-
ability about the line of regression.
Solution (to Question 14.6.3): Of the two models, the model that uses the
curb weigh as the explanatory variable explains a larger portion of the variability
in the response. Hence, unless other criteria tells us otherwise, we will prefer
this model over the model that uses the size of the engine as an explanatory
variable.
14.6 Summary
Glossary
Regression: Relates different variables that are measured on the same sample.
Regression models are used to describe the effect of one of the variables
on the distribution of the other one. The former is called the explanatory
variable and the later is called the response.
Scatter Plot: A plot that presents the data in a pair of numeric variables. The
axes represents the variables and each point represents an observation.
Slope: A coefficient of a linear equation. The change in the value of y for each
unit change in the value of x. A positive slope corresponds to an increasing
line and a negative slope corresponds to a decreasing line.
R-Square: is the difference between 1 and the ratio between the variance of
the residuals from the regression and the variance of the response. Its
value is between 0 and 1 and it represents the fraction of the variability
of the response that is explained by the regression line.
the same topic subject, but consider it specifically in the context of statistical
models.
Some statisticians prefer complex models, models that try to fit the data as
closely as one can. Others prefer a simple model. They claim that although
simpler models are more remote from the data yet they are easier to interpret
and thus provide more insight. What do you think? Which type of model is
better to use?
When formulating your answer to this question you may thing of a situation
that involves inference based on data conducted by yourself for the sack of
others. What would be the best way to report your findings and explain them
to the others?
Formulas:
• A Linear Equation: y = a + b · x.
Sum of products of the deviations
Pn
i=1 (yi −ȳ)(xi −x̄)
• Covariance: = .
Number of values in the sample−1 n−1
A Bernoulli Response
• Fit the logistic regression model to data using the function “glm” and
conduct statistical inference on the fitted model.
281
282 CHAPTER 15. A BERNOULLI RESPONSE
section we will use the factor “fuel.type” as the explanatory variable. Recall
that this variable identified the type of fuel, diesel or gas, that the car uses. The
aim of the analysis is to compare the proportion of cars with four doors between
cars that run on diesel and cars that run on gas.
Let us first summarize the data in a 2 × 2 frequency table. The function
“table” may be used in order to produce such a table:
four two
diesel 16 3
gas 98 86
1.0
two
0.8
0.6
num.of.doors
four
0.4
0.2
0.0
diesel gas
fuel.type
sub-population of diesel cars (the height of the leftmost darker rectangle in the
theoretic mosaic plot that is produced for the entire population) is equal to the
probability of the same level of the response with in the sub-population of cars
that run on gas (the height of the rightmost darker rectangle in that theoretic
mosaic plot). Specifically, let us test the hypothesis that the two probabilities
of the level “four”, one for diesel cars and one for cars that run on gas, are equal
to each other.
The output of the function “table” may serve as the input to the function
“prop.test”2 . Notice that the Bernoulli response variable should be the second
variable in the input to the table whereas the explanatory factor is the first
variable in the table. When we apply the test to the data we get the report:
> prop.test(table(cars$fuel.type,cars$num.of.doors))
2 The
function “prop.test” was applied in Section 12.4 in order to test that the probability
of an event is equal to a given value (“p = 0.5” by default). The input to the function was a
pair of numbers: the total number of successes and the sample size. In the current application
the input is a 2 × 2 table. When applied to such input the function carries out a test of the
equality of the probability of the first column between the rows of the table.
15.3. LOGISTIC REGRESSION 285
1.0
0.8
two
0.6
num.of.doors
0.4
four
0.2
0.0
140 160 170 180 190 210
length
70
60
50
Frequency
40
30
20
10
0
length
explanatory variable. The plot presents, for interval levels of the explanatory
variable, the relative frequencies of each interval. It also presents the relative fre-
quency of the levels of the response within each interval level of the explanatory
variable.
In order to get a better understanding of the meaning of the given mosaic
plot one may consider the histogram of the explanatory variable. This histogram
is presented in Figure 15.3. Observe that the histogram involves the partition
of the range of variable length into intervals. These interval are the basis for
rectangles. The height of the rectangles represent the frequency of cars with
lengths that fall in the given interval.
The mosaic plot in Figure 15.2 is constructed on the basis of the histogram.
Notice that the x-axis in this plat corresponds to the explanatory variable
“length”. The total area of the square in the plot is divided between 7 vertical
rectangles. These vertical rectangles correspond to the 7 rectangles in the his-
togram of Figure 15.3, turn on their sides. Hence, the width of each rectangle in
Figure 15.2 correspond to the hight of the parallel rectangle in the histogram.
Consequently, the area (width times a height of 1) of the vertical rectangles in
the mosaic plot represents the relative frequency of the associated interval of
288 CHAPTER 15. A BERNOULLI RESPONSE
ea+b·xi
pi = ,
1 + ea+b·xi
where a and b are coefficients common to all observations. Equivalently, one
may right the same relation in the form:
log(pi /[1 − pi ]) = a + b · xi .
One may fit the logistic regression to the data and test the null hypothesis
by the use of the function “glm”:
Call:
glm(formula = num.of.doors == "four" ~ length, family = binomial,
data = cars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1646 -1.1292 0.5688 1.0240 1.6673
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.14767 2.58693 -5.082 3.73e-07 ***
length 0.07726 0.01495 5.168 2.37e-07 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
15.4. SOLVED EXERCISES 289
Generally, the function “glm” can be used in order to fit regression models in
cases where the distribution of the response has special forms. Specifically, when
the argument “family=binomial” is used then the model that is being used in
the model of logistic regression. The formula that is used in the function involves
a response and an explanatory variable. The response may be a sequence with
logical “TRUE” or “FALSE” values as in the example3 . Alternatively, it may be
a sequence with “1” or “0” values, “1” corresponding to the event occurring to
the subject and “0” corresponding to the event not occurring. The argument
“data=cars” is used in order to inform the function that the variables are
located in the given data frame. The “glm” function is applied to the data and
the fitted model is stored in the object “fit.doors”.
A report is produced when the function “summary” is applied to the fitted
model. Notice the similarities and the differences between the report presented
here and the reports for linear regression that are presented in Chapter 14. Both
reports contain estimates of the coefficients a and b and tests for the equality of
these coefficients to zero. When the coefficient b, the coefficient that multiplies
the explanatory variable, is equal to 0 then the probability of the response and
the explanatory variable are unrelated. In the current case we may note that the
null hypothesis H0 : b = 0, the hypothesis that claims that there is no relation
between the explanatory variable and the response, is clearly rejected.
The estimated values of the coefficients are −13.14767 for the intercept a
and 0.07726 for the slope b. One may produce confidence intervals for these
coefficients by the application of the function “confint” to the fitted model:
> confint(fit.doors)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -18.50384373 -8.3180877
length 0.04938358 0.1082429
This case study is taken from the Rice Virtual Lab in Statistics. More details
on this case study can be found in the case study “Mediterranean Diet and
Health” that is presented in that site.
The subjects, 605 survivors of a heart attack, were randomly assigned fol-
low either (1) a diet close to the “prudent diet step 1” of the American Heart
Association (AHA) or (2) a Mediterranean-type diet consisting of more bread
and cereals, more fresh fruit and vegetables, more grains, more fish, fewer deli-
catessen food, less meat.
The subjects‘ diet and health condition were monitored over a period of four-
year. Information regarding deaths, development of cancer or the development
of non-fatal illnesses was collected. The information from this study is stored in
the file “diet.csv”. The file “diet.csv” contains two factors: “health” that
describes the condition of the subject, either healthy, suffering from a non-fatal
illness, suffering from cancer, or dead; and the “type” that describes the type
of diet, either Mediterranean or the diet recommended by the AHA. The file
can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/
Datasets/diet.csv. Answer the following questions based on the data in the
file:
1. Produce a frequency table of the two variable. Read off from the table the
number of healthy subjects that are using the Mediterranean diet and the
number of healthy subjects that are using the diet recommended by the
AHA. Use a two-sided alternative and a 5% significance level.
2. Test the null hypothesis that the probability of keeping healthy following
an heart attack is the same for those that use the Mediterranean diet and
for those that use the diet recommended by the AHA.
3. Compute a 95% confidence interval for the difference between the two
probabilities of keeping healthy.
Solution (to Question 15.1.1): After saving the file “diet.csv” in the work-
ing directory of R and read it’s content into a data frame. We apply the function
“table” to the two variables in order to obtain a requested frequency table:
> diet <- read.csv("diet.csv")
> table(diet$health,diet$type)
aha med
cancer 15 7
death 24 14
healthy 239 273
illness 25 8
The resulting table has two columns and 4 rows. The third row corresponds to
healthy subjects. Of these, 239 subjects used the AHA recommended diet and
273 used the Mediterranean diet. We may also plot this data using a mosaic
plot:
> plot(health~type,data=diet)
The mosaic plot produced by the function “plot” is presented in Figure 15.4.
Examining this plot one may appreciate the fact that the vast majority of the
15.4. SOLVED EXERCISES 291
1.0
illness
0.8
healthy
0.6
health
0.4
death
0.2
cancer
0.0
aha med
type
subjects were healthy and the relative proportion of healthy subjects among
users of the Mediterranean diet is higher than the relative proportion among
users of the AHA recommended diet.
Solution (to Question 15.1.2): In order to test the hypothesis that the
probability of keeping healthy following an heart attack is the same for those
that use the Mediterranean diet and for those that use the diet recommended by
the AHA we create a 2 × 2. This table compares the response of being healthy
or not to the type of diet as an explanatory variable. A sequence with logical
components, “TRUE” for healthy and “FALSE” for not can be used in the response.
Such a sequence is produced via the expression “diet$health=="healthy"”:
> table(diet$health=="healthy",diet$type)
aha med
FALSE 64 29
TRUE 239 273
The table may serve as input to the function “prop.test”:
292 CHAPTER 15. A BERNOULLI RESPONSE
> prop.test(table(diet$health=="healthy",diet$type))
Solution (to Question 15.1.3): The confidence interval for the difference
in probabilities is given and is equal to [0.1114300, 0.3313203]. The point es-
timation of the difference between the probabilities is p̂a − p̂b = 0.6881720 −
0.4667969 ≈ 0.22 in favor of a Mediterranean diet. The confidence interval pro-
poses that a difference as low as 0.11 or as high as 0.33 are not excluded by the
data.
Histogram of cushings$tetra
1.0
u
c
15
0.8 0.6
Frequency
10
b
type
0.4
5
0.2
a
0
0.0
0 10 20 0 20 40 60
tetra cushings$tetra
3. Repeat the analysis from 2 using only the observations for which the type
is known. (Hint: you may fit the model to the required subset by the
inclusion of the argument “subset=(type!="u")” in the function that
fits the model.) Which of the analysis do you think is more appropriate?
Solution (to Question 15.2.1): We save the data of the file in a data frame
by the name “cushings. We produce the mosaic plot with the function “plot”
and the histogram with the function “hist”:
> cushings <- read.csv("cushings.csv")
> plot(type~tetra,data=cushings)
> hist(cushings$tetra)
The mosaic plot describes the distribution of the 4 levels of the response within
the different intervals of values of the explanatory variable. The intervals coin-
cide with the intervals that are used in the construction of the histogram. In
particular, the third vertical rectangle from the left in the mosaic is associated
with the third interval from the left in the histogram6 . This interval is asso-
6 This is also the third interval from the left in the histogram. However, since the second
294 CHAPTER 15. A BERNOULLI RESPONSE
ciated with the range of values between 20 and 30. The height of the given
interval in the histogram is 2, which is the number of patients with “terta”
levels that fall in the interval.
There are 4 shades of grey in the first vertical rectangle from the left. Each
shade is associated with a different level of the response. The lightest shade
of grey, the upmost one, is associated with the level “u”. Notice that this is
also the shade of grey of the entire third vertical rectangle from the left. The
conclusion is that both patients that are associated with this rectangle have
Tetrahydrocortisone levels between 2 and 30 and have an unknown type of
syndrome.
Solution (to Question 15.2.2): We fit the logistic regression to the en-
tire data in the data frame “cushings” using the function “glm”, with the
“family=binomial” argument. The response in the fit is the indicator that the
type is equal to “b”. The fitted model is saved in an object called “cushings.fit.all”.
The application of the function “summary” to the fitted model produces a report
that includes the test of the hypothesis of interest:
Call:
glm(formula = (type == "b") ~ tetra, family = binomial,
data = cushings)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0924 -1.0461 -0.8652 1.3427 1.5182
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.12304 0.61330 -0.201 0.841
tetra -0.04220 0.05213 -0.809 0.418
The test of interest examines the coefficient that is associated wit the explana-
tory variable “tetra”. The estimated value of this parameter is −0.04220. The
p-value for testing that the coefficient is 0 is equal to 0.418. Consequently, since
the p-value is larger than 0.05, we do not reject the null hypothesis that states
that the response and the explanatory variable are statistically unrelated.
interval from the right in the histogram is empty, it turns out that this is aslo associated with
the second rectangle from the right in the mosaic plot, the rectangle of interest.
15.4. SOLVED EXERCISES 295
> confint(cushings.fit.all)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -1.2955624 1.18118256
tetra -0.1776113 0.04016772
Specifically, the confidence interval for the coefficient that is associated with the
explanatory variable is equal to [−0.1776113, 0.04016772]
Call:
glm(formula = (type == "b") ~ tetra, family = binomial,
data = cushings, subset = (type != "u"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2078 -1.1865 -0.7548 1.2033 1.2791
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.11457 0.59947 0.191 0.848
tetra -0.02276 0.04586 -0.496 0.620
The estimated value of the coefficient when considering only subject with a
known type of the syndrome is slightly changed to −0.02276. The new p-value,
which is equal to 0.620, is larger than 0.05. Hence, yet again, we do not reject
the null hypothesis.
7 This argument may be used in other functions. For example, it may be used in the
> confint(cushings.fit.known)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -1.0519135 1.40515473
tetra -0.1537617 0.06279923
For the modified confidence interval we apply the function “confint”. We get
now [−0.1537617, 0.06279923] as a confidence interval for the coefficient of the
explanatory variable.
Before we computed the model to all the observations. Here we use only
the observations for which the type of the syndrome is known. The practical
meaning of using all observations is to announce that the type of the syndrome
for those for which thew type is unknown is not type “b”. This seems not to
be appropriate and may introduce bias. It is more appropriate to treat the
observations associated with the level “u” as missing observations and to delete
them from the analysis. This approach is the approach that was used in the
second analysis.
Glossary
Mosaic Plot: A plot that describes the relation between a response factor and
an explanatory variable. Vertical rectangles represent the distribution
of the explanatory variable. Horizontal rectangles with the vertical ones
represent the distribution of the response.
Formulas:
ea+b·xi
• Logistic Regression, (Probability): pi = 1+ea+b·xi
.
Case Studies
• Revise the concepts and methods for statistical inference that were pre-
sented in the second part of the book.
16.2 A Revision
The second part of the book dealt with statistical inference; the science of
making general statement on an entire population on the basis of the data at
hand. The statements are based on the formulation of theoretical models that
produce the sampling distribution. Procedures for making the inference are
evaluated based on their properties in the context of this sampling distribution.
Procedures with desirable properties are applied to the data. One may attach
to the output of this application summaries that describe the properties.
In particular, we dealt with two forms of making inference. One form was
estimation and the other was hypothesis testing. The goal in estimation is
to determine the value of a parameter in the population. Point estimates or
confidence intervals may be used in order to fulfill this goal. The properties of
point estimators may be assessed using the mean square error (MSE) and the
properties of the confidence interval may be assessed using the confidence level.
The target in hypotheses testing is to decide between two competing hypoth-
esis. These hypotheses are formulated in terms of population parameters. The
decision rule is called a statistical and is constructed with the aid of a test statis-
tic and a rejection region. The default hypothesis among the two, is rejected if
299
300 CHAPTER 16. CASE STUDIES
the test statistic falls in the rejection region. The major property a test must
possess is a restriction on the probability of a Type I error, erroneously rejecting
the null hypothesis. This restriction is called the significance level of the test.
A test may also be assessed in terms of it’s statistical power, the probability of
rightfully rejecting the null hypothesis.
Estimation and testing were applied in the context of single measurements
and for the investigation of the relations between a pair of measurements. For
single measurements we considered both numeric variables and factors. For
numeric variables one may attempt to conduct inference on the expectation
and/or the variance. For factors we considered the estimation of the probability
of obtaining a level, or, more generally, the probability of the occurrence of an
event.
We introduced statistical models that may be used to describe the relations
between variables. One of the variables was designated as the response. The
other variable, the explanatory variable, is identified as a variable which may
affect the distribution of the response. Specifically, we considered numeric vari-
ables and factors that have two levels. If the explanatory variable is a factor with
two levels then the analysis reduces to the comparison of two sub-populations,
each one associated with a level. If the explanatory variable is numeric then a
regression model may be applied, either linear or logistic regression, depending
on the type of the response.
Statistical inference is based on the assumption of statistical models. These
models attempt to reflect reality. However, one s advised to apply healthy skep-
ticism when using the models. First, one should be aware what the assumptions
are. Then one should ask oneself how reasonable are these assumption in the
context of the specific analysis. Finally, one should check as much as one can
the validity of the assumptions in light of the information at hand. It is useful
to plot the data and compare the plot to the assumptions of the model.
> hist(patient$time)
Histogram of patient$time
30
25
20
Frequency
15
10
5
0
10 20 30 40 50 60
patient$time
fact that the median is equal to 30, one may suspect that, as a matter of fact,
a large numeber of the values aer actually equal to 30. Indeed, let us produce a
table of the response:
> table(patient$time)
5 15 20 25 30 40 45 50 60
1 10 15 3 30 4 5 2 1
Notice that 30 of the 72 physicians marked “30” as the time they expect to
spend with the patient. This is the middle value in the range, and may just be
the default value one marks if one just needs to feed a form and do not really
place much importance to the question that was asked.
The goal of the analysis is to examine the relation between overweigh and
the Doctor’s response. The explanatory variable is a factor with two levels. The
response is numeric. A natural tool to use in order to test this hypothesis is the
t-test, which is implemented with the function “t.test”.
First we plot the relation between the response and the explanatory variable
and then we apply the test:
16.3. CASE STUDIES 303
60
●
50
40
30
20
10
BMI=23 BMI=30
> boxplot(time~weight,data=patient)
> t.test(time~weight,data=patient)
The box plots that describe the distribution of the response for each level of
the explanatory variable are presented in Figure 16.2. Nothing seems problem-
atic in this plot. The two distributions, as they are reflected in the box plots,
look fairly symmetric.
304 CHAPTER 16. CASE STUDIES
1.0
0.8
factor(patient$time >= 30)
TRUE
0.6
0.4
0.2
FALSE
0.0
BMI=23 BMI=30
weight
> var.test(time~weight,data=patient)
In this test we do not reject the null hypothesis since the p-value is larger
than 0.05. The sample variances are almost equal to each other (their ratio
is 1.044316), with a confidence interval for the ration that essentially ranges
between 1/2 and 2.
The production of p-values and confidence intervals is just one aspect in the
analysis of data. Another aspect, which typically is much more time consum-
ing and requires experience and healthy skepticism is the examination of the
assumptions that are used in order to produce the p-values and the confidence
intervals. A clear violation of the assumptions may warn the statistician that
perhaps the computed nominal quantities do not represent the actual statistical
properties of the tools that were applied.
In this case, we have noticed the high concentration of the response at the
value “30”. What is the situation when we split the sample between the two
levels of the explanatory variable? Let us apply the function “table” once more,
this time with the explanatory variable included:
> table(patient$time,patient$weight)
BMI=23 BMI=30
5 0 1
15 2 8
20 6 9
25 1 2
30 14 16
40 4 0
45 4 1
50 2 0
60 0 1
Not surprisingly, there is still high concentration at that level “30”. But one
can see that only 2 of the responses of the “BMI=30” group are above that value
in comparison to a much more symmetric distribution of responses for the other
group.
The simulations of the significance level of the one-sample t-test for an Ex-
ponential response that were conducted in Question 12.2 may cast some doubt
on how trustworthy are nominal p-values of the t-test when the measurements
are skewed. The skewness of the response for the group “BMI=30” is a reason to
be worry.
We may consider a different test, which is more robust, in order to validate
the significance of our findings. For example, we may turn the response into
a factor by setting a level for values larger or equal to “30” and a different
level for values less than “30”. The relation between the new response and the
306 CHAPTER 16. CASE STUDIES
ing associated with values of “time” strictly larger than 30 minutes and the other
with values less or equal to 30. The resulting p-value from the expression
“prop.test(table(patient$time>30,patient$weight))” is 0.01276. However, the number
of subjects in one of the cells of the table is equal only to 2, which casts doubt on the validity
of the Normal approximation that is used by this test.
4 Blakley, B.A., Quiñones, M.A., Crawford, M.S., and Jago, I.A. (1994). The validity of
30
25
Frequency
Frequency
20
15
10
5
0
0
20 30 40 50 60 −4 −2 0 2 4 6
job$ratings job$sims
25
Frequency
Frequency
30
15
20
10
5
0
job$grip job$arm
from each participant. These included grip and arm strength. A piece of equip-
ment known as the Jackson Evaluation System (JES) was used to collect the
strength data. The JES can be configured to measure the strength of a number
of muscle groups. In this study, grip strength and arm strength were measured.
The outcomes of these measurements were summarized in two scores of physical
strength called “grip” and “arm”.
Two separate measures of job performance are presented in this case study.
First, the supervisors for each of the participants were asked to rate how well
their employee(s) perform on the physical aspects of their jobs. This measure
is summarizes in the variable “ratings”. Second, simulations of physically de-
manding work tasks were developed. The summary score of these simulations
are given in the variable “sims”. Higher values of either measures of perfor-
mance indicates better performance.
The data for the 4 variables and 147 observations is stored in the file “job.csv”5 .
We start by reading the content of the file into a data frame by the name “job”,
All variables are numeric. Their histograms are presented in Figure 16.5. Exam-
ination of the 4 summaries and histograms does not produce interest findings.
All variables are, more or less, symmetric with the distribution of the variable
“ratings” tending perhaps to be more uniform then the other three.
The main analyses of interest are attempts to relate the two measures of
physical strength “grip” and “arm” with the two measures of job performance,
“ratings” and “sims”. A natural tool to consider in this context is a linear
regression analysis that relates a measure of physical strength as an explanatory
variable to a measure of job performance as a response.
Let us consider the variable “sims” as a response. The first step is to plot
a scatter plot of the response and explanatory variable, for both explanatory
variables. To the scatter plot we add the line of regression. In order to add the
regression line we fit the regression model with the function “lm” and then apply
the function “abline” to the fitted model. The plot for the relation between
the response and the variable “grip” is produced by the code:
> plot(sims~grip,data=job)
> sims.grip <- lm(sims~grip,data=job)
> abline(sims.grip)
The plot that is produced by this code is presented on the upper-left panel of
Figure 16.5.
The plot for the relation between the response and the variable “arm” is
produced by this code:
> plot(sims~arm,data=job)
> sims.arm <- lm(sims~arm,data=job)
> abline(sims.arm)
The plot that is produced by the last code is presented on the upper-right panel
of Figure 16.5.
Both plots show similar characteristics. There is an overall linear trend in
the relation between the explanatory variable and the response. The value of
the response increases with the increase in the value of the explanatory variable
(a positive slope). The regression line seems to follow, more or less, the trend
that is demonstrated by the scatter plot.
16.3. CASE STUDIES 309
● ● ● ● ● ●
● ● ● ●
4
4
● ●
● ●●●● ● ●●●● ●
●
●●● ● ● ● ● ●●
●
●●●
2
2
●
●
●● ● ●● ●
●● ●●●● ● ●● ●●● ● ● ●●●● ● ●●●
●● ●●
sims
sims
● ●●●
●● ●●● ●●
● ●●● ●● ●● ● ● ●●●●●●
●● ●
● ●
● ●●●●
● ●●●● ● ● ●
●● ●●● ●●● ●●
● ●●
● ● ● ●
● ●● ●●●●● ●●●● ● ● ● ●● ●● ●●●●● ●●
0
0
● ●●
●●● ●● ●●
●●●● ●
●
●
●● ●●
●●● ● ●● ● ●
● ● ●
● ●● ● ●●● ●●● ●
●●●
●● ● ● ●●● ● ● ● ● ●●● ●
● ● ●
● ●
● ●●● ●●●●● ●● ●●●●
●● ●
● ●●● ●● ● ●●●●● ●
−4 −2
−4 −2
● ● ●
● ●● ●
● ● ● ●
● ●
● ● ● ●
grip arm
● ● ●
●
● ● ● ●
4
●
150
●●●●● ● ● ● ● ● ●
● ● ● ● ●
● ●● ● ●●● ●●● ●
2
●●● ● ●● ● ●
● ●●● ● ●
●● ● ●
● ●●● ●● ● ●
●●●●●● ●
●● ● ●● ● ●●●●
●●●● ●
●
sims
● ●● ● ● ● ● ●
grip
● ● ● ●●●●●
● ● ● ● ● ●
●● ●●●● ●●● ●● ● ●● ●● ● ●
●●● ● ● ● ●● ● ●●
100
●● ●● ●● ● ● ●
● ●
●●● ●● ●●
●
● ● ●●●
● ● ● ● ●●●
0
●● ●
●● ● ●● ● ● ● ●
● ● ●● ● ●
● ●● ● ● ●● ●● ●●● ● ●
●
●●● ●● ●
● ● ● ● ● ●● ● ●
● ●● ●●
● ● ●● ●
●● ●●
●
●● ●
● ●●
●
● ● ● ●● ● ●● ● ●
● ●●●● ●
−4 −2
● ● ● ● ● ●
●
●● ●
50
● ●●
●
● ● ●
−4 −2 0 2 20 40 60 80 100
score arm
> summary(sims.grip)
Call:
lm(formula = sims ~ grip, data = job)
Residuals:
Min 1Q Median 3Q Max
-2.9295 -0.8708 -0.1219 0.8039 3.3494
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.809675 0.511141 -9.41 <2e-16 ***
grip 0.045463 0.004535 10.03 <2e-16 ***
---
310 CHAPTER 16. CASE STUDIES
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
Examination of the report reviles a clear statistical significance for the effect
of the explanatory variable on the distribution of response. The value of R-
squared, the ration of the variance of the response
√ explained by the regression
is 0.4094. The square root of this quantity, 0.4094 ≈ 0.64, is the proportion
of the standard deviation of the response that is explained by the explanatory
variable. Hence, about 64% of the variability in the response can be attributed
to the measure of the strength of the grip.
For the variable “arm” we get:
> summary(sims.arm)
Call:
lm(formula = sims ~ arm, data = job)
Residuals:
Min 1Q Median 3Q Max
-3.64667 -0.75022 -0.02852 0.68754 3.07702
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.095160 0.391745 -10.45 <2e-16 ***
arm 0.054563 0.004806 11.35 <2e-16 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
tory variables. The code expression that can be used is “lm(sims grip + arm, data=job)”.
16.3. CASE STUDIES 311
We use this combined score as an explanatory variable. First we form the score
and plot the relation between it and the response:
Call:
lm(formula = sims ~ score, data = job)
Residuals:
Min 1Q Median 3Q Max
-3.18897 -0.73905 -0.06983 0.74114 2.86356
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07479 0.09452 0.791 0.43
score 1.01291 0.07730 13.104 <2e-16 ***
---
Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1
Indeed, the score is highly significant. More important, the R-squared coefficient
that is associated with the score is 0.5422, which corresponds
√ to a ratio of the
standard deviation that is explained by the model of 0.5422 ≈ 0.74. Thus,
almost 3/4 of the variability is accounted for by the score, so the score is a
reasonable mean of guessing what the results of the simulations will be. This
guess is based only on the results of the simple tests of strength that is conducted
with the JES device.
Before putting the final seal on the results let us examine as much as we can
the assumptions we made. First, with respect to the two explanatory variables.
Do they really each measure a different property or do they actually measure
the same phenomena, call it overall strength, twice? In order to examine this
question let us look at the scatter plot that describes the relation between the
two explanatory variables. This plot is produced using the code:
> plot(grip~arm,data=job)
It is presented in the lower-right panel of Figure 16.5. Indeed, one may see that
the two measurements of strength are not independent of each other but rather
312 CHAPTER 16. CASE STUDIES
Histogram of residuals(sims.score)
Frequency
20
10
0
−3 −2 −1 0 1 2 3
residuals(sims.score)
●
●
●●●
2
●●
● ●●●●
●
●●
● ●
● ●●●● ●●●
●
●●●●
●●
●●
● ●●
●
●●
●● ●
●●
●●●
●
●●●
●
0
●
●●
●●
●●
● ●
●●
●●
●●
●●
●
●
●
●●●
●●
●●
● ●
●
●
●● ●● ●●●
●●
●● ●
−2
● ●●
●
●
−3 −2 −1 0 1 2 3
Sample Quantiles
the response and its estimated expectation, namely the fitted regression line.
The residuals can be computed via the application of the function “residuals”
to the fitted regression model.
Specifically, let us look at the residuals from the regression line that uses the
score that is combined from the grip and arm measurements of strength. One
may plot a histogram of the residuals:
> hist(residuals(sims.score))
The produced histogram is represented on the upper panel of Figure 16.6. The
histogram portrays a symmetric distribution that my result from Normally dis-
tributed observations. A better method to compare the distribution of the
residuals to the Normal distribution is to use the Quantile-Quantile plot. This
plot can be found on the lower panel of Figure 16.6. We do not discuss here the
method by which this plot is produced7 . However, we do say that any devia-
tion of the points from a straight line is an indication for the violation of the
assumption of Normality. In the current case, the points seem to be on a single
line, which is consistent with the assumptions of the regression model.
The next task should be an analysis of the relations between the explanatory
variables and the other response “ratings”. In principle one may use the same
steps that were presented for the investigation of the relations between the
explanatory variables and the response “sims”. But of course, the conclusion
may differ. We leave this part of the investigation as an exercise to the students.
16.4 Summary
16.4.1 Concluding Remarks
The book included a description of some elements of statistics, element that we
thought are simple enough to be explained as part of an introductory course to
statistics and are the minimum that is required for any person that is involved
in academic activities of any field in which the analysis of data is required. Now,
as you finish the book, it is as good time as any to say some words regarding
the elements of statistics that are missing from this book.
One element is more of the same. The statistical models that were presented
are as simple as a model can get. A typical application will required more com-
plex models. Each of these models may require specific methods for estimation
and testing. The characteristics of inference, e.g. significance or confidence
levels, rely on assumptions that the models are assumed to possess. The user
should be familiar with computational tools that can be used for the analysis of
these more complex models. Familiarity with the probabilistic assumptions is
required in order to be able to interpret the computer output, to diagnose pos-
sible divergence from the assumptions and to assess the severity of the possible
effect of such divergence on the validity of the findings.
Statistical tools can be used for tasks other than estimation and hypothesis
testing. For example, one may use statistics for prediction. In many applica-
tions it is important to assess what the values of future observations may be
7 Generally speaking, the plot is composed of the empirical percentiles of the residuals,
plotted against the theoretical percentiles of the standard Normal distribution. The current
plot is produced by the expression “qqnorm(residuals(sims.score))”.
314 CHAPTER 16. CASE STUDIES
and in what range of values are they likely to occur. Statistical tools such as
regression are natural in this context. However, the required task is not testing
or estimation the values of parameters, but the prediction of future values of
the response.
A different element is the role of statistics in the design stage. We hinted
in that direction when we talked about in Chapter 11 about the selection of
a sample size in order to assure a confidence interval with a given accuracy.
In most applications, the selection of the sample size emerges in the context
of hypothesis testing and the criteria for selection is the minimal power of the
test, a minimal probability to detect a true finding. Yet, statistical design is
much more than the determination of the sample size. Statistics may have a
crucial input in the decision of how to collect the data. With an eye on the
requirements for the final analysis, an experienced statistician can make sure
that data that is collected is indeed appropriate for that final analysis. Too often
is the case where researcher steps into the statistician’s office with data that he
or she collected and asks, when it is already too late, for help in the analysis
of data that cannot provide a satisfactory answer to the research question the
researcher tried to address. It may be said, with some exaggeration, that good
statisticians are required for the final analysis only in the case where the initial
planning was poor.
Last, but not least, is the theoretical mathematical theory of statistics. We
tried to introduce as little as possible of the relevant mathematics in this course.
However, if one seriously intends to learn and understand statistics then one
most become familiar with the relevant mathematical theory. Clearly, deep
knowledge in the mathematical theory of probability is required. But apart
from that, there is a rich and rapidly growing body of research that deals with
the mathematical aspects of data analysis. One cannot be a good statistician
unless one becomes familiar with the important aspects of this theory.
I should have started the book with the famous quotation: “Lies, damned
lies, and statistics”. Instead, I am using it to end the book. Statistics can be
used and can be misused. Learning statistics can give you the tools to tell the
difference between the two. My goal in writing the book is achieved if reading
it will mark for you the beginning of the process of learning statistics and not
the end of the process.