Basic Statistics With R - Reaching Decisions With Data
Basic Statistics With R - Reaching Decisions With Data
Stephen C. Loftus
Division of Science, Technology, Engineering and Math
Sweet Briar College
Sweet Briar, VA, United States
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
Copyright © 2022 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the publisher. Details on how to seek
permission, further information about the Publisher’s permissions policies and our arrangements
with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and
experience broaden our understanding, changes in research methods, professional practices, or
medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in
evaluating and using any information, methods, compounds, or experiments described herein. In
using such information or methods they should be mindful of their own safety and the safety of
others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors,
assume any liability for any injury and/or damage to persons or property as a matter of products
liability, negligence or otherwise, or from any use or operation of any methods, products,
instructions, or ideas contained in the material herein.
ISBN: 978-0-12-820788-8
Biography xv
Preface xvii
Acknowledgments xix
Part I
An introduction to statistics and R
1. What is statistics and why is it important?
1.1 Introduction 3
1.2 So what is statistics? 4
1.2.1 The process of statistics 4
1.2.2 Hypothesis/questions 4
1.2.3 Data collection 5
1.2.4 Data description 5
1.2.5 Statistical inference 5
1.2.6 Theories/decisions 6
1.3 Computation and statistics 6
2. An introduction to R
2.1 Installation 7
2.2 Classes of data 7
2.3 Mathematical operations in R 8
2.4 Variables 9
2.5 Vectors 11
2.6 Data frames 12
2.7 Practice problems 13
2.8 Conclusion 14
Part II
Collecting data and loading it into R
3. Data collection: methods and concerns
3.1 Introduction 17
vii
viii Contents
Part III
Exploring and describing data
6. Exploratory data analyses: describing our data
6.1 Introduction 47
6.2 Parameters and statistics 47
6.3 Parameters, statistics, and EDA for categorical variables 48
6.3.1 Practice problems 50
6.4 Parameters, statistics, and EDA for a single quantitative variable 51
6.4.1 Statistics for the center of a variable 51
6.4.2 Practice problems 53
6.4.3 Statistics for the spread of a variable 54
6.4.4 Practice problems 56
6.5 Visual summaries for a single quantitative variables 57
6.6 Identifying outliers 59
Contents ix
7. R tutorial: EDA in R
7.1 Introduction 73
7.2 Frequency and contingency tables in R 73
7.3 Numerical exploratory analyses in R 74
7.3.1 Summaries for the center of a variable 74
7.3.2 Summaries for the spread of a variable 75
7.3.3 Summaries for the association between two quantitative
variables 76
7.4 Missing data 77
7.5 Practice problems 78
7.6 Graphical exploratory analyses in R 78
7.6.1 Scatterplots 78
7.6.2 Histograms 80
7.7 Boxplots 82
7.8 Practice problems 84
7.9 Conclusion 85
Part IV
Mechanisms of inference
8. An incredibly brief introduction to probability
8.1 Introduction 89
8.2 Random phenomena, probability, and the Law of Large
Numbers 90
8.3 What is the role of probability in inference? 91
8.4 Calculating probability and the axioms of probability 92
8.5 Random variables and probability distributions 94
8.6 The binomial distribution 95
8.7 The normal distribution 96
8.8 Practice problems 98
8.9 Conclusion 99
Part V
Statistical inference
13. Hypothesis tests for a single parameter
13.1 Introduction 137
13.2 One-sample test for proportions 138
13.2.1 State hypotheses 138
Contents xi
B. List of R datasets
References 271
Index 277
Online Resources
Please visit the student companion site for access to Data Sets referenced in
the text: https://www.elsevier.com/books-and-journals/book-companion/
9780128207888
Biography
xv
Preface
Over the past 15 years—ever since I took AP Statistics as a high school junior—
I have heard many horror stories from friends, family, and complete strangers
about their experiences with statistics. “It was too difficult” or “It just did not
click for me” were common refrains. As such, the thought of doing statistics or
working with data fills them with fear.
This is something I consider very unfortunate, for two reasons. The first is
that statistics is a subject with which I am personally fascinated. The ability to
draw information out of and tell stories with data is something that has drawn
me to the subject. Second, and much more importantly, the ability to work with
data is becoming an essential skill in the 21st century. As the amount of data
increases in every field, employees of all types are expected to take a larger part
in the process of drawing decisions from this data. This cannot happen when a
person’s sole experience with statistics is negative.
What follows is an attempt to try to provide a minimal-stress introduction
to statistics for some or a gentle reintroduction to the subject for others. In this
book, we will be looking at many of the foundations of statistics; from how
data is collected, to exploratory analysis, to basic statistical inference. In doing
so, we will have many opportunities to practice the techniques learned through
examples and practice problems.
Additionally, many of the problems posed by modern statistics require the
use of software to find a solution. As such, the statistical methods taught in this
book are accompanied by instruction in the statistical programming language R.
While not intended to be a comprehensive introduction to either coding or R, it
should provide a good starting point for individuals working in this language for
the first time.
With all this in mind, let us go forward into the process of statistics.
Stephen C. Loftus
xvii
Acknowledgments
I would like to thank everyone who helped to make this book come to fruition, to
include Raina Robeva, for putting me in touch with the right people to start this
project, and everyone at Elsevier who had a hand in this process, particularly
Katey Birtcher, Alice Grant, Andreh Akeh, and Beula Christopher. The initial
three reviewers provided helpful suggestions relating to content that could aug-
ment one’s understanding. Additionally, I would like to thank all of my various
math and statistics teachers, professors, and mentors who trained me to be the
statistician I am today. Furthermore, I would like to thank the students at Sweet
Briar who gave me cause to create this text. Finally, I must especially thank my
wife, Michelle, who first suggested I turn my class notes into a textbook a full
year prior to my contact with Elsevier, and also to my family for their support.
xix
Chapter 1
1.1 Introduction
On August 5, 2009, the Technology section of The New York Times ran the
following headline: “For Today’s Graduate, Just One Word: Statistics.” The ar-
ticle’s author, Steve Lohr, used the moment to argue that the most important
skill for college graduates entering the workforce was—and is still today—
statistics [1].
Statistics, as an academic discipline, is relatively young. Aristotle wrote on
chemistry in the 4th century BCE, and calculus was discovered by Newton and
Leibnitz in the late 17th century. The first major papers on statistics were pub-
lished in the late 19th century, but many of the foundational statistical techniques
still used today were developed in the 1920s.
What changed from that time to today? Before the past decade, the field of
statistics was generally regarded with skepticism, dismissed with pithy quips
that are still quoted today—for a few examples, consider Benjamin Disraeli’s
famous salvo, “There are three types of lies: lies, damned lies, and statistics,” or
George Canning’s, “I can prove anything by statistics except the truth.” How-
ever, the days of the statistician punch line seem to be passing. Now, being a
statistician or data scientist is consistently regarded by many outlets—including
U.S. News and World Report [2]—as one of the best jobs of the year.
What changed was the importance of data. Nearly every business field or
academic discipline now recognizes the importance of data in decision-making.
Data supports or disproves theories and pushes the scientific process forward in
biology and chemistry. Data fuels policymaking in economics and political sci-
ence. Data pervades all fields at all times, to the point that the World Economic
Forum declared data a new class of economic asset, akin to currency or precious
metals [3].
1.2.2 Hypothesis/questions
We always want to try to answer questions about the world. We see an event,
some sort of phenomenon, and we want to try and understand it. With that in
mind, we ask questions or develop hypothesis to explain what we have seen.
What is statistics and why is it important? Chapter | 1 5
FIGURE 1.1 The Process of Statistics. Notice that this process is arranged as a circle as, like the
scientific process, the process of Statistics is a constant procedure, continuously looking for better
solutions and better decisions as time goes on.
For example, medical professionals may want to try to see if the use of exter-
nal warming, such as blankets or forced-air warming blankets, helps decrease
instances of hypothermia in the operating room. Agricultural researchers may
want to know what types of fertilizer give the best yields. Political pollsters may
want to gauge public opinion of a law to be enacted.
descriptive statistics are based on our sample, and no two samples are the same.
Ideally, we should make the same decision regardless of our data, at least most
of the time. Statistical inference helps us to make decisions by quantifying how
unlikely our data is assuming a specific state of the world. If our collected data
is rare enough, we conclude that our assumed state of the world is incorrect.
1.2.6 Theories/decisions
Once we complete our statistical inference, the subject matter experts—the
economists, the scientists, etc.—have to use the results accordingly. If you are
the expert in the field, it is important to understand what the results state. Other-
wise, you must be able to effectively communicate the results of your statistical
inference to other individuals, a skill that comes only with practice.
An introduction to R
Contents
2.1 Installation 7 2.5 Vectors 11
2.2 Classes of data 7 2.6 Data frames 12
2.3 Mathematical operations in R 8 2.7 Practice problems 13
2.4 Variables 9 2.8 Conclusion 14
2.1 Installation
R is an open-source programming language that is used for data analysis and
graphics creation. Across a wide variety of academic fields, R is the standard
statistical software and boasts an extensive online community. As such, solu-
tions to problems one might encounter in working with R can easily be found
through a quick web search. Because of these reasons, we will be using R to
conduct our analyses in this book.
In order to install R, you have to download the program—be sure to choose
the correct version for your Windows or Macintosh machine—from the Com-
prehensive R Archive Network at https://cran.r-project.org/. The installation
wizard is simple to follow, and thus will not be explained in detail.
The R console is fairly simple for both Windows and Mac operating sys-
tems. As a coding language, there are minimal point-and-click aspects to the
system. All code for calculations, statistical methods, and creating graphics is
typed into the console and executed by hitting the <Enter> key. The output
for calculations and statistical methods will be seen in the same console where
the code is typed. When creating graphics, other windows will open in the pro-
gram containing the various plots that we create. Unless otherwise noted, there
is no difference in the code to complete a task between Windows and Macintosh
machines. (See Fig. 2.1.)
FIGURE 2.1 The R console for Windows (Top) and Mac (Bottom) operating systems.
To do the calculations, you type the formula into the R console and hit
<Enter>. Be sure to be careful about your order of operations and, if needed,
your parentheses, as R will follow the standard order of operations learned in
math class.
R has particular conventions about parentheses in particular. Where the
mathematical formula 2(1 + 2) is completely valid, R does not know how to
read this, and will return an error. In order to enter this in a way that R will
understand, you will have to say
2*(1+2)
which will return the answer 6. In addition, for every parenthesis that you open
in an equation, a closing parenthesis is necessary. If an open parenthesis does not
have a matching close parenthesis, R will not return your answer, but will return
an empty line with a + sign at the left. To exit out of the line, hit <Escape> and
reenter your formula. (See Fig. 2.3.)
FIGURE 2.3 A parenthesis error in R. Exit the line by hitting the <Escape> key.
2.4 Variables
Variables are the crux of most of what we will be using in R. These variables
are saved data of either numeric or character values. The simplest variable is a
variable with a single saved value. In general, to save a variable with a specific
value in R, you type
and then hit <Enter> to save the variable. From that point onward that variable
will be associated with that value, until you either exit R or overwrite the value.
For example, if you wanted to assign the value 71 (inches) to a variable named
height, you would enter
height = 71
and then hit <Enter>. From that point on, height will be associated with the
value 71, and can be acted on by mathematical operators. For example, say you
wanted to find the value of height in feet. You would want to divide the height—
71 inches—by 12. To do this in R, you would type
height/12
and hit <Enter> to get the height in feet value of 5.9167. If you wanted to
keep that value around, it could be stored in a new variable by using that exact
formula, for example,
heightfeet=height/12
At which point the variable heightfeet will be associated with the value
71/12=5.9167. It is important to note that for variable names in R, you can
use any combination of letters, numbers, and select special characters such as
periods or underscores. Spaces and other special characters—such as the excla-
mation point or commas—are not allowed in variable names. If a variable name
breaks a naming rule—a disallowed character, for example—R will return an
error, saying “Error: unexpected input” or “Error: unexpected symbol.” (See
Fig. 2.4.)
One other important aspect of variable names is that they are case sensi-
tive, meaning that capitalization matters to variable names. To R, the variable
named height—what we defined to be equal to 71—and the variable named
Height—something we never created in R so it does not exist—are two differ-
ent variables. If you enter Height into the console, R will not return the value 71
but will tell you that the variable “Height” is not found.
An introduction to R Chapter | 2 11
Character variables have a similar entering process in R, with one key dis-
tinction. Character values must be entered using quotes, otherwise R will return
an error. The entry process is again
and then hit <Enter> to save the variable. The variable is then associated with
that value until exiting R or until it is overwritten. For example,
and the variable name will be associated with the value “Stephen Loftus” from
that point onward. Note that if you do not put the value of your variable in
quotes, R will return an error. All of the naming conventions for character vari-
able names and numeric variable names are identical. (See Fig. 2.5.)
2.5 Vectors
Variables can be more than just single values. Commonly in our datasets, vari-
ables will have one value for every participant in our study. These multiple
values are stored in vectors in R, which will be stored in data frames (more
on them in a minute). In R, vectors are defined using the notation
So, for example, if you recorded heights for five people and stored them in a
vector named height, it would be
and then hit <Enter>. The variable height will now be associated with those five
values until exiting R or overwriting. Vectors can be acted on using arithmetic
operations similar to single-value variables. For example, to find the height in
feet of these five individuals, you would divide the whole height vector by 12,
using
height/12
12 Basic Statistics With R
This would return 6.25, 6.17, 5.58, 6.92, and 6.25, the heights of each of these
five people in feet. These values can be stored in a vector of their own if desired.
In addition, two vectors of the same length can be added together, with the
first elements of both vectors being added together, the second elements added
together, etc. (See Fig. 2.6.)
and hit <Enter>. Naming conventions are identical for data frames and vari-
ables. It is important to note that in creating data frames, we need to ensure the
ordering for each vector is the same. In other words, the first values in vector 1
and vector 2 both belong to observation 1, the second values from observation 2,
and so on. So, for example, to create a data frame name athlete with the athlete’s
names and heights from the previous section, we would use
athlete=data.frame(name, height)
An introduction to R Chapter | 2 13
Printing this out in R, we can see that Alex Ovechkin is 75 inches tall, Mike
Trout is 74 inches tall, etc. Data frames are able to have both character vectors
and numeric vectors in the same data frame. (See Fig. 2.7.)
9. Each of these games fall under the genre of adventure games. In 2011,
adventuregamers.com created a ranking of the top 100 adventure games of
all time. Create a vector—containing the rankings of the aforementioned
games—named Rank with the values 14, 11, 6, and 1 [6].
10. What code would you use to create a data frame called AdventureGames
containing the vectors contained in Problem 6, Problem 7, and Problem 9?
2.8 Conclusion
R provides a flexible coding framework in which we can store entered data—
both numbers and character strings—in vectors and data frames alike. Addi-
tionally, it can act as a calculator, doing standard mathematical operations on
numbers or inputted data as a whole. Our next goal will be to select individ-
ual values from a vector, variables from a data frame, and rows—representing
observations—from a dataset.
Chapter 3
3.1 Introduction
The first step in the process of statistics is the development of a hypothesis or
question to test. While careful thought does need to be taken when choosing the
question you want answered, we will assume that you have a clear definition of
the question you want answered. With this in mind, we move on to the second
step in the process: collecting data.
Data collection is one of the more overlooked roles of statistics in science,
but is crucial nonetheless. If data is not carefully collected, it is very possible
that we will influence the results of our study or possibly make it so that we
are unable to answer our question at all. In this chapter, we will look at two
of the most common methods of data collection—observational studies and de-
signed experiments—and see what each of these methods bring to the process
of statistics.
ject, regardless of what it is. There are the many types of variables that we
can record, but we will concentrate on two main types. Quantitative variables,
sometimes referred to as numeric variables, are any variable where the recorded
value is a number and that number indicates some sort of magnitude. Exam-
ples of quantitative variables could include height, weight, or age. Qualitative
variables, more commonly referred to as categorical variables, are any variable
where the recorded value is a category. Sex, political affiliation, job, or academic
class are all examples of categorical variables.
As an example, consider a situation where we collect 5-digit zip codes from
our subjects. Would a zip code be considered a quantitative variable or a cat-
egorical variable? Zip codes are comprised of numbers, which might imply
quantitative. However, the numbers of zip codes do not indicate any magni-
tude. A zip code 34815 does not imply anything better or worse than a zip code
of 26970. As such, zip codes are categorical variables.
In order to answer our questions using data, we of course must gather sam-
ple. Generally speaking, there are two ways we can get our data: observational
studies and designed experiments. Both methods have their strengths and weak-
nesses, as well as scenarios when one method is better than the other. To begin,
let us look at observational studies, how they are conducted, and questions we
need to be concerned about in order to properly conduct such a study.
In order to ensure that these assumptions are met, we have to use a ran-
dom sample of subjects from our population. There are many different ways to
choose a random sample, but for this chapter we will concentrate on the simplest
form of a random sample: the Simple Random Sample (SRS).
In an SRS, we start—as with all forms of random sampling—with our sam-
pling frame. Each subject in the sampling frame is assigned a number or some
form of identifier, and then we select n numbers at random to create our sam-
ple. In this way, each member of the population has an equal chance of being
selected, and on the average our sample will reflect the population.
An important aspect of the SRS is the selection of random numbers and
how we generate them. We cannot just choose n numbers that come to mind to
make up our random numbers. They must be generated through some random
process—such as drawing numbers out of a hat—or through computational soft-
ware such as R. This is because, as amazing as the human mind is, it is incapable
of replicating true randomness. We will look into how we choose random num-
bers for our sample using R in the next chapter.
While there are many more complicated—and arguably better—sampling
methods, the SRS illustrates one of the foundations of statistical inference: ran-
domness. Further, by looking only at the SRS, we are able to see possible biases
existent in sampling more clearly, a key consideration in collecting a sample.
a landslide victory for candidate Thomas Dewey over incumbent Harry Truman.
Despite the seeming certainty of the result, President Truman was reelected in a
landslide, winning the popular vote by 4.5%.
How were the polls wrong, leading to one of the most famous pictures in
American political history? It turns out that the polls leading up to the elec-
tion were tainted twice by sampling bias. First, several major polls stopped
collecting data two weeks before election day, not allowing them to track the
ever-shifting public opinion toward Truman. In fact, some—admittedly less
scientific—polls that showed a Truman lead were discontinued due to their “ir-
regular” results. Even more importantly, several major polls showed particular
bias toward Dewey because they were conducted via telephone. At the time
telephones were more commonly found in wealthy homes, and wealthy individ-
uals were more likely to vote for Dewey, resulting in poll results showing clear
Dewey leads right up until election day.
A second type of bias common in surveys is nonresponse bias. Nonresponse
bias is introduced because people did not respond to a survey, possibly for a
common reason. This extends beyond the common complaint “Why do tele-
marketers or survey people always call around dinner time?” Nonresponse bias
can be a symptom of some common trait among the nonresponders. For exam-
ple, if a researcher fails to reach individuals via a phone survey in the evening,
it may be that the nonrespondents are, say, forced to work a second job in the
evenings due to lower incomes.
In Lahaut et al. (2002) [7], researchers sent out surveys to 310 individu-
als trying to asses rates of alcohol consumption across various socioeconomic
conditions. Of the 133 responses to their survey, they found that 27.2% of
respondents abstained from alcohol use. At a later date, they were able to con-
tact 80 nonrespondents to their original survey and asked them the same set
of questions. They found that 52.5% of this nonrespondent group abstained
from alcohol use, a drastic difference from their original survey results. Had the
nonresponse bias been ignored, they likely would have come to very different
conclusions for their study.
A final type of common bias is response bias. Response bias occurs when
respondents give incorrect responses to the surveyor, whether intentionally—
i.e., lying—or unintentionally—i.e., failing to remember the proper response for
them. Response bias is harder for researchers to control, as it entirely depends
on the respondent’s honesty and accuracy.
One instance where response bias partially influenced results comes from
the 2016 presidential election. That year, Hillary Clinton was widely expected to
defeat Donald Trump by a landslide; however, Trump won the Electoral College
and a higher share of the popular vote than expected. Many news outlets [8]
pointed to “Shy Trump Voters,” voters who did not say that they would vote
Trump due to social desirability biases, as a major reason why the polls were so
inaccurate. (See Fig. 3.1.)
Data collection: methods and concerns Chapter | 3 21
FIGURE 3.1 County by county results from 2016 presidential election. Some analysts believed
that “Shy Trump Voters,” an example of response bias, swung the election.
When creating a survey, you can only directly control the sampling bias of
your survey. Response and nonresponse bias entirely come from the respon-
dents, and are out of a surveyor’s control. With this in mind, while there are
some methods to correct these latter two biases—such a demographic weight-
ing for nonresponse bias—the majority of efforts to correct biases are directed
toward avoiding any form of sampling bias.
something in the response. Observational studies cannot establish that one vari-
able causes another, only that they are related in some way.
This is because experiments control for confounding factors through ran-
domization in a way that observational studies are unable to do. Remember the
example from earlier about confounding variables, in which it appeared that us-
ing nightlights earlier in life led to myopia in children. The observational study
did not consider the confounding variable of myopic parents, which ultimately
did explain the myopia in the children. The experiment assigned families to ei-
ther the nightlight or no nightlight group randomly. Some families with myopic
parents would have been forced to not use a nightlight, despite their possible
preference. This would have lessened the chance that the confounding variable
would have made it appear that myopia was related to nightlight use.
So, with these concerns in mind, we can choose between observational stud-
ies and experiments accordingly. If you want to establish causality between your
explanatory variable and response, you have to use an experiment. If you only
want to establish that the variables are related in some way, an observational
study will likely be sufficient and also less expensive and prone to ethical con-
cerns than experiments.
3.6 Conclusion
While not often thought of as statistics proper, data collection is an essential
portion of the process of statistics. As such, it needs to also be considered with
a statistical mindset. Thus, we have a myriad of considerations, including cost,
ethics, the types of data collected, the method of collection, and the ways that
the sample could have been biased. Each of these considerations can greatly
affect the questions that can be answered using our data, particularly focusing
on whether or not we are able to make causal claims about our data. Ignoring
these aspects of statistics can lead to lost time, money, and the inability to answer
desired questions.
Once we have data, our next concern will be gaining an understanding of
our data. This is done through looking at our variables, summaries both numeric
and graphical, through a process called Exploratory Data Analysis, or EDA.
Chapter 4
4.1 Introduction
In this chapter, we will concentrate on how to choose a random sample from
a sampling frame. In order to do this, we will need to build up a few skills
first. Initially, we will talk about subsetting vectors and data frames, followed
by a little information about how R generates random numbers. Finally, we will
discuss the sample function to select a sample of numbers, and then combine
all the information to select a random sample from a sampling frame. In all of
these examples, it is assumed that you have a sampling frame loaded into R (we
will discuss how to do this in Chapter 5).
hours=c(8.84, 3.26, 2.81, 0.64, 0.60, 0.53, 0.37, 0.35, 0.31, 0.24)
This data is the average daily hours spent on the ten most popular daily activities
in 2019, according to the Bureau of Labor Statistics [95]. R subsets data using
the square brackets [] after the vector. You can place any vector indices or logical
statements to select specific elements of the vector. For example, if you wanted
to know the first element of the vector age, you would type
hours[1]
and hit <Enter>. R will return the value 8.84, or the average number of hours
spent sleeping—the activity with the most amount of time spent on it. With
that in mind, how would you look up the eighth entry of the hours vector, or
the average number of hours spent socializing or communicating in 2019—the
eighth-most popular daily activity? You would type hours[8] and hit <Enter>.
It is possible to look up more than one entry in the vector in a single line of
code. In addition to having a single number inside the square brackets, you can
have a vector with all the entries of interest. For example, if you want to select
the hours spent on the most popular, third-most popular, and ninth-most popular
activities—sleeping, watching television, and participating in sports, exercise,
or recreation—you would use the code
hours[c(1,3,9)]
which would return 8.84, 2.81, and 0.31. What code would you use to look up
the age of father and son John Adams (2nd president) and John Quincy Adams
(6th) at their inaugurations? The code would be
age[c(2,6)]
Finally, you can look up specific values in a vector based on a logical state-
ment. That is, you could look up specifically the entries in the vector that are
greater than some value (or less than, or equal to, etc.). For example, if you
wanted to look up the average hours spent for activities that were specifically
than an hour, you would use the code
hours[hours>1]
which will return 8.84, 3.26, 2.81, as these are the only three ages in the vector
that are greater than 1. To use logical statements, we need to know the logical
operators that R recognizes.
The last two statements help us make multiple statements to subset the data.
For example, if you wanted the average hours spent in the vector that fall be-
tween 30 and 45 minutes—that is, greater than or equal to 0.5 hours and less
than or equal to 0.75 hours—we would use
R tutorial: subsetting data Chapter | 4 29
hours[hours>=0.5&hours<=0.75]
which would return 0.64, 0.6, and 0.53. As another example, if we wanted to
select the hours that are either less than 15 minutes—0.25 hours—or longer
than 4 hours (not including 0.25 and 4), we would use
hours[hours<0.25|hours>4]
which would return 8.84 and 0.24. So, with this in mind, what code would
you use to find hours that fell between 15 and 30 minutes (including both end-
points)? How about hours that are equal to 0.6 or greater than 2 (not including
2)? For the first, it would be hours[hours>=0.25&hours<=0.5], and the second
is hours[hours==0.6|hours>2].
Now, the data frame is basically just a matrix, so we can select an element of
the data frame by defining a specific row-column combination. So, if we wanted
to choose the category of food preparation, we would want to choose the fifth
row and third column of the data frame, using
30 Basic Statistics With R
Activities[5,3]
which returns the value “Household.” So, say, what code would we use if we
wanted the name of the ninth-most popular activity? The code would be
Activities[9,1]
Activities[10,]
as its values. We can choose more than one row as well by using vectors in the
row index. For example, if you wanted the information on Sleeping (Row 1),
Socializing (Row 4), and Childcare (Row 7) the code would be
Activities[c(1,4,7)]
So, with this in mind, what code would you use to look up the information
for Watching Television (Row 3) and Housework (Row 6)? The code would be
Activities[c(3,6),].
Thus far, we have concentrated on subsetting data frames using the rows.
However, we can work with the columns in data frames as well. We saw earlier
that data frames are composed of the named vectors we created. The Activities
data frame, for example, is composed of three vectors named Name, Aver-
ageHours, and Category. We may be interested in looking at only individual
vectors. For example, given this dataset, we may only want to look at the cate-
gories of these activities. In R, we can look at columns within a data frame using
the $ operator, specifically
So, to look at the categories of these popular daily activities, we would use
R tutorial: subsetting data Chapter | 4 31
Activities$Category
which returns Personal, Work, Leisure, etc. How would we look at the average
hours spent on these activities? Activities$AverageHours.
We use this $ variable notation to help us choose rows based on logical state-
ments. For example, we may want to look at the information for all activities that
receive more than an hour daily, using
Activities[Activities$AverageHours>1,]
How would you look at the information for all activities not classified as
Leisure? Activities[Activities$Category!=Leisure”,]. Note that this answer in-
cludes having Leisure in quotes, as it is a character value.
where you supply the number for the seed. If you supply a seed, you will al-
ways get the exact same series of random numbers with that particular seed. For
example, in the R screenshot below, we set the seed to be 8 and then generate
10 “random” numbers between 0 and 1, reset the seed to 8 and generate 10 new
“random” numbers. You will notice that both sets are identical. (See Fig. 4.1.)
32 Basic Statistics With R
which will return a vector of row numbers that will be selected in the sample.
For example, one of the most important datasets in baseball analysis is the
PitchF/x dataset. This dataset contains a large number of variables—including
pitch velocity, revolutions of the pitch, pitch type, where the pitch crossed the
plate, among others—about ever pitch thrown in a regular season Major League
Baseball game since 2007, 8,812,107 after the conclusion of the 2019 season.
Trying to analyze all these pitches would be difficult at best, so we could choose
to analyze a subset—say of 1000 pitches—of this data. If we wanted to choose
which 1000 pitches to analyze (or in other words, which rows of the data frame),
we would use
sample.int(n=8812107, size=1000)
Note that R does not allow the use of commas to break up thousands or mil-
lions. This code would only tell us which rows to use. In practice, we would want
to look up—or possibly save—those specific rows from the sampling frame. We
would do this by subsetting the rows of the data frame using the output of the
sample.int function. Assuming that the entire PitchF/x dataset is stored in a data
frame called pitch, we would specifically use
pitch[sample.int(n=8812107, size=1000),]
to print out the 1000 rows that we selected. Now, generally we want to save those
chosen rows in a data frame, so we would save that like we would generally save
anything in R. The code could look something like
R tutorial: subsetting data Chapter | 4 33
sample=pitch[sample.int(n=8812107, size=1000),]
which would save the selected rows to a data frame called sample. Say that we
wanted to select a sample of 5000 pitches from the pitch data frame (same size
as before) and save it to a data frame called pitchsample, What would the code
be? pitchsample=pitch[sample.int(n=7821149, size=5000),].
?sample.int
R will then open up a window that provides a description of the function, its
inputs, and its outputs. Even further, this window will provide sample code that
should hopefully make it easier to accomplish what your analysis intends to.
1. What code would you use to select the first, third, tenth, and twelfth entries
in the TopSalary vector from the Colleges data frame?
2. What code would you use to select the elements of the MedianSalary vector
where the TopSalary is greater than $400,000?
3. What code would you use to select the rows of the data frame for colleges
with less than or equal to 1000 employees?
4. What code would you use to select a sample of 5 colleges from this data
frame (there are 14 rows)?
5. What could would you use to select the rows of the data frame that have
GDP per capita less than 10000 and are not in the Asia region?
6. What code would you use to select a sample of three nations from this data
frame (There are 10 rows)?
7. What code would you use to select which nations saw a population percent
increase greater that 1.5%?
8. What code would you use to select the rows of the data frame where the
host nation was also the medal leader?
9. What code would you use to select the rows of the data frame where the
number of competitors per event is greater than 35?
10. What code would you use to select the rows of the data frame where the
number of competing nations in the Winter Olympics is at least 80?
4.8 Conclusion
Where we previously had the ability to input data into R, we now add the ability
to pick off individual data values, columns, and rows within a dataset. This will
allow us to eventually compare groups in order to see if and how they differ.
However, at this time we have still have to enter our data ourselves. We will
next endeavor to load pre-entered datasets into R, either through csv files or
libraries designed for R.
Chapter 5
5.1 Introduction
Once we have collected data, our next step in the process of statistics is to begin
analyzing our data through exploratory data analyses. In very few cases, we can
analyze data by hand because large datasets make calculation by hand tedious if
not impossible. As such, we need to be able to load data into R to take advantage
of the programs functions and computing ability. This chapter will focus on
libraries in R—which hold data as well as specific functions that help make
analysis easier—and loading data into R—either from libraries or csv files.
5.2 Libraries in R
Because R is entirely open-source software, anyone can create packages of
user-defined functions to help make analysis easier. These packages are called
libraries and are mostly stored on the CRAN. These libraries can be downloaded
and installed in R and then called into use at any time.
The first step to installing the libraries is downloading and installing them,
specifically demonstrating this through installing the MASS package [10] into R.
This process of downloading and installation differs for Windows machines and
Macs (the process is much easier for Macs), so we will detail both here, begin-
ning with the Windows machine.
In order to install packages in R, you must run R in administrator mode on
your computer. We do this by right-clicking on the R icon and selecting “Run as
Administrator” from the menu. (See Fig. 5.1.)
After this, once R opens up, we click on the “Packages” tab on the top menu
and select “Install Packages” from the drop-down menu. (See Fig. 5.2.)
After this, we need to select a CRAN mirror. Basically, all this does is define
which saved version of the CRAN, which are consistently updated. The choice
of CRAN mirrorusually has no effect on installing the packages. (See Fig. 5.3.)
After this, we are given a full alphabetical list of libraries available in R.
Scroll down to select MASS and click okay. The package will install and can be
called upon whenever the user needs it. (See Fig. 5.4.)
To call a library—and its functions and datasets—into use in R, we use the
library command. The code for this is
library(Library Name)
R tutorial: libraries and loading data into R Chapter | 5 39
And then hit <Enter>. Unless there is an error in your code, R will just show
a blank line and you then can use any of the functions or datasets stored in the
library. So, to call on the functions in the MASS library, we would load the MASS
library using
library(MASS)
For Mac operating systems, the process is considerably simpler. To begin, open
the R application regularly, and click the Packages & Data tab at the top. From
the dropdown menu, select Package Installer. (See Fig. 5.5.)
40 Basic Statistics With R
In the search box, type the name of the package (in this case MASS) and
click Get List. This will return all packages with names including the string of
characters that you searched for. Click on the package you want to install, and
before clicking Install Selected, check the box that says Install Dependencies.
Some R libraries depend on other libraries to work, and this last step ensures
that all the other dependent packages are installed as well. (See Fig. 5.6.)
At this point, the procedure is identical to the Windows listed above. To
call a library—and its functions and datasets—into use in R, we use the library
command. The code for this is
R tutorial: libraries and loading data into R Chapter | 5 41
library(Library Name)
And then hit <Enter>. Unless there is an error in your code, R will just show
a blank line and you then can use any of the functions or datasets stored
in the library. So, to call on the functions in the MASS library, we would
use
library(MASS)
42 Basic Statistics With R
data(Dataset Name)
and then hitting <Enter>. As an example, consider the nlschools dataset in the
MASS library [10,11]. This dataset is the test scores for over 2000 8th grade
students in the Netherlands. To load this, we would use the code
data(nlschools)
and then hit <Enter>. There is now a data frame in the workspace named
nlschools that can be used for analysis.
read.csv(File Location)
The file location can occasionally be difficult to find, and even when known
there are opportunities to make a mistake typing in the file pathway into the
function. With that in mind, we will use the file.choose function inside read.csv.
The code could look something like
data=read.csv(file.choose())
This will cause R to open a window where you are able to navigate to your csv
file and select it to be loaded into R. It is important to note that you have to
save the loaded data frame as something, otherwise the read.csv function will
just print out the dataset, making it unusable it for analysis. So, in the code
above, the data frame that is read into R is saved under the name data, but in
practice you can choose any name you want for the dataset assuming it follows
the naming conventions we laid out earlier.
R tutorial: libraries and loading data into R Chapter | 5 43
5.6 Conclusion
Up until this point, any data that we wanted to work with in R would have to
be entered by ourselves. This would be a very tedious process, especially as
datasets get larger. Through the use of libraries and the read.csv function, we
are able to easily load datasets into the R workspace. I might add that R is very
flexible in the forms of data that it can work with. With the right functions and
knowledge, R can get data from PDF files, text files, and web pages.
Now that we have ways to load data into R and subset the data into groups,
we can now move into the portion of statistics that comes to mind when people
discuss the subject: analyzing data. Doing so in R will open up a wide array of
functions and graphs to us, allowing us to more express our data in understand-
able terms.
Chapter 6
6.1 Introduction
Once we have collected data, the next step in the statistical process is begin-
ning to explore the data. Now, we often cannot explore the entire dataset, as the
number of subjects in our study can be quite large. As such, we need to work
with summaries of our data, both graphical summaries and numeric summaries.
These summaries, referred to as descriptive statistics, are not the end of the sta-
tistical process—which does not really have an end as it is continually repeating
itself. Further, we cannot—or should not—make definitive decisions from de-
scriptive statistics. Rather, descriptive statistics can give us a general idea what
is going on in the dataset and point us in the direction of interesting relationships
between variables in the dataset.
ing, “On the average, does the population of individuals who engage in exercise
experience a decrease in blood pressure level?” This single number that de-
scribes the entire population is called a parameter or population parameter.
It is considered fixed and unknown, and in order to go about answering ques-
tions through the process of statistics we need to estimate the value(s) of the
parameter(s) of interest. (See Fig. 6.1.)
common summary is the frequency table. A frequency table gives the counts
for the number of subjects that fall in each category, and occasionally the total
number of subjects in the study. From the frequency table, we can easily get the
sample proportions for each category in our categorical response.
For example, in 2016, the Pew Research Group conducted a survey [14]
of 1488 people to see through what forms Americans consume books. They
found that 395 people did not read books, 577 people only read print books, 425
people read both digital and print books, and 91 people only read digital books.
The frequency table would be
What would the sample proportion be for the people who read only print
books? We would divide the number of people who read print-only books by
the total sample size, so the sample proportion for print-only readers would be
577
p̂ = = 0.378.
1488
Oftentimes, we are interested in how two categorical variables are related.
With this in mind, a frequency table, which counts the number of subjects in
each category for a single variable only, is an inadequate summary. When we are
interested in the relationship between two variables, the best descriptive statistic
is a contingency table. A contingency table is a m × p table where the counts
in each cell of the table give the number of subjects who fall in the specific
combination of categories defined by the row-column combination. In addition,
contingency tables give the marginal totals for each row and column, which
gives the subjects who fall into a row or column category ignoring the other
dimension of the table, essentially creating a frequency table for either rows or
columns.
In 2017, the Pew Research Group conducted a survey [15] of 3701 people to
see what causes Americans to be active consumers of science news. Among the
variables they looked at were ethnic group, including black, white, and Hispanic.
These subjects were then asked if they were an active, casual, or uninterested
consumer of science news. The contingency table for the data was
What would the frequency table be for the columns (Defined by the question
“Are you an active, casual, or uninterested consumer of science news?”), and
50 Basic Statistics With R
what would the sample proportion be for subjects who are Hispanic and who
are active science consumers? If look at the column totals, we would find that
the frequency table for the columns is
Additionally, by finding the Hispanic row and Active column in the table,
we would find that the sample proportion of individuals who are Hispanic and
89
active science consumers is p̂ = = 0.024.
3701
Calculating frequency and contingency tables by hand is a tedious task. For-
tunately, R has a function that creates both of these tables for us. The table
function in R takes in one or multiple categorical variables and outputs the de-
sired frequency or contingency table. The code used for this is
table(variable1, variable2)
where variable1 is the first variable desired in the table and variable2 is the
optional second variable, used if you are creating a contingency table. As with
all the functions discussed in this chapter, more details and examples will be
included in Chapter 7.
Visual summaries for categorical variables are very popular in the media, es-
pecially pie charts and bar charts. We will not discuss these charts because they
often cause multiple problems in interpretation, particularly pie charts. Edward
Tufte, a statistician noted in the field of data visualization, once stated that “Pie
chart users deserve the same suspicion and skepticism as those who mix up its
and it’s or there and their.” In short, the best descriptive statistics for categorical
data remain sample proportions, frequency tables, and contingency tables.
No Internet Internet
None of it 48 334
Not much 66 990
Some of it 62 848
Most of it 11 166
our samples. Say that we have a sample of size n. We assign the values of a
variable that we collect to be
x1 , x2 , ..., xn
where x1 is the value of the variable for the first subject, x2 is the value of the
variable for the second subject, xn is the value of the last subject, etc. In addition
to ordering the values of our variable by subject number, we could alternatively
order the variable values in our sample from smallest to largest. We notate this
by
x (1) , x (2) , ..., x (n)
where x (1) is the smallest value of the variable in the sample, x (2) is the second-
smallest value in the sample, x (n) is the largest value in the sample, etc.
Now that we have defined our sample, we can talk about the sample statistics.
The sample mean (notated x̄), our best estimate for the population mean, is
merely the average value for all values in the sample, so
1
n
1
x̄ = xi = x1 + x2 + ... + xn
n n
i=1
The sample median (notated x̃), our best estimate for the population me-
dian, is the median value for the variable in the sample, so there are two possible
formulas for the sample median. If the sample size is odd, the sample median is
n+1
x̃ = x 2
n−1
or the points in the sample where half of the data— data points—are
2
greater than and less than that value. If the sample size is even, the sample me-
dian is
n n
x 2 + x 2 +1
x̃ =
2
or the average between the two values in the sample closest to the middle.
For example, Nigel Richards is generally considered one of the best Scrabble
players in the world. At a 2014 tournament, he put up scores of
550, 526, 458, 440, 440, 389, 432, 508, 433, 433, 548, 497, 531, 484.
1
The sample mean for this data would be x̄ = 550 + 526 + ... + 484 =
14
476.36. The sample median would require first ordering the scores for smallest
to largest from 432 to 550. Then, as the sample size is even at n = 14, we find
Exploratory data analyses: describing our data Chapter | 6 53
the two values closest to the middle of 458 and 484, and average them to get the
median of x̃ = 471.
349, 432, 433, 433, 440, 440, 458, 484, 497, 500, 508, 526, 531, 548
In R, the mean and median are calculated using the mean and median func-
tions respectively. The code to calculate the average and median for some data—
named variable below—would be
mean(variable)
median(variable)
Now, because there are two summaries for the center of the data, a reason-
able question would be, which is better? Both the sample median and mean have
properties that make them the better option. The sample mean is not robust,
which means it is highly sensitive to outliers. An outlier is any strange point in
the dataset far away from other values. We will discuss them more later on. If a
single value in a sample is changed to become an outlier, the sample mean will
drastically change. On the other hand, the sample median is robust, which im-
plies that the sample median is not highly affected by outliers. Generally speak-
ing, we like our statistics to be robust, so this is a point in the median’s favor.
Despite this, generally speaking we use the sample mean to summarize the
center of a variable. This is because the mean has several theoretical properties
that make it very convenient to do statistical inference, much more so than the
sample median. We will not discuss these properties now, but will return to them
in later chapters. It suffices to say that unless otherwise specified, we will use
the sample mean to describe the center of a variable.
di = xi − x̄
Now, while deviations are useful for individual data points, we want to have
a single number to summarize the spread for the variable in the entire sample.
Exploratory data analyses: describing our data Chapter | 6 55
You will notice that this is not an average in the strict sense, as the sum of
the squared deviations is divided by n − 1 and not n. This is to make the sample
variance an unbiased estimator of the population variance. Directly related to
the variance is the standard deviation. The standard deviation is the square
root of the variance, and also comes as an unknown√population parameter—√
notated σ —and sample statistic—notated s. So, σ = σ 2 , and s = s 2 . If the
standard deviation is known, the variance is known and vice versa.
R has two simple functions to calculate the variance and standard deviation,
the var and sd functions, respectively. All that is required is the data that we
want the mean and variance for, using code
var(variable)
sd(variable)
Because the sample variances and sample standard deviations are a func-
tion of the squared deviations, they will always be greater than or equal to zero.
A sample variance or sample standard deviation of 0 implies that every obser-
vation in the sample has the same value. As the sample variance or standard
deviation increases, the spread of the data is larger.
For example, The Sweet Briar soccer team allowed the following number
of goals (from fewest to most) in the 15 games of the 2017 season: 1, 1, 2,
2, 4, 5, 6, 6, 7, 7, 10, 10, 10, 10, 11. If we were to find the variance—and
standard deviation of the number of goals allowed—we first need to find the
sample average, which is x̄ = 6.133. Our sample variance will then be
1
s2 =
15 − 1
× (1 − 6.133)2 + (1 − 6.133)2 + ... + (10 − 6.133)2 + (11 − 6.133)2
= 12.69524.
56 Basic Statistics With R
Our√ sample standard deviation will be the square root of our sample variance,
so s = 12.69524 = 3.563038. While the variance and standard deviation are
the most commonly-used measures of spread in the data, neither are robust to
outliers. If we changed our largest number of goals allowed from 11 to 50, the
variance and standard deviation would drastically change. This brings up the
question, are there any robust measures of spread that provide alternatives to the
variance or standard deviation?
The Interquartile Range (IQR) is just such a robust measure of spread in
the dataset. The IQR is defined to be the distance between the 75th and 25th
percentile in the sample. The 75th percentile is the point in the dataset where
75% of the data is less than that value and the 25th percentile is the point in the
dataset where 25% of the data is less than that value. When calculating the 75th
percentile, we essentially find the median of the upper half of the dataset, while
the 25th percentile is the median of the lower half of the dataset.
So let us look at our Sweet Briar goals allowed again. As a reminder, the
team allowed the following number of goals in 2017: 1, 1, 2, 2, 4, 5, 6, 6, 7, 7,
10, 10, 10, 10, 11. To find the 75th and 25th percentiles, we need to first find the
median in order to split the dataset into the upper and lower halves. Since there
are 15 observations, the median will be x (8) = 6. This splits the dataset into two
halves:
The 75th percentile is the median of the upper half, or x (12) = 10. The 25th
percentile is the median of the lower half, or x (4) = 2. So the IQR will be 10-
2=8. R uses the function IQR to calculate the IQR.
IQR(variable)
Precinct Turnout
1 0.409
2 0.427
3 0.439
4 0.477
5 0.461
6 0.475
7 0.524
8 0.52
9 0.445
10 0.363
11 0.399
To create the histogram, we need to decide on the size of the bins. For this
data, we can use the intervals from 0.35 to 0.4, 0.4 to 0.45, 0.45 to 0.5, and 0.5
58 Basic Statistics With R
to 0.55. There are two observations in the first bin, four in the second bin, three
in the third bin, and two in the fourth. Graphing this yields the histogram seen
in Fig. 6.2.
R uses the hist function to create histograms. There are multiple extra options
that we will discuss in Chapter 7, but the most basic histogram can be created
by solely inputting the data.
hist(variable)
When we plot a histogram, there are several things to look for that tell us
information about the data. The first is the shape of the histogram, either sym-
metric or skewed. Symmetric histograms are what the name implies, that the
data evenly split and identically shaped on either side of the middle value. How-
ever, note that symmetric histograms do not have to be perfectly symmetric, only
roughly speaking. Skewed histograms are nonsymmetric histograms and can be
either skewed right or left, with the direction of the skew being whichever side
has the long tail of data.
The second thing to look for is the number of modes, or local high points,
in the histogram. The most common numbers of modes are one (unimodal) and
two (bimodal). On very rare occasions, you see a trimodal histogram. Modes
that are very close together may be in fact one mode depending on how the
values are binned together. (See Fig. 6.4.)
The last three things to look for are where the center of the data is located,
how spread out the data is, and if there are any outliers. Generally, we use the
sample mean and sample variance to describe the center and spread, but how do
we identify outliers? We have mentioned several times that various statistics and
summaries are sensitive to outliers without talking about how to identify poten-
tial outliers—stated this way because we do not necessarily know for certain if
a data point is an outlier.
Exploratory data analyses: describing our data Chapter | 6 59
FIGURE 6.3 Various shapes of histograms (from left to right) Symmetric, Skewed Right, and
Skewed Left.
FIGURE 6.4 Number of modes of histograms (from left to right) Unimodal and Bimodal.
z-score. Generally speaking, if |zi | > 3—that is, the observation is more than
three standard deviations from the average—we will flag the observation as a
potential outlier.
For example, on the average, the Old Faithful geyser [10,19] has 72.31 min-
utes in between eruptions with a standard deviation of 13.89 minutes. If you
have been waiting for 2 hours for an eruption, is this observation possibly an
120 − 72.31
outlier? Yes, as a 2-hour wait has a z-score of = 3.43, which has
13.89
an absolute value greater than 3.
While z-scores can be useful for identifying outliers, they come with a major
caveat: They can only be used to identify outliers for bell-shaped data. Bell-
shaped data is a specific type of symmetric data that is ultimately very important
to statistics, and we will discuss it later on in the course. The symmetric his-
togram in Fig. 6.3 is an example of bell-shaped data, while Fig. 6.5 shows an
example of bell-shaped data with a single outlier.
If a dataset is not bell-shaped, we cannot use the z-score to identify outliers
because ultimately the calculation of the z-score requires bell-shaped data to be
valid. As many datasets are not going to be bell-shaped, we need another method
to identify potential outliers.
Such a method can be found using the Interquartile Range. Once we find the
75th percentile (also referred to as the third quartile or Q3) and 25th percentile
(also called the first quartile or Q1), we calculate the Interquartile Range as
I QR = Q3 − Q1. Then we can label potential any outliers if an observation is
and is reliant on the robust IQR rather than the nonrobust sample average and
sample standard deviation. But both methods do have the uses and instances
where one should be chosen.
If you made a histogram of the Old Faithful data, you would see that it
is highly nonbell shaped. With that in mind, we need to use the IQR to iden-
tify outliers. The 75th percentile (or third quartile) is at 82 and 25th percentile
(first quartile) is at 58, so the IQR is 82 − 58 =24. This means that any data
points greater than 82 + 24 × 1.5 = 118 or less than 58 − 24 × 1.5 = 22 are
outliers. So, if you have been waiting for two hours, this would still be an out-
lier.
850, 850, 1000, 810, 960, 800, 830, 830, 880, 720,
880, 840, 890, 770, 910, 720, 890, 810, 870, 940
simply begin by comparing sample averages or sample medians across the lev-
els of a categorical predictor to look for major differences. This does not take
variability of the response into account, so we need to bring that into the pic-
ture.
While there are many methods to explore these relationships, we are go-
ing to focus on a visual comparison, specifically boxplots. Boxplots allow us
to visually see if there is overlap for a response among multiple groups, but
they can be used to visualize single variables as well. The boxplot is basically a
visual display of the five number summary (5NS) of a variable. The five num-
ber summary is a simple summary of the values of a variable, consisting of
the minimum, maximum, median, first quartile (Q1, 25th percentile), and third
quartile (Q3, 75th percentile). Again, a boxplot is just a visual plotting of this
five number summary, and the steps are quite simple.
1. Calculate the five number summary.
2. Determine if there are any outliers using the interquartile range (IQR).
3. On a number line, draw three vertical lines at the first quartile (Q1), median,
and third quartile (Q3). Form a box with these three lines.
4. Draw a horizontal line from Q1 down to the smallest nonoutlier value.
5. Draw a horizontal line from Q3 up to the largest nonoutlier value.
6. Mark all outliers with a symbol.
R uses the function summary to calculate the five number summary. This
function also returns the sample average as well as the number of missing data
points if any exist.
summary(variable)
tween Day/Night games and attendance. This makes some intuitive sense, as
day games either tend to be held on weekends, holidays, or special occasions—
such as Opening Day—when people are more likely to be able to attend
games.
The boxplot function in R allows us to create both individual and side-by-
side boxplots. We’ll discuss the particulars of this function in Chapter 7, but the
basic boxplot can be created using
boxplot(variable)
Exploratory data analyses: describing our data Chapter | 6 65
FIGURE 6.8 Side-by-side boxplot for home attendance for the 2019 Orioles day and night games.
30. Does there appear to be an association between species and sepal length?
FIGURE 6.9 A scatterplot of cat heart weight (Y) versus cat body weight (X) [10,21].
scatterplots, but using the code below, we can create a simple scatterplot where
variable1 will be the x-coordinate and variable2 will be the y-coordinate.
plot(variable1, variable2)
Just like when using a histogram to consider the shape of a variable, there
are specific things we look for in a scatterplot. The first thing we look for is
some sort of a pattern in the data. This could be linear, quadratic, or any other
pattern we can think of. (See Fig. 6.10.)
Second, we look to see if there is an increasing or decreasing relationship
between our predictor and response. Increasing implies that as our predictor
gets larger, so does our response. On the other hand, decreasing implies that as
our predictor gets larger, our response gets smaller. (See Fig. 6.11.)
The third thing we look for is any anomalies, specifically outliers, in our pre-
dictor or response. Existence of these outliers can drastically change or possibly
invalidate our results. Other anomalies may include clusters in our predictor,
response, or both. (See Fig. 6.12.)
Finally, assuming the relationship between the predictor and response is lin-
ear, we look for how strong the association is. Stronger associations between
predictor and response imply the scatterplot looks much more like a line, while
a weak association between predictor and response implies that the scatterplot
looks like random scatter. (See Fig. 6.13.)
Now, using “strong” and “weak” to denote the strength of the association
between a predictor and the response can be very subjective and vague. We
need a better way to describe this association, specifically with a statistic that
can be calculated from our sample.
This better way to describe associations between variables is correlation.
It exists as a population parameter (ρ, the fixed and unknown true correlation
between two quantitative variables for the entire population) and as a sample
Exploratory data analyses: describing our data Chapter | 6 67
FIGURE 6.10 Four possible scatterplot patterns (clockwise from top left): Linear, Quadratic, Ex-
ponential, No Pattern.
FIGURE 6.11 Scatterplot relationships (from left to right): Increasing relationship, No relation-
ship, Decreasing relationship.
statistic (r, the calculated correlation between two quantitative variables in our
sample). Correlations are able to summarize both the strength of the associa-
tion between two variables as well the direction (Increasing/Decreasing) of the
relationship between the predictor and response.
68 Basic Statistics With R
FIGURE 6.12 Scatterplot anomalies (from left to right): Outlier in the predictor, Outlier in the
response, Clustered Data.
FIGURE 6.13 Scatterplot strength of association (from left to right): Weak association, Strong
association.
The correlation is ultimately the average of the product of z-scores for our
two variables in question. That is,
n
1 xi − x̄ yi − ȳ
r=
n−1 sx sy
i=1
where x̄ and ȳ are the sample means for x and y, while sx and sy as the sample
standard deviations for x and y. So, if we are calculating the correlation by hand,
we follow the following steps:
1. Calculate the sample means for x and y
2. Calculate the sample standard deviations for x and y
3. Calculate the z-score for each observation of x
4. Calculate the z-score for each observation of y
Exploratory data analyses: describing our data Chapter | 6 69
5. Multiply the z-score for x and the z-score for y together for each observation
6. Add all the multiplied z-scores together and divide by the sample size n
minus 1.
An agricultural researcher in the 1980s was looking into the effect of over-
seeding on crop yield [22,23]. He planted the fields with various seed rates and
recorded the resulting yield.
To find the correlation, we first need to find the sample averages for seed rate
(100) and yield weight (19.32). Next, we need the sample standard deviations
for seed rate (39.53) and yield weight (1.29). Then we need to calculate z-scores
and multiply them together
Finally, we add all the values in the third column together and divide by
5 − 1 = 4 to get −3.94/4 = −0.985 for our correlation.
As is quite obvious, this process is very tedious to do by hand. Thus, we often
turn to the cor function in R. Given two variables that we want to correlate, this
function will return the correlation as calculated above.
cor(variable1, variable2)
The sample correlation r tells us first the direction of the relationship, where
if r is positive this implies that there is a increasing relationship between the two
variables. r is bounded between −1 and 1, and can also tell us the strength of
the association between our two variables. If |r| is close to 1, this implies a very
strong relationship between the two variables, while if |r| is close to 0, there is
a weak relationship between the variables. (See Fig. 6.14.)
There are two main caveats when dealing with correlations. The first is that
correlations are only valid and interpretable for linear relationships between two
variables. If the two variables we are interested in have any other relationship—
quadratic, cubic, exponential, etc.—we cannot interpret our correlations and our
70 Basic Statistics With R
FIGURE 6.14 Plots of various correlations (clockwise from top left): r = 0, r = 0.99.r = −0.25,
r = −0.75.
calculated value is not valid. In addition, the correlation is not a robust statistic,
so outliers in either variable will strongly affect the correlation value.
The second caveat—and I cannot stress this enough—is that if two variables
are associated (or have a high correlation) this does not imply that one variable
causes the other. Put another way, Correlation does not imply causation. The
only thing that can establish causation is a designed experiment like the ones
we discussed earlier. For proof, I would suggest visiting the website Spurious
Correlations [24], which is filled with many instances of pairs of highly corre-
lated variables that very likely have no causal relationship. For example, while
the marriage rate in Virginia is incredibly highly correlated with the per capita
consumption of high fructose corn syrup (r = 0.982) [25], we would in no way
expect high marriage rates in Virginia to cause the per capita consumption of
corn syrup to increase. For another example, while the number of lawyers in
Virginia is incredibly highly correlated with the per capita consumption of mar-
garine (r = −0.946) [26], we would in no way expect high numbers of lawyers
in Virginia to cause the per capita consumption of margarine to decrease. The
lesson here is, again, correlation does not imply causation. (See Fig. 6.15.)
Ultimately, when looking at associations between two quantitative variables,
we need to use both scatterplots and correlations to get at understanding the
relationship between predictor and response. Further than that, we need to use
Exploratory data analyses: describing our data Chapter | 6 71
FIGURE 6.15 Plots of spurious correlations (from left to right): Marriage rates in Virginia versus
per capita consumption of corn syrup (r = 0.9816), number of lawyers in Virginia versus per capita
consumption of margarine (r = −0.9461).
a fair bit of common sense to ensure that we do not try to claim more than the
data allows us to claim.
32. Given the following data, calculate the correlation between the two vari-
ables [27].
33. Given the following data describing the amount of wear on each shoe in a
pair of shoes, calculate the correlation between the two variables [85].
Shoe A Shoe B
13.2 14.0
8.2 8.8
10.9 11.2
14.3 14.2
10.7 11.8
6.6 6.4
9.5 9.8
10.8 11.3
8.8 9.3
13.3 13.6
6.10 Conclusion
Exploratory data analysis takes many forms, from numerical summaries of a
single variable to graphical summaries of multiple variables. Each of these sum-
maries has its uses, and they all share a goal of trying to make our data more
understandable. However, in choosing a summary we need to be sure that it is
appropriate, as well as considering how the summary will be affected by the
data as a whole.
No matter how useful our exploratory analyses may be, they cannot be the
foundation of decision-making on their own. This is because these summaries
are calculated from random samples or randomized experiments, and as such
will have a certain amount of variability ascribed to them. We need to account
for this variability, which is the basis of statistical inference.
Chapter 7
R tutorial: EDA in R
Contents
7.1 Introduction 73 7.4 Missing data 77
7.2 Frequency and contingency 7.5 Practice problems 78
tables in R 73 7.6 Graphical exploratory analyses
7.3 Numerical exploratory analyses in R 78
in R 74 7.6.1 Scatterplots 78
7.3.1 Summaries for the center 7.6.2 Histograms 80
of a variable 74 7.7 Boxplots 82
7.3.2 Summaries for the spread 7.8 Practice problems 84
of a variable 75 7.9 Conclusion 85
7.3.3 Summaries for the
association between two
quantitative variables 76
7.1 Introduction
In our previous R tutorial, we talked about how we load data into our workspace,
using either the libraries available for use with R or loading in csv files use the
read.csv function. In this tutorial, we will talk about calculating the summaries
discussed in Chapter 6 as well as generating various plots used in visualizing
data.
and then hit <Enter>. The “Variable” mentioned can be either then name of an
entered vector or a variable selected from a larger data frame using the $ nota-
tion. R then returns the frequency table of the variable you specified in the table
statement. Consider the quine dataset [10,28] in the MASS library. This data
was an investigation into Australian student absences, with data collected on
each student including ethnicity (aboriginal and nonaboriginal), sex (male and
female), learning ability (average learner and slow learner), etc. If we wanted
to know the ethnic breakdown of the subjects in the survey-defined by the Eth
variable in the quine data frame—we would use the code given below to see that
there were 69 aboriginal students and 77 nonaboriginal students.
table(quine$Eth)
For contingency tables, the code is almost identical, except instead of one vari-
able in the table we have two, so the code will change to
table(Variable1, Variable2)
Again, these variables can be entered vectors or variables selected from a larger
data frame. Using the same quine dataset, we could look at the contingency
table for student ethnicity—the Eth variable in the quine data frame—and their
learning ability—the Lrn variable in the quine data frame. Our code would be
table(quine$Eth, quine$Lrn)
Notice that R does not give the marginal totals for the rows or columns,
nor the table total. These can be determined by adding the appropriate rows
and columns as desired or taking advantage of other functions in R such as
addmargins.
median(Variable)
In both cases, the “Variable” can either be saved in an inputted vector or saved
within a data frame. So, for example, let us look at the Housing dataset [12,29]
in the Ecdat library. This dataset looks at the characteristics and sales price of
houses in the city of Windsor. If we wanted to find the mean of the sales price
of the houses—the price variable—we would use
mean(Housing$price)
To find that the mean sales price was 68,121.6 and the median was 62,000. We
can also use the mean and median functions in conjunction with subsetting vec-
tors to compare the means and medians for multiple groups. For example, say
that we wanted to compare the average sales price of houses based on whether
they were in a preferred area—found in the variable prefarea with values of
“yes” and “no”—we would use
mean(Housing$price[Housing$prefarea=="yes"])
mean(Housing$price[Housing$prefarea=="no"])
to see that the average sales price for preferred areas is 83,986.37 versus
63,263.49 for nonpreferred areas. The code would be similar for comparing
medians, using the median function in place of the mean function.
This will give you the minimum, first quartile, median, third quartile, and
maximum, plus the added mean for a variable. From this, we can calculate the
range as well as the IQR. Alternatively, we could use the range function to get
the minimum and maximum values of a variable or IQR function to calculate
the IQR directly.
76 Basic Statistics With R
range(Variable)
IQR(Variable)
While the summary function does not include variance or standard deviation,
they both are simple to calculate using R. Our variance can be found using the
var function with the code
var(Variable)
and the standard deviation using the function sd with the code
sd(Variable)
To illustrate all these options, we can look at the Pima.te dataset [10,30] in
the MASS library. This data looks at possible indicators of diabetes for females
in the Pima Native American tribe. If, say, we wanted to find the range and
IQR for the plasma glucose level (the glu variable), we would first find the five
number summary by using
summary(Pima.te$glu)
To get that the range is 197 − 65 = 132 and the IQR is 136.2 − 96 = 40.2.
The range and IQR can be confirmed using the code
range(Pima.te$glu)
IQR(Pima.te$glu)
to get s = 30.5. Notice that s is the square root of s 2 , as is part of the definition
of variance and standard deviation.
cor(Variable1, Variable2)
where both variables could either be an inputted vector or stored in a data frame.
As an example, let us look at the cars dataset [31] stored in the base R environ-
ment. This dataset looks at the relationship between the speed of a car and the
resultant stopping distance—specifically for cars in the 1920s. To calculate the
correlation between the speed—stored in the variable speed—and the stopping
distance—stored in dist—we would use
cor(cars$speed, cars$dist)
This process is similar for many other functions, including median, var, and
sd. One notable difference is the cor function. As there are two variables in-
volved in this function, we only want to use observations that have no missing
values, or complete observations. In this case, we would add use=“complete”
to our cor function, with the result looking something like
cor(variable1, variable2, use="complete")
It should be noted that the na.rm and use options can be included in any
instance of mean or cor, but if there are no missing observations these options
will have no effect on the resulting calculations.
78 Basic Statistics With R
7.6.1 Scatterplots
Scatterplots are used to determine the relationship between two quantitative
variables. This is important for many reasons, including identifying the afore-
mentioned linear relationship required for correlation interpretability. The base
code that we use to generate a scatterplot for two variables is
R tutorial: EDA in R Chapter | 7 79
plot(X-variable, Y-variable)
where “X-variable” is the variable that you want to plot along the x-axis, while
“Y-variable” is the variable that you want to plot along the y-axis. So, say that
we wanted to confirm the linearity of the relationship between speed and stop-
ping distance in the cars dataset. The code we would use is
plot(cars$speed, cars$dist)
to generate the left-most plot in Fig. 7.1. As we can see, it is a very spartan plot
and somewhat difficult to read. However, plot has a number of options that can
be included after the variables to improve the look and readability of the plot.
FIGURE 7.1 Left-to-right. A base scatterplot for speed versus stopping distance and an enhanced
scatterplot of the same data using several of the described options in plot.
• xlab, ylab: The axis title for the x-axis and y-axis, respectively. These are
changed through xlab=“Title” and ylab=“Title”. The default axis titles are
the variable names entered in plot.
• main: The main title for the plot. This is changed through main=“Title”. The
default is no title.
• pch: The plotting character, or what the (x, y) points look like when plotted.
The standard options are numbers 1 through 25 with each number being a
different plotting character—shown below in Fig. 7.2. The default value is
pch=1.
• col: The color of the plotted points. This value can be entered as a string—
col=“red”—a number—col=4 results in a blue point—or even a HTML
hex code—col=“#8B1F41” for Chicago Maroon. The default value is
col=“black”.
• cex, cex.lab, cex.axis: The magnification factor of the plotted points—
cex—the axis titles—cex.lab—and the axis values—cex.axis. These are all
numeric values, with a baseline default of 1. Larger numbers increase the
size of points or text, smaller numbers shrink the size.
80 Basic Statistics With R
FIGURE 7.2 Different plotting character—pch—options with their corresponding number value.
• xlim, ylim: The range of the plot window for the x-axis and y-axis, respec-
tively. This is entered as a vector of values—for example, xlim=c(minimum,
maximum). By default, R sets the xlim and ylim to be the range of the X and
Y variables, respectively.
Using all these options, let us improve the speed and stopping distance plot
with the result being seen in the right-most plot in Fig. 7.1. As we can see, the
linearity assumption between the speed and stopping distance seems reasonable
given the data.
7.6.2 Histograms
Histograms are used to help us describe the shape of our data. In R, histograms
are created through the hist function, with the baseline code being
hist(Variable)
For an example, we can look at the geyser dataset [10,19] in the MASS pack-
age in R. This dataset is the Old Faithful eruption wait and duration that we
discussed last chapter. To create the histogram for the wait time between erup-
tions (the wait variable in the geyser data frame), we would use
hist(geyser$wait)
The wait time variable seems to be bimodal, symmetric, and centered around
70. Like scatterplots, histograms have several options that can affect the look or
readability of the histogram.
R tutorial: EDA in R Chapter | 7 81
FIGURE 7.3 Left-to-right: A base histogram for the wait time between eruptions and an enhanced
histogram of the same data using several of the described options.
• breaks: The number of bins for the data. This can be entered as either the raw
number—breaks=10—or a vector defining the cutoff points of the bins—
breaks=c(10,15,20,25,30). The effect of the breaks variable can be seen in
Fig. 7.4.
• xlab, ylab: The axis title for the x-axis and y-axis, respectively. These are
changed through xlab=“Title” and ylab=“Title”. The default axis titles are
the variable names entered in plot.
• main: The main title for the plot. This is changed through main=“Title”. The
default is no title.
• col: The color of the bars. This value can be entered as a string—col=“red”—
a number—col=4 results in a blue point—or even a HTML hex code—
col=“#8B1F41” for Chicago Maroon. The default value is col=“black”.
• cex.lab, cex.axis: The magnification factor of the axis titles—cex.lab—and
the axis values—cex.axis. These are all numeric values, with a baseline de-
fault of 1. Larger numbers increase the size of points or text, smaller numbers
shrink the size.
• xlim, ylim: The range of the plot window for the x-axis and y-axis, respec-
tively. This is entered as a vector of values—for example, xlim=c(minimum,
maximum). By default, R sets the xlim and ylim to be the range of the X and
Y variables, respectively.
We can use these histogram options to improve the base histogram in Fig. 7.3
with the following code, resulting in the right-most histogram in that same fig-
ure.
hist(geyser$wait, breaks=20,
main="Histogram of Time Between Eruptions",
xlab="Wait Time", cex.axis=1.25, cex.lab=1.25, cex.main=1.5)
82 Basic Statistics With R
FIGURE 7.4 The effect of the number of breaks on the resulting histogram. From left to right, 20
breaks and 50 breaks.
7.7 Boxplots
In using boxplots, we can either make a boxplot for a single variable to make
side-by-side boxplots exploring the association between a categorical predictor
and a quantitative response. Both of these goals can be accomplished through
a single function: the boxplot function. The base code to create a boxplot for a
single variable is
boxplot(Variable)
As an example, we can look at the birthwt dataset [10,33] in the MASS pack-
age of R. This dataset is looking at several possible indicators of low birth
weight in newborns. If we just wanted to look at the birth weights of these
subjects (The variable bwt in the birthwt data frame), our code would be
boxplot(birthwt$bwt)
Like the previous plots, the boxplot has multiple options to improve the plot.
• pch: The plotting character for outlier points. The standard options are num-
bers 1 through 25 with each number being a different plotting character—
shown in Fig. 7.2. The default value is pch=1.
• xlab, ylab: The axis title for the x-axis and y-axis, respectively. These are
changed through xlab=“Title” and ylab=“Title”. The default axis titles are
the variable names entered in plot.
• main: The main title for the plot. This is changed through main=“Title”. The
default is no title.
• cex.lab, cex.axis: The magnification factor of the axis titles—cex.lab—and
the axis values—cex.axis. These are all numeric values, with a baseline de-
fault of 1. Larger numbers increase the size of points or text, smaller numbers
shrink the size.
R tutorial: EDA in R Chapter | 7 83
FIGURE 7.5 From left to right: The base boxplot of the birth weights from the birthwt dataset and
the enhanced boxplot of the same data using some of the options.
• xlim, ylim: The range of the plot window for the x-axis and y-axis, respec-
tively. This is entered as a vector of values—for example, xlim=c(minimum,
maximum). By default, R sets the xlim and ylim to be the range of the X and
Y variables, respectively.
We can use these options to modify the base histogram in Fig. 7.5, using the
following code. The result will be the right-most boxplot in that same figure.
boxplot(birthwt$bwt, ylab="Birth Weight",
pch=16, cex.axis=1.25, cex.lab=1.25)
The base code of the boxplot can be modified with the same options for the
single-variable boxplot. Additionally, we can make changes to the plot with one
further option. (See Fig. 7.6.)
• names: The names of the levels for the predictor variable. This can be
changed with names=c(Value1, Value2,...). By default, the boxplot names
will be the variable levels. It is important that the values in the names argu-
ment are in the matching order as the values of the original variable.
We can use these options to modify the base boxplot in Fig. 7.5, resulting
in the right-most boxplot in the same figure. Notice that in the names argument,
84 Basic Statistics With R
FIGURE 7.6 From left to right: The base boxplot of the birth weights from the birthwt dataset and
the enhanced boxplot of the same data using some of the options.
boxplot(birthwt$bwt~birthwt$smoke,
names=c(Nonsmoker, Smoker),
ylab="Birth Weight", xlab="Mother’s Smoking Status",
pch=16, cex.axis=1.25, cex.lab=1.25)
7.9 Conclusion
With the exploratory analyses learned in this chapter, we suddenly have a lot
of tools at our disposal to help us understand and visualize our data. These are
not the only options available in statistics or in R, but they do represent some
of the most common. The next time we return to R, we will have the tools of
statistical inference at our disposal, and thus will be focused on implementing
these techniques in R.
Chapter 8
8.1 Introduction
Now that we are able to summarize our data, we want to be able to take these
summaries and answer questions using statistical inference. Ultimately, for us
to use statistical inference, we need a basic understanding of probability. This is
because much of statistical inference comes down to the question, “What is the
chance that we observe more extreme data if a certain hypothesis is true?” To
quantify that chance, we need probability.
Unlike statistics, probability as a field of study has been around since the
1600s. Originally, many of the rules and theorems were designed to solve prob-
lems related to gambling, but the early field is full of many of the most signifi-
cant names in mathematics, including Pascal, Fermat, Bernoulli, and Gauss.
We have occasional encounters with probability in our daily lives. We see
the probability that it will rain or snow in weather reports. Sports talks about
the probability of teams winning before and during the games. Then there is of
course gambling, the father of probability, where we have odds in craps, poker,
roulette, and blackjack. Despite this, there are often misunderstandings or mis-
conceptions about probability and the general idea behind it. (See Fig. 8.1.)
This chapter will introduce us to the basics of probability, and the general
ideas that we will return to in statistical inference. We will begin with the ideas
behind random phenomena, random variables, the role of probability, and the
rules of probability.
FIGURE 8.1 A win probability plot for Game 5 of the 2017 World Series. Plotted is the Houston
Astros’s probability of winning the game at the time.
Say we are flipping a coin and want to know the probability of flipping a
heads. Assuming the coin is fair, the chances of flipping a head should be 0.5.
If we flip a coin over and over, the proportion of heads should get closer and
closer to 0.5, as can be seen in Fig. 8.2.
FIGURE 8.2 The frequentist proportion view of probability as applied to flipping a fair coin. As
the number of flips gets larger and larger, the proportion of heads gets closer and closer to 0.5.
Why does this long-run view of probability work? The reason is a statistical
theorem called the Law of Large Numbers. This law states that as our sample
size gets larger—or n → ∞—the sample mean x̄ of any sample we take will
converge toward the true value of the mean μ. Similarly, as the sample size
gets larger, the sample proportion p̂ will converge toward the true value of the
proportion p. Mathematically, we can write this as
lim x̄ = μ
n→∞
lim p̂ = p.
n→∞
As an example, the probability that you roll any number on a fair, six-sided
die is 1/6. So, the proportion of 3s will converge to 1/6 as the sample size goes
to infinity.
a really good game for you because you should win five times more often than
your opponent. Now, say that your opponent rolls five consecutive 6s, leading
him to win $25. What would you likely think? After two or maybe even three
6s, you might think that your opponent got really lucky. By the fifth consecutive
6, you would likely be thinking that your opponent either cheated, that the die
was not fair, or both. You would not be unwarranted in thinking this. If the die
was fair and your opponent did not cheat, the chances of rolling five consecutive
6s is 1/7776, a very unlikely event.
In a similar sort of question, say you were flipping a coin—that you assume
is fair—20 times and counting the number of times you see a heads. You would
expect to see 10 heads if the coin was fair. If you saw 12 heads in the 20 flips,
would you start to think that the coin was unfair? 14 heads? How many heads
would you need to see before you start to suspect that the coin is unfair? This
is the general idea behind statistical inference: Assume a certain state of the
world, collect data, and consider if the data makes you think your assumption is
incorrect. Probability will help us quantify this, and we will go into more detail
on this topic in later chapters.
# of outcomes in A
P (A) = .
# of outcomes in S
An incredibly brief introduction to probability Chapter | 8 93
Consider the example again where you roll two dice and add them together—
essentially the gambling game called craps. There are 36 total combinations of
first die and second die, defined below with die 1 on the rows, die 2 on the
columns.
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Each one of these 36 outcomes is equally likely, assuming the dice are fair.
This makes the game an opportunity to exercise classical probability. Say that
we are interested in the event A where we roll a 7 or 11. In the grid, there are
eight 7s or 11s, so the probability that event A occurs is P (A) = 8/36 or 0.2222.
Now, say that we want to know the probability of event B where we roll a 2, 3,
or 12. There are four 2s, 3s, or 12s, leading to P (B) = 4/36 = 0.1111.
On occasion, we are not interested in some event A occurring so much as
the chance that is does not occur. This event, A not occurring, is referred to as
the complement of A, or AC . The complement of A is made up of all elements
in the sample space S that are not in A.
Say you run a simple lottery, with a drawing of two balls with possible num-
bers being 1 through 3. The sample space will be S = {11, 12, 13, 21, 22, 23, 31,
32, 33}. If the event A is where the winning combo includes at least a single 2,
what will AC be? AC = {11, 13, 31, 33}
Notice that if you combine A and AC , you will get the entire sample space
S. Once we notice that, we are able to define our basic rules for probability, re-
ferred to as the Axioms of Probability or Kolmogorov Axioms—after Russian
probabilist Andrey Kolmogorov, who set down the axioms in 1933.
1. P (A) ≥ 0
2. P (S) = 1
3. If events A and B are disjoint sets—in other words, have no overlap—then
P (A or B) = P (A) + P (B).
These axioms hold regardless of the paradigm of probability being consid-
ered. Additionally, further properties can be derived from these consequences.
• P (AC ) = 1 − P (A), known as the complement rule.
• If event A contains all of event B, then P (A) ≥ P (B).
• If A is the empty set—an event with no items—then P (A) = 0.
94 Basic Statistics With R
FIGURE 8.4 The effect of the mean and variance parameters on the Normal distribution. On the
left, the effect of changing the mean μ. On the right, the effect of changing the variance σ 2 .
X − 68
Z= .
3
So, for example, say X ∼ N (6, 16) and Y ∼ N (3, 4). If we were to add X
and Y together, then X + Y will follow a Normal distribution with the mean
being 6 + 3 = 9 and variance 16 + 4 = 20. That is, X + Y ∼ N (9, 20).
A similar property holds for subtracting two Normal distributions. If X fol-
lows a Normal distribution with mean μX and variance σX2 and Y follows a
Normal distribution with mean μY and variance σY2 , then the difference of those
two variables X − Y will also follow a Normal distribution with mean μX − μY
and variance σX2 + σY2 . That is,
X ∼ N μX , σX2 , Y ∼ N μY , σY2
X − Y ∼ N μX − μY , σX2 + σY2 .
Taking the same example from above—X ∼ N (6, 16) and Y ∼ N (3, 4).
Then the difference X − Y will follow a Normal distribution with the mean
being 6 − 3 = 3 and variance 16 + 4 = 20. That is, X + Y ∼ N (3, 20).
8.9 Conclusion
Probability allow us to quantify uncertainty in a way that makes a wide variety
of events comparable. There are multiple ways to do this quantification, but
they all follow a similar set of rules that define our probabilities. These rules
and calculations come together in our probability distributions, which allow us
to describe our random variables in a way that use just a couple of numbers,
called our parameters. These distributions will allow us to ultimately say how
rare our observed data is in a way that is essential to statistical inference.
Chapter 9
9.1 Introduction
As we saw in previous chapters, exploratory data analyses can give us a fair bit
of information about variables in our dataset. These analyses can show us the
center and shape of numeric data, the proportions of categories, and if variables
appear to be associated with each other. However, exploratory analyses are not
enough to help us answer questions. In this chapter, we will explore the reasons
why, and thus move toward the ideas behind statistical inference.
Student ID Credits
1 18
2 15
3 12
4 13
5 15
The true population mean is 14.6, but depending on the sample, we can
have a variety of sample averages. So, our sample mean—or any of our sam-
ple statistics—is a random variable because our sample is chosen randomly.
Recall that a random variable is a numeric summary of some event that we do
not know the outcome of in advance, in this case our sample. Since the mean is
a random variable, like every other random variable it will follow some proba-
bility distribution. The sampling distribution is the probability distribution for
a sample statistic for a sample of size n. Every sample statistic has a sampling
distribution, with some being easier to work with than others. In general, the
sampling distribution helps give us an idea if the data that we collected is un-
usual in any way. We can view the sample distribution of our example above,
visualized through a histogram. (See Fig. 9.1.)
Sampling distributions Chapter | 9 103
FIGURE 9.1 Histogram describing the sampling distribution of the sample mean for our credit
hours example.
We know the population mean—and the average of all the sample means—
is 14.6. So we can calculate the standard error of these sample means to be
s.e.(x̄) = 1.26
Now, we mentioned that the sampling distribution is the probability distribu-
tion for a sample statistic for a sample of size n. This implies that our sampling
distribution is in some way affected by the sample size, and if the distribution
is affected by the sample size, it is entirely reasonable to expect that the stan-
dard error of the statistic is also affected by the sample size. We know from the
Law of Large Numbers that as the sample size gets larger, our sample statistics
104 Basic Statistics With R
get closer and closer to their parameter’s true value. But what happens to the
standard error?
Consider the same example above, where we calculate a sample mean, but
this time we up our sample size to four. This will have five possible samples
with their sample means given below.
Students Average
1,2,3,4 14.5
1,2,3,5 15
1,2,4,5 15.25
1,3,4,5 14.5
2,3,4,5 13.75
The average of the sample means is still 14.6, so the standard error would
now become s.e.(x̄) = 0.515. So, our standard error decreases as our sample
size increases. This makes intuitive sense as well. If we were able to sample
the entire population, we would only calculate one sample mean—equal to the
population mean—and you would not have any variability in that one sample
mean. In general,
s.e.(Statistic) → 0 as n → ∞.
So, since the variability of the sample statistic decreases as the sample size
gets larger, this means that our sampling distribution will become more com-
pressed as the sample size gets larger. Consider the situation where you are
finding the average commute time for a population of 20 employees at a com-
pany. The individual commute times are given in Fig. 9.2.
If you knew the commute times for all 20 employees and considered all
possible samples of size n = 5, n = 10, and n = 15, you get the sampling dis-
Sampling distributions Chapter | 9 105
FIGURE 9.3 Sampling distribution for sample average commute time for n = 5, n = 10, and
n = 15.
tributions below, which become more compressed as our sample size increases.
(See Fig. 9.3.)
To summarize, sample statistics are—by the nature of random samples—
random variables themselves. Thus, they have probability distributions—called
the sampling distribution—and variability—-called the standard error. These
distributions and variability are affected by the sample size, with the variability
decreasing and the distributions constricting as the sample size increases.
shaped. The same holds if you look at the sampling distribution of the sample
proportion.
Consider the scenario where you flip a coin 100 times and count the number
of heads in the 100 flips. Say you did this 10,000 times. The sampling distribu-
tion for the sample proportion of heads would look like Fig. 9.4.
FIGURE 9.4 Sampling distribution for the proportion of heads in 100 coin flips.
This bell-shaped nature of the sampling distribution for these two statistics
is not a coincidence. It turns out that as the sample size goes to infinity, the
sampling distribution for the sample mean becomes a Normal with parameters
μ equal to the true value of the population mean and variance being the standard
error of x̄ squared. That is,
x̄ ∼ N μ, s.e.(x̄)2 as n → ∞.
A similar property holds for the sample proportion. Namely, as the sample
size goes to infinity, the sampling distribution for the sample proportion will
become Normal with parameters μ being the true value of the population pro-
portion p and variance being the standard error of p̂ squared. That is,
p̂ ∼ N p, s.e.(p̂)2 as n → ∞.
This result—that the sampling distributions of the sample mean and propor-
tion becoming Normal—is called the central limit theorem. Thanks in part to
this result, it is possible for us to do statistical inference in the form of hypothesis
testing and confidence intervals, which will begin next chapter.
As an example of the central limit theorem, let us look at the example seen
in Fig. 9.4. In this example, we are flipping a fair coin—true probability of
flipping a heads is p = 0.5–100 times—and calculate the sample proportion p̂.
If this is the case, the standard error of p̂ would be s.e.(p̂) = 0.05, meaning that
our sample proportion would roughly follow a Normal distribution with mean
Sampling distributions Chapter | 9 107
9.5 Conclusion
As we had previously suggested, data summaries are not sufficient to be the
foundation for making decisions. This is because they are calculated from ran-
dom samples or randomized experiments and are therefore random variables
themselves. However, this means that they follow probability distributions—
called the sampling distribution—and in some cases these probability distri-
butions are known. Particularly in the case of the sample average or sample
proportion they follow Normal distributions, a result known as the central limit
theorem. This result will help us going forward to be able to understand how un-
likely our data is, providing us the eventual foundation for coming to a decision
based on our data.
Chapter 10
10.1 Introduction
We have finally reached the most important step of the process of statistics:
statistical inference. Inferential statistics allow us to consider the data that we
have collected in the light of probability and assumptions we make about the
state of the world. In doing so, we consider if the data we collected is rare under
our assumed world state and whether we should revise our assumptions. This
chapter will focus on introducing us to the most common of inferential statistics:
the hypothesis test. To begin, let us look at how hypothesis testing came about,
with a tale that reminds us how scientific and statistical discoveries can come
about in the most unlikely ways.
identify half of the cups, just by pure chance. However, if she identified many
more cups correctly than expected, then the group assumption that she could not
tell the difference was likely incorrect.
And that is exactly what they did. Fisher and his colleagues presented Dr.
Bristol with eight cups of tea—four where the tea was poured into the milk
and four where the milk was poured into the tea—in random order. They then
recorded Dr. Bristol’s guesses and tallied up the results. It turned out that Dr.
Bristol got all eight guesses correct.
state of the world is in fact true, that there really is nothing strange going on
in the data. This mirrors court cases, the defendant—our null hypothesis—is
innocent until proven guilty.
To illustrate this, let us consider the case where you want to discover if the
coin is fair, which we want to determine through a hypothesis test. To begin this
hypothesis test, we need to both our null and alternative hypotheses. In most
cases, when writing out the null and alternative hypotheses, we want to express
our claims in terms of some population parameter. This allows us to more easily
test these hypotheses mathematically, and requires us to define our population
parameters in many cases.
In the case of our fair coin, that parameter is the true probability of flipping a
heads—or the true proportion of heads. Let us define p to be this true probability
of flipping a heads, or p = P r(H ). If the coin is fair, the probability of flipping
a heads would be p = 0.5. This is our assumed state of the world, our status quo,
a world where nothing strange is happening. Because of this, this claim—that
p = 0.5—is our null hypothesis. We write this out as
H0 : p = 0.5.
Once we have defined our null hypothesis, we need to define our alternative
hypothesis. This alternative is going to completely depend on what we want
to show. For example, we might think that the coin that we are flipping makes
heads more likely to occur. If this is the case, the population proportion of heads
will be greater than 0.5, or p > 0.5. In that case, our alternative hypothesis is
HA : p > 0.5.
Or you believe that the coin makes tails more likely to occur, or heads less
likely to occur. In this case, the population proportion of heads will be less than
0.5, or p < 0.5. For this, our alternative hypothesis is
HA : p < 0.5.
HA : p = 0.5.
When defining our hypotheses, we cannot have any overlap between the null
and alternative. This makes reasonable sense, as we cannot have both of these
112 Basic Statistics With R
hypotheses be true. For example, the pair of hypotheses listed below are not a
valid set of null and alternative hypotheses.
H0 : p = 0.5
HA : p ≥ 0.5.
H0 : p = 0.5
HA : p > 0.5
so we would hope to see more heads than tails to support our alternative hypoth-
esis. The question for you is: Is six heads and four tails enough evidence that
the coin is unfair and makes heads more likely? Six heads and four tails is pretty
close to 50/50, so maybe it is not enough evidence for you. What about seven
heads and three tails? Is that enough evidence? What do you consider enough
evidence?
The idea behind hypothesis testing is that we go in assuming that our null
hypothesis is true, and if we see sufficiently rare data we conclude that the as-
sumption in the truth of the null hypothesis is wrong and we have statistically
significant results. The problem is how to determine if our data is “sufficiently
rare.” That is where the p-value comes in. A p-value is the probability that we
see data that is at least as extreme as what we observed, assuming of course that
the null hypothesis is true. For example, in the coin flip sequence above, our
p-value would be the probability that we see 6 or more heads in a series of 10
coin flips if the coin is indeed fair (A value of 0.377).
P-values can have many misinterpretations. The p-value is not...
• the probability that the sample statistic equals our observed value
• the probability that the null hypothesis is true.
The second misinterpretation is much more common as well as more dam-
aging. It is important to recall that our parameters—about which we make our
hypotheses—are fixed values, so the hypotheses will either be true or false, with
no probability one way or the other. The p-value only makes a statement of how
rare our data is assuming that the null hypothesis is true. There are several ways
The idea behind testing hypotheses Chapter | 10 113
to calculate p-values, but we will leave the discussion of calculating p-values for
a later chapter.
reality. In statistics, these particular errors have specific names. If we reject the
null hypothesis when it is in reality true, this is called a Type I Error. If you
fail to reject the null hypothesis when it is in reality false, this is called a Type
II Error.
Test result
Reject H0 Fail to reject H0
H0 True Type I Error Correct
Reality
H0 False Correct Type II Error
Ideally, we would like to avoid both errors, but this is impossible. If we make
it hard to reject the null hypothesis by setting α to be very low—and avoid Type
I errors—we make it likely that we will fail to reject the null hypothesis when
we should—i.e., we make more Type II errors. If we make is easy to reject the
null hypothesis by setting α higher—and avoiding Type II errors—we make it
more likely that we will reject the null hypothesis when we should not do so—
i.e., we make more Type I errors. This all becomes a large balancing act where
we have to consider the consequences of making a Type I or a Type II error.
Generally, it is much more costly to commit a Type I error. Our null hypoth-
esis represents the current state of the world, so to reject that null hypothesis
and change the state of the world will require money, time, and manpower. If
we undertake all this cost supporting something that is not true in reality, this all
becomes a sunk cost and is usually more detrimental than remaining with the
status quo. With this in mind, we will try and control how often we make Type
I errors in our hypothesis test, even as we make it more likely that we will make
a few more Type II errors.
The next question is: How do we control Type I errors? It ultimately all
comes down to α. Our significance level determines when we reject the null
hypothesis, so it will determine how likely we are to make a Type I error. In
fact, α equals the probability that we make a Type I error, or
α = P r(Type I Error).
suit the evidence. You cannot bias your results toward one result or another and
retain your integrity in research.
10.5 Conclusion
The philosophy behind hypothesis testing is fairly simple: place a claim about
our population parameters on trial, set a reasonable doubt, and let the data speak
as evidence. If the data is sufficiently rare that there is no reasonable way it
could have happened by chance, we conclude that we should reject our initial
claim. Otherwise, it should remain in place as a plausible claim. This mindset is
the foundation of many statistical techniques ranging from the simple to com-
plex. With this mindset now in place, we can work toward conducting tests for
population means and proportions.
Chapter 11
11.1 Introduction
Now that we have introduced the idea of hypothesis testing, we are going to
begin delving into it a little more. Specifically, we need to talk about how to get
our p-values, the driving force behind our decision-making process. There are
many techniques to get p-values, but a majority of them can be tedious or require
more theoretical explanations beyond the scope of an introductory course. The
most common, and indeed one of the simpler, methods of getting p-values takes
advantage of the central limit theorem, allowing us to turn sample statistics into
probabilities about how rare our data is. But first, we need to talk a little more
about the Normal distribution and how to work with this very important class of
distributions.
X−μ
Z= .
σ
Then Z will follow a standard Normal distribution. This will become very im-
portant to us when we are trying to do any hypothesis testing. Our other key
property of Normal distributions was that they could be combined using ad-
dition and subtraction. Namely, if two random variables X and Y both follow
Normal distributions, the X + Y and X − Y will also follow Normal distribu-
tions. Namely,
X ∼ N μX , σX2 , Y ∼ N μY , σY2
X + Y ∼ N μX + μY , σX2 + σY2
X − Y ∼ N μX − μY , σX2 + σY2 .
This property will also be important for hypothesis testing, specifically when we
are comparing two populations to see if they are similar or differ in some way.
√
the variance and standard deviation are directly connected— σ 2 = σ —so we
just have to do is be cognizant of what we
are inputting into
the function.
So, for example, say that we want P N (5, 4) > 6.46 . To get this in R, we
would use the code
However, this leaves a few things missing. We need to know what the true
value of p is along with the standard error of the sampling proportion. We never
know what the value of p is, as it is a population parameter and, therefore,
unknown. However, in hypothesis testing, we assume that the null hypothesis is
true, which gives us some assumptions about that value of p. Our null hypothesis
is that the coin is fair, or p = 0.5. So, if the null hypothesis is true, then the
sampling distribution of the sample proportion of heads would be
p̂ ∼ N 0.5, s.e.(p̂)2 .
This just leaves the standard error of the sampling proportion s.e.(p̂). As
we move forward, we will give you formulas for what the standard error of our
various sample statistics is. However, for this chapter we will fill in the standard
errors for the sampling distributions.
Our goal is to calculate our p-value, or the probability that we see more
extreme data from this sampling distribution. Put another way, we need the prob-
ability that we see a more extreme sample statistic. We have a sample statistic
that follows a Normal distribution for which we now know how to find probabil-
ities. Based on this distribution and the knowledge we now have, we just need
to understand what we mean by more extreme data.
This definition of “more extreme” depends on the alternative hypothesis. Say
that our alternative hypothesis is HA : p > 0.5, or that the coin makes flipping
a heads more likely. “More extreme” data for this alternative hypothesis—data
that would support the alternative hypothesis—would result in seeing at least
as many heads or more than we actually did. So, our p-value would be the
probability that we see greater than or equal to 60 heads in a series of 100
flips—or, the probability that we see a sample proportion greater than or equal
to 60/100 = 0.6.
To calculate this out, we need the full sampling distribution of p̂. The stan-
dard error of p̂ for this particular scenario—testing if the coin is fair using 100
flips—is 0.05, so the sampling distribution of p̂ under the assumption that the
null hypothesis is true would be is N (0.5, 0.052 ). Seeing “more extreme” data
in support of the alternative hypothesis that p > 0.5 is equivalent to seeing a
sample proportion greater than or equal to 0.6, so our p-value will be
P N (0.5, 0.052 ) ≥ 0.6 .
with the returned probability of 0.9772, which becomes our p-value for the al-
ternative hypothesis HA : p < 0.5.
The final alternative hypothesis that we have to consider is when HA : p =
0.5. This is going to require a little more thought, because data that would sup-
port the alternative hypothesis would mean that we saw evidence that the coin
122 Basic Statistics With R
makes heads more likely or tails more likely, so extreme comes now in two
directions.
In this way, we could define “more extreme” as further away from the null
hypothesis value—or 0.5 in this case—than what we observed. In our example,
the observed value for the sample proportion is 0.6, which is 0.1 away from
0.5. On the other side, 0.4 is 0.1 away from 0.5. So, more extreme would be all
sample proportions that are greater than or equal to 0.6 and less than or equal to
0.4. Now, our sample proportion will still follow a N (0.5, 0.052 ) according to
the central limit theorem. So, our p-value will wind out being
P N (0.5, 0.052 ) ≤ 0.4 + P N (0.5, 0.052 ) ≥ 0.6 .
Now, we still have to calculate two probabilities to get our p-value. How-
ever,
let us look at these two
probabilities
individually. If we do, we will see that
P N (0, 1) ≤ −2 and P N (0, 1) ≥ 2 are both 0.02275. This is not coinciden-
tal. Normal distributions are perfectly symmetric, which means that they will
have the same tail probabilities, or in other words,
P N (0, 1) ≤ −a = P N (0, 1) ≥ a .
We can use this to our advantage, taking our two probabilities and converting
them to a single probability.
= P N (0, 1) ≤ −2 + P N (0, 1) ≥ 2
= P N (0, 1) ≥ 2 + P N (0, 1) ≥ 2
= 2 × P N (0, 1) ≥ 2 .
We can get this using the pnorm function in R to find that our p-value is
0.0455. We will use this conversion in all instances where our alternative hy-
pothesis is of the form HA : p = p0 . (See Fig. 11.1.)
In calculating all these p-values, regardless of the alternative hypothesis,
there is a common thread. In order to see this common thread, we need to look at
how these p-value were derived by hand. If you look back at the calculations of
Making hypothesis testing work with the central limit theorem Chapter | 11 123
all three p-values, you will notice they all begin in the same place. Namely, look-
ing at how our sampling distribution—a N (0.5, 0.052 )—is related to our sample
proportion—p̂ = 0.6. After employing one of the properties of Normal distribu-
0.6 − 0.5
tions, this changes to how a standard Normal is connected to t = .
0.05
HA P-value
p > 0.5 P N (0, 1) ≥ t
p < 0.5 P N (0, 1) ≤ t
p = 0.5 2 × P N (0, 1) ≥ |t|
This value t takes into account information we know about the sample
statistic—in this case p̂—the null hypothesis value p0 , and the standard error—
here, s.e.(p̂). Specifically, it is
Sample Statistic – Null Hypothesis Value
t=
Standard Error
Essentially, it completely encompasses all our information about the hypoth-
esis test. This quantity t is referred to as the test statistic, and it calculates how
far our sample statistic is from the null hypothesis value, scaled by the standard
error. In general, we can use this knowledge about the relationship between the
test statistic, alternative hypothesis, and p-values to generalize which p-value
we should use for which alternative hypothesis. If we define our test statistic as
Sample Statistic – Null Hypothesis Value
t= ,
Standard Error
then our p-values will be as defined in the table below, solely depending on the
alternative hypothesis.
124 Basic Statistics With R
HA P-value
HA : p < p0 P N (0, 1) ≤ t
HA : p > p0 P N (0, 1) ≥ t
HA : p = p0 2 × P N (0, 1) ≥ |t|
We will see tables defining our p-values very similar to this for each individ-
ual test as we go through them, but this general idea behind getting our p-values
will carry across a variety of parameters and tests.
11.5 Conclusion
P-values represent an important part of our hypothesis-testing procedure. Ulti-
mately, they are the evidence with which we decide whether or not to reject our
null hypothesis. Based on their definition, they could be difficult to calculate
exactly. However, thanks to the central limit theorem discussed earlier, we are
able to connect the sampling distribution to our p-values. With an assist from R
calculating these p-values becomes a simple task, relying only on our alternative
hypothesis and the observed data. As we will see in the following chapter, the
importance of the central limit theorem also extends to the other common form
of statistical inference.
Chapter 12
12.1 Introduction
So far in our discussion of statistical inference, we have concentrated solely on
the hypothesis tests. Further, in calculating our statistics from our sample we
have concentrated on only calculating single values as estimates for our popula-
tion parameters. However, there is a collection of estimators for our parameters
that involve an interval of plausible values for that parameter. In this chapter, we
will talk about the spirit and interpretation of these interval estimates.
gin of error gives us an idea of how precise the point estimate is, expressing the
amount of variability that may exist in our estimate due to uncontrollable error.
For example, prior to the 2017 Virginia gubernatorial election, Quinnipiac
University conducted a poll and found that the Democratic candidate Ralph
Northam led the Republican, Ed Gillespie, by 9 points plus-or-minus 3.7 points
[35]. In this case, the point estimate for Northam’s lead is 9 points and the
margin of error is 3.7 points. This implied that in reality it would be plausi-
ble that Northam could lead the race by any number between 9 − 3.7 = 5.3 and
9 + 4.7 = 12.7 points.
Under ideal circumstances, our margin of error is able to give us an idea
of how confident we can be in the estimate of our population parameter. Say
we have two interval estimates where our point estimates are the same but our
margins of error differ: 6 ± 2 and 6 ± 4. Our first interval estimate implies that
the plausible values of our parameter are from 4 to 8, while the second implies
a range of 2 to 10. Because the range of plausible values is smaller for the
first interval estimate, we would be inclined to trust the results more. However,
there are many things that go into our margin of error: the amount of variability
inherent in our data, the sample size, and most importantly the probability that
our interval is “right.”
plausible values for our parameter calculated from our data whose probabilistic
interpretation is directly connected to if the interval covers the true value of the
parameter. A confidence interval that is calculated from a sample will cover the
true value of the population parameter a predetermined portion of the time; an
80% confidence interval will cover the true value of a parameter 80% of the
time. A 95% confidence interval will cover the true value of the parameter 95%.
In general, if we were able to take an infinite number of samples and calculate
(1 − α)100% confidence intervals from each of them, then (1 − α)100% of these
intervals would cover the true value of the parameter. This (1 − α)100% prede-
termined coverage proportion—the probability that the interval is “right”—is
called the confidence level.
Let us look at an example of this where we know the true value of our pa-
rameter. Say that we flipped a fair coin 100 times and created a 95% confidence
interval for the true probability of flipping a heads p—more details on this to
come. Again, this is a fair coin, so the true value of p = 0.5. Thus, we will know
whether or not our confidence interval covers the true value of p. Now, we re-
peat this process of collecting data and creating a confidence interval 99 more
times. Assuming that our 95% confidence interval is “right” 95% of the time, we
should see that approximately 95 out of the 100 intervals covers the true value
of p = 0.5. In Fig. 12.1, we can see this phenomenon, with exactly 95 out of the
100 intervals covering p = 0.5.
FIGURE 12.1 95% Confidence intervals for the proportion of heads p with a sample size of 100.
Approximately 95% of the confidence intervals should cover the true value of p, given by the vertical
dotted line at p = 0.5.
similar to the interval estimates we talked about above. While the exact form of
our confidence intervals will vary depending on what parameter we are doing
confidence intervals for, we can see the process of making confidence intervals
and how they are connected to the probability of being “right.” Specifically, we
will look at this through calculating our confidence interval for the population
proportion p.
For our confidence interval to be “right” (1 − α)100% of the time, we need
the interval to have a lower bound and upper bound calculated from our sample
such that
P Lower < p < Upper = 1 − α.
So, let us start with a distribution that we have worked with before: the stan-
dard Normal distribution. We could find a value z∗ so that the probability that a
standard Normal is between −z∗ and z∗ is 1 − α, or
P − z∗ < N (0, 1) < z∗ = 1 − α.
We will call this z∗ the critical value. Now, if we can find something that
connects the population proportion p to the standard Normal distribution, we
will be able to create an interval. If we recall, there is an important theorem
that connects our sample proportion, population proportion, and the standard
Normal: the central limit theorem. This states that the sample proportion p̂ will
follow a Normal distribution assuming the sample size n is large enough, specif-
ically
p̂ ∼ N p, s.e.(p̂ 2 ) .
As we saw in hypothesis testing, we can translate this to a standard Normal
by subtracting off the mean p and dividing by the standard deviation s.e.(p̂)
p̂ − p
∼ N 0, 1 .
s.e.(p̂)
So let us go back to our standard Normal probability, P (−z∗ < N (0, 1) <
p̂ − p p̂ − p
z∗ ) = 1 − α. We can sub in for the N (0, 1), since ∼ N 0, 1 .
s.e.(p̂) s.e.(p̂)
Our standard Normal probability then becomes
∗ p̂ − p ∗
P −z < < z = 1 − α.
s.e.(p̂)
and, therefore, our interval—we need to solve for p inside the probability state-
ment:
∗ p̂ − p ∗
P −z < <z =1−α
s.e.(p̂)
P − z∗ s.e.(p̂) < p̂ − p < z∗ s.e.(p̂) = 1 − α
P − p̂ − z∗ s.e.(p̂) < −p < −p̂ + z∗ s.e.(p̂) = 1 − α
P p̂ − z∗ s.e.(p̂) < p < p̂ + z∗ s.e.(p̂) = 1 − α.
This will allow us to look up only one probability to find our z∗ . Let us see
this in practice. Say we wanted to calculate a 95% confidence interval for p.
To find our value of α, we solve (1 − α)100%=95% so that our α = 0.05. In
finding z∗ , this implies that
P N (0, 1) < z∗ = 1 − 0.05/2 = 0.975.
distribution sd, and the probability associated with that quantile p. The code to
use this function is
qnorm(p, mean, sd)
And R will return the quantile z∗ such that P N (mean, sd) < z∗ = p. For
our example—finding the z∗ associated with a 95% confidence interval—we
would use the code
qnorm(p=0.975, mean=0, sd=1)
And R will return a z∗ value of 1.959964, which we can use to get our 95%
confidence interval. Let us take a look at this in practice. In 1996, the United
States General Social Survey took a survey of 96 people asking if they were
satisfied with their job [36,37]. They found that 79 of them were satisfied, so
the sample proportion would be p̂ = 79/96 = 0.8229. Say we wanted to create
a 90% confidence interval (α = 0.1) for the true proportion of people satisfied
with their job. The first thing is to find z∗ , so that
P (N(0, 1) < z∗ ) = 1 − 0.1/2 = 0.95.
Our final part of the confidence interval is the standard error of p̂, which in
this case is s.e.(p̂) = 0.039. Taking all this information, our 90% confidence
interval for p will be
0.8229 − 1.645 × 0.039, 0.8229 + 1.645 × 0.039 = 0.7585, 0.8871 .
than a fixed value. This is false, as our parameter is fixed while the bounds
of our confidence interval are calculated from a sample and are therefore
random.
• A (1 − α)100% confidence interval does not mean that we are (1 − α)100%
confident that the parameter equals the interval defined by the given confi-
dence interval. Again, our parameter is a fixed, unknown, single value and
will not be equal to an interval.
• A (1 − α)100% confidence interval does not mean that (1 − α)100% of
the population falls within the bounds of a confidence interval. Confidence
intervals make statements about population parameters, not about individuals
within the population.
• A (1 − α)100% confidence interval does not mean that a calculated sample
statistic will fall in the confidence interval (1 − α)100% of the time. Again,
confidence intervals make statements about population parameters and not
about the individuals within the population.
12.8 Conclusion
Confidence intervals represent another important component of statistical infer-
ence. Rather than testing a claim, they allow us to see the range of values that our
parameter may take. We will see how we can connect this directly to hypothesis
testing, but the two methods of inference are already connected via the central
limit theorem. Both techniques rely on this theorem to connect our data to our
parameter, allowing us to understand the values for these important summaries
of our population. Going forward, we can now fill in the specifics of several key
inferential situations, focusing on population means and proportions.
Chapter 13
13.1 Introduction
In the past few chapters, we have been laying the general foundations for sta-
tistical inference through hypothesis testing and confidence intervals. In this
chapter, we will start really going through the process of hypothesis testing, fill-
ing in the missing gaps from earlier chapters. We will specifically be looking at
one-sample tests in this chapter. One-sample hypothesis tests are concentrated
on the questions related to a single parameter, for example, is a coin fair—i.e.,
is p = 0.5?
While statistics is a process that is many times as much art as science, hy-
pothesis testing is a very methodical procedure that in all its forms consists of
a series of steps. These steps are given below and will be referred to as we go
through each of the tests in this chapter and future chapters:
1. State hypotheses
2. Set significance level
3. Collect and summarize data
4. Calculate test statistic
5. Calculate p-value
6. Conclude
In this chapter, our one-sample tests will include a one-sample test about our
population proportion p and a one-sample test about our population mean μ. We
will begin with the more familiar of these two tests, looking at our one-sample
test for the population proportion p.
H0 : p = 0.14
HA : p = 0.14.
p̂ − p0
t= .
s.e.(p̂)
All that remains is our standard error s.e.(p̂). This can be a little more
difficult to calculate, but it turns out that the standard error of our sampling
140 Basic Statistics With R
proportion is
p(1 − p)
s.e.(p̂) =
n
where p is the true value of our population proportion and n is our sample size.
This is problematic because it relies on knowing the true value of p, which—as
a parameter—is unknown. If we knew the value of p, there would be no reason
to go through this procedure. This would seem to be a dead end in calculating
our standard error. Fortunately, one of our key assumptions in hypothesis testing
provides an answer. In hypothesis testing, we always assume that the null hy-
pothesis is true until proven otherwise. In this case, this would mean that p = p0
until we prove it false. We can substitute this value of p = p0 into our standard
error which becomes
p0 (1 − p0 )
s.e.(p̂) = .
n
Putting all this together, our test statistic for this particular test is
p̂ − p0
t= .
p0 (1 − p0 )
n
For our Sweet Briar example, our null hypothesis value will be p0 = 0.14,
our sample proportion is p̂ = 0.1519, and our sample size is n = 79. We can fill
this into our test statistic formula above to get
0.1519 − 0.14
t= = 0.3048.
0.14(1 − 0.14)
79
HA P-value
HA : p < p0 P N (0, 1) ≤ t
HA : p > p0 P N (0, 1) ≥ t
HA : p = p0 2 × P N (0, 1) ≥ |t|
Hypothesis tests for a single parameter Chapter | 13 141
13.2.6 Conclude
In general, our conclusions come in three parts—justification, result, and con-
text. We want to be sure to mention the result of our hypothesis test—we
either reject or fail to reject the null hypothesis—to go with our justification
for reaching that result—because our p-value was less than α or greater than
α, respectively. Finally, we want to be sure to put this conclusion in context, in
order for us to understand what this result means for our study. If we reject the
null hypothesis, we will conclude in favor of HA . However, if we fail to reject
the null hypothesis, we can only conclude that the null hypothesis is plausible—
remember that the null hypothesis can never be concluded true, only plausible.
In our Sweet Briar legacy example, our p-value = 0.7605184 is greater than
our significance level of α = 0.05. Since this is the case, we will fail to reject
the null hypothesis and conclude that it is plausible that the true proportion of
legacy students at Sweet Briar is equal to 0.14.
Now, let us see all our whole hypothesis testing procedure at once in another
example. In March 2016, a Pew Research poll found that 57% of Americans
were “frustrated” with the federal government [39]. Pew repeated the poll in
December 2017 and found that 826 of 1503 respondents said that they were
“frustrated” with the federal government. Say we wanted to test to see if the true
proportion of American frustrated with the government had decreased since the
142 Basic Statistics With R
H0 : p = 0.57
HA : p < 0.57.
We will set our significance level to be α = 0.05 for this data. The summaries
826
we need for this data is the sample proportion p̂ = = 0.549 and the sample
1503
size n = 1503. Assuming that the null hypothesis is true, our test statistic is
0.549 − 0.57
t= = −1.644.
0.57 × 0.43
1503
Because
our alternative
hypothesis is HA : p < 0.57, our p-value will
be P N (0, 1) < −1.64 . Using the R code given below, we find that this
probability—which is our p-value—will be p-value = 0.0500881.
pnorm(q=-1.644, mean=0, sd=1, lower.tail=TRUE).
6. What would the null and alternative hypotheses be to test this question?
7. In surveying a set of 276 e-mails, 122 were determined to be spam. What
is the sample proportion for this data?
8. Based on the null hypothesis and data collected, what would your test
statistic be?
9. Based on your alternative hypothesis and test statistic, what will your p-
value be?
10. Assuming that α = 0.1, what conclusions will you draw from this test?
In a 2011 survey, 79% of Americans said they had read a book in the past
year [42]. A 2019 survey found that out of 1502 respondents, 1082 had read a
book in the previous year [43].
11. Test if the population proportion of Americans who have read a book in the
past year has decreased from the 2012 level of p = 0.79 at the α = 0.05
level.
H0 : μ = 13.13
HA : μ < 13.13.
x̄ − μ0
t= .
s.e.(x̄)
Hypothesis tests for a single parameter Chapter | 13 145
All that remains is the standard error of our sample average s.e.(x̄). Deriving
this is difficult, but it turns out that the standard error of the sample average is
σ
s.e.(x̄) = √
n
where σ is the population standard deviation and n is the sample size. Putting
this together, this would make our test statistic
x̄ − μ0
t= √ .
σ/ n
However, there is a problem with this test statistic, namely that it depends
on population parameter σ . As a population parameter, σ is assumed to be fixed
and unknown, so it would be impossible to calculate a test statistic based on this
unknown parameter. Unlike the one-sample test for proportions, assuming that
the null hypothesis is true does not help us. Knowing μ0 does not give us the
value of σ , which we need to calculate our test statistic. Despite this, there is
another option for our test statistic; we can find a sample statistic that estimates
σ . Such an estimate does exist, specifically the sample standard deviation s. We
can substitute s into our test statistic in place of σ for our calculations, with our
resulting test statistic for this hypothesis test being
x̄ − μ0
t= √ .
s/ n
In the video slots example, using our data and null hypothesis, we can cal-
culate our test statistic to be
0.672 − 13.13
t= √ = −90.7084.
2.551/ 343
x̄ − μ0
t= √ ∼ N (0, 1).
σ/ n
146 Basic Statistics With R
x̄ − μ0
t= √ .
s/ n
This means that our test statistic will not follow a standard Normal distribu-
tion. But what do we do in this case? To calculate the p-values, we need to know
what distribution our test statistic follows, which currently is a roadblock. With
this in mind, let us take a short detour to take a look at a new class of probability
distributions: t distributions.
Back to p-values
So, given our slight detour, it probably is no surprise to find that our test statistic
from Step 4 will follow a t-distribution. However, which one? In order to define
our t-distribution, we will need to define the degrees of freedom. It turns out that
our test statistic follows a t-distribution with n − 1 degrees of freedom, where n
is our sample size. That is,
x̄ − μ0
t= √ ∼ tn−1 .
s/ n
FIGURE 13.1 The t-distribution with various degrees of freedom and the standard Normal distri-
bution.
and our test statistic. Additionally, the rationale for each p-value will be similar
to that for our standard Normal p-values. To illustrate this, let us look at the p-
value for the alternative hypothesis HA : μ > μ0 . Our p-value is the probability
that we observe “more extreme” data, which in this case would mean that we
see a sample average that is greater than the value we actually observed. If the
sample average is greater than observed, this will result in a test statistic greater
than the one calculated from our observed data. This is equivalent to saying that
a tn−1 distribution is greater than our calculated test statistic, because our test
statistic formula follows a tn−1 distribution. The probability that this occurs is
our p-value, which is P (tn−1 > t) where t is our test statistic calculated in Step
4. The rationale for our other two alternative hypothesis is similar, and their
respective p-value is given below. (See Fig. 13.2.)
HA P-value
HA : μ < μ0 P tn−1 ≤ t
HA : μ > μ0 P tn−1 ≥ t
HA : μ = μ0 2 × P tn−1 ≥ |t|
So, in order to get our p-values, we will need to be able to get p-values from
out t-distributions. Like with the standard Normal, we can do this through a
simple function in R. Similar to how the pnorm function got us probabilities
from the Normal distribution, the pt function will get us probabilities from the
t-distribution. The pt function takes three arguments: the critical value of the
distribution desired t ∗ , the degrees of freedom ν, and whether
or not we
want
the
lower or upper tail probability—that is, if we want P tn−1 < t or P tn−1 >
t , respectively. Our code for this function, assuming we want the lower tail
probability, will be
pt(q=t*, df=v, lower.tail=TRUE)
148 Basic Statistics With R
FIGURE 13.2 P-values based on the t-distribution for our three alternative hypotheses.
To get the upper tail probability, all we will have to change is our lower.tail
option to be FALSE.
One final important thing to note is the assumptions built into using our t-
distribution for p-values in this test. As with the one-sample test for proportions,
we are able to use the t-distributions only if the central limit theorem holds
for our sample statistic—in this case x̄. The central limit theorem will hold
only if one of two conditions are true: if our sample size is large enough or
our response variable follows a Normal distribution. This means that only for
large enough sample sizes—generally requiring that our sample size n be at
least 30—or Normal data can we use the t-distribution for p-values.
For the video slot, since our alternative hypothesis is HA : μ < 13.13, our p-
value will be P t344 < −90.7084). We use the following code in R to find that
our p-value is 1.834 × 10−247 Our sample size is sufficiently large with n = 345
for us to say that the central limit theorem will hold, and thus our p-value is valid.
13.3.7 Conclude
Just as in the one-sample test for proportions, we either will reject or fail to reject
our null hypothesis based on our p-value and the significance level α. In doing
so, we want to be sure to reference our decision—reject or fail to reject—the
reason for the decision—p-value less than or greater than α—and the context of
the decision in terms of the problem.
Finally, in our video slots example, since our p-value of 1.834 × 10−247
is decidedly less than the significance level α = 0.05, we will reject the null
hypothesis and conclude that the true average payout of the video slot machine
is less than $13.13.
Let us see all these steps in practice. Say there is a factory process for making
washers [45,46]. If the population average of washer size differs from 4.0 in any
Hypothesis tests for a single parameter Chapter | 13 149
way, the process is reset. A quality control sample of 20 washers is taken and
the sample average was found to be x̄ = 3.9865 with sample standard deviation
s = 0.073. In this setting, to test if the process should be reset our hypotheses
would be
H0 : μ = 4
HA : μ = 4
13.4 Conclusion
Our tests for single parameters are crucial to understanding our populations. As
the population mean provides the average value for our population, the test for
a single mean helps us to recognize where that central value for our popula-
tion lies. The test for proportions can help us understand the true probability
of success—however it is defined—in addition to allowing us to then estimate
the total number of successes in the population. Fortunately, hypothesis testing
has a well-defined set of procedures: defining hypotheses, set our significance
level, collect and summarize our data, calculate our test statistic, calculate our p-
value, and draw a conclusion. This process allows us to understand the process
of testing for a wide variety of tests. In addition to understanding this proce-
dure, we need to understand what makes these procedures work but recognizing
the assumptions that are essential to making our tests valid. These procedures
and assumptions will continue going forward as we learn more inferential tech-
niques, both tests and confidence intervals.
Chapter 14
14.1 Introduction
In the previous chapter, we introduced the general ideas of interval estimation
and confidence intervals. In this chapter, we will expand further our confidence
intervals, calculating intervals for p and μ as well as providing additional uses
to these intervals beyond merely providing a range of plausible values.
where p is the true value of the population proportion. This implies that the
p(1 − p)
standard error of p̂ is s.e.(p̂) = . However, we do not know the
n
true value of p because the population proportion is a parameter and, therefore,
assumed to be unknown. We ultimately need to substitute something else into
our formula for the standard error that estimates our population proportion p.
We know that the sample proportion p̂ estimates our population proportion p,
so we can substitute p̂ for p in our standard error. This results in
p̂(1 − p̂)
s.e.(p̂) = .
n
With this change, our (1 − α)100% confidence interval for p will be
∗ p̂(1 − p̂) ∗ p̂(1 − p̂)
p̂ − z , p̂ + z
n n
where p̂ is the sample proportion, n is our sample size, and z∗ is our critical
value chosen so that P (N(0, 1) < z∗ ) = 1 − α/2. The final piece of the confi-
dence interval is the interpretation, which we discussed previously. We interpret
confidence intervals by saying we are (1 − α)100% confident that our calculated
interval covers the true value of p.
Similar to hypothesis testing, there is a key assumption in creating our con-
fidence interval. In order to use the Normal distribution to get our critical value
z∗ and our confidence interval, we need the central limit theorem to hold. For
the central limit theorem to hold, we stated that our sample size had to be suf-
ficiently large—generally n ≥ 30—and there needed to be a sufficient number
of successes and failures in our sample—at least 10 of each. So, for our confi-
dence intervals to be valid we need a sufficiently large sample size and enough
successes and failure in our sample.
As an example, Steph Curry made 325 of 362 free throws in 2016. Let us
create a 99% confidence interval (α = 0.01) for the true probability of Steph
Curry hitting a free throw p. Our first step is finding our critical value z∗ such
that P (N(0, 1) < z∗ ) = 1 − 0.01/2 = 0.995. Using the R code
we can get that our critical value is z∗ = 2.576. With our critical value, our
325
sample proportion p̂ = , and our sample size of n = 362 our 99% confidence
362
interval for p is
0.898(1 − 0.898) 0.898(1 − 0.898)
0.898 − 2.575 , 0.898 + 2.575
362 362
0.857, 0.939 .
Confidence intervals for a single parameter Chapter | 14 153
We interpret this as we are 99% confident that (0.857, 0.939) will cover the
true probability that Steph Curry makes a free throw. Further, our sample size is
sufficiently large—n = 369—and we see sufficient successes and failure—325
successes and 362 − 325 = 37 failures—to trust that the results are valid.
The prop.test function mentioned earlier has the ability to calculate our
confidence intervals for p. By inputting our data as before and changing the
conf.level option to our desired confidence level, we can get the (1 − α)100%
confidence interval for p. Details and examples of this function will be discussed
in Chapter 17.
P Lower < μ < Upper = 1 − α.
x̄ − μ
√ ∼ tn−1 .
s/ n
With this in mind, let us start with a probability statement related to the tn−1
distribution. Let us choose a value t ∗ —again called the critical value—such that
P − t ∗ < tn−1 < t ∗ = 1 − α.
154 Basic Statistics With R
x̄ − μ x̄ − μ
Now, since √ ∼ tn−1 , we can substitute in √ for tn−1 in the proba-
s/ n s/ n
bility statement, resulting in
∗ x̄ − μ ∗
P − t < √ < t = 1 − α.
s/ n
At this point, if we solve for μ using algebra inside our probability state-
ment, we will wind out with an interval estimate with our desired probabilistic
properties.
x̄ − μ
P − t∗ < √ < t∗ = 1 − α
s/ n
s s
P − t ∗ √ < x̄ − μ < t ∗ √ =1−α
n n
s s
P − x̄ − t ∗ √ < −μ < −x̄ + t ∗ √ =1−α
n n
s s
P x̄ − t ∗ √ < μ < x̄ + t ∗ √ = 1 − α.
n n
We now have an interval Lower, Upper such that P Lower < μ <
Upper = 1 − α. This means that in the end, our (1 − α)100% confidence inter-
val for μ is
s s
x̄ − t ∗ √ , x̄ + t ∗ √ .
n n
So, to calculate our confidence interval for μ, we need to get our sample
mean x̄, the sample standard deviation s, and sample size n, all of which is
gotten from our sample. This just leaves finding our critical value t ∗ . Similar to
our confidenceinterval for p, finding t ∗ as defined—choosing t ∗ such that P −
t ∗ < tn−1 < t ∗ = 1 − α—is problematic as it involves finding two probabilities.
However, since the t-distribution is symmetric, we can adjust the probability
used to choose t ∗ such that
P tn−1 < t ∗ = 1 − α/2.
To find our value of t ∗ , we again turn to R. Just as the pnorm function had a
t-distribution analogue in the pt function, the qnorm function—used to find our
critical values from the Normal distribution—has a t-distribution analogue in
the qt function. The qt function finds the critical value of a t-distribution based
on the given probability and degrees of freedom of the t-distribution. These two
values—the probability p associated with the critical value and the degrees of
freedom ν—are the inputs into the qt function, with the code given below.
qt(p=p, df=v)
Confidence intervals for a single parameter Chapter | 14 155
below we find that t ∗ = 2.263781. Putting this all together, our 96% confidence
interval for the population mean amount of DDT in the kale sample is
0.4372
3.328 ± 2.264 √ = (3.0724, 3.5836).
15
We are 96% confident that (3.0724, 3.5836) covers the true value of μ. How-
ever, it is important to note that our sample size is less than 30, implying that
there may be concerns for our results. However, checking our histogram of
the data shows reasonably symmetric data, implying reasonable Normality and
that the central limit theorem still holds. Thus, our results are trustworthy. (See
Fig. 14.1.)
Similar to how the prop.test function can be used to get confidence intervals
for p, the t.test function can be used to get confidence intervals for μ. As before,
we merely have to add information about our desired confidence level in order
to get this interval.
sample size. In initially defining interval estimates, we said that the most basic
interval estimate was defined by
Let us define this in terms of our confidence interval for p. So, recall, our
confidence interval for p is
∗ p̂(1 − p̂)
p̂ ± z .
n
In the case of our population proportion p, the point estimate of that pop-
parameter is the sample proportion p̂ and our margin of error is m =
ulation
p̂(1 − p̂)
z∗ . The margin of error is predicated on knowing the sample size n,
n
your confidence level determined through z∗ , and our sample proportion p̂. Say
that we wanted to know the sample size and we knew our margin of error m, our
sample proportion p̂, and the confidence level we want to have in our estimate—
which is conveyed through z∗ . We can use our equation for the margin of error
to solve for n through a little algebra, getting
You can confirm this through either calculus or by calculating what the sample
size would be for various values of p̂. In the end, if we know that we want to es-
timate our population proportion within a margin of error m with (1 − α)100%
confidence, our sample size will have to be
When calculating this necessary sample size, often we will have n be a non-
integer. Of course, we cannot sample half a person, so we need to round up to
the next integer greater than or equal to your calculated n. If you round down,
the sample size will result in a larger margin of error than desired. By rounding
up, we ensure that our confidence intervals will have a slightly smaller margin
of error than initially desired.
For example, say that you wanted to estimate the true proportion p of Amer-
icans who believe that life is better for people like them today than it was 50
years ago. To design the survey, you determine that you want to estimate p
within a margin of error of 0.04 with 99.5% confidence. To begin, we know that
our margin of error is m = 0.04. Next, we need to find our critical value z∗ . For
99.5%
confidence,
our value of α will be α = 0.005, so we choose z∗ such that
P N (0, 1) > z∗ = 1 − 0.005/2 = 0.9975. Using the code qnorm(p=0.9975,
mean=0, sd=1) in R, we will find that z∗ = 2.807.
The final piece we are missing is p̃. Say that we do not have an idea of the
sample proportion from a previous study, so we will have to assume the worst
case scenario and set p̃ = 0.5. Putting all this together, our minimum sample
size that will estimate p within a margin of error of 0.04 with 99.5% confidence
is
(2.807)2 0.5(1 − 0.5)
n≥ = 1231.133.
0.042
So we will need a sample size of at least 1232 subjects. On the other hand,
say we did a little digging and found a Pew Research Survey from 2017 [51]
that found a p̂ = 0.37. In this case, our p̃ = 0.37 and our sample size calculation
would change to
So our sample size would need to be 1148 people. Notice that our necessary
sample size is lower than the worst case scenario, which is what we expect since
the sample size is maximized at p̃ = 0.5.
We can see this in Figure 14.2, where the area under the curve (equal to α/2)
in the figure to the left is P (tn−1 ≥ t ∗ ) and the area under the curve (equal to
the p-value for a rejected null hypothesis) for the bottom is P (tn−1 ≥ |t|). (See
Fig. 14.2.)
160 Basic Statistics With R
FIGURE 14.2 If our test statistic t is greater than a critical value t ∗ , we will reject the null hypoth-
esis.
∗
reject H0 :μ = μ0 ∗in favor of HA : μ = μ0 if |t|
So, we ∗will > t where
P tn−1 ≥ t = α/2. If P tn−1 ≥ t = α/2, this means that P tn−1 ≤ t ∗ =
1 − α/2. If this is our rule to reject the null hypothesis, we can define our other
possible decision by its opposite. That is, we will fail to reject H0 : μ = μ0 if
|t| < t ∗ , where we choose t ∗ so that P tn−1 ≤ t ∗ = 1 − α/2. We can expand
this a little more fully by writing out our test statistic that we calculate in our
hypothesis test. In doing so, we will fail to reject the null hypothesis if
x̄ − μ0
√ < t ∗.
s/ n
Our values of x̄, s, and n will not change in a study regardless of the test that
we choose. Similarly, our t ∗ will not change once we set our significance level
α. In fact, the one thing that can change across these tests is the values of μ0 .
In fact, given a sample and a set significance level, we can determine if a value
of μ0 will result in rejection or failure to reject. We do this by solving for μ0 in
the equation above. Our first step is to recognize that if |t| < t ∗ , this implies that
−t ∗ < t < t ∗ . With this in mind, we can rewrite our above equation and solve
for μ0 using a little algebra. This will define under what conditions—in other
words, for what values of μ0 —we will fail to reject our null hypothesis.
x̄ − μ0
− t∗ < √ < t∗
s/ n
s s
− t ∗ √ < x̄ − μ0 < t ∗ √
n n
∗ s s
− x̄ − t √ < −μ0 < −x̄ + t ∗ √
n n
∗ s ∗ s
x̄ − x̄ − t √ < μ0 < x̄ + t √ .
n n
Confidence intervals for a single parameter Chapter | 14 161
This means that we will fail to reject the null hypothesis if μ0 is in the
interval
s s
x̄ − t ∗ √ , x̄ + t ∗ √
n n
where we choose t ∗ such that P tn−1 ≤ t ∗ = 1 − α/2. If you look closely, you
will notice that this interval is our (1 − α)100% confidence interval for μ. This
is not a coincidence. If we create a (1 − α)100% confidence interval for μ and
then want to conduct a hypothesis test of H0 : μ = μ0 versus HA : μ = μ0 at
the same α significance level, we will fail to reject the null hypothesis if μ0 is
contained in our (1 − α)100% confidence interval. On the other hand, if μ0 is
not contained in the (1 − α)100% confidence interval, then we will reject the
null hypothesis at the α significance level.
Let us see this in practice, calculating our confidence interval, applying this
result, and confirming it through hypothesis testing. Recall a previous exam-
ple [45,46] in which a factory that manufactured washers wanted to test if the
population average of washer size differed from 4 in any way. The sample av-
erage of 20 washers was found to be 3.9865 with sample standard deviation of
s = 0.073. If we wanted to create a 95% confidence interval, we would find—
using qt(p=0.975, df=19)—that t ∗ is 2.093. Putting this and our sample data
together, our 95% confidence interval for μ would be
0.073
3.9865 ± 2.093 × √ = (3.952, 4.021).
20
With this in mind, if we were testing H0 : μ = 4 versus HA : μ = 4 at the α =
0.05, we could just look at our 95% confidence interval since the α level for a
95% confidence interval is α = 0.05. Since 4 is contained in the interval (3.952,
4.021)—our 95% confidence interval for μ—we would fail to reject the null
hypothesis. If we look back at our previous chapter, we found that our p-value
for this test was 0.4185027. This clearly implies that we should fail to reject the
null hypothesis, which matches our conclusion based on our confidence interval.
12. Say you created a 99% confidence interval for μ. Would you reject the null
hypothesis in the test of H0 : μ = 4 versus HA : μ = 4 at the α = 0.005
level?
14.5 Conclusion
Our confidence intervals for a single parameter grow out of the same assump-
tions and foundational theorems as hypothesis testing. At the center is the central
limit theorem, which again allows us to connect our data to our parameters. This
connection brings additional uses to the confidence interval, allowing us to both
test with confidence interval as well as determine sample size for future studies.
This is all in addition to its main use: giving us the range of plausible values
for our parameter. Similar to hypothesis testing, the procedure for confidence
intervals remains consistent across intervals for both populations means and
proportions. We will see this continue going forward as we begin to compare
groups to see if they are similar.
Chapter 15
15.1 Introduction
To this point, we have talked about statistical inference for only a single param-
eter. Our questions have been limited to asking if a single population proportion
or population mean is plausibly equal to a value. However, the majority of
statistical questions are ones of comparisons: Does one group have a greater
population mean than another? Does one group have a population proportion
equal to another?
We see these sorts of questions all the time. In a 2016 pre-election survey
[54], respondents were asked if they had a candidate’s sign in their yard. Of
persons who had voted in the 2012 election, 26.6% had a sign in their yard,
while among those who had not voted in 2012 only 7.8% had a sign in their
yard. A reasonable question could be: “Does the population proportion of 2012
voters who have a candidate’s sign in their yard differ from those who didn’t
vote in 2012?”
This chapter looks into just these sorts of questions and the hypothesis tests
that can help answer them. There are three tests that we will talk about in
this chapter: the two-sample test for proportions, the two-sample t-test for
means, and the paired t-test for means. Each test is designed to work in a
specific scenario, so knowing which test to use is important. Ultimately, identi-
fying the correct test to use comes down to answering two questions about our
samples:
1. Is the response of interest categorical or quantitative?
2. Are the two samples independent? Put another way, is one sample affected
by or related to another sample in some way?
As we go through our test, we will go through the answers to these two
questions to choose the appropriate test, as well as the steps that go into each of
these hypothesis tests.
p 1 = p2
p1 − p2 = 0.
H0 : p1 − p2 = 0.
Our alternative hypothesis will again be one of three options. The first option
is that the population proportion for our first population is greater than popula-
tion proportion for our second population, or p1 > p2 . This is equivalent to
Hypothesis tests for two parameters Chapter | 15 165
HA : p1 − p2 > 0.
Our second option is the opposite of the first, with p1 < p2 . Again, we can
do a little algebra to see that this is equivalent to p1 − p2 < 0, resulting in the
alternative hypothesis
HA : p1 − p2 < 0.
Last of all is the two-sided test, where we are interested in if the population
proportions of our two groups differs in any way, or p1 = p2 . In this case, our
alternative hypothesis will become
HA : p1 − p2 = 0.
Let us take an example to illustrate this and the following steps. In a 2018
survey [55], Pew research did a study to investigate America’s attitudes toward
space exploration. One of the key questions was: “Is it essential for the United
States to continue to be a world leader in space exploration?” In order to see
how opinions differed across generations, the researchers grouped responses
by generation. Say the researchers were interested in testing if the population
proportion of individuals who believed it was essential for the United States to
continue being a world leader in space exploration was the same or differed for
Millenials and Gen Xers. Our hypotheses for this test would be
H0 : pMillenial − pGenX = 0
HA : pMillenial − pGenX = 0.
for our samples from both populations one and two—n1 and n2 . Outside of these
summaries, we do not need to know anything else about the sample.
In the space exploration survey, 467 out of 667 Millenials believed it was
important for the United States to be a leader in space exploration while 407
our of 558 Gen Xers believed the same. Thus our sample proportions would be
467 407
p̂Millenial = = 0.7001 and p̂GenX = = 0.7293.
667 558
H0 : p1 − p2 = 0.
sample sizes are large enough for both groups. By the central limit theorem, this
means that
p1 (1 − p1 )
p̂1 ∼ N p1 ,
n1
p2 (1 − p2 )
p̂2 ∼ N p2 ,
n2
where p1 and p2 are the true population proportions for our two populations
and n1 and n2 are the sample sizes for our two samples. Now, we need to know
the distribution of p̂1 − p̂2 in order to get the standard error. Based on the fact
that p̂1 and p̂2 follow Normal distributions for sufficient sample sizes and what
we know about how we can combine Normal distributions, it turns out that the
distribution of p̂1 − p̂2 is a Normal distribution, specifically
p1 (1 − p1 ) p2 (1 − p2 )
p̂1 − p̂2 ∼ N p1 − p2 , + .
n1 n2
Now, in hypothesis testing we always assume going in that the null hypoth-
esis is true, and thus we need to calculate this standard error assuming that the
null hypothesis is true. If that is the case, that means that p1 = p2 , so we can
drop the subscripts and use a common population proportion p. This means that
our standard error would become
p(1 − p) p(1 − p)
s.e.(p̂1 − p̂2 ) = +
n1 n2
where p is the true population proportion for both populations one and two.
However, this common population proportion p is a parameter and, therefore,
an unknown quantity. To calculate the test statistic, we would need to get an
estimate for p. Usually, we estimate the population proportion with the sample
proportion. However, we now have two samples that we are working with, so
which do we use? If p1 = p2 = p, that implies that the two populations are no
different from one another. If those two populations are no different from each
other, we can combine information from both samples to better estimate the
common population proportion p. This estimate for p using information from
both samples is called the pooled proportion because it pools together all the
information available to get a single estimate for p. The pooled proportion p̂
will be the combined successes y1 + y2 divided by the combined sample size
168 Basic Statistics With R
n1 + n2 , or
y1 + y2
p̂ = .
n1 + n2
Here, y1 and y2 are the number of successes in the sample from populations
one and two, while n1 and n2 are the sample sizes. If we define our pooled
proportion in this way, our standard error for p̂1 − p̂2 becomes
p̂(1 − p̂) p̂(1 − p̂)
s.e.(p̂1 − p̂2 ) = +
n1 n2
1 1
= p̂(1 − p̂) + .
n1 n2
This is the final component of our test statistic for this hypothesis test, re-
sulting in
p̂1 − p̂2
t= .
1 1
p̂(1 − p̂) +
n1 n2
In our space exploration survey example, we first need to find our pooled
467 + 407
proportion p̂. Based on our data, this will be p̂ = = 0.7135. Putting
667 + 558
this together with our remaining sample data, we can calculate that our test
statistic is
0.7001 − 0.7293
t= = −1.1273.
1 1
0.7135 × (1 − 0.7135) +
667 558
HA P-value
HA : p1 − p2 < 0 P N (0, 1) ≤ t
HA : p1 − p2 > 0 P N (0, 1) ≥ t
HA : p1 − p2 = 0 2 × P N (0, 1) ≥ |t|
Again, we can only use our Normal distribution for p-values if the central
limit theorem holds, which is contingent on the sample size. We need for both
sample sizes n1 and n2 to be at least 30, and need to observe at least 10 successes
and failures in both our sample from population one and population two.
In the space exploration example, the alternative hypothesis is HA :
pMillenial − pGenX = 0 and
our test statistic is t = −1.1273, so our p-value
will be 2 × P N (0, 1) > − 1.1273 . We can get this through the pnorm func-
tion, specifically with the code given below, to find that our p-value is 0.2596.
15.2.6 Conclude
Finally, once we have our p-value, we can draw our conclusions. As before, if
the p-value is less than the significance level α, we will reject the null hypothesis
and conclude in favor of our alternative hypothesis. Otherwise, we will fail to
reject the null hypothesis and conclude that it is plausible that the two population
proportions are equal.
To conclude our election example, since our p-value of 0.2596 is greater than
our significance level of α = 0.1, we would fail to reject the null hypothesis and
conclude it is plausible that the proportion of individuals who believe that it is
important for the United States to continue to be a leader in space exploration is
the same for Millenials and Gen Xers.
Let us see all these steps in practice all at once. A January 2018 poll done
by the Pew Research Group investigated the social media habits of Americans
[56]. One of the questions they were interested in was if the proportion of
Americans aged 18–29—population one with proportion p1 —were online “con-
stantly” more than ages 30–49—population two with proportion p2 . They found
that 137 out of 352 participants in the 18–29 sample were online “constantly”
as opposed to 190 out of 528 participants in the 30–49 sample. To answer this
question, our hypotheses would be
H0 : p1 − p2 = 0
HA : p1 − p2 > 0.
Here, we will assume that our significance level is α = 0.05. In collecting and
137
summarizing our data, we have that our sample proportions are p̂1 = =
352
170 Basic Statistics With R
190
0.3892 and p̂2 = = 0.3598. Additionally, we will need the pooled propor-
528
137 + 190
tion for our test, calculated using p̂ = = 0.3716. Now, to calculate
352 + 528
our test statistic we will have
p̂1 − p̂2
t=
1 1
p̂(1 − p̂) +
n1 n2
0.3892 − 0.3598
= = 0.8829.
1 1
0.3716(1 − 0.3716) +
352 528
So, based on our alternative hypothesis, our p-value will be P N (0, 1) >
0.889 . We can use the R code
pnorm(q=0.8829, mean=0, sd=1, lower.tail=FALSE)
to get that this probability and our p-value is equal to 0.1887. Since the p-value
is greater than α = 0.05, we fail to reject the null hypothesis and conclude that it
is plausible that the proportions 18–29-year-olds and 30–49-year-olds who are
online “constantly” are equal.
The previously mentioned prop.test function is also able to do our two-
sample test for proportions. Rather than just including information about a
single sample, we merely add in our information about our second sample. As
before, the function will give us our test statistic and p-values. More detail and
examples are given in Chapter 17.
aid [48,58]. Out of 216 former inmates who did not receive financial aid, 66
were arrested within a year of release. Of the 216 who did receive financial aid,
48 were arrested within a year of release.
6. Set up and test this hypothesis at the α = 0.01 level.
A researcher is interested in whether the proportion of individuals who re-
ceived corporal punishment as a child believe in corporal punishment for chil-
dren at a higher rate than those who did not receive corporal punishment [90,91].
Out of 307 individuals who remember receiving corporal punishment as a child,
263 believe in moderate corporal punishment of children. Of 1156 individuals
who do not remember receiving corporal punishment, 783 believe in moderate
corporal punishment of children.
7. Set up and test this hypothesis at the α = 0.1 level.
μ1 − μ2 = 0. This will become our null hypothesis, that the two populations
have equal means, mathematically stated as
H0 : μ1 − μ2 = 0.
Our alternative hypothesis will be one of three options that we have seen
before. The first option is that population mean for population one is greater
than the population mean for population two, or μ1 > μ2 . With a little algebra,
this leads to the alternative hypothesis
HA : μ1 − μ2 > 0.
Option two has the population mean for population one being less than the
population mean for population two, or μ1 < μ2 . After subtracting μ2 from both
sides of the inequality, our alternative hypothesis becomes
HA : μ1 − μ2 < 0.
Last of all we have our two-sided test, used if we are interested in asking the
two population means differ in any way, or μ1 = μ2 . In this case, our alternative
hypothesis is
HA : μ1 − μ2 = 0.
For our conformity example, say we were interested in testing if partners of
high status resulted in equal versus higher conformity compared to partners of
low status. In this case, our hypotheses would be
H0 : μH igh − μLow = 0
HA : μH igh − μLow > 0.
and x̄2 , respectively—the sample variances for both samples from population
one and population two—s12 and s22 , respectively—and the sample size for both
samples—n1 and n2 .
In the conformity example, 23 subjects experienced a partner of higher sta-
tus with the average number of conforming responses being x̄H igh = 14.217
igh = 19.087. For our second group, 22 subjects experienced a
with variance sH 2
partner of lower status with the average number of conforming responses being
x̄Low = 9.955 with a sample variance of sLow2 = 27.855.
σ2
x̄2 ∼ N μ2 , 2 .
n2
However, we need the standard error of x̄1 − x̄2 . Based on what we know
about combining Normal distributions, we find that x̄1 − x̄2 also follows a Nor-
mal distribution, specifically
σ12 σ22
x̄1 − x̄2 ∼ N μ1 − μ2 , +
n1 n2
where s12 and s22 are our sample variances for sample one and sample two, re-
spectively, and n1 and n2 are our sample sizes. This pooled sample variance
can be substituted for our pooled population variance in the standard error of
x̄1 − x̄2 , with a little factoring making the standard error now
1 1
s.e.(x̄1 − x̄2 ) = sp2 + .
n1 n2
We then substitute this standard error into the formula for our test statistic,
making our test statistic for equal variances
x̄1 − x̄2
t= .
1 1
2
sp +
n1 n2
Hypothesis tests for two parameters Chapter | 15 175
Our other option is if our variances are not equal, or σ12 = σ22 . If this is
the case, we need two statistics that estimate σ12 and σ22 in order to use these
statistics in our standard error. It seems reasonable that our two sample variances
should estimate these population variances so we can substitute these sample
variances for our population variances. This makes our standard error
s12 s2
s.e.(x̄1 − x̄2 ) = + 2.
n1 n2
x̄1 − x̄2
t= .
s12 s22
+
n1 n2
14.217 − 9.955
t= = 2.956.
1 1
23.37 +
23 22
our test statistic follows depends on the answer to the question that we posed
earlier: Are our variances equal or not? We saw how we can roughly determine
the answer to this question and how that answer can affect our test statistic. This
answer again comes into play as it will change our p-values.
In the case where we determine that our variances are equal, our test statistic
will follow a t-distribution, similar to what we saw before. However, our degrees
of freedom have changed. In this case, our degrees of freedom will be ν = n1 +
n2 − 2, where n1 and n2 are our sample sizes from our two samples.
When the variances are not equal, the distribution of our test statistic is a little
more difficult to define. It turns out that our test statistic follows approximately
a t-distribution. This is an approximation, not precise—the exact distribution
of this test statistic with unequal variances is one of the classic unsolved ques-
tions in statistics. However, knowing that the test statistic follows approximately
a t-distribution is not enough; we need our degrees of freedom as well. Exact
solutions for the degrees of freedom are hard to come by and difficult to imple-
ment, so more often approximations are used.
The most common of these approximations is referred to as the Satterthwaite
approximation, or the Welch approximation. In this case, our degrees of freedom
ν of our t-distribution used for p-values will be given by
s12 s2 2
+ 2
n1 n2
ν= 2 2 2 2 .
1 s1 1 s2
+
n1 − 1 n1 n2 − 1 n2
Once we know the distribution of our test statistic, we only need our alterna-
tive hypothesis. As before, calculating our p-values is dependent on our alterna-
tive hypothesis, with identical rationale to our one-sample t-test for means—and
therefore that rationale is omitted here. Under each of the three alternative hy-
potheses, our p-values will be
HA P-value
HA : μ1 − μ2 < 0 P tν ≤ t
HA : μ1 − μ2 > 0 P tν ≥ t
HA : μ1 − μ2 = 0 2 × P tν ≥ |t|
That is,
n1 ≥ 30, n2 ≥ 30
OR
x1,i ∼ N (μ1 , σ12 ), x2,i ∼ N (μ2 , σ22 ).
15.3.6 Conclusion
Our conclusion step has not changed, as we again reject the null hypothesis if
the p-value is less than our significance level α, while failing to reject otherwise.
In our conformity example, since our p-value of 0.00266 is less than our signifi-
cance level of α = 0.01, we reject the null hypothesis and conclude that partners
of higher status result in higher levels of conformity than partners of lower sta-
tus. However, it is important to note that our sample size is smaller than 30 in
both samples and a look at histograms of our data brings Normality into doubt,
bringing the results of this test somewhat under scrutiny. (See Fig. 15.1.)
Let us see all these steps in practice. A researcher is investigating the effect
of two different diets on the weight of newborn chickens [60]. Specifically, they
were interested in if there was any difference in weight gain for chicks on a diet
supplemented with sunflower seeds versus a diet supplemented with soybeans.
178 Basic Statistics With R
Say we tested this at an α = 0.05. Upon collecting the data, the a sam-
ple of 12 chicks on sunflowers had an average weight gain of x̄sunf lower =
lower = 2384.99. Meanwhile, amongst
2
328.92 with a sample variance of ssunf
14 chicks on soybeans, the sample average weight gain was x̄soybean =
2
ssunf lower
246.43 with a sample variance of ssoybean = 2929.96. Since 2
2 =
ssoybean
0.814, we can say that it is plausible that the variances are equal. Be-
cause of this, we can calculate our pooled sample variance to be sp2 =
(12 − 1) × 2384.99 + (14 − 1) × 2929.96
= 2680.182. Based on this data, our
12 + 14 − 2
test statistic would be
246.43 − 328.92
t= = −4.050.
1 1
2680.812 +
14 12
Our test statistic will follow a t-distribution, with our degrees of freedom
being ν = 12 + 14 − 2 = 24. Based on our alternative hypothesis, our p-value
will be 2 × P (t24 > | − 4.050|). We turn to the R code pt(q=4.050, df=24,
lower.tail=FALSE) to get that this probability is 0.0002322, then double it to get
our p-value of 0.0004644. So, since our p-value is definitely less than α = 0.05,
we will reject the null hypothesis and conclude that there is a difference between
the two diets. The only concern with this result may be that since we have small
sample sizes for both of our two diets and a lack of Normality in our data makes
our results may not be fully reliable. (See Fig. 15.2.)
Why would we want overlap between our two samples? In most cases, we
want our samples or our variables to be independent, so why the exception now?
To answer this question, let us consider the following scenario. A shoe company
claims that their running shoe will help improve your mile run time. To test this
claim, you get a bunch of people together and randomly assign half the people
to use the new shoe in a mile run while the other half will run using their regular
old shoes. You then plan to do hypothesis testing using the two-sample t-test for
means to compare the two groups.
What is the problem with this setup? We wind up comparing the two groups,
so what is the problem? If we assign the groups in this way—half the partic-
ipants in each group—we do not account for each individual runner’s ability.
Under this scenario it is entirely possible that one group happens to have better
runners, resulting in that group’s times being lower, regardless of the effect of
the shoe.
So what setup would be better? It would be better to have each person run a
mile in both the new shoes and old shoes so that we can directly compare each
individual’s mile time under both conditions. In fact, we can take the difference
between their time with their old shoe and with the new shoe and conduct our
hypothesis test on this difference.
This will be the general idea of our paired t-test for means. We collect two
samples that are connected in some way—called a paired sample—and then
conduct what essentially amounts to a one-sample t-test for means on the dif-
ferences for each individual. The steps of this paired are identical to what we’ve
seen before, with an especial connection to our one-sample t-test.
As we work through these steps, we will illustrate them with the fol-
lowing example. A researcher is comparing the accuracy of two analysis
techniques determining the sugar concentration of a breakfast cereal: liquid
chromatography—a slow but accurate process—and an infra-analyzer 400 [23].
H0 : μD = 0.
Our alternative hypothesis comes in one of the three familiar forms. Either
the population average difference is less than 0, greater than 0, or just not equal
Hypothesis tests for two parameters Chapter | 15 181
HA : μD < 0
HA : μD > 0
HA : μD = 0.
In the breakfast cereal example, say we wanted to test if there was any differ-
ence between the results of the two analysis techniques. Our hypotheses would
be
H0 : μD = 0
HA : μD = 0.
After calculating the individual differences, we will need the sample aver-
age of the differences x̄D , the sample standard deviation (or variance) of the
differences sD , and the sample size n.
In the breakfast cereal analysis, the researcher collected the percentage of
sugar in 100 samples of breakfast cereal using both the liquid chromatography
method and infra-analyzer. The differences for each sample was calculated, with
the sample average being x̄D = −0.622 with a standard deviation of sD = 1.988.
182 Basic Statistics With R
x̄ − μ0
t= √ .
s/ n
x̄D
t= √ .
sD / n
This will be our test statistic for our paired t-test for means, with its direct
connections to the one-sample t-test for means. For breakfast cereal example,
we have our data, we just need to fill in the various parts of our test statistic.
−0.622
t= √ = −3.129.
1.988/ 100
HA P-value
HA : μD < 0 P tn−1 ≤ t
HA : μD > 0 P tn−1 ≥ t
HA : μD = 0 2 × P tn−1 ≥ |t|
Just as with the one-sample t-test, this result will only hold if the central
limit theorem holds. This requires a large enough sample size—n ≥ 30 is the
standard—or that our differences are Normally distributed so that we are able to
trust the results of our test.
In the breakfast cereal example,
our alternative
hypothesis is HA : μD = 0,
so our p-value will be 2 × P t99 > − 3.129 . Using the following R code, we
will find that our p-value is 0.0023.
2*pt(3.129, df=99, lower.tail=FALSE)
15.4.6 Conclude
As with all our other hypothesis tests, our conclusion step consists of the jus-
tification for our decision—our p-value will either be greater than or less than
our significance level α—the decision we reached—fail to reject or reject the
null hypothesis, respectively—and the resulting context about what this deci-
sion means for our parameter.
In the breakfast cereal analysis, since our p-value of 0.0023 is less than our
significance level α = 0.05, we will reject the null hypothesis and conclude there
is a difference between the two analysis techniques.
We can illustrate these steps using one of the early statistical datasets: Stu-
dent’s—a pseudonym of statistician William Gosset—sleep data [17]. Ten in-
dividuals were given one of two soporific drugs, also known as sleep aids, and
their hours of sleep were recorded. Then the process was repeated with the other
drug. They wanted to test if the two drugs—Dextro and Laevo—were any dif-
ferent in hours of sleep added. So, the hypotheses for this question are
H0 : μ D = 0
HA : μD = 0
which we will test at the α = 0.1 level. In the sample, it was found that the
average difference—calculated as Dextro minus Laevo—was x̄D = −1.58 with
a sample standard deviation of sD = 1.23. So the test statistic was
−1.58
t= √ = −4.06.
1.23/ 10
Because we are dealing with a two-tailed test, our p-value is 2 × P (t10−1 >
| − 4.06|). Using the R code pt(q=4.06, df=9, lower.tail=FALSE), we get that
this probability is 0.00142, which we double to get the p-value of 0.00284. Since
184 Basic Statistics With R
this p-value is very much below the α = 0.1 threshold, we will reject the null
hypothesis and conclude that there is some difference in the effectiveness of
the two drugs. While the sample sizes are small, a histogram of the differences
makes it appear as they are roughly Normal. (See Fig. 15.3.)
The t.test function is adaptable to all our two-sample tests for means—
paired and unpaired, equal or unequal variances. By including both our samples
and changing a few of the functions options—namely the paired and var.equal
options—we are able to access all of these varied options.
19. Test if the population average percentage of solids is the same or different
for the shaded and exposed halves of the at the α = 0.05 level.
In 1876, Darwin tested corn plants to see if the height of cross-pollinated
corn plants was different than self-pollinated corn plants—grown in the same
pots under the same conditions. Of his n = 15 observations, he found a sample
average difference in heights—cross-pollinated minus self-pollinated—of x̄D =
2.617 with a sample difference variance of sD2 = 22.26 [83]. Histograms of the
15.5 Conclusion
Comparing two populations is one of the most common and most important
problems in statistics. Whether it be proportions, means, or paired means, there
is a hypothesis test for each of these scenarios. As we saw previously, the pro-
cedures for these scenarios remain the same, with adjustments for the fact that
we now have two sets of data from two separate populations. We will be able to
apply these results to the other form of inference in confidence intervals.
Chapter 16
16.1 Introduction
When we first introduced one-sample hypothesis tests, we commented that these
tests did not represent the only form of statistical inference for one sample.
Confidence intervals, intervals giving us a range of plausible values for the pa-
rameter of interest, also were a valid form of inference that corresponded with
each hypothesis test. This is also the case for our two-sample tests. Just as we
were testing if the two population means differed, we are able to get a range of
plausible values for the difference between the two population proportions, two
population means, or a paired mean. In this chapter, we will derive and interpret
our interval as well as seeing some of the applications of the intervals beyond
providing ranges of plausible values.
the relationship between p1 and p2 , including the range of plausible values for
that difference.
As before, our goal is to come up with an interval for p1 − p2 of the form
(Lower, Upper) such that
P Lower < p1 − p2 < Upper = 1 − α.
Let us keep this probability statement in the back of our mind for now. Next,
we need to find some way to connect our data and our parameters p1 and p2 to
the standard Normal distribution. We will do this, as we have for previous inter-
vals and hypothesis tests, through the central limit theorem and the properties of
Normal distributions. According to the central limit theorem, if our sample size
is large enough and we have seen sufficient successes and failures, our two sam-
ple proportions p̂1 and p̂2 will each individually follow a Normal distribution
centered around the true population proportions p1 and p2 .
p1 (1 − p1 )
p̂1 ∼ N p1 ,
n1
p2 (1 − p2 )
p̂2 ∼ N p2 , .
n2
Now, we are interested in the interval for p1 − p2 . With that in mind, let us
look at what the distribution of p̂1 − p̂2 becomes. If we take the difference in
our sample proportions, the distribution will be a Normal distribution centered
at p1 − p2
p1 (1 − p1 ) p2 (1 − p2 )
p̂1 − p̂2 ∼ N p1 − p2 , + .
n1 n2
p1 (1 − p1 ) p2 (1 − p2 )
This implies that the standard error of p̂1 − p̂2 is + .
n1 n2
Knowing the value of this standard error involves knowing the values of p1 and
p2 , which is impossible since both p1 and p2 are parameters and, therefore,
unknown. We need to substitute in an estimate for each of the population pro-
portions in order for us to know the standard errors. These estimates will be the
sample proportions for each sample, substituting p̂1 for p1 and p̂2 for p2 .
Confidence intervals for two parameters Chapter | 16 189
Now, since the distribution of p̂1 − p̂2 is a Normal distribution, we are able
to convert it to a standard Normal by subtracting off the mean p1 − p2 and
dividing by the standard error. That is,
p̂1 − p̂2 − (p1 − p2 )
∼ N (0, 1).
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
n1 n2
Now, to find our (1 − α)100% confidence interval, we need to take the above
probability and solve for p1 − p2 . Doing so requires a little algebra—omitted
here—and will result a confidence interval with the probability statement that
we desire.
∗ p̂1 − p̂2 − (p1 − p2 ) ∗
P −z < <z
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
n1 n2
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
= P p̂1 − p̂2 − z∗ + < p1 − p 2 <
n1 n2
∗ p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
p̂1 − p̂2 + z + .
n1 n2
where p̂1 and p̂2 are our sample proportions, n1 and n2 are our sample sizes,
and z∗ is our critical value. In practice, we will find the critical value z∗ using
the qnorm function as previously described.
190 Basic Statistics With R
You might notice that this confidence interval formula does not use the
pooled proportion, something that was a key component of our two-sample test
for proportions. This discrepancy is entirely explainable by one of our funda-
mental assumptions of hypothesis testing. In hypothesis testing, we assume that
the null hypothesis is true, which in this case implies that p1 = p2 = p, leading
to the need for our pooled proportion estimate that we use in our hypothesis
test. In confidence intervals, we have no null hypothesis to assume to be true,
and thus there is no common proportion that is estimated by the pooled propor-
tion.
It’s important to note that as this confidence interval relies on the results
of the central limit theorem, we will need the assumptions of the central limit
theorem to hold for our confidence interval to be valid. As we are dealing
with two sample proportions p̂1 and p̂2 , the assumptions will have to hold for
both proportions. This means that we need both our sample sizes to be large
enough—generally n1 ≥ 30 and n2 ≥ 30—and we see enough successes and
failures in both samples—usually at least 10 each.
Let us look at this in practice. In a 2018 survey, the Pew Research group
asked 743 teens and 1058 adults the question, “Do you spend too much time
on your cellphone?” Four hundred and one of the teens and 381 of the adults
answered “Yes” [65]. Say that we wanted to create a 98% confidence interval
for the difference between the true proportions pT een and pAdult .
To fill in the various components of our confidence interval, we need our
sample proportions, our sample sizes, and our critical value. Our sample pro-
401 381
portions for the two groups are p̂T een = = 0.5397 and p̂Adult = =
743 1058
0.3601. Our sample sizes will be nT een = 743 and nAdult = 1058. All that
remains is to find the critical value z∗ . Our α level is α = 0.02 since (1 −
0.02)100%=98%, so we will choose so that P N (0, 1) < z∗ = 1 − 0.02/2 =
0.99. We use the code qnorm(0.99, mean=0, sd=1) to get our critical value
z∗ = 2.326348. So, our 98% CI for pT een − pAdult will be
p̂T een (1 − p̂T een ) p̂Adult (1 − p̂Adult )
p̂T een − p̂Adult ± z∗ +
nT een nAdult
0.5397(1 − 0.5397) 0.3601(1 − 0.3601)
0.5397 − 0.3601 ± 2.326 +
743 1058
= (0.1249, 0.2343).
So we are 98% confident that the interval (0.1249, 0.2343) covers the true
value of pT een − pAdult .
The prop.test function allows us to get our confidence interval for p1 −
p2 . By merely changing our confidence level—the conf.level option in the
function—R will get us the confidence interval we are interested in. Further
details and examples are given in Chapter 17.
Confidence intervals for two parameters Chapter | 16 191
Normal distribution.
σ2 σ2
x̄1 − x̄2 ∼ N μ1 − μ2 , 1 + 2 .
n1 n2
This result does come with some problems, though, as the denominator re-
quires knowing the true values of the population variances σ12 and σ22 . This is
impossible, as both values are
population parameters and, therefore, unknown.
σ12 σ22
We need a substitution for + . In fact we know two substitutions, one
n1 n2
each depending on our answer to the question: “Are our population variances
equal?”
This pooled sample variance estimates the common population variance and
is then inserted into our standard error for x̄1 − x̄2 . We can do this exact same
procedure in this case, making our standard error for equal variances
1 1
s.e.(x̄1 − x̄2 ) = sp
2 + .
n1 n2
We can substitute this in for the standard error based on our unknown popu-
lation variances. However, when we do this, the probability distribution changes.
Instead of dealing with a standard Normal, our focus changes to a t-distribution
Confidence intervals for two parameters Chapter | 16 193
We will use this t-distribution as our probability distribution. So, for our
(1 − α)100% confidence interval we will want to choose a critical value t ∗ so
that
P − t ∗ < tν < t ∗ = 1 − α.
Now, as we saw above, we have something to connect our parameters μ1 −
μ2 and data to this distribution: our test statistic. So, we can sub this into the
probability statement, resulting in,
∗ x̄1 − x̄2 − (μ1 − μ2 ) ∗
P −t < < t = 1 − α.
1 1
sp2 +
n1 n2
However, we can substitute our sample variances s12 and s22 for σ12 and σ22 in
our standard error. Thus, our standard error of x̄1 − x̄2 becomes
s12 s2
s.e.(x̄1 − x̄2 ) = + 2.
n1 n2
We can then substitute this into our probability statement instead of us-
ing the standard error based on the unknown population variances. However,
when we use this standard error in our probability statement, the associated
probability distribution changes from a Normal distribution to approximately a
t-distribution. As with the hypothesis test, this is distribution an approximation,
as an exact answer does not exist. Like in our hypothesis test, we approximate
the degrees of freedom ν using the Satterthwaite approximation, resulting in a
degrees of freedom of
2
s1 s22 2
+
n1 n2
ν= 2 2 2 2 .
1 s1 1 s2
+
n1 − 1 n1 n2 − 1 n2
We will use this t-distribution as our probability distribution. So, for our
(1 − α)100% confidence interval we will want to choose a critical value t ∗ so
that
P − t ∗ < tν < t ∗ = 1 − α.
Now, as we saw above, we have something to connect our parameters μ1 −
μ2 and data to this distribution: our test statistic. So, we can sub this into the
probability statement, resulting in,
∗ x̄1 − x̄2 − (μ1 − μ2 ) ∗
P −t < < t = 1 − α.
s12 s22
+
n1 n2
Now, all we have to do is solve for μ1 − μ2 inside the probability statement
and we will have our (1 − α)100% confidence interval for μ1 − μ2
∗ x̄1 − x̄2 − (μ1 − μ2 ) ∗
P −t < <t =1−α
s12 s22
+
n1 n2
..
.
s 2 s 2 2 s22
∗ ∗ s1
P x̄1 − x̄2 − t 1
+ 2
< μ1 − μ2 < x̄1 − x̄2 + t + .
n1 n2 n1 n2
Confidence intervals for two parameters Chapter | 16 195
This gives us our interval with lower and upper bounds so that P Lower <
μ1 − μ2 < Upper = 1 − α. Thus our (1 − α)100% confidence interval for μ1 −
μ2 will be
s2 s2
x̄1 − x̄2 ± t ∗ 1 + 2 .
n1 n2
So, we are 90% confident that (−47.460, −4.008) will cover the true value
of μ298 − μ493 . However, our small sample and slight lack of Normality gives
196 Basic Statistics With R
[10,28].
5. Calculate and interpret the 96% confidence interval for μF − μM .
Say a researcher is interested in the difference in atmospheric ozone levels
during the summer and winter in Leeds. In 578 summer days sampled, the av-
erage ozone level was found to be x̄S = 32.00 parts per billion with a variance
of sS2 = 106.81. Meanwhile, in 532 winter days, the average ozone level was
x̄W = 20.06 with a variance of sW2 = 118.72. [97,98]
is, we will again need a sample size of n ≥ 30 paired observations or the dif-
ferences to be Normally distributed in order for the central limit theorem—and
thus our confidence intervals—to hold.
To see this in practice, let us look at an example from the PairedData li-
brary in R. Say that we want to determine if two processes for determining the
percentage of iron in an ore differ in their results [63,67]. To do so, we take
10 samples of iron ore and measure the percentage of iron by method A and
method B. Since there is overlap between the two samples, this is a paired sam-
ple. In the sample, they find that the average difference (A-B) between the two
methods is x̄D = −0.13 with a sample standard deviation of sD = 0.177. Say
we wanted to create a 90% confidence interval for μD . They only thing we are
missing is the critical value t ∗ . To find this, we use the qt(0.95, df=9) function
to get t ∗ = 1.833. So, our 90% Confidence Interval for μD is
0.177
−0.13 ± 1.833 × √ = (−0.027, −0.233)
10
with the interpretation that we are 90% confidence that the true population dif-
ference is covered by the interval (−0.027, −0.233). However, the small sample
size and non-Normal differences may cast some doubts about the validity of the
results. (See Fig. 16.2.)
As with our hypothesis tests, the t.test function is able to give us our con-
fidence intervals for μ across the board. Adjusting similar options that were
mentioned in our two-sample tests for μ and changing the conf.level option al-
lows us to get any (1 − α)100% confidence interval for μ1 − μ2 or μD that is
required.
differences appear skewed right. Calculate and interpret the 99% confidence
interval for μD [62,63].
8. A researcher is interested in the effectiveness of a drug at stabilizing CD4
counts in HIV-positive patients. A decrease in CD4 counts can mark the
onset of full-blown AIDS in a patient. A baseline CD4 counts was measured
for 20 patients and then the CD4 counts were remeasured a year later. They
found a sample average of x̄D = −0.805 with a standard deviation of sD =
0.8017. Histograms of the differences seem Normal. Calculate and interpret
the 90% confidence interval for μD [68–70].
9. Researchers were developing a device that generates electricity from wave
power at sea. In order to keep the device in one location, one of two mooring
methods was used—one of which was much cheaper than the other. Then a
series of simulations in a wave tank was done to test the bending stress on one
particular part of the device, with each of the two mooring methods receiving
the same simulated wave type. In 18 simulations, the average difference in
stress was x̄D = 0.062 with a standard deviation of sD = 0.29. Histograms
of the differences seem Normal. Calculate and interpret the 80% confidence
interval for μD [99,100].
Since 0 is not in this interval, we would conclude that we should reject the
null hypothesis at the α = 0.05 level. This matches up with the results from
our hypothesis test, as we would expect to see given the connection between
hypothesis testing and confidence intervals. However, we should note that our
sample size and Normality concerns remain from before.
The exact same procedure applies to the paired test as well. In order to do
a test of H0 : μD = 0 versus HA : μD = 0, we can calculate a (1 − α)100%
confidence interval with matching α levels. If 0 is contained in the interval we
fail to reject the null hypothesis, while if 0 is not contained in the interval we
will reject the null hypothesis.
Again let us confirm this with the results of a previous hypothesis test. Recall
our earlier example involving analyzing breakfast cereal [23]. We previously
tested our hypothesis of H0 : μD = 0 versus HA : μD = 0 at the α = 0.05 and
found that we would reject the null hypothesis. We could alternatively test this
by calculating a 95% confidence interval—again with α = 0.05—for μD . As
x̄ = −0.622, sD = 1.988, and n = 100, all we are missing is our critical value
of t ∗ . Our level of α will be α = 0.05, so we would use the R code qt(0.975,
df=99) to find that t ∗ = 1.9842. Putting all this together we get our 95% confi-
dence interval to be
1.988
−0.622 ± 1.9842 √ = (−1.016, −0.228).
100
Confidence intervals for two parameters Chapter | 16 201
Since 0 is not in this interval, we would reject the null hypothesis at the
α = 0.05 level, the result that we got earlier when doing the entire hypothesis
testing procedure.
16.6 Conclusion
As single-parameter confidence intervals allow us to get plausible values for the
parameter, two-parameter confidence intervals allow us to understand the plau-
sible differences between two populations. These intervals still follow the same
procedure as single-parameter intervals, with adjustments for the two samples
that occur in this case. Now, these intervals are much more easily calculated
using R, which we will see in the following chapter.
Chapter 17
17.1 Introduction
As we have seen from the past few chapters, hypothesis tests and confidence
intervals are two statistical techniques that can be done mostly by hand, es-
pecially in small datasets. However, as is often the case when working with
modern datasets, real data often can be quite large and complex, often having
hundreds or thousands of observations. When this is the case, it is impractical,
if not impossible, to perform these inference techniques by hand. Fortunately,
in addition to providing the exploratory data analyses and plots that we previ-
ously discussed, R allows us to perform the whole of statistical inference—not
just probabilities and p-values—for all of the hypothesis tests and confidence
intervals that we discussed.
FIGURE 17.1 The process of identifying the correct hypothesis test to use.
HA R-code
HA : p < p0 alternative=“less”
HA : p > p0 alternative=“greater”
HA : p = p0 alternative=“two.sided”
So, let us use an example to see how the function works. Let us use the
example we used to walk through the one-sample test for proportions. In this
example, we were asking if the proportion of Sweet Briar College legacy stu-
dents differed from the national average of 14%. To test this, we collected a
sample of 79 students and found that 12 of them were legacy students. In this
case, our hypotheses were
H0 : p = 0.14
HA : p = 0.14.
206 Basic Statistics With R
R will then output several results, all relevant to our hypothesis test proce-
dure.
1-sample proportions test without continuity correction
There is a lot of output here to sift through, some of which are intuitive. The
first important output is the X-squared value, which gives us our test statistic
squared. If we had calculated our test statistic using the formula, we are familiar
with by hand, we would either get the positive square root of X-squared—if
p̂ is greater than p0 —or the negative square root of X-squared—if p̂ is less
than p0 . So, for our example, since our p̂ = 0.1519 is √ greater than the null
hypothesis value of 0.14, our test statistic would be + 0.0929 = 0.3048, the
value we got when we did the work by hand. The reason why prop.test reports
the test statistic squared instead of the test statistic is due to the function’s direct
connection to a more general form of our test for proportions—the chi-squared
test for independence—that can be learned in a second course in statistics.
Next, we have the p-value, which is just what it says it is; the p-value for
our hypothesis test based on our test statistic and alternative hypothesis. Here,
our p-value will be 0.7605, which would likely be greater than any reasonable
significance level that we would choose. Again, this is the result we got when
going through our test step-by-step.
The function returns a few other results—the sample proportion p̂, the al-
ternative hypothesis—but one is of particular interest. We see that the function
outputs a 95% confidence interval for p based on our results. This is helpful
when we want a 95% confidence interval, but what about when we want a 90%
or 99% confidence interval? It turns out that there is one additional option in the
prop.test function that will give us any confidence interval we want: the confi-
dence level conf.level.
R tutorial: statistical inference in R Chapter | 17 207
The important change here comes in our successes x and sample sizes n.
Previously, they were single values as we only were concerned with a single
sample. Here—as we have two groups to compare—both x and n will be vectors
of values. It is important to ensure that the ordering of successes in x matches
up with the ordering of sample sizes in n. Additionally, the alternative hypoth-
esis and alternative option will be affected by this ordering. For example, the
“greater” option implies HA : p1 > p2 , where group one—with population pro-
portion p1 —will be defined by the successes and sample size entered first into
x and n. So we must make sure that the ordering of x and n matches up with
what our alternative argument implies.
For example, let us look back at the example we walked through while learn-
ing the steps of our two-sample test for proportions [55]. In this example, the
Pew Research Group asked 667 Millenials and 558 Generation Xers if they
thought it was essential for the United States to be a world leader in space explo-
ration. 467 Millenials and 407 Generation Xers answered “Yes” to the question.
We wanted ultimately to test if there was any difference between the population
proportions for those two groups. The hypotheses in this case would be
H0 : pMillenial − pGenX = 0
208 Basic Statistics With R
HA : pMillenial − pGenX = 0
leading to .” Here, our successes would be x = c(467, 407), and our sample
sizes would be n = c(667, 558). All put together, the code would be
prop.test(x=c(467, 407), n=c(667, 558),
alternative="two.sided", correct=FALSE)
The resulting output for this code is given below. Again, our test statistic will
be the positive or negative square root of X-squared: positive if p̂1 > p̂2 and
negative if p̂1 < p̂2 . Since p̂1 < √
p̂2 in this case, our test statistic is the negative
square root of X-squared, so − 1.2707 = −1.127. The p-value, determined
by our test statistic and alternative hypothesis, is 0.2596, likely higher than our
significance level.
2-sample test for equality of proportions without continuity
correction
HA R-code
HA : μ < μ0 alternative=“less”
HA : μ > μ0 alternative=“greater”
HA : μ = μ0 alternative=“two.sided”
To illustrate this, let us look at the example we used to illustrate the one-
sample test for means. We tested if a video slot machine’s average payout was
equal to or less than the theoretical payout of $13.13. This data can be found
in the vlt dataset in the DAAG library [23,44]. The payout amount for the 345
games played are found in the prize variable in the dataset. So, to do our hy-
pothesis test we will use the code
t.test(vlt$prize, mu=13.13,
alternative="less")
210 Basic Statistics With R
And R will return the following output. Just like the prop.test function, the
t.test output covers all the bases for our hypothesis test. The value of t in the
output is the value of our test statistic, using the formula we saw in previous
chapters. The df value is the degrees of freedom for this test, calculated as
before using df = n − 1 where n is our sample size. The p-value is of course our
p-value, calculated with respect to our test statistic and alternative hypothesis.
As we can see, all of these match up with our results that we saw before.
data: vlt$prize
t = -90.718, df = 344, p-value < 2.2e-16
alternative hypothesis: true mean is less than 13.13
95 percent confidence interval:
-Inf 0.8989483
sample estimates:
mean of x
0.6724638
Just like in the prop.test function, we are able to create our (1 − α)100%
confidence interval—in this case for μ—using the same function with which we
do hypothesis testing. By default, the t.test function returns a 95% confidence
interval. However, we can change the desired confidence level for our interval
using the same conf.level argument that we saw in the prop.test function.
For example, let us look at our conformity example from the previous chap-
ter. The data is stored in the Moore dataset in the carData library [48,59]. We
were interested in testing if partners of a higher status resulted in equal ver-
sus higher conformity than partners of lower status, or H0 : μH igh − μLow = 0
s2
versus HA : μH igh − μLow > 0. Based on the ratio of our variances— 12 =
s2
27.855
= 1.459—we can say that our variances are plausibly equal. Thus, we
19.087
can use the t.test function to test this using the following code:
t.test(Moore$conformity[Moore$partner.status=="high"],
Moore$conformity[Moore$partner.status=="low"],
alternative="greater", var.equal=TRUE)
The function gives us the same set of output as our one-sample t-test for
means with our test statistic and p-value among other items. If we look at the
output of the function, we see that they all match up with the results we previ-
ously calculated by hand.
Two Sample t-test
It is important to note that when entering our x and y samples for the paired
t-test for means we need our samples to be the same length and match up ac-
cordingly. Otherwise, R will return an error if our samples are different lengths
or incorrect results if our samples don’t match up correctly. Let us illustrate this
through the example we saw when learning about this test.
Let us see this applied to Student’s sleep data [17]. This is stored in the sleep
dataset in the base package of R. We were interested in testing if there was any
difference between the two sleep aids, or H0 : μD = 0 versus HA : μD = 0. The
code to test this will be
t.test(x=sleep$extra[sleep$group==1],y=sleep$extra[sleep$group==2],
alternative="two.sided", paired=TRUE)
The function returns the same results that we expect to see, among them
the test statistic t, our degrees of freedom df , and the p-value of the test. As
we can see, these results match up with the results that we calculated by hand
previously.
Paired t-test
In addition to our hypothesis test, the t.test function provides us the confi-
dence interval for μD as well. By default a 95% confidence interval, we can
change to our specified confidence level by adding the conf.level option to our
function. So, if we wanted the 90% confidence interval for μD , we would use
the following code to find that the interval is (−2.2930053, −0.8669947):
t.test(x=sleep$extra[sleep$group==1],y=sleep$extra[sleep$group==2],
alternative="two.sided", paired=TRUE, conf.level=0.9)
17.5 Conclusion
While all forms of statistical inference can be done by hand with simple formu-
las and a little code, it can be a tedious process. Instead of going through this
process R provides two functions that allow us to calculate confidence intervals
214 Basic Statistics With R
18.1 Introduction
When we first introduced the definition of variables, we said that we would focus
on two main types of variables: categorical and quantitative. We then worked
our way through the process of statistics, doing exploratory analyses and then
conducting statistical inference. Our previous inferential techniques focused on
one of two situations: a quantitative response with a categorical predictor and a
categorical response with a categorical predictor. In both instances, we have a
categorical predictor which defined our one or two groups that we intended to
compare.
However, not all problems in statistics are limited to this idea of a categorical
predictor. Consider the following situation: you are interested in whether or not
the speed a vehicle was traveling at the time of braking affects the stopping
distance of the vehicle. In this case, your response is the stopping distance of
the vehicle which is quantitative. More importantly, your predictor variable is
the speed of the vehicle at the time of braking, again a quantitative variable.
To begin investigating this question, we start with collecting data, which
we can find in the cars dataset in R [31]. We then could do some exploratory
analyses, such as creating a scatterplot. Through this, we could see that there
appears to be a positive linear relationship between the speed of the vehicle and
its stopping distance that is moderately strong with no outliers. (See Fig. 18.1.)
Even with this scatterplot, our most powerful tool to describe the associa-
tion between two quantitative variables is the correlation. As a reminder, the
are uncorrelated, which implies that the true value of our population correlation
is ρ = 0. Thus, our null hypothesis will be
H0 : ρ = 0.
Our alternative hypothesis is one of the three forms that we have seen be-
fore: the population correlation is greater than 0, less than 0, or just not equal
to 0. This leads to the three possible alternative hypotheses—with the not equal
option being the most common option
HA : ρ < 0 HA : ρ > 0 HA : ρ = 0.
In our example dealing with vehicle speed and stopping distance say we are
interested in testing if the variables are correlated versus not correlated. With
this in mind, our hypotheses would be
H0 : ρ = 0
HA : ρ = 0.
However, for the sample correlation the central limit theorem does not hold.
Does this mean that we have no test statistic that we can use? Not at all. It turns
out that there exists a test statistic that directly relates our sample correlation and
sample size to a specific—and familiar—probability distribution. Specifically,
the test statistic for our test for correlations is
r
t= .
1 − r2
n−2
In our speed and stopping distance example, our test statistic would be
0.8069
t= = 9.464.
1 − 0.80692
50 − 2
one of three p-values, with the rationale for each of the p-values being identical
to previous arguments. As such, we will omit the explanation here. Under each
of the three alternative hypotheses, our p-values will be
HA P-value
HA : ρ < 0 P tn−2 ≤ t
HA : ρ > 0 P tn−2 ≥ t
HA : ρ = 0 2 × P tn−2 ≥ |t|
We of course can calculate this in R using the pt function. For our vehicle
speed and stopping distance example, we had the alternative hypothesis that
HA : ρ = 0 with a test statistic of t = 9.464 and n = 50. Thus our p-value will be
2 × P (t50−2 > |9.464|). We use the R code pt(9.464, df=48, lower.tail=FALSE)
to get our p-value of 7.45 × 10−13 .
18.2.6 Conclusion
Our conclusion step has not changed, as we again reject the null hypothesis if
the p-value is less than our significance level α, while failing to reject otherwise.
In our speed and stopping distance example, this means that since our p-value
is 7.45 × 10−13 which is less than α = 0.05, we reject the null hypothesis and
conclude that the correlation between the vehicle’s speed and its stopping dis-
tance is not zero. However, as noted earlier, the assumptions required for this
test are not met. Therefore, it is possible that the conclusions of this test may be
suspect.
It is important to note what our conclusion is. We do not make any claim
about our variables being associated, independent, or anything along those lines.
Our hypothesis test—and, therefore, our conclusion—is about our parameter ρ,
the population correlation. This is the only claim we can rule on based on our
test; the claim that was stated in our hypotheses.
220 Basic Statistics With R
In practice, our test for correlations are done using the cor.test function in R.
This function takes in three key arguments, the two variables used to calculate
our correlation x and y, as well as the alternative hypothesis stated in alternative.
The general form of this function is
cor.test(x,y,alternative)
We will discuss the details of this function in a section later in this chapter.
For now, let us see all these steps in a single run. Say we were interested in if
the heights of parents are correlated with children’s height. In order to test this,
we set up the following hypotheses to be tested at the α = 0.05 level:
H0 : ρ = 0
HA : ρ = 0.
Data exists on this in the Galton dataset in the HistData library [75,76]. The
two variables in the dataset are the average height of the two parents parent and
the child’s height child. Looking at a scatterplot of this data, we can see that
there appears to be a linear relationship between our two variables. Also, we
should note that the data values seem to be constrained to particular values. (See
Fig. 18.3.)
We then find that in the sample of n = 928 parent-child trios, the sample
correlation is r = 0.4588. This results in a test statistic of
0.4588
t= = 15.713.
1 − 0.45882
928 − 2
Now, based on our data and alternative hypothesis, our p-value will be
2 × P (t928−2 > |15.713|). We calculate this through the R code pt(15.713,926,
Inference for two quantitative variables Chapter | 18 221
Since our p-value is decidedly less than our significance level of α = 0.05,
we would reject the null hypothesis and conclude that the correlation between
the average height of two parents and their child’s height will be nonzero.
With this in mind, we need a way to relate our data—particularly our sample
correlation—and our population correlation ρ to the standard Normal distribu-
tion. It turns out that R.A. Fisher, the same statistician at the lunch with the lady
tasting tea, found that a transformation of the sample correlation approximately
will follow the Normal distribution. Specifically,
1 1+r 1 1+ρ 1
ln ∼N ln ,
2 1−r 2 1−ρ n−3
We would then solve this equation for ρ, where we would get an inter-
val of the form P (Lower < ρ < Upper) = 1 − α. Our interpretation of our
(1 − α)100% confidence interval for ρ remains similar to our previous intervals.
Namely, we are (1 − α)100% confident that our calculated interval will cover the
true value of ρ. We should note that for this interval to be valid, the relationship
between our two variables needs to be linear. There is no Normality assumption
necessary.
As can be seen from our equation above, doing this math would be tedious
to do. As such, we will let R do the work for us. The cor.test function can
create a confidence interval of this form, with only one added argument to the
function: the desired confidence level defined in conf.level. Let us take a look at
this function in some detail.
cor.test(x, y, alternative)
Our two variables x and y are entered as vectors that we define manually or
take from a data frame. Our alternative argument is defined by the alternative
hypothesis we are interested in, exactly like in our prop.test and t.test functions.
HA R-code
hA : ρ < 0 alternative=“less”
HA : ρ > 0 alternative=“greater”
HA : ρ = 0 alternative=“two.sided”
So let us see this function at work with the Galton dataset—found in the
HistData library. As discussed, this looks at the relationship between parent
height and child height. Assuming, as before, that we are testing H0 : ρ = 0
versus HA : ρ = 0 at the significance level α = 0.05, our code would be
cor.test(x=Galton$parent, y=Galton$child,
alternative="two.sided")
R returns the following output below. cor.test includes all the output we have
discussed above. We see our test statistic given in t matching the value above.
224 Basic Statistics With R
The degrees of freedom for our t-distribution are given in df and our p-value
is also provided. The only thing missing to definitively draw a conclusion is
checking our assumptions for the test, which we can do by creating a scatterplot
of our variable—plot(Galton$parent,Galton$child)—and histograms of the two
variables—hist(Galton$parent) and hist(Galton$child). These graphs are seen
in examples above, so I will not repeat them here, but they show that our as-
sumptions are reasonably satisfied.
Pearson’s product-moment correlation
Thus, since our p-value is decidedly less than α = 0.05, we reject the null
hypothesis and conclude that the correlation between average parent height and
child height is not equal to 0.
4. Run the test for correlations with the alternative hypothesis HA : ρ = 0 at the
α = 0.1 level.
5. Find and interpret the 90% confidence interval for ρ.
Say a researcher wants to investigate the relationship between academic staff
pay at universities (acpay) and the number of academic grants the university
receives (resgr). Data to investigate this question can be found in the University
dataset in the Ecdat package in R [12,80].
6. Run the test for correlations with hypotheses H0 : ρ = 0 versus HA : ρ > 0
at the α = 0.05 level.
7. Find and interpret the 99% confidence interval for ρ.
The Caschool dataset in the Ecdat library looks at school district perfor-
mance on a standardized test best on various factors [12,107]. Say a researcher
wants to test the correlation between test score (testscr) and expenditure per
student (expnstu).
8. Run the test for correlations with the hypotheses H0 : ρ = 0 versus HA : ρ >
0 at the α = 0.01 level.
9. Find and interpret the 91% confidence interval for ρ.
18.7 Conclusion
Our test for correlations and confidence interval for correlations give us an
important tool to understanding the relationship between two quantitative vari-
ables. However, it does not tell the whole story. While correlations can tell us the
direction and strength of the relationship between the variables, it does not tell
us how one variable affects the other. This can only be done through methods
that describe the linear relationship between our predictor and response, esti-
mating the slope of the line that describes the data. With this in mind, we have
to take one further step to have a fuller understanding of our variables.
Chapter 19
19.1 Introduction
In conducting inference for correlations, we begin to understand the linear re-
lationship between two quantitative variables, both in terms of direction and
strength. However, inference for correlations tells us nothing about how the vari-
ables affect each other. Consider the scatterplots in Fig. 19.1; in both cases, the
correlation between our x and y variables is approximately r ≈ 0.91. However,
in the scatterplot on the left, every time the predictor x increases by 1, the re-
sponse y increases by ≈ 0.4. On the right, every time the x increases by 1, the
y increases by ≈ 1.8. These are two very different effects, but with identical
correlations. We need a way to distinguish these two graphs and more fully un-
derstand the relationship between our variables.
Simple linear regression is a method to do this. It takes the two variables that
are linearly related—one predictor and one response—and finds the line that
best describes that relationship. This line will be related to the correlation, but
give us more information than just the correlation alone. From this technique,
we can look at a variety of aspects and consequences of this line, including pre-
dicting our response. This opens up a variety of possibilities in terms of analysis
and questions that we can answer.
FIGURE 19.1 Two scatterplots with identical correlations but different slopes.
While much of the calculations for simple linear regression can be derived
by hand, it is highly impractical to complete them given the sizes of datasets. As
such, we will discuss how the estimates are derived in general terms but not go
through the full derivation. As an example, let us look at the following dataset. It
is generally understood that there is a connection between the length of eruption
time for the Old Faithful geyser and the waiting time until the next eruption. We
will use the faithful dataset in R to look at this phenomenon in order to develop
a simple linear regression to predict waiting time [19].
FIGURE 19.2 Eruption time and wait time for the Old Faithful geyser.
y = β0 + β1 x
are two key parts to this form of the line: the slope β1 and the intercept β0 . The
slope describes how much our independent variable increases when the depen-
dent variable increases by a single unit. If the slope is positive, the relationship
between independent and dependent variable is positive. If the slope is negative,
the relationship is negative. Finally, if the slope is zero, the independent variable
has no effect on the dependent variable.
The intercept tells us the value of our independent variable when our depen-
dent variable is equal to zero. We can see this clearly by substituting x = 0 into
our line. In this case, we would get
y = β0 + β1 × 0 = β0 + 0 = β0 .
n
n
SSE = i2 = (yi − β0 − β1 xi )2 .
i=1 i=1
Now that we have defined the SSE, we want to choose the values of β0 and
β1 that make the SSE as small as possible. We will not go through how this is
done, but for those with a Calculus background it is accomplished in a similar
way to minimizing any function. If we went through this process of deriving our
estimates for our slope and intercept—β̂1 and β̂0 , respectively—we will find
they are
sy
β̂1 = rx,y ×
sx
β̂0 = ȳ − β̂1 x̄
where rx,y is the correlation between our predictor and response, sx and sy are
our sample standard deviations for predictor and response, respectively, and x̄
Simple linear regression Chapter | 19 231
FIGURE 19.3 Visualizing errors in regression. The vertical lines represent the error in our regres-
sion model, which is squared and summed to make our SSE.
and ȳ are the sample mean of the predictor and response. In finding the estimate
for our slope particularly, we can see the connection between our correlation
and regression. If the correlation is positive, the slope will be positive. If the
correlation is negative, the slope will be negative. As the correlation grows, so
too does the slope. Finally, when the correlation is close to zero—which implies
that the association between the two variables is weak—the estimated slope will
also be close to zero—meaning the predictor has very little influence on the
response.
Let us apply this to the Old Faithful dataset. In this dataset, we can find
in R that our correlation between eruption time and wait time is r = 0.9008,
the standard deviation for our predictor eruption time is sx = 1.1414, and the
standard deviation for our response wait time is sy = 13.5950. Thus, the estimate
13.595
for our slope in the regression line will be β̂1 = 0.9008 × = 10.7296.
1.1414
This implies that for every minute longer that the eruption of old faithful lasts
we would expect to wait 10.7296 extra minutes for the next eruption.
Our next step is to estimate our intercept β̂0 . In order to do this, we need our
estimate for the slope—ββ1 = 10.7296, as we as the means for eruption time
x̄ = 3.4878 and wait time ȳ = 70.8971. This means that our intercept will be
β̂0 = 70.8971 − 10.7296 × 3.4878 = 33.4744. Thus, our estimated regression
line is
yi = 33.4744 + 10.7296xi .
The last parameter that we have to estimate for our model is the variance
of our errors σ 2 . Given our errors, we could estimate the true variance of the
errors by using the sample variance of those value. The difficulty with this task
is, of course, we do not know our errors. Since the formula of the errors is
i = yi − β0 − β1 xi , in order to know the true value of the errors we would need
to know the true value of β0 and β1 —which is impossible.
232 Basic Statistics With R
ei = yi − β̂0 − β̂1 xi .
We can use these residuals to estimate the variance of our errors σ 2 . This
estimate is called the mean squared error (MSE), is notated σ̂ 2 and is calculated
1 2
n
σ̂ 2 = ei .
n−2
i=1
19.5 Regression in R
While it is possible to calculate the components of our regression model—
sample correlation, standard deviations, and means—in R and then use them
to calculate the slope and intercept, R provides a considerably easier method.
The lm function in R is used to estimate simple linear regression models.
It takes in our response and predictor variables and our dataset and outputs a
variety of important statistics, including our slope, intercept, and MSE. The code
for the lm function is
where Response is our response variable, Predictor is our predictor variable, and
dataset is the dataset where our variables are stored. If we were to apply this to
our Old Faithful example—where our dataset is named faithful, our response
variable is the wait time waiting, and the predictor variable is the eruption time
eruptions—our code would be
lm(waiting~eruption, data=faithful)
R then will return the output below. The (Intercept) column gives us the
value of our intercept estimate β̂0 , while the eruptions column gives the slope
estimate β̂1 . This column name will change depending on the name of the pre-
dictor variable.
Call:
lm(formula = waiting ~ eruptions, data = faithful)
Coefficients:
(Intercept) eruptions
33.47 10.73
Simple linear regression Chapter | 19 233
In several cases, we will need to call parts of the R output for the lm function.
In order to do so, we will need to save the results of the lm function. We can do
this like we would save any variable, vector, or data frame in R.
save name=lm(...)
For example, say we save the Old Faithful regression as slr. The code would
be
slr=lm(waiting~eruption, data=faithful)
One instance where we need to call on the output of the lm is to get the
MSE, our estimate of the error variance. The MSE is stored in the summary
of our lm object, accessible through the summary function in R. We have seen
this function previously, where it was used to get our five-number summary of
a variable. When the input into the summary function is an lm object, it returns
a summary of the regression.
summary(lm object)
This summary will meet several of our needs when evaluating our regression.
If we wanted to access the summary of our Old Faithful regression—saved as
slr—the code would be
summary(slr)
And R will return the summary below. The MSE, or a statistic derived from
it, is stored in the Residual standard error portion of the summary. The resid-
ual standard error is the square root of the MSE, making it our estimate of the
standard deviation of our errors, or σ̂ . In our Old Faithful regression, our resid-
ual standard error is 5.914, making our MSE and estimate of our error variance
σ̂ 2 = 5.9142 = 34.9755.
Call:
lm(formula = waiting ~ eruptions, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-12.0796 -4.4831 0.2122 3.9246 15.9719
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.4744 1.1549 28.98 <2e-16 ***
eruptions 10.7296 0.3148 34.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ei = yi − ŷi .
In this way, our residuals are often referred to being calculated as “observed
minus predicted.”
While our predictions are relatively easy to create by hand for an individual,
when fitted values or predictions are needed en masse R provides several options
to get predictions. If you need the fitted values for our observations, they are
stored in the saved lm object under the name fitted.values. In order to access
them for plotting or analyzing, we would use the code
Simple linear regression Chapter | 19 235
lm object$fitted.values
In the Old Faithful regression saved as slr, we access our fitted values using
the code slr$fitted.values. The residuals can be accessed in a similar way as
they are stored in the lm object under the name residuals. For the Old Faithful
regression, we would access the residuals using
slr$residuals
In our data frame for our new predictor values, there is a key point to recog-
nize. As our predictor variable is specifically named in our lm formula, the name
of our variable in the newdata data frame must match. For example, in the Old
Faithful regression—saved as slr—say that we wanted to generate a prediction
for wait time assuming that the previous eruption was 2.75 minutes long. Our R
code for this would be
predict(slr, newdata=data.frame(eruptions=2.75))
And R will return the prediction of 62.9809. We can confirm with a little
arithmetic that this is correct, up to rounding of course.
One important thing to note about our predicted values specifically is that
there are values for which we should not generate predictions. In general, we
should not generate predictions for values of outside of our predictor range—
what is called extrapolating. Why is extrapolation such a problem? We generally
assume that the relationship between predictor and response is linear in the
range of our predictor. Outside of that range we do not know how the predictor
and response are related. It could remain linear, or it could change drastically.
We can see an example of this in Fig. 19.4, where a small range of the pre-
dictor appears linear but the overall relationship between predictor and response
remains highly nonlinear. If we were to assume linearity based on the smaller
range and extrapolate, we would have drastically different predictions than real-
ity.
FIGURE 19.4 An example of a relationship where a small range of the predictor is linear on the
left, but the overall relationship on the right is not.
6. Load the water dataset from the HSAUR library [99,102]. This dataset looks
at the relationship between water hardness and mortality. If the calcium con-
centration of the water (hardness) is equal to 60 parts per million, what
would be the expected mortality per 100,000 male inhabitants (mortality)
based on the simple linear regression.
yi = β0 + β1 xi + i .
This only allows for a linear component in our model. If the true relationship
between our predictor and response was quadratic—that is, it has an xi2 compo-
nent in it—our regression model would be inadequate. To check this assumption,
Simple linear regression Chapter | 19 237
we plot the regression residual for each observation versus the observation’s fit-
ted value. If linearity holds, the residuals will show no pattern, looking like
random scatter. If there is any pattern in the residuals, this implies that the lin-
earity assumption does not hold. (See Fig. 19.5.)
FIGURE 19.5 An example of random scatter in the residuals—so the assumption holds—and pat-
tern in the residuals—so the assumption is violated.
In order to check this in R, we need to find our fitted values and residuals.
Recall that our fitted values ŷi = β̂0 + β̂1 xi are our predicted values for a given
observation in our dataset and our residuals ei are our estimates of our random
errors i , calculated as our observed value minus our predicted value, or ei =
yi − ŷi .
Our fitted values are stored in our lm object under the name fitted.values,
while our residuals are stored in the same place under the name residuals. In
order to access them we can use the $ notation, similar to how we call a variable
from a data frame. For example, recall that we saved our simple linear regres-
sion for the Old Faithful dataset under the name slr. If we wanted to see these
residuals and fitted values, we would use
slr$fitted.values
slr.$residuals
plot(slr$fitted.values,slr$residuals)
To produce the plot in Fig. 19.6. Note that this plot has several additional
options included in the code to change—among other things—the plotting char-
acter and axis labels. Looking at this plot, it appears that the linearity assumption
does hold.
In plotting the residuals versus fitted values to check the assumption, it
seems like there would be an easier method. Namely, we could merely plot
238 Basic Statistics With R
FIGURE 19.6 Checking the linearity assumption for the Old Faithful regression.
our predictor versus our response. If we see linearity in this relationship, the as-
sumption would be satisfied. However, plotting the predictor versus response
can be misleading. Consider the following data: James Forbes was a Scot-
tish researcher looking at the effect of altitude—represented through baromet-
ric pressure—on the boiling point of water. The data is stored in the forbes
dataset in the MASS library [10,93]. Plotting the barometric pressure—our
predictor—versus the boiling point—our response—it appears that the rela-
tionship is linear, albeit with one odd outlier point. However, if we run a
linear regression and plot the residuals versus fitted values we clearly see a
curved pattern in the residuals. The residuals are much better at picking up
these patterns, making them more suitable for checking these assumptions. (See
Fig. 19.7.)
FIGURE 19.7 Forbes’ boiling point data plotted on left and residuals versus fitted for Forbes’ data
on the right. A clear pattern exists in the residuals, implying the linearity assumption is violated.
If the linearity assumption is violated, you can often transform the response
to create a model where linearity exists. Forbes’ data is an example of this,
Simple linear regression Chapter | 19 239
where modeling the log of the boiling point with the atmospheric pressure is an
appropriate model.
FIGURE 19.8 A simulated dataset with the plot of predictor versus response and the residual plot
showing heteroskedastic errors.
H0 : β1 = 0
Simple linear regression Chapter | 19 241
HA : β1 = 0.
This set of hypotheses will hold for our Old Faithful regression as well,
testing if slope is zero or not.
H0 : β1 = 0.
Thus we need a sample statistic calculated from our data that estimates β1 . It
seems fairly clear that the slope from our simple linear regression β̂1 is a reason-
able choice for our sample statistic. Additionally, looking at our null hypothesis
we can see that our null hypothesis value is equal to 0. Thus, our test statistic at
this stage is
β̂1 − 0
t= .
Standard Error
All that remains is the standard error of our sample statistic, or s.e.(β̂1 ).
It turns out that under certain conditions—we will discuss them shortly—the
242 Basic Statistics With R
10.7296
t= = 34.09.
34.9755
271 × 1.3027
HA P-value
HA : β1 = 0 2 × P tn−2 ≥ |t|
then is what is required for this to be true? In this case, either one of two con-
ditions must be true. Either our errors must come from a Normal distribution
or sample size must be sufficiently large—usually n ≥ 30—for the central limit
theorem to kick in.
We will know whether or not our sample size is large enough, but what
about the Normally distributed errors? It seems reasonable that we can use our
residuals in lieu of our errors to determine this. But how do we go about this?
Formal tests exist for seeing if a sample of data follows a Normal distribution.
However, in this book we will focus on informal techniques, namely looking at
a histogram of our residuals. If the histogram seems roughly Normal, we will
considered this assumption met.
In order to calculate our p-value, we again turn to the pt function in R. In
our Old Faithful example, our sample size is decidedly greater than 30, so the
central limit theorem will hold and our test statistic will follow a t-distribution
with 272 − 2 = 270 degrees of freedom. Thus we can use the following R code
to find that our p-value is 8.08 × 10−100 .
19.10.6 Conclusion
Our conclusion step has not changed, as we reject the null hypothesis if the p-
value is less than the significance level α. Otherwise, we would fail to reject the
null hypothesis. In our Old Faithful example, since our p-value is decidedly less
than our α = 0.05, we will reject the null hypothesis and conclude that β1 = 0,
and thus eruption time does help predict the wait time until the next eruption.
Call:
lm(formula = waiting ~ eruptions, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-12.0796 -4.4831 0.2122 3.9246 15.9719
Coefficients:
244 Basic Statistics With R
SSE
Our SSR is directly related to our mean squared error σ̂ 2 , as σ̂ 2 = for
n−2
our simple linear regression. In general, if the SSR is close to 0, this implies that
our model is doing a good job fitting the data. If the SSR is large, the model is
doing a poor job fitting the data.
Unfortunately, there is a problem with this interpretation: it is dependent on
how much variability exists in our response. Consider the following two simu-
lated datasets presented in Fig. 19.9. In looking at the two graphs, it looks like
the regression on the right would do a better job predicting its response as the
observed values are more tightly clustered about the line. However, the SSR for
the regression on the left is 2421.9, while the SSR for the regression on the right
is 12050.5.
FIGURE 19.9 Two simulated regressions. The regression on the right appears to predict its re-
sponse better, despite a larger SSR.
Why is this the case? The variance of the response in the dataset on the
right—13516.02—is much larger than the variance of the response in the dataset
on the left—36.29. We need a statistic about our regression that both takes into
account our total variability and our SSR.
The most common statistic used that accounts for both of these is the coef-
ficient of determination, or R 2 . Our R 2 value represents the percentage of the
total variability in our response that is explained by the regression model. Since
R 2 is a percentage, this means it is bounded between 0 and 1, with larger val-
ues indicating that the regression explains more variability in the response—and
thus is doing a better job.
The formula for R 2 is given by
SSR
R2 = 1 −
SST
246 Basic Statistics With R
where SSR is our previously mentioned sum of squared errors and SST is the
total sum of square. We calculate this total sum of squares through the formula
n
SST = (yi − ȳ)2 , where ȳ is the average of our response. The SST is directly
i=1
connected to the variance of our response, just as the SSR is connected to our
mean squared error. Thus we can restate our R 2 formula using this information.
(n − 2)σ̂ 2
R2 = 1 − .
(n − 1)sy2
In this formula, σ̂ 2 is our mean squared error and sy2 is the variance of our
response. In our Old Faithful regression, we know that our mean squared error
is σ̂ 2 = 34.9755, the variance of the wait time response is sy2 = 184.8233, and
our sample size is n = 272. Using this information, our R 2 value is
270 × 34.9755
R2 = 1 − = 0.8115.
271 × 184.8233
Thus we can say that our regression using eruption time to predict wait time
until the next eruption accounts for 81.15% of the total variability in wait time.
The coefficient of determination has a specific connection to our correlation
for simple linear regressions. It turns out that for a simple linear regression in-
volving a response y and a predictor x, the R 2 value of that regression will be
R 2 = rx,y
2 , where r
x,y is equal to the correlation. We can see this in our Old
Faithful regression, as the correlation between eruption time and wait is equal
to rx,y = 0.9008, which squared matches up to our R 2 = 0.90082 = 0.8115.
Similar to our hypothesis test for regressions, we can find our value of R 2 us-
ing the summary function with an lm input in R. The R 2 is stored in the Multiple
R-squared section of our summary, beneath the Coefficients section. In looking
at the summary of our Old Faithful regression saved as slr—using the code sum-
mary(slr)—we can easily see that the value in the Multiple R-squared section
matches up with our calculated value of R 2 = 0.8115:
Call:
lm(formula = waiting ~ eruptions, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-12.0796 -4.4831 0.2122 3.9246 15.9719
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.4744 1.1549 28.98 <2e-16 ***
eruptions 10.7296 0.3148 34.09 <2e-16 ***
---
Simple linear regression Chapter | 19 247
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
19.13 Conclusion
Regression compliments our inference for correlations by giving us a more com-
plete understanding of the relationship between predictors and responses. By
making a few assumptions about the random errors that exist in our data, we are
able to estimate the line that best describes our data. From this line, we can make
predictions, do inference on our line components, and understand how much of
our variability is described by the regression line. In doing all this, we have laid
down the foundations of linear models; one of the most useful techniques in
statistics.
Chapter 20
There is a thread that connects through every idea and technique that exists in the
field of statistics. This book is the beginning of that thread. Through the course
of this book, we have been developing the mindset of statistics as a field of study
that is all about translating data into decisions, theories, and knowledge. We have
talked briefly about data collection and initially exploring our datasets. Then we
saw why these exploratory analyses are not enough to draw conclusions, leading
us finally to statistical inference.
We have been able to answer a variety of questions using these techniques,
but even so, these questions represent only a beginning in the grander scheme
of research. The complexity of questions has only increased as we have moved
forward, as has the size and scope data. These questions require answers, and
as such statistics and its techniques must be there to meet these questions with
ideas. And while our techniques learned in this book put us on the right track
toward answering many of these complex questions, they often leave us a little
short.
Even when a technique seems well designed to answer a question, there are
times when generalizations are needed. We talked extensively about the two-
sample t-test for means, as well we should. It is the test to work with when
we are comparing a quantitative response across two groups. When we wanted
to compare the weight gain in chickens to look for the results of two different
dietary supplements—soybeans and sunflowers—this test supplied us with an
answer that we could trust.
However, this test limits us two comparing two groups at a time. What if we
wanted to compare six groups? What is to be done then? This is not a trivial
question, as it is directly related to our chicken weights question. This dataset
[60] that we presented—looking at soybeans and sunflowers—is just a subset
of the complete dataset which contains the weight gains of chickens on one of
six feed supplements—casein, horsebean, linseed, meatmeal, soybean, and sun-
flower. How would we know if there is any difference between the six different
supplements? Looking at the side-by-side boxplots of the weight gain for the six
supplements it seems like there is a difference, but how can we run statistical
inference to more definitively answer the question? We could compare each pair
of supplements in a series of two-sample t-tests for means, but this results in 15
comparisons and thus would seem more likely that we would make a mistake
along the way. It would seem like there is a more efficient and downright better
way to answer this question. (See Fig. 20.1.)
FIGURE 20.1 Side-by-side boxplots of chicken weight gain for six feed supplements.
This sort of scenario can extend to categorical responses as well. Right now,
our two-sample test for proportions allows us to see if the sample proportion of
a binary response differs across two groups. This can answer a wide array of
questions and give us a variety of useful information. For example, we could
look at if the proportion of liberals differs for individuals with at least a Bach-
elor’s degree than those without a Bachelor’s degree [54]. Data and inference
like this can help us understand and confirm or disprove conventional wisdom.
Statistics: the world beyond this book Chapter | 20 251
Now, this example is still a binary response. Say we were to expand this
to a multiary response, with as many options in our response as we want. In
our education and political leanings question, this is equivalent to going from
our Not Liberal versus Liberal—hardly a representation of the variety of the
political spectrum—to more complete view of the spectrum ranging from Very
Conservative (Represented as 1 in the table) to Moderate (4) to Very Liberal (7).
1 2 3 4 5 6 7
No HS 11 11 8 42 5 9 10
HS Grad 57 57 36 120 25 33 49
Some College 24 35 21 58 23 38 42
2-Year Grad 10 7 12 29 6 20 20
4-Year Grad 21 32 28 37 17 34 29
Post-Grad 16 16 14 21 13 23 16
This data is more granular than what we started with and can offer us a
wider variety of insights. Our original question—does the population propor-
tion of liberals differ among those who have a bachelor’s degree and those who
do not—can now be generalized to the question: “How does education affect po-
litical leaning?” However, the two-sample test for proportions is ill-equipped to
answer this question meaning that we need more complex inferential techniques
to develop an answer.
This need for more complex techniques is not confined to inference. There
are instances when the exploratory analyses discussed in the book are not suf-
ficient to understand a question. Say we wanted to try and classify a series of
252 Basic Statistics With R
pitches in a baseball game based on data about each individual pitch: the speed
of the pitch, spin rate, vertical movement, and horizontal movement. If we were
only dealing with two variables, we could plot them in a scatterplot and try to
identify clusters in the positions of the plotted points. However, with four vari-
ables this is impossible; at most we could visualize three variables at a time. We
need a way to visualize our data so that we can identify clusters within the data.
We can even extend this need for new ideas to the entire philosophy of statis-
tics. The methods of statistics discussed in this book fall into the paradigm
of statistics called frequentist statistics, in which we assume—among other
things—the frequentist or long-run view of probability and that our parameters
are fixed values. This mindset led us to hypothesis testing, in which we eval-
uate claims about our parameters through p-values. Say we want to know the
probability that one of our hypotheses is true. This seemingly simple question
is impossible in the framework of frequentist statistics and hypothesis testing,
as our p-values do not work in this way. Because frequentist statistics does not
have the tools to attack these problems, a different mindset is necessary.
The same sort of ability to compare multiple groups is found the in the chi-
squared test of independence. In this test, we take our data—two categorical
variables with two or more categories apiece—and compare it to what we would
expect to see if the two variables had no affect on each other. In doing so, we
can find if any association exists between the two variables, such as if education
level and political leaning are associated in any way. This is a question we have
been asking—if two categorical variables are associated—just with restricting
our comparisons to merely two categories. This test allows us to expand our
possibilities, answering questions that are much more complex and allowing us
to gain further insight.
In the instance when we want to visualize more than two variables, a tech-
nique called Principal Component Analysis (PCA) exists. Similar to ANOVA,
PCA tries to partition the variability that exists in our data. PCA does this by
finding the independent combinations of variables that best explain the variabil-
ity. If we plot the two best combinations, this will help us identify clusters in the
data—should they exist—while maintaining the structure of the data. Consider
the baseball pitch data we mentioned earlier. In plotting the two best principal
components—the name for the linear combinations of variables—we see that
there seems to be three clusters within our data. With a little more informa-
tion about the data, this makes intuitive sense. Our data is the pitches thrown in
2019 by Mychal Givens, a reliever on the Baltimore Orioles. He throws three
pitches: a fastball, curveball, and change-up. We’d expect to see three clusters—
corresponding to those three pitch types—in our data, exactly what we do see in
the plot of our PCA. (See Fig. 20.2.)
FIGURE 20.2 Principal component analysis of Mychal Givens’s pitches from 2019.
Finally, even the question about the probability of a hypothesis being true
is able to be answered with new philosophies. Bayesian statistics is a philoso-
phy of statistics that is based around the ideas of subjective probability—that
probability is the quantification of a degree of belief—and the idea that we can
determine the probability of events given that some other event has occurred.
254 Basic Statistics With R
Chapter 3
1. Due to social desirability, individuals might not be willing to admit their
depression in person.
2. This could introduce sampling bias because individuals may not be available
during the survey times selected.
3. The response is the change in blood pressure. The explanatory variables are
the medication and exercise level. The treatments are Drug A-1 hour, Drug
A-5 hours, Drug B-1 hours, and Drug B-5 hours.
4. The response is the improvement of the student measured by the difference
in pre- and post-exam scores. The explanatory variable is the teaching style.
The treatments are Team-based method or Traditional method.
5. This study is an observational study, and observational studies are not able
to establish causality.
6. The subjects are the 200 participants, the explanatory variable is their amount
of exercise, and the response is their weight.
7. Possible confounding variables could be dietary choices and family history,
among many others.
8. Design an experiment where subjects are assigned to different exercise regi-
mens in order to control for the various confounding factors.
255
256 Solutions to practice problems
Chapter 4
1. Colleges$TopSalary[c(1, 3, 10, 12)]
2. Colleges$MedianSalary[Colleges$TopSalary>400 000]
3. Colleges[Colleges$Employees<=1000,]
4. Colleges[sample.int(n=14, size=5),]
5. Countries[Countries$GDPcapita<10000&Countries$Region!=“Asia.”]
6. Countries[sample.int(n=10, size=3),]
7. Countries$Nation[Countries$PctIncrease>1.5]
8. Olympics[Olympics$Host==Olympics$Leader,]
9. Olympics[Olympics$Competitors/Olympics$Events>35,]
10. Olympics[Olympics$Type==“Winter”&Olympics$Nations>=80,]
Chapter 5
1. library(Ecdat)
2. data(Diamond)
Chapter 6
1. The population of interest is American adults.
2. n = 2458 + 1796 + 378 + 94 = 4726
2458
3. p̂ = = 0.5202
4726
4. n = 48 + 334 + 66 + 990 + 62 + 848 + 11 + 166 = 2525
334 + 990 + 848 + 166
5. p̂ = = 0.9259
2525
None of it Not much Some of it Most of it
6.
382 1056 910 177
7. x1 = 0.7, y8 = 1.6, x (2) = −1.2, y (6) = 1.9
8. ȳ = 2.33
9. x̃ = 0.35
10. DDT (3) = 3.06, DDT (14) = 3.78
11. DDT ¯ = 3.328
12. DDT ˜ = 3.22
13. x̄ = 101.425
14. x̃ = 100.9
15. s 2 = 1.898
16. s = 1.378
17. I QR = 102.9 − 100.4 = 2.5
18. x̄ = 94.14
19. x̄ = 92
20. s 2 = 197.476
21. s = 14.053
22. I QR = 106 − 78 = 28
Solutions to practice problems 257
23. Our z-scores are 0.74, −2.33, −0.34, −0.13, −0.4, 0.05, 0.19, 0.53, −0.18,
0.21, 0.93, 0.08 1.43, 1.04, and −1.83, so according to z-scores, none of
our data points are potential outliers.
24. As any observation less than 1 − 1.5 × (6.125 − 1) = −6.6875 or greater
than 6.125 + 1.5 × (6.125 − 1) = 13.8125 are potential outliers, we would
say −8.375 is a potential outlier.
25. Our z-scores are −0.03, −0.03, 2.05, −0.59, 1.49, −0.73, −0.31, −0.31,
0.38, −1.84, 0.38, −0.17, 0.52, −1.15, 0.8, −1.84, 0.52, −0.59, 0.24, and
1.22, so according to z-scores, none of our data points are potential outliers.
26. As any observation less than 810 − 1.5 × (890 − 810) = 690 or greater than
890 + 1.5 × (890 − 810) = 1010 is a potential outlier, so according to the
IQR, there are no potential outliers.
27. Top left: Bimodal, symmetric, centered around 0.5. Top right: Bimodel,
skewed to the right, centered near 0. Bottom left: Unimodal, skewed
left, centered around −2. Bottom right: Unimodal, skewed right, centered
around 0.75, potential outlier near −2.
28. See figure.
Chapter 7
1. table(leuk$ag)
2. median(leuk$time), mean(leuk$time)
3. range(leuk$wbc), IQR(leuk$wbc), var(leuk$wbc), sd(leuk$wbc)
4. cor(leuk$wbc, leuk$time)
5. table(survey$Smoke, survey$Exer)
6. mean(survey$Pulse), median(survey$Pulse)
7. range(survey$Age), IQR(survey$Age), var(survey$Age), sd(survey$Age)
8. cor(survey$Wr.Hnd, survey$NW.Hnd)
9. table(Housing$recroom, Housing$fullbase)
10. mean(Housing$lotsize), median(Housing$lotsize)
11. range(Housing$price), IQR(Housing$price), var(Housing$price), sd(Hous-
ing$price)
12. cor(Housing$price, Housing$bedrooms)
13. plot(Star$tmathssk, Star$treadssk)
14. hist(Star$totexpk)
15. Star$totalscore=Star$tmathssk+Star$treadssk
16. boxplot(totalscore∼classk, data=Star)
17. plot(survey$Height, survey$Wr.Hnd)
18. hist(survey$Age)
19. boxplot(Pulse∼Exer, data=survey)
Chapter 8
1. S = {Ace of Spades, 2 of Spades...Queen of Hearts, King of Hearts}
12
2. P (A) =
52
1 3
3. P (AC ) = 1 − P (A) = 1 − =
4 4
4. S = {1, 2, ..., 100}
10 1
5. P (A) = =
100 10
6. X can be equal to 0, 1, 2, 3.
3
7. P (X = 2) =
8
8. Binomial(19, 0.4)
9. X ∼ Binomial(8, 0.492)
Solutions to practice problems 259
Chapter 9
1. The sampling distribution would be centered at 0.5.
2. Your colleague will have a smaller standard error.
3. As the sample size increases, the sampling distribution of the sample average
will approach a Normal distribution.
4. The standard error will get smaller and eventually go to 0. The sampling
distribution will approach a Normal distribution.
Chapter 10
1. H0 : p = 1/6, HA : p > 1/6.
2. One-sided test
3. P (Type I) = 0.05
4. Type II Error
5. We would reject the null hypothesis because our p-value is less than our
significance level.
6. H0 : p = 0.5, HA : p = 0.5.
7. Two-sided test
8. The significance level α does not affect the probability of a Type II error.
9. We would fail to reject the null hypothesis because our p-value is greater
than our significance level.
Chapter 11
1. 0.9772499
2. 0.01222447
3. 0.9875338
4. 0.9583677
5. 0.001349898
6. H0 : p = 0.47, HA : p < 0.47
20
7. p̂ = = 0.4
50
0.4 − 0.47
8. t= = −1
0.07
9. p̂ ∼ N (0.47, 0.072 )
10. 0.1586553
260 Solutions to practice problems
Chapter 12
1. We are 90% confident that (0.438486, 0.461514) covers the true value of p
2. We are 95% confident that (0.4362803, 0.4637197) covers the true value
of p
3. We are 99% confident that (0.4319692, 0.4680308) covers the true value
of p
4. We are 99.5% confident that (0.4303508, 0.4696492) covers the true value
of p
5. We are 80% confident that (0.7610291, 0.7789709) covers the true value
of p
6. We are 90% confident that (0.758486, 0.781514) covers the true value of p
7. We are 95% confident that (0.7562803, 0.7837197) covers the true value
of p
Chapter 13
1. H0 : p = 0.68, HA : p > 0.68
1270
2. p̂ = = 0.8199
1549
0.8199 − 0.68
3. t = = 11.8036
0.68(1 − 0.68)
1549
4. 1.87 × 10−32
5. Because our p-value is less than α, we reject the null hypothesis and con-
clude that the proportion of Americans who say children should be required
to be vacinnated is greater than 0.68.
6. H0 : p = 0.5, HA : p = 0.5
122
7. p̂ = = 0.442
276
0.442 − 0.5
8. t = = −1.927
0.5(1 − 0.5)
276
9. 0.0540
10. Because our p-value is less than α, we reject the null hypothesis and con-
clude that the proportion of e-mails that are spam is not equal to 0.5.
11. H0 : p = 0.79, HA : p < 0.79, α = 0.05
Solutions to practice problems 261
1082
p̂ = = 0.7204
1502
0.7204 − 0.79
t= = −6.622
0.79(1 − 0.79)
1502
P N (0, 1) < −6.622 = 1.77 × 10−11
Because our p-value is less than α, we reject our null hypothesis and con-
clude that the proportion of American who have read a book in the last year
is less than 0.79.
12. H0 : μ = 6, HA : μ > 6
6.2 − 6
13. t= √ = 3.683
2.06/ 1439
14. t distribution with 1438 degrees of freedom.
15. 0.000120
16. Because our p-value is less than α, we will reject the null hypothesis and
conclude μ > 6.
17. H0 : μ = 2.5, HA : μ > 2.5
6.143 − 2.5
18. t=√ √ = 5.84
13.229/ 34
19. t distribution with 33 degrees of freedom.
20. 7.7 × 10−7
21. Because our p-value is less than α, we will reject the null hypothesis and
conclude μ > 2.5.
Chapter 14
1. We are 95% confident that (0.3991859, 0.4853184) covers the true propor-
tion of times Mike Trout gets on base.
2. We are 99% confident that (0.1791346, 0.6706865) covers the true proba-
bility of winning at craps betting the pass line. However, the small sample
size may cast doubt on the results.
3. We are 90% confident that (0.4181515, 0.4767203) covers the true propor-
tion of women who volunteer.
4. We are 99% confident that (34.1071, 50.1529) covers the true mean ozone
ppm.
5. We are 95% confident that (1.815092, 2.324908) cover the true mean dif-
ference in actual minus reported height.
6. We are 95% confident that (60.5361, 61.3039) covers the true age of
women who have a heart attack.
1.6452 × 0.41 × (1 − 0.41)
7. n ≥ = 65.4471
0.12
2.5762 × 0.5 × (1 − 0.5)
8. n ≥ = 663.4897
0.052
1.962 × 0.5 × (1 − 0.5)
9. n ≥ = 9604
0.012
262 Solutions to practice problems
10. (2.427, 6.134). Since 4 is in the interval, we would fail to reject the null
hypothesis.
11. (2.044, 6.517). Since 2 is outside the interval, we would reject the null
hypothesis.
12. (1.506, 7.054). Because our confidence level does not match up with our
significance level, we cannot determine the result of our hypothesis test
based on this confidence interval.
Chapter 15
1. H0 : pM − pW = 0, HA : pM − pW = 0
794 + 919
2. p̂ = = 0.7308
1119 + 1225
0.7096 − 0.7502
3. t = = −2.219
1 1
0.7308(1 − 0.7308) +
1119 1225
4. 0.02648
5. Since our p-value is less than α, we reject the null hypothesis and conclude
that men and women value work-life balance at different rates.
6. H0 : pAid − pN oAid = 0, HA : pAid − pN oAid < 0
α = 0.01
48 66
p̂Aid = = 0.2222, p̂N oAid = = 0.3056
216 216
48 + 66
p̂ = = 0.2639
216 + 216
0.2222 = 0.3056
t= = −1.676934
1 1
0.2639(1 − 0.2639) +
216 216
P N (0, 1) < −1.676934) = 0.04677769
Becase our p-value is greater than α, we fail to reject the null hypothe-
sis and conclude it’s plausible that the rate of recidivism is equivalent for
former inmates with and without financial aid.
7. H0 : pY es − pN o = 0, HA : pY es − pN o > 0
α = 0.1
263
p̂Y es = = 0.8567, p̂N o = 0.6773
307
263 + 783
p̂ = = 0.7150
307 + 1156
0.8567 − 0.6773
t= = 6.189747
1 1
0.7150(1 − 0.7150) +
307 1156
P N (0, 1) > 6.189747 = 3.01 × 10−10
Solutions to practice problems 263
Because our p-value is less than α, we reject the null hypothesis and con-
clude that people who remember receiving corporal punishment are in
favor of moderate corporal punishment for children at a higher proportion
than those who do not remember receiving corporal punishment.
8. H0 : μBenign − μMalignant = 0, HA : μBenign − μMalignant = 0
9. Ratio of variances is between 0.25 and 4, so
1.672 × (458 − 1) + 2.432 × (241 − 1)
sp2 = = 3.862
458 + 241 − 2
2.96 − 7.2
t= = −27.112
1 1
3.862 +
458 241
10. df = 458
+ 241 − 2 = 697
11. 2 × P t697 > − 27.112 = 4.35 × 10−111
12. Since our p-value is less than α, we reject the null hypothesis and conclude
that clump thickness differs for benign and malignant tumors.
13. H0 : μW C − μBC = 0, HA : μW C − μBC > 0
α = 0.1
sW2
C 139.067
2
= = 0.4266
sBC 325.991
325.991 × (21 − 1) + 139.067 × (6 − 1)
sp2 = = 288.6062
21 + 6 − 2
36.667 − 22.762
t= = 1.76816
1 1
288.6062 +
6 21
P t25 > 1.76816 = 0.04462
Because our p-value is less than α, we reject the null hypothesis and con-
clude that white collar jobs are viewed as more prestigious than blue collar
jobs. However, the small sample sizes and lack of Normality cast some
doubt on the results.
14. H0 : μC − μMP = 0, HA : μC − μMP = 0
α = 0.05
sC2 0.892
2
= = 0.5069
sMP 1.252
0.892 × (45 − 1) + 1.252 × (74 − 1)
sp2 = = 1.2728
45 + 74 − 2
22.3 − 19.7
t= = 12.19107
1 1
1.2728 +
45 74
P t117 > 12.19107 = 7.22 × 10−23
Because our p-value is less than α, we reject the null hypothesis and con-
clude that the length of cuckoo and host meadow pipit eggs differ.
15. H0 : μD = 0, HA : μD > 0
264 Solutions to practice problems
15.97
16. t = √ = 5.7844
15.86/ 33
17. P t32 > 5.7844 = 1.01 × 10−6
18. Because our p-value is less than α, we reject the null hypothesis and con-
clude that the blood lead levels are higher for children of parents exposed
to lead in their factory work.
19. H0 : μD = 0, HA : μD = 0
α = 0.05
0.1936
t= √ = 3.157
0.3066/
25
2 × P t24 > 3.15 = 0.00426
Because our p-value is less than α, we reject the null hypothesis and con-
clude there is a difference in the percentage of solids for shaded and ex-
posed grapefruits. However, the sample size and lack of Normality might
cause one to doubt the results.
20. H0 : μD = 0, HA : μD > 0
α = 0.1
2.617
t=√ √ = 2.14826
22.26/ 15
P (t14 > 2.14826) = 0.02484
Because our p-value is less than α, we reject the null hypothesis and con-
clude that cross-pollinated corn plants are taller than self-pollinated corn
plants. However, the small sample size and lack of Normality casts some
doubt on the analysis.
Chapter 16
1. We are 96% confident that (0.08112345, 0.1392416) covers the true value
of pCollege − pN oCollege .
2. We are 80% confident that (0.07300417, 0.1050089) covers the true value
of pDiabetes − pN oDiabetes .
3. We are 95% confident that (−0.1037826, 0.09569645) covers the true value
of pF rench − pEnglish .
4. We are 90% confident that (3.982428, 4.497572) covers the true value of
μMalignant − μBenign .
5. We are 96% confident that (−8.331023, 2.871023) covers the true value of
μF − μM .
6. We are 92% confident that (10.82324, 13.05676) covers the true value of
μS − μW .
7. We are 99% confident that (8.407745, 23.53225) covers the true value
of μD .
8. We are 99% confident that (−1.113974, −0.494026) covers the true value
of μD .
9. We are 80% confident that (−0.02914, 0.15314) covers the true values
of μD .
Solutions to practice problems 265
10. Since 0 is in the 95% confidence interval, we would fail to reject the null
hypothesis.
11. Since 0 is not contained in the 90% confidence interval, we would reject
the null hypothesis.
12. Since 0 is not contained in the 95% confidence interval, we would reject
the null hypothesis.
Chapter 17
1. prop.test(x=c(7,10), n=c(117,118), alternative=“two.sided,” conf.level=0.87,
correct=FALSE)
2. prop.test(x=705, n=1500, p=0.5, alternative=“less,” correct=FALSE),
prop.test(x=705, n=1500, p=0.5, alternative=“two.sided,” conf.level=0.93,
correct=FALSE)
3. prop.test(x=c(171,124), n=c(212,541), alternative=“greater,” correct=
FALSE) prop.test(x=c(171,124), n=c(212,541), alternative=‘two.sided,” cor-
rect=FALSE)
4. prop.test(x=428, n=753, p=0.5, alternative=“greater,” conf.level=0.98, cor-
rect=FALSE), prop.test(x=428, n=753, p=0.5, alternative=“two.sided,”
conf.level=0.84, correct=FALSE)
5. t.test(DDT, mu=3, alternative=“greater”), t.test(DDT, mu=3, alternative=
“two.sided,” conf.level=0.99)
6. t.test(x=monica$age[monica$sex==“men”], y=monica$age[monica$sex==
“women”], alternative=“less”, var.equal=TRUE), t.test(x=monica$age[mon-
ica$sex==“men”], y=monica$age[monica$sex==“women”], alternative=
“two.sided,” conf.level=0.9, var.equal=TRUE)
7. t.test(x=cd4$baseline, y=cd4$oneyear, alternative=“less,” paired=TRUE,
var.equal=TRUE), t.test(x=cd4$baseline, y=cd4$oneyear, alternative=
‘two.sided,” paired=TRUE, var.equal=TRUE, conf.level=0.6)
8. t.test(x=Guyer$cooperation[Guyer$condition==“anonymous”], y=Guyer$co-
operation[Guyer$condition==“public”], alternative=“two.sided,”
var.equal=TRUE, conf.level=0.97)
9. t.test(x=pair65$heated, y=pair65$ambient, alternative=“two.sided,”
var.equal=TRUE)
Chapter 18
1. H0 : ρ = 0, HA : ρ = 0
α = 0.05
0.177
t= = 2.5433
1 − 0.1772
202 − 2
2 × P (t200 > |2.5433|) = 0.01173731
266 Solutions to practice problems
Because our p-value is less than α, we reject the null hypothesis and conclude
that the correlation between body mass index and white blood cell count in
athletes is not 0.
2. H0 : ρ = 0, HA : ρ = 0
α = 0.01
0.1857
t= = 2.58436
1 − 0.18572
189 − 2
2 × P (t187 > |2.58436|) = 0.0105183
Because our p-value is greater than α, we fail to reject the null hypothesis and
conclude that the correlation between a newborn’s weight and its mother’s
weight is plausibly 0.
3. H0 : ρ = 0, HA : ρ = 0
α = 0.05
0.6557
t= = 47.55122
1 − 0.65572
3000 − 2
2 × P (t2998 > |47.55122|) = 0
Because our p-value is less than α, we reject the null hypothesis and conclude
that the correlation between height and index finger length is not equal to 0.
4. cor.test(x=Weekly$Volume, y=Weekly$Today, alternative=“two.sided,”
conf.level=0.9)
5. We are 90% confident that (−0.08281261, 0.01682141) covers the true value
of ρ.
6. cor.test(x=University$acpay, y=University$resgr, alternative=“two.sided,”
conf.level=0.99)
7. We are 99% confident that (0.7564774, 0.9300306) covers the true value of
ρ. 0.7564774 0.9300306
8. cor.test(x=Caschool$testscr, y=Caschool$expnstu, alternative=“two.sided,”
conf.level=0.91)
9. We are 91% confident that (0.11018480.2698314) covers the true value of ρ.
Chapter 19
1. y = −17.579 + 3.932x
2. ˆ 2 = 0.042262 = 0.001786
y = 0.2068 + 0.0031x, sigma
3. ˆ 2 = 0.18612 = 0.03463
y = 0.154010 + 0.065066, sigma
4. ŷ = −17.579 + 3.932 × 30 = 100.381
5. ŷ = 23.94153 + 0.64629 × 68.5 = 68.2124
6. ŷ = 1676.3556 − 3.2261 × 60 = 1482.79
7. lm=lm(VitC data=HeadWt,cabbages)
plot(lmf itted.values, lmresiduals)
Solutions to practice problems 267
Based on the plot generated by this code, it appears that the regression
assumptions hold.
8. As the p-value from our lm summary is 9.75 × 10−9 , we would reject the
null hypothesis and conclude that the slope coefficient in our regression is
not zero.
9. R 2 = 0.4355
10. As the p-value from our lm summary is less than 2.2 × 10−16 , we would
reject the null hypothesis and conclude that the slope coefficient in our
regression is not zero.
11. R 2 = 0.09688
12. lm=lm(sales price,data=Cigar)
plot(lmf itted.values, lmresiduals)
Based on the plot generated by this code, it appears that the regression
assumptions do not hold, specifically our assumption about the constant
variance of the errors.
13. As the p-value from our lm summary is less than 2.2 × 10−16 , we would
reject the null hypothesis and conclude that the slope coefficient in our
regression is not zero.
14. R 2 = 0.6387
15. lm=lm(ohms juice,data=fruitohms)
plot(lmf itted.values, lmresiduals)
Based on the plot generated by this code, it appears that the regression
assumptions do not hold, specifically our assumption about the relationship
between our predictor and response being linear.
Appendix B
List of R datasets
Note: Datasets may appear in multiple locations in the book. The listed chapter
is their first occurrence.
269
270 List of R datasets