0% found this document useful (0 votes)

33 views150 pages

Anova

Uploaded by

alexleg26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views150 pages

Anova

Uploaded by

alexleg26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 150

0.

1 - Getting started in R
by Mark Greenwood and Katharine Banner

This book and access to a computer (PC, Mac, or just computer lab computers on campus) are
the only required materials for the course. You will need to download the statistical software
package called R and an enhanced interface to R called R-studio (Rstudio, 2014). They are open
source and free to download and use (and will always be that way). This means that the skills
you learn now can follow you the rest of your life. R is becoming the primary language of
statistics and is being adopted across academia, government, and businesses to help manage and
learn from the growing volume of data being obtained. Hopefully you will get a sense of some of
the power of R this semester.

The next pages will walk you through the process of getting the software downloaded and
provide you with an initial experience using R-studio to do things that should look familiar even
though the interface will be a new experience. Do not expect to master R quickly - it takes years
(sorry!) even if you know all the statistical methods being used. We will try to keep all of your
interactions with R code in a similar coding form and that should help your learning how to use
R as we move through various methods. Everyone that learns R starts with copying other
people's code and then making changes for specific applications - so expect to go back to
examples and learn how to modify that code to work for your particular data set. In Chapter 1,
we will exploit the power of R to compare quantitative responses from two groups, making some
graphical displays, doing hypothesis testing and creating confidence intervals in a couple of
different ways.

You will have two downloading activities to complete before you can do anything more than
read this book. First, you need to download R. It is the engine that will do all the computing for
us, but you will only interact with it once. Go to http://cran.rstudio.com and click on
the "Download R for..." button that corresponds to your operating system. Second, you need to
download R-studio. It is an enhanced interface that will make interacting with R less frustrating.
Go to http://www.rstudio.com/products/rstudio/download/ and select the "installer" for your
operating system under the column for "Installers for all platforms". From this point forward,
you should only open R-studio; it provides your interface with R. Note that both R and R-studio
are updated frequently (up to four times a year) and if you downloaded either more than a few
months previously, you should download the up-to-date versions, especially if something you are
trying to do is not working. Sometimes code will not work in older versions of R and sometimes
old code won't work in new versions of R3.

Now we get to complete some basic tasks in R using the R-studio interface. When you open R-
studio, you will see a screen like Figure 0-2. The added notes can help you get initially oriented
to the software interface. R is command-line software - meaning that most of the time you have
to create code and then execute it to get any results. R-studio makes the management and
execution of code more efficient than the basic version of R. The lower left panel in R-studio is
called the "console" window and is where you can type R code directly into R or where you will
see the code you run and (most importantly!) where the results of your executed commands will
show up. The most basic interaction with R is available once you get the cursor active at the
command prompt ">". The upper left panel is for writing, saving, and running your R code. Once
you have code available in this window, the "Run" button will execute the code for the line that
your cursor is on or for any text that you have highlighted with your mouse. The "data
management" or environment panel is in the upper right, providing information on what data sets
have been loaded. It also contains the "Import Dataset" button that makes reading data into R
easier. The lower right panel contains information on the "Packages" that are available and is
where you will see plots that you make and requests for "Help".

Figure 0-2: Initial R-studio layout.

To interact with R, click near the command prompt (>) in the lower left "console" panel, type
3+4 and then hit enter. It should look like this:

> 3+4

[1] 7

You can do more interesting calculations, like finding the mean of the numbers 3, 5, 7, and 8 by
adding them up and dividing by 4:

(-3+5+7+8)/4

[1] 4.25

Note that the the parentheses help R to figure out your desired order of operations. If you drop
that grouping, you get a very different result:

> -3+5+7+8/4

[1] 11

We could estimate the standard deviation similarly using the formula you might remember from
introductory statistics, but that will only work in very limited situations. To use the real power of
R this semester, we need to work with data sets that store the observations for our subjects
in variables. Basically, we need to store observations in named vectors that contain a list of the
observations. To create a vector containing the four numbers and assign it to a variable
named variable1, we need to create a vector using the function c which means combine the
items that follow if they are inside parentheses and have commas separating the values:

> c(-3,5,7,8)

[1] -3 5 7 8

To get this vector stored in a variable called variable1 we need to use the assignment operator,
"<-"(read as "stored as") that assigns in the information on the right into the variable that you are
creating.

> variable1 <- c(-3,5,7,8)

In R, the assignment operator, <-, is created by typing a less than symbol (<) followed by a
minus sign (-) without a space between them. If you ever want to see what numbers are
residing in an object in R, just type its name and hit enter. You can see how that variable
contains the same information that was initially generated by c(-3,5,7,8) but is easier to access
since we just need the text representing that vector.

> variable1
[1] -3 5 7 8

You can see how that variable contains the same information that was initially generated by c(-
3,5,7,8) but is easier to access since we just need the text representing that vector. Now we can
use functions such as mean and sd to find the mean and standard deviation of the observations
contained in variable1:

> mean(variable1)

[1] 4.25

> sd(variable1)

[1] 4.99166

When dealing with real data, we will often have information about more than one variable. We
could enter all observations by hand for each variable but this is prone to error and onerous for
all but the smallest data sets. If you are to ever utilize the power of statistics in the evolving data-
centered world, data management has to be accomplished in a more sophisticated way. While
you can manage data sets quite effectively in R, it is often easiest to start with your data set in
something like Microsoft Excel or OpenOffice's Calc. You want to make sure that observations
are in the rows and the names of variables are in the columns and that there is no "extra stuff" in
the spreadsheet. If you have missing observations, they should be represented with blank cells.
The file should be saved as a ".csv" file (stands for comma-separated values although Excel calls
it "CSV (Comma Delimited)", which basically strips off some of the junk that Excel adds to the
necessary information in the file. Excel will tell you that this is a bad idea, but it actually creates
a more stable long-term storage format and one that R can use directly. There will be a few
words in the last chapter regarding why we use R in this course instead of Excel or other
(commercial) statistical software. We'll wait until we show you some of the cool things that R
can do to discuss why we didn't use other software.

With a data set converted to a CSV file, we need to read the data set into R. There are two ways
to do this, either using the GUI point-and-click interface in R-studio or modifying
the read.csv function to find the file of interest. To practice this, you can download an Excel
(.xls) file from https://dl.dropboxusercontent.com/u/77307195/treadmill.xls that contains
observations on 31 males that volunteered for a study on methods for measuring fitness (Westfall
and Young, 1993). In the spreadsheet, you will find:

Subjec TreadMillO TreadMillMaxPuls RunTim BodyWeigh Ag

RunPulse RestPulse
t x e e t e
1 60.05 186 8.63 170 48 81.87 38
2 59.57 172 8.17 166 40 68.15 42
... ... ... ... ... ... ... ....
30 39.2 172 12.88 168 44 91.63 54
31 37.39 192 14.03 186 56 87.66 45
The variables contain information on the subject number (Subject), subjects' treadmill oxygen
consumption (TreadMillOx, in ml per kg per minute) and maximum pulse rate
(TreadMillMaxPulse, in beats per minute), minutes to run 1.5 miles (Run Time), maximum pulse
during 1.5 mile run (RunPulse, in beats per minute), resting pulse rate (RestPulse, beats per
minute), Body Weight (BodyWeight, in kg), and Age (in years). Open the file in Excel or
equivalent software and then save it as a .csv file in a location you can find. Then go to R-studio
and click on Tools, then Import Data Set, then From Text File...4 Find your file and check
"Import". R will store the data set as an object named whatever the .csv file was named. You
could use another name as well, but it is often easiest just to keep the data set name in R related
to the original file. You should see some text appear in the console like in Figure 0-3. The text
that is created will look something like the following (depending on the location you stored the
file) - if you had stored the file in a drive labeled D:/, it would be:

treadmill <- read.csv("D:/treadmill.csv")

What is put inside the " " will depend on the location of your saved .csv file. A version of the
data set in what looks like a spreadsheet will appear in the upper left window due to the second
line of code (View(treadmill)). Just directly typing (or using) a line of code like this is
actually the other way that we can read in files. If you choose to use this, you need to tell R
where to look in your computer to find the data file. read.csv is a function that takes a path as
an argument. To use it, specify the path to your data file, put quotes around it, and put it as the
input to read.csv(...). For some examples later in the book, you will be able to copy a
command like this and read data sets and other code directly from my Dropbox folder using an
internet connection.
Figure 0-3: R-studio with inital data set loaded.
To verify that you read in the data set correctly, it is good to check its contents. We can view the
first and last rows in the data set using the head and tail functions on the data set, which
show the following results for the treadmill data. Note that you will sometimes need to
resize the console window in R-studio to get all the columns to display in a single row which can
be performed by dragging the grey bars that separate the panels.

>head(treadmill)

Subjec TreadMillO TreadMillMaxPuls RunTim RunPuls RestPuls BodyWeigh

Age
t x e e e e t
1 1 60.05 186 8.63 170 48 81.87 38
2 2 59.57 172 8.17 166 40 68.15 42
3 3 54.62 155 8.92 146 48 70.87 50
4 4 54.30 168 8.65 156 45 85.84 44
5 5 51.85 170 10.33 166 50 83.12 54
6 6 50.55 155 9.93 148 49 59.08 57
>tail(treadmill)

Subjec TreadMillO TreadMillMaxPul RunTim RunPuls RestPuls BodyWeigh

Age
t x se e e e t
26 26 44.61 182 11.37 178 62 89.47 44
27 27 40.84 172 10.95 168 57 69.63 51
28 28 39.44 176 13.08 174 63 81.42 44
29 29 39.41 176 12.63 174 58 73.37 57
30 30 39.20 172 12.88 168 44 91.63 54
31 31 37.39 192 14.03 186 56 87.66 45
While not always required, for many of the analyses, we will tap into a large suite of additional
functions available in R packages by "installing" (basically downloading) and then "loading" the
packages. There are some packages that we will use frequently, starting with
the mosaic package (Pruim, Kaplan, and Horton, 2014). To install a R package, go to
the Packages tab in the lower right panel of R-studio. Click on the Install button and then type
in the name of the package in the box (here type in mosaic). R-studio will try to auto-complete
the package name you are typing which should help you make sure you got it typed correctly.
This will be the first of many times that we will mention that R is case sensitive - in other
words, Mosaic is different from mosaic in R syntax. You should only need to install each R
package once on a given computer. If you ever see a message that R can't find a package, make
sure it appears in the list in the Packages tab and if it doesn't, repeat the previous steps to install
it.

After installing the package, we need to load it to make it active. We need to go to the command
prompt and type (or copy and paste) require(mosaic):

> require(mosaic)

You may see a warning message about versions of the package and versions of R - this is usually
something you can ignore. Other warning messages could be more ominous for proceeding but
before getting too concerned, there are couple of basic things to check. First, double check that
the package is installed. Second, check for typographical errors in your code - especially for mis-
spellings or unintended capilization. If you are still having issues, try repeating the installation
process or find someone more used to using R to help you. Most computers in computer labs on
campus at MSU have R and R-studio installed and provide another venue to use the software if
you are having problems5.

To help you go from basic to intermediate R usage, you will want to learn how to manage and
save your R code. The best way to do this is using the upper left panel in R-studio using what are
called R-scripts and they have a file extension of .R. To start a new .R file to store your code,
click on File, then New File, then R Script. This will create a blank page to enter and edit code -
then save the file as MyFileName.R in your preferred location. Saving your code will mean that
you can return to where you last were working by simply re-running the saved script file. With
code in the script window, you can place the cursor on a line of code or highlight a chunk of
code and hit the "Run" button on the upper part of the panel. It will appear in the console with
results just like what you got if you typed it after the command prompt. Figure 0-4 shows the
screen with the code used in this section in the upper left panel, saved in file called Ch0.R, with
the results of highlighting and executing the first section of code using the "Run" button.
Figure 0-4: R-studio with highlighted code run.
previous next

3
The need to keep the code up-to-date as R continues to evolve is one reason that this book is
locally published...
4
If you are having trouble getting the file converted and read into R, copy and run the following
code: treadmill=read.csv("http://dl.dropboxusercontent.com/u/77307
195/treadmill.csv",header=T)

5
We highly recommend that you do not wait until the last minute to try to get R code to work for
your own assignments. Even experienced R users can sometimes need a little time to find their
errors.

0.2 - Basic summary statistics,

histograms and boxplots using R
by Mark Greenwood and Katharine Banner

With R-studio running, the mosaic package loaded, a place to write and save code, and
the treadmill data set loaded, we can (finally!) start to summarize the results of the study.
The treadmill object is what R calls a data.frame and contains columns corresponding to
each variable in the spreadsheet. Every function in R will involve specifying the variable(s) of
interest and how you want to use them. To access a particular variable (column) in a data.frame,
you can use a $ between the data.frame name and the name of the variable of interest,
as dataframename$variablename. To identify the RunTime variable here it would
be treadmill$RunTime and in the command would look like:

>treadmill$RunTime

10.3 10.1 10.0 10.8

[1] 8.63 8.17 8.92 8.65 9.93 9.22 8.95 9.40 11.50
3 3 8 5
10.5 10.2 10.0 11.1 10.4 11.9 10.0 11.0 11.6 11.1
[14] 10.60 9.63 11.37
0 5 0 7 7 5 7 8 3 2
10.9 13..0 12.6 12.8 14.0
[27]
5 8 3 8 3
Just as in the previous section, we can generate summary statistics using functions like mean and
sd:

> mean(treadmill$RunTime)
[1] 10.58613

> sd(treadmill$RunTime)

[1] 1.387414

And now we know that the average running time for 1.5 miles for the subjects in the study was
10.6 minutes with a standard deviation (SD) of 1.39 minutes. But you should remember that the
mean and SD are only appropriate summaries if the distribution is roughly symmetric.
The mosaic package provides a useful function called favstats that provides the mean and
SD as well as the 5 number summary: the minimum (min), the first quartile (Q1, the
25th percentile), the median (50th percentile), the third quartile (Q3, the 75th percentile), and the
maximum (max). It also provides the number of observation (n) which was 31, as noted above,
and a count of whether any missing values were encountered (missing), which was 0 here.

> favstats(treadmill$RunTime)

min Q1 median Q3 max mean sd n missing

8.17 9.78 10.47 11.27 14.03 10.58613 1.387414 31 0
We are starting to get somewhere with understanding that the runners were somewhat fit with
worst runner covering 1.5 miles in 14 minutes (a 9.3 minute mile) and the best running a 5.4
minute mile. The limited variation in the results suggests that the sample was obtained from a
restricted group with somewhat common characteristics. When you explore the ages and weights
of the subjects in the Practice Problems in Section 0.5, you will get even more information about
how similar all the subjects in this study were. A graphical display of these results will help us
assess the shape of the distribution of run times - including considering the potential for the
presence of a skew and outliers. A histogram is a good place to start. Histograms display
connected bars with counts of observations defining the height of bars based on a set of bins of
values of the quantitative variable. We will apply the hist function to the RunTime variable,
which produces Figure 0-5.

> hist(treadmill$RunTime)
Figure 0-5: Histogram of Run Times in minutes of n=31 subjects in Treadmill study.
I used the Export button found above the plot, followed by Copy to Clipboard and clicking on
the Copy Plot button to make it available to paste the figure into your favorite word-processing
program. You can see the first parts of this process in the screen grab in Figure 0-6.
Figure 0-6: R-studio while in the process of copying the histogram.
You can also directly save the figures as separate files using Save as image or Save as PDF and
then insert them into other documents.

The function defaults into providing a histogram on the frequency or count scale. In most R
functions, there are the default options that will occur if we don't make any specific choices and
options that we can modify. One option we can modify here is to add labels to the bars to be able
to see exactly how many observations fell into each bar. Specifically, we can turn
the labels option "on" with adding labels=T to the previous call to the hist function,
separated by a comma:

hist(treadmill$RunTime,labels=T)
Figure 0-7: Histogram of Run Times with counts in bars labelled.
Based on this histogram, it does not appear that there any outliers in the responses since there are
no bars that are separated from the other observations. However, the distribution does not look
symmetric and there might be a skew to the distribution. Specifically, it appears to be skewed
right (the right tail is longer than the left). But histograms can sometimes mask features of the
data set by binning observations and it is hard to find the percentiles accurately from the plot.

When assessing outliers and skew, the boxplot (or Box and Whiskers plot) can also be helpful
(Figure 0-8) to describe the shape of the distribution as it displays the 5-number summary and
will also indicate observations that are "far" above the middle of the observations.
R's boxplot function uses the standard rule to indicate an observation as a potential outlier if it
falls more than 1.5 times the IQR (Inter-Quartile Range, calculated as Q3-Q1) below Q1 or
above Q3. The potential outliers are plotted with circles and the Whiskers (lines that extend from
Q1 and Q3 typically to the minimum and maximum) are shortened to only go as far as
observations that are within 1.5*IQR of the upper and lower quartiles. The box part of the
boxplot is a box that goes from Q1 to Q3 and the median is displayed as a line somewhere inside
the box6. Looking back at the summary statistics above, Q1=9.78 and Q3=11.27, providing an
IQR of:

> IQR<-11.27-9.78

> IQR
[1] 1.49

One observation (the maximum value of 14.03) is indicated as a potential outlier based on this
result by being larger than Q3+1.5*IQR, which was 13.505:

> 11.27+1.5*IQR

[1] 13.505

The boxplot also shows a slight indication of a right skew (skew towards larger values) with the
distance from the minimum to the median being smaller than the distance from the median to the
maximum. Additionally, the distance from Q1 to the median is smaller than the distance from the
median to Q3. It is modest skew, but is worth noting.

boxplot(treadmill$RunTime)
Figure 0-8: Boxplot of 1.5 mile Run Times.
While the default boxplot is fine, it fails to provide good graphical labels, especially on the y-
axis. Additionally, there is no title on the plot. The following code provides some enhancements
to the plot by using the ylab and main options in the call to boxplot, with the results
displayed in Figure 0-9.

boxplot(treadmill$RunTime,ylab="1.5 Mile Run Time (minutes)",main="Boxplot of the Run

Times of n=31 participants")
Figure 0-9: Boxplot of Run Times with improved labels.
Throughout the book, we will often use extra options to make figures that are easier for you to
understand. There are often simpler versions of the functions that will suffice but the extra work
to get better labeled figures is often worth it. I guess the point is that "a picture is worth a
thousand words" if the reader can understand what is being displayed and if the information is
worth displaying.

previous next

6
The median, quartiles and whiskers sometimes occur at the same values when there are many
tied observations. If you can't see all the components of the boxplot, produce the numerical
summary to help you understand what happened.

0.3 - Chapter summary

by Mark Greenwood and Katharine Banner

You should have R and R-studio downloaded and working after going through this preliminary
chapter. You should be able to read a data set into R and run some basic functions, all done using
the R-studio interface. If you are struggling with this, you should seek additional help with these
technical issues so that you are ready for more complicated statistical methods that are coming
very soon. For most assignments, we will give you a seed of the basic R code that you need.
Then you will modify it to work on your data set of interest. As mentioned previously, the way
everyone learns and uses R involves starting with someone elses code and then modifying it. If
you can complete the Practice Problems that follow, you are on your way to learning to use R.

The statistical methods in this chapter were minimal and all should have been review. They
involved a quick reminder of summarizing the center, spread, and shape of distributions using
numerical summaries of the mean and SD and/or the min, Q1, median, Q3, and max and the
histogram and boxplot as graphical summaries. The main point was really to get a start on using
R to provide results you should be familiar with from your previous statistics experiences.

0.4 - Important R Code

by Mark Greenwood and Katharine Banner

At the end of each chapter, there will be a section highlighting the

most important R code used. The dark text will never change but the
lighter (red) text will need to be customized to your particular
application. The sub-bullet for each function will discuss the use of the
function and pertinent options or packages required. You can use this
as a guide to finding the function names and some hints about options
that will help you to get the code to work or you can revisit the worked
examples using each of the functions.
 FILENAME<-read.csv("path to save csv file/ FILENAME.csv")
 Can be generated using "Import Dataset" button or by modifying this text.

 DATASETNAME$VARIABLENAME
 To access a particular variable in a data.frame called DATASETNAME, use a $ and then
the VARIABLENAME.

 head(DATASETNAME)
 Provides a list of the first few rows of the data set for all the variables in it.

 mean(DATASETNAME$VARIABLENAME)
 Calculates the mean of the observations in a variable.

 sd(DATASETNAME$VARIABLENAME)
 Calculates the SD of the observations in a variable.

 favstats(DATASETNAME$VARIABLENAME)
 Provides a suite of numerical summaries of the observations in a variable.

 Requires the mosaic package to be loaded (require(mosaic)).

 hist(DATASETNAME$VARIABLENAME)
 Makes a histogram.

 boxplot(DATASETNAME$VARIABLENAME)
 Makes a boxplot.

0.5 - Practice problems

by Mark Greenwood and Katharine Banner

At the end of each chapter, there is a section filled with questions related to the material. Your
instructor has a file that contains the R code required to provide the results to answer all these
questions. To practice learning R, it would be most useful for you to try to accomplish the
requested tasks first yourself in R and then refer to the provided R code when you struggle.
These questions provide a great venue to check what you are learning, see the methods applied to
another data set, and to discuss in study groups, with your instructor, or at the Math Learning
Center, especially if you have any questions about the correct responses.

0.1. Read in the treadmill data set discussed above and find the mean and SD of the Ages
(Age variable) and Body Weights (BodyWeight). In studies involving human subjects, it
is common to report a summaries of characteristics of the subjects. Why does this matter?
Think about how your interpretation of any study of the fitness of subjects would change
if the mean age had been 20 years older or 35 years younger.
0.2. How does knowing about the distribution of results for Age and BodyWeight help
you understand the results for the Run Times discussed above?

0.3. The mean and SD are most useful as summary statistics only if the distribution is
relatively symmetric. Make a histogram of Age responses and discuss the shape of the
distribution (is it skewed right, skewed left, approximately symmetric?; are there
outliers?). Approximately what range of ages does this study pertain to?

0.4. The weight responses are in kilograms and you might prefer to see them in pounds.
The conversion is lbs=2.205*kgs. Create a new variable in the treadmill data.frame
called BWlb using this code:

treadmill$BWlb <- 2.205*treadmill$BodyWeight

and find the mean and SD of the new variable.

0.5. Make histograms and boxplots of the original BodyWeight and new BWlb variables.
Discuss aspects of the distributions that changed and those that remained the same with
the transformation from kilograms to pounds.

1 - (R)e-introduction to statistics
by Mark Greenwood and Katharine Banner

The previous material served to get us started in R and to get a quick

review of same basic descriptive statistics. Now we will begin to
engage some new material and exploit the power of R to do some
statistical inference. Because inference is one of the hardest topics to
master in statistics, we will also review some basic terminology that is
required to move forward in learning more sophisticated statistical
methods. To keep this "review" as short as possible, we will not
consider every situation you learned in introductory statistics and
instead focus exclusively on the situation where we have a
quantitative response variable measured on two groups.

1.0 - Histograms, boxplots, and density

curves
by Mark Greenwood and Katharine Banner
Part of learning statistics is learning to correctly use the terminology, some of which is used
colloquially differently than it is used in formal statistical settings. The most commonly
"misused" term is data. In statistical parlance, we want to note the plurality of data.
Specifically, datum is a single measurement, possibly on multiple random variables, and so it is
appropriate to say that "a datum is...". Once we move to discussing data, we are now referring to
more than one observation, again on one, or possibly more than one, random variable, and so we
need to use "data are..." when talking about our observations. We want to distinguish our use of
the term "data" from its more colloquial7 usage that often involves treating it as singular and to
refer to any sort of numerical information. We want to use "data" to specifically refer to
measurements of our cases or units. When we summarize the results of a study (say providing
the mean and SD), that information is not "data". We used our data to generate that information.
Sometimes we also use the term "data set" to refer to all of our observations and this is a singular
term to refer to the group of observations and this makes it really easy to make mistakes on the
usage of this term.

It is also really important to note that variables have to vary - if you measure the sex of your
subjects but are only measuring females, then you do not have an interesting variable. The last,
but probably most important, aspect of data is the context of the measurement. The who, what,
when, and where of the collection of the observations is critical to the sort of conclusions we will
make based on the observations. The information on the study design will provide the
information required to assess the scope of inference of the study. Generally, remember to think
about the research questions the researchers were trying to answer and whether their study
actually would answer those questions. There are no formulas to help us sort some of these
things out, just critical thinking about the context of the measurements.

To make this concrete, consider the data collected from a study (Plaster, 1989) to investigate
whether perceived physical attractiveness had an impact on the sentences or perceived
seriousness of a crime that male jurors might give to female defendants. The researchers showed
the participants in the study (men who volunteered from a prison) pictures of one of three young
women. Each picture had previously been decided to be either beautiful, average, or unattractive
by the researchers. Each "juror" was randomly assigned to one of three levels of this factor
(which is a categorical predictor or explanatory variable) and then each rated their picture on a
variety of traits such as how warm or sincere the woman appeared. Finally, they were told the
women had committed a crime (also randomly assigned to either be told she committed a
burglary or a swindle) and were asked to rate the seriousness of the crime and provide a
suggested length of sentence. We will bypass some aspects of their research and just focus on
differences in the sentence suggested among the three pictures. To get a sense of these data, let's
consider the first and last parts of the data set:

Subject Attr Crime Years Serious independent Sincere

1 Beautiful Burglary 10 8 9 8
2 Beautiful Burglary 3 8 9 3
3 Beautiful Burglary 5 5 6 3
4 Beautiful Burglary 1 3 9 8
5 Beautiful Burglary 7 9 5 1
... ... ... ... ... ... ...
108 Average Swindle 3 3 5 4
109 Average Swindle 3 2 9 9
110 Average Swindle 2 1 8 8
111 Average Swindle 7 4 9 1
112 Average Swindle 6 3 5 2
113 Average Swindle 12 9 9 1
114 Average Swindle 8 8 1 5
When working with data, we should always start with summarizing the sample size. We will
use n for the number of subjects in the sample and denote the population size (if available)
with N. Here, the sample size is n=114. In this situation, we do not have a random sample from a
population (these were volunteers from the population of prisoners at the particular prison) so we
can not make inferences to a larger group. But we can assess whether there is a causal effect8: if
sufficient evidence is found to conclude that there is some difference in the responses across the
treated groups, we can attribute those differences to the treatments applied, since the groups
should be same otherwise due to the pictures being randomly assigned to the "jurors". The story
of the data set - that it is was collected on prisoners - becomes pretty important in thinking about
the ramifications of any results. Are male prisoners different from the population of college
males or all residents of a state such as Montana? If so, then we should not assume that the
detected differences, if detected, would also exist in some other group of male subjects. The lack
of a random sample makes it impossible to assume that this set of prisoners might be like other
prisoners. So there will be some serious caution with the following results. But it is still
interesting to see if the pictures caused a difference in the suggested mean sentences, even
though the inferences are limited to this group of prisoners. If this had been an observational
study (suppose that the prisoners could select one of the three pictures), then we would have to
avoid any of the "causal" language that we can consider here because the pictures were randomly
assigned to the subjects. Without random assignment, the explanatory variable of picture choice
could be confounded with another characteristic of prisoners that was related to which picture
the selected, and that other variable might be the reason for the differences in the responses
provided.

Instead of loading this data set into R using the "Import Dataset" functionality, we can load a R
package that contains the data, making for easy access to this data set. The package
called heplots (Fox, Friendly, and Monette, 2013) contains a data set called MockJury that
contains the results of the study. We will also rely the R package called mosaic (Pruim,
Kaplan, and Horton, 2014) that was introduced previously. First (but only once), you need to
install both packages, which can be done using the install.packages function with quotes
around the package name:
> install.packages("heplots")

After making sure that the packages are installed, we use the require function around the
package name (no quotes now!) to load the package.

> require(heplots)

> require(mosaic)

To load the data set that is in a loaded package, we use the data function.

> data(MockJury)

Now there will be a data.frame called MockJury available for us to analyze. We can find out
more about the data set as before in a couple of ways. First, we can use the View function to
provide a spreadsheet sort of view in the upper left panel. Second, we can use
the head and tail functions to print out the beginning and end of the data set. Because there
are so many variables, it may wrap around to show all the columns.

> View(MockJury)

> head(MockJury)

Attr Crime Years Serious exciting cal

1 Beautiful Burglary 10 8 6
2 Beautiful Burglary 3 8 9
3 Beautiful Burglary 5 5 3
4 Beautiful Burglary 1 3 3
5 Beautiful Burglary 7 9 1
6 Beautiful Burglary 7 9 1
sociable kind intelligent strong sophisticated happ
1 9 9 6 9 9
2 9 4 9 5 5
3 4 2 4 5 4
4 9 9 9 9 9
5 9 4 7 9 9
6 9 5 8 9 9
> tail(MockJury)
Attr Crime Years Serious exciting cal
109 Average Swindle 3 2 7
110 Average Swindle 2 1 8
111 Average Swindle 7 4 1
112 Average Swindle 6 3 5
113 Average Swindle 12 9 1
114 Average Swindle 8 8 1
sociable kind intelligent strong sophisticated happ
109 7 6 8 6 5
110 9 9 9 9 9
111 9 4 1 1 1
112 4 9 3 3 9
113 9 1 9 9 1
114 9 1 1 9 5
When data sets are loaded from packages, there is often extra documentation available about the
data set which can be accessed using the help function.

> help(MockJury)

With many variables in a data set, it is often useful to get some quick information about all of
them; the summary function provides useful information whether the variables are categorical or
quantitative and notes if any values were missing.

> summary(MockJury)

Attr Crime Years Serious excit

Beautiful :39 Burglary :59 Min. :1.000 Min. :1.000 Min. :1.
Average :38 Swindle :55 1st Qu. 2.000 1st Qu. 3.000 1st Qu. 3.
Unattractive:37 Median :3.000 Median :5.000 Median :5.
Mean :4.693 Mean :5.010 Mean :4.
3rd Qu. :7.000 3rd Qu. :6.750 3rd Qu. :6.
Max. :15.000 Max. :9.000 Max. :9.
calm independent Sincere warm phya
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1
1st Qu. :4.250 1st Qu. :5.000 1st Qu. :3.000 1st Qu. :2.00 1st Qu. :2
Median :6.500 Median :5.000 Median :5.000 Median :5.00 Median :5
Mean :5.982 Mean :6.132 Mean :4.789 Mean :4.57 Mean :4
3rd Qu. :8.000 3rd Qu. :8.000 3rd Qu. :7.000 3rd Qu. :7.00 3rd Qu. :8
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.00 Max. :9
sociable kind intelligent strong sophistica
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1
1st Qu. :5.000 1st Qu. :3.000 1st Qu. :4.000 1st Qu. :4.000 1st Qu. :3.
Median :7.000 Median :5.000 Median :7.000 Median :6.000 Median :5.
Mean :6.132 Mean :4.728 Mean :6.096 Mean :5.649 Mean :5.
3rd Qu. :8.000 3rd Qu. :7.000 3rd Qu. :8.750 3rd Qu. :7.000 3rd Qu. :7.
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.
happy ownPA
Min. :1.000 Min. :1.000
1st Qu. :3.000 1st Qu. :5.000
Median :5.000 Median :6.000
Mean :5.061 Mean :6.377
3rd Qu. :7.000 3rd Qu. :9.000
Max. :9.000 Max. :9.000
This violates some rules about the amount of numbers to show versus useful information, but if
we take a few moments to explore the output we can discover some useful aspects of the data set.
The output is organized by variable, providing some summary information, either counts by
category for categorical variables or the 5-number summary plus the mean for quantitative
variables. For the first variable, called Attr in the data.frame and that we might more explicitly
name Attractiveness, we find counts of the number of subjects shown each picture: 37/114
viewed the "Unattractive" picture, 38 viewed "Average", and 39 viewed "Beautiful". We can also
see that suggested sentences (data.frame variable Years) ranged from 1 year to 15 years with a
median of 3 years. It seems that all of the other variables except for Crime (type of crime that
they were told the pictured woman committed) contained responses between 1 and 9 based on
rating scales from 1=low to 9 =high.

To accompany the numerical summaries, histograms and boxplots can provide some initial
information on the shape of the distribution of the responses for the suggested sentences
in Years. Figure 1-1 contains the histogram and boxplot of Years, ignoring any information on
which picture the "jurors" were shown. The code is enhanced slightly to make it better labeled
> hist(MockJury$Years,xlab="Years",labels=T,main="Histogram of
Years")

> boxplot(MockJury$Years,ylab="Years",main="Boxplot of Years")

Figure 1-1: Histogram and boxplot of suggested sentences in years.

The distribution appears to have a strong right skew with three observations at 15 years flagged
as potential outliers. They seem to just be the upper edge of the overall pattern of a strongly right
skewed distribution, so we certainly would want want to ignore them in the data set. In real data
sets, outliers are common and the first step is to verify that they were not errors in recording. The
next step is to study their impact on the statistical analyses performed, potentially considering
reporting results with and without the influential observation(s) in the results. Sometimes the
outliers are the most interesting part of the data set and should not always be discounted.
Often when we think of distributions, we think of the smooth underlying shape that led to the
data set realized in the histogram. Instead of binning up observations and making bars in the
histogram, we can estimate what is called a density curve as a smooth curve that represents the
observed distribution. Density curves can sometimes help us see features of the data sets more
clearly. To understand the density curve, it is useful to initially see the histogram and density
curve together. The density curve is scaled so that the total area9 under the curve is 1. To make a
comparable histogram, the y-axis needs to be scaled so that the histogram is also on the "density"
scale which makes the heights of the bars the height needed so that the proportion of the total
data set in each bar is represented by the area in each bar (height times width). So the height
depends on the width of the bars and the total area across all the bars is 1. In the hist function,
the freq=F option does this required re-scaling. The density curve is added to the histogram
using lines (density()), producing the result in Figure 1-2 with added modifications of
options for lwd (line width) and col (color) to make the plot more interesting. You can see how
density curve somewhat matches the histogram bars but deals with the bumps up and down and
edges a little differently. We can pick out the strong right skew using either display and will
rarely make both together.

> hist(MockJury$Years,freq=F,xlab="Years",main="Histogram of
Years with density curve")

> lines(density(MockJury$Years),lwd=3,col="red")
Figure 1-2: Histogram and density curve of Years data.
Histograms can be sensitive to the choice of the number of bars and even the cut-offs used to
define the bins for a given number of bars. Small changes in the definition of cut-offs for the bins
can have noticeable impacts on the shapes observed but this does not impact density curves. We
are not going to over-ride the default choices for bars in histogram, but we can add information
on the original observations being included in each bar. In the previous display, we can add what
is called a rug to the plot, were a tick mark is made for each observation. Because the responses
were provided as whole years (1, 2, 3, ..., 15), we need to use a graphical technique
called jittering to add a little noise10 to each observation so all observations at each year value do
not plot at the same points. In Figure 1-3, the added tick marks on the x-axis show the
approximate locations of the original observations. We can clearly see how there are 3
observations at 15 (all were 15 and the noise added makes it possible to see them all. The
limitations of the histogram arise around the 10 year sentence area where there are many
responses at 10 years and just one at both 9 and 11 years, but the histogram bars sort of miss this
that aspect of the data set. The density curve did show a small bump at 10 years. Density curves
are, however, not perfect and this one shows area for sentences less than 0 years which is not
possible here.

> hist(MockJury$Years,freq=F,xlab="Years",main="Histogram of
Years with density curve and rug")

> lines(density(MockJury$Years),lwd=3,col="red")

> rug(jitter(MockJury$Years),col="blue",lwd=2)

Figure 1-3: Histogram and density curve and rug of the jittered responses.
The tools we've just discussed are going to help us move to comparing the distribution of
responses across more than one group. We will have two displays that will help us make these
comparisons. The simplest is the side-by-side boxplot, where a boxplot is displayed for each
group of interest using the same y-axis scaling. In R, we can use its formula notation to see if the
response (Years) differs based on the group (Attr) by using something like Y~X or,
here,Years~Attr. We also need to tell R where to find the variables and use the last option in
the command, data=DATASETNAME, to inform R of the data.frame to look in to find the
variables. In this example, data=MockJury. We will use the formula and data=... options in
almost every function we use from here forward. Figure 1-4 contains the side-by-side boxplots
showing right skew for all the groups, slightly higher median and more variability for
the Unattractive group along with some potential outliers indicated in two of the three groups.

> boxplot(Years~Attr,data=MockJury)

Figure 1-4: side-by-side boxplot of Years based on picture groups.

The "~" (the tilde symbol, which you can find in the upper left corner of your keyboard) notation
will be used in two ways this semester. The formula use in R employed previously declares that
the response variable here is Years and the explanatory variable is Attr. The other use for "~" is
as shorthand for "is distributed as" and is used in the context of Y~N(0,1), which translates (in
statistics) to defining the random variable Y as following a normal distribution with mean 0 and
standard deviation of 1. In the current situation, we could ask whether the Years variable seems
like it may follow a normal distribution, in other words, is Years~N(μ, σ)? Since the responses
are right skewed with some groups having outliers, it is not reasonable to assume that the Years
variable for any of the three groups may follow a Normal distribution (more later on the issues
this creates!). Remember that μ and σ are parameters where μ is our standard symbol for
the population mean and that σ is the symbol of the population standard deviation.

previous next

7
You will more typically hear "data is" but that more often refers to information, sometimes even
statistical summaries of data sets, than to observations collected as part of a study, suggesting the
confusion of this term in the general public. We will explore a data set in Chapter 4 related to
perceptions of this issue collected by researchers at http://fivethirtyeight.com.
8
We will try to reserve the term "effect" for situations where random assignment allows us to
consider causality as the reason for the differences in the response variable among levels of the
explanatory variable, if we find evidence against the null hypothesis of no difference.
9
If you've taken calculus, you will know that the curve is being constructed so that the integral
from −∞ to ∞ is 1.
10
Jittering typically involves adding random variability to each observation that is uniformly
distributed in a range determined based on the spacing of the observations. If you re-run
the jitter function, the results will change. For more details, type help(jitter) in R.

1.1 - Beanplots
by Mark Greenwood and Katharine Banner

The other graphical display for comparing multiple groups we will use is a newer display called
a beanplot (Kampstra, 2008). It provides a side-by-side display that contains the density curve,
the original observations that generated the density curve in a rug-plot, and the mean of each
group. For each group the density curves are mirrored to aid in visual assessment of the shape of
the distribution. This mirroring will often create a shape that resembles a violin with skewed
distributions. Long, bold horizontal lines are placed at the mean for each group. All together this
plot shows us information on the center (mean), spread, and shape of the distributions of the
responses. Our inferences typically focus on the means of the groups and this plot allows us to
compare those across the groups while gaining information on whether the mean is reasonable
summary of the center of the distribution.

To use the beanplot function we need to install and load the beanplot package. The
function works like the boxplot used previously except that options for log, col,
and method need to be specified. Use these options for any beanplots you
make: log="",col="bisque", method="jitter".

> require(beanplot)

>
beanplot(Years~Attr,data=MockJury,log="",col="bisque",method="jit
ter")

Figure 1-5 reinforces the strong right skews that were also detected in the boxplots previously.
The three large sentences of 15 years can now be clearly viewed, one in the Beautiful group and
two in the Unattractive group. The Unattractive group seems to have more high observations
than the other groups even though the Beautiful group had the largest number of observations
around 10 years. The mean sentence was highest for the Unattractive group and the differences
differences in the means between Beautiful and Average was small.
Figure 1-5: Beanplot of Years by picture group. Long, bold lines correspond to mean of each
group.
In this example, it appears that the mean for Unattractive is larger than the other two groups. But
is this difference real? We will never know the answer to that question, but we can assess how
likely we are to have seen a result as extreme or more extreme than our result, assuming that
there is no difference in the means of the groups. And if the observed result is (extremely)
unlikely to occur, then we can reject the hypothesis that the groups have the same mean and
conclude that there is evidence of a real difference. We can get means and standard deviations by
groups easily using the same formula notation with the mean and sd functions if
the mosaic package is loaded.

> mean(Years~Attr,data=MockJuryR)

Beautiful Average Unattractive

4.333333 3.973684 5.810811
> sd(Years~Attr,data=MockJuryR)

Beautiful Average Unattractive

3.405362 2.823519 4.364235
We can also use the favstats function to get those summaries and others.

> favstats(Years~Attr,data=MockJuryR)

min Q1 median Q3 max mean sd n missing

Beautiful 1 2 3 6.5 15 4.333333 3.405362 39 0
Average 1 2 3 5.0 12 3.973684 2.823519 38 0
Unattractive 1 2 5 10 15 5.810811 4.364235 37 0
We have an estimate of a difference of almost 2 years in the mean sentence
between Average and Unattractive groups. Because there are three groups being compared in
this study, we will have to wait to Chapter 2 and the One-Way ANOVA test to fully assess
evidence related to some difference in the three groups. For now, we are going to focus on
comparing the mean Years between Average and Unattractive groups - which is a 2 independent
sample mean situation and something you have seen before. We will use this simple scenario to
review some basic statistical concepts and connect two frameworks for conducting statistical
inference, randomization and parametric techniques. Parametric statistical methods involve
making assumptions about the distribution of the responses and obtaining confidence intervals
and/or p-values using a named distribution (like the z or t-distributions). Typically these results
are generated using formulas and looking up areas under curves using a table or a
computer. Randomization-based statistical methods use a computer to shuffle, sample, or
simulate observations in ways that allow you to obtain p-values and confidence intervals without
resorting to using tables and named distributions. Randomization methods are what are
called nonparametric methods that often make fewer assumptions (they are not free of
assumptions!) and so can handle a larger set of problems more easily than parametric methods.
When the assumptions involved in the parametric procedures are met, the randomization
methods often provide very similar results to those provided by the parametric techniques. To be
a more sophisticated statistical consumer, it is useful to have some knowledge of both of these
approaches to statistical inference and the fact that they can provide similar results might deepen
your understanding of both approaches.

Because comparing two groups is easier than comparing more than two groups, we will start
with comparing the Average and Unattractive groups. We could remove the Beautiful group
observations in a spreadsheet program and read that new data set back into R, but it is easier to
use R to do data management once the data set is loaded. To remove the observations that came
from the Beautiful group, we are going to generate a new variable that we will
call NotBeautiful that is true when observations came from another group
(Average or Unattractive) and false for observations from the Beautiful group. To do this, we
will apply the not equal logical function (!=) to the variable Attr, inquiring whether it was
different from the "Beautiful" level.

> NotBeautiful <- MockJury$Attr!="Beautiful"

> NotBeautiful

FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS
[1]
E E E E E E E E E E E E
FALS FALS FALS FALS FALS FALS FALS FALS FALS
[13] TRUE TRUE TRUE
E E E E E E E E E
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
FALS FALS FALS FALS FALS FALS FALS FALS
[73] TRUE TRUE TRUE TRUE
E E E E E E E E
FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS
[85] TRUE TRUE
E E E E E E E E E E
[97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[109
TRUE TRUE TRUE TRUE TRUE TRUE
]
This new variable is only FALSE for the Beautiful responses as we can see if we compare some
of the results from the original and new variable:

> data.frame(MockJury$Attr,NotBeautiful)

MockJury.Attr NotBeautiful
1 Beautiful FALSE
2 Beautiful FALSE
3 Beautiful FALSE
...
20 Beautiful FALSE
21 Beautiful FALSE
22 Unattractive TRUE
23 Unattractive TRUE
24 Unattractive TRUE
25 Unattractive TRUE
26 Unattractive TRUE
...
112 Average TRUE
113 Average TRUE
114 Average TRUE
To get rid of one of the groups, we need to learn a little bit about data management in
R. Brackets ([,]) are used to modify the rows or columns in a data.frame with entries before
the comma operating on rows and entries after the comma on the columns. For example, if you
want to see the results for the 5th subject we can reference the 5th row of the data.frame
using [5,] after the data.frame name:

> MockJury[5,]

Yea excit indepen since war

Attr Crime Serious calm
rs ing dent re m
Beauti Burgl
5 7 9 1 1 5 1 8
ful ary
phyatt socia kin intelli stron sophistic
happy ownPA
r ble d gent g ated
5 8 9 4 7 9 9 8 7
We could just extract the Years response for the 5th subject by incorporating information on the
row and column of interest (Years is the 3rd column):

> MockJury[5,3]

[1] 7

In R, we can use logical vectors to keep any rows of the data.frame where the variable is true and
drop any rows where it is false by placing the logical variable in the first element of the brackets.
The reduced version of the data set should be saved with a different name such
as MockJury2 that is used here:

> MockJury2 <- MockJury[NotBeautiful,]

You will always want to check that the correct observations were dropped either
using View(MockJury2) or by doing a quick summary of the Attr variable in the new
data.frame.
> summary(MockJury2$Attr)

Beautiful Average Unattractive

0 38 37
It ends up that R remembers the Beautiful category even though there are 0 observations in it
now and that can cause us some problems. When we remove a group of observations, we
sometimes need to clean up categorical variables to just reflect the categories that are present.
The factor function creates categorical variables based on the levels of the variables that are
observed and is useful to run here to clean up Attr.

> MockJury2$Attr <- factor(MockJury2$Attr)

> summary(MockJury2$Attr)

Average Unattractive
38 37
Now the boxplot and beanplots only contain results for the two groups of interest here as seen in
Figure 1-6.

> boxplot(Years~Attr,data=MockJury2)

>
beanplot(Years~Attr,data=MockJury2,log="",col="bisque",method="ji
tter")

The two-sample mean techniques you learned in your previous course start with comparing the
means the two groups. We can obtain the two means using the mean function or directly obtain
the difference in the means using the compareMean function (both require
the mosaic package). The compareMean function provides xUnattractive−xAverage where x is the
sample mean of observations in the subscripted group. Note that there are two directions to
compare the means and this function chooses to take the mean from the second group name
alphabetically and subtracts the mean from the first alphabetical group name. It is always good to
check the direction of this calculation as having a difference of -1.84 years versus 1.84 years
could be important to note.

> mean(Years~Attr,data=MockJury2)

Average Unattractive
3.973684 5.810811
> compareMean(Years ~ Attr, data=MockJury2)

[1] 1.837127
Figure 1-6: Boxplot and beanplot of the Years responses on the reduced data set.

1.2 - Models, hypotheses, and

permutations for the 2 sample mean
situation
by Mark Greenwood and Katharine Banner

There appears to be some evidence that the Unattractive group is getting higher average lengths
of sentences from the mock jurors than the Average group, but we want to make sure that the
difference is real - that there is evidence to reject the assumption that the means are the same "in
the population". First, a null hypothesis11 which defines a null model12 needs to be determined in
terms of parameters (the true values in the population). The research question should help you
determine the form of the hypotheses for the assumed population. In the 2 independent sample
mean problem, the interest is in testing a null hypothesis of H0: μ1=μ2 versus the alternative
hypothesis of HA: μ1≠μ2, where μ1 is the parameter for the true mean of the first group and μ2 is
the parameter for the true mean of the second group. The alternative hypothesis involves
assuming a statistical model for the ith (i=1,...,nj) response from the jth group (j=1,2), γij, is
modeled as γij = μj + εij, where we typically assume that εij ~ N(0,σ2). For the moment, focus on
the models that assuming the means are the same (null) or different (alternative) imply:

• Null Model: γij = μ + εij There is no difference in true means for the two groups.
• Alternative Model: yij = μj + εij There is a difference in true means for the two groups.
Suppose we are considering the alternative model for the 4th observation (i=4) from the second
group (j=2), then the model for this observation is γ42 = μ2 + ε42. And for, say, the 5th observation
from the first group (j=1), the model is γ51 = μ1 + ε51. If we were working with the null model, the
mean is always the same (μ) and the group specified does not change that aspect of the model.

It can be helpful to think about the null and alternative models graphically. By assuming the null
hypothesis is true (means are equal) and that the random errors around the mean follow a normal
distribution, we assume that the truth is as displayed in the left panel of Figure 1-7 - two normal
distributions with the same mean and variability. The alternative model allows the two groups to
potentially have different means, such as those displayed in the right panel of Figure 1-7, but
otherwise assumes that the responses have the same distribution. We assume that the
observations (γij) would either have been generated as samples from the null or alternative model
- imagine drawing observations at random from the pictured distributions. The hypothesis testing
task in this situation involves first assuming that the null model is true and then assessing how
unusual the actual result was relative to that assumption so that we can conclude that the
alternative model is likely correct. The researchers obviously would have hoped to encounter
some sort of noticeable difference in the sentences provided for the different pictures and been
able to find enough evidence to reject the null model where the groups "looked the same".
Figure 1-7: Illustration of the assumed situations under the null (left) and a single possibility
that could occur if the alternative were true (right).
In statistical inference, null hypotheses (and their implied models) are set up as "straw men" with
every interest in rejecting them even though we assume they are true to be able to assess the
evidence against them. Consider the original study design here, the pictures were randomly
assigned to the subjects. If the null hypothesis were true, then we would have no difference in the
population means of the groups. And this would apply if we had done a different random
assignment of the pictures to the subjects. So let's try this: assume that the null hypothesis is true
and randomly re-assign the treatments (pictures) to the observations that were obtained. In other
words, keep the sentences (Years) the same and shuffle the group labels randomly. The technical
term for this is doing a permutation (a random shuffling of the treatments relative to the
responses). If the null is true and the means in the two groups are the same, then we should be
able to re-shuffle the groups to the observed sentences (Years) and get results similar to those we
actually observed. If the null is false and the means are really different in the two groups, then
what we observed should differ from what we get under other random permutations. The
differences between the two groups should be more noticeable in the observed data set than in
(most) of the shuffled data sets. It helps to see this to understand what a permutation means in
this context.

In the mosaic R package, the shuffle function allows us to easily perform a permutation13.
Just one time, we can explore what a permutation of the treatment labels could look like.

> Perm1 <-

with(MockJury2,data.frame(Years,Attr,PermutedAttr=shuffle(Attr)))

> Perm1

Years Attr PermutedAttr

1 1 Unattractive Unattractive
2 4 Unattractive Average
3 3 Unattractive Average
4 2 Unattractive Average
5 8 Unattractive Unattractive
6 8 Unattractive Unattractive
7 1 Unattractive Unattractive
8 1 Unattractive Unattractive
9 5 Unattractive Unattractive
10 7 Unattractive Unattractive
11 1 Unattractive Average
12 5 Unattractive Unattractive
13 2 Unattractive Unattractive
14 12 Unattractive Unattractive
15 10 Unattractive Unattractive
16 1 Unattractive Average
17 6 Unattractive Average
18 2 Unattractive Average
19 5 Unattractive Average
20 12 Unattractive Average
21 6 Unattractive Average
22 3 Unattractive Average
23 8 Unattractive Unattractive
24 4 Unattractive Unattractive
25 10 Unattractive Average
26 10 Unattractive Unattractive
27 15 Unattractive Unattractive
28 15 Unattractive Unattractive
29 3 Unattractive Average
30 3 Unattractive Unattractive
31 3 Unattractive Average
32 11 Unattractive Average
33 12 Unattractive Average
34 2 Unattractive Unattractive
35 1 Unattractive Average
36 1 Unattractive Average
37 12 Unattractive Unattractive
38 5 Average Average
39 5 Average Average
40 4 Average Unattractive
41 3 Average Unattractive
42 6 Average Average
43 4 Average Average
44 9 Average Unattractive
45 8 Average Average
46 3 Average Unattractive
47 2 Average Average
48 10 Average Average
49 1 Average Unattractive
50 1 Average Unattractive
51 3 Average Unattractive
52 1 Average Unattractive
53 3 Average Unattractive
54 5 Average Unattractive
55 8 Average Unattractive
56 3 Average Average
57 1 Average Average
58 1 Average Average
59 1 Average Average
60 2 Average Average
61 2 Average Unattractive
62 1 Average Average
63 1 Average Unattractive
64 2 Average Average
65 3 Average Unattractive
66 4 Average Unattractive
67 5 Average Average
68 3 Average Unattractive
69 3 Average Unattractive
70 3 Average Average
71 2 Average Average
72 7 Average Unattractive
73 6 Average Average
74 12 Average Average
75 8 Average Average
If you count up the number of subjects in each group by counting the number of times each label
(Average, Unattractive) occurs, it is the same in both the Attr and PermutedAttr columns.
Permutations involve randomly re-ordering the values of a variable - here the Attr group labels.
This result can also be generated using what is called sampling without replacement:
sequentially select n labels from the original variable, removing each used label and making sure
that each original Attr label is selected once and only once. The new, randomly selected order
of selected labels provides the permuted labels. Stepping through the process helps us understand
how it works: after the initial random sample of one label, there would n-1 choices possible; on
the nth selection, there would only be one label remaining to select. This makes sure that all
original labels are re-used but that the order is random. Sampling without replacement is like
picking names out of a hat, one-at-a-time, and not putting the names back in after they are
selected. Sampling with replacement involves sampling from the specified list with each
observation having an equal chance of selection for each sampled observation - in other words,
observations can be selected more than once. This is like picking n names out of a hat that
contains n names, except that every time a name is selected, it goes back into the hat - we'll use
this technique later in the Chapter to do what is called bootstrapping. Both sampling
mechanisms can be used to generate inferences but each has particular situations where they are
most useful.

The comparison of the beanplots for the real data set and permuted version of the labels is what
is really interesting (Figure 1-8). The original difference in the sample means of the two groups
was 1.84 years (Unattractive minus Average). The sample means are the statistics that estimate
the parameters for the true means of the two groups. In the permuted data set, the difference in
the means is 0.66 years.

> mean(Years ~ PermutedAttr, data=Perm1)

Average Unattractive
4.552632 5.216216
> compareMean(Years ~ PermutedAttr, data=Perm1)

[1] 0.6635846
Figure 1-8: Boxplots of Years responses versus actual treatment groups and permuted groups.
These results suggest that the observed difference was larger than what we got when we did a
single permutation. The important aspect of this is that the permutation is valid if the null
hypothesis is true - this is a technique to generate results that we might have gotten if the null
hypothesis were true. We just need to repeat the permutation process many times and track how
unusual our observed result is relative to this distribution of responses. If the observed
differences are unusual relative to the results under permutations, then there is evidence against
the null hypothesis, the null hypothesis should be rejected (Reject H0) and a conclusion should be
made, in the direction of the alternative hypothesis, that there is evidence that the true means
differ. If the observed differences are similar to (or at least not unusual relative to) what we get
under random shuffling under the null model, we would have a tough time concluding that there
is any real difference between the groups based on our observed data set.

previous next
11
The hypothesis of no difference that is typically generated in the hopes of being rejected in
favor of the alternative hypothesis which contains the sort of difference that is of interest in the
application.
12
The null model is the statistical model that is implied by the chosen null hypothesis. Here, a
null hypothesis of no difference will translate to having a model with the same mean for both
groups.

We'll see the shuffle function in a more common usage below; while the code to
13

generate Perm1 is provided, it isn't something to worry about right now: Perm1<-
with(MockJury2,data.frame(Years,Attr,PermutedAttr=shuffle(Attr)))

1.3 - Permutation testing for the 2

sample mean situation
by Mark Greenwood and Katharine Banner

In any testing situation, you must define some function of the observations that gives us a single
number that addresses our question of interest. This quantity is called a test statistic. These often
take on complicated forms and have names like t or z statistics that relate to their parametric
(named) distributions so we know where to look up p-values. In randomization settings, they can
have simpler forms because we use the data set to find the distribution of the statistic. We will
label our test statistic T (for Test statistic) unless the test statistic has a commonly used name.
Since we are interested in comparing the means of the two groups, we can
define T=xUnattractive−xAverage, which coincidentally is what the compareMean
function provided us previously. We label our observed test statistic (the one from the
original data set) as Tobs=xUnattractive−xAverage which happened to be 1.84 years here. We will
compare this result to the results for the test statistic that we obtain from permuting the group
labels. To denote permuted results, we will add a * to the labels: T*=xUnattractive*−xAverage*. We then
compare the Tobs=xUnattractive−xAverage= 1.84 to the distribution of results that are possible for the
permuted results (T*) which corresponds to assuming the null hypothesis is true.

To do permutations, we are going to learn how to write a for loop in R to be able to repeatedly
generate the permuted data sets and record T*. Loops are a basic programming task that make
randomization methods possible as well as potentially simplifying any repetitive computing task.
To write a "for loop", we need to choose how many times we want to do the loop (call that B)
and decide on a counter to keep track of where we are at in the loops (call that b, which goes
from 1 to B). The simplest loop would just involve printing out the index, print(b). This is
our first use of curly braces, { and }, that are used to group the code we want to repeatedly run as
we proceed through the loop. The code in the script window is:

for (b in (1:B)){
print(b)

And when you highlight and run the code, it will look about the same with "+" printed after the
first line to indicate that all the code is connected, looking like this:

> for (b in (1:B)){

+ print(b)

+ }

When you run these three lines of code, the console will show you the following output:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
This is basically the result of running the print function on b as it has values from 1 to 5.

Instead of printing the counter, we want to use the loop to repeatedly compute our test statistic
when permuting observations. The shuffle function will perform permutations of the group
labels relative to responses and the compareMean function will calculate the difference in two
group means. For a single permutation, the combination of shuffling Attr and finding the
difference in the means, storing it in a variable called Ts is:

> Ts<-compareMean(Years ~ shuffle(Attr), data=MockJury2)

> Ts

[1] 0.3968706

And putting this inside the print function allows us to find the test statistic under 5 different
permutations easily:

> for (b in (1:B)){

+ Ts<-compareMean(Years ~ shuffle(Attr), data=MockJury2)

+ print(Ts)
+ }

[1] 0.9302987
[1] 0.6635846
[1] 0.7702703
[1] -1.203414
[1] -0.7766714
Finally, we would like to store the values of the test statistic instead of just printing them out on
each pass through the loop. To do this, we need to create a variable to store the results, let's call
it Tstar. We know that we need to store B results so will create a vector of length B, containing
B elements, full of missing values (NA) using the matrix function:

> Tstar<-matrix(NA,nrow=B)

> Tstar

[,1]
[1,] NA
[2,] NA
[3,] NA
[4,] NA
[5,] NA
Now we can run our loop B times and store the results in Tstar:

> for (b in (1:B)){

+ Tstar[b]<-compareMean(Years ~ shuffle(Attr), data=MockJury2)

+ }

> Tstar

[,1]
[1,] 1.1436700
[2,] -0.7233286
[3,] 1.3036984
[4,] -1.1500711
[5,] -1.0433855
The Tstar vector when we set B to be large, say B=1,000, generate the permutation distribution
for the selected test statistic under14the null hypothesis - what is called the null distribution of
the statistic and also its sampling distribution. We want to visualize this distribution and use it to
assess how unusual our Tobs result of 1.84 years was relative to all the possibilities under
permutations (under the null hypothesis). So we repeat the loop, now with B=1000 and generate
a histogram, density curve and summary statistics of the results:

> B<-1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-compareMean(Years ~ shuffle(Attr), data=MockJury2)

+ }

> hist(Tstar,labels=T)

> plot(density(Tstar),main="Density curve of Tstar")

> favstats(Tstar)

missi
min Q1 median Q3 max mean sd n
ng
- -
0.02347 0.6102 2.9039 0.01829 0.8625 100
2.5369 0.5633 0
084 418 83 659 767 0
84 001
Figure 1-9 contains visualizations of the results for the distribution of T* and the favstats
summary provides the related numerical summaries. Our observed Tobs of 1.837 seems fairly
unusual relative to these results with only 11 T* values over 2 based on the histogram. We need
to make more specific assessments of the permuted results versus our observed result to be able
to clearly decide whether our observed result is really unusual.
Figure 1-9: Histogram (with counts in bars) and density curve of values of test statistic for 1,000
permutations.
We can enhance the previous graphs by adding the value of the test statistic from the real data
set, as shown in Figure 1-10, using the abline function.

> hist(Tstar,labels=T)

> abline(v=Tobs,lwd=2,col="red")

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,lwd=2,col="red")
Figure 1-10: Histogram and density curve of values of test statistic for 1,000 permutations with
bold line for value of observed test statistic.
Second, we can calculate the exact number of permuted results that were larger than what we
observed. To calculate the proportion of the 1,000 values that were larger than what we
observed, we will use the pdata function. To use this function, we need to provide the cut-off
point (Tobs), the distribution of values to compare to the cut-off (Tstar), and whether we want
the lower or upper tail of the distribution (lower.tail=F option provides the proportion of
values above).

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0.016

The proportion of 0.016 tells us that 16 of the 1,000 permuted results (1.6%) were larger than
what we observed. This type of work is how we can generate p-values using permutation
distributions. P-values are the probability of getting a result as extreme or more extreme than
what we observed, given that the null is true. Finding only 16 permutations of 1,000 that were
larger than our observed result suggests that it is hard to find a result like what we observed if
there really were no difference, although it is not impossible.

When testing hypotheses for two groups, there are two types of alternative hypotheses, one-sided
or two-sided. One-sided tests involve only considering differences in one-direction (like μ1>μ2)
and are performed when researchers can decide a priori15 which group should have a larger
mean. We did not know enough about the potential impacts of the pictures to know which group
should be larger than the other and without much knowledge we could have gotten the direction
wrong relative to the observed results and we can't look at the responses to decide on the
hypotheses. It is often safer and more conservative16 to start with a two-sided alternative (HA:
μ1≠μ2). To do a 2-sided test, find the area larger than what we observed as above. We also need
to add the area in the other tail (here the left tail) similar to what we observed in the right tail.
Here we need to also find how many of the permuted results were smaller than -1.84 years,
using pdata with -Tobs as the cut-off and lower.tail=T:

> pdata(-Tobs,Tstar,lower.tail=T)

[1] 0.015

So the p-value to test our null hypothesis of no difference in the true means between the groups
is 0.016+0.015, providing a p-value of 0.031. Figure 1-11 shows both cut-offs on the histogram
and density curve.

> hist(Tstar,labels=T)

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
Figure 1-11: Histogram and density curve of values of test statistic for 1,000 permutations with
bold lines for value of observed test statistic and its opposite value required for performing two-
sided test.
In general, the one-sided test p-value is the proportion of the permuted results that are more
extreme than observed in the direction of the alternative hypothesis (lower or upper tail, which
also depends on the direction of the difference taken). For the 2-sided test, the p-value is the
proportion of the permuted results that are less than the negative version of the observed statistic
and greater than the positive version of the observed statistic. Using absolute values, we can
simplify this: the two-sided p-value is the proportion of the |permuted statistics| that are larger
than |observed statistic|. This will always work and finds areas in both tails regardless of whether
the observed statistic is positive or negative. In R, the abs function provides the absolute
value and we can again use pdata to find our p-value:

> pdata(abs(Tobs),abs(Tstar),lower.tail=F)
[1] 0.031

We will discuss the choice of significance level below, but for the moment, assume
a significance level (α) of 0.05. Since the p-value is smaller than α, this suggests that we
can reject the null hypothesis and conclude that there is evidence of some difference in the true
mean sentences given between the two types of pictures.

Before we move on, let's note some interesting features of the permutation distribution of the
difference in the sample means shown in Figure 1-11.

 1) It is basically centered at 0. Since we are performing permutations assuming the null model
is true, we are assuming that μ1=μ2 which implies that μ1−μ2= 0 and 0 is always the center of the
permutation distribution.
 2) It is approximately normally distributed. This is due to the Central Limit Theorem17,
where the sampling distribution of the difference in the sample means (x1-x2) will be
approximately normal if the sample sizes are large enough. This result will allow us to use a
parametric method to approximate this distribution under the null model if some assumptions are
met, as we'll discuss below.
 3) Our observed difference in the sample means (1.84 years) is a fairly unusual result relative
to the rest of these results but there are some permuted data sets that produce more extreme
differences in the sample means. When the observed differences are really large, we may not see
any permuted results that are as extreme as what we observed. When pdata gives you 0, the p-
value should be reported to be smaller than 0.001 (not 0!) since it happened in less than 1 in
1000 tries.
 4) Since our null model is not specific about the direction of the difference, considering a
result like ours but in the other direction (-1.84 years) needs to be included. The observed result
seems to put about the same area in both tails of the distribution but it is not exactly the same.
The small difference in the tails is a useful aspect of this approach compared to the parametric
method discussed below as it accounts for slight asymmetry in the sampling distribution.
Earlier, we decided that the p-value was small enough to reject the null hypothesis since it was
smaller than our chosen level of significance. In this course, you will often be allowed to use
your own judgement about an appropriate significance level in a particular situation (in other
words, if we forget to tell you an α-level, you can still make a decision using a reasonably
selected significance level). Remembering that the p-value is the probability you would observe
a result like you did (or more extreme), assuming the null hypothesis is true, this tells you that
the smaller the p-value is, the more evidence you have against the null. The next section provides
a more formal review of the hypothesis testing infrastructure, terminology, and some of things
that can happen when testing hypotheses.

previous next

14
We often say "under" in statistics and we mean "given that the following is true".
15
This is a fancy way of saying "in advance", here in advance of seeing the observations.
16
Statistically, a conservative method is one that provides less chance of rejecting the null
hypothesis in comparison to some other method or some pre-defined standard.
17
We'll leave the discussion of the CLT to your previous stat coursework or an internet search.

1.4 - Hypothesis testing (general)

by Mark Greenwood and Katharine Banner

In hypothesis testing, it is formulated to answer a specific question about a population or ture

parameter(s) using a statistic based on a data set. In your previous statistics course, you
(hopefully) considered one-sample hypotheses about population means and proportions and the
two sample mean situation we are focused on here. Our hypotheses relate to trying to answer the
question about whether the population mean sentences between the two groups are different,
with an initial assumption of no difference.

Hypothesis testing is much like a criminal trial where you are in the role of a jury member (or
judge if no jury is present). Initially, the defendant is assumed innocent. In our situation, the true
means are assumed to be equal between the groups. Then evidence is presented and, as a juror,
you analyze it. In statistical hypothesis testing, data are collected and analyzed. Then you have to
decide if we had "enough" evidence to reject the initial assumption ("innocence" is initially
assumed). To make this decision, you want to have previously decided on the standard of
evidence required to reject the initial assumption. In criminal cases, "beyond a reasonable doubt"
is used. Wikipedia's definition suggests that this standard is that "there can still be a doubt, but
only to the extent that it would not affect a reasonable person's belief regarding whether or not
the defendant is guilty". In civil trials, a lower standard called a "preponderance of evidence" is
used. Based on that defined and pre-decided (a priori) measure, you decide that the defendant is
guilty or not guilty. In statistics, we compare our p-value to a significance level, α, which is most
often 5%. If our p-value is less than α, we reject the null hypothesis. The choice of the
significance level is like the variation in standards of evidence between criminal and civil trials -
and in all situations everyone should know the standards required for rejecting the initial
assumption before any information is "analyzed". Once someone is found guilty, then there is the
matter of sentencing which is related to the impacts ("size") of the crime. In statistics, this is
similar to the estimated size of differences and the related judgements about whether the
differences are practically important or not. If the crime is proven beyond a reasonable doubt but
it is a minor crime, then the sentence will be small. With the same level of evidence and a more
serious crime, the sentence will be more dramatic.

There are some important aspects of the testing process to note that inform how we interpret
statistical hypothesis test results. When someone is found "not guilty", it does not mean
"innocent", it just means that there was not enough evidence to find the person guilty "beyond a
reasonable doubt". Not finding enough evidence to reject the null hypothesis does not imply that
the true means are equal, just that there was not enough evidence to conclude that they were
different. There are many potential reasons why we might fail to reject the null, but the most
common one is that our sample size was too small (which is related to having too little evidence).
Throughout the semester, we will continue to re-iterate the distinctions between parameters and
statistics and want you to be clear about the distinctions between estimates based on the sample
and inferences for the population or true values of the parameters of interest. Remember that
statistics are summaries of the sample information and parameters are characteristics of
populations (which we rarely know). In the two-sample mean situation, the sample means are
always at least a little different - that is not an interesting conclusion. What is interesting is
whether we have enough evidence to prove that the population means differ "beyond a
reasonable doubt".

The scope of any inferences is constrained based on whether there is a random sample (RS)
and/or random assignment (RA). Table 1-1 contains the four possible combinations of these two
characteristics of a given study. Random assignment allows for causal inferences for differences
that are observed - the different in treatment levels causes differences in the mean responses.
Random sampling (or at least some sort of representative sample) allows inferences to be made
to the population of interest. If we do not have RA, then causal inferences cannot be made. If we
do not have a representative sample, then our inferences are limited to the sampled subjects.

A simple example helps to clarify how the scope of inference can change. Suppose we are
interested in studying the GPA of students and have a sample mean GPA and a confidence
interval for the population mean GPA available. If we had taken a random sample from, say, the
STAT 217 students in a given semester, our scope of inference would be the population of 217
students in that semester. If we had taken a random sample from the entire MSU population, then
the inferences would be to the entire MSU population in that semester. These are similar types of
problems but the two populations are very different and the group you are trying to make
conclusions about should be noted carefully in your results - it does matter! If we did not have a
representative sample, say the students could choose to provide this information or not, then we
can only make inferences to volunteers. These volunteers might differ in systematic ways from
the entire population of STAT 217 students so we cannot safely extend our inferences beyond
the group that volunteered.

Table 1-1: Scope of inference summary.

Random
Random Assignment (RA) - Yes Random Assignment (RA) -
Sampling/Random
(controlled experiment) No (observational study)
Assignment
Because we have RS, we can Can generalize inference to
generalize inferences to the population RS was taken from
Random Sampling (RS) -
population the RS was taken from. but cannot establish causal
Yes (or some method
Because we have RA we can inference (no RA - cannot
that results in a
assume the groups were equivalent isolate treatment variable as
representative sample of
on all aspects except for the only difference among groups,
population of interest)
treatment and can establish causal could be confounding
inference. variables).
Random Sampling (RS) - Cannot generalize inference to the Cannot generalize inference to
No (usually a population of interest because the the population of interest
sample was not random and could because the sample was not
be biased - may not be random and could be biased -
"representative" of the population may not be "representative" of
convenience sample ) of interest. Can establish causal the population of interest.
inference due to RA → the Cannot establish causal
inference from this type of study inference due to lack of RA of
applies only to the sample. the treatment.
To consider the impacts of RA versus observational studies, we need to be comparing groups.
Suppose that we are interested in differences in the mean GPAs for different sections of STAT
217 and that we take a random sample of students from each section and compare the results and
find evidence of some difference. In this scenario, we can conclude that there is some difference
in the population of STAT 217 students but we can't say that being in different sections caused
the differences in the mean GPAs. Now suppose that we randomly assigned every 217 student to
get extra training in one of three different study techniques and found evidence of differences
among the training methods. We could conclude that the training methods caused the differences
in these students. These conclusions would only apply to STAT 217 students and could not be
generalized to a larger population of students. If we took a random sample of STAT 217 students
(say only 10 from each section) and then randomly assigned them to one of three training
programs. If evidence of differences is found, then we can say that the training programs caused
the differences and we can say that we have evidence that those differences pertain to the
population of STAT 217 students. This seems similar to the scenario where all 217 students
participated in the training programs except that by using random sampling, only a fraction of the
population needs to actually be studied to make inferences to the entire population of interest -
saving time and money.

A quick summary of the terminology of hypothesis testing is useful at this point. The null
hypothesis (H0) states that there is no difference or no relationship in the population. This is the
statement of no effect or no difference and the claim that we are trying to find evidence against.
In this chapter, it is always H0: μ1 = μ2. When doing two-group problems, you always need to
specify which group is 1 and which is 2. The alternative hypothesis (H1 or HA) states a specific
difference between parameters. This is the research hypothesis and the claim about the
population that we hope to demonstrate is more reasonable to conclude than the null hypothesis.
In the two-group situation, we can have one-sided alternatives of HA: μ1 > μ2 (greater than) or HA:
μ1 < μ2 (less than) or, the more common, two-sided alternative of HA: μ1 ≠ μ2 (not equal to). We
usually default to using two-sided tests because we often do not know enough to know the
direction of a difference in advance, especially in more complicated situations. The sampling
distribution is the distribution of a statistic under the assumption that H0 is true and is used to
calculate the p-value, the probability of obtaining a result as extreme or more extreme than what
we observed given that the null hypothesis is true. We will find sampling distributions
using nonparametric approaches (like the permutation approach used above) and parametric
methods (using "named" distributions like the t, F, and χ2).

Small p-values are evidence against the null hypothesis because the the observed result is
unlikely due to chance if H0 is true. Large p-values provide no evidence against H0 but do not
allow us to conclude that there is no difference. The level of significance is an a priori definition
of how small the p-value needs to be to provide "enough" (sufficient) evidence against H0. This
is most useful to prevent sliding the standards after the results are found. We compare the p-
value to the level of significance to decide if the p-value is small enough to constitute sufficient
evidence to reject the null hypothesis. We use a to denote the level of significance and most
typically use 0.05 which we refer to as the 5% significance level. We compare the p-value to this
level and make a decision. The two options for decisions are to either reject the null hypothesis if
the p-value ≤ α or fail to reject the null hypothesis if the p-value > α. When interpreting
hypothesis testing results, remember that the p-value is a measure of how unlikely the observed
outcome was, assuming that the null hypothesis is true. It is NOT the probability of the data or
the probability of either hypothesis being true. The p-value is a measure of evidence against the
null hypothesis.

The specific definition of a is that it is the probability of rejecting H0 when H0 is true, the
probability of what is called a Type I error. Type I errors are also called false rejections. In the
two-group mean situation, a Type I error would be concluding that there is a difference in the
true means between the groups when none really exists in the population. In the courtroom
setting, this is like falsely finding someone guilty. We don't want to do this very often, so we use
small values of the significance level, allowing us to control the rate of Type of I errors at α. We
also have to worry about Type II errors, which are failing to reject the null hypothesis when it's
false. In a courtroom, this is the same as failing to convict a guilty person. This most often occurs
due to a lack of evidence. You can use the Table 1-2 to help you remember all the possibilities.

Table 1-2: Table of decisions and truth scenarios in a hypothesis testing situation. We never
know the truth in a real situation.
H0 True H0 False
FTR H0 Correct decision Type II error
Reject H0 Type I error Correct decision
In comparing different procedures, there is an interest in studying the rate or probability of Type
I and II errors. The probability of a Type I error was defined previously as α, the significance
level. The power of a procedure is the probability of rejecting the null hypothesis when it is false.
Power is defined as power = 1 - Probability(Type II error) = Probability(Reject H0 | H0 is false),
or, in words, the probability of detecting a difference when it actually exists. We want to use a
statistical procedure that controls the Type I error rate at the pre-specified level and has high
power to detect false null alternatives. Increasing the sample size is one of the most commonly
used methods for increasing the power in a given situation but sometimes we can choose among
different procedures and use the power of the procedures to help us make that selection. Note
that there are many ways to make H0 false and the power changes based on how false the null
hypothesis actually is. To make this concrete, suppose that the true mean sentences differed by
either 1 or 20 years in previous example. The chances of rejecting the null hypothesis are much
larger when the groups actually differ by 20 years than if they differ by just 1 year.

After making a decision (was there enough evidence to reject the null or not), we want to make
the conclusions specific to the problem of interest. If we reject H0, then we can conclude that
there was sufficient evidence at the α-level that the null hypothesis is wrong (and the results
point in the direction of the alternative). If we fail to reject H0 (FTR H0), then we can conclude
that there was insufficient evidence at the α-level to say that the null hypothesis is wrong. We
are NOT saying that the null is correct and we NEVER accept the null hypothesis. We just
failed to find enough evidence to say it's wrong. If we find sufficient evidence to reject the null,
then we need to revisit the method of data collection and design of the study. This allows us to
consider the scope of the inferences we can make. Can we discuss causality (due to RA) and/or
make inferences to a larger group than those in the sample (due to RS)?

To perform a hypothesis test, there are some steps to remember to complete to make sure you
have thought through all the aspects of the results.

Outline of 6+ steps to perform a Hypothesis Test

Isolate the claim to be proved, method to use (define a test statistic T), and significance level

1) Write the null and alternative hypotheses

2) Assess the "Things To Check" for the procedure being used (discussed below)

3) Find the value of the appropriate test statistic

4) Find the p-value

5) Make a decision

6) Write a conclusion specific to the problem, including scope of inference discussion

1.5 - Connecting randomization

(nonparametric) and parametric tests
by Mark Greenwood and Katharine Banner

In developing statistical inference techniques, we need to define the test statistic, T, that
measures the quantity of interest. To compare the means of two groups, a statistic is needed that
measures their differences. In general, for comparing two groups, the choices are simple - a
difference in the means often works well and is a natural choice. There are other options such as
tracking the ratio of means or possibly the difference in medians. Instead of just using the
difference in the means, we could "standardize" the difference in the means by dividing by an
appropriate quantity. It ends up that there are many possibilities for testing using the
randomization (nonparametric) techniques introduced previously. Parametric statistical methods
focus on means because the statistical theory surrounding means is quite a bit easier (not easy,
just easier) than other options. Randomization techniques allow inference for other quantities but
our focus here will be on using randomization for inferences on means to see the similarities with
the more traditional parametric procedures.
In two-sample mean situations, instead of working with the difference in the means, we often
calculate a test statistic that is called the equal variance two-independent samples t-statistic. The
test statistic is

where s12 and s22 are the sample variances for the two groups, n1 and n2 are the sample sizes for
the two groups, and the pooled sample standard devation,

The t-statistic keeps the important comparison between the means in the numerator that we used
before and standardizes (re-scales) that difference so that t will follow a t-distribution (a
parametric "named" distribution) if certain assumptions are met. But first we should see if
standardizing the difference in the means had an impact on our permutation test results. Instead
of using the compareMean function, we will use the t.test function (see its full use below)
and have it calculate the formula for t for us. The R code "$statistic" is basically a way of
extracting just the number we want to use for T from a larger set of output the t.test function
wants to provide you. We will see below that t.test switches the order of the difference (now
it is Average - Unattractive) - always carefully check for the direction of the difference in the
results. Since we are doing a two-sided test, the code resembles the permutation test code in
Section 1.3 with the new t-statistic replacing the difference in the sample means.

The permutation distribution in Figure 1-12 looks similar to the previous results with slightly
different x-axis scaling. The observed t-statistic was -2.17 and the proportion of permuted results
that were more extreme than the observed result was 0.034. This difference is due to a different
set of random permutations being selected. If you run permutation code, you will often get
slightly different results each time you run it. If you are uncomfortable with the variation in the
results, you can run more than B=1,000 permutations (say 10,000) and the variability will be
reduced further. Usually this uncertainty will not cause any substantive problems - but do not be
surprised if your results vary from a colleagues if you are both analyzing the same data set.

> Tobs <- t.test(Years ~ Attr, data=MockJury2,var.equal=T)

$statistic

> Tobs

t
-2.17023

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-t.test(Years ~ shuffle(Attr),
data=MockJury2,var.equal=T)$statistic

+ }

> hist(Tstar,labels=T)

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")

> pdata(abs(Tobs),abs(Tstar),lower.tail=F)

0.034
Figure 1-12: Permutation distribution of the t-statistic.
The parametric version of these results is based on using what is called the two-independent
sample t-test. There are actually two versions of this test, one that assumes that variances are
equal in the groups and one that does not. There is a rule of thumb that if the ratio of the larger
standard deviation over the smaller standard deviation is less than 2, the equal variance
procedure is ok. It ends up that this assumption is less important if the sample sizes in the groups
are approximately equal and more important if the groups contain different numbers of
observations. In comparing the two potential test statistics, the procedure that assumes equal
variances has a complicated denominator (see the formula above for t involving sp) but a simple
formula for degrees of freedom (df) for the t-distribution (df=n1+n2−2) that approximates the
distribution of the test statistic, t, under the null hypothesis. The procedure that assumes unequal
variances has a simpler test statistic and a very complicated degrees of freedom formula. The
equal variance procedure is most similar to the ANOVA methods we will consider later this
semester so that will be our focus here. Fortunately, both of these methods are readily available
in the t.test function in R if needed.
If the assumptions for the equal variance t-test are met and the null hypothesis is true, then the
sampling distribution of the test statistic should follow a t-distribution with n1+n2−2 degrees of
freedom. The t-distribution is a bell-shaped curve that is more spread out for smaller values of
degrees of freedom as shown in Figure 1-13. The t-distribution looks more and more like a
standard normal distribution (N(0,1)) as the degrees of freedom increase.

Figure 1-13: Plots of t and normal distributions.

To get the p-value from the parametric t-test, we need to calculate the test statistic and df, then
look up the areas in the tails of the t-distribution relative to the observed t-statistic. We'll learn
how to use R to do this below, but for now we will allow the t.test function to take care of
this for us. The t.test function uses our formula notation (Years ~ Attr) and then data=...
as we saw before for making plots. To get the equal-variance test result,
the var.equal=T option needs to be turned on. Then t.test provides us with lots of useful
output. We highlighted the three results we've been discussing - the test statistic value (-
2.17), df=73, and the p-value, from the t-distribution with 73 degrees of freedom, of 0.033.
> t.test(Years ~ Attr, data=MockJury2,var.equal=T)

Two Sample t-test

data: Years by Attr

t = -2.1702, df = 73, p-value = 0.03324

alternative hypothesis: true difference in means is not equal to

95 percent confidence interval:

-3.5242237 -0.1500295

sample estimates:

mean in group Average mean in group Unattractive

3.973684 5.810811
So the parametric t-test gives a p-value of 0.033 from a test statistic of -2.1702. The negative
sign on the statistic occurred because the function took Average - Unattractive which is the
opposite direction as compareMeans. The p-value is very similar to the two permutation
results found before. The reason for this similarity is that the permutation distribution looks an
awful lot like a t-distribution with 73 degrees of freedom. Figure 1-14 shows how similar the two
distributions happened to be here.
Figure 1-14: Plot of permutation and t distribution with df=73.
In your previous statistics course, you might have used an applet or a table to find p-values such
as what was provided in the previous R output. When not directly provided by a function, we
will use R to find p-values 18 from named distributions such as the t-distribution. In this case, the
distribution is a t(73) or a t with 73 degrees of freedom. We will use the pt function to get p-
values from the t-distribution in the same manner as we used pdata to find p-values from the
permutation distribution. We need to provide the df=... and specify the tail of the distribution of
interest using the lower.tail option. If we want the area to the left of -2.17:

> pt(-2.1702,df=73,lower.tail=T)

[1] 0.01662286

And we can double it to get the p-value that t.test provided earlier, because the t-distribution is
symmetric:
> 2*pt(-2.1702,df=73,lower.tail=T)

[1] 0.03324571

More generally, we could always make the test statistic positive using the absolute value, find
the area to the right of it, and then double that for a two-side test p-value:

> 2*pt(abs(-2.1702),df=73,lower.tail=F)

[1] 0.03324571

Permutation distributions do not need to match the named parametric distribution to work
correctly, although this happened in the previous example. The parametric approach, the t-test,
requires the certain conditions to be met for the sampling distribution of the statistic to follow the
named distribution and provide accurate p-values. The conditions for the equal variance t-test
are:

1) Independent observations: Each observation obtained is unrelated to all other observations.

To assess this, consider whether there anything in the data collection might lead to clustered or
related observations that are un-related to the differences in the groups. For example, was the
same person measured more than once?19

2) Equal variances in the groups (because we used a procedure that assumes equal variances! -
there is another procedure that allows you to relax this assumption if needed...). To assess this,
compare the standard deviations and see if they look noticeably different, especially if the
sample sizes differ between groups.

3) Normal distributions of the observations in each group. We'll learn more diagnostics later,
but the boxplots and beanplots are a good place to start to help you look for skews or outliers,
which were both present here. If you find skew and/or outliers, that would suggest a problem
with this condition.

For the permutation test, we relax the third condition:

3) Similar distributions between the groups: The permutation approach helps us with this
assumption and allows valid inferences as long as the two groups have similar shapes and only
possibly differ in their centers. In other words, the distributions need not look normal for the
procedure to work well.

In the mock jury study, we can assume that the independent observation condition is met because
there is no information suggesting that the same subjects were measured more than once or that
some other type of grouping in the responses was present (like the subjects were divided in
groups and placed in the same room discussing their responses). The equal variance condition
might be violated although we do get some lee-way in this assumption and are still able to get
reasonable results. The standard deviations are 2.8 vs 4.4, so this difference is not "large"
according to the rule of thumb. It is, however, close to being considered problematic. It would be
difficult to reasonably assume that the the normality condition is met here (Figure 1-6), that is
assumed in the derivation of the parametric procedure, with clear right skews in both groups and
potential outliers. The shapes look similar for the two groups so there is less reason to be
concerned with using the permutation approach as compared to the parametric approach.

The permutation approach is resistant to impacts of violations of the normality assumption. It is

not resistant to impact of violations of any of the other assumptions. In fact, it can be quite
sensitive to unequal variances as it will detect differences in the variances of the groups instead
of differences in the means. Its scope of inference is limited just like the parametric approach and
can lead to similarly inaccurate conclusions in the presence of non-independent observations as
for the parametric approach. For our purposes, we hope that seeing the similarity in the methods
can help you understand both methods better. In this example, we discover that parametric and
permutation approaches provide very similar inferences.

previous next

18
On exams, you will be asked to describe the area of interest, sketch a picture of the area of
interest and/or note the distribution you would use.
19
In some studies, the same subject might be measured in both conditions and this violates the
assumptions of this procedure.

1.6 - Second example of permutation

tests
by Mark Greenwood and Katharine Banner

In every chapter, we will follow the first example used to explain the methods with a "worked"
example where we focus on the results provided. In a previous semester, some of the STAT 217
students (n=79) provided information on their gender, Age, and current GPA. We might be
interested in whether Males and Females had different average GPAs. First, we can take a look at
the difference in the responses by groups as displayed in Figure 1-15.

>
s217=read.csv( http://dl.dropboxusercontent.com/u/77307195/s217.c
sv)

> require(mosaic)

> par(mfrow=c(1,2))

> boxplot(GPA~Sex,data=s217)

> require(beanplot)
> beanplot(GPA~Sex,data=s217,
log="",col="lightblue",method="jitter")

> mean(GPA~Sex,data=s217)

F M
3.338378 3.088571
> favstats(GPA~Sex,data=s217)

.group min Q1 median Q3 max mean sd n missing

1 F 2.50 3.1 3.400 3.70 4 3.338378 0.4074549 37 0
2 M 1.96 2.8 3.175 3.46 4 3.088571 0.4151789 42 0
Figure 1-15: Side-by-side boxplot and beanplot of GPAs of STAT 217 students by sex.
In these data, the distributions of the GPAs look to be left skewed but maybe not as dramatically
as the responses were right-skewed in the previous example. The Female GPAs look to be
slightly higher than for Males (0.25 GPA difference in the means) but is that a "real" difference?
We need our inference tools to more fully assess these differences.

> compareMean(GPA~Sex,data=s217)

[1] -0.2498069

First, we can try the parametric approach:

> t.test(GPA~Sex,data=s217,var.equal=T)

Two Sample t-test

data: GPA by Sex

t = 2.6919, df = 77, p-value = 0.008713

alternative hypothesis: true difference in means is not equal to

95 percent confidence interval:

0.06501838 0.43459552

sample estimates:

mean in group F mean in group M

3.338378 3.088571
So the test statistic was observed to be t=2.69 and it hopefully follows a t(77) distribution under
the null hypothesis. This provides a p-value of 0.008713 that we can trust if all of the conditions
are met. We can compare these results to the permutation approach, which relaxes that normality
assumption, with the required code and results following. In the permutation test, T=2.692 and
the p-value is 0.011 which is a little larger than the result provided by the parametric approach.
The agreement of the two approaches provides some re-assurance about the use of either
approach.

> Tobs <- t.test(GPA~Sex,data=s217,var.equal=T)$statistic

> Tobs

2.691883

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-t.test(GPA~shuffle(Sex),data=s217,var.equal=T)
$statistic

+ }

> hist(Tstar,labels=T)

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=c(-1,1)*Tobs,lwd=2,col="red")

> pdata(abs(Tobs),abs(Tstar),lower.tail=F)

0.011

Figure 1-16: Histogram and density curve of permutation distribution of test statistic for STAT
217 GPAs.
Here is a full write-up of the results using all 6+ hypothesis testing steps, using the permutation
results:

Isolate the claim to be proved and method to use (define a test statistic T)
We want to test for a difference in the means between males and females and will use the equal-
variance two-sample t-test statistic to compare them, making a decision at the 5% significance
level.

1) Write the null and alternative hypotheses

 •H0: μMale = μFemale

o ◦where μMale is the true mean GPA for males and μFemale is true mean GPA for
females

 •HA: μMale ≠ μFemale

2) Check conditions for the procedure being used

 •Independent observations condition: It appears that this assumption is met because

there is no reason to assume any clustering or grouping of responses that might create
dependence in the observations. The only possible consideration is that the observations
were taken from different sections and there could be some differences between the
sections. However, for overall GPA this not likely to be a big issue. The only way this
could create a violation here is if certain sections tended to attract students with different
GPA levels (such as the 9 am section had the best/worst GPA students...).

 •Equal variance condition: There is a small difference in the range of the observations
in the two groups but the standard deviations are very similar so there is no evidence that
this condition is violated.

 •Similar distribution condition: Based on the side-by-side boxplots and beanplots, it

appears that both groups have slightly left-skewed distributions which could be
problematic for the parametric approach but the permutation approach condition is not
violated since the distributions look to have fairly similar shapes.

3) Find the value of the appropriate test statistic

 •T=2.69 from the previous R output

4) Find the p-value

 •p-value=0.012 from the permutation distribution results.

 •This means that there is about a 1.2% chance we would observe a difference in mean
GPA (female-male or male-female) of 0.25 points or more if there in fact no difference in
true mean GPA between females and males in STAT 217 in a particular semester.

5) Decision
 •Since the p-value is "small" (a priori 5% significance level selected), we can reject the
null hypothesis.

6) Conclusion and scope of inference, specific to the problem

 •There is evidence against the null hypothesis of no difference in the true mean GPA
between males and females for the STAT 217 students in this semester and so we
conclude that there is evidence of a difference in the mean GPAs between males and
females.

 •Because this was not a randomized experiment, we can't say that the difference in sex
causes the difference in mean GPA and because it was not a random sample from a larger
population, our inferences only pertain the STAT 217 students that responded to the
survey in that semester.

 1.7 - Confidence intervals and

bootstrapping
 by Mark Greenwood and Katharine Banner
 Randomly shuffling the treatments between the observations is like randomly sampling
the treatments without replacement. In other words, we randomly sample one observation
at a time from the treatments until we have n observations. This provides us with a
technique for testing hypotheses because it provides a new ordering of the observations
that is valid if the null hypothesis is assumed true. In most situations, we also want to
estimate parameters of interest and provide confidence intervals for those parameters (an
interval where we are __% confident that the true parameter lies). As before, there are
two options we will consider - a parametric and a nonparametric approach. The
nonparametric approach will be using what is called bootstrapping and draws its name
from "pull yourself up by your bootstraps" where you improve your situation based on
your own efforts. In statistics, we make our situation or inferences better by re-using the
observations we have by assuming that the sample represents the population. Since each
observation represents other similar observations in the population, if we sample with
replacement from our data set it mimics the process of taking repeated random samples
from our population of interest. This process ends up giving us good distributions of
statistics even when our standard normality assumption is violated, similar to what we
encountered in the permutation tests. Bootstrapping is especially useful in situations
where we are interested in statistics other than the mean (say we want a confidence
interval for a median or a standard deviation) or when we consider functions of more than
one parameter and don't want to derive the distribution of the statistic (say the difference
in two medians). Our uses for bootstrapping will be typically to use it when some of our
assumptions (especially normality) might be violated for our regular procedure to provide
more trustworthy inferences.
 To perform bootstrapping, we will use the resample function from the mosaic
package. We can apply this function to a data set and get a new version of the data set
by sampling new observations with replacement from the original one. The new version
of the data set contains a new variable called orig.ids which is the number of the
subject from the original data set. By summarizing how often each of these id's occurred
in a bootstrapped data set, we can see how the re-sampling works. The code is
complicated for unimportant reasons, but the end result is the table function providing
counts of the number of times each original observation occurred, with the first row
containing the observation number and the second row the count. In the first bootstrap
sample shown, the 1st, 2nd, and 4th observations were sampled one time each and the
3rd observation was not sampled at all. The 5th observation was sampled two times.
Observation 42 was sampled four times. This helps you understand what types of samples
that sampling with replacement can generate.
 > table(as.numeric(resample(MockJury2)$orig.ids))

1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3
1 2 4 5 6 7 8 9
0 1 2 4 5 7 3 4 5 6 7 8 2 3 5 6 7
1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 2 1 2 3 1 2 3
3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7
9 1 2 3 4 5 7 1 4 6 7 8 9 0 1 2 3 5 6 8 0 1 3 5
1 2 4 2 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 2 2 3
 A second bootstrap sample is also provided. It did not re-sample observations 1, 2, or 4
but does sample observation 5 three times. You can see other variations in the resulting
re-sampling of subjects.
 > table(as.numeric(resample(MockJury2)$orig.ids))

1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 4 4
3 5 6 8 4
0 2 5 6 7 8 9 5 6 7 9 0 2 4 6 7 8 9 0 2
1 3 1 2 1 1 1 1 2 1 1 3 1 1 1 1 2 2 1 2 2 1 2 1 2
4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 7 7 7 7 7
4 5 7 8 9 2 3 5 6 7 8 0 1 3 4 5 6 7 8 9 0 1 2 3 4
2 1 1 1 3 1 1 1 1 1 3 1 1 3 2 1 1 2 1 1 1 1 2 2 2
7
5
1
 Each run of the resample function provides a new version of the data set. Repeating
this B times using another for loop, we will track our quantity of interest, say T, in all
these new "data sets" and call those results T*. The distribution of the bootstrapped T*
statistics will tell us about the range of results to expect for the statistic and the middle __
% of the T*'s provides a bootstrap confidence interval for the true parameter - here
the difference in the two population means.
 To make this concrete, we can revisit our previous examples, starting with
the MockJury2 data created before and our interest in comparing the mean sentences
for the Average and Unattractive picture groups. The bootstrapping code is very similar
to the permutation code except that we apply the resample function to the entire data
set as opposed to the shuffle function being applied to the explanatory variable.
 > Tobs <- compareMean(Years ~ Attr, data=MockJury2); Tobs
 [1] 1.837127
 > B<- 1000
 > Tstar<-matrix(NA,nrow=B)
 > for (b in (1:B)){
 + Tstar[b]<-compareMean(Years ~ Attr,
data=resample(MockJury2))
 + }
 > hist(Tstar,labels=T)
 > plot(density(Tstar),main="Density curve of Tstar")
 > favstats(Tstar)

missi
min Q1 median Q3 max mean sd n
ng
-
1.2620 1.8536 2.4071 5.4620 1.8398 0.84269 100
1.2521 0
18 15 43 06 87 69 0
37
 In this situation, the observed difference in the mean sentences is 1.84 years
(Unattractive-Average), which is the vertical line in Figure 1-17. The bootstrap
distribution shows the results for the difference in the sample means when fake data sets
are re-constructed by sampling from the data set with replacement. The bootstrap
distribution is approximately centered at the observed value and relatively symmetric.


Figure 1-17: Histogram and density curve of bootstrap distributions of difference in

sample mean Years with vertical line for the observed difference in the means.
 The permutation distribution in the same situation (Figure 1-12) had a similar shape but
was centered at 0. Permutations create distributions based on assuming the null
hypothesis is true, which is useful for hypothesis testing. Bootstrapping creates
distributions centered at the observed result, sort of like distributions under the
alternative; bootstrap distributions are useful for generating intervals for the true
parameter values.
 To create a 95% bootstrap confidence interval for the difference in the true mean
sentences (μUnattr - μAve), we select the middle 95% of results from the bootstrap
distribution. Specifically, we find the 2.5th percentile and the 97.5th percentile (values that
put 2.5 and 97.5% of the results to the left), which leaves 95% in the middle. To find
percentiles in a distribution, we will use functions that are q[Name of
distribution] and from the bootstrap results we will use the qdata function on
the Tstar results.
 > qdata(.025,Tstar)
 p quantile
 0.0250000 0.1914578
 > qdata(.975,Tstar)
 p quantile
 0.975000 3.484155
 These results tell us that the 2.5th percentile of the bootstrap distribution is at 0.19 years
and the 97.5th percentile is at 3.48 years. We can combine these results to provide a 95%
confidence for μUnattr - μAve that is between 0.19 and 3.48. We can interpret this as with
any confidence interval, that we are 95% confident that the difference in the true means
(Unattractive minus Average) is between 0.19 and 3.48 years. We can also obtain both
percentiles in one line of code using:
 > quantiles<-qdata(c(.025,.975),Tstar)
 > quantiles
 quantile p
 2.5% 0.1914578 0.025
 97.5% 3.4841547 0.975
 Figure 1-18 displays those same percentiles on the same bootstrap distribution.
 > hist(Tstar,labels=T)
 > abline(v=quantiles$quantile,col="blue",lwd=3)
 > plot(density(Tstar),main="Density curve of Tstar")
 > abline(v=quantiles$quantile,col="blue",lwd=3)


Figure 1-18: Histogram and density curve of bootstrap distribution with 95% bootstrap
confidence intervals displayed (vertical lines).
 Although confidence intervals can exist without referencing hypotheses, we can revisit
our previous hypotheses and see what this confidence interval tells us about the test of
H0: μUnattr = μAve. This null hypothesis is equivalent to testing H0: μUnattr - μAve=0, that the
difference in the true means is equal to 0 years. And the difference in the means was the
scale for our confidence interval, which did not contain 0 years. We will call 0 an
interesting reference value for the confidence interval, because here it is the value where
the true means are equal other (have a difference of 0 years). In general, if our confidence
interval does not contain 0, then it is saying that 0 is not one of our likely values for the
difference in the true means. This implies that we should reject a claim that they are
equal. This provides the same inferences for the hypotheses that we considered
previously using both a parametric and permutation approach. The general summary is
that we can use confidence intervals to test hypotheses by assessing whether the reference
value under the null hypothesis is in the confidence interval (FTR H0) or outside the
confidence interval (Reject H0).
 As in the previous situation, we also want to consider the parametric approach for
comparison purposes and to have that method available for the rest of the semester. The
parametric confidence interval is called the equal variance, two-sample t-based
confidence interval and assumes that the populations being sampled from are normally
distributed and leads to using a t-distribution to form the interval. The output from
the t.test function provides the parametric 95% confidence interval calculated for
you:
 > t.test(Years ~ Attr, data=MockJury2,var.equal=T)
 Two Sample t-test
 data: Years by Attr
 t = -2.1702, df = 73, p-value = 0.03324
 alternative hypothesis: true difference in means is not
equal to 0
 9
 5 percent confidence interval:
 -3.5242237 -0.1500295
 sample estimates:

mean in group Average mean in group Unattractive

3.973684 5.810811
 The t.test function again switched the order of the groups and provides slightly
different end-points than our bootstrap confidence interval (both made at the 95%
confidence level), which was slightly narrower. Both intervals have the same
interpretation, only the methods for calculating the intervals and the assumptions differ.
Specifically, the bootstrap interval can tolerate different distribution shapes other than
normal and still provide intervals that work well. The other assumptions are all the same
as for the hypothesis test, where we continue to assume that we have independent
observations with equal variances for the two groups.
 The formula that t.test is using to calculate the parametric equal-variance two-
sample t-based confidence interval is:


 In this situation, the df is again n1+n2−2 and

The t*df is a multiplier that comes from finding the percentile from the t-distribution that
puts C% in the middle of the distribution with C being the confidence level. It is
important to note that this t* has nothing to do with the previous test statistic t. It is
confusing and many of you will, at some point, happily take the result from a test statistic
calculation and use it for a multiplier in a t-based confidence interval. Figure 1-19 shows
the t-distribution with 73 degrees of freedom and the cut-offs that put 95% of the area in
the middle.


Figure 1-19: Plot of t(73) with cut-offs for putting 95% of distributions in the middle.
 For 95% confidence intervals, the multiplier is going to be close to 2 - anything else is a
sign of a mistake. We can use R to get the multipliers for us using the qt function in a
similar fashion to how we used qdata in the bootstrap results, except that this new value
must be used in the previous formula. This function produces values for requested
percentiles. So if we want to put 95% in the middle, we place 2.5% in each tail of the
distribution and need to request the 97.5th percentile. Because the t-distribution is always
symmetric around 0, we merely need to look up the value for the 97.5th percentile. The t*
multiplier to form the confidence interval is 1.993 for a 95% confidence interval when
the df=73 based on the results from qt:
 > qt(.975,df=73)
 [1] 1.992997
 Note that the 2.5th percentile is just the negative of this value due to symmetry and the
real source of the minus in the plus/minus in the formula for the confidence interval.
 > qt(.025,df=73)
 [1] -1.992997
 We can also re-write the general confidence interval formula more simply as


 where

In some situations, researchers will report the standard error (SE) or margin of
error (ME) as a method of quantifying the uncertainty in a statistic. The SE is an estimate
of the standard deviation of the statistic (here x1−x2) and the ME is an estimate of the
precision of a statistic that can be used to directly form a confidence interval. The ME
depends on the choice of confidence level although 95% is almost always selected.
 To finish this example, we can use R to help us do calculations much like a calculator
except with much more power "under the hood". You have to make sure you are careful
with using ( ) to group items and remember that the asterisk (*) is used for
multiplication. To do this, we need the pertinent information which is available from the
bolded parts of the favstats output repeated below.
 > favstats(Years~Attr,data=MockJury2)

min Q1 median Q3 max mean sd n missing

Average 1 2 3 5 12 3.973684 2.823519 38 0
Unattractive 1 2 5 10 15 5.810811 4.364235 37 0
 We can start with typing the following command to calculate sp:
 > sp <- sqrt(((38-1)*(2.8235^2)+(37-1)*(4.364^2))/(38+37-2))
 > sp
 [1] 3.665036
 So then we can calculate the confidence interval that t.test provided using:
 > 3.974-5.811+c(-1,1)*qt(.975,df=73)*sp*sqrt(1/38+1/37)
 [1] -3.5240302 -0.1499698
 The previous code uses c(-1,1) times the margin of error to subtract and add the ME to
the difference in the sample means (3.974-5.811) to generate the lower and then
upper bounds of the confidence interval. If desired, we can also use just the last portion of
the previous calculation to find the margin of error, which is 1.69 here.
 > qt(.975,df=73)*sp*sqrt(1/38+1/37)
 [1] 1.68703
1.8a - Bootstrap confidence interval for
difference in GPAs
by Mark Greenwood and Katharine Banner

We can now repeat the methods on the STAT 217 grade data. This time we can start with the
parametric 95% confidence interval "by hand" and then using t.test. The favstats output
provides us with the required information to do this ourselves:

favstats(GPA~Sex,data=s217)

.group min Q1 median Q3 max mean sd n missing

1 F 2.50 3.1 3.400 3.70 4 3.338378 0.4074549 37 0
2 M 1.96 2.8 3.175 3.46 4 3.088571 0.4151789 42 0
The df are 37+42-2 = 77. Using the SDs from the two groups and their sample sizes, we can
calculate sp:

> sp=sqrt(((37-1)*(0.4075^2)+(42-1)*(0.41518^2))/(37+42-2))

> sp

[1] 0.4116072

The margin of error is:

> qt(.975,df=77)*sp*sqrt(1/37+1/42)

[1] 0.1847982

All together, the 95% confidence interval is:

> 3.338-3.0886+c(-1,1)*qt(.975,df=77)*sp*sqrt(1/37+1/42)

[1] 0.0646018 0.4341982

So we are 95% confident that the difference in the true mean GPAs between females and males
(femals minus males) is between 0.065 and 0.434 GPA points. We get a similar20 result from the
bolded part of the t.test output:

> t.test(GPA~Sex,data=s217,var.equal=T)

Two Sample t-test

data: GPA by Sex

t = 2.6919, df = 77, p-value = 0.008713

alternative hypothesis: true difference in means is not equal to

95 percent confidence interval:

0.06501838 0.43459552

sample estimates:

mean in group F mean in group M

3.338378 3.088571
Note that we can easily switch to 90% or 99% confidence intervals by simply changing the
percentile in qt or changing conf.level in the t.test function. In the following two lines
of code, we added hashtags (#) and then some text to explain what is being calculated. Hashtags
provide a way of adding comments to R code as R will ignore any text after a hashtag on a given
line.

> qt(.95,df=77) #For 90% confidence and 77 df

1] 1.664885

> qt(.995,df=77) #For 99% confidence and 77 df

[1] 2.641198

t.test(GPA~Sex,data=s217,var.equal=T,conf.level=.90)

t = 2.6919, df = 77, p-value = 0.008713

alternative hypothesis: true difference in means is not equal to

90 percent confidence interval:

0.09530553 0.40430837

> t.test(GPA~Sex,data=s217,var.equal=T,conf.level=.99)

t = 2.6919, df = 77, p-value = 0.008713

alternative hypothesis: true difference in means is not equal to
0

99 percent confidence interval:

0.004703598 0.494910301

As a review of some basic ideas with confidence intervals make sure you can answer the
following questions:

1. 1) What is the impact of increasing the confidence level in this situation?

2. 2) What happens to the width of the confidence interval if the size of the SE increases or
decreases?

3. 3) What about increasing the sample size - should that increase or decrease the width of
the interval?

All of the general results you learned before about impacts to widths of CIs hold in this situation
whether we are considering the parametric or bootstrap methods.

To finish this example, we will generate the comparable bootstrap 90% confidence interval using
the bootstrap distribution in Figure 1-20.

> Tobs <- compareMean(GPA ~ Sex, data=s217); Tobs

[1] -0.2498069

> par(mfrow=c(1,2))

> B<- 1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-compareMean(GPA ~ Sex, data=resample(s217))

+ }

> qdata(.05,Tstar)

p quantile
0.0500000 -0.3974425
> qdata(.95,Tstar)
p quantile
0.9500000 -0.1147324
> quantiles<-qdata(c(.05,.95),Tstar)

> quantiles

quantile p
5% -0.3974425 0.05
95% -0.1147324 0.95
The output tells us that the 90% confidence interval is from -0.397 to -0.115 GPA points. The
bootstrap distribution with the observed difference in the sample means and these cut-offs is
displayed in Figure 1-20 using this code:

> hist(Tstar,labels=T)

> abline(v=Tobs,col="red",lwd=2)

> abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,col="red",lwd=2)

> abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)

In the previous output, the parametric 90% confidence interval is from 0.095 to 0.404, suggesting
similar results again from the two approaches once you account for the two different orders of
differencing. There was a slight left skew in the bootstrap distribution with one much smaller
difference observed which generated some of the observed difference in the results. Based on the
bootstrap CI, we can say that we are 90% confident that the difference in the true mean GPAs for
STAT 217 students is between -0.397 to -0.115 GPA points (male minus females). Because sex
cannot be assigned to the subjects, we cannot infer that sex is causing this difference and because
this was a voluntary response sample of STAT 217 students in a given semester, we cannot infer
that a difference of this size would apply to all STAT 217 students or even students in another
semester.
Figure 1-20: Histogram and density curve of bootstrap distribution of difference in sample mean
GPAs (male minus female) with observed difference (solid vertical line) and quantiles that
delineate the 90% confidence intervals (dashed vertical lines).
Throughout the semester, pay attention to the distinctions between parameters and statistics,
focusing on the differences between estimates based on the sample and inferences for the
population of interest in the form of the parameters of interest. Remember that statistics are
summaries of the sample information and parameters are characteristics of populations (which
we rarely know). And that our inferences are limited to the population that we randomly sampled
from, if we randomly sampled.

previous next

20
We rounded the means a little and that caused the small difference in results.
1.8b - Chapter summary
by Mark Greenwood and Katharine Banner

In this chapter, we reviewed basic statistical inference methods in the context of a two-sample
mean problem. You were introduced to using R to do permutation testing and generate bootstrap
confidence intervals as well as obtaining parametric t-test and confidence intervals in this same
situation. You should have learned how to use a for loop for doing the nonparametric
inferences and the t.test function for generating parametric inferences. In the two examples
considered, the parametric and nonparametric methods provided similar results, suggesting that
the assumptions were at least close to being met for the parametric procedures. When parametric
and nonparametric approaches disagree, the nonparametric methods are likely to be more
trustworthy since they have less restrictive assumptions but can still have problems. When the
noted conditions are not met in a hypothesis testing situation, the Type I error rates can be
inflated, meaning that we reject the null hypothesis more often than we have allowed to occur by
chance. Specifically, we could have a situation where our assumed 5% significance level test
might actually reject the null when it is true 20% of the time. If this is occurring, we call a
procedure liberal (it rejects too easily) and if the procedure is liberal, how could we trust a small
p-value to be a "real" result and not just an artifact of violating the assumptions of the procedure?
Likewise, for confidence intervals we hope that our 95% confidence level procedure, when
repeated, will contain the true parameter 95% of the time. If our assumptions are violated, we
might actually have an 80% confidence level procedure and it makes it hard to trust the reported
results for our observed data set. Statistical inference relies on a belief in the methods underlying
our inferences. If we don't trust our assumptions, we shouldn't trust the conclusions to perform
the way we want them to. As sample sizes increase and violations of conditions lessen, then the
procedures will perform better. In Chapter 2, we'll learn some new tools for doing diagnostics to
help us assess how much those conditions are violated.

1.9 - Summary of important R code

by Mark Greenwood and Katharine Banner

 • summary(DATASETNAME)
 ◦ Provides numerical summaries of all variables in the data set.

 • t.test(Y~X,data=DATASETNAME,conf.level=0.95)
 ◦ Provides two-sample t-test test statistic, df, p-value, and 95% confidence interval.

 • 2*pt(abs(Tobs),df=DF,lower.tail=F)
 ◦ Finds the two-sided test p-value for an observed 2-sample t-test statistic of Tobs.
 • hist(DATASETNAME$Y)
 ◦ Makes a histogram of a variable named Y from the data set of interest.

 • boxplot(Y~X,data=DATASETNAME)
 ◦ Makes a boxplot of a variable named Y for groups in X from the data set.

 • beanplot(Y~X,data=DATASETNAME)
 ◦ Makes a beanplot of a variable named Y for groups in X from the data set.

 ◦ Requires the beanplot package is loaded.

 • mean(Y~X,data=DATASETNAME); sd(Y~X,data=DATASETNAME)
 ◦ Provides the mean and sd of responses of Y for each group described in X.

 • favstats(Y~X,data=DATASETNAME)
 ◦ Provides numerical summaries of Y by groups described in X.

 • Tobs <- t.test(Y~X,data=DATASETNAME,var.equal=T)$statistic;

Tobs
B<-1000

Tstar<-matrix(NA,nrow=B)

for (b in (1:B)){

Tstar[b]<-t.test(Y~shuffle(X),data=DATASETNAME,var.equal=T)
$statistic

 ◦ Code to run a for loop to generate 1000 permuted versions of the test statistic using
the shuffle function and keep track of the results in Tstar.

 • pdata(abs(Tobs),Tstar,lower.tail=F)
 ◦ Finds the proportion of the permuted test statistics in Tstar that are less than -|Tobs| or
greater than |Tobs|,

 • Tobs <- compareMeans(Y~X,data= DATASETNAME); Tobs

B<-1000

Tstar<-matrix(NA,nrow=B)

for (b in (1:B)){

Tstar[b]<-compareMeans(Y~X,data=resample(DATASETNAME))
}

 ◦ Code to run a for loop to generate 1000 bootstrapped versions of the data set using
the resample function and keep track of the results of the statistic in Tstar.

 • qdata(c(0.025,0.975),Tstar)
 ◦ Provides the values that delineate the middle 95% of the results in the bootstrap
distribution (Tstar).

1.10 - Practice problems

by Mark Greenwood and Katharine Banner

Load the HELPrct data set from the mosaicData package. The HELP study was a clinical
trial for adult inpatients recruited from a detoxification unit. Patients with no primary care
physician were randomly assigned to receive a multidisciplinary assessment and a brief
motivational intervention or usual care and various outcomes were observed. Two of the
variables in the dataset are sex, a factor with levels (male and female) and daysanysub, time
(in days) to first use of any substance post-detox. We are interested in the difference in mean
number of days to first use of any substance post-detox between males and females. There are
some missing responses and the following code will produce favstats with the missing
values and then provide a data set that for complete observations by applying
the na.omit function that removes any observations with missing values.

require(mosaicData) #load the dataset

data(HELPrct)

HELPrct2<-HELPrct[,c("daysanysub","sex")] #Just focus on two

variables

HELPrct3<-na.omit(HELPrct2) #Removes subjects with missing

favstats(daysanysub~sex, data = HELPrct2)

favstats(daysanysub~sex, data = HELPrct3)

1.1. Based on the results provided, how many observations were missing for males and
females. Missing values here likely mean that the subjects didn't use any substances post-
detox in the time of the study. This is called censoring. What is the problem with the
numerical summaries if the missing responses were all something larger than the largest
observation?
1.2. Make a beanplot and a boxplot of daysanysub ~ sex using the HELPrct3 data
set created above. Compare the distributions, recommending parametric or nonparametric
inferences.

1.3. Generate the permutation results and write out the 6+ steps of the hypothesis test,
making sure to note the numerical value of observed test statistic you are using. Include
scope of inference.

1.4. Interpret the p-value for these results.

1.5. Generate the parametric t.test results, reporting the test-statistic, its distribution under
the null hypothesis, and compare the p-value to those observed using the permutation
approach.

1.6. Make and interpret a 95% bootstrap confidence interval for the difference in the
means.

2.0 Situation - One-Way ANOVA

by Mark Greenwood and Katharine Banner

In Chapter 1, tools for comparing the means of two groups were considered. More generally,
these methods are used for a quantitative response and a categorical explanatory variable (group)
which had two and only two levels. The MockJury data set actually contained three groups
(Figure 2-1) with Beautiful, Average, and Unattractive rated pictures randomly assigned to the
subjects for sentence ratings. In a situation with more than two groups, we have two choices.
First, we could rely on our two group comparisons, performing tests for every possible pair
(Beautiful vs Average, Beautiful vs Unattractive, and Average vs Unattractive). We spent
Chapter 1 doing inferences for differences between Average and Unattractive. The other two
comparisons would lead us to initially end up with three p-values and no direct answer about our
initial question of interest - is there some overall difference in the average sentences provided
across the groups? In this chapter, we will learn a new method, called Analysis of Variance,
ANOVA, that directly assesses whether there is evidence of some overall difference in the means
among the groups. This version of an ANOVA is called a One-Way ANOVA since there is just
one 21 grouping variable. After we perform our One-Way ANOVA test for overall evidence of a
difference, we will revisit the comparisons similar to those considered in Chapter 1 to get more
details on specific differences among the pairs of groups - what we call pair-wise comparisons.
An issue is created when you perform many tests simultaneously and we will augment our
previous methods with an adjusted method for pairwise comparisons to make our results valid
called Tukey's Honest Significant Difference.

To make this more concrete, we return to the original MockJury data, making side-by-side
boxplots and beanplots (Figure 2-1) as well summarizing the sentences for the three groups
using favstats.

> favstats(Years~Attr,data=MockJury)

> require(heplots)

> require(mosaic)

> data(MockJury)

> par(mfrow=c(1,2))

> boxplot(Years~Attr,data=MockJury)

>
beanplot(Years~Attr,data=MockJury,log="",col="bisque",method="jit
ter")

> favstats(Years~Attr,data=MockJury)

media missin
.group min Q1 Q3 max mean sd n
n g
4.33333 3.40536
1 Beautiful 1 2 3 6.5 15 39 0
3 2
3.97368 2.82351
2 Average 1 2 3 5.0 12 38 0
4 9
Unattractiv 10. 5.81081 4.36423
3 1 2 5 15 37 0
e 0 1 5

There are slight differences in the sample sizes in the three groups with 37 Unattractive,
38 Average and 39 Beautiful group responses, providing a data set has a total sample size of
N=114. The Beautiful and Average groups do not appear to be very different with means of 4.33
and 3.97 years. In Chapter 1, we found moderate evidence regarding the difference
in Average and Unattractive. It is less clear whether we might find evidence of a difference
between Beautiful and Unattractive groups since we are comparing means of 5.81 and 4.33
years. All the distributions appear to be right skewed with relatively similar shapes. The
variability in Average and Unattractive groups seems like it could be slightly different leading to
an overall concern of whether the variability is the same in all the groups.
Figure 2-1: Boxplot and beanplot of the sentences (years) for the three treatment groups.

2.1 - Linear model for One-Way ANOVA

(cell-means and reference-coding)
by Mark Greenwood and Katharine Banner

We introduced the statistical model γij = μj + εij in Chapter 1 for the situation with j = 1 or 2 to
denote a situation where there were two groups and, for the alternative model, the means
differed. Now we have three groups and the previous model can be extended to this new
situation by allowing j to be 1, 2, or 3. Now that we have more than two groups, we need to
admit that what we were doing in Chapter 1 was actually fitting what is called a linear model.
The linear model assumes that the responses follow a normal distribution with the linear model
defining the mean, all observations have the same variance, and the parameters for the mean in
the model enter linearly. This last condition is hard to explain at this level of material - it is
sufficient to know that there models where the parameters enter the model nonlinearly and that
they are beyond the scope of this course. The result of this constraint is that we will be able to
use the same general modeling framework for the rest of the course.

As in Chapter 1, we have a null hypothesis that defines a situation (and model) where all the
groups have the same mean. Specifically, the null hypothesis in the general situation
with J groups (J≥2) is to have all the true group means equal,

H0: μ1 =... = μJ.

This defines a model where all the groups have the same mean that we can define in terms of a
single mean, μ, for the ith observation from the jth group as γij = μ + εij. This is not the model that
most researchers want to characterize their study as it implies no difference in the groups. There
is more caution required to specify the alternative hypothesis with more than two groups. The
alternative hypothesis needs to be the logical negation of this null hypothesis of all groups
having equal means; to make the null hypothesis false, we only need one group to differ but
more than one group could differ from the others. Essentially, there are many ways to "violate"
the null hypothesis so we choose some delicate wording for the alternative hypothesis when there
are more than 2 groups. Specifically, we state the alternative as

HA: Not all μj are equal

or, in words, at least one of the true means differs among the J groups. You will be attracted to
trying to say that all means are different in the alternative but we do not put this strict a
requirement in place to reject the null hypothesis. The alternative model allows all the true group
means to differ but does require that they differ with

γij = μj + εij.

This linear model states that the response for the ith observation in the jth group, γij, is modeled
with a group j (j=1,...,J) population mean, μj, and a random error for each subject in each group,
εij, that we assume follows a normal distribution and that all the random errors have the same
variance, σ2. We can write the assumption about the random errors, often called the normality
assumption, as εij~N(0,σ2). There is a second way to write out this model that will allow
extensions to more complex models discussed below, so we need a name for this version of the
model. The model writtern in terms of the μj's is called the cell means model and is the easier
version of this model to understand.

One of the reasons we learned about beanplots is that it helps us visually consider all the aspects
of this model. In the right panel of Figure 2-1, we can see the wider, bold horizontal lines that
provide the estimated group means. The bigger the differences, the more likely we are to find
evidence against the null hypothesis. You can also see the null model on the plot that assumes all
the groups have the same as displayed in the dashed horizontal line at 4.7 years (the R code
below shows the overall mean of Years is 4.7). While the hypotheses focus on the means, the
model also contains assumptions about the distribution of the responses - specifically that the
distributions are normal and all have the groups have the same variability. As discussed
previously, it appears that the distributions are right skewed and the variability might not be the
same for all the groups. The boxplot provides the information about the skew and variability but
since it doesn't display the means it is not directly related to the linear model and hypotheses we
are considering.

> mean(MockJury$Years)

[1] 4.692982
There is a second way to write out the One-Way ANOVA model that will allow extensions to
more complex models in Chapter 3. The other parameterization (way of writing out or defining)
of the model is called the reference-coded model since it writes out the model in terms of a
baseline group and deviations from that baseline or reference level. The reference-coded model
for the ith subject in the jth group is yij = α + τj + εij where α (alpha) is the true mean for the
baseline group (first alphabetically) and the τj (tau j) are the deviations from the baseline group
for group j. The deviation for the baseline group, τ1, is always set to 0 so there are really just
deviations for groups 2 through J. The equivalence between the two models can be seen by
considering the mean for the first, second, and Jth groups in both models:

Cell means: Reference-coded

Group 1: μ1 α
Group 2: μ2 α + τ2
... ... ...
Group J: μJ α + τJ
The hypotheses for the reference-coded model are similar to those in the cell-means coding
except that they are defined in terms of the deviations, τj. The null hypothesis is that there is no
deviation from the baseline for any group - that all the τj's =0,

H0: τ2 =... = τJ = 0.

The alternative hypothesis is that at least one of the deviations is not 0,

HA: Not all τj equal 0.

You are welcome to use either version unless we instruct you to use a particular version in this
chapter but we have to use the reference-coding in subsequent chapters. The next task is to learn
how to use R's linear model (lm) function to get estimates of the parameters in each model, but
first a review of these new ideas:

Cell-means version:

• H0: μ1 =... = μJ HA: Not all μj equal

• Null hypothesis in words: No difference in the true means between the groups.

• Null model: yij = μ + εij

• Alternative hypothesis in words: At least one of the true means differs between the groups.

• Alternative model: yij = μj + εij

Reference-coded version:
• H0: τ2 =... = τJ = 0 HA: Not all τj equal 0

• Null hypothesis in words: No deviation of the true mean for any groups from the baseline
group.

• Null model: yij = α + εij

• Alternative hypothesis in words: At least one of the true deviations is different from 0 or that at
least one group has a different true mean than the baseline group.

• Alternative model: yij = α + τj + εij

In order to estimate the models discussed above, the lm function will be used. If you look closely
in the code for the rest of the semester, any model for a quantitative response will use this
function, suggesting a common threads in the most commonly used statistical models.
The lm function continues to use the same format as previous
functions, lm(Y~X,data=datasetname). It ends up that this code will give you the
reference-coded version of the model by default. We want to start with the cell-means version of
the model, so we have to add a "-1" to the formula interface to tell R that we want to the cell-
means coding. Generally, this looks like lm(Y~X-1,data=datasetname) and you will
find a row of output for each group. It will contain columns for an estimate (Estimate),
standard error (Std. Error), t-value (t value), and p-value (Pr(>|t|)). We'll learn to
use all of the output in the following material, but for now we will just focus on the estimates of
the parameters that the function provides that we put in bold.

> lm1 <- lm(Years~Attr-1,data=MockJuryR)

> summary(lm1)

Coefficients:

Estimate Std.Error t value Pr(>|t|)

AttrBeautiful 4.3333 0.5730 7.563 1.23e-11 ***
AttrAverage 3.9737 0.5805 6.845 4.41e-10 ***
AttrUnattractive 5.8108 0.5883 9.878 < 2e-16 ***
In general, we denote estimated parameters with a hat over the parameter of interest to show that
it is an estimate. For the true mean of group j, μj, we estimate it with μ̂j, which is just the sample
mean for group j, xj. The model suggests an estimate for each observation that we denote
as ŷij that we will also call a fitted value based on the model being considered. The three
estimates are bolded in the previous output, with a different estimate produced for all
observations in the same group. R tries to help you to sort out which row of output corresponds
to which group by appending the group name with the variable name. Here, the variable name
was Attr and the first group alphabetically was Beautiful, so R provides a row
labeled AttrBeautiful with an estimate of 4.3333. The sample means from the three groups
can be seen to directly match those results.

> mean(Years~Attr,data=MockJuryR)

Beautiful Average Unattractive

4.333333 3.973684 5.810811
The reference-coded version of the same model is more complicated but ends up giving the same
results once we understand what it is doing. Here is the model summary:

> lm2 <- lm(Years~Attr,data=MockJuryR)

> summary(lm2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.3333 0.5730 7.563 1.23e-11 ***
AttrAverage -0.3596 0.8157 -0.441 0.6601
AttrUnattractive 1.4775 0.8212 1.799 0.0747 .

Residual standard error: 3.578 on 111 degrees of freedom

Multiple R-squared: 0.04754, Adjusted R-squared: 0.03038

F-statistic: 2.77 on 2 and 111 DF, p-value: 0.067

Remember that this is the standard version of the linear model so it will be something that gets
used repeatedly this semester. The estimated model coefficients are α̂ = 4.333 years, τ̂2 =-0.3596
years, and τ̂3 =1.4775 years where group 1 is Beautiful, 2 is Average, and 3 is Unattractive. The
way you can figure out the baseline group (group 1 is Beautiful here) is to see which category
label is not present in the output. The baseline level is typically the first group label
alphabetically, but you should always check this. Based on these definitions, there are
interpretations available for each coefficient. For α̂ = 4.333 years, this is an estimate of the mean
sentencing time for the Beautiful group. τ̂2 =-0.3596 years is the deviation of the Average group's
mean from the Beautiful groups mean (specifically, it is 0.36 years lower). Finally, τ̂3 =1.4775
years tells us that the Unattractive group mean sentencing time is 1.48 years higher than
the Beautiful group mean sentencing time. These interpretations lead directly to reconstructing
the estimated means for each group by combining the baseline and pertinent deviations as shown
in Table 2-1.

Table 2-1: Constructing group mean estimates from the Formul Estimates
reference-coded linear model estimates.
a
Group
Beautiful α̂ 4.3333 years
Average α̂ + τ̂2 4.3333-0.3596=3.974 years
Unattractive α̂ + τ̂3 4.3333+1.4775=5.811 years

We can also visualize the results of our linear models using what are called term or effect
plots (from the effects package; Fox, 2003) as displayed in Figure 2-2 (we don't want to use
"effect" unless we have random assignment in the study design so we will mainly call these term
plots). These plots take an estimated model and show you its estimates along with 95%
confidence intervals generated by the linear model, which will be especially useful for some of
the more complicated models encountered later in the semester. To make this plot, you need to
install and load the effects package and then use plot(allEffects(...)) functions
together on the lm object called lm2 generated above. You can find the correspondence between
the displayed means and the estimates that were constructed in Table 2-1.

> require(effects)

> plot(allEffects(lm2))
Figure 2-2: Plot of the estimated group mean sentences from the reference-coded model for the
MockJury data.

In order to assess evidence for having different means for the groups, we will compare either of
the previous models (cell-means or reference-coded) to a null model based on the null hypothesis
(H0: μ1 =... = μJ) which implies a model of yij = μ + εij in the cell-means version where μ is a
common mean for all the observations. We will call this the mean-only model since it is boring
and only has a single mean in it. In the reference-coding version of the model, we have a null
hypothesis that H0: τ2 =... = τJ = 0, so the "mean-only" model is yij = α + εij with α having the
same definition as μ for the cell means model - it forces a common estimate for every group.
The mean-only model is also an example of a reduced model where we set some coefficients in
the model to 0 and get a simpler model. Simple can be good as it is easy to interpret, but having a
model for J groups that suggests no difference in the groups is not a very exciting result in most,
but not all, situations. In order for R to provide results for the mean-only model, we remove the
grouping variable, Attr, from the model formula and just include a "1". The (Intercept)
row of the output provides the estimate for either model when we assume that the mean is the
same for all groups:

> lm3 <- lm(Years~1,data=MockJuryR)

> summary(lm3)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.6930 0.3404 13.79 <2e-16 ***

Residual standard error: 3.634 on 113 degrees of freedom

This model provides an estimate of the common mean for all observations of 4.693 = μ̂=α̂ years.
This value also is the dashed, horizontal line in the beanplot in Figure 2-1.

2.2 - One-Way ANOVA Sums of Squares,

Mean Squares, and F-test
by Mark Greenwood and Katharine Banner

The previous discussion showed two ways of estimating the model but still hasn't addressed how
to assess evidence related to whether the observed differences in the means among the groups is
"real". In this section, we develop what is called the ANOVA F-test that provides a method of
aggregating the differences among the means of 2 or more groups and testing our null hypothesis
of no difference in the means vs the alternative. In order to develop the test, some additional
notation needs to be defined. The sample size in each group is denoted nj and the total sample
size is N=Σnj = n1+n2+...+nJ where Σ (capital sigma) means "add up over whatever follows". An
estimated residual (eij) is the difference between an observation, γij, and the model estimate,
γ̂ij=μ̂j, for that observation, γij - γ̂ij=eij. It is basically what is left over that the mean part of the
model (μ̂j) does not explain and is our window into how "good" the model might be.
Figure 2-3: Demonstration of different amount of difference in means relative to variability.
Consider the four different fake results for a situation with four groups in Figure 2-3. In Situation
1, it looks like there is little evidence for a difference in the means and in Situation 2, it looks
fairly clear that there is a difference in the group means. Why? It is because the variation in the
means looks "clear" relative to the variation around the means. Consider alternate versions of
each result in Situations 3 and 4 and how much evidence there appears to be for same sizes of
differences in the means. In the plots, there are two sources of variability in the responses - how
much the group means vary across the groups and how much variability there is around the
means in each group. So we need a test statistic to help us make some sort of comparison of the
groups and to account for the amount of variability present around the means. The statistic is
called the ANOVA F-statistic. It is developed using sums of squares which are measures of total
variation like used in the numerator of the standard deviation that took

all the observations, subtracted the mean, squared the differences, and then added up the results
over all the observations to generate a measure of total variability. With multiple groups, we will
focus on decomposing that total variability (Total Sums of Squares) into variability among the
means (we'll call this Explanatory Variable A's Sums of Squares) and variability in the
residuals or errors (Error Sums of Squares). We define each of these quantities in the One-Way
ANOVA situation as follows:

 ⚪ SSTotal = Total Sums of Squares

o ■ By summing over all nj observations in each group and then

adding those results up across the groups , we accumulate the

variation across all N observations.

o ■ Total variation is assessed by squaring the deviations of the responses around

the overall or grand mean (γ̿, the estimated mean for all the observations and
available from the mean-only model).

o ■ Note: this is the residual variation if the null model is used, so there is no
further decomposition possible for that model.

o ■ This is also equivalent to the numerator of the sample variance which is what
you get when you ignore the information on the potential differences in the
groups.

 ⚪ SSA = Explanatory Variable A's Sums of

Squares

o ■ Variation in the group means around the grand mean based on explanatory
variable A.

o ■ Also called sums of squares for the treatment, regression, or model.

 ⚪ SSE = Error (Residual) Sums of

Squares

o ■ Variation in the responses around the group means.

o ■ Also called the sums of squares for the residuals.

The possibly surprising result given the mass of notation just presented is that the total sums of
squares is ALWAYS equal to the sum of explanatory variable A's sum of squares and the error
sums of squares, SSTotal = SSA + SSE. This equality means that if the SSA goes up, then
the SSE must go down if SSTotal remains the same. This result is called the sums of squares
decomposition formula. We use these results to build our test statistic and organize this
information in what is called an ANOVA table. The ANOVA table is generated using
the anova function applied to the reference-coded model:

> lm2 <- lm(Years~Attr,data=MockJuryR)

> anova(lm2)

Analysis of Variance Table

Response: Years

Df Sum Sq Mean Sq F value Pr(>F)

Attr 2 70.94 35.469 2.77 0.067 .
Residuals 111 1421.32 12.805
Note that the ANOVA table has a row labelled Attr, which contains information for the
grouping variable (we'll generally refer to this as explanatory variable A but here it is the picture
group that was randomly assigned), and a row labelled Residuals, which is synonymous with
"Error". The SS are available in the Sum Sq column. It doesn't show a row for "Total" but the
SSTotal =SSA+SSE = 1492.26.

> 70.94+1421.32

[1] 1492.26

It may be easiest to understand the sums of squares decomposition by connecting it to our

permutation ideas. In a permutation situation, the total variation (SSTotal) cannot change - it is the
same responses varying around the grand mean. However, the amount of variation attributed to
variation among the means and in the residuals can change if we change which observations go
with which group. In Figure 2-4, the means and 95% confidence intervals are displayed for the
three treatment levels. In panel (a), the results for the original data set (a) are presented including
sums of squares. Three permuted versions of the data set are summarized in panels (b), (c), and
(d). The SSA is 70.9 in the real data set and between 6.6 and 11 in the permuted data sets. If you
had to pick among the plots for the one with the most evidence of a difference in the means, you
hopefully would pick panel (a). This visual "unusualness" suggests that this observed result is
unusual relative to the possibilities under permutations, which are, again, the possibilities tied to
having the null hypothesis being true. But note that the differences are not that great between
these permuted data sets and the real one.

One way to think about SSA is that it is a function that converts the variation in the group means
into a single value. This makes it a reasonable test statistic in a permutation testing context. By
comparing the observed SSA=70.9 to the permutation results of 6.7, 6.6, and 11 we see that the
observed result is much more extreme than the three alternate versions. In contrast to our
previous test statistics where positive and negative differences were possible, SSA is always
positive with a value of 0 corresponding to no variation in the means. The larger the SSA, the
more variation there was in the means. The permutation p-value for the alternative hypothesis
of some (not of greater or less than!) difference in the true means of the groups will involve
counting the number of permuted SSA* results that are larger than what we observed.
Figure 2-4: Plot of means and 95% confidence intervals for the three groups for the real data (a)
and three different permtutations of the treatment labels to the same responses in (b), (c), and
(d).
To do a permutation test, we need to be able to calculate and extract the SSA value. In the
ANOVA table, it is in the first row and is the second number and we can use the [,] referencing
to extract that number from the ANOVA table that anova
produces (anova(lm(Years~Attr,data=MockJury))[1,2]). We'll store the
observed value of SSA is Tobs:

> Tobs <- anova(lm(Years~Attr,data=MockJury))[1,2]; Tobs

[1] 70.93836

The following code performs the permutations using the shuffle function and then makes a
plot of the resulting permutation distribution:

> B<-1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,2]

+ }

> hist(Tstar,labels=T)

> abline(v=Tobs,col="red",lwd=3)

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,col="red",lwd=3)
Figure 2-5: Permutation distributions of SSA with the observed value of SSA (bold, vertical line).
The right-skewed distribution (Figure 2-5) contains the distribution of SSA*'s under permutations
(where all the groups are assumed to be equivalent under the null hypothesis). While the
observed result is larger than many SSA*'s, there are also many results that are much larger than
observed that showed up when doing permutations. The proportion of permuted results that
exceed the observed value is found using pdata as before, except only for the area to the right
of the observed result. We know that Tobs will always be positive so no absolute values are
required now.

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0.071

This provides a permutation-based p-value of 0.071 and suggests marginal evidence against the
null hypothesis of no difference in the true means. We would interpret this as saying that there is
a 7.1% chance of getting a SSA as large or larger than we observed, given that the null hypothesis
is true.

It ends up that some nice parametric statistical results are available (if our assumptions are met)
for the ratio of estimated variances, which are called Mean Squares. To turn sums of squares
into mean square (variance) estimates, we divide the sums of squares by the amount of free
information available. For example, remember the typical variance estimator introductory
statistics, , where we "lose" one piece of information to estimate

the mean and there are N deviations around the single mean so we divide by N-1. Now
consider which still has N deviations but it varies

around the J means, so the Mean Square Error = MSE = SSE/(N-J). Basically, we lose J pieces of
information in this calculation because we have to estimate J means. The sums of squares for
explanatory variable A is harder to see in the formula , but

the same reasoning can be used to understand the denominator for forming the Mean Square for
variable A or MSA: there are J means that vary around the grand mean so MSA = SSA/(J-1). In
summary, the two mean squares are simply:

 ■ MSA = SSA/(J-1), which estimates the variance of the group means around the grand
mean.

 ■ MSError = SSError/(N-J), which estimates the variation of the errors around the group
means.

These results are put together using a ratio to define the ANOVA F-statistic (also called the F-
ratio) as

F=MSA/MSError.

This statistic is close to 1 if the variability in the means is "similar" to the variability in the
residuals and would lead to no evidence being found of a difference in the means. If the MSA is
much larger than the MSE, the F-statistic will provide evidence against the null hypothesis. The
"size" of the F-statistic is formalized by finding the p-value. The F-statistic, if assumptions
discussed below are met and we assume the null hypothesis is true, follows an F-distribution.
The F-distribution is a right-skewed distribution whose shape is defined by what are called
the numerator degrees of freedom (J-1) and the denominator degrees of freedom (N-J). These
names correspond to the values that we used to calculate the mean squares and where in the F-
ratio each mean square was used; F-distributions are denoted by their degrees of freedom using
the convention of F(numerator df, denominator df). Some examples of different F-distributions
are displayed for you in Figure 2-6.
Figure 2-6: Density curves of four different F-distributions.
The characteristics of the F-distribution can be summarized as:

 ⚪ Right skewed,

 ⚪ Nonzero probabilities for values greater than 0,

 ⚪ Shape changes depending on the numerator and denominator DF, and

 ⚪ Always use the right-tailed area for p-values.

Now we are ready to see an ANOVA table when we know about all its components. Note the
general format of the ANOVA table is22:
Table 2-2: General One-Way ANOVA table.
Source DF Sums of Squares Mean Squares F-ratio P-value
Variable A J-1 SSA MSA = SSA/(J-1) F=MSA/MSE Right tail of F(J-1,N-J)
N-
Residuals SSE MSE =SSE/(N-J)
J
N-
Total SSTotal
1
The table is oriented to help you reconstruct the F-ratio from each of its components. The output
from R is similar although it does not provide the last row. The R version of the table for the type
of picture effect (Attr) with J=3 levels and N=114 observations, repeated from above, is:

> anova(lm2)

Analysis of Variance Table

Response: Years

Df Sum Sq Mean Sq F value Pr(>F)

Attr 2 70.94 35.469 2.77 0.067 .
Residuals 111 1421.32 12.805
The p-value from the F-distribution is 0.067. We can verify this result using the observed F-
statistic of 2.77 (which came from the ratio of the means squares: 35.47/12.8) which follows an
F(2,111) if the null hypothesis is true and other assumptions are met. Using the pf function
provides us with areas in the specified F-distribution with the df1 provided to the function as
the numerator DF and df2 as the denominator and lower.tail=F reflecting our desire for a
right tailed area.

> pf(2.77,df1=2,df2=111,lower.tail=F)

[1] 0.06699803

The result from the F-distribution using this parametric procedure is similar to the p-value
obtained using permutations with the test statistic of the SSA, which was 0.071. The F-statistic
obviously is another potential test statistic to use as a test statistic in a permutation approach. We
should check that we get similar results from it with permutations as we did from using SSA as a
test statistic. The following code generates the permutation distribution for the F-statistic (Figure
2-7) and assesses how unusual the observed F-statistic of 2.77 was in this permutation
distribution. The only change in the code involves moving from extracting SSA to extracting
the F-ratio which is in the 4th column of the anova output:

> anova(lm(Years~Attr,data=MockJury))[1,4]
[1] 2.770024

> Tobs <- anova(lm(Years~Attr,data=MockJury))[1,4]; Tobs

[1] 2.770024

> B<-1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,4]

+ }

> hist(Tstar,labels=T)

> abline(v=Tobs,col="red",lwd=3)

> plot(density(Tstar),main="Density curve of Tstar")

> abline(v=Tobs,col="red",lwd=3)

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0.064
Figure 2-7: Permutation distribution of the F-statistic with bold, vertical line for observed value
of test statistic of 2.77.
The permutation-based p-value is 0.064 which, again, matches the other results closely. The first
conclusion is that using a test statistic of the F-statistic or the SSA provide similar permutation
results. However, we tend to favor using the F-statistic because it is more commonly used in
reporting ANOVA results, not because it is any better in a permutation context.

It is also interesting to compare the permutation distribution for the F-statistic and the
parametric F(2,111) distribution (Figure 2-8). They do not match perfectly but are quite similar.
Some the differences around 0 are due to the behavior of the method used to create the density
curve and are not really a problem for the methods. This explains why both methods give similar
results. In some situations, the correspondence will not be quite to close.
Figure 2-8: Comparison of F(2,111) (dashed line) and permutation distribution (solid line).
So how can we rectify this result (p-value≈0.06) and the Chapter 1 result that detected a
difference between Average and Unattractive with a p-value≈0.03? I selected the two groups to
compare in Chapter 1 because they were furthest apart. "Cherry-picking" the comparison that is
likely to be most different creates a false sense of the real situation and inflates the Type I error
rate because of the selection. If the entire suite of comparisons are considered, this result may
lose some of its luster. In other words, if we consider the suite of all pair-wise differences (and
the tests) implicit in comparing all of them, we need stronger evidence in the most different pair
than a p-value of 0.033 to suggest overall differences. The Beautiful and Average groups are not
that different from each other so they do not contribute much to the overall F-test. In Section 2.5,
we will revisit this topic and consider a method that is statistically valid for performing all
possible pair-wise comparisons.
2.3 - ANOVA model diagnostics
including QQ-plots
by Mark Greenwood and Katharine Banner

The requirements for a One-Way ANOVA F-test are similar to those discussed in Chapter 1,
except that there are now J groups instead of only 2. Specifically, the linear model assumes:

 1) Independent observations

 2) Equal variances

 3) Normal distributions

For assessing equal variances across the groups, we must use plots to assess this. We can use
boxplots and beanplots to compare the spreads of the groups, which are provided in Figure 2-1.
The range and IQRs should be similar across the groups, although you should always note how
clear or big the violation of the assumption might be, remembering that there will always be
some differences in the variation among groups. In this section, we learn how to work with the
diagnostic plots that are provided from the lm function that can help us more clearly assess
potential violations of the previous assumptions.

We can obtain a suite of diagnostic plots by using the plot function on the ANOVA model
object that we fit. To get all of the plots together in four panels we need to add
the par(mfrow=c(2,2)) command to tell R to make a graph with 4 panels 23.

> par(mfrow=c(2,2))

> plot(lm2)

There are two plots in Figure 2-9 with useful information for the equal variance assumption. The
"Residuals vs Fitted" in the top left panel displays the residuals (eij= γij - γ̂ij) on the y-axis and the
fitted values (γ̂ij) on the x-axis. This allows you to see if the variability of the observations differs
across the groups because all observations in the same group get the same fitted value. In this
plot, the points seem to have fairly similar spreads at the fitted values for the three groups of 4,
4.3, and 6. The "Scale-Location" plot in the lower left panel has the same x-axis but the y-axis
contains the square-root of the absolute value of the standardized residuals. The absolute value
transforms all the residuals into a magnitude scale (removing direction) and the square-root helps
you see differences in variability more accurately. The usage is similar in the two plots - you
want to assess whether it appears that the groups have somewhat similar or noticeably different
amounts of variability. If you see a clear funnel shape in the Residuals vs Fitted or an increase or
decrease in the edge of points in the Scale-Location plot, that may indicate a violation of the
constant variance assumption. Remember that some variation across the groups is expected and
is ok, but large differences in spreads are problematic for all the procedures we will learn this
semester.

Figure 2-9: Default diagnostic plots for the linear model.

The linear model assumes that all the random errors () follow a normal distribution. To gain
insight into the validity of this assumption, we can explore the original observations, mentally
subtracting off the differences in the means and focusing on the shapes of the distributions of
observations in each group in the boxplot and beanplot. These plots can help us assess whether
there is there a skew or outliers present in each group. If so, by definition, the normality
assumption is violated. But sometimes the differen groups might contain different "non-normal"
features and this can make an overall assessment complicated. Our real interest in these
diagnostics is to understand how reasonable our assumption is overall for our model. The
residuals from the entire model provide us with estimates of the random errors and if the
normality assumption is met, then the residuals all-together should approximately follow a
normal distribition. The Normal Q-Q Plot in upper right panel of Figure 2-9 is a direct visual
assessment of how well our residuals match what we would expect from a normal distribution.
Outliers, skew, heavy and light-tailed aspects of distributions (all violations of normality) will
show up in this plot once you learn to read it - which is our next task. To make it easier to read
QQ-plots, it is nice to start with just considering histograms and/or density plots of the residuals.
We can obtain the residuals from the linear model using the residuals function on the linear
model object.

> eij=residuals(lm2)

> hist(eij,main="Histogram of residuals")

> plot(density(eij),main="Density plot of

residuals",ylab="Density",xlab="Residuals")

Figure 2-10: Histogram and density curve of the linear model raw residuals.
Figure 2-10 shows that there is a right skew present in the residuals, which is consistent with the
initial assessment of some right skew in the plots of observations in each group.

A Quantile-Quantile plot (QQ-plot) shows the "match" of an observed distribution with a

theoretical distribution, almost always the normal distribution. They are also known as Quantile
Comparison, Normal Probability, or Normal Q-Q plots, with the last two names being specific to
comparing results to a normal distribution. In this version24 , the QQ-plots display the value of
observed percentiles in the residual distribution on the y-axis versus the percentiles of a
theoretical normal distribution on the x-axis. If the observed distribution of the residuals
matches the shape of the normal distribution, then the plotted points should follow a 1-1
relationship. If the points follow the displayed straight line that suggests that the residuals have
a similar shape to a normal distribution. Some variation is expected around the line and some
patterns of deviation are worse than others for our models, so you need to go beyond saying "it
does not match a normal distribution" and be specific about the type of deviation you are
detecting. And to do that, we need to practice interpreting some QQ-plots.

I extracted the previous QQ-plot of the linear model residuals and enhanced it a little to make
Figure 2-11. We know from looking at the histogram that this is a slightly right skewed
distribution. The QQ-plot places the observed standardized25 residuals on the y-axis and the
theoretical normal values on the x-axis. The most noticeable deviation from the 1-1 line is in the
lower left corner of the plot. These are for the negative residuals (left tail) and there are many
residuals at around the same value a little smaller than -1. If the distribution had followed the
normal here, the points would be on the 1-1 line and would actually be even smaller. So we are
not getting as much spread in the lower observations as we would expect in a normal
distribution. If you go back to the histogram you can see that the lower observations are all
stacked up and do not spread out like the left tail of a normal distribution should. In the right tail
(positive) residuals, there is also a systematic lifting from the 1-1 line to larger values in the
residuals than the normal would generate. For example, the point labeled as "82" (the
82nd observation in the data set) has a value of 3 in residuals but should actually be smaller
(maybe 2.5) if the distribution was normal. Put together, this pattern in the QQ-plot suggests that
the left tail is too compacted (too short) and the right tail is too spread out - this is the right skew
we identified from the histogram and density curve!
Figure 2-11: QQ-plot of residuals from linear model.
Generally, when both tails deviate on the same side of the line (forming a sort of quadratic curve,
especially in more extreme cases), that is evidence of a skew. To see some different potential
shapes QQ-plots, six different data sets are Figures 2-12 and 2-13. In each row, a QQ-plot and
density curve are displayed. If the points are both above the 1-1 line in the lowr and upper tails as
in Figure 2-12(a), then the pattern is a right skew, here even more extreme than in the real data
set. If the points are below the 1-1 line in both tails as in Figure 2-12(c), then the pattern should
be identified as a left skew. These are both problematic for models that assume normally
distributed responses but not necessarily for our permutation approaches if all the groups have
similar skewed shapes. The other problematic pattern is to have more spread than a normal curve
as in Figure2-12(e) and (f). This shows up with the points being below the line in the left tail
(more extreme negative than expected by the normal) and the points being above the line for the
right tail (more extreme positive than the normal). We call these distributions heavy-tailed and
can manifest as distributions with outliers in both tails or just a bit more spread out than a normal
distribution. Heavy-tailed residual distributions can be problematic for our models as the
variation is greater than what the normal distribution can account for and our methods might
under-estimate the variability in the results. The opposite pattern with the left tail above the line
and the right tail below the line suggests less spread (lighter-tailed) than a normal as in Figure 2-
12(g) and (h). This pattern is relatively harmless and you can proceed with methods that assume
normality safely.

Figure 2-12: QQ-plots and density curves of four fake distributions with different shapes.
Finally, to help you calibrate expectations for data that are actually normally distributed, two
data sets simulated from normal distributions are displayed below in Figure 2-13. Note how
neither follows the line exactly but that the overall pattern matches fairly well. You have to allow
for some variation from the line in real data sets and focus on when there are really noticeable
issues in the distribution of the residuals such as those displayed above.

Figure 2-13: Two more simulated data sets, generated from normal distributions.
The last issues with assessing the assumptions in an ANOVA relates to situations where the
models are more or less resistant26. to violations of assumptions. For reasons beyond the scope of
this class, the parametric ANOVA F-test is more resistant to violations of the assumptions of the
normality and equal variance assumptions if the design is balanced. A balanced design occurs
when each group is measured the same number of times. The resistance decreases as the data set
becomes less balanced, so having close to balance is preferred to a more imbalanced situation if
there is a choice available. There is some intuition available here - it makes some sense that you
would have better results if all groups are equally (or nearly equally) represented in the data set.
We can check the number of observations in each group to see if they are equal or similar using
the tally function from the mosaic package:

> tally(~Attr,data=MockJuryR)

Beautiful Average Unattractive Total

39 38 37 114
So the sample sizes do vary among the groups and the design is technically not balanced, but it is
also very close to being balanced. This tells us that the F-test so should have some resistance to
violations of assumptions. This nearly balanced design, and the moderate sample size, make the
parametric and nonparametric approaches provide similar results in this data set.

previous next

23
We have been using this function quite a bit to make multi-panel graphs but you will always
want to use this command for linear model diagnostics or your will have to use the arrows above
the plots to go back and see previous plots.
24
Along with multiple names, there is variation of what is plotted on the x and y axes and the
scaling of the values plotted, increasing the challenge of interpreting QQ-plots. We will try to be
consistent about the x and y axis choices.
25
Here this means re-scaled so that they should have similar scaling to a standard normal with
mean 0 and standard deviation 1. This does not change the shape of the distribution but can make
outlier identification by value of the residuals simpler - having a standardized residual more
extreme than 5 or -5 would suggest a deviation from normality. But mainly focus on the shape of
the pattern in the QQ-plot.
26
A resistant procedure is one that is not severely impacted by a particular violation of an
assumption. For example, the median is resistant to the impact of an outlier.

2.4 - Guinea pig tooth growth One-Way

ANOVA example
by Mark Greenwood and Katharine Banner

A second example of the One-way ANOVA methods involves a study of growth rates of the
teeth of Guinea Pigs (measured in millimeters, mm). N=60 Guinea Pigs were obtained from a
local breeder and each received Orange Juice (OJ) or ascorbic acid (the stuff in vitamin C
capsules, called VC below) at one of three dosages (0.5, 1, or 2 mg) as a source of added Vitamin
C in their diets. Each guinea pig was randomly assigned to receive one of the six different
treatment combinations possible (OJ at 0.5 mg, OJ at 1 mg, OJ at 2 mg, VC at 0.5 mg, VC at 1
mg, and VC at 2 mg). The animals were treated similarly otherwise and we can assume lived in
separate cages. We need to create a variable that combines the levels of delivery type (OJ, VC)
and the dosages (0.5, 1, and 2) to use our One-Way ANOVA on the six levels.
The interaction function creates a new variable in the ToothGrowth data.frame that we
called Treat that will be used as a six-level grouping variable.

> data(ToothGrowth) #Available in Base R package

> ToothGrowth$Treat=with(ToothGrowth,interaction(supp,dose))
#Creates a new variable Treat with 6 levels

The tally function helps us to check for balance; this is a balanced design because the same
number of guinea pigs (nj=10 for all j) were measured in each treatment combination.

> require(mosaic)

> tally(~Treat,data=ToothGrowth)

OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2

10 10 10 10 10 10

The next task is to visualize the results using boxplots and beanplots27 (Figure 2-14) and generate
some summary statistics for each group using favstats.

> par(mfrow=c(1,2))

> boxplot(len~Treat,data=ToothGrowth,ylab="Tooth Growth in mm")

>
beanplot(len~Treat,data=ToothGrowth,log="",col="yellow",method="j
itter")

> favstats(len~Treat,data=ToothGrowth)

.group min Q1 median Q3 max mean sd n missing

1 OJ.0.5 8.2 9.700 12.25 16.175 21.5 13.23 4.459709 10 0

2 VC.0.5 4.2 5.950 7.15 10.900 11.5 7.98 2.746634 10 0

3 OJ.1 14.5 20.300 23.45 25.650 27.3 22.70 3.910953 10 0

4 VC.1 13.6 15.275 16.50 17.300 22.5 16.77 2.515309 10 0

5 OJ.2 22.4 24.575 25.95 27.075 30.9 26.06 2.655058 10 0

6 VC.2 18.5 23.375 25.95 28.800 33.9 26.14 4.797731 10 0

Figure 2-14 suggests that the mean tooth growth increases with the dosage level and that OJ
might lead to higher growth rates than VC except at dosages of 2 mg. The variability around the
means looks to be small relative to the differences among the means, so we should expect a small
p-value from our F-test. The design is balanced as noted above (nj = 10 for all six groups) so the
methods are somewhat resistant to impacts from non-normality and non-constant variance. There
is some suggestion of non-constant variance in the plots but this will be explored further below
when we can visually remove the difference in the means from this comparison. There might be
some skew in the responses in some of the groups but there are only 10 observations per group
so skew in the boxplots could be generated by very few observations.
Figure 2-14: Boxplot and beanplot of tooth growth responses for the six treatment level combinations.

Now we can apply our 6+ steps for performing a hypothesis test with these observations. The
initial step is deciding on the claim to be assessed and the test statistic to use. This is a six group
situation with a quantitative response, identifying it as a One-Way ANOVA where we want to
test a null hypothesis that all the groups have the same population mean. We will use a 5%
significance level.

 1) Hypotheses: H0: μOJ0.5 = μVC0.5 = μOJ1 = μVC1 = μOJ2 = μVC2 vs HA: Not all μj equal

 • The null hypothesis could also be written in reference-coding as H0: τVC0.5 = τOJ1 = τVC1 = τOJ2 =
τVC2 = 0 since OJ.0.5 is chosen as the baseline group (discussed below).

 • The alternative hypothesis can be left a bit less specific: HA: Not all τj equal 0.

 2) Validity conditions:
 • Independence:

o ○ This is where the separate cages note above is important. Suppose that there were
cages that contained multiple animals and they competed for food or could share
illness. The animals in one cage might be systematically different from the others and
this "clustering" of observations would present a potential violation of the
independence assumption. If the experiment had the animals in separate cages, there is
no clear dependency in the design of the study and can assume that there is no problem
with this assumption.

 • Constant variance:

o ○ As noted above, there is some indication of a difference in the variability among the
groups in the boxplots but the sample size was small in each group. We need to fit the
linear model to get the other diagnostic plots to make an overall assessment.

> m2=lm(len~Treat,data=ToothGrowth)

> par(mfrow=c(2,2))

> plot(m2)

Figure 2-15: Diagnostic plots for the toothgrowth model.

o ○ The Residuals vs Fitted panel in Figure 2-15 shows some difference in the spreads
but the spread is not that different between the groups.

o ○ The Scale-Location plot also shows just a little less variability in the group with the
smallest fitted value but the spread of the groups looks fairly similar in this alternative
scaling.

o ○ Put together, the evidence for non-constant is not that strong and we can assume
that there is at least not a major problem with this assumption.

 • Normality of residuals:

o ○ The Normal Q-Q plot shows a small deviation in the lower tail but nothing that we
wouldn't expect from a normal distribution. There is no evidence of a problem with this
assumption in the upper right panel of Figure 2-15.

 3) Calculate the test statistic:

 • The ANOVA table for our model follows, providing an F-statistic of 41.557:

> anova(m2)

Analysis of Variance Table

Response: len

Df Sum Sq Mean Sq F value Pr(>F)

Treat 5 2740.10 548.02 41.557 < 2.2e-16 ***

Residuals 54 712.11 13.19

 4) Find the p-value:

 • There are two options here, especially since it seems that our assumptions about variance
and normality are not violated (note that we do not say "met" - we just have no strong evidence
against them). The parametric and nonparametric approaches should provide similar results
here.

 • The parametric approach is easiest - the p-value comes from the previous ANOVA table as
<2.2e-16. This is in scientific notation and means it is at the numerical precision of the computer
and it reports that this is a very small number. You report that the p-value<0.00001 but should
not report that it is 0. This p-value came from an F(5,54) distribution (the distribution of the test
statistic if the null hypothesis is true).
 • The nonparametric approach is not too hard so we can compare the two approaches here.

> Tobs <- anova(lm(len~Treat,data=ToothGrowth))[1,4]; Tobs

[1] 41.55718

> par(mfrow=c(1,2))

> B<- 1000

> Tstar<-matrix(NA,nrow=B)

> for (b in (1:B)){

+ Tstar[b]<-anova(lm(len~shuffle(Treat),data=ToothGrowth))[1,4]

+ }

> hist(Tstar,xlim=c(0,Tobs+3))

> abline(v=Tobs,col="red",lwd=3)

> plot(density(Tstar),,xlim=c(0,Tobs+3),main="Density curve of

Tstar")

> abline(v=Tobs,col="red",lwd=3)

> pdata(Tobs,Tstar,lower.tail=F)

[1] 0
Figure 2-16: Histogram and density curve of permutation distribution for F-statistic for tooth growth
data. Observed test statistic in bold, vertical line at 41.56.

o • The permutation p-value was reported as 0. This should be reported as p-

value<0.0001 since we did 1000 permutations and found that none of the permuted F-
statistics, F*, were larger than the observed F-statistic of 41.56. The permuted results do
not exceed 6 as seen in Figure 2-16, so the observed result is really unusual relative to
the null hypothesis. As suggested previously, the parametric and nonparametric
approaches should be similar here and they were.

 5) Make a decision:
 • Reject H0 since the p-value is less than 5%.

 6) Write a conclusion:
 • There is evidence at the 5% significance level that the different treatments (combinations of
OJ/VC and dosage levels) cause some difference in the true mean tooth growth for these Guinea
Pigs.

o ○ We can make the causal statement because the treatments were randomly assigned
but these inferences only apply to these Guinea Pigs since they were not randomly
selected from a larger population.

o ○ Remember that we are making inferences to the population means and not the
sample means and want to make that clear in any conclusion.

o ○ The alternative is that there is some difference in the true means - be sure to make
the wording clear that you aren't saying that all differ. In fact, if you look back at Figure
2-14, the means for the 2 mg dosages look almost the same. The F-test is about finding
evidence of some difference somewhere among the true means. The next section will
provide some additional tools to get more specific about the source of those detected
differences.

Before we leave this example, we should revisit our model estimates and interpretations. The
default model parameterization is into the reference-coding. Running the
model summary function on m2 provides the estimated coefficients:

> summary(m2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 13.230 1.148 11.521 3.60e-16 ***

TreatVC.0.5 -5.250 1.624 -3.233 0.00209 **

TreatOJ.1 9.470 1.624 5.831 3.18e-07 ***

TreatVC.1 3.540 1.624 2.180 0.03365 *

TreatOJ.2 12.830 1.624 7.900 1.43e-10 ***

TreatVC.2 12.910 1.624 7.949 1.19e-10 ***

Residual standard error: 3.631 on 54 degrees of freedom

Multiple R-squared: 0.7937, Adjusted R-squared: 0.7746

F-statistic: 41.56 on 5 and 54 DF, p-value: < 2.2e-16

For some practice with the reference coding used in these models, we will find the estimates for
observations for a couple of the groups. To work with the parameters, you need to start with
diagnosing the baseline category by considering which level is not displayed in the output. The
levels function can list the groups and their coding in the data set. The first level is usually the
baseline category but you should check this in the model summary as well.

> levels(ToothGrowth$Treat)

[1] "OJ.0.5" "VC.0.5" "OJ.1" "VC.1" "OJ.2" "VC.2"

There is a VC.0.5 in the second row of the model summary, but there is no row
for 0J.0.5 and so this must be the baseline category. That means that the fitted value or model
estimate for the OJ at 0.5 mg group is the same as the (Intercept) row or α̂, estimating a
mean tooth growth of 13.23 mm when the pigs get OJ at a 0.5 mg dosage level. You should
always start with working on the baseline level in a reference-coded model. To get estimates for
any other group, then you can use the (Intercept) estimate and add the deviation for the group of
interest. For VC.0.5, the estimated mean tooth growth is α̂ + τ̂2 = α̂ + τ̂VC.0.5 =13.23+ (-5.25) =
7.98 mm. It is also potentially interesting to directly interpret the estimated difference (or
deviation) between OJ0.5 (the baseline) and VC0.5 (group 2) that is τ̂VC.0.5 = -5.25: we estimate
that the mean tooth growth in VC.0.5 is 5.25 mm shorter than it is in OJ.0.5. This and many
other direct comparisons of groups are likely of interest to researchers involved in studying the
impacts of these supplements on tooth growth and the next section will show us how to do that
(correctly!).

previous next

27
Note that to see all the group labels in the plot when I copied it into R, I had to widen the plot
window. You can resize the plot window using the small "=" signs in the grey bars that separate
the different panels in R-studio.

Section:

Chapter 2

Category:

Textbook
Context:

F-test#One-way ANOVA example

Read Time:

5 minute(s)

Page:

Statistics with R

 Table of Contents

 Title Page

 Top of Page

 Table of Contents

2.5 - Multiple (pair-wise) comparisons using

Tukey's HSD and the compact letter display
by Mark Greenwood and Katharine Banner

With evidence that the true means are likely not all equal, many researchers want to know which
groups show evidence of differing from one another. This provides information on the source of
the overall difference that was detected and detailed information on which groups differed from
one another. Because this is a shot-gun/ unfocused sort of approach, some people think it is an
over-used procedure. Others feel that it is an important method of addressing detailed questions
about group comparisons in a valid way. For example, we might want to know if OJ is different
from VC at the 0.5 mg dosage level and these methods will allow us to get an answer to this sort
of question. It also will test for differences between the OJ-0.5 and VC-2 groups and every other
pair you can construct. This method actually takes us back to the methods in Chapter 1 where we
compared the means of two groups except that we need to deal with potentially many pair-wise
comparisons, making an adjustment to account for that inflation in Type I errors that occurs due
to many tests being performed at the same time. There are many different statistical methods to
make all the pair-wise comparisons, but we will employ the most commonly used one,
called Tukey's Honest Significant Difference (Tukey's HSD) method28 . The name suggests that
not using it could lead to a dishonest answer and that it will give you an honest result. It is more
that if you don't do some sort of correction for all the tests you are performing, you might find
some spurious29 results. There are other methods that could be used to do a similar correction.

Generally, the general challenge in this situation is that if you perform many tests at the same
time, you inflate the Type I error rate. We can define the family-wise error rate as the
probability that at least one error is made on a set of tests or P(At least 1 error is made). The
family-wise error is meant to capture the overall situation in terms of measuring the likelihood of
making a mistake if we consider many tests, each with some chance of making their own
mistake, and focus on how often we make at least one error when we do many tests. A quick
probability calculation shows the magnitude of the problem. If we start with a 5% significance
level test, then P(Type I error on one test) =0.05 and the P(no errors made on one test) =0.95, by
definition. This is our standard hypothesis testing situation. Now, suppose we have m
independent tests, then P(make at least 1 Type I error given all null hypotheses are true) = 1 -
P(no errors made) = 1 - .95m. Figure 2-17 shows how the probability of having at least one false
detection grows rapidly with the number of tests. The plot stops at 100 tests since it is effectively
a 100% chance of at least on false detection. It might seem like doing 100 tests is a lot, but in
Genetics research it is possible to consider situations where millions of tests are considered so
these are real issues to be concerned about in many situations.
Figure 2-17: Plot of

family-wise error rate as the number of tests performed increases. Dashed line indicates 0.05.

In pair-wise comparisons between all the pairs of means in a One-Way ANOVA, the number of
tests is based on the number of pairs. We can calculate the number of tests using J choose 2, (J2),
to get the number of pairs of size 2 that we can make out of J individual treatment levels. We
won't explore the combinatorics formula for this, as the choose function can give us the
answers:

> choose(3,2)

[1] 3

> choose(4,2)

[1] 6
> choose(5,2)

[1] 10

> choose(6,2)

[1] 15

So if you have 6 groups, like in the Guinea Pig study, we will have to consider 15 tests to
compare all the pairs of groups. 15 tests seems like enough that we should be worried about
inflated family-wise error rates. Fortunately, the Tukey's HSD method controls the family-wise
error rate at your specified level (say 0.05) across any number of pair-wise comparisons. This
means that the overall rate of at least one Type I error is controlled at the specified significance
level, often 5%. To do this, each test must use a slightly more conversative cut-off than if just
one test is performed and the procedure helps us figure out how much more conservative we
need to be.

Tukey's HSD starts with focusing on the difference between the groups with the largest and
smallest means (γ̅max -γ̅min). If (γ̅max - γ̅min) ≤ Margin of Error for the difference in the means, then
all other pairwise differences, say |γ̅j - γ̅j'|, will be less than or equal to that margin of error. This
also means that any confidence intervals for any difference in the means will contain 0. Tukey's
HSD selects a critical value so that (γ̅max - γ̅min) will be less than the margin of error in 95% of
data sets drawn from populations with a common mean. This implies that in 95% of datasets in
which all the population means are the same, all confidence intervals for differences in pairs of
means will contain 0. Tukey's HSD provides confidence intervals for the difference in true
means between groups j and j', μj - μj', for all pairs where j ≠ j', using

where is the margin of error for the intervals. The distribution

used to find the multiplier, q, for the confidence intervals is available in the qtukey function
and generally provides a slightly larger multiplier than the regular t* from our two-sample t-
based confidence interval, discussed in Chapter 1. We will use the confint, cld,
and plot functions applied to output from the glht function (multcomp package; Hothorn,
Bretz and Westfall, 2008) to easily get the required comparisons from our ANOVA model.
Unfortunately, its code format is a little complicated - but there are just two places to modify the
code, by including the modele name and after mcp (stands for multiple comparisons) in
the linfct option, you need to include the explanatory variable name
as VARIABLENAME="Tukey". The last part is to get the Tukey HSD multiple comparisons.
Once we obtain the intervals, we can use them to test H0: γj = γj' vs HA: γj ≠ γj' by assessing
whether 0 is in the confidence for each pair. If 0 is in the interval, then there is no evidence of a
difference for that pair. If 0 is not in the interval, then we reject H0 and have evidence at the
specified family-wise significance level of a difference for that pair. The following code provides
the numerical and graphical30 results of applying Tukey's HSD to the linear model for the Guinea
Pig data:

> require(multcomp)

> Tm2 <- glht(m2, linfct = mcp(Treat = "Tukey"))

> confint(Tm2)

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: lm(formula = len ~ Treat, data = ToothGrowth)

Quantile = 2.9549

95% family-wise confidence level

Linear Hypotheses:

Estimate lwr upr

VC.0.5 - OJ.0.5 == 0 -5.2500 -10.0487 -0.4513

OJ.1 - OJ.0.5 == 0 9.4700 4.6713 14.2687

VC.1 - OJ.0.5 == 0 3.5400 -1.2587 8.3387

OJ.2 - OJ.0.5 == 0 12.8300 8.0313 17.6287

VC.2 - OJ.0.5 == 0 12.9100 8.1113 17.7087

OJ.1 - VC.0.5 == 0 14.7200 9.9213 19.5187

VC.1 - VC.0.5 == 0 8.7900 3.9913 13.5887

OJ.2 - VC.0.5 == 0 18.0800 13.2813 22.8787

VC.2 - VC.0.5 == 0 18.1600 13.3613 22.9587

VC.1 - OJ.1 == 0 -5.9300 -10.7287 -1.1313

OJ.2 - OJ.1 == 0 3.3600 -1.4387 8.1587

VC.2 - OJ.1 == 0 3.4400 -1.3587 8.2387

OJ.2 - VC.1 == 0 9.2900 4.4913 14.0887

VC.2 - VC.1 == 0 9.3700 4.5713 14.1687

VC.2 - OJ.2 == 0 0.0800 -4.7187 4.8787

> old.par <- par(mai=c(1.5,2,1,1)) #Makes room on the plot for

the group names

> plot(Tm2)
Figure 2-18: Graphical display of pair-wise comparisons from Tukey's HSD for the Guinea Pig data. Any
confidence intervals that do not contain 0 provide evidence of a difference in the groups.

Figure 2-18 contains confidence intervals for the difference in the means for all 15 pairs of
groups. For example, the first confidence interval in the first row is comparing VC.0.5 and
OJ.0.5 (VC.0.5 minus OJ.0.5). In the numerical output, you can find that this 95% family-wise
confidence interval goes from -10.05 to -0.45 mm (lwr and upr in the numerical output provide
the CI endpoints). This interval does not contain 0 since its upper end point is -0.45 mm and so
we can now say that there is evidence that OJ and VC have different true mean growth rates at
the 0.5 mg dosage level. We can go further and say that we are 95% confident that the difference
in the true mean tooth growth between VC0.5 and OJ0.5 (VC0.5-OJ0.5) is between -10.05 and -
0.45 mm. But there are fourteen more similar intervals...

If you put all these pair-wise tests together, you can generate an overall interpretation of Tukey's
HSD results that discusses sets of groups that are not detectably different from one another and
those groups distinguished from other sets of groups. To do this, start with listing out the groups
that do are not detectably different (CIs contain 0), which, here, only occurs for four of the pairs.
The CIs that contain 0 are for the pairs VC.1 and OJ.0.5, OJ.2 and OJ.1, VC.2 and OJ.1, and,
finally, VC.2 and OJ.2. So VC.2, OJ.1, and OJ.2 are all not detectably different from each other
and VC.1 and OJ.0.5 are also not detectably different. If you look carefully, VC.0.5 is detected
as different from every other group. So there are basically three sets of groups that can be
grouped together as "similar": VC.2, OJ.1, and OJ.2; VC.1 and OJ.0.5; and VC.0.5. Sometimes
groups overlap with some levels not being detectably different from other levels that belong to
different groups and the story is not as clear as it is in this case. An example of this sort of
overlap is seen in the next section.

There is a method that many researchers use to more efficiently generate and report these sorts of
results that is called a compact letter display (CLD). The cld function can be applied to the
results from glht to provide a "simple" summary of the sets of groups that we generated above.
In this discussion, we are using a set as a union of different groups that can contain one or more
members and the member of these groups are the six different treatment levels.

> cld(Tm2)

OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2

"b" "a" "c" "b" "c" "c"

Groups with the same letter are not detectably different (are in the same set) and groups that are
detectably different get different letters (different sets). Groups can have more than one letter to
reflect "overlap" between the sets of groups and sometimes a set of groups contains only a single
treatment level (VC.0.5 is a set of size 1). Note that if the groups have the same letter, this does
not mean they are the same, just that there is no evidence of a difference for that pair. If we
consider the previous output for the CLD, the "a" set contains VC.0.5, the "b" set contains OJ.0.5
and VC.1, and the "c" set contains OJ.1, OJ.2, and VC.2. These are exactly the groups of
treatment levels that we obtained by going through all fifteen pairwise results. And these letters
can be added to a beanplot to help fully report the results and understand the sorts of differences
Tukey's HSD can detect.
>
beanplot(len~Treat,data=ToothGrowth,log="",col="white",method="ji
tter")

> text(c(2),c(10),"a",col="blue",cex=2)

> text(c(3,5,6),c(25,28,28),"b",col="green",cex=2)

> text(c(1,4),c(15,18),"c",col="red",cex=2)

Figure 2-19 can be used to enhance the discussion by showing that the "a" group with VC.0.5
had the lowest average tooth growth, the "c" group had intermediate tooth growth for treatments
OJ.0.5 and VC.1, and the highest growth rates came from OJ.1, OJ.2, and VC.2. Even though
VC.2 had the highest average growth rate, we are not able to prove that its true mean is any
higher than the other groups labeled with "b". Hopefully the ease of getting to the story of the
Tukey's HSD results from a plot like this explains why it is common to report results using these
methods instead of reporting 15 confidence intervals.
Figure 2-19: Beanplot of tooth growth by group with Tukey's HSD compact letter display.

There are just a couple of other details to mention on this set of methods. First, note that we
interpret the set of confidence intervals simultaneously: We are 95% confident that ALL the
intervals contain the respective differences in the true means (this is a family-wise
interpretation). These intervals are adjusted (wider) from our regular 2 sample t intervals from
Chapter 1 to allow this stronger interpretation. Second, if sample sizes are unequal in the groups,
Tukey's HSD is conservative and provides a family-wise error rate that is lower than the nominal
level. In other words, it fails less often than expected and the intervals provided are a little wider
than needed, containing all the pairwise differences at higher than the nominal confidence level
of (typically) 95%. Third, this is a parametric approach and violations of normality and constant
variance will push the method in the other direction, potentially making the technique
dangerously liberal. Nonparametric approaches to this problem are possible, but will not be
considered here.
previous next

28
When this procedure is used with unequal group sizes it is also sometimes called Tukey-
Kramer's method.
29
We often use "spurious" to describe falsely rejected null hypotheses which are also called false
detections.
30
The plot of results usually contains all the labels of groups but if the labels are long or there
many groups, sometimes the row labels are hard to see even with re-sizing the plot to make it
taller in R-studio and the numerical output is useful as a guide to help you read the plot.

Section:

Chapter 2

Category:

Academic Journal

Context:

Tukey's range test

Read Time:

10 minute(s)

Page:

Statistics with R

 Table of Contents

 Title Page


Search

 Top of Page

 Table of Contents

 License

 2.6 - Pair-wise comparisons for Mock

Jury data
 by Mark Greenwood and Katharine Banner
 In our previous work with the Mock Jury data, the overall ANOVA test provided only
marginal evidence of some difference in the true means across the three groups with a p-
value=0.067. Tukey's HSD does not require you to find a small p-value from your
overall F-test to employ the methods but if you apply it to situations with p-values larger
than your a priori significance level, you are unlikely to find any pairs that are detected
as being different. Some statisticians suggest that you shouldn't employ follow-up tests
such as Tukey's HSD when there is not sufficient evidence to reject the overall null
hypothesis. For the sake of completeness, we can find the pair-wise comparison results at
our typical 95% family-wise confidence level in this situation, with the three confidence
intervals displayed in Figure 2-20.
 > require(heplots)
 > require(mosaic)
 > data(MockJury)
 > lm2=lm(Years~Attr,data=MockJury)
 > require(multcomp)
 > Tm2 <- glht(lm2, linfct = mcp(Attr = "Tukey"))
 > confint(Tm2)
 Simultaneous Confidence Intervals
 Multiple Comparisons of Means: Tukey Contrasts
 Fit: lm(formula = Years ~ Attr, data = MockJury)
 Quantile = 2.3749
 95% family-wise confidence level

Linear Hypotheses:

Estimate lwr upr

Average - Beautiful == 0 -0.3596 -2.2968 1.5775
Unattractive - Beautiful == 0 1.4775 -0.4729 3.4278
Unattractive - Average == 0 1.8371 -0.1257 3.7999
 > old.par <- par(mai=c(1.5,2.5,1,1)) #Makes room on the plot
for the group names
 > plot(Tm2)
 > cld(Tm2)

Beautiful Average Unattractive

"a" "a" "a"


Figure 2-20: Tukey's HSD confidence interval results at the 95% family-wise confidence
level.
 At the family-wise 5% significance level, there are no pairs that are detectably different -
they all get the same letter of "a". Now we will produce results for the reader that thought
a 10% significance was suitable for this application before seeing any of the results. We
just need to change the confidence level or significance level that the CIs or tests are
produced with inside the functions. For the confint function, the level option is the
confidence level and for the cld, it is the family-wise significance level.
 > confint(Tm2,level=0.9)
 Simultaneous Confidence Intervals
 Multiple Comparisons of Means: Tukey Contrasts
 90% family-wise confidence level

Estimate lwr upr

Average - Beautiful == 0 -0.3596 -2.0511 1.3318
Unattractive - Beautiful == 0 1.4775 -0.2255 3.1804
Unattractive - Average == 0 1.8371 0.1233 3.5510
 > old.par <- par(mai=c(1.5,2.5,1,1)) #Makes room on the plot
for the group names
 > plot(confint(Tm2,level=.9))
 > cld(Tm2,level=0.1)

Beautiful Average Unattractive

"ab" "a" "b"


Figure 2-21: Tukey's HSD 90% family-wise confidence intervals.

 With family-wise 10% significance and 90% confidence levels,
the Unattractive and Average picture groups are detected as being different but
the Average group is not detected as different from Beautiful and Beautiful is not detected
to be different from Unattractive. This leaves the "overlap" of groups across the sets of
groups that was noted earlier. The Beautiful level is not detected as being dissimilar from
levels in two different sets and so gets two different letters.
 The beanplot's means (Figure 2-22) helps to clarify some of reasons for this set of results.
The detection of a difference between Average and Unattractive just barely occurs and
the mean for Beautiful is between the other two so it ends up not being detectably
different from either one. This sort of overlap is actually a fairly common occurrence in
these sorts of situations so be prepared a mixed set of letters for some levels.
 >
beanplot(Years~Attr,data=MockJury,log="",col="white",method=
"jitter")
 > text(c(1),c(5),"ab",col="blue",cex=2)
 > text(c(2),c(4.8),"a",col="green",cex=2)
 > text(c(3),c(6.5),"b",col="red",cex=2)


Figure 2-22: Beanplot of sentences with compact letter display results from 10% family-
wise significance level Tukey's HSD.

 2.7 - Chapter Summary

 by Mark Greenwood and Katharine Banner
 In this chapter, we explored methods for comparing a
quantitative response across J groups (J ≥ 2), what is called the
One-Way ANOVA procedure. The initial test is based on
assessing evidence against a null hypothesis of no difference in
the true means for the J groups. There are two different methods
for estimating these One-Way ANOVA models: the cell-means
model and the reference-coded versions of the model. There are
times when either model will be preferred, but for the rest of the
semester, the reference coding will be preferred (sorry!). The
ANOVA F-statistic, often presented with underlying information in
the ANOVA table, provides a method of assessing evidence
against the null hypothesis either using permutations or via
the F-distribution. Pair-wise comparisons using Tukey's HSD
provide a method for comparing all the groups and are a nice
complement to the overall ANOVA results. A compact letter
display was shown that enhanced the interpretation of Tukey's
HSD result.
 In the Guinea Pig example, we are left with some lingering
questions based on these results. It appears that the effect
of dosage changes as a function of the delivery method (OJ, VC)
because the size of the differences between OJ and VC change
for different dosages. These methods can't directly assess the
question of whether the effect of delivery method is the same or
not across the different dosages. The next chapter splits the two
variables, Dosage and Delivery method so we can consider their
effects both separately and together. This allows more refined
hypotheses, such as is the effect of delivery method the same for
all dosages, to be tested. This will introduce new models and
methods for analyzing data where there are two factors as
explanatory variables and a quantitative response variable in
what is called the Two-Way ANOVA.

2.8 - 2.8 Summary of Important R code

by Mark Greenwood and Katharine Banner

The main components of R code used in this chapter follow with components to modify in red,
remembering that any R packages mentioned need to be installed and loaded for this code to
have a chance of working:
 • MODELNAME=lm(Y~X,data=DATASETNAME)
 ◦ Provides numerical summaries of all variables in the data set.

 ◦ Here it is used to fit the reference-coded One-Way ANOVA model with Y as the
response variable and X as the grouping variable, storing the estimated model object in
MODELNAME.

 • MODELNAME=lm(Y~X-1,data=DATASETNAME)
 ◦ Fits the cell means version of the One-Way ANOVA model.

 • summary(MODELNAME))
 ◦ Generates model summary information including the estimated model coefficients, SEs,
t-tests, and p-values.

 • anova(MODELNAME)
 ◦ Generates the ANOVA table but must only be run on the reference-coded version of
the model.

 ◦ Results are incorrect if run on the cell-means model since the reduced model under the
null is that the mean of all the observations is 0!

 • pf(FSTATISTIC,df1=NUMDF,df2=DENDF,lower.tail=F)
 ◦ Finds the p-value for an observed F-statistic with NUMDF and DENDF degrees of
freedom.

 • par(mfrow=c(2,2)); plot(MODELNAME)
 ◦ Generates four diagnostic plots including the Residuals vs Fitted and Normal Q-Q plot.

 • plot(allEffects(MODELNAME))
 ◦ Plots the estimated model.

 ◦ Requires the effects package be loaded.

 • Tm2=glht(MODELNAME,linfct=mcp(X="Tukey"); confint(Tm2);
plot(Tm2); cld(Tm2)
 ◦ Requires the multcomp package to be installed and loaded.

 ◦ Can only be run on the reference-coded version of the model.

 ◦ Generates the text output and plot for Tukey's HSD as well as the compact letter
display.

previous next

Section:
Chapter 2
Category:
Textbook
Context:
R (programming language)
Read Time:
1 minute(s)
Page:
62

2.9 - 2.9: Practice problems

by Mark Greenwood and Katharine Banner

For these practice problems, you will work with the cholesterol data set from
the multcomp package that you should already have loaded. To load the data set and learn
more about the study, use the following code:

require(multcomp)

data(cholesterol)

help(cholesterol)

2.1. Graphically explore the differences in the changes in Cholesterol levels for the five
levels using boxplots and beanplots.
2.2. Is the design balanced?
2.3. Complete all 6+ steps of the hypothesis test using the parametric F-test, reporting the
ANOVA table and the distribution of the test statistic under the null.
2.4. Discuss the scope of inference using the information that the treatment levels were
randomly assigned to volunteers in the study.
2.5. Generate the permutation distribution and find the p-value. Compare the parametric
p-value to the permutation test results.
2.6. Perform Tukey's HSD on the data set. Discuss the results - which pairs were detected
as different and which were not? Bigger reductions in cholesterol are good, so are there
any levels you would recommend or that might provide similar reductions?
2.7. Find and interpret the CLD and compare that to your interpretation of results from
2.6.
previousNotice: Undefined variable: nextPage in
/dp/dp01/book/statistics-with-r-textbook/public_html/item.html on line 192

Asset Management PAS 55 ISO 55000
100% (2)
Asset Management PAS 55 ISO 55000
15 pages
HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
No ratings yet
HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
8 pages
Bayes CPH - Tutorial R
No ratings yet
Bayes CPH - Tutorial R
9 pages
Learn R Programming in 24 Hours
From Everand
Learn R Programming in 24 Hours
Alex Nordeen
No ratings yet
PSQI Scoring
100% (1)
PSQI Scoring
2 pages
R With RStudio For Introductory Statistics
No ratings yet
R With RStudio For Introductory Statistics
163 pages
07 Introduction To R
No ratings yet
07 Introduction To R
75 pages
Practical 3 Intro To R
No ratings yet
Practical 3 Intro To R
10 pages
Notes19 08
No ratings yet
Notes19 08
21 pages
Essential R
No ratings yet
Essential R
183 pages
EssentialR PDF
No ratings yet
EssentialR PDF
181 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
R Software Project
No ratings yet
R Software Project
42 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
R Tutorial Lecture Notes
No ratings yet
R Tutorial Lecture Notes
59 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
179 pages
STA1007S Lab 1: R Interface: Getting Started
No ratings yet
STA1007S Lab 1: R Interface: Getting Started
9 pages
STAT319 Lab Manual Based On R - Final Version
No ratings yet
STAT319 Lab Manual Based On R - Final Version
127 pages
BS51009 Workshop 1
No ratings yet
BS51009 Workshop 1
15 pages
Statistical Analysis With R - A Quick Start
100% (1)
Statistical Analysis With R - A Quick Start
47 pages
R Workshop
No ratings yet
R Workshop
47 pages
Brief R Tutorial
No ratings yet
Brief R Tutorial
8 pages
AnalyticsEdge Rmanual PDF
100% (1)
AnalyticsEdge Rmanual PDF
44 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
R Using R Statistics Stowell2014
No ratings yet
R Using R Statistics Stowell2014
232 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
Lec 1
No ratings yet
Lec 1
42 pages
R Language Lab Manual Lab 1
No ratings yet
R Language Lab Manual Lab 1
32 pages
An Introduction To R: 1 Background
No ratings yet
An Introduction To R: 1 Background
17 pages
Topic 1 - Intro To Basics
No ratings yet
Topic 1 - Intro To Basics
38 pages
Introduction To R (Used in PSYC8010)
No ratings yet
Introduction To R (Used in PSYC8010)
24 pages
R Guide
No ratings yet
R Guide
152 pages
Notes
No ratings yet
Notes
17 pages
Exercises
No ratings yet
Exercises
38 pages
TextbookECO 329 Fall 2024
No ratings yet
TextbookECO 329 Fall 2024
545 pages
R Manual PDF
No ratings yet
R Manual PDF
78 pages
Intro To RStudio
No ratings yet
Intro To RStudio
10 pages
A Short Introduction To R: Richard Harris Creative Commons Attribution-Noncommercial-Sharealike 3.0 Unported License
No ratings yet
A Short Introduction To R: Richard Harris Creative Commons Attribution-Noncommercial-Sharealike 3.0 Unported License
36 pages
Computerstatistik Skriptum
No ratings yet
Computerstatistik Skriptum
162 pages
Stats With R
No ratings yet
Stats With R
103 pages
Lab Activity 1
No ratings yet
Lab Activity 1
25 pages
How To Install R
No ratings yet
How To Install R
6 pages
INtroductionGeostatistics R
No ratings yet
INtroductionGeostatistics R
30 pages
Class One
No ratings yet
Class One
66 pages
Getting Started With R and RStudio
No ratings yet
Getting Started With R and RStudio
35 pages
R Studio Manual
No ratings yet
R Studio Manual
61 pages
Ntroductory Tatistics: by Dr. Laila M. Fatehy
No ratings yet
Ntroductory Tatistics: by Dr. Laila M. Fatehy
22 pages
Case Studies in R
No ratings yet
Case Studies in R
4 pages
Intro To R
No ratings yet
Intro To R
4 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
DAR Programming - An Approach To Data Analytics-1
No ratings yet
DAR Programming - An Approach To Data Analytics-1
156 pages
An Introduction To R
No ratings yet
An Introduction To R
141 pages
Lab 1 Manual - Introduction To R
No ratings yet
Lab 1 Manual - Introduction To R
7 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Tutorial3 EBC2090
No ratings yet
Tutorial3 EBC2090
19 pages
Tutorial6 EBC2090
No ratings yet
Tutorial6 EBC2090
32 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
AutoIT Scripting For Beginners
From Everand
AutoIT Scripting For Beginners
Rajan
5/5 (2)
CitectSCADA 7.20 User Guide-1
No ratings yet
CitectSCADA 7.20 User Guide-1
100 pages
Loki Manual
No ratings yet
Loki Manual
4 pages
Complex Digital Signal Processing in Telecommunications
No ratings yet
Complex Digital Signal Processing in Telecommunications
23 pages
78R-13 - Original Baseline Schedule Review
100% (3)
78R-13 - Original Baseline Schedule Review
21 pages
REST API in ASP - NET Core
No ratings yet
REST API in ASP - NET Core
15 pages
System Requirements Autodesk Autocad 2021
No ratings yet
System Requirements Autodesk Autocad 2021
3 pages
BAMC Layout and Manual Merged
No ratings yet
BAMC Layout and Manual Merged
15 pages
Mathlinks 9 Review Bundles CH 2
No ratings yet
Mathlinks 9 Review Bundles CH 2
4 pages
TC & BSC Overview
No ratings yet
TC & BSC Overview
36 pages
Sencon 2.0 Software Update Version 2
No ratings yet
Sencon 2.0 Software Update Version 2
11 pages
Comenzi Cisco
No ratings yet
Comenzi Cisco
3 pages
Register Transfer Language Register Transfer Bus and Memory Transfers Arithmetic Micro-Operations Logic Micro-Operations Shift Micro-Operations Arithmetic Logic Shift Unit
No ratings yet
Register Transfer Language Register Transfer Bus and Memory Transfers Arithmetic Micro-Operations Logic Micro-Operations Shift Micro-Operations Arithmetic Logic Shift Unit
7 pages
Unit 1 Introduction To Cloud Computing: Structure
No ratings yet
Unit 1 Introduction To Cloud Computing: Structure
17 pages
06hypothesis Testing v2 PDF
No ratings yet
06hypothesis Testing v2 PDF
39 pages
Turbo C Manual Chapter 2 Algo N Flowchart Module 2
100% (1)
Turbo C Manual Chapter 2 Algo N Flowchart Module 2
44 pages
Unit 7 Resuemen Ingles
No ratings yet
Unit 7 Resuemen Ingles
8 pages
Topo Sheet and Calculation
No ratings yet
Topo Sheet and Calculation
16 pages
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
No ratings yet
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
18 pages
Module 7
No ratings yet
Module 7
16 pages
Abdulrahman El Moughrabi Resume
No ratings yet
Abdulrahman El Moughrabi Resume
2 pages
31-TMSS-01-R2 (Relay and Control Panels)
No ratings yet
31-TMSS-01-R2 (Relay and Control Panels)
35 pages
Ab Initio
No ratings yet
Ab Initio
17 pages
Image Steganography: Protection of Digital Properties Against Eavesdropping
No ratings yet
Image Steganography: Protection of Digital Properties Against Eavesdropping
8 pages
How To Configure DHCP in Cisco Router Using Packet Tracer and Gns3 - Router Switch Configuration Using Packet Tracer GNS3
100% (1)
How To Configure DHCP in Cisco Router Using Packet Tracer and Gns3 - Router Switch Configuration Using Packet Tracer GNS3
5 pages
10 - FCFS and SJF Algorithm
No ratings yet
10 - FCFS and SJF Algorithm
28 pages
CISSP Study Guide Conrad 2024 Scribd Download
100% (3)
CISSP Study Guide Conrad 2024 Scribd Download
65 pages
BIM Template Training Bentley July2013
No ratings yet
BIM Template Training Bentley July2013
35 pages
JSP Interview Questions and Answers
No ratings yet
JSP Interview Questions and Answers
7 pages