Anova
Anova
1 - Getting started in R
by Mark Greenwood and Katharine Banner
This book and access to a computer (PC, Mac, or just computer lab computers on campus) are
the only required materials for the course. You will need to download the statistical software
package called R and an enhanced interface to R called R-studio (Rstudio, 2014). They are open
source and free to download and use (and will always be that way). This means that the skills
you learn now can follow you the rest of your life. R is becoming the primary language of
statistics and is being adopted across academia, government, and businesses to help manage and
learn from the growing volume of data being obtained. Hopefully you will get a sense of some of
the power of R this semester.
The next pages will walk you through the process of getting the software downloaded and
provide you with an initial experience using R-studio to do things that should look familiar even
though the interface will be a new experience. Do not expect to master R quickly - it takes years
(sorry!) even if you know all the statistical methods being used. We will try to keep all of your
interactions with R code in a similar coding form and that should help your learning how to use
R as we move through various methods. Everyone that learns R starts with copying other
people's code and then making changes for specific applications - so expect to go back to
examples and learn how to modify that code to work for your particular data set. In Chapter 1,
we will exploit the power of R to compare quantitative responses from two groups, making some
graphical displays, doing hypothesis testing and creating confidence intervals in a couple of
different ways.
You will have two downloading activities to complete before you can do anything more than
read this book. First, you need to download R. It is the engine that will do all the computing for
us, but you will only interact with it once. Go to http://cran.rstudio.com and click on
the "Download R for..." button that corresponds to your operating system. Second, you need to
download R-studio. It is an enhanced interface that will make interacting with R less frustrating.
Go to http://www.rstudio.com/products/rstudio/download/ and select the "installer" for your
operating system under the column for "Installers for all platforms". From this point forward,
you should only open R-studio; it provides your interface with R. Note that both R and R-studio
are updated frequently (up to four times a year) and if you downloaded either more than a few
months previously, you should download the up-to-date versions, especially if something you are
trying to do is not working. Sometimes code will not work in older versions of R and sometimes
old code won't work in new versions of R3.
Now we get to complete some basic tasks in R using the R-studio interface. When you open R-
studio, you will see a screen like Figure 0-2. The added notes can help you get initially oriented
to the software interface. R is command-line software - meaning that most of the time you have
to create code and then execute it to get any results. R-studio makes the management and
execution of code more efficient than the basic version of R. The lower left panel in R-studio is
called the "console" window and is where you can type R code directly into R or where you will
see the code you run and (most importantly!) where the results of your executed commands will
show up. The most basic interaction with R is available once you get the cursor active at the
command prompt ">". The upper left panel is for writing, saving, and running your R code. Once
you have code available in this window, the "Run" button will execute the code for the line that
your cursor is on or for any text that you have highlighted with your mouse. The "data
management" or environment panel is in the upper right, providing information on what data sets
have been loaded. It also contains the "Import Dataset" button that makes reading data into R
easier. The lower right panel contains information on the "Packages" that are available and is
where you will see plots that you make and requests for "Help".
> 3+4
[1] 7
You can do more interesting calculations, like finding the mean of the numbers 3, 5, 7, and 8 by
adding them up and dividing by 4:
(-3+5+7+8)/4
[1] 4.25
Note that the the parentheses help R to figure out your desired order of operations. If you drop
that grouping, you get a very different result:
> -3+5+7+8/4
[1] 11
We could estimate the standard deviation similarly using the formula you might remember from
introductory statistics, but that will only work in very limited situations. To use the real power of
R this semester, we need to work with data sets that store the observations for our subjects
in variables. Basically, we need to store observations in named vectors that contain a list of the
observations. To create a vector containing the four numbers and assign it to a variable
named variable1, we need to create a vector using the function c which means combine the
items that follow if they are inside parentheses and have commas separating the values:
> c(-3,5,7,8)
[1] -3 5 7 8
To get this vector stored in a variable called variable1 we need to use the assignment operator,
"<-"(read as "stored as") that assigns in the information on the right into the variable that you are
creating.
In R, the assignment operator, <-, is created by typing a less than symbol (<) followed by a
minus sign (-) without a space between them. If you ever want to see what numbers are
residing in an object in R, just type its name and hit enter. You can see how that variable
contains the same information that was initially generated by c(-3,5,7,8) but is easier to access
since we just need the text representing that vector.
> variable1
[1] -3 5 7 8
You can see how that variable contains the same information that was initially generated by c(-
3,5,7,8) but is easier to access since we just need the text representing that vector. Now we can
use functions such as mean and sd to find the mean and standard deviation of the observations
contained in variable1:
> mean(variable1)
[1] 4.25
> sd(variable1)
[1] 4.99166
When dealing with real data, we will often have information about more than one variable. We
could enter all observations by hand for each variable but this is prone to error and onerous for
all but the smallest data sets. If you are to ever utilize the power of statistics in the evolving data-
centered world, data management has to be accomplished in a more sophisticated way. While
you can manage data sets quite effectively in R, it is often easiest to start with your data set in
something like Microsoft Excel or OpenOffice's Calc. You want to make sure that observations
are in the rows and the names of variables are in the columns and that there is no "extra stuff" in
the spreadsheet. If you have missing observations, they should be represented with blank cells.
The file should be saved as a ".csv" file (stands for comma-separated values although Excel calls
it "CSV (Comma Delimited)", which basically strips off some of the junk that Excel adds to the
necessary information in the file. Excel will tell you that this is a bad idea, but it actually creates
a more stable long-term storage format and one that R can use directly. There will be a few
words in the last chapter regarding why we use R in this course instead of Excel or other
(commercial) statistical software. We'll wait until we show you some of the cool things that R
can do to discuss why we didn't use other software.
With a data set converted to a CSV file, we need to read the data set into R. There are two ways
to do this, either using the GUI point-and-click interface in R-studio or modifying
the read.csv function to find the file of interest. To practice this, you can download an Excel
(.xls) file from https://dl.dropboxusercontent.com/u/77307195/treadmill.xls that contains
observations on 31 males that volunteered for a study on methods for measuring fitness (Westfall
and Young, 1993). In the spreadsheet, you will find:
What is put inside the " " will depend on the location of your saved .csv file. A version of the
data set in what looks like a spreadsheet will appear in the upper left window due to the second
line of code (View(treadmill)). Just directly typing (or using) a line of code like this is
actually the other way that we can read in files. If you choose to use this, you need to tell R
where to look in your computer to find the data file. read.csv is a function that takes a path as
an argument. To use it, specify the path to your data file, put quotes around it, and put it as the
input to read.csv(...). For some examples later in the book, you will be able to copy a
command like this and read data sets and other code directly from my Dropbox folder using an
internet connection.
Figure 0-3: R-studio with inital data set loaded.
To verify that you read in the data set correctly, it is good to check its contents. We can view the
first and last rows in the data set using the head and tail functions on the data set, which
show the following results for the treadmill data. Note that you will sometimes need to
resize the console window in R-studio to get all the columns to display in a single row which can
be performed by dragging the grey bars that separate the panels.
>head(treadmill)
After installing the package, we need to load it to make it active. We need to go to the command
prompt and type (or copy and paste) require(mosaic):
> require(mosaic)
You may see a warning message about versions of the package and versions of R - this is usually
something you can ignore. Other warning messages could be more ominous for proceeding but
before getting too concerned, there are couple of basic things to check. First, double check that
the package is installed. Second, check for typographical errors in your code - especially for mis-
spellings or unintended capilization. If you are still having issues, try repeating the installation
process or find someone more used to using R to help you. Most computers in computer labs on
campus at MSU have R and R-studio installed and provide another venue to use the software if
you are having problems5.
To help you go from basic to intermediate R usage, you will want to learn how to manage and
save your R code. The best way to do this is using the upper left panel in R-studio using what are
called R-scripts and they have a file extension of .R. To start a new .R file to store your code,
click on File, then New File, then R Script. This will create a blank page to enter and edit code -
then save the file as MyFileName.R in your preferred location. Saving your code will mean that
you can return to where you last were working by simply re-running the saved script file. With
code in the script window, you can place the cursor on a line of code or highlight a chunk of
code and hit the "Run" button on the upper part of the panel. It will appear in the console with
results just like what you got if you typed it after the command prompt. Figure 0-4 shows the
screen with the code used in this section in the upper left panel, saved in file called Ch0.R, with
the results of highlighting and executing the first section of code using the "Run" button.
Figure 0-4: R-studio with highlighted code run.
previous next
3
The need to keep the code up-to-date as R continues to evolve is one reason that this book is
locally published...
4
If you are having trouble getting the file converted and read into R, copy and run the following
code: treadmill=read.csv("http://dl.dropboxusercontent.com/u/77307
195/treadmill.csv",header=T)
5
We highly recommend that you do not wait until the last minute to try to get R code to work for
your own assignments. Even experienced R users can sometimes need a little time to find their
errors.
With R-studio running, the mosaic package loaded, a place to write and save code, and
the treadmill data set loaded, we can (finally!) start to summarize the results of the study.
The treadmill object is what R calls a data.frame and contains columns corresponding to
each variable in the spreadsheet. Every function in R will involve specifying the variable(s) of
interest and how you want to use them. To access a particular variable (column) in a data.frame,
you can use a $ between the data.frame name and the name of the variable of interest,
as dataframename$variablename. To identify the RunTime variable here it would
be treadmill$RunTime and in the command would look like:
>treadmill$RunTime
> mean(treadmill$RunTime)
[1] 10.58613
> sd(treadmill$RunTime)
[1] 1.387414
And now we know that the average running time for 1.5 miles for the subjects in the study was
10.6 minutes with a standard deviation (SD) of 1.39 minutes. But you should remember that the
mean and SD are only appropriate summaries if the distribution is roughly symmetric.
The mosaic package provides a useful function called favstats that provides the mean and
SD as well as the 5 number summary: the minimum (min), the first quartile (Q1, the
25th percentile), the median (50th percentile), the third quartile (Q3, the 75th percentile), and the
maximum (max). It also provides the number of observation (n) which was 31, as noted above,
and a count of whether any missing values were encountered (missing), which was 0 here.
> favstats(treadmill$RunTime)
> hist(treadmill$RunTime)
Figure 0-5: Histogram of Run Times in minutes of n=31 subjects in Treadmill study.
I used the Export button found above the plot, followed by Copy to Clipboard and clicking on
the Copy Plot button to make it available to paste the figure into your favorite word-processing
program. You can see the first parts of this process in the screen grab in Figure 0-6.
Figure 0-6: R-studio while in the process of copying the histogram.
You can also directly save the figures as separate files using Save as image or Save as PDF and
then insert them into other documents.
The function defaults into providing a histogram on the frequency or count scale. In most R
functions, there are the default options that will occur if we don't make any specific choices and
options that we can modify. One option we can modify here is to add labels to the bars to be able
to see exactly how many observations fell into each bar. Specifically, we can turn
the labels option "on" with adding labels=T to the previous call to the hist function,
separated by a comma:
hist(treadmill$RunTime,labels=T)
Figure 0-7: Histogram of Run Times with counts in bars labelled.
Based on this histogram, it does not appear that there any outliers in the responses since there are
no bars that are separated from the other observations. However, the distribution does not look
symmetric and there might be a skew to the distribution. Specifically, it appears to be skewed
right (the right tail is longer than the left). But histograms can sometimes mask features of the
data set by binning observations and it is hard to find the percentiles accurately from the plot.
When assessing outliers and skew, the boxplot (or Box and Whiskers plot) can also be helpful
(Figure 0-8) to describe the shape of the distribution as it displays the 5-number summary and
will also indicate observations that are "far" above the middle of the observations.
R's boxplot function uses the standard rule to indicate an observation as a potential outlier if it
falls more than 1.5 times the IQR (Inter-Quartile Range, calculated as Q3-Q1) below Q1 or
above Q3. The potential outliers are plotted with circles and the Whiskers (lines that extend from
Q1 and Q3 typically to the minimum and maximum) are shortened to only go as far as
observations that are within 1.5*IQR of the upper and lower quartiles. The box part of the
boxplot is a box that goes from Q1 to Q3 and the median is displayed as a line somewhere inside
the box6. Looking back at the summary statistics above, Q1=9.78 and Q3=11.27, providing an
IQR of:
> IQR<-11.27-9.78
> IQR
[1] 1.49
One observation (the maximum value of 14.03) is indicated as a potential outlier based on this
result by being larger than Q3+1.5*IQR, which was 13.505:
> 11.27+1.5*IQR
[1] 13.505
The boxplot also shows a slight indication of a right skew (skew towards larger values) with the
distance from the minimum to the median being smaller than the distance from the median to the
maximum. Additionally, the distance from Q1 to the median is smaller than the distance from the
median to Q3. It is modest skew, but is worth noting.
boxplot(treadmill$RunTime)
Figure 0-8: Boxplot of 1.5 mile Run Times.
While the default boxplot is fine, it fails to provide good graphical labels, especially on the y-
axis. Additionally, there is no title on the plot. The following code provides some enhancements
to the plot by using the ylab and main options in the call to boxplot, with the results
displayed in Figure 0-9.
previous next
6
The median, quartiles and whiskers sometimes occur at the same values when there are many
tied observations. If you can't see all the components of the boxplot, produce the numerical
summary to help you understand what happened.
You should have R and R-studio downloaded and working after going through this preliminary
chapter. You should be able to read a data set into R and run some basic functions, all done using
the R-studio interface. If you are struggling with this, you should seek additional help with these
technical issues so that you are ready for more complicated statistical methods that are coming
very soon. For most assignments, we will give you a seed of the basic R code that you need.
Then you will modify it to work on your data set of interest. As mentioned previously, the way
everyone learns and uses R involves starting with someone elses code and then modifying it. If
you can complete the Practice Problems that follow, you are on your way to learning to use R.
The statistical methods in this chapter were minimal and all should have been review. They
involved a quick reminder of summarizing the center, spread, and shape of distributions using
numerical summaries of the mean and SD and/or the min, Q1, median, Q3, and max and the
histogram and boxplot as graphical summaries. The main point was really to get a start on using
R to provide results you should be familiar with from your previous statistics experiences.
DATASETNAME$VARIABLENAME
To access a particular variable in a data.frame called DATASETNAME, use a $ and then
the VARIABLENAME.
head(DATASETNAME)
Provides a list of the first few rows of the data set for all the variables in it.
mean(DATASETNAME$VARIABLENAME)
Calculates the mean of the observations in a variable.
sd(DATASETNAME$VARIABLENAME)
Calculates the SD of the observations in a variable.
favstats(DATASETNAME$VARIABLENAME)
Provides a suite of numerical summaries of the observations in a variable.
hist(DATASETNAME$VARIABLENAME)
Makes a histogram.
boxplot(DATASETNAME$VARIABLENAME)
Makes a boxplot.
At the end of each chapter, there is a section filled with questions related to the material. Your
instructor has a file that contains the R code required to provide the results to answer all these
questions. To practice learning R, it would be most useful for you to try to accomplish the
requested tasks first yourself in R and then refer to the provided R code when you struggle.
These questions provide a great venue to check what you are learning, see the methods applied to
another data set, and to discuss in study groups, with your instructor, or at the Math Learning
Center, especially if you have any questions about the correct responses.
0.1. Read in the treadmill data set discussed above and find the mean and SD of the Ages
(Age variable) and Body Weights (BodyWeight). In studies involving human subjects, it
is common to report a summaries of characteristics of the subjects. Why does this matter?
Think about how your interpretation of any study of the fitness of subjects would change
if the mean age had been 20 years older or 35 years younger.
0.2. How does knowing about the distribution of results for Age and BodyWeight help
you understand the results for the Run Times discussed above?
0.3. The mean and SD are most useful as summary statistics only if the distribution is
relatively symmetric. Make a histogram of Age responses and discuss the shape of the
distribution (is it skewed right, skewed left, approximately symmetric?; are there
outliers?). Approximately what range of ages does this study pertain to?
0.4. The weight responses are in kilograms and you might prefer to see them in pounds.
The conversion is lbs=2.205*kgs. Create a new variable in the treadmill data.frame
called BWlb using this code:
0.5. Make histograms and boxplots of the original BodyWeight and new BWlb variables.
Discuss aspects of the distributions that changed and those that remained the same with
the transformation from kilograms to pounds.
1 - (R)e-introduction to statistics
by Mark Greenwood and Katharine Banner
It is also really important to note that variables have to vary - if you measure the sex of your
subjects but are only measuring females, then you do not have an interesting variable. The last,
but probably most important, aspect of data is the context of the measurement. The who, what,
when, and where of the collection of the observations is critical to the sort of conclusions we will
make based on the observations. The information on the study design will provide the
information required to assess the scope of inference of the study. Generally, remember to think
about the research questions the researchers were trying to answer and whether their study
actually would answer those questions. There are no formulas to help us sort some of these
things out, just critical thinking about the context of the measurements.
To make this concrete, consider the data collected from a study (Plaster, 1989) to investigate
whether perceived physical attractiveness had an impact on the sentences or perceived
seriousness of a crime that male jurors might give to female defendants. The researchers showed
the participants in the study (men who volunteered from a prison) pictures of one of three young
women. Each picture had previously been decided to be either beautiful, average, or unattractive
by the researchers. Each "juror" was randomly assigned to one of three levels of this factor
(which is a categorical predictor or explanatory variable) and then each rated their picture on a
variety of traits such as how warm or sincere the woman appeared. Finally, they were told the
women had committed a crime (also randomly assigned to either be told she committed a
burglary or a swindle) and were asked to rate the seriousness of the crime and provide a
suggested length of sentence. We will bypass some aspects of their research and just focus on
differences in the sentence suggested among the three pictures. To get a sense of these data, let's
consider the first and last parts of the data set:
Instead of loading this data set into R using the "Import Dataset" functionality, we can load a R
package that contains the data, making for easy access to this data set. The package
called heplots (Fox, Friendly, and Monette, 2013) contains a data set called MockJury that
contains the results of the study. We will also rely the R package called mosaic (Pruim,
Kaplan, and Horton, 2014) that was introduced previously. First (but only once), you need to
install both packages, which can be done using the install.packages function with quotes
around the package name:
> install.packages("heplots")
After making sure that the packages are installed, we use the require function around the
package name (no quotes now!) to load the package.
> require(heplots)
> require(mosaic)
To load the data set that is in a loaded package, we use the data function.
> data(MockJury)
Now there will be a data.frame called MockJury available for us to analyze. We can find out
more about the data set as before in a couple of ways. First, we can use the View function to
provide a spreadsheet sort of view in the upper left panel. Second, we can use
the head and tail functions to print out the beginning and end of the data set. Because there
are so many variables, it may wrap around to show all the columns.
> View(MockJury)
> head(MockJury)
> help(MockJury)
With many variables in a data set, it is often useful to get some quick information about all of
them; the summary function provides useful information whether the variables are categorical or
quantitative and notes if any values were missing.
> summary(MockJury)
To accompany the numerical summaries, histograms and boxplots can provide some initial
information on the shape of the distribution of the responses for the suggested sentences
in Years. Figure 1-1 contains the histogram and boxplot of Years, ignoring any information on
which picture the "jurors" were shown. The code is enhanced slightly to make it better labeled
> hist(MockJury$Years,xlab="Years",labels=T,main="Histogram of
Years")
> hist(MockJury$Years,freq=F,xlab="Years",main="Histogram of
Years with density curve")
> lines(density(MockJury$Years),lwd=3,col="red")
Figure 1-2: Histogram and density curve of Years data.
Histograms can be sensitive to the choice of the number of bars and even the cut-offs used to
define the bins for a given number of bars. Small changes in the definition of cut-offs for the bins
can have noticeable impacts on the shapes observed but this does not impact density curves. We
are not going to over-ride the default choices for bars in histogram, but we can add information
on the original observations being included in each bar. In the previous display, we can add what
is called a rug to the plot, were a tick mark is made for each observation. Because the responses
were provided as whole years (1, 2, 3, ..., 15), we need to use a graphical technique
called jittering to add a little noise10 to each observation so all observations at each year value do
not plot at the same points. In Figure 1-3, the added tick marks on the x-axis show the
approximate locations of the original observations. We can clearly see how there are 3
observations at 15 (all were 15 and the noise added makes it possible to see them all. The
limitations of the histogram arise around the 10 year sentence area where there are many
responses at 10 years and just one at both 9 and 11 years, but the histogram bars sort of miss this
that aspect of the data set. The density curve did show a small bump at 10 years. Density curves
are, however, not perfect and this one shows area for sentences less than 0 years which is not
possible here.
> hist(MockJury$Years,freq=F,xlab="Years",main="Histogram of
Years with density curve and rug")
> lines(density(MockJury$Years),lwd=3,col="red")
> rug(jitter(MockJury$Years),col="blue",lwd=2)
Figure 1-3: Histogram and density curve and rug of the jittered responses.
The tools we've just discussed are going to help us move to comparing the distribution of
responses across more than one group. We will have two displays that will help us make these
comparisons. The simplest is the side-by-side boxplot, where a boxplot is displayed for each
group of interest using the same y-axis scaling. In R, we can use its formula notation to see if the
response (Years) differs based on the group (Attr) by using something like Y~X or,
here,Years~Attr. We also need to tell R where to find the variables and use the last option in
the command, data=DATASETNAME, to inform R of the data.frame to look in to find the
variables. In this example, data=MockJury. We will use the formula and data=... options in
almost every function we use from here forward. Figure 1-4 contains the side-by-side boxplots
showing right skew for all the groups, slightly higher median and more variability for
the Unattractive group along with some potential outliers indicated in two of the three groups.
> boxplot(Years~Attr,data=MockJury)
previous next
7
You will more typically hear "data is" but that more often refers to information, sometimes even
statistical summaries of data sets, than to observations collected as part of a study, suggesting the
confusion of this term in the general public. We will explore a data set in Chapter 4 related to
perceptions of this issue collected by researchers at http://fivethirtyeight.com.
8
We will try to reserve the term "effect" for situations where random assignment allows us to
consider causality as the reason for the differences in the response variable among levels of the
explanatory variable, if we find evidence against the null hypothesis of no difference.
9
If you've taken calculus, you will know that the curve is being constructed so that the integral
from −∞ to ∞ is 1.
10
Jittering typically involves adding random variability to each observation that is uniformly
distributed in a range determined based on the spacing of the observations. If you re-run
the jitter function, the results will change. For more details, type help(jitter) in R.
1.1 - Beanplots
by Mark Greenwood and Katharine Banner
The other graphical display for comparing multiple groups we will use is a newer display called
a beanplot (Kampstra, 2008). It provides a side-by-side display that contains the density curve,
the original observations that generated the density curve in a rug-plot, and the mean of each
group. For each group the density curves are mirrored to aid in visual assessment of the shape of
the distribution. This mirroring will often create a shape that resembles a violin with skewed
distributions. Long, bold horizontal lines are placed at the mean for each group. All together this
plot shows us information on the center (mean), spread, and shape of the distributions of the
responses. Our inferences typically focus on the means of the groups and this plot allows us to
compare those across the groups while gaining information on whether the mean is reasonable
summary of the center of the distribution.
To use the beanplot function we need to install and load the beanplot package. The
function works like the boxplot used previously except that options for log, col,
and method need to be specified. Use these options for any beanplots you
make: log="",col="bisque", method="jitter".
> require(beanplot)
>
beanplot(Years~Attr,data=MockJury,log="",col="bisque",method="jit
ter")
Figure 1-5 reinforces the strong right skews that were also detected in the boxplots previously.
The three large sentences of 15 years can now be clearly viewed, one in the Beautiful group and
two in the Unattractive group. The Unattractive group seems to have more high observations
than the other groups even though the Beautiful group had the largest number of observations
around 10 years. The mean sentence was highest for the Unattractive group and the differences
differences in the means between Beautiful and Average was small.
Figure 1-5: Beanplot of Years by picture group. Long, bold lines correspond to mean of each
group.
In this example, it appears that the mean for Unattractive is larger than the other two groups. But
is this difference real? We will never know the answer to that question, but we can assess how
likely we are to have seen a result as extreme or more extreme than our result, assuming that
there is no difference in the means of the groups. And if the observed result is (extremely)
unlikely to occur, then we can reject the hypothesis that the groups have the same mean and
conclude that there is evidence of a real difference. We can get means and standard deviations by
groups easily using the same formula notation with the mean and sd functions if
the mosaic package is loaded.
> mean(Years~Attr,data=MockJuryR)
> favstats(Years~Attr,data=MockJuryR)
Because comparing two groups is easier than comparing more than two groups, we will start
with comparing the Average and Unattractive groups. We could remove the Beautiful group
observations in a spreadsheet program and read that new data set back into R, but it is easier to
use R to do data management once the data set is loaded. To remove the observations that came
from the Beautiful group, we are going to generate a new variable that we will
call NotBeautiful that is true when observations came from another group
(Average or Unattractive) and false for observations from the Beautiful group. To do this, we
will apply the not equal logical function (!=) to the variable Attr, inquiring whether it was
different from the "Beautiful" level.
> NotBeautiful
FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS
[1]
E E E E E E E E E E E E
FALS FALS FALS FALS FALS FALS FALS FALS FALS
[13] TRUE TRUE TRUE
E E E E E E E E E
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
FALS FALS FALS FALS FALS FALS FALS FALS
[73] TRUE TRUE TRUE TRUE
E E E E E E E E
FALS FALS FALS FALS FALS FALS FALS FALS FALS FALS
[85] TRUE TRUE
E E E E E E E E E E
[97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[109
TRUE TRUE TRUE TRUE TRUE TRUE
]
This new variable is only FALSE for the Beautiful responses as we can see if we compare some
of the results from the original and new variable:
> data.frame(MockJury$Attr,NotBeautiful)
MockJury.Attr NotBeautiful
1 Beautiful FALSE
2 Beautiful FALSE
3 Beautiful FALSE
...
20 Beautiful FALSE
21 Beautiful FALSE
22 Unattractive TRUE
23 Unattractive TRUE
24 Unattractive TRUE
25 Unattractive TRUE
26 Unattractive TRUE
...
112 Average TRUE
113 Average TRUE
114 Average TRUE
To get rid of one of the groups, we need to learn a little bit about data management in
R. Brackets ([,]) are used to modify the rows or columns in a data.frame with entries before
the comma operating on rows and entries after the comma on the columns. For example, if you
want to see the results for the 5th subject we can reference the 5th row of the data.frame
using [5,] after the data.frame name:
> MockJury[5,]
> MockJury[5,3]
[1] 7
In R, we can use logical vectors to keep any rows of the data.frame where the variable is true and
drop any rows where it is false by placing the logical variable in the first element of the brackets.
The reduced version of the data set should be saved with a different name such
as MockJury2 that is used here:
You will always want to check that the correct observations were dropped either
using View(MockJury2) or by doing a quick summary of the Attr variable in the new
data.frame.
> summary(MockJury2$Attr)
> summary(MockJury2$Attr)
Average Unattractive
38 37
Now the boxplot and beanplots only contain results for the two groups of interest here as seen in
Figure 1-6.
> boxplot(Years~Attr,data=MockJury2)
>
beanplot(Years~Attr,data=MockJury2,log="",col="bisque",method="ji
tter")
The two-sample mean techniques you learned in your previous course start with comparing the
means the two groups. We can obtain the two means using the mean function or directly obtain
the difference in the means using the compareMean function (both require
the mosaic package). The compareMean function provides xUnattractive−xAverage where x is the
sample mean of observations in the subscripted group. Note that there are two directions to
compare the means and this function chooses to take the mean from the second group name
alphabetically and subtracts the mean from the first alphabetical group name. It is always good to
check the direction of this calculation as having a difference of -1.84 years versus 1.84 years
could be important to note.
> mean(Years~Attr,data=MockJury2)
Average Unattractive
3.973684 5.810811
> compareMean(Years ~ Attr, data=MockJury2)
[1] 1.837127
Figure 1-6: Boxplot and beanplot of the Years responses on the reduced data set.
There appears to be some evidence that the Unattractive group is getting higher average lengths
of sentences from the mock jurors than the Average group, but we want to make sure that the
difference is real - that there is evidence to reject the assumption that the means are the same "in
the population". First, a null hypothesis11 which defines a null model12 needs to be determined in
terms of parameters (the true values in the population). The research question should help you
determine the form of the hypotheses for the assumed population. In the 2 independent sample
mean problem, the interest is in testing a null hypothesis of H0: μ1=μ2 versus the alternative
hypothesis of HA: μ1≠μ2, where μ1 is the parameter for the true mean of the first group and μ2 is
the parameter for the true mean of the second group. The alternative hypothesis involves
assuming a statistical model for the ith (i=1,...,nj) response from the jth group (j=1,2), γij, is
modeled as γij = μj + εij, where we typically assume that εij ~ N(0,σ2). For the moment, focus on
the models that assuming the means are the same (null) or different (alternative) imply:
• Null Model: γij = μ + εij There is no difference in true means for the two groups.
• Alternative Model: yij = μj + εij There is a difference in true means for the two groups.
Suppose we are considering the alternative model for the 4th observation (i=4) from the second
group (j=2), then the model for this observation is γ42 = μ2 + ε42. And for, say, the 5th observation
from the first group (j=1), the model is γ51 = μ1 + ε51. If we were working with the null model, the
mean is always the same (μ) and the group specified does not change that aspect of the model.
It can be helpful to think about the null and alternative models graphically. By assuming the null
hypothesis is true (means are equal) and that the random errors around the mean follow a normal
distribution, we assume that the truth is as displayed in the left panel of Figure 1-7 - two normal
distributions with the same mean and variability. The alternative model allows the two groups to
potentially have different means, such as those displayed in the right panel of Figure 1-7, but
otherwise assumes that the responses have the same distribution. We assume that the
observations (γij) would either have been generated as samples from the null or alternative model
- imagine drawing observations at random from the pictured distributions. The hypothesis testing
task in this situation involves first assuming that the null model is true and then assessing how
unusual the actual result was relative to that assumption so that we can conclude that the
alternative model is likely correct. The researchers obviously would have hoped to encounter
some sort of noticeable difference in the sentences provided for the different pictures and been
able to find enough evidence to reject the null model where the groups "looked the same".
Figure 1-7: Illustration of the assumed situations under the null (left) and a single possibility
that could occur if the alternative were true (right).
In statistical inference, null hypotheses (and their implied models) are set up as "straw men" with
every interest in rejecting them even though we assume they are true to be able to assess the
evidence against them. Consider the original study design here, the pictures were randomly
assigned to the subjects. If the null hypothesis were true, then we would have no difference in the
population means of the groups. And this would apply if we had done a different random
assignment of the pictures to the subjects. So let's try this: assume that the null hypothesis is true
and randomly re-assign the treatments (pictures) to the observations that were obtained. In other
words, keep the sentences (Years) the same and shuffle the group labels randomly. The technical
term for this is doing a permutation (a random shuffling of the treatments relative to the
responses). If the null is true and the means in the two groups are the same, then we should be
able to re-shuffle the groups to the observed sentences (Years) and get results similar to those we
actually observed. If the null is false and the means are really different in the two groups, then
what we observed should differ from what we get under other random permutations. The
differences between the two groups should be more noticeable in the observed data set than in
(most) of the shuffled data sets. It helps to see this to understand what a permutation means in
this context.
In the mosaic R package, the shuffle function allows us to easily perform a permutation13.
Just one time, we can explore what a permutation of the treatment labels could look like.
> Perm1
The comparison of the beanplots for the real data set and permuted version of the labels is what
is really interesting (Figure 1-8). The original difference in the sample means of the two groups
was 1.84 years (Unattractive minus Average). The sample means are the statistics that estimate
the parameters for the true means of the two groups. In the permuted data set, the difference in
the means is 0.66 years.
Average Unattractive
4.552632 5.216216
> compareMean(Years ~ PermutedAttr, data=Perm1)
[1] 0.6635846
Figure 1-8: Boxplots of Years responses versus actual treatment groups and permuted groups.
These results suggest that the observed difference was larger than what we got when we did a
single permutation. The important aspect of this is that the permutation is valid if the null
hypothesis is true - this is a technique to generate results that we might have gotten if the null
hypothesis were true. We just need to repeat the permutation process many times and track how
unusual our observed result is relative to this distribution of responses. If the observed
differences are unusual relative to the results under permutations, then there is evidence against
the null hypothesis, the null hypothesis should be rejected (Reject H0) and a conclusion should be
made, in the direction of the alternative hypothesis, that there is evidence that the true means
differ. If the observed differences are similar to (or at least not unusual relative to) what we get
under random shuffling under the null model, we would have a tough time concluding that there
is any real difference between the groups based on our observed data set.
previous next
11
The hypothesis of no difference that is typically generated in the hopes of being rejected in
favor of the alternative hypothesis which contains the sort of difference that is of interest in the
application.
12
The null model is the statistical model that is implied by the chosen null hypothesis. Here, a
null hypothesis of no difference will translate to having a model with the same mean for both
groups.
We'll see the shuffle function in a more common usage below; while the code to
13
generate Perm1 is provided, it isn't something to worry about right now: Perm1<-
with(MockJury2,data.frame(Years,Attr,PermutedAttr=shuffle(Attr)))
In any testing situation, you must define some function of the observations that gives us a single
number that addresses our question of interest. This quantity is called a test statistic. These often
take on complicated forms and have names like t or z statistics that relate to their parametric
(named) distributions so we know where to look up p-values. In randomization settings, they can
have simpler forms because we use the data set to find the distribution of the statistic. We will
label our test statistic T (for Test statistic) unless the test statistic has a commonly used name.
Since we are interested in comparing the means of the two groups, we can
define T=xUnattractive−xAverage, which coincidentally is what the compareMean
function provided us previously. We label our observed test statistic (the one from the
original data set) as Tobs=xUnattractive−xAverage which happened to be 1.84 years here. We will
compare this result to the results for the test statistic that we obtain from permuting the group
labels. To denote permuted results, we will add a * to the labels: T*=xUnattractive*−xAverage*. We then
compare the Tobs=xUnattractive−xAverage= 1.84 to the distribution of results that are possible for the
permuted results (T*) which corresponds to assuming the null hypothesis is true.
To do permutations, we are going to learn how to write a for loop in R to be able to repeatedly
generate the permuted data sets and record T*. Loops are a basic programming task that make
randomization methods possible as well as potentially simplifying any repetitive computing task.
To write a "for loop", we need to choose how many times we want to do the loop (call that B)
and decide on a counter to keep track of where we are at in the loops (call that b, which goes
from 1 to B). The simplest loop would just involve printing out the index, print(b). This is
our first use of curly braces, { and }, that are used to group the code we want to repeatedly run as
we proceed through the loop. The code in the script window is:
for (b in (1:B)){
print(b)
And when you highlight and run the code, it will look about the same with "+" printed after the
first line to indicate that all the code is connected, looking like this:
+ print(b)
+ }
When you run these three lines of code, the console will show you the following output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
This is basically the result of running the print function on b as it has values from 1 to 5.
Instead of printing the counter, we want to use the loop to repeatedly compute our test statistic
when permuting observations. The shuffle function will perform permutations of the group
labels relative to responses and the compareMean function will calculate the difference in two
group means. For a single permutation, the combination of shuffling Attr and finding the
difference in the means, storing it in a variable called Ts is:
> Ts
[1] 0.3968706
And putting this inside the print function allows us to find the test statistic under 5 different
permutations easily:
+ print(Ts)
+ }
[1] 0.9302987
[1] 0.6635846
[1] 0.7702703
[1] -1.203414
[1] -0.7766714
Finally, we would like to store the values of the test statistic instead of just printing them out on
each pass through the loop. To do this, we need to create a variable to store the results, let's call
it Tstar. We know that we need to store B results so will create a vector of length B, containing
B elements, full of missing values (NA) using the matrix function:
> Tstar<-matrix(NA,nrow=B)
> Tstar
[,1]
[1,] NA
[2,] NA
[3,] NA
[4,] NA
[5,] NA
Now we can run our loop B times and store the results in Tstar:
+ }
> Tstar
[,1]
[1,] 1.1436700
[2,] -0.7233286
[3,] 1.3036984
[4,] -1.1500711
[5,] -1.0433855
The Tstar vector when we set B to be large, say B=1,000, generate the permutation distribution
for the selected test statistic under14the null hypothesis - what is called the null distribution of
the statistic and also its sampling distribution. We want to visualize this distribution and use it to
assess how unusual our Tobs result of 1.84 years was relative to all the possibilities under
permutations (under the null hypothesis). So we repeat the loop, now with B=1000 and generate
a histogram, density curve and summary statistics of the results:
> B<-1000
> Tstar<-matrix(NA,nrow=B)
+ }
> hist(Tstar,labels=T)
> favstats(Tstar)
missi
min Q1 median Q3 max mean sd n
ng
- -
0.02347 0.6102 2.9039 0.01829 0.8625 100
2.5369 0.5633 0
084 418 83 659 767 0
84 001
Figure 1-9 contains visualizations of the results for the distribution of T* and the favstats
summary provides the related numerical summaries. Our observed Tobs of 1.837 seems fairly
unusual relative to these results with only 11 T* values over 2 based on the histogram. We need
to make more specific assessments of the permuted results versus our observed result to be able
to clearly decide whether our observed result is really unusual.
Figure 1-9: Histogram (with counts in bars) and density curve of values of test statistic for 1,000
permutations.
We can enhance the previous graphs by adding the value of the test statistic from the real data
set, as shown in Figure 1-10, using the abline function.
> hist(Tstar,labels=T)
> abline(v=Tobs,lwd=2,col="red")
> abline(v=Tobs,lwd=2,col="red")
Figure 1-10: Histogram and density curve of values of test statistic for 1,000 permutations with
bold line for value of observed test statistic.
Second, we can calculate the exact number of permuted results that were larger than what we
observed. To calculate the proportion of the 1,000 values that were larger than what we
observed, we will use the pdata function. To use this function, we need to provide the cut-off
point (Tobs), the distribution of values to compare to the cut-off (Tstar), and whether we want
the lower or upper tail of the distribution (lower.tail=F option provides the proportion of
values above).
> pdata(Tobs,Tstar,lower.tail=F)
[1] 0.016
The proportion of 0.016 tells us that 16 of the 1,000 permuted results (1.6%) were larger than
what we observed. This type of work is how we can generate p-values using permutation
distributions. P-values are the probability of getting a result as extreme or more extreme than
what we observed, given that the null is true. Finding only 16 permutations of 1,000 that were
larger than our observed result suggests that it is hard to find a result like what we observed if
there really were no difference, although it is not impossible.
When testing hypotheses for two groups, there are two types of alternative hypotheses, one-sided
or two-sided. One-sided tests involve only considering differences in one-direction (like μ1>μ2)
and are performed when researchers can decide a priori15 which group should have a larger
mean. We did not know enough about the potential impacts of the pictures to know which group
should be larger than the other and without much knowledge we could have gotten the direction
wrong relative to the observed results and we can't look at the responses to decide on the
hypotheses. It is often safer and more conservative16 to start with a two-sided alternative (HA:
μ1≠μ2). To do a 2-sided test, find the area larger than what we observed as above. We also need
to add the area in the other tail (here the left tail) similar to what we observed in the right tail.
Here we need to also find how many of the permuted results were smaller than -1.84 years,
using pdata with -Tobs as the cut-off and lower.tail=T:
> pdata(-Tobs,Tstar,lower.tail=T)
[1] 0.015
So the p-value to test our null hypothesis of no difference in the true means between the groups
is 0.016+0.015, providing a p-value of 0.031. Figure 1-11 shows both cut-offs on the histogram
and density curve.
> hist(Tstar,labels=T)
> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
Figure 1-11: Histogram and density curve of values of test statistic for 1,000 permutations with
bold lines for value of observed test statistic and its opposite value required for performing two-
sided test.
In general, the one-sided test p-value is the proportion of the permuted results that are more
extreme than observed in the direction of the alternative hypothesis (lower or upper tail, which
also depends on the direction of the difference taken). For the 2-sided test, the p-value is the
proportion of the permuted results that are less than the negative version of the observed statistic
and greater than the positive version of the observed statistic. Using absolute values, we can
simplify this: the two-sided p-value is the proportion of the |permuted statistics| that are larger
than |observed statistic|. This will always work and finds areas in both tails regardless of whether
the observed statistic is positive or negative. In R, the abs function provides the absolute
value and we can again use pdata to find our p-value:
> pdata(abs(Tobs),abs(Tstar),lower.tail=F)
[1] 0.031
We will discuss the choice of significance level below, but for the moment, assume
a significance level (α) of 0.05. Since the p-value is smaller than α, this suggests that we
can reject the null hypothesis and conclude that there is evidence of some difference in the true
mean sentences given between the two types of pictures.
Before we move on, let's note some interesting features of the permutation distribution of the
difference in the sample means shown in Figure 1-11.
1) It is basically centered at 0. Since we are performing permutations assuming the null model
is true, we are assuming that μ1=μ2 which implies that μ1−μ2= 0 and 0 is always the center of the
permutation distribution.
2) It is approximately normally distributed. This is due to the Central Limit Theorem17,
where the sampling distribution of the difference in the sample means (x1-x2) will be
approximately normal if the sample sizes are large enough. This result will allow us to use a
parametric method to approximate this distribution under the null model if some assumptions are
met, as we'll discuss below.
3) Our observed difference in the sample means (1.84 years) is a fairly unusual result relative
to the rest of these results but there are some permuted data sets that produce more extreme
differences in the sample means. When the observed differences are really large, we may not see
any permuted results that are as extreme as what we observed. When pdata gives you 0, the p-
value should be reported to be smaller than 0.001 (not 0!) since it happened in less than 1 in
1000 tries.
4) Since our null model is not specific about the direction of the difference, considering a
result like ours but in the other direction (-1.84 years) needs to be included. The observed result
seems to put about the same area in both tails of the distribution but it is not exactly the same.
The small difference in the tails is a useful aspect of this approach compared to the parametric
method discussed below as it accounts for slight asymmetry in the sampling distribution.
Earlier, we decided that the p-value was small enough to reject the null hypothesis since it was
smaller than our chosen level of significance. In this course, you will often be allowed to use
your own judgement about an appropriate significance level in a particular situation (in other
words, if we forget to tell you an α-level, you can still make a decision using a reasonably
selected significance level). Remembering that the p-value is the probability you would observe
a result like you did (or more extreme), assuming the null hypothesis is true, this tells you that
the smaller the p-value is, the more evidence you have against the null. The next section provides
a more formal review of the hypothesis testing infrastructure, terminology, and some of things
that can happen when testing hypotheses.
previous next
14
We often say "under" in statistics and we mean "given that the following is true".
15
This is a fancy way of saying "in advance", here in advance of seeing the observations.
16
Statistically, a conservative method is one that provides less chance of rejecting the null
hypothesis in comparison to some other method or some pre-defined standard.
17
We'll leave the discussion of the CLT to your previous stat coursework or an internet search.
Hypothesis testing is much like a criminal trial where you are in the role of a jury member (or
judge if no jury is present). Initially, the defendant is assumed innocent. In our situation, the true
means are assumed to be equal between the groups. Then evidence is presented and, as a juror,
you analyze it. In statistical hypothesis testing, data are collected and analyzed. Then you have to
decide if we had "enough" evidence to reject the initial assumption ("innocence" is initially
assumed). To make this decision, you want to have previously decided on the standard of
evidence required to reject the initial assumption. In criminal cases, "beyond a reasonable doubt"
is used. Wikipedia's definition suggests that this standard is that "there can still be a doubt, but
only to the extent that it would not affect a reasonable person's belief regarding whether or not
the defendant is guilty". In civil trials, a lower standard called a "preponderance of evidence" is
used. Based on that defined and pre-decided (a priori) measure, you decide that the defendant is
guilty or not guilty. In statistics, we compare our p-value to a significance level, α, which is most
often 5%. If our p-value is less than α, we reject the null hypothesis. The choice of the
significance level is like the variation in standards of evidence between criminal and civil trials -
and in all situations everyone should know the standards required for rejecting the initial
assumption before any information is "analyzed". Once someone is found guilty, then there is the
matter of sentencing which is related to the impacts ("size") of the crime. In statistics, this is
similar to the estimated size of differences and the related judgements about whether the
differences are practically important or not. If the crime is proven beyond a reasonable doubt but
it is a minor crime, then the sentence will be small. With the same level of evidence and a more
serious crime, the sentence will be more dramatic.
There are some important aspects of the testing process to note that inform how we interpret
statistical hypothesis test results. When someone is found "not guilty", it does not mean
"innocent", it just means that there was not enough evidence to find the person guilty "beyond a
reasonable doubt". Not finding enough evidence to reject the null hypothesis does not imply that
the true means are equal, just that there was not enough evidence to conclude that they were
different. There are many potential reasons why we might fail to reject the null, but the most
common one is that our sample size was too small (which is related to having too little evidence).
Throughout the semester, we will continue to re-iterate the distinctions between parameters and
statistics and want you to be clear about the distinctions between estimates based on the sample
and inferences for the population or true values of the parameters of interest. Remember that
statistics are summaries of the sample information and parameters are characteristics of
populations (which we rarely know). In the two-sample mean situation, the sample means are
always at least a little different - that is not an interesting conclusion. What is interesting is
whether we have enough evidence to prove that the population means differ "beyond a
reasonable doubt".
The scope of any inferences is constrained based on whether there is a random sample (RS)
and/or random assignment (RA). Table 1-1 contains the four possible combinations of these two
characteristics of a given study. Random assignment allows for causal inferences for differences
that are observed - the different in treatment levels causes differences in the mean responses.
Random sampling (or at least some sort of representative sample) allows inferences to be made
to the population of interest. If we do not have RA, then causal inferences cannot be made. If we
do not have a representative sample, then our inferences are limited to the sampled subjects.
A simple example helps to clarify how the scope of inference can change. Suppose we are
interested in studying the GPA of students and have a sample mean GPA and a confidence
interval for the population mean GPA available. If we had taken a random sample from, say, the
STAT 217 students in a given semester, our scope of inference would be the population of 217
students in that semester. If we had taken a random sample from the entire MSU population, then
the inferences would be to the entire MSU population in that semester. These are similar types of
problems but the two populations are very different and the group you are trying to make
conclusions about should be noted carefully in your results - it does matter! If we did not have a
representative sample, say the students could choose to provide this information or not, then we
can only make inferences to volunteers. These volunteers might differ in systematic ways from
the entire population of STAT 217 students so we cannot safely extend our inferences beyond
the group that volunteered.
A quick summary of the terminology of hypothesis testing is useful at this point. The null
hypothesis (H0) states that there is no difference or no relationship in the population. This is the
statement of no effect or no difference and the claim that we are trying to find evidence against.
In this chapter, it is always H0: μ1 = μ2. When doing two-group problems, you always need to
specify which group is 1 and which is 2. The alternative hypothesis (H1 or HA) states a specific
difference between parameters. This is the research hypothesis and the claim about the
population that we hope to demonstrate is more reasonable to conclude than the null hypothesis.
In the two-group situation, we can have one-sided alternatives of HA: μ1 > μ2 (greater than) or HA:
μ1 < μ2 (less than) or, the more common, two-sided alternative of HA: μ1 ≠ μ2 (not equal to). We
usually default to using two-sided tests because we often do not know enough to know the
direction of a difference in advance, especially in more complicated situations. The sampling
distribution is the distribution of a statistic under the assumption that H0 is true and is used to
calculate the p-value, the probability of obtaining a result as extreme or more extreme than what
we observed given that the null hypothesis is true. We will find sampling distributions
using nonparametric approaches (like the permutation approach used above) and parametric
methods (using "named" distributions like the t, F, and χ2).
Small p-values are evidence against the null hypothesis because the the observed result is
unlikely due to chance if H0 is true. Large p-values provide no evidence against H0 but do not
allow us to conclude that there is no difference. The level of significance is an a priori definition
of how small the p-value needs to be to provide "enough" (sufficient) evidence against H0. This
is most useful to prevent sliding the standards after the results are found. We compare the p-
value to the level of significance to decide if the p-value is small enough to constitute sufficient
evidence to reject the null hypothesis. We use a to denote the level of significance and most
typically use 0.05 which we refer to as the 5% significance level. We compare the p-value to this
level and make a decision. The two options for decisions are to either reject the null hypothesis if
the p-value ≤ α or fail to reject the null hypothesis if the p-value > α. When interpreting
hypothesis testing results, remember that the p-value is a measure of how unlikely the observed
outcome was, assuming that the null hypothesis is true. It is NOT the probability of the data or
the probability of either hypothesis being true. The p-value is a measure of evidence against the
null hypothesis.
The specific definition of a is that it is the probability of rejecting H0 when H0 is true, the
probability of what is called a Type I error. Type I errors are also called false rejections. In the
two-group mean situation, a Type I error would be concluding that there is a difference in the
true means between the groups when none really exists in the population. In the courtroom
setting, this is like falsely finding someone guilty. We don't want to do this very often, so we use
small values of the significance level, allowing us to control the rate of Type of I errors at α. We
also have to worry about Type II errors, which are failing to reject the null hypothesis when it's
false. In a courtroom, this is the same as failing to convict a guilty person. This most often occurs
due to a lack of evidence. You can use the Table 1-2 to help you remember all the possibilities.
Table 1-2: Table of decisions and truth scenarios in a hypothesis testing situation. We never
know the truth in a real situation.
H0 True H0 False
FTR H0 Correct decision Type II error
Reject H0 Type I error Correct decision
In comparing different procedures, there is an interest in studying the rate or probability of Type
I and II errors. The probability of a Type I error was defined previously as α, the significance
level. The power of a procedure is the probability of rejecting the null hypothesis when it is false.
Power is defined as power = 1 - Probability(Type II error) = Probability(Reject H0 | H0 is false),
or, in words, the probability of detecting a difference when it actually exists. We want to use a
statistical procedure that controls the Type I error rate at the pre-specified level and has high
power to detect false null alternatives. Increasing the sample size is one of the most commonly
used methods for increasing the power in a given situation but sometimes we can choose among
different procedures and use the power of the procedures to help us make that selection. Note
that there are many ways to make H0 false and the power changes based on how false the null
hypothesis actually is. To make this concrete, suppose that the true mean sentences differed by
either 1 or 20 years in previous example. The chances of rejecting the null hypothesis are much
larger when the groups actually differ by 20 years than if they differ by just 1 year.
After making a decision (was there enough evidence to reject the null or not), we want to make
the conclusions specific to the problem of interest. If we reject H0, then we can conclude that
there was sufficient evidence at the α-level that the null hypothesis is wrong (and the results
point in the direction of the alternative). If we fail to reject H0 (FTR H0), then we can conclude
that there was insufficient evidence at the α-level to say that the null hypothesis is wrong. We
are NOT saying that the null is correct and we NEVER accept the null hypothesis. We just
failed to find enough evidence to say it's wrong. If we find sufficient evidence to reject the null,
then we need to revisit the method of data collection and design of the study. This allows us to
consider the scope of the inferences we can make. Can we discuss causality (due to RA) and/or
make inferences to a larger group than those in the sample (due to RS)?
To perform a hypothesis test, there are some steps to remember to complete to make sure you
have thought through all the aspects of the results.
2) Assess the "Things To Check" for the procedure being used (discussed below)
5) Make a decision
In developing statistical inference techniques, we need to define the test statistic, T, that
measures the quantity of interest. To compare the means of two groups, a statistic is needed that
measures their differences. In general, for comparing two groups, the choices are simple - a
difference in the means often works well and is a natural choice. There are other options such as
tracking the ratio of means or possibly the difference in medians. Instead of just using the
difference in the means, we could "standardize" the difference in the means by dividing by an
appropriate quantity. It ends up that there are many possibilities for testing using the
randomization (nonparametric) techniques introduced previously. Parametric statistical methods
focus on means because the statistical theory surrounding means is quite a bit easier (not easy,
just easier) than other options. Randomization techniques allow inference for other quantities but
our focus here will be on using randomization for inferences on means to see the similarities with
the more traditional parametric procedures.
In two-sample mean situations, instead of working with the difference in the means, we often
calculate a test statistic that is called the equal variance two-independent samples t-statistic. The
test statistic is
where s12 and s22 are the sample variances for the two groups, n1 and n2 are the sample sizes for
the two groups, and the pooled sample standard devation,
The t-statistic keeps the important comparison between the means in the numerator that we used
before and standardizes (re-scales) that difference so that t will follow a t-distribution (a
parametric "named" distribution) if certain assumptions are met. But first we should see if
standardizing the difference in the means had an impact on our permutation test results. Instead
of using the compareMean function, we will use the t.test function (see its full use below)
and have it calculate the formula for t for us. The R code "$statistic" is basically a way of
extracting just the number we want to use for T from a larger set of output the t.test function
wants to provide you. We will see below that t.test switches the order of the difference (now
it is Average - Unattractive) - always carefully check for the direction of the difference in the
results. Since we are doing a two-sided test, the code resembles the permutation test code in
Section 1.3 with the new t-statistic replacing the difference in the sample means.
The permutation distribution in Figure 1-12 looks similar to the previous results with slightly
different x-axis scaling. The observed t-statistic was -2.17 and the proportion of permuted results
that were more extreme than the observed result was 0.034. This difference is due to a different
set of random permutations being selected. If you run permutation code, you will often get
slightly different results each time you run it. If you are uncomfortable with the variation in the
results, you can run more than B=1,000 permutations (say 10,000) and the variability will be
reduced further. Usually this uncertainty will not cause any substantive problems - but do not be
surprised if your results vary from a colleagues if you are both analyzing the same data set.
> Tobs
t
-2.17023
> Tstar<-matrix(NA,nrow=B)
+ Tstar[b]<-t.test(Years ~ shuffle(Attr),
data=MockJury2,var.equal=T)$statistic
+ }
> hist(Tstar,labels=T)
> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
>
> pdata(abs(Tobs),abs(Tstar),lower.tail=F)
0.034
Figure 1-12: Permutation distribution of the t-statistic.
The parametric version of these results is based on using what is called the two-independent
sample t-test. There are actually two versions of this test, one that assumes that variances are
equal in the groups and one that does not. There is a rule of thumb that if the ratio of the larger
standard deviation over the smaller standard deviation is less than 2, the equal variance
procedure is ok. It ends up that this assumption is less important if the sample sizes in the groups
are approximately equal and more important if the groups contain different numbers of
observations. In comparing the two potential test statistics, the procedure that assumes equal
variances has a complicated denominator (see the formula above for t involving sp) but a simple
formula for degrees of freedom (df) for the t-distribution (df=n1+n2−2) that approximates the
distribution of the test statistic, t, under the null hypothesis. The procedure that assumes unequal
variances has a simpler test statistic and a very complicated degrees of freedom formula. The
equal variance procedure is most similar to the ANOVA methods we will consider later this
semester so that will be our focus here. Fortunately, both of these methods are readily available
in the t.test function in R if needed.
If the assumptions for the equal variance t-test are met and the null hypothesis is true, then the
sampling distribution of the test statistic should follow a t-distribution with n1+n2−2 degrees of
freedom. The t-distribution is a bell-shaped curve that is more spread out for smaller values of
degrees of freedom as shown in Figure 1-13. The t-distribution looks more and more like a
standard normal distribution (N(0,1)) as the degrees of freedom increase.
-3.5242237 -0.1500295
sample estimates:
> pt(-2.1702,df=73,lower.tail=T)
[1] 0.01662286
And we can double it to get the p-value that t.test provided earlier, because the t-distribution is
symmetric:
> 2*pt(-2.1702,df=73,lower.tail=T)
[1] 0.03324571
More generally, we could always make the test statistic positive using the absolute value, find
the area to the right of it, and then double that for a two-side test p-value:
> 2*pt(abs(-2.1702),df=73,lower.tail=F)
[1] 0.03324571
Permutation distributions do not need to match the named parametric distribution to work
correctly, although this happened in the previous example. The parametric approach, the t-test,
requires the certain conditions to be met for the sampling distribution of the statistic to follow the
named distribution and provide accurate p-values. The conditions for the equal variance t-test
are:
2) Equal variances in the groups (because we used a procedure that assumes equal variances! -
there is another procedure that allows you to relax this assumption if needed...). To assess this,
compare the standard deviations and see if they look noticeably different, especially if the
sample sizes differ between groups.
3) Normal distributions of the observations in each group. We'll learn more diagnostics later,
but the boxplots and beanplots are a good place to start to help you look for skews or outliers,
which were both present here. If you find skew and/or outliers, that would suggest a problem
with this condition.
3) Similar distributions between the groups: The permutation approach helps us with this
assumption and allows valid inferences as long as the two groups have similar shapes and only
possibly differ in their centers. In other words, the distributions need not look normal for the
procedure to work well.
In the mock jury study, we can assume that the independent observation condition is met because
there is no information suggesting that the same subjects were measured more than once or that
some other type of grouping in the responses was present (like the subjects were divided in
groups and placed in the same room discussing their responses). The equal variance condition
might be violated although we do get some lee-way in this assumption and are still able to get
reasonable results. The standard deviations are 2.8 vs 4.4, so this difference is not "large"
according to the rule of thumb. It is, however, close to being considered problematic. It would be
difficult to reasonably assume that the the normality condition is met here (Figure 1-6), that is
assumed in the derivation of the parametric procedure, with clear right skews in both groups and
potential outliers. The shapes look similar for the two groups so there is less reason to be
concerned with using the permutation approach as compared to the parametric approach.
previous next
18
On exams, you will be asked to describe the area of interest, sketch a picture of the area of
interest and/or note the distribution you would use.
19
In some studies, the same subject might be measured in both conditions and this violates the
assumptions of this procedure.
In every chapter, we will follow the first example used to explain the methods with a "worked"
example where we focus on the results provided. In a previous semester, some of the STAT 217
students (n=79) provided information on their gender, Age, and current GPA. We might be
interested in whether Males and Females had different average GPAs. First, we can take a look at
the difference in the responses by groups as displayed in Figure 1-15.
>
s217=read.csv( http://dl.dropboxusercontent.com/u/77307195/s217.c
sv)
> require(mosaic)
> par(mfrow=c(1,2))
> boxplot(GPA~Sex,data=s217)
> require(beanplot)
> beanplot(GPA~Sex,data=s217,
log="",col="lightblue",method="jitter")
>
> mean(GPA~Sex,data=s217)
F M
3.338378 3.088571
> favstats(GPA~Sex,data=s217)
> compareMean(GPA~Sex,data=s217)
[1] -0.2498069
> t.test(GPA~Sex,data=s217,var.equal=T)
0.06501838 0.43459552
sample estimates:
> Tobs
2.691883
> Tstar<-matrix(NA,nrow=B)
+ Tstar[b]<-t.test(GPA~shuffle(Sex),data=s217,var.equal=T)
$statistic
+ }
> hist(Tstar,labels=T)
> abline(v=c(-1,1)*Tobs,lwd=2,col="red")
> pdata(abs(Tobs),abs(Tstar),lower.tail=F)
0.011
Figure 1-16: Histogram and density curve of permutation distribution of test statistic for STAT
217 GPAs.
Here is a full write-up of the results using all 6+ hypothesis testing steps, using the permutation
results:
Isolate the claim to be proved and method to use (define a test statistic T)
We want to test for a difference in the means between males and females and will use the equal-
variance two-sample t-test statistic to compare them, making a decision at the 5% significance
level.
o ◦where μMale is the true mean GPA for males and μFemale is true mean GPA for
females
•Equal variance condition: There is a small difference in the range of the observations
in the two groups but the standard deviations are very similar so there is no evidence that
this condition is violated.
•This means that there is about a 1.2% chance we would observe a difference in mean
GPA (female-male or male-female) of 0.25 points or more if there in fact no difference in
true mean GPA between females and males in STAT 217 in a particular semester.
5) Decision
•Since the p-value is "small" (a priori 5% significance level selected), we can reject the
null hypothesis.
•There is evidence against the null hypothesis of no difference in the true mean GPA
between males and females for the STAT 217 students in this semester and so we
conclude that there is evidence of a difference in the mean GPAs between males and
females.
•Because this was not a randomized experiment, we can't say that the difference in sex
causes the difference in mean GPA and because it was not a random sample from a larger
population, our inferences only pertain the STAT 217 students that responded to the
survey in that semester.
1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3
1 2 4 5 6 7 8 9
0 1 2 4 5 7 3 4 5 6 7 8 2 3 5 6 7
1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 2 1 2 3 1 2 3
3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7
9 1 2 3 4 5 7 1 4 6 7 8 9 0 1 2 3 5 6 8 0 1 3 5
1 2 4 2 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 2 2 3
A second bootstrap sample is also provided. It did not re-sample observations 1, 2, or 4
but does sample observation 5 three times. You can see other variations in the resulting
re-sampling of subjects.
> table(as.numeric(resample(MockJury2)$orig.ids))
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 4 4
3 5 6 8 4
0 2 5 6 7 8 9 5 6 7 9 0 2 4 6 7 8 9 0 2
1 3 1 2 1 1 1 1 2 1 1 3 1 1 1 1 2 2 1 2 2 1 2 1 2
4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 7 7 7 7 7
4 5 7 8 9 2 3 5 6 7 8 0 1 3 4 5 6 7 8 9 0 1 2 3 4
2 1 1 1 3 1 1 1 1 1 3 1 1 3 2 1 1 2 1 1 1 1 2 2 2
7
5
1
Each run of the resample function provides a new version of the data set. Repeating
this B times using another for loop, we will track our quantity of interest, say T, in all
these new "data sets" and call those results T*. The distribution of the bootstrapped T*
statistics will tell us about the range of results to expect for the statistic and the middle __
% of the T*'s provides a bootstrap confidence interval for the true parameter - here
the difference in the two population means.
To make this concrete, we can revisit our previous examples, starting with
the MockJury2 data created before and our interest in comparing the mean sentences
for the Average and Unattractive picture groups. The bootstrapping code is very similar
to the permutation code except that we apply the resample function to the entire data
set as opposed to the shuffle function being applied to the explanatory variable.
> Tobs <- compareMean(Years ~ Attr, data=MockJury2); Tobs
[1] 1.837127
> B<- 1000
> Tstar<-matrix(NA,nrow=B)
> for (b in (1:B)){
+ Tstar[b]<-compareMean(Years ~ Attr,
data=resample(MockJury2))
+ }
> hist(Tstar,labels=T)
> plot(density(Tstar),main="Density curve of Tstar")
> favstats(Tstar)
missi
min Q1 median Q3 max mean sd n
ng
-
1.2620 1.8536 2.4071 5.4620 1.8398 0.84269 100
1.2521 0
18 15 43 06 87 69 0
37
In this situation, the observed difference in the mean sentences is 1.84 years
(Unattractive-Average), which is the vertical line in Figure 1-17. The bootstrap
distribution shows the results for the difference in the sample means when fake data sets
are re-constructed by sampling from the data set with replacement. The bootstrap
distribution is approximately centered at the observed value and relatively symmetric.
Figure 1-18: Histogram and density curve of bootstrap distribution with 95% bootstrap
confidence intervals displayed (vertical lines).
Although confidence intervals can exist without referencing hypotheses, we can revisit
our previous hypotheses and see what this confidence interval tells us about the test of
H0: μUnattr = μAve. This null hypothesis is equivalent to testing H0: μUnattr - μAve=0, that the
difference in the true means is equal to 0 years. And the difference in the means was the
scale for our confidence interval, which did not contain 0 years. We will call 0 an
interesting reference value for the confidence interval, because here it is the value where
the true means are equal other (have a difference of 0 years). In general, if our confidence
interval does not contain 0, then it is saying that 0 is not one of our likely values for the
difference in the true means. This implies that we should reject a claim that they are
equal. This provides the same inferences for the hypotheses that we considered
previously using both a parametric and permutation approach. The general summary is
that we can use confidence intervals to test hypotheses by assessing whether the reference
value under the null hypothesis is in the confidence interval (FTR H0) or outside the
confidence interval (Reject H0).
As in the previous situation, we also want to consider the parametric approach for
comparison purposes and to have that method available for the rest of the semester. The
parametric confidence interval is called the equal variance, two-sample t-based
confidence interval and assumes that the populations being sampled from are normally
distributed and leads to using a t-distribution to form the interval. The output from
the t.test function provides the parametric 95% confidence interval calculated for
you:
> t.test(Years ~ Attr, data=MockJury2,var.equal=T)
Two Sample t-test
data: Years by Attr
t = -2.1702, df = 73, p-value = 0.03324
alternative hypothesis: true difference in means is not
equal to 0
9
5 percent confidence interval:
-3.5242237 -0.1500295
sample estimates:
The t*df is a multiplier that comes from finding the percentile from the t-distribution that
puts C% in the middle of the distribution with C being the confidence level. It is
important to note that this t* has nothing to do with the previous test statistic t. It is
confusing and many of you will, at some point, happily take the result from a test statistic
calculation and use it for a multiplier in a t-based confidence interval. Figure 1-19 shows
the t-distribution with 73 degrees of freedom and the cut-offs that put 95% of the area in
the middle.
Figure 1-19: Plot of t(73) with cut-offs for putting 95% of distributions in the middle.
For 95% confidence intervals, the multiplier is going to be close to 2 - anything else is a
sign of a mistake. We can use R to get the multipliers for us using the qt function in a
similar fashion to how we used qdata in the bootstrap results, except that this new value
must be used in the previous formula. This function produces values for requested
percentiles. So if we want to put 95% in the middle, we place 2.5% in each tail of the
distribution and need to request the 97.5th percentile. Because the t-distribution is always
symmetric around 0, we merely need to look up the value for the 97.5th percentile. The t*
multiplier to form the confidence interval is 1.993 for a 95% confidence interval when
the df=73 based on the results from qt:
> qt(.975,df=73)
[1] 1.992997
Note that the 2.5th percentile is just the negative of this value due to symmetry and the
real source of the minus in the plus/minus in the formula for the confidence interval.
> qt(.025,df=73)
[1] -1.992997
We can also re-write the general confidence interval formula more simply as
where
In some situations, researchers will report the standard error (SE) or margin of
error (ME) as a method of quantifying the uncertainty in a statistic. The SE is an estimate
of the standard deviation of the statistic (here x1−x2) and the ME is an estimate of the
precision of a statistic that can be used to directly form a confidence interval. The ME
depends on the choice of confidence level although 95% is almost always selected.
To finish this example, we can use R to help us do calculations much like a calculator
except with much more power "under the hood". You have to make sure you are careful
with using ( ) to group items and remember that the asterisk (*) is used for
multiplication. To do this, we need the pertinent information which is available from the
bolded parts of the favstats output repeated below.
> favstats(Years~Attr,data=MockJury2)
We can now repeat the methods on the STAT 217 grade data. This time we can start with the
parametric 95% confidence interval "by hand" and then using t.test. The favstats output
provides us with the required information to do this ourselves:
favstats(GPA~Sex,data=s217)
> sp=sqrt(((37-1)*(0.4075^2)+(42-1)*(0.41518^2))/(37+42-2))
> sp
[1] 0.4116072
> qt(.975,df=77)*sp*sqrt(1/37+1/42)
[1] 0.1847982
> 3.338-3.0886+c(-1,1)*qt(.975,df=77)*sp*sqrt(1/37+1/42)
So we are 95% confident that the difference in the true mean GPAs between females and males
(femals minus males) is between 0.065 and 0.434 GPA points. We get a similar20 result from the
bolded part of the t.test output:
> t.test(GPA~Sex,data=s217,var.equal=T)
0.06501838 0.43459552
sample estimates:
1] 1.664885
[1] 2.641198
t.test(GPA~Sex,data=s217,var.equal=T,conf.level=.90)
0.09530553 0.40430837
> t.test(GPA~Sex,data=s217,var.equal=T,conf.level=.99)
0.004703598 0.494910301
As a review of some basic ideas with confidence intervals make sure you can answer the
following questions:
2. 2) What happens to the width of the confidence interval if the size of the SE increases or
decreases?
3. 3) What about increasing the sample size - should that increase or decrease the width of
the interval?
All of the general results you learned before about impacts to widths of CIs hold in this situation
whether we are considering the parametric or bootstrap methods.
To finish this example, we will generate the comparable bootstrap 90% confidence interval using
the bootstrap distribution in Figure 1-20.
[1] -0.2498069
> par(mfrow=c(1,2))
> Tstar<-matrix(NA,nrow=B)
+ }
> qdata(.05,Tstar)
p quantile
0.0500000 -0.3974425
> qdata(.95,Tstar)
p quantile
0.9500000 -0.1147324
> quantiles<-qdata(c(.05,.95),Tstar)
> quantiles
quantile p
5% -0.3974425 0.05
95% -0.1147324 0.95
The output tells us that the 90% confidence interval is from -0.397 to -0.115 GPA points. The
bootstrap distribution with the observed difference in the sample means and these cut-offs is
displayed in Figure 1-20 using this code:
> hist(Tstar,labels=T)
> abline(v=Tobs,col="red",lwd=2)
> abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)
> abline(v=Tobs,col="red",lwd=2)
> abline(v=quantiles$quantile,col="blue",lwd=3,lty=2)
In the previous output, the parametric 90% confidence interval is from 0.095 to 0.404, suggesting
similar results again from the two approaches once you account for the two different orders of
differencing. There was a slight left skew in the bootstrap distribution with one much smaller
difference observed which generated some of the observed difference in the results. Based on the
bootstrap CI, we can say that we are 90% confident that the difference in the true mean GPAs for
STAT 217 students is between -0.397 to -0.115 GPA points (male minus females). Because sex
cannot be assigned to the subjects, we cannot infer that sex is causing this difference and because
this was a voluntary response sample of STAT 217 students in a given semester, we cannot infer
that a difference of this size would apply to all STAT 217 students or even students in another
semester.
Figure 1-20: Histogram and density curve of bootstrap distribution of difference in sample mean
GPAs (male minus female) with observed difference (solid vertical line) and quantiles that
delineate the 90% confidence intervals (dashed vertical lines).
Throughout the semester, pay attention to the distinctions between parameters and statistics,
focusing on the differences between estimates based on the sample and inferences for the
population of interest in the form of the parameters of interest. Remember that statistics are
summaries of the sample information and parameters are characteristics of populations (which
we rarely know). And that our inferences are limited to the population that we randomly sampled
from, if we randomly sampled.
previous next
20
We rounded the means a little and that caused the small difference in results.
1.8b - Chapter summary
by Mark Greenwood and Katharine Banner
In this chapter, we reviewed basic statistical inference methods in the context of a two-sample
mean problem. You were introduced to using R to do permutation testing and generate bootstrap
confidence intervals as well as obtaining parametric t-test and confidence intervals in this same
situation. You should have learned how to use a for loop for doing the nonparametric
inferences and the t.test function for generating parametric inferences. In the two examples
considered, the parametric and nonparametric methods provided similar results, suggesting that
the assumptions were at least close to being met for the parametric procedures. When parametric
and nonparametric approaches disagree, the nonparametric methods are likely to be more
trustworthy since they have less restrictive assumptions but can still have problems. When the
noted conditions are not met in a hypothesis testing situation, the Type I error rates can be
inflated, meaning that we reject the null hypothesis more often than we have allowed to occur by
chance. Specifically, we could have a situation where our assumed 5% significance level test
might actually reject the null when it is true 20% of the time. If this is occurring, we call a
procedure liberal (it rejects too easily) and if the procedure is liberal, how could we trust a small
p-value to be a "real" result and not just an artifact of violating the assumptions of the procedure?
Likewise, for confidence intervals we hope that our 95% confidence level procedure, when
repeated, will contain the true parameter 95% of the time. If our assumptions are violated, we
might actually have an 80% confidence level procedure and it makes it hard to trust the reported
results for our observed data set. Statistical inference relies on a belief in the methods underlying
our inferences. If we don't trust our assumptions, we shouldn't trust the conclusions to perform
the way we want them to. As sample sizes increase and violations of conditions lessen, then the
procedures will perform better. In Chapter 2, we'll learn some new tools for doing diagnostics to
help us assess how much those conditions are violated.
The main components of R code used in this chapter follow with components to modify in red,
remembering that any R packages mentioned need to be installed and loaded for this code to
have a chance of working:
• summary(DATASETNAME)
◦ Provides numerical summaries of all variables in the data set.
• t.test(Y~X,data=DATASETNAME,conf.level=0.95)
◦ Provides two-sample t-test test statistic, df, p-value, and 95% confidence interval.
• 2*pt(abs(Tobs),df=DF,lower.tail=F)
◦ Finds the two-sided test p-value for an observed 2-sample t-test statistic of Tobs.
• hist(DATASETNAME$Y)
◦ Makes a histogram of a variable named Y from the data set of interest.
• boxplot(Y~X,data=DATASETNAME)
◦ Makes a boxplot of a variable named Y for groups in X from the data set.
• beanplot(Y~X,data=DATASETNAME)
◦ Makes a beanplot of a variable named Y for groups in X from the data set.
• mean(Y~X,data=DATASETNAME); sd(Y~X,data=DATASETNAME)
◦ Provides the mean and sd of responses of Y for each group described in X.
• favstats(Y~X,data=DATASETNAME)
◦ Provides numerical summaries of Y by groups described in X.
Tstar<-matrix(NA,nrow=B)
for (b in (1:B)){
Tstar[b]<-t.test(Y~shuffle(X),data=DATASETNAME,var.equal=T)
$statistic
◦ Code to run a for loop to generate 1000 permuted versions of the test statistic using
the shuffle function and keep track of the results in Tstar.
• pdata(abs(Tobs),Tstar,lower.tail=F)
◦ Finds the proportion of the permuted test statistics in Tstar that are less than -|Tobs| or
greater than |Tobs|,
Tstar<-matrix(NA,nrow=B)
for (b in (1:B)){
Tstar[b]<-compareMeans(Y~X,data=resample(DATASETNAME))
}
◦ Code to run a for loop to generate 1000 bootstrapped versions of the data set using
the resample function and keep track of the results of the statistic in Tstar.
• qdata(c(0.025,0.975),Tstar)
◦ Provides the values that delineate the middle 95% of the results in the bootstrap
distribution (Tstar).
Load the HELPrct data set from the mosaicData package. The HELP study was a clinical
trial for adult inpatients recruited from a detoxification unit. Patients with no primary care
physician were randomly assigned to receive a multidisciplinary assessment and a brief
motivational intervention or usual care and various outcomes were observed. Two of the
variables in the dataset are sex, a factor with levels (male and female) and daysanysub, time
(in days) to first use of any substance post-detox. We are interested in the difference in mean
number of days to first use of any substance post-detox between males and females. There are
some missing responses and the following code will produce favstats with the missing
values and then provide a data set that for complete observations by applying
the na.omit function that removes any observations with missing values.
data(HELPrct)
1.1. Based on the results provided, how many observations were missing for males and
females. Missing values here likely mean that the subjects didn't use any substances post-
detox in the time of the study. This is called censoring. What is the problem with the
numerical summaries if the missing responses were all something larger than the largest
observation?
1.2. Make a beanplot and a boxplot of daysanysub ~ sex using the HELPrct3 data
set created above. Compare the distributions, recommending parametric or nonparametric
inferences.
1.3. Generate the permutation results and write out the 6+ steps of the hypothesis test,
making sure to note the numerical value of observed test statistic you are using. Include
scope of inference.
1.5. Generate the parametric t.test results, reporting the test-statistic, its distribution under
the null hypothesis, and compare the p-value to those observed using the permutation
approach.
1.6. Make and interpret a 95% bootstrap confidence interval for the difference in the
means.
In Chapter 1, tools for comparing the means of two groups were considered. More generally,
these methods are used for a quantitative response and a categorical explanatory variable (group)
which had two and only two levels. The MockJury data set actually contained three groups
(Figure 2-1) with Beautiful, Average, and Unattractive rated pictures randomly assigned to the
subjects for sentence ratings. In a situation with more than two groups, we have two choices.
First, we could rely on our two group comparisons, performing tests for every possible pair
(Beautiful vs Average, Beautiful vs Unattractive, and Average vs Unattractive). We spent
Chapter 1 doing inferences for differences between Average and Unattractive. The other two
comparisons would lead us to initially end up with three p-values and no direct answer about our
initial question of interest - is there some overall difference in the average sentences provided
across the groups? In this chapter, we will learn a new method, called Analysis of Variance,
ANOVA, that directly assesses whether there is evidence of some overall difference in the means
among the groups. This version of an ANOVA is called a One-Way ANOVA since there is just
one 21 grouping variable. After we perform our One-Way ANOVA test for overall evidence of a
difference, we will revisit the comparisons similar to those considered in Chapter 1 to get more
details on specific differences among the pairs of groups - what we call pair-wise comparisons.
An issue is created when you perform many tests simultaneously and we will augment our
previous methods with an adjusted method for pairwise comparisons to make our results valid
called Tukey's Honest Significant Difference.
To make this more concrete, we return to the original MockJury data, making side-by-side
boxplots and beanplots (Figure 2-1) as well summarizing the sentences for the three groups
using favstats.
> favstats(Years~Attr,data=MockJury)
> require(heplots)
> require(mosaic)
> data(MockJury)
> par(mfrow=c(1,2))
> boxplot(Years~Attr,data=MockJury)
>
beanplot(Years~Attr,data=MockJury,log="",col="bisque",method="jit
ter")
> favstats(Years~Attr,data=MockJury)
media missin
.group min Q1 Q3 max mean sd n
n g
4.33333 3.40536
1 Beautiful 1 2 3 6.5 15 39 0
3 2
3.97368 2.82351
2 Average 1 2 3 5.0 12 38 0
4 9
Unattractiv 10. 5.81081 4.36423
3 1 2 5 15 37 0
e 0 1 5
There are slight differences in the sample sizes in the three groups with 37 Unattractive,
38 Average and 39 Beautiful group responses, providing a data set has a total sample size of
N=114. The Beautiful and Average groups do not appear to be very different with means of 4.33
and 3.97 years. In Chapter 1, we found moderate evidence regarding the difference
in Average and Unattractive. It is less clear whether we might find evidence of a difference
between Beautiful and Unattractive groups since we are comparing means of 5.81 and 4.33
years. All the distributions appear to be right skewed with relatively similar shapes. The
variability in Average and Unattractive groups seems like it could be slightly different leading to
an overall concern of whether the variability is the same in all the groups.
Figure 2-1: Boxplot and beanplot of the sentences (years) for the three treatment groups.
We introduced the statistical model γij = μj + εij in Chapter 1 for the situation with j = 1 or 2 to
denote a situation where there were two groups and, for the alternative model, the means
differed. Now we have three groups and the previous model can be extended to this new
situation by allowing j to be 1, 2, or 3. Now that we have more than two groups, we need to
admit that what we were doing in Chapter 1 was actually fitting what is called a linear model.
The linear model assumes that the responses follow a normal distribution with the linear model
defining the mean, all observations have the same variance, and the parameters for the mean in
the model enter linearly. This last condition is hard to explain at this level of material - it is
sufficient to know that there models where the parameters enter the model nonlinearly and that
they are beyond the scope of this course. The result of this constraint is that we will be able to
use the same general modeling framework for the rest of the course.
As in Chapter 1, we have a null hypothesis that defines a situation (and model) where all the
groups have the same mean. Specifically, the null hypothesis in the general situation
with J groups (J≥2) is to have all the true group means equal,
or, in words, at least one of the true means differs among the J groups. You will be attracted to
trying to say that all means are different in the alternative but we do not put this strict a
requirement in place to reject the null hypothesis. The alternative model allows all the true group
means to differ but does require that they differ with
γij = μj + εij.
This linear model states that the response for the ith observation in the jth group, γij, is modeled
with a group j (j=1,...,J) population mean, μj, and a random error for each subject in each group,
εij, that we assume follows a normal distribution and that all the random errors have the same
variance, σ2. We can write the assumption about the random errors, often called the normality
assumption, as εij~N(0,σ2). There is a second way to write out this model that will allow
extensions to more complex models discussed below, so we need a name for this version of the
model. The model writtern in terms of the μj's is called the cell means model and is the easier
version of this model to understand.
One of the reasons we learned about beanplots is that it helps us visually consider all the aspects
of this model. In the right panel of Figure 2-1, we can see the wider, bold horizontal lines that
provide the estimated group means. The bigger the differences, the more likely we are to find
evidence against the null hypothesis. You can also see the null model on the plot that assumes all
the groups have the same as displayed in the dashed horizontal line at 4.7 years (the R code
below shows the overall mean of Years is 4.7). While the hypotheses focus on the means, the
model also contains assumptions about the distribution of the responses - specifically that the
distributions are normal and all have the groups have the same variability. As discussed
previously, it appears that the distributions are right skewed and the variability might not be the
same for all the groups. The boxplot provides the information about the skew and variability but
since it doesn't display the means it is not directly related to the linear model and hypotheses we
are considering.
> mean(MockJury$Years)
[1] 4.692982
There is a second way to write out the One-Way ANOVA model that will allow extensions to
more complex models in Chapter 3. The other parameterization (way of writing out or defining)
of the model is called the reference-coded model since it writes out the model in terms of a
baseline group and deviations from that baseline or reference level. The reference-coded model
for the ith subject in the jth group is yij = α + τj + εij where α (alpha) is the true mean for the
baseline group (first alphabetically) and the τj (tau j) are the deviations from the baseline group
for group j. The deviation for the baseline group, τ1, is always set to 0 so there are really just
deviations for groups 2 through J. The equivalence between the two models can be seen by
considering the mean for the first, second, and Jth groups in both models:
H0: τ2 =... = τJ = 0.
You are welcome to use either version unless we instruct you to use a particular version in this
chapter but we have to use the reference-coding in subsequent chapters. The next task is to learn
how to use R's linear model (lm) function to get estimates of the parameters in each model, but
first a review of these new ideas:
Cell-means version:
• Null hypothesis in words: No difference in the true means between the groups.
• Alternative hypothesis in words: At least one of the true means differs between the groups.
Reference-coded version:
• H0: τ2 =... = τJ = 0 HA: Not all τj equal 0
• Null hypothesis in words: No deviation of the true mean for any groups from the baseline
group.
• Alternative hypothesis in words: At least one of the true deviations is different from 0 or that at
least one group has a different true mean than the baseline group.
In order to estimate the models discussed above, the lm function will be used. If you look closely
in the code for the rest of the semester, any model for a quantitative response will use this
function, suggesting a common threads in the most commonly used statistical models.
The lm function continues to use the same format as previous
functions, lm(Y~X,data=datasetname). It ends up that this code will give you the
reference-coded version of the model by default. We want to start with the cell-means version of
the model, so we have to add a "-1" to the formula interface to tell R that we want to the cell-
means coding. Generally, this looks like lm(Y~X-1,data=datasetname) and you will
find a row of output for each group. It will contain columns for an estimate (Estimate),
standard error (Std. Error), t-value (t value), and p-value (Pr(>|t|)). We'll learn to
use all of the output in the following material, but for now we will just focus on the estimates of
the parameters that the function provides that we put in bold.
> summary(lm1)
Coefficients:
> mean(Years~Attr,data=MockJuryR)
> summary(lm2)
Coefficients:
Remember that this is the standard version of the linear model so it will be something that gets
used repeatedly this semester. The estimated model coefficients are α̂ = 4.333 years, τ̂2 =-0.3596
years, and τ̂3 =1.4775 years where group 1 is Beautiful, 2 is Average, and 3 is Unattractive. The
way you can figure out the baseline group (group 1 is Beautiful here) is to see which category
label is not present in the output. The baseline level is typically the first group label
alphabetically, but you should always check this. Based on these definitions, there are
interpretations available for each coefficient. For α̂ = 4.333 years, this is an estimate of the mean
sentencing time for the Beautiful group. τ̂2 =-0.3596 years is the deviation of the Average group's
mean from the Beautiful groups mean (specifically, it is 0.36 years lower). Finally, τ̂3 =1.4775
years tells us that the Unattractive group mean sentencing time is 1.48 years higher than
the Beautiful group mean sentencing time. These interpretations lead directly to reconstructing
the estimated means for each group by combining the baseline and pertinent deviations as shown
in Table 2-1.
Table 2-1: Constructing group mean estimates from the Formul Estimates
reference-coded linear model estimates.
a
Group
Beautiful α̂ 4.3333 years
Average α̂ + τ̂2 4.3333-0.3596=3.974 years
Unattractive α̂ + τ̂3 4.3333+1.4775=5.811 years
We can also visualize the results of our linear models using what are called term or effect
plots (from the effects package; Fox, 2003) as displayed in Figure 2-2 (we don't want to use
"effect" unless we have random assignment in the study design so we will mainly call these term
plots). These plots take an estimated model and show you its estimates along with 95%
confidence intervals generated by the linear model, which will be especially useful for some of
the more complicated models encountered later in the semester. To make this plot, you need to
install and load the effects package and then use plot(allEffects(...)) functions
together on the lm object called lm2 generated above. You can find the correspondence between
the displayed means and the estimates that were constructed in Table 2-1.
> require(effects)
> plot(allEffects(lm2))
Figure 2-2: Plot of the estimated group mean sentences from the reference-coded model for the
MockJury data.
In order to assess evidence for having different means for the groups, we will compare either of
the previous models (cell-means or reference-coded) to a null model based on the null hypothesis
(H0: μ1 =... = μJ) which implies a model of yij = μ + εij in the cell-means version where μ is a
common mean for all the observations. We will call this the mean-only model since it is boring
and only has a single mean in it. In the reference-coding version of the model, we have a null
hypothesis that H0: τ2 =... = τJ = 0, so the "mean-only" model is yij = α + εij with α having the
same definition as μ for the cell means model - it forces a common estimate for every group.
The mean-only model is also an example of a reduced model where we set some coefficients in
the model to 0 and get a simpler model. Simple can be good as it is easy to interpret, but having a
model for J groups that suggests no difference in the groups is not a very exciting result in most,
but not all, situations. In order for R to provide results for the mean-only model, we remove the
grouping variable, Attr, from the model formula and just include a "1". The (Intercept)
row of the output provides the estimate for either model when we assume that the mean is the
same for all groups:
> summary(lm3)
Coefficients:
This model provides an estimate of the common mean for all observations of 4.693 = μ̂=α̂ years.
This value also is the dashed, horizontal line in the beanplot in Figure 2-1.
The previous discussion showed two ways of estimating the model but still hasn't addressed how
to assess evidence related to whether the observed differences in the means among the groups is
"real". In this section, we develop what is called the ANOVA F-test that provides a method of
aggregating the differences among the means of 2 or more groups and testing our null hypothesis
of no difference in the means vs the alternative. In order to develop the test, some additional
notation needs to be defined. The sample size in each group is denoted nj and the total sample
size is N=Σnj = n1+n2+...+nJ where Σ (capital sigma) means "add up over whatever follows". An
estimated residual (eij) is the difference between an observation, γij, and the model estimate,
γ̂ij=μ̂j, for that observation, γij - γ̂ij=eij. It is basically what is left over that the mean part of the
model (μ̂j) does not explain and is our window into how "good" the model might be.
Figure 2-3: Demonstration of different amount of difference in means relative to variability.
Consider the four different fake results for a situation with four groups in Figure 2-3. In Situation
1, it looks like there is little evidence for a difference in the means and in Situation 2, it looks
fairly clear that there is a difference in the group means. Why? It is because the variation in the
means looks "clear" relative to the variation around the means. Consider alternate versions of
each result in Situations 3 and 4 and how much evidence there appears to be for same sizes of
differences in the means. In the plots, there are two sources of variability in the responses - how
much the group means vary across the groups and how much variability there is around the
means in each group. So we need a test statistic to help us make some sort of comparison of the
groups and to account for the amount of variability present around the means. The statistic is
called the ANOVA F-statistic. It is developed using sums of squares which are measures of total
variation like used in the numerator of the standard deviation that took
all the observations, subtracted the mean, squared the differences, and then added up the results
over all the observations to generate a measure of total variability. With multiple groups, we will
focus on decomposing that total variability (Total Sums of Squares) into variability among the
means (we'll call this Explanatory Variable A's Sums of Squares) and variability in the
residuals or errors (Error Sums of Squares). We define each of these quantities in the One-Way
ANOVA situation as follows:
o ■ Note: this is the residual variation if the null model is used, so there is no
further decomposition possible for that model.
o ■ This is also equivalent to the numerator of the sample variance which is what
you get when you ignore the information on the potential differences in the
groups.
o ■ Variation in the group means around the grand mean based on explanatory
variable A.
> anova(lm2)
Response: Years
> 70.94+1421.32
[1] 1492.26
One way to think about SSA is that it is a function that converts the variation in the group means
into a single value. This makes it a reasonable test statistic in a permutation testing context. By
comparing the observed SSA=70.9 to the permutation results of 6.7, 6.6, and 11 we see that the
observed result is much more extreme than the three alternate versions. In contrast to our
previous test statistics where positive and negative differences were possible, SSA is always
positive with a value of 0 corresponding to no variation in the means. The larger the SSA, the
more variation there was in the means. The permutation p-value for the alternative hypothesis
of some (not of greater or less than!) difference in the true means of the groups will involve
counting the number of permuted SSA* results that are larger than what we observed.
Figure 2-4: Plot of means and 95% confidence intervals for the three groups for the real data (a)
and three different permtutations of the treatment labels to the same responses in (b), (c), and
(d).
To do a permutation test, we need to be able to calculate and extract the SSA value. In the
ANOVA table, it is in the first row and is the second number and we can use the [,] referencing
to extract that number from the ANOVA table that anova
produces (anova(lm(Years~Attr,data=MockJury))[1,2]). We'll store the
observed value of SSA is Tobs:
[1] 70.93836
The following code performs the permutations using the shuffle function and then makes a
plot of the resulting permutation distribution:
> B<-1000
> Tstar<-matrix(NA,nrow=B)
+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,2]
+ }
> hist(Tstar,labels=T)
> abline(v=Tobs,col="red",lwd=3)
> abline(v=Tobs,col="red",lwd=3)
Figure 2-5: Permutation distributions of SSA with the observed value of SSA (bold, vertical line).
The right-skewed distribution (Figure 2-5) contains the distribution of SSA*'s under permutations
(where all the groups are assumed to be equivalent under the null hypothesis). While the
observed result is larger than many SSA*'s, there are also many results that are much larger than
observed that showed up when doing permutations. The proportion of permuted results that
exceed the observed value is found using pdata as before, except only for the area to the right
of the observed result. We know that Tobs will always be positive so no absolute values are
required now.
> pdata(Tobs,Tstar,lower.tail=F)
[1] 0.071
This provides a permutation-based p-value of 0.071 and suggests marginal evidence against the
null hypothesis of no difference in the true means. We would interpret this as saying that there is
a 7.1% chance of getting a SSA as large or larger than we observed, given that the null hypothesis
is true.
It ends up that some nice parametric statistical results are available (if our assumptions are met)
for the ratio of estimated variances, which are called Mean Squares. To turn sums of squares
into mean square (variance) estimates, we divide the sums of squares by the amount of free
information available. For example, remember the typical variance estimator introductory
statistics, , where we "lose" one piece of information to estimate
the mean and there are N deviations around the single mean so we divide by N-1. Now
consider which still has N deviations but it varies
around the J means, so the Mean Square Error = MSE = SSE/(N-J). Basically, we lose J pieces of
information in this calculation because we have to estimate J means. The sums of squares for
explanatory variable A is harder to see in the formula , but
the same reasoning can be used to understand the denominator for forming the Mean Square for
variable A or MSA: there are J means that vary around the grand mean so MSA = SSA/(J-1). In
summary, the two mean squares are simply:
■ MSA = SSA/(J-1), which estimates the variance of the group means around the grand
mean.
■ MSError = SSError/(N-J), which estimates the variation of the errors around the group
means.
These results are put together using a ratio to define the ANOVA F-statistic (also called the F-
ratio) as
F=MSA/MSError.
This statistic is close to 1 if the variability in the means is "similar" to the variability in the
residuals and would lead to no evidence being found of a difference in the means. If the MSA is
much larger than the MSE, the F-statistic will provide evidence against the null hypothesis. The
"size" of the F-statistic is formalized by finding the p-value. The F-statistic, if assumptions
discussed below are met and we assume the null hypothesis is true, follows an F-distribution.
The F-distribution is a right-skewed distribution whose shape is defined by what are called
the numerator degrees of freedom (J-1) and the denominator degrees of freedom (N-J). These
names correspond to the values that we used to calculate the mean squares and where in the F-
ratio each mean square was used; F-distributions are denoted by their degrees of freedom using
the convention of F(numerator df, denominator df). Some examples of different F-distributions
are displayed for you in Figure 2-6.
Figure 2-6: Density curves of four different F-distributions.
The characteristics of the F-distribution can be summarized as:
⚪ Right skewed,
Now we are ready to see an ANOVA table when we know about all its components. Note the
general format of the ANOVA table is22:
Table 2-2: General One-Way ANOVA table.
Source DF Sums of Squares Mean Squares F-ratio P-value
Variable A J-1 SSA MSA = SSA/(J-1) F=MSA/MSE Right tail of F(J-1,N-J)
N-
Residuals SSE MSE =SSE/(N-J)
J
N-
Total SSTotal
1
The table is oriented to help you reconstruct the F-ratio from each of its components. The output
from R is similar although it does not provide the last row. The R version of the table for the type
of picture effect (Attr) with J=3 levels and N=114 observations, repeated from above, is:
> anova(lm2)
Response: Years
> pf(2.77,df1=2,df2=111,lower.tail=F)
[1] 0.06699803
The result from the F-distribution using this parametric procedure is similar to the p-value
obtained using permutations with the test statistic of the SSA, which was 0.071. The F-statistic
obviously is another potential test statistic to use as a test statistic in a permutation approach. We
should check that we get similar results from it with permutations as we did from using SSA as a
test statistic. The following code generates the permutation distribution for the F-statistic (Figure
2-7) and assesses how unusual the observed F-statistic of 2.77 was in this permutation
distribution. The only change in the code involves moving from extracting SSA to extracting
the F-ratio which is in the 4th column of the anova output:
> anova(lm(Years~Attr,data=MockJury))[1,4]
[1] 2.770024
[1] 2.770024
> B<-1000
> Tstar<-matrix(NA,nrow=B)
+ Tstar[b]<-anova(lm(Years~shuffle(Attr),data=MockJury))[1,4]
+ }
> hist(Tstar,labels=T)
> abline(v=Tobs,col="red",lwd=3)
> abline(v=Tobs,col="red",lwd=3)
> pdata(Tobs,Tstar,lower.tail=F)
[1] 0.064
Figure 2-7: Permutation distribution of the F-statistic with bold, vertical line for observed value
of test statistic of 2.77.
The permutation-based p-value is 0.064 which, again, matches the other results closely. The first
conclusion is that using a test statistic of the F-statistic or the SSA provide similar permutation
results. However, we tend to favor using the F-statistic because it is more commonly used in
reporting ANOVA results, not because it is any better in a permutation context.
It is also interesting to compare the permutation distribution for the F-statistic and the
parametric F(2,111) distribution (Figure 2-8). They do not match perfectly but are quite similar.
Some the differences around 0 are due to the behavior of the method used to create the density
curve and are not really a problem for the methods. This explains why both methods give similar
results. In some situations, the correspondence will not be quite to close.
Figure 2-8: Comparison of F(2,111) (dashed line) and permutation distribution (solid line).
So how can we rectify this result (p-value≈0.06) and the Chapter 1 result that detected a
difference between Average and Unattractive with a p-value≈0.03? I selected the two groups to
compare in Chapter 1 because they were furthest apart. "Cherry-picking" the comparison that is
likely to be most different creates a false sense of the real situation and inflates the Type I error
rate because of the selection. If the entire suite of comparisons are considered, this result may
lose some of its luster. In other words, if we consider the suite of all pair-wise differences (and
the tests) implicit in comparing all of them, we need stronger evidence in the most different pair
than a p-value of 0.033 to suggest overall differences. The Beautiful and Average groups are not
that different from each other so they do not contribute much to the overall F-test. In Section 2.5,
we will revisit this topic and consider a method that is statistically valid for performing all
possible pair-wise comparisons.
2.3 - ANOVA model diagnostics
including QQ-plots
by Mark Greenwood and Katharine Banner
The requirements for a One-Way ANOVA F-test are similar to those discussed in Chapter 1,
except that there are now J groups instead of only 2. Specifically, the linear model assumes:
1) Independent observations
2) Equal variances
3) Normal distributions
For assessing equal variances across the groups, we must use plots to assess this. We can use
boxplots and beanplots to compare the spreads of the groups, which are provided in Figure 2-1.
The range and IQRs should be similar across the groups, although you should always note how
clear or big the violation of the assumption might be, remembering that there will always be
some differences in the variation among groups. In this section, we learn how to work with the
diagnostic plots that are provided from the lm function that can help us more clearly assess
potential violations of the previous assumptions.
We can obtain a suite of diagnostic plots by using the plot function on the ANOVA model
object that we fit. To get all of the plots together in four panels we need to add
the par(mfrow=c(2,2)) command to tell R to make a graph with 4 panels 23.
> par(mfrow=c(2,2))
> plot(lm2)
There are two plots in Figure 2-9 with useful information for the equal variance assumption. The
"Residuals vs Fitted" in the top left panel displays the residuals (eij= γij - γ̂ij) on the y-axis and the
fitted values (γ̂ij) on the x-axis. This allows you to see if the variability of the observations differs
across the groups because all observations in the same group get the same fitted value. In this
plot, the points seem to have fairly similar spreads at the fitted values for the three groups of 4,
4.3, and 6. The "Scale-Location" plot in the lower left panel has the same x-axis but the y-axis
contains the square-root of the absolute value of the standardized residuals. The absolute value
transforms all the residuals into a magnitude scale (removing direction) and the square-root helps
you see differences in variability more accurately. The usage is similar in the two plots - you
want to assess whether it appears that the groups have somewhat similar or noticeably different
amounts of variability. If you see a clear funnel shape in the Residuals vs Fitted or an increase or
decrease in the edge of points in the Scale-Location plot, that may indicate a violation of the
constant variance assumption. Remember that some variation across the groups is expected and
is ok, but large differences in spreads are problematic for all the procedures we will learn this
semester.
> eij=residuals(lm2)
Figure 2-10: Histogram and density curve of the linear model raw residuals.
Figure 2-10 shows that there is a right skew present in the residuals, which is consistent with the
initial assessment of some right skew in the plots of observations in each group.
I extracted the previous QQ-plot of the linear model residuals and enhanced it a little to make
Figure 2-11. We know from looking at the histogram that this is a slightly right skewed
distribution. The QQ-plot places the observed standardized25 residuals on the y-axis and the
theoretical normal values on the x-axis. The most noticeable deviation from the 1-1 line is in the
lower left corner of the plot. These are for the negative residuals (left tail) and there are many
residuals at around the same value a little smaller than -1. If the distribution had followed the
normal here, the points would be on the 1-1 line and would actually be even smaller. So we are
not getting as much spread in the lower observations as we would expect in a normal
distribution. If you go back to the histogram you can see that the lower observations are all
stacked up and do not spread out like the left tail of a normal distribution should. In the right tail
(positive) residuals, there is also a systematic lifting from the 1-1 line to larger values in the
residuals than the normal would generate. For example, the point labeled as "82" (the
82nd observation in the data set) has a value of 3 in residuals but should actually be smaller
(maybe 2.5) if the distribution was normal. Put together, this pattern in the QQ-plot suggests that
the left tail is too compacted (too short) and the right tail is too spread out - this is the right skew
we identified from the histogram and density curve!
Figure 2-11: QQ-plot of residuals from linear model.
Generally, when both tails deviate on the same side of the line (forming a sort of quadratic curve,
especially in more extreme cases), that is evidence of a skew. To see some different potential
shapes QQ-plots, six different data sets are Figures 2-12 and 2-13. In each row, a QQ-plot and
density curve are displayed. If the points are both above the 1-1 line in the lowr and upper tails as
in Figure 2-12(a), then the pattern is a right skew, here even more extreme than in the real data
set. If the points are below the 1-1 line in both tails as in Figure 2-12(c), then the pattern should
be identified as a left skew. These are both problematic for models that assume normally
distributed responses but not necessarily for our permutation approaches if all the groups have
similar skewed shapes. The other problematic pattern is to have more spread than a normal curve
as in Figure2-12(e) and (f). This shows up with the points being below the line in the left tail
(more extreme negative than expected by the normal) and the points being above the line for the
right tail (more extreme positive than the normal). We call these distributions heavy-tailed and
can manifest as distributions with outliers in both tails or just a bit more spread out than a normal
distribution. Heavy-tailed residual distributions can be problematic for our models as the
variation is greater than what the normal distribution can account for and our methods might
under-estimate the variability in the results. The opposite pattern with the left tail above the line
and the right tail below the line suggests less spread (lighter-tailed) than a normal as in Figure 2-
12(g) and (h). This pattern is relatively harmless and you can proceed with methods that assume
normality safely.
Figure 2-12: QQ-plots and density curves of four fake distributions with different shapes.
Finally, to help you calibrate expectations for data that are actually normally distributed, two
data sets simulated from normal distributions are displayed below in Figure 2-13. Note how
neither follows the line exactly but that the overall pattern matches fairly well. You have to allow
for some variation from the line in real data sets and focus on when there are really noticeable
issues in the distribution of the residuals such as those displayed above.
Figure 2-13: Two more simulated data sets, generated from normal distributions.
The last issues with assessing the assumptions in an ANOVA relates to situations where the
models are more or less resistant26. to violations of assumptions. For reasons beyond the scope of
this class, the parametric ANOVA F-test is more resistant to violations of the assumptions of the
normality and equal variance assumptions if the design is balanced. A balanced design occurs
when each group is measured the same number of times. The resistance decreases as the data set
becomes less balanced, so having close to balance is preferred to a more imbalanced situation if
there is a choice available. There is some intuition available here - it makes some sense that you
would have better results if all groups are equally (or nearly equally) represented in the data set.
We can check the number of observations in each group to see if they are equal or similar using
the tally function from the mosaic package:
> tally(~Attr,data=MockJuryR)
previous next
23
We have been using this function quite a bit to make multi-panel graphs but you will always
want to use this command for linear model diagnostics or your will have to use the arrows above
the plots to go back and see previous plots.
24
Along with multiple names, there is variation of what is plotted on the x and y axes and the
scaling of the values plotted, increasing the challenge of interpreting QQ-plots. We will try to be
consistent about the x and y axis choices.
25
Here this means re-scaled so that they should have similar scaling to a standard normal with
mean 0 and standard deviation 1. This does not change the shape of the distribution but can make
outlier identification by value of the residuals simpler - having a standardized residual more
extreme than 5 or -5 would suggest a deviation from normality. But mainly focus on the shape of
the pattern in the QQ-plot.
26
A resistant procedure is one that is not severely impacted by a particular violation of an
assumption. For example, the median is resistant to the impact of an outlier.
A second example of the One-way ANOVA methods involves a study of growth rates of the
teeth of Guinea Pigs (measured in millimeters, mm). N=60 Guinea Pigs were obtained from a
local breeder and each received Orange Juice (OJ) or ascorbic acid (the stuff in vitamin C
capsules, called VC below) at one of three dosages (0.5, 1, or 2 mg) as a source of added Vitamin
C in their diets. Each guinea pig was randomly assigned to receive one of the six different
treatment combinations possible (OJ at 0.5 mg, OJ at 1 mg, OJ at 2 mg, VC at 0.5 mg, VC at 1
mg, and VC at 2 mg). The animals were treated similarly otherwise and we can assume lived in
separate cages. We need to create a variable that combines the levels of delivery type (OJ, VC)
and the dosages (0.5, 1, and 2) to use our One-Way ANOVA on the six levels.
The interaction function creates a new variable in the ToothGrowth data.frame that we
called Treat that will be used as a six-level grouping variable.
> ToothGrowth$Treat=with(ToothGrowth,interaction(supp,dose))
#Creates a new variable Treat with 6 levels
The tally function helps us to check for balance; this is a balanced design because the same
number of guinea pigs (nj=10 for all j) were measured in each treatment combination.
> require(mosaic)
> tally(~Treat,data=ToothGrowth)
10 10 10 10 10 10
The next task is to visualize the results using boxplots and beanplots27 (Figure 2-14) and generate
some summary statistics for each group using favstats.
> par(mfrow=c(1,2))
>
beanplot(len~Treat,data=ToothGrowth,log="",col="yellow",method="j
itter")
> favstats(len~Treat,data=ToothGrowth)
Figure 2-14 suggests that the mean tooth growth increases with the dosage level and that OJ
might lead to higher growth rates than VC except at dosages of 2 mg. The variability around the
means looks to be small relative to the differences among the means, so we should expect a small
p-value from our F-test. The design is balanced as noted above (nj = 10 for all six groups) so the
methods are somewhat resistant to impacts from non-normality and non-constant variance. There
is some suggestion of non-constant variance in the plots but this will be explored further below
when we can visually remove the difference in the means from this comparison. There might be
some skew in the responses in some of the groups but there are only 10 observations per group
so skew in the boxplots could be generated by very few observations.
Figure 2-14: Boxplot and beanplot of tooth growth responses for the six treatment level combinations.
Now we can apply our 6+ steps for performing a hypothesis test with these observations. The
initial step is deciding on the claim to be assessed and the test statistic to use. This is a six group
situation with a quantitative response, identifying it as a One-Way ANOVA where we want to
test a null hypothesis that all the groups have the same population mean. We will use a 5%
significance level.
1) Hypotheses: H0: μOJ0.5 = μVC0.5 = μOJ1 = μVC1 = μOJ2 = μVC2 vs HA: Not all μj equal
• The null hypothesis could also be written in reference-coding as H0: τVC0.5 = τOJ1 = τVC1 = τOJ2 =
τVC2 = 0 since OJ.0.5 is chosen as the baseline group (discussed below).
• The alternative hypothesis can be left a bit less specific: HA: Not all τj equal 0.
2) Validity conditions:
• Independence:
o ○ This is where the separate cages note above is important. Suppose that there were
cages that contained multiple animals and they competed for food or could share
illness. The animals in one cage might be systematically different from the others and
this "clustering" of observations would present a potential violation of the
independence assumption. If the experiment had the animals in separate cages, there is
no clear dependency in the design of the study and can assume that there is no problem
with this assumption.
• Constant variance:
o ○ As noted above, there is some indication of a difference in the variability among the
groups in the boxplots but the sample size was small in each group. We need to fit the
linear model to get the other diagnostic plots to make an overall assessment.
> m2=lm(len~Treat,data=ToothGrowth)
> par(mfrow=c(2,2))
> plot(m2)
o ○ The Scale-Location plot also shows just a little less variability in the group with the
smallest fitted value but the spread of the groups looks fairly similar in this alternative
scaling.
o ○ Put together, the evidence for non-constant is not that strong and we can assume
that there is at least not a major problem with this assumption.
• Normality of residuals:
o ○ The Normal Q-Q plot shows a small deviation in the lower tail but nothing that we
wouldn't expect from a normal distribution. There is no evidence of a problem with this
assumption in the upper right panel of Figure 2-15.
> anova(m2)
Response: len
• There are two options here, especially since it seems that our assumptions about variance
and normality are not violated (note that we do not say "met" - we just have no strong evidence
against them). The parametric and nonparametric approaches should provide similar results
here.
• The parametric approach is easiest - the p-value comes from the previous ANOVA table as
<2.2e-16. This is in scientific notation and means it is at the numerical precision of the computer
and it reports that this is a very small number. You report that the p-value<0.00001 but should
not report that it is 0. This p-value came from an F(5,54) distribution (the distribution of the test
statistic if the null hypothesis is true).
• The nonparametric approach is not too hard so we can compare the two approaches here.
[1] 41.55718
> par(mfrow=c(1,2))
> Tstar<-matrix(NA,nrow=B)
+ Tstar[b]<-anova(lm(len~shuffle(Treat),data=ToothGrowth))[1,4]
+ }
> hist(Tstar,xlim=c(0,Tobs+3))
> abline(v=Tobs,col="red",lwd=3)
> abline(v=Tobs,col="red",lwd=3)
> pdata(Tobs,Tstar,lower.tail=F)
[1] 0
Figure 2-16: Histogram and density curve of permutation distribution for F-statistic for tooth growth
data. Observed test statistic in bold, vertical line at 41.56.
5) Make a decision:
• Reject H0 since the p-value is less than 5%.
6) Write a conclusion:
• There is evidence at the 5% significance level that the different treatments (combinations of
OJ/VC and dosage levels) cause some difference in the true mean tooth growth for these Guinea
Pigs.
o ○ We can make the causal statement because the treatments were randomly assigned
but these inferences only apply to these Guinea Pigs since they were not randomly
selected from a larger population.
o ○ Remember that we are making inferences to the population means and not the
sample means and want to make that clear in any conclusion.
o ○ The alternative is that there is some difference in the true means - be sure to make
the wording clear that you aren't saying that all differ. In fact, if you look back at Figure
2-14, the means for the 2 mg dosages look almost the same. The F-test is about finding
evidence of some difference somewhere among the true means. The next section will
provide some additional tools to get more specific about the source of those detected
differences.
Before we leave this example, we should revisit our model estimates and interpretations. The
default model parameterization is into the reference-coding. Running the
model summary function on m2 provides the estimated coefficients:
> summary(m2)
Coefficients:
For some practice with the reference coding used in these models, we will find the estimates for
observations for a couple of the groups. To work with the parameters, you need to start with
diagnosing the baseline category by considering which level is not displayed in the output. The
levels function can list the groups and their coding in the data set. The first level is usually the
baseline category but you should check this in the model summary as well.
> levels(ToothGrowth$Treat)
There is a VC.0.5 in the second row of the model summary, but there is no row
for 0J.0.5 and so this must be the baseline category. That means that the fitted value or model
estimate for the OJ at 0.5 mg group is the same as the (Intercept) row or α̂, estimating a
mean tooth growth of 13.23 mm when the pigs get OJ at a 0.5 mg dosage level. You should
always start with working on the baseline level in a reference-coded model. To get estimates for
any other group, then you can use the (Intercept) estimate and add the deviation for the group of
interest. For VC.0.5, the estimated mean tooth growth is α̂ + τ̂2 = α̂ + τ̂VC.0.5 =13.23+ (-5.25) =
7.98 mm. It is also potentially interesting to directly interpret the estimated difference (or
deviation) between OJ0.5 (the baseline) and VC0.5 (group 2) that is τ̂VC.0.5 = -5.25: we estimate
that the mean tooth growth in VC.0.5 is 5.25 mm shorter than it is in OJ.0.5. This and many
other direct comparisons of groups are likely of interest to researchers involved in studying the
impacts of these supplements on tooth growth and the next section will show us how to do that
(correctly!).
previous next
27
Note that to see all the group labels in the plot when I copied it into R, I had to widen the plot
window. You can resize the plot window using the small "=" signs in the grey bars that separate
the different panels in R-studio.
Section:
Chapter 2
Category:
Textbook
Context:
Read Time:
5 minute(s)
Page:
58
Statistics with R
Table of Contents
Title Page
Search
Top of Page
Table of Contents
With evidence that the true means are likely not all equal, many researchers want to know which
groups show evidence of differing from one another. This provides information on the source of
the overall difference that was detected and detailed information on which groups differed from
one another. Because this is a shot-gun/ unfocused sort of approach, some people think it is an
over-used procedure. Others feel that it is an important method of addressing detailed questions
about group comparisons in a valid way. For example, we might want to know if OJ is different
from VC at the 0.5 mg dosage level and these methods will allow us to get an answer to this sort
of question. It also will test for differences between the OJ-0.5 and VC-2 groups and every other
pair you can construct. This method actually takes us back to the methods in Chapter 1 where we
compared the means of two groups except that we need to deal with potentially many pair-wise
comparisons, making an adjustment to account for that inflation in Type I errors that occurs due
to many tests being performed at the same time. There are many different statistical methods to
make all the pair-wise comparisons, but we will employ the most commonly used one,
called Tukey's Honest Significant Difference (Tukey's HSD) method28 . The name suggests that
not using it could lead to a dishonest answer and that it will give you an honest result. It is more
that if you don't do some sort of correction for all the tests you are performing, you might find
some spurious29 results. There are other methods that could be used to do a similar correction.
Generally, the general challenge in this situation is that if you perform many tests at the same
time, you inflate the Type I error rate. We can define the family-wise error rate as the
probability that at least one error is made on a set of tests or P(At least 1 error is made). The
family-wise error is meant to capture the overall situation in terms of measuring the likelihood of
making a mistake if we consider many tests, each with some chance of making their own
mistake, and focus on how often we make at least one error when we do many tests. A quick
probability calculation shows the magnitude of the problem. If we start with a 5% significance
level test, then P(Type I error on one test) =0.05 and the P(no errors made on one test) =0.95, by
definition. This is our standard hypothesis testing situation. Now, suppose we have m
independent tests, then P(make at least 1 Type I error given all null hypotheses are true) = 1 -
P(no errors made) = 1 - .95m. Figure 2-17 shows how the probability of having at least one false
detection grows rapidly with the number of tests. The plot stops at 100 tests since it is effectively
a 100% chance of at least on false detection. It might seem like doing 100 tests is a lot, but in
Genetics research it is possible to consider situations where millions of tests are considered so
these are real issues to be concerned about in many situations.
Figure 2-17: Plot of
family-wise error rate as the number of tests performed increases. Dashed line indicates 0.05.
In pair-wise comparisons between all the pairs of means in a One-Way ANOVA, the number of
tests is based on the number of pairs. We can calculate the number of tests using J choose 2, (J2),
to get the number of pairs of size 2 that we can make out of J individual treatment levels. We
won't explore the combinatorics formula for this, as the choose function can give us the
answers:
> choose(3,2)
[1] 3
> choose(4,2)
[1] 6
> choose(5,2)
[1] 10
> choose(6,2)
[1] 15
So if you have 6 groups, like in the Guinea Pig study, we will have to consider 15 tests to
compare all the pairs of groups. 15 tests seems like enough that we should be worried about
inflated family-wise error rates. Fortunately, the Tukey's HSD method controls the family-wise
error rate at your specified level (say 0.05) across any number of pair-wise comparisons. This
means that the overall rate of at least one Type I error is controlled at the specified significance
level, often 5%. To do this, each test must use a slightly more conversative cut-off than if just
one test is performed and the procedure helps us figure out how much more conservative we
need to be.
Tukey's HSD starts with focusing on the difference between the groups with the largest and
smallest means (γ̅max -γ̅min). If (γ̅max - γ̅min) ≤ Margin of Error for the difference in the means, then
all other pairwise differences, say |γ̅j - γ̅j'|, will be less than or equal to that margin of error. This
also means that any confidence intervals for any difference in the means will contain 0. Tukey's
HSD selects a critical value so that (γ̅max - γ̅min) will be less than the margin of error in 95% of
data sets drawn from populations with a common mean. This implies that in 95% of datasets in
which all the population means are the same, all confidence intervals for differences in pairs of
means will contain 0. Tukey's HSD provides confidence intervals for the difference in true
means between groups j and j', μj - μj', for all pairs where j ≠ j', using
used to find the multiplier, q, for the confidence intervals is available in the qtukey function
and generally provides a slightly larger multiplier than the regular t* from our two-sample t-
based confidence interval, discussed in Chapter 1. We will use the confint, cld,
and plot functions applied to output from the glht function (multcomp package; Hothorn,
Bretz and Westfall, 2008) to easily get the required comparisons from our ANOVA model.
Unfortunately, its code format is a little complicated - but there are just two places to modify the
code, by including the modele name and after mcp (stands for multiple comparisons) in
the linfct option, you need to include the explanatory variable name
as VARIABLENAME="Tukey". The last part is to get the Tukey HSD multiple comparisons.
Once we obtain the intervals, we can use them to test H0: γj = γj' vs HA: γj ≠ γj' by assessing
whether 0 is in the confidence for each pair. If 0 is in the interval, then there is no evidence of a
difference for that pair. If 0 is not in the interval, then we reject H0 and have evidence at the
specified family-wise significance level of a difference for that pair. The following code provides
the numerical and graphical30 results of applying Tukey's HSD to the linear model for the Guinea
Pig data:
> require(multcomp)
> confint(Tm2)
Quantile = 2.9549
Linear Hypotheses:
> plot(Tm2)
Figure 2-18: Graphical display of pair-wise comparisons from Tukey's HSD for the Guinea Pig data. Any
confidence intervals that do not contain 0 provide evidence of a difference in the groups.
Figure 2-18 contains confidence intervals for the difference in the means for all 15 pairs of
groups. For example, the first confidence interval in the first row is comparing VC.0.5 and
OJ.0.5 (VC.0.5 minus OJ.0.5). In the numerical output, you can find that this 95% family-wise
confidence interval goes from -10.05 to -0.45 mm (lwr and upr in the numerical output provide
the CI endpoints). This interval does not contain 0 since its upper end point is -0.45 mm and so
we can now say that there is evidence that OJ and VC have different true mean growth rates at
the 0.5 mg dosage level. We can go further and say that we are 95% confident that the difference
in the true mean tooth growth between VC0.5 and OJ0.5 (VC0.5-OJ0.5) is between -10.05 and -
0.45 mm. But there are fourteen more similar intervals...
If you put all these pair-wise tests together, you can generate an overall interpretation of Tukey's
HSD results that discusses sets of groups that are not detectably different from one another and
those groups distinguished from other sets of groups. To do this, start with listing out the groups
that do are not detectably different (CIs contain 0), which, here, only occurs for four of the pairs.
The CIs that contain 0 are for the pairs VC.1 and OJ.0.5, OJ.2 and OJ.1, VC.2 and OJ.1, and,
finally, VC.2 and OJ.2. So VC.2, OJ.1, and OJ.2 are all not detectably different from each other
and VC.1 and OJ.0.5 are also not detectably different. If you look carefully, VC.0.5 is detected
as different from every other group. So there are basically three sets of groups that can be
grouped together as "similar": VC.2, OJ.1, and OJ.2; VC.1 and OJ.0.5; and VC.0.5. Sometimes
groups overlap with some levels not being detectably different from other levels that belong to
different groups and the story is not as clear as it is in this case. An example of this sort of
overlap is seen in the next section.
There is a method that many researchers use to more efficiently generate and report these sorts of
results that is called a compact letter display (CLD). The cld function can be applied to the
results from glht to provide a "simple" summary of the sets of groups that we generated above.
In this discussion, we are using a set as a union of different groups that can contain one or more
members and the member of these groups are the six different treatment levels.
> cld(Tm2)
Groups with the same letter are not detectably different (are in the same set) and groups that are
detectably different get different letters (different sets). Groups can have more than one letter to
reflect "overlap" between the sets of groups and sometimes a set of groups contains only a single
treatment level (VC.0.5 is a set of size 1). Note that if the groups have the same letter, this does
not mean they are the same, just that there is no evidence of a difference for that pair. If we
consider the previous output for the CLD, the "a" set contains VC.0.5, the "b" set contains OJ.0.5
and VC.1, and the "c" set contains OJ.1, OJ.2, and VC.2. These are exactly the groups of
treatment levels that we obtained by going through all fifteen pairwise results. And these letters
can be added to a beanplot to help fully report the results and understand the sorts of differences
Tukey's HSD can detect.
>
beanplot(len~Treat,data=ToothGrowth,log="",col="white",method="ji
tter")
> text(c(2),c(10),"a",col="blue",cex=2)
> text(c(3,5,6),c(25,28,28),"b",col="green",cex=2)
> text(c(1,4),c(15,18),"c",col="red",cex=2)
Figure 2-19 can be used to enhance the discussion by showing that the "a" group with VC.0.5
had the lowest average tooth growth, the "c" group had intermediate tooth growth for treatments
OJ.0.5 and VC.1, and the highest growth rates came from OJ.1, OJ.2, and VC.2. Even though
VC.2 had the highest average growth rate, we are not able to prove that its true mean is any
higher than the other groups labeled with "b". Hopefully the ease of getting to the story of the
Tukey's HSD results from a plot like this explains why it is common to report results using these
methods instead of reporting 15 confidence intervals.
Figure 2-19: Beanplot of tooth growth by group with Tukey's HSD compact letter display.
There are just a couple of other details to mention on this set of methods. First, note that we
interpret the set of confidence intervals simultaneously: We are 95% confident that ALL the
intervals contain the respective differences in the true means (this is a family-wise
interpretation). These intervals are adjusted (wider) from our regular 2 sample t intervals from
Chapter 1 to allow this stronger interpretation. Second, if sample sizes are unequal in the groups,
Tukey's HSD is conservative and provides a family-wise error rate that is lower than the nominal
level. In other words, it fails less often than expected and the intervals provided are a little wider
than needed, containing all the pairwise differences at higher than the nominal confidence level
of (typically) 95%. Third, this is a parametric approach and violations of normality and constant
variance will push the method in the other direction, potentially making the technique
dangerously liberal. Nonparametric approaches to this problem are possible, but will not be
considered here.
previous next
28
When this procedure is used with unequal group sizes it is also sometimes called Tukey-
Kramer's method.
29
We often use "spurious" to describe falsely rejected null hypotheses which are also called false
detections.
30
The plot of results usually contains all the labels of groups but if the labels are long or there
many groups, sometimes the row labels are hard to see even with re-sizing the plot to make it
taller in R-studio and the numerical output is useful as a guide to help you read the plot.
Section:
Chapter 2
Category:
Academic Journal
Context:
Read Time:
10 minute(s)
Page:
59
Statistics with R
Table of Contents
Title Page
Search
Top of Page
Table of Contents
License
Linear Hypotheses:
Figure 2-20: Tukey's HSD confidence interval results at the 95% family-wise confidence
level.
At the family-wise 5% significance level, there are no pairs that are detectably different -
they all get the same letter of "a". Now we will produce results for the reader that thought
a 10% significance was suitable for this application before seeing any of the results. We
just need to change the confidence level or significance level that the CIs or tests are
produced with inside the functions. For the confint function, the level option is the
confidence level and for the cld, it is the family-wise significance level.
> confint(Tm2,level=0.9)
Simultaneous Confidence Intervals
Multiple Comparisons of Means: Tukey Contrasts
90% family-wise confidence level
Figure 2-22: Beanplot of sentences with compact letter display results from 10% family-
wise significance level Tukey's HSD.
The main components of R code used in this chapter follow with components to modify in red,
remembering that any R packages mentioned need to be installed and loaded for this code to
have a chance of working:
• MODELNAME=lm(Y~X,data=DATASETNAME)
◦ Provides numerical summaries of all variables in the data set.
◦ Here it is used to fit the reference-coded One-Way ANOVA model with Y as the
response variable and X as the grouping variable, storing the estimated model object in
MODELNAME.
• MODELNAME=lm(Y~X-1,data=DATASETNAME)
◦ Fits the cell means version of the One-Way ANOVA model.
• summary(MODELNAME))
◦ Generates model summary information including the estimated model coefficients, SEs,
t-tests, and p-values.
• anova(MODELNAME)
◦ Generates the ANOVA table but must only be run on the reference-coded version of
the model.
◦ Results are incorrect if run on the cell-means model since the reduced model under the
null is that the mean of all the observations is 0!
• pf(FSTATISTIC,df1=NUMDF,df2=DENDF,lower.tail=F)
◦ Finds the p-value for an observed F-statistic with NUMDF and DENDF degrees of
freedom.
• par(mfrow=c(2,2)); plot(MODELNAME)
◦ Generates four diagnostic plots including the Residuals vs Fitted and Normal Q-Q plot.
• plot(allEffects(MODELNAME))
◦ Plots the estimated model.
• Tm2=glht(MODELNAME,linfct=mcp(X="Tukey"); confint(Tm2);
plot(Tm2); cld(Tm2)
◦ Requires the multcomp package to be installed and loaded.
◦ Generates the text output and plot for Tukey's HSD as well as the compact letter
display.
previous next
Section:
Chapter 2
Category:
Textbook
Context:
R (programming language)
Read Time:
1 minute(s)
Page:
62
For these practice problems, you will work with the cholesterol data set from
the multcomp package that you should already have loaded. To load the data set and learn
more about the study, use the following code:
require(multcomp)
data(cholesterol)
help(cholesterol)
2.1. Graphically explore the differences in the changes in Cholesterol levels for the five
levels using boxplots and beanplots.
2.2. Is the design balanced?
2.3. Complete all 6+ steps of the hypothesis test using the parametric F-test, reporting the
ANOVA table and the distribution of the test statistic under the null.
2.4. Discuss the scope of inference using the information that the treatment levels were
randomly assigned to volunteers in the study.
2.5. Generate the permutation distribution and find the p-value. Compare the parametric
p-value to the permutation test results.
2.6. Perform Tukey's HSD on the data set. Discuss the results - which pairs were detected
as different and which were not? Bigger reductions in cholesterol are good, so are there
any levels you would recommend or that might provide similar reductions?
2.7. Find and interpret the CLD and compare that to your interpretation of results from
2.6.
previousNotice: Undefined variable: nextPage in
/dp/dp01/book/statistics-with-r-textbook/public_html/item.html on line 192