Lucero R Tutorial 2016
Lucero R Tutorial 2016
1
Dr. Christian Lucero
Version: 2016.12.31
1 This tutorial was designed by Dr. Christian Lucero for use in his courses at Virginia Tech. Please acknowledge the
original author’s contribution when using portions of this tutorial.
Instructions
The tutorials and assignments are designed to teach you how to do basic statistical data analysis using
R/RStudio.
You will need to read the tutorials thoroughly in order to complete an assignment.
You should read the tutorial, line-by-line in it’s entirety. Skipping around is ill-advised.
It is recommended that you dedicate a folder to this course and a sub-folder for each assignment!!
You will be asked to write your own snippets of R programming code to carry out basic statistical data
analysis. You should save your Rcode in files by naming them appropriately. We will give you specific
naming requirements for each assignment that you turn in.
Note: In order to edit the R Script / R Program Code, RStudio has a built-in R-code editor that we highly
advise you to use.
1
A Few Words About R & General Advice
What is most important is that you know where to look up an example related to the question that you want to
answer.
Once you learn and memorize some common commands within R, it becomes easier to use.
Until then, copy & paste earlier examples and modify the syntax to suit the needs of your current problem.
One of the most efficent ways to learn any program, calculator, or even an app on your phone is through exploration
along with trial-and-error.
If you ever feel lost, R has a built-in help system. Additionally, a few minutes googling for answers usually helps
in most cases.
Don’t be afraid to seek help from other students, TAs, or the instructors if you are truly lost.
Most importantly!
2
Contents
3
3.2.2 One-Sided Confidence Intervals (Confidence Bounds) . . . . . . . . . . . . . . . . . . . . . 84
3.3 Inference for Paired-Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4
Chapter 1
5. Functions in R.
5
1.1 Getting Started with R: The RStudio Interface.
To quote wikipedia:
“R is a programming language and software environment for statistical computing and graphics.”
“RStudio is a free and open source integrated development environment (IDE) for R, a programming language
for statistical computing and graphics.”
Essentially R is a very powerful statistical analysis software suite & RStudio is an independently developed project
that was designed to run on top of the R framework to assist in ease programming in R.
Therefore you need to install both on your system. However, you should never need to open the R program, only
RStudio.
The instructions on how to install R & RStudio can be found on those websites.
If you are still stuck, please see a classmate, a TA, or talk to your instructor during office hours.
6
The basic RStudio interface is divided into specific sections.
1. The area in the top left corner is dedicated to writing R Programs and is called the Source Window.
Specific lines of code can be run from this window, one-line at a time, by chunks, as well as the entire
program.
2. The console is in the bottom left corner. The console can be used to type in basic commands, functions,
etc. Additionally the console is where you see basic output from running specific commands and programs.
3. A command history and short list of user defined variables is located in the top right corner.
4. The bottom right corner contains a number of useful items organized by tabs. There is a file browser, a tab
which houses the plots when they are generated, an installed package viewer, and the help system interface
for RStudio.
7
1.2 Naming Files and Organizing Your Work
Organization of your files is extremely important! You should dedicate a folder to this course, specific homework
assignments/projects, and projects that you might be working on in the real world.
Within this folder, you should make sub-folders. You should have a folder for each assignment.
There will be a number of assignments in this course, so you should have the following sub-folders:
Assignment_1
Assignment_2
Assignment_3
You should generally keep a separate R Script file for every assignment. You should call these R Scripts something
appropriate.
Lucero_Christian_CMDA_3654_HW1.R
If you are familiar with version control, such as git, please use it!!
One thing that you should notice is that all R Scripts/Program Files must end with a .R
R is a statistical programming language that is highly customizable and has a lot of built-in functions that do
many advanced statistical computations and plots for you.
R Programs/Codes/Scripts (all names are commonly used) are the instructions that outline the exact computa-
tions and other operations that you want R to run.
1.3.1 Comments in R
To start with, you should familiarize yourself with comments. Comments are inert statements that start with a
# symbol.
8
Commenting or annotation is a very common practice in science.
The purpose of commenting is to help to explain pieces of your code so that you and others can understand the
steps that are about to proceed in a more human language.
Commenting can also help you debug your code, by preventing R from running certain pieces of code which allows
you to see what is working, and what is not.
# Example 1
# I want to plot the function y = sin(x), where x is from 0 to 2*pi
x <- seq(0,2*pi,length.out=100)
y <- sin(x)
# 3+5*7 # Isn't computed because it's commented out.
# plot(y,x,type="l"), # this would produce the wrong plot, so it's commented out.
plot(x,y,type="l", main="y=sin(x)") # This will produce the plot we want!
y=sin(x)
1.0
0.0
y
−1.0
0 1 2 3 4 5 6
We should see that anywhere with a comment, R does nothing. In particular ‘3+5*7’ is never calculated. Similarly
for ‘plot(y,x,type="l")’.
For the other lines that were not commented out, i.e:
x <- seq(0,2*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l", main="y=sin(x)") # This will produce the plot we want!
The first 2 lines do some work, and assign the results to the variables named ‘x’, and ‘y’. The plot() function
then takes those values and produces the plot you see above.
9
1.3.2 How to Run R-Code: The Basics
1. In the Console, you can type each command individually or cut & paste, then hit enter and observe the
output.
While this works, this is generally not advised. Instead do your work in the Source Window and save it to
an R Script. Then do one of the following.
2. If you have a command inside an R Script that you want the computer to run, you can put your blinking
cursor anywhere on that line hit CTRL-ENTER on the keyboard.
If you have a bunch of lines in a row that you need to run, just repeat the above step as many times as you
need.
3. If you want to run a specific line or multiple lines, then highlight those lines and hit CTRL-ENTER on
the keyboard.
4. You can highlight the code that you want to run, and click on the Run button in the top right of Source
Window.
If you want to run ALL LINES inside of the R script, save, then click on the Run Button.
For the most part, you are going to run lines individually in this class. You will then observe the output and
probably cut and paste the output results into a word document.
All programming languages allow you to define basic objects or variables to allow you to store and retrieve
information in order to do both simple and complex computations.
# We can assign values to objects using arrows or equals signs.
a <- 12
# Typing just the name of the object in R will print the value.
a
[1] 12
[1] 2
Notice that we can use either an arrow <- or we can use an equals sign = to set objects in practice it doesn’t
matter which one you use, so long as you know what’s going on when reading other people’s code.
10
Essentially, you could treat R as just a fancy scientific calculator.
Run the following lines in the console and observe the output:
a-31
a*b
a^2
(a-b)^(b*4)
Of course you can store your fancy computations into new objects/variables.
my.answer <- sqrt(57)+32 + 11^3
When we assign a value to an object we don’t see anything unless we call ask for it by itself or use a print
command
my.answer
[1] 1370.55
print(my.answer)
[1] 1370.55
There are times when we will also need to store characters (not numbers) into objects.
specimen.name = 'Charlie The Horse' # Notice that single ' ' and " " work identically in R.
specimen.breed = "Paso Fino"
specimen.age = 7
print(specimen.name)
print(specimen.breed)
print(specimen.age)
[1] 7
[1] "The specimen was as a Paso Fino with the name Charlie The Horse, age 7."
1.5 Functions in R
There are many functions built into R that you will learn as you go along.
11
Sometimes these functions can be called with just 1 argument, and sometimes additonal arguments are needed.
Some function arguments are optional but help you to customize the output.
If you know the name of the function, say it’s called fcn.name but you don’t remember how to use it or what
arguments it needs, then use one of the following commands.
help('fcn.name')
# or equivalently
? fcn.name
Of course, you can click on the Help tab in the bottom right panel and do a search as well.
Go ahead and try this now. Look at the help file for plot function.
Often times when referring to a function instead of a variable or object name, we make this explicit
by referring to the function using the notation fcn.name(), for example, the plot() function.
When naming variables or other objects, it is best not to give them the same name as function
names. For example, length() is a function, so don’t do length <- 32.5, instead call it something
else arm.length <- 32.5.
If you need an example of how a function might be used, you can read about it on the help page or there may be
a more extensive example using the example() command function: example('fcn.name')
example('plot')
When the data is not too long, we can type it in by hand using the concatenation function c().
# Suppose our list started getting too long...we can continue onto the next line
# so long as the previous line ends with a comma.
12
# Important Note: this is true for all functions
strength <- c(580, 400, 428, 825, 850, 875, 920, 550,
575, 750, 636, 360, 590, 735, 950)
strength
[1] 580 400 428 825 850 875 920 550 575 750 636 360 590 735 950
# or
print(strength)
[1] 580 400 428 825 850 875 920 550 575 750 636 360 590 735 950
Now we can compute the mean, sd, etc of the strength variable observations. We can also plot values too.
mean(strength)
[1] 668.2667
The data that we just entered is now stored in what is called a vector or an array.
R can read in many different types of files from other types of programs. Many more file types than we can cover
in this tutorial. Instead we will show you the basic idea on how to load .csv and .dat files that are reasonably
well-structured.
A good way to learn about importing a other types of data is use Google.
For example type: R import .mat
Quite often, data is stored in a spreadsheet where the variables are given by the columns with the variable name
in the first entry of each column. We will focus upon importing data that is in this type of format.
Consider the following dataset which has 6609 observations for each variable (only the first 15 are shown).
13
If you ever have a spreadsheet of data formatted in example above, then explore the options in your spreadsheet
program. Every “good” spreadsheet program has the ability to export to a specific file type known as a .csv file
(comma separated value).
We will show you how to read in .csv files below. If you ever need to read in other file types, a little Google will
get you a long way on this subject. If we need to read in other file types for this course, we’ll specifically give you
the command.
There are a few different ways to read in data from a file. R is rather powerful in that it can load data from a
wide variety of sources. Typically there are different functions for different file types and we don’t want to explore
all of them here. Instead we want to give you the main idea about how loading in files works.
Let’s focus on loading in .csv files. Comma separated value, csv, files are files that can easily be read/written by
spreadsheet programs.
We want to read in the crime dataset which is stored in the Crimedata.csv file.
Make sure that you know where this file is on your computer and then run the following line.
A dialog box will appear and you must select the file.
# Note the header = T option means that the first row corresponds to a variable name.
crime <- read.csv(file.choose(), header=T)
Alternatively, if you know where the file is, especially if you run the code often and want automation, then you
can load the data explicitly.
# Note the header = T option means that the first row corresponds to a variable name.
crime <- read.csv("/home/Username/CMDA_3654/Tutorial1/Crimedata.csv", header=T)
14
Note the use of the quotation marks!
Now the crime dataset is in working memory. By default R treats this as a specific type of data structure called
a data frame. Essentially a data frame is just a way of organizing all of our variables in a related data set.
To see the first few observations from this dataset use the following command:
head(crime)
We can also ask R to tell us how many rows are in the dataset.
nrow(crime)
[1] 6609
When a dataset does not have column headings that serve as the names of the variables, we have to assign them
ourselves.
First, let’s use a different type of data set called a .DAT file (a simple data file). The values in the data set are
separated by a simple space between then. The first value is the number of hours of snowfall (call it “snowhours”).
The second value is the number of hours it takes for workers to clear the snow (call it “clearhours”).
#there is no header to we don't say header = TRUE
snowdata <- read.table("T3-2.DAT")
print(snowdata)
V1 V2
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0
15
5 19.5 23.6
6 8.0 13.2
7 9.0 32.1
8 7.0 12.3
9 7.0 11.8
10 9.0 24.4
11 6.5 18.2
12 10.5 22.0
13 10.0 32.5
14 4.5 18.7
15 7.0 15.8
16 8.5 15.6
17 6.5 12.0
18 8.0 12.8
19 3.5 26.1
20 8.0 14.5
21 17.5 42.3
22 10.5 17.5
23 12.0 21.8
24 6.0 10.4
25 13.0 25.6
We’ll see that R calls assigns generic variable names “V1” and “V2” to the columns. The observation numbers
were also filled in my R.
We can assign variable names to the data after it has been read in.
colnames(snowdata) <- c("snowhours","clearhours")
snowhours clearhours
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0
5 19.5 23.6
6 8.0 13.2
We can also assign names to the row elements (such as patient names or other labels) using rownames()
head(snowdata)
snowhours clearhours
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0
5 19.5 23.6
6 8.0 13.2
16
1.8 Installing R Library Packages
R is a very open language. People from all over the world can submit their own libraries of code to the central R
repositories for other people to use.
In addition to new programs, functions, & plotting tools, installing new library packages in R can also provide
new practice datasets to work on.
In order to install a new library package (sometimes just called a library or just a package) in RStudio, there are
several ways.
Method 1: Go to Tools -> Install Packages, and search and install the package you are looking for.
Method 2: In the bottom right panel, click on Packages, then click on Install, then search and install the
package.
Install the psych package using any of the three methods stated above.
# To install the pysch library using Method 3 command, you should have used:
install.packages("psych", dependencies = TRUE)
Some other libraries that you should go ahead and install right now are:
install.packages("binom", dependencies = TRUE)
install.packages("epitools", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
install.packages("multcomp", dependencies = TRUE)
install.packages("Sleuth3", dependencies = TRUE)
You can see which packages are already installed on your computer a couple of different ways, but the easiest
in RStudio is the Packages Tab in the bottom right portion of the screen. If you click there, you can see what
packages are installed, and those with checkmarks are the ones currently enabled. Certain packages are always
on by default, others you have to enable yourself. We can simply click on the check box to enable a package or
use the command line, which is often preferred.
Enabling a Package
In order to use a package, you must first enable it. If you are already on the Packages tab in RStudio, simply
make sure the box next to the library you want is checked.
17
library('package.name'), or library("package.name"), or library(package.name)
If you attempt to load a library before you have installed it yet, you will get an error.
We already installed the psych library above, now let’s enable it.
library("psych")
Go ahead and enable the other libraries that we installed today, we’ll use at least one of them later.
Now that we installed the psych library, let’s look at one of the new functions.
A neat function from the psych library is describe(). Keep in mind, you’ll get an error if you did not install
and enable the psych library.
The dataset mtcars is a dataset that we can use this function on (it’s in the datasets System Library which is
auto-enabled when you start RStudio).
describe(mtcars)
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 0.09
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37 0.34 0.32
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24 -2.00 0.09
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36 -1.92 0.09
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53 -1.07 0.13
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05 1.26 0.29
However, you may have to re-enable/activate the library if you quit R/RStudio.
Make sure to check to see if a package is enabled or else you will get errors when trying to use functions from
those packages.
18
1.9 Data Frames and Their Variables: The Basics
When you read in data into an object by hand, it is usually just a scalar variable or a simple vector array of
numerical values or characters.
Example:
student <- c('Joe','Sara','Chen') # Vector/Array of Characters
age <- c(23, 22, 22) # Vector/Array of Numerical Values
grade <- c('Junior','Sophomore','Sophomore') # Vector/Array of Characters
In the above variables, we have 3 vectors of equal length, 2 contain characters denoted by quotations ' ', and a
vector of numerical values.
A data frame is a special type of dataset that is used for storing data tables. It is a list of vectors of equal length
where the columns correspond to variables and the rows within each column are the corresponding observation.
A natural example of a data frame is data organized in the spreadsheet in the Crimedata.csv file.
By default, most of the data that is naturally found in the R libraries (as well as the data we import from .csv
files) will be in the data frame format.
We can make a dataframe from vectors of equal length by using the data.frame() function:
student.data <- data.frame(student,age,grade)
print(student.data)
The trees dataset is provided by the datasets library which should already be enabled when you turn on
RStudio.
You can learn about the trees dataset by typing ? trees or help('trees') in the Console.
We can look at the entire dataset by simply typing the name of the dataset:
trees
19
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
If it’s not too long, looking at the whole dataset is okay. But it is very long and you just want to get a general
sense what the data looks like, then we have a number of options.
First, we could use the head() function which lists the first 6 rows.
head(trees)
What we notice is that this dataset has 3 variables (all of which are numeric), they are “Girth”, “Height”,
“Volume”.
This function output specifically says the trees dataset is a 'data.frame', there are 3 variables, each of which
has 31 observations. It then names the variables, it states the type of variables num stands for numerical, and
then lists some of the observations for each variable.
We could have also used the names() and nrow() functions to tell us the names of the variables and the number
20
of rows, respectively.
Suppose we wish to just look at a particular variable from this dataset, say Height, we can do the following:
trees$Height
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87
The above shows all 31 observations for the Height variable from the trees dataset.
To look at a specific variable from a dataset such as a data frame, do the following:
dataset.name$variable.name
For now, we’ll just state the easiest alternative. You can save a variable from a dataset into a new object that we
name ourselves.
our.tree.heights <- trees$Height
Now our new variable is called our.tree.heights and we can just call it by name.
our.tree.heights
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87
Suppose want to know what the 7th value is of the Volume variable in the trees dataset.
We have a number of options to do this. Do the following and scroll up to the previous section to check if the
answer is correct.
trees$Volume[7]
[1] 15.6
trees[7,3]
[1] 15.6
The square brackets ‘[ ]’ are used for indexing locations in matrices, vectors, and data frames in R.
When we just have a single vector, then you only need one number for your index.
21
my.vector = c(10,20,30,40,50,60,70,80,90)
[1] 50
[1] 10 20 30 40 50
[1] 30 40 50 60 70
[1] NA
[1] 9
Here is an example of a matrix, ignore the details about how it is contructed for now.
my.matrix <- matrix(1:12, byrow=T, 3,4) # A matrix with 3 rows by 4 columns
my.matrix
Notice that the matrix is a 3-by-4 matrix. In general we describe matrices by their size: an m-by-n, matrix has
m rows and n columns.
When indexing a matrix, you must ask for the row location first, then the column location.
my.matrix[2,3] # row 2, column 3 should be 7
[1] 7
my.matrix[3, ] # Row 3, no number for the 2nd entry returns all columns
[1] 9 10 11 12
[1] 4 8 12
22
[,1] [,2]
[1,] 7 8
[2,] 11 12
Getting back to the first example. The Volume variable was in the 3rd column of the trees dataset, so that’s
why we could find out the 7th value of variable using trees$Volume[7] or trees[7,3].
This should come in handy. One reason this might be useful is to split our dataset into different pieces.
all.heights <- trees$Height
all.heights
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87
some.heights
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75
remaining.heights
[1] 74 85 86 71 64 78 80 74 72 77 81 82 80 80 80 87
What if you wanted to use R as a graphing calculator, what would you do?
y <- sin(x)
# Using the x-values, evaluate a function and store the result in an array object called "y".
23
# we can use the following command:
points(2,0,col="red",pch=2) # This new point will be red (col stands for color)
w <- 2*pi*x-1
points(w, col="blue") # points just adds the (x,w) points to an existing plot
lines(w) # plots the points (x,w) but with a solid line passing through those points.
We can specify a title for our plot, rename our axes, as well as redefine our axes.
We will illustrate how to do these with more examples later in the tutorial.
There are other ways to put new points/lines on existing plots. But the above examples illustrate the basic idea
for most simple cases.
A picture of the final plot that you should obtain is given below.
1.0
0.5
0.0
y
−0.5
−1.0
0 1 2 3 4 5 6
In addition to the plot function, there are other functions that do a lot of fancy plotting with a few simple options.
For example, the hist() function will be used to make histograms, while boxplot() will generate boxplots.
24
We will see some examples of these functions and many more throughout the R tutorials.
25
Chapter 2
In this part of the tutorial you will learn how to use R to do many of the basic statistical methods that you have
learned to do by hand from the textbook.
4. Probability.
26
2.1 Elementary Statistics Using R.
[1] 76
[1] 6.371813
[1] 40.6
[1] 76
78%
80.4
# Okay, so that last one is probably not obvious, unless you've been following the class lecture.
If you can think of a statistical method, then someone has probably already made a function for that technique!
Even if it’s not in the base installation of R, the function you want can usually be found in one of the library
packages that we can download.
There are some other functions that are quite useful at times. Please note that these functions work differently
depending on if we are using them on a data frame, a variable, or some other type of object.
This gave use a “5 number summary” along with the mean for all our data frame variables.
27
summary(trees$Height)
The summary() function will be used to show us other useful information in more advanced tutorials.
Investigate what the describe() function from the psych returns when we use it on the trees dataset.
First we are going to look at univariate summaries. Our main tools are bar plots, histograms, stripcharts, &
boxplots.
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
270 264 234 302 229 265 281 213 236 216 207 164 176 148 150 146 125 137 135 129 118 136
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
122 114 146 116 122 123 119 121 118 73 94 94 97 77 70 97 63 52 50 56 57 32
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
34 29 21 22 21 25 23 17 8 10 8 7 11 7 6 9 5 6 4 3 9 4
78 79 80 82 84 85 87 89 90
3 3 3 1 6 3 2 1 4
# In the above, we see that there were 270 - 12 year old victims, 264 - 13 year olds, etc.
# This will display a frequency table of the whether the police where called or not for the crimes.
crime.police.reported.counts
No Police Police
2695 3914
We could use the frequency tables to make bar plots & histograms by hand but let’s let R do this for us.
Bar Plots
A simple bar plot is made using the barplot() function. We use this on categorical data only consisting of counts
(absolute frequency) for particular categories such as the crime.police.reported.counts table.
28
barplot(crime.police.reported.counts)
3000
2000
1000
0
No Police Police
This is an ugly looking plot, but we customize so many aspects of it. A lot of this is done just through playing
around with very options.
Most of the time, a simple bar plot is all we need. However, we often prefer using relative frequencies instead of
absolute.
This requires us to rescale our counts so that the everything is a fraction of 100%.
# Divide our counts by the total sum for relative frequencies.
relative.crime.reports
No Police Police
0.4077773 0.5922227
barplot(relative.crime.reports, col=c('purple','green'),
main='Rates of Police Calls for Crimes' )
29
Rates of Police Calls for Crimes
0.5
0.4
0.3
0.2
0.1
0.0
No Police Police
Histograms
Histograms are easy in R. Simply use the hist() function. By default this will use the absolute frequencies
(counts) instead of relative frequencies. One option that we can feed the hist() function to produce a histogram
with relative frequencies (aka proportions/densities).
30
Histogram of crime$age
800 1000
Frequency
600
400
200
0
20 40 60 80
crime$age
Now, if we want relative frequencies (proportions/densities) instead of absolute frequencies (or counts), we use
the following option:
# Histogram with Relative Frequency instead
hist(crime$age, prob=TRUE)
31
Histogram of crime$age
0.03
0.02
Density
0.01
0.00
20 40 60 80
crime$age
We can zoom in or out, that is change the class-width interval of the bins, by changing the number of breaks. By
default, R picks the number of breaks to use.
# Histograms with different number of bins
hist(crime$age, breaks=10, main="Histogram of crime$age, breaks=10")
hist(crime$age, breaks=20, main="Histogram of crime$age, breaks=20")
hist(crime$age, breaks=50, main="Histogram of crime$age, breaks=50")
hist(crime$age, breaks=100, main="Histogram of crime$age, breaks=100")
32
Histogram of crime$age, breaks=10 Histogram of crime$age, breaks=20
2000
1000
Frequency
Frequency
600
500 1000
200
0
0
20 40 60 80 20 40 60 80
crime$age crime$age
500
600
Frequency
Frequency
300
400
200
100
0
20 40 60 80 20 40 60 80
crime$age crime$age
We can customize just about everything, the title, the x & y labels, axes, colors, etc.
Don’t be afraid to google or use the help system to help you get what you want!
# Add your own title using , main="Your Title"
33
Histogram of Age
800 1000
Frequency
600
400
200
0
20 40 60 80
crime$age
34
Histogram of Age
800 1000
Frequency
600
400
200
0
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Age in Years
Stripcharts are fairly easy to use, but customizing is often very important. Let’s generate a stripchart for the Height
variable from the trees dataset. Let’s also learn how to generate a dotplot which uses the same stripchart()
function but with different options.
Some Notes:
pch = 20 plots little filled circles and col=’red’ makes these circles red.
Since there are multiple trees at particular height = 80, the method='jitter' option adds a tiny random variation,
so the dots aren’t on top of each other.
Since jitter relies on the random number generator, if you want the same plot that I have then we need to reset
our random number generators.
The random number generators are reset to a starting value by using the set.seed() function.
First let’s see what a stripchart looks like with no extra options, then we’ll see how it looks after we modify it a
bit.
35
stripchart(trees$Height, main="Stripchart for Tree Height Data")
65 70 75 80 85
You can’t tell this from the above plot but there are actually multiple observations at certain values.
As you can see there are three trees with a height equal to 80. So we need to modify the stripchart to be a little
more clear about this fact.
trees$Height
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87
36
Stripchart for Tree Height Data
85
80
75
70
65
The only change that needs to be made for a dotplot is the method='stack' option.
# Note we are also making our plot horizontal which is the default anyway.
# method = 'stack', which products a dotplot.
stripchart(trees$Height, method='stack', pch=20, col='red', main="Dotplot for Tree Height Data")
65 70 75 80 85
I’m not a huge fan of where R put the tick marks by default.
The xaxt='n' deletes the xaxis along with the tick marks. I can use the axis() function to customize my own.
37
stripchart(trees$Height, method='stack', pch=20, col='red', xaxt='n',
main="Dotplot for Tree Height Data")
axis(side=1, at=seq(from=63, to=87, by=2)) # put ticks every 2 units apart between 63 and 87
63 65 67 69 71 73 75 77 79 81 83 85 87
Boxplots
The boxplot() is fairly easy to use now that you’ve seen some other functions.
38
Boxplot of Tree Height Data
85
80
75
70
65
No outliers, kind of boring! Let’s try making a horizontal boxplot of the crime$age observations.
# By default boxplots are vertical, to change to horizontal, use horizontal = TRUE
39
Boxplot of Age of Crime Victims
20 40 60 80
Age
This data corresponds to Olympic track records for various countries in various running events. The 100 meter
and 200 meter times are both numerical variables. The appropriate tool to use is a scatter plot.
plot(track.records$m100, track.records$m200, xlab='100 meter dash times',
ylab='200 meter dash times', pch=18, col="blue")
40
26
25
200 meter dash times
24
23
22
When we have numerical-categorical data, we can compare the categories by using side-by-side boxplots.
The categorical variable contains specific names of categories, such as colors “Yellow”, “Green”, “Blue”; or perhaps
general ranges “Low”, “Medium”, “High”; or specific outcomes such as “Postive”, “Negative”.
Sometimes the categories have a number as a label. Consider dosages. Instead of “10 mg”, “20 mg”, “30 mg”,
sometimes people just put “10”, “20”, “30” for the labels. We have to be careful when using numbers as labels!
Suppose I have three categories that I have the same type of measurements, for example, Vitamin C measurements
(in mg per fruit) for Apples, Cranberries, and Oranges.
First note that each category does not have to have the same number of measurements.
apple <- c(8.4, 7.3, 10.9, 37.1, 17.4) # 5 measurements here
cranberry <- c(20.6, 16.7, 26.8, 44.2) # 4 measurements here
orange <- c(61.1, 56.5, 46.3, 53.2, 35.1, 57.8) # 6 measurements here
Method 1: To make side-by-side boxplots when the data is in the form given by simple numerical vectors designated
by the category names, do the following:
41
boxplot(apple,cranberry,orange, names=c('apple','cranberry','orange'), col=c('red','pink','orange'),
main='Vitamin C in Fruit (mg per 200g of fruit)')
60
50
40
30
20
10
Vitamin C in Fruit (mg per 200g of fruit)
If you are importing data from a .csv file or some other method, the data will usually be in a data frame.
print(fruit)
print(vit.c.mg)
[1] 8.4 7.3 10.9 37.1 17.4 20.6 16.7 26.8 44.2 61.1 56.5 46.3 53.2 35.1 57.8
print(vitamin.c.content)
42
fruit vit.c.mg
1 apple 8.4
2 apple 7.3
3 apple 10.9
4 apple 37.1
5 apple 17.4
6 cranberry 20.6
7 cranberry 16.7
8 cranberry 26.8
9 cranberry 44.2
10 orange 61.1
11 orange 56.5
12 orange 46.3
13 orange 53.2
14 orange 35.1
15 orange 57.8
Let’s explore the pattern for this other way of producing boxplots.
43
Method 2(a): To make side-by-side boxplots, the general R command is:
boxplot ( data $ numerical . variable ~ data $ categorical . variable )
Depending on the circumstance, the categorical variable is sometimes called a factor variable and the categories
of the factor variable are often called the levels.
Please note: If you have used only numbers as labels and have not used the qutoations to designate the numbers
as a character, i.e. '20', then you can use the factor() function to tell R that the numerical values are actually
labels.
Method 2(b): To make side-by-side boxplots when the categories are numerical values:
boxplot ( data $ numerical . variable ~ factor ( data $ categorical . variable ) )
# Let's add a legend just for fun. Use the legend() function.
# It seems redundant in this plot though.
legend("topright", inset=0.05, title="Number of Cylinders", c("4","6","8"),
fill=c("orange","green","cyan"))
44
MPG by Number of Cylinders
Number of Cylinders
4
30
6
8
25
MPG
20
15
10
4 6 8
Cylinders
In order to create a stacked bar chart of the MSA and Police variables, you first need to create a bivariate
frequency table (or a contingency table) with the counts of each of these categorical variables.
table(data$variable1, data$variable2)
Note: The first variable in the arguement will be along the left side column in the plot.
No Police Police
Rural 325 489
Suburban 1315 1856
Urban 1055 1569
45
We will use the barplot() function to obtain a stacked bar plot for this data.
barplot(counts, main="Stacked Bar Chart of MSA and Police Reporting")
No Police Police
Stacked bar plots won’t make sense without a legend. So let’s add one. Also, we should add an x-axis label.
To add a legend, make sure to use c("Category1", "Category2", "etc.") for as many categories as you need.
46
Different Neighborhood Crime Locations and Police Reporting
Rural
Suburban
Urban
3000
Number of Neighborhoods
2000
1000
0
No Police Police
Police Reporting
There are still some formatting issues that we should fix, but before we do that, what if we actually wanted a
different stacked bar plot with the crime locations on the x-axis and the frequency of police calls to be the heights?
We need to use the t() function. This function takes the transpose of matrices and tables.
barplot(t(counts), main='Police Reporting for Crimes in Different Neighborhood Types',
xlab='Neighborhood Type', ylab='Frequency of Police Non-Reporting/Reporting',
col=c('red','blue'))
47
Police Reporting for Crimes in Different Neighborhood Types
Frequency of Police Non−Reporting/Reporting
3000
No Police Called
Police Called
2000
1000
500
0
Neighborhood Type
There are many options that we can use to modify the plots. We can add colors, rotate the plot, put the bar
plots next to each other instead of stacked (this is called a grouped bar plot).
barplot(t(counts), main='Police Reporting for Crimes in Different Neighborhood Types',
ylab='Neighborhood Type', xlab='Police Non-Reporting/Reporting Frequency',
col=c("cyan","orange"), horiz=T, beside=T, xaxt = 'n')
# side=1 refers to the x-axis. Side: 1=below, 2=left, 3=above and 4=right
# Let's add a legend, inset pushes away from the side of the plot by a tiny bit.
legend("bottomright", inset=0.05, c('No Police Called','Police Called'),
fill = c('cyan','orange'))
48
Police Reporting for Crimes in Different Neighborhood Types
Urban
Neighborhood Type
Suburban
Rural
No Police Called
Police Called
No Police Police
Rural 0.3992629 0.6007371
Suburban 0.4146957 0.5853043
Urban 0.4020579 0.5979421
49
Relative Frequency of Police Reporting for Crimes
in Different Neighborhood Types
1.0
0.8
Relative Frequency of Police Reports
0.6
0.4
0.2
No Police Called
Police Called
0.0
Neighborhood Type
50
2.4 Probability
R has most of the commonly used probability distributions already built into the base installation.
2.4.2 Evaluating the probability density function at a specific value of the random
variable.
By prefixing a "d" to the function name in the table above, you can get probability density values (pdf).
In general, d<function.name> maps y = f(x) where f(x) is the probability density function that describes the
probability distribution.
The dnorm() function returns the height of the normal curve at the desired value along the x-axis.
# Example 1: Evaluate the probability density function (pdf)
# for a normal distribution with mean=20, sd=4
# for x = 23
[1] 0.07528436
51
The Normal Distribution with Parameters: N(20,4).
The pdf function for this distribution is evaluated at x=23.
0.10
0.08
0.04
0.02
0.00
10 15 20 23 25 30
# Example 2:
dexp(3,rate=1/2)
[1] 0.1115651
52
The Exponential Distribution with rate parameter = 1/2.
The pdf function for this distribution is evaluated at x=3.
0.5
0.4
0.3
y=f(x)
0.2
0 1 2 3 4 5
The CDF or cumaltive probability distribution function is the left tail probability for a given value of x, say x0 .
[1] 0.7733726
53
The Normal Distribution with Parameters: N(20,4)
The probability P(X <= 23) = pnorm(23,mean=20,sd=4)
0.10
0.08
0.06
y=f(x)
0.04
10 15 20 23 25 30
The default action of pnorm() is to always give the left tail probabilities.
[1] 0.2266274
# or equivalently
[1] 0.2266274
54
The Normal Distribution with Parameters: N(20,4)
The probability P(X > 23) = 1 − pnorm(23,mean=20,sd=4)
0.10
0.08
0.06
y=f(x)
10 15 20 23 25 30
pt(1.3, df = 23)
[1] 0.8967606
# Note: Unlike your tables from the book, R can handle fractional degrees of freedom as well,
# what if df = 23.7? P(T<1.3) = ?
pt(1.3, df = 23.7)
[1] 0.8969486
qnorm(0.37,mean=20,sd=4)
[1] 18.67259
55
For the distribution N(20,4), what value of x0 has an area to the left
equal to 37%? That is, for what value of x0 is P(X <= x0)=0.37?
0.10
0.08
0.06
y=f(x)
0.04
0.02
10 15 18.67 20 25 30
x
The answer is given by the 37th percentile of the distribution, qnorm(37,mean=20,sd=4)
qt(0.33, df=14)
[1] -0.4494312
Of course, we might also wish to know which value of the random variable has a certain percentage of values
above it. To obtain this answer, we use the option lower.tail = FALSE.
# Example 3: If T ~ t-distribution with df = 14, what if we want to know
# which value of t correponds to having 67% of the values ABOVE it?
# (This is still the 33rd percentile just worded differently!)
[1] -0.4494312
56
The t−distribution with df=14.
The 33rd percentile has 33% of the data below it and 67% above it.
0.4
0.3
y=f(t)
0.2
33% 67%
0.1
0.0
−3 −2 −1 −0.449 0 1 2 3
By prefixing an "r" in front of your desired distribution, you can generate random numbers from that distribution.
# Example 1: Generate a single random number from a normal distribution
# with mean = 20, sd = 4
rnorm(1,mean=20,sd=4)
[1] 13.26149
rnorm(20, mean=20, sd = 4)
[1] 17.64615 17.59858 19.69242 23.81140 18.45567 20.17301 23.07194 16.42878 18.83256
[10] 21.28481 22.05607 19.18567 21.42000 21.26179 26.65697 22.64397 18.38407 27.49874
[19] 22.15625 28.72936
# Let's run the command again and get a different set of random numbers but this time we'll store it!
print(x)
[1] 18.53378 19.99741 21.53830 19.45958 17.94471 31.27115 19.45154 22.86772 21.69416
[10] 17.01573 24.06403 13.17149 24.65973 15.60269 21.26390 17.52044 20.90198 24.36188
[19] 27.65063 18.82910
57
Obviously this is a bunch of numbers that we probably don’t want to see, so it’s best to store it and view them
with a picture.
We sampled 20 random numbers from the normal distribution with µ = 20, and σ = 4.
We can see where these 20 values correspond to with respect to the distribution from where they were sampled.
0.10
0.08
0.06
0.04
0.02
0.00
10 15 20 25 30
If we wanted to, we could keep sampling more and more values. As the sample becomes large enough, we can
start to collect these values into bins and create a histogram. This histogram should resemble the distribution
from which the random variables were sampled.
To see this, let’s increase our sample size but this time make a histogram.
# Let's do this again, this time for n = 200
58
Histogram of x
400
300
Frequency
200
100
0
5 10 15 20 25 30 35
The main distribution that we focus upon for the first 3rd of the course is the normal distribution.
The normal distribution is considered to be the most important distribution in Statistics. We will cover this more
in detail in class.
In practice, there are a large number of techniques that can be used when data come from a normal distribution.
But how can we know if our data actually come from a normal distribution?
There are three primary tools that we use to assess whether or not a sample comes from a normal distribution.
1. Histograms
2.5.1 Histograms
Let’s use some data that was in the Sleuth3 library that we installed
59
library('Sleuth3')
Note, if we have not installed and enabled the Sleuth3 library, the following commands will not work!
To learn more about this data set run the following command
? case0201
Year Depth
1 1976 6.2
2 1976 6.8
3 1976 7.1
4 1976 7.1
5 1976 7.4
6 1976 7.8
vars n mean sd median trimmed mad min max range skew kurtosis se
Year 1 178 1977.0 1.00 1977.0 1977.00 1.48 1976.0 1978.0 2.0 0.00 -2.01 0.08
Depth 2 178 9.8 1.03 9.9 9.86 1.04 6.2 11.7 5.5 -0.69 0.55 0.08
We see that there are 2 variables, Year and Depth (corresponding to beak depth)
To avoid typing case0201$Depth a lot, let’s make life a little easier by naming a new variable.
beak.depth <- case0201$Depth
The main idea behind using a histogram to check for normality of a sample is to see if the shape is approximately
normal, and the emprical rule seems to mostly hold.
## Run the following lines and read carefully
60
Histogram of Beak Depths with Normal Curve
0.4
0.3
Density
0.2
0.1
0.0
6 7 8 9 10 11 12
Beak Depth
The above lines look long and confusing (in fact this is the hardest part of this first tutorial) so let’s break them
down and try to understand what they do.
The above is nothing new, feel free to change breaks = 10 to 20 and re-run the lines. Notice that probability =
T means that we are using relative frequencies.
We saw this earlier in the tutorial when we looked at plotting. This line generates equally spaced values between
the min & max values of of the beak.depth observations.
Investigate xfit by typing it and looking at the output.
In particular, y = f (x) where f (x) is the Gaussian function used in the normal distribution.
61
Since this data may or may not come from a normal distribution, we are going to assume it does and see how
well a normal curve (the dark blue curve) with the same mean and standard deviation compares.
Since the real µ and σ are unknown quantities we’ll estimate them using sample mean and sample standard
deviation.
This is simply plugging the x-values into the Gaussian function f (x) and getting the corresponding y-values.
Finally, we just want to overlay this theoretical curve on the existing plot without plotting a new plot using:
lines(xfit, yfit, col="blue", lwd=2)
Once again, the overlayed “normal curve” is the theoretical distribution that uses µ equal to the sample mean,
X, and σ equal to the sample standard deviation, s.
The overlayed curve and the histogram don’t have to match perfectly but the better the do, the more plausible
it is that the sample came from a normal distribution.
For the beak.depth observations, we see that the histogram indicates that the sample is slightly left skewed, but
some may also be tempted to judge the data as “normal enough”.
Sometimes the number of breaks that you use can also affect your initial judgement on normality.
This is one of the reasons we use other tools to help make the judgement.
Histograms are only the first tool to assessing whether data are truly from a normal distribution or not.
These were used for comparing a sample with a theoretical normal distribution.
In general, qqplots are the more generic term when we want to compare a data set a particular distribution, not
necessarily the normal distribution.
Run the following commands. Read to understand what the commands are doing and what we should be seeing.
62
Normal Probability Plot of Beak Depths
11
Sample Quantiles
10
9
8
7
6
−2 −1 0 1 2
Theoretical Quantiles
You should see a normal probability plot for the beak.depth data.
The function qqline(), plots the “ideal” line that the data should follow if it was from a perfectly normal
distribution.
In general, the data won’t follow this line exactly, but it will help you get an idea of how approximately normal
the data is.
Extreme deviations from the line are indicators that the sample may not be from a normal distribution.
The deviations from the line for the beak.depth data seem to deviate a little at the tails. It seems difficult to
make an absolute judgement based upon this plot but we would be tempted to say that the data is not from a
normal distribution and is slightly left skewed which agrees with our histogram.
We can use a computational test called the Shapiro-Wilk’s test for normality.
It assumes that the same data values are from a normally distributed population and looks for evidence to the
contrary. It then conducts a statistical test. The details of this test are too advanced for this course but, we can
63
still learn how to use and interpret the results of this test.
The command for testing the beak.depth variable for normality is below.
shapiro.test(beak.depth)
data: beak.depth
W = 0.96781, p-value = 0.000393
If the p-value < 0.10, then the data IS ASSUMED TO NOT BE FROM a normally distributed population.
If the p-value ≥ 0.10, then it is considered that the data IS PLAUSIBLY FROM a normally distributed
population.
In this case, since the p-value = 0.000393, which is less than 0.10, so we have evidence to conclude that the
data is not from a normal distribution.
Let’s look at another example. This time we will look at a case where the data is definitely not from a normal
distribution.
# Example 2: Test a random sample to see if it comes from a normally distributed population.
# Let's investigate the beta distribution with shape parameters alpha = beta = 0.5
# Clearly a beta distribution is not a normal distribution!!
# A histogram
hist(my.sample, breaks=20, probability = TRUE)
xfit <- seq( min(my.sample), max(my.sample), length=40)
yfit <- dnorm(xfit, mean=mean(my.sample), sd=sd(my.sample) )
lines(xfit, yfit, col="blue", lwd=2)
64
Histogram of my.sample
my.sample
# A normality plot
qqnorm(my.sample, pch=20)
qqline(my.sample, col="red", lwd=2)
65
Normal Q−Q Plot
1.0
0.8
Sample Quantiles
0.6
0.4
0.2
0.0
−2 −1 0 1 2
Theoretical Quantiles
# Shaprio-Wilk's test
shapiro.test(my.sample)
data: my.sample
W = 0.86781, p-value = 5.725e-08
All three tools are in perfect agreement. This data is not from a normally distributed population.
Let’s look at one more example from the Sleuth3 library. This data is not normally distributed but we’ll use a
log transformation to make it into one.
# Example 3: Salaries are often NOT normally distributed.
# You can investigate that this salary data is NOT normally distributed on your own.
# Run the following line, the Sleuth3 library must be installed and enabled.
salaries <- case0102$Salary
# A histogram
hist(log.salaries, breaks=20, probability = T)
xfit <- seq( min(log.salaries), max(log.salaries), length=40)
yfit <- dnorm(xfit, mean=mean(log.salaries), sd=sd(log.salaries) )
lines(xfit, yfit, col="blue", lwd=2)
66
Histogram of log.salaries
5
4
3
Density
2
1
0
log.salaries
# A normality plot
qqnorm(log.salaries, pch=20)
qqline(log.salaries, col="red", lwd=2)
67
Normal Q−Q Plot
9.0
8.8
Sample Quantiles
8.6
8.4
−2 −1 0 1 2
Theoretical Quantiles
# Shaprio-Wilk's test
shapiro.test(log.salaries)
data: log.salaries
W = 0.9817, p-value = 0.2183
Sampling distributions are very important in Statistics. Whether we explicitly say so or not, we will use several
kinds in this course.
While it’s not important for this class to study every sampling distribution in detail, we will choose to look at
the easiest sampling distribution which is the sampling distribution for X.
In class, we learned that the sampling distribution for the sample mean, X, has some useful properties.
68
First, recall that
X1 + · · · + Xn
X=
n
where the measurements X1 , X2 ,. . ., Xn are independent and identically distributed from a population which has
a population mean µ and population standard deviation σ.
Since X is a linear combination of X1 , . . . , Xn , we can use the properties of linear combinations of random variables
to obtain the mean and standard deviation of X.
√
q
The mean of X is given by E(X) = µ and the standard deviation is given by V ar(X) = σ/ n.
We learned that
1. If the population from which samples are drawn is normal, then the sampling distribution of X is also
normal regardless of the sample size n.
2. Central Limit Theorem (CLT): If n is large, then the sampling distribution of X is approximately normal,
even if the population from which the measurements are taken from is not normal.
In this section we are going to try to understand a little more about sampling distributions and the central limit
theorem.
We are going to look at how well the central limit theorem holds up for different sample sizes and for different
distributions.
To investigate, we are going to conduct a “meta-study” consisting of repeatedly taking samples of equal size,
computing the sample mean for each sample, little x, and then constructing a histogram of the sample means.
Run the following commands and read the explanations.
mu <- 25
69
sigma <- 7
# Take 500 sample, each of size n1 = 50, compute the sample mean for each
# then store the answer in the vector xbar1.
# For our vector of sample means (little xbar) xbar1, make a histogram.
hist(xbar1, main=("Hist of 500 Random Sample Means, samples of \n size n=50 sampled from X ~ N(25,7)"),
xlab=expression(bar("X")), breaks=30, prob=TRUE)
0.2
0.1
0.0
22 23 24 25 26 27 28
Since we are sampling from a normal distribution, note the use of rnorm(), then the sampling distribution should
look rather normal in shape for any sample size. We chose n = 50.
In addition to the histogram, don’t forget that you can look at qqplots too and run Shaprio-Wilk’s test to assess
normality.
qqnorm(xbar1, pch=20); qqline(xbar1, col="red")
70
Normal Q−Q Plot
28
27
Sample Quantiles
26
25
24
23
22
−3 −2 −1 0 1 2 3
Theoretical Quantiles
shapiro.test(xbar1)
data: xbar1
W = 0.99774, p-value = 0.7451
We can verify that the mean of X (the mean of sample means E(X)) is close to the population mean of X ∼
N (25, 7) which is µ = 25.
## Is the mean of xbar close to 25?
mean(xbar1)
[1] 25.00307
√ √
Additionally, we can verify that the standard deviation of X is σ/ n = 7/ 50 = 0.9899495.
# Is the standard deviation of xbar close to 0.9899495?
sd(xbar1)
[1] 1.006624
Repeat the above, but this time, change the sample size from
n1 = 50 to n1 = 5
71
You should just be able to cut and paste the above lines of code in example, and change n1 in the definition and
n=5 in the title for the plot.
So let’s repeat the above analysis and see how well the central limit theorem works.
To illustrate this, let’s look at samples taken from the Exponential Distribution. Don’t worry about the details
of this distribution.
If you want to see what this distribution looks like, the dark blue line represents the theoretical curve.
hist(rexp(100000,rate=25), breaks=20, prob=T)
lines(seq(0,0.3,len=50), dexp(seq(0,0.3,len=50),rate=25),type='l', col="blue", lwd=2)
10
5
0
72
Let’s see how the central limit theorem works in this case.
set.seed(303) # for reproducibility
n2 <- 5
xbar2=rep(0,500)
for (i in 1:500) { xbar2[i]=mean( rexp(n2,rate=25)) } # Draw a sample of size n2 from X ~ Exp(25)
hist(xbar2, main="Hist of 500 Random Sample Means \n samples of size n=5 \n sampled from X ~ Exp(25)",
xlab=expression(bar("X")), breaks=20)
40
20
0
mean(xbar2) # The mean for this distribution in this case should be close to 1/25 = 0.04
[1] 0.03847333
sd(xbar2) # The sd for this dist for n=5 should be close to (1/25)/sqrt(n) = 0.01788854
[1] 0.01715966
We see that the mean and standard deviation seem to match faily well but the shape is not exactly normally
distributed as it appears to be slightly skewed to the right.
We can repeat the above analysis but this time, we’ll change the sample size from n = 5 to n = 50.
set.seed(303) # for reproducibility
n2 <- 50
xbar2=rep(0,500)
73
for (i in 1:500) { xbar2[i]=mean( rexp(n2,rate=25)) } # Draw a sample of size n2 from X ~ Exp(25)
hist(xbar2, main="Hist of 500 Random Sample Means \n samples of size n=50 \n sampled from X ~ Exp(25)",
xlab=expression(bar("X")), breaks=20)
40
20
0
mean(xbar2) # The mean for this distribution in this case should be close to 1/25 = 0.04
[1] 0.03965549
sd(xbar2) # The sd for this dist for n=5 should be close to (1/25)/sqrt(n) = 0.01788854
[1] 0.00570074
Increasing the sample size has helped to make the distribution more normal in shape.
If we increase the sample size even further, this approximation continues to improve.
The general rule of thumb is that for most distributions a sample size of n > 30 is good enough to satisfy the
central limit theorem.
However, if the original population is extremely skewed (as is the case of the exponential distribution), then larger
sample sizes are generally needed.
74
Chapter 3
In this part of the tutorial you will learn how to use R to carry out:
1. Produce confidence intervals and conduct hypothesis tests for a single mean.
2. Produce confidence intervals and conduct hypothesis tests for differences of means from two independent
samples.
75
3.1 Inference for a Single Mean
Let’s start by reviewing a bit about confidence intervals for a single mean µ and how to generate them in R.
Before you start, realize that we are going to discuss the long way to compute confidence intervals first, then I
will show you the R shortcut later on.
Confidence Intervals are used to provide us with an interval estimate for the true population mean, µ.
We want to find confidence intervals for the mpg variable from the mtcars data set. In order to use confidence
intervals, we need to know the following:
1. sample size
2. sample mean
To get the sample size, there are a number of ways to accomplish this: First, we can use the describe command
that we used earlier describe(mtcars). You will see n=32.
[1] 32
[1] 32
[1] 20.09062
s <- sd(mtcars$mpg)
s
[1] 6.026948
Now, 100(1 − α)% Confidence Intervals are found using the student’s t-distribution. In order to use the t-
distrubution we need to know alpha and the degrees of freedom. The degrees of freedom are always n − 1. In this
76
case: n − 1 = 31.
Now, similarly to the normal distribution, we want the value of t, so that 100(1 − α/2)% of the data is less than
or equal to t (Hence the 1 − (α/2) percentile ). This is called the critical value for α/2 Note: Your book uses tα/2
instead for simplicity. This should have been explained in class and in the reading.
where t(1−α/2,n−1) is the critical value of t that corresponds to the 1 − α/2 percentile, from a distribution with
n − 1 desgrees of freedom.
[1] 18.28418
upper
[1] 21.89707
# Either way, we should see that a 90% CI for the mpg is approximately
# (18.28, 21.90) miles per gallon. Be sure to always remember the UNITS!!
# Note:
# A trick to get a confidence interval using a single line is the following:
ybar + c(-1,1)*qt(0.95, df=n-1)*s/sqrt(n)
As you learned in class, the most common way to interpret this confidence intervals is:
We are 90% confident that the true population mean of all automobiles is between 18.28 and 21.80 miles per
gallon.
77
Clearly there has to be some way for R to automatically compute confidence intervals.
In fact we can use a single command to compute hypothesis tests and confidence intervals simulataneously.
We will talk about hypothesis tests in more detail in class, so for now we will just look at using this function for
the automatic confidence interval computation.
t.test(mtcars$mpg)
data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
17.91768 22.26357
sample estimates:
mean of x
20.09062
By default, the t.test command will produce a 95% confidence interval for the mean
data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
18.28418 21.89707
sample estimates:
mean of x
20.09062
The only thing that should change is that we should now see the following:
which should match what we computed the long way in the previous section.
Suppose in the previous example that we wanted to test the claim that the average mpg of the cars is less than
22 mpg.
78
The hypothesis statements are therefore
H0 : µ ≥ 22
versus
HA : µ < 22
X − µ0
We know how to do this the long way. First we compute a test statistic, ts = √ .
s/ n
We then compute the P-value = P (Tn−1 < ts ), where the < was determined by the alternative hypothesis. We
can of course test greater than, or 6= alternative hypotheses too.
# Here are some summary statistics that we'll need
xbar.mpg <- mean(mtcars$mpg)
s.mpg <- sd(mtcars$mpg)
n.mpg <- length(mtcars$mpg)
df.mpg <- n.mpg - 1
[1] -1.792127
print(p_value)
[1] 0.04143924
Since the P-value is 0.0414392, which is less than α = 0.05 we have statistical evidence that the average mpg of
the cars is less than 22 mpg.
To carry this out using the built-in R function, we already know to use t.test().
data: mtcars$mpg
t = -1.7921, df = 31, p-value = 0.04144
alternative hypothesis: true mean is less than 22
95 percent confidence interval:
-Inf 21.89707
sample estimates:
mean of x
20.09062
79
Notice that we use the function option alternative="less". This specifies the direction of the test, by default
the test assumes "two.sided". We can of course do alternative="greater"}as well.
3.2 Inference for the the Difference of Means from Two Indepen-
dent Samples.
Example 1:
A researcher is investigating the differences between two catalysts (Catalyst A and Catalyst B) and their ability
to speed up a certain process. Both catalysts produce identical results with regards to the end of the process
itself, the main question at hand is whether one catalyst takes less time than the other.
In order to investigate, the research was able to repeat the process with Catalyst A six times and 5 times with
Catalyst B before running out of material. The times to complete the process in minutes are given below:
Can you conclude that the time to complete the process differs between the two catalysts? Test at the 5%
significance level.
Now we will talk about how to use the built-in R commands to carry out a t test for the difference of means.
Of course, we can use R as a fancy calculator and simply apply our formulas that we learned in class.
The following commands are what you would do by hand. The difference is that you get a more precise P-value.
xbar.A <-mean(catalystA)
xbar.B <-mean(catalystB)
80
ts <- ( xbar.A - xbar.B )/se.A.minus.B
print(nu.df)
[1] 4.368632
Note that the above gives a fractional degrees of freedom. R can handle this but we cannot use our tables If you
wanted to be able to use your table to compare answers, you need to round the degrees of freedom down to the
nearest whole number You can do this using the floor() function.
nu.df.rounded.down <- floor( (sd.A^2/n.A + sd.B^2/n.B)^2 / ( (sd.A^2/n.A)^2/(n.A-1) + (sd.B^2/n.B)^2/(n.B-1) ) )
print(nu.df.rounded.down)
[1] 4
By now we realize that for hypothesis tests, we can either look at the P-value or we can compare the test statistic
with the critical values that define our critical region.
The critical region is all of the values of the T-distribution with df, degrees of freedom, that correspond to the
extreme alphaof the T-distribution with df, and alpha, that mark the boundary of the critical region.
Hence, if our test statistic, ts, is beyond the critical values, then we reject the null Hypothesis
# To get the critical values, we will use the following:
alpha <- .05
# These are the critical values that cut-off the tails and define the boundary of our rejection region.
c(-t.half.alpha, t.half.alpha)
[1] -3.025679
81
The P-value for a two-sided hypothesis test is found by
# P-value = 2*Prob( T > |ts|)
[1] 0.03475186
Finally, we could have also studied the confidence intervals instead. The 95
lower.bound <- xbar.A-xbar.B - qt(1-.05/2,df=nu.df)*se.A.minus.B
upper.bound <- xbar.A-xbar.B + qt(1-.05/2,df=nu.df)*se.A.minus.B
[1] -3.423359
[1] -0.2033077
By default, the built-in Student’s t-Test assumes that the samples are independent and that the variances are
UNEQUAL
It automatically calculates the degrees of freedom using the exact same formula.
This form of the t-test is called Welch’s two-sample t-Test (unequal variances)
## Run the following line
82
## Study the output.
It tells us that it conducted a two-sided confidence interval, that is ”alternative hypothesis: true difference in
means is not equal to 0” Finally, it provided us with a 95% confidence interval.
The test didn’t provide us with a conclusion for the test, we need to simply look at the P-value and make that
desision based upon the significance level that we choose. However, if we wanted to use the confidence interval to
make an interpretation, we can ask for difference 100(1 − α)% confidence intervals using the following command
variation
t.test(catalystA, catalystB, alternative="two.sided", conf.level=0.99) # For a 99% confidence interval
The long method requires that we keep track of which tail of the t-test that we want to observe.
The test statistic is still exactly the same, but we need to change how we calculate the P-value. Since the
alternative µA < µB ⇒ µA − µB < 0, we notice that we use a less than “<”
So the P-value is the probability that we will observe a particular value of our test statistic, ts , or less. i.e, The
83
P-value for a ”less than” hypothesis test is found by P-value = Prob(T < ts )
pvalue.ex1b <- pt( ts, df=nu.df, lower.tail=TRUE)
# By default, this uses a lower tail probability, so we set it to FALSE
[1] 0.01737593
Using our short command, to get the ”less than” alternative we use the modification.
t.test(catalystA, catalystB, alternative = "less")
We notice that the difference in the P-values between our method and the built-in code again comes from the
fact that this built-in command uses df = 4.2267, where as we used df = 4.
We also notice that t.test automatically generates the corresponding 1-sided confidence intervals too.
We discussed how to do 1-sided confidence intervals by hand in class, but let’s review how to do this using R.
For 1-sided confidence intervals, we simply need to find either the upper or lower ”confidence bound”
An 95% upper confidence bound on µA − µB corresponds to a lower confidence interval, and vice-versa
upper.bound.one.sided <- xbar.A-xbar.B + qt(1-.05,df=nu.df)*se.A.minus.B
[1] -0.5660423
You can compare our answer with the output from the t.test() with alternative="less".
84
H0 : µA ≤ µB
HA : µA > µB
Very simply, we can find the answer using the built-in t.test() command to conduct the test by specifying:
alternative="greater" in the argument.
In class, we studied the paired-sample t-test. Let’s look at the following example:
A sample of 10 diesel trucks were run both hot and cold to estimate the difference in fuel economy. The results,
in mpg, are presented in the following table.
This is a matched-pairs experimental design because it’s the same 10 trucks at two different temperatures.
To assess whether diesel trucks are more efficient in hot and cold weather, we must compare the mean differences,
µd versus 0.
To do this by hand or even using the long method you should calculate the differences.
85
diffs <- hot - cold
print(diffs)
[1] 0.30 0.38 0.66 0.41 -0.12 0.58 0.20 -0.04 0.01 0.50
In order to justify the use of a t-test, We need to check to see if the data are approximately normally distributed.
qqnorm(diffs, pch=19)
qqline(diffs, col="red")
0.4
0.2
0.0
Theoretical Quantiles
shapiro.test(diffs)
data: diffs
W = 0.94751, p-value = 0.6392
86
Once we realize that this a a matched-pairs design, then the actually t-test works in the same basic way as that
for a single mean.
Instead of practicing the long way, let’s see how to use the built-in R command to conduct the hypothesis test
and get a 90% confidence interval for µd .
t.test(hot,cold, alternative="two.sided", paired=TRUE, conf.level=0.90)
Paired t-test
We have strong statistical evidence that the average mpg is not the for diesel trucks in hot and cold weather
(p-value = 0.008). Our study suggests, with 90% confidence, that diesel trucks perform better in hot weather
with an average mpg as little as 0.13 and as much as 0.44 mpg greater than in the cold.
87
Chapter 4
In this part of the tutorial you will learn how to use R to:
2. Test compound null hypotheses of population proportions, i.e., conduct χ2 goodness-of-fit tests for a single
variable with k categories.
3. Conduct χ2 tests for association/independence between two variables, i.e. a χ2 contingency test for (r × k)
tables.
88
4.1 Confidence Intervals for Population Proportions
Here we wish to construct confidence invervals for population proportions, p, instead of population means µ.
The only difference is that before we calculated confidence intervals using the t-distribution.
The book told you that for the standard normal distribution, the z-score that corresponds to the 95% confidence
interval is 1.96.
[1] 1.959964
So this is the z-score for the 97.5% quantile, which is the boundary of the 2.5% upper tail of the standard normal
distribution.
In a natural population of mice (Mus musculus) near Ann Arbor, Michigan, the coats of some individuals are
white spotted on the belly. In a sample of 580 mice from the population, 28 individuals were found to have
white-spotted bellies. Construct a 95% confidence interal for the population proportion of this trait.
# This is the number of trials
n.mice <- 580
# This is the number of 'successes'
n.spotted.mice <- 28
We are 95% confidence that the true proportion of white-spotted bellied mice is between 0.0335 and 0.0693.
We have a library that we installed earlier that allows us to generate the confidence intervals for proportions, this
binom.
89
library(binom)
# binom.confint( <number of successes>, <number of trials>,method="wilson")
binom.confint(28,580,method="wilson")
Note that this is not the exact same value as you computed using the methods from the book the book teaches a
good approximation, this binom.confint command uses something that is a bit more sophisticated.
The registrar assumes that for a particular art class on campus, the ratio of students who take it will be
Freshman:Sophomores:Juniors:Seniors = 3:2:1:1.
The final class roster for the new semester is in and the actual enrollment is 32 Freshman, 15 Sophomores, 13
Juniors, and 9 Seniors. Are these data consistent with the 3:2:1:1 ratio predicted by the registrar?
p1 = 3/7
p2 = 2/7
H0 : The data is consistent with the proposed model.
p3 = 1/7
p4 = 1/7
Again, we will start out by doing this the long way, and then learn how to use a short simple command.
n.students <- 69 # total students
90
# where p_{i} is the probability of each category as indicated above
expected.students <- n.students*p.theory
Now we need our Chi-square test statistic. There are a couple of ways to do this. The formula that we studied
in class (also in the book) is the most efficient way to do this and is given with the following line.
# The chi-square test statistic
xs <- sum( (actual.students-expected.students)^2/expected.students )
print(xs)
[1] 2.403382
You should double check this by hand to see if you get the same answer.
Now we use the Chi-square distribution to get a P-value. We want the probability that we observe a particular
value of, χ2s or xs, or more extreme. Remember, df = (number of categories)-1 here.
p.value.students <- pchisq(xs,length(actual.students)-1,lower.tail=FALSE)
# Notice that we want the upper tail probability
print(p.value.students)
[1] 0.4930054
To call this test, all we need is observed counts, and the theoretical proportions to which we are comparing with.
We can easily see that this matches what we did via the long method.
data: actual.students
X-squared = 17.899, df = 3, p-value = 0.0004616
91
The null hypothesis for our example is:
data: actual.students
X-squared = 2.4034, df = 3, p-value = 0.493
# or
chisq.test(actual.students,p=p.theory)
data: actual.students
X-squared = 2.4034, df = 3, p-value = 0.493
Example 2:
We’ve seen this dataset before, the MSA variable has neighborhood types where various crimes occur. To remind
ourselves, let’s look at this dataset a little bit.
# This command looks at the first 6 rows of the dataset.
head(victims)
# This is a command that allows us to view names and types of variables in the data set.
str(victims)
92
$ YEAR : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ MSA : Factor w/ 3 levels "Rural","Suburban",..: 2 2 3 3 3 2 2 2 3 2 ...
$ ER : Factor w/ 2 levels "ER","No ER": 2 1 2 2 1 2 2 2 2 2 ...
$ Police : Factor w/ 2 levels "No Police","Police": 1 2 1 1 1 2 2 1 2 1 ...
$ age : int 14 24 45 14 37 15 14 30 21 19 ...
$ female : int 0 1 0 0 0 0 0 0 1 0 ...
$ stranger : int 0 0 1 1 0 0 1 0 0 0 ...
$ thirdparty: int 1 1 1 1 1 1 1 1 0 1 ...
$ private : int 1 1 0 0 1 0 0 1 0 1 ...
$ income : int 3 1 1 3 1 4 3 2 1 1 ...
We see that there are only 3 categories: Rural, Suburban, and Urban.
So MSA is a variable with k = 3 categories. The proportions given by the prob.table() function above are pb’s
for the different categories.
We know that we can use our chisq.test() function to test the equality of these proportions very easily.
chisq.test(neighborhood.counts)
93
data: neighborhood.counts
X-squared = 1172.4, df = 2, p-value < 2.2e-16
So the P-value is < 2.2e-16, so the proportion of crimes happen differently in different neighborhood types.
Example 2b.
How about we test a different null? What if someone (before seeing the data) proposed that the proportion of
crimes in the neighborhoods was (0.12, 0.50, 0.38) for Rural, Suburban, and Urban Neighborhoods respectively
To do this test, we still use the chisq.test() function but we now specify the probabilities instead of the test’s
default where it assumes equality.
chisq.test(neighborhood.counts, p=c(0.12, 0.50, 0.38))
data: neighborhood.counts
X-squared = 3.0585, df = 2, p-value = 0.2167
In this case, we fail to reject H0, therefore it is plausible that the proportion of assaults occur 12% of the time in
Rural neighborhoods, 50% in Suburban neighborhoods, and 38% of the time in Urban neighborhoods.
In class and the book, you learned how to conduct a hypothesis test for the difference of population proportions
by using the Chi-square goodness-of-fit test.
The general formulation of the test relies on you constructing a (r × k) contigency table. You then use the
marginal frequencies and the expected frequencies to carry out the chi-square test.
Real Data does not end up in those types of contigency tables on their own; we must build them ourselves.
Let’s use the victims data that you loaded at the very beginning of the assignment and consider an example of
how to do hypothesis tests for (2 × 2) contigency tables.
The victims data set is from the National Crime Victimization Survey from 1996-2005.
94
The data corresponds to incidents of serious assaults in which the victim sustained an injury.
We can express categorical data a number of ways, using names as well as indicators (i.e. numbers, such a
1=”low”, 2=”medium”, 3=”high”)
MSA is the location where the incident occured. To view the categories of the MSA variable, use the following
command
levels(victims$MSA)
levels(victims$Police)
Police is a categorical variable with categories (Police, No Police) which indicates that the incident was reported
to the police or not.
ER is a categorical variable with categories (ER, No ER) which indicates that the victim received treatment at
the ER or not.
Stranger is a categorical indicator variable which indicates the offender was a stranger (indicated with a 1), or
not (indicated with a 0).
private is a categorical indicator meaning that the location was private (indicated with a 1), or public (indicated
by a 0).
Suppose that we hypothesize that victims who call the police go to the ER more often than those who don’t.
95
Then the hypotheses would look like the following:
H0 : P r( ER | Police ) = P r( ER | No Police )
HA : P r( ER | Police ) > P r( ER | No Police )
In order to test this by hand (the long way), we need to construct a contigency table.
We have already seen how to construct a table of absolute frequencies in R assignment 1, but as a reminder, we
will use the following command.
tbl1 <- table(victims$ER, victims$Police)
# This is how we create r x k tables using real data.
No Police Police
ER 95 675
No ER 2201 2532
Notice that the ER variable is the rows, while the Police variable is the columns.
# To get the row sums of this table
rS <- rowSums(tbl1)
rS
ER No ER
770 4733
No Police Police
2296 3207
[1] 5503
No Police Police
[1,] 321.2648 448.7352
[2,] 1974.7352 2758.2648
96
To construct the test statistic, we use the regular formula.
xs.test.tbl1 <- sum((tbl1-expected.freqs.for.tbl1)^2/expected.freqs.for.tbl1)
# Hence our Chi-squared test statistic, xs, for this (2x2) table is
xs.test.tbl1
[1] 317.9321
For a 2x2 table, the degrees of freedom is df = 1. To find the P-value, we simple use the command:
p.value.tbl1.test <- pchisq(xs.test.tbl1, df=1, lower.tail=FALSE)
p.value.tbl1.test
[1] 4.086469e-71
We conclude that patients who call the Police go to the ER more than people who do not call the police.
Now, for what you’ve been waiting for, the short way. The short way simply uses the command:
chisq.test(tbl1, correct = FALSE)
data: tbl1
X-squared = 317.93, df = 1, p-value < 2.2e-16
# For the case 2x2, we need to use the option: correct = FALSE
When the null is simple (the proportions are equal), you don’t need any extra options.
What changes when the contigency table is not (2 × 2) but some other more general (r × k) table?
In case you don’t recall, r is the number of rows in a contingency tables and k is the number of columns.
For any (r × k) contingency table, the degrees of freedom is given by the formula df = (r − 1)(k − 1). Otherwise,
all other steps are identical.
We know that we can make a table from seperate columns of a dataframe using the table() function, but what
if we are not given the raw data and instead only given the already summarized data in the bivariate frequency
table/contingency table?
97
Suppose we have a table that looks like (ignoring the row and column names for now)
Burrito
Beef Bean Cheese
Hot 42 10 27
Salsa
Mild 9 39 13
Recall that if we want to input an array into R, we use the following command
x <- c(1,2,3,4)
print(x)
[1] 1 2 3 4
Let’s explore how to use the matrix function with c() to enter a tables.
# This makes a column vector of length 4
matrix(c(1,2,3,4))
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
#this makes a 2 x 2 matrix but say we want the first row to be [1 2], then this would be incorrect
matrix(c(1,2,3,4),nrow=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
# this makes a 2 x 2 matrix using c(1,2,3,4) putting the values in left-to-right, a row at a time.
matrix(c(1,2,3,4),nrow=2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,] 3 4
Now that we obtained a table that wanted, we can customize it a little bit.
y <- c(42,10,27,9,39,13)
lunch.data <-matrix( y ,nrow = 2,byrow = T)
print(lunch.data)
98
[1,] 42 10 27
[2,] 9 39 13
Now that we have our table, we can finally run our χ2 test on this table.
chisq.test(lunch.data)
data: lunch.data
X-squared = 41.793, df = 2, p-value = 8.41e-10
99
Chapter 5
In this part of the tutorial you will learn how to use R to carry out:
1. One-Way ANOVA
2. Multiple Comparisons
100
5.1 One-way ANOVA
Analysis of Variance or ANOVA is a method that is used to compare the means of groups simultaneously.
For example, suppose I have t different treatment groups that I wish to compare. Under the null, we assume all
the groups have the same population mean.
H0 : µ1 = µ2 = · · · = µt
versus
HA : At least one mean differs.
Example
A built-in data set in R looks at results from an experiment to compare yields (as measured by dried weight of
plants) obtained under a control and two different treatment conditions (assuming some type of fertilizer).
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
7 5.17 ctrl
8 4.53 ctrl
9 5.33 ctrl
10 5.14 ctrl
11 4.81 trt1
12 4.17 trt1
13 4.41 trt1
14 3.59 trt1
15 5.87 trt1
16 3.83 trt1
17 6.03 trt1
18 4.89 trt1
19 4.32 trt1
20 4.69 trt1
21 6.31 trt2
22 5.12 trt2
23 5.54 trt2
24 5.50 trt2
25 5.37 trt2
26 5.29 trt2
27 4.92 trt2
28 6.15 trt2
29 5.80 trt2
30 5.26 trt2
The response variable consisting of the yield observations are given by the weight variable.
The corresponding groups for each observation are given by the group variable.
101
We can of course use the dollar sign $ notation to extract the data, dataset$variable.name as before.
We have two other options that we can use when we don’t want to keep using the $ sign notation.
Clearly we can use our own naming which might help us save a little bit of typing.
plant.weights <- PlantGrowth$weight
plant.groups <- PlantGrowth$group
Alternatively we can use the attach() function. This means that we can call the variables in a dataset without
having to say the name of the dataset first. You must be careful because it will mask other variables/functions
of the same name if they already exist. A good practice to follow is to make sure to use the detach() function
when you are finished.
Now we can just type weight and group without having to type PlantGrowth$ first.
Running the following command shows how many observations are in each group.
summary(group)
So we see that observations 1-10 are from the control group, observations 11-20 from treatment 1, and 21-30 from
treatment 2 Remember, the number of observations won’t always be the same for each group!
Before working on anything else, let’s define our ANOVA model. This can be done several different ways.
Even though we don’t see any specific output, R actually already completed a lot of preliminary calculations using
our data.
102
It is always good practice to check the underlying assumptions before trying to use a particular analysis method.
All methods have slightly different requirements for their assumptions.
i. The groups all have the same variances (or standard deviations)
ii. The groups all come from a normal distribution (especially when the sample sizes are small).
6.0
5.5
Weight
5.0
4.5
4.0
3.5
ctrl trt1 trt2
Treatment
Again we see the Y ∼ X notation. We know that the observed weights are the response variable Y while the
groups are the predictor variables X. In this case, what it’s doing here is that it’s telling the plotter which group
each observation belongs to so that it can plot appropriately.
The method="stack" option just makes overlapping data points so that they are easier to see.
The cex option is new to us. This essentially allows us to control the size of the plotting character. The default
is cex=1 which stands for 100%, while cex=3 stands for 300%. So cex=0.75 is 75% of the original character size.
To Add the location of the group means to the plot run the following lines.
103
stripchart(weight~group, vertical=T, method="stack", xlab="Treatment", ylab="Weight",
main="Yields of Plants (Dried Weight) Due to Treatments", col="blue", las=1, pch=1, cex=0.75)
6.0
5.5
Weight
5.0
4.5
4.0
3.5
ctrl trt1 trt2
Treatment
Let’s explore the tapply() command, this looks at the columns and takes the means of the observations according
to the appropriate groups. Kinda clever! One way to look at this is the following:
tapply( vector of response observations , vector of groups , a function such as mean, median, etc. )
104
Yield of Plants (Dried Weight) Due to Treatments
6.0
5.5
Weight
5.0
4.5
4.0
3.5
ctrl trt1 trt2
Treatment
i. We can plot Normality Plots or run Shaprio-Wilk’s tests on each of the individual groups.
ii. Or, as we discussed in class, you can simply look a Normality Plot for the Residuals.
105
Normal Q−Q Plot for Plant Growth Residuals,
Shapiro.test.pval = 0.4379
1.0
Sample Quantiles
0.5
0.0
−0.5
−1.0
−2 −1 0 1 2
Theoretical Quantiles
data: anovamodel$residuals
W = 0.96607, p-value = 0.4379
ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl trt1 trt1 trt1 trt1 trt1
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661
trt1 trt1 trt1 trt1 trt1 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
4.661 4.661 4.661 4.661 4.661 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526
Earlier, we obtained ybar (i.e. the group sample means). To make the plot, we need appropriate (x,y) pairings
so we need to replicate the means by the appropriate sample sizes for each group using the rep() function.
In order to check the assumption of equal variances/standard deviations, we can start by using a graphical
approach.
106
A graphical way to look at whether the standard deviations are approximately equal is the following plot:
plot( rep(ybar,n_obs),anovamodel$residuals, pch=1, cex=0.75,
xlab="Fitted Value",ylab="Residuals",
main="Plots of Residuals vs sample mean (i.e. Fitted Value)")
abline(h=0, col="blue")
1.0
0.5 Plots of Residuals vs sample mean (i.e. Fitted Value)
Residuals
0.0
−0.5
−1.0
Fitted Value
The rule of thumb is that the ratio of the largest spread to the smallest spread should not exceed 2. Additionally
we should not see any specific kind of pattern such as the spread increasing/decreasing significantly with the fitted
value.
Since we don’t see a huge deviation between the spreads, or a worrying pattern, we’ll assume they are okay.
Another test that can be used to assess whether the equal variance assumption is met is Levene’s test. Levene’s
test is similar to shapiro.test but for testing the equality of variances (and hence standard deviations).
The leveneTest() function is found in the car library. If you have not done so, install and enable the car library.
leveneTest(weight~group)
107
Df F value Pr(>F)
group 2 1.1192 0.3412
27
If the p-value is less than .05, we reject the null hypothesis of equal variances. We of course want to fail to reject
the null hypothesis so that we can use ANOVA.
To get the ANOVA Table, we have a couple of different commands, but to avoid confusion just use:
anova(anovamodel)
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You should know how to read an ANOVA table by now, but be aware that R does not provide you with the row
corresponding to the totals.
Note the variable names are the variable names that you used when defining the model in the aov() command
above.
As was mentioned in class, the Between(Groups) is given, it is usually the actual name of the treatment groups.
In this case that variable name is called “group”.
The Within(Groups) row, is also called the “Error” or the “Residuals”, depending on the command or software
that you obtain your ANOVA table with. By default, R calls them Residuals.
The following section requires the multcomp and ggplot2 libraries. Please make sure that these
libraries are installed and enabled before proceeding further.
After we have conducted our F test (or rather we have generated our ANOVA Table) we observe a P-value. If
that P-value is less than some predetermined significance level, say alpha = 0.05, then we have statistical evidence
that not all of the group population means are equal.
In this instance, we need to do pairwise comparisons to test to see which means might differ from each other.
108
From the multcomp library we have some wonderful tools that do this for us.
The names of some of the functions won’t make too much sense and aren’t worth explaining in full detail.
Instead learn what changes and what stays the same if you were to use the functions in future problems.
The basic setup to carry out multiple comparisons are all made using the following command
# Be aware that the "group" variable is one that is defined by you or the data
comparisons <- glht(anovamodel, linfct=mcp(group="Tukey"))
The above command ran some calculations and stored it in the “comparisons” object.
If you forget to run the above command, the rest of the work below will not work !!!!!!
Also, be sure to replace group with the name of the predictor variable for your dataset!
In order to do the following analysis, some other things are needed including:
The number of comparisons that need to be made is k, which is usually different from the total number of groups
I. If we which to carry out every pairwise test (or obtain every pairwise confidence interval) the total number of
comparisons to be made is k = I(I − 1)/2.
k <- length(n_obs) * (length(n_obs)-1 )/2
print(k)
[1] 3
We need the degrees of freedom of the residuals (error) row, this is given by N − I = n· − I from the ANOVA
table. We can get the degrees of freedom using the following:
df.within <- anova.table[2,"Df"]
print(df.within)
[1] 27
print(MSE)
[1] 0.3885959
Now let’s look at conducting pairwise tests and confidence intervals using the different methods that we discussed
in class.
109
5.2.1 Fisher’s Least Significant Difference
These are simple pair-wise comparisons that use the t-distribution that you are used to.
You should have practice being able to do this by hand, the only difference now is:
The degrees of freedom is based upon the degrees of freedom within the blocks, that is, the degrees of freedom is
found in the Error/Residuals row of the ANOVA table.
The FisherLSD pairwise hypothesis tests are carried out using the following command:
summary(comparisons,test=adjusted("none")) # adjusted("none") means we don't adjust alpha
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.19439
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.08768 .
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.00446 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- none method)
Quantile = 2.0518
95% confidence level
Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.37100 -0.94301 0.20101
trt2 - ctrl == 0 0.49400 -0.07801 1.06601
trt2 - trt1 == 0 0.86500 0.29299 1.43701
We can plot these confidence intervals using the following long function
qplot(lhs, estimate, data = confint(comparisons, calpha = univariate_calpha()),
main="FisherLSD 95% Confidence Intervals", geom = "pointrange", ymin = lwr, ymax = upr,
xlab="Comparisons", ylab="Estimates and Confidence Intervals") +
110
coord_flip() + geom_hline(yintercept = 0)
trt2 − trt1
Comparisons
trt2 − ctrl
trt1 − ctrl
To hypothesis tests and make confidence intervals using Fisher’s LSD by hand is not difficult.
The hypothesis statements for the pairwise tests are given by the following:
H0 : µi = µj
HA : µi 6= µj
for any two populations i 6= j from the larger set of I population groups in total.
Since we assume that the variances of the populations are equal, then 100(1 − α)% confidence intervals for µi − µj
where i 6= j are given by s
√ 1 1
Y i − Y j ± t1−α/2,df M SE +
ni nj
111
where the M SE and df (the degress of freedom) are obtained from the residuals/error row of ANOVA table.
Yi−Yj
ts =
√
r
1 1
M SE +
ni nj
This can be a cumbersome chore by hand when the total number of groups and pairwise comparisons to be made
is large.
We already know that the sample means are stored in ybar, and the number of observations is stored in n_obs.
First, run the next couple of lines that strip the names. It’s clear that the first sample mean (the control group)
is given by ybar[1], the treatment 1 group sample mean is ybar[2], etc.
comp.names <- names(ybar)
ybar <- as.numeric(ybar)
n_obs <- as.numeric(n_obs)
print(comp.names)
print(ybar)
print(n_obs)
[1] 10 10 10
The difference between group 2 (treatment 1) and group 1 (the control) is given by
ybar[2] - ybar[1]
[1] -0.371
This matches the Estimate of the differences of the means given in the tables provided in the previous sections.
[1] 0.2787816
which matches what we see in the output from summary(comparisons,test=adjusted("none")) in the previous
section. Note that since we have a “balanced” experimental design (meaning all sample sizes are equal) then the
standard error is the same for all pairwise comparisons.
112
The test statistic and P-value for pairwise comparison test between the control group and treatment 1 is given
by:
ts <- (ybar[2] - ybar[1])/(sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) ))
print(ts)
[1] -1.330791
[1] 0.1943879
A 95% confidence interval for µ2 − µ1 (the treatment 1 population mean - the control population mean) is given
by
(ybar[2]-ybar[1]) + c(-1,1)*qt(1-0.05/2,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )
# or
lower <- (ybar[2]-ybar[1]) - qt(1-0.05/2,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )
print(lower)
[1] -0.9430126
[1] 0.2010126
Which agrees with the corresponding line seen from the confint() output in the previous section.
I simply told you how to carry out 1 of the hypothesis tests and obtain the corresponding confidence intervals.
You will need to do the other 2 for this dataset. In general, there might be more than 3 groups, so you may need
to MANY tests.
Here we correct the αew = αcw /k, where k is the number of pairwise comparisons we make. If we wish to make
all of the pairwise comparisons possible, then k = I(I − 1)/2 where I is the total number of groups.
The Bonferroni tests can be done in a similar way as before using functions from the multcomp library. For
example, the hypothesis tests can be done using the following (note the comparisons object was computed in the
FisherLSD section). The only real change is that we explicitly tell R that we are using the Bonferroni method by
specifying type="bonferroni" .
summary(comparisons, test=adjusted(type="bonferroni"))
113
Fit: aov(formula = weight ~ group)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.5832
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.2630
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- bonferroni method)
Quantile = 2.5525
95% confidence level
Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.3710 -1.0826 0.3406
trt2 - ctrl == 0 0.4940 -0.2176 1.2056
trt2 - trt1 == 0 0.8650 0.1534 1.5766
114
Bonferroni 95% Confidence Intervals
trt2 − trt1
Comparisons
trt2 − ctrl
trt1 − ctrl
−1 0 1
Estimates and Confidence Intervals
If we wanted to generate confidence intervals for µ2 − µ1 by hand, then we need to do the following.
We know that the difference between group 2 (treatment 1) and group 1 (the control) is given by
(ybar[2] - ybar[1])
[1] -0.371
[1] 0.2787816
The only major change is the critical value that we use as a multiplier in the margin of error computation.
115
# Instead of the critical value used in Fisher's LSD where
qt(1-0.05/2,df.within)*sqrt(MSE)
[1] 1.279059
[1] 1.591138
Now the 95% Bonferroni confidence interval for µ2 − µ1 based upon the sample we obtained earlier is
(ybar[2]-ybar[1]) + c(-1,1)*qt(1-0.05/2/k,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )
In order to conduct a hypothesis by hand using the Bonferroni method, we note that we can do the following:
ts = (ybar[2]-ybar[1]) / ( sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) ) )
print(ts)
[1] -1.330791
print(p.val.bonf)
[1] 0.1943879
If this gives the same P-value as before that’s not surprising. What changes is what we make the comparison
with.
First note that we have I = 3 groups, hence k = I(I − 1)/2 = 3 . The total number of tests that we will conduct
will be 3, this is simply the first one. We want to control the overall Type I error rate that is associated with
multiple tests. So we no longer make a simple comparisons with α = 0.05.
Formerly, we made a comparison at the α level, but since we are conducting k multiple tests, we conduct each
test at the α/k level. So if αf w = 0.05 is the desired familywise error rate, we need to compare our P-value with
the experimentwise error rate of αew = αf w /k = 0.05/3 = 0.01666667.
Note however, that the output of previous section had different P-values that the one we just obtained. In
particular that function returned something called “adjusted P-values”. As a reminder here is the output we are
referencing:
Linear Hypotheses:
116
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.5832
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.2630
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- bonferroni method)
Adjusted P-values are used in order to make decisions about the test easier. So instead of conducting every test
at the 0.01666667 level, what is done is each test is conducted at the 0.05 level but was has been changed is that
the P-values that we obtained earlier have now been multiplied by k = 3.
Since our original P-value = 0.1943879, the adjusted P-value is 3 ∗ 0.1943879 = 0.5831637 which matches the
output above. We then see that the adjusted P-value = 0.5832 > 0.05 hence we fail to reject H0 .
When we first created the comparisons object using the glht() function, we used this option called "Tukey".
This did not automatically create Tukey intervals, but instead it told R that we wish to carry out all “pairwise”
comparisons. Since the most common method to do this is using the Tukey-Kramer method, just stating the
option that we will use Tukey already set this up for future computations.
Since we already specified that we intend to use pairwise comparisons using Tukey’s HSD, the default output (no
additional options as before) using the summary() function will provide us with Tukey’s HSD hypothesis tests
and confidence intervals for pairwise contrasts.
summary(comparisons)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.3909
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.1979
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0122 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
117
Simultaneous Confidence Intervals
Quantile = 2.4795
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.3710 -1.0622 0.3202
trt2 - ctrl == 0 0.4940 -0.1972 1.1852
trt2 - trt1 == 0 0.8650 0.1738 1.5562
Finally, plots of the Tukey HSD confidence intervals are given by the function
qplot(lhs, estimate, data = confint(comparisons),
main="TukeyHSD 95% Confidence Intervals",
xlab="Comparisons",
ylab = "Estimates and Confidence Intervals",
geom = "pointrange", ymin = lwr, ymax = upr) +
coord_flip()+ geom_hline(yintercept = 0)
118
TukeyHSD 95% Confidence Intervals
trt2 − trt1
Comparisons
trt2 − ctrl
trt1 − ctrl
−1 0 1
Estimates and Confidence Intervals
Since TukeyHSD is often the most preferred method for post-hoc analysis, hypothesis tests and confidence intervals
are given by a simple dedicated function
TukeyHSD(anovamodel)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
119
In general, Tukey-Kramer 100(1 − α)% confidence intervals for µi − µj (i 6= j) are given by
s
qα,I,N −I √ 1 1
Yi−Yj ± √ M SE +
2 ni nj
In R, we obtain critical values from Tukey’s studentized range distribution using the qtukey() function.
A 95% confidence interval for µ2 − µ1 using based upon the sample is given by
(ybar[2] - ybar[1]) + c(-1,1) * qtukey(1-0.05, nmeans = 3, df=df.within)/sqrt(2) *
sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )
where QI,N −I is a random variable from Tukey’s Studentized Range Distribution with I groups and N − I is the
degrees of freedom of the error. Additionally our test statistic is:
√ |Y i − Y j | √
qs = 2∗ = 2 ∗ |ts |
√
r
1 1
M SE +
ni nj
[1] 0.3908711
We’re done with the PlantGrowth dataset, don’t forget to detach it!
detach(PlantGrowth)
120
Chapter 6
In this part of the tutorial you will learn how to use R to:
2. Compute Correlations
121
6.1 Linear Models in R
A statistical model that decribes a relationship between the ith observation of response variable Y and a predictor
variable X is given by
Yi = β0 + β1 X + εi
where εi the random error.
In general, regression analysis is used to describe the relationship between a single response variable: Y and one
or more predictor variables (X1 , X2 , . . ., Xp ). When p = 1, we use “Simple Linear Regression” and if p > 1 we
use Multivariate Regression.
We will focus on only the elementary case for simple linear regression.
Hopefully your next statistics course will cover the more general and advanced cases.
Assume that we want to measure the length of a particular spring when masses of various weights have been
attached.
We select mass of various weights, starting as zero, and going to 3.8 Kg.
We then measure the length of the spring for the various weights.
Question: Are the weights of the masses and the length of the spring linearly related?
Let’s explore this by first plotting the data. The data is given below:
# Weights between 0 and 3.8 Kg increments of 0.2
weight <- seq(0,3.8, by=0.2)
# Note: You should avoid using length as a variable name since it is the name of a command.
# Same for number of objects in the dataset.
122
Spring Length versus Mass
5.8
5.6
spring.length
5.4
5.2
5.0
0 1 2 3
weight
[1] 0.9743193
# You should avoid using r as a variable, same with length, as it is a command inside R
my.r <- cor(weight,spring.length)
123
print(my.r)
[1] 0.9743193
The linear model can be specified simply using the lm() function.
spring.modelfit <- lm( spring.length ~ weight)
Just like with the aov() command, the lm(), already carried out many calculations including computation of the
least-squares coefficients.
To find out the equation of the least-squares line, we just need to look at the coefficients
spring.modelfit$coefficients
(Intercept) weight
4.9997143 0.2046241
print(my_b1)
[1] 0.2046241
b0 is found using: b0 : y − b1 ∗ x
my_b0 <- mean(spring.length) - b1*mean(weight)
print(my_b0)
weight
4.999714
124
6.4 Inference on the Regression Coefficient β1
The following R syntax can be used to display a wealth of information by the time you read this tutorial, hopefully
you know what most of it means of course we will go over it in class.
summary(spring.modelfit)
Call:
lm(formula = spring.length ~ weight)
Residuals:
Min 1Q Median 3Q Max
-0.09619 -0.03406 -0.00535 0.03761 0.12011
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99971 0.02477 201.81 < 2e-16 ***
weight 0.20462 0.01115 18.36 4.2e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Among other things, this summary displays the estimates (b0 , b1 ), their standard errors, the value of their test
statistics, and their P-values.
Let’s look at how we can get these quantities the long way
The residual sum of squares can be found by taking the residuals, squaring them and then taking the sum
[1] 0.05948624
r
SSresid
The residual standard deviaton is se = .
n−2
my_resid_std_error <- sqrt(sum(spring.modelfit$residuals^2)/(n_obs-2))
print(my_resid_std_error)
[1] 0.05748731
print(s_e)
125
[1] 0.05748731
s
The standard error of β1 = SEb1 can be calculated explicitly using: √e .
(sx ∗ n − 1)
# since x = weight
std_error_b1 <- s_e/(sd(weight)*sqrt(n_obs-1))
print(std_error_b1)
[1] 0.01114631
As it turns out, R has already computed this for us, and is given in the summary(lm()) output. We can extract
this from the output using:
# Change "weight" to the variable you need
SE_b1 <- summary(spring.modelfit)$coefficients["weight",2]
print(SE_b1)
[1] 0.01114631
print(df.resid)
[1] 18
Using symbols,
H0 : β1 = 0
HA : β1 6= 0
print(ts)
weight
18.35801
126
p.val.b1 <- 2*pt(ts,df.resid,lower.tail=FALSE)
print(p.val.b1)
weight
4.203758e-13
We can look at the output of summary(lm()) for comparison or just this small portion
summary(spring.modelfit)$coefficients["weight",3:4]
t value Pr(>|t|)
1.835801e+01 4.203758e-13
Call:
lm(formula = spring.length ~ weight)
Residuals:
Min 1Q Median 3Q Max
-0.09619 -0.03406 -0.00535 0.03761 0.12011
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99971 0.02477 201.81 < 2e-16 ***
weight 0.20462 0.01115 18.36 4.2e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 0.9492981
127
anova(spring.modelfit)
Response: spring.length
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 1.11377 1.1138 337.02 4.204e-13 ***
Residuals 18 0.05949 0.0033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, summary(spring.modelfit) provides us with the F-statistic and the corresponding P-value.
The main idea is that ANOVA and Regression can be used to test for the same thing:
H0 : β0 = β1 = 0
While this test seems silly for simple linear regression, it is more useful in multiple linear regression which we are
not covering.
In that case,
H0 : β0 = β1 = ... = βp = 0
IMPORTANT NOTE:
While the above numerical analysis will work on your data sets in general, the visual analysis requires that you
pre-sort your data with respect to the x-variable. Otherwise your plots are likely to look strange.
128
Spring Length versus Mass
5.8
5.6
Spring Length
5.4
5.2
5.0
0 1 2 3
Mass Weight
We can also add the fitted values (or the predicted values) and plot them
plot(weight,spring.length,pch=16, main="Spring Length versus Mass",
xlab="Mass Weight", ylab="Spring Length")
129
Spring Length versus Mass
5.8
5.6
Spring Length
5.4
5.2
5.0
0 1 2 3
Mass Weight
Note: I will demonstrate one way to use the default plotting tools in R to plot confidence and prediction bands.
While this is not difficult, these plots tend to look better using ggplot2.
We will have discussed a little about confidence and prediction bands in class. We will not discuss how to do
them the long way in R, instead we will just observe how to plot them efficiently. The plots of the confidence and
prediction bands are almost always seen on the same plot.
Look at the chunk of code below. To obtain confidence and prediction bands, we use the predict() function.
# Confidence Bands
spring.confint <- predict(spring.modelfit, interval="confidence")
# Prediction Bands
spring.predint <- predict(spring.modelfit, interval="prediction")
130
plot(weight, spring.length, pch=16,
main="Spring Length versus Mass \n with Confidence & Prediction Intervals",
xlab="Mass Weight", ylab="Spring Length")
Confidence
Prediction
5.6
Spring Length
5.4
5.2
5.0
0 1 2 3
Mass Weight
131
6.4.4 Basic Residual Analysis for Checking Assumptions
Residual analysis is used to determine whether the assumptions of our linear model and the method of simple
linear regression are valid.
As a first step, we assumed a simple linear model described the relationship between our response and predictor
variables.
A simple scatter plot is a good first step to tell whether or not it is appropriate to use simple linear regression.
However, sometimes variables that are related in a non-linear manner appear to be linear in certain regions.
Another way that we can detect for curvature is to use residual plots
A scatter plot of the residuals, ri = (yi − ybi ), versus the fitted (predicted) data, ybi can show curvilinearity if it
exists.
spring.residuals <- resid(spring.modelfit)
plot(spring.fitted.values,spring.residuals,pch=16,
main="Residuals vs Predicted Plot", xlab="Predicted",ylab="Residuals")
abline(h=0) # A horizontal line makes it easier to see
132
Residuals vs Predicted Plot
0.10
0.05
Residuals
0.00
−0.05
−0.10
Predicted
If there is a trend, such as a quadratic relationship, this means that our linear model assumption is false.
This might mean that we may need to transform our data or that we may need to use multiple regression instead
of simple linear regression.
Finally, we should also look at normality plot of the residuals to assess if the normality assumptions have been
satisfied.
qqnorm(spring.residuals, main="Normal Q-Q Plot for Residuals for \n the Spring Length Data", pch=20)
qqline(spring.residuals, col="red")
133
Normal Q−Q Plot for Residuals for
the Spring Length Data
0.10
0.05
Sample Quantiles
0.00
−0.05
−0.10
−2 −1 0 1 2
Theoretical Quantiles
134