0% found this document useful (0 votes)

38 views135 pages

Lucero R Tutorial 2016

Uploaded by

red.book7748

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views135 pages

Lucero R Tutorial 2016

Uploaded by

red.book7748

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 135

An Introduction to R/RStudio

1
Dr. Christian Lucero

Version: 2016.12.31

1 This tutorial was designed by Dr. Christian Lucero for use in his courses at Virginia Tech. Please acknowledge the
original author’s contribution when using portions of this tutorial.
Instructions

The tutorials and assignments are designed to teach you how to do basic statistical data analysis using
R/RStudio.

You will need to read the tutorials thoroughly in order to complete an assignment.

You should read the tutorial, line-by-line in it’s entirety. Skipping around is ill-advised.

You should generally run RStudio in full screen.

It is recommended that you dedicate a folder to this course and a sub-folder for each assignment!!

You will be asked to write your own snippets of R programming code to carry out basic statistical data
analysis. You should save your Rcode in files by naming them appropriately. We will give you specific
naming requirements for each assignment that you turn in.

Note: In order to edit the R Script / R Program Code, RStudio has a built-in R-code editor that we highly
advise you to use.

1
A Few Words About R & General Advice

There are a lot of “commands” or “functions” to carry out certain tasks in R.

Few people remember absolutely every command in a programming language!

What is most important is that you know where to look up an example related to the question that you want to
answer.

Once you learn and memorize some common commands within R, it becomes easier to use.

Until then, copy & paste earlier examples and modify the syntax to suit the needs of your current problem.

One of the most efficent ways to learn any program, calculator, or even an app on your phone is through exploration
along with trial-and-error.

So be patient, and try to maintain a postive attitude!

If you ever feel lost, R has a built-in help system. Additionally, a few minutes googling for answers usually helps
in most cases.

Don’t be afraid to seek help from other students, TAs, or the instructors if you are truly lost.

Most importantly!

DO NOT wait until the last minute to start your R Assignments!

2
Contents

1 The Basics of R/RStudio 5

1.1 Getting Started with R: The RStudio Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Naming Files and Organizing Your Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 What is the R Script, R Code, R Program, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Comments in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 How to Run R-Code: The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Defining & Viewing Objects in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Reading a Dataset into an Object/Variable by Hand, the Basics. . . . . . . . . . . . . . . . . . . . 12
1.7 Reading in Data from a File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7.1 When a Dataset Does Not Have a Header . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Installing R Library Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.9 Data Frames and Their Variables: The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.9.1 Matrix/Vector/Data Frame Indexing in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.10 The Basics of Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Elementary Probability and Statistics using R 26

2.1 Elementary Statistics Using R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Visual Summaries for Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Visual Summaries for Bivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.2 Evaluating the probability density function at a specific value of the random variable. . . . 51
2.4.3 Finding Probabilities from a given Distribution . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.4 Finding Quantiles from a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.5 Generating Random numbers from a distribution . . . . . . . . . . . . . . . . . . . . . . . 57
2.5 The Normal Distribution: Assessing Normality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5.2 Normal Probability Plots (Normal Q-Q Plots) . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.5.3 The Shapiro-Wilk’s Test for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.5.4 Sampling Distributions and The Central Limit Theorem. . . . . . . . . . . . . . . . . . . . 68
2.5.5 The mean and standard deviation of X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.6 The Sampling Distribution for X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3 Inference Methods for Numerical Data 75

3.1 Inference for a Single Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.1.1 Confidence Intervals for a single mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.1.2 Hypothesis Tests for a Single Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2 Inference for the the Difference of Means from Two Independent Samples. . . . . . . . . . . . . . . 80
3.2.1 Method 2: The short way using R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3
3.2.2 One-Sided Confidence Intervals (Confidence Bounds) . . . . . . . . . . . . . . . . . . . . . 84
3.3 Inference for Paired-Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Inferences for Categorical Data 88

4.1 Confidence Intervals for Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Chi-Square Tests for compound null hypotheses of population proportions . . . . . . . . . . . . . . 90
4.3 Tests for Association/Independence of Two Variables:
χ2 contigency test for an (r × k) table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5 Introduction to ANOVA and Multiple Comparisons 100

5.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.1 Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Obtaining Analysis of Variance Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Fisher’s Least Significant Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2 Fisher’s LSD by Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.3 Bonferroni’s Method for Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.4 Bonferroni Tests and Confidence Intervals by Hand . . . . . . . . . . . . . . . . . . . . . . 115
5.2.5 Tukey Honest Significant Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.6 Tukey HSD Tests and Confidence Intervals by Hand . . . . . . . . . . . . . . . . . . . . . . 119

6 Introduction to Simple Linear Regression 121

6.1 Linear Models in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2 The Correlation Coefficient, r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Computing the Least-Squares Regression Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Inference on the Regression Coefficient β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.1 Hypothesis Tests for the Slope, β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.2 Visual Tools for Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.3 Confidence and Prediction Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4.4 Basic Residual Analysis for Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . 132

4
Chapter 1

The Basics of R/RStudio

In this part of the tutorial you will learn the following:

1. Getting Started with R: The RStudio Interface.

2. Naming Files and Organizing Your Work.

3. What is the R Script, R Code, R Program, etc?

4. Defining & Viewing Objects in R.

5. Functions in R.

6. Reading a Dataset into an Object/Variable by Hand, the Basics.

7. Reading in Data from a File.

8. Installing R Library Packages.

9. Data Frames and Their Variables: The Basics

10. The Basics of Plotting.

5
1.1 Getting Started with R: The RStudio Interface.

To quote wikipedia:

“R is a programming language and software environment for statistical computing and graphics.”

“RStudio is a free and open source integrated development environment (IDE) for R, a programming language
for statistical computing and graphics.”

Essentially R is a very powerful statistical analysis software suite & RStudio is an independently developed project
that was designed to run on top of the R framework to assist in ease programming in R.

R can be run independently of RStudio, but RStudio needs R installed to run.

Therefore you need to install both on your system. However, you should never need to open the R program, only
RStudio.

R can be downloaded here: https://cran.r-project.org/

RStudio can be downloaded here: https://www.rstudio.com/products/rstudio/download/

The instructions on how to install R & RStudio can be found on those websites.

If you are still stuck, please see a classmate, a TA, or talk to your instructor during office hours.

6
The basic RStudio interface is divided into specific sections.

1. The area in the top left corner is dedicated to writing R Programs and is called the Source Window.
Specific lines of code can be run from this window, one-line at a time, by chunks, as well as the entire
program.

2. The console is in the bottom left corner. The console can be used to type in basic commands, functions,
etc. Additionally the console is where you see basic output from running specific commands and programs.

3. A command history and short list of user defined variables is located in the top right corner.

4. The bottom right corner contains a number of useful items organized by tabs. There is a file browser, a tab
which houses the plots when they are generated, an installed package viewer, and the help system interface
for RStudio.

3. Command History &

1. Source Window. Objects, Variables, and
R-Scripts/R-Programs a.k.a. some Data Sets
“R-Code” are written here. can be view here.

4. Files, Plots, A List of Installed

Packages and the Help System
can be seen here.
2. The Console can be used to
run commands and see output
from R Programs from above.

7
1.2 Naming Files and Organizing Your Work

Organization of your files is extremely important! You should dedicate a folder to this course, specific homework
assignments/projects, and projects that you might be working on in the real world.

For this class, create an obviously named folder.

Call it something like CMDA_3654_Assignments

Within this folder, you should make sub-folders. You should have a folder for each assignment.

There will be a number of assignments in this course, so you should have the following sub-folders:

Assignment_1
Assignment_2
Assignment_3

You should generally keep a separate R Script file for every assignment. You should call these R Scripts something
appropriate.

A good suggestion is:

Lucero_Christian_CMDA_3654_HW1.R

If you are familiar with version control, such as git, please use it!!

Of course substitute your last name and first name.

One thing that you should notice is that all R Scripts/Program Files must end with a .R

1.3 What is the R Script, R Code, R Program, etc.

R is a statistical programming language that is highly customizable and has a lot of built-in functions that do
many advanced statistical computations and plots for you.

R Programs/Codes/Scripts (all names are commonly used) are the instructions that outline the exact computa-
tions and other operations that you want R to run.

1.3.1 Comments in R

To start with, you should familiarize yourself with comments. Comments are inert statements that start with a
# symbol.

Anything written after a # symbol will not be interpreted by R.

8
Commenting or annotation is a very common practice in science.

The purpose of commenting is to help to explain pieces of your code so that you and others can understand the
steps that are about to proceed in a more human language.

Commenting can also help you debug your code, by preventing R from running certain pieces of code which allows
you to see what is working, and what is not.
# Example 1
# I want to plot the function y = sin(x), where x is from 0 to 2*pi
x <- seq(0,2*pi,length.out=100)
y <- sin(x)
# 3+5*7 # Isn't computed because it's commented out.
# plot(y,x,type="l"), # this would produce the wrong plot, so it's commented out.
plot(x,y,type="l", main="y=sin(x)") # This will produce the plot we want!

y=sin(x)
1.0
0.0
y

−1.0

0 1 2 3 4 5 6

Notice the difference between comments and the actual commands.

We should see that anywhere with a comment, R does nothing. In particular ‘3+5*7’ is never calculated. Similarly
for ‘plot(y,x,type="l")’.

For the other lines that were not commented out, i.e:

x <- seq(0,2*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l", main="y=sin(x)") # This will produce the plot we want!

The first 2 lines do some work, and assign the results to the variables named ‘x’, and ‘y’. The plot() function
then takes those values and produces the plot you see above.

9
1.3.2 How to Run R-Code: The Basics

To run particular commands in RStudio, there are many different options:

1. In the Console, you can type each command individually or cut & paste, then hit enter and observe the
output.

While this works, this is generally not advised. Instead do your work in the Source Window and save it to
an R Script. Then do one of the following.

2. If you have a command inside an R Script that you want the computer to run, you can put your blinking
cursor anywhere on that line hit CTRL-ENTER on the keyboard.

If you have a bunch of lines in a row that you need to run, just repeat the above step as many times as you
need.

3. If you want to run a specific line or multiple lines, then highlight those lines and hit CTRL-ENTER on
the keyboard.

4. You can highlight the code that you want to run, and click on the Run button in the top right of Source
Window.

If you want to run ALL LINES inside of the R script, save, then click on the Run Button.

For the most part, you are going to run lines individually in this class. You will then observe the output and
probably cut and paste the output results into a word document.

1.4 Defining & Viewing Objects in R.

All programming languages allow you to define basic objects or variables to allow you to store and retrieve
information in order to do both simple and complex computations.
# We can assign values to objects using arrows or equals signs.
a <- 12

# Typing just the name of the object in R will print the value.
a

[1] 12

# Using equals signs works the same way in R.

b = 2
b

[1] 2

Notice that we can use either an arrow <- or we can use an equals sign = to set objects in practice it doesn’t
matter which one you use, so long as you know what’s going on when reading other people’s code.

10
Essentially, you could treat R as just a fancy scientific calculator.

Run the following lines in the console and observe the output:
a-31
a*b
a^2
(a-b)^(b*4)

Of course you can store your fancy computations into new objects/variables.
my.answer <- sqrt(57)+32 + 11^3

When we assign a value to an object we don’t see anything unless we call ask for it by itself or use a print
command
my.answer

[1] 1370.55

print(my.answer)

[1] 1370.55

There are times when we will also need to store characters (not numbers) into objects.
specimen.name = 'Charlie The Horse' # Notice that single ' ' and " " work identically in R.
specimen.breed = "Paso Fino"
specimen.age = 7

print(specimen.name)

[1] "Charlie The Horse"

print(specimen.breed)

[1] "Paso Fino"

print(specimen.age)

[1] 7

# Something advanced for those of you who have seen it before.

sprintf("The specimen was as a %s with the name %s, age %i.",
specimen.breed, specimen.name, specimen.age)

[1] "The specimen was as a Paso Fino with the name Charlie The Horse, age 7."

1.5 Functions in R

There are many functions built into R that you will learn as you go along.

11
Sometimes these functions can be called with just 1 argument, and sometimes additonal arguments are needed.

Some function arguments are optional but help you to customize the output.

If you know the name of the function, say it’s called fcn.name but you don’t remember how to use it or what
arguments it needs, then use one of the following commands.
help('fcn.name')

# or equivalently

? fcn.name

Of course, you can click on the Help tab in the bottom right panel and do a search as well.

Go ahead and try this now. Look at the help file for plot function.

Often times when referring to a function instead of a variable or object name, we make this explicit
by referring to the function using the notation fcn.name(), for example, the plot() function.

When naming variables or other objects, it is best not to give them the same name as function
names. For example, length() is a function, so don’t do length <- 32.5, instead call it something
else arm.length <- 32.5.

If you need an example of how a function might be used, you can read about it on the help page or there may be
a more extensive example using the example() command function: example('fcn.name')

You can try this now with the following:

example('plot')

1.6 Reading a Dataset into an Object/Variable by Hand, the Basics.

Putting data into a variable by hand is very important.

When the data is not too long, we can type it in by hand using the concatenation function c().

Make sure to have comma’s between data values.

The general way to define a variable with multiple values is:

<variable name> <- c(value1, value2, ...) }

Consider the following example:

strength <- c(580, 400, 428, 825, 850, 875, 920, 550, 575, 750, 636, 360, 590, 735, 950)

# Suppose our list started getting too long...we can continue onto the next line
# so long as the previous line ends with a comma.

12
# Important Note: this is true for all functions

strength <- c(580, 400, 428, 825, 850, 875, 920, 550,
575, 750, 636, 360, 590, 735, 950)

# This is completely equivalent to the above!

# We can view the data using:

strength

[1] 580 400 428 825 850 875 920 550 575 750 636 360 590 735 950

# or

print(strength)

[1] 580 400 428 825 850 875 920 550 575 750 636 360 590 735 950

Now we can compute the mean, sd, etc of the strength variable observations. We can also plot values too.
mean(strength)

[1] 668.2667

The data that we just entered is now stored in what is called a vector or an array.

We will see how to access particular values in vectors/arrays in another section.

1.7 Reading in Data from a File.

R can read in many different types of files from other types of programs. Many more file types than we can cover
in this tutorial. Instead we will show you the basic idea on how to load .csv and .dat files that are reasonably
well-structured.

A good way to learn about importing a other types of data is use Google.
For example type: R import .mat

Quite often, data is stored in a spreadsheet where the variables are given by the columns with the variable name
in the first entry of each column. We will focus upon importing data that is in this type of format.

Consider the following dataset which has 6609 observations for each variable (only the first 15 are shown).

13
If you ever have a spreadsheet of data formatted in example above, then explore the options in your spreadsheet
program. Every “good” spreadsheet program has the ability to export to a specific file type known as a .csv file
(comma separated value).

We will show you how to read in .csv files below. If you ever need to read in other file types, a little Google will
get you a long way on this subject. If we need to read in other file types for this course, we’ll specifically give you
the command.

There are a few different ways to read in data from a file. R is rather powerful in that it can load data from a
wide variety of sources. Typically there are different functions for different file types and we don’t want to explore
all of them here. Instead we want to give you the main idea about how loading in files works.

Let’s focus on loading in .csv files. Comma separated value, csv, files are files that can easily be read/written by
spreadsheet programs.

We want to read in the crime dataset which is stored in the Crimedata.csv file.

Make sure that you know where this file is on your computer and then run the following line.

A dialog box will appear and you must select the file.
# Note the header = T option means that the first row corresponds to a variable name.
crime <- read.csv(file.choose(), header=T)

Alternatively, if you know where the file is, especially if you run the code often and want automation, then you
can load the data explicitly.
# Note the header = T option means that the first row corresponds to a variable name.
crime <- read.csv("/home/Username/CMDA_3654/Tutorial1/Crimedata.csv", header=T)

14
Note the use of the quotation marks!

Now the crime dataset is in working memory. By default R treats this as a specific type of data structure called
a data frame. Essentially a data frame is just a way of organizing all of our variables in a related data set.

To see the first few observations from this dataset use the following command:

head(crime)

age moved MSA ER Police Year

1 14 0 Suburban No ER No Police 1996
2 24 15 Suburban ER Police 1996
3 45 4 Urban No ER No Police 1996
4 14 0 Urban No ER No Police 1996
5 42 2 Urban ER Police 1996
6 37 9 Urban ER No Police 1996

We can ask R for the names of the variables in the dataset.

names(crime)

[1] "age" "moved" "MSA" "ER" "Police" "Year"

We can also ask R to tell us how many rows are in the dataset.
nrow(crime)

[1] 6609

We’ll use this dataset later in this tutorial.

1.7.1 When a Dataset Does Not Have a Header

When a dataset does not have column headings that serve as the names of the variables, we have to assign them
ourselves.

There are a number of different ways to do this.

First, let’s use a different type of data set called a .DAT file (a simple data file). The values in the data set are
separated by a simple space between then. The first value is the number of hours of snowfall (call it “snowhours”).
The second value is the number of hours it takes for workers to clear the snow (call it “clearhours”).
#there is no header to we don't say header = TRUE
snowdata <- read.table("T3-2.DAT")

print(snowdata)

V1 V2
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0

15
5 19.5 23.6
6 8.0 13.2
7 9.0 32.1
8 7.0 12.3
9 7.0 11.8
10 9.0 24.4
11 6.5 18.2
12 10.5 22.0
13 10.0 32.5
14 4.5 18.7
15 7.0 15.8
16 8.5 15.6
17 6.5 12.0
18 8.0 12.8
19 3.5 26.1
20 8.0 14.5
21 17.5 42.3
22 10.5 17.5
23 12.0 21.8
24 6.0 10.4
25 13.0 25.6

We’ll see that R calls assigns generic variable names “V1” and “V2” to the columns. The observation numbers
were also filled in my R.

We can assign variable names to the data after it has been read in.
colnames(snowdata) <- c("snowhours","clearhours")

# head() displays the first 6 rows

head(snowdata)

snowhours clearhours
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0
5 19.5 23.6
6 8.0 13.2

We can also assign names to the row elements (such as patient names or other labels) using rownames()

Assigning names when we read in the data.

snowdata <- read.table("T3-2.DAT", col.names = c("snowhours","clearhours"))

head(snowdata)

snowhours clearhours
1 12.5 13.7
2 14.5 16.5
3 8.0 17.4
4 9.0 11.0
5 19.5 23.6
6 8.0 13.2

16
1.8 Installing R Library Packages

R is a very open language. People from all over the world can submit their own libraries of code to the central R
repositories for other people to use.

In addition to new programs, functions, & plotting tools, installing new library packages in R can also provide
new practice datasets to work on.

In order to install a new library package (sometimes just called a library or just a package) in RStudio, there are
several ways.

Method 1: Go to Tools -> Install Packages, and search and install the package you are looking for.

Method 2: In the bottom right panel, click on Packages, then click on Install, then search and install the
package.

Method 3: Command line, this only needs to be run once.

The command to run is:

install.packages("package name goes here", dependencies = TRUE)

Install the psych package using any of the three methods stated above.

# To install the pysch library using Method 3 command, you should have used:
install.packages("psych", dependencies = TRUE)

Some other libraries that you should go ahead and install right now are:
install.packages("binom", dependencies = TRUE)
install.packages("epitools", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
install.packages("multcomp", dependencies = TRUE)
install.packages("Sleuth3", dependencies = TRUE)

Determining Which Packages are Currently Installed

You can see which packages are already installed on your computer a couple of different ways, but the easiest
in RStudio is the Packages Tab in the bottom right portion of the screen. If you click there, you can see what
packages are installed, and those with checkmarks are the ones currently enabled. Certain packages are always
on by default, others you have to enable yourself. We can simply click on the check box to enable a package or
use the command line, which is often preferred.

Enabling a Package

In order to use a package, you must first enable it. If you are already on the Packages tab in RStudio, simply
make sure the box next to the library you want is checked.

Alternatively: To enable a library using a command, the function format to use is

17
library('package.name'), or library("package.name"), or library(package.name)

If you attempt to load a library before you have installed it yet, you will get an error.

We already installed the psych library above, now let’s enable it.
library("psych")

Go ahead and enable the other libraries that we installed today, we’ll use at least one of them later.

Here is the list in case you forgot:

binom, epitools, car, multcomp, Sleuth3

Now that we installed the psych library, let’s look at one of the new functions.

A neat function from the psych library is describe(). Keep in mind, you’ll get an error if you did not install
and enable the psych library.

The dataset mtcars is a dataset that we can use this function on (it’s in the datasets System Library which is
auto-enabled when you start RStudio).
describe(mtcars)

vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 0.09
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37 0.34 0.32
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24 -2.00 0.09
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36 -1.92 0.09
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53 -1.07 0.13
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05 1.26 0.29

Final Note on R Library Packages:

You should only have to install a library package once.

However, you may have to re-enable/activate the library if you quit R/RStudio.

Make sure to check to see if a package is enabled or else you will get errors when trying to use functions from
those packages.

18
1.9 Data Frames and Their Variables: The Basics

When you read in data into an object by hand, it is usually just a scalar variable or a simple vector array of
numerical values or characters.

Example:
student <- c('Joe','Sara','Chen') # Vector/Array of Characters
age <- c(23, 22, 22) # Vector/Array of Numerical Values
grade <- c('Junior','Sophomore','Sophomore') # Vector/Array of Characters

In the above variables, we have 3 vectors of equal length, 2 contain characters denoted by quotations ' ', and a
vector of numerical values.

A data frame is a special type of dataset that is used for storing data tables. It is a list of vectors of equal length
where the columns correspond to variables and the rows within each column are the corresponding observation.

A natural example of a data frame is data organized in the spreadsheet in the Crimedata.csv file.

By default, most of the data that is naturally found in the R libraries (as well as the data we import from .csv
files) will be in the data frame format.

We can make a dataframe from vectors of equal length by using the data.frame() function:
student.data <- data.frame(student,age,grade)

print(student.data)

student age grade

1 Joe 23 Junior
2 Sara 22 Sophomore
3 Chen 22 Sophomore

Let’s look at an example of a dataset that is already built into RStudio.

The trees dataset is provided by the datasets library which should already be enabled when you turn on
RStudio.

You can learn about the trees dataset by typing ? trees or help('trees') in the Console.

We can look at the entire dataset by simply typing the name of the dataset:
trees

Girth Height Volume

1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2

19
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0

If it’s not too long, looking at the whole dataset is okay. But it is very long and you just want to get a general
sense what the data looks like, then we have a number of options.

First, we could use the head() function which lists the first 6 rows.
head(trees)

Girth Height Volume

1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7

What we notice is that this dataset has 3 variables (all of which are numeric), they are “Girth”, “Height”,
“Volume”.

If we use the str() function, this tells us a few different things.

str(trees)

'data.frame': 31 obs. of 3 variables:

$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

This function output specifically says the trees dataset is a 'data.frame', there are 3 variables, each of which
has 31 observations. It then names the variables, it states the type of variables num stands for numerical, and
then lists some of the observations for each variable.

We could have also used the names() and nrow() functions to tell us the names of the variables and the number

20
of rows, respectively.

Suppose we wish to just look at a particular variable from this dataset, say Height, we can do the following:
trees$Height

[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87

The above shows all 31 observations for the Height variable from the trees dataset.

To look at a specific variable from a dataset such as a data frame, do the following:

dataset.name$variable.name

Typing the $ is pretty common, but we have other options.

For now, we’ll just state the easiest alternative. You can save a variable from a dataset into a new object that we
name ourselves.
our.tree.heights <- trees$Height

Now our new variable is called our.tree.heights and we can just call it by name.
our.tree.heights

[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87

1.9.1 Matrix/Vector/Data Frame Indexing in R

Suppose want to know what the 7th value is of the Volume variable in the trees dataset.

We have a number of options to do this. Do the following and scroll up to the previous section to check if the
answer is correct.
trees$Volume[7]

[1] 15.6

trees[7,3]

[1] 15.6

Let’s explore a little more.

The square brackets ‘[ ]’ are used for indexing locations in matrices, vectors, and data frames in R.

When we just have a single vector, then you only need one number for your index.

21
my.vector = c(10,20,30,40,50,60,70,80,90)

my.vector[5] # returns the 5th entry of the vector.

[1] 50

my.vector[1:5] # returns the first 5 locations.

[1] 10 20 30 40 50

my.vector[3:7] # returns the 3rd thru 7th values.

[1] 30 40 50 60 70

my.vector[10] # This is out of range, so it returns NA (No Answer).

[1] NA

length(my.vector) # We see the length of our vector is only 9.

[1] 9

Here is an example of a matrix, ignore the details about how it is contructed for now.
my.matrix <- matrix(1:12, byrow=T, 3,4) # A matrix with 3 rows by 4 columns

my.matrix

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12

Notice that the matrix is a 3-by-4 matrix. In general we describe matrices by their size: an m-by-n, matrix has
m rows and n columns.

When indexing a matrix, you must ask for the row location first, then the column location.
my.matrix[2,3] # row 2, column 3 should be 7

[1] 7

# We can ask for entire rows or entire columns

my.matrix[3, ] # Row 3, no number for the 2nd entry returns all columns

[1] 9 10 11 12

my.matrix[ , 4] # Shows column 4 entries

[1] 4 8 12

my.matrix[2:3, 3:4] # Show a submatrix

22
[,1] [,2]
[1,] 7 8
[2,] 11 12

Getting back to the first example. The Volume variable was in the 3rd column of the trees dataset, so that’s
why we could find out the 7th value of variable using trees$Volume[7] or trees[7,3].

This should come in handy. One reason this might be useful is to split our dataset into different pieces.
all.heights <- trees$Height

some.heights <- all.heights[1:15]

remaining.heights <- all.heights[16:31]

# We have divided the Height observations into roughly two halves.

all.heights

[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87

some.heights

[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75

remaining.heights

[1] 74 85 86 71 64 78 80 74 72 77 81 82 80 80 80 87

1.10 The Basics of Plotting

What if you wanted to use R as a graphing calculator, what would you do?

1. Specify your range of x-values,

2. Put these values into a function to get y-values,

3. Plot the points (x,y) in that order.

Run the following commands and read their explanations.

x <- seq(from=0, to=2*pi,length=100)
# creates a list (a sequence using seq() function) of x-values from 0 to 2*pi,
# it stores these values in an array called "x".

y <- sin(x)
# Using the x-values, evaluate a function and store the result in an array object called "y".

plot(x,y,type="l") # type="l", means to plot a solid line, the default is circles.

# If we want to add a single new point to an existing plot,

23
# we can use the following command:

points(2,0,col="red",pch=2) # This new point will be red (col stands for color)

# pch=2, (pch stands for plotting character) makes it a triangle.

# To see more on characters, Google: pch R
# The numbers 2, 0 is the x coordinate first followed by the y coordinate.

# If we want to add a bunch of points corresponding to an overlayed

# plot of a 2nd function, say w, we do the following:

w <- 2*pi*x-1
points(w, col="blue") # points just adds the (x,w) points to an existing plot

# If we want to add an extrpolated line instead, we do:

lines(w) # plots the points (x,w) but with a solid line passing through those points.

We can specify a title for our plot, rename our axes, as well as redefine our axes.

We will illustrate how to do these with more examples later in the tutorial.

There are other ways to put new points/lines on existing plots. But the above examples illustrate the basic idea
for most simple cases.

A picture of the final plot that you should obtain is given below.
1.0
0.5
0.0
y

−0.5
−1.0

0 1 2 3 4 5 6

In addition to the plot function, there are other functions that do a lot of fancy plotting with a few simple options.

For example, the hist() function will be used to make histograms, while boxplot() will generate boxplots.

24
We will see some examples of these functions and many more throughout the R tutorials.

25
Chapter 2

Elementary Probability and Statistics using R

In this part of the tutorial you will learn how to use R to do many of the basic statistical methods that you have
learned to do by hand from the textbook.

1. Elementary Statistics Using R.

2. Visual Summaries for Univariate Data.

3. Visual Summaries for Bivariate Data

4. Probability.

Generating Random Numbers, Computing Probabilities, Computing Quantiles

5. The Normal Distribution: Assessing Normality.

6. Sampling Distributions and The Central Limit Theorem.

26
2.1 Elementary Statistics Using R.

Using our trees dataset, let’s compute some elementary statistics!

Some functions make perfect sense based upon name alone.

mean(trees$Height) # The mean of the Height observations

[1] 76

sd(trees$Height) # The standard deviation of the Height observations

[1] 6.371813

var(trees$Height) # The variance of the Height observations

[1] 40.6

median(trees$Height) # The median of the Height observations

[1] 76

quantile(trees$Height, 0.78) # The 78th percentile of the Height observations

78%
80.4

# Okay, so that last one is probably not obvious, unless you've been following the class lecture.

If you can think of a statistical method, then someone has probably already made a function for that technique!
Even if it’s not in the base installation of R, the function you want can usually be found in one of the library
packages that we can download.

There are some other functions that are quite useful at times. Please note that these functions work differently
depending on if we are using them on a data frame, a variable, or some other type of object.

For example, look at the summary() function.

summary(trees)

Girth Height Volume

Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00

This gave use a “5 number summary” along with the mean for all our data frame variables.

What if we use it on just a single variable?

27
summary(trees$Height)

Min. 1st Qu. Median Mean 3rd Qu. Max.

63 72 76 76 80 87

The summary() function will be used to show us other useful information in more advanced tutorials.

Investigate what the describe() function from the psych returns when we use it on the trees dataset.

2.2 Visual Summaries for Univariate Data

We are going to look at various ways of visualizing data.

First we are going to look at univariate summaries. Our main tools are bar plots, histograms, stripcharts, &
boxplots.

Generate a frequency table.

crime.age.counts <- table(crime$age)
crime.police.reported.counts <- table(crime$Police)

# This will display a frequency table of the ages of crime victims.

crime.age.counts

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
270 264 234 302 229 265 281 213 236 216 207 164 176 148 150 146 125 137 135 129 118 136
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
122 114 146 116 122 123 119 121 118 73 94 94 97 77 70 97 63 52 50 56 57 32
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
34 29 21 22 21 25 23 17 8 10 8 7 11 7 6 9 5 6 4 3 9 4
78 79 80 82 84 85 87 89 90
3 3 3 1 6 3 2 1 4

# In the above, we see that there were 270 - 12 year old victims, 264 - 13 year olds, etc.

# This will display a frequency table of the whether the police where called or not for the crimes.
crime.police.reported.counts

No Police Police
2695 3914

We could use the frequency tables to make bar plots & histograms by hand but let’s let R do this for us.

Bar Plots

A simple bar plot is made using the barplot() function. We use this on categorical data only consisting of counts
(absolute frequency) for particular categories such as the crime.police.reported.counts table.

28
barplot(crime.police.reported.counts)

3000
2000
1000
0

No Police Police

This is an ugly looking plot, but we customize so many aspects of it. A lot of this is done just through playing
around with very options.

Most of the time, a simple bar plot is all we need. However, we often prefer using relative frequencies instead of
absolute.

This requires us to rescale our counts so that the everything is a fraction of 100%.
# Divide our counts by the total sum for relative frequencies.

relative.crime.reports <- crime.police.reported.counts/sum(crime.police.reported.counts)

relative.crime.reports

No Police Police
0.4077773 0.5922227

# Now make a new bar plot, let's make it prettier...

barplot(relative.crime.reports, col=c('purple','green'),
main='Rates of Police Calls for Crimes' )

29
Rates of Police Calls for Crimes

0.5
0.4
0.3
0.2
0.1
0.0

No Police Police

Histograms

Histograms are easy in R. Simply use the hist() function. By default this will use the absolute frequencies
(counts) instead of relative frequencies. One option that we can feed the hist() function to produce a histogram
with relative frequencies (aka proportions/densities).

Let’s explore, first:

# Create a histogram of the age variable
hist(crime$age)

30
Histogram of crime$age

800 1000
Frequency

600
400
200
0

20 40 60 80

crime$age

Now, if we want relative frequencies (proportions/densities) instead of absolute frequencies (or counts), we use
the following option:
# Histogram with Relative Frequency instead
hist(crime$age, prob=TRUE)

31
Histogram of crime$age

0.03
0.02
Density

0.01
0.00

20 40 60 80

crime$age

We can zoom in or out, that is change the class-width interval of the bins, by changing the number of breaks. By
default, R picks the number of breaks to use.
# Histograms with different number of bins
hist(crime$age, breaks=10, main="Histogram of crime$age, breaks=10")
hist(crime$age, breaks=20, main="Histogram of crime$age, breaks=20")
hist(crime$age, breaks=50, main="Histogram of crime$age, breaks=50")
hist(crime$age, breaks=100, main="Histogram of crime$age, breaks=100")

32
Histogram of crime$age, breaks=10 Histogram of crime$age, breaks=20

2000

1000
Frequency

Frequency

600
500 1000

200
0

0
20 40 60 80 20 40 60 80

crime$age crime$age

Histogram of crime$age, breaks=50 Histogram of crime$age, breaks=100

500
600
Frequency

Frequency

300
400
200

100
0

20 40 60 80 20 40 60 80

crime$age crime$age

We’ll just stick with the default number of breaks chosen by R.

What else can we customize in this plot?

We can customize just about everything, the title, the x & y labels, axes, colors, etc.

Don’t be afraid to google or use the help system to help you get what you want!
# Add your own title using , main="Your Title"

hist(crime$age, main="Histogram of Age")

33
Histogram of Age

800 1000
Frequency

600
400
200
0

20 40 60 80

crime$age

# Add your own x-axis label using xlab="Your X-axis Label"

# While we are at it, let's make the bars pink, use col="pink"

hist(crime$age, main="Histogram of Age", xlab="Age in Years", col="pink", xaxt='n')

axis(side=1, at=seq(from=10, to=90, by=5)) # put ticks every 5 units apart between 10 and 90

34
Histogram of Age

800 1000
Frequency

600
400
200
0

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

Age in Years

Dotplots & Stripcharts

Stripcharts are fairly easy to use, but customizing is often very important. Let’s generate a stripchart for the Height
variable from the trees dataset. Let’s also learn how to generate a dotplot which uses the same stripchart()
function but with different options.

Some Notes:

vertical = T rotates the plot to be vertical instead of horizontal (R’s default).

pch = 20 plots little filled circles and col=’red’ makes these circles red.

Since there are multiple trees at particular height = 80, the method='jitter' option adds a tiny random variation,
so the dots aren’t on top of each other.

Since jitter relies on the random number generator, if you want the same plot that I have then we need to reset
our random number generators.

The random number generators are reset to a starting value by using the set.seed() function.

Now let’s make a stripchart for the trees$Height variable.

First let’s see what a stripchart looks like with no extra options, then we’ll see how it looks after we modify it a
bit.

35
stripchart(trees$Height, main="Stripchart for Tree Height Data")

Stripchart for Tree Height Data

65 70 75 80 85

You can’t tell this from the above plot but there are actually multiple observations at certain values.

As you can see there are three trees with a height equal to 80. So we need to modify the stripchart to be a little
more clear about this fact.
trees$Height

[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[29] 80 80 87

So now let’s add some options to our stripchart.

set.seed(303) # Only for reproducibility

stripchart(trees$Height, vertical = T, method='jitter', pch=20, col='red',

main="Stripchart for Tree Height Data")

36
Stripchart for Tree Height Data

85
80
75
70
65

The only change that needs to be made for a dotplot is the method='stack' option.
# Note we are also making our plot horizontal which is the default anyway.
# method = 'stack', which products a dotplot.
stripchart(trees$Height, method='stack', pch=20, col='red', main="Dotplot for Tree Height Data")

Dotplot for Tree Height Data

65 70 75 80 85

I’m not a huge fan of where R put the tick marks by default.

The xaxt='n' deletes the xaxis along with the tick marks. I can use the axis() function to customize my own.

37
stripchart(trees$Height, method='stack', pch=20, col='red', xaxt='n',
main="Dotplot for Tree Height Data")
axis(side=1, at=seq(from=63, to=87, by=2)) # put ticks every 2 units apart between 63 and 87

Dotplot for Tree Height Data

63 65 67 69 71 73 75 77 79 81 83 85 87

Boxplots

The boxplot() is fairly easy to use now that you’ve seen some other functions.

Let’s make a boxplot of the trees$Height observations

boxplot(trees$Height, main="Boxplot of Tree Height Data")

38
Boxplot of Tree Height Data

85
80
75
70
65

No outliers, kind of boring! Let’s try making a horizontal boxplot of the crime$age observations.
# By default boxplots are vertical, to change to horizontal, use horizontal = TRUE

boxplot(crime$age, horizontal=TRUE, main='Boxplot of Age of Crime Victims', xlab='Age')

39
Boxplot of Age of Crime Victims

20 40 60 80

Age

2.3 Visual Summaries for Bivariate Data

There are 3 types of bivariate datasets that we consider in this course:

Case 1: 1st Variable Numerical - 2nd Variable Numerical

Import the TrackRecords.csv dataset using the following command.

track.records <- read.csv(file.choose(),header=T)

This data corresponds to Olympic track records for various countries in various running events. The 100 meter
and 200 meter times are both numerical variables. The appropriate tool to use is a scatter plot.
plot(track.records$m100, track.records$m200, xlab='100 meter dash times',
ylab='200 meter dash times', pch=18, col="blue")

# There apears to be a linear trend in the data

# Later in the semester we will study linear regression.
# The goal is to find a leasts-squares regression line for this data.

fit <- lm(track.records$m200~track.records$m100) # We'll discuss the details in a later tutorial.

abline(fit, col="red")

40
26
25
200 meter dash times

24
23
22

10.5 11.0 11.5 12.0 12.5

100 meter dash times

Case 2: 1st Variable Numerical - 2nd Variable Categorical (or vice-versa)

When we have numerical-categorical data, we can compare the categories by using side-by-side boxplots.

The categorical variable contains specific names of categories, such as colors “Yellow”, “Green”, “Blue”; or perhaps
general ranges “Low”, “Medium”, “High”; or specific outcomes such as “Postive”, “Negative”.

Sometimes the categories have a number as a label. Consider dosages. Instead of “10 mg”, “20 mg”, “30 mg”,
sometimes people just put “10”, “20”, “30” for the labels. We have to be careful when using numbers as labels!

Suppose I have three categories that I have the same type of measurements, for example, Vitamin C measurements
(in mg per fruit) for Apples, Cranberries, and Oranges.

First note that each category does not have to have the same number of measurements.
apple <- c(8.4, 7.3, 10.9, 37.1, 17.4) # 5 measurements here
cranberry <- c(20.6, 16.7, 26.8, 44.2) # 4 measurements here
orange <- c(61.1, 56.5, 46.3, 53.2, 35.1, 57.8) # 6 measurements here

Method 1: To make side-by-side boxplots when the data is in the form given by simple numerical vectors designated
by the category names, do the following:

boxplot(category1, category2, category3)

where category1, category2, category3 are the names of the vectors.

41
boxplot(apple,cranberry,orange, names=c('apple','cranberry','orange'), col=c('red','pink','orange'),
main='Vitamin C in Fruit (mg per 200g of fruit)')

60
50
40
30
20
10
Vitamin C in Fruit (mg per 200g of fruit)

apple cranberry orange

If you are importing data from a .csv file or some other method, the data will usually be in a data frame.

In this case, the data will be presented a little differently.

fruit <- c('apple','apple','apple','apple','apple',
'cranberry','cranberry','cranberry','cranberry',
'orange','orange','orange','orange','orange','orange')

vit.c.mg <- c(apple,cranberry,orange)

print(fruit)

[1] "apple" "apple" "apple" "apple" "apple" "cranberry" "cranberry"

[8] "cranberry" "cranberry" "orange" "orange" "orange" "orange" "orange"
[15] "orange"

print(vit.c.mg)

[1] 8.4 7.3 10.9 37.1 17.4 20.6 16.7 26.8 44.2 61.1 56.5 46.3 53.2 35.1 57.8

vitamin.c.content <- data.frame(fruit,vit.c.mg)

print(vitamin.c.content)

42
fruit vit.c.mg
1 apple 8.4
2 apple 7.3
3 apple 10.9
4 apple 37.1
5 apple 17.4
6 cranberry 20.6
7 cranberry 16.7
8 cranberry 26.8
9 cranberry 44.2
10 orange 61.1
11 orange 56.5
12 orange 46.3
13 orange 53.2
14 orange 35.1
15 orange 57.8

To make a boxplot in this circumstance you should do the following:

# If the data is from a dataframe use:
# boxplot( vitamin.c.content$vit.c.mg ~ vitamin.c.content$fruit )

# If you just have the above vectors, then do:

boxplot(vit.c.mg ~ fruit)
60
50
40
30
20
10

apple cranberry orange

Let’s explore the pattern for this other way of producing boxplots.

43
Method 2(a): To make side-by-side boxplots, the general R command is:
boxplot ( data $ numerical . variable ~ data $ categorical . variable )

Depending on the circumstance, the categorical variable is sometimes called a factor variable and the categories
of the factor variable are often called the levels.

Please note: If you have used only numbers as labels and have not used the qutoations to designate the numbers
as a character, i.e. '20', then you can use the factor() function to tell R that the numerical values are actually
labels.

Method 2(b): To make side-by-side boxplots when the categories are numerical values:
boxplot ( data $ numerical . variable ~ factor ( data $ categorical . variable ) )

Let’s do another quick example using a buit-in dataset.

boxplot(mtcars$mpg~mtcars$cyl, main="MPG by Number of Cylinders", ylab="MPG",
xlab="Cylinders", col=c("orange","green","cyan") ) # 3 colors, one for each boxplot.

# Let's add a legend just for fun. Use the legend() function.
# It seems redundant in this plot though.
legend("topright", inset=0.05, title="Number of Cylinders", c("4","6","8"),
fill=c("orange","green","cyan"))

44
MPG by Number of Cylinders

Number of Cylinders
4
30
6
8
25
MPG

20
15
10

4 6 8

Cylinders

Case 3: 1st Variable Categorical - 2nd Variable Categorical

In order to create a stacked bar chart of the MSA and Police variables, you first need to create a bivariate
frequency table (or a contingency table) with the counts of each of these categorical variables.

In order to make a bivariate frequency table or contingency table in R:

table(data$variable1, data$variable2)

Note: The first variable in the arguement will be along the left side column in the plot.

Lets look at this table

counts <- table(crime$MSA, crime$Police)
counts

No Police Police
Rural 325 489
Suburban 1315 1856
Urban 1055 1569

45
We will use the barplot() function to obtain a stacked bar plot for this data.
barplot(counts, main="Stacked Bar Chart of MSA and Police Reporting")

Stacked Bar Chart of MSA and Police Reporting

3000
2000
1000
0

No Police Police

Stacked bar plots won’t make sense without a legend. So let’s add one. Also, we should add an x-axis label.

To add a legend, make sure to use c("Category1", "Category2", "etc.") for as many categories as you need.

barplot(counts, main="Different Neighborhood Crime Locations and Police Reporting",

xlab="Police Reporting", ylab='Number of Neighborhoods', col=c('yellow','blue','purple'))

legend('topleft', inset=0.05, c("Rural", "Suburban", "Urban"), fill=c('yellow','blue','purple'))

46
Different Neighborhood Crime Locations and Police Reporting

Rural
Suburban
Urban
3000
Number of Neighborhoods

2000
1000
0

No Police Police

Police Reporting

There are still some formatting issues that we should fix, but before we do that, what if we actually wanted a
different stacked bar plot with the crime locations on the x-axis and the frequency of police calls to be the heights?

We need to use the t() function. This function takes the transpose of matrices and tables.
barplot(t(counts), main='Police Reporting for Crimes in Different Neighborhood Types',
xlab='Neighborhood Type', ylab='Frequency of Police Non-Reporting/Reporting',
col=c('red','blue'))

legend("topleft", inset=0.05, c('No Police Called','Police Called'), fill = c('red','blue') )

47
Police Reporting for Crimes in Different Neighborhood Types
Frequency of Police Non−Reporting/Reporting

3000
No Police Called
Police Called
2000
1000
500
0

Rural Suburban Urban

Neighborhood Type

There are many options that we can use to modify the plots. We can add colors, rotate the plot, put the bar
plots next to each other instead of stacked (this is called a grouped bar plot).
barplot(t(counts), main='Police Reporting for Crimes in Different Neighborhood Types',
ylab='Neighborhood Type', xlab='Police Non-Reporting/Reporting Frequency',
col=c("cyan","orange"), horiz=T, beside=T, xaxt = 'n')

# xaxt = 'n' , made a plot with no x-axis, so let's add one.

axis(side=1, at=c(0,350,650, 950, 1250,1550,1850))

# side=1 refers to the x-axis. Side: 1=below, 2=left, 3=above and 4=right

# Let's add a legend, inset pushes away from the side of the plot by a tiny bit.
legend("bottomright", inset=0.05, c('No Police Called','Police Called'),
fill = c('cyan','orange'))

48
Police Reporting for Crimes in Different Neighborhood Types

Urban
Neighborhood Type

Suburban
Rural

No Police Called
Police Called

0 350 650 950 1250 1550 1850

Police Non−Reporting/Reporting Frequency

More importantly, what if we want a relative frequency bar plot instead?

This requires to normalize our frequency table.

relfreqs <- counts/rowSums(counts)
relfreqs

No Police Police
Rural 0.3992629 0.6007371
Suburban 0.4146957 0.5853043
Urban 0.4020579 0.5979421

# Notice that the rows all sum to 1 now.

barplot( t(relfreqs), main='Relative Frequency of Police Reporting for Crimes

in Different Neighborhood Types', xlab='Neighborhood Type',
ylab='Relative Frequency of Police Reports', col=c('cyan','orange'))

legend("bottom", inset=0.05, c('No Police Called', 'Police Called'),

fill=c("cyan","orange"), bg = "white")

49
Relative Frequency of Police Reporting for Crimes
in Different Neighborhood Types

1.0
0.8
Relative Frequency of Police Reports

0.6
0.4
0.2

No Police Called
Police Called
0.0

Rural Suburban Urban

Neighborhood Type

50
2.4 Probability

2.4.1 Probability Distributions

R has most of the commonly used probability distributions already built into the base installation.

Here is a short list of some common distributions (not exhaustive!)

distribution function postfix type

binomial binom discrete
chi-squared chisq continuous
F f continuous
normal norm continuous
Poisson pois discrete
Student’s t t continuous
uniform unif continuous
Exponential exp continuous

function prefix description

d returns the height of the probabilty density function
p returns the cumulative density function, i.e. the probability P (X ≤ x0 )
q returns quantiles
r returns random numbers from the given distribution

We will explore the usage in the next several subsections.

2.4.2 Evaluating the probability density function at a specific value of the random
variable.

By prefixing a "d" to the function name in the table above, you can get probability density values (pdf).

In general, d<function.name> maps y = f(x) where f(x) is the probability density function that describes the
probability distribution.

Consider the following example:

The dnorm() function returns the height of the normal curve at the desired value along the x-axis.
# Example 1: Evaluate the probability density function (pdf)
# for a normal distribution with mean=20, sd=4
# for x = 23

dnorm(23, mean=20, sd=4)

[1] 0.07528436

An illustration of what is actually happening can be seen in the figure below.

51
The Normal Distribution with Parameters: N(20,4).
The pdf function for this distribution is evaluated at x=23.

0.10
0.08

dnorm(23, mean=20, sd=4) = 0.075284

0.06
y=f(x)

0.04
0.02
0.00

10 15 20 23 25 30

# Example 2:

# Evaluate the probability density function when x=3

# for an exponential distribution with rate = 1/2, i.e. X ~ Exp(1/2).

dexp(3,rate=1/2)

[1] 0.1115651

52
The Exponential Distribution with rate parameter = 1/2.
The pdf function for this distribution is evaluated at x=3.

0.5
0.4
0.3
y=f(x)

0.2

dexp(3, rate=1/2) = 0.111565

0.1

0 1 2 3 4 5

2.4.3 Finding Probabilities from a given Distribution

By prefixing a "p", you can get cumulative probabilities (cdf).

The CDF or cumaltive probability distribution function is the left tail probability for a given value of x, say x0 .

That is, if X is a random variable, the cdf is P r(X ≤ x0 ).

# Example: If X ~ N(20,4) , what is the probability P(X < 23)?

pnorm(23, mean=20, sd=4)

[1] 0.7733726

53
The Normal Distribution with Parameters: N(20,4)
The probability P(X <= 23) = pnorm(23,mean=20,sd=4)

0.10
0.08
0.06
y=f(x)

0.04

P(X <= 23) = 0.773373

0.02
0.00

10 15 20 23 25 30

The default action of pnorm() is to always give the left tail probabilities.

If I want the right tail probability, I can do 1 of 2 things:

# Example 1: P(X>23), if X ~ N(20,4)

1-pnorm(23, mean=20, sd=4)

[1] 0.2266274

# or equivalently

pnorm(23, mean=20, sd=4, lower.tail = FALSE)

[1] 0.2266274

54
The Normal Distribution with Parameters: N(20,4)
The probability P(X > 23) = 1 − pnorm(23,mean=20,sd=4)

0.10
0.08
0.06
y=f(x)

P(X > 23) = 0.226627

0.04
0.02
0.00

10 15 20 23 25 30

# Example 2: Find the Probability P(T < 1.3),

# if T is a t-distributed random variable with df = 23

pt(1.3, df = 23)

[1] 0.8967606

# Note: Unlike your tables from the book, R can handle fractional degrees of freedom as well,
# what if df = 23.7? P(T<1.3) = ?

pt(1.3, df = 23.7)

[1] 0.8969486

2.4.4 Finding Quantiles from a Distribution

By prefixing a "q", you can get quantile values.

# Example 1: If X ~ N(20,4)
# what value of X corresponds to the 37th percentile?

qnorm(0.37,mean=20,sd=4)

[1] 18.67259

55
For the distribution N(20,4), what value of x0 has an area to the left
equal to 37%? That is, for what value of x0 is P(X <= x0)=0.37?

0.10
0.08
0.06
y=f(x)

0.04
0.02

P(X <= 18.67)

= 0.37
0.00

10 15 18.67 20 25 30

x
The answer is given by the 37th percentile of the distribution, qnorm(37,mean=20,sd=4)

# Example 2: If T ~ t-distribution with df = 14,

# what value of t corresponds to the 33rd percentile?

qt(0.33, df=14)

[1] -0.4494312

Of course, we might also wish to know which value of the random variable has a certain percentage of values
above it. To obtain this answer, we use the option lower.tail = FALSE.
# Example 3: If T ~ t-distribution with df = 14, what if we want to know
# which value of t correponds to having 67% of the values ABOVE it?
# (This is still the 33rd percentile just worded differently!)

qt(0.67, df=14, lower.tail = FALSE)

[1] -0.4494312

# The same answer as above!

56
The t−distribution with df=14.
The 33rd percentile has 33% of the data below it and 67% above it.

0.4
0.3
y=f(t)

0.2

33% 67%
0.1
0.0

−3 −2 −1 −0.449 0 1 2 3

2.4.5 Generating Random numbers from a distribution

By prefixing an "r" in front of your desired distribution, you can generate random numbers from that distribution.
# Example 1: Generate a single random number from a normal distribution
# with mean = 20, sd = 4

rnorm(1,mean=20,sd=4)

[1] 13.26149

# Example 2: Generate a sample of n=20 random numbers from a normal distribution

# with mean = 20, sd = 4

rnorm(20, mean=20, sd = 4)

[1] 17.64615 17.59858 19.69242 23.81140 18.45567 20.17301 23.07194 16.42878 18.83256
[10] 21.28481 22.05607 19.18567 21.42000 21.26179 26.65697 22.64397 18.38407 27.49874
[19] 22.15625 28.72936

# Let's run the command again and get a different set of random numbers but this time we'll store it!

x <- rnorm(20, mean=20, sd = 4)

print(x)

[1] 18.53378 19.99741 21.53830 19.45958 17.94471 31.27115 19.45154 22.86772 21.69416
[10] 17.01573 24.06403 13.17149 24.65973 15.60269 21.26390 17.52044 20.90198 24.36188
[19] 27.65063 18.82910

57
Obviously this is a bunch of numbers that we probably don’t want to see, so it’s best to store it and view them
with a picture.

We sampled 20 random numbers from the normal distribution with µ = 20, and σ = 4.

We can see where these 20 values correspond to with respect to the distribution from where they were sampled.
0.10
0.08
0.06
0.04
0.02
0.00

10 15 20 25 30

As we expected, very few values fall into the tails.

If we wanted to, we could keep sampling more and more values. As the sample becomes large enough, we can
start to collect these values into bins and create a histogram. This histogram should resemble the distribution
from which the random variables were sampled.

To see this, let’s increase our sample size but this time make a histogram.
# Let's do this again, this time for n = 200

x <- rnorm(2000,mean = 20, sd = 4)

hist(x, col="green") # Green just for fun

58
Histogram of x

400
300
Frequency

200
100
0

5 10 15 20 25 30 35

2.5 The Normal Distribution: Assessing Normality.

The main distribution that we focus upon for the first 3rd of the course is the normal distribution.

The normal distribution is considered to be the most important distribution in Statistics. We will cover this more
in detail in class.

In practice, there are a large number of techniques that can be used when data come from a normal distribution.

But how can we know if our data actually come from a normal distribution?

There are three primary tools that we use to assess whether or not a sample comes from a normal distribution.

1. Histograms

2. Normal Probability Plots (Normal Q-Q Plots)

3. Shapiro-Wilk’s Test for Normality

2.5.1 Histograms

Let’s use some data that was in the Sleuth3 library that we installed

59
library('Sleuth3')

Note, if we have not installed and enabled the Sleuth3 library, the following commands will not work!

The following data set looks at Beak beak.depths of Finches

To learn more about this data set run the following command
? case0201

Let’s look at the data

head(case0201)

Year Depth
1 1976 6.2
2 1976 6.8
3 1976 7.1
4 1976 7.1
5 1976 7.4
6 1976 7.8

# The following command requires the psych library to be installed first

describe(case0201)

vars n mean sd median trimmed mad min max range skew kurtosis se
Year 1 178 1977.0 1.00 1977.0 1977.00 1.48 1976.0 1978.0 2.0 0.00 -2.01 0.08
Depth 2 178 9.8 1.03 9.9 9.86 1.04 6.2 11.7 5.5 -0.69 0.55 0.08

We see that there are 2 variables, Year and Depth (corresponding to beak depth)

We want to investigate whether beak depth is normally distributed

To avoid typing case0201$Depth a lot, let’s make life a little easier by naming a new variable.
beak.depth <- case0201$Depth

Now we let’s just focus upon our beak.depth observations.

The main idea behind using a histogram to check for normality of a sample is to see if the shape is approximately
normal, and the emprical rule seems to mostly hold.
## Run the following lines and read carefully

hist(beak.depth, breaks=10, col="lightcyan", xlab="Beak Depth",

main="Histogram of Beak Depths with Normal Curve", probability=TRUE) #note can use prob=T instead
xfit <- seq( min(beak.depth), max(beak.depth), length=40)
yfit <- dnorm(xfit, mean=mean(beak.depth), sd=sd(beak.depth) )
lines(xfit, yfit, col="blue", lwd=2)

60
Histogram of Beak Depths with Normal Curve

0.4
0.3
Density

0.2
0.1
0.0

6 7 8 9 10 11 12

Beak Depth

The above lines look long and confusing (in fact this is the hardest part of this first tutorial) so let’s break them
down and try to understand what they do.

First you saw:

hist(beak.depth, breaks=10, col="lightcyan", xlab="Beak Depth",
main="Histogram of Beak Depths with Normal Curve", probability=TRUE)

The above is nothing new, feel free to change breaks = 10 to 20 and re-run the lines. Notice that probability =
T means that we are using relative frequencies.

Secondly, you saw:

xfit <- seq( min(beak.depth), max(beak.depth), length=40)

We saw this earlier in the tutorial when we looked at plotting. This line generates equally spaced values between
the min & max values of of the beak.depth observations.
Investigate xfit by typing it and looking at the output.

We will be using xfit as values to put into a function.

In particular, y = f (x) where f (x) is the Gaussian function used in the normal distribution.

If you recall, this function has parameters µ, and σ.

61
Since this data may or may not come from a normal distribution, we are going to assume it does and see how
well a normal curve (the dark blue curve) with the same mean and standard deviation compares.

Since the real µ and σ are unknown quantities we’ll estimate them using sample mean and sample standard
deviation.

So, the next thing you saw was:

yfit <- dnorm(xfit, mean=mean(beak.depth), sd=sd(beak.depth) )

This is simply plugging the x-values into the Gaussian function f (x) and getting the corresponding y-values.

Finally, we just want to overlay this theoretical curve on the existing plot without plotting a new plot using:
lines(xfit, yfit, col="blue", lwd=2)

Once again, the overlayed “normal curve” is the theoretical distribution that uses µ equal to the sample mean,
X, and σ equal to the sample standard deviation, s.

The overlayed curve and the histogram don’t have to match perfectly but the better the do, the more plausible
it is that the sample came from a normal distribution.

For the beak.depth observations, we see that the histogram indicates that the sample is slightly left skewed, but
some may also be tempted to judge the data as “normal enough”.

Sometimes the number of breaks that you use can also affect your initial judgement on normality.

This is one of the reasons we use other tools to help make the judgement.

2.5.2 Normal Probability Plots (Normal Q-Q Plots)

Histograms are only the first tool to assessing whether data are truly from a normal distribution or not.

Let’s look at a qqplot for the normal distribution.

In class, we called this a normal probability plot, a Q-Q plot, etc.

These were used for comparing a sample with a theoretical normal distribution.

In general, qqplots are the more generic term when we want to compare a data set a particular distribution, not
necessarily the normal distribution.

Run the following commands. Read to understand what the commands are doing and what we should be seeing.

qqnorm(beak.depth, main="Normal Probability Plot of Beak Depths", pch=20)

# pch=20 produces little filled circles
qqline(beak.depth, col="red", lwd=2) # lwd = 2, stands for line width = 2

62
Normal Probability Plot of Beak Depths

11
Sample Quantiles

10
9
8
7
6

−2 −1 0 1 2

Theoretical Quantiles

You should see a normal probability plot for the beak.depth data.

The function qqline(), plots the “ideal” line that the data should follow if it was from a perfectly normal
distribution.

In general, the data won’t follow this line exactly, but it will help you get an idea of how approximately normal
the data is.

Extreme deviations from the line are indicators that the sample may not be from a normal distribution.

The deviations from the line for the beak.depth data seem to deviate a little at the tails. It seems difficult to
make an absolute judgement based upon this plot but we would be tempted to say that the data is not from a
normal distribution and is slightly left skewed which agrees with our histogram.

2.5.3 The Shapiro-Wilk’s Test for Normality

We can use a computational test called the Shapiro-Wilk’s test for normality.

The way that this test works:

It assumes that the same data values are from a normally distributed population and looks for evidence to the
contrary. It then conducts a statistical test. The details of this test are too advanced for this course but, we can

63
still learn how to use and interpret the results of this test.

The command for testing the beak.depth variable for normality is below.
shapiro.test(beak.depth)

Shapiro-Wilk normality test

data: beak.depth
W = 0.96781, p-value = 0.000393

# You will see two values, W = 0.9678, and p-value = 0.000393

If the p-value < 0.10, then the data IS ASSUMED TO NOT BE FROM a normally distributed population.

If the p-value ≥ 0.10, then it is considered that the data IS PLAUSIBLY FROM a normally distributed
population.

In this case, since the p-value = 0.000393, which is less than 0.10, so we have evidence to conclude that the
data is not from a normal distribution.

Let’s look at another example. This time we will look at a case where the data is definitely not from a normal
distribution.
# Example 2: Test a random sample to see if it comes from a normally distributed population.

set.seed(303) # for reproducibility

# Let's investigate the beta distribution with shape parameters alpha = beta = 0.5
# Clearly a beta distribution is not a normal distribution!!

# Let's take a sample of n=100 numbers from beta(alpha=0.5,beta=0.5)

my.sample <- rbeta(100,0.5,0.5)

# A histogram
hist(my.sample, breaks=20, probability = TRUE)
xfit <- seq( min(my.sample), max(my.sample), length=40)
yfit <- dnorm(xfit, mean=mean(my.sample), sd=sd(my.sample) )
lines(xfit, yfit, col="blue", lwd=2)

64
Histogram of my.sample

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Density

0.0 0.2 0.4 0.6 0.8 1.0

my.sample

# A normality plot
qqnorm(my.sample, pch=20)
qqline(my.sample, col="red", lwd=2)

65
Normal Q−Q Plot

1.0
0.8
Sample Quantiles

0.6
0.4
0.2
0.0

−2 −1 0 1 2

Theoretical Quantiles

# Shaprio-Wilk's test
shapiro.test(my.sample)

Shapiro-Wilk normality test

data: my.sample
W = 0.86781, p-value = 5.725e-08

All three tools are in perfect agreement. This data is not from a normally distributed population.

Let’s look at one more example from the Sleuth3 library. This data is not normally distributed but we’ll use a
log transformation to make it into one.
# Example 3: Salaries are often NOT normally distributed.
# You can investigate that this salary data is NOT normally distributed on your own.

# Run the following line, the Sleuth3 library must be installed and enabled.
salaries <- case0102$Salary

# However, the log(salaries) is approximately normally distributed.

log.salaries <- log(salaries)

# A histogram
hist(log.salaries, breaks=20, probability = T)
xfit <- seq( min(log.salaries), max(log.salaries), length=40)
yfit <- dnorm(xfit, mean=mean(log.salaries), sd=sd(log.salaries) )
lines(xfit, yfit, col="blue", lwd=2)

66
Histogram of log.salaries

5
4
3
Density

2
1
0

8.4 8.6 8.8 9.0

log.salaries

# A normality plot
qqnorm(log.salaries, pch=20)
qqline(log.salaries, col="red", lwd=2)

67
Normal Q−Q Plot

9.0
8.8
Sample Quantiles

8.6
8.4

−2 −1 0 1 2

Theoretical Quantiles

# Shaprio-Wilk's test
shapiro.test(log.salaries)

Shapiro-Wilk normality test

data: log.salaries
W = 0.9817, p-value = 0.2183

# Shaprio-Wilk's test now has a P-value of 0.2183

2.5.4 Sampling Distributions and The Central Limit Theorem.

Sampling distributions are very important in Statistics. Whether we explicitly say so or not, we will use several
kinds in this course.

While it’s not important for this class to study every sampling distribution in detail, we will choose to look at
the easiest sampling distribution which is the sampling distribution for X.

2.5.5 The mean and standard deviation of X

In class, we learned that the sampling distribution for the sample mean, X, has some useful properties.

68
First, recall that
X1 + · · · + Xn
X=
n
where the measurements X1 , X2 ,. . ., Xn are independent and identically distributed from a population which has
a population mean µ and population standard deviation σ.

Using notation, E(X1 ) = · · · = E(Xn ) = µ and V ar(X1 ) = · · · V ar(Xn ) = σ 2 .

Since X is a linear combination of X1 , . . . , Xn , we can use the properties of linear combinations of random variables
to obtain the mean and standard deviation of X.
√
q
The mean of X is given by E(X) = µ and the standard deviation is given by V ar(X) = σ/ n.

We will verify the above results later in this section.

2.5.6 The Sampling Distribution for X

Another important property that we learned about in class is the shape of X.

We learned that

1. If the population from which samples are drawn is normal, then the sampling distribution of X is also
normal regardless of the sample size n.

2. Central Limit Theorem (CLT): If n is large, then the sampling distribution of X is approximately normal,
even if the population from which the measurements are taken from is not normal.

So when we put this together.

σ
So X ∼ N µ, √ approximately with large enough n, regardless of the original distribution X for our mea-
n
surements.

In this section we are going to try to understand a little more about sampling distributions and the central limit
theorem.

We are going to look at how well the central limit theorem holds up for different sample sizes and for different
distributions.

We should start by investigating the shape of the sampling distribution for X.

To investigate, we are going to conduct a “meta-study” consisting of repeatedly taking samples of equal size,
computing the sample mean for each sample, little x, and then constructing a histogram of the sample means.
Run the following commands and read the explanations.

set.seed(303) # for reproducibility

mu <- 25

69
sigma <- 7

n1 <- 50 # This will be a sample size

xbar1 <- rep(0,500) # Store 500 zeros in a variable called xbar1.

# Take 500 sample, each of size n1 = 50, compute the sample mean for each
# then store the answer in the vector xbar1.

for (i in 1:500) { xbar1[i]=mean( rnorm(n1, mean=mu, sd=sigma)) }

# For our vector of sample means (little xbar) xbar1, make a histogram.
hist(xbar1, main=("Hist of 500 Random Sample Means, samples of \n size n=50 sampled from X ~ N(25,7)"),
xlab=expression(bar("X")), breaks=30, prob=TRUE)

Hist of 500 Random Sample Means, samples of

size n=50 sampled from X ~ N(25,7)
0.4
0.3
Density

0.2
0.1
0.0

22 23 24 25 26 27 28

Since we are sampling from a normal distribution, note the use of rnorm(), then the sampling distribution should
look rather normal in shape for any sample size. We chose n = 50.

In addition to the histogram, don’t forget that you can look at qqplots too and run Shaprio-Wilk’s test to assess
normality.
qqnorm(xbar1, pch=20); qqline(xbar1, col="red")

70
Normal Q−Q Plot

28
27
Sample Quantiles

26
25
24
23
22

−3 −2 −1 0 1 2 3

Theoretical Quantiles

shapiro.test(xbar1)

Shapiro-Wilk normality test

data: xbar1
W = 0.99774, p-value = 0.7451

We can verify that the mean of X (the mean of sample means E(X)) is close to the population mean of X ∼
N (25, 7) which is µ = 25.
## Is the mean of xbar close to 25?
mean(xbar1)

[1] 25.00307

√ √
Additionally, we can verify that the standard deviation of X is σ/ n = 7/ 50 = 0.9899495.
# Is the standard deviation of xbar close to 0.9899495?
sd(xbar1)

[1] 1.006624

Repeat the above, but this time, change the sample size from

n1 = 50 to n1 = 5

71
You should just be able to cut and paste the above lines of code in example, and change n1 in the definition and
n=5 in the title for the plot.

Investigate the histograms for both cases, n=5 and n=50.

In the above example,

the original
population that we were sampling from was already normal. In this case the
σ
distribution of X ∼ N µ, √ regardless of the sample size n.
n

σ
But real the power of the central limit theorem is that the sampling distribution for X ∼ N µ, √ “approxi-
n
mately” regardless of the distribution of the measurements!

So what if X1 , . . . , Xn come from a population that is not normally distributed?

So let’s repeat the above analysis and see how well the central limit theorem works.

To illustrate this, let’s look at samples taken from the Exponential Distribution. Don’t worry about the details
of this distribution.

If you want to see what this distribution looks like, the dark blue line represents the theoretical curve.
hist(rexp(100000,rate=25), breaks=20, prob=T)
lines(seq(0,0.3,len=50), dexp(seq(0,0.3,len=50),rate=25),type='l', col="blue", lwd=2)

Histogram of rexp(1e+05, rate = 25)

20
15
Density

10
5
0

0.0 0.1 0.2 0.3 0.4

rexp(1e+05, rate = 25)

72
Let’s see how the central limit theorem works in this case.
set.seed(303) # for reproducibility

n2 <- 5
xbar2=rep(0,500)
for (i in 1:500) { xbar2[i]=mean( rexp(n2,rate=25)) } # Draw a sample of size n2 from X ~ Exp(25)

hist(xbar2, main="Hist of 500 Random Sample Means \n samples of size n=5 \n sampled from X ~ Exp(25)",
xlab=expression(bar("X")), breaks=20)

Hist of 500 Random Sample Means

samples of size n=5
sampled from X ~ Exp(25)
80
60
Frequency

40
20
0

0.02 0.04 0.06 0.08 0.10

mean(xbar2) # The mean for this distribution in this case should be close to 1/25 = 0.04

[1] 0.03847333

sd(xbar2) # The sd for this dist for n=5 should be close to (1/25)/sqrt(n) = 0.01788854

[1] 0.01715966

We see that the mean and standard deviation seem to match faily well but the shape is not exactly normally
distributed as it appears to be slightly skewed to the right.

We can repeat the above analysis but this time, we’ll change the sample size from n = 5 to n = 50.
set.seed(303) # for reproducibility

n2 <- 50
xbar2=rep(0,500)

73
for (i in 1:500) { xbar2[i]=mean( rexp(n2,rate=25)) } # Draw a sample of size n2 from X ~ Exp(25)

hist(xbar2, main="Hist of 500 Random Sample Means \n samples of size n=50 \n sampled from X ~ Exp(25)",
xlab=expression(bar("X")), breaks=20)

Hist of 500 Random Sample Means

samples of size n=50
sampled from X ~ Exp(25)
80
60
Frequency

40
20
0

0.03 0.04 0.05 0.06

mean(xbar2) # The mean for this distribution in this case should be close to 1/25 = 0.04

[1] 0.03965549

sd(xbar2) # The sd for this dist for n=5 should be close to (1/25)/sqrt(n) = 0.01788854

[1] 0.00570074

Increasing the sample size has helped to make the distribution more normal in shape.

If we increase the sample size even further, this approximation continues to improve.

The general rule of thumb is that for most distributions a sample size of n > 30 is good enough to satisfy the
central limit theorem.

However, if the original population is extremely skewed (as is the case of the exponential distribution), then larger
sample sizes are generally needed.

74
Chapter 3

Inference Methods for Numerical Data

We now turn our attention towards inference methods.

Basic Inference methods for Numerical Data

In this part of the tutorial you will learn how to use R to carry out:

1. Produce confidence intervals and conduct hypothesis tests for a single mean.

2. Produce confidence intervals and conduct hypothesis tests for differences of means from two independent
samples.

3. Produce confidence intervals and conduct hypothesis tests for Paired-samples.

75
3.1 Inference for a Single Mean

Let’s start by reviewing a bit about confidence intervals for a single mean µ and how to generate them in R.

3.1.1 Confidence Intervals for a single mean.

Before you start, realize that we are going to discuss the long way to compute confidence intervals first, then I
will show you the R shortcut later on.

Confidence Intervals are used to provide us with an interval estimate for the true population mean, µ.

We want to find confidence intervals for the mpg variable from the mtcars data set. In order to use confidence
intervals, we need to know the following:

1. sample size

2. sample mean

3. sample standard deviation

To get the sample size, there are a number of ways to accomplish this: First, we can use the describe command
that we used earlier describe(mtcars). You will see n=32.

Alternatively, we can use the length command:

n <- length(mtcars$mpg)
n

[1] 32

# if we don't need to store n for later, we could simply type:

length(mtcars$mpg)

[1] 32

The sample mean and sample standard deviation are easy:

ybar <- mean(mtcars$mpg)
ybar

[1] 20.09062

s <- sd(mtcars$mpg)
s

[1] 6.026948

Now, 100(1 − α)% Confidence Intervals are found using the student’s t-distribution. In order to use the t-
distrubution we need to know alpha and the degrees of freedom. The degrees of freedom are always n − 1. In this

76
case: n − 1 = 31.

Now, similarly to the normal distribution, we want the value of t, so that 100(1 − α/2)% of the data is less than
or equal to t (Hence the 1 − (α/2) percentile ). This is called the critical value for α/2 Note: Your book uses tα/2
instead for simplicity. This should have been explained in class and in the reading.

The confidence intervals for µ are therefore given by:

√ √
(y − t(1−α/2,n−1) ∗ s/ n , y + t(1−α/2,n−1) ∗ s/ n)

( ybar - t(1-alpha/2, n-1)s/sqrt(n), ybar + t(1-alpha/2, n-1)s/sqrt(n) )

where t(1−α/2,n−1) is the critical value of t that corresponds to the 1 − α/2 percentile, from a distribution with
n − 1 desgrees of freedom.

So for 90% Confidence Intervals with n = 32, 1 − α/2 = 0.95.

error <- qt(0.95, df=n-1)*s/sqrt(n) # We could have used qt(1-0.10/2,df=n-1) instead.
lower <- ybar-error
upper <- ybar+error

# The confidence interval is given by the lower and upper bounds.

# Let's try a fancy command just for fun:
sprintf("The lower bound for the 90 %% CI is: %f ", lower)

[1] "The lower bound for the 90 % CI is: 18.284179 "

sprintf("The upper bound for the 90 %% CI is: %f ", upper)

[1] "The upper bound for the 90 % CI is: 21.897071 "

# We can always just type in lower, upper one at a time:

lower

[1] 18.28418

upper

[1] 21.89707

# Either way, we should see that a 90% CI for the mpg is approximately
# (18.28, 21.90) miles per gallon. Be sure to always remember the UNITS!!

# Note:
# A trick to get a confidence interval using a single line is the following:
ybar + c(-1,1)*qt(0.95, df=n-1)*s/sqrt(n)

[1] 18.28418 21.89707

As you learned in class, the most common way to interpret this confidence intervals is:

We are 90% confident that the true population mean of all automobiles is between 18.28 and 21.80 miles per
gallon.

77
Clearly there has to be some way for R to automatically compute confidence intervals.

In fact we can use a single command to compute hypothesis tests and confidence intervals simulataneously.

That command is t.test()

We will talk about hypothesis tests in more detail in class, so for now we will just look at using this function for
the automatic confidence interval computation.
t.test(mtcars$mpg)

One Sample t-test

data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
17.91768 22.26357
sample estimates:
mean of x
20.09062

By default, the t.test command will produce a 95% confidence interval for the mean

To get a 90% confidence interval, we need to specify the confidence level.

t.test(mtcars$mpg, conf.level = 0.9)

One Sample t-test

data: mtcars$mpg
t = 18.857, df = 31, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
18.28418 21.89707
sample estimates:
mean of x
20.09062

The only thing that should change is that we should now see the following:

90 percent confidence interval:

18.28418 21.89707

which should match what we computed the long way in the previous section.

3.1.2 Hypothesis Tests for a Single Mean

Suppose in the previous example that we wanted to test the claim that the average mpg of the cars is less than
22 mpg.

78
The hypothesis statements are therefore

H0 : µ ≥ 22
versus
HA : µ < 22

X − µ0
We know how to do this the long way. First we compute a test statistic, ts = √ .
s/ n

We then compute the P-value = P (Tn−1 < ts ), where the < was determined by the alternative hypothesis. We
can of course test greater than, or 6= alternative hypotheses too.
# Here are some summary statistics that we'll need
xbar.mpg <- mean(mtcars$mpg)
s.mpg <- sd(mtcars$mpg)
n.mpg <- length(mtcars$mpg)
df.mpg <- n.mpg - 1

# Our test statistic is given by

ts <- (xbar.mpg - 22)/(s.mpg/sqrt(n.mpg))
print(ts)

[1] -1.792127

# Since it's a left-tail probability we don't have to change any options.

p_value <- pt(ts,df.mpg)

print(p_value)

[1] 0.04143924

Since the P-value is 0.0414392, which is less than α = 0.05 we have statistical evidence that the average mpg of
the cars is less than 22 mpg.

To carry this out using the built-in R function, we already know to use t.test().

t.test(mtcars$mpg, alternative = "less", mu=22)

One Sample t-test

data: mtcars$mpg
t = -1.7921, df = 31, p-value = 0.04144
alternative hypothesis: true mean is less than 22
95 percent confidence interval:
-Inf 21.89707
sample estimates:
mean of x
20.09062

Be sure to specify the µ = µ0 in the function options.

79
Notice that we use the function option alternative="less". This specifies the direction of the test, by default
the test assumes "two.sided". We can of course do alternative="greater"}as well.

3.2 Inference for the the Difference of Means from Two Indepen-
dent Samples.

Consider the following problem:

Example 1:
A researcher is investigating the differences between two catalysts (Catalyst A and Catalyst B) and their ability
to speed up a certain process. Both catalysts produce identical results with regards to the end of the process
itself, the main question at hand is whether one catalyst takes less time than the other.

In order to investigate, the research was able to repeat the process with Catalyst A six times and 5 times with
Catalyst B before running out of material. The times to complete the process in minutes are given below:

Catalyst A: 16.0 15.7 16.4 15.9 16.2 16.3

Catalyst B: 17.2 16.9 16.1 19.8 16.7

Can you conclude that the time to complete the process differs between the two catalysts? Test at the 5%
significance level.

Now we will talk about how to use the built-in R commands to carry out a t test for the difference of means.

Of course, we can use R as a fancy calculator and simply apply our formulas that we learned in class.

First, let’s read in the data.

catalystA <- c(16.0, 15.7, 15.4, 15.9, 15.2, 15.8)
catalystB <- c(17.2, 16.9, 16.8, 19.8, 16.7)

Method 1: The long way

The following commands are what you would do by hand. The difference is that you get a more precise P-value.
xbar.A <-mean(catalystA)
xbar.B <-mean(catalystB)

sd.A <- sd(catalystA)

sd.B <- sd(catalystB)
n.A <- length(catalystA)
n.B <- length(catalystB)

se.A.minus.B <- sqrt( sd.A^2/n.A + sd.B^2/n.B )

The test statistic

80
ts <- ( xbar.A - xbar.B )/se.A.minus.B

The degrees of freedom is given by

nu.df <- (sd.A^2/n.A + sd.B^2/n.B)^2 / ( (sd.A^2/n.A)^2/(n.A-1) + (sd.B^2/n.B)^2/(n.B-1) )

print(nu.df)

[1] 4.368632

Note that the above gives a fractional degrees of freedom. R can handle this but we cannot use our tables If you
wanted to be able to use your table to compare answers, you need to round the degrees of freedom down to the
nearest whole number You can do this using the floor() function.
nu.df.rounded.down <- floor( (sd.A^2/n.A + sd.B^2/n.B)^2 / ( (sd.A^2/n.A)^2/(n.A-1) + (sd.B^2/n.B)^2/(n.B-1) ) )

print(nu.df.rounded.down)

[1] 4

By now we realize that for hypothesis tests, we can either look at the P-value or we can compare the test statistic
with the critical values that define our critical region.

The critical region is all of the values of the T-distribution with df, degrees of freedom, that correspond to the
extreme alphaof the T-distribution with df, and alpha, that mark the boundary of the critical region.

Hence, if our test statistic, ts, is beyond the critical values, then we reject the null Hypothesis
# To get the critical values, we will use the following:
alpha <- .05

# Since we are looking at a two-sided hypothesis test, we want both tails

t.half.alpha = qt(1-alpha/2, df=nu.df)

# These are the critical values that cut-off the tails and define the boundary of our rejection region.
c(-t.half.alpha, t.half.alpha)

[1] -2.686446 2.686446

# Our critical values are therefore -2.179602 and 2.179602

# So if our test statistic, ts, is below -2.179602, or greater than 2.179602, then we reject H0
# Equivalently or more eloquently,
# we will reject the null if |ts| > t_(1-alpha/2,df)

# Here is our test statistic

[1] -3.025679

# Here is our rejection region cut-offs for a two-tailed test

c(-t.half.alpha, t.half.alpha)

[1] -2.686446 2.686446

81
The P-value for a two-sided hypothesis test is found by
# P-value = 2*Prob( T > |ts|)

pvalue.ex1a <- 2*pt( abs(ts), df=nu.df, lower.tail=FALSE)

# By default, this uses a lower tail probability, so we set it to FALSE

# Hence the P-value is

pvalue.ex1a

[1] 0.03475186

Finally, we could have also studied the confidence intervals instead. The 95
lower.bound <- xbar.A-xbar.B - qt(1-.05/2,df=nu.df)*se.A.minus.B
upper.bound <- xbar.A-xbar.B + qt(1-.05/2,df=nu.df)*se.A.minus.B

# The lower bound of our 95% confidence interval is

lower.bound

[1] -3.423359

# The upper bound of our 95% confidence interval is

upper.bound

[1] -0.2033077

3.2.1 Method 2: The short way using R Functions

Some kind programmer has provided us with a t-test within R.

By default, the built-in Student’s t-Test assumes that the samples are independent and that the variances are
UNEQUAL

It automatically calculates the degrees of freedom using the exact same formula.

This form of the t-test is called Welch’s two-sample t-Test (unequal variances)
## Run the following line

t.test(catalystA, catalystB, alternative = "two.sided")

Welch Two Sample t-test

data: catalystA and catalystB

t = -3.0257, df = 4.3686, p-value = 0.03475
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.4233590 -0.2033077
sample estimates:
mean of x mean of y
15.66667 17.48000

82
## Study the output.

# It gives the test statistic t=-2.7757

# degrees of freedom df = 11.961
# and the p-value = 0.01683

It tells us that it conducted a two-sided confidence interval, that is ”alternative hypothesis: true difference in
means is not equal to 0” Finally, it provided us with a 95% confidence interval.

The test didn’t provide us with a conclusion for the test, we need to simply look at the P-value and make that
desision based upon the significance level that we choose. However, if we wanted to use the confidence interval to
make an interpretation, we can ask for difference 100(1 − α)% confidence intervals using the following command
variation
t.test(catalystA, catalystB, alternative="two.sided", conf.level=0.99) # For a 99% confidence interval

Welch Two Sample t-test

data: catalystA and catalystB

t = -3.0257, df = 4.3686, p-value = 0.03475
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
-4.4199372 0.7932706
sample estimates:
mean of x mean of y
15.66667 17.48000

t.test(catalystA, catalystB, alternative="two.sided", conf.level=0.80) # For an 80% confidence interval

Welch Two Sample t-test

data: catalystA and catalystB

t = -3.0257, df = 4.3686, p-value = 0.03475
alternative hypothesis: true difference in means is not equal to 0
80 percent confidence interval:
-2.7174295 -0.9092372
sample estimates:
mean of x mean of y
15.66667 17.48000

What if we wanted to do the above problem using a directional hypothesis test?

The long method requires that we keep track of which tail of the t-test that we want to observe.

Suppose that we want to test the alternative hypothesis HA : µA < µB

Then the null hypothesis has to be H0 : µA ≥ µB

The test statistic is still exactly the same, but we need to change how we calculate the P-value. Since the
alternative µA < µB ⇒ µA − µB < 0, we notice that we use a less than “<”

So the P-value is the probability that we will observe a particular value of our test statistic, ts , or less. i.e, The

83
P-value for a ”less than” hypothesis test is found by P-value = Prob(T < ts )
pvalue.ex1b <- pt( ts, df=nu.df, lower.tail=TRUE)
# By default, this uses a lower tail probability, so we set it to FALSE

# The P-value is therefore given by

pvalue.ex1b

[1] 0.01737593

Using our short command, to get the ”less than” alternative we use the modification.
t.test(catalystA, catalystB, alternative = "less")

Welch Two Sample t-test

data: catalystA and catalystB

t = -3.0257, df = 4.3686, p-value = 0.01738
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.5660423
sample estimates:
mean of x mean of y
15.66667 17.48000

We notice that the difference in the P-values between our method and the built-in code again comes from the
fact that this built-in command uses df = 4.2267, where as we used df = 4.

We also notice that t.test automatically generates the corresponding 1-sided confidence intervals too.

3.2.2 One-Sided Confidence Intervals (Confidence Bounds)

We discussed how to do 1-sided confidence intervals by hand in class, but let’s review how to do this using R.

For 1-sided confidence intervals, we simply need to find either the upper or lower ”confidence bound”

An 95% upper confidence bound on µA − µB corresponds to a lower confidence interval, and vice-versa
upper.bound.one.sided <- xbar.A-xbar.B + qt(1-.05,df=nu.df)*se.A.minus.B

# We note that in qt(1-alpha) instead of qt(1-alpha/2) as we did for a standard interval.

# a 95% upper confidence bound for mu_{disk}-mu_{oval} is given by

upper.bound.one.sided

[1] -0.5660423

You can compare our answer with the output from the t.test() with alternative="less".

Question: What changes for the greater than alternative test?

84
H0 : µA ≤ µB
HA : µA > µB

Very simply, we can find the answer using the built-in t.test() command to conduct the test by specifying:
alternative="greater" in the argument.

3.3 Inference for Paired-Samples

In class, we studied the paired-sample t-test. Let’s look at the following example:

A sample of 10 diesel trucks were run both hot and cold to estimate the difference in fuel economy. The results,
in mpg, are presented in the following table.

Truck Hot Cold

1 4.56 4.26
2 4.46 4.08
3 6.49 5.83
4 5.37 4.96
5 6.25 6.37
6 5.90 5.32
7 4.12 3.92
8 3.85 3.89
9 4.15 4.14
10 4.69 4.19

This is a matched-pairs experimental design because it’s the same 10 trucks at two different temperatures.

To assess whether diesel trucks are more efficient in hot and cold weather, we must compare the mean differences,
µd versus 0.

H0 : µd = 0. There is no difference in average mpg between hot and cold weather.

versus
HA : µd 6= 0. There is a difference in average mpg between hot and cold weather.

This problem is straightforward and easy to do by hand or “the long way”.

I’ll start you off by loading in the data

hot <- c(4.56, 4.46, 6.49, 5.37, 6.25, 5.90, 4.12, 3.85, 4.15, 4.69)
cold <- c(4.26, 4.08, 5.83, 4.96, 6.37, 5.32, 3.92, 3.89, 4.14, 4.19)

To do this by hand or even using the long method you should calculate the differences.

85
diffs <- hot - cold

print(diffs)

[1] 0.30 0.38 0.66 0.41 -0.12 0.58 0.20 -0.04 0.01 0.50

Truck Hot Cold Difference

1 4.56 4.26 0.30
2 4.46 4.08 0.38
3 6.49 5.83 0.66
4 5.37 4.96 0.41
5 6.25 6.37 -0.12
6 5.90 5.32 0.58
7 4.12 3.92 0.20
8 3.85 3.89 -0.04
9 4.15 4.14 0.01
10 4.69 4.19 0.50

In order to justify the use of a t-test, We need to check to see if the data are approximately normally distributed.
qqnorm(diffs, pch=19)
qqline(diffs, col="red")

Normal Q−Q Plot

0.6
Sample Quantiles

0.4
0.2
0.0

−1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles

shapiro.test(diffs)

Shapiro-Wilk normality test

data: diffs
W = 0.94751, p-value = 0.6392

86
Once we realize that this a a matched-pairs design, then the actually t-test works in the same basic way as that
for a single mean.

Instead of practicing the long way, let’s see how to use the built-in R command to conduct the hypothesis test
and get a 90% confidence interval for µd .
t.test(hot,cold, alternative="two.sided", paired=TRUE, conf.level=0.90)

Paired t-test

data: hot and cold

t = 3.3859, df = 9, p-value = 0.008052
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
0.1320764 0.4439236
sample estimates:
mean of the differences
0.288

We have strong statistical evidence that the average mpg is not the for diesel trucks in hot and cold weather
(p-value = 0.008). Our study suggests, with 90% confidence, that diesel trucks perform better in hot weather
with an average mpg as little as 0.13 and as much as 0.44 mpg greater than in the cold.

87
Chapter 4

Inferences for Categorical Data

Basic Inference Methods for Categorical Data

In this part of the tutorial you will learn how to use R to:

1. Produce confidence intervals for population proportions, p.

2. Test compound null hypotheses of population proportions, i.e., conduct χ2 goodness-of-fit tests for a single
variable with k categories.

3. Conduct χ2 tests for association/independence between two variables, i.e. a χ2 contingency test for (r × k)
tables.

88
4.1 Confidence Intervals for Population Proportions

Here we wish to construct confidence invervals for population proportions, p, instead of population means µ.

The only difference is that before we calculated confidence intervals using the t-distribution.

Now we will use the normal distribution instead.

The book told you that for the standard normal distribution, the z-score that corresponds to the 95% confidence
interval is 1.96.

To see where this comes from, run the following:

qnorm(1-0.05/2,mean=0,sd=1) # By default mean=0, sd =1, so we could just run qnorm(1-0.05/2) instead.

[1] 1.959964

So this is the z-score for the 97.5% quantile, which is the boundary of the 2.5% upper tail of the standard normal
distribution.

Let’s briefly look at an example at how to do CI’s for a proportion.

In a natural population of mice (Mus musculus) near Ann Arbor, Michigan, the coats of some individuals are
white spotted on the belly. In a sample of 580 mice from the population, 28 individuals were found to have
white-spotted bellies. Construct a 95% confidence interal for the population proportion of this trait.
# This is the number of trials
n.mice <- 580
# This is the number of 'successes'
n.spotted.mice <- 28

First we will look at our Wilson’s adjusted sample proportion.

p.tilde <- (n.spotted.mice + 2)/(n.mice+4)
sprintf("Wilson's adjusted sample proportion of white-spotted bellied mice is %f", p.tilde)

[1] "Wilson's adjusted sample proportion of white-spotted bellied mice is 0.051370"

# The standard error of p_tilde is

se.p.tilde <- sqrt((p.tilde*(1-p.tilde))/(n.mice+4))

# So our 95% CI for p is given by

lower.bound <- p.tilde - qnorm(1-0.05/2)*se.p.tilde
upper.bound <- p.tilde + qnorm(1-0.05/2)*se.p.tilde
c(lower.bound, upper.bound)

[1] 0.03346610 0.06927363

We are 95% confidence that the true proportion of white-spotted bellied mice is between 0.0335 and 0.0693.

We have a library that we installed earlier that allows us to generate the confidence intervals for proportions, this
binom.

89
library(binom)
# binom.confint( <number of successes>, <number of trials>,method="wilson")

binom.confint(28,580,method="wilson")

method x n mean lower upper

1 wilson 28 580 0.04827586 0.03360897 0.06888711

# To change the confidence level

binom.confint(8,100,method="wilson",conf.level=0.99) # For 99% CI

method x n mean lower upper

1 wilson 8 100 0.08 0.03359056 0.1786748

Note that this is not the exact same value as you computed using the methods from the book the book teaches a
good approximation, this binom.confint command uses something that is a bit more sophisticated.

4.2 Chi-Square Tests for compound null hypotheses of population

proportions

Consider the following example:

The registrar assumes that for a particular art class on campus, the ratio of students who take it will be

Freshman:Sophomores:Juniors:Seniors = 3:2:1:1.

The final class roster for the new semester is in and the actual enrollment is 32 Freshman, 15 Sophomores, 13
Juniors, and 9 Seniors. Are these data consistent with the 3:2:1:1 ratio predicted by the registrar?

State the null and alternative hypotheses.

 

 p1 = 3/7

p2 = 2/7
 
H0 : The data is consistent with the proposed model.

 p3 = 1/7

p4 = 1/7
 

HA : The data is not consistent with the proposed model.

Again, we will start out by doing this the long way, and then learn how to use a short simple command.
n.students <- 69 # total students

actual.students <- c(32,15,13,9) # observed counts

# Given the model ratio, the theoretical proportions are

p.theory <- (c(3,2,1,1))/sum(c(3,2,1,1))

# The expected number of students in each category is found by e_{i} = n*p_{i},

90
# where p_{i} is the probability of each category as indicated above
expected.students <- n.students*p.theory

Now we need our Chi-square test statistic. There are a couple of ways to do this. The formula that we studied
in class (also in the book) is the most efficient way to do this and is given with the following line.
# The chi-square test statistic
xs <- sum( (actual.students-expected.students)^2/expected.students )
print(xs)

[1] 2.403382

You should double check this by hand to see if you get the same answer.

Now we use the Chi-square distribution to get a P-value. We want the probability that we observe a particular
value of, χ2s or xs, or more extreme. Remember, df = (number of categories)-1 here.
p.value.students <- pchisq(xs,length(actual.students)-1,lower.tail=FALSE)
# Notice that we want the upper tail probability

print(p.value.students)

[1] 0.4930054

Make your own conclusion based upon this P-value.

Now let’s look at how to do this this short way with R.

To call this test, all we need is observed counts, and the theoretical proportions to which we are comparing with.

We can easily see that this matches what we did via the long method.

We use the built-in function chisq.test()

The way this test works is the following:

chisq.test(actual.students)

Chi-squared test for given probabilities

data: actual.students
X-squared = 17.899, df = 3, p-value = 0.0004616

In this first case the null is

H0 : All the pi ’s are equal.

HA : All the pi ’s are NOT equal.

This is NOT how we want to do our particular test!

In our example we have a slightly more complicated null hypothesis.

91
The null hypothesis for our example is:

H0 : p1 = 3/7, p2 = 2/7, p3 = 1/7, p4 = 1/7

HA : The null is incorrect.

To test for our particular case, we must specify the null.

chisq.test(actual.students,p=c(3/7,2/7,1/7,1/7) )

Chi-squared test for given probabilities

data: actual.students
X-squared = 2.4034, df = 3, p-value = 0.493

# or
chisq.test(actual.students,p=p.theory)

Chi-squared test for given probabilities

data: actual.students
X-squared = 2.4034, df = 3, p-value = 0.493

Example 2:

Let’s look at the victims data found in the assaultvictims.csv file.

# You can use this method...
victims <- read.csv(file.choose(), header = TRUE)

# or you can read it from a known location...

victims <- read.csv(file = "/home/username/Dropbox/CMDA/datasets/assaultvictims.csv", header = TRUE)

We’ve seen this dataset before, the MSA variable has neighborhood types where various crimes occur. To remind
ourselves, let’s look at this dataset a little bit.
# This command looks at the first 6 rows of the dataset.
head(victims)

YEAR MSA ER Police age female stranger thirdparty private income

1 1996 Suburban No ER No Police 14 0 0 1 1 3
2 1996 Suburban ER Police 24 1 0 1 1 1
3 1996 Urban No ER No Police 45 0 1 1 0 1
4 1996 Urban No ER No Police 14 0 1 1 0 3
5 1996 Urban ER No Police 37 0 0 1 1 1
6 1996 Suburban No ER Police 15 0 0 1 0 4

# This is a command that allows us to view names and types of variables in the data set.
str(victims)

'data.frame': 5503 obs. of 10 variables:

92
$ YEAR : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ MSA : Factor w/ 3 levels "Rural","Suburban",..: 2 2 3 3 3 2 2 2 3 2 ...
$ ER : Factor w/ 2 levels "ER","No ER": 2 1 2 2 1 2 2 2 2 2 ...
$ Police : Factor w/ 2 levels "No Police","Police": 1 2 1 1 1 2 2 1 2 1 ...
$ age : int 14 24 45 14 37 15 14 30 21 19 ...
$ female : int 0 1 0 0 0 0 0 0 1 0 ...
$ stranger : int 0 0 1 1 0 0 1 0 0 0 ...
$ thirdparty: int 1 1 1 1 1 1 1 1 0 1 ...
$ private : int 1 1 0 0 1 0 0 1 0 1 ...
$ income : int 3 1 1 3 1 4 3 2 1 1 ...

We can see the types of neighborhoods by using the levels command

levels(victims$MSA)

[1] "Rural" "Suburban" "Urban"

We see that there are only 3 categories: Rural, Suburban, and Urban.

Let’s test the following hypotheses:

Do assaults occur equally in the three types of neighborhoods?

H0 : P ( assault | Rural ) = P ( assault | Suburban ) = P ( assault | Urban )

HA : The assaults in the different neighborhoods are not all equal.

Let’s make a table of the assaults in the different neightborhoods.

# The number of crimes reported in the different neighborhoods are:
neighborhood.counts <- table(victims$MSA)
neighborhood.counts

Rural Suburban Urban

680 2687 2136

# The proportion of crimes reported in the different neighborhoods are:

prop.table(neighborhood.counts)

Rural Suburban Urban

0.1235690 0.4882791 0.3881519

So MSA is a variable with k = 3 categories. The proportions given by the prob.table() function above are pb’s
for the different categories.

We know that we can use our chisq.test() function to test the equality of these proportions very easily.
chisq.test(neighborhood.counts)

Chi-squared test for given probabilities

93
data: neighborhood.counts
X-squared = 1172.4, df = 2, p-value < 2.2e-16

So the P-value is < 2.2e-16, so the proportion of crimes happen differently in different neighborhood types.

Example 2b.

How about we test a different null? What if someone (before seeing the data) proposed that the proportion of
crimes in the neighborhoods was (0.12, 0.50, 0.38) for Rural, Suburban, and Urban Neighborhoods respectively

That is we have a compound null hypothesis

 
 P ( assault | Rural ) = 0.12 
H0 : P ( assault | Suburban ) = 0.50
P ( assault | Urban ) = 0.38
 

HA : The null is not correct.

To do this test, we still use the chisq.test() function but we now specify the probabilities instead of the test’s
default where it assumes equality.
chisq.test(neighborhood.counts, p=c(0.12, 0.50, 0.38))

Chi-squared test for given probabilities

data: neighborhood.counts
X-squared = 3.0585, df = 2, p-value = 0.2167

In this case, we fail to reject H0, therefore it is plausible that the proportion of assaults occur 12% of the time in
Rural neighborhoods, 50% in Suburban neighborhoods, and 38% of the time in Urban neighborhoods.

4.3 Tests for Association/Independence of Two Variables:

χ2 contigency test for an (r × k) table

In class and the book, you learned how to conduct a hypothesis test for the difference of population proportions
by using the Chi-square goodness-of-fit test.

The general formulation of the test relies on you constructing a (r × k) contigency table. You then use the
marginal frequencies and the expected frequencies to carry out the chi-square test.

Real Data does not end up in those types of contigency tables on their own; we must build them ourselves.

Let’s use the victims data that you loaded at the very beginning of the assignment and consider an example of
how to do hypothesis tests for (2 × 2) contigency tables.

The victims data set is from the National Crime Victimization Survey from 1996-2005.

94
The data corresponds to incidents of serious assaults in which the victim sustained an injury.

A list of the variables in the victims dataset is given by

str(victims)

'data.frame': 5503 obs. of 10 variables:

$ YEAR : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ MSA : Factor w/ 3 levels "Rural","Suburban",..: 2 2 3 3 3 2 2 2 3 2 ...
$ ER : Factor w/ 2 levels "ER","No ER": 2 1 2 2 1 2 2 2 2 2 ...
$ Police : Factor w/ 2 levels "No Police","Police": 1 2 1 1 1 2 2 1 2 1 ...
$ age : int 14 24 45 14 37 15 14 30 21 19 ...
$ female : int 0 1 0 0 0 0 0 0 1 0 ...
$ stranger : int 0 0 1 1 0 0 1 0 0 0 ...
$ thirdparty: int 1 1 1 1 1 1 1 1 0 1 ...
$ private : int 1 1 0 0 1 0 0 1 0 1 ...
$ income : int 3 1 1 3 1 4 3 2 1 1 ...

We can express categorical data a number of ways, using names as well as indicators (i.e. numbers, such a
1=”low”, 2=”medium”, 3=”high”)

A short summary of the variables is as follows:

The year and victims age are obvious.

MSA is the location where the incident occured. To view the categories of the MSA variable, use the following
command
levels(victims$MSA)

[1] "Rural" "Suburban" "Urban"

levels(victims$Police)

[1] "No Police" "Police"

Police is a categorical variable with categories (Police, No Police) which indicates that the incident was reported
to the police or not.

ER is a categorical variable with categories (ER, No ER) which indicates that the victim received treatment at
the ER or not.

Stranger is a categorical indicator variable which indicates the offender was a stranger (indicated with a 1), or
not (indicated with a 0).

private is a categorical indicator meaning that the location was private (indicated with a 1), or public (indicated
by a 0).

Income is categorical (lowest, low, middle, high), indicated by (1,2,3,4) respectively.

For now, we will focus on the ER and Police categories.

Suppose that we hypothesize that victims who call the police go to the ER more often than those who don’t.

95
Then the hypotheses would look like the following:

H0 : P r( ER | Police ) = P r( ER | No Police )
HA : P r( ER | Police ) > P r( ER | No Police )

In order to test this by hand (the long way), we need to construct a contigency table.

We have already seen how to construct a table of absolute frequencies in R assignment 1, but as a reminder, we
will use the following command.
tbl1 <- table(victims$ER, victims$Police)
# This is how we create r x k tables using real data.

# To view the table

tbl1

No Police Police
ER 95 675
No ER 2201 2532

Notice that the ER variable is the rows, while the Police variable is the columns.
# To get the row sums of this table
rS <- rowSums(tbl1)
rS

ER No ER
770 4733

# To get the column sums of this table

cS <- colSums(tbl1)
cS

No Police Police
2296 3207

# The overall sum is given by

totS <- sum(tbl1)
totS

[1] 5503

We now need the expected frequencies in a contingency table.

A clever way to generate this table is to use the following syntax:

expected.freqs.for.tbl1 <- (rS %*% t(cS))/totS
expected.freqs.for.tbl1

No Police Police
[1,] 321.2648 448.7352
[2,] 1974.7352 2758.2648

96
To construct the test statistic, we use the regular formula.
xs.test.tbl1 <- sum((tbl1-expected.freqs.for.tbl1)^2/expected.freqs.for.tbl1)

# Hence our Chi-squared test statistic, xs, for this (2x2) table is
xs.test.tbl1

[1] 317.9321

For a 2x2 table, the degrees of freedom is df = 1. To find the P-value, we simple use the command:
p.value.tbl1.test <- pchisq(xs.test.tbl1, df=1, lower.tail=FALSE)
p.value.tbl1.test

[1] 4.086469e-71

Since our P-value is extremely tiny (it is essentially zero), we reject H0 .

We conclude that patients who call the Police go to the ER more than people who do not call the police.

Now, for what you’ve been waiting for, the short way. The short way simply uses the command:
chisq.test(tbl1, correct = FALSE)

Pearson's Chi-squared test

data: tbl1
X-squared = 317.93, df = 1, p-value < 2.2e-16

# For the case 2x2, we need to use the option: correct = FALSE

When the null is simple (the proportions are equal), you don’t need any extra options.

Note, the correct=FALSE option only needs to be used for 2 × 2 tables.

What changes when the contigency table is not (2 × 2) but some other more general (r × k) table?

In case you don’t recall, r is the number of rows in a contingency tables and k is the number of columns.

For any (r × k) contingency table, the degrees of freedom is given by the formula df = (r − 1)(k − 1). Otherwise,
all other steps are identical.

We know that we can make a table from seperate columns of a dataframe using the table() function, but what
if we are not given the raw data and instead only given the already summarized data in the bivariate frequency
table/contingency table?

If we want to manually input a contingency table, we can do the following.

97
Suppose we have a table that looks like (ignoring the row and column names for now)

Burrito
Beef Bean Cheese
Hot 42 10 27
Salsa
Mild 9 39 13

So we note that this is a (2 × 3) (nrows × ncolumns) table.

Recall that if we want to input an array into R, we use the following command
x <- c(1,2,3,4)
print(x)

[1] 1 2 3 4

Let’s explore how to use the matrix function with c() to enter a tables.
# This makes a column vector of length 4
matrix(c(1,2,3,4))

[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4

#this makes a 2 x 2 matrix but say we want the first row to be [1 2], then this would be incorrect
matrix(c(1,2,3,4),nrow=2)

[,1] [,2]
[1,] 1 3
[2,] 2 4

# this makes a 2 x 2 matrix using c(1,2,3,4) putting the values in left-to-right, a row at a time.
matrix(c(1,2,3,4),nrow=2,byrow=T)

[,1] [,2]
[1,] 1 2
[2,] 3 4

# So for our more complicated example above

matrix(c(42,10,27,9,39,13),nrow = 2,byrow = T)

[,1] [,2] [,3]

[1,] 42 10 27
[2,] 9 39 13

Now that we obtained a table that wanted, we can customize it a little bit.
y <- c(42,10,27,9,39,13)
lunch.data <-matrix( y ,nrow = 2,byrow = T)
print(lunch.data)

[,1] [,2] [,3]

98
[1,] 42 10 27
[2,] 9 39 13

# Let's add names to the rows and columns

colnames(lunch.data) <- c("Beef","Bean","Cheese")
rownames(lunch.data) <- c("Hot","Mild")
print(lunch.data)

Beef Bean Cheese

Hot 42 10 27
Mild 9 39 13

Now that we have our table, we can finally run our χ2 test on this table.
chisq.test(lunch.data)

Pearson's Chi-squared test

data: lunch.data
X-squared = 41.793, df = 2, p-value = 8.41e-10

99
Chapter 5

Introduction to ANOVA and Multiple

Comparisons

In this part of the tutorial you will learn how to use R to carry out:

1. One-Way ANOVA

2. Multiple Comparisons

(a) Multiple Comparisons Using Fisher’s Least Significant Difference.

(b) Multiple Comparisons Using the Bonferroni Method.

(c) Multiple Comparisons Using Tukey’s Honest Significant Difference.

100
5.1 One-way ANOVA

Analysis of Variance or ANOVA is a method that is used to compare the means of groups simultaneously.

For example, suppose I have t different treatment groups that I wish to compare. Under the null, we assume all
the groups have the same population mean.

H0 : µ1 = µ2 = · · · = µt
versus
HA : At least one mean differs.

Example

A built-in data set in R looks at results from an experiment to compare yields (as measured by dried weight of
plants) obtained under a control and two different treatment conditions (assuming some type of fertilizer).

The whole data set of course can be seen by typing

PlantGrowth

weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
7 5.17 ctrl
8 4.53 ctrl
9 5.33 ctrl
10 5.14 ctrl
11 4.81 trt1
12 4.17 trt1
13 4.41 trt1
14 3.59 trt1
15 5.87 trt1
16 3.83 trt1
17 6.03 trt1
18 4.89 trt1
19 4.32 trt1
20 4.69 trt1
21 6.31 trt2
22 5.12 trt2
23 5.54 trt2
24 5.50 trt2
25 5.37 trt2
26 5.29 trt2
27 4.92 trt2
28 6.15 trt2
29 5.80 trt2
30 5.26 trt2

The response variable consisting of the yield observations are given by the weight variable.

The corresponding groups for each observation are given by the group variable.

101
We can of course use the dollar sign $ notation to extract the data, dataset$variable.name as before.

We have two other options that we can use when we don’t want to keep using the $ sign notation.

Clearly we can use our own naming which might help us save a little bit of typing.
plant.weights <- PlantGrowth$weight
plant.groups <- PlantGrowth$group

Alternatively we can use the attach() function. This means that we can call the variables in a dataset without
having to say the name of the dataset first. You must be careful because it will mask other variables/functions
of the same name if they already exist. A good practice to follow is to make sure to use the detach() function
when you are finished.

Let’s attached the variables from the PlantGrowth dataset.

attach(PlantGrowth)

To detach the PlantGrowth data later, use

detach(PlantGrowth)

Now we can just type weight and group without having to type PlantGrowth$ first.

Running the following command shows how many observations are in each group.
summary(group)

ctrl trt1 trt2

10 10 10

# Let's save the number of observations

n_obs <- summary(group)

So we see that observations 1-10 are from the control group, observations 11-20 from treatment 1, and 21-30 from
treatment 2 Remember, the number of observations won’t always be the same for each group!

Before working on anything else, let’s define our ANOVA model. This can be done several different ways.

The most common way to do this is:

anovamodel <- aov(weight~group)

Even though we don’t see any specific output, R actually already completed a lot of preliminary calculations using
our data.

However, we need to run additional commands to extract the information of interest.

5.1.1 Checking Assumptions

The following section requires use of the leveneTest() function which is found in the car library. If you have
not done so, install and enable the car library.

102
It is always good practice to check the underlying assumptions before trying to use a particular analysis method.
All methods have slightly different requirements for their assumptions.

For the ANOVA Method, our primary assumptions are:

i. The groups all have the same variances (or standard deviations)

ii. The groups all come from a normal distribution (especially when the sample sizes are small).

First, let’s plot the data

stripchart(weight~group, vertical=T, method="stack", xlab="Treatment", ylab="Weight",
main="Yields of Plants (Dried Weight) Due to Treatments", col="blue", las=1, pch=1, cex=0.75)

Yields of Plants (Dried Weight) Due to Treatments

6.0

5.5
Weight

5.0

4.5

4.0

3.5
ctrl trt1 trt2

Treatment

Again we see the Y ∼ X notation. We know that the observed weights are the response variable Y while the
groups are the predictor variables X. In this case, what it’s doing here is that it’s telling the plotter which group
each observation belongs to so that it can plot appropriately.

The method="stack" option just makes overlapping data points so that they are easier to see.

The cex option is new to us. This essentially allows us to control the size of the plotting character. The default
is cex=1 which stands for 100%, while cex=3 stands for 300%. So cex=0.75 is 75% of the original character size.

To Add the location of the group means to the plot run the following lines.

103
stripchart(weight~group, vertical=T, method="stack", xlab="Treatment", ylab="Weight",
main="Yields of Plants (Dried Weight) Due to Treatments", col="blue", las=1, pch=1, cex=0.75)

ybar <- tapply(weight, group, mean)

stripchart(ybar~c("ctrl","trt1","trt2"), pch=18, col="red", add=T, vertical=T, cex=1.5)

Yields of Plants (Dried Weight) Due to Treatments

6.0

5.5
Weight

5.0

4.5

4.0

3.5
ctrl trt1 trt2

Treatment

Let’s explore the tapply() command, this looks at the columns and takes the means of the observations according
to the appropriate groups. Kinda clever! One way to look at this is the following:
tapply( vector of response observations , vector of groups , a function such as mean, median, etc. )

So if we want to see the group sample means:

tapply(weight, group, mean)

ctrl trt1 trt2

5.032 4.661 5.526

What if we wanted boxplots?

boxplot(weight~group, las=1, main="Yield of Plants (Dried Weight) Due to Treatments",
ylab="Weight", xlab="Treatment")

104
Yield of Plants (Dried Weight) Due to Treatments

6.0

5.5
Weight

5.0

4.5

4.0

3.5
ctrl trt1 trt2

Treatment

Checking the normality assumption can be done in 2 ways:

i. We can plot Normality Plots or run Shaprio-Wilk’s tests on each of the individual groups.

ii. Or, as we discussed in class, you can simply look a Normality Plot for the Residuals.

To use the residuals to check the normality assumption:

qqnorm(anovamodel$residuals, las=1, pch=19,
main="Normal Q-Q Plot for Plant Growth Residuals, \n Shapiro.test.pval = 0.4379")
qqline(anovamodel$residuals, col="red")

105
Normal Q−Q Plot for Plant Growth Residuals,
Shapiro.test.pval = 0.4379

1.0
Sample Quantiles

0.5

0.0

−0.5

−1.0

−2 −1 0 1 2

Theoretical Quantiles

# Of course, we can also run a Shapiro-Wilk's test

shapiro.test(anovamodel$residuals)

Shapiro-Wilk normality test

data: anovamodel$residuals
W = 0.96607, p-value = 0.4379

Look at the output of the function

rep(ybar,n_obs)

ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl trt1 trt1 trt1 trt1 trt1
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661
trt1 trt1 trt1 trt1 trt1 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
4.661 4.661 4.661 4.661 4.661 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526 5.526

Earlier, we obtained ybar (i.e. the group sample means). To make the plot, we need appropriate (x,y) pairings
so we need to replicate the means by the appropriate sample sizes for each group using the rep() function.

In order to check the assumption of equal variances/standard deviations, we can start by using a graphical
approach.

106
A graphical way to look at whether the standard deviations are approximately equal is the following plot:
plot( rep(ybar,n_obs),anovamodel$residuals, pch=1, cex=0.75,
xlab="Fitted Value",ylab="Residuals",
main="Plots of Residuals vs sample mean (i.e. Fitted Value)")
abline(h=0, col="blue")

1.0
0.5 Plots of Residuals vs sample mean (i.e. Fitted Value)
Residuals

0.0
−0.5
−1.0

4.8 5.0 5.2 5.4

Fitted Value

The rule of thumb is that the ratio of the largest spread to the smallest spread should not exceed 2. Additionally
we should not see any specific kind of pattern such as the spread increasing/decreasing significantly with the fitted
value.

Since we don’t see a huge deviation between the spreads, or a worrying pattern, we’ll assume they are okay.

Another test that can be used to assess whether the equal variance assumption is met is Levene’s test. Levene’s
test is similar to shapiro.test but for testing the equality of variances (and hence standard deviations).

The leveneTest() function is found in the car library. If you have not done so, install and enable the car library.

leveneTest(weight~group)

Levene's Test for Homogeneity of Variance (center = median)

107
Df F value Pr(>F)
group 2 1.1192 0.3412
27

If the p-value is less than .05, we reject the null hypothesis of equal variances. We of course want to fail to reject
the null hypothesis so that we can use ANOVA.

5.1.2 Obtaining Analysis of Variance Table

To get the ANOVA Table, we have a couple of different commands, but to avoid confusion just use:
anova(anovamodel)

Analysis of Variance Table

Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You should know how to read an ANOVA table by now, but be aware that R does not provide you with the row
corresponding to the totals.

Let’s save this table for use later

anova.table <- anova(anovamodel)

Note the variable names are the variable names that you used when defining the model in the aov() command
above.

As was mentioned in class, the Between(Groups) is given, it is usually the actual name of the treatment groups.
In this case that variable name is called “group”.

The Within(Groups) row, is also called the “Error” or the “Residuals”, depending on the command or software
that you obtain your ANOVA table with. By default, R calls them Residuals.

5.2 Multiple Comparisons

The following section requires the multcomp and ggplot2 libraries. Please make sure that these
libraries are installed and enabled before proceeding further.

After we have conducted our F test (or rather we have generated our ANOVA Table) we observe a P-value. If
that P-value is less than some predetermined significance level, say alpha = 0.05, then we have statistical evidence
that not all of the group population means are equal.

In this instance, we need to do pairwise comparisons to test to see which means might differ from each other.

108
From the multcomp library we have some wonderful tools that do this for us.

The names of some of the functions won’t make too much sense and aren’t worth explaining in full detail.

Instead learn what changes and what stays the same if you were to use the functions in future problems.

The basic setup to carry out multiple comparisons are all made using the following command
# Be aware that the "group" variable is one that is defined by you or the data
comparisons <- glht(anovamodel, linfct=mcp(group="Tukey"))

The above command ran some calculations and stored it in the “comparisons” object.

If you forget to run the above command, the rest of the work below will not work !!!!!!

Also, be sure to replace group with the name of the predictor variable for your dataset!

In order to do the following analysis, some other things are needed including:

The number of comparisons that need to be made is k, which is usually different from the total number of groups
I. If we which to carry out every pairwise test (or obtain every pairwise confidence interval) the total number of
comparisons to be made is k = I(I − 1)/2.
k <- length(n_obs) * (length(n_obs)-1 )/2

print(k)

[1] 3

We need the degrees of freedom of the residuals (error) row, this is given by N − I = n· − I from the ANOVA
table. We can get the degrees of freedom using the following:
df.within <- anova.table[2,"Df"]

print(df.within)

[1] 27

# The degrees of freedom in this case is 27

We’ll also need the mean squared error (MSE) = MS.within

MSE <- anova.table[2,"Mean Sq"]

print(MSE)

[1] 0.3885959

# The MSE in this case is 0.3885959

Note: The MSE = the pooled sample variance s2pooled .

Now let’s look at conducting pairwise tests and confidence intervals using the different methods that we discussed
in class.

109
5.2.1 Fisher’s Least Significant Difference

These are simple pair-wise comparisons that use the t-distribution that you are used to.

You should have practice being able to do this by hand, the only difference now is:

The degrees of freedom is based upon the degrees of freedom within the blocks, that is, the degrees of freedom is
found in the Error/Residuals row of the ANOVA table.

The FisherLSD pairwise hypothesis tests are carried out using the following command:
summary(comparisons,test=adjusted("none")) # adjusted("none") means we don't adjust alpha

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = weight ~ group)

Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.19439
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.08768 .
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.00446 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- none method)

The corresponding Fisher’s LSD 95% confidence Intervals are given by

confint(comparisons,calpha=univariate_calpha())

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = weight ~ group)

Quantile = 2.0518
95% confidence level

Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.37100 -0.94301 0.20101
trt2 - ctrl == 0 0.49400 -0.07801 1.06601
trt2 - trt1 == 0 0.86500 0.29299 1.43701

We can plot these confidence intervals using the following long function
qplot(lhs, estimate, data = confint(comparisons, calpha = univariate_calpha()),
main="FisherLSD 95% Confidence Intervals", geom = "pointrange", ymin = lwr, ymax = upr,
xlab="Comparisons", ylab="Estimates and Confidence Intervals") +

110
coord_flip() + geom_hline(yintercept = 0)

FisherLSD 95% Confidence Intervals

trt2 − trt1
Comparisons

trt2 − ctrl

trt1 − ctrl

−1.0 −0.5 0.0 0.5 1.0 1.5

Estimates and Confidence Intervals

5.2.2 Fisher’s LSD by Hand

To hypothesis tests and make confidence intervals using Fisher’s LSD by hand is not difficult.

The hypothesis statements for the pairwise tests are given by the following:

H0 : µi = µj
HA : µi 6= µj

for any two populations i 6= j from the larger set of I population groups in total.

Since we assume that the variances of the populations are equal, then 100(1 − α)% confidence intervals for µi − µj
where i 6= j are given by s
√ 1 1
Y i − Y j ± t1−α/2,df M SE +
ni nj

111
where the M SE and df (the degress of freedom) are obtained from the residuals/error row of ANOVA table.

The test statistics can be computed using

Yi−Yj
ts =
√
r
1 1
M SE +
ni nj

and the P-value (assuming a two-sided test) is computed using

P-value = 2 ∗ P {Tdf > |ts |}

This can be a cumbersome chore by hand when the total number of groups and pairwise comparisons to be made
is large.

We already know that the sample means are stored in ybar, and the number of observations is stored in n_obs.

First, run the next couple of lines that strip the names. It’s clear that the first sample mean (the control group)
is given by ybar[1], the treatment 1 group sample mean is ybar[2], etc.
comp.names <- names(ybar)
ybar <- as.numeric(ybar)
n_obs <- as.numeric(n_obs)

print(comp.names)

[1] "ctrl" "trt1" "trt2"

print(ybar)

[1] 5.032 4.661 5.526

print(n_obs)

[1] 10 10 10

The difference between group 2 (treatment 1) and group 1 (the control) is given by
ybar[2] - ybar[1]

[1] -0.371

This matches the Estimate of the differences of the means given in the tables provided in the previous sections.

The standard error is given by

sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

[1] 0.2787816

which matches what we see in the output from summary(comparisons,test=adjusted("none")) in the previous
section. Note that since we have a “balanced” experimental design (meaning all sample sizes are equal) then the
standard error is the same for all pairwise comparisons.

112
The test statistic and P-value for pairwise comparison test between the control group and treatment 1 is given
by:
ts <- (ybar[2] - ybar[1])/(sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) ))
print(ts)

[1] -1.330791

p.value <- 2*pt(abs(ts), df=df.within, lower.tail = FALSE)

print(p.value)

[1] 0.1943879

A 95% confidence interval for µ2 − µ1 (the treatment 1 population mean - the control population mean) is given
by
(ybar[2]-ybar[1]) + c(-1,1)*qt(1-0.05/2,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

[1] -0.9430126 0.2010126

# or
lower <- (ybar[2]-ybar[1]) - qt(1-0.05/2,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )
print(lower)

[1] -0.9430126

upper <- (ybar[2]-ybar[1]) + qt(1-0.05/2,df.within)sqrt(MSE)sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

print(upper)

[1] 0.2010126

Which agrees with the corresponding line seen from the confint() output in the previous section.
I simply told you how to carry out 1 of the hypothesis tests and obtain the corresponding confidence intervals.
You will need to do the other 2 for this dataset. In general, there might be more than 3 groups, so you may need
to MANY tests.

5.2.3 Bonferroni’s Method for Multiple Comparisons

Here we correct the αew = αcw /k, where k is the number of pairwise comparisons we make. If we wish to make
all of the pairwise comparisons possible, then k = I(I − 1)/2 where I is the total number of groups.

The Bonferroni tests can be done in a similar way as before using functions from the multcomp library. For
example, the hypothesis tests can be done using the following (note the comparisons object was computed in the
FisherLSD section). The only real change is that we explicitly tell R that we are using the Bonferroni method by
specifying type="bonferroni" .
summary(comparisons, test=adjusted(type="bonferroni"))

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

113
Fit: aov(formula = weight ~ group)

Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.5832
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.2630
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- bonferroni method)

The Bonferroni 95% confidence intervals are given by

confint(comparisons,calpha=qt(1-0.05/2/k,df.within))

Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = weight ~ group)

Quantile = 2.5525
95% confidence level

Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.3710 -1.0826 0.3406
trt2 - ctrl == 0 0.4940 -0.2176 1.2056
trt2 - trt1 == 0 0.8650 0.1534 1.5766

Plots of these Bonferroni 95% confidence intervals are given by

qplot(lhs, estimate, data = confint(comparisons, calpha=qt(1-0.05/2/k,df.within)),
main="Bonferroni 95% Confidence Intervals", geom = "pointrange", ymin = lwr, ymax = upr,
xlab="Comparisons", ylab="Estimates and Confidence Intervals") +
coord_flip() + geom_hline(yintercept = 0)

114
Bonferroni 95% Confidence Intervals

trt2 − trt1
Comparisons

trt2 − ctrl

trt1 − ctrl

−1 0 1
Estimates and Confidence Intervals

5.2.4 Bonferroni Tests and Confidence Intervals by Hand

If we wanted to generate confidence intervals for µ2 − µ1 by hand, then we need to do the following.

We know that the difference between group 2 (treatment 1) and group 1 (the control) is given by
(ybar[2] - ybar[1])

[1] -0.371

The standard error is still the same as what we found before.

sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

[1] 0.2787816

The only major change is the critical value that we use as a multiplier in the margin of error computation.

115
# Instead of the critical value used in Fisher's LSD where
qt(1-0.05/2,df.within)*sqrt(MSE)

[1] 1.279059

# for Bonferroni Intervals for k comparisons we use

qt(1-0.05/2/k,df.within)*sqrt(MSE)

[1] 1.591138

Now the 95% Bonferroni confidence interval for µ2 − µ1 based upon the sample we obtained earlier is
(ybar[2]-ybar[1]) + c(-1,1)*qt(1-0.05/2/k,df.within)*sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

[1] -1.0825786 0.3405786

In order to conduct a hypothesis by hand using the Bonferroni method, we note that we can do the following:
ts = (ybar[2]-ybar[1]) / ( sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) ) )

print(ts)

[1] -1.330791

This is the same test statistic as before.

The P-value is computed exactly the same way as before.

p.val.bonf <- 2*pt(abs(ts),df=df.within, lower.tail = FALSE)

print(p.val.bonf)

[1] 0.1943879

If this gives the same P-value as before that’s not surprising. What changes is what we make the comparison
with.

First note that we have I = 3 groups, hence k = I(I − 1)/2 = 3 . The total number of tests that we will conduct
will be 3, this is simply the first one. We want to control the overall Type I error rate that is associated with
multiple tests. So we no longer make a simple comparisons with α = 0.05.

Formerly, we made a comparison at the α level, but since we are conducting k multiple tests, we conduct each
test at the α/k level. So if αf w = 0.05 is the desired familywise error rate, we need to compare our P-value with
the experimentwise error rate of αew = αf w /k = 0.05/3 = 0.01666667.

Since our P-value = 0.1943879, we fail to reject H0 .

Note however, that the output of previous section had different P-values that the one we just obtained. In
particular that function returned something called “adjusted P-values”. As a reminder here is the output we are
referencing:

Linear Hypotheses:

116
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.5832
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.2630
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- bonferroni method)

Adjusted P-values are used in order to make decisions about the test easier. So instead of conducting every test
at the 0.01666667 level, what is done is each test is conducted at the 0.05 level but was has been changed is that
the P-values that we obtained earlier have now been multiplied by k = 3.

Since our original P-value = 0.1943879, the adjusted P-value is 3 ∗ 0.1943879 = 0.5831637 which matches the
output above. We then see that the adjusted P-value = 0.5832 > 0.05 hence we fail to reject H0 .

5.2.5 Tukey Honest Significant Difference

When we first created the comparisons object using the glht() function, we used this option called "Tukey".
This did not automatically create Tukey intervals, but instead it told R that we wish to carry out all “pairwise”
comparisons. Since the most common method to do this is using the Tukey-Kramer method, just stating the
option that we will use Tukey already set this up for future computations.

In case you don’t recall we did:

comparisons <- glht(anovamodel, linfct=mcp(group="Tukey"))

Since we already specified that we intend to use pairwise comparisons using Tukey’s HSD, the default output (no
additional options as before) using the summary() function will provide us with Tukey’s HSD hypothesis tests
and confidence intervals for pairwise contrasts.
summary(comparisons)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = weight ~ group)

Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.3909
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.1979
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0122 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Additionally, the Tukey HSD 95% Confidence Intervals are given by

confint(comparisons)

117
Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = weight ~ group)

Quantile = 2.4795
95% family-wise confidence level

Linear Hypotheses:
Estimate lwr upr
trt1 - ctrl == 0 -0.3710 -1.0622 0.3202
trt2 - ctrl == 0 0.4940 -0.1972 1.1852
trt2 - trt1 == 0 0.8650 0.1738 1.5562

Finally, plots of the Tukey HSD confidence intervals are given by the function
qplot(lhs, estimate, data = confint(comparisons),
main="TukeyHSD 95% Confidence Intervals",
xlab="Comparisons",
ylab = "Estimates and Confidence Intervals",
geom = "pointrange", ymin = lwr, ymax = upr) +
coord_flip()+ geom_hline(yintercept = 0)

118
TukeyHSD 95% Confidence Intervals

trt2 − trt1
Comparisons

trt2 − ctrl

trt1 − ctrl

−1 0 1
Estimates and Confidence Intervals

Since TukeyHSD is often the most preferred method for post-hoc analysis, hypothesis tests and confidence intervals
are given by a simple dedicated function
TukeyHSD(anovamodel)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = weight ~ group)

$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064

5.2.6 Tukey HSD Tests and Confidence Intervals by Hand

119
In general, Tukey-Kramer 100(1 − α)% confidence intervals for µi − µj (i 6= j) are given by
s
qα,I,N −I √ 1 1
Yi−Yj ± √ M SE +
2 ni nj

where I is the number of groups, N − I is the degrees of freedom of the error.

In R, we obtain critical values from Tukey’s studentized range distribution using the qtukey() function.

A 95% confidence interval for µ2 − µ1 using based upon the sample is given by
(ybar[2] - ybar[1]) + c(-1,1) * qtukey(1-0.05, nmeans = 3, df=df.within)/sqrt(2) *
sqrt(MSE)*sqrt( (1/n_obs[2]) + (1/n_obs[1]) )

[1] -1.0622161 0.3202161

which matches the output from TukeyHSD in the previous section.

In order to use R to obtain P-values, we first notice that

P-value = P (QI,N −I > qs)

where QI,N −I is a random variable from Tukey’s Studentized Range Distribution with I groups and N − I is the
degrees of freedom of the error. Additionally our test statistic is:
 
√  |Y i − Y j |  √
qs = 2∗  = 2 ∗ |ts |
√
r
 1 1 
M SE +
ni nj

Therefore if we want to use R to get out P-value, we simply do the following:

ptukey( abs(ts)*sqrt(2) , nmeans=3, df=df.within, lower.tail=FALSE)

[1] 0.3908711

which matches the output from the TukeyHSD() function.

We’re done with the PlantGrowth dataset, don’t forget to detach it!
detach(PlantGrowth)

120
Chapter 6

Introduction to Simple Linear Regression

In this part of the tutorial you will learn how to use R to:

1. Learn how to define linear models in R using the lm() function.

2. Compute Correlations

3. Compute the least-squares regression line.

4. Carry out inference on the regression coefficients.

5. Plot Confidence and Prediction Bands.

6. Checking the assumptions of least-squares regression using basic residual analysis.

121
6.1 Linear Models in R

A statistical model that decribes a relationship between the ith observation of response variable Y and a predictor
variable X is given by
Yi = β0 + β1 X + εi
where εi the random error.

The basic syntax for regression analysis in R is:

lm( Response ~ Predictor(s))

In general, regression analysis is used to describe the relationship between a single response variable: Y and one
or more predictor variables (X1 , X2 , . . ., Xp ). When p = 1, we use “Simple Linear Regression” and if p > 1 we
use Multivariate Regression.

The response variable Y must be a continuous variable.

The predictors X1 , . . ., Xp can be continuous, discrete, or categorical

We will focus on only the elementary case for simple linear regression.

Hopefully your next statistics course will cover the more general and advanced cases.

Example: Weighted Mass attached to a Spring.

Whenever a weighted mass is attached to a spring, it stretches by some amount.

Assume that we want to measure the length of a particular spring when masses of various weights have been
attached.

We select mass of various weights, starting as zero, and going to 3.8 Kg.

We then measure the length of the spring for the various weights.

Question: Are the weights of the masses and the length of the spring linearly related?

Let’s explore this by first plotting the data. The data is given below:
# Weights between 0 and 3.8 Kg increments of 0.2
weight <- seq(0,3.8, by=0.2)

# Measured Spring Lengths

spring.length <- c(5.06, 5.01, 5.12, 5.13, 5.14, 5.16, 5.25, 5.19, 5.24, 5.46,
5.40, 5.57, 5.47, 5.53, 5.61, 5.59, 5.61, 5.75, 5.68, 5.80)

# Note: You should avoid using length as a variable name since it is the name of a command.
# Same for number of objects in the dataset.

n_obs <- length(spring.length)

# Finally let's plot the data

plot(weight,spring.length, pch=20, main="Spring Length versus Mass")

122
Spring Length versus Mass

5.8
5.6
spring.length

5.4
5.2
5.0

0 1 2 3

weight

6.2 The Correlation Coefficient, r

We learned how we can calculate this by hand in class.

n
1 X xi − x yi − y
r=
(n − 1) i=1 sx sy

So we could use R as a calculator to do this for us

(1/(n_obs-1))*sum((weight-mean(weight))/sd(weight)*(spring.length-mean(spring.length))/sd(spring.length))

[1] 0.9743193

# Of course the built-in R command for calculating the correlation coefficient is

# You should avoid using r as a variable, same with length, as it is a command inside R
my.r <- cor(weight,spring.length)

123
print(my.r)

[1] 0.9743193

6.3 Computing the Least-Squares Regression Line.

The linear model can be specified simply using the lm() function.
spring.modelfit <- lm( spring.length ~ weight)

Just like with the aov() command, the lm(), already carried out many calculations including computation of the
least-squares coefficients.

To find out the equation of the least-squares line, we just need to look at the coefficients
spring.modelfit$coefficients

(Intercept) weight
4.9997143 0.2046241

So b0 = Intercept, and b1 = Weight coefficients

b0 <- spring.modelfit$coefficients[1]
b1 <- spring.modelfit$coefficients[2]

As a reminder, b0 is an estimator for β0 , and b1 is an estimator for β1 .

We could have calculated b1 and b0 by hand too

b1 is found using: b1 = r ∗ (sy /sx )

my_b1 = my.r*(sd(spring.length)/sd(weight))

print(my_b1)

[1] 0.2046241

b0 is found using: b0 : y − b1 ∗ x
my_b0 <- mean(spring.length) - b1*mean(weight)

print(my_b0)

weight
4.999714

124
6.4 Inference on the Regression Coefficient β1

The following R syntax can be used to display a wealth of information by the time you read this tutorial, hopefully
you know what most of it means of course we will go over it in class.
summary(spring.modelfit)

Call:
lm(formula = spring.length ~ weight)

Residuals:
Min 1Q Median 3Q Max
-0.09619 -0.03406 -0.00535 0.03761 0.12011

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99971 0.02477 201.81 < 2e-16 ***
weight 0.20462 0.01115 18.36 4.2e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05749 on 18 degrees of freedom

Multiple R-squared: 0.9493,Adjusted R-squared: 0.9465
F-statistic: 337 on 1 and 18 DF, p-value: 4.204e-13

Among other things, this summary displays the estimates (b0 , b1 ), their standard errors, the value of their test
statistics, and their P-values.

Let’s look at how we can get these quantities the long way

The residual sum of squares can be found by taking the residuals, squaring them and then taking the sum

We call denote this using SS(residual) = SSresid

sum(spring.modelfit$residuals^2)

[1] 0.05948624

r
SSresid
The residual standard deviaton is se = .
n−2
my_resid_std_error <- sqrt(sum(spring.modelfit$residuals^2)/(n_obs-2))

print(my_resid_std_error)

[1] 0.05748731

Of course, we can use R to get this value too.

# The residual standard error is given in the summary(lm()) output.
s_e <- summary(spring.modelfit)$sigma

print(s_e)

125
[1] 0.05748731

s
The standard error of β1 = SEb1 can be calculated explicitly using: √e .
(sx ∗ n − 1)
# since x = weight
std_error_b1 <- s_e/(sd(weight)*sqrt(n_obs-1))

print(std_error_b1)

[1] 0.01114631

As it turns out, R has already computed this for us, and is given in the summary(lm()) output. We can extract
this from the output using:
# Change "weight" to the variable you need
SE_b1 <- summary(spring.modelfit)$coefficients["weight",2]

print(SE_b1)

[1] 0.01114631

We’ll need the degrees of freedom.

df.resid <- n_obs - 2

print(df.resid)

[1] 18

6.4.1 Hypothesis Tests for the Slope, β1

We want to carry out a hypothesis test in the following form:

H0 : The mean spring length is not linearly related to the weight of the mass
HA : The mean spring length is linearly related to the the weight of the mass

Using symbols,
H0 : β1 = 0
HA : β1 6= 0

Our test statistic is:

ts <- (b1 - 0)/SE_b1

print(ts)

weight
18.35801

Our corresponding P-value is:

126
p.val.b1 <- 2*pt(ts,df.resid,lower.tail=FALSE)

print(p.val.b1)

weight
4.203758e-13

We can look at the output of summary(lm()) for comparison or just this small portion
summary(spring.modelfit)$coefficients["weight",3:4]

t value Pr(>|t|)
1.835801e+01 4.203758e-13

We can also look at a 95% confidence interval for β1

b1 + c(-1,1)*qt(1-0.05/2,df.resid)*SE_b1

[1] 0.1812065 0.2280416

Now let’s explore the other portion of the output

summary(spring.modelfit)

Call:
lm(formula = spring.length ~ weight)

Residuals:
Min 1Q Median 3Q Max
-0.09619 -0.03406 -0.00535 0.03761 0.12011

Residual standard error: 0.05749 on 18 degrees of freedom

Multiple R-squared: 0.9493,Adjusted R-squared: 0.9465
F-statistic: 337 on 1 and 18 DF, p-value: 4.204e-13

We see R-squared, the coefficient of determination (R2 ) is given.

Of course R-squared is also the correlation coefficient squared

cor(weight,spring.length)^2

[1] 0.9492981

The bottom most portion of summary(spring.modelfit) corresponds to ANOVA

We can compare it to the results of the ANOVA call explicitly

127
anova(spring.modelfit)

Analysis of Variance Table

Response: spring.length
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 1.11377 1.1138 337.02 4.204e-13 ***
Residuals 18 0.05949 0.0033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, summary(spring.modelfit) provides us with the F-statistic and the corresponding P-value.

The main idea is that ANOVA and Regression can be used to test for the same thing:

H0 : β0 = β1 = 0

While this test seems silly for simple linear regression, it is more useful in multiple linear regression which we are
not covering.

In that case,

H0 : β0 = β1 = ... = βp = 0

So ANOVA is quite useful in this situation

6.4.2 Visual Tools for Simple Linear Regression

IMPORTANT NOTE:
While the above numerical analysis will work on your data sets in general, the visual analysis requires that you
pre-sort your data with respect to the x-variable. Otherwise your plots are likely to look strange.

On our plot, we can super-impose our least-squares line

plot(weight,spring.length,pch=16, main="Spring Length versus Mass",
xlab="Mass Weight", ylab="Spring Length")

# Note the order below, b0 = intercept coefficient, b1 = coefficient of variable

abline(b0,b1)

128
Spring Length versus Mass

5.8
5.6
Spring Length

5.4
5.2
5.0

0 1 2 3

Mass Weight

We can also add the fitted values (or the predicted values) and plot them
plot(weight,spring.length,pch=16, main="Spring Length versus Mass",
xlab="Mass Weight", ylab="Spring Length")

# Note the order above, b0 = intercept coefficient, b1 = coefficient of variable

abline(b0,b1)

spring.fitted.values <- fitted(spring.modelfit)

# Note the order below, x-variable, y-variable_fitted values

points( weight, spring.fitted.values, pch=18, col="blue")

129
Spring Length versus Mass

5.8
5.6
Spring Length

5.4
5.2
5.0

0 1 2 3

Mass Weight

6.4.3 Confidence and Prediction Bands

Note: I will demonstrate one way to use the default plotting tools in R to plot confidence and prediction bands.
While this is not difficult, these plots tend to look better using ggplot2.

We will have discussed a little about confidence and prediction bands in class. We will not discuss how to do
them the long way in R, instead we will just observe how to plot them efficiently. The plots of the confidence and
prediction bands are almost always seen on the same plot.

Look at the chunk of code below. To obtain confidence and prediction bands, we use the predict() function.
# Confidence Bands
spring.confint <- predict(spring.modelfit, interval="confidence")

# Prediction Bands
spring.predint <- predict(spring.modelfit, interval="prediction")

Warning in predict.lm(spring.modelfit, interval = "prediction"): predictions on current data refer to future

responses

130
plot(weight, spring.length, pch=16,
main="Spring Length versus Mass \n with Confidence & Prediction Intervals",
xlab="Mass Weight", ylab="Spring Length")

# Plot the Least-Squares line

abline(b0,b1)

# Plotting the Confidence Band Limits

matlines(weight, spring.confint[,2], col="blue")
matlines(weight, spring.confint[,3], col="blue")

# Plotting the Prediction Band Limits

matlines(weight, spring.predint[,2], col="red")
matlines(weight, spring.predint[,3], col="red")

legend("topleft", c("Confidence","Prediction"), lty=c(1,1), col=c("blue","red"))

Spring Length versus Mass

with Confidence & Prediction Intervals
5.8

Confidence
Prediction
5.6
Spring Length

5.4
5.2
5.0

0 1 2 3

Mass Weight

131
6.4.4 Basic Residual Analysis for Checking Assumptions

Residual analysis is used to determine whether the assumptions of our linear model and the method of simple
linear regression are valid.

As a first step, we assumed a simple linear model described the relationship between our response and predictor
variables.

A simple scatter plot is a good first step to tell whether or not it is appropriate to use simple linear regression.

However, sometimes variables that are related in a non-linear manner appear to be linear in certain regions.

Example: sqrt(x), can look linear over small enough intervals.

Another way that we can detect for curvature is to use residual plots

A scatter plot of the residuals, ri = (yi − ybi ), versus the fitted (predicted) data, ybi can show curvilinearity if it
exists.
spring.residuals <- resid(spring.modelfit)

plot(spring.fitted.values,spring.residuals,pch=16,
main="Residuals vs Predicted Plot", xlab="Predicted",ylab="Residuals")
abline(h=0) # A horizontal line makes it easier to see

132
Residuals vs Predicted Plot

0.10
0.05
Residuals

0.00
−0.05
−0.10

5.0 5.2 5.4 5.6 5.8

Predicted

This plot should be more or less random about the line y = 0.

If there is a trend, such as a quadratic relationship, this means that our linear model assumption is false.

This might mean that we may need to transform our data or that we may need to use multiple regression instead
of simple linear regression.

Finally, we should also look at normality plot of the residuals to assess if the normality assumptions have been
satisfied.
qqnorm(spring.residuals, main="Normal Q-Q Plot for Residuals for \n the Spring Length Data", pch=20)
qqline(spring.residuals, col="red")

133
Normal Q−Q Plot for Residuals for
the Spring Length Data

0.10
0.05
Sample Quantiles

0.00
−0.05
−0.10

−2 −1 0 1 2

Theoretical Quantiles

134

An Introduction To Political and Social Data Analysis Using R
No ratings yet
An Introduction To Political and Social Data Analysis Using R
432 pages
Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
Applied Statistics For Bioinformatics PDF
No ratings yet
Applied Statistics For Bioinformatics PDF
278 pages
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
No ratings yet
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
248 pages
Visual Statistics Use R PDF
No ratings yet
Visual Statistics Use R PDF
388 pages
Essential R
No ratings yet
Essential R
261 pages
Visual Statistics Use R
No ratings yet
Visual Statistics Use R
451 pages
Applied Statistics PDF
No ratings yet
Applied Statistics PDF
417 pages
Minitab Guide
No ratings yet
Minitab Guide
256 pages
Essentials of Statistics
No ratings yet
Essentials of Statistics
272 pages
Greenwood Intermediate Statistics With R
No ratings yet
Greenwood Intermediate Statistics With R
429 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
R-Web-Appendix of Foundations of Statistics For Data Scientists
No ratings yet
R-Web-Appendix of Foundations of Statistics For Data Scientists
122 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
FeatureCAM 2015 Reference Help
100% (1)
FeatureCAM 2015 Reference Help
1,985 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
STAT1301 Notes
No ratings yet
STAT1301 Notes
215 pages
Re Center Psych Stats
No ratings yet
Re Center Psych Stats
560 pages
ComputerLabNotes 2024
No ratings yet
ComputerLabNotes 2024
109 pages
Bookdown Demo
No ratings yet
Bookdown Demo
448 pages
Probability I - Mark Scheme
No ratings yet
Probability I - Mark Scheme
17 pages
Data Analysis For The Life Sciences With R - 1st Edition PDF
100% (17)
Data Analysis For The Life Sciences With R - 1st Edition PDF
16 pages
An R Companion To Statistical Thinking For The 21st Century
No ratings yet
An R Companion To Statistical Thinking For The 21st Century
159 pages
Learn R For Applied Statistics
No ratings yet
Learn R For Applied Statistics
457 pages
Applied Statistics
No ratings yet
Applied Statistics
457 pages
STA501 Study Guide 2024-02-27 01 - 00 - 08
No ratings yet
STA501 Study Guide 2024-02-27 01 - 00 - 08
270 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
John Fox - Using The R Commander. A Point-And-Click Interface For R-CRC (2018)
No ratings yet
John Fox - Using The R Commander. A Point-And-Click Interface For R-CRC (2018)
223 pages
STAT319 Lab Manual Based On R - Final Version
No ratings yet
STAT319 Lab Manual Based On R - Final Version
127 pages
Stats With R
No ratings yet
Stats With R
103 pages
Quante Con
No ratings yet
Quante Con
146 pages
Book IntroStatistics
No ratings yet
Book IntroStatistics
422 pages
Shipunov Visual Statistics
No ratings yet
Shipunov Visual Statistics
429 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Quante Con
No ratings yet
Quante Con
146 pages
Boulder Handout 2019
No ratings yet
Boulder Handout 2019
187 pages
Statistical Analysis and Visualizations Using R: Okan Bulut
No ratings yet
Statistical Analysis and Visualizations Using R: Okan Bulut
96 pages
Bio220 Lab Manual
No ratings yet
Bio220 Lab Manual
92 pages
Notes PDF
No ratings yet
Notes PDF
294 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
Krijnen IntroBioInfStatistics
No ratings yet
Krijnen IntroBioInfStatistics
278 pages
Modeling and Visulizing Data Using R: A Practical Introduction
No ratings yet
Modeling and Visulizing Data Using R: A Practical Introduction
106 pages
Biological Data Analysis Using R
No ratings yet
Biological Data Analysis Using R
226 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
MathPSHS Curriculum
No ratings yet
MathPSHS Curriculum
1 page
Manual Minitab
No ratings yet
Manual Minitab
124 pages
Mathematics W 21
100% (1)
Mathematics W 21
25 pages
Exercises 9 - Decision Making 0 0
No ratings yet
Exercises 9 - Decision Making 0 0
6 pages
Math 1280 Notes
No ratings yet
Math 1280 Notes
91 pages
Manual Minitab
No ratings yet
Manual Minitab
124 pages
CAM625 2019 s1 Module1
No ratings yet
CAM625 2019 s1 Module1
31 pages
Optimization of Line Losses Using Series Compensation: Vaibhav V. Gholase Sudhir A. Gadekar
No ratings yet
Optimization of Line Losses Using Series Compensation: Vaibhav V. Gholase Sudhir A. Gadekar
29 pages
A Study On Effectiveness of Franchise Business Model of Mcdonald'S in Ahmedabad
No ratings yet
A Study On Effectiveness of Franchise Business Model of Mcdonald'S in Ahmedabad
25 pages
EViews Guide
100% (1)
EViews Guide
14 pages
MINITAB Manual For Introduction To The Practice of Statistics
No ratings yet
MINITAB Manual For Introduction To The Practice of Statistics
124 pages
Class 10th WTP 06 Retest Maths 25-05-2025 S. 2025-26
No ratings yet
Class 10th WTP 06 Retest Maths 25-05-2025 S. 2025-26
1 page
Craven Slides PDF
No ratings yet
Craven Slides PDF
84 pages
Maths Test Mac
No ratings yet
Maths Test Mac
12 pages
Addmath F4C7 Coordinate Geo (Phase Two '24)
No ratings yet
Addmath F4C7 Coordinate Geo (Phase Two '24)
14 pages
Day 3 Solutions
100% (1)
Day 3 Solutions
5 pages
Vertopal Com EDA Project
No ratings yet
Vertopal Com EDA Project
21 pages
11 Phy DPP 32
No ratings yet
11 Phy DPP 32
4 pages
Alexis Butler's - Fac - Fact - Fic - Pon - Pound - Pono - Struct - Strue - Stit - Stat - Sto
No ratings yet
Alexis Butler's - Fac - Fact - Fic - Pon - Pound - Pono - Struct - Strue - Stit - Stat - Sto
10 pages
Chapter 1 - Introduction To Finite Element Analysis
No ratings yet
Chapter 1 - Introduction To Finite Element Analysis
16 pages
Scan 9 Apr 2019 PDF
No ratings yet
Scan 9 Apr 2019 PDF
26 pages
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
No ratings yet
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
29 pages
Time Complexity: Dr. Zahid Halim
No ratings yet
Time Complexity: Dr. Zahid Halim
32 pages
Program To Convert Decimal To Binary Using Stack
No ratings yet
Program To Convert Decimal To Binary Using Stack
27 pages
DPP-1 2D Projectile Motion Op
No ratings yet
DPP-1 2D Projectile Motion Op
2 pages
Python Notes 11 Dictionary Tuples and Sets 1664121924
No ratings yet
Python Notes 11 Dictionary Tuples and Sets 1664121924
21 pages
Lead Compensator Design Paper
No ratings yet
Lead Compensator Design Paper
17 pages
Module 2 Revised
No ratings yet
Module 2 Revised
25 pages
Adaptive Inverse Control of Two-Axis Hydraulic Shaking Table Based On RLS Filter
No ratings yet
Adaptive Inverse Control of Two-Axis Hydraulic Shaking Table Based On RLS Filter
6 pages
CS Practice Set-Quiz
No ratings yet
CS Practice Set-Quiz
6 pages
Design Optimization of Solid Propellant Rocket Motor Pavel Konečný, Vojtěch Hrubý, Zdeněk Křižan
No ratings yet
Design Optimization of Solid Propellant Rocket Motor Pavel Konečný, Vojtěch Hrubý, Zdeněk Křižan
8 pages
Tiger Tools
No ratings yet
Tiger Tools
2 pages
CS482 Data Structures
No ratings yet
CS482 Data Structures
3 pages
R With RCMDR: Basic Instructions: 1 Running & Installation R Under Windows
No ratings yet
R With RCMDR: Basic Instructions: 1 Running & Installation R Under Windows
29 pages
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Conquering the Competition: Strategies for Standing Out in the Gaming Content Landscape
From Everand
Conquering the Competition: Strategies for Standing Out in the Gaming Content Landscape
Rian McCullen
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet

Lucero R Tutorial 2016

Uploaded by

Lucero R Tutorial 2016

Uploaded by

An Introduction to R/RStudio

 You should generally run RStudio in full screen.

There are a lot of “commands” or “functions” to carry out certain tasks in R.

Few people remember absolutely every command in a programming language!

So be patient, and try to maintain a postive attitude!

DO NOT wait until the last minute to start your R Assignments!

1 The Basics of R/RStudio 5

2 Elementary Probability and Statistics using R 26

3 Inference Methods for Numerical Data 75

4 Inferences for Categorical Data 88

5 Introduction to ANOVA and Multiple Comparisons 100

6 Introduction to Simple Linear Regression 121

The Basics of R/RStudio

In this part of the tutorial you will learn the following:

1. Getting Started with R: The RStudio Interface.

2. Naming Files and Organizing Your Work.

3. What is the R Script, R Code, R Program, etc?

4. Defining & Viewing Objects in R.

6. Reading a Dataset into an Object/Variable by Hand, the Basics.

7. Reading in Data from a File.

8. Installing R Library Packages.

9. Data Frames and Their Variables: The Basics

10. The Basics of Plotting.

R can be run independently of RStudio, but RStudio needs R installed to run.

R can be downloaded here: https://cran.r-project.org/

RStudio can be downloaded here: https://www.rstudio.com/products/rstudio/download/

3. Command History &

4. Files, Plots, A List of Installed

For this class, create an obviously named folder.

Call it something like CMDA_3654_Assignments

A good suggestion is:

Of course substitute your last name and first name.

1.3 What is the R Script, R Code, R Program, etc.

Anything written after a # symbol will not be interpreted by R.

Notice the difference between comments and the actual commands.

To run particular commands in RStudio, there are many different options:

1.4 Defining & Viewing Objects in R.

# Using equals signs works the same way in R.

[1] "Charlie The Horse"

[1] "Paso Fino"

# Something advanced for those of you who have seen it before.

You can try this now with the following:

1.6 Reading a Dataset into an Object/Variable by Hand, the Basics.

Putting data into a variable by hand is very important.

Make sure to have comma’s between data values.

The general way to define a variable with multiple values is:

<variable name> <- c(value1, value2, ...) }

Consider the following example:

# This is completely equivalent to the above!

# We can view the data using:

We will see how to access particular values in vectors/arrays in another section.

1.7 Reading in Data from a File.

age moved MSA ER Police Year

We can ask R for the names of the variables in the dataset.

[1] "age" "moved" "MSA" "ER" "Police" "Year"

We’ll use this dataset later in this tutorial.

1.7.1 When a Dataset Does Not Have a Header

There are a number of different ways to do this.

# head() displays the first 6 rows

Assigning names when we read in the data.

 Method 3: Command line, this only needs to be run once.

The command to run is:

install.packages("package name goes here", dependencies = TRUE)

Determining Which Packages are Currently Installed

Alternatively: To enable a library using a command, the function format to use is

Here is the list in case you forgot:

binom, epitools, car, multcomp, Sleuth3

Final Note on R Library Packages:

You should only have to install a library package once.

student age grade

Let’s look at an example of a dataset that is already built into RStudio.

Girth Height Volume

Girth Height Volume

If we use the str() function, this tells us a few different things.

'data.frame': 31 obs. of 3 variables:

Typing the $ is pretty common, but we have other options.

You should generally run RStudio in full screen.

Method 3: Command line, this only needs to be run once.

Generating Random Numbers, Computing Probabilities, Computing Quantiles