SLMContents T3531 RProgramming 1734522316054
SLMContents T3531 RProgramming 1734522316054
(DEEMED UNIVERSITY)
Established under Section 3 of the UGC Act. 1956
Awarded Category - I by UGC
E-CONTENT
R PROGRAMMING
MBA SEM – II
DR. BAJEET KAUR
Learning Objective/Outcome (s):
This course will help students to understand the R programming basics. After
completion of this course student can able to do data analysis using R
programming.
Pre-learning:
Basic knowledge of statistics
Books Recommended
R for Data Science: 1st Editionby Hadley Wickham , Garrett
Grolemund , ISBN-13: 978- 1491910399 ISBN-10: 1491910399
Hands-On Programming with R, Garrett Grolemund ISBN-13: 978-
1449359010,ISBN- 10: 1449359019
R Cookbook: Paul Teetor ISBN-13: 978-0596809157,ISBN-10: 0596809158
Curran, J.M. (2010) Introduction to Data Analysis with R for Forensic
Scientists, ISBN: 978- 1420088267
Murrell, P (2005) R Graphics, ISBN: 978-1584884866
Murrell, P Introduction to Data Technologies
www.stat.auckland.ac.nz/~paul/ItDT
Table of Contents
1.1 Installing R.................................................................................................................................. 5
Installing RStudio ............................................................................................................... 5
Steps to Install R Studio (Windows):.................................................................................. 5
1.2 Working with R ........................................................................................................................ 13
1.3 Working with R ........................................................................................................................ 13
1.4 Help feature .............................................................................................................................. 15
1.5 R workspace.............................................................................................................................. 16
1.6 Input and output ........................................................................................................................ 16
INPUT :............................................................................................................................. 16
OUTPUT........................................................................................................................... 16
1.7 Packages ................................................................................................................................... 18
What are packages? .......................................................................................................... 18
Installing a package .......................................................................................................... 19
Loading a package ............................................................................................................ 19
1.8 Working with large datasets ..................................................................................................... 19
2 Module 2 ........................................................................................................................................... 21
2. Dataset and Data ........................................................................................................................... 21
Datasets ............................................................................................................................. 21
2.2 Data structures .......................................................................................................................... 21
Vectors .............................................................................................................................. 21
Matrices ............................................................................................................................ 22
Arrays................................................................................................................................ 23
Data frames ....................................................................................................................... 24
Factors............................................................................................................................... 25
Lists................................................................................................................................... 26
2.3 Data input.................................................................................................................................. 27
Importing data from a delimited text file .......................................................................... 27
Importing data from Excel ................................................................................................ 28
Importing data from CSV ................................................................................................. 28
Importing data from JSON................................................................................................ 29
Importing data from XML ................................................................................................ 29
2.4 Useful functions for working with data objects ........................................................................ 30
2.5 R – Programming Constructs:................................................................................................... 32
Decision making ............................................................................................................... 32
3 User defined functions in R. ............................................................................................................. 37
3.1 FunctionComponents ...................................................................................................................... 37
3.2 Built-inFunction ........................................................................................................................ 37
3.3 User-definedFunction ...................................................................................................................... 38
3.4 CallingaFunction ........................................................................................................................... 38
4 Graphical Analysis using R .............................................................................................................. 40
4.1 Introduction: ............................................................................................................................. 40
4.2 Bar plots .................................................................................................................................... 40
Simple bar plots: ............................................................................................................... 41
Stacked and grouped bar plots: ......................................................................................... 41
Mean bar plots: ................................................................................................................. 42
Tweaking bar plots: .......................................................................................................... 43
Spinograms: ...................................................................................................................... 44
4.3 Box plots: .................................................................................................................................. 44
Using parallel box plots to compare groups: .................................................................... 45
4.4 Dot plots: .................................................................................................................................. 48
4.5 Pie charts:.................................................................................................................................. 50
5 Advanced R ...................................................................................................................................... 53
5.1 Correlations:............................................................................................................................ 53
Types of correlations: ....................................................................................................... 53
PARTIAL CORRELATIONS: ......................................................................................... 54
OTHER TYPES OF CORRELATIONS .......................................................................... 55
5.2 Testing correlations for significance:........................................................................................ 55
5.3 Regression: ............................................................................................................................... 56
The many faces of regression: .......................................................................................... 56
Scenarios for using OLS regression:................................................................................. 57
Simple linear regression: .................................................................................................. 58
5.4 Polynomial regression: ............................................................................................................. 60
5.5 Fitting ANOVA models: ........................................................................................................... 61
Two-way factorial ANOVA: ............................................................................................ 67
5.6 ANOVA as regression: ............................................................................................................. 70
1 Module 1: Introduction to R programming
Dear Learners, the objective of this course is to teach beginners to understand the R programming
basics and enable them to do data analysis using R programming. A variety of topics will be covered
that are important for Data Analytics in order to prepare the students for real life prediction of
data engineering. This course will impart knowledge of the concepts related Data Types in R,
Programming constructs in R, Reading different file formats, User defined functions in R,
Graphs and Charts in R, Statistical data analysis and Web Scraping using R. It also gives the
idea how data is managed in various environments with emphasis on Predictions measures as
implemented in data sets. Statistical Programming in R is a multi-part course designed to get
you up to speed with the most important and powerful methodologies in statistics. This course
is designed to prepare you to do data analysis in R, from simple computations to machine
learning. This course has been written from scratch. This course assumes you are comfortable
with basic math, algebra, and logical operations.
Topic to be covered:
Getting R, Managing R, Arithmetic and Matrix Operations, Introduction to Functions,
Control Structures. Working with Objects and Data: Introduction to Objects,
Manipulating Objects, Con//structing Data Objects, types of Data items, Structure of Data
items, Reading and Getting Data, Manipulating Data, Storing Data.
1.1 Installing R
To install R to work on your own computer, you can download it freely from the
Comprehensive R Archive Network (CRAN). Note that CRAN makes several versions
of R available: versions for multiple operating systems and releases older than the current
one. You want to read the CRAN instructions to assure you download the correct version.
If you need further help, you can try the following resources:
Installing R on Windows
Installing R on Mac
Installing R on Ubuntu
Installing RStudio
RStudio is an integrated development environment (IDE). We highly recommend
installing and using RStudio to edit and test your code. You can install RStudio through
the RStudio website. Their cheatsheet is a great resource. You must install R before
installing RStudio.
Step 3: Click on Download R 4.1.1 for Windows (86 megabytes, 32/64 bit)
Step 4: Save your file
Step 18: Click on R.exe. and run your commands. Finally Done
1.2 Working with R
o For example, the statement x <- rnorm(5) creates a vector object named
x which contain five random deviates from a standard normal distribution.
Note: R allows you to use the = sign to be used for object assignments.
However, you would rarely find any program written that way, as it
is not a standard syntax.
Comments are preceded by the # symbol. All texts appearing after the # is ignored
by the R interpreter.
Suppose you are studying the physical growth of infants in first year of their life. You have
data of the age and weight of ten infants in their first year of life in below table. You would
be interested in weight distribution and their relationship to age.
Weight (kg)
Age (months)
01 4.4
03 5.3
05 7.2
02 5.2
11 8.5
09 7.3
03 6.0
09 10.4
12 10.2
03 6.1
Enter the age and weight data as vectors, using the function c (), which combines
its arguments into a vector or list.
You can apply following built-in functions on the data
o mean and standard deviation of the weights
o correlation between age and weight
o plot the relationship between age and weight so that you can inspect any trend
visually.
A sample R Session
R Help interface
You can use help window in R Studio Environment to access R documentation. The
function help.start() open the browser window with access to introductory and
advanced manuals, FAQs, and reference materials. R provides extensive
help facilities, and learning to navigate them will definitely help in your
programming efforts.
1.5 R workspace
The workspace is your current R working environment and it includes any user-
defined objects (vectors, matrices, functions, data frames, or lists).
At the end of R session, you can save the current workspace that’s automatically
reloaded the next time R starts.
You can use the up and down arrow keys for scrolling through your command
history. By doing so it allows you to select a previous command, edit it if desired,
and resubmit it using the Enter key.
The current working directory is the directory R will read files from and save results
to by default.
o getwd() function : To find the current working directory in use.
o setwd() function: set the current working directory by using this function. If
you need to input a file that isn’t in the current working directory, use the
full pathname in the call. Always enclose the names of files and directories
from the operating system in quote marks. Some standard commands for
managing your workspace are listed in below table.
By default, launching R starts with an interactive session with input from the keyboard and output
to the screen. You can also process commands from a script file (a file containing R statements)
and direct output to a variety of destinations.
INPUT :
The source("filename") function submits a script to the current session. In case the filename
doesn’t include a path, the file is assumed to be in the current working directory. For example,
source("myscript.R") runs a set of R statements contained in file myscript.R. By convention,
script file names end with an. R extension, but this isn’t required.
OUTPUT
The sink("filename") function redirects output to the file filename. If the file already exists by
default, its contents are overwritten. Include the option append=TRUE to append text to the file
rather than overwriting it. Including the option split=TRUE will send output to both the screen
and the output file. Issuing the command sink () without options will return output to the screen
alone. GRAPHIC OUTPUT Although sink () redirects text output, as it has no effect on graphic
output. To redirect graphic output, use one of the functions listed in Table below. Use dev.off()
to return output to the terminal.
Let’s put it together with an example. Assume that you have three script files containing R code
(script1.R, script2.R, and script3.R). Issuing the statement source ("script1.R") will submit the R
code from script1.R to the current session and the results will appear on the screen. If you then
issue the statements
R will be submitted, and the results will appear on the screen. In addition to that, the text output
will be appended to the file myoutput , and the graphic output will be saved to the
file mygraphs.pdf .
Finally, if you issue the statements
sink() dev.off()
source("script3.R")
The R code from script3.
R will be submitted, and the results will appear on the screen. This time, no text or graphic
output is saved to files. The sequence is outlined in below figure. R gives quite a bit of
flexibility and control over where input comes from and where it goes.
Input with source() function and output with sink() function.
1.7 Packages
R comes with an extensive capability right out of the box. But some of its most exciting features
are available as optional modules that you can download and install. There are more than 2,500
user-contributed modules called packages that you can download from http://cran.r-
project.org/web/packages. They provide a large range of new capabilities, from the analysis of
geostatistical data to protein mass spectra processing to the analysis of psychological tests! We
will use many of these optional packages in later chapters.
Packages are a collection of R functions, data, and compiled code in a well-defined format. The
directory where packages are stored on your computer is known as the library. The function
.libPaths() shows you where your library is located, and the function library () shows you what
packages you’ve saved in your library.
R comes with a standard set of packages (including base, datasets, utils, grDevices, graphics,
stats, and methods). They give access to a wide range of functions and datasets that are available
by default. Other packages are also available for download and installation. After installing,
they have to be loaded into the session in order to be used. The command search () tells you which
packages are loaded and ready to use.
Installing a package
To install the package for the first time, use the install.packages () command. For
example, installing Packages () without options brings up a list of CRAN screen sites.
Once you have selected a site, you will be presented with a list of all available packages.
Selecting one will download and install it. If you know which package you want to install,
you can do it directly by providing it as a work dispute. For example, the gclus package
contains works to build advanced distribution sites.
You can download and install the package with the
installation command.packages ("gclus"). You only need to install the package once. But
like any software, packages are often updated by their authors.
Use the update.packages () command to update any packages you have installed.
To view details in your packages, you can use the installed.packages () command. It lists
the packages you have, as well as their version numbers, dependencies, and other
information.
Loading a package
Install the package download from the CRAN screen and save it to your library. To use it in R,
you need to load the package using the library command (). For example, using a
combined gclus to extract a library (gclus). Of course, you have to install the package before you
can load it. You will only need to upload the package once within the given session. If you wish,
you can customize your startup environment to automatically download the packages you use
most often.
System planners often ask if R can handle big data problems. Usually, they work with large
amounts of data collected from web, weather, or genetic research. Because R holds objects in
memory, you are often limited in the amount of RAM available. For example, on my five-year-
old Windows PC with 2 GB of RAM, I was able to easily manage 10 million data sets (100
variables per 100,000 views). On an iMac with 4 GB of RAM, I was able to handle 100 million
items without any hassle. But there are two problems to consider: the size of the database and the
mathematical methods to be used. R can handle data analysis problems from gigabyte to terabyte
range, but special procedures are required.
2 Module 2
Topics to be covered:
Data Types in R
Different vector operations
Programming constructs in R
Arrays
Lists
The first step of any data analysis is the creation of a dataset containing the information to be
studied, in a format that meets your needs. In R, this task involves the following:
These data sources may include text files, spreadsheets, statistical packages, and data
management systems. For example, the data I work with usually comes from SQL information.
It is possible to use only one or two of the methods described in this section, so feel free to choose
the ones that suit your situation. Once the database is created, you will define it, adding flexible
descriptive labels and dynamic codes. Let's start with the basics.
Datasets
A dataset is usually a rectangular array of data with rows representing observations and columns
representing variables..
Vectors
• A vector is an ordered collection of basic data types of a given length.
• The only key thing here is all the elements of a vector must be of the identical data type e.g
homogenous data structures.
• Vectors are one-dimensional data structures. There are five atomic classes of Vectors.
Vectors are basically one-dimensional arrays which can hold numeric data, character data, or
logical data. The combine function c() is used to form the vector. Here are examples of each type
of vector:
a <- c(1, 2, 5, 3, 6, -2, 4)
b <- c("one", "two", "three")
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
Here, a is numeric vector, b is a character vector, and c is a logical vector. Note that the data in a
vector must only be one type or mode (numeric, character, or logical). You can’t mix modes in
the same vector.
NOTE Scalars are one-element vectors. Examples include f <- 3, g <- "US" and h <- TRUE.
They’re used to hold constants.
You can refer to elements of a vector using a numeric vector of positions within brackets. For
example, a[c(2, 4)] refers to the 2nd and 4th element of vector a. Here are additional examples:
Matrices
• Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular
layout.
• A Matrix is created using the matrix() function.
• Example: matrix(data, nrow, ncol, byrow, dimnames) where,
• data is the input vector which becomes the data elements of the matrix.
• nrow is the number of rows to be created.
• ncol is the number of columns to be created.
• byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
• dimname is the names assigned to the rows and columns.
Matrix are two-dimensional array where each element has the same mode (numeric, character, or
logical). Matrices are created using the matrix function. The general format is
myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logica
l_value, dimnames=list( char_vector_rownames, char_vector_colnames))
Where vector contains the elements for the matrix, nrow and ncol specify the row and column
dimensions, and dimnames contains optional row and column labels stored in character vectors.
The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by
column (byrow=FALSE). The default is by column. The following listing demonstrates the
matrix function.
Firstly, you create a 5x4 matrix q. Then you should create a 2x2 matrix with labels and fill the
matrix by rows w. Finally, you should create a 2x2 matrix and fill the matrix by columns e. You
can easily identify rows, columns, or elements of a matrix by using subscripts and brackets. X[i,]
refers to the ith row of matrix X, X[,j] refers to jth column, and X[i, j] refers to the ijth element,
respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or
columns, as shown in the following listing.
Firstly, a 2 x 5 matrix is created containing numbers 1 to 10. By default, the matrix is filled by
column. Then the elements in the 2nd row are selected, followed by the elements in the 2nd
column. Next, the element in the 1st row and 4th column is selected. Finally, the elements in the
1st row and the 4th and 5th columns are selected. Matrices are two-dimensional and, like vectors,
can contain only one data type. When there are more than two dimensions, you’ll use arrays
(section 2.2.3). When there are multiple modes of data, you’ll use data frames (section 2.2.4)
Arrays
Arrays are very similar to matrices but in array you can have more than two dimensions. They’re
created with an array function of the following form:
myarray <- array(vector, dimensions, dimnames)
• Arrays are similar to matrices but can have more than two dimensions. They’re created
with an array function of the following form:
• myarray <- array(vector, dimensions, dimnames)
• dim1 <- c("A1", "A2")
• > dim2 <- c("B1", "B2", "B3")
• > dim3 <- c("C1", "C2", "C3", "C4")
• > z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
Here, vector contains the data for the array, dimensions are a numeric vector giving the maximal
index for each dimension, and dimnames is an optional list of dimension labels. The following
listing gives an example of creating a three-dimensional (2x3x4) array of numbers.
Data frames
• A data frame is more general than a matrix in that different columns can contain different
modes of data (numeric, character, etc.).
• It’s similar to the datasets you’d typically see in SAS, SPSS, and Stata. Data frames are
the most common data structure you’ll deal with in R
Each column should have only one mode, but you can put columns of various modes together to
form a data framework. Because data frames are closer to what analysts think of as data sets, we
will use columns and variables that are different when discussing data frames. There are several
ways to identify data frame objects. You can use the subscription text you used before (for
example, with matric) or you can specify column names. Using the previously
created patientdata data frame, the following list shows these methods.
Factors
• variables can be described as nominal, ordinal, or continuous.
• Nominal variables are categorical, without an implied order. Diabetes (Type1, Type2) is
an example of a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a
2 in the data, no order is implied.
• Ordinal variables imply order but not amount. Status (poor, improved, excellent) is a good
example of an ordinal variable. You know that a patient with a poor status isn’t doing as
well as a patient with an improved status, but not by how much.
• Continuous variables can take on any value within some range, and both order and amount
are implied. Age in years is a continuous variable and can take on values such as 14.5 or
22.8 and any value in between. You know that someone who is 15 is one year older than
someone who is 14.
• Categorical (nominal) and ordered categorical (ordinal) variables in R are called
factors.
• Factors are crucial in R because they determine how data will be analysed and presented
visually.
• diabetes <- c("Type1", "Type2", "Type1", "Type1")
• diabetes <- factor(diabetes)
As you can see, the variables can be defined as suggested, edited, or continuous. Appointment
flexibility is a category, with no specific order. Diabetes (Type1, Type2) is an example of
selection variability. Even if Type1 is listed as 1 and Type2 is listed as 2 in the data, no order is
specified. Normal variation means order but not value. Status (bad, improved, excellent) is a good
example of an ordinal variable. You know that a patient with a bad condition is not doing as well
as a patient with an advanced condition, but not in terms of how much. Continuous variations can
take up any value in a certain range, and both order and value are stated. Age is a constant variable
and can take values like 14.5 or 22.8 or any intermediate value. You know that a 15-year-old is
one year older than a 14-year-old. Divided (classified) and ordered (classified) in R are referred
to as elements. The factors are important to R because they determine how the data will be
analyzed and presented visually. You will see examples of this throughout the book. The function
factor () stores class values as numerical vectors in width [1 ... k] (where k is the number of values
that differ from the word variable), as well as the internal vector of character strands (real values)
on the map of these numbers. For example, suppose you have a vector
diabetes <- c("Type1", "Type2", "Type1", "Type1")
The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with
1=Type1 and 2=Type2 internally (the assignment is alphabetical). Any analysis performed on
vector diabetes will treat the variables as determined and select mathematical methods that are
appropriate for this measurement scale. For vegetables representing ordinal variables, add the
ordered parameter = TRUE to the factor () function. You have been given a vector
status <- c("Poor", "Improved", "Excellent", "Poor")
the statement status <- factor(status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and
associate these values internally as 1=Excellent, 2=Improved, and 3=Poor. Additionally, any
analysis made on this vector will treat you as an ordinal variable and select mathematical methods
accordingly. By default, alphabetical levels are created in alphabetical order. This has worked for
the status quo, because the order "Outstanding," "Advanced," "Poor" makes sense. It would be a
problem if the “Poor” had the “Sick” codes instead, because the order would say “Sick,” “Good,”
“Improved.” There is a similar problem if the requested order is “Poor,” “Improved,” “Excellent.”
With ordered features, the alphabetical default is rarely sufficient. You can write over the default
by defining the levels option. For example,
status <- factor(status, order=TRUE, levels=c("Poor", "Improved", "Excellent"))
would assign the levels as 1=Poor, 2=Improved, 3=Excellent. Be sure that the specified levels
match your actual data values. Any data values not in the list will be set to missing. The following
listing demonstrates how specifying factors and ordered factors impact data analyses.
Firstly, you enter the data as vectors q. Then you need to specify that diabetes is a factor and
status is an ordered factor. Finally, you can combine the data into a data frame. The function
str(object)gives information on an object in R (the data frame in this case) w. It clearly shows
that diabetes is a factor and status is an ordered factor, along with how it’s coded internally. Note
that the summary() functiontreats the variables differently e. It provides the minimum, maximum,
mean, and quartiles for the continuous variable age, and frequency counts for the
categorical variables diabetes and status.
Lists
Lists is the most complex data types in R. A list is an ordered collection of objects
(components). List allows you to gather a variety of (possibly unrelated) objects under one name.
For example, a list may contain a combination of vectors, matrices, data frames, and even other
lists. You can create a list using the list() function
mylist <- list(object1, object2, …)
Here, the objects are any of the structures seen so far. Optionally, you can name the objects in a
list:
mylist <- list(name1=object1, name2=object2, …)
The following listing shows an example.
In this example, you can create a list with four components that are: a string, a numeric vector, a
matrix, and a character vector. You can also combine any number of objects and save them as a
list. You can even specify elements of the list by indicating a component number or a name within
double brackets. In this example, mylist[[2]] and mylist[["ages"]] both refer to the same four-
element numeric vector. Lists are important R structures for two reasons. Firstly, they allow you
to organize and recall disparate information in a simple way. Secondly, the results of many R
functions return lists. It’s up to the analyst to pull out the components that are needed. You’ll see
numerous examples of functions that return lists in later chapters.
In the following example, you’ll create a data frame named mydata with three variables: age
(numeric) , gender (character) , and weight (numeric) .
You’ll then invoke the text editor, add your data, and save the results.
mydata <- data.frame(age=numeric(0),
gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
Assignments like age=numeric(0) create a variable of a specific mode, but without actual data.
Note that the result of the editing is assigned back to the object itself. The edit() function
operates on a copy of the object. If you don’t assign it a destination, all of your edits will be lost!
here file is a delimited ASCII file , header is a logical value indicating whether the first row
contains variable names (TRUE or FALSE), sep specifies the delimiter separating data values,
and row.names is an optional parameter specifying one or more variables to represent row
identifiers. For example, the statement :
reads a comma-separated file called studentgrades.csv from the current work directory, detects
dynamic words in the first line of the file, specifies STUDENTID variables as a line identifier,
and saves results as a data framework called distances.
Excel 2007 uses XLSX file format, which is a zipped set of XML files. The xlsx package can be
used to access spreadsheets this way. Be sure to download and install it before using it first. The
Read.xlsx () function imports a worksheet from the XLSX file to the data framework. The
simplest format is read.xlsx (file, n) where the file is the path to the Excel 2007 workbook and n
is the worksheet number to be imported. For example, on Windows platform, code
library(xlsx)
workbook <- "c:/myworkbook.xlsx"
mydataframe <- read.xlsx(workbook, 1)
Firstly, imports the first worksheet from the workbook myworkbook.xlsx stored on the C: drive
and saves it as the data frame mydataframe. The xlsx packagecan can do more than import
worksheets. It can create and also manipulate Excel XLSX files as well. Programmers who are
interested to develop an interface between R and Excel should check out this relatively new
package.
Importing data from CSV
we will read data in r by loading a CSV file from Stress-Lysis. “Humidity – Temperature – Step
count – Stress levels” represents the titles for Stress-Lysis.csv file.
library(rjson)
JsonData <- fromJSON(file = 'drake_data.json')
print(JsonData[1])
library(xml2)
plant_xml <- read_xml('https://www.w3schools.com/xml/plant_catalog.xml')
plant_xml_parse <- xmlParse(plant_xml)
The XML file can be read after installing the package and then parsing it with xmlparse()
function, which takes as input the XML file name and prints the content of the file in the form of
a list. The file should be located in the current working directory. An additional package named
‘methods’ should also be installed. The following code can be used to read the contents of the
file “sample.xml”.
library("XML")
library("methods")
We’ve already discussed most of these functions. The functions head() and tail() are useful for
quickly scanning large datasets. For example, head(patientdata) lists the first six rows of the data
frame, whereas tail(patientdata) lists the last six. We’ll cover functions such as length(), cbind(),
and rbind() in the next chapter. They’re gathered here as a reference.
Summary
One of the most challenging tasks in data analysis is data preparation. We got off to a good start
in this chapter by describing the various R structures that provide data storage and the many
available ways to import data from both keyboard and external sources. Specifically, we will use
vector definitions, matrix, data frame, and write over and over again in future chapters. Your
ability to decipher the properties of these properties with bracket notation will be invaluable in
selecting, setting, and converting data.
As you can see, R provides a wealth of activities for accessing external data. This includes data
from flat files, web files, statistical packages, spreadsheets, and details. Although the focus of this
chapter is to import data into R, you can also transfer data from R into these external formats.
2.5 R – Programming Constructs:
Decision making
Decision making structures require the programmer to specify one or more conditions to be evaluated or
tested by the program, along with a statement or statements to be executed if the condition is determined
to be true, and optionally, other statements to be executed if the condition is determined to be false.
Following is the general form of a typical decision making structure found in most of the programming
languages:
if statement:
if...else statement :
An if statement can be followed by an optional else statement, which executes when the Boolean
expression is false.
if(boolean_expression)
If the Boolean expression evaluates to be true, then the block of code inside the if statement will be
executed. If Boolean expression evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed
Example:
x <- 30L
if(is.integer(x))
{ print("X is an Integer") }
When the above code is compiled and executed, it produces the following result:
2.5.1.2 Syntax The basic syntax for creating an if...else statement in R is:
if(boolean_expression)
{ // statement(s) will execute if the boolean expression is true. }
else
{ // statement(s) will execute if the boolean expression is false. }
If the Boolean expression evaluates to be true, then the if block of code will be executed, otherwise else
block of code will be executed.
Example:
x <- c("what","is","truth")
if("Truth" %in% x)
{ print("Truth is found") }
else
{ print("Truth is not found") }
When the above code is compiled and executed, it produces the following result:
Flow Diagram
2.5.1.4 R LOOPs
There may be a situation when you need to execute a block of code several number of times. In general,
statements are executed sequentially. The first statement in a function is executed first, followed by the
second, and so on. Programming languages provide various control structures that allow for more
complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple times and the following
is the general form of a loop statement in most of the programming languages:
R programming language provides the following kinds of loop to handle looping
requirements.Click the following links to check their detail.
Executes a sequence of statements multiple times and abbreviates thecode that manages
repeat loop
the loop variable.
Repeats a statement or group of statements while a given condition istrue. It tests the
while loop
condition before executing the loop body.
Like a while statement, except that it tests the condition at the end of the loop body.
for loop
Syntax
The basic syntax for creating a repeat loop in R is:
repeat {
commands
if(condition ){
break
}
}
The basic syntax for creating a while loop in R is :
while
(test_expression) {
statement
}
R’s for loops are particularly flexible in that they are not limited to integers, or even numbersin the input.
We can pass character vectors, logical vectors, lists or expressions.
Example
v <- LETTERS[1:4]
for ( i in v)
{ print(i)
}
R Programming
A function is a set of statements organized together to perform a specific task. R has a largenumber of in-
built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result which
may be stored in other objects.
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows:
{ Function body
}
3.1 FunctionComponents
The different parts of a function are:
Function Name: This is the actual name of the function. It is stored in R environmentas an object with this name.
Arguments: An argument is a placeholder. When a function is invoked, you pass a value to the argument.
Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.
Function Body: The function body contains a collection of statements that defineswhat the function does.
Return Value: The return value of a function is the last expression in the function body to be evaluated.
R has many in-built functions which can be directly called in the program without definingthem first.
We can also create and use our own functions referred as user defined functions.
3.2 Built-inFunction
Simple examples of in-built functions are seq(), mean(), max(), sum(x)and paste(...)
etc.They are directly called by user written programs.
3.3 User-definedFunction
We can create user-defined functions in R. They are specific to what a user wants and once created they
can be used like the built-in functions. Below is an example of how a function is created and used.
3.4 CallingaFunction
new.function <-
function(a) { for(i in
1:a) {
b <- i^2 print(b)
}
}
3.4.1.1 Calling a Function with Argument Values (by position and by name)
The arguments to a function call can be supplied in the same sequence as defined in the function or they
can be supplied in a different sequence but assigned to the names of the arguments.
# Create a function with arguments.
new.function(a=11,b=5,c=3)
4 Graphical Analysis using R
Topics to be covered:
Basic Plotting
Manipulating the plotting window
BoxWhisker Plots
Scatter Plots
Pair Plots
Pie Charts
Bar Charts.
4.1 Introduction:
Whenever we analyze data, the first thing that we should do is look at it. For each variable,
what are the most common values? How much of a difference is there? Are there any
unusual observations? R provides many data visualization functions. In this chapter, we’ll
look at graphs that help you understand a single categorical or continuous variable.
In both cases, the variable could be continuous (for example, car mileage as miles per
gallon) or categorical (for example, treatment outcome as none, some, or marked). In later
chapters, we will examine graphs showing the bivariate and multivariate relationships
between variables. In the following sections, we’ll examine the use of bar plots, pie charts,
fan charts, histograms, kernel density plots, box plots, violin plots, and dot plots. Some of
these may be familiar to you, whereas others (such as fan plots or violin plots) may be
new to you. Our goal, as constantly, is to apprehend your facts higher and to communicate
this information to others.
barplot(height)
In the following examples, we’ll plot the outcome of a study investigating a new treatment
for rheumatoid arthritis. The data are contained in the Arthritis data frame distributed with
the vcd package. Because the vcd package isn’t included in the default R installation, be
sure to download and install it before first use (install. packages("vcd")). Note that
the vcd package isn’t needed to create bar plots. We’re loading it in order to gain access
to the Arthritis dataset.
Simple bar plots:
If height is a vector, the values determine the heights of the bars in the plot and a vertical
bar plot is produced. Including the option horiz=TRUE produces a horizontal bar chart
instead. You can also add annotating options. The main option adds a plot title, whereas
the xlab and ylab options add x-axis and y-axis labels, respectively.
In the Arthritis study, the variable Improved records the patient outcomes for individuals
receiving a placebo or drug.
1. library(vcd)
2. counts <- table(Arthritis$Improved)
3. counts
None Some Marked
42 14 28
If height is a matrix instead of a vector, the resulting graph will be a stacked or grouped
bar plot. If beside=FALSE (the default), then each column of the matrix produces a bar
in the plot, with the values in the column giving the heights of stacked “sub-bars.”
If beside=TRUE, each column of the matrix represents a group, and the values in each
column are juxtaposed instead of stacked.
Consider the cross-tabulation of treatment type and improvement status:
1. library(vcd)
2. Counts <-table(Arthritis$Improved, Arthritis$Treatment)
3. Counts Treatment
The first barplot function produces a stacked bar plot, whereas the second produces a
grouped bar plot. We’ve also added the col option to add color to the bars plotted.
The legend.text parameter provides bar labels for the legend (which are only useful
when height is a matrix).
In this example, we’ve rotated the bar labels (with las=2), changed the label text, and both
increased the size of the y margin (with mar) and decreased the font size in order to fit the
labels comfortably (using cex.names=0.8). The par() function allows you to make
extensive modifications to the graphs that R produces by default. See chapter 3 for more
details.
Spinograms:
Before finishing our discussion of bar plots, let’s take a look at a specialized version called
a spinogram. In a spinogram, a stacked bar plot is rescaled so that the height of each bar
is 1 and the segment heights represent proportions. Spinograms are created through
the spine() function of the vcd package. The following code produces a
simple spinogram:
library(vcd)
attach(Arthritis)
counts <- table(Treatment,
Improved) spine(counts,
main="Spinogram Example")
detach(Arthritis)
For example:
boxplot(mtcars$mpg, main="Box plot", ylab="Miles per Gallon")
By default, each whisker extends to the most extreme data point, which is no more than
the 1.5 times the interquartile range for the box. Values outside this range are depicted as
dots (not shown here).
For example, in our sample of cars the median mpg is 19.2, 50 percent of the scores fall
between 15.3 and 22.8, the smallest value is 10.4, and the largest value is 33.9. How did
I read this so precisely from the graph? Issuing boxplot.stats(mtcars$mpg) prints the
statistics used to build the graph. There doesn’t appear to be any outliers, and there is a
mild positive skew (the upper whisker is longer than the lower whisker).
boxplot(formula, data=dataframe)
where formula is a formula and dataframe denotes the data frame (or list) providing the
data. An example of a formula is y ~ A, where a separate box plot for numeric variable y is
generated for each value of categorical variable A. The formula y ~ A*B would produce
a box plot of numeric variable y, for each combination of levels in categorical
variables A and B.
Adding the option varwidth=TRUE will make the box plot widths proportional to the
square root of their sample sizes. Add horizontal=TRUE to reverse the axis orientation.
In the following code, we revisit the impact of four, six, and eight cylinders on auto
mpg with parallel box plots. The plot is provided in below figure.
You can see in above figure that there’s a good separation of groups based on gas mile-
age. You can also see that the distribution of mpg for six-cylinder cars is
more symmetrical than for the other two car types. Cars with four cylinders show the
greatest spread (and positive skew) of mpg scores, when compared with six- and eight-
cylinder cars. There’s also an outlier in the eight-cylinder group.
Box plots are very versatile. By adding notch=TRUE, you get notched box plots. If two
boxes’ notches don’t overlap, there’s strong evidence that their medians differ .The
following code will create notched box plots for our mpg example:
The col option fills the box plots with a red color, and varwidth=TRUE produces box
plots with widths that are proportional to their sample sizes. You can see in below figure
that the median car mileage for four-, six-, and eight- cylinder cars differ. Mileage is
significantly reduced by the number of cylinders.
Finally, you can produce box plots for more than one grouping factor. Listing code
provides box plots for mpg versus the number of cylinders and transmission type in an
automobile. Again, you use the col option to fill the box plots with color. Note that colors
recycle. In this case, there are six box plots and only two specified colors, so the colors
repeat three times.
From below figure it’s again clear that median mileage decreases with cylinder number.
For four and six- cylinder cars, mileage is higher for standard transmissions. But for eight-
cylinder cars there doesn’t appear to be a difference. You can also see from the widths of
the box plots that standard four-cylinder and automatic eight - cylinder cars are the most
common in this dataset.
4.4 Dot plots:
Dot plots provide a method of plotting a large number of labeled values on a simple
horizontal scale. You create them with the dotchart() function, using the format
dotchart(x, labels=)
where x is a numeric vector and labels specifies a vector that labels each point. You can
add a groups option to designate a factor specifying how the elements of x are grouped. If
so, the option gcolor controls the color of the groups label and cex con- trols the size of
the labels. Here’s an example with the mtcars dataset:
In this example, the data frame mtcars is sorted by mpg (lowest to highest) and saved as
data frame x. The numeric vector cyl is transformed into a factor. A character vector
(color) is added to data frame x and contains the values "red", "blue", or "dark-
green" depending on the value of cyl. In addition, the labels for the data points are taken
from the row names of the data frame (car makes). Data points are grouped by number of
cylinders. The numbers 4, 6, and 8 are printed in black. The color of the points and labels
are derived from the color vector, and points are represented by filled circles. The code
produces the graph in below figure:
In above figure ,a number of features become evident for the first time. Again, you see an
increase in gas mileage as the number of cylinders decrease. But you also see exceptions.
For example, the Pontiac Firebird, with eight cylinders, gets higher gas mileage than the
Mercury 280C and the Valiant, each with six cylinders. The Hornet 4 Drive, with six
cylinders, gets the same miles per gallon as the Volvo 142E, which has four cylinders.
It’s also clear that the Toyota Corolla gets the best gas mileage by far, whereas the Lincoln
Continental and Cadillac Fleetwood are outliers on the low end. You can gain significant
insight from a dot plot in this example because each point is labeled, the value of each
point is inherently meaningful, and the points are arranged in a manner that promotes
comparisons. But as the number of data points increase, the utility of the dot plot
decreases.
For the second pie chart w, you convert the sample sizes to percentages and add the
information to the slice labels. The second pie chart also defines the colors of the slices
using the rainbow() function. Here rainbow(length(lbls2)) resolves to rainbow(5),
providing five colors for the graph.
The third pie chart is a 3D chart created using the pie3D() function from
the plotrix package. Be sure to download and install this package before using it for the
first time. If statisticians dislike pie charts, they positively despise 3D pie charts (although
they may secretly find them pretty). This is because the 3D effect adds no additional
insight into the data and is considered distracting eye candy.
The fourth pie chart demonstrates how to create a chart from a table e. In this
case, you count the number of states by US region, and append the information to the
labels before producing the plot.
Pie charts make it difficult to compare the values of the slices (unless the values are
appended to the labels). For example, looking at the simple pie chart, can you tell how
the US compares to Germany? (If you can, you’re more perceptive than I am.) In an
attempt to improve on this situation, a variation of the pie chart, called a fan plot, has been
developed. The fan plot (Lemon & Tyagi, 2009) provides the user with a way to display
both relative quantities and differences. In R, it’s implemented through
the fan.plot() function in the plotrix package.
library(plotrix)
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France") fan.plot(slices, labels = lbls,
main="Fan Plot")
In a fan plot, the slices are rearranged to overlap each other and the radii have been
modified so that each slice is visible. Here you can see that Germany is the largest slice
and that the US slice is roughly 60 percent as large. France appears to be half as large as
Germany and twice as large as Australia. Remember that the width of the slice and not
the radius is what’s important here.
As you can see, it’s much easier to determine the relative sizes of the slice in a fan plot
than in a pie chart. Fan plots haven’t caught on yet, but they’re new. Now that we’ve
covered pie and fan charts, let’s move on to histograms. Unlike bar plots and pie charts,
histograms describe the distribution of a continuous variable.
5 Advanced R
Topics to be covered:
Statistical models in R
Correlation and regression analysis
Analysis of Variance (ANOVA)
creating data for complex analysis
Summarizing data, and case studies
5.1 Correlations:
Correlation coefficients are used to describe relationships among quantitative variables. The
sign ± indicates the direction of the relationship (positive or inverse) and the magnitude
indicates the strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly
predictable relationship).
Types of correlations:
R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall,
partial, polychoric, and polyserial. Let’s look at each in turn.
PEARSON, SPEARMAN, AND KENDALL CORRELATIONS
The Pearson product moment correlation assesses the degree of linear relationship between
two quantitative variables. Spearman’s Rank Order correlation coefficient as- sesses the degree
of relationship between two rank-ordered variables. Kendall’s Tau is also a nonparametric
measure of rank correlation.
The cor() function produces all three correlation coefficients, whereas the cov() function
provides covariances. There are many options, but a simplified format for producing
correlations is
cor(x, use= , method= )
The default options are use="everything" and method="pearson". You can see an example in
the following listing.
The first call produces the variances and covariances. The second provides Pearson Product
Moment correlation coefficients, whereas the third produces Spearman Rank Order correlation
coefficients. You can see, for example, that a strong positive correla- tion exists between income
and high school graduation rate and that a strong negative correlation exists between illiteracy
rates and life expectancy. Notice that you get square matrices by default (all variables crossed
with all other variables).
PARTIAL CORRELATIONS:
A partial correlation is a correlation between two quantitative variables, controlling for one or
more other quantitative variables. You can use the pcor() function in the ggm package to provide
partial correlation coefficients. The ggm package isn’t installed by default, so be sure to install
it on first use.
The format is
pcor(u, S)
where u is a vector of numbers, with the first two numbers the indices of the variables to be
correlated, and the remaining numbers the indices of the conditioning variables (that is, the
variables being partialed out). S is the covariance matrix among the vari- ables. An example will
help clarify this:
1. library(ggm)
2. # partial correlation of population and murder rate, controlling
3. # for income, illiteracy rate, and HS graduation rate
4. pcor(c(1,5,2,3,6), cov(states)) [1] 0.346
In this case, 0.346 is the correlation between population and murder rate, controlling for the
influence of income, illiteracy rate, and HS graduation rate. The use of partial correlations is
common in the social sciences.
OTHER TYPES OF CORRELATIONS
The hetcor() function in the polycor package can compute a heterogeneous cor- relation matrix
containing Pearson product-moment correlations between numeric variables, polyserial
correlations between numeric and ordinal variables, polychoric correlations between ordinal
variables, and tetrachoric correlations between two di- chotomous variables. Polyserial,
polychoric, and tetrachoric correlations assume that the ordinal or dichotomous variables are
derived from underlying normal distribu- tions. See the documentation that accompanies this
package for more information.
5.2 Testing correlations for significance:
Once you’ve generated correlation coefficients, how do you test them for statistical sig-
nificance? The typical null hypothesis is no relationship (that is, the correlation in the population
is 0). You can use the cor.test() function to test an individual Pearson,
Spearman, and Kendall correlation coefficient. A simplified format is
cor.test(x, y, alternative = , method = )
where x and y are the variables to be correlated, alternative specifies a two-tailed or one- tailed
test ("two.side", "less", or "greater") and method specifies the type of corre- lation ("pearson",
"kendall", or "spearman") to compute. Use alternative="less" when the research hypothesis is
that the population correlation is less than 0. Use alternative="greater" when the research
hypothesis is that the population correla- tion is greater than 0. By default,
alternative="two.side" (population correlation isn’t equal to 0) is assumed. See the following
listing for an example.
This code tests the null hypothesis that the Pearson correlation between life expec- tancy and
murder rate is 0. Assuming that the population correlation is 0, you’d expect to see a sample
correlation as large as 0.703 less than 1 time out of 10 million (that is, p = 1.258e-08). Given how
unlikely this is, you reject the null hypothesis in favor of the research hypothesis, that the
population correlation between life expectancy and murder rate is not 0.
Unfortunately, you can test only one correlation at a time using cor.test. Luckily, the
corr.test() function provided in the psych package allows you to go further. The corr.test()
function produces correlations and significance levels for matrices of Pearson, Spearman, or
Kendall correlations. An example is given in the following listing.
The use= options can be "pairwise" or "complete" (for pairwise or listwise dele- tion of
missing values, respectively). The method= option is "pearson" (the default), "spearman", or
"kendall". Here you see that the correlation between population size and high school graduation
rate (–0.10) is not significantly different from 0 (p = 0.5).
5.3 Regression:
In many ways, regression analysis lives at the heart of statistics. It’s a broad term for a set of
methodologies used to predict a response variable (also called a dependent, criterion, or
outcome variable) from one or more predictor variables (also called independent or explanatory
variables). In general, regression analysis can be used to identify the explanatory variables that
are related to a response variable, to describe the form of the relationships involved, and to
provide an equation for predicting the response variable from the explanatory variables.
The many faces of regression:
The term regression can be confusing because there are so many specialized varieties refer
below table. In addition, R has powerful and comprehensive features for fitting regression
models, and the abundance of options can be confusing as well.
Scenarios for using OLS regression:
In OLS regression, a quantitative dependent variable is predicted from a weighted sum of
predictor variables, where the weights are parameters estimated from the data. Let’s take a
look at a concrete example (no pun intended), loosely adapted from Fwa (2006). An engineer
wants to identify the most important factors related to bridge deterioration (such as age,
traffic volume, bridge design, construction materials and methods, construction quality, and
weather conditions) and determine the mathematical form of these relationships. She
collects data on each of these variables from a representative sample of bridges and models
the data using OLS regression.
The approach is highly interactive. She fits a series of models, checks their compliance with
underlying statistical assumptions, explores any unexpected or aberrant findings, and finally
chooses the “best” model from among many possible models. If successful, the results will
help her to
1. Focus on important variables, by determining which of the many collected variables are
useful in predicting bridge deterioration, along with their relative importance.
2. Look for bridges that are likely to be in trouble, by providing an equation that can be
used to predict bridge deterioration for new cases (where the values of the predictor
variables are known, but the degree of bridge deterioration isn’t).
3. Take advantage of serendipity, by identifying unusual bridges. If she finds that some
bridges deteriorate much faster or slower than predicted by the model, a study of these
“outliers” may yield important findings that could help her to understand the
mechanisms involved in bridge deterioration.
5.3.2.1 OLS regression:
For most of this chapter, we’ll be predicting the response variable from a set of pre-
dictor variables (also called “regressing” the response variable on the predictor
variables—hence the name) using OLS.
Our goal is to select model parameters (intercept and slopes) that minimize the dif- ference
between actual response values and those predicted by the model. Specifically, model
parameters are selected to minimize the sum of squared residuals
To properly interpret the coefficients of the OLS model, you must satisfy a number of statistical
assumptions:
1. Normality —For fixed values of the independent variables, the dependent variable is
normally distributed.
2. Independence —The Yi values are independent of each other.
3. Linearity —The dependent variable is linearly related to the independent variables.
4. Homoscedasticity —The variance of the dependent variable doesn’t vary with the levels
of the independent variables. We could call this constant variance, but saying homoscedasticity
makes me feel smarter.
If you violate these assumptions, your statistical significance tests and confidence intervals may
not be accurate. Note that OLS regression also assumes that the independent variables are
fixed and measured without error, but this assumption is typically relaxed in practice.
Simple linear regression:
The dataset women in the base installation provides the height and weight for a set of 15
women ages 30 to 39. We want to predict weight from height. Having an equation for
predicting weight from height can help us to identify overweight or underweight individuals.
The analysis is provided in the following listing, and the resulting graph is shown in figure.
From the output, you see that the prediction equation is
Weight = - 87.52 + 3.45 * Height
Because a height of 0 is impossible, you wouldn’t try to give a physical interpretation to the
intercept. It merely becomes an adjustment constant. From the Pr(>|t|) col- umn, you see that
the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates
that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height.
The multiple R-squared (0.991) indicates that the model accounts for 99.1 percent of the
variance in weights. The multiple R-squared is also the squared correlation between the actual
and predicted value (that is, R2 = r 2 ). The residual standard error (1.53 lbs.) can be thought of
as the average error in
predicting weight from height using this model. The F statistic tests whether the pre- dictor
variables taken together, predict the response variable above chance levels. Be- cause there’s
only one predictor variable in simple regression, in this example the F test is equivalent to the
t-test for the regression coefficient for height.
For demonstration purposes, we’ve printed out the actual, predicted, and residual values.
Evidently, the largest residuals occur for low and high heights, which can also be seen in the
plot in above figure.
5.4 Polynomial regression:
The plot in the above figure suggests that you might be able to improve your prediction using a
regression with a quadratic term (that is, X 2).
You can fit a quadratic equation using the statement
fit2 <- lm(weight ~ height + I(height^2), data=women)
The new term I(height^2) requires explanation. height^2 adds a height-squared term to the
prediction equation. The I function treats the contents within the paren- theses as an R regular
expression. You need this because the ^ operator has a special meaning in formulas that you
don’t want to invoke here .
Listing below shows the results of fitting the quadratic equation.
and both regression coefficients are significant at the p < 0.0001 level. The amount of variance
accounted for has increased to 99.9 percent. The significance of the squared term (t = 13.89, p
< .001) suggests that inclusion of the quadratic term improves the model fit. If you look at the
plot of fit2 in below figure you can see that the curve does indeed provides a better fit.
Table below provides formulas for several common research designs. In this table, low- ercase
letters are quantitative variables, uppercase letters are grouping factors, and Subject is a
unique identifier variable for subjects.
dose effect
dose
0 5 50 500
32.4 28.9 30.6 29.3
In this case, the adjusted means are similar to the unadjusted means produced by the
aggregate() function, but this won’t always be the case. The effects package provides a
powerful method of obtaining adjusted means for complex research de- signs and presenting
them visually. See the package documentation on CRAN for more details.
As with the one-way ANOVA example in the last section, the F test for dose indicates that the
treatments don’t have the same mean birth weight, but it doesn’t tell you which means differ
from one another. Again you can use the multiple comparison procedures provided by the
multcomp package to compute all pairwise mean comparisons. Additionally, the multcomp
package can be used to test specific user- defined hypotheses about the means.
Assessing test assumptions
ANCOVA designs make the same normality and homogeneity of variance assumptions
described for ANOVA designs. In addition, standard ANCOVA designs assumes ho- mogeneity
of regression slopes. In this case, it’s assumed that the regression slope for predicting birth
weight from gestation time is the same in each of the four treatment groups. A test for the
homogeneity of regression slopes can be obtained by including a gestation*dose interaction
term in your ANCOVA model. A significant interaction would imply that the relationship
between gestation and birth weight depends on the level of the dose variable. The code and
results are provided in the following listing.
produces the plot shown in the following figure 9.5. Note: the figure has been modi- fied to
display better in black and white and will look slightly different when you run the code yourself.
Here you can see that the regression lines for predicting birth weight from gestation time are
parallel in each group but have different intercepts. As gestation time increases, birth weight
increases. Additionally, you can see that the 0-dose group has the largest intercept and the 5-
dose group has the lowest intercept. The lines are parallel because you’ve specified them to
be. If you’d used the statement ancova(weight ~ gesttime*dose) instead, you’d generate a plot
that allows both the slopes and intercepts to vary by group. This approach is useful for
visualizing the case where the homogeneity of regression slopes doesn’t hold.
weight ~ gesttime + dose
Two-way factorial ANOVA:
In a two-way factorial ANOVA, subjects are assigned to groups that are formed from the cross-
classification of two factors. This example uses the ToothGrowth dataset in the base
installation to demonstrate a two-way between-groups ANOVA. Sixty guinea pigs are randomly
assigned to receive one of three levels of ascorbic acid (0.5, 1, or 2mg), and one of two delivery
methods (orange juice or Vitamin C), under the restriction that each treatment combination
has 10 guinea pigs. The dependent variable is tooth length. The following listing shows the code
for the analysis.
The table statement indicates that you have a balanced design (equal sample sizes in each cell
of the design), and the aggregate statements provide the cell means and standard deviations.
The ANOVA table provided by the summary() function indicates that both main effects (supp
and dose) and the interaction between these factors are significant.
You can visualize the results in several ways. You can use the interaction.plot()
function to display the interaction in a two-way ANOVA. The code is
interaction.plot(dose, supp, len, type="b",
col=c("red","blue"), pch=c(16, 18),
main = "Interaction between Dose and Supplement Type")
and the resulting plot is presented in above figure. The plot provides the mean tooth length for
each supplement at each dosage.
With a little finesse, you can get an interaction plot out of the plotmeans() function in the gplots
package. The following code produces the graph in below figure:
library(gplots)
plotmeans(len ~ interaction(supp, dose, sep=" "), connect=list(c(1,3,5),c(2,4,6)), col=c("red",
"darkgreen"),
main = "Interaction Plot with 95% CIs", xlab="Treatment and Dose Combination")
The graph includes the means, as well as error bars (95 percent confidence intervals) and
sample sizes.
All graphs indicate that tooth growth increases with the dose of ascorbic acid for both orange
juice and Vitamin C. For the 0.5 and 1mg doses, orange juice produced more tooth growth than
Vitamin C. For 2mg of ascorbic acid, both delivery methods produced identical growth. Of the
three plotting methods provided, I prefer the in- teraction2wt() function in the HH package. It
displays both the main effects (the box plots) and the two-way interactions for designs of any
complexity (two-way ANOVA, three-way ANOVA, etc.).
5.6 ANOVA as regression:
We noted that ANOVA and regression are both special cases of the same general linear model.
As such, the designs in this chapter could have been analyzed using the lm() function. However,
in order to understand the output, you need to understand how R deals with categorical
variables when fitting models.
Consider the one-way ANOVA problem discussed earliear in this unit ,which compares the
impact of five cholesterol-reducing drug regiments (trt).
1. library(multcomp)
2. levels(cholesterol$trt)
If a patient is in the drugD condition, then the variable drugD equals 1, and the vari- ables
2times, 4times, and drugE will each equal zero. You don’t need a variable for the first group,
because a zero on each of the four indicator variables uniquely deter- mines that the patient is
in the 1times condition.
This unit have covered the statistical methods most often used by researchers in a wide variety
of fields.