STAT630 RSlide
STAT630 RSlide
Instructor
Samuel Iddi (PhD)
Learing Objectives:
Provide an introduction to R environment
Create R objects; list, factor and data frame
Subset and index objects
Import and export data
Create new columns, rename, sort and subset dataset
Merge two or more datasets and aggregate data
Perform simulations and boostrapping
Use R to perform fit linear models, interpret results and draw
conclusions.
Textbook:
1 Maindonald, J. and Braun, J. (2003). Data Analysis and Graphics Using
R. New York: Cambridge University Press.
Reference:
1 Chambers (2008). Software for Data Analysis. Springer.
2 Venables W.N. and Ripley B.D (1997). Modern Applied Statistics with
S-PLUS. 2nd Ed. New York: Springer.
Introduction
Introduction to R
Learning new language does not come easy and the learning curve of R
is steep.
History of R
Overview of R
Advantages
Run program for any purpose.
Study how program works and adapt to your own needs (free assess to
source code).
Redistribute copies to anyone.
Improve program and share with public.
Drawbacks
Based on an old technology.
Functionality is based on consumer demand and user contribution. If
your favorite method is not implemented, you have to do the job yourself.
Not ideal for all possible situations (a drawback of all software
packages).
Example: textttinstall.packages(’ggplot2’)
Within R
Access help file: ?rnorm.
sqrt(4^2+2/3)*pi^3
## [1] 126.5826
## [1] 40
## [1] 4.25
## [1] 4.25
## [1] 2.718282
## [1] 5
## [1] 2.061553
msg<-"hello"
print(msg)
## [1] "hello"
## [1] 1 2 3 4 5 6
Objects
Object Attributes
Object Attributes in R
length(x)#length
## [1] 5
z<-c("Male","Female")#character vector
mode(x)
## [1] "numeric"
class(z)
## [1] "character"
Coercion
Examples: Coercion
## [1] "character"
## [1] "numeric"
## [1] "character"
x<-0:5
class(x)
## [1] "integer"
x<-as.numeric(x)
class(x)
## [1] "numeric"
Examples: Coercion
as.logical(x)
##Nonsensical Coercion
y<-c("NPP","NDC","CPP")
as.numeric(y)
## [1] NA NA NA
as.logical(y)
## [1] NA NA NA
as.complex(y)
## [1] NA NA NA
Dr. S. Iddi (UG) R Training February 10, 2020 21 / 130
Operators and Special Values
Special Values
Special values:
◦ Logical values: TRUE/FALSE or T/F
◦ Missing values: NA (not available), NaN (not a number)
◦ Inf is a special number which represent infinity. Eg. 1/0.
◦ NaN represent value of an undefined mathematical operations or
missing value. Eg. 0/0.
is.na() is used to test objects if they are NA.
is.nan() is used to test for NaN.
NA values have a class also, so there are integer NA and character NA.
NaN value is also NA but the converse is not true.
If operation cannot be accomplished, NA is returned.
Examples in R
## [1] "logical"
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
z<-y<=0
z
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
(p<-rnorm(10,0,1))
## [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
Dr. S. Iddi (UG) R Training February 10, 2020 24 / 130
Operators and Special Values
Examples in R
0/0
## [1] NaN
1/0
## [1] Inf
is.na(r)
is.nan(r)
R Examples on Numbers
(x<-20/sqrt(2))
## [1] 14.14214
floor(x)
## [1] 14
ceiling(x)
## [1] 15
trunc(x)
## [1] 14
round(x,digit=2)
## [1] 14.14
round(x,4)
Dr. S. Iddi (UG) R Training February 10, 2020 27 / 130
Numbers and Vectors
Vectors
R Examples on Vectors
class(x)
## [1] "numeric"
(y<-integer(5))
## [1] 0 0 0 0 0
class(y)
## [1] "integer"
(z<-character(4))
R Examples on Vectors
class(z)
## [1] "character"
(a=rep(3,9))
## [1] 3 3 3 3 3 3 3 3 3
(b=seq(3,8,0.5))
## [1] 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
(c=1:5)
## [1] 1 2 3 4 5
(v=c(x,c))
Examples
x<-c(1,1,7,3,3,4,3,2,3,2,1,4,1,1,6,4)
sort(x)
## [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 6 7
order(x)
## [1] 1 2 11 13 14 8 10 4 5 7 9 6 12 16 15 3
rank(x)
## [1] 3.0 3.0 16.0 9.5 9.5 13.0 9.5 6.5 9.5 6.5 3.0 13.0 3.0 3.0
## [15] 15.0 13.0
Examples
unique(x)
## [1] 1 7 3 4 2 6
length(x)
## [1] 16
diff(x,lag=2)
## [1] 6 2 -4 1 0 -2 0 0 -2 2 0 -3 5 3
Vector arithmetics
Vector arithmetics
x*y
x-y
2*x+y-5*sqrt(x)
y/x
x/z
Character Vectors
## [1] NA
mean(x)
## [1] NA
(i=is.na(x))#identify NA's
## [1] 9
Creating Matrices
Matrices in R
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
dim(x)
## [1] 2 2
## [,1] [,2]
## [1,] 2 -1
## [2,] 4 0
## [3,] 5 -4
Matrices in R
rbind(x,y)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [3,] 2 -1
## [4,] 4 0
## [5,] 5 -4
(z<-c(2, 4, 5, -1,0,-4))
## [1] 2 4 5 -1 0 -4
## [,1] [,2]
## [1,] 2 -1
## [2,] 4 0
## [3,] 5 -4
Matrices in R
(p <- 1:3)
## [1] 1 2 3
(q <- 10:12)
## [1] 10 11 12
cbind(p, q)
## p q
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
## [,1] [,2]
## [1,] "a" "A"
## [2,] "b" "B"
## [3,] "c" "C"
## [4,]
Dr. "d" "D"
S. Iddi (UG) R Training February 10, 2020 44 / 130
Matrices
Matrix operations
(x<-c(10,2,5,7,125,3,0,1,1))
## [1] 10 2 5 7 125 3 0 1 1
(y<-matrix(x,3,3))
det(y)#calculate determant of y
## [1] 1241
sum(diag(y))#return trace of y
## [1] 136
t(y)#transpose of y
solve(y)#inverse of y
## eigen() decomposition
## $values
## [1] 125.148195 9.844519 1.007286
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.06065781 0.87316411 -0.006357337
## [2,] -0.99780533 -0.01939437 0.008167102
## [3,] -0.02655459 0.48704034 -0.999946440
eigen(y)$values
eigen(y)$vectors
5+y
Subsetting a matrix
Subsetting a matrix
## [,1]
## [1,] 5
## [1] 5 -1
## [,1]
## [1,] 5
## [2,] -1
Arrays
A matrix is a two dimension array but larger array can be defined as well.
Arrays have k dimensions.
Each element of an array is accessed with k indices, x[i1,...,ik].
Eg. an array of 3 matrices 2 × 3 each is defined by dim=c(2,3,3).
Arrays
(x<-array(c(1:18),dim=c(2,3,3)))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
Arrays
## [1] 7 8
dim(x)
## [1] 2 3 3
Examples on lists
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
Examples on lists
## [[1]]
## [1] 1
##
## [[2]]
## [1] "Male" "Female"
##
## [[3]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## $b
## [1] "Male" "Female"
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Examples on lists
z[["d"]] #alternatively
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## $b
## [1] "Male" "Female"
##
## $d
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
(ol<-eigen(z$d))
## eigen() decomposition
## $values
## [1] 5.3722813 -0.3722813
##
## $vectors
## [,1] [,2]
## [1,] -0.5657675 -0.9093767
## [2,] -0.8245648 0.4159736
summary(ol)
Dr. S. Iddi (UG) R Training February 10, 2020 59 / 130
Factors
Factors
Factors
Example: Factors
(gender<-c(rep("F",3),rep("M",5)))
levels(gender)
## NULL
(gender<-factor(gender))
## [1] F F F M M M M M
## Levels: F M
levels(gender)
Example: Factors
as.numeric(gender)
## [1] 1 1 1 2 2 2 2 2
x<-data.frame(sex=c(rep("M",3),rep("F",4)),
height=ceiling(rnorm(7,10,2)))
levels(x$sex)
Data frame
Can be constructed by reading from saved data from text file with
read.table() or read.csv()
Columns of a data frame can be referenced by index or name.
Use $ sign to call for a vector in the data frame.
names() can be used to see the names of a data frame.
Can simply attach (with attach()) data frame and indicate the column
name.
Detach data frame with detach().
Data frames can be converted into matrix with data.matrix() or
as.matrix().
#From Vectors#
scores<-c(50,45,90)
exams<-c("maths","english","science")
(dat1<-data.frame(exams,scores))
## exams scores
## 1 maths 50
## 2 english 45
## 3 science 90
#From Matrix#
(dat2<-data.frame(matrix(1 : 24, nrow = 4, ncol = 6)))
## X1 X2 X3 X4 X5 X6
## 1 1 5 9 13 17 21
## 2 2 6 10 14 18 22
## 3 3 7 11 15 19 23
## 4 4 8 12 16 20 24
## V1 V2 V3 V4 V5 V6
## 1 1 5 9 13 17 21
## 2 2 6 10 14 18 22
## 3 3 7 11 15 19 23
## 4 4 8 12 16 20 24
## exams scores
## [1,] 2 50
## [2,] 1 45
## [3,] 3 90
##Subset Dataframe##
dat1$exams #subset by name
## [1] 50 45 90
## [1] 50 45 90
Import data
The read.table function is one of the most commonly used functions for
reading data. It has a few important arguments:
file, the name of a file, or a connection.
header, logical indicating if the file has a header line.
sep, a string indicating how the columns are separated.
colClasses, a character vector indicating the class of each column in
the dataset.
nrows, the number of rows in the dataset.
comment.char, a character string indicating the comment character.
skip, the number of lines to skip from the beginning.
stringsAsFactors, should character variables be coded as factors?
Export data
Activity
#mydata<- read.table("C:/Datasets/mydata.txt",header=TRUE)
mydata<-head(airquality)
head(mydata,n=3) #print first 3 rows of mydata
#install.packages('xlsx)
#install.packages('openxlsx)
library(openxlsx)
write.xlsx(expdat,"./Datasets/expdat.xlsx" )
Activity
Recoding variable
#install.packages('reshape')
library(reshape)
HosDat<-rename(HosDat, c(costs="Costs",totcosts="TotalCosts"))
HosDat
## hospital patients Costs TotalCosts costs.cat costs.cat2
## 1 Kolebu 150 3.1 465 Expensive Expensive
## 2 37 Military 350 2.5 875 Cheap Normal
## 3 Police 200 2.9 580 Expensive Expensive
## 4 Legon 500 2.0 1000 Cheap Cheap
Activity
Assume that we have registered the height and weight for four people:
Heights in cm are 180, 165, 160, 193, 163, 145, 200; weights in kg are 87, 58,
65, 100,150,100, 75. Make two vectors, height and weight, with the data. The
bodymass index (BMI) is defined as
weight in kg
(height in m)2
Create a data frame. Make a column with the BMI values for the four people,
and a column with the natural logarithm of the BMI values.
##Sort descending
irisdata<-irisdata[order(-irisdata$Sepal.Length),]
head(irisdata)
Activity
id h1 h2 h3 w1 w2 w3 sex
1 1 11 101 5 25 35 male
2 2 12 102 6 26 36 male
3 3 13 103 7 27 37 male
4 4 14 104 8 28 38 female
5 5 15 105 9 29 39 female
The following codes can be used to import the data.
wide<-read.table("./Datasets/wide.txt", header=T)
subset() function
Activity
For the BMI data above make a vector with the weights for those people who
have a BMI larger than 25.
Two datasets can be merged by one or more common key with the
command merge(data.frameA, data.frameB,
by=c("common key variable")).
Return only the rows in which the left table have matching keys in the
right table and is called Inner join
Other types of merge
◦ An outer join of data.frameA and data.frameB:
Returns all rows from both tables, join records from the left which have matching keys in the
right table.
◦ A left outer join (or simply left join) of data.frameA and
data.frameB
Return all rows from the left table, and any rows with matching keys from the right table.
◦ A right outer join of data.frameA and data.frameB
Return all rows from the right table, and any rows with matching keys from the left table.
#Inner join
merge(df1, df2,by="CustomerId")
## CustomerId Product State
## 1 2 Toaster Alabama
## 2 4 Radio Alabama
## 3 6 Radio Ohio
#Outer join
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
## CustomerId Product State
## 1 1 Toaster <NA>
## 2 2 Toaster Alabama
## 3 3 Toaster <NA>
## 4 4 Radio Alabama
## 5 5 Radio <NA>
## 6 6 Radio Ohio
Merging
#Left outer
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
#Right outer
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
## CustomerId Product State
## 1 2 Toaster Alabama
## 2 4 Radio Alabama
## 3 6 Radio Ohio
#Cross join
#merge(x = df1, y = df2, by = NULL)
Dr. S. Iddi (UG) R Training February 10, 2020 94 / 130
Data Manipulation Merging and Appending
Activity
Aggregating
Aggregating
attach(mtcars)
# ?mtcars
(agdata <-aggregate(mtcars, by=list(cyl,vs),FUN=mean, na.rm=TRUE))
Reshape a dataset from wide to long and long to wide is easily done with
the reshape() or melt() and cast() command from the
library(reshape).
str(reshape)
head(long)
#Long to Wide#
long<-long[order(long$id),]
head(long)
#Long to Wide#
w<-reshape(long, timevar = "time", idvar = c("id", "sex"),
direction = "wide")
Activity
Exercise
Below, is a dataset in wide format. Students have been measured using five
metrics: read, write, math, science, and socst.
id female race ses schtyp prog read write math science socst
70 0 4 1 1 1 57 52 41 47 57
121 1 4 2 1 3 68 59 53 63 61
86 0 4 3 1 1 44 33 54 58 31
141 0 4 3 1 3 63 44 47 53 56
172 0 4 2 1 2 47 52 57 53 61
113 0 4 2 1 2 44 52 51 63 61
50 0 3 2 1 1 50 59 42 53 61
11 0 1 2 1 2 34 46 45 39 36
84 0 4 2 1 1 63 57 54 58 51
48 0 3 2 1 2 57 55 52 50 51
Reformat this dataset into long form, using the reshape function.
Applies to a list or data frame (since data frame has the structure of a list
of columns).
Used to apply a function to each columns of a data frame in turn.
The results of lapply is a list.
sapply() is used to simplify the result into a vector or matrix.
◦ if the result is a list where every element is of length 1, then a vector
is returned.
◦ if the result is a list where every element is a vector of the same
length (>1), a matrix is returned.
◦ If it can’t figure things out, a list is returned.
lapply(x, mean)
## $a
## [1] 2.5
##
## $b
## [1] -0.6856285
##
## $c
## [1] 0.9688946
##
## $d
## [1] 5.045283
str(sapply)
sapply(x,mean)
## a b c d
## 2.5000000 -0.6856285 0.9688946 5.0452832
lapply(iris[,-5],mean,na.rm=TRUE)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
sapply(iris[,-5],mean,na.rm=TRUE)
## $a
## [1] 1 2
##
## $b
## [1] 1 2 3
## $a
## [1] 1 2
##
## $b
## [1] 1 2 3
apply() function
Can also be applied with data frame and give the same results as
sapply.
Its first argument is an array or data frame, the second specifies the
margin.
There are shortcuts to find the sums and means of matrix dimensions.
◦ rowSums=apply(x, 1, sum)
◦ rowMeans=apply(x,1, mean)
◦ colSums=apply(x,2, sum)
◦ colMeans=apply(x,2, mean)
Dr. S. Iddi (UG) R Training February 10, 2020 111 / 130
Special Functions
attach(iris)
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))
apply(iris[,-5],2,mean)
tapply() function
library(MASS)
attach(cabbages)
attach(iris);
str(tapply)
## function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
tapply() function
## 1 2 3
## -0.2104579 0.5430863 1.1316653
tapply(iris$Sepal.Length,list(iris$Species), mean)
tapply(HeadWt,list(Cult, Date),mean)
Exploratory graphs
Stem-and-leaf
Textual graph that classifies data items according to their most significant
numeric digits.
Used to study the distribution of a continuous random variable.
Created using the function stem().
attach(faithful)
stem(faithful$waiting)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 3
## 4 | 55566666777788899999
## 5 | 00000111111222223333333444444444
## 5 | 555555666677788889999999
## 6 | 00000022223334444
## 6 | 555667899
## 7 | 00001111123333333444444
## 7 | 555555556666666667777777777778888888888888889999999999
## 8 | 000000001111111111111222222222222333333333333334444444444
## 8 | 55555566666677888888999
## 9 Dr.|S.00000012334
Iddi (UG) R Training February 10, 2020 116 / 130
Graphical Functions
Histogram or Ogives
Knowledge of few options can help your histogram look exactly how
you want it.
Histogram or Ogives
Histogram of BMI
Frequency
100
0
20 25 30
BMI
Dr. S. Iddi (UG) R Training February 10, 2020 118 / 130
Graphical Functions
Histogram or Ogives
Breaks=20
Frequency
100
0
20 25 30
BMI
Dr. S. Iddi (UG) R Training February 10, 2020 119 / 130
Graphical Functions
Histogram or Ogives
Density plot
Density
0.10
0.00
20 25 30
BMI
Dr. S. Iddi (UG) R Training February 10, 2020 120 / 130
Graphical Functions
Histogram or Ogives
Distribution
0.20
Density
0.00
15 20 25 30 35
##QQPlot
qqnorm(faithful$waiting)
qqline(faithful$waiting)
50 80
−3 −2 −1 0 1 2 3
Theoretical Quantiles
##Boxplot
boxplot(faithful$waiting)
50 80
##Boxplot
boxplot(faithful$waiting, main="Boxplot of Time Waited",
xlab="Time waited", horizontal=TRUE)
50 60 70 80 90
Time waited
##Boxplot
hist(faithful$waiting,main="Boxplot of Time Waited",
xlab="Time waited")
30
0
40 50 60 70 80 90
Time waited
Scatter Plot: There are many ways to create a scatter plot of two
quantitative variables.
Created with the basic plot function plot(x,y,...).
It display the pair of values of the vectors x and y inside a cartesian
diagram.
Used to reveal the relationship between the variables.
There are several options to change the default plot. Example,
◦ pch - change symbol of points
◦ cex - change size of text
◦ adj - shift title to left or right, etc.
Line plot: control the type of line connecting points with the option
type=.
Example: "p" for points, "l" for lines, "b" for both, "s" for stair
steps,"n" for no plotting.
Dr. S. Iddi (UG) R Training February 10, 2020 127 / 130
Graphical Functions
##Scatter plots
par(mfrow=c(1,2))
plot(eruptions, waiting, # plot the variables
xlab="Eruption duration", # x-axis label
ylab="Time waited",main="Scatterplot1")
abline(lm(waiting~eruptions))
#change symbol of points and size of points.
plot(eruptions, waiting,main="Scatterplot2",adj=0,
xlab="Eruption duration", cex=2, ylab="Time waited",pch=2,col=3)
50 80
50 80
##Line plot
x<-rnorm(25,0,1)
x<-sort(x)
Fn<-order(x)/length(x)
#cbind(x,order(x),Fn)
par(mfrow=c(1,2))
plot(x,Fn)
plot(x,Fn,type="l")
0.2 0.8
0.2 0.8
Fn
Fn
##Line plot
par(mfrow=c(1,2))
plot(x,Fn,type="b")
plot(x,Fn,type="s")
1.0
1.0
0.6
0.6
Fn
Fn
0.2
0.2