R Module 7 - Data Classes
R Module 7 - Data Classes
Andrew Jaffe
January 6, 2016
Data Classes:
class(c("Andrew", "Jaffe"))
## [1] "character"
class(c(1, 4, 7))
## [1] "numeric"
Integer
## [1] 1 2 3 4 5
class(x)
## [1] "integer"
Integer
1:5
## [1] 1 2 3 4 5
Logical
logical is a class that only has two possible elements: TRUE and
FALSE
## [1] "logical"
sum() and mean() work on logical vectors - they return the total
and proportion of TRUE elements, respectively.
Logical
## [1] "character"
Factor
factor are special character vectors where the elements have
pre-defined groups or ‘levels’. You can think of these as qualitative
or categorical variables:
class(x)
## [1] "factor"
Factors are used to represent categorical data, and can also be used
for ordinal data (ie categories have an intrinsic ordering)
Note that R reads in character strings as factors by default in
functions like read.table()
‘The function factor is used to encode a vector as a factor (the
terms ’category’ and ‘enumerated type’ are also used for factors). If
argument ordered is TRUE, the factor levels are assumed to be
ordered.’
cc = factor(c("case","case","case",
"control","control","control"))
cc
levels(cc) = c("control","case")
cc
factor(c("case","case","case","control",
"control","control"),
levels =c("control","case") )
factor(c("case","case","case","control",
"control","control"),
levels =c("control","case"), ordered=TRUE)
x = factor(c("case","case","case","control",
"control","control"),
levels =c("control","case") )
as.character(x)
as.numeric(x)
## [1] 2 2 2 1 1 1
Factors
However, you need to be careful modifying the labels of existing
factors, as its quite easy to alter the meaning of the underlying data.
xCopy = x
levels(xCopy) = c("case", "control") # wrong way
xCopy
as.numeric(xCopy)
## [1] 2 2 2 1 1 1
Creating categorical variables
the rep() [“repeat”] function is useful for creating new variables
bg = rep(c("boy","girl"),each=50)
head(bg)
bg2 = rep(c("boy","girl"),times=50)
head(bg2)
length(bg)==length(bg2)
## [1] TRUE
Creating categorical variables
One frequently-used tool is creating categorical variables out of
continuous variables, like generating quantiles of a specific
continuously measured variable.
A general function for creating new variables based on existing
variables is the ifelse() function, which “returns a value with the
same shape as test which is filled with elements selected from either
yes or no depending on whether the element of test is TRUE or
FALSE.”
circ = read.csv("http://www.aejaffe.com/winterR_2016/data/C
header=TRUE,as.is=TRUE)
Creating categorical variables
For example, we can create a new variable that records whether
daily ridership on the Circulator was above 10,000.
table(hi_rider)
## hi_rider
## low high
## 740 282
Creating categorical variables
You can also nest ifelse() within itself to create 3 levels of a
variable.
table(riderLevels)
## riderLevels
## low med high
## 740 280 2
Creating categorical variables
table(cx)
## cx
## (0,10] (10,25] (25,50] (50,100]
## 10 15 25 50
head(sort(circ$date))
range(circ$newDate)
range(circ$newDate2)
theTime = Sys.time()
theTime
class(theTime)
theTime + 5000
n = 1:9
n
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1
## [1] 1 4 7
## [1] 1 2 3
Data Selection
class(mat[1, ])
## [1] "integer"
class(mat[, 1])
## [1] "integer"
Data Frames
You will likely use data.frame class for a lot of data cleaning and
analysis. However, some operations that rely on matrix
multiplication (like performing many linear regressions) are (much)
faster with matrices. Also, as we will touch on later, some functions
for iterating over data will return the matrix class, or will be placed
in empty matrices that can then be converted to data.frames
Data Frames versus Matrices
library(matrixStats,quietly = TRUE)
avgs = select(circ, ends_with("Average"))
rowMins(as.matrix(avgs),na.rm=TRUE)[500:510]
I N-dimensional classes:
I Arrays: any extension of matrices with more than 2 dimensions,
e.g. 3x3x3 cube
I Lists: more flexible container for R objects.
Arrays
These are just more flexible matrices - you should just be made
aware of them as some functions return objects of this class, for
example, cross tabulating over more than 2 variables and the
tapply function.
Arrays
Selecting from arrays is similar to matrices, just with additional
commas for the additional slots.
ar = array(1:27, c(3,3,3))
ar[,,1]
ar[,1,]
I One other data type that is the most generic are lists.
I Can be created using list()
I Can hold vectors, strings, matrices, models, list of other list,
lists upon lists!
I Can reference data using $ (if the elements are named), or
using [], or [[]]
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
List referencing
$letters
[1] "A" "b" "c"
$letters
[1] "A" "b" "c"
List referencing
You can also select multiple lists with the single brackets.
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
List referencing
You can also select down several levels of a list at once
> mylist$letters[1]
[1] "A"
> mylist[[2]][1]
[1] 1
> mylist[[3]][1:2,1:2]
[,1] [,2]
[1,] 1 6
[2,] 2 7
Splitting Data Frames
> # head(dayList)
> lapply(dayList, head, n=2)
$Friday
day date orangeBoardings orangeAlightings orang
5 Friday 01/15/2010 1645 1643
12 Friday 01/22/2010 1401 1388
purpleBoardings purpleAlightings purpleAverage greenBoar
5 NA NA NA
12 NA NA NA
greenAlightings greenAverage bannerBoardings bannerAligh
5 NA NA NA
12 NA NA NA
bannerAverage daily newDate newDate2
5 NA 1644.0 2010-01-15 2010-01-15
> # head(dayList)
> lapply(dayList, dim)
$Friday
[1] 164 17
$Monday
[1] 164 17
$Saturday
[1] 163 17
$Sunday
[1] 163 17
$Thursday
[1] 164 17
$Tuesday
General Class Information