0% found this document useful (0 votes)

21 views45 pages

R Module 7 - Data Classes

The document provides an overview of various data classes in R, including one-dimensional classes like character, numeric, integer, factor, logical, and date/POSIXct, as well as two-dimensional classes such as data frames and matrices. It explains how to create and manipulate these classes, including the use of functions like seq(), factor(), and cut() for categorical variables. Additionally, it discusses the use of arrays and lists as more flexible data structures in R.

Uploaded by

lowtarhkM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views45 pages

R Module 7 - Data Classes

Uploaded by

lowtarhkM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Data Classes

Andrew Jaffe

January 6, 2016
Data Classes:

I One dimensional classes (‘vectors’):

I Character: strings or individual characters, quoted
I Numeric: any real number(s)
I Integer: any integer(s)/whole numbers
I Factor: categorical/qualitative variables
I Logical: variables composed of TRUE or FALSE
I Date/POSIXct: represents calendar dates and times
Character and numeric

We have already covered character and numeric classes.

class(c("Andrew", "Jaffe"))

## [1] "character"

class(c(1, 4, 7))

## [1] "numeric"
Integer

Integer is a special subset of numeric that contains only whole

numbers
A sequence of numbers is an example of the integer class

x = seq(from = 1, to = 5) # seq() is a function

## [1] 1 2 3 4 5

class(x)

## [1] "integer"
Integer

The colon : is a shortcut for making sequences of numbers

It makes consecutive integer sequence from [num1] to [num2] by 1

1:5

## [1] 1 2 3 4 5
Logical

logical is a class that only has two possible elements: TRUE and
FALSE

x = c(TRUE, FALSE, TRUE, TRUE, FALSE)

class(x)

## [1] "logical"

sum() and mean() work on logical vectors - they return the total
and proportion of TRUE elements, respectively.
Logical

Note that logical elements are NOT in quotes.

z = c("TRUE", "FALSE", "TRUE", "FALSE")

class(z)

## [1] "character"
Factor
factor are special character vectors where the elements have
pre-defined groups or ‘levels’. You can think of these as qualitative
or categorical variables:

x = factor(c("boy", "girl", "girl", "boy", "girl"))

## [1] boy girl girl boy girl

## Levels: boy girl

class(x)

## [1] "factor"

Note that levels are, by default, alphabetical or alphanumerical

order.
Factors

Factors are used to represent categorical data, and can also be used
for ordinal data (ie categories have an intrinsic ordering)
Note that R reads in character strings as factors by default in
functions like read.table()
‘The function factor is used to encode a vector as a factor (the
terms ’category’ and ‘enumerated type’ are also used for factors). If
argument ordered is TRUE, the factor levels are assumed to be
ordered.’

factor(x = character(), levels, labels = levels,

exclude = NA, ordered = is.ordered(x))
Factors
Suppose we have a vector of case-control status

cc = factor(c("case","case","case",
"control","control","control"))
cc

## [1] case case case control control control

## Levels: case control

levels(cc) = c("control","case")
cc

## [1] control control control case case case

## Levels: control case
Factors
Note that the levels are alphabetically ordered by default. We can
also specify the levels within the factor call

factor(c("case","case","case","control",
"control","control"),
levels =c("control","case") )

## [1] case case case control control control

## Levels: control case

factor(c("case","case","case","control",
"control","control"),
levels =c("control","case"), ordered=TRUE)

## [1] case case case control control control

## Levels: control < case
Factors

Factors can be converted to numeric or character very easily

x = factor(c("case","case","case","control",
"control","control"),
levels =c("control","case") )
as.character(x)

## [1] "case" "case" "case" "control" "control" "c

as.numeric(x)

## [1] 2 2 2 1 1 1
Factors
However, you need to be careful modifying the labels of existing
factors, as its quite easy to alter the meaning of the underlying data.

xCopy = x
levels(xCopy) = c("case", "control") # wrong way
xCopy

## [1] control control control case case case

## Levels: case control

as.character(xCopy) # labels switched

## [1] "control" "control" "control" "case" "case" "c

as.numeric(xCopy)

## [1] 2 2 2 1 1 1
Creating categorical variables
the rep() [“repeat”] function is useful for creating new variables

bg = rep(c("boy","girl"),each=50)
head(bg)

## [1] "boy" "boy" "boy" "boy" "boy" "boy"

bg2 = rep(c("boy","girl"),times=50)
head(bg2)

## [1] "boy" "girl" "boy" "girl" "boy" "girl"

length(bg)==length(bg2)

## [1] TRUE
Creating categorical variables
One frequently-used tool is creating categorical variables out of
continuous variables, like generating quantiles of a specific
continuously measured variable.
A general function for creating new variables based on existing
variables is the ifelse() function, which “returns a value with the
same shape as test which is filled with elements selected from either
yes or no depending on whether the element of test is TRUE or
FALSE.”

ifelse(test, yes, no)

# test: an object which can be coerced

to logical mode.
# yes: return values for true elements of test.
# no: return values for false elements of test.
Charm City Circulator data

Please download the Charm City Circulator data:

http://www.aejaffe.com/winterR_2016/data/Charm_City_
Circulator_Ridership.csv

circ = read.csv("http://www.aejaffe.com/winterR_2016/data/C
header=TRUE,as.is=TRUE)
Creating categorical variables
For example, we can create a new variable that records whether
daily ridership on the Circulator was above 10,000.

hi_rider = ifelse(circ$daily > 10000, "high", "low")

hi_rider = factor(hi_rider, levels = c("low","high"))
head(hi_rider)

## [1] low low low low low low

## Levels: low high

table(hi_rider)

## hi_rider
## low high
## 740 282
Creating categorical variables
You can also nest ifelse() within itself to create 3 levels of a
variable.

riderLevels = ifelse(circ$daily < 10000, "low",

ifelse(circ$daily > 20000,
"high", "med"))
riderLevels = factor(riderLevels,
levels = c("low","med","high"))
head(riderLevels)

## [1] low low low low low low

## Levels: low med high

table(riderLevels)

## riderLevels
## low med high
## 740 280 2
Creating categorical variables

However, it’s much easier to use cut() to create categorical

variables from continuous variables.
‘cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.’

cut(x, breaks, labels = NULL, include.lowest = FALSE,

right = TRUE, dig.lab = 3,
ordered_result = FALSE, ...)
Creating categorical variables

x: a numeric vector which is to be converted to a factor by cutting.

breaks: either a numeric vector of two or more unique cut points
or a single number (greater than or equal to 2) giving the number of
intervals into which x is to be cut.
labels: labels for the levels of the resulting category. By default,
labels are constructed using “(a,b]” interval notation. If labels =
FALSE, simple integer codes are returned instead of a factor.
Cut
Now that we know more about factors, cut() will make more sense:
x = 1:100
cx = cut(x, breaks=c(0,10,25,50,100))
head(cx)

## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10]

## Levels: (0,10] (10,25] (25,50] (50,100]

table(cx)

## cx
## (0,10] (10,25] (25,50] (50,100]
## 10 15 25 50

We can also leave off the labels

cx = cut(x, breaks=c(0,10,25,50,100), labels=FALSE)

head(cx)
Date
You can convert date-like strings in the Date class
(http://www.statmethods.net/input/dates.html for more
info)

head(sort(circ$date))

## [1] "01/01/2011" "01/01/2012" "01/01/2013" "01/02/2011"

## [6] "01/02/2013"

circ$newDate <- as.Date(circ$date, "%m/%d/%Y") # creating a

head(circ$newDate)

## [1] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14"

## [6] "2010-01-16"

range(circ$newDate)

## [1] "2010-01-11" "2013-03-01"

Date
However, the lubridate package is much easier for generating
explicit dates:

library(lubridate) # great for dates!

suppressPackageStartupMessages(library(dplyr))
circ = mutate(circ, newDate2 = mdy(date))
head(circ$newDate2)

## [1] "2010-01-11 UTC" "2010-01-12 UTC" "2010-01-13 UTC" "

## [5] "2010-01-15 UTC" "2010-01-16 UTC"

range(circ$newDate2)

## [1] "2010-01-11 UTC" "2013-03-01 UTC"

POSIXct
The POSIXct class can encode time information

theTime = Sys.time()
theTime

## [1] "2016-01-06 22:45:41 EST"

class(theTime)

## [1] "POSIXct" "POSIXt"

theTime + 5000

## [1] "2016-01-07 00:09:01 EST"

Note it’s like a more general date format.

Data Classes:

I Two dimensional classes:

I data.frame: traditional ‘Excel’ spreadsheets
I Each column can have a different class, from above
I Matrix: two-dimensional data, composed of rows and columns.
Unlike data frames, the entire matrix is composed of one R
class, e.g. all numeric or all characters.
Matrices

n = 1:9
n

## [1] 1 2 3 4 5 6 7 8 9

mat = matrix(n, nrow = 3)

mat

## [,1] [,2] [,3]

## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Matrix (and Data frame) Functions

These are in addition to the previous useful vector functions:

I nrow() displays the number of rows of a matrix or data frame

I ncol() displays the number of columns
I dim() displays a vector of length 2: # rows, # columns
I colnames() displays the column names (if any) and
rownames() displays the row names (if any)
Data Selection
Matrices have two “slots” you can use to select data, which
represent rows and columns, that are separated by a comma, so the
syntax is matrix[row,column]. Note you cannot use dplyr
functions on matrices.

mat[1, 1] # individual entry: row 1, column 1

## [1] 1

mat[1, ] # first row

## [1] 1 4 7

mat[, 1] # first columns

## [1] 1 2 3
Data Selection

Note that the class of the returned object is no longer a matrix

class(mat[1, ])

## [1] "integer"

class(mat[, 1])

## [1] "integer"
Data Frames

To review, the data.frame is the other two dimensional variable

class.
Again, data frames are like matrices, but each column is a vector
that can have its own class. So some columns might be character
and others might be numeric, while others maybe a factor.
Data Frames versus Matrices

You will likely use data.frame class for a lot of data cleaning and
analysis. However, some operations that rely on matrix
multiplication (like performing many linear regressions) are (much)
faster with matrices. Also, as we will touch on later, some functions
for iterating over data will return the matrix class, or will be placed
in empty matrices that can then be converted to data.frames
Data Frames versus Matrices

There is also additional summarization functions for matrices (and

not data.frames) in the matrixStats package, like rowMins(),
colMaxs(), etc.

library(matrixStats,quietly = TRUE)
avgs = select(circ, ends_with("Average"))
rowMins(as.matrix(avgs),na.rm=TRUE)[500:510]

## [1] 3538.5 3402.5 3862.5 3347.5 2837.5 2704.0 3138.5 32

## [11] 3046.0
Data Classes

Extensions of “normal” data classes:

I N-dimensional classes:
I Arrays: any extension of matrices with more than 2 dimensions,
e.g. 3x3x3 cube
I Lists: more flexible container for R objects.
Arrays

These are just more flexible matrices - you should just be made
aware of them as some functions return objects of this class, for
example, cross tabulating over more than 2 variables and the
tapply function.
Arrays
Selecting from arrays is similar to matrices, just with additional
commas for the additional slots.

ar = array(1:27, c(3,3,3))
ar[,,1]

## [,1] [,2] [,3]

## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

ar[,1,]

## [,1] [,2] [,3]

## [1,] 1 10 19
## [2,] 2 11 20
## [3,] 3 12 21
Lists

I One other data type that is the most generic are lists.
I Can be created using list()
I Can hold vectors, strings, matrices, models, list of other list,
lists upon lists!
I Can reference data using $ (if the elements are named), or
using [], or [[]]

> mylist <- list(letters=c("A", "b", "c"),

+ numbers=1:3, matrix(1:25, ncol=5))
List Structure
> head(mylist)

$letters
[1] "A" "b" "c"

$numbers
[1] 1 2 3

[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
List referencing

> mylist[1] # returns a list

$letters
[1] "A" "b" "c"

> mylist["letters"] # returns a list

$letters
[1] "A" "b" "c"
List referencing

> mylist[[1]] # returns the vector 'letters'

[1] "A" "b" "c"

> mylist$letters # returns vector

[1] "A" "b" "c"

> mylist[["letters"]] # returns the vector 'letters'

[1] "A" "b" "c"

List referencing

You can also select multiple lists with the single brackets.

> mylist[1:2] # returns a list

$letters
[1] "A" "b" "c"

$numbers
[1] 1 2 3
List referencing
You can also select down several levels of a list at once

> mylist$letters[1]

[1] "A"

> mylist[[2]][1]

[1] 1

> mylist[[3]][1:2,1:2]

[,1] [,2]
[1,] 1 6
[2,] 2 7
Splitting Data Frames

The split() function is useful for splitting data.frames

“split divides the data in the vector x into the groups defined by f.
The replacement forms replace values corresponding to such a
division. unsplit reverses the effect of split.”

> dayList = split(circ,circ$day)

Splitting Data Frames
Here is a good chance to introduce lapply, which performs a
function within each list element:

> # head(dayList)
> lapply(dayList, head, n=2)

$Friday
day date orangeBoardings orangeAlightings orang
5 Friday 01/15/2010 1645 1643
12 Friday 01/22/2010 1401 1388
purpleBoardings purpleAlightings purpleAverage greenBoar
5 NA NA NA
12 NA NA NA
greenAlightings greenAverage bannerBoardings bannerAligh
5 NA NA NA
12 NA NA NA
bannerAverage daily newDate newDate2
5 NA 1644.0 2010-01-15 2010-01-15
> # head(dayList)
> lapply(dayList, dim)

$Friday
[1] 164 17

$Monday
[1] 164 17

$Saturday
[1] 163 17

$Sunday
[1] 163 17

$Thursday
[1] 164 17

$Tuesday
General Class Information

There are two useful functions associated with practically all R

classes, which relate to logically checking the underlying class
(is.CLASS_()) and coercing between classes (as.CLASS_()).
We saw some examples of coercion in the past, like as.numeric()
and as.character() regarding the factor class and also
as.Date() for the date class.

CPWD Delhi BUILDING COST INDEX CONVERSION TABLE March 2018
No ratings yet
CPWD Delhi BUILDING COST INDEX CONVERSION TABLE March 2018
13 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
Unit 2
No ratings yet
Unit 2
32 pages
Introduction to Categorical Data AA 2016-2017
No ratings yet
Introduction to Categorical Data AA 2016-2017
70 pages
Categorical_Data_Courses
No ratings yet
Categorical_Data_Courses
191 pages
dispensa_completa categorical data
No ratings yet
dispensa_completa categorical data
211 pages
Data Types
No ratings yet
Data Types
27 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
R-Training For Print
No ratings yet
R-Training For Print
11 pages
Rfactors
No ratings yet
Rfactors
31 pages
advance R prog.-1
No ratings yet
advance R prog.-1
24 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
27 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
R Nuts and Bolts
No ratings yet
R Nuts and Bolts
9 pages
Image to pdf 24-Jun-2024
No ratings yet
Image to pdf 24-Jun-2024
2 pages
Data Types in R Programming
No ratings yet
Data Types in R Programming
9 pages
Biostat S1 Handout
No ratings yet
Biostat S1 Handout
7 pages
R Data Types 8
No ratings yet
R Data Types 8
7 pages
Module 1 Rprogramming Introduction Part A
No ratings yet
Module 1 Rprogramming Introduction Part A
20 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Basic Data Types
No ratings yet
Basic Data Types
48 pages
chapter 3 p2
No ratings yet
chapter 3 p2
80 pages
Basics of R Programming - Part 2
No ratings yet
Basics of R Programming - Part 2
7 pages
R Programming Notes
No ratings yet
R Programming Notes
23 pages
R-Programming: To See The Working Directory in R Studio
No ratings yet
R-Programming: To See The Working Directory in R Studio
17 pages
R Programming Easy
No ratings yet
R Programming Easy
8 pages
01 IntroSlides
No ratings yet
01 IntroSlides
43 pages
Introduction to R for Business Analytics(1)
No ratings yet
Introduction to R for Business Analytics(1)
7 pages
Lenguaje R C3
No ratings yet
Lenguaje R C3
19 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
R Programming
No ratings yet
R Programming
50 pages
R study material I
No ratings yet
R study material I
8 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Introduction to r Chap 2
No ratings yet
Introduction to r Chap 2
30 pages
Eda
No ratings yet
Eda
188 pages
Ecotrix With R and Python
No ratings yet
Ecotrix With R and Python
25 pages
ADA2 Notes Ch18
No ratings yet
ADA2 Notes Ch18
45 pages
R-pres
No ratings yet
R-pres
53 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
Introduction to r
No ratings yet
Introduction to r
18 pages
Data in R
No ratings yet
Data in R
7 pages
R Data Structures_07_4
No ratings yet
R Data Structures_07_4
27 pages
FE418_RLectureNotes1
No ratings yet
FE418_RLectureNotes1
15 pages
R Factors
No ratings yet
R Factors
12 pages
In R programming pdf
No ratings yet
In R programming pdf
72 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Starting With R - 3
No ratings yet
Starting With R - 3
1 page
N2 Data in R
No ratings yet
N2 Data in R
7 pages
R Module 2
No ratings yet
R Module 2
30 pages
R Prog
No ratings yet
R Prog
27 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
All v2 Basic Statistics Using R
No ratings yet
All v2 Basic Statistics Using R
241 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Recession-Proof Setups 21 Proven Stock Market Trading Strategies in a Bear Market - Matthew Giannino 2020 124s
No ratings yet
Recession-Proof Setups 21 Proven Stock Market Trading Strategies in a Bear Market - Matthew Giannino 2020 124s
124 pages
Forecasting Crashes - Trading Volume, Past Returns and Condi
No ratings yet
Forecasting Crashes - Trading Volume, Past Returns and Condi
48 pages
Point and Figure University Course 191s
No ratings yet
Point and Figure University Course 191s
191 pages
R Module 12 - Functions
No ratings yet
R Module 12 - Functions
14 pages
Simplex Method For Standard Minimization Problem
No ratings yet
Simplex Method For Standard Minimization Problem
3 pages
OS Lab 11 Mutex Semaphore
No ratings yet
OS Lab 11 Mutex Semaphore
9 pages
Revision Paper
No ratings yet
Revision Paper
4 pages
VHDL Aes Project
No ratings yet
VHDL Aes Project
13 pages
Understanding Memory Management
No ratings yet
Understanding Memory Management
4 pages
Chapter No: - 1: Yoga Classes Registration System
No ratings yet
Chapter No: - 1: Yoga Classes Registration System
27 pages
CD GTU Study Material Presentations Unit-1 27062020072512AM
No ratings yet
CD GTU Study Material Presentations Unit-1 27062020072512AM
41 pages
Python Regex Cheatsheet With Examples: Re Module Functions
No ratings yet
Python Regex Cheatsheet With Examples: Re Module Functions
1 page
FAI Syllabus
No ratings yet
FAI Syllabus
4 pages
ECE 3140 _ CS 3420 - Embedded Systems - Unofficial Cornell CS Wiki
No ratings yet
ECE 3140 _ CS 3420 - Embedded Systems - Unofficial Cornell CS Wiki
3 pages
Pre 2C
No ratings yet
Pre 2C
3 pages
DSA UNIT 4
No ratings yet
DSA UNIT 4
9 pages
Year 8 Autumn Higher Mark Scheme A
No ratings yet
Year 8 Autumn Higher Mark Scheme A
4 pages
Microcontrollers 8051-Notes For IV Sem Students
No ratings yet
Microcontrollers 8051-Notes For IV Sem Students
60 pages
Grade 0ne Third Term Maths 2023
100% (1)
Grade 0ne Third Term Maths 2023
13 pages
Activity 7 Logic Programming
No ratings yet
Activity 7 Logic Programming
2 pages
Loes BRCM
No ratings yet
Loes BRCM
22 pages
Natural Language Processing Course Content
No ratings yet
Natural Language Processing Course Content
2 pages
Keshav PPL
No ratings yet
Keshav PPL
56 pages
Communication Lab Part-B, Experiment-3 and 4.
No ratings yet
Communication Lab Part-B, Experiment-3 and 4.
6 pages
Lab 1 OOP ANSWER
No ratings yet
Lab 1 OOP ANSWER
4 pages
BARRERA_ICpEP-Quiz-Bowl-2023
No ratings yet
BARRERA_ICpEP-Quiz-Bowl-2023
6 pages
WINSEM2024-25 BCSE304L TH CH2024250501975 Reference Material II Module 3
No ratings yet
WINSEM2024-25 BCSE304L TH CH2024250501975 Reference Material II Module 3
70 pages
Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 2 Vijendra Singh pdf download
100% (2)
Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 2 Vijendra Singh pdf download
55 pages
Chapter 0.0 Introduction To Data Structures
No ratings yet
Chapter 0.0 Introduction To Data Structures
30 pages
8 7-OperatorOverloading
No ratings yet
8 7-OperatorOverloading
2 pages
Course Type Course Code Name of Course L T P Credit: Text Books
No ratings yet
Course Type Course Code Name of Course L T P Credit: Text Books
1 page
Intuitive Probability and Random Processes Using Matlab Solution Manual
No ratings yet
Intuitive Probability and Random Processes Using Matlab Solution Manual
419 pages
Text Files - WORKSHEET 1
No ratings yet
Text Files - WORKSHEET 1
6 pages