0% found this document useful (0 votes)
13 views

Module2 DAR

Module 2 covers the creation, access, and basic operations of matrices, arrays, lists, factors, strings, and date/time classes in R. It includes examples of how to manipulate these data types, including arithmetic operations and accessing elements. The module emphasizes the differences between atomic and recursive data structures in R.

Uploaded by

db8770632
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module2 DAR

Module 2 covers the creation, access, and basic operations of matrices, arrays, lists, factors, strings, and date/time classes in R. It includes examples of how to manipulate these data types, including arithmetic operations and accessing elements. The module emphasizes the differences between atomic and recursive data structures in R.

Uploaded by

db8770632
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Module 2

❖ OBJECTIVES

• know how to create, access and perform basic operations on matrices and arrays
in R

• know how to create, access and perform basic operations on the list data types

• know how to create, access and perform basic operations on the factor data
types in R

• know how to create, access and perform basic operations on strings in R

• understand the various date and time classes in R

• convert between various date formats

• setup various time zones

• perform calculations on dates and times

Matrices and Arrays


A matrix is a collection of data elements with the same basic type arranged in a two-
dimensional rectangular layout. An array consists of multidimensional rectangular
data. Matrices are special cases of two-dimensional arrays. To create an array the
array() function can be used and a vector of values and vector of dimensions are
passed to it.
> x <- array(1:24, dim = c(4, 3, 2),
dimnames = list(c(“a”, “b”, “c”, “d”), c(“e”, “f ”, “g”), c(“h”, “i”)))
>x
,,h
ef g
a15 9
b 2 6 10
c 3 7 11
d 4 8 12
,,i
e f g
a 13 17 21
b 14 18 22
c 15 19 23
d 16 20 24

The syntax for creating matrices is using the function matrix() and passing the
nrow or ncol argument instead of the dim argument in the arrays. A matrix can also
be created using the array() function where the dimension of the array is two.
> m <- matrix(1:12, nrow = 3, dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
>m

def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12
> m1 <- array(1:12, dim = c(3,4),
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
> m1
def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12

The argument byrow = TRUE in the matrix() function assigns the elements
row wise. If this argument is not specified, by default the elements are filled column
wise.
> m <- matrix(1:12, nrow = 3, byrow = TRUE
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))

The dim() function returns the dimensions of an array or a matrix. The functions
nrow() and ncol() returns the number of rows and number of columns of a matrix
respectively.
> dim(x)
[1] 4 3 2
> dim(m)
[1] 3 4
> nrow(m)
[1] 3
> ncol(m)
[1] 4

The length() function also works for matrices and arrays. It is also possible to
assign new dimension for a matrix or an array using the dim() function.
> length(x)
[1] 24
> length(m)
[1] 12
> dim(m) <- c(6,2)

The functions rownames(), colnames() and dimnames() can be used to fetch the
row names, column names and dimension names of matrices and arrays respectively.
> rownames(m1)
[1] “a” “b” “c”
> colnames(m1)
[1] “d” “e” “f” “g”
> dimnames(x)
[[1]]
[1] “a” “b” “c” “d”
[[2]]
[1] “e” “f ” “g”
[[3]]
[1] “h” “i”

It is possible to extract the element at the nth row and mth column using the
expression M[n, m]. The entire nth row can be extracted using M[n, ] and similarly,
the mth column can be extracted using M[,m]. Also, it is possible to extract more
than one column or row.
> M[2,3]
[1] 6
> M[2,]
[1] 4 5 6

> M[,3]
[1] 3 6 9
> M[,c(1,3)]
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
> M[c(1,3),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 7 8 9
The matrix transpose is constructed by interchanging its rows and columns
using the function t().
> t(M)
r1 r2 r3
c1 1 4 7
c2 2 5 8
c3 3 6 9

The columns of two matrices can be combined using the cbind() function and
similarly the rows of two matrices can be combined using the rbind() function.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9), nrow=3, ncol = 1)
> M2
[,1]
[1,] 3
[2,] 6
[3,] 9
> cbind(M1, M2)
[,1] [,2] [,3]
[1,] 2 8 3
[2,] 4 10 6
[3,] 6 12 9
> M3 = matrix(c(4,8), nrow=1, ncol=2)
> M3
[,1] [,2]
[1,] 4 8
> rbind(M1, M3)
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
[4,] 4 8

A matrix can be deconstructed using the c() function which combines all
column vectors into one.
> c(M1)
[1] 2 4 6 8 10 12

The arithmetic operators “+”, “- “, “* “, “/ “ work element wise on matrices


and arrays. But the condition is that the matrices or arrays should be of conformable
sizes. The matrix multiplication is done using the operator “%*%”.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9,11,1,5), nrow=3, ncol = 2)
> M2
[,1] [,2]
[1,] 3 11
[2,] 6 1
[3,] 9 5
> M1 + M2
[,1] [,2]
[1,] 5 19
[2,] 10 11
[3,] 15 17
> M1 * M2
[,1] [,2]
[1,] 6 88
[2,] 24 10
[3,] 54 60
> M2 = matrix(c(3,6,9,11), nrow=2, ncol = 2)
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M1 %*% M2
[,1] [,2]
[1,] 54 106
[2,] 72 146
[3,] 90 186

The power operator “^” also works element wise on matrices. To find the
inverse of a matrix the function solve() can be used.
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M2^-1
[,1] [,2]
[1,] 0.3333333 0.11111111
[2,] 0.1666667 0.09090909
> solve(M2)
[,1] [,2]
[1,] -0.5238095 0.4285714
[2,] 0.2857143 -0.1428571

Lists
Lists allow us to combine different data types in a single variable. Lists can be
created using the list() function. This function is similar to the c() function. The
contents of a list are just listed within the list() function as arguments separated by
a comma. List elements can be a vector, matrix or a function. It is possible to name
the elements of the list while creation or later using the names() function.
> L <- list(c(9,1, 4, 7, 0), matrix(c(1,2,3,4,5,6), nrow = 3))
>L
[[1]]
[1] 9 1 4 7 0
[[2]]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> names(L) <- c(“Num”, “Mat”)

>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> L <- list(Num = c(9,1, 4, 7, 0), Mat = matrix(c(1,2,3,4,5,6), nrow = 3))
>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

Accessing list elements :Once you’ve created the list, you can access its
elements using either indices or the name. you can access elements of a list
using either the dollar sign ($) notation or double square brackets ([[ ]]).

> L[[1]]

[1] 9 1 4 7 0
> L[[2]]

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> L[["Num"]]

[1] 9 1 4 7 0

> L[["Mat"]]

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> L$Num[1]

[1] 9

> L$Num[3:5]

[1] 4 7 0

> L$Mat[1,]

[1] 1 4

> L$Mat[,2]

[1] 4 5 6

> L$Mat[2,1]

[1] 2
dollar sign ($) :Used to access named elements in a list directly, square
brackets [ ] used to access elements by their position or to return a sublist.

Lists can be nested. That is a list can be an element of another list. But, vectors,
arrays and matrices are not recursive/nested. They are atomic. The functions
is.recursive() and is.atomic() shows if a variable type is recursive or atomic respectively.
> is.atomic(list())
[1] FALSE
> is.recursive(list())
[1] TRUE
> is.atomic(L)
[1] FALSE
> is.recursive(L)
[1] TRUE
> is.atomic(matrix())
[1] TRUE
> is.recursive(matrix())
[1] FALSE

The length() function works on list like in vectors and matrices. But, the dim(),
nrow() and ncol() functions returns only NULL.
> length(L)
[1] 2
> dim(L)
NULL
> nrow(L)
NULL
> ncol(L)
NULL

Arithmetic operations in list are possible only if the elements of the list are of
the same data type. Generally, it is not recommended. As in vectors the elements
of the list can be accessed by indexing them using the square brackets. The index
can be a positive number, or a negative number, or element names or logical values.
> L1 <- list(l1 = c(8, 9, 1), l2 = matrix(c(1,2,3,4), nrow = 2),
l3 = list( l31 = c(“a”, “b”), l32 = c(TRUE, FALSE) ))
> L1
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
$l3
$l3$l31
[1] “a” “b”
$l3$l32
[1] TRUE FALSE

> L1[1:2]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[-3]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[c(“l1”, “l2”)]


$l1
[1] 8 9 1

$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

> L1[c(TRUE, TRUE, FALSE)]


$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4

A list is a generic vector containing other objects.


> a = c(4,8,12)
> b = c(“abc”, “def ”, “ghi”, “jkl”, “mno”)
> d = c(TRUE, FALSE)
> t = list(a, b, d, 5)

The list t contains copies of the vectors a, b and d. A list slice is retrieved using
single square brackets []. In the below, t[2] contains a slice and a copy of b. Slice
can also be retrieved with multiple members.
> t[2]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[c(2,4)]
[[1]]
[1] “abc” “def” “ghi” “jkl” “mno”
[[2]]
[1] 5

To reference a list member directly double square bracket [[]] is used. Thus
t[[2]] retrieves the second member of the list t. This results in a copy of b, but not a
slice of b. It is also possible to modify the contents of the elements directly, but the
contents of b are unaffected.
> t[[2]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[[2]][1] = “qqq”
> t[[2]]
[1] “qqq” “def ” “ghi” “jkl” “mno”
>b
[1] “abc” “def ” “ghi” “jkl” “mno”

We can assign names to the list members and reference lists by names instead of
numeric indexes. A list of two members is given as example below with the member
names as “first” and “second”. The list slice containing the member “first” can be
retrieved using the square brackets [] as shown below.
> l = list(first=c(1,2,3), second=c(“a”,”b”, “c”))
>l
$first
[1] 1 2 3
$second
[1] “a” “b” “c”
> l[“first”]
$first
[1] 1 2 3

The named list member can also be directly referenced with the $ operator or
double square brackets [[]] as below.
> l$first
[1] 1 2 3

> l[[“first”]]
[1] 1 2 3

A vector can be converted to a list using the function as.list(). Similarly, a


list can be converted into a vector, provided the list contains scalar elements of
the same type. This is done using the conversion functions such as as.numeric(),
as.character() and so on. If a list consists of non-scalar elements, but if they are of
the same type, then it can be converted into a vector using the function unlist().
> v <- c(7, 3, 9, 2, 6)

> as.list(v)
[[1]]
[1] 7
[[2]]
[1] 3
[[3]]
[1] 9
[[4]]
[1] 2
[[5]]
[1] 6

> L <- list(3, 7, 8, 12, 14)


> as.numeric(L)
[1] 3 7 8 12 14
> L1 <- list(“aaa”, “bbb”, “ccc”)
> L1
[[1]]
[1] “aaa”
[[2]]
[1] “bbb”
[[3]]
[1] “ccc”
> as.character(L1)
[1] “aaa” “bbb” “ccc”
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L1
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55

> unlist(L1)
l11 l12 l13 l21 l22 l23 l24 l25
78 90 21 11 22 33 44 55

The c() function can also be used to combine lists as we do for vectors.
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L2 <- list(“aaa”, “bbb”, “ccc”)
> c(L1, L2)
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
[[3]]
[1] “aaa”
[[4]]
[1] “bbb”
[[5]]
[1] “ccc”
Data Frames
A data frame is used for storing data tables. They store spread-sheet like data. It is a
list of vectors of equal length (not necessarily of the same basic data type). Consider
a data frame df1 consisting of three vectors a, b, and d.
> a = c(1, 2, 3)
> b = c(“a”, “b”, “c”)
> d = c(TRUE, FALSE, TRUE)
> df1 = data.frame(a, b, d)
> df1
a b d
1 1 a TRUE
2 2 b FALSE
3 3 c TRUE

By default the row names are automatically numbered from 1 to the number of
rows in the data frame. It is also possible to provide row names manually using the
row.names argument as below.
> df1 = data.frame(a, b, d, row.names = c(“one”, “two”, “three”))
> df1
a b d
one 1 a TRUE
two 2 b FALSE
three 3 c TRUE

The functions rownames(), colnames(), dimnames(), nrow(), ncol() and dim()


can be applied on the data frames as below. The length() and names() function,
returns the same result as that of ncol() and colnames() respectively.
> rownames(df1)
[1] “one” “two” “three”
> colnames(df1)
[1] “a” “b” “d”
> dimnames(df1)
[[1]]
[1] “one” “two” “three”
[[2]]
[1] “a” “b” “d”

> nrow(df1)
[1] 3
> ncol(df1)
[1] 3
> dim(df1)
[1] 3 3
> length(df1)
[1] 3
> colnames(df1)
[1] “a” “b” “d”

It is possible to create data frames with different length of vectors as long as the
shorter ones can be recycled to match that of the longer ones. Otherwise, an error
will be thrown.
> df2 <- data.frame(x = 1, y = 2:3, y = 4:7)
> df2
x y y.1
1 1 2 4
2 1 3 5
3 1 2 6
4 1 3 7

The argument check.names can be set as FALSE so that a data frame will not
look for valid column names.
> df3 <- data.frame(“BaD col” = c(1:5), “!@#$%^&*” = c(“aaa”))
> df3
BaD.col X........
1 1 aaa
2 2 aaa
3 3 aaa
4 4 aaa
5 5 aaa

There are many built-in data frames available in R (example – mtcars). When
this data frame is invoked in R tool, it produces the below result.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............

The top line contains the header or the column names. Each row denotes a record
or a row in the table. A row begins with the name of the row. Each data member of a
row is called a cell. To retrieve a cell value, we enter the row and the column number
of the cell in square brackets [] separated by a comma. The cell value of the second
row and third column is retrieved as below. The row and the column names can also
be used inside the square brackets [] instead of the row and column numbers.
> mtcars[2, 3]
[1] 160
> mtcars[“Mazda RX4 Wag”, “disp”]
[1] 160

The nrow() function gives the number of rows in a data frame and the ncol()
function gives the number of columns in a data frame. To get the preview or the first
few records of a data frame along with the header the head() function can be used.
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11
> head(mtcars)
mpg cyl Disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
......

To retrieve a column from a data frame we use double square brackets [[]] and
the column name or the column number inside the [[]]. The same can be achieved
by making use of the $ symbol as well. This same result can also be achieved by
using single brackets [] by mentioning a comma instead of the row name / number
and using the column name / number as the second index inside the [].
> mtcars[[“hp”]]
[1] 110 110 93 110 175 105 262 95 123 123 180 180 180 ....
> mtcars[[4]]
[1] 110 110 93 110 175 105 262 95 123 123 180 180 180 ....
> mtcars$hp
[1] 110 110 93 110 175 105 262 95 123 123 180 180 180 ....
> mtcars[,”hp”]
[1] 110 110 93 110 175 105 262 95 123 123 180 180 180 ....
> mtcars[,4]
[1] 110 110 93 110 175 105 262 95 123 123 180 180 180 ....

Similarly, if we use the column name or the column number inside a single
square bracket [], we get the below result.
> mtcars[4]
hp
Mazda RX4 110
Mazda RX4 Wag 110
Datsun 710 93
....
> mtcars[c(“mpg”,”hp”)]
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
....

To retrieve a row from a data frame we use the single square brackets [] only by
mentioning the row name / number as the first index inside [] and a comma instead
of the column name / number.
> mtcars[6,]
mpg cyl disp Hp drat wt....
Valiant 18.1 6 225 105 2.76 2.46....

> mtcars[c(6,18),]
mpg cyl disp Hp drat wt....
Valiant 18.1 6 225 105 2.76 2.46....
Fiat 128 32.4 4 78.7 66 4.08 2.2Q....
> mtcars[“Valiant”,]
mpg cyl disp Hp drat wt....
Valiant 18.1 6 225 105 2.76 2.46....

> mtcars[c(“Valiant”,”Fiat 128”),]


mpg cyl disp Hp drat wt....
Valiant 18.1 6 225 105 2.76 2.46....
Fiat 128 32.4 4 78.7 66 4.08 2.2Q....
If we need to fetch a subset of a data frame by selecting few columns and
specifying conditions on the rows, we can use the subset() function to do this. This
function takes the arguments, the data frame, the condition to be applied on the
rows and the columns to be fetched.
> x <- c(“a”, “b”, “c”, “d”, “e”, “f ”)
> y <- c(3, 4, 7, 8, 12, 15)
> z <- c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
> D <- data.frame(x, y, z)
>D

x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE

> subset(D, y<10 & z, select=x)


x
1 a
2 b
4 d

As we have for matrices the transpose of a data frame can be obtained using the
t() function as below.
> t(D)
[,1] [,2] [,3] [,4] [,5] [,6]
x “a” “b” “c” “d” “e” “f”
y “ 3” “ 4” “ 7” “ 8” “12” “15”
z “ TRUE” “ TRUE” “FALSE” “ TRUE” “FALSE” “ TRUE”
The functions rbind() and cbind() can also be applied on the data frames as we
do for the matrices. The only condition for rbind() is that the column names should
match, but for cbind() it does not check even if the column names are duplicated.
> x1 <- c(“aaa”, “bbb”, “ccc”, “ddd”, “eee”, “fff ”)
> y1 <- c(9, 12, 17, 18, 23, 32)
> z1 <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
> ff <- data.frame(x1, y1, z1)
> ff
x1 y1 z1
1 aaa 9 TRUE
2 bbb 12 FALSE
3 ccc 17 TRUE
4 ddd 18 FALSE
5 eee 23 TRUE
6 fff 32 FALSE
> cbind(D, ff)

x y z x1 y1 z1
1 a 3 TRUE Aaa 9 TRUE
2 b 4 TRUE Bbb 12 FALSE
3 c 7 FALSE Ccc 17 TRUE
4 d 8 TRUE ddd 18 FALSE
5 e 12 FALSE eee 23 TRUE
6 f 15 TRUE Fff 32 FALSE
> F <- data.frame(x, y, z)
>F
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE

> rbind(D, F)
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE

The merge() function can be applied to merge two data frames provided they have
common column names. By default, the merge() function does the merging based
on all the common columns, otherwise one of the common column name has to be
specified.
> merge(D, F, all = TRUE)
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE

> # Create the first data frame DD


> DD <- data.frame( S = c(1, 2, 3, 4, 5, 6), A = c(2, 4, 7, 8, 12, 15), B = c(TRUE, TRUE,
FALSE, TRUE, FALSE, TRUE))
>
> # Create the second data frame FF
> FF <- data.frame(S = c(1, 2, 3, 7, 8),A1 = c(9, 12, 17, 20, 25),B1 = c(FALSE, TRUE,
TRUE, FALSE, TRUE))
> DD
S A B
1 1 2 TRUE
2 2 4 TRUE
3 3 7 FALSE
4 4 8 TRUE
5 5 12 FALSE
6 6 15 TRUE

> FF
S A1 B1
1 1 9 FALSE
2 2 12 TRUE
3 3 17 TRUE
4 7 20 FALSE
5 8 25 TRUE

# Merge DD and FF
> merged_df <- merge(DD, FF, by = "S", all = TRUE)
> # Print the result
> print(merged_df)
S A B A1 B1
1 1 2 TRUE 9 FALSE
2 2 4 TRUE 12 TRUE
3 3 7 FALSE 17 TRUE
4 4 8 TRUE NA NA
5 5 12 FALSE NA NA
6 6 15 TRUE NA NA
7 7 NA NA 20 FALSE
8 8 NA NA 25 TRUE
The functions colSums(), colMeans(), rowSums() and rowMeans() can be
applied on the data frames that have numeric values as below.
> s <- c(5, 6, 7, 8)
> y <- c(25, 26, 27, 28)
> y <- c(25, 26, 27, 28)
> G <- data.frame(s, y, y)
>G
s y y
1 5 15 15
2 6 16 16
3 7 17 17
4 8 18 18

colSums(G[, t:2]) s y
26 66

> colMeans(G[, t:2])


s y y
6.5 16.5 16.5

> rowSums(G[t:2, ])
1 2 3 4
35 38 41 44

> rowMeans(G[2:4, ])
1 2 3 4
11.6 12.6 13.6 14.6

Factors
Factors are used to store categorical data like gender (“Male” or “Female”). They
behave sometimes like character vectors and sometimes like integer vectors based
on the context.Factors stores categorical data and they behave like strings
sometimes and integers sometimes. Consider a data frame that stores the
weight of few males and females. In this case the column that stores the gender
is a factor as it stores categorical data. The choices “female” and “male” are called
the levels of the factor. This can be viewed by using the levels() function and
nlevels() function.

weight <- data.frame(wt_kg = c(60,82,45, 49,52,75,68),gender=


c("female","male", "female", "female", "female", "male", "male"))

> weight$gender <- as.factor(weight$gender)

> print(weight)
wt_kg gender
1 6Q female
2 82 Male
2 45 female
4 49 female
5 52 female
6 75 Male
7 68 Male

> print(weight$gender)
[1] female male female female female male male
Levels: female male
> print(levels(weight$gender))
[1] “female” “male”
> print(nlevels(weight$gender))
[1] 2

At the atomic level a factor can be created using the factor() function, which
takes a character vector as the argument.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”, “male”, “male”))
> gender
[1] female male female female female male male
Levels: female male

The levels argument can be used in the factor() function to specify the levels of
the factor. It is also possible to change the levels once the factor is created. This is
done using the function levels() or the function relevel(). The function relevel() just
mentions which level comes first.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”,
“male”, “male”), levels = c(“male”, “female”))
> gender
[1] female male female female female male male
Levels: male female
> levels(gender) <- c(“F”, “M”)
> gender
[1] M F M M M F F
Levels: F M
> relevel(gender, “M”)
[1] M F M M M F F
Levels: M F

It is possible to drop a level from a factor using the function droplevels() when
the level is not in use as in the example below. [Note: the function is.na() is used to
remove the missing value]
> diet <- data.frame(eat = c(“fruit”, “fruit”, “vegetable”, “fruit”),
type = c(“apple”, “mango”, NA, “papaya”))
> diet
eat type
t fruit apple
2 fruit mango
2 vegetable <NA>
4 fruit papaya
> diet <- subset(diet, !is.na(type))
> diet
eat type
t fruit apple
2 fruit mango
4 fruit papaya
> diet$eat <- as.factor(diet$eat)
> diet$eat
[1] fruit fruit fruit
Levels: fruit vegetable
> levels(diet)
NULL
> levels(diet$eat)
[1] “fruit” “vegetable”
> unique(diet$eat)
[1] fruit
Levels: fruit vegetable
> diet$eat <- droplevels(diet$eat)
> levels(diet$eat)
[1] “fruit”
In some cases, the levels need to be ordered as in rating a product or course. The
ratings can be “Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”. When a
factor is created with these levels, it is not necessary they are ordered. So, to order the
levels in a factor, we can either use the function ordered() or the argument ordered=
TRUE in the factor() function. Such ordering can be useful when analysing survey
data.
> ch <- c(“Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”)
> val <- sample(ch, tQQ, replace = TRUE)
> rating <- factor(val, ch)
> rating
[t] Outstanding Bad Outstanding Good Very Good Very Good
[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding Excellent Very Good Good Bad
> is.factor(rating)
[t] TRUE
> is.ordered(rating)
[t] FALSE
> rating_ord <- ordered(val, ch)
> is.factor(rating_ord)
[t] TRUE
> is.ordered(rating_ord)
[t] TRUE
> rating_ord

[t] Outstanding Bad Outstanding Good Very Good Very Good


[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding < Excellent < Very Good < Good < Bad

Numeric values can be summarized into factors using the cut() function and
the result can be viewed using the table() function which lists the count of numbers
in each category. For example let us consider the variable age which has the numeric
values of ages. These ages can be grouped using the cut() function with an interval
of 10 and the result is a factor age_group.
# Define the age vector
age <- c(18, 20, 21, 22, 22, 25, 41, 28, 45, 48, 51, 27, 29, 42, 29)

# Create age groups using cut


age_group <- cut(age, breaks = seq(15, 55, by = 5), right = FALSE)

# Display the age and the corresponding age groups


age
age_group

[1] [15,20) [20,25) [20,25) [20,25) [20,25) [25,30) [40,45) [25,30) [45,50) [45,50)
[50,55) [25,30) [25,30) [40,45) [25,30)
Levels: [15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55)
> table(age_group)
age_group
[15,20) [20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55)
1 4 5 0 0 2 2 1

The function gl() can be used to create a factor, which takes the first argument
that tells how many levels the factor contains and the second argument that tells
how many times each level has to be repeated as value. This function can also take
the argument labels, which lists the names of the factor levels. The function can
also be made to list alternating values of the labels as below.
> gl(5,2)
[t] 112 2334455
Levels: 1 2 3 4 5

> # First example


> factor_one_to_five <- gl(5, 2, labels = c("one", "two", "three", "four", "five"))
> print(factor_one_to_five)
[1] one one two two three three four four five five
Levels: one two three four five

> # Second example with proper definitions


> t <- 2
> factor_with_labels <- gl(5, t, labels = c("t1", "t2", "t3", "t4", "t5"))
> print(factor_with_labels)
[1] t1 t1 t2 t2 t3 t3 t4 t4 t5 t5
Levels: t1 t2 t3 t4 t5

The factors thus generated can be combined using the function interaction() to
get a resultant combined factor.
> # Create the first factor
> fact <- gl(5, 2, labels = c("one", "two", "three", "four", "five"))
>
> # Create the second factor with more labels
> fac2 <- gl(5, t, labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o"))
>
> # Display the factors
> print(fact)
[1] one one two two three three four four five five
Levels: one two three four five
> print(fac2)
[1] a a b b c c d d e e
Levels: a b c d e f g h i j k l m n o
> # Create the interaction of the two factors
> interaction_result <- interaction(fact, fac2)
>
> # Display the result
> print(interaction_result)
[1] one.a one.a two.b two.b three.c three.c four.d four.d five.e five.e

75 Levels: one.a two.a three.a four.a five.a one.b two.b three.b four.b five.b
one.c two.c three.c four.c five.c one.d two.d three.d four.d five.d one.e two.e three.e
four.e ... five.o

Strings
Strings are stored in character vectors. Most string manipulation functions act
on character vectors. Character vectors can be created using the c() function by
enclosing the string in double or single quotes. (Generally we follow only double
quotes). The paste() function can be used to concatenate two strings with a space
in between. If the space need not be shown, we use the function pasteQ(). To have
specified separator between the two concatenated string, we use the argument
sep in the paste() function. The result can be collapsed into one string using the
collapse argument.
> c(“String t”, ‘String 2’)
[1] “String t” “String 2”
> paste(c(“Pine”, “Red”), “Apple”)

[1] “Pine Apple” “Red Apple”


> pasteQ(c(“Pine”, “Red”), “Apple”)
[1] “PineApple” “RedApple”
> paste(c(“Pine”, “Red”), “Apple”, sep = “-”)
[1] “Pine-Apple” “Red-Apple”
> paste(c(“Pine”, “Red”), “Apple”, sep = “-”, collapse = “, “)
[1] “Pine-Apple, Red-Apple”

The to String() function can be used to convert a number vector into a character
vector, with the elements separated by a comma and a space. It is possible to specify
the width of the print string in this function.
> s <- c(t:tQ)^2
>s
[1] t 8 27 64 t25 2t6 242 5t2 729 tQQQ
> toString(s)
[1] “t, 8, 27, 64, t25, 2t6, 242, 5t2, 729, tQQQ”
> toString(s, t8)
[1] “t, 8, 27, 64, .... ”

The cat() function is also similar to the paste() function, but there is little
difference in it as shown below.
> cat(c(“Red”, “Pine”), “Apple”)
Red Pine Apple

The noquote() function forces the string outputs not to be displayed with
quotes.
> a <- c(“I”, “am”, “a”, “data”, “scientist”)
>a
[1] “I” “am” “a” “data” “scientist”
> noquote(a)
[1] I am a data scientist

The formatC() function is used to format the numbers and display them as
strings. This function has the arguments digits, width, format, flag etc which can be
used as below. A slight variation of the function formatC() is the function format()
whose usage is as shown below.
> h <- c(4.567, 8.98t, 27.772)
>h
[1] 4.567 8.98t 27.772
> formatC(h)
[1] “4.567” “8.98t” “27.77”
> formatC(h, digits = 2)
[1] “4.57” “8.98” “27.8”
> formatC(h, digits = 2, width = 5)

[1] “ 4.57” “ 8.98” “ 27.8”


> formatC(h, digits = 2, format = “e”)
[1] “4.567e+QQ” “8.98te+QQ” “2.777e+Qt”
> formatC(h, digits = 2, flag = “+”)
[1] “+4.57” “+8.98” “+27.8”

> format(h)
[1] “ 4.567” “ 8.98t” “27.772”
> format(h, digits = 2)
[1] “ 4.57” “ 8.98” “27.77”
> format(h, digits = 2, trim = TRUE)
[1] “4.57” “8.98” “27.77”
The sprint() function is also used for formatting strings and passing number values in
between the strings. The argument %s in this function stands for a string to be passed. The
argument %d and argument %f stands for integer and floating- point number. The usage of
this function can be understood by the below example

> s <- c(t, 2, 2)


> sprintf(“The number %d in the list is = %f ”, s, h)
[t] “The number t in the list is = 4.567QQQ”
[2] “The number 2 in the list is = 8.98tQQQ”
[2] “The number 2 in the list is = 27.772QQQ”

To print a tab in between text, we can use the cat() function with the special
character “\t” included in between the text as below. Similarly, if we need to insert
a new line in between the text, we use “\n”. In this cat() function the argument fill
= TRUff means that after printing the text, the cursor is placed in the next line.
Suppose if a back slash has to be used in between the text, it is preceded by another
back slash. If we enclose the text in double quotes and if the text contains a double
quote in between, it is also preceded by a back slash. Similarly, if we enclose the
text in single quotes and if the text contains a single quote in between, it is also
preceded by a back slash. If we enclose the text in double quotes and if the text
contains a single quote in between, or if we enclose the text in single quotes and if
the text contains a double quote in between, it is not a problem (No need for back
slash).
> cat(“Black\tBerry”, fill = TRUff)
Black Berry
> cat(“Black\nBerry”, fill = TRUff)
Black
Berry
> cat(“Black\\Berry”, fill = TRUff)
Black\Berry
> cat(“Black\”Berry”, fill = TRUff)
Black”Berry
> cat(‘Black\’Berry’, fill = TRUff)
Black’Berry
> cat(‘Black”Berry’, fill = TRUff)
Black”Berry

> cat(“Black’Berry”, fill = TRUff)


Black’Berry

The function toupper() and tolower() are used to convert a string into upper
case or lower case respectively. The substring() or the substr() function is used to cut
a part of the string from the given text. Its arguments are the text, starting position
and ending position. Both these functions produce the same result.
> toupper(“The cat is on the Wall”)
[t] “THff CAT IS ON THff WALL”
> tolower(“The cat is on the Wall”)
[t] “the cat is on the wall”

> substring(“The cat is on the wall”, 2, tQ)


[t] “e cat is”
> substr(“The cat is on the wall”, 2, tQ)
[t] “e cat is”
> substr(“The cat is on the wall”, 5, tQ)
[t] “cat is”

The function strsplit() does the splitting of a text into many strings based on
the splitting character mentioned as argument. In the below example the splitting
is done when a space is encountered. It is important to note that this function
returns a list and not a character vector as a result.
> strsplit(“I like Bannana, Orange and Pineapple”, “ “)
[[t]]
[t] “I” “like” “Bannana,” “Orange” “and” “Pineapple”
In this same example if the text has to be split when a comma or space is encountered it is
mentioned as “,?”. This means that the comma is optional and space is mandatory for
splitting the given text
> strsplit(“I like Bannana, Orange and Pineapple”, “,? “)
[[t]]
[t] “I” “like” “Bannana” “Orange” “and” “Pineapple”

The default R’s working directory can be obtained using the function getwd()
and this default directory can be changed using the function setwd(). The directory
path mentioned in the setwd() function should have the forward slash instead of
backward slash as in the example below.
> getwd()
[t] “C:/Users/admin/Documents”
> setwd(“C:/Program Files/R”)
> getwd()
[t] “C:/Program Files/R”

It is also possible to construct the file paths using the file.path() function which
automatically inserts the forward slash between the directory names. The function
R.home() list the home directory where R is installed.
> file.path(“C:”, “Program Files”, “R”, “R-2.2.Q”)
[t] “C:/Program Files/R/R-2.2.Q”
> R.home()
[t] “C:/PROGRA~t/R/R-22~t.Q”

Paths can also be specified by relative terms such as “.” denotes current directory,
“..” denotes parent directory and “~” denotes home directory. The function path.
espand() converts relative paths to absolute paths.
> path.espand(“.”)
[t] “.”
> path.espand(“..”)
[t] “..”
> path.espand(“~”)
[t] “C:/Users/admin/Documents”
The function basename() returns only the file name leaving its directory if
specified. On the other hand the function dirname() returns only the directory
name leaving the file name.
> filename <- “C:/Program Files/R/R-2.2.Q/bin/R.ese”
> basename(filename)
[t] “R.ese”
> dirname(filename)
[t] “C:/Program Files/R/R-2.2.Q/bin”

Dates and Times


Dates and Times are common in data analysis and R has a wide range of capabilities
for dealing with dates and times.

Date and Time Classes

R has three date and time base classes and they are POSIXct, POSIXlt and Date.
POSIX is a set of standards that defines how dates and times should be specified and
“ct” stands for “calendar time”. POSIXlt stores dates as a list of seconds, minutes,
hours, day of month etc. For storing and calculating with dates, we can use POSIXct
and for extracting parts of dates, we can use POSXlt.

The function Sys.time() is used to return the current date and time. This
returned value is by default in the POSIXct form. But, this can be converted to
POSIXlt form using the function as.POSIXlt(). When printed both forms of date
and time are displayed in the same manner, but their internal storage mechanism
differs. We can also access individual components of a POSIXlt date using the dollar
symbol or the double brackets as shown below.
> Sys.time()
[t] “2Qt7-Q5-tt t4:2t:29 IST”
> t <- Sys.time()
> tt <- Sys.time()
> t2 <- as.POSIXlt(tt)
> tt
[t] “2Qt7-Q5-tt t4:29:29 IST”
> t2
[t] “2Qt7-Q5-tt t4:29:29 IST”
> class(tt)
[t] “POSIXct” “POSIXt”
> class(t2)
[t] “POSIXlt” “POSIXt”
> t2$sec
[t] 29.2Q794
> t2[[“min”]]
[t] 29
> t2$hour
[t] t4
> t2$mday
[t] tt
> t2$wday
[t] 4

The Date class stores the dates as number of days from start of 1970. This class
is useful when time is insignificant. The as.Date() function can be used to convert
a date in other class formats to the Date class format.
> t2 <- as.Date(t2)
> t2
[t] “2Qt7-Q5-tt”

There are also other add-on packages available in R to handle date and time and
they are date, dates, chron, yearmon, yearqtr, timeDate, ti and jul.
> Sys.getlocale(“LC_TIMff”)
[t] “ffnglish_India.t252”

Few of the time zones are UTC (Universal Time), IST (Indian Standard Time),
ffST (Eastern Standard Time), PST (Pacific Standard Time), GMT (Greenwitch
Meridian Time), etc. It is also possible to give manual offset from UTC as “UTC+n”
or “UTC–n” to denote west and east parts of UTC respectively. Even though it
throws warning message, it gives the result correctly.
> strftime(Sys.time(), ty = “UTC”)
[t] “2Qt7-Q5-t2 Q4:59:Q4”

> strftime(Sys.time(), ty = “UTC-5”)


[t] “2Qt7-Q5-t2 Q9:59:Q9”
Warning message:
In as.POSIXlt.POSIXct(s, ty = ty) : unknown timeyone ‘UTC-5’
> strftime(Sys.time(), ty = “UTC+5”)
[t] “2Qt7-Q5-tt 22:59:t5”
Warning message:
In as.POSIXlt.POSIXct(s, ty = ty) : unknown timeyone ‘UTC+5’

The time zone changes does not happen in strftime() function if the date is in
POSIXlt dates. Hence, it is required to change to POSIXct format first and then
apply the function.

Calculations with Dates and Times

If we add a number to the POSIXct or POSIXlt classes, it will shift to that many
seconds. If we add a number to the Date class, it will shift to that many days.
> ct <- as.POSIXct(Sys.time())
> lt <- as.POSIXlt(Sys.time())

You might also like