R Lesson-7
R Lesson-7
Lesson-7
Data Structures in R
By
Prof.Dr. A.B.Chowdhury,HOD,CA
September 2, 2024
This lesson aims to provide with a detailed knowledge of all the data struc-
tures supported by R. These data structures equip R to handle all types
of data that we face in our practical life in a simple, compact and lucid
manner. It is these data structures that have enriched R for use as a very
useful tool in all sorts of programming.
Prerequisite: Knowledge of a programming language will become an added
advantage.
According to the first way of classification, the data may be either homoge-
neous i.e. comprising data of same type or heterogeneous i.e. consisting
of data of different types.On the basis of the second way of classification,
they are organized either in one dimensional form, or in two dimensional
form or in multidimensional form.
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 3 / 101
Table showing the data structures in R
The following table shows the data structure according to their classifica-
tions.
Organization of data Homogeneous data structure Heterogeneous data structure
1 dimensional(1-d) Atomic vector,factor list
2 dimensional(2-d) matrix dataframe
Multidimensional(n-d) array
Paste() : This functions basically concatenates two or more strings. It can also
be used to concatenate two or more vectors of strings. Strings of the vectors of
shorter lengths are concatenated recursively with those of the longest one. The
syntax of the function is:
paste( . . . ,sep=” “,collapse=NULL)
Where, . . . implies one or more R objects, to be converted to character vectors
sep is character string to separate the terms.
collapse an optional character string to separate the results.
Paste converts its arguments via as.character () to character strings by sep. If
the arguments are vectors, they are concatenated term by term to give a character
vector result.Thus,the paste() function takes multiple elements from the multiple
vectors and concatenates them into a single element.
paste0(. . . ,collapse) is equivalent to paste(. . . ,sep=” “,collapse). Thus,the
paste0() function has space as its default separator.
If a value is mentioned for the collapse option, the values in the result are then
separated by the value in the concatenated string.
Figure: The internal structure of a list referenced by two different names with common and uncommon values
An empty list can be created though any one of the following three methods:
1 Using the list() function
2 Using the vector () function
3 Storing the NULL value in an existing list
Let us have some illustrative examples underneath.
Example-1.A list of length zero.
>blank list=list()
>blank list will show list()
>length(blank list) will show [1] 0
>length(blank list2)
[1] 5
$Name
[1] ”ABC”
>list>>-NULL
Error: cannot change value of locked binding for ’list’
>list=NULL
>list
NULL
>list=list(12,list(1,’A’,’B’,”C”,’’),Name=’ABC’)
>list=NULL
>list
NULL
>length(list)
[1] 0
We observe that the components of a list can be broken down into smaller
components but the same thing cannot be done in case of a vector. So, the
normal vectors that have been defined earlier can be called atomic
vectors, whereas, the lists can be called recursive vectors.
Manipulation of the elements of a list
We can enhance a list by adding values to the desired location. Here, the
location is too mentioned by the name of the list followed by the subscript of
the desired location within square brackets ([ ]). If the subscript mentions
a location which is far outside the current list, the skipped locations will
contain a special value called NULL which means value is not available in
the location. If we mention some location that already contains some value,
that value will be replaced with the new value. These are illustrated in the
machine sessions below:
>lt[1]=5 # This replaces the current value at the first location of the list.
>lt # Let us now have a look at the elements of the list
$‘Vector‘
[1] 5
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 55 / 101
Manipulation of the elements of a list–Contd.
$Name
[1] ”Arka Roy”
$Age
[1] 25
$‘An Inner List‘
$‘An Inner List‘[[1]]
[1] 6
$‘An Inner List‘[[2]]
[1] ”r”
$‘An Inner List‘[[3]]
[1] TRUE
>lt[5]=”Latest Value” # Enhancing the elements in the list by putting a new one.
>lt[6]=’Next Value’ # Enhancing the elements in the list by putting another new value.
>lt # Listing the current elements of the list.
[1] 6
$‘An Inner List‘[[2]]
[1] ”r”
$‘An Inner List‘[[3]]
[1] TRUE
[[5]]
[1] ”Latest Value”
[[6]]
[1] ”Next Value”
>lt[8]=”A New element” # Inserting a value at the 8th location when the list has elements upto the 6th location.
>lt[7] # As the there is no value at the 7th location, R shows NULL.
[[1]]
NULL
$‘Vector‘
[1] 5
$Name
[1] ”Arka Roy”
$Age
[1] 25
$‘An Inner List‘
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 56 / 101
Manipulation of the elements of a list–Contd.
$‘An Inner List‘[[1]]
[1] 6
$‘An Inner List‘[[2]]
[1] ”r”
$‘An Inner List‘[[3]]
[1] TRUE
[[5]]
[1] ”Latest Value”
[[6]] [1] ”Next Value”
>lt[2]=’Pranab Roy’ # Changing an existing value
>lt # listing again
$‘Vector‘
[1] 5
$Name
[1] ”Pranab Roy”
$Age
[1] 25
$‘An Inner List‘
$‘An Inner List‘[[1]]
We can also delete an element from a list by storing NULL in the location as shown below:
>lt[1]=NULL
>lt[1] # This shows the value which was previously at location 2 because the value at location 1 is not available now and the
value of the second location comes to the first location and similar phenomenon occurs for the remaining locations reducing the
size of the list by one.
$‘Name‘
[1] ”Pranab Roy”
>lt[5]=NULL
>lt
$‘Name‘
[1] ”Pranab Roy”
$Age
[1] 25
$‘An Inner List‘
$‘An Inner List‘[[1]]
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 57 / 101
Manipulation of the elements of a list–Contd.
[1] 6
$‘An Inner List‘[[2]]
[1] ”r”
$‘An Inner List‘[[3]]
[1] TRUE
[[4]]
[1] ”Latest Value”
[[5]]
NULL
[[6]]
[1] ”A New element”
A list can be converted into a vector by using the unlist() function when
all the values of the list will be upgraded to the same data type as men-
tioned earlier in case of vectors. This is illustrated in the following listing of
commands and the values of the vectors.
>v1=unlist(lt)
>v2=unlist(v2)
>print(v1)
Age An Inner List1 An Inner List2 An Inner List3
”25” ”6” ”r” ”TRUE” ”Latest Value” ”A New element”
>v2
[1] ”A” ”B” ”C”
If we want to see the different labels of the categorical values, we use the
levels() function as illustrated below. To see the internal structure of an
R object, we use the str() function, Let us see their effects below:
>levels(Marital Status)
[1] ”Divorced” ”Married” ”single”
>str(Marital Status)
Factor w/ 3 levels ”Divorced”, “Married”,..: 3 3 3 3 3 3 3 3 3 3
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 61 / 101
Data Frames in R
A data frame in R is a two dimensional tabular representation of data in
rows and columns where columns are of same size with same or different
data types called mode. It is a natural way of representing data from
relational database tables, Excel sheets, vectors or factors of equal length
etc. data sets. The data sets may either be qualitative or quantitative i.e.
categorical or numerical. It is like a list with components as columns of a
table. Columns in a data frame are usually named which are often referenced
as variables. The rows of a data frame may also be named, if the user likes.
The data. frame ( ) function is used to create a data frame.
The following R statement shows the creation of a data frame.
data.frame(Name=c(”Ram”,”Ramesh”,”Mamata”),Marital Status=
c(”Married”,”Divorced”,”Single”),Age=c(35,37,62))
Name Marital Status Age
1 Ram Married 35
2 Ramesh Divorced 37
3 Mamata Single 62
df = data.frame(Date=as.Date(character()),File=character(),
User=character(),stringsAsFactors=FALSE)
>str(df)
’data.frame’: 0 obs. of 3 variables:
$ Date: ’Date’ num(0)
$ File: chr
$ User: chr
In general, it is the task of initializing the desired columns of the dataframe
with empty vectors.
The following example illustrates the creation of a dataframe with five empty
vectors:
df = data.frame(Doubles=double(),Ints=integer(),Factors=factor(),
Logicals=logical(), Characters=character(),stringsAsFactors=FALSE)
To see the dimension of the data frame, we use the dim() function as illustrated
below:
>dim(social status)
R shows the following for this statement:
[1] 3 4
Entries from a data.frame can be pointed out with subscripts written within square
brackets and separated by comma implying row number followed by column number
with the variable holding the data.frame. This is illustrated in the figure below:
>social status[1,3]
[1] Highly Educated
Levels: Highly Educated Lowly educated Moderately Educated
>social status[3,2]
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 68 / 101
Insertion of Rows and Columns in a R Data Frame
[1] Rich
Levels: Middle class Poor Rich
New components can be inserted into a data.frame, if required, as shown below:
social status$Bank Balance=c(10000,80000,500000)
>social status
Persons Monetary status Educational status Age Bank Balance
1 Ram Poor Highly Educated 30 1e+04
2 shyam Middle class Moderately Educated 40 8e+04
3 Jadu Rich Lowly educated 50 5e+05
We may also add further rows in an existing data.frame. This can be done by defin-
ing additional row values and then using a function named rbind( ) as illustrated
below.
>ss=data.frame(Persons=c(”Madhu”,”Jidu”,”Sidhu”),Monetary status=
c(”Poor”,”poor”,”Rich”), Educational status=c(”Highly Educated”,”Lowly
Educated”,”Moderately Educated”),Age=c(35,45,55),Bank Balance=
c(25000,35000,90000))
>ss=rbind(social status,ss)
>ss
Persons Monetary status Educational status Age Bank Balance
1 Ram Poor Highly Educated 30 10000
2 shyam Middle class Moderately Educated 40 80000
3 Jadu Rich Lowly educated 50 500000
4 madhu Poor Highly Educated 35 25000
5 Jidu poor Lowly Educated 45 35000
6 Sidhu Rich Moderately Educated 55 90000
However, the cbind( ) function can be used to add more than one column
at a time into an existing data frame as illustrated below:
x=c(1,2,3,4,5,6)
>y=c(7,8,9,10,11,12)
>cbind(ss,x,y)
Persons Monetary status Educational status Age Bank Balance Marital Status x y
1 Ram Poor Highly Educated 30 10000 TRUE 1 7
2 shyam Middle class Moderately Educated 40 80000 TRUE 2 8
3 Jadu Rich Lowly educated 50 500000 FALSE 3 9
4 madhu Poor Highly Educated 35 25000 FALSE 4 10
5 Jidu poor Lowly Educated 45 35000 TRUE 5 11
6 Sidhu Rich Moderately Educated 55 90000 TRUE 6 12
4. Use of negative column number in the subsetting also removes the column
as shown below:
>ss= ss[,-3]
>ss
This removes the Age column.
The following illustrated format also removes the 3rd column:
>ss=ss[-3]
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 73 / 101
Deletion from a Data Frame–Contd.
5. For deletion of multiple columns at a time, we need to use a list of NULL
values as stated below:
>ss[2:4]=list(NULL)
>ss
This shows the following table:
Persons Bank Balance
1 Madhu 25000
2 Jidu 35000
3 Sidhu 90000
For the removal of multiple rows at a time, we make use of the subset function as shown below:
>subset(ss,Monetary status!=”Poor”&Age>50)
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 74 / 101
Insertion of columns at any desired position of a
dataframe
We have observed that cbind() function enables us to add a column always after
the existing columns of a dataframe. Now, if we want to insert a column at any
desired column position in a dataframe, we can do it by using an in-built R function
add column() having the following syntax:
add column(Name of existing dataframe,New column definition,.before|.after
Here, option-1|option-2 implies that any one of the options is to be used and the
braces are not part of the actual use in a statement.
Here, we shall need to install the ’tibble’ package and its library shall have to
be loaded before the use of the function. An illustrative example has been shown
below:
>library(tibble)
>df=data.frame(A = 10:14, B = 21:25, C=33:37)
This creates the following dataframe:
A B C
1 10 21 33
2 11 22 34
3 12 23 35
4 13 24 36
5 14 25 37
We can also assign the resulting dataframe to the existing one to make the
changes permanent in the dataframe.
>df=insertRow(df,rw,r)
We may like to change the NA values in the dataframe with some specific value for easy recognition,say, 999.We can do this with
the following command:
>airquality[is.na(airquality)]=999
Let us check the effect on the NA values got in the output of the preceding query:
>airquality[c(1,5,9),]
The following result is generated:
Let us now retrieve the data of the columns 3 and 5 for the rows 5,7 and 13.The statement written below serves the purpose.
>airquality[c(5,7,13),c(3,5)]
It may be observed here that both row numbers and column numbers are not sequential,rather randomly chosen;hence, we need
to express them as vectors as shown below:
Wind Month
5 14.3 5
7 8.6 5
13 9.2 5
Persons Monetary status Educational status Age Bank Balance Marital Status
2 shyam Middle class Moderately Educated 40 8e+04 TRUE
3 Jadu Rich Lowly educated 50 5e+05 FALSE
6 Sidhu Rich Moderately Educated 55 9e+04 TRUE
>ss[ss$Bank Balance<50000,]
Persons Monetary status Educational status Age Bank Balance Marital Status
1 Ram Poor Highly Educated 30 10000 TRUE
4 madhu Poor Highly Educated 35 25000 FALSE
5 Jidu poor Lowly Educated 45 35000 TRUE
Persons Monetary status Educational status Age Bank Balance Marital Status
1 Ram Poor Highly Educated 30 10000 TRUE
4 madhu Poor Highly Educated 35 25000 FALSE
5 Jidu poor Lowly Educated 45 35000 TRUE
2 shyam Middle class Moderately Educated 40 80000 TRUE
6 Sidhu Rich Moderately Educated 55 90000 TRUE
3 Jadu Rich Lowly educated 50 500000 FALSE
>ss[rev(order(ss[,”Bank Balance”])),]
Persons Monetary status Educational status Age Bank Balance Marital Status
3 Jadu Rich Lowly educated 50 500000 FALSE
6 Sidhu Rich Moderately Educated 55 90000 TRUE
2 shyam Middle class Moderately Educated 40 80000 TRUE
5 Jidu poor Lowly Educated 45 35000 TRUE
4 madhu Poor Highly Educated 35 25000 FALSE
1 Ram Poor Highly Educated 30 10000 TRUE
Merging data Frames: Merging of data frames implies combining two data
frames to generate a new data frame. This is equivalent to Join operation
of relational algebra.
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 83 / 101
Merging of data Frames
Similar to different forms of Join, we can merge two data frames for re-
sults of Equijoins,Right Outer Join, Left Outer Join and Cross Join or
Cartesian product.The syntax is:
”common column name” x
merge(df1,df2[,by= ][,all[. ]=
NULL y
TRUE])
Here, options within square brackets([]) imply that their use is optional.Options
within a pair of braces imply that one of them is required to be used.
To illustrate the use of the function in different ways, we define two simple data frames as under:
>df1=data.frame(EMPNO=c(1,5,7,8,9))
>df2=data.frame(EMPNO=c(3,5,7,4,2))
Now,we perform equi-join as under:
>merge(df1,df2)
EMPNO
1 5
2 7
We perform Left Outer Join as shown below:
>merge(df1,df2,by=”EMPNO”,all.x=TRUE)
The generated output is:
EMPNO
1 1
2 5
3 7
4 8
5 9
EMPNO.x EMPNO.y
1 1 3
2 5 3
3 7 3
4 8 3
5 9 3
6 1 5
7 5 5
8 7 5
9 8 5
10 9 5
11 1 7
12 5 7
13 7 7
14 8 7
15 9 7
16 1 4
17 5 4
18 7 4
19 8 4
20 9 4
21 1 2
22 5 2
23 7 2
24 8 2
25 9 2
The ranks printed imply that the first value occurs in the first row, the second value
in the order occurs in the row number 4 and so on. The reverse ordering can now
be obtained as shown below:
>ss[order(ss$Bank Balance,decreasing=TRUE),]
Persons Monetary status Educational status Age Bank Balance Marital Status
3 Jadu Rich Lowly educated 50 500000 FALSE
6 Sidhu Rich Moderately Educated 55 90000 TRUE
2 shyam Middle class Moderately Educated 40 80000 TRUE
5 Jidu poor Lowly Educated 45 35000 TRUE
4 madhu Poor Highly Educated 35 25000 FALSE
1 Ram Poor Highly Educated 30 10000 TRUE
If we want to sort the dataframe on the basis of multiple columns,one under the
other in the order mentioned, we can do it by using the with option as illustrated
below for Persons under Bank Balance.
>ss= ss[with(ss, order(Persons, Bank Balance)), ]
>ss[with(ss, order(Persons, Bank Balance,decreasing=TRUE)),]
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 88 / 101
Illustration of head() and tail() for a data frame df
Inspecting the rows of a dataframe
The first six rows of a data frame can be inspected by using the head() function.For
the last six rows, the tail() function is used. The syntax of the functions are:
head(name of dataframe[,n]),where n is the number of rows to be
displayed from the beginning which is 6 ,by default
tail(name of dataframe[,n]),where n is the number of last rows to be
displayed which is 6 ,by default
>head(df)
term count
1 label 103
2 book 55
3 one 40
4 game 38
5 love 34
6 great 29
>head(df,3)
term count
1 label 103
2 book 55
3 one 40
>tail(df,2)
term count
1687 remast 1
1688 rerecord 1
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 89 / 101
The setdiff() and Conclusion on data frame
The setdiff() function
This function is used to find out the elements in two vectors or data frames which
is in the first vector or data frame(minuend), but not in the second vector or data
frame(subtrahend). Its use has been illustrated below:
x=c(1:20)
>y=c(15:40)
setdiff(x,y)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
This function can also be used to show the rows of the first dataframe which do
not belong to the second one.
Format:setdiff(df1,df2)
Further important points about dataframe to be borne in mind.
By default data frames turn strings into factors. So, we should use
stringsAsFactors = FALSE to suppress this behaviour:
We can coerce an object to a data frame with as.data.frame():
1 A vector will create a one-column data frame.
2 A list will create one column for each element; it’s an error if
As soon as the entry of the inputs becomes over, the following matrix form of values are displayed by R.
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 92 / 101
Displaying the values in a matrix
[,1] [,2] [,3]
[1,] 10 14 18
[2,] 11 15 19
[3,] 12 16 20
[4,] 13 17 21
The function dim(x), where x is the variable to which we assign a matrix, is an integer vector giving the number of rows and
columns of the matrix x; i.e. dim() function gives the dimension of the matrix. The following depicts its use.
>x=matrix(1:12,4)
>x
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
>dim(x)
[1] 4 3
>x[1,]
[1] 1 5 9
>x[,2]
[1] 5 6 7 8
>x[3,3]
[1] 11
>i=scan(n=1) # to accept the row number as input
1: 4
>j=scan(n=1,quiet=TRUE) # This to accept the column number as input
1: 2
>x[i,j] # row and column numbers are being expressed with variable values.
[1] 8
>class(x)
[1] ”matrix”
>class(x[1,])
[1] ”integer”
>class(x[i,j])
[1] ”integer”
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 93 / 101
R inputs into a matrix
R script for generalized inputs
# matinput.r
cat(”How many rows?”)
r=scan(n=1,quiet=TRUE)
cat(”How many Columns?”)
c=scan(n=1,quiet=TRUE)
tot=r*c
cat(”Enter the matrix element rowwise for”,r,”rows and ”,c,”Columns
and press enter for each value”)
mat=matrix(scan(n=tot),r,byrow=TRUE)
cat(”The input matrix is shown below:”)
for( i in 1:r){ for (j in 1:c){
cat(mat[i,j],’ ’)}
cat(”)
}
Matrix Multiplication
R supports multiplication of two matrices also. The multiplication
operator is %*%. So, if A is matrix of order m x n, and B is a matrix
of order n x p; then we know that the matrices are eligible for giving the
product and we can perform it by simply typing A%*%B at the R prompt.
The product can be assigned to a variable also which will be another matrix
of order m x p , by the definition of matrix multiplication.
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 96 / 101
Relevant functions for matrix manipulations
To obtain the vector of elements on the main diagonal of a matrix M, say,
we simply need to issue the command diag(M).
We can also obtain the transpose of a matrix M, which is a matrix obtained
from m by interchanging the rows and columns.
This can be achieved by simply typing t(M) at the R prompt and then
pressing ENTER.
We can also obtain the determinant value of a matrix by simply entering
det(matrix-name). However, the matrix must be a square matrix.
Sometimes, we need to extract statistics from the rows or columns of a
matrix. Let f be a function that generates a number for any given vector
v, If M is a matrix then, we can enter apply(M,1,f) to obtain the result
of applying f to each row of the matrix M. The application of the apply( )
function will generate a vector for each row of the matrix M. To obtain the
similar result for each of the columns, we are to enter: apply(M,2,f). We
can also find out the eigen values and eigen vectors from given matrix
by using the eigen() function. Above discussions have been illustrated in
the figure below:
By Prof.Dr. A.B.Chowdhury,HOD,CA (TIU,W.B.)
The tools and techniques of R programming Lesson-7Data
September
Structures
2, 2024
in R 97 / 101
Matrix manipulations illustrated
>x=matrix(1:16,4)
>y=matrix(3:8,2)
>x%*%y
[,1] [,2] [,3]
[1,] 19 29 39
[2,] 26 40 54
[3,] 33 51 69
>diag(x%*%y)
[1] 19 40 69
>t(x%*%y)
[,1] [,2] [,3]
[1,] 19 26 33
[2,] 29 40 51
[3,] 39 54 69
>f=function(v){return(sum(v))}
>apply(x,1,f)
[1] 5 7 9
>x=matrix(1:6,3)
>apply(x,1,f)
[1] 5 7 9
>x=matrix(1:4,2)
>v=c(1,2)
>f=function(v){return(sum(v))}
>apply(x,1,f)
[1] 4 6
>x
>x=matrix(1:4,2)
>det(x)
[1] -2
>det(m) Error in determinant.matrix(x, logarithm = TRUE, ...) :
’x’ must be a square matrix
>eigen(x)
eigen() decomposition
$‘values‘
[1] 5.3722813 -0.3722813
$vectors
Finally, we shall mention two more functions frequently used with matrices.
These are rownames( ) and colnames( ) functions. The former one is
used to name the rows of a matrix; whereas, the latter one is used to name
the columns. This is illustrated below:
>rownames(m)=c(’r1’,’r2’,’r3’)
>colnames(m)=c(’c1’,’c2’)
>m
c1 c2
r1 5 4
r2 2 5
r3 5 5