Data Analysis Using R - 3
Data Analysis Using R - 3
3.1 Matrices
byrow is a logical clue.Bydefault matrix element are arranged by columns,by setting parameter
byrow = TRUE we can arrange the elements row-wise.
Creating Matrices :
str( ) : The str() function in R to display the internal structure of any R object in a compact way. It
is an alternative function to display the summary of the output produced, especially when the data
set is huge.
Syntax: str(object)
Lets find the structure of the above matrix P using str( ).
str(P)
Ouptput: int [1:4, 1:3] 3 6 9 12 4 7 10 13 5 8 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:4] "row1" "row2" "row3" "row4"
..$ : chr [1:3] "col1" "col2" "col3"
dim( ) : The dim( ) function in R Language is used to get or set the dimension of the specified
matrix, array or data frame.
Syntax: dim(object)
Lets find the dimension of the above matrix M using dim( ).
dim(M)
Ouptput:[1] 4 3
length( ) : In R, the length( ) function is used to get the length of the object. In simpler terms, it is
used to find out how many items are present in that object.
Syntax: length(object)
Lets find the length of the above matrix N using length( ).
length(N)
Ouptput:[1] 12
rownames( ) : rownames( ) function in R Language is used to set the names to rows of a matrix.
Syntax: rownames(matrix_name) <- value
A = matrix(1:9, 3, 3, byrow = TRUE)
rownames(A) <- c("X","Y","Z")
print(A)
colnames( ) : colnames( ) function in R Language is used to set the names to columns of a matrix.
Syntax: colnames(matrix_name) <- value
A = matrix(1:9, 3, 3, byrow = TRUE)
colnames(A) <- c("X","Y","Z")
print(A)
Ouptput: X Y Z
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
We can access the items by using [ ] brackets. The first number "1" in the bracket specifies the
The whole row can be accessed if we specify a comma before the number in the bracket:
m1[,2] #[1] "orange" "grape" "pineapple"
More than one row can be accessed if we use the c() function:
m1[c(1,2),] # [,1] [,2] [,3]
#[1,] "apple" "orange" "pear" 2
#[2,] "banana" "grape" "melon"
More than one column can be accessed if we use the c() function:
m1[,c(1,2)] # [,1] [,2]
#[1,] "apple" "orange"
#[2,] "banana" "grape"
#[3,] "cherry" "pineapple"
rbind( ): We can use the rbind( ) function to add additional rows in a Matrix:
Note: The cells in the new row must be of the same length as the existing matrix.
Using this method we can add a row into the existing matrix:
m1<- matrix(letters[1:9], nrow = 3, ncol = 3)
m1
m2<- rbind(m1,letters[10:12])
m2
Output:
[,1] [,2] [,3] #m1
[1,] "a" "d" "g"
[2,] "b" "e" "h"
[3,] "c" "f" "i"
Using this method we can combine 2 matrix row wise(Concatenation of 2 matrix using rbind( )) :
m1 <- matrix(1:9, nrow = 3, ncol = 3)
m1
m2<-matrix(10:12, nrow = 1, ncol = 3)
m2
m3<- rbind(m1,m2)
m3
Output:
[,1] [,2] [,3] #m1
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
cbind( ): We can use the cbind( ) function to add additional columns in a Matrix:
Note: The cells in the new column must be of the same length as the existing matrix.
Using this method we can add a column into the existing matrix:
m1<- matrix(letters[1:9], nrow = 3, ncol = 3)
m1
m2<- cbind(m1,letters[10:12])
m2
Output:
[,1] [,2] [,3] #m1
[1,] "a" "d" "g"
[2,] "b" "e" "h"
[3,] "c" "f" "i"
Using this method we can combine 2 matrix column wise(Concatenation of 2 matrix using
cbind( )) :
m1<- matrix(1:4, nrow = 2, ncol = 2)
m2<- matrix(5:8,2,2)
m3<-cbind(m1,m1)
m3
Matrix Arithmetic:
Various Arithmetic operations are performed on the matrices using the R operators. The result of
the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the
operation.
Arithmetic operations include addition (+), subtraction (-), multiplication(*), division (/) and
modulus(%).
Creating 2 matrix to perform operations on it:
# Create two 2x3 matrices.
matrix1 <- matrix(c(10, 9, 7, 6, 3, 6), nrow = 2)
print(matrix1)
matrix2 <- matrix(c(5, 2, 1, 4, 2, 4), nrow = 2)
print(matrix2)
Output:
Addition:
result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)
Output:
Result of addition
[,1] [,2] [,3]
[1,] 15 8 5
[2,] 11 10 10
Subtraction:
result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)
Output:
Result of subtraction
[,1] [,2] [,3]
[1,] 5 6 1
[2,] 7 2 2
Multiplication:
result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)
Output:
Result of multiplication
[,1] [,2] [,3]
[1,] 50 7 6
[2,] 18 24 24
Division:
result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)
Output:
Result of division
[,1] [,2] [,3]
[1,] 2.0 7.0 1.5
[2,] 4.5 1.5 1.5
Scalar Arithmetic: We can do basic arithmetic operations on numeric scalars – they won't work
on character scalars.
Performing scalar arithmetic on above matrix 1
result <- matrix1 + 1 # adding 1 into matrix
cat("Result of addition","\n")
print(result)
Output:
Result of addition
[,1] [,2] [,3]
[1,] 11 8 4
[2,] 10 7 7
Output:
Result of subtraction
[,1] [,2] [,3]
[1,] 8 5 1
[2,] 7 4 4
Output:
Result of multiplication
[,1] [,2] [,3]
[1,] 30 21 9
[2,] 27 18 18
Output:
Result of division
[,1] [,2] [,3]
[1,] 5.0 3.5 1.5
[2,] 4.5 3.0 3.0
Extra:
Miscellaneous functions:
sum(matrix1) #[1]41
rowSums(matrix1) #[1]20 21
colSums(matrix1) #[1]19 13 9
min(matrix1) #[1]3
max(matrix1) #[1]10
is.matrix(matrix1) #[1]TRUE
ncol(matrix1) #[1]3
nrow(matrix1) #[1]2
row=c(“r1”,”r2”,”r3”)
columns=c(“c1”,”c2”,”c3”)
M3<-matrix(1:9,nrow=3,dimnames=list(row,columns))
M4<-matrix(1:9,nrow=3, ncol=3)
rownames(M4)=c (“r1”,”r2”,”r3”)
colnames(M4)=c (“c1”,”c2”,”c3”)
M1[1,3]<-0
Transposition of Matrix:
Transpose of a matrix is an operation in which we convert the rows of the matrix in column and column of
row=c("r1","r2")
columns=c("c1","c2")
M3<-matrix(1:4,nrow=2,dimnames=list(row,columns))
M3
M3<-t(M3)
M3
Output:
c1 c2
r1 1 3
r2 2 4
r1 r2
c1 1 2
c2 3 4
Deletion of Rows and Columns :
m<- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3,
ncol =2)
m
#Remove the first row and the first column
m<- m[-c(1), -c(1)]
M
Output:
[,1] [,2]
[1,] "apple" "orange"
[2,] "banana" "mango"
[3,] "cherry" "pineapple"
[1] "mango" "pineapple"
3.2 Dataframes
Creating Dataframe :
Output:
friend_id friend_name
1 1 Sachin
2 2 Sourav
3 3 Dravid
4 4 Sehwag
5 5 Dhoni
In R, we can find the structure of our data frame. R provides an in-build function called str() which
returns the data with its complete structure.
The data of the data frame is very crucial for us. To manipulate the data of the data frame, it is
essential to extract it from the data frame. We can extract the data in three ways which are as follows:
1. We can extract the specific columns from a data frame using the column name.
2. We can extract the specific rows also from a data frame.
3. We can extract the specific rows corresponding to specific columns.
Output:
friend.data.friend_name
1 Sachin
2 Sourav
3 Dravid
4 Sehwag
5 Dhoni
friend[[2]] #access the 2nd col and shows data in vector format
Friend[2] #access the 2nd col and shows data in a list format
Data Reshaping: Adding rows and columns, Merge Dataframes, Melting and Casting of
Dataframe.
Adding rows and columns:
R allows us to do modification in our data frame. Like matrices modification, we can modify
our data frame through re-assignment.
We cannot only add rows and columns, but also we can delete them. The data frame is
expanded by adding rows and columns.
We can
1. Add a column by adding a column vector with the help of a new column name using cbind()
function.
2. Add rows by adding new rows in the same structure as the existing data frame and using rbind()
function
3. Delete the columns by assigning a NULL value to them.
4. Delete the rows by re-assignment to them.
stud.data<- data.frame(
stud_id = c (1:5),
stud_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
age=c(17,18,18,17,18),
stringsAsFactors = FALSE)
stud.data<-rbind(stud.data,x)
stud.data
stud.data<-cbind(stud.data,mob=c(1234,3456,5478,2354,5725,2354))
stud.data
Output:
Merging 2 dataframes:
Using cbind( ):
stud_info<-data.frame(stud_id=1:6,add=c(“Pune”,“Mumbai”,“Nashik”,“Beed”,“Mumbai”,
“Pune”))
stud_info
Output:
stud_id add
1 1 Pune
2 2 Mumbai
3 3 Nashik
4 4 Beed
5 5 Mumbai
6 6 Pune
sinfo<-cbind(stud.data,stud_info)
sinfo
Output:
Using rbind( ):
emp<-data.frame(eid=1:3,ename=c("jack","john","mickey"))
emp
ep<-data.frame(eid=4:6,ename=c("jacky","max","jerry"))
ep
Output:
eid ename
1 1 jack
2 2 john
3 3 mickey
eid ename
1 4 jacky
2 5 max
3 6 jerry
emp_info<-rbind(emp,ep)
emp_info
Output:
eid ename
1 1 jack
2 2 john
3 3 mickey
4 4 jacky
5 5 max
6 6 jerry
stud.data$mob[1]=8779
Merge Dataframes:
Dataframe is made up of three principal components, the data, rows, and columns.
In R we use merge() function to merge two dataframes in R. This function is present
inside join() function of dplyr package.
The most important condition for joining two dataframes is that the column type should be the
same on which the merging happens.
merge() function works similarly like join in DBMS. Types of Merging Available in R are,
1. Natural Join or Inner Join
2. Left Outer Join
3. Right Outer Join
4. Full Outer Join
5. Anti Join
Basic Syntax of merge() function in R:
Syntax: merge(df1, df2, by.df1, by.df2, all.df1, all.df2, sort = TRUE)
Parameters:
df1: one dataframe df2: another dataframe
by.df1, by.df2: The names of the columns that are common to both df1 and df2.
all, all.df1, all.df2: Logical values that actually specify the type of merging happens.
# Data frame 1
df1 = data.frame(StudentId = c(101:106),
Subject = c("Hindi", "English","Maths", "Science","History","Physics"))
df1
Output:
StudentId Subject
1 101 Hindi
2 102 English
3 103 Maths
4 104 Science
5 105 History
6 106 Physics
# Data frame 2
df2 = data.frame(StudentId = c(102, 104, 106,107, 108),
State = c("Mangalore", "Mysore","Pune", "Dehradun", "Delhi"))
df2
Output:
StudentId State
1 102 Mangalore
2 104 Mysore
3 106 Pune
4 107 Dehradun
5 108 Delhi
Output:
StudentId Subject State
1 101 Hindi NA
2 102 English Mangalore
3 103 Maths NA
4 104 Science Mysore
5 105 History NA
6 106 Physics Pune
5. Anti Join
In terms of set theory, we can say anti-join as set difference operation, for example, A = (1, 2, 3, 4) B
= (2, 3, 5) then the output of A-B will be set (1, 4). This join is somewhat like df1 – df2, as it basically
selects all rows from df1 that are actually not present in df2.
# Import required library
library(dplyr)
StudentId Subject
1 101 Hindi
2 103 Maths
3 105 History
6.Cross Join
A Cross Join also known as cartesian join results in every row of one dataframe is being joined to
every other row of another dataframe. In set theory, this type of joins is known as the cartesian
product between two sets.
frame1 = data.frame(s1=c(2,50,71))
frame1
frame2 = data.frame(s1=c(11,38,90))
frame2
df = merge(x = frame1, y= frame2 , by=NULL)
df
Output:
s1
1 2
2 50
3 71
s2
1 11
2 38
3 90
s1.x s2.y
1 2 11
2 50 11
3 71 11
4 2 38
5 50 38
6 71 38
7 2 90
8 50 90
9 71 90
1. Melting in R:
Melting in R programming is done to organize the data. The melt function takes data in wide
format and stacks a set of columns into a single column of data.
It is performed using melt() function.To make use of this function we need to specify a data frame,
the constant variables , i.e id variables and the measured variables (columns of data) to be
stacked.
The default assumption on measured variables is that it is all columns that are not specified as id
variables.
Using melt(), dataframe which is in wide format is converted into long format .
Creating a dataframe:
2. Casting in R:
Casting in R programming is used to reshape the molten data using cast() function which takes
aggregate function and formula to aggregate the data accordingly.
This function is used to convert long format data back into some aggregated (wide) format of data
based on the formula in the cast( ) function.
Applying cast( ) function to molten data:
Output:
Deleting data :
stud.data<-stud.data[-1,-1] #removes 1st row and 1st col of dataframe
stud.data
Output:
stud_name age mob
2 Arpita 18 3456
3 Nishka 18 5478
4 Gunjan 17 2354
5 Sumit 18 5725
6 Vaishali 18 2354
stud.data$age<-NULL
Output:
stud_name mob
2 Geeta 3456
3 Nishka 5478
4 Gunjan 2354
5 Sumit 5725
6 Vaishali 2354
Sorting Dataframe:
Sorting a DataFrame allows us to reorder the rows based on the values in one or more columns.
This can be useful for various purposes, such as organizing data for analysis or presentation.
Methods to sort a dataframe:
1. Using order( ) function:This function is used to sort the dataframe based on the particular column
in the dataframe.
We can also use the order function with a character vector. Note that ordering a categorical variable
means ordering it in alphabetical order.
Syntax: order(dataframe$column_name,decreasing = TRUE))
where
dataframe is the input dataframe
Column name is the column in the dataframe such that dataframe is sorted based on this column
Decreasing parameter specifies the type of sorting order
If it is TRUE dataframe is sorted in descending order. Otherwise, in increasing order
Example:
# create dataframe with roll no and
# subjects columns
data = data.frame(
rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
Output:
rollno subjects
1 1 java
2 5 python
3 4 php
4 2 sql
5 3 c
2. The order( ) function returned the index of each element in sorted order.
By default we get the index order in ascending order.for sorting indexes in descending order use
“decreasing=TRUE”
3. The rank() function assigned a rank to each element , i.e. it tells us that which position the element
will take after sorting.
For example, rank() tells us that the first value in the original vector is the smallest (rank = 1) and
the second value in the original vector is the largest (rank = 4)
The following code shows how to use sort(), order(), and rank() functions :
sort(x) #[1] 0 10 15 20
sort(x,decreasing=TRUE) #[1] 20 15 10 0
order(x) #[1] 1 3 4 2
order(x,decreasing=TRUE) #[1] 2 4 3 1
rank(x) #[1] 1 4 2 3
rank(-x) #[1] 4 1 3 2
3.3 List
Lists are the objects of R which contain elements of different types such as number, vectors,
string and another list inside it.
It can also contain a function or a matrix as its elements.
a list is a generic vector which contains other objects.
Lists are one-dimensional, heterogeneous data structures.
A list is a data structure which has components of mixed data types.
Each componenet in a list can have different length.
Creating List
list1 <- list(29, 32, 34) # list with similar type of data
list1
list2 <- list("Ranjy", 38, TRUE) # list with different type of data
list2
Output:
#list1 #list2
[[1]] [[1]]
[1] 29 [1] "Ranjy"
[[2]] [[2]]
[1] 32 [1] 38
[[3]] [[3]]
[1] 34 [1] TRUE
OR
student<-list(rno=1:3,name=c("A","B","C"))
student
$Output:rno
[1] 1 2 3
$name
[1] "A" "B" "C"
Output:
#list1 #list2
List of 3 List of 3
$ : num 29 $ : chr "Ranjy"
$ : num 32 $ : num 38
$ : num 34 $ : logi TRUE
#student
List of 2
$ rno : int [1:3] 1 2 3
$ name: chr [1:3] "A" "B" "C
Difference between [] and [[]] :when we use single square bracket[] we get output in form of
list as shown above and when we use double square bracket[[]] we get only the elements of list
not the list i.e in form of vector.
Lets understand by an example:
Using []
stud<-student["name"]#extracting name from student and storing it in stud
stud
class(stud)
Output:rn
$name
[1] "A" "B" "C"
[1] "list"
Using[[]]
studi<-student[["name"]]
studi
class(studi)
Output:
[1] "A" "B" "C"
[1] "character"
student$marks=c(50,67,84)
student[["city"]]<-c("M","P","N")
student
Output:
$rno
[1] 1 2 3
$name
[1] "A" "B" "C"
$marks
[1] 50 67 84
$city
[1] "M" "P" "N"
Inserting a specific value
student[[1]][[4]]<-4
student[[2]][[4]]<-"D"
student[["marks"]][[4]]<-45
student[["city"]][[4]]<-"P"
student
Output:
$rno
[1] 1 2 3 4
$name
[1] "A" "B" "C" "D"
$marks
[1] 50 67 84 45
$city
[1] "M" "P" "N" "P"
length( ): length function in list gives the length of the list, i.e it displays how many elements are
there in the list.
length(student)
Output: [1] 4
Merging 2 lists:
The most common way is to use the c() function, which combines two elements together:
mob
studinfo
Output:
$rno
[1] 1 2 3 4
$name
[1] "A" "B" "C" "D"
$marks
[1] 50 78 84 45
$city
[1] "a" "b" "d" NA
$mobile
[1] 1234 4572 5849 9827
Output:
$rno
[1] 1 2 3 4
$name
[1] "A" "B" "C" "D"
$marks
[1] 50 78 84 45
$city
[1] "a" "b" "d" NA
$mobile
[1] 1234 4572 5849 9827
$shift
[1] "FS" "SS" "FS" "SS"
[[2]]
[1] "cherry"
Sys.time(): Sys.time() function is used to return the system’s date and time.
>Sys.Date() # Output:[1] "2023-10-10 17:36:23 UTC"
Sys.timezone(): Sys.timezone() function is used to return the current time zone.
>Sys.Date() # Output:[1] "Etc/UTC"
install.packages("lubridate")
library("lubridate")
cat("Year: ",year(datetime)) #Year:2023
cat("Month: ",month(datetime)) #Month:9
cat("Day: ",day(datetime)) #Day:27
cat("Minute: ",minute(datetime)) #Minute:30
cat("Second: ",second(datetime)) #Second:0
# Date arithmetic
3.5 Strings in R:
Length of String
The length of strings indicates the number of characters present in the string. The function
nchar( ) inbuilt function of R can be used to determine the length of strings in R.
Syntax: nchar(x)
str <- "Hello World!"
nchar(str) # Output: [1] 12
Concatenating Strings
Many strings in R are combined using the paste() function. It can take any number of
arguments to be combined together.
Syntax: paste(..., sep = " ", collapse = NULL)
str1 <- "Hello"
str2 <- "World"
paste(str1, str2) # Output: [1] Hello World
OR
paste("Hello","World")
Cat Function ( )
If you want the line breaks to be inserted at the same position as in the code, use the cat()
function:
str <- "Lorem ipsum dolor sit amet, # Output:Lorem ipsum dolor sit amet,
consectetur adipiscing elit, consectetur adipiscing elit,
sed do eiusmod tempor incididunt sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua." ut labore et dolore magna aliqua.
cat(str)
Control statements are expressions used to control the execution and flow of the program based on
- if-else
- For loop
- While loop
- Repeat loop
- Next, break
-Switch case
3.6.1 if-else :
The if statement contains a logical condition that needs to be evaluated and checked. There is a
block of code under if statement which gets executed and returns output once the logical
condition is satisfied or is TRUE. Otherwise, if the logical condition doesn’t get satisfied or is
FALSE, the block of code does not get executed.
Then comes the else part of this statement. This part allows us to execute a block of code when
the logical condition mapped under the if statement gives output as FALSE.
Syntax:
if(expression){
statements
}
else{
statements
}
Example:
x<- 5
It is a type of loop or sequence of statements executed repeatedly until exit condition is reached.
Syntax:
for(value in vector){
statements
}
Example:
x <- letters[4:10]
for(i in x)
{
print(i)
}
#Output:[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
3.6.3 While Loop:
while loop is another kind of loop iterated until a condition is satisfied. The testing expression is
checked first before executing the body of loop.
Syntax:
while(expression){
statement
}
Example:
x = 1
# Print 1 to 5
while(x <= 5)
{
print(x)
x = x + 1
}
#Output:[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
repeat is a loop which can be iterated many number of times but there is no exit condition to come
out from the loop. So, break statement is used to exit from the loop. break statement can be used
in any type of loop to exit from the loop.
Syntax:
repeat {
statements
if(expression)
{ break }}
Example:
x = 1
# Print 1 to 5
repeat{
print(x)
x = x + 1
if(x > 5)
{
break
}
}
#Output:[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Next Statement:
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Syntax:
if (test_condition) {
next}
Example:
x <- 1:10
# Print even numbers
for(i in x)
{
if(i%%2 != 0)
{
next #Jumps to next loop
}
print(i)
}
#Output:[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
Break Statement:
The break keyword is a jump statement that is used to terminate the loop at a particular iteration.
Syntax:
if (test_expression) {
break
}
Example:
a<-1
while (a < 10)
{
print(a)
if(a==5)
break
a = a + 1
}
#Output:[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Switch Case :
A switch statement is a selection control mechanism that selects several expressions or values
based on the value of a given expression.
Switch case statements are a substitute for long if statements that compare a variable to several
integral values.
Switch case in R is a multiway branch statement. It allows a variable to be tested for equality
against a list of values.
The basic syntax of the switch function is as follows:
switch(EXPR, CASE1, CASE2, ..., CASEN, DEFAULT)
Example:
day <- "Monday"
message <- switch(
day,
"Monday" = "It's the start of the workweek.",
"Tuesday" = "You're in the middle of the workweek.",
"Wednesday" = "It's hump day!",
"Thursday" = "Almost there, one more day to the weekend.",
"Friday" = "Happy Friday! It's the weekend soon.",
"Saturday" = "Enjoy your weekend!",
"Sunday" = "It's still the weekend.",
"Unknown day"
)
print(message) #Output: [1] "It's the start of the workweek."
sapply() function
sapply() function takes list, vector or data frame as input and gives output in vector or matrix.
It is useful for operations on list objects and returns a list object of same length of original set.
sapply function in R does the same job as lapply() function but returns a vector.
Syntax: sapply( x, fun )
- x: determines the input vector or an object.
- fun: determines the function that is to be applied to input data.