R Intro

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 227

What is R?

• R is a programming language and software


environment for statistical analysis and
graphics representation.
• R is a dialect of the S language.
• R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R
Development Core Team.
History of R
• S is a language that was developed by John Chambers
and others at the old Bell Telephone Laboratories,
originally part of AT&T Corp. at 1976.
• In 2004 Insightful purchased the S language from
Lucent for $2 million under the product name S-PLUS
and built a number of fancy features.
• In 1991, Ross Ihaka and Robert Gentleman developed R
as a free software environment for their teaching
classes when they were colleagues at the University of
Auckland in New Zealand.
• In addition, many other people have contributed new
code and bug fixes to the project.
• Early 1990s: The development of R began.
• August 1993: The software was announced on the
S-news mailing list.
• June 1995: The code was made available under
the Free Software Foundation’s GNU General
Public License (GPL), Version 2.
• February 2000: The first version of R, version 1.0.0,
was released.
• October 2004: Release of R version 2.0.0.
• April 2013: Release of R version 3.0.0.
• April 2015: Release of R-3.2.0
• July 2019 : Release of R-3.6.1
Importance of R
• R is free, open-source code
• R runs anywhere
• R supports extensions
• R provides an engaged community
• R connects with other languages
• Running code without a compiler
• The Ultimate Statistical Analysis Kit
• Benefits of Charting
Limitation of R
• Lack of packages.
• R commands give little thought to memory
management, and so R can consume all
available memory.
• Memory management, speed, and efficiency
are probably the biggest challenges R faces.
• R isn't just for advanced programmers.
R Resources
• https://cran.r-project.org
R for Ubuntu
• There are two ways to install R in Ubuntu. One is
through the terminal, and the other is through the
Ubuntu Software Center.
1. Through Terminal -
• Press Ctrl+Alt+T to open Terminal
• Then execute sudo apt-get update
• After that, sudo apt-get install r-base

2. Through Ubuntu Software Center -


• Open Ubuntu Software Center
• Search for r-base
• And click Install
• Then run R by executing R in the Terminal
R console
R Studio
• RStudio is an integrated development
environment for R.
• It includes a console, syntax-highlighting
editor that supports direct code execution, as
well as tools for plotting, history, debugging
and workspace management.
Arithmetic Operations
• Every statistical analysis involves a lot of
calculations, and calculation is what R is
designed for — the work that R does best.
1. Basic arithmetic operators-
• These operators are used in just about every
programming language.
1. Basic Arithmetic Operator
2. USING MATHEMATICAL FUNCTIONS
3. Relational Operators
4. Logical Operators
5. Assignment operator-

6. Vector operations-
• Vector operations are functions that make
calculations on a complete vector, like sum().
• Each result depends on more than one value of the
vector.
7. Matrix operations-
• These functions are used for operations and
calculations on matrices.
Objects
• To create new variables, you will need to use the
assignment operator (<-).
• Instead of declaring data types, as done in C++ and Java, in
R, the user assigns the variables with certain Objects in R,
the most popular are:
• Vectors
• Factors
• Lists
• Data Frames
• Matrices
• The data type of the object in R becomes the data type of
the variable by definition.
• R's basic data types are character, numeric, integer,
complex, and logical.
vector
• A vector is the simplest type of data structure in R. A
vector is a sequence of data elements of the same
basic type.
• There are six data types of the simplest object - vector:
1. Logical
2. Numeric
3. Integer
4. Character
5. Raw
6. Complex
• If you want to check the variable type, use class().
• A vector is a sequence of elements that share
the same data type. These elements are
known as components of a vector.
• R vector comes in two parts: Atomic
vectors and Lists.
• All elements of an atomic vector must be of
the same type, whereas the elements of a list
can have different types.
Atomic Vectors in R

• There are four common types of R atomic


vectors:
1. Numeric Data Type
• Decimal values are referred to as numeric data
types in R. If we assign a decimal value for any
variable g, as given below then, g will become
a numeric type.
2. Integer Data Type
• A numeric value with no fraction called integer data is
represented by “Int”.
• -54 and 23 are two of the examples of an integer. Int size is
2 bytes while long Int size is 4 byte.
• In order to assign an integer to a variable, there are two
ways:
1) The first way is to use the as.integer() function:
a <- as.integer(4)
• For checking data type:
• typeof(a)
2) The second way is the appending of L to the value:
b <- 4L
• For checking data type:
• typeof(b)
3) Character Data Type
• The character is held as the one-byte integer in
memory. There are two ways to create a character data
type value in R:
1. The first method is by typing a string between ” “
• x = “Rstudio"
• For determining the type of x:
• typeof(x)
2. In order to convert a number into character, make use
of as.character() function as follows:
• > y = as.character(42)
• For determining the type of y:
• > typeof(y)
4. Logical Data Type
• A logical data type returns either of the two
values – TRUE or FALSE based on which
condition is satisfied.
• For example:
• a =3; b =6
• g = a>b
• g #print the logical value
How to Create Vector in R?

• The c() function is used for creating a vector in R. This


function returns a one-dimensional array, also known
as vector.
• For example:
• x <- c(1,2,3,4)
• There are several other ways of creating a vector:
1. Using the Operator
• x <- 1:5
• For y operator:
• y <- 5:-5
• y
2. Create R vector using seq() function
• There are also two ways in this. The first way is to
set the step size and the second method is by
setting the length of the vector.
1) Setting step size with ‘by’ parameter:
• seq(2,4, by = 0.4)
• (2.0,2.4,2.8,3.2,3.6,4.0)
2) Specifying length of vector with the ‘length.out’
feature:
• seq(1,4, length.out = 5)
• (1.00,1.75,2.50,3.25,4.00)
How to Access Elements of R Vectors?

• With the help of vector indexing, we can access


the elements of vectors. Indexing denotes the
position where the values in a vector are stored.
1. Indexing with Integer Vector
• Unlike many programming languages like Python,
C++, Java etc. where the indexing starts from 0,
the indexing of vectors in R starts with 1.
• X(1,2,3)
• X[1]
• O/p =?
2. Indexing with Character Vector
• Character vector indexing can be done as
follows:
• x <- c("One" = 1, "Two" = 2, "Three" = 3)
• x["Two"]
3. Indexing with Logic Vector
• In logical indexing, the positions whose
corresponding position has logical vector
TRUE are returned.
• a <- c(1,2,3,4)
• a[c(TRUE, FALSE, TRUE, FALSE)]
Operations in R Vector
1. Combining Vector in R
• Functions are used to combine vectors. In order
to combine the two vectors in R, we will create
two new vectors ‘n’ and ‘s’. Then, we will create
another vector that will combine these two using
c(n,s) as follows:
• n = c(1, 2, 3, 4)
• s = c("Hadoop", "Spark", "HIVE", "Flink")
• c(n,s)
• Output: ?
2. Arithmetic Operations on Vectors in R
• Arithmetic operations on vectors can be
performed member-by-member.
• Suppose we have two vectors a and b:
• a = c (1, 3)
• b = c (1, 3)
• For Addition:
• a+b
• For subtraction:
• a-b
• For division:
• a/b
• For remainder operation:
• a %% b
3. Logical Index Vector in R -
• By using a logical index vector in R, we can
form a new vector from a given vector, which
has the same length as the original vector.
• If the corresponding members of the original
vector are included in the slice, then vector
members are TRUE and otherwise FALSE.
• S = c("bb", "cc")
• L = c(TRUE, TRUE)
• S[L]
• o/p= “bb”,”cc”.
4. Numeric Index
• For indexing a numerical value in R, we specify
the index between square braces [ ].
• If our index is negative, then R will return us
all the values except for the index that we
have specified.
• x <- c("aa", "bb", "cc", "dd", "ee")
• x[3] =?
• x[-2] = ?
5. Duplicate Index
• The index vector allows duplicate values. Hence, the following
retrieves a member twice in one operation.
• s = c("aa", "bb", "cc", "dd", "ee")
• s[c(2,3,3)] = ?
6. Range Indexes
• To produce a vector slice between two indexes, we can use
the colon operator “:“. It is convenient for situations involving
large vectors.
• s = c("aa", "bb", "cc", "dd", "ee")
• s[1:3]
7. Out-of-order Indexes
• The index vector can even be out-of-order. Here is a vector
slice with the order of first and second members reversed.
• For example:
• > s [ c (2, 1, 3) ]
variables
• A variable provides us with named storage
that our programs can manipulate.
• A variable in R can store an atomic vector,
group of atomic vectors or a combination of
many R objects.
• A valid variable name consists of letters,
numbers and the dot or underline characters.
The variable name starts with a letter or the
dot not followed by a number.
Variable Assignment

• The variables can be assigned values using leftward, rightward and


equal to operator.
• The values of the variables can be printed
using print() or cat() function.
• The cat() function combines multiple items into a continuous print
output.
• var.1 = c(0,1,2,3)
• var.2 <- c("learn","R")
• c(TRUE) -> var.3
• print(var.1)
• cat ("var.1 is ", var.1 ,"\n")
• cat ("var.2 is ", var.2 ,"\n")
• cat ("var.3 is ", var.3 ,"\n")
Data Type of a Variable

• In R, a variable itself is not declared of any data type,


rather it gets the data type of the R - object assigned to
it.
• So R is called a dynamically typed language, which
means that we can change a variable’s data type of the
same variable again and again when using it in a
program.
• var_x <- "Hello“
• Class(var_x)=?
• var_x <- 34.5
• var_x <- 27L
Finding Variables

• To know all the variables currently available in


the workspace we use the ls() function.
• Also the ls() function can use patterns to
match the variable names.
• print(ls())
• print(ls(pattern = "var"))
Deleting Variables

• Variables can be deleted by using


the rm() function.
• Below we delete the variable var.3. On
printing the value of the variable error is
thrown.
• rm(var.3)
• print(var.3)
Factor
• R factor is used to store categorical data as levels.
• It can store both character and integer types of
data.
• These factors are created with the help
of factor() functions, by taking a vector as input.
• R factors are variables. The factor is stored as
integers.
• They have labels associated with these unique
integers.
• Factor contains a predefined set value called
levels. By default, R always sorts levels in
alphabetical order.
How to Create a Factor

• directions <- c("North", "North", "West", "South")


• factor(directions)
• o/p= levels: North, South,West
• In order to add this missing level to our factors, we use
the “levels” attribute as follows:
• factor(directions, levels= c("North", "East", "South",
"West"))
• In order to provide abbreviations or ‘labels’ to our
levels, we make use of the labels argument as follows –
• factor(directions, levels= c("North", "East", "South",
"West"), labels=c("N", "E", "S", "W"))
Cont’d..
• if you want to exclude any level from your factor, you
can make use of the exclude argument.
• factor(directions, levels= c("North", "East", "South",
"West"), exclude = "North")
• There are various ways to access the elements of a
factor in R. Some of the ways are as follows:
• data <- c("East", "West", "East", "North)
• data[4]
• data[c(2,3)]
• data[-1]
• data[c(TRUE, FALSE, TRUE, TRUE)]
How to Modify an R Factor?

• To modify a factor, we are only limited to the


values that are not outside the predefined
levels.
• print(data)
• data[2] <- "North"
• data[3] <- "South"
Factor Functions in R
• is.factor() checks if the input is present in the form of
factor and returns a Boolean value (TRUE or FALSE).
• as.factor() takes the input (usually a vector) and
converts it into a factor.
• is.ordered() checks if the factor is ordered and returns
boolean TRUE or FALSE.
• as.ordered() function takes an unordered function and
returns a factor that is arranged in order.
• f_directions <- factor(directions)
• is.factor(f_directions)
• as.factor(directions)
• is.ordered(f_directions)
• as.ordered(f_directions)
Reserved Words
• Reserved words in R programming are a set of
words that have a special meaning and cannot
be used as an identifier.
• The list of reserved words can be viewed by
typing ?reserved or help(reserved) at the R
command prompt.
Comments in R
• Single comment is written using # at the
beginning of the statement as follow.
• #Test program in R.
• R does not support multi-line comments as in
C or python.
Identifiers
• The unique name given to a variable like function
or objects is known as an identifier.
• Following are the rules for naming an identifier.
1. Identifiers can be a combination of letters,
digits, period(.) and underscore.
2. It must start with a letter or a period. If it starts
with a period, it can not be followed by a digit.
3. Reserved word in R can not be used as identifier.
Ex. Total1,sum,.date.of.birth,Sum_of_two etc.
Constants
• Constants or literals, are entities whose value
cannot be altered. Basic types of constants are
numeric constants and character constants.
• There are built-in constants also. All numbers fall
under this category.
• They can be of type integer, double and complex.
• But it is not good to rely on these, as they are
implemented as variables whose values can be
changed,
Reading Strings
• We can read strings from a keyboard using the
readline() fun.
• It lets the user to enter a one-line string at the
terminal.
• Value <- readline(prompt=“string”)
• Ex. Print(n<-readline(prompt=“enter the subject:”))
• Enter the subject : R
• [1] “R”
Data Types
• The variables are assigned with R objects and the
data type of the R objects becomes the data type
of the variable. There many type of R object.
1. Vectors
2. Lists
3. Matrices
4. Arrays
5. Factors
6. Data Frames
Basic Data Types
• Numeric- Decimal values are called numeric in R.
• Integer
• Complex- A complex value in R is defined via the
pure imaginary value i. A complex number will be
in the form of a+bi.
• Logical- There are two logical values True and
False.
• Character-used to represent character value in R.
List
• A List is a generic vector containing other
objects. Lists are the R objects which contain
elements of different types like − numbers,
strings, vectors and another list inside it.
• A list can also contain a matrix or a function
as its elements.
• List is created using list() function.
Creating a List

• n <- list(c(2,3), c(“a”,”b”,”c”), c(TRUE,FALSE,TRUE),3)


• Output-
• [[1]]
• [1] 2,3
• [[2]]
• [1] “a” “b” “c”
• [[3]]
• [1] TRUE FALSE TRUE
• [[4]]
• [1] 3
Naming List Elements

• The list elements can be given names and they


can be accessed using these names.
• list_data <- list(c("Jan","Feb","Mar"),
matrix(c(3,9,5,1,-2,8), nrow = 2))
• names(list_data) <- c("1st Quarter",
"A_Matrix", "A Inner list")
• print(list_data)
Accessing List Elements
• Lists can be accessed in similar fashion to
vectors. Integer, logical or character vectors
can be used for indexing.
• Elements of the list can be accessed by the
index of the element in the list.
• # Create a list containing a vector, a matrix and a list.
• list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8),
nrow = 2), list("green",12.3))
• # Give names to the elements in the list.
• names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner
list")
• # Access the first element of the list.
• print(list_data[1])
• # Access the third element. As it is also a list, all its
elements will be printed.
• print(list_data[3])
• # Access the list element using the name of the element.
• print(list_data$A_Matrix)
Manipulating List Elements

• We can add, delete and update list elements


as shown below.
• We can add and delete elements only at the
end of a list. But we can update any element.
• list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8),
nrow = 2), list("green",12.3))
• names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner
list")
# Add element at the end of the list.
• list_data[4] <- "New element"
• print(list_data[4])

# Remove the last element.


• list_data[4] <- NULL

# Update the 3rd Element.


list_data[3] <- "updated element“
print(list_data[3])
Merging Lists

• You can merge many lists into one list by


placing all the lists inside one list() function.
• list1 <- list(1,2,3)
• list2 <- list("Sun","Mon","Tue")
# Merge the two lists.
• merged.list <- c(list1,list2)
# Print the merged list.
• print(merged.list)
Converting List to Vector

• A list can be converted to a vector so that the


elements of the vector can be used for further
manipulation.
• All the arithmetic operations on vectors can
be applied after the list is converted into
vectors.
• To do this conversion, we use
the unlist() function.
• It takes the list as input and produces a vector.
Cont’d…
# Create lists.
• list1 <- list(1:5)
• print(list1)
• list2 <-list(10:14)
• print(list2)
# Convert the lists to vectors.
• v1 <- unlist(list1)
• v2 <- unlist(list2)
• print(v1)
• print(v2)
# Now add the vectors
• result <- v1+v2
• print(result)
Matrices
• Matrices are the R objects in which the elements
are arranged in a two-dimensional rectangular
layout.
• They contain elements of the same atomic types.
• Though we can create a matrix containing only
characters or only logical values, they are not of
much use.
• We use matrices containing numeric elements to
be used in mathematical calculations.
• A Matrix is created using the matrix() function.
• Syntax-
matrix(data, nrow, ncol, byrow, dimnames)
• data is the input vector which becomes the
data elements of the matrix.
• nrow is the number of rows to be created.
• ncol is the number of columns to be created.
• byrow is a logical clue. If TRUE then the input
vector elements are arranged by row.
• dimname is the names assigned to the rows
and columns.
• M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
• print(M)

• N <- matrix(c(3:14), nrow = 4, byrow = FALSE)


• print(N)

• rownames = c("row1", "row2", "row3", "row4")


• colnames = c("col1", "col2", "col3")
• P <-matrix(c(3:14), nrow = 4, byrow = TRUE,
dimnames = list(rownames, colnames))
• print(P)
[,1] [,2] [,3]
• [1,] 3 4 5
• [2,] 6 7 8
• [3,] 9 10 11
• [4,] 12 13 14

[,1] [,2] [,3]


• [1,] 3 7 11
• [2,] 4 8 12
• [3,] 5 9 13
• [4,] 6 10 14

col1 col2 col3


row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Another way of creating a matrix
• By using cbind() and rbind() – If we are using cbind()
function, elements will be filled column-wise and
rbind() function fills the matrix elements row-wise.
• M=cbind(c(1,2,3),c(4,5,6))
• M
[,1] [,2]
• [1,] 1 4
• [2,] 2 5
• [3,] 3 6
• M=rbind(c(1,2,3),c(4,5,6))
• M
[,1] [,2] [,3]
• [1,] 1 2 3
• [2,] 4 5 6

(3) By using dim() function- we can also create a matrix


from a vector by setting its dimensions using dim().
M = c(1,2,3,4,5,6)
dim(M) =c(2,3)
M
[,1] [,2] [,3]
• [1,] 1 3 5
• [2,] 2 4 6
Accessing Matrix Elements
• Matrix elements can be accessed in 3 different
ways-
1. Integer vector as index- An element at the mth
row and nth column of a matrix P can be
accessed by the expression P[m,n].
• We can use negative integers to specify rows or
columns to be excluded.
• If any field inside the bracket is left blank, it
selects all.
• For ex. the entire mth row of matrix P can be
extracted as P[m,] and for column P[,n].
• M= matrix(c(1:12), nrow =4, byrow= TRUE)
• M
[,1] [,2] [,3]
• [1,] 1 2 3
• [2,] 4 5 6
• [3,] 7 8 9
• [4,] 10 11 12
1. M[2,3]
2. M[2, ]
3. M[ ,3]
4. M[ , ]
5. M[ ,c(1,3)]
6. M[c(3,2) , ]
7. M[c(1,2) ,c(2,3)]
8. M[-1, ]
1. [1] 6
2. [1] 4 5 6
3. [1] 3 6 9 12
4. [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
5. [,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
[4,] 10 12
6. [,1] [,2] [,3]
[1,] 7 8 9
[2,] 4 5 6
7. [,1] [,2]
[1,] 2 3
[2,] 5 6
8. [,1] [,2] [,3]
[1,] 4 5 6
[2,] 7 8 9
[3,] 10 11 12
• Logical vector as index- Two logical vectors can
be used to index a matrix. In such situation,
rows and columns where the value is TRUE is
returned.
• These indexing vectors are recycled if
necessary and can be mixed with integers
vectors.
• M= matrix(c(1:12), nrow =4, byrow = TRUE)
• M[c(TRUE, FALSE,TRUE),c(TRUE,TRUE,FALSE)]
[,1] [,2]
[1,] 1 2
[3,] 10 11
• Character vector as index – If we assign names
to the rows and columns of a matrix, then we
can access the elements by names.
• This can be mixed with integers or logical
indexing.
• M <-matrix(c(3:14), nrow = 4, byrow = TRUE,
dimname = list(c(“r1”,”r2”,”r3”,”r4”),c(“c1”,”c2”,”c3”)))
• M[“r2”, “c3”] # elements at 2nd row, 3rd column
• M[ , “c1”] # elements of the column named c1
• M[TRUE, c(“c1”,”c2”)] # all rows and columns c1 & c2
• M[2:3, c(“c1”,”c3”)] # 2nd & 3rd row, columns c1 & c3
[,1] [,2] [,3]
• [1,] 1 2 3
• [2,] 4 5 6
• [3,] 7 8 9
• [4,] 10 11 12

• [1] 6

• r1 r2 r3 r4
1 4 7 10

c1 c2
• r1 1 2
• r2 4 5
• r3 7 8
• r4 10 11

c1 c2
• r1 4 6
• r2 7 9
Matrix Arithmetic
• The dimensions ( no of rows and columns)
should be same for the matrices involved in the
operation.
• Matrix1 <- matrix(c(10,20,30,40,50,60), nrow=2)
• Matrix2 <- matrix(c(1,2,3,4,5,6), nrow=2)
• Sum <- Matrix1 + Matrix2
• Difference <- Matrix1 – Matrix2
• Product <- Matrix1 * Matrix2
• Quotient <- Matrix1 / Matrix2
[,1] [,2] [,3]
• [1,] 10 20 30
• [2,] 40 50 60

[,1] [,2] [,3]


• [1,] 1 3 5
• [2,] 2 4 6
Matrix Manipulation
• We can modify a single element or elements
based on a conditions.
• Matrix1 <- matrix(c(10,20,30,40,50,60), nrow=2)
• Matrix1[2,2] <- 100
• Matrix1[ Matrix<40] <- 0
• We can add row or column using rbind() and cbind().
Similarly it can be removed through reassignment.
• cbind( Matrix1, c(1,2,3))
• rbind( Matrix1 , c(1,2))
• print( Matrix1 <- Matrix1[1:2, ]
Matrix Operations
1. Matrix Multiplication – Two matrices A of order MXN and B of
order PXQ are eligible for multiplication only if N is equal to P.
• The resultant matrix will be of the order MXQ.
• Matrix multiplication is performed using the operator A % *
%B where A and B are matrices.
• Matrix1 <- matrix(c(10,20,30,40,50,60), nrow=2)
• Matrix2 <- matrix(c(1,2,3,4,5,6), nrow=2)
• Product <- Matrix1 %*% Matrix2

2. Transpose – The transpose of a matrix is an operation which


flips a matrix over its diagonal, that is it switches the row and
column indices of the matrix.
• Matrix1 <- matrix(c(10,20,30,40,50,60), nrow=2)
• t( Matrix1)
[,1] [,2]
• [1,] 10 40
• [2,] 20 50
• [3,] 30 60

[,1] [,2] [,3]


• [1,] 10 20 30
• [2,] 40 50 60
3. Cross product-
• A<- matrix(c(10,20,30,40,50,60), nrow=2)
• B <- matrix(c(1,2,3,4,5,6), nrow=2)
• crossprod( A,B)
• Crossprod(A)
4. Diagonal Matrix –
• A <- matrix (1:9 , nrow =3)
• diag(A) # prints the diagonal element
• diag(3) # create an identity matrix of order 3
• diag( c(1,2,3) ,3) # create a matrix of order 3
with diagonal elements 1,2,3.
[,1] [,2] [,3]
• [1,] 1 4 7
• [2,] 2 5 8
• [3,] 3 6 9

• [1,] 1 5 9

[,1] [,2] [,3]


• [1,] 1 0 0
• [2,] 0 1 0
• [3,] 0 0 1

[,1] [,2] [,3]


• [1,] 1 0 0
• [2,] 0 2 0
• [3,] 0 0 3
5. Row sum and column sum-
• A<- matrix(c(10,20,30,40,50,60), nrow=2)
• rowSums(A)
• colSums(A)
6. Row means and column means-
• rowMeans(A)
• colMeans(A)
7. Eigen values and eigen vectors-
• Y <- eigen(A)
8. Inverse –
• solve(A)
Arrays
• Arrays are R data objects which can store data in
more than two dimensions.
• Arrays can store only same data type.
• For ex. , if we create an array of dimension (2,4,5)
then it creates 5 rectangular matrices each with 2
rows and 4 columns.
• An array is created using the array() function.
• It takes vectors as input. The function dim()
defines the dimension of an array or use the
values in the dim parameter to create an array.
• V1= c(1,2,3)
• V2= c(10,20,30,40,50,60)
• A<- array(c(v1,v2),dim=c(3,3,2))
, , 1
[,1] [,2] [,3]
• [1,] 1 10 40
• [2,] 2 20 50
• [3,] 3 30 60
, , 2
[,1] [,2] [,3]
• [1,] 1 10 40
• [2,] 2 20 50
• [3,] 3 30 60
• We can give names to the rows, columns and
matrices in the array by using the dimnames
parameter.
• V1 = c(1,2,3)
• V2= c(10,20,30,40,50,60)
• Column.names <- c(“col1”,”col2”,”col3”)
• Row.names <- c(“row1”,”row2”,”row3”)
• Matrix.names <- c(“matrix1”,”matrix2”)
• A <- array(c(V1,V2), dim =c(3,3,2),dimnames =
list(row.names,column.names,matrix.names))
Accessing Array Elements
• We can use the index position to access the
array elements. Using index we can alter each
and every individual element present in array.
• Syntax- array_name [row_position, col_position,matrix_level]
• A <- array( 1:24, dim= c(3,4,2))
• A[1,2,1] # 1st row 2nd col in matrix1.
• A[3,4,2]
• A[3, , 1] # only 3rd row in 1 matrix.
• A[ , 4,2] # 4th column in 2 matrix.
• A[ , , 1]
• A[ , , 2]
[,1] [,2] [,3] [,4]
• [1,] 1 4 7 10
• [2,] 2 5 8 11
• [3,] 3 6 9 12
[,1] [,2] [,3] [,4]
• [1,] 13 16 19 22
• [2,] 14 17 20 23
• [3,] 15 18 21 24

• [1] 4
• [1] 24
• [1] 3 6 9 12
• [1] 22 23 24
Array Element Manipulation
• We can do calculations across the elements in an
array using the apply() function.
• Syntax- apply(x, margin,func)
• X is an array, margin is the name of the dataset,
func is function to be applied.
• V1 <- c(1,2,3)
• V2 <- c(10,20,30,40,50,60)
• A<- array(c(V1,V2), dim=c(3,3,2))
• B <- apply(A, c(1), sum)
• C <- apply (C, c(2), sum)
Array Arithmetic
• To perform the arithmetic operations, we need to
convert the multi-dimensional matrix into one
dimensional matrix.
• V1 <- c(1,2,3)
• V2 <- c(10,20,30,40,50,60)
• A<- array(c(V1,V2), dim=c(3,3,2))
• mat.a <- A[ , , 1]
• mat.b <- A[ , ,2]
• mat.a + mat.b
• mat.a - mat.b
• mat.a * mat.b
• mat.a / mat.b
Factors
• Factor is a data structure used for fields that takes
only predefined finite number of values or
categorical data.
• They are used to categorize the data and store it
as levels.
• They can store both string and integers.
• For ex., A data field such as marital status may
contain only values from single, married,
separated, divorced and widowed. In such case,
the possible values are predefined and distnict
called levels.
Creating factors
• factors are created with the help
of factor() functions, by taking a vector as input.
• Factor contains a predefined set value called
levels. By default, R always sorts levels in
alphabetical order.
• directions <- c("North", "North", "West", "South")
• factor(directions)
• o/p= levels: North, South,West
Accessing Factor
• There are various ways to access the elements
of a factor in R. Some of the ways are as
follows:
• data <- c("East", "West", "East", "North)
• data[4]
• data[c(2,3)]
• data[-1]
• data[c(TRUE, FALSE, TRUE, TRUE)]
Modifying Factor
• To modify a factor, we are only limited to the
values that are not outside the predefined
levels.
• print(data)
• data[2] <- "North"
• data[3] <- "South"
Data Frames

• A data frame is used for storing data tables.


• It is a list of vectors of equal length.
• A data frame is a table or a two-dimensional
array like structure in which each column
contains values of one variable and each row
contains one set of values from each column.
Characteristics of a data frame
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of
numeric, factor or character type.
4. Each column should contain same number of
data items.
Creating Data Frames
• We can create data frames using the function
data.frame().
• The top line of the table called the header
contains the column names.
• Each horizontal line afterward denotes a data
row, which begins with the name of the row,
and then followed by the actual data.
• Each data member of a row is called a cell.
• We can get the name of header using the
function names().
• No of rows using the function nrow().
• No of column using the function ncol().
• The length() function returns the length of the
list which is same as that of no of columns.
• The structure of a data frame can be retrived
using str() function.
• The statistical summary and nature of the data
can be obtained by applying summary()
function.
• X <- data.frame("roll"=1:2,"name"=c("jack","jill"),"age"=c(20,22))
• print(X)

• names(X)

• nrow(X)

• ncol(X)

• str(X)

• summary(X)
Accessing Data Frame Components
• Components of data frame can be accessed like a
list or like a matrix.
(a) Accessing like a list – we can use either [[ or $
operator to access columns of data frame.
• Accessing with [[ and $ is similar.
• X <-
data.frame("roll"=1:2,"name"=c("jack","jill"),"age"=c(20,
22))
• X$name
• X[["name"]]
• X[[3]] # retrieves the value for the third col name as list
(b) Accessing like a Matrix – Data frame can be
accessed like a matrix by providing index for
row and column.
• We can use the [] for indexing, this will return
us a data frame unlike the other two [[ and $
will reduce it into a vector.
• We can use the head() function to display first
n rows.
• Negative number for the index are also
allowed in data frames.
• X <-
data.frame("roll"=1:3,"name"=c("jack","jill","Tom"),"age"=c(20,22,23))

• X["name"]

• X[1:2,]

• X[, 2:3]

• X[c(1,2),c(2,3)]

• X[,-1]

• X[-1,]

• X[X$age>21,]

• head(X,2)
Modifying Data Frames
• Data frames can be modified like we modified matrices
through reassignment.
• X <-
data.frame("roll"=1:3,"name"=c("jack","jill","Tom"),"age"=c(2
0,22,23))
• X[1,"age"] <- 25
• A data frame can be expanded by adding columns and rows.
• We can add the column vector using a new column name.
• Columns can also be added using the cbind() function.
• Similarly rows can be added using the rbind() function.
• Data frame columns can be deleted by assigning NULL to it.
• Similarly, rows can be deleted through reassignment.
• print(X$bloodgroup <- c("A+","B-","AB+"))

# adding new column using cbind()


• print(X <-
cbind(X,city=c("delhi","mumbai","chennai")))

# adding new row using rbind()


• print(X <- rbind(X,c(4,"Jack",24,"B+","Delhi")))
Aggregating Data
• It is relatively easy to collapse data in R using
one or more by variables and a defined
function.
• When using the aggregate() function, the by
variables must be in a list, even if there is only
one column.
• The function can be built-in functions like
mean, max, min, sum etc. or user provided
function.
• X <- data.frame("roll"=1:11,
"name"=c("jack","jill","jeeva","smith","bob","smith","john",
mathew","charle","zen","yug"),
"age"= c(20,20,30,21,19,21,19,18,22,25,21),
"marks" = c(100,98,99,75,80,90,88,43,87,43,89))
• print(X)

• aggdata <- aggregate(X$marks,list(m=X$age),mean)


• print(aggdata)

• aggdata <- aggregate(X$marks,list(m=X$age),max)


• print(aggdata)

• aggdata <- aggregate(X$marks,list(m=X$age),sum)


• print(aggdata)
Sorting Data
• To sort a data frame in R, use the order()
function.
• By default, sorting is ascending.
• We can sort in descending order by giving the
sorting variable a minus sign in front.
• X <-
data.frame("roll"=1:11,"name"=c("jack","jill","jeeva","smith
","bob","smith","john","mathew","charle","zen","yug"),
"age"= c(20,20,30,21,19,21,19,18,22,25,21),
"marks" = c(100,98,99,75,80,90,88,43,87,43,89))

#sort by name
• newdata <- X[order(X$name),]

# sort by age and within sort by name


• newdata <- X[order(X$age,X$name),]

# sort by age ascending and within age descending


• newdata <- X[order(X$name,-X$age),]
Merging Data
• We can merge two data frames(datasets)
horizontally, by using the merge() function.
• In most cases, we can join two data frames by
one or more common key variable(i.e. inner join).
• There are different types of join like inner join,
outer join, left outer join, right outer join and
cross join.
• Following are the points to be kept in mind while
performing join operations-
1. An inner join of two data frames df1 and df2 returns
only the rows in which the left table have matching
keys in the right table.
2. An outer join of two data frames df1 and df2 returns
only the rows in which the left table have matching
keys in the right table.
3. A left outer join(or simply left join) of two data frames
df1 and df2 returns all rows from the left table, and
any rows with matching keys from the right table.
4. A right outer join of two data frames df1 and df2
returns all rows from the right table, and any rows
with matching keys from the left table.
5. A cross join of two data frames df1 and df2 returns a
result set which is the number of rows in the first
table multiplied by the no of rows in second table.
• df1 =data.frame(CustomerId = c(1:6), product= c(rep("toaster",3),rep("radio",3)))

• df2 =data.frame(CustomerId = c(2,4,6), state= c(rep("alabama",2),rep("ohio",1)))

• print(df1)
• print(df2)

• # inner join
• merge(df1,df2, by= "CustomerId")

• # outer join
• merge(x=df1,y=df2, by= "CustomerId",all=TRUE)

• # left outer join


• merge(x=df1,y=df2, by= "CustomerId",all.x=TRUE)

• #Right outer join


• merge(x=df1,y=df2, by= "CustomerId",all.y=TRUE)

• #cross join
• merge(x=df1,y=df2, by= NULL)
Reshaping Data
• R provides a variety of methods for reshaping
data prior to analysis.
• Two important functions for reshaping data are
the melt() and cast() functions.
• These functions are available in reshape package.
• Before using these functions, make sure that the
package is properly installed in your system.
• We can “melt” the data so that each row is a
unique id-variable combination. Then we can
“cast” the melted data into any shape we would
like.
• y <- data.frame("id"=c(1,2,1,2,1), "age"=c(20,20,21,21,19),

"marks1"=c(80,60,70,80,90),"marks2"=c(100,98,99,75,80))
• print(y)

• #melting data
• mdata= melt(y, id=c("id","age"))

• # cast( data, formula, function)


• # mean marks for each id
• markmeans <- cast(mdata,id~variable,mean)

• # mean mark for each group


• agemeans <- cast(mdata,age~variable,mean)
Subsetting Data
• The subset() function is the easier way to
select variables and observations.
• In the following ex., we select all rows that
have a value of age greater than or equal to 20
or age less than 10.
• Similarly we select all rows with
name=“smith” or name=“John”.
• X <- data.frame ("roll"=1:11,
"name"=c("jack","jill","jeeva","smith","bob","smith","joh
n","mathew","charle","zen","yug"),
"age"= c(20,20,30,21,19,21,19,18,22,25,21))
• print(X)

• newdata <-
subset(X,age>=25&age<30,select=c(roll,name,age))
• print(newdata)

• newdata <-
subset(X,name=="smith"|name=="john",select=roll:age)
• print(newdata)
Data Type Conversion
• We can convert one data type to another data
type as in any programming language.
• We can convert any basic data type to
numeric using the function as.numeric().
• Similarly as.integer() converts to integer,
as.character() converts to character,
as.logical() converts to logical and
as.complex() converts to complex data types.
Unit 3
Conditions and loops
• Decision making structures are used by the
programmer to specify one or more
conditions to be evaluated or tested by the
program.
• A statement or statements need to be
executed if the condition is TRUE and
optionally other statements to be executed if
the condition is FALSE.
Decision Making
• R provides the following types of decision
making statements which includes if
statement, if..else statement, nested if…else
statement, ifelse() function and switch
statement.
if Statement
• An if statement consists of a boolean
expression followed by one or more
statements. The syntax is-
• If( boolean_expression)
{
// statement will execute if the boolean
expression is true.
}
• If the boolean_expression evaluates to TRUE,
then the block of code inside the if statement
will be executed.
• If boolean_expression evaluates to FALSE,
then the first set of code after the end of if
statement will be executed.
• Here boolean expression can be a logical or
numeric vector, but only the first element is
taken into consideration.
• In the case of numeric vector, zero is taken as
FALSE, rest as TRUE.
• x<- 10
if (x > 0)
{
cat(x, “ is a positive number\n”)
}
if….else Statement
• An if statement can be followed by an optional else
statements which executes when the boolean expression is
FALSE.
• The syntax of if…else is-
if (boolean_expression)
{
// if expression is true
}
else
{
// if expression is false
}
• If the boolean_expression evaluates to be
TRUE, then if block of code will be executed,
otherwise else block of code will be executed.
• X <- -5
If(x > 0){
cat( x, “is a positive number\n”)
} else {
cat( x, “is a negative number\n”)
}
• We can write the if…else statement in a single
line if the “if and else” block contains only one
statement as follows.
• if( x>0) cat ( x, ”is a positive no\n”) else cat(x, “is
a negative no\n”)
Nested if…else Statement
• An if statement can be followed by an optional
else if..else statement, which is very useful to
test various conditions using single if…else if
statement.
• We can nest as many if..else statement as we
want.
• Only one statement will get executed
depending upon the boolean_expression.
• if( boolean_expression 1) {
// execute when expression 1 is true.
} else if(boolean_expression 2) {
// execute when expression 2 is true.
} else if(boolean_expression 3) {
// execute when expression 3 is true.
} else {
// execute when none of the above condition is
true.
}
• X <- 19
if (x < 0)
{
cat(x, ”is a negative number”)
} else if (x>0)
{
cat(x, “is a positive number”)
}
else
print(“zero”)
ifelse() function
• Most of the function in R take vector as input and
output a resultant vector.
• This vectorization of code, will be much faster
than applying the same function to each element
of the vector individually.
• There is an easier way to use if..else statement
specifically for vectors in R.
• We can use if…else() function instead which is the
vector equivalent form of the if..else statement.
• ifelse(boolean_expression, x, y)
• Here, boolean_expression must be a logical
vector.
• The return value is a vector with the same length
as boolean_expression.
• This returned vector has element from x if the
corresponding value of boolean_expression is
TRUE or from Y if the corresponding value of
boolean_expression is FALSE.
• For example, the ith element of result will be x[i],
if boolean_expression[i] is TRUE else it will take
the value of y[i].
• The vectors x and y are recycled whenever
necessary.
• a = c(5,7,2,9)
ifelse( a %% 2 == 0 , “even” ,”odd”)
• o/p = ?
• In the above example, the boolean_expression
is a %% 2 ==0 which will result into the
vector(FALSE, FALSE,TRUE,FALSE).
• Similarly, the other two vectors in the function
argument gets recycled to (“even”, ”even”,
”even”, ”even”) and (“odd”, “odd”, “odd”,
“odd”) respectively.
• Hence the result is evaluated accordingly.
switch Statement
• A switch statement allows a variable to be tested
for equality against a list of values.
• Each value is called a case, and the variable being
switched on is checked for each case.
• switch( expression, case1, case2, case3….)
• If the value of expression is not a character string,
it is coerced to integer.
• We can have any no of case statements within a
switch.
• Each case is followed by the value to be
compared to and a colon.
• If the value of the integer is between 1 and
nargs()-1 { the max no of arguments} then the
corresponding element of case condition is
evaluated and the result is returned.
• If expression evaluates to a character string
then the string is matched(exactly) to the
names of the elements.
• If there is more than one match, the first
matching element is returned.
• No default argument is available.
• Switch( 2, “red”, “green”, “blue”)
• Switch(“color”, “color” = “red”, “shape” = “
square” ,”length “=5)
• Output- [1] “green”
[2] “red”
• If the value evaluated is a number, that item of the list
is returned.
• In the above example, “red”, “green”, ”blue” from a
three item list. The switch() function returns the
corresponding item to the numeric value evaluated.
• In the above example, green is returned.
• The result of the statement can be a string as well.
• In this case, the matching named item’s value is
returned.
• In the above example, “color” is the string that is
matched and its value “red” is returned.
Loops
• In General, statements are executed
sequentially.
• Loops are used in programming to repeat a
specific block of code.
• R provides various looping structures like for
loop, while loop and repeat loop.
for loop
• A for loop is a repetition control structure that allow us
to efficiently write a loop that needs to execute a
specific number of times.
• A for loop is used to iterate over a vector in R
programming.
for ( value in sequence)
{
statements
}
• Here sequence is a vector and value takes on each of
its value during the loop.
• In each iteration, statements are evaluated.
• X <- c(2,5,3,9,8,11,6)
count <- 0
for(val in X)
{
if (val %% 2 == 0)
count = count+1
}
cat( “no of even numbers in”, X, “is”, count, ”\n”)
• o/p = ?
• The for loop in R is flexible that they are not
limited to integers in the input.
• We can pass character vector, logical vector,
lists or expressions.
• Ex-
• V <- c( “a”, “e”, “i”, “o”, “u”)
for ( vowel in V)
{
print(vowel)
}
• o/p- ?
while loop
• while loops used to loop until a specific condition in
met.
• Syntax-
while ( test_expression)
{ statement
}
• Here, test expression is evaluated and the body of the
loop is entered if the result is TRUE.
• The statements inside the loop are executed and the
flow returns to evaluate the test_expression again.
• This is repeated each time until test_expression
evaluated to FALSE, in which case, the loop exits.
num=5
sum=0
while(num>0)
{ sum= sum + num
num= num - 1
} cat( “the sum is”, sum, “\n”)
repeat loop
• A repeat loop is used to iterate over a block of
code multiple number of times.
• There is no condition check in repeat loop to
exit the loop. We must ourselves put a
condition explicitly inside the body of the loop
and use the break statement to exit the loop.
• Otherwise it will result in an infinite loop.
repeat {
Statements
if( condition)
{
Break
}
}
Loop Control Statements
• Loop control statements are also known as
jump statements.
• Loop control statements change execution
from its normal sequence.
• When execution leaves a scope, all automatic
objects that were created in that scope are
destroyed.
• The loop control statements in R are break
statement and next statement.
break statement
• A break statement is used inside a loop
(repeat, for, while) to stop the iterations and
flow the control outside of the loop.
• In a nested looping situation, where there is a
loop inside another loop, this statement exists
from the innermost loop that is being
evaluated.
• x<- 1:10
for( val in x) {
if (val == 3) {
break
}
print(val) }
• o/p = ?
• In the above example, we iterate over the vector
x, which has consecutive numbers from 1 to 10.
• Inside the for loop we have used an if condition
to break if the current value is equal to 3.
next statement
• A next statement is useful when we want to
skip the current iteration of a loop without
terminating it.
• On encountering next, the R parser skips
further evaluation and starts next iteration of
loop.
• This is equivalent to the continue statement in
C, java and python.
• X <- 1:10
for( val in X) {
if ( val == 3) {
next
}
print( val)
}
• We use the next statement inside a condition to
check if the value is equal to 3.
• If the value is equal to 3, the current evaluation
stops( value is not printed) but the loop continues
with the next iteration.
Functions
• Functions are used to logically break our code
into simpler parts which becomes easy to
maintain and understand.
• A function is a set of statements organized
together to perform a specific task.
• R has a large no of built-in functions and the
user can create their own functions.
• A function is an object, with or without
arguments.
Function Definition
• The reserved word function is used to declare a
function in R.
• func_name <- function(argument)
{
Statement
}
• Here, the reserved word function is used to declare a
function in R.
• This function object is given a name by assigning it to a
variable, func_name.
• The statements within the curly braces form the body
of the function. These braces are optional if the body
contains only a single expression.
• Following are the components of a function in R-
1. Function Name – This is the actual name of the
function. It is stored in R environment as an
object with this name.
2. Arguments – When a function is invoked, we can
pass values to the arguments. Arguments are
optional. A function may or may not contain
arguments. The arguments can also have default
values.
3. Function Body – The function body contains a
collection of statements that defines what the
function does.
Function Calling
• We can create user-defined functions in R. They are
specific to what a user wants and once created they can
be used like build-in functions.
• power <- function(x,y)
{
result <- x^y
cat( x, “raised to the power”, y, “is”, result, “\n”)
}
• power(2,3)
• Here, the arguments used in the function declaration x
and y are called formal arguments and those used while
calling the function are called actual argument.
Function without Arguments
• It is possible to create a function in R without
arguments.
• square <- function()
{
for( i in 1:5)
cat(“square of”, i, “is”, (i*i), “\n”)
}
• square()
Function with named Arguments
• When calling a function in this way, the order of the
actual arguments does not matter or we can pass the
arguments in a shuffled order.
• For example, all the function calls given below are
equivalent.
• power <- function(x,y)
{
result <- x^y
cat( x, “raised to the power”, y, “is”, result, “\n”)
}
• power(2,3)
• Power(x=2,y=3)
• Power(y=3,x=2)
• Further we can use named and unnamed
arguments in a single function call.
• In such case, all the named arguments are
matched first and then the remaining
unnamed arguments are matched in a
positional order
• power( x=2,3)
• power(2, y=3)
Function with default Arguments
• We can assign default values to arguments in a
function in R. This is done by providing an
appropriate value to the formal argument in the
function declaration.
• The function named power is defined with a
default value for Y in the following example
program. If no value is passed for Y, then the
default value is taken.
• If the value is passed for Y, then the default value
will be overridden.
• power <- function(x,y=2)
{
result <- x^y
cat( x, “raised to the power”, y, “is”, result,
“\n”)
}
• power(2)
• Power(2,3)
Built-in Functions
• There are several built-in functions available in
R. These functions can be directly used in user
written program.
• The built-in functions can be grouped into
mathematical functions, character functions,
statistical functions, probability functions,
date functions, time functions and other
useful functions.
Mathematical functions
1. abs()- this function computes the absolute value
of numeric data.
• The syntax is abs(x), where x is any numeric
value, array or vector.
• abs(-1)
• x <- c( -2,4,0,45,9,-4)
• abs(x)
• x <- matrix (c( -3,5,-7,1,-9,4), nrow= 3, ncol=2,
byrow=TRUE)
• abs(x[1, ])
• abs (x[, 1])
2. Sin(), cos() and tan()- the function sin()
computes the sine value, cos() computes the
cosine value and tan() computes the tangent
value of numeric data in radians.
• Syntax is sin(x), cos(x), tan(x), where x is any
numeric, array or vector.
• sin(10) , cos(90) , tan(50)
• x <- c( -2,4,0,45,9,-4)
• sin(x) , cos(x) , tan(x)
• x <- matrix (c( -3,5,-7,1,-9,4), nrow= 3, ncol=2,
byrow=TRUE)
• sin(x[1, ]) ,cos(x[,1 ]), tan(x[1,])
3. asin(), acos() and atan() – the asin() computes the
inverse sine value, acos() computes inverse cosine
value and atan() computes inverse tangent value
of numeric data in radians.
• asin(1), acos(1), atan(50)
4.exp(x) – the function computes the exponential
value of a number or number vector, e^x.
• x=5 , exp(x)
5. ceiling- This function returns the smallest integer
larger than the parameter.
• x <- 2.5
• Ceiling(x)
• 3
6. floor- This function returns the largest integer
not greater than the giving number.
• x <- 2.5
• floor(x)
7. round()- This function returns the integer
rounded to the giving number.
• The syntax is round( x, digits=n), where x is
numeric variable or a vector and digit specifies
the number of digits to be rounded.
• x<- 2.587888
• round(x,3)
7. trunc()- This function returns the integer
truncated with decimal part.
• x <- 2.99
• trunc(x)
8. signif(x, digits=n)- This function rounds the
values in its first argument to the specified
number of significant digits.
• x <- 2.587888
• Signif (x,3)
• 2.59
10. log(), log10(), log2(), log(x,b)- log() function
computes natural algorithms for a no or vector.
11. max() and min() – max() function computes the
max value of a vector and min() function
computes the minimum value of a vector.
• x <- c(10, 289, -100, 8000)
• max(x) , min(x)
12. beta() and Ibeta()- function returns the beta
value and Ibeta() returns the natural logarithm of
the beta function.
• beta(4,9)
• Ibeta(4,9)
o/p - 0.0005050, -7.590852
13. gamma()- this function returns the gamma
function £x.
• x=5
• gamma(x)
• o/p – 24
14. factorial ()- this function computes factorial
of a number or a numeric vector.
• x=5
• factorial(x)
Character Function
• These functions are used for string handling
operations like extracting characters from a
string, extracting substrings from a string,
concatenation of strings, matching strings,
inserting strings, converting strings from one
case to another and so on.
1. agrep()- this function searches for
approximate matches to pattern within each
element of the string.
• agrep( pattern, x, ignore.case=FALSE, value=
FALSE, max.distance=0.1, useBytes= FALSE)
• x <- c(“R language”, “and”, “SAND”)
• agrep( “an”,x)
• agrep(“an”, x, ignore.case=TRUE)
• agrep(“uag”, x, ignore.case=TRUE, max=1)
• agrep(“uag”, x, ignore.case=TRUE, max=2)

• [1] 1 2
• [1] 1 2 3
• [1] 1
• [1] 1 2 3
2. char.expand()- This function seeks for a unique
match of its first argument among the elements
of its second.
• If successful, it returns this element, otherwise, it
performs an action specified by the third
argument. The syntax is as follow-
char.expand( input, target, nomatch= stop(“no
match”), warning())
• Where input is the character string to be
expanded, target is the character vector with the
values to be matched against, nomatch is an R
expression to be evaluated in case expansion was
not possible and warning function prints the
warning message in case there is no match.
• The match string searches only in the beginning.
• x<- c(“sand”, “and”, “land”)
• char.expand(“an”, x, warning(“no expand”))
• char.expand(“a”, x, warning(“no expand”))
3. charmatch()- This function finds matches
between two arguments and returns the index
position.
• charmatch( x, table, nomatch= NA_integer_)
• Where x gives the value to be matched, table
gives the value to be matched against and
nomatch gives the value to be returned at non
matching positions.
• charmatch (“an”, c(“and”, ”sand”))
• charmatch(“an”, “sand”)
• [1] 1
• [1] NA
4. charToRow – This function converts character
to ASCII or “raw” objects.
• x <- charToRaw(“a”)
• Y <- charToRaw(“AB”)
• [1] 61
• [1] 41 42
5. chartr() – this function is used for character
substitutions.
• chartr(old, new, x)
• x <- “apples are red”
• chartr(“a”, “g”, x)
6. dquote()- this function is used for putting double
quotes on a text.
• x <- ‘2013-06-12’
• dquote(x)
7. format()- numbers and strings can be formatted
to a specific style using format() function.
• Ex- format(x, digits, nsmall, scientific, width,
justify= c(“left”, “right”, “centre”, “none”))
8. gsub()- this function replaces all matches of a
string, if the parameter is a string vector, returns
a string vector of the same length and with the
same attributes.
• gsub(pattern, replacement, x, ignore.case=FALSE)
Ex- x<- “apples are red”
gsub(“are”, “were”, x)
o/p- “apples were red”
9. nchar() & nzchar()- This function determines
the size of each elements of a character
vector. nzchar() tests whether elements of a
character vector are non-empty strings.
Syn- nchar(x, type=“chars”, allowNA= FALSE)
syn- nzchar()
10. noquote()- This function prints out strings
without quotes. The syntax is noquote(x)
where x is a character vector.
Ex- letters
noquotes(letters)
11. paste()- Strings in R are combined using the
paste() function. It can take any number of string
arguments to be combined together.
Syn- paste(…., sep = “ “, collapse = NULL)
• Where…. Represents any number of arguments
to be combined, sep represents any seperator
between the arguments. It is optional.
• Collapse is used to eliminate the space in
between two strings but not the space within two
words of one string.
• Ex- a <- “hello”
• b <- “everyone”
• print(paste(a,b,c))
• print( paste(a,b,c, sep = “-” ))
• print( paste(a,b,c, sep = “”, collapse = “”)
12. replace()- This function replaces the values in X
with indices given in list by those given in values.
If necessary, the values in ‘values’ are recycled.
syn- replace( x, list, values)
Ex- x <- c(“green”, ”red”, “yellow”)
y <- replace(x,1,”black”)
13. sQuote()- This function is used for putting single
quote on a text.
X <- “2013-06-12 19:18:05”
sQuote(X)
14. strsplit()- This function splits the elements of a
character vector x into substrings according to
the matches to substring split within them.
Syn- strsplit( x, split)
15. substr()- This function extracts or replace
substrings in a character vector.
Syn- substr( x, start, stop)
substr( x, start, stop) <- value
Ex- substr( “programming”, 2,3)
x= c(“red”, “blue”, “green”, “yellow”)
Substr(x,2,3) <- “gh”
16. tolower() – This function converts string to
its lower case.
Syn- tolower(“R Programming”)
17. toString – This function produces a single
character string describing an R object.
Syn- toString(x)
toString( x, width = NULL)
18. toupper- This function converts string to its
upper case.
Syn- toupper(“r programming”)
Statistical Function
1. mean()- The function mean() is used to
calculate average or mean in R.
Syn- mean(x, trim= 0, na.rm = FALSE)
Trim is used to drop some observation from
both end of the sorted vector and na.rm is
used to remove the missing values from the
input vector.
2. median()- the middle most value in a data
series is called the median. The median() fun
is used in R to calculate this value.
Syn- median(x, na.rm= FALSE)
3. var()- returns the estimated variance of the
population from which the no in vector x are
sampled.
Syn- x<- c(10,2,30,2,5,8)
var(x, na.rm= TRUE)
4. sd()- returns the estimated standard deviation of
the population from which the no in vector x are
sampled.
Syn- sd(x, na.rm= TRUE)
5. scale()- returns the standard scores(z-score) for
the no in vector in x. Used to standardizing a
matrix.
Syn- x<- matrix(1:9, 3,3)
scale(x)
6. sum()- adds up all elements of a vector.
Syn- sum(X)
sum(c(1:10))
7. diff(x,lag=1)- returns suitably lagged and iterated
differences.
Syn- diff(x, lag, differences)
Where X is a numeric vector or matrix containing the
values to be differenced, lag is an integer indicating
which lag to use and difference is an integer indicating
the order of the difference.
• For ex., if lag=2, the difference between third and first
value, between the fourth and the second value are
calculated.
• The attribute differences returns the differences of
differences.
8. range()- returns a vector of the minimum and
maximum values.
Syn- x<- c(10,2,14,67,86,54)
range(x)
o/p- 2 86
9. rank()- This function returns the rank of the
numbers( in increasing order) in vector x.
Syn- rank(x, na.last = TRUE)
10. Skewness- how much differ from normal
distribution.
Syn- skewness(x)
Date and Time Functions
• R provides several options for dealing with date and
date/time.
• Three date/time classes commonly used in R are Date,
POSIXct and POSIXIt.
1. Date – date() function returns a date without time as
character string.Sys.Date() and Sys.time() returns the
system’s date and time.
Syn <- date()
Sys.Date()
Sys.time()
• We can create a date as follows-
• Dt <- as.Date(“2012-07-22”)
• While creating a date, the non-standard must be
specified.
• Dt2 <- as.Date(“04/20/2011” , format =“%m%d%Y”)
• Dt3 <- as.Date(“October 6, 2010”, format = “%B %d,%Y”)
2. POSIXct- If we have times in your data, this is
usually the best class to use. In POSIXct, “ct”
stands for calender time.
• We can create some POSIXct objects as follows.
Tm1<- as.POSIXct(“2013-07-24 23:55:26”)
o/p – “2013-07-24 23:55:26 PDT”
Tm2 <- as.POSIXct(“25072012 08:32:07”, format=
“%d%m%Y %H:%M:%S”)
• We can specify the time zone as follows.
Tm3<- as.POSIXct(“2010-12-01 11:42:03”,
tz=“GMT”)
• Times can be compared as follows.
• Tm2> Tm1
• We can add or subtract seconds as follows.
• Tm1 +30
• Tm1- 30
• Tm2 – Tm1
3. POSIXlt- This class enables easy extraction of
specific components of a time. In POSIXit, “lt”
stands for local time.
• “lt” also helps one remember that POSIXlt objects
are lists.
• Tm1.lt <- as.POSIXlt(“2013-07-24 23:55:26”)
• o/p- “2013-07-24 23:55:26”
• We can extract the components in time as follows.
• unlist(Tm1.lt)
sec min hour mday mon year wday yday isdat
26 55 23 24 6 113 3 204 1
• mday, wday, yday stands for day of the month, day of
the week and day of year resp.
• A particular component of a time can be extracted as
follows.
• Tm1.lt$sec
• we can truncate or round off the times as given below.
• trunc( Tm1.lt, “days”) o/p - “2013-07-24”
• trunc( Tm1.lt, “mins”) o/p – “2013-07-24 23:55:00”
Other Functions
1. rep( x, ntimes) – This function repeats x n
times.
Ex.- rep( 1:3,4)

2. cut( x,n)- divide continuous variable in factor


with n levels.
X<- c(1,2,3,1,2,3,1)
cut( X,2)
Recursive Function
• A function that calls itself is called a recursive function
and this technique is known as recursion.
• This special programming technique can be used to
solve problems by breaking them into smaller and
simpler sub- problems.
• Recursive functions call themselves. They break down
the problem into the smallest possible components.
• The function() calls itself within the original function()
on each of the smaller components. After this, the
results will be put together to solve the original
problem.
• recursive.factorial <- function(x)
• {
• if ( x == 0)
• return (1)
• else
• return ( X * recursive.factorial( X-1))
• }
• recursive.factorial (5)
Convert decimal number to binary-
• convert_to_binary <- function(n)
• {
• if ( n>1)
• {
• convert_to _binary(as.integer(n/2))
• }
• cat ( n%%2)
• }
• convert_to_binary(5)
Classes and objects
• Everything in R is an object.
• An object is a data structure having some attributes and methods
which act on its attributes.
• Class is a blueprint for the object. We can think of class like a sketch
(prototype) of a house. It contains all the details about the floors,
doors, windows etc. Based on these descriptions we build the
house.
• House is the object. As, many houses can be made from a
description, we can create many objects from a class. An object is
also called an instance of a class and the process of creating this
object is called instantiation.
• While most programming languages have a single class system, R
has three class systems. Namely, S3, S4 and more recently
Reference class systems.
• They have their own features and peculiarities and choosing one
over the other is a matter of preference.
S3 Class
• S3 class is somewhat primitive in nature. It lacks a
formal definition and object of this class can be created
simply by adding a class attribute to it.
• This simplicity accounts for the fact that it is widely
used in R programming language. In fact most of the R
built-in classes are of this type.
• S3 is used to overload any function. Therefore, we can
call different names of the function. And, it depends
upon the type of input parameter or the number of a
parameter.
• The class of an object is determined by
its class attribute, a character vector of class names.
• S3 is different from conventional programming languages
like Java, C++, and C# that implement message passing OO.
This makes S3 easier to implement.
• The below ex. shows how to define a function that will
create and return an object of a given class. A list is created
with the relevant members, the list’s class is set, and a copy
of the list is being returned. The components of the list
become the member variables of the object.
• s <- list(name = “ABC", age = 29, GPA = 4.0)
• class(s) <- "student"
• S
• o/p- $name [1] “ABC“
$age [1] 29
$GPA [1] 4.0
attr(,"class")
[1] "student"
• How to use constructors to create objects?
• It is a good practice to use a function with the same name as
class (not a necessity) to create objects.
• This will bring some uniformity in the creation of objects and
make them look similar.
• We can also add some integrity check on the member attributes.
Here is an example. Note that in this example we use
the attr() function to set the class attribute of the object.
• # a constructor function for the "student" class
student <- function(n,a,g)
{
if(g>4 || g<0) stop("GPA must be between 0 and 4")
value <- list(name = n, age = a, GPA = g)
# class can be set using class() or attr() function
attr(value, "class") <- "student“
value }
Methods and Generic Functions
• In the above example, when we simply write the name of
the object, its internals get printed. In interactive mode,
writing the name alone will print it using
the print() function.
• How does print() know how to print these variety of
dissimilar looking object?
• The answer is, print() is a generic function. Actually, it has a
collection of a number of methods. You can check all these
methods with methods(print).
• Printing our object of class "student" looks for a method of
the form print.student(), but there is no method of this
form.
• So, which method did our object of class "student" call? It
called print.default(). This is the fallback method which is
called if no other match is found. Generic functions have a
default method.
• Now, we will make a generic function – GPA
GPA <- function(obj)
{
UseMethod("GPA")
}
• Now, we will make a new method for the class
“student”
GPA.student <- function(obj)
{
cat("Total GPA is", obj$GPA, "\n")
}
S4 class
• S4 class are an improvement over the S3 class. They
have a formally defined structure which helps in
making object of the same class look more or less
similar.
• Class components are properly defined using
the setClass() function and objects are created using
the new() function.
• We specify a function to verify that the data is
consistent (validation) and also specify the default
values (the prototype).
• setClass ("student", slots=list(name="character",
age="numeric", GPA="numeric"))
• There are two functions is.object and
the isS4 commands.
• We use is.object command to determine whether a
variable refers to an object or not.
• We use the isS4 command to determine whether a
variable is an S4 object or not.
• The importance of the commands is that
the isS4 command alone cannot determine that a
variable is an S3 object. First, we need to determine
whether the variable is an object and then decide if it is
S4 or not.
• In an object, we use a set of commands to get
information about the data elements, or slots within an
object. The first is the slotNames command which can
take either an object or the name of a class. We obtain
names of slots that are related to the class as strings.
• How to create S4 objects?
• S4 objects are created using the new() function.
• s <- new("student",name="John", age=21, GPA=3.5)
• s
• We can check if an object is an S4 object through the
function isS4().
• isS4(s)
• [1] TRUE
• The function setClass() returns a generator function.
• This generator function (usually having same name as the
class) can be used to create new objects. It acts as a
constructor.
• student <- setClass("student", slots=list(name="character",
age="numeric", GPA="numeric"))
• student
• How to access and modify slot?
• Just as components of a list are accessed using $, slot of an
object are accessed using @.
• Accessing slot
• s@name
• [1] "John“
• Modifying slot directly
• A slot can be modified through reassignment.
• s@GPA <- 3.7
• s
• Modifying slots using slot() function
• Similarly, slots can be access or modified using
the slot() function.
• slot(s,"name")
• [1] "John"
• slot(s,"name") <- "Paul"
Reference Class
• Reference class were introduced later, compared to the
other two. It is more similar to the object oriented
programming we are used to seeing in other major
programming languages.
• Reference classes are basically S4 classed with an
environment added to it.
• setRefClass("student")
• Member variables of a class, if defined, need to be
included in the class definition. Member variables of
reference class are called fields (analogous to slots in
S4 classes).
• setRefClass("student", fields = list(name = "character",
age = "numeric", GPA = "numeric"))
• How to create a reference objects?
• The function setRefClass() returns a generator function
which is used to create objects of that class.
• student <- setRefClass("student", fields = list(name =
"character", age = "numeric", GPA = "numeric"))
• s <- student(name = "John", age = 21, GPA = 3.5) > s
• How to access and modify fields?
• Fields of the object can be accessed using
the $ operator.
• s$name [1] "John“
• Similarly, it is modified by reassignment.
• s$name <- "Paul"
Reference Methods

• Methods are defined for a reference class and


do not belong to generic functions as in S3
and S4 classes.
• All reference class have some methods
predefined because they all are inherited from
the superclass envRefClass.
• We can create our own methods for the class.
• This can be done during the class definition by
passing a list of function definitions
to methods argument of setRefClass().
Debugging
• A grammatically correct program may give us incorrect
results due to logical errors. In case, if such errors (i.e.
bugs) occur, we need to find out why and where they
occur so that you can fix them. The procedure to
identify and fix bugs is called “debugging”.
• There are a number of R debug functions, such as:
• traceback()
• debug()
• browser()
• trace()
• recover()
Fundamental Principles of R
Debugging
1. The Essence of Debugging - The principle of
confirmation: Fixing a bugging program is a process of
confirming, one by one, that many things you believe
to be true about code are actually true. When we find
one of our assumptions is not true, we have found a
clue to the location of a bug.
2. Start Small - Stick to small simple test cases, at least at
the beginning of the R debug process. Working with
large data objects may make it harder to think about
the problem. Of course, we should eventually test our
code in large, complicated cases, but start small.
3. Debug in a Modular-
• Top-Down Manner: Most professional software
developers agree that code should be written in a
modular manner. Our first-level code should not be long
enough with much of it consisting of functions calls. And
those functions should not be too lengthy and should call
another function if necessary. This makes code easier at
the writing stage and also for others to understand when
the time comes for the code to be extended.
• We should debug in a top-down manner.
4. Antibugging - If we have a section of a code in which a
variable x should be positive, then we can insert this line:
Stopifnot(x>0)
• If there is a bug in the code earlier that renders x equals
to, say -3, the call to stopifnot() will bring things right
there, with an error message like this: Error: x > 0 is not
TRUE
R Debug Functions
• 1. traceback() - If our code has already crashed
and we want to know where the offending line is,
then try traceback(). This will (sometimes) show
whereabouts in the code of the problem
occurred.
• When an R function fails, an error is printed to
the screen. Immediately after the error, you can
call traceback() to see in which function the error
occurred. The traceback() function prints the list
of functions that were called before the error
occurred. The functions are printed in reverse
order.
2. debug()- The function debug() in R allows the
user to step through the execution of a
function, line by line. At any point, we can
print out values of variables or produce a
graph of the results within the function. While
debugging, we can simply type “c” to continue
to the end of the current section of
code. traceback() does not tell us where the
error occurred in the function. In order to
know which line causes the error, we will have
to step through the function using debug().
3. browser() - The R debug function browser() stops the
execution of a function until the user allows it to
continue. This is useful if we don’t want to step
through the complete code, line-by-line, but we want it
to stop at a certain point so we can check out what is
going on. Inserting a call to the browser() in a function
will pause the execution of a function at the point
where the browser() is called. Similar to using debug()
except we can control where execution gets paused.
4. trace() - Calling trace() on a function allows the user to
insert bits of code into a function. The syntax for R
debug function trace() is a bit strange for first-time
users. It might be better off using debug().
5. recover() - When we are debugging a
function, recover() allows us to check variables in
upper-level functions.
• By typing a number in the selection, we are
navigated to the function on the call stack and
positioned in the browser environment.
• We can use recover() as an error handler, set
using options() (e.g.options(error=recover)).
• When a function throws an error, execution is
halted at the point of failure. We can browse the
function calls and examine the environment to
find the source of the problem.
• In recover, we use the previous f(), g() and h()
functions for debugging.
Error Handling & Recovery in R
• Exception or Error handling is a process of
responding to anomalous occurrences in the
code that disrupt the flow of the code. In general,
the scope for the exception handlers begins with
try and ends with a catch. R
provides try() and trycatch() function for the
same.
• The try() function is a wrapper function for
trycatch() which prints the error and then
continues. On the other hand, trycatch() gives
you the control of the error function and also
optionally, continues the process of the function.
Unit-4
• Functions for Reading Data into R:
• Usually we will be using data already in a file that we
need to read into R in order to work on it. R can read
data from a variety of file formats—for example, files
created as text, or in Excel.
• We will mainly be reading files in text format .txt or
.csv (comma-separated, usually created in Excel).
• To read an entire data frame directly, the external file
will normally have a special form -
1. The first line of the file should have a name for each
variable in the data frame.
2. Each additional line of the file has as its first item a row
label and the values for each variable.
1. CSV files – The csv file is a text file in which
the values in the columns are separated by a
comma. CSV stands for “comma-seprated
values”.
• The file should be present in current working
directory so that R can read it.
• We can also set the directory from which
files are to be read or written.
# get and print current working directory
getwd()
# Set current working directory
setwd(“D:/R programs”)
1.Reading from a csv file- we can read from a csv
file using read.csv() function.
• We can check for the type, number of rows and
columns in the data frame.
• Data <- read.csv("C:/Desktop/airquality.csv")
• The function read.table() can then be used to
read the data frame directly.
• air <- read.table("C:/Desktop/airquality.txt")
• We can use the colnames() command to assign
column names to the dataset.
• Reading CSV Files with Pandas
• result = pandas.read_csv('X:\data.csv')
2. Reading from excel file- Microsoft Excel is the
most widely used spreadsheet program which
stores data in the .xls or .xlsx format.
• R can read directly from these files using some
excel specific packages. Few such packages are –
XLConnect, xlsx, gdata etc. We will be using “xlsx”
package.
• install.packages("xlsx")
• mydata = read.xlsx("D:/myexcel.xlsx", sheetindex
= 1)
• mydata = read.xlsx ("D:/myexcel.xlsx",
sheetName = "mysheet1")
3. Reading data from json- the function fromJSON()
will return a list.
• Install.packages(rjson)
• File1 <- fromJSON(“emp.json”)
• We can again convert the list into a dataframe
using as.data.frame() function.
• Json_data <- as.data.frame(File1)
4. Reading data from XML file- The xml file is read
by R and converted to data frames using the
function xmlToDataFrame().
• The result is stored as a list in R.
• library(“XML”)
• Result <- xmlToDataFrame(“XMLREAD.xml”)
5. Reading from binary files- we can read from binary files
using the function readBin().
• The syntax of readBin() function is readBin(con,what,n)
where con is the connection object to read from the
binary file, what is the mode like char, int etc.
representing the bytes to be read and n is the number
to bytes to read from the file.
# creating connection object
• read.filename <- file(“emp.dat”, “rb”)
# reading column name
• Col <- readBin(read.filename, character(), n=3)
6. Reading data from HTML table-
• readHTMLTable(doc, header = NA, colClasses = NULL,
skip.rows = integer(), trim = TRUE, elFun = xmlValue,
as.data.frame = TRUE,)
• There are a few very useful functions for reading data
into R.
• read.table() and read.csv() are two popular functions
used for reading tabular data into R.
• readLines() is used for reading lines from a text file.
• source() is a very useful function for reading in R code
files from a another R program.
• dget() function is also used for reading in R code files.
• load() function is used for reading in saved workspaces.
• unserialize() function is used for reading single R
objects in binary format.
1. Writing to a CSV file- We can create csv file from
existing data frame in R.
• The write.csv() function is used to create the csv
file.
• data <- read.csv(“employee.csv”)
• hr <- subset( data, dept == “HR”)
• write.csv(hr, “hr.csv”)
• The R base function write.table() can be used to
export a data frame or a matrix to a file.
• write.table(x, file, append = FALSE, sep = " ", dec
= ".", row.names = TRUE, col.names = TRUE)
2. Writing to an excel file-
write.xlsx( x, file, sheetname = “Sheet1”, col.names =
TRUE, row.names = TRUE, append = FALSE)
• X is data frame to be written into workbook, file is
the path to the output file.
3. Writing to binary file- we can write to binary files
from R using the function writeBin().
• The syntax of writeBin is writeBin(object,con) where
object is the binary file to be written and con is the
connection object to write the binary file.
• emp <- read.csv(“emp.csv”)
• write.file <- file(“emp.dat”, “wb”)
• writeBin(colnames(empdata), write.file)
• close(write.filename)
• Functions for Writing Data to Files:
• There are similar functions for writing data to files
• write.table() is used for writing tabular data to text
files (i.e. CSV).
• writeLines() function is useful for writing character
data line-by-line to a file or connection.
• dump() is a function for dumping a textual
representation of multiple R objects.
• dput() function is used for outputting a textual
representation of an R object.
• save() is useful for saving an arbitrary number of R
objects in binary format to a file.
• serialize() is used for converting an R object into a
binary format for outputting to a connection (or
file).
Handling large data sets in R
• The Problem with large data sets in R-
• R reads entire data set into RAM all at once.
Other programs can read file sections on
demand.
• R Objects live in memory entirely.
• Does not have int64 data type
Not possible to index objects with huge
numbers of rows & columns even in 64 bit
systems (2 Billion vector index limit) . Hits file
size limit around 2-4 GB.
• How big is a large data set:
• We can categorize large data sets in R across two
broad categories:
• Medium sized files that can be loaded in R (
within memory limit but processing is
cumbersome (typically in the 1-2 GB range )
• Large files that cannot be loaded in R due to R /
OS limitations as discussed above . we can further
split this group into 2 sub groups
– Large files - (typically 2 - 10 GB) that can still be
processed locally using some work around solutions.
– Very Large files - ( > 10 GB) that needs distributed
large scale computing.
• Medium sized datasets (< 2 GB)
1. Try to reduce the size of the file before loading it into R
• If you are loading xls files , you can select specific columns that is
required for analysis instead of selecting the entire data set.
• You can not select specific columns if you are loading csv or text
file - you might want to pre-process the data in command line
using cut or awk commands and filter data required for analysis.
2. Pre-allocate number of rows and pre-define column classes
• Read optimization example :
• read in a few records of the input file , identify the classes of the
input file and assign that column class to the input file while
reading the entire data set
• calculate approximate row count of the data set based on the size
of the file , number of fields in the column ( or using wc in
command line ) and define nrow= parameter
• define comment.char parameter
• Alternately, use fread option from package data.table.
• “fast and friendly file finagler”, the popular data.table package is
an extremely useful and easy to use. Its fread() function is meant
to import data from regular delimited files directly into R,
without any detours or nonsense.
• One of the great things about this function is that all controls,
expressed in arguments such as sep, colClasses and nrows are
automatically detected.
• Also, bit64::integer64 types are also detected and read directly
without needing to read as character before converting.
• ff - ff is another package dealing with large data sets similar to
bigmemory. It uses a pointer as well but to a flat binary file
stored in the disk, and it can be shared across different sessions.
• One advantage ff has over bigmemory is that it supports
multiple data class types in the data set unlike bigmemory.
• Parallel Processing-Parallelism approach runs several
computations at the same time and takes advantage of
multiple cores or CPUs on a single system or across
systems. Following R packages are used for parallel
processing in R.
• Bigmemory - bigmemory is part of the “big” family
which consists of several packages that perform
analysis on large data sets. bigmemory uses several
matrix objects but we will only focus on big.matrix.
• big.matrix is a R object that uses a pointer to a C++
data structure. The location of the pointer to the C++
matrix can be saved to the disk or RAM and shared
with other users in different sessions.
• By loading the pointer object, users can access the data
set without reading the entire set into R.
• Very Large datasets -
• There are two options to process very large data
sets ( > 10GB) in R.
• Use integrated environment packages
like Rhipe to leverage Hadoop MapReduce
framework.
• Use RHadoop directly on hadoop distributed
system.
• Storing large files in databases and connecting
through DBI/ODBC calls from R is also an option
worth considering.

You might also like