BigData_BCom-Unit-3
BigData_BCom-Unit-3
WITH R
INTRODUCTION
Statistical computing and high-scale data analysis tasks needed a new
category of computer language besides the existing procedural and
object-oriented programming languages, which would support these tasks
instead of developing new software. There is plenty of data available
today which can be analysed in different ways to provide a wide range of
useful insights for multiple operations in various industries. Problems such
as the lack of support, tools and techniques for varied data analysis have
been solved with the introduction of one such language called R.
What is R?
R is a scripting or programming language which provides an environment
for statistical computing, data science and graphics. It was inspired by,
and is mostly compatible with, the statistical language S developed at Bell
laboratory (formerly AT & T, now Lucent technologies). Although there
are some very important differences between R and S, much of the code
written for S runs unaltered on R. R has become so popular that it is used
as the single most important tool for computational statistics,
visualisation and data science.
Why R?
R has opened tremendous scope for statistical computing and data
analysis. It provides techniques for various statistical analyses like
classical tests and classification, time-series analysis, clustering, linear
and non-linear modelling and graphical operations. The techniques
supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary
solution and is not readily available to developers. In contrast, R is
available freely. Hence, it helps the developer community in research and
development.
Another reason behind the popularity and widespread use of R is its
superior support
for graphics. It can provide well-developed and high-quality plots from
data analysis. The plots can contain mathematical formulae and symbols,
if necessary, and users have full control over the selection and use of
symbols in the graphics. Hence, other than robustness, user-experience
and user-friendliness are two key aspects of R.
2. R has the fundamental data type, i.e., a vector that can be organised
and aggregated in different ways even though the core is the same.
Vector data type imposes some limitations on the language as this is a
rigid type. However, it gives a strong logical base to R. Based on the
vector data type, R uses the concept of data frames that are like a
matrix with attributes and internal data structure similar to
spreadsheets or relational database. Hence, R follows a column-wise
data structure based on the aggregation of vectors.
Data types in R
R is a programming language. Like other programming languages, R also
makes use
of variables to store varied information. This means that when variables
are created, locations are reserved in the computer’s memory to hold the
related values. The number of locations or size of memory reserved is
determined by the data type of the variables. Data type essentially means
the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types.
Variables in R are used to store some R objects and the data type of the R
object becomes the data type of the variable. The most popular (based on
usage) R objects are:
Vector Array
List Factor
Matrix Data Frames
A vector is the simplest of all R objects. It has varied data types. The
most commonly used data types are listed as follows:
Logical Character
Numeric Double
o Integer
class() function can be used to reveal the data type.
Logical
TRUE / T and FALSE / F are logical values.
> TRUE > FALSE
[1] TRUE [1] FALSE
> class(TRUE) > class(FALSE)
[1] "logical" [1] "logical"
>T >F
[1] TRUE [1] FALSE
> class(T) > class(F)
[1] "logical" [1] "logical"
Numeric
>2 > 76.25
[1] 2 [1] 76.25
> class (2) > class(76.25)
[1] "numeric" [1] "numeric"
Integer
Integer data type is a sub class of numeric data type. Notice the use of
“L” as a suffix to a numeric value in order for it to be considered an
“integer”.
> 2L > class(2L)
[1] 2 [1] "integer"
Functions such as “is.numeric()”, “is.integer()” can be used to test the
data type.
> is.numeric(2) > is.integer(2)
[1] TRUE [1] FALSE
> is.numeric(2L) > is.integer(2L)
[1] TRUE [1] TRUE
Note: Integers are numeric but NOT all numbers are integers.
Character
> "Data Science" > class("Data Science")
[1] "Data Science" [1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE
Variables
(i) Assign a value of 50 to the variable called ‘Var’.
> Var <- 50 Or > Var = 5
(ii) Print the value in the variable, ‘Var’.
> Var
[1] 50
(iii) Perform arithmetic operations on the variable, ‘Var’.
> Var + 10 > Var / 2
[1] 60 [1] 25
Variables can be reassigned values either of the same data type or of a
different data type.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Print the value in the variable, ‘Var’.
> Var
[1] “R is a Statistical Programming Language”
(v) Reassign a logical value to the variable, ‘Var’.
> Var <- TRUE
> Var
[1] TRUE
Functions
In this section we will try out a few functions such as sum(), min(), max()
and seq().
sum() function
sum() function returns the sum of all the values in its arguments.
Syntax: sum(..., na.rm = FALSE)
where … implies numeric or complex or logical vectors.
na, rm accepts a logical value. Should missing values (including NaN (Not
a Number)) be removed?
Example: Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6
min() function
min() function returns the minimum of all the values present in their
arguments.
Syntax: min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > min(1, 2, 3)
[1] 1
max() function
max() function returns the maximum of all the values present in their
arguments.
Syntax: max(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > max(44, 78, 66)
[1] 78
seq() function
seq() function generates a regular sequence.
Syntax: seq(start from, end at, interval, length.out)
where, Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence.
Interval: It is the increment of the sequence.
length.out: It is the desired length of the sequence.
Example:
> seq(1, 10, 2) > seq(18)
[1] 1 3 5 7 9 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> seq(1, 10, length.out=10) 16 17 18
[1] 1 2 3 4 5 6 7 8 9 10 > seq(1, 6, by=3)
[1] 1 4
Control Structures
Control structures in R allow you to control the flow of execution of a
series of R expressions. Basically, control structures allow you to put
some “logic” into your R code, rather than just always executing the same
R code every time. Control structures allow you to respond to inputs or to
features of the data and execute different R expressions accordingly.
Most control structures are not used in interactive sessions, but rather
when writing functions or longer expresisons. However, these constructs
do not have to be used in functions and it’s a good idea to become
familiar with them before we delve into functions.
if-else
The if-else combination is probably the most commonly used control
structure in R (or perhaps any language). This structure allows you to test
a condition and act on it depending on whether it’s true or false.
if(<condition>) {
## do something
}
## Continue with rest of code
The above code does nothing if the condition is false. If you have an
action you want to execute when the condition is false, then you need an
else clause.
if(<condition>) {
## do something
}
else {
## do something else
}
You can have a series of tests by following the initial if with any number of
else ifs.
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
Here is an example of a valid if/else structure.
## Generate a uniform random number
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
The value of y is set depending on whether x > 3 or not. This expression
can also be written a different, but equivalent, way in R.
y <- if(x > 3) {
10
} else {
0
}
for Loops
In R, for loops take an iterator variable and assign it successive values
from a sequence or vector. For loops are most commonly used for
iterating over the elements of an object (list, vector, etc.)
> for(i in 1:10)
print(i)
This loop takes the i variable and in each iteration of the loop gives it
values 1, 2, 3, …, 10, executes the code within the curly braces, and then
the loop exits.
The following three loops all have the same behavior.
x <- c("a", "b", "c", "d")
for(i in 1:4) {
## Print out each element of 'x'
print(x[i])
}
The seq_along() function is commonly used in conjunction with for loops in
order to generate an integer sequence based on the length of an object
(in this case, object x).
x <- c("z", "y", "x", "w") ## Generate a sequence based on
length of 'x'
for(i in seq_along(x))
print(x[i])
It is not necessary to use an index-type variable.
for(letter in x)
print(letter)
while Loops
While loops begin by testing a condition. If it is true, then they execute
the loop body. Once the loop body is executed, the condition is tested
again, and so forth, until the condition is false, after which the loop exits.
count <- 0
while(count <= 10) {
print(count)
count <- count + 1
}
While loops can potentially result in infinite loops if not written properly.
repeat Loops
repeat initiates an infinite loop right from the start. The only way to exit a
repeat loop is to call break.
One possible paradigm might be in an iterative algorithm where you may
be searching for a solution and you don’t want to stop until you’re close
enough to the solution. In this kind of situation, you often don’t know in
advance how many iterations it’s going to take to get “close enough” to
the solution.
val = 5
repeat {
print (val)
val <- val + 1
if (val == 10) break
}
next, break
next is used to skip an iteration of a loop.
for(i in 1:100) {
if(i <= 20) next
print (i)
}
break is used to exit a loop immediately, regardless of what iteration the
loop be on.
for(i in 1:100) {
print(i)
if(i >= 20) break
}
Vectors in R
The fundamental data type in R is the vector. A vector is a sequence of
data elements all of the same type.
Creating Vectors
There are various ways to create vectors but one of the most common is
the concatenation operator. This takes arguments and places them all in
a vector.
x <- c(1, 5, 2, 6) is.vector(x)
x ## [1] TRUE
[1] 1 5 2 6
Note that c() orders the values in the vector in the order in which they
were entered.
Vector Arithmetic
We can do arithmetic with vectors in a similar manner as we have with
integers. When we use operators we are doing something element by
element or “elementwise.”
y <- c(1,6,4,8)
x+y
## [1] 2 11 6 14
Notice that we did not add all of the values together but we added both of
the first values from x and y, then the second values and so on.
> x*y > x %% y
[1] 1 30 8 48 [1] 0 5 2 6
> x/y
[1] 1.0000000 0.8333333 0.5000000
0.7500000
Functions on Vectors
We considered functions on specific data values but we can actually put
vectors into most functions in R. One of the simplest functions can help
us with knowing information about Recycling that we encountered before.
This is the length() function.
> length(x)
> length(y)
> length(z)
Then length vector is very important with the writing of functions which
we will get to in a later unit. We can use any() and all() in order to answer
logical questions about elements
any(x>3)
[1] TRUE
We see that there must be at least one x that is greater than 3.
all(x>3)
## [1] FALSE
However, not all values of x are larger than 3.
Indexing Vectors
We can call specific elements of a vector by using the following:
x[] is a way to call up a specific element of a vector.
x[1] is the first element.
x[3] is the third element.
x[-3] is a vector with everything but the third element.
We can start of by checking what we have stored so far:
ls()
## [1] "x" "y" "z"
Now, that we see the vectors available we can try indexing x:
x[3]
## [1] 2
x[-3]
## [1] 1 5 6
Note that x[3] returns the third element and x[-3] returns everything but
the third element.
Arrays in R
Arrays are still a vector in R but they have added extra options to them.
We can essentially call them “vector structure”. With a vector we have a
list of objects in one dimension. With an array we can have any number
of dimensions to our data.
We can consider a simple vector to start with
> x <- c(1,2,3,4)
This means that x is a vector with 4 elements. This simple vector can be
turned into an array by specifying some dimensions on it.
> x.array <- array(x, dim=c(2,2)) [,1] [,2]
> x.array [1,] 1 3
[2,] 2 4
A regular vector has a single dimension.
A matrix has 2 dimensions
An array can have up to n dimensions.
We can learn about arrays with the following functions:
> dim(x.array)
[1] 2 2
We can see that our array is a 2x2 matrix.
> is.vector(x.array) > is.array(x.array)
[1] FALSE [1] TRUE
We can also see that R does view these are different objects. There is an
array and a vector class.
Properties of Arrays
We can also have R tell us:
Type of elements does our array contain with the typeof() function.
The structure of the array with the str() function.
Other attributes with the attributes() function.
> typeof(x.array) > attributes(x.array)
[1] "double" $dim
> str(x.array) [1] 2 2
num [1:2, 1:2] 1 2 3 4
The structure gives a lot of detail about the array and the attributes lets
you know that a given attribute is the number of dimensions which is 2x2.
Indexing Arrays
Previously we learned how to extract or remove information from vectors.
We can also index arrays but our index takes into account all the
dimensions of our array
For example if we wish to take the element out of the first row and first
column we can do that by:
> x.array[1,1]
[1] 1
Just like in vectors, we can replace values in an array but using indexing
and assigning of values.
> x.array[1,1] <- 5 [,1] [,2]
> x.array [1,] 5 3
[2,] 2 4
Many times we wish to have functions act on either just the row or the
column and there are many functions built into R for this. For example:
> rowSums(x.array) > colSums(x.array)
[1] 8 6 [1] 7 7
Matrices in R
A Matrix is a vector that also contains information on the number of rows
and number of columns. However, vectors are not matrices.
Creating Matrices
An important first step with matrices is to learn how to create them. One
of the easiest ways to do this is with the matrix() function.
x <- c(1,2,3,4) [,1] [,2]
x.mat <- matrix(x, nrow=2, ncol=2, [1,] 1 2
byrow=TRUE) [2,] 3 4
x.mat
Note: the byrow=TRUE means that we will the matrix by the row, it is not
the same as if we do not fill it by row:
> x.mat2 <- matrix(x, nrow=2, ncol=2, [,1] [,2]
byrow=FALSE) [1,] 1 3
> x.mat2 [2,] 2 4