0% found this document useful (0 votes)
2 views

BigData_BCom-Unit-3

The document introduces R, a programming language designed for statistical computing and data analysis, highlighting its compatibility with S and its extensive capabilities for various statistical techniques. It emphasizes R's advantages over other languages like Python, particularly in ease of use for statistical tasks and its extensive library support, while also noting some limitations regarding scalability for large datasets. Additionally, it covers basic concepts such as data types, variables, functions, and control structures in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

BigData_BCom-Unit-3

The document introduces R, a programming language designed for statistical computing and data analysis, highlighting its compatibility with S and its extensive capabilities for various statistical techniques. It emphasizes R's advantages over other languages like Python, particularly in ease of use for statistical tasks and its extensive library support, while also noting some limitations regarding scalability for large datasets. Additionally, it covers basic concepts such as data types, variables, functions, and control structures in R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit-III: INTRODUCTION TO R & GETTING STARTED

WITH R
INTRODUCTION
Statistical computing and high-scale data analysis tasks needed a new
category of computer language besides the existing procedural and
object-oriented programming languages, which would support these tasks
instead of developing new software. There is plenty of data available
today which can be analysed in different ways to provide a wide range of
useful insights for multiple operations in various industries. Problems such
as the lack of support, tools and techniques for varied data analysis have
been solved with the introduction of one such language called R.

What is R?
R is a scripting or programming language which provides an environment
for statistical computing, data science and graphics. It was inspired by,
and is mostly compatible with, the statistical language S developed at Bell
laboratory (formerly AT & T, now Lucent technologies). Although there
are some very important differences between R and S, much of the code
written for S runs unaltered on R. R has become so popular that it is used
as the single most important tool for computational statistics,
visualisation and data science.

Why R?
R has opened tremendous scope for statistical computing and data
analysis. It provides techniques for various statistical analyses like
classical tests and classification, time-series analysis, clustering, linear
and non-linear modelling and graphical operations. The techniques
supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary
solution and is not readily available to developers. In contrast, R is
available freely. Hence, it helps the developer community in research and
development.
Another reason behind the popularity and widespread use of R is its
superior support
for graphics. It can provide well-developed and high-quality plots from
data analysis. The plots can contain mathematical formulae and symbols,
if necessary, and users have full control over the selection and use of
symbols in the graphics. Hence, other than robustness, user-experience
and user-friendliness are two key aspects of R.

The following points describe why R language should be used:


If you need to run statistical calculations in your application, learn and
deploy R. It
easily integrates with programming languages such as Java, C++, Python
and Ruby.
 If you wish to perform a quick analysis for making sense of data.
 If you are working on an optimisation problem.
 If you need to use re-usable libraries to solve a complex problem,
leverage the 2000+ free libraries provided by R.
 If you wish to create compelling charts.
 If you aspire to be a Data Scientist.
 If you want to have fun with statistics.
 R is free. It is available under the terms of the Free Software
Foundation’s GNU General Public License in source code form.
 It is available for Windows, Mac and a wide variety of Unix platforms
(including FreeBSD, Linux, etc.).

 In addition to enabling statistical operations, it is a general


programming language so that you can automate your analyses and
create new functions.
 R has excellent tools for creating graphics such as bar charts, scatter
plots, multipanel lattice charts, etc.
 It has an object oriented and functional programming structure along
with support from a robust and vibrant community.
 R has a flexible analysis tool kit, which makes it easy to access data in
various formats, manipulate it (transform, merge, aggregate, etc.), and
subject it to traditional and modern statistical models (such as
regression, ANOVA, tree models, etc.)
 R can be extended easily via packages. It relates easily to other
programming languages.
 Existing software as well as emerging software can be integrated with
R packages to make them more productive.
 R can easily import data from MS Excel, MS Access, MySQL, SQLite,
Oracle etc. It can easily connect to databases using ODBC (Open
Database Connectivity Protocol) and ROracle package.

Advantages of R Over Other Programming Languages


Advanced programming languages like Python also support statistical
computing and data visualisation along with traditional computer
programming. However, R wins the race over Python and similar
languages because of the following two advantages:
1. Python needs third party extensions and support for data visualisation
and statistical computing. However, R does not require any such
support extensively. For example, the lm function is present for linear
regression analysis and data analysis in both Python and R. In R, data
can be easily passed through the function and the function will return
an object with detailed information about the regression. The function
can also return information about the standard errors, coefficients,
residual values and so on. When lm function is called in the Python
environment, it will duplicate the functionalities using third party
libraries such as SciPy, NumPy and so on. Hence, R can do the same
thing with a single line of code instead of taking support from third
party libraries.

2. R has the fundamental data type, i.e., a vector that can be organised
and aggregated in different ways even though the core is the same.
Vector data type imposes some limitations on the language as this is a
rigid type. However, it gives a strong logical base to R. Based on the
vector data type, R uses the concept of data frames that are like a
matrix with attributes and internal data structure similar to
spreadsheets or relational database. Hence, R follows a column-wise
data structure based on the aggregation of vectors.

There are also some disadvantages of R. For example, R cannot scale


efficiently for larger data sets. Hence, the use of R is limited to
prototyping and sandboxing. It is rarely used for enterprise-level
solutions. By default, R uses a single-thread execution approach while
working on data stored in the RAM which leads to scalability issues as
well. Developers from open source communities are working hard on
these issues to make R capable of multi-threading execution and
parallelisation. This will help R to utilise more than one core processor.
There are big data extensions from companies like Revolution R and the
issues are expected to be resolved soon. Other languages like SPlus can
help to store objects permanently on disks, hence, supporting better
memory management and analysis of high volume of massive datasets.

Data types in R
R is a programming language. Like other programming languages, R also
makes use
of variables to store varied information. This means that when variables
are created, locations are reserved in the computer’s memory to hold the
related values. The number of locations or size of memory reserved is
determined by the data type of the variables. Data type essentially means
the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types.
Variables in R are used to store some R objects and the data type of the R
object becomes the data type of the variable. The most popular (based on
usage) R objects are:
 Vector  Array
 List  Factor
 Matrix  Data Frames
A vector is the simplest of all R objects. It has varied data types. The
most commonly used data types are listed as follows:
 Logical  Character
 Numeric  Double
o Integer
class() function can be used to reveal the data type.

Logical
TRUE / T and FALSE / F are logical values.
> TRUE > FALSE
[1] TRUE [1] FALSE
> class(TRUE) > class(FALSE)
[1] "logical" [1] "logical"
>T >F
[1] TRUE [1] FALSE
> class(T) > class(F)
[1] "logical" [1] "logical"

Numeric
>2 > 76.25
[1] 2 [1] 76.25
> class (2) > class(76.25)
[1] "numeric" [1] "numeric"
Integer
Integer data type is a sub class of numeric data type. Notice the use of
“L” as a suffix to a numeric value in order for it to be considered an
“integer”.
> 2L > class(2L)
[1] 2 [1] "integer"
Functions such as “is.numeric()”, “is.integer()” can be used to test the
data type.
> is.numeric(2) > is.integer(2)
[1] TRUE [1] FALSE
> is.numeric(2L) > is.integer(2L)
[1] TRUE [1] TRUE
Note: Integers are numeric but NOT all numbers are integers.

Character
> "Data Science" > class("Data Science")
[1] "Data Science" [1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE

double (for double precision floating point numbers)


By default, numbers are of “double” type unless explicitly mentioned with
an L suffixed to the number for it to be considered an integer.
> typeof (76.25)
[1] "double"
Variables and ls() Function
R, like any other programming language, uses variables to store
information. Let us start by creating a variable “RectangleHeight” and
assign the value 2 to it. Note the use of the operator “<-” to assign a
value to the variable.
Likewise, the variable “RectangleWidth” is defined and assigned the value
4. The area of the rectangle is computed using the formula
“RectangleHeight * RectangleWidth”. The computed value for the area of
the rectangle is stored in the variable “RectangleArea”.
RectangleHeight <- 2 RectangleWidth
RectangleWidth <- 4 [1] 4
RectangleArea <- RectangleHeight * RectangleArea
RectangleWidth [1] 8
RectangleHeight
[1] 2
Note: When a value is assigned to a variable, it does not display anything
on the console. To get the value, type the name of the variable at the
prompt.
Use the ls() function to list all the objects in the working environment.
> ls()
[1] "RectangleArea" "RectangleHeight" "RectangleWidth"
ls() is also useful to clean the environment before running a code.
Execute the rm() function as shown to clean up the environment.
> rm(list=ls())
> ls()
character(0)

Variables
(i) Assign a value of 50 to the variable called ‘Var’.
> Var <- 50 Or > Var = 5
(ii) Print the value in the variable, ‘Var’.
> Var
[1] 50
(iii) Perform arithmetic operations on the variable, ‘Var’.
> Var + 10 > Var / 2
[1] 60 [1] 25
Variables can be reassigned values either of the same data type or of a
different data type.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Print the value in the variable, ‘Var’.
> Var
[1] “R is a Statistical Programming Language”
(v) Reassign a logical value to the variable, ‘Var’.
> Var <- TRUE
> Var
[1] TRUE
Functions
In this section we will try out a few functions such as sum(), min(), max()
and seq().

sum() function
sum() function returns the sum of all the values in its arguments.
Syntax: sum(..., na.rm = FALSE)
where … implies numeric or complex or logical vectors.
na, rm accepts a logical value. Should missing values (including NaN (Not
a Number)) be removed?
Example: Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6

min() function
min() function returns the minimum of all the values present in their
arguments.
Syntax: min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > min(1, 2, 3)
[1] 1

max() function
max() function returns the maximum of all the values present in their
arguments.
Syntax: max(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > max(44, 78, 66)
[1] 78

seq() function
seq() function generates a regular sequence.
Syntax: seq(start from, end at, interval, length.out)
where, Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence.
Interval: It is the increment of the sequence.
length.out: It is the desired length of the sequence.
Example:
> seq(1, 10, 2) > seq(18)
[1] 1 3 5 7 9 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> seq(1, 10, length.out=10) 16 17 18
[1] 1 2 3 4 5 6 7 8 9 10 > seq(1, 6, by=3)
[1] 1 4

Control Structures
Control structures in R allow you to control the flow of execution of a
series of R expressions. Basically, control structures allow you to put
some “logic” into your R code, rather than just always executing the same
R code every time. Control structures allow you to respond to inputs or to
features of the data and execute different R expressions accordingly.

Commonly used control structures are:


if and else: testing a condition and acting on it
for: execute a loop a fixed number of times
while: execute a loop while a condition is true
repeat: execute an infinite loop (must break out of it to stop)
break: break the execution of a loop
next: skip an iteration of a loop

Most control structures are not used in interactive sessions, but rather
when writing functions or longer expresisons. However, these constructs
do not have to be used in functions and it’s a good idea to become
familiar with them before we delve into functions.

if-else
The if-else combination is probably the most commonly used control
structure in R (or perhaps any language). This structure allows you to test
a condition and act on it depending on whether it’s true or false.
if(<condition>) {
## do something
}
## Continue with rest of code
The above code does nothing if the condition is false. If you have an
action you want to execute when the condition is false, then you need an
else clause.
if(<condition>) {
## do something
}
else {
## do something else
}
You can have a series of tests by following the initial if with any number of
else ifs.
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
Here is an example of a valid if/else structure.
## Generate a uniform random number
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
The value of y is set depending on whether x > 3 or not. This expression
can also be written a different, but equivalent, way in R.
y <- if(x > 3) {
10
} else {
0
}

for Loops
In R, for loops take an iterator variable and assign it successive values
from a sequence or vector. For loops are most commonly used for
iterating over the elements of an object (list, vector, etc.)
> for(i in 1:10)
print(i)
This loop takes the i variable and in each iteration of the loop gives it
values 1, 2, 3, …, 10, executes the code within the curly braces, and then
the loop exits.
The following three loops all have the same behavior.
x <- c("a", "b", "c", "d")
for(i in 1:4) {
## Print out each element of 'x'
print(x[i])
}
The seq_along() function is commonly used in conjunction with for loops in
order to generate an integer sequence based on the length of an object
(in this case, object x).
x <- c("z", "y", "x", "w") ## Generate a sequence based on
length of 'x'
for(i in seq_along(x))
print(x[i])
It is not necessary to use an index-type variable.
for(letter in x)
print(letter)

Nested for loops


for loops can be nested inside of each other.
x <- matrix(1:6)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
print(x[i, j])
}
}

while Loops
While loops begin by testing a condition. If it is true, then they execute
the loop body. Once the loop body is executed, the condition is tested
again, and so forth, until the condition is false, after which the loop exits.
count <- 0
while(count <= 10) {
print(count)
count <- count + 1
}
While loops can potentially result in infinite loops if not written properly.
repeat Loops
repeat initiates an infinite loop right from the start. The only way to exit a
repeat loop is to call break.
One possible paradigm might be in an iterative algorithm where you may
be searching for a solution and you don’t want to stop until you’re close
enough to the solution. In this kind of situation, you often don’t know in
advance how many iterations it’s going to take to get “close enough” to
the solution.
val = 5
repeat {
print (val)
val <- val + 1
if (val == 10) break
}

next, break
next is used to skip an iteration of a loop.
for(i in 1:100) {
if(i <= 20) next
print (i)
}
break is used to exit a loop immediately, regardless of what iteration the
loop be on.
for(i in 1:100) {
print(i)
if(i >= 20) break
}

Vectors in R
The fundamental data type in R is the vector. A vector is a sequence of
data elements all of the same type.

Creating Vectors
There are various ways to create vectors but one of the most common is
the concatenation operator. This takes arguments and places them all in
a vector.
x <- c(1, 5, 2, 6) is.vector(x)
x ## [1] TRUE
[1] 1 5 2 6
Note that c() orders the values in the vector in the order in which they
were entered.

Vector Arithmetic
We can do arithmetic with vectors in a similar manner as we have with
integers. When we use operators we are doing something element by
element or “elementwise.”
y <- c(1,6,4,8)
x+y
## [1] 2 11 6 14
Notice that we did not add all of the values together but we added both of
the first values from x and y, then the second values and so on.
> x*y > x %% y
[1] 1 30 8 48 [1] 0 5 2 6
> x/y
[1] 1.0000000 0.8333333 0.5000000
0.7500000

Functions on Vectors
We considered functions on specific data values but we can actually put
vectors into most functions in R. One of the simplest functions can help
us with knowing information about Recycling that we encountered before.
This is the length() function.
> length(x)
> length(y)
> length(z)
Then length vector is very important with the writing of functions which
we will get to in a later unit. We can use any() and all() in order to answer
logical questions about elements
any(x>3)
[1] TRUE
We see that there must be at least one x that is greater than 3.
all(x>3)
## [1] FALSE
However, not all values of x are larger than 3.

Other Functions for Vectors


There area various other functions that can be run on vectors, some of
these you may seen before:
mean() finds the arithmetic mean of a vector.
median() finds the median of a vector.
sd() and var() finds the standard deviation and variance of a vector
respectively.
min() and max() finds the minimum and maximum of a vector
respectively.
sort() returns a vector that is sorted.
summary() returns a 5 number summary of the numbers in a vector.

The which() Function


Some functions help us work with the data more to return values in which
we are interested in. For example, above we asked if any elements in
vector x were greater than 3. The which() function will tell us the
elements that are.
> which(x>3)
[1] 2 4

Indexing Vectors
We can call specific elements of a vector by using the following:
x[] is a way to call up a specific element of a vector.
x[1] is the first element.
x[3] is the third element.
x[-3] is a vector with everything but the third element.
We can start of by checking what we have stored so far:
ls()
## [1] "x" "y" "z"
Now, that we see the vectors available we can try indexing x:
x[3]
## [1] 2
x[-3]
## [1] 1 5 6
Note that x[3] returns the third element and x[-3] returns everything but
the third element.

Naming Vector Elements


With vectors it can be important to assign names to the values. Then
when doing plots or considering maximum and minimums, instead of
being given a numerical place within the vector we can be given a specific
name of what that value represents. For example say that vector x
represents the number of medications of 4 unique patients. We could then
use the names() function to assign names to the values
>x > names(x) <- c("Patient A", "Patient
[1] 1 5 2 6 B", "Patient C", "Patient D")
> names(x) >x
NULL Patient A Patient B Patient C Patient D
1 5 2
6

Arrays in R
Arrays are still a vector in R but they have added extra options to them.
We can essentially call them “vector structure”. With a vector we have a
list of objects in one dimension. With an array we can have any number
of dimensions to our data.
We can consider a simple vector to start with
> x <- c(1,2,3,4)
This means that x is a vector with 4 elements. This simple vector can be
turned into an array by specifying some dimensions on it.
> x.array <- array(x, dim=c(2,2)) [,1] [,2]
> x.array [1,] 1 3
[2,] 2 4
 A regular vector has a single dimension.
 A matrix has 2 dimensions
 An array can have up to n dimensions.
We can learn about arrays with the following functions:
> dim(x.array)
[1] 2 2
We can see that our array is a 2x2 matrix.
> is.vector(x.array) > is.array(x.array)
[1] FALSE [1] TRUE

We can also see that R does view these are different objects. There is an
array and a vector class.

Properties of Arrays
We can also have R tell us:
 Type of elements does our array contain with the typeof() function.
 The structure of the array with the str() function.
 Other attributes with the attributes() function.
> typeof(x.array) > attributes(x.array)
[1] "double" $dim
> str(x.array) [1] 2 2
num [1:2, 1:2] 1 2 3 4

The structure gives a lot of detail about the array and the attributes lets
you know that a given attribute is the number of dimensions which is 2x2.

Working with Arrays


As statisticians it is important to know how to work with arrays. Much of
our data will be represented by vectors and arrays.

Indexing Arrays
Previously we learned how to extract or remove information from vectors.
We can also index arrays but our index takes into account all the
dimensions of our array
For example if we wish to take the element out of the first row and first
column we can do that by:
> x.array[1,1]
[1] 1
Just like in vectors, we can replace values in an array but using indexing
and assigning of values.
> x.array[1,1] <- 5 [,1] [,2]
> x.array [1,] 5 3
[2,] 2 4
Many times we wish to have functions act on either just the row or the
column and there are many functions built into R for this. For example:
> rowSums(x.array) > colSums(x.array)
[1] 8 6 [1] 7 7

Matrices in R
A Matrix is a vector that also contains information on the number of rows
and number of columns. However, vectors are not matrices.
Creating Matrices
An important first step with matrices is to learn how to create them. One
of the easiest ways to do this is with the matrix() function.
x <- c(1,2,3,4) [,1] [,2]
x.mat <- matrix(x, nrow=2, ncol=2, [1,] 1 2
byrow=TRUE) [2,] 3 4
x.mat
Note: the byrow=TRUE means that we will the matrix by the row, it is not
the same as if we do not fill it by row:
> x.mat2 <- matrix(x, nrow=2, ncol=2, [,1] [,2]
byrow=FALSE) [1,] 1 3
> x.mat2 [2,] 2 4

We can also create matrices purely by expressing the number of columns


we wish to have. In larger forms of data we may not know the exact
amount of rows and columns but certainly we can choose at least the
number of columns.
y <- c(1,2,3,4,5,6,7) [,1] [,2]
y.mat <- matrix(y, ncol=2) [1,] 1 5
y.mat [2,] 2 6
[3,] 3 7
[4,] 4 1
Matrix Operations
R can be a great tool for working with matrices. Many operations we need
to do with linear algebra can be done in R. We can perform elementwise
multiplication just like in vectors:
> x.mat * x.mat2
[,1] [,2]
[1,] 1 6
[2,] 6 16
R does have the ability to do matrix multiplication as well
> x.mat %*% x.mat2
[,1] [,2]
[1,] 5 11
[2,] 11 25
We can transpose matrices and extract the diagonals as well
> t(x.mat) > diag(x.mat2)
[,1] [,2] [1] 1 4
[1,] 1 3
[2,] 2 4
Another common matrix calculation is the
inverse. Many algorithms and functions in
statistics need to work with the inverse of
matrices:
solve(x.mat)
[,1] [,2]
[1,] -2.0 1.0
[2,] 1.5 -0.5

The apply() Function


Many times we wish to use our own function over the elements of a
matrix. The apply() function allows someone to use an R function or user-
defined function with a matrix. This function is
apply(m, dimcode, f, arguments)
Where,
m: matrix you wish to use.
Dimcode: 1 if you want to apply
function to rows, 2 if you want to apply to
columns
f: function you wish to use
arguments: specific arguments for function being used.
apply() Example: We begin with our matrix y.mat. We can use the apply
function to get means of either the columns or the rows.
> apply(y.mat, 1, mean) > apply(y.mat,2,mean)
[1] 3.0 4.0 5.0 2.5 [1] 2.50 4.75

Naming Rows and Columns of Matrices


Just like in vectors we may want to name elements in a matrix. Now we
have more than on dimension so we can name both the rows and
columns. Consider the following matrices where we have recorded both
weight(lbs) and height(inches) of subjects at time point 1.
> time1 <- matrix( c(115, 63, 175, 69, 259, 57, 325, 70), ncol=2,
byrow=TRUE)
> time1
[,1] [,2]
[1,] 115 63
[2,] 175 69
[3,] 259 57
[4,] 325 70
Without the story behind these we do not know what kind of data we have
here or what is being measured. This is where it can be very important to
name both the columns and the rows of data.
> #Names for Time 1
> colnames(time1) <- c("weight1", "height1")
> rownames(time1) <- c("Subject 1", "Subject 2", "Subject 3",
"Subject 4")
> time1
weight1 height1
Subject 1 115 63
Subject 2 175 69
Subject 3 259 57
Subject 4 325 70
We can see that now time1 is much more clear as to what the data
contains.

You might also like