0% found this document useful (0 votes)
12 views39 pages

DAUR UNIT 1 Part 2

UNIT 6

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

DAUR UNIT 1 Part 2

UNIT 6

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Loading and Handling Data in R

OUTCOMES:
• Introduces different data types such as numbers, text, logical values, dates, etc., supported in R.
• It also describes various R objects such as vector, matrix, list, dataset, etc.,
• how to manipulate data using R functions such as sum(), min(), max(), rep() and string functions
such as substr(), grep(), strsplit(), etc.
• It explores import of data into R from .csv (comma separated values), spreadsheets, XML
documents, JASON (Java Script Object Notation) documents, web data, etc.
• Interfacing R with databases such as MySQL, PostGreSQL, SQLlite, etc.

INTRODUCTION:
Enterprise applications today generate a huge amount of data. This data is analyzed to draw useful
insights that can help decision makers make better and faster decisions.

CHALLENGES OF ANALYTICAL DATA PROCESSING


Analytical data processing is a part of business intelligence that includes relational database, data
warehousing, data mining and report mining. It is a computer processing technique that handles different
types of business processing practices like sales, budgeting, financial reporting, management reporting,
etc. All these processing techniques require big data. Business analytics combines big data with
technology. Different challenges occur during business data analytics. However, most of these challenges
are mainly associated with data and they arise during the early stages of projects. Some of these
challenges are explained ahead.

1. Data Formats: Data is the main element of business analytics. Business analytics uses sets of
data to store a large amount of data. Selecting a data format is the first challenge in analytical data
processing for researchers or developers. Analytical data processing requires a complete set of
data, in the absence of which, developers can expect problems in further processing. R is a well-
documented programming language that stores data in the form of an object. It has a very simple
syntax that helps in processing any type of data. R provides many packages and features such as
open database connectivity (ODBC), which process different types of data formats. For example,
ODBC supports data formats such as CSV, MS Excel, SQL, etc.
2. Data Quality: Business analysts are required to deliver perfect information, inferences, outliers
and output without any missing or invalid value. A data with inferior input or output is bound to
give incorrect quality results. With the help of R, business analysts can maintain data quality.
Different tools of R help business analysts in removing invalid data, replacing missing values and
removing outliers in data.
3. Project Scope: Projects based on analytical data processing are costly and time consuming.
Hence, before starting a new project, business analysts should analyze the scope of the project.
They should identify the amount of data required from external sources, time of delivery and
other parameters related to the project.
4. Output Result via Stakeholder Expectation Management: In analytical data processing,
analysts design projects that generate output with different types of values like p-value, the
degree of freedom, etc. However, users or stakeholders prefer to see the output. The stakeholders
do not want to see the constraints used in data processing, assumptions, hypothesis, p-values, chi-
square value or any other value. Hence, an analytical project should try to fulfil all the
expectations of the stakeholders.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 1
Business analysts should use transparent methods and processes. They should also validate the data using
cross validation. If business analysts use the standard steps of analytical data processing that generate the
perfect output, they will not encounter any problems.

The sequence of analytical data processing that analyst should follow while conducting business analysis
for their project are:

• Data input
• Processing
• descriptive statistics
• visualization of data
• report generation
• output

EXPRESSIONS:

Logical Values
Logical values are TRUE and FALSE or T and F. these are case sensitive.

The equality operator is ==.

>8<4 > 3 * 2 == 5 > 3 * 2 == 6 > F == FALSE > T == TRUE


[1] FALSE [1] FALSE [1] TRUE [1] TRUE [1] TRUE

Guided Activity

Step 1: Create a vector, x consisting of 10 elements with values ranging from 1 to 10.

> x <- c(1:10)

Step 2: Display the contents of the vector, x.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 2


>x

[1] 1 2 3 4 5 6 7 8 9 10

Step 3: Print the values of those elements whose values are either greater than 7 or less than 5. ‘|’ is the
OR operator. Use the OR operator to display elements whose values are either greater than 7 or less than
10.

> x[(x>7) | ( x<5)]

[1] 1 2 3 4 8 9 10

Step 4: Print the values of those elements whose values are greater than 7 and less than 10. ‘&’ is the
AND operator. Use the AND operator to display elements whose values are greater than 7 and less than
10.

> x[(x>7) & (x<10)]


[1] 8 9

DATES:
The default format of date is YYYY-MM-DD.
(i) Print system’s date.

> Sys.Date()
[1] “2017-01-13”

(ii) Print system’s time.

> Sys.time()

[1] “2017-01-13 10:54:37 IST”

(iii) Print the time zone.

> Sys.timezone()

[1] “Asia/Calcutta”

(iv) Print today’s date.

> today <- Sys.Date()

> today

[1] “2017-01-13”

> format (today, format = “%B %d %Y”)

[1] “January 13 2017”

(v) Store date as a text data type.

> CustomDate = “2016-01-13”

> CustomDate

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 3


[1] “2016-01-13”

> class (CustomDate)

[1] “character”

(vi) Convert the date stored as text data type into a date data type.

> CustDate = as.Date(CustomDate)

> class(CustDate)

[1] “Date” > CustDate [1] “2016-01-13”

(vii) Find the difference between the following two dates.

> strDates <- c(“08/15/1947”, “01/26/1950”)

(viii) Convert strings into date format.

> dates = as.Date(strDates, “%m /%d /%Y”)

> dates

[1] “1947-08-15” “1950-01-26”

(ix) Compute the difference between the two dates.

> dates[2] – dates

[1] Time difference of 895 days

VARIABLES
(i) Assign a value of 50 to the variable called ‘Var’.

> Var <-50 Or > Var=5

(ii) Print the value in the variable, ‘Var’.

> Var

[1] 50

(iii) Perform arithmetic operations on the variable, ‘Var’.

> Var + 10

[1] 60

> Var / 2

[1] 25

Variables can be reassigned values either of the same data type or of a different data type.

(iv) Reassign a string value to the variable, ‘Var’.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 4


> Var <- “R is a Statistical Programming Language”

Print the value in the variable, ‘Var’.

> Var

[1] “R is a Statistical Programming Language”

(v) Reassign a logical value to the variable, ‘Var’.

> Var <- TRUE

> Var

[1] TRUE

FUNCTIONS
sum() function

sum() function returns the sum of all the values in its arguments.

Syntax

sum(..., na.rm = FALSE)

where … implies numeric or complex or logical vectors. na,rm accepts a logical value.

Examples

(i) Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()

> sum(1, 2, 3)

[1] 6

(ii) What will be the output if NA is used for one of the arguments to sum()?

> sum(1, 5, NA, na.rm=FALSE)

[1] NA

If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.

(iii) What will be the output if NaN is used for one of the arguments to sum()?

> sum(1, 5, NaN, na.rm= FALSE)

[1] NaN

(iv) What will be the output if NA and NaN are used as arguments to sum()?

> sum(1, 5, NA, NaN, na.rm=FALSE)

[1] NA

(v) What will be the output if option, na.rm is set to TRUE? If na.rm is TRUE, an NA or NaN value
in any of the argument will be ignored.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 5
> sum(1, 5, NA, na.rm=TRUE)

[1] 6

> sum(1, 5, NA, NaN, na.rm=TRUE)

[1] 6

min() function

min() function returns the minimum of all the values present in their arguments.

Syntax

min(…, na.rm=FALSE)

where … implies numeric or character arguments and na.rm accepts a logical value.

Example

> min(1, 2, 3)

[1] 1

If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.

> min(1, 2, 3, NA, na.rm=FALSE)

[1] NA

> min(1, 2, 3, NaN, na.rm=FALSE)

[1] NaN

> min(1, 2, 3, NA, NaN, na.rm=FALSE)

[1] NA

If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.

> min(1, 2, 3, NA, NaN, na.rm=TRUE)

[1] 1

max() function

max() function returns the maximum of all the values present in their arguments.

Syntax

max(…, na.rm=FALSE)

where … implies numeric or character arguments na.rm accepts a logical value.

Example

> max(44, 78, 66)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 6


[1] 78

If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.

> max(44, 78, 66, NA, na.rm=FALSE)

[1] NA

> max(44, 78, 66, NaN, na.rm=FALSE)

[1] NaN

> max(44, 78, 66, NA, NaN, na.rm=FALSE)

[1] NA

If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.

> max(44, 78, 66, NA, NaN, na.rm=TRUE)

[1] 78

seq() function

seq() function generates a regular sequence.

Syntax

seq(start from, end at, interval, length.out)

where, Start from: It is the start value of the sequence. End at: It is the maximal or end value of the
sequence. Interval: It is the increment of the sequence. length.out: It is the desired length of the sequence.

Example

> seq(1, 10, 2)

[1] 1 3 5 7 9

> seq(1, 10, length.out=10)

[1] 1 2 3 4 5 6 7 8 9 10

> seq(18)

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Or

> seq_len(18)

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

> seq(1, 6, by=3)

[1] 1 4

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 7


MANIPULATING TEXT IN DATA
There are many inbuilt string functions available in R that manipulates text or string. Finding a part of
some text string, searching some string in a text or concatenating strings and other similar operations
come under manipulating text operation.

Table 3.2 explains some useful text manipulation operations. Let us take a look at how R treats strings.
String values have to be enclosed within double quotes.

> “R is a statistical programming language”

[1] “R is a statistical programming language”

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 8


rep() function

rep() function repeats a given argument for a specified number of times. In the example
below, the string, ‘statistics’ is repeated three times.
Example
> rep(“statistics”, 3)
[1] “statistics” “statistics” “statistics”

grep() function

In the example below, the function grep() finds the index position at which the string,
‘statistical’ is present.
Example
> grep(“statistical”,c(“R”,“is”,“a”,“statistical”,“language”),fixed=TRUE)
[1] 4

toupper() function

toupper() function converts a given character vector into upper case.


Syntax
toupper(x)
x<- is a character vector
Example
> toupper(“statistics”)
[1] “STATISTICS”
Or
> casefold (“r programming language”, upper=TRUE)
[1] “R PROGRAMMING LANGUAGE”

tolower() function

tolower() function converts the given character vector into lower case.
Syntax
tolower(x)
x<- is a character vector
Example
> tolower(“STATISTICS”)
[1] “statistics”

Or
> casefold(“R PROGRAMMING LANGUAGE”, upper=FALSE)
[1] “r programming language”

substr() function
substr() function extracts or replaces substrings in a character vector.
Syntax
substr(x, start, stop)
x<-character vector
start <-start position of extraction or replacement
stop<-stop or end position of extraction or replacement
Example
Extract the string ‘tic’ from ‘statistics’. Begin the extraction at position 7 and continue the
extraction till position 9.
> substr(“statistics”, 7, 9)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 9


[1] “tic”

MISSING VALUES TREATMENT IN R


During analytical data processing, users come across problems caused by missing and infinite values. To
get an accurate output, users should remove or clean the missing values.

In R, NA (Not Available) represents missing values and Inf (Infinite) represents infinite values. R
provides different functions that identify the missing values during processing.

creates a vector ‘A’ with some missing values [10, 20, NA,40] . The is.na(A) returns TRUE for the
missing value. The na.omit(A) and na.exclude(A) removes the missing value and stores it into vector ‘B’
and ‘D’,respectively. The na.fail(A) generates an error if A has some missing value. The na.pass(A)
returns the usual vector A.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 10


VECTORS

A vector can have a list of values. The values can be numbers, strings or logical. All the values in a vector
should be of the same data type.A few points to remember about vectors in R are:
• Vectors are stored like arrays in C
• Vector indices begin at 1
• All vector elements must have the same mode such as integer, numeric (floating point number),
character (string), logical (Boolean), complex, object, etc.
Let us create a few vectors.
1. Create a vector of numbers
> c(4, 7, 8)
[1] 4 7 8
The c function (c is short for combine) creates a new vector consisting of three values, viz. 4, 7 and 8.
2. Create a vector of string values.
> c(“R”, “SAS”, “SPSS”)
[1] “R” “SAS” “SPSS”
3. Create a vector of logical values.
> c(TRUE, FALSE)
[1] TRUE FALSE
A vector cannot hold values of different data types. Consider the example below on placing integer, string
and Boolean values together in a vector.
> c(4, 8, “R”, FALSE)
[1] “4” “8” “R” “FALSE”
4. Declare a vector by the name, ‘Project’ of length 3 and store values in it.
> Project <- vector(length = 3)
> Project [1] <- “Finance Project”
> Project [2] <- “Retail Project”
> Project [3] <- “Energy Project”
Outcome
> Project
[1] “Finance Project” “Retail Project” “Energy Project”
> length (Project)
[1] 3
➢ Sequence Vector
A sequence vector can be created with a start:end notation.
Create a sequence of numbers between 1 and 5 (both inclusive).
> 1:5
[1] 1 2 3 4 5
Or
> seq(1:5)
[1] 1 2 3 4 5
The default increment with seq is 1. However, it also allows the use of increments other than 1.
> seq (1, 10, 2)
[1] 1 3 5 7 9
Or
> seq (from=1, to=10, by=2)
[1] 1 3 5 7 9
Or
> seq (1, 10, by=2)
[1] 1 3 5 7 9
seq can also generate numbers in the descending order.
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> seq (10, 1, by=–2)
[1] 10 8 6 4 2

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 11


➢ rep function
The rep function is used to place the same constant into long vectors. The syntax is rep(z,k), which
creates a vector of k*length(z) elements, each equals to z.
Demonstrate rep function.
Act
> rep (3, 4)
[1] 3 3 3 3
Or
> x <-rep (3, 4)
>x
[1] 3 3 3 3
➢ Vector Access
1. Let us create a variable, ‘VariableSeq’ and assign to it a vector consisting of string values.
> VariableSeq <- c (“R”, “is”, “a”, “programming”, “language”)
2. To access values in a vector, specify the indices at which the value is present in the vector.
Indices start at 1.
> VariableSeq[1]
[1] “R”
> VariableSeq[2]
[1] “is”
> VariableSeq[3]
[1] “a”
> VariableSeq[4]
[1] “programming”
> VariableSeq[5]
[1] “language”
3. Assign new values in an existing vector. For example, let us assign value, ‘good
programming’ at indices 4 in the existing vector, ‘VariableSeq’.
> VariableSeq[4] <- “good programming”
Outcome
> VariableSeq[4]
[1] “good programming”
4. To access more than one value from the vector.
(a) Access the first and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 5)]
[1] “R” “language”
(b) Access first to the fourth element from the vector, ‘VariableSeq’.
> VariableSeq[1:4]
[1] “R” “is” “a” “good programming”
(c) Access the first, fourth and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 4:5)]
[1] “R” “good programming” “language”
(d) Retrieve all the values from the variable, ‘VariableSeq’
> VariableSeq
[1] “R” “is” “a” “good programming”
[5] “language”

➢ Vector Names
The names() function helps to assign names to the vector elements. This is accomplished in two steps as
shown:
> placeholder <- 1:5
> names(placeholder) <- c(“r”, “is”, “a”, “programming”, “language”)
• The vector elements can then be retrieved using the indices position.
> placeholder
r is a programming language
12345

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 12


> placeholder [3] > placeholder [1]
a r
3 1
> placeholder[4:5] > placeholder [“programming”]
programming language programming
45 4
• Objective:
Plot a bar graph using the barplot function. The barplot function uses a vector’s values
to plot a bar chart.
Act
The vector used is called BarVector.
> BarVector <- c(4, 7, 8)
> barplot(BarVector)

• Let us use the name function to assign names to the vector elements. These names will be used as
labels in the barplot.
> names(BarVector) <- c(“India”, “MiddleEast”, “US”)
> barplot(BarVector)

Vector Math
• Let us define a vector, ‘x’ with three values. Let us add a scalar value (single value) to the vector.
This value will get added to each vector element.
> x <- c(4, 7, 8)
> x +1
[1] 5 8 9
• However, the vector will retain its individual elements.
>x
[1] 4 7 8
• If the vector needs to be updated with the new values, type the statement given below.
> x <- x + 1
>x
[1] 5 8 9
• We can run other arithmetic operations on the vector as given:
>x–1
[1] 4 7 8

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 13


>x*2
[1] 10 16 18
>x/2
[1] 2.5 4.0 4.5
• Let us practice these arithmetic operations on two vectors.
>x
[1] 5 8 9
> y <- c(1, 2, 3)
>y
[1] 1 2 3
>x+y
[1] 6 10 12
• Other arithmetic operations are:
>x–y
[1] 4 6 6
>x*y
[1] 5 16 27
• Check if the two vectors are equal. The comparison takes place element by element.
>x
[1] 5 8 9
>y
[1] 1 2 3
> x==y
[1] FALSE FALSE FALSE
>x<y
[1] FALSE FALSE FALSE
> sin(x)
[1] -0.9589243 0.9893582 0.4121185

Vector Recycling

If an operation is performed involving two vectors that requires them to be of the same length, the shorter
one is recycled, i.e. repeated until it is long enough to match the longer one.
Objective
• Add two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) + c(4, 5, 6, 7, 8, 9)
[1] 5 7 9 8 10 12
Objective
• Multiply the two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) * c(4, 5, 6, 7, 8, 9)
[1] 4 10 18 7 16 27
Objective
• Plot a Scatter Plot. The function to plot a scatter plot is ‘plot’. This function uses two vectors, i.e.
one for the x axis and another for the y axis. The objective is to understand the relationship
between numbers and their sines.
• We will use two vectors. Vector, x which will have a sequence of values between 1 and 25 at an
interval of 0.1 and vector, y which stores the sines of all values held in vector, x.
> x <-seq(1, 25, 0.1)
> y <-sin(x)
The plot function takes the values in the vector, x and plots it on the horizontal axis. It then takes the
values in the vector, y and places it on the vertical axis (Figure 3.4).
> plot(x, y)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 14


MATRICES
Matrices are nothing but two-dimensional arrays.
Objective
Let us create a matrix which is 3 rows by 4 columns and set all its elements to 1.
> matrix (1, 3, 4)
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 1 1 1
[2, ] 1 1 1 1
[3, ] 1 1 1 1
Objective
Use a vector to create an array, 3 rows high and 3 columns wide.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq(10, 90, by = 10)
Step 2: Validate by printing the value of vector a.
>a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Call the matrix function with vector, ‘a’ the number of rows and the number of
columns.
> matrix (a, 3, 3)
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90
Objective
Re-shape the vector itself into an array using the dim function.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq (10, 90, by = 10)
Step 2: Validate by printing the value of vector, a.
>a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Assign new dimensions to vector, a by passing a vector having 3 rows and 3columns (c (3, 3)).
> dim(a) <- c(3, 3)
Step 4: Print the values of vector, a. You will notice that the values have shifted to form 3rows by 3
columns. The vector is no longer one dimensional. It has been converted into a two-dimensional matrix
that is 3 rows high and 3 columns wide.

>a
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 15
[3, ] 30 60 90

➢ Matrix Access
• Objective-1
Access the elements of a 3 *4 matrix.
Step 1: Create a matrix, ‘mat’, 3 rows high and 4 columns wide using a vector.
> x <- 1:12
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> mat <- matrix (x, 3, 4)
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: Access the element present in the second row and third column of the matrix, ‘mat’.
> mat [2, 3]
[1] 8
• Objective-2
Access the third row of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the third row of the matrix, simply provide the row number and omit the column
number.
> mat [3, ]
[1] 3 6 9 12
• Objective-3
Access the second column of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second column of the matrix, simply provide the column number
and omit the row number.
> mat[, 2]
[1] 4 5 6
• Objective-4
Access the second and third columns of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second and third columns of the matrix, simply provide the column numbers and
omit the row number.
> mat[,2:3]

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 16


[, 1] [, 2]
[1, ] 4 7
[2, ] 5 8
[3, ] 6 9
• Objective-5
Create a contour plot.
Create a matrix, ‘mat’ which is 9 rows high and 9 columns wide and assign the value ‘1’ to all its
elements.
> mat <- matrix(1, 9, 9)
Print all the values of the matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 1 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1
Assign ‘0’ as the value to the element present in the third row and third column of the matrix, ‘mat’.
> mat[3, 3] <-0
> mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 0 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1

Plot the contour chart using the contour() function (Figure 3.5). The contour() function creates a contour
plot or adds contour lines to an existing plot. Look up the R documentation for a complete description of
the contour() function.
> contour(mat)

Contour plot
Objective-6
Create a 3D perspective plot with the persp() function (Figure 3.6). It provides a 3D wireframe plot most
commonly used to display a surface.
>persp(mat)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 17


We can add a title to our plot with the parameter ‘main’. Similarly, ‘xlab’, ‘ylab’ and ‘zlab’ can be used to
label the three axes. Coloring of the plot is done with parameter ‘col’. Similarly, we can add shading with
the parameter ‘shade’.

Objective-7
R includes some sample data sets. One of these is ‘volcano’, which is a 3D map of a dormant New
Zealand volcano. Create a contour map of the volcano dataset (Figure 3.7).
> contour(volcano)

Let us create a 3D perspective map of the sample data set, ‘volcano’ (Figure 3.8).
> persp(volcano)

Objective-8
Create a heat map of the sample dataset, ‘volcano’ (Figure 3.9).
> image(volcano)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 18


FACTORS

➢ Creating Factors
School, ‘XYZ’ places students in groups, also called houses. Each group is assigned a unique color such
as ‘red’, ‘green’, ‘blue’ or ‘yellow’. HouseColor is a vector that stores the house colors of a group of
students.
> HouseColor <- c(‘red’, ‘green’, ‘blue’, ‘yellow’, red’, ‘green’, ‘blue’, ‘blue’)
> types <- factor(HouseColor)
> HouseColor
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print(HouseColor)
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print (types)
[1] red green blue yellow red green blue blue
Levels: blue green red yellow
Levels denotes the unique values. The above has four distinct values such as ‘blue’,‘green’, ‘red’ and
‘yellow’.
> as.integer(types)
[1] 3 2 1 4 3 2 1 1
The above output is explained as given below.
1 is the number assigned to blue.
2 is the number assigned to green.
3 is the number assigned to red.
4 is the number assigned to yellow.
> levels(types)
[1] “blue” “green” “red” “yellow”
The vector ‘NoofStudents’ stores the number of students in each house/group with 12 students in blue
house, 14 students in green house, 12 students in red house and 13 students in yellow house.
> NoofStudents <- c(12, 14, 12, 13)
> NoofStudents
[1] 12 14 12 13
The vector, ‘AverageScore’ stores the average score of the students of each house/group. 70 is the
average score for students of the blue house, 80 is the average score for students of the green house, 90 is
the average score for the students of the red house and 95 is the average score for the students of the
yellow house.
> AverageScore(70, 80, 90, 95)
> AverageScore
[1] 70 80 90 95
Objective-1
Plot the relationship between NoofStudents and AverageScore (Figure 3.10).
> plot(NoofStudents, AverageScore)
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 19
> plot (NoofStudents, AverageScore, pch=as.integer (types))
The above graph in Figure 3.10 displays 4 dots. Let us improve the graph by at least using different
symbols to represent each house (Figure 3.11).

To add further meaning to the graph, let us place a legend on the top right corner (Figure 3.12).
> legend(“topright”, c(“red”, “green”, “blue”, “yellow”), pch=1:4)

LIST
List is similar to C Struct.
Objective-1
Create a list in R.
To create a list, ‘emp’ having three elements, ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’.
> emp <- list (“EmpName=“Alex”, EmpUnit = “IT”, EmpSal = 55000)
Outcome
To get the elements of the list, ‘emp’ use the command given below.
> emp

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 20


$EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
Actually, the element names, e.g. ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’ are optional.We could
alternatively do this as shown below.
> EmpList <- list(“Alex”, “IT”, 55000)
> EmpList
[[1]]
[1] “Alex”
[[2]]
[1] “IT”
[[3]]
[1] 55000
Here the elements of EmpList are referred to as 1, 2 and 3.
➢ List Tags and Values
A list has elements. The elements in a list can have names, which are referred to as tags.Elements can also
have values.
For example, in the ‘emp’ list we have three elements, viz. EmpName, EmpUnit and EmpSal. The values
are as follows. The element ‘EmpName’ has the value ‘Alex’, the element ‘EmpUnit’ has the value ‘IT’
and the element ‘EmpSal’ has the value 55000.
Let us look at the command to retrieve the names and values of the elements in a list.
Objective-1
Retrieve the names of the elements in the list ‘emp’.
> names(emp)
[1] “EmpName” “EmpUnit” “EmpSal”
Objective-2
Retrieve the values of the elements in the list ‘emp’.
> unlist(emp)
EmpName EmpUnit EmpSal
“Alex” “IT” “55000”
The command to retrieve the value of a single element in the list ‘emp’ is given below.
Objective-3
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> unlist(emp[“EmpName”])
EmpName
“Alex”
The value of the other elements in the list can be checked in a similar manner.
> unlist(emp[“EmpUnit”])
EmpUnit
“IT”
> unlist(emp[“EmpSal”])
EmpSal
55000
Yet another way to retrieve the values of the elements in the list ‘emp’ is given as follows:
Objective-4
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> emp[[“EmpName”]]
[1] “Alex”
Or
> emp[[1]]
[1] “Alex”
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 21
➢ Add/Delete Element to or from a List
Objective-1
Add an element with the name ‘EmpDesg’ and value ‘Software Engineer’ to the list, ‘emp’.
> emp$EmpDesg = “Software Engineer”
Outcome
> emp
$EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
Objective-2
Delete an element with the name ‘EmpUnit’ and value ‘IT’ from the list, ‘emp’.
> emp$EmpUnit <- NULL
Outcome
> emp
$EmpName
[1] “Alex”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
➢ Size of a List
length() function can be used to determine the number of elements present in the list.The list, ‘emp’ has
three elements:
Objective-1
Determine the number of elements in the list, ‘emp’.
> length(emp)
[1] 3
➢ Recursive List
A recursive list means a list within a list.
Objective-2
Create a list within a list.Let us begin with two lists, ‘emp’ and ‘emp1’.The elements in both the lists are
as shown below.
> emp
$EmpName
[1] “Alex”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
> emp1
$EmpUnit
[1] “IT”
$EmpCity
[1] “Los Angeles”
We would like to combine both the lists into a single list called ‘EmpList’.
> EmpList <- list(emp, emp1)
Outcome
> EmpList

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 22


[[1]]
[[1]] $EmpName
[1] “Alex”
[[1]]$EmpSal
[1] 55000
[[1]]$EmpDesg
[1] “Software Engineer”
[[2]]
[[2]]$EmpUnit
[1] “IT”
[[2]]$EmpCity
[1] “Los Angeles”

FEW COMMON ANALYTICAL TASKS


Just like any other processing, analytical data processing also requires general operations for complex
processing like Reading, writing, updating and merging data.
Exploring a Dataset
Exploring a dataset means displaying the data of the dataset in a different form. Datasets are the main part
of analytical data processing. It uses different forms or parts of the dataset. With the help of R commands,
analysts can easily explore a dataset in different ways.

Functions Function
Arguments Description

The following example loads a matrix into the workspace. All the above commands are executed on the
dataset, ‘Orange’ uses summary(), names() and str() functions.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 23


• Conditional Manipulation of a Dataset
Analytical data processing sometimes may require specific rows and columns of a dataset.

The following example reads a table, ‘Hardware.csv’ into object, ‘TD’ on the R workspace. The TD[1]
and TD[, 1] commands displays rows and columns

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 24


• Merging Data

Merging different datasets or objects is another common task used in most processing activities.
Analytical data processing may also require merging two or more data objects. R provides a function
merge() that merges data objects. The merge() function combines data frames by common columns or
row names. It also follows the database join operations.

The syntax of the merge() function is given as follows:

merge(x, y,…)
OR
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y =by, all = FALSE, all.x = all, all.y =
all, …)
• where, x is an object or data frame, y is an object or data frame and by, by.x, by.y arguments
define the common columns or rows for merging.
• All arguments contain logical values ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the full outer join by adding all rows of x and y into the result object.
• all.x argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the dataset as per left outer join after merging the objects by adding an extra row in x that is not
matching with rows in y. If the value is FALSE then it merges the rows with the data from both x
and y into the result object.
• all.y argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the dataset as per right outer join after merging the objects by adding an extra row in y that is not
matching with rows in x. If the value is FALSE then it merges the rows with data from both x and
y into the result object.
• The dots ‘…’ define the other optional argument.
• Example-1: merging data

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 25


• Example-2: merging data using join condition

AGGREGATING AND GROUP PROCESSING OF A VARIABLE


Aggregate and group operations aggregate the data of specific variables of a dataset after grouping
variable data. Like merging, analytical data processing also requires aggregation and grouping operation
on a dataset. R provides some functions for aggregation operation.
➢ aggregate() Function
The aggregate() function is an inbuilt function of R that aggregates data values. The function also splits
data into groups after performing given statistical functions.

Syntax:
aggregate(x, …) or aggregate(x, by, FUN, …)

where, x is an object, by argument defines the list of group elements of the specific variable
of the dataset, FUN argument is a statistic function that returns a numeric value after given statistic
operations and the dots ‘…’ define the other optional argument.

The following example reads a table, ‘Fruit_data.csv’ into object, ‘S’. The aggregate() function computes
the mean price of each type of fruit. Here by argument is list(Fruit.Name = S$Fruit.Name) that groups the
Fruit.Name columns.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 26


➢ tapply() Function
The tapply() function is also an inbuilt function of R and works in a manner similar to the function
aggregate(). The function aggregates the data values into groups after performing the given statistical
functions.

Syntax:
tapply (x, …) or tapply(x, INDEX, FUN, …)

where, x is an object that defines the summary variable, INDEX argument defines the list of group
elements—also called group variable, FUN argument is a statistic function that returns a numeric value
after given statistic operations and the dots ‘…’ define the other optional argument.
The following example reads the table, ‘Fruit_data.csv’ into object, ‘A’. The tapply()function computes
the sum and price of each type of fruit. Here Fruit.Price is a summary variable and Fruit.Name is a
grouping variable. The FUN function is applied on the summary variable, Fruit.Price.

SIMPLE ANALYSIS USING R

➢ Input
Input is the first step in any processing, including analytical data processing. Here, the input is dataset,
‘Fruit’. For reading the dataset into R, use read.table() or read.csv() function.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 27


➢ Describe Data Structure
After reading the dataset into the R workspace, the dataset can be described using different
functions like names(), str(), summary(), head() and tail(). All these functions have been described in the
previous sections.

➢ Describe Variable Structure


you can also describe the variables of the dataset using different functions. For describing the variables
and performing operations on them, many functions are available.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 28


Many inbuilt distribution functions can be applied to the variables of a dataset that define the distribution
of data in a dataset.
A histogram is a graphical display of data that uses many bars of different heights.

The complete syntax for hist() function is:

hist(x, breaks = ‘Sturges’,freq = NULL, probability = !freq,include.lowest = TRUE,


right = TRUE,density = NULL, angle = 45, col = NULL, border = NULL,
main = paste(‘Histogram of’ , xname),xlim = range(breaks), ylim = NULL,xlab = xname, ylab,axes
= TRUE, plot = TRUE, labels = FALSE,nclass = NULL, warn.unused = TRUE, ...)

where,x is the vector for which a histogram is required.freq is a logical value. If TRUE, the histogram
graphic is a representation of frequencies,the counts component of the result. If FALSE, the probability
densities and componentdensity are plotted.main, xlab, ylab are arguments to title. plot is a logical value.
If TRUE (default), a histogram is plotted, else a list of breaks and counts is returned.
>hist(fruits$Fruit.Price)

Figure given below describes the box-and-whisker plot of the ‘Fruit’ dataset using the boxplot()
Function. A box and whisker plot summarises the group values into boxes.

syntax for boxplot() function is:

boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,notch = FALSE, outline = TRUE,
names, plot = TRUE,border = par(‘fg’), col = NULL, log = ‘‘,
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),horizontal = FALSE, add = FALSE, at =
NULL)

where x is a numeric vector or a single list containing such vectors.


outline - If outline is not true, the outliers are not drawn.
range - This determines how far the plot whiskers extend out from the box.

Figure below describes the plot of the ‘Fruit’ dataset using the plot() function.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 29


METHODS FOR READING DATA
R supports different types of data formats related to a database. With the help of import and export utility
of R, any type of data can be imported and exported into R.

➢ CSV and Spreadsheets

Comma separated value (CSV) files and spreadsheets are used for storing small size data.R has an inbuilt
function facility through which analysts can read both types of files.
Reading CSV Files
A CSV file uses .csv extension and stores data in a table structure format in any plain text.
The following function reads data from a CSV file:

read.csv(‘filename’)

where,filename is the name of the CSV file that needs to be imported.

The read.table() function can also read data from CSV files. The syntax of the function is

read.table(‘filename’, header=TRUE, sep=‘,’,…)

where,filename argument defines the path of the file to be read, header argument contains logical values
TRUE and FALSE for defining whether the file has header names on the first line or not, sep argument
defines the character used for separating each column of the file and the dots ‘…’ define the other
optional arguments.

The following example reads a CSV file, ‘Hardware.csv’ using read.csv() and read.table() function.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 30


Reading Spreadsheets

A spreadsheet is a table that stores data in rows and columns. Many applications are available for creating
a spreadsheet. Microsoft Excel is the most popular for creating an Excel file. An Excel file uses .xlsx
extension and stores data in a spreadsheet.
In R, different packages are available such as gdata, xlsx, etc., that provide functions for reading Excel
files. Importing such packages is necessary before using any inbuilt function of any package. The
read.xlsx() is an inbuilt function of ‘xlsx’ package for reading Excel files.

The syntax of the read.xlsx() function is


read.xlsx(‘filename’,…)

where,filename argument defines the path of the file to be read and the dots ‘…’ define the other optional
arguments.
In R, reading or writing (importing and exporting) data using packages may create some problems like
incompatibility of versions, additional packages not loaded and so on. In order to avoid these problems, it
is better to convert files into CSV files. After converting files into CSV files, the converted file can be
read using the read.csv() function.

The following example illustrates creation of an Excel file, ‘Softdrink.xlsx’. The ‘Software.
csv’ file is the converted form of the ‘Softdrink.xlsx’ file.

The function read.csv() is reading this file into R.

➢ Reading Data from Packages


A package is a collection of functions and datasets. In R, many packages are available for doing different
types of operations. Some functions for reading and loading the dataset from and into packages defined in
R are explained next.
1. library() Function
The library() function loads packages into the R workspace. It is compulsory to import the package before
reading the available dataset of that package.

The syntax of the library() function is:


library(packagename)

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 31


where, packagename argument is the name of the package to be read.

2. data() Function
The data() function lists all the available datasets of the loaded package into the R workspace. For loading
a new dataset into the loaded packages, users need to pass the name of the new dataset into data()
function.

The syntax of the data() function is:


data(dataset name)

where,dataset name argument is the name of the dataset to be read.


The following example illustrates the loading of a matrix. The data() function lists all the vailable datasets
of the loaded package. The ‘ > Orange ‘ command reads and displays the content of the dataset, ‘Orange’
into the workspace.

➢ Reading Data from Web/APIs

Nowadays most business organizations are using the Internet and cloud services for storing data. This
online dataset is directly accessible through packages and application programming interfaces (APIs).
Different packages are available in R for reading from online datasets.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 32


Web scraping extracts data from any webpage of a website. Here package ‘RCurl’ is used for web
scraping (Figure 3.32). At first, the package, ‘RCurl’ is imported into the workspace and then getURL()
function of the package, ‘RCurl’ takes the required webpage. Now htmlTreeParse() function parses the
content of the webpage.

#installing RCurl package


install.packages("RCurl")

#loading RCurl into environment


library(RCurl)

#passing URL for which web data required


wd<-getURL("https://www.techopedia.com/definition/5212/web-scrapping",ssl.verifypeer=FALSE)

install.packages("htmlTreeParse")

#parsing the web data


wd_parsed<-htmlTreeParse(wd)

#display the contents of webpage


wd_parsed

➢ Reading a JSON (Java Script Object Notation) Document


Step 1: Install rjson package.
> install.packages(“rjson”)
Step 2: Input data.
Store the data given below in a text file (‘D:/Jsondoc.json’). Ensure that the file is saved
with an extension of .json
{
‘EMPID’:[‘1001’,’2001’,’3001’,’4001’,’5001’,’6001’,’7001’,’8001’
],
‘Name’:[‘Ricky’,’Danny’,’Mitchelle’,’Ryan’,’Gerry’,’Nonita’,’Sim
on’,’Gallop’ ],
‘Dept’: [‘IT’,’Operations’,’IT’,’HR’,’Finance’,’IT’,’Operations’
,’Finance’]
}
A JSON document begins and ends with a curly brace ({}). A JSON document is a set of key value pairs.
Each key:value pair is delimited using ‘,’ as a delimiter.
Step 3: Read the JSON file, ‘d:/Jsondoc.json’.
> output <- fromJSON(file = “d:/Jsondoc.json”)
> output
$EMPID
[1] “1001” “2001” “3001” “4001” “5001” “6001” “7001” “8001”
$Name
[1] “Ricky” “Danny” “Mitchelle” “Ryan” “Gerry” “Nonita”, “Simon” “Gallop”
$Dept
[1] “IT” “Operations” “IT” “HR” “Finance”,“IT” “Operations” “Finance”
Step 4: Convert JSON to a data frame.
> JSONDataFrame <- as.data.frame(output)
Display the content of the data frame, ‘output’.
> JSONDataFrame

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 33


➢ Reading an XML File

Step 1: Install an XML package.


> install.packages(“XML”)
Step 2: Input data.
Store the data below in a text file (XMLFile.xml in the D: drive). Ensure that the file is saved with an
extension of .xml.
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>Computer Science</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>People Management</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
</RECORDS>
Reading an XML File
The xml file is read in R using the function xmlParse(). It is stored as a list in R.

Step 1: Begin by loading the required packages.


> library(“XML”)
Warning message:package ‘XML’ was built under R version 3.2.3
> library (“methods”)
> output <- xmlParse(file = “d:/XMLFile.xml”)
> print(output)

Step 2: Extract the root node from the XML file.


> rootnode <- xmlRoot(output)
Find the number of nodes in the root.
> rootsize <- xmlSize(rootnode)
> rootsize
[1] 3

Let us display the details of the first node.


> print (rootnode[1])

Let us display the details of the first element of the first node.
> print(rootnode[[1]][[1]])

Let us display the details of the third element of the first node.
> print(rootnode[[1]][[3]])
Next, display the details of the third element of the second node.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 34


> print(rootnode[[2]][[3]])
We can also display the value of 2nd element of the first node.
> output <-xmlValue(rootnode[[1]][[2]])
> output
Step 3: Convert the input xml file to a data frame using the xmlToDataFrame function.
> xmldataframe <- xmlToDataFrame(“d:/XMLFile.xml”)
Display the output of the data frame.
> xmldataframe

COMPARISON OF R GUIS FOR DATA INPUT

R is mainly used for statistical analytical data processing. Analytical data processing needs a large dataset
that is stored in a tabular form. Sometimes it is difficult to use inbuilt functions of R for doing such
analytical data processing operations in R console. Hence, to overcome this problem, GUI is developed
for R.
Graphical user interface is a graphical medium through which users interact with the language or perform
operations. Different GUIs are available for data input in R. Each GUI has its own features. Table below
describes some of the most popular R GUIs.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 35


USING R WITH DATABASES AND BUSINESS INTELLIGENCE
SYSTEMS

Business analytical processing uses database for storing large volume of information. Business
intelligence systems or business intelligence tools handle all the analytical processing of a database and
use different types of database systems. The tools support the relational database processing (RDBMS),
accessing a part of the large database, getting a summary of the database, accessing it concurrently,
managing security, constraints, server connectivity and other functionality.
At present, different types of databases are available in the market for processing.They have many inbuilt
tools, GUIs and other inbuilt functions through which database processing becomes easy.
For SQL, MySQL, PostGreSQL and SQL Lite databases, R provides inbuilt packages to access
all of these. With the help of these packages, users can easily access a database since all the packages
follow the same steps for accessing data from the database.

RODBC

• RODBC is a package of languages that interacts with a database.


• Michael Lapsley and Brian Ripley developed this package.
• RODBC helps in accessing databases such as MS Access and Microsoft SQL Server through an
ODBC interface.
• Its package has many inbuilt functions for performing database operations on the database.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 36


Here is a sample code where package RODBC is used for reading data from a database.

># importing package


> library(RODBC)
> connect1 <- odbcConnect(dsn = ‘servername’, uid= ‘‘, pwd= ‘‘)
#Open connection
> query1 <- ‘Select * from lib.table where…’
> Demodb <- sqlQuery(connect1, query1, errors = TRUE)
> odbcClose(connection) #Close the connection

Using MySQL and R

• MySQL is an open source SQL database system.


• It is a small-sized popular database that is available for free download.
• For accessing MySQL database, users need to install the MySQL database system on their
computers.
• MySQL database can be downloaded and installed from its official website.
• R also provides a package, ‘RMySQL’ used for accessing the database from the MySQL
database.
• Like other packages, RMySQL has many inbuilt functions for interacting with a database.

A sample code to illustrate the use of RMySQL for reading data from a database is given below.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 37
># importing package
> library(RMySQL)
> connectm <- odbcConnect(MySQL(), uid= ‘‘, pwd= ‘‘,dbname = ‘‘,host = ‘‘) #Open connection
‘connectm’
> querym <- ‘Select * from lib.table where…’
> Demom<- dbSendQuery(connectm, querym)
>dbDisconnect(connectm) #Close the connection ‘connect’

Using PostgreSQL and R


• PostgreSQL is an open source and customisable SQL database system.
• After MySQL,PostgreSQL database is used for business analytical processing.
• For accessing the PostgreSQL database, users need to install the PostgreSQL database system on
their computer system.
• Please note that it requires a server. Users can get a server on rent,download and install the
MySQL database from its official website.
• R has a package, ‘RPostgreSQL’ that is used for accessing the database from the PostgreSQL
database.
• Like other packages, RPostgreSQL has many inbuilt functions for interacting with its database.

Using SQLite and R

• SQLite is a server-less, self-contained, transactional and zero-configuration SQL database


system.
• It is an embedded SQL database engine that does not require any server, due to which it is called
a serverless database.
• The database also supports all business analytical data processing.
• R has an RSQLite package that is used for accessing a database from the SQLite database.
• The RSQLite has many inbuilt functions for working with the database.
• dbconnect() and dbDisconnect() for opening and closing the connection from the SQLite
database, respectively.
• The only difference here is that users have to pass the SQLite database driver object in the
dbConnect() function.

Using JasperDB and R

• JasperDB is another open source database system integrated with R.


• It was developed by the Jaspersoft community.
• It provides many business intelligence tools for analytical business processing.
• A Java library interface is used between JasperDB and R. It is called ‘RevoConnectR for
JasperReports Server’.
• The dashboard of the JasperReports Server provides many features through which R charts, an
output of the RevoDeploy R, etc., are easily accessible.

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 38


• JasperDB has a package or web service framework called ‘RevoDeployR’ developed by
Revolution Analytics.
• RevoDeploy R provides a set of web services with security features, scripts, APIs and libraries in
a single server.
• It easily integrates with the dynamic R-based computations into web applications.

Using Pentaho and R

• Pentaho is one of the most famous companies in the data integration field that develops different
products and provides services for big data deployment and business analytics.
• The company provides different open source-based and enterprise-class platforms.
• Pentaho Data Integration (PDI) is one of the products of Pentaho used for accessing database and
analytical data processing. It prepares and integrates data for creating a perfect picture of any
business.
• The tool provides accurate and analytics-ready data reports to the end users, eliminates the coding
complexity and uses big data in one place.
• R Script Executor is one of the inbuilt tools of the PDI tool for establishing a relationship
between R and Pentaho Data Integration. Through R Script Executor, users can access data and
perform analytical data operations.
• If users have R in their system already, then they just need to install PDI from its official
website. The users need to configure environment variables, Spoon, DI Server, and Cluster nodes
as well.
• Although users can try PDI and transform a database using R Script Executor, PDI is a paid tool
for doing analytical data integration operation.
• The complete installation process of the R Script Executor is available at
http://wiki.pentaho.com/display/EAI/R+script+executor

KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 39

You might also like