DAUR UNIT 1 Part 2
DAUR UNIT 1 Part 2
OUTCOMES:
• Introduces different data types such as numbers, text, logical values, dates, etc., supported in R.
• It also describes various R objects such as vector, matrix, list, dataset, etc.,
• how to manipulate data using R functions such as sum(), min(), max(), rep() and string functions
such as substr(), grep(), strsplit(), etc.
• It explores import of data into R from .csv (comma separated values), spreadsheets, XML
documents, JASON (Java Script Object Notation) documents, web data, etc.
• Interfacing R with databases such as MySQL, PostGreSQL, SQLlite, etc.
INTRODUCTION:
Enterprise applications today generate a huge amount of data. This data is analyzed to draw useful
insights that can help decision makers make better and faster decisions.
1. Data Formats: Data is the main element of business analytics. Business analytics uses sets of
data to store a large amount of data. Selecting a data format is the first challenge in analytical data
processing for researchers or developers. Analytical data processing requires a complete set of
data, in the absence of which, developers can expect problems in further processing. R is a well-
documented programming language that stores data in the form of an object. It has a very simple
syntax that helps in processing any type of data. R provides many packages and features such as
open database connectivity (ODBC), which process different types of data formats. For example,
ODBC supports data formats such as CSV, MS Excel, SQL, etc.
2. Data Quality: Business analysts are required to deliver perfect information, inferences, outliers
and output without any missing or invalid value. A data with inferior input or output is bound to
give incorrect quality results. With the help of R, business analysts can maintain data quality.
Different tools of R help business analysts in removing invalid data, replacing missing values and
removing outliers in data.
3. Project Scope: Projects based on analytical data processing are costly and time consuming.
Hence, before starting a new project, business analysts should analyze the scope of the project.
They should identify the amount of data required from external sources, time of delivery and
other parameters related to the project.
4. Output Result via Stakeholder Expectation Management: In analytical data processing,
analysts design projects that generate output with different types of values like p-value, the
degree of freedom, etc. However, users or stakeholders prefer to see the output. The stakeholders
do not want to see the constraints used in data processing, assumptions, hypothesis, p-values, chi-
square value or any other value. Hence, an analytical project should try to fulfil all the
expectations of the stakeholders.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 1
Business analysts should use transparent methods and processes. They should also validate the data using
cross validation. If business analysts use the standard steps of analytical data processing that generate the
perfect output, they will not encounter any problems.
The sequence of analytical data processing that analyst should follow while conducting business analysis
for their project are:
• Data input
• Processing
• descriptive statistics
• visualization of data
• report generation
• output
EXPRESSIONS:
Logical Values
Logical values are TRUE and FALSE or T and F. these are case sensitive.
Guided Activity
Step 1: Create a vector, x consisting of 10 elements with values ranging from 1 to 10.
[1] 1 2 3 4 5 6 7 8 9 10
Step 3: Print the values of those elements whose values are either greater than 7 or less than 5. ‘|’ is the
OR operator. Use the OR operator to display elements whose values are either greater than 7 or less than
10.
[1] 1 2 3 4 8 9 10
Step 4: Print the values of those elements whose values are greater than 7 and less than 10. ‘&’ is the
AND operator. Use the AND operator to display elements whose values are greater than 7 and less than
10.
DATES:
The default format of date is YYYY-MM-DD.
(i) Print system’s date.
> Sys.Date()
[1] “2017-01-13”
> Sys.time()
> Sys.timezone()
[1] “Asia/Calcutta”
> today
[1] “2017-01-13”
> CustomDate
[1] “character”
(vi) Convert the date stored as text data type into a date data type.
> class(CustDate)
> dates
VARIABLES
(i) Assign a value of 50 to the variable called ‘Var’.
> Var
[1] 50
> Var + 10
[1] 60
> Var / 2
[1] 25
Variables can be reassigned values either of the same data type or of a different data type.
> Var
> Var
[1] TRUE
FUNCTIONS
sum() function
sum() function returns the sum of all the values in its arguments.
Syntax
where … implies numeric or complex or logical vectors. na,rm accepts a logical value.
Examples
(i) Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6
(ii) What will be the output if NA is used for one of the arguments to sum()?
[1] NA
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.
(iii) What will be the output if NaN is used for one of the arguments to sum()?
[1] NaN
(iv) What will be the output if NA and NaN are used as arguments to sum()?
[1] NA
(v) What will be the output if option, na.rm is set to TRUE? If na.rm is TRUE, an NA or NaN value
in any of the argument will be ignored.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 5
> sum(1, 5, NA, na.rm=TRUE)
[1] 6
[1] 6
min() function
min() function returns the minimum of all the values present in their arguments.
Syntax
min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a logical value.
Example
> min(1, 2, 3)
[1] 1
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.
[1] NA
[1] NaN
[1] NA
[1] 1
max() function
max() function returns the maximum of all the values present in their arguments.
Syntax
max(…, na.rm=FALSE)
Example
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be returned.
[1] NA
[1] NaN
[1] NA
[1] 78
seq() function
Syntax
where, Start from: It is the start value of the sequence. End at: It is the maximal or end value of the
sequence. Interval: It is the increment of the sequence. length.out: It is the desired length of the sequence.
Example
[1] 1 3 5 7 9
[1] 1 2 3 4 5 6 7 8 9 10
> seq(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Or
> seq_len(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1] 1 4
Table 3.2 explains some useful text manipulation operations. Let us take a look at how R treats strings.
String values have to be enclosed within double quotes.
rep() function repeats a given argument for a specified number of times. In the example
below, the string, ‘statistics’ is repeated three times.
Example
> rep(“statistics”, 3)
[1] “statistics” “statistics” “statistics”
grep() function
In the example below, the function grep() finds the index position at which the string,
‘statistical’ is present.
Example
> grep(“statistical”,c(“R”,“is”,“a”,“statistical”,“language”),fixed=TRUE)
[1] 4
toupper() function
tolower() function
tolower() function converts the given character vector into lower case.
Syntax
tolower(x)
x<- is a character vector
Example
> tolower(“STATISTICS”)
[1] “statistics”
Or
> casefold(“R PROGRAMMING LANGUAGE”, upper=FALSE)
[1] “r programming language”
substr() function
substr() function extracts or replaces substrings in a character vector.
Syntax
substr(x, start, stop)
x<-character vector
start <-start position of extraction or replacement
stop<-stop or end position of extraction or replacement
Example
Extract the string ‘tic’ from ‘statistics’. Begin the extraction at position 7 and continue the
extraction till position 9.
> substr(“statistics”, 7, 9)
In R, NA (Not Available) represents missing values and Inf (Infinite) represents infinite values. R
provides different functions that identify the missing values during processing.
creates a vector ‘A’ with some missing values [10, 20, NA,40] . The is.na(A) returns TRUE for the
missing value. The na.omit(A) and na.exclude(A) removes the missing value and stores it into vector ‘B’
and ‘D’,respectively. The na.fail(A) generates an error if A has some missing value. The na.pass(A)
returns the usual vector A.
A vector can have a list of values. The values can be numbers, strings or logical. All the values in a vector
should be of the same data type.A few points to remember about vectors in R are:
• Vectors are stored like arrays in C
• Vector indices begin at 1
• All vector elements must have the same mode such as integer, numeric (floating point number),
character (string), logical (Boolean), complex, object, etc.
Let us create a few vectors.
1. Create a vector of numbers
> c(4, 7, 8)
[1] 4 7 8
The c function (c is short for combine) creates a new vector consisting of three values, viz. 4, 7 and 8.
2. Create a vector of string values.
> c(“R”, “SAS”, “SPSS”)
[1] “R” “SAS” “SPSS”
3. Create a vector of logical values.
> c(TRUE, FALSE)
[1] TRUE FALSE
A vector cannot hold values of different data types. Consider the example below on placing integer, string
and Boolean values together in a vector.
> c(4, 8, “R”, FALSE)
[1] “4” “8” “R” “FALSE”
4. Declare a vector by the name, ‘Project’ of length 3 and store values in it.
> Project <- vector(length = 3)
> Project [1] <- “Finance Project”
> Project [2] <- “Retail Project”
> Project [3] <- “Energy Project”
Outcome
> Project
[1] “Finance Project” “Retail Project” “Energy Project”
> length (Project)
[1] 3
➢ Sequence Vector
A sequence vector can be created with a start:end notation.
Create a sequence of numbers between 1 and 5 (both inclusive).
> 1:5
[1] 1 2 3 4 5
Or
> seq(1:5)
[1] 1 2 3 4 5
The default increment with seq is 1. However, it also allows the use of increments other than 1.
> seq (1, 10, 2)
[1] 1 3 5 7 9
Or
> seq (from=1, to=10, by=2)
[1] 1 3 5 7 9
Or
> seq (1, 10, by=2)
[1] 1 3 5 7 9
seq can also generate numbers in the descending order.
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> seq (10, 1, by=–2)
[1] 10 8 6 4 2
➢ Vector Names
The names() function helps to assign names to the vector elements. This is accomplished in two steps as
shown:
> placeholder <- 1:5
> names(placeholder) <- c(“r”, “is”, “a”, “programming”, “language”)
• The vector elements can then be retrieved using the indices position.
> placeholder
r is a programming language
12345
• Let us use the name function to assign names to the vector elements. These names will be used as
labels in the barplot.
> names(BarVector) <- c(“India”, “MiddleEast”, “US”)
> barplot(BarVector)
Vector Math
• Let us define a vector, ‘x’ with three values. Let us add a scalar value (single value) to the vector.
This value will get added to each vector element.
> x <- c(4, 7, 8)
> x +1
[1] 5 8 9
• However, the vector will retain its individual elements.
>x
[1] 4 7 8
• If the vector needs to be updated with the new values, type the statement given below.
> x <- x + 1
>x
[1] 5 8 9
• We can run other arithmetic operations on the vector as given:
>x–1
[1] 4 7 8
Vector Recycling
If an operation is performed involving two vectors that requires them to be of the same length, the shorter
one is recycled, i.e. repeated until it is long enough to match the longer one.
Objective
• Add two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) + c(4, 5, 6, 7, 8, 9)
[1] 5 7 9 8 10 12
Objective
• Multiply the two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) * c(4, 5, 6, 7, 8, 9)
[1] 4 10 18 7 16 27
Objective
• Plot a Scatter Plot. The function to plot a scatter plot is ‘plot’. This function uses two vectors, i.e.
one for the x axis and another for the y axis. The objective is to understand the relationship
between numbers and their sines.
• We will use two vectors. Vector, x which will have a sequence of values between 1 and 25 at an
interval of 0.1 and vector, y which stores the sines of all values held in vector, x.
> x <-seq(1, 25, 0.1)
> y <-sin(x)
The plot function takes the values in the vector, x and plots it on the horizontal axis. It then takes the
values in the vector, y and places it on the vertical axis (Figure 3.4).
> plot(x, y)
>a
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 15
[3, ] 30 60 90
➢ Matrix Access
• Objective-1
Access the elements of a 3 *4 matrix.
Step 1: Create a matrix, ‘mat’, 3 rows high and 4 columns wide using a vector.
> x <- 1:12
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> mat <- matrix (x, 3, 4)
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: Access the element present in the second row and third column of the matrix, ‘mat’.
> mat [2, 3]
[1] 8
• Objective-2
Access the third row of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the third row of the matrix, simply provide the row number and omit the column
number.
> mat [3, ]
[1] 3 6 9 12
• Objective-3
Access the second column of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second column of the matrix, simply provide the column number
and omit the row number.
> mat[, 2]
[1] 4 5 6
• Objective-4
Access the second and third columns of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second and third columns of the matrix, simply provide the column numbers and
omit the row number.
> mat[,2:3]
Plot the contour chart using the contour() function (Figure 3.5). The contour() function creates a contour
plot or adds contour lines to an existing plot. Look up the R documentation for a complete description of
the contour() function.
> contour(mat)
Contour plot
Objective-6
Create a 3D perspective plot with the persp() function (Figure 3.6). It provides a 3D wireframe plot most
commonly used to display a surface.
>persp(mat)
Objective-7
R includes some sample data sets. One of these is ‘volcano’, which is a 3D map of a dormant New
Zealand volcano. Create a contour map of the volcano dataset (Figure 3.7).
> contour(volcano)
Let us create a 3D perspective map of the sample data set, ‘volcano’ (Figure 3.8).
> persp(volcano)
Objective-8
Create a heat map of the sample dataset, ‘volcano’ (Figure 3.9).
> image(volcano)
➢ Creating Factors
School, ‘XYZ’ places students in groups, also called houses. Each group is assigned a unique color such
as ‘red’, ‘green’, ‘blue’ or ‘yellow’. HouseColor is a vector that stores the house colors of a group of
students.
> HouseColor <- c(‘red’, ‘green’, ‘blue’, ‘yellow’, red’, ‘green’, ‘blue’, ‘blue’)
> types <- factor(HouseColor)
> HouseColor
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print(HouseColor)
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print (types)
[1] red green blue yellow red green blue blue
Levels: blue green red yellow
Levels denotes the unique values. The above has four distinct values such as ‘blue’,‘green’, ‘red’ and
‘yellow’.
> as.integer(types)
[1] 3 2 1 4 3 2 1 1
The above output is explained as given below.
1 is the number assigned to blue.
2 is the number assigned to green.
3 is the number assigned to red.
4 is the number assigned to yellow.
> levels(types)
[1] “blue” “green” “red” “yellow”
The vector ‘NoofStudents’ stores the number of students in each house/group with 12 students in blue
house, 14 students in green house, 12 students in red house and 13 students in yellow house.
> NoofStudents <- c(12, 14, 12, 13)
> NoofStudents
[1] 12 14 12 13
The vector, ‘AverageScore’ stores the average score of the students of each house/group. 70 is the
average score for students of the blue house, 80 is the average score for students of the green house, 90 is
the average score for the students of the red house and 95 is the average score for the students of the
yellow house.
> AverageScore(70, 80, 90, 95)
> AverageScore
[1] 70 80 90 95
Objective-1
Plot the relationship between NoofStudents and AverageScore (Figure 3.10).
> plot(NoofStudents, AverageScore)
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 19
> plot (NoofStudents, AverageScore, pch=as.integer (types))
The above graph in Figure 3.10 displays 4 dots. Let us improve the graph by at least using different
symbols to represent each house (Figure 3.11).
To add further meaning to the graph, let us place a legend on the top right corner (Figure 3.12).
> legend(“topright”, c(“red”, “green”, “blue”, “yellow”), pch=1:4)
LIST
List is similar to C Struct.
Objective-1
Create a list in R.
To create a list, ‘emp’ having three elements, ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’.
> emp <- list (“EmpName=“Alex”, EmpUnit = “IT”, EmpSal = 55000)
Outcome
To get the elements of the list, ‘emp’ use the command given below.
> emp
Functions Function
Arguments Description
The following example loads a matrix into the workspace. All the above commands are executed on the
dataset, ‘Orange’ uses summary(), names() and str() functions.
The following example reads a table, ‘Hardware.csv’ into object, ‘TD’ on the R workspace. The TD[1]
and TD[, 1] commands displays rows and columns
Merging different datasets or objects is another common task used in most processing activities.
Analytical data processing may also require merging two or more data objects. R provides a function
merge() that merges data objects. The merge() function combines data frames by common columns or
row names. It also follows the database join operations.
merge(x, y,…)
OR
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y =by, all = FALSE, all.x = all, all.y =
all, …)
• where, x is an object or data frame, y is an object or data frame and by, by.x, by.y arguments
define the common columns or rows for merging.
• All arguments contain logical values ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the full outer join by adding all rows of x and y into the result object.
• all.x argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the dataset as per left outer join after merging the objects by adding an extra row in x that is not
matching with rows in y. If the value is FALSE then it merges the rows with the data from both x
and y into the result object.
• all.y argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns
the dataset as per right outer join after merging the objects by adding an extra row in y that is not
matching with rows in x. If the value is FALSE then it merges the rows with data from both x and
y into the result object.
• The dots ‘…’ define the other optional argument.
• Example-1: merging data
Syntax:
aggregate(x, …) or aggregate(x, by, FUN, …)
where, x is an object, by argument defines the list of group elements of the specific variable
of the dataset, FUN argument is a statistic function that returns a numeric value after given statistic
operations and the dots ‘…’ define the other optional argument.
The following example reads a table, ‘Fruit_data.csv’ into object, ‘S’. The aggregate() function computes
the mean price of each type of fruit. Here by argument is list(Fruit.Name = S$Fruit.Name) that groups the
Fruit.Name columns.
Syntax:
tapply (x, …) or tapply(x, INDEX, FUN, …)
where, x is an object that defines the summary variable, INDEX argument defines the list of group
elements—also called group variable, FUN argument is a statistic function that returns a numeric value
after given statistic operations and the dots ‘…’ define the other optional argument.
The following example reads the table, ‘Fruit_data.csv’ into object, ‘A’. The tapply()function computes
the sum and price of each type of fruit. Here Fruit.Price is a summary variable and Fruit.Name is a
grouping variable. The FUN function is applied on the summary variable, Fruit.Price.
➢ Input
Input is the first step in any processing, including analytical data processing. Here, the input is dataset,
‘Fruit’. For reading the dataset into R, use read.table() or read.csv() function.
where,x is the vector for which a histogram is required.freq is a logical value. If TRUE, the histogram
graphic is a representation of frequencies,the counts component of the result. If FALSE, the probability
densities and componentdensity are plotted.main, xlab, ylab are arguments to title. plot is a logical value.
If TRUE (default), a histogram is plotted, else a list of breaks and counts is returned.
>hist(fruits$Fruit.Price)
Figure given below describes the box-and-whisker plot of the ‘Fruit’ dataset using the boxplot()
Function. A box and whisker plot summarises the group values into boxes.
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,notch = FALSE, outline = TRUE,
names, plot = TRUE,border = par(‘fg’), col = NULL, log = ‘‘,
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),horizontal = FALSE, add = FALSE, at =
NULL)
Figure below describes the plot of the ‘Fruit’ dataset using the plot() function.
Comma separated value (CSV) files and spreadsheets are used for storing small size data.R has an inbuilt
function facility through which analysts can read both types of files.
Reading CSV Files
A CSV file uses .csv extension and stores data in a table structure format in any plain text.
The following function reads data from a CSV file:
read.csv(‘filename’)
The read.table() function can also read data from CSV files. The syntax of the function is
where,filename argument defines the path of the file to be read, header argument contains logical values
TRUE and FALSE for defining whether the file has header names on the first line or not, sep argument
defines the character used for separating each column of the file and the dots ‘…’ define the other
optional arguments.
The following example reads a CSV file, ‘Hardware.csv’ using read.csv() and read.table() function.
A spreadsheet is a table that stores data in rows and columns. Many applications are available for creating
a spreadsheet. Microsoft Excel is the most popular for creating an Excel file. An Excel file uses .xlsx
extension and stores data in a spreadsheet.
In R, different packages are available such as gdata, xlsx, etc., that provide functions for reading Excel
files. Importing such packages is necessary before using any inbuilt function of any package. The
read.xlsx() is an inbuilt function of ‘xlsx’ package for reading Excel files.
where,filename argument defines the path of the file to be read and the dots ‘…’ define the other optional
arguments.
In R, reading or writing (importing and exporting) data using packages may create some problems like
incompatibility of versions, additional packages not loaded and so on. In order to avoid these problems, it
is better to convert files into CSV files. After converting files into CSV files, the converted file can be
read using the read.csv() function.
The following example illustrates creation of an Excel file, ‘Softdrink.xlsx’. The ‘Software.
csv’ file is the converted form of the ‘Softdrink.xlsx’ file.
2. data() Function
The data() function lists all the available datasets of the loaded package into the R workspace. For loading
a new dataset into the loaded packages, users need to pass the name of the new dataset into data()
function.
Nowadays most business organizations are using the Internet and cloud services for storing data. This
online dataset is directly accessible through packages and application programming interfaces (APIs).
Different packages are available in R for reading from online datasets.
install.packages("htmlTreeParse")
Let us display the details of the first element of the first node.
> print(rootnode[[1]][[1]])
Let us display the details of the third element of the first node.
> print(rootnode[[1]][[3]])
Next, display the details of the third element of the second node.
R is mainly used for statistical analytical data processing. Analytical data processing needs a large dataset
that is stored in a tabular form. Sometimes it is difficult to use inbuilt functions of R for doing such
analytical data processing operations in R console. Hence, to overcome this problem, GUI is developed
for R.
Graphical user interface is a graphical medium through which users interact with the language or perform
operations. Different GUIs are available for data input in R. Each GUI has its own features. Table below
describes some of the most popular R GUIs.
Business analytical processing uses database for storing large volume of information. Business
intelligence systems or business intelligence tools handle all the analytical processing of a database and
use different types of database systems. The tools support the relational database processing (RDBMS),
accessing a part of the large database, getting a summary of the database, accessing it concurrently,
managing security, constraints, server connectivity and other functionality.
At present, different types of databases are available in the market for processing.They have many inbuilt
tools, GUIs and other inbuilt functions through which database processing becomes easy.
For SQL, MySQL, PostGreSQL and SQL Lite databases, R provides inbuilt packages to access
all of these. With the help of these packages, users can easily access a database since all the packages
follow the same steps for accessing data from the database.
RODBC
A sample code to illustrate the use of RMySQL for reading data from a database is given below.
KAMALA CHALLA,ASST.PROF,IT DEPT,VNR VJIET. Page 37
># importing package
> library(RMySQL)
> connectm <- odbcConnect(MySQL(), uid= ‘‘, pwd= ‘‘,dbname = ‘‘,host = ‘‘) #Open connection
‘connectm’
> querym <- ‘Select * from lib.table where…’
> Demom<- dbSendQuery(connectm, querym)
>dbDisconnect(connectm) #Close the connection ‘connect’
• Pentaho is one of the most famous companies in the data integration field that develops different
products and provides services for big data deployment and business analytics.
• The company provides different open source-based and enterprise-class platforms.
• Pentaho Data Integration (PDI) is one of the products of Pentaho used for accessing database and
analytical data processing. It prepares and integrates data for creating a perfect picture of any
business.
• The tool provides accurate and analytics-ready data reports to the end users, eliminates the coding
complexity and uses big data in one place.
• R Script Executor is one of the inbuilt tools of the PDI tool for establishing a relationship
between R and Pentaho Data Integration. Through R Script Executor, users can access data and
perform analytical data operations.
• If users have R in their system already, then they just need to install PDI from its official
website. The users need to configure environment variables, Spoon, DI Server, and Cluster nodes
as well.
• Although users can try PDI and transform a database using R Script Executor, PDI is a paid tool
for doing analytical data integration operation.
• The complete installation process of the R Script Executor is available at
http://wiki.pentaho.com/display/EAI/R+script+executor