SCSA4001 - R Program
SCSA4001 - R Program
Unit – I
Introduction to R - History and fundamentals of R, Installation and use of R / R Studio / R Shiny,
Installing R packages, R – Nuts and Bolts -Getting Data In and Out - Control Structures and Functions-
Loop Functions-Data Manipulation- String Operations- Matrix Operations.
BASICS OF R
History of R
Installation of R
2.Run the installer. Default settings are fine. If you do not have admin rights on your
laptop, thenask you local IT support. In that case, it is important that you also ask them
to give you full permissions to the R directories. Without this, you will not be able to
install additional packages later.
Installation of R
Studio
2. Once the installation of R has completed successfully, run the RStudio installer.
3. If you do not have administrative rights on your laptop, step 2 may fail.
Ask your ITSupport or download a pre-built zip archive of RStudio which
doesn’t need installing.
4. The link for this is towards the bottom of the download page, highlighted in Image
5. Download the appropriate archive for your system (Windows/Linux only –
the Macversion can be installed into your personal “Applications” folder
without admin rights).
6. Double clicking on the zip archive should automatically unpack it on most
Windowsmachines.
Installing R packages
Option 1:
Click on the tab ‘ Packages’ then ‘Install’ as shown in figure 1.3. Click
on Install toinstall R packages.
Fig 1.3:Install Packages
Option 2:
Tools -> Install packages. As shown in Fig 1.4.
R is free and open source software, allowing anyone to use and, importantly, to
modify it. R is licensed under the GNU General Public License, with copyright
held by The R Foundation for Statistical Computing.
R has no license restrictions (other than ensuring our freedom to use it at our own
discretion), and so we can run it anywhere and at any time.
Anyone can provide new packages, and the wealth of quality packages available
for R is a testament to this approach to software development and sharing.
R plays well with many other tools, importing data, for example, from CSV files,
SAS, and SPSS, or directly from Microsoft Excel, Microsoft Access, Oracle,
MySQL, and SQLite.
It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and
table output for LATEX and HTML.
write.table() : for writing tabular data to text files (i.e. CSV) or connections
Control Structures in R
1. Conditional
statements.2.Looping
Statements 3.Jump
Statements
Conditional Statements:
Example:If,If-Else,If-ElseIf Ladder,switch
If:
Syntax: If(condition)
{
Statement
Output :
If-Else
Syntax:if(condition)
Statement1
}else{
Statement 2
}
Example:Check for Odd or Even Number
Output:
If-Else if Ladder
Syntax:if(condition)
Statement1
}else if{
Statement 2
}else{
Statement 3
Switch:
A switch statement allows a variable to be tested for equality against a list of values.
Each value is called a case, and the variable being switched on is checked for each case.
Syntax: switch(Expression, "Option 1", "Option 2", "Option 3" ................................. "Option
N")
Looping Statements in R
There may be a situation when you need to execute a block of code several number of
times. A loop statement allows us to execute a statement or group ofstatements multiple
times.
Example:for,while,Repeat
For:
Description: A For loop is a repetition control structure that allows you to efficiently
write a loop that needs to execute a specific number of times.
Statements
While:
Description: The While loop executes the same code again and again until astop
condition is met.
Syntax:while(condition){
Statement
Output:
Repeat:
Description:The Repeat loop executes the same code again and again until astop
condition is met.
Syntax:
repeat
Commands
If(condition)
Break
Output:
Jump Statements:
Description: Loop control statements change execution from its normalsequence
Example: break, next
Break:
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
Syntax: break
Output:
Next:
next
Functions in R
A function is a set of statements organized together to perform a specific task.R has a
large number of in-built functions and the user can create their own functions.
Syntax:
Function name=function(ar1,ar2,……)
Function body
}
Function Components:
Return Value − The return value of a function is the last expression in the
function body to be evaluated.
Built-in Functions
R has many in-built functions which can be directly called in the program
without defining them first.
Eg: sum.seq,abs,round .
Example:
Output:
User Defined Function:
We can create user-defined functions in R. They are specific to what a user wants and
once created they can be used like the built-in functions. Belowis an example of how
a function is created and used.
Output:
Function with Arguments:
The arguments to a function call can be supplied in the same sequence as defined in
the function or they can be supplied in a different sequence but assigned to the names
of the arguments.
Example:
Output:
We can define the value of the arguments in the function definition and call the
function without supplying any argument to get the default result. But we can also call
such functions by supplying new values of the argument and get non default result.
Example:
Output:
LOOPING FUNCTIONS IN R
lapply:
=Li
st FUN=A
Function
…. =other arguments
Example:
sapply:
=Li
st FUN=A
Function
…. =other arguments
Example:
apply:
X =An Array
mapply:
Example:
Matrices in R
Matrices are the R objects in which the elements are arranged in a two-dimensional
rectangular layout. We use matrices containing numeric elements to be used in
mathematical calculations.
Syntax:
data is the input vector which becomes the data elements of the matrix.
byrow is a logical clue. If TRUE then the input vector elements are
arranged by row.
Matrix Creation:
row.Example:
(ii) Arrange elements sequentially by coloumn.
Matrix Operations:
Various mathematical operations are performed on the matrices using the Roperators.
The result of the operation is also a matrix.The dimensions (number of rows and
columns) should be same for the matrices involved in the operation.
Matrix Addition:
Matrix Subtraction
Matrix Multiplication(Elementwise)
Matrix Multiplication(Real)
Matrix Division:
String Operations in R
String:
Any value written within a pair of single quote or double quotes in R istreated as a string.
Example:S=”Hello”
S1=’hai’
String Manipulation:
Concatenating Strings - paste() function:
Many strings in R are combined using the paste() function. It can takeany number of
arguments to be combined together.
Syntax:
Example
Syntax:
nchar(x)
Example:
Changing the case - toupper() & tolower() functions
Syntax
toupper(x)
tolower(x)
Example:
Syntax
substring(x,first,last)
These are replacement functions, which replaces the occurrence of asubstring with other
substring.
Example:
grep(value = FALSE) returns an integer vector of the indices of the elements of x that
yielded a match .
Example:
SCHOOL OF COMPUTING
DEPARTMENT OF INFORMATION TECHNOLOGY
R Data interfaces - CSV Files, XML files, Web Data- Data Preprocessing: Missing Values, Principle
Component Analysis - Data Visualization – Charts & Graphs-Pie Chart, Bar Chart, Box plot, Histogram,
Line graph, Scatter Plot.
R DATA INTERFACES
we can read data from files stored outside the R environment. We can also write data into files which will be
stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml
etc.
Example:
CSV Files
Input as CSV
The csv file is a text file in which the values in the columns are separated by a comma. Let's consider the
following data present in the file named input.csv.Create this file using windows notepad by copying
and pasting this data. Save the fileas input.csv using the save As All files(*.*) option in notepad.
Reading a CSV File
read.csv() function is used to read a CSV file available in your currentworking directory .
Example:
R can create csv file form existing data frame. The write.csv() function isused to create the csv file. This
file gets created in the working directory.
XML files
XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and
elsewhere using standard ASCII text. It stands for Extensible Markup Language (XML). Similar to HTML
it contains markup tags. But unlike HTML where the markup tag describes structure of the page, in xml the
markup tags describe the meaning of the data contained into he file.
Read a xml file in R using the "XML" package. This package can be installed using following command.
Install.packages(“XML”)
InputData
Create a XMl file by copying the below data into a text editor like notepad. Save the file with a .xml extension
and choosing the file type as all files(*.*).
Reading XMLFile
The xml file is read by R using the function xmlParse(). It is stored as alist in R.
XMLtoDataFrame
To handle the data effectively in large files we read the data in the xml fileas a data frame. Then process
the data frame for data analysis.
WEB DATA
Many websites provide data for consumption by its users. For example the World Health
Organization(WHO) provides reports on health and medical information in the form of CSV, txt and XML
files. Using R programs, we canprogrammatically extract specific data from such websites. Some packages
in R which are used to scrap data form the web are − "RCurl",XML", and "stringr". They are used to
connect to the URL’s, identify required links for the files and download them to the local environment.
Install R Packages
The following packages are required for processing the URL’s and links to the files. If they are not
available in your R Environment, you can install them using following commands.
Input Data
We will visit the URL weather data and download the CSV files using R for theyear 2015.
Example
We will use the function getHTMLLinks() to gather the URLs of the files. Then we will use the
function download.file() to save the files to the local system. As we will be applying the same code again
and again for multiple files, we will create a function to be called multiple times. The filenames arepassed
as parameters in form of a R list object to this function.
# Identify only the links which point to the JCMB 2015 files.filenames <- links[str_detect(links,
"JCMB_2015")]
# Create a function to download the files by passing the URL and filenamelist.
download.file(filedetails,filename)
# Now apply the lapply function and save the files into the current Rworking directory.
lapply(filenames,downloadcsv,mainurl= "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
Pie Chart
A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and
the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The
additional parameters are used to control labels, color, title etc.
syntax
The basic syntax for creating a pie-chart using the R is −
radius indicates the radius of the circle of the pie chart.(value between
−1 and +1).
clockwise is a logical value indicating if the slices are drawn clockwise oranti
clockwise.
Example:
Output:
3DPie Chart
A pie chart with 3 dimensions can be drawn using additional packages. Thepackage plotrix has a function
called pie3D() that is used for this.
Example:
Output:
BAR CHART
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and horizontal bars in
the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
The following script will create and save the bar chart in the current R workingdirectory.
Output:
GroupBar ChartandStackedBar Chart
We can create bar chart with groups of bars and stacks in each bar by usinga matrix as input values.
More than two variables are represented as a matrix which is used to createthe group bar chart and stacked bar
chart.
Example:
Output:
HISTOGRAM
A histogram represents the frequencies of values of a variable bucketed intoranges. Histogram is similar to
bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents
the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector asan input and uses some more
parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is –
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
Example
A simple histogram is created using input vector, label, col and borderparameters.
The script given below will create and save the histogram in the current Rworking directory.
Output:
Example:
Output:
LINE GRAPH
A line chart is a graph that connects a series of points by drawing line segments between them. These points
are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in
identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
type takes the value "p" to draw only the points, "l" to draw only thelines and "o"
to draw both points and lines.
xlab is the label for x axis.
Example
A simple line chart is created using the input vector and the type parameteras "O". The below
script will create and save a line chart in the current R working directory. The features of the
line chart has been expanded by using additional parameters. We add color to the points and
lines, give a title to the chart and add labels to the axes.
Output:
Multiple Lines ina Line Chart
More than one line can be drawn on the same chart by using
the lines()function.
After the first line is plotted, the lines() function can use an additionalvector as input to draw the
second line in the chart,
Example:
Output:
Scattar Plots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two
variables. One variable is chosen in thehorizontal axis and another in the vertical axis.
Syntax
Example
We use the data set "iris" available in the R environment to create a basicscatter plot. Let's use the
columns "Sepal length" and "Sepal Width" in iris.
Example:
Output:
ScatterplotMatrices
When we have more than two variables and we want to find the correlation between one variable versus
the remaining ones we use scatterplot matrix. Weuse pairs() function to create matrices of
scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plottedfor each pair.
Output:
Box Plot
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three
quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data
set. It is also useful in comparing thedistribution of data across data sets by drawing boxplots for each of
them.
Syntax
The basic syntax to create a boxplot in R is −
x is a vector or a formula.
varwidth is a logical value. Set as true to draw width of the box proportionate tothe
sample size.
names are the group labels which will be printed under each boxplot.
Example 1:
Output:
Example 2: boxplot(iris$Petal.Length,iris$Petal.Width)Output:
SCHOOL OF COMPUTING
DEPARTMENT OF INFORMATION TECHNOLOGY
There are several standard statistical models to fit the data using R. The key to modeling in R is
the formula object, which provides a shorthand method to describe the exact model to be fit to
the data. Modeling functions in R typically require a formula object as an argument. The
modeling functions return a model object that contains all the information about the fit
Statistical analysis in R is performed by using many in-built functions. st of these functions are
part of the R base package. These functions take R vector as an input along with the arguments
and give the result.
Descriptive Statistics
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can
be either a representation of the entire population or a sample of it. Descriptive statistics are broken
down into measures of central tendency and measures of variability, or spread. Measures of central
tendency include the mean, median and mode, while measures of variability includethe standard
deviation or variance, the minimum and maximum variables.
Measures of central tendency describe the center position of a distribution for a data set. A
person analyzes the frequency of each data point in the distribution and describes it using the
mean, median or mode, which measure the most common patterns of the data set being analyzed.
Measures of variability, or the measures of spread, aid in analyzing how spread-out the distribution
is for a set of data. For example, while the measures of central tendency may give a person the
average of a data set, it doesn't describe how the data is distributed within the set. So, while the
average of the data may be 65 out of 100, there can still be data points at both 1 and 100.
Measures of variability help communicate this by describing the shape and spread of the data set.
Range, quartiles, absolute deviation and variance are all examples of measures ofvariability.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
trim ->is used to drop some observations from both end of the sorted vector.
na.rm ->is used to remove the missing values from the input vector.
Median
The middle most value in a data series is called the median. The median()function is used in R
to calculate this value.
na.rm ->is used to remove the missing values from the input vector.
Maximum:
It represents maximum value in the given data set.
na.rm ->is used to remove the missing values from the input vector.
Minimum:
It represents minimum value in the given data set. The basic syntax for calculating Minimum in R
na.rm ->is used to remove the missing values from the input vector.
Range:
The difference between the maximum and minimum data entries in the set.
Range = (Max. data entry) – (Min. data entry)The basic syntax for calculating Range in R is −
na.rm ->is used to remove the missing values from the input vector.
na.rm ->is used to remove the missing values from the input vector.
Example:
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print("Mean :",mean(x)
)# Find Median
print("Median :",median(x))
print("Standard Deviation",sd(x))
# Find range.
print("Range :",range(x))
Output:
Regression
Linear Regression
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear
relationship represents a straight line when plotted as a graph. A non-linear
relationship where the exponent of any variable is not equal to 1 creates a curve.
y = ax + b
StepstoEstablishaRegression
A simple example of regression is predicting weight of a person when his height
is known. Todo this we need to have the relationship between height and weight of
a person.
o Find the coefficients from the model created and create the
mathematical equation usingthese
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm()Function
This function creates the relationship model between the predictor and the
response variable.Syntax
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function
Call:
lm(formula = y
Coefficien
ts: x
(Intercept 0.67
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
Output:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002-1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value
Pr(>|t|) (Intercept) -38.45509
8.04901 -4.778
0.00139 **
Residual standard error: 3.253 on 8 degrees of
freedom Multiple R-squared: 0.9548,
Adjusted R-squared:
predict() FunctionSyntax
The basic syntax for predict() in linear regression is –
predict(object, newdata)
o object is the formula which is already created using the lm() function.
o newdata is the vector containing the new value for predictor variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(result)
When we execute the above code, it produces the following result −
1
76.22869
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
png(file = "linearregression.png")
dev.off()
We create the regression model using the lm() function in R. The model determines
the value ofthe coefficients using the input data. Next we can predict the value of
the response variable for agiven set of predictor variables using these coefficients.
lm()Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
print(model)
Call:
lm(formula = mpg ~ disp + hp + wt,
Coefficien
ts: dis h w
(Intercept -0.000937 -0.031157 -
Create Equation for Regression Model
Based on the above intercept and coefficient values, we create the mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R-LogisticRegression
The Logistic Regression is a regression model in which the response variable (dependentvariable)
has categorical values such as True/False or 0/1. It actually measures the probability of a binary
response as the value of response variable based on the mathematical equation relating it with the
predictor variables.
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
The function used to create the regression model is the glm() function.
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
Example
The in-built data set "mtcars" describes different models of a car with their various
engine specifications. In "mtcars" data set, the transmission mode (automatic or
manual) is described by the column am which is a binary value (0 or 1). We can
create a logistic regression model between the columns "am" and 3 other columns -
hp, wt and cyl.
<- mtcars[,c("am","cyl","hp","wt")]
CreateRegressionModel
We use the glm() function to create the regression model and get its summary for analysis.
print(summary(am.data))
When we execute the above code, it produces the following result −
Call:
glm(formula = am ~ cyl + hp + wt, family =
binomial, data = input)
Min 1 Media 3 Ma
- -0.14907 -0.01464
Coefficien
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables
"cyl" and "hp", we consider them to be insignificant in contributing to the value
of the variable "am". Only weight (wt) impacts the "am" value in this regression
model.
DISTRIBUTION
Normal Distribution:
The normal distribution is the most widely known and used of all distributions.
Because the normal distribution approximates many natural phenomena so well, it
has developed into a standard of reference for many probability problems.
o The curve is symmetric at the center (i.e. around the mean, μ).
o Exactly half of the values are to the left of center and exactly half
the values are to theright.
heights of people
errors in measurements
blood pressure
marks on a test
R has four in built functions to generate normal distribution. They are described below.
x is a vector of numbers.
p is a vector of probabilities.
mean is the mean value of the sample data. It's default value is zero.
dnorm()
This function gives height of the probability distribution at each point for a given mean and
standard deviation.
The number of observations or trials is fixed. In other words, you can only
figure outthe probability of something happening if you do it a certain number of
times. This is common sense — if you toss a coin once, your probability of
getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a
tails is very, very close to 100%.
Each observation or trial is independent. In other words, none of your trials have
an effect on the probability of the next trial.
The probability of success (tails, heads, fail or pass) is exactly the same from one
trial to another.
Where :
b=binomial probability
x=total number of “successes” (pass or fail, heads or tails etc.)
P=probability of a success on an individual trial
n= number of trials
Example:
R has four in-built functions to generate binomial distribution. They are described below.
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
This function takes the probability value and gives a number whose cumulative
value matchesthe probability value.
Example and output:
rbinom()
Time series is a series of data points in which each data point is associated with a
timestamp. A simple example is the price of a stock in the stock market at different
points of time on a given day. Another example is the amount of rainfall in a
region at different months of the year. R language uses many functions to create,
manipulate and plot the time series data. The data for the time series is stored in an
R object called time-series object. It is also a R data object like a vector or data
frame.The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is −
data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
Example:
Consider the annual rainfall details at a place starting from January 2012. We create an R time
series object for a period of 12 months and plot it.
MultipleTimeSeries
We can plot multiple time series in one chart by combining both the series into a matrix.
ANOVA
We are often interested in determining whether the means from more than two
populations or groups are equal or not. To test whether the difference in means
is statistically significant wecan perform analysis of variance (ANOVA) using the
R function aov(). If the ANOVA F-test shows there is a significant difference in
means between the groups we may want to perform multiple comparisons between
all pair-wise means to determine how they differ.
Analysis of Variance
The first step in our analysis is to graphically compare the means of the variable
of interest across groups. It is possible to create side-by-side boxplots of
measurements organized in groups using the function plot(). Simply type
Example:
A drug company tested three formulations of a pain relief medicine for migraine
headache sufferers. For the experiment 27 volunteers were selected and 9 were
randomly assigned to one of three drug formulations. The subjects were instructed
to take the drug during their next migraine headache episode and to report their
pain on a scale of 1 to 10 (10 being most pain).
Drug A 4 4
Drug B 6 6
Drug C 6 5
To make side-by-side boxplots of the variable pain grouped by the variable drug we must firstread in the
data into the appropriate format.
> pain = c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
> drug = c(rep("A",9), rep("B",9), rep("C",9))
>migraine = data.frame(pain,drug)
Note the command rep("A",9) constructs a list of nine A‟s in a row. The variable drug is therefore a list of
length 27 consisting of nine A‟s followed by nine B‟s followed by nine C‟s.If we print the data frame
migraine we can see the format the data should be on in order to make side-by-side boxplots and perform
ANOVA (note the output is cut-off between observations 6- 25 for space purposes).
A B C
From the boxplots it appears that the mean pain for drug A is lower than that for drugs B andC.Next, the
R function aov() can be used for fitting ANOVA models. The general form is
Studying the output of the ANOVA table above we see that the F-statistic is 11.91 with a p-
value equal to 0.0003. We clearly reject the null hypothesis of equal means for all three drug
groups.
SCHOOL OF COMPUTING
DEPARTMENT OF INFORMATION TECHNOLOGY
Machine Learning in R - Classification: Decision Trees, Random Forest, SVM – Clustering - Association Rule
Mining - Outlier Detection.
Machine learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability
to automatically learn and improve from experience without being explicitly programmed.
Machine learning focuses on the development of computer programs that can access data and
use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn
automatically without human intervention or assistance and adjust actions accordingly.
Supervised machine learning algorithms can apply what has been learned in the past to new
data using labeled examples to predict future events. Starting from the analysis of a known
training dataset, the learning algorithm produces an inferred function to make predictions about
the output values. The system is able to provide targets for any new input after sufficient
training. The learning algorithm can also compare its output with the correct, intended output
and find errors in order to modify the model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information used to
train is neither classified nor labeled. Unsupervised learning studies how systems can infer a
function to describe a hidden structure from unlabeled data. The system doesn’t figure out the
right output, but it explores the data and can draw inferences from datasets to describe hidden
structures from unlabeled data.
Reinforcement machine learning algorithms is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search and
delayed reward are the most relevant characteristics of reinforcement learning. This method
allows machines and software agents to automatically determine the ideal behavior within a
specific context in order to maximize its performance. Simple reward feedback is required for
the agent to learn which action is best; this is known as the reinforcement signal.
Classification
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing data.
The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if
the accuracy is considered acceptable.
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decision tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
> print(head(readingSkills))
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
> pred=predict(dtree,test)
> pred
[1] yes yes yes yes no no no no yes no yes
[12] no no no no yes yes yes no no no no
[23] yes yes no no no no yes no yes yes no
[34] yes yes yes no yes yes no no yes no yes
[45] yes yes no yes yes yes no yes yes yes yes
[56] no yes yes no yes
Levels: no yes
> table(pred)
pred
no yes
27 33
> acc=addmargins(table(pred,test$nativeSpeaker))
> acc
Sum 30 30 60
> value=57/60
> value
[1] 0.95
Conclusion:
Constructed Decision Tree 95% accurate
Random Forest
In the random forest approach, a large number of decision trees are created. Every observation
is fed into every decision tree. The most common outcome for each observation is used as the
final output. A new observation is fed into all the trees and taking a majority vote for each
classification model.
An error estimate is made for the cases which were not used while building the tree. That is
called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("randomForest)
The package "randomForest" has the function randomForest() which is used to create and
analyze random forests.
Syntax
The basic syntax for creating a random forest in R is −
randomForest(formula, data)
Input Data
We will use the R in-built data set named reading Skills to create a decision tree. It describes the
score of someone's reading Skills if we know the variables "age", "shoesize", "score" and
whether the person is a native speaker.
Construct Random Forest Model:
> pred=predict(RForest,test)
> pred
1 2 4 5 8 9 10 23 27 30 35 39 42 43 44 46 47 48 57 60 61 62
yes yes yes yes yes yes no no yes yes yes yes no no no yes yes no yes no yes no 68 70
71 72 76 77 79 84 86 89 96 100 101 105 107 109 120 129 131 132 133 141
no yes yes no yes yes no no yes yes no yes yes no yes no no yes no no yes no 142 147
152 154 166 170 173 174 187 188 190 191 193 194 197 200
no yes yes yes no no yes no yes yes yes no no no yes no
Levels: no yes
> table(pred)
pred
no yes
27 33
> acc=addmargins(table(pred,test$nativeSpeaker))
> acc
yes 1 32 33
Sum 28 32 60
> accvalue=(27+32)/60
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be
used for both classification or regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n
is number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper- plane that differentiate the
two classes very well (look at the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine
is a frontier which best segregates the two classes (hyper-plane/ line).
Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
Now, identify the right hyper-plane to classify star and circle.
“Select the hyper-plane which segregates the two classes better”. In this
scenario, hyper-plane “B” has excellently performed this job.
Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyper-plane?
Here, maximizing the distances between nearest data point (either class) and hyper-plane will
help us to decide the right hyper-plane. This distance is called as Margin. Consider
the below snapshot:
In the above figure, margin for hyper-plane C is high as compared to both A and B. Hence, we
name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-plane having low margin then there is high
chance of miss- classification.
Can we classify two classes (Scenario-3): In below figure , It is unable to segregate the two
classes using a straight line, as one of star lies in the territory of other(circle) class as an outlier.
Here one star at other end is like an outlier for star class. SVM has a feature to ignore outliers
and find the hyper-plane that has maximum margin. Hence, we can say, SVM is robust to
outliers.
>install.packages(“e1071”)
>library(e1071)
SVM Model Construction for iris dataset and it’s output:
> pred=predict(svm_model,test)
> table(pred) pred
setosa versicolor virginica
15 19 12
Checking Accuracy of the constrcted Model:
> addmargins(table(pred,test$Species))
> AccValue=(15+16+12)/46
> AccValue [1]
0.9347826
Conclusion:
Clustering
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to
house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card
fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of
data. Each partition will represent a cluster and k
≤ n. It means that it will classify the data into k groups, which satisfy the following
requirements −
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in
the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
until each object in one cluster or the termination condition holds. This method is rigid, i.e.,
once a merging or splitting is done, it can never be undone.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantage
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
K-Means Clustering
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the k
Means algorithm; an unsupervised learning algorithm.
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups of similarity. To calculate that similarity, we will use the
euclidean distance as measurement.
The “points” mentioned above are called means, because they hold the mean values of the items
categorized in it. To initialize these means, we have a lot of options. An intuitive method is to
initialize the means at random items in the data set. Another method is to initialize the means at
random values between the boundaries of the data set .
K Means Clustering Implementaion in R.
One method to validate the number of clusters is the elbow method. The
Idea of the elbow method is to run k-means clustering on the dataset for a range of values of k
(say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared
errors (SSE).
Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the
"elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but
that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the
number of data points in the dataset, because then each data point is its own cluster, and there is
no error between it and the center of its cluster). So our goal is to choose a small value of k that
still has a low SSE, and the elbow usually represents where we start to have diminishing returns
by increasing k.
Install the package cluster and use the function clus plot()to visualize clustering results.
Association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. It is the way of analyzing and presenting strong
rules discovered in databases using different measures of interestingness. Based on the concept
of strong rules, discover regularities between products in large-scale transaction data recorded by
point-of-sale (POS) systems in supermarkets. For example, the
rule found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together, he or she is
likely to also buy hamburger meat. Such information can be used as the basis for decisions about
marketing activities such as, e.g., promotional pricing or product placements. In addition to the
above example from market basket analysis association rules are employed today in many
application areas including Web usage mining, intrusion detection and bioinformatics. As
opposed to sequence mining, association rule learning typically does not consider the order of
items either within a transaction or across transactions.
Apriori Algorithm
The most famous algorithm for association rule learning is Apriori. It was
proposed by Agrawal and Srikant in 1994. The input of the algorithm is a dataset of transactions
where each transaction is a set of items. The output is a collection of association rules for which
support and confidence are greater than some specified threshold. The name comes from the
Latin phrase a priori (literally, "from what is before") because of one smart observation behind
the algorithm: if the item set is infrequent, then we can be sure in advance that all its subsets are
also infrequent.
1. Count the support of all item sets of length 1, or calculate the frequency of every item
in the dataset.
2. Drop the item sets that have support lower than the threshold.
4. Extend each stored item set by one element with all possible extensions. This step is
known as candidate generation.
5. Calculate the support value of each candidate.
7. Drop all stored items from step 3 that have the same support as their extensions.
9. Repeat steps 4 to step 8 until there are no more extensions with support greater than the
threshold.
This is not a very efficient algorithm if you have a lot of data, but mobile applications are not
recommended for use with big data anyway. This algorithm was influential in its time, and is
also elegant and easy to understand today.
Implementation of Apriori Algorithm in R import the package and use the package arules:
>install.packages(“arules”)
>library(arules)
Market _Basket_Optimaisation data set should be downloaded from the below website.
www.superdatascience/machinelearing
Convert the dataset into sparse Matrix:
>dataset=read.transactions('E:\\Market_Basket_Optimisation.csv',sep=",",rm.dup licates=TRUE)
distribution of transactions with duplicates: 15
> dataset
transactions in sparse format with 7501 transactions (rows) and 119 items (columns)
> summary(dataset)
transactions as itemMatrix in sparse format with 7501 rows
(elements/itemsets/transactions) and119 columns (items) and a density of 0.03288973
most frequent items:
mineral water eggs spaghetti french fries
1788 1348 1306 1282
chocolate (Other)
1229 22405
element (itemset/transaction) length distribution:
sizes
1 23 4 5 6 7 8 9 10 1 12 1
1 3
1 1358 8 6 4 3 3 2 139 1 67 4
7 1044 1 6 9 9 2 5 0 0
5 6 7 3 1 4 9 2
4
1 15 16 1 1 2
4 8 9 0
2 17 4 1 2 1
2
> itemFrequencyPlot(dataset,topN=10)
>rules=apriori(data=dataset,parameter=list(support=0.003,confiden ce=0.8))
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.8 0.1 1 none FALSE TRUE 5 0.003
minlen maxlen target ext
1 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 22
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s]. sorting and
recoding items ... [115 item(s)] done [0.00s].
creating transaction tree ... done [0.00s]. checking subsets
of size 1 2 3 4 5 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
>
rules=apriori(data=dataset,parameter=list(support=0.003,confiden ce=0.4))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.4 0.1 1 none FALSE TRUE 5 0.003
minlen maxlen target ext
1 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 22
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s]. sorting and
recoding items ... [115 item(s)] done [0.00s].
creating transaction tree ... done [0.00s]. checking subsets
of size 1 2 3 4 5 done [0.00s].
writing ... [281 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Outlier:
An outlier is a data object that deviates significantly from the rest of the objects, as if it were
generated by a different mechanism. Data objects that are not outliers as “normal” or expected
data. Similarly, we may refer to outliers as “abnormal” data.
Outlier Analysis:
The outliers may be of particular interest, such as in the case of fraud detection, where outliers
may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data
mining task, referred to as outlier mining or outlier analysis.
LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig
et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the
former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a
sparser region than its neighbors, which suggests it be an outlier.
Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the
LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier
factors.
Print the top 5 outliers:
Next, we show outliers with a biplot of the first two principal components.
SCHOOL OF COMPUTING
DEPARTMENT OF INFORMATION TECHNOLOGY
Overview of R Shiny - R Hadoop - Case Study - Hypothesis Generation, Importing Data set and Basic Data Exploration,
Feature Engineering, Model Building.
Overview of R Shiny
Writing codes for plotting graphs in R time & again can get very tiring. Also, it is very
difficult to create an interactive visualization for story narration using above packages. These
problems can be resolved by dynamically creating interactive plots in R using Shiny with
minimal effort.
If you use R, chances are that you might have come across Shiny. It is an open package
from RStudio, used to build interactive web pages with R. It provides a very powerful way to
share your analysis in an interactive manner with the community.
Shiny is an open package from RStudio, which provides a web application framework to
create interactive web applications (visualization) called “Shiny apps”. The ease of working with
Shiny has what popularized it among R users. These web applications seamlessly display R
objects (like plots, tables etc.) and can also be made live to allow access to anyone.
Shiny provides automatic reactive binding between inputs and outputs which we will be
discussing in the later parts of this article. It also provides extensive pre-built widgets which
make it possible to build elegant and powerful applications with minimal effort.
Components of R Shiny
1. UI.R: This file creates the user interface in a shiny application. It provides interactivity to the
shiny app by taking the input from the user and dynamically displaying the generated output on
the screen.
2. Server.R: This file contains the series of steps to convert the input given by user into the
desired output to be displayed.
Creation of Simple Creating simple RShiny Application
Writing “ui.R”
If you are creating a shiny application, the best way to ensure that the application interface runs
smoothly on different devices with different screen resolutions is to create it using fluid page.
This ensures that the page is laid out dynamically based on the resolution of each device.
Title Panel: The content in the title panel is displayed as metadata, as in top left corner of
above image which generally provides name of the application and some other relevant
information.
Sidebar Layout: Sidebar layout takes input from the user in various forms like text
input, checkbox input, radio button input, drop down input, etc. It is represented in dark
background in left section of the above image.
Main Panel: It is part of screen where the output(s) generated as a result of performing a
set of operations on input(s) at the server.R is / are displayed.
Let’s understand UI.R and Server.R with an example:
#UI.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Iris Dataset"),
sidebarLayout(
sidebarPanel (
mainPanel(plotOutput("distPlot")))
))
Writing SERVER.R
This acts as the brain of web application. The server.R is written in the form of a function which
maps input(s) to the output(s) by some set of logical operations. The inputs taken in ui.R file are
accessed using $ operator (input$InputName). The outputs are also referred using the $ operator
(output$OutputName). We will be discussing a few examples of server.R in the coming sections
of the article for better understanding.
#SERVER.R
library(shiny)
if(input$p=='b'){
i<-2
}
if(input$p=='c'){
i<-3
}
if(input$p=='d'){
i<-4
}
x <- iris[, i]
})
Output:
The shiny apps which you have created can be accessed and used by anyone only if, it is
deployed on the web. You can host your shiny application on “Shinyapps.io”. It provides free of
cost platform as a service [PaaS] for deployment of shiny apps, with some restrictions though
like only 25 hours of usage in a month, limited memory space, etc. You can also use your own
server for deploying shiny apps.
Steps for using shiny cloud:
Server.R
library(shiny)
library(datasets)
# Define server logic required to summarize and view the selected dataset
shinyServer(function(input, output) {
switch(input$dataset,
"rock" = rock,
"pressure" = pressure,
"cars" = cars)
})
ui.R
library(shiny)
# Application title
titlePanel("Shiny Text"),
# Sidebar with controls to select a dataset and specify the number
# of observations to view
sidebarLayout(
sidebarPanel(
selectInput("dataset", "Choose a dataset:",
choices = c("rock", "pressure", "cars")),
# Show a summary of the dataset and an HTML table with the requested
# number of observations
mainPanel(
verbatimTextOutput("summary"),
tableOutput("view")
)
)
))
Output:
R HADOOP
R is an amazing data science programming tool to run statistical data analysis on models and
translating the results of analysis into colourful graphics. There is no doubt that R is the most
preferred programming tool for statisticians, data scientists, data analysts and data architects but
it falls short when working with large datasets. One major drawback with R programming
language is that all objects are loaded into the main memory of a single machine. Large datasets
of size petabytes cannot be loaded into the RAM memory; this is when Hadoop integrated with R
language, is an ideal solution. To adapt to the in-memory, single machine limitation of R
programming language, data scientists have to limit their data analysis to a sample of data from
the large data set. This limitation of R programming language comes as a major hindrance when
dealing with big data. Since, R is not very scalable, the core R engine can process only limited
amount of data.To the contrary, distributed processing frameworks like Hadoop are scalable for
complex operations and tasks on large datasets (petabyte range) but do not have strong statistical
analytical capabilities. As Hadoop is a popular framework for big data processing, integrating R
with Hadoop is the next logical step. Using R on Hadoop will provide highly scalable data
analytics platform which can be scaled depending on the size of the dataset. Integrating Hadoop
with R lets data scientists run R in parallel on large dataset as none of the data science libraries in
R language will work on a dataset that is larger than its memory. Big Data analytics with R and
Hadoop competes with the cost value return offered by commodity hardware cluster for vertical
scaling.
Data analysts or data scientists working with Hadoop might have R packages or R scripts that
they use for data processing. To use these R scripts or R packages with Hadoop, they need to
rewrite these R scripts in Java programming language or any other language that implements
Hadoop MapReduce. This is a burdensome process and could lead to unwanted errors. To
integrate Hadoop with R programming language, we need to use a software that already is
written for R language with the data being stored on the distributed storage Hadoop. There are
many solutions for using R language to perform large computations but all these solutions
require that the data be loaded into the memory before it is distributed to the computing nodes.
This is not an ideal solution for large datasets. Here are some commonly used methods to
integrate Hadoop with R to make the best use of the analytical capabilities of R for large
datasets-
The most commonly used open source analytics solution to integrate R programming language
with Hadoop is RHadoop. RHadoop developed by Revolution Analytics lets users directly ingest
data from HBase database subsystems and HDFS file systems. Rhadoop package is the ‘go-to’
solution for using R on Hadoop because of its simplicity and cost advantage. Rhadoop is a
collection of 5 different packages which allows Hadoop users to manage and analyse data using
R programming language. RHadoop package is compatible with open source Hadoop and as well
with popular Hadoop distributions- Cloudera, Hortonworks and MapR.
rhbase – rhbase package provides database management functions for HBase within R using
Thrift server. This package needs to be installed on the node that will run R client. Using rhbase,
data engineers and data scientists can read, write and modify data stored in HBase tables from
within R.
rhdfs –rhdfs package provides R programmers with connectivity to the Hadoop distributed file
system so that they read, write or modify the data stored in Hadoop HDFS.
plyrmr – This package supports data manipulation operations on large datasets managed by
Hadoop. Plyrmr (plyr for MapReduce) provides data manipulation operations present in popular
packages like reshape2 and plyr. This package depends on Hadoop MapReduce to perform
operations but abstracts most of the MapReduce details.
ravro –This package lets users read and write Avro files from local and HDFS file systems.
rmr2 (Execute R inside Hadoop MapReduce) – Using this package, R programmers can perform
statistical analysis on the data stored in a Hadoop cluster. Using rmr2 might be a cumbersome
process to integrate R with Hadoop but many R programmers find using rmr2 much easier than
depending on Java based Hadoop mappers and reducers. rmr2 might be a little tedious but it
eliminates data movement and helps parallelize computation to handle large datasets.
RHIPE (“R and Hadoop Integrated Programming Environment”) is an R library that allows
users to run Hadoop MapReduce jobs within R programming language. R programmers just have
to write R map and R reduce functions and the RHIPE library will transfer them and invoke the
corresponding Hadoop Map and Hadoop Reduce tasks. RHIPE uses a protocol buffer encoding
scheme to transfer the map and reduce inputs. The advantage of using RHIPE over other parallel
R packages is, that it integrates well with Hadoop and provides a data distribution scheme
using HDFS across a cluster of machines - which provides fault tolerance and optimizes
processor usage.
3) R and Hadoop Streaming
Hadoop Streaming API allows users to run Hadoop MapReduce jobs with any executable script
that reads data from standard input and writes data to standard output as mapper or reducer.
Thus, Hadoop Streaming API can be used along R programming scripts in the map or reduce
phases. This method to integrate R, Hadoop does not require any client side integration because
streaming jobs are launched through Hadoop command line. MapReduce jobs submitted undergo
data transformation through UNIX standard streams and serialization to ensure Java complaint
input to Hadoop, irrespective of the language of the input script provided by the programmer.
If you want your Hive queries to be launched from R interface then RHIVE is the go-to package
with functions for retrieving metadata like database names, column names, and table names from
Apache Hive. RHIVE provides rich statistical libraries and algorithms available in R
programming language to the data stored in Hadoop by extending HiveQL with R language
functions. RHIVE functions allow users to apply R statistical learning models to the data stored
in Hadoop cluster that has been catalogued using Apache Hive. The advantage of using RHIVE
for Hadoop R integration is that it parallelizes operations and avoids data movement because
data operations are pushed down into Hadoop.
ORCH can be used on non-oracle Hadoop clusters or on any other Oracle big appliance.
Mappers and Reducers are written in R programming language and MapReduce jobs are
executed from the R environments through a high level interface. With ORCH for R Hadoop
integration, R programmers do not have to learn a new programming language like Java for
getting into the details of Hadoop environment like Hadoop Cluster hardware or software.
ORCH connector also allows users to test the ability of Map Reduce programs locally, through
the same function call, much before they are deployed to the Hadoop cluster.
The number of open source options for performing big data analytics with R and Hadoop is
continuously expanding but for simple Hadoop MapReduce jobs, R and Hadoop Streaming still
proves to be the best solution. The combination of R and Hadoop together is a must have toolkit
for professionals working with big data to create fast, predictive analytics combined with
performance, scalability and flexibility you need.
Most Hadoop users claim that the advantage of using R programming language is its exhaustive
list of data science libraries for statistics and data visualization. However, the data science
libraries in R language are non-distributed in nature which makes data retrieval a time
consuming affair. However, this is an in-built limitation of R programming language, but if we
just ignore it, then R and Hadoop together can make big data analytics an ecstasy!
Case Study
Before exploring the data to understand the relationship between variables, I’d recommend you
to focus on hypothesis generation first. Now, this might sound counter-intuitive for solving a
data science problem. Before exploring data, think about the business problem, gain the domain
knowledge.
How does it help? This practice usually helps you form better features later on, which are not
biased by the data available in the dataset. At this stage, you are expected to posses structured
thinking i.e. a thinking process which takes into consideration all the possible aspects of a
particular problem.
Here are some of the hypothesis which I thought could influence the demand of bikes:
Hourly trend: There must be high demand during office timings. Early morning and late
evening can have different trend (cyclist) and low demand during 10:00 pm to 4:00 am.
Daily Trend: Registered users demand more bike on weekdays as compared to weekend or
holiday.
Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day. Similarly,
higher humidity will cause to lower the demand and vice versa.
Temperature: In India, temperature has negative correlation with bike demand. But,
after looking at Washington’s temperature graph, I presume it may have positive correlation.
Pollution: If the pollution level in a city starts soaring, people may start using Bike (it may be
influenced by government / company policies or increased awareness).
Time: Total demand should have higher contribution of registered user as compared to casual
because registered user base would increase over time.
Traffic: It can be positively correlated with Bike demand. Higher traffic may force people to use
bike as compared to other road transport medium like car, taxi etc
The dataset shows hourly rental data for two years (2011 and 2012). The training data set is for
the first 19 days of each month. The test dataset is from 20th day to month’s end. We are
required to predict the total count of bikes rented during each hour covered by the test set.
In the training data set, they have separately given bike demand by registered, casual users and
sum of both is given as count.
Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and
count).
Independent Variables
2- > Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3- > Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered
clouds
Dependent Variables
For this solution, R (R Studio 0.99.442) in Windows Environment has been used.
Below are the steps to import and perform data exploration.
train=read.csv("train_bike.csv")
test=read.csv("test_bike.csv")
2. Combine both Train and Test Data set (to understand the distribution of
independent variable together).
test$registered=0
test$casual=0
test$count=0
data=rbind(train,test)
Before combing test and train data set, I have made the structure similar for both.
str(data)
'data.frame': 17379 obs. of 12 variables:
table(is.na(data))
FALSE
208548
From Above you can see that it has returned no missing values in the data frame.
par(mfrow=c(4,2))
hist(data$season)
hist(data$weather)
hist(data$humidity)
hist(data$holiday)
hist(data$workingday)
hist(data$temp)
hist(data$atemp)
hist(data$windspeed)
Few inferences can be drawn by looking at the these histograms:
prop.table(table(data$weather))
1 2 3 4
As expected, mostly working days and variable holiday is also showing a similar inference. You
can use the code above to look at the distribution in detail. Here you can generate a variable for
weekday using holiday and working day. Incase, if both have zero values, then it must be a working
day.Variables temp, atemp, humidity and windspeed looks naturally distributed.
data$season=as.factor(data$season)
data$weather=as.factor(data$weather)
data$holiday=as.factor(data$holiday)
data$workingday=as.factor(data$workingday)
4. Hypothesis Testing (using multivariate analysis)
Till now, we have got a fair understanding of the data set. Now, let’s test the hypothesis which
we had generated earlier. Here I have added some additional hypothesis from the dataset. Let’s
test them one by one:
Hourly trend: We don’t have the variable ‘hour’ with us right now. But we can extract it
using the datetime column.
data$hour=substr(data$datetime,12,13)
data$hour=as.factor(data$hour)
Let’s plot the hourly trend of count over hours and check if our hypothesis is correct or
not. We will separate train and test data set from combined one.
train=data[as.integer(substr(data$datetime,9,10))<20,]
test=data[as.integer(substr(data$datetime,9,10))>19,]
Above, you can see the trend of bike demand over hours. Quickly, I’ll segregate the bike demand
in three categories:
High : 7-9 and 17-19 hours
Here we have analyzed the distribution of total bike demand. Let’s look at the distribution of
registered and casual users separately.
Above you can see that registered users have similar trend as count. Whereas, casual users have
different trend. Thus, we can say that ‘hour’ is significant variable and our hypothesis is ‘true’.
You might have noticed that there are a lot of outliers while plotting the count of registered and
casual users. These values are not generated due to error, so we consider them as natural outliers.
They might be a result of groups of people taking up cycling (who are not registered). To treat such
outliers, we will use logarithm transformation. Let’s look at the similar plot after log
transformation.
boxplot(log(train$count)~train$hour,xlab="hour",ylab="log(count)")
Daily Trend: Like Hour, we will generate a variable for day from datetime variable and after
that we’ll plot it.
date=substr(data$datetime,1,10)
days<-weekdays(as.Date(date))
data$day=days
Rain: We don’t have the ‘rain’ variable with us but have ‘weather’ which is sufficient to test our
hypothesis. As per variable description, weather 3 represents light rain and weather 4 represents
heavy rain. Take a look at the plot: It is clearly satisfying our hypothesis.
Temperature, Windspeed and Humidity: These are continuous variables so we can look at the
correlation factor to validate hypothesis.
sub=data.frame(train$registered,train$casual,train$count,train$temp,train$humidity,train
$atemp,train$windspeed)
cor(sub)
Here are a few inferences you can draw by looking at the above histograms:
Variable temp is positively correlated with dependent
variables (casual is more compare to registered)
Variable atemp is highly correlated with temp.
Wind speed has lower correlation as compared to temp and humidity
Time: Let’s extract year of each observation from the date time column and see the trend
of bike demand over year.
data$year=substr(data$datetime,1,4) data$year=as.factor(data$year)
train=data[as.integer(substr(data$datetime,9,10))<20,]
test=data[as.integer(substr(data$datetime,9,10))>19,]
boxplot(train$count~train$year,xlab="year", ylab="count")
We can see that 2012 has higher bike demand as compared to 2011.
Pollution & Traffic: We don’t have the variable related with these metrics in
our data set so we cannot test this hypothesis.
5. Feature Engineering
Hour Bins: Initially, we have broadly categorize the hour into three
categories. Let’s create bins for the hour variable separately for casual and
registered users. Here we will use decision tree to find the accurate bins.
library(rpart)
library(rpart.plot) library(RColorBrewer)
d=rpart(registered~hour,data=train)
fancyRpartPlot(d)
Now, looking at the nodes we can create different hour bucket for registered users.
data=rbind(train,test)
data$dp_reg=0
data$dp_reg[data$hour<8]=1
data$dp_reg[data$hour>=22]=2
data$dp_reg[data$hour>9 &
data$hour<18]=3
data$dp_reg[data$hour==8]=4
data$dp_reg[data$hour==9]=5 data$dp_reg[data$hour==20 |
Temp Bins: Using similar methods, we have created bins for temperature for
both registered and casuals users. Variables created are (temp_reg and
temp_cas).
Year Bins: We had a hypothesis that bike demand will increase over time and
we have proved it also. Here I have created 8 bins (quarterly) for two years.
Jan-Mar 2011 as 1.Oct-Dec2012 as 8.
data$year_part[data$year=='2011']=1
data$year_part[data$year=='2011' &
data$month>3]=2
data$year_part[data$year=='2011' &
data$month>6]=3
data$year_part[data$year=='2011' &
data$month>9]=4
data$year_part[data$year=='2012']=5
data$year_part[data$year=='2012' &
data$month>3]=6
data$year_part[data$year=='2012' &
data$month>6]=7
data$year_part[data$year=='2012' &
data$month>9]=8 table(data$year_part)
Day Type: Created a variable having categories like “weekday”, “weekend” and “holiday”.
data$day_type=""
data$day_type[data$holiday==1]="holiday"
data$day_type[data$holiday==0 &
data$workingday==1]="working day"
data$weekend=0
6. Model Building
Before executing the random forest model code, I have followed following steps:
train$hour=as.factor(train$hour)
test$hour=as.factor(test$hour)
As we know that dependent variables have natural outliers so we will predict log of
dependent variables.
Predict bike demand registered and casual users separately.
y1=log(casual+1) and y2=log(registered+1),
Here we have added 1 to deal with zero values in the casual and registered columns.
pred1=predict(fit1,test)
test$logreg=pred1
pred2=predict(fit2,test)
test$logcas=pred2
Re-transforming the predicted variables and then writing the output of count to the file submit.csv
test$registered=exp(test$logreg)-1
test$casual=exp(test$logcas)-1
test$count=test$casual+test$registered
s<-data.frame(datetime=test$datetime,count=test$count)
write.csv(s,file="submit.csv",row.names=FALSE)