DataAnalyticsUsingR Dr.P.rajesh
DataAnalyticsUsingR Dr.P.rajesh
Data Analytics
using R
Programming
Dr. P. Rajesh
Assistant Professor
PG Department of Computer Science
Government Arts College
C.Mutlur, Chidambaram.
Email: [email protected]
Arignar Anna Government Arts College, Villupuram. Date 22.07.2020, Time 10.00 am to 12.00
pm
Data Analytics
Data Science and Data Analytics are two most trending terminologies
of today’s time.
Data is collected into raw form and processed according to the
requirement of a company.
This data is utilized for the decision making purpose.
This process helps the businesses to grow in the market.
But, the main question arises – What is the process called?
Data Analytics is the answer here. and, Data Analyst and Data
Scientist are the ones who perform this process.
What is Data Analytics?
Data or information is in raw format.
The increase in size of the data has led to arise
In need for carrying out inspection, data cleaning and
transformation.
Data modeling to gain insights from the data in order to derive
conclusions for better decision-making process.
This process is known as data analysis.
The analysis is an interactive process of a person tackling a
problem, finding the data required to get an answer, analyzing that
data, and interpreting the results in order to provide a
https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
1. Descriptive:
What is happening?
Why is it happening?
Diagnostic analytics is a form of
advanced analytics that examines
data or content to answer the
question,
It is characterized by techniques
such as data discovery, data mining
and correlations.
3. Predictive: What is likely to
happen?
With the help of predictive analysis,
determine the future outcome.
Based on the analysis of the historical
data, we are able to forecast the future.
With the help of data analytics,
technological advancements and machine
learning, we are able to obtain predictive
about the future effectively.
Predictive analytics is a complex field that
requires a large amount of data, skilled
implementation of predictive models.
Its tuning to obtain accurate predictions.
4. Prescriptive:
What do I need to do?
Understanding of
what has happened,
why it has happened.
variety of “what-might-
happen” analysis.
help the user determine the
best solutions of action to
take.
Prescriptive analysis is
typically not just with one
individual action but is in
fact a host of other actions.
Best route home and
considering the distance of
each route, the speed
What is the types of Data in Data Analytics
History
History
R is a programming language and software environment for
statistical analysis, graphics representation and reporting.
R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand.
R is freely available under the GNU General Public License.
R provided for various operating systems like Linux, Windows and
Mac.
This programming language was named R, based on the first letter
of first name of the two R authors (Robert Gentleman and Ross
Ihaka).
R play on the name of the Bell Labs Language S.
Since mid-1997 there has been a core group (the "R Core Team")
14
Why Learn R Programming Language
With R, you can perform statistical analysis, data analysis as well as machine
learning.
We can create objects, functions and packages in it.
R is platform-independent and can be used across multiple operating systems.
R is free owing to its open-source GNU licensing and can be installed by
anyone.
R consists of a robust collection of graphical libraries like ggplot2, plotly and
many more.
R is most widely used by the various industries like health, finance, banking,
manufacturing and many more.
There are about 2 million job openings for R programmers worldwide.
Companies hire R programmers for many roles like data analysts, business
15
16
Features of R
As stated earlier, R is a programming language and software
environment for statistical analysis, graphics representation and
reporting. The following are the important features of R −
R is a well-developed, simple and effective programming language
which includes conditionals, loops, user defined recursive
functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists,
vectors and matrices.
R provides a large, coherent and integrated collection of tools for
data analysis. 17
How R is better than Other Technologies
There are certain unique aspects of R programming which makes it better in
comparison with other technologies:
•Graphical Libraries – Libraries like ggplot2, plotly facilitate appealing libraries for
making well-defined plots.
•Availability / Cost – R is completely free.
•Advancement in Tool – R supports various advanced tools and features that allow
you to build robust statistical models.
•Job Scenario – The immense growth in Data Science and rise in demand, R has
become the most in-demand programming language of the world today.
•Customer Service Support and Community – With R, you can enjoy strong
community support.
•Portability – R is highly portable. Many different programming languages and
software frameworks can easily combine with the R environment for the best results.
18
Sourcing of R Script
RStudio
• RStudio is an Integrated Development Environment for R.
• It facilitates extensive code editing, development as well as various
features.
Features of RStudio
• RStudio provides various tools and features that allow you to boost your
code productivity.
• It can also be accessed over the web and is cross-platform in nature.
• It facilitates automatic checking of updates
• It provides support for recovery in case of file loss.
• With RStudio, you can manage the data more efficiently. 19
Components of RStudio
• Source – In the top left corner of the screen is the text editor that
allows you to work within source scripting. You can enter multiple lines
in this source.
• Console – This is present on the bottom left corner of the main window
Workspace and History – In the top right corner, the R workspace and
the history window. This will give you the list of all the variables and view
https://cran
.r-project.o
rg/bin/wind
ows/base/
25
R Console Window
26
R Command Prompt
Once you have R environment setup, then it’s easy to start your R
This will launch R interpreter and you will get a prompt > where you
https://rstudio.co
m/products/rstud
io/download/#do
wnload
28
R - Data Types
In contrast to other programming languages like C and java in R, the
variables are not declared as some data type.
The variables are assigned with R-Objects and the data type of the R-
object becomes the data type of the variable.
There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames 29
R - Functions
A function is a set of statements to perform a specific task.
R has a large number of in-built functions
The user can create their own functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and
paste(...) etc.
They are directly called by user written programs.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
# Find mean of numbers from 25 to 82.
print(mean(25:82))
# Find sum of numbers from 41 to 68.
print(sum(41:68)) 30
User-defined Function
They are specific to what a user wants and once created they can
be used like the built-in functions.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) { for(i in 1:a) { b <- i^2 print(b) } }
Calling a Function
# Call the function new.function supplying 6 as an argument.
new.function(6)
Produces the following result −
[1] 1 [1] 4 [1] 9 [1] 16 [1] 25 [1] 36
31
R String Manipulation Functions
1. grep()
It is used for pattern matching and replacement.
grep("b+", c("abc", "bda", "ccaa", "abd"), perl=TRUE, value=TRUE)
grep("b+", c("abc", "bda", "ccaa", "abd"), perl=TRUE, value=FALSE)
grep("chid+", c("chidambaram", "Villupuram", "Srimushnam",
"chidambaram"), perl=TRUE, value=FALSE)
grep("அ+", c("அப்பா", "தாத்தா", "அம்மா"), perl=TRUE, value=FALSE)
[1] 1 2 4
[1] 1 4
[1] 1 3 32
2. nchar()
With the help of this function, we can count the characters.
> str <- "Big Data at DataFlair"
> nchar(str)
[21]
3. paste()
Concatenate n number of strings using the paste() function.
> #Author DataFlair
> paste("Hadoop", "Spark", "and", "Flink")
[1] “Hadoop, Spark, and, Flink”
4. sprintf()
This function makes of the formatting commands that are styled after C.
> sprintf("%s scored %.2f percent", "Matthew", 72.3)
> [1] Matthew scored 72.30 percent 33
5. strsplit()
> #Author DataFlair
> str = "Splitting sentence into words"
> strsplit(str, " ")
> strsplit(str, "")
Output
[1] "Splitting" "sentence" "into" "words"
[1] "S" "p" "l" "i" "t" "t" "i" "n" "g" " " "s" "e" "n" "t" "e" "n" "c" "e" " "
"i" "n" "t" "o" " " "w" "o" "r" "d" "s" 34
Vector
Vectors are the most basic R data objects and there are six types of atomic
vectors. They are logical, integer, double, complex, character and raw.
The non-character values are coerced to character type if one of the elements is
a character.
36
Accessing Vector Elements
The [ ] brackets are used for indexing. Indexing starts with position 1.
Giving a negative value in the index drops that element from result.
TRUE, FALSE or 0 and 1 can also be used for indexing.
Syntax
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow, If TRUE, then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
39
Matrix Example
character.
"2014-05-11", "2015-03-27")),
stringsAsFactors = FALSE )
# Print the data frame.
print(emp.data)
44
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by
applying summary() function.
# Create the data frame.
emp.data <- data.frame( emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15",
"2014-05-11", "2015-03-27")), stringsAsFactors = FALSE )
# Print the summary.
Print(emp.data)
45
print(summary(emp.data))
summary(emp.data)
46
Working with CSV Files
• R read data from files stored outside the R environment.
• Write data into files which will be stored and accessed by the operating
system.
• R can read and write into various file formats like csv, excel, xml etc.
Getting and Setting the Working Directory
getwd() - Find the working directory
setwd() - Set the working directory
# Get and print current working directory.
print(getwd())
# Set current working directory.
setwd("/web/com")
47
# Get and print current working directory.
Input as CSV File
The csv file is a text file
The values in the columns are separated by a comma.
The following data present in the file named input.csv.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
48
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
read.csv()
function to read a CSV file available in your current working directory
data <- read.csv("input.csv")
print(data)
Output
49
Analysing the CSV File
read.csv() function gives the output as a data frame
data <- read.csv("input.csv")
print(is.data.frame(data))
print(ncol(data))
print(nrow(data)) [1] TRUE [1] 5
[1] 8
Get the maximum salary
# Create a data frame.
data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
# Get the person detail having max salary.
retval <- subset(data, salary == max(salary)) 50
51
52
Writing into a CSV File
write.csv(retval,"output.csv")
print(newdata) 53
Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship
model between two variables.
One of these variable is called predictor variable whose value is gathered through
experiments.
The other variable is called response variable whose value is derived from the predictor
variable.
Mathematically a linear relationship represents a straight line when plotted as a graph.
The general mathematical equation for a linear regression is y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
How much money should you allocate for gas?
You approach this problem with a science-oriented mindset, thinking that there must be a way to estimate the
amount of money needed, based on the distance you're travelling.
At this point these are just numbers. It's not very easy to get
any valuable information from this spreadsheet.
"If I drive for 1200 miles, how much will I pay for gas?"
Sl.No. Total Miles (x) Total Payed (y) x*x x*y
1 390 36.66 152100 14297.4
2 403 37.05 162409 14931.15 a
n x y x y
i i i i
n x x
2 2
3 396.5 34.71 157212.25 13762.52 i i
Interest_Rate <-
c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,
1.75,1.75,1.75)
Unemployment_Rate <-
c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
Stock_Index_Price <-
c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,97
1,949,884,866,876,822,704,719)
# Check the Linearity the corresponding data is correct or not
plot(x=Interest_Rate, y=Stock_Index_Price)
plot(x=Unemployment_Rate, y=Stock_Index_Price)
# Capture the in R format
student <- c(1,2,3,4,5,6,7,8,9,10)
testscore <- c(100,95,92,90,85,80,78,75,72,65)
IQ <- c(125,104,110,105,100,100,95,95,85,90)
studyhrs <- c(30,40,25,20,20,20,15,10,0,5)
# Check the Linearity the corresponding data is correct or not
plot(x=testscore, y=IQ)
plot(x=IQ, y=studyhrs)
#==================================================
# Predict Test Square using IQ and Study Hrs
relation <- lm(testscore ~ IQ + studyhrs)
a <- data.frame(IQ=120,studyhrs=40)
result <- predict(relation,a)
print(result)
#==================================================
# Predict IQ using Test Square and Study Hrs
relation <- lm(IQ ~ testscore + studyhrs)
a <- data.frame(testscore=50,studyhrs=25)
result <- predict(relation,a)
print(result)
#==================================================
# Predict IQ using Test Square and Study Hrs
relation <- lm(studyhrs ~ IQ + testscore )
a <- data.frame(IQ=140, testscore=90)
result <- predict(relation,a)
print(result)
68
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
dev.off()
69
# Get the library.
library(plotrix)
# Create data for the graph.
x <- c(21, 62, 10,53)
lbl <- c("Nashik","Aurangabad","Navi Mumbai","Nagpur")
png(file = "3d_pie_chart.png")
# Plot the chart.
pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of
Countries ")
dev.off()
70
71
Bar Chart with Color with attributes
# Create the data for the chart.
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
# Give the chart file a name.
png(file = "barchart_months_revenue.png")
# Plot the bar chart.
barplot(H,names.arg = M,xlab = "Month",ylab =
"Revenue",col = "blue", main = "Revenue chart",border
= "red")
dev.off()
72
Bar chart – Stacked
73
Box Plot Graphs
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab =
"Number of Cylinders", ylab = "Miles
Per Gallon", main = "Mileage Data")
# Save the file.
dev.off()
74
Histogram – Example
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
# Give the chart file a name.
png(file = "histogram.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
# Save the file.
dev.off()
75
barplot
X<-c(0,1,2,3)
Prob<-c(0.208,0.167,0.25,0.375)
N<-c('A','B','C','D')
barplot(Prob,names=N,ylab="Probability", main="RNA Residue Analysis")
77