Text Book of Principal R Progamming For Data Analytics - 05
Text Book of Principal R Progamming For Data Analytics - 05
Book
R for Fundamental Data Analysis in Market Research
Sujata Ramnarayan
2020-08-20
Page |1
Chapter I
Introduction R & Application
1 Chapter I Introduction
1.1 Overview
R is a programming language, and open source software that broadly used by numerous
purposes, such as data analysis, graphing, and reporting. R is commonly used in statistical
analysis, scientific computing, machine learning, and data visualization. Since it allows for
programming as well, it makes it more powerful than some other statistical tools for data
processing and analysis.
R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand. R made its first appearance in 1993. Since mid-1997 there has been a core group
(called the "R Core Team") who can modify the R source code archive.
At its core, R is an interpreted computer language that enables modular programming with
branches and loops and functions. R can integrate procedures written in C, C++, .Net, Python,
or FORTRAN languages for greater efficiency.
R is freely available under the GNU General Public License and comes with pre-compiled
binaries for various operating systems including Linux, Windows and Mac. R is free software
distributed in GNU-style copies and is an official part of the GNU Project called GNU S.
1.2 Features of R
The following are the important features of R −
1. R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
2. R has an effective data handling and storage facility,
3. R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
4. R provides a large, coherent and integrated collection of tools for data analysis.
5. R provides graphical facilities for data analysis and display either directly at the computer
or printing at the papers.
R is the world's most widely used statistical programming language. It's the data scientist's
choice and is backed by an active and talented community of contributors. R is taught in
universities and used in mission-critical business applications.
(1) Finance
Data Science is most widely used in the financial industry. R is the most popular tool for this
role. This is because R provides an advanced statistical suite that is able to carry out all the
necessary financial tasks.
Page |2
With the help of R, financial institutions are able to perform downside risk measurement,
adjust risk performance and utilize visualizations like candlestick charts, density plots,
drawdown plots, etc.
R also provides tools for moving averages, autoregression and time-series analysis which
forms the crux of financial applications. R is being widely used for credit risk analysis at firms
like ANZ and portfolio management.
Finance industries are also leveraging the time-series statistical processes of R, to model the
movement of their stock-market and predict the prices of shares. R also provides facilities for
financial data mining through its packages like quantmod, pdfetch, TFX, pwt, etc. R makes it
easy for you to extract data from online assets. With the help of RShiny, you can also
demonstrate your financial products through vivid and engaging visualizations.
AD
(2) Banking
Just like financial institutions, banking industries make use of R for credit risk modeling and
other forms of risk analytics.
Banks make heavy usage of the Mortgage Haircut Model that allows them to take over the
property in case of loan defaults. Mortgage Haircut Modelling involves sales price
distribution, the volatility of the sales price and the calculation of expected shortfall. For these
purposes, R is often used alongside proprietary tools like SAS.
R is also used in conjunction with Hadoop to facilitate the analysis of customer quality,
customer segmentation, and retention.
Bank of America makes use of R for financial reporting. With the help of R, the data scientists
at BOA are able to analyze financial losses and make use of R’s visualization tools.
(3) Healthcare
Genetics, Bioinformatics, Drug Discovery, Epidemiology are some of the fields in healthcare
that make heavy usage of R. With the help of R, these companies are able to crunch data and
process information, providing an essential backdrop for further analysis and data processing.
For more advanced processing like drug discovery, R is most widely used for performing pre-
clinical trials and analyzing the drug-safety data. It also provides a suite for performing
exploratory data analysis and vivid visualization tools to its users.
R is also popular for its Bioconductor package that provides various functionalities for
analyzing the genomic data. R is also used for statistical modeling in the field
of epidemiology, where data scientists analyze and predict the spread of diseases.
Figure 2.5 Run the R program codes with the OneCompiler IDE (at https://onecompiler.com/r)
• An online compiler and IDE service for 76+ languages and 2 Databases, with
collaborative programming and code sharing features.
• A REST-based compiler API to integrate compilers to your applications.
• An IDE Plugin solution to include IDEs to your web applications without using APIs.
• An Online Assessment and Course Platform for teaching and assessing
programming.
• Fullscreen - side-by-side code and output is available. click the "" icon near execute
button to switch.
• Dark Theme available. Click on "" icon near execute button and select dark theme.
Page |4
Chapter II
R IDE (Integrated Development Enviroment)
2 IDE (Integrated Development Enviroment)
2.1 RStudio
2.1.1 Overview of RStudio
RStudio is an integrated development environment (IDE) for the R programming language.
Some of its features include:
• Customizable workbench with all of the tools required to work with R in one place
(console, source, plots, workspace, help, history, etc.).
• Syntax highlighting editor with code completion.
• Execute code directly from the source editor (line, selection, or file).
• Full support for authoring Sweave and TeX documents.
• Runs on Windows, Mac, and Linux, and has a community-maintained FreeBSD port.
• Can also be run as a server, enabling multiple users to access the RStudio IDE using a
web browser.
R Console
• R Script: As the name suggest, here you get space to write codes. To run those codes,
simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click
on little ‘Run’ button location at top right corner of R Script.
• R Console: This area shows the output of code you run. Also, you can directly write
codes in console. Code entered directly in R console cannot be traced later. This is
where R script comes to use.
• R environment: This space displays the set of external elements added. This includes
data set, variables, vectors, functions etc. To check if data has been loaded properly
in R, always look at this area.
• Graphical Output: This space display the graphs created during exploratory data
analysis. Not just graphs, you could select packages, seek help with embedded R’s
official documentation.
Page |7
Figure 2.3 Adding R extention into VS code (see more at: https://code.visualstudio.com/docs/languages/r)
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Figure 2.5 Run the R program codes with the OneCompiler IDE (at https://onecompiler.com/r)
• An online compiler and IDE service for 76+ languages and 2 Databases, with
collaborative programming and code sharing features.
• A REST-based compiler API to integrate compilers to your applications.
• An IDE Plugin solution to include IDEs to your web applications without using APIs.
• An Online Assessment and Course Platform for teaching and assessing
programming.
• Fullscreen - side-by-side code and output is available. click the "" icon near execute
button to switch.
• Dark Theme available. Click on "" icon near execute button and select dark theme.
P a g e | 10
Figure 2.6 Run the R program codes with the JDoodle IDE (at https://www.jdoodle.com/execute-r-online/)
Source: https://www.tutorialspoint.com/r/r_operators.htm
P a g e | 30
Figure 2.9 Run the R program codes with Paiza R Online (at https://paiza.io/en/projects/new?language=r )
P a g e | 32
for (i in 1:6) {
2
cat("\n",i, students_score_test1[i], students_score_test2[i],
students_score_test1[i]+students_score_test2[i])
}
randomdata1 <- rnorm(30) #Create a vector filled with random normal values
tot <- 0
for (i in 1:length(randomdata1)) {
cat("\n", format(i, width=2, justify = "right"),
3 format(randomdata1[i], width = 8, justify = "right", digits =2))
tot <- tot + randomdata1[i]
}
cat("\ntotal =", tot)
We can use the class() function to check the data type of a variable:
R Code R Comments
1 sales_manager <- "Julian" #type character/string
message1 <- "This is my firs R code"
message2 <- "List my friends and colours"
house_temperature <- 30 #type of numeric
print(message1)
cat("class of vars message is ",class(message1),"\n")
print(house_temperature)
cat("class of vars house temperature is ",class(house_temperature),"\n")
The variables can be assigned values using leftward, rightward and equal to operator. The
values of the variables can be printed using print() or cat() function. The cat() function
combines multiple items into a continuous print output.
print(list_colors)
print(list_bananas_prices)
print(list_students)
cat("class of vars students is ",class(list_students),"\n")
cat("class of vars young_boy is ",class(young_boy),"\n")
print(single_parent)
Chapter IV
R – Fundamental Data Structures
4 Chapter IV R – Fundamental Data Structures
https://www.datamentor.io/r-programming/matrix/
In every programming language, we must utilize a variety of variables to store a variety of
information, or data structure, or objetc’s data while programming. Variables are only
reserved memory spaces for the storage of values. This implies that we must set aside some
memory when we create a variable.
There are many other data types that we might want to save information for, including
character, wide character, integer, floating point, double floating point, Boolean, etc. The
operating system allots memory and determines what can be kept in the reserved memory
based on the data type of a variable.
R does not designate the variables as any particular data type, unlike other programming
languages like C and java. With the use of R-Objects, the variables are assigned, and the R-
data Object's type becomes the variables' data type. In R Programming language, the data
object’s type can be categorized as following:
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
The vector object is the most basic of these objects, and it has six data types, commonly known
as six classes of vectors. The atomic vectors serve as the foundation for the rest of the R-
Objects.
The most fundamental data types in R programming are R-objects called vectors, which carry
items of various classes, as seen above. Please keep in mind that the number of classes in R is
not limited to the six categories listed above. For example, we may combine many atomic
vectors to form an array whose class is array.
In the most common meaning, a vector is a variable. Categorical variable is the factor. Arrays
are k-dimensional tables, often known as matrices. Arrays with k = 2 have a special situation.
P a g e | 17
The array's elements or matrices are all of the same mode. The data frame is a composite table
that contains one or more vectors and/or factors that are all the same length but may have
distinct modes. Because 'ts' is a time series record, it includes Attributes such as frequency and
date.
After all, a list can include any sort of object, including another list! The modes and lengths of
vectors are adequate to explain the data. Other information is necessary for Other objects,
which is provided through a non-specific property. We can mention dim as one of these
characteristics. this is correct.
4.1 Vectors
A vector is the most common data structure in R. It is a sequence of elements of the same basic
type. The vector() function can be used to create a vector. The default mode is logical, but we
can use constructors such as character(), numeric(), etc., to create a vector of a specific type.
Elements of a vector can be accessed using vector indexing as shown in example 2. The vector
used for indexing can be logical, integer or character vector.
print(student_ID)
1 print(student_active_status)
print(student_city)
cat("type", class(student_ID),"\n")
cat("type", class(student_active_status),"\n")
cat("type", class(student_city),"\n")
#output
#output
4.2 Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it. In 2nd example: product name, Rate, Available Stock,
and in order number are called tags which makes it easier to reference the components of the
list. However, tags are optional. We can create the same list without the tags. In such scenario,
numeric indices are used by default.
1 str(student_accounting)
str(student_management)
cat("Manajemen students :");
str(student_management[1])
str(student_management[2])
str(book_economy_list)
str(book_economy_list[1])
str(book_economy_list[2])
str(book_economy_list[8])
str(book_economy_list[9])
P a g e | 19
4.3 Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function. A matrix is a 2-dimensional table of like elements. Matrix elements can
be either numeric, character, or logical. An array is the generalization of matrices to 3 or more
dimensions (commonly known as stratified tables).
Most of you should be familiar with matrices from mathematics, see examples Matrices A 3x4;
B 4x4; and C 2x2. In general, each element of a X matrice can be conseidered as rows and colomns,
and write as xij show in matrice X.
10 12 15 18
1 0 0 1 40 43 44 45 x11 x12 x13 x14
0.5 0.8
A(x) = (0 cos x − sin x 1) B = 60 62 64 66 C=( ) 𝑿 = (x21 x22 x23 x24 )
0.5 0.1 x31 x32 x33 x34
0 sin x cos x 1 30 34 36 38
(50 51 52 53)
Matrices can be created using the matrix() function. According to the R documentation the
usage of the matrix().
Var_data matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
Comment
Example in R Codes
s
M1 = matrix( c(10, 12, 15, 18, 40, 43, 44, 45), nrow = 2, ncol = 4, byrow = TRUE)
M2 = matrix( c(100, 100, 100, 100, 200, 200, 200, 200), nrow = 2, ncol = 4, byrow = TRUE)
M3 = M1 + M2
M4 = as.vector(M3)
M5 <- matrix(data = 20:9, nrow = 2, byrow = TRUE)
M6 <- matrix(data = 20:9, nrow = 2, byrow = FALSE)
1
print(M1); print(dim(M1)); print(nrow(M1)); print(ncol(M1))
print(M2); print(length(M2)); print(M3); print(M4);
print(M5); print(M6);
Matrices (as vectors) can only contain data of one type. We can create numeric matrices,
integer matrices, character matrices, and logical matrices by adding the corresponding values
in the data argument when creating a matrix. In Example 2, show matrixes as vectors on
different types (double, integer, character, and logical).
P a g e | 20
• To represent the real world data is like traits of people’s population. They are the
best representation method for plotting common survey things.
• In robotics and automation, matrices are the best elements for the robot movements.
• Matrices are used in calculating the gross domestic products in Economics.
Therefore, it helps in calculating the efficiency of goods and products.
• In computer-based application, matrices play a vital role in the projection of three-
dimensional image into a two-dimensional screen, creating a realistic seeming
motion.
• In physical related applications, matrices can be applied in the study of an electrical
circuit.
4.4 Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
,,2
4.5 Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
P a g e | 21
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
4.7 Vectors
To combine the list of items to a vector, use the c() function and separate
the items by a comma.
In the example below, we create a vector variable called fruits, that combine
strings:
print(super_sleepers)
P a g e | 23
4.7.2 Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used −
• data is the input vector which becomes the data elements of the matrix.
• nrow is the number of rows to be created.
• ncol is the number of columns to be created.
• byrow is a logical clue. If TRUE then the input vector elements are
arranged by row.
• dimname is the names assigned to the rows and columns.
4.7.3 Example
#W4-00
writeLines("\n# Experiment w4-00")
cat("Number of rows:\n")
print(nrow(A))
cat("Number of columns:\n")
print(ncol(A))
cat("Number of elements:\n")
print(length(A))
# OR
print(prod(dim(A)))
# R program to illustrate
# access submatrices in a matrix
cat("Accessing the first three rows and the first two columns\n")
print(A[1:3, 1:2])
)
cat("The 3x3 matrix:\n")
print(A)
# R program to illustrate
# concatenation of a row in metrics
# R program to illustrate
# column deletion in metrics
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("Before deleting the 2nd column\n")
print(A)
# 2nd-row deletion
A = A[, -2]
Chapter V
Operators
5 Chapter V Operators
Source: https://www.tutorialspoint.com/r/r_operators.htm
P a g e | 29
Chapter V
R – Loopings & Control Structures
6 R - Loops
for (i in 1:6) {
2
cat("\n",i, students_score_test1[i], students_score_test2[i],
students_score_test1[i]+students_score_test2[i])
}
randomdata1 <- rnorm(30) #Create a vector filled with random normal values
tot <- 0
for (i in 1:length(randomdata1)) {
cat("\n", format(i, width=2, justify = "right"),
3 format(randomdata1[i], width = 8, justify = "right", digits =2))
tot <- tot + randomdata1[i]
}
cat("\ntotal =", tot)
AD
P a g e | 35
https://www.geeksforgeeks.org/r-programming-for-data-science/
• RCrawler
• Leaflet
• Janitor
• Plotly