R Programming Language E Notes_B.tech
R Programming Language E Notes_B.tech
ON
R PROGRAMMING
B.Tech (IT)
Semester-VII
Syllabus
CODE: OEC-CS-701(III)
SUBJECT NAME: R PROGRAMMING
MODULE-1: INTRODUCTION
MODULE-2: R DATA
Reading Data into R: Reading CSVs, Excel Data, Reading from Databases, Data from
Other Statistical Tools, R Binary Files, Data Included with R, Extract Data from Web
Sites Statistical Graphics: Base Graphics , ggplot2
Group Manipulation: Apply Family, aggregate, plyr, data.table Data Reshaping: cbind
andrbind, Joins, reshape2 Manipulating Strings: paste, sprint, Extracting Text, Regular
MODULE-5: R STATISTICS & LINEAR MODELING
1
Index
2
R Programming Language – An Introduction
Module 1
• R programming is used as a leading tool for machine learning, statistics, and data analysis.
Objects, functions, and packages can easily be created by R.
• It’s a platform-independent language. This means it can be applied to all operating system.
• It’s an open-source free language. That means anyone can install it in any organization
without purchasing a license.
3
• R programming language is not only a statistic package but also allows us to integrate with
other languages (C, C++). Thus, you can easily interact with many data sources and statistical
packages.
• The R programming language has a vast community of users and it’s growing day by day.
• R is currently one of the most requested programming languages in the Data Science job
market that makes it the hottest trend nowadays.
4
Advantages of R:
• R is the most comprehensive statistical analysis package. As new technology and concepts
often appear first in R.
• As R programming language is an open source. Thus, you can run R anywhere and at any
time.
• R programming language is suitable for GNU/Linux and Windows operating system.
• R programming is cross-platform which runs on any operating system.
• In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R:
• In the R programming language, the standard of some packages is less than perfect.
• Although, R commands give little pressure to memory management. So R programming
language may consume all available memory.
• In R basically, nobody to complain if something doesn’t work.
Applications of R:
• We use R for Data Science. It gives us a broad variety of libraries related to statistics. It also
provides the environment for statistical computing and design.
• R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing and cleaning.
• R is the most prevalent language. So many data analysts and research programmers use it.
Hence, it is used as a fundamental tool for finance.
• Tech giants like Google, Facebook, bing, Accenture, Wipro and many more using R
nowadays.
History of R Programming
R was first implemented in the early 1990's by Robert Gentleman and Ross Ihaka, both
faculty members at the University of Auckland. Robert and Ross established R as an open
5
source project in 1995. Since 1997, the R project has been managed by the R Core Group. And
in February 2000, R 1.0.
R vs Python
R programming and Python are both used extensively for Data Sciences. Both are very
useful and open source languages as well.
R Language is used for machine learning algorithms, linear regression, time series, statistical
inference, etc. It was designed by Ross Ihaka and Robert Gentleman in 1993.
Feature R Python
It has many features which are useful It can be used to develop GUI
for statistical analysis and application and web application as
Objective representation. well as with embedded systems
Integrated
development Various popular R IDEs are Rstudio, Various popular Python IDEs are
environment RKward, R commander, etc. Spyder, Eclipse+Pydev, Atom, etc.
6
Feature R Python
Introduction to R Studio
R Studio is an integrated development environment(IDE) for R. IDE is a GUI, where you can
write your quotes, see the results and also see the variables that are generated during the
course of programming.
• R Studio is available as both Open source and Commercial software.
• R Studio is also available as both Desktop and Server version.
• R Studio is also available for various platforms such as Windows, Linux, and macOS.
R Studio can be downloaded from its official Website rstudio.com and instructions for
installation are available on How to Install RStudio for R programming in Windows?
After the Installation process is over, the R Studio Interface looks like:
7
• The console panel(left panel) is the place where R is waiting for you to tell it what to do, and
see the results that are generated when you type in the commands.
• To the top right, you have Environmental/History panel. It contains 2 tabs:
• Environment tab: It shows the variables that are generated during the course of
programming in a workspace which is temporary.
• History tab: In this tab, you’ll see all the commands that are used till now from the start of
usage of R Studio.
• To the right bottom, you have another panel, which contains multiple tabs, such as files,
plots, packages, help, and viewer.
• The Files tab shows the files and directories that are available within the default workspace
of R.
• The Plots tab shows the plots that are generated during the course of programming.
• The Packages tab helps you to look at what are the packages that are already installed in the
R Studio and it also gives a user interface to install new packages.
• The Help tab is the most important one where you can get help from the R Documentation
on the functions that are in built-in R.
8
• The final and last tab is that the Viewer tab which can be used to see the local web content
that’s generated using R.
Installation of R
Installing R to the local computer is very easy. First, we must know which
operating system we are using so that we can download it accordingly.
Install R in Windows
Step 1:
Step 2:
9
started of R setup.Once the downloading is finished, we have to run the setup
of R in the following way:
1) Select the path where we want to download the R and proceed to Next.
2) Select all components which we want to install, and then we will proceed to Next.
10
4) When we proceed to next, our installation of R in our system will get started:
11
RStudio IDE
Installation of RStudio
RStudio Desktop is available for both Windows and Linux. The open-source
12
RStudio Desktop installation is very simple to install on both operating systems.
The licensed version of RStudio has some more features than open-source.
Before installing RStudio, let's see what are the additional features in the license
version of RStudio.
Installation on Windows/Linux
Step 1:
In the first step, we visit the RStudio official site and click on Download RStudio.
13
Step 2:
In the next step, we will select the RStudio desktop for open-source license and
click ondownload.
Step 3:
In the next step, we will select the appropriate installer. When we select
the installer, ourdownloading of RStudion setup will start.
14
Step 4:
In the next step, we will run our setup in the following way:
1) Click on Next.
2) Click on Install.
15
3) Click on finish.
16
4) RStudio is ready to work.
R Packages
R packages are the collection of R functions, sample data, and compile codes.
In the R environment, these packages are stored under a directory called
"library." During installation, R installs a set of packages. We can add packages
later when they are needed for some specific purpose. Only the default packages
17
will be available when we start the R console. Other packages which are already
installed will be loaded explicitly to be used by the R program.
There is the following list of commands to be used to check, verify, and use the R packages.
To check the available R Packages, we have to find the library location in which
R packages are contained. R provides libPaths() function to find the library
locations.
1. libPaths()
When the above code executes, it produces the following project, which may
vary depending on the local settings of our PCs & Laptops.
[1] "C:/Users/ajeet/OneDrive/Documents/R/win-library/3.6"
[2] "C:/Program Files/R/R-3.6.1/library"
R provides library() function, which allows us to get the list of all the installed packages.
1. library()
When we execute the above function, it produces the following result, which
may varydepending on the local settings of our PCs or laptops.
18
Like library() function, R provides search() function to get all packages
currently loaded in the Renvironment.
1. search()
When we execute the above code, it will produce the following result, which
may varydepending on the local settings of our PCs and laptops:
In R, there are two techniques to add new R packages. The first technique is
installing package directly from the CRAN directory, and the second one is to
install it manually after downloading the package to our local system.
The following command is used to get the packages directly from CRAN
webpage and install thepackage in the R environment. We may be prompted
to choose the nearest mirror. Choose theone appropriate to our location.
1. install.packages("Package Name")
1. install.packages("XML")
19
Output
Once the downloading has finished, we will use the following command:
1. install.packages("C:\Users\ajeet\OneDrive\Desktop\graphics\xml2_1.2.2.zip",
repos = NULL, type = "source")
Load Package to Library
We cannot use the package in our code until it will not be loaded into the current
R environment. We also need to load a package which is already installed
20
previously but not available in the current environment.
package
1. install.packages("C:\Users\ajeet\OneDrive\Desktop\graphics\xml2_1.2.2.zip",
repos = NULL, type = "source")
Syntax of R Programming
R Command Prompt
21
The code of "Hello World!" in R programming can be written as:
In the above code, the first statement defines a string variable string, where we
assign a string "Hello World!". The next statement print() is used to print the
value which is stored in the variable string.
R Script File
The R script file is another way on which we can write our programs, and then
we execute those scripts at our command prompt with the help of R interpreter
known as Rscript. We make a text file and write the following code. We will
save this file with .R extension as:
Demo.R
To execute this file in Windows and other operating systems, the process will
remain the same asmentioned below.
Comments
22
our code in a false block.
Single-line comment
R-Objects
Generally, while doing programming in any programming
language, you need to use various variables to store various
information. Variables are nothing but reserved memory
locations to store values. This means that, when you create a
variable you reserve some space in memory.
You may like to store information of various data types like
character, wide character, integer, floating point, double floating
point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be
stored in the reserved memory.
23
In contrast to other programming languages like C and java in R,
the variables arenot declared as some data type. The variables
are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of
R-objects. The frequently used ones are:
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
The simplest of these objects is the vector object and there are
six data types of these atomic vectors, also termed as six classes
of vectors. The other R-Objects are built upon the atomic vectors.
Data Type Example Verify
v <- TRUE
print(class(v))
Logical TRUE , FALSE
it produces the following result:
[1] "logical"
v <- 23.5
Numeric 12.3, 5, 999 print(class(v))
[1] "integer"
v <- 2+5i
print(class(v))
Complex 3 + 2i it produces the following result:
[1] "complex"
v <- "TRUE"
print(class(v))
Character 'a' , '"good", "TRUE", '23.4' it produces the following result:
[1] "character"
24
R Programming
v <- charToRaw("Hello")
print(class(v)) it produces the
following result:
Raw "Hello" is stored as 48 65 6c 6c 6f
[1] "raw"
Vectors
When you want to create vector with more than one element, you should use c()
function which means to combine the elements into a vector.
# Create a vector.
Lists
A list is an R-object which can contain many different types of
elements inside it like vectors, functions and even another list
inside it.
# Create a list.
25
R Programming
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
Matrices
A matrix is a two-dimensional rectangular data set. It can be
created using a vectorinput to the matrix function.
# Create a matrix.
Arrays
While matrices are confined to two dimensions, arrays can be of
any number of dimensions. The array function takes a dim
attribute which creates the required number of dimension. In the
below example we create an array with two elements which are
3x3 matrices each.
26
R Programming
# Create an array.
a <- array(c('green','yellow'),dim=c(3,3,2))
print(a)
Factors
Factors are the r-objects which are created using a vector. It
stores the vector alongwith the distinct values of the elements in
the vector as labels. The labels are always character irrespective
of whether it is numeric or character or Boolean etc. in the input
vector. They are useful in statistical modeling.
Factors are created using the factor() function.The nlevels
functions gives the count of levels.
# Create a vector.
27
R Programming
Data Frames
Data frames are tabular data objects. Unlike a matrix in data
frame each column can contain different modes of data. The first
column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors
of equal length.
Data Frames are created using the data.frame() function.
print(BMI)
1 Male 152.0 81 42
2 Male 171.5 93 38
gender height weight Age 26
3 Female 165.0 78
To make the best of the R language, we’ll need a strong understanding of the basic data types
and data structures and how to operate on them.
28
R Programming
Data structures are very important to understand because these are the objects you will
manipulate on a day-to-day basis in R.
Everything in R is an object.
R has 6 basic data types.
• character
• numeric (real or decimal)
• integer
• logical
• complex
Elements of these data types may be combined to form data structures, such as atomic vectors.
When we call a vector atomic, we mean that the vector only holds data of a single data type.
Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.
R provides many functions to examine features of vectors and other objects, for example
# Example
x <- "dataset"
typeof(x)
[1] "character"
attributes(x)
NULL
y <- 1:10
y
[1] 1 2 3 4 5 6 7 8 9 10
typeof(y)
[1] "integer"
length(y)
[1] 10
z <- as.numeric(y)
z
[1] 1 2 3 4 5 6 7 8 9 10
typeof(z)
[1] "double"
R has many data structures. These include
• atomic vector
29
R Programming
• list
• matrix
• data frame
• factors
Vectors
A vector is the most common and basic data structure in R and is pretty much the workhorse
of R. Technically, vectors can be one of two types:
• atomic vectors
• lists
although the term “vector” most commonly refers to the atomic types not to lists.
30
R Programming
Examining Vectors
The functions typeof(), length(), class() and str() provide useful information about your
vectors and R objects in general.
typeof(z)
[1] "character"
length(z)
[1] 3
class(z)
[1] "character"
str(z)
chr [1:3] "Sarah" "Tracy" "Jon"
Adding Elements
The function c() (for combine) can also be used to add elements to a vector.
z <- c(z, "Annette")
z
[1] "Sarah" "Tracy" "Jon" "Annette"
z <- c("Greg", z)
z
[1] "Greg" "Sarah" "Tracy" "Jon" "Annette"
Missing Data
R supports missing data in vectors. They are represented as NA (Not Available) and can be
used for all the vector types covered in this lesson:
x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
31
R Programming
R will create a resulting vector with a mode that can most easily accommodate all the elements
it contains. This conversion between modes of storage is called “coercion”. When R converts
the mode of storage based on its content, it is referred to as “implicit coercion”. For instance,
can you guess what the following do (without running them first)?
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
You can also control how vectors are coerced explicitly using the as.<class_name>() functions:
as.numeric("1")
[1] 1
as.character(1:2)
[1] "1" "2"
Objects Attributes
Objects can have attributes. Attributes are part of the object. These include:
32
R Programming
• names
• dimnames
• dim
• class
• attributes (contain metadata)
You can also glean other attribute-like information such as length (works on vectors and lists)
or number of characters (for character strings).
length(1:10)
[1] 10
nchar("Software Carpentry")
[1] 18
Matrix
In R matrices are an extension of the numeric or character vectors. They are not a separate type
of object but simply an atomic vector with dimensions; the number of rows and columns. As
with atomic vectors, the elements of a matrix must be of the same data type.
m <- matrix(nrow = 2, ncol = 2)
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
dim(m)
[1] 2 2
You can check that matrices are vectors with a class attribute of matrix by
using class() and typeof().
m <- matrix(c(1:3))
class(m)
[1] "matrix" "array"
typeof(m)
[1] "integer"
While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an
integer vector.
Data types of matrix elements
33
R Programming
Solution
We know that typeof(FOURS) will also return "double" since matrices are made of elements
of the same data type. Note that you could do something like as.character(FOURS) if you
needed the elements of FOURS as characters.
Matrices in R are filled column-wise.
m <- matrix(1:6, nrow = 2, ncol = 3)
Other ways to construct a matrix
m <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using rbind() and cbind() (“row bind” and “column
bind”, respectively).
x <- 1:3
y <- 10:12
cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
You can also use the byrow argument to specify how the matrix is filled. From R’s own
documentation:
mdat <- matrix(c(1, 2, 3, 11, 12, 13),
nrow = 2,
ncol = 3,
byrow = TRUE)
mdat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13
Elements of a matrix can be referenced by specifying the index along each dimension (e.g.
“row” and “column”) in single square brackets.
mdat[2, 3]
[1] 13
List
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a
single mode and can encompass any mixture of data types. Lists are sometimes called generic
34
R Programming
vectors, because the elements of a list can by of any type of R object, even lists containing
further lists. This property makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using list() or coerce other objects using as.list(). An empty list of the required
length can be created using vector()
x <- list(1, "a", TRUE, 1+4i)
x
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
x <- vector("list", length = 5) # empty list
length(x)
[1] 5
The content of elements of a list can be retrieved by using double square brackets.
x[[1]]
NULL
Vectors can be coerced to lists as follows:
x <- 1:10
x <- as.list(x)
length(x)
[1] 10
Examining Lists
Solution
Elements of a list can be named (i.e. lists can have the names attribute)
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(mtcars))
xlist
$a
[1] "Karthik Ram"
$b
[1] 1 2 3 4 5 6 7 8 9 10
$data
35
R Programming
Solution
Lists can be extremely useful inside functions. Because the functions in R are able to return
only a single object, you can “staple” together lots of different kinds of results into a single
object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a
new line.
Elements are indexed by double brackets. Single brackets will still return a(nother) list. If the
elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).
Data Frame
A data frame is a very important data type in R. It’s pretty much the de facto data structure for
most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has same length (i.e. data
frame is a “rectangular” list).
Data frames can have additional attributes such as rownames(), which can be useful for
annotating data, like subject_id or sample_id. But most of the time they are not used.
Some additional information on data frames:
• Usually created by read.csv() and read.table(), i.e. when importing the data into R.
• Assuming all columns in a data frame are of same type, data frame can be converted to a matrix
with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the
results may not always be what you expect.
• Can also create a new data frame with data.frame() function.
• Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
• Rownames are often automatically generated and look like 1, 2, …, n. Consistency in
numbering of rownames may not be honored when rows are reshuffled or subset.
36
R Programming
37
R Programming
Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data
frames or another type of objects). Lists can also contain elements of any length, therefore list
do not necessarily have to be “rectangular”. However in order for the list to qualify as a data
frame, the length of each element has to be the same.
Column Types in Data Frames
Knowing that data frames are lists, can columns be of different type?
What type of structure do you expect to see when you explore the structure of
the PlantGrowth data frame? Hint: Use str().
Solution
Key Points
• R’s basic data types are character, numeric, integer, complex, and logical.
• R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these
structures require that all members be of the same data type (e.g. vectors, matrices) while
others permit multiple data types (e.g. lists, data frames).
• Objects may have attributes, such as name, dimension, and class.
Vectors
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
Live Demo
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
38
R Programming
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
Live Demo
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
Live Demo
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1
39
R Programming
,,2
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
40
R Programming
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
Module 2
R CSV Files
Storing data in excel spreadsheets is the most common way for data storing,
which is used by thedata scientists. There are lots of packages in R designed for
accessing data from the excel spreadsheet. Users often find it easier to save their
41
R Programming
R allows us to read data from files which are stored outside the R environment.
Let's start understanding how we can read and write data into CSV files. The
file should be present in the current working directory so that R can read it. We
can also set our directory and read file from there.
In R, getwd() and setwd() are the two useful functions. The getwd() function is
used to check on which directory the R workspace is pointing. And the setwd()
function is used to set a new working directory to read and write files from that
directory.
CSV files are basically the text files wherein the values of each row are separated by a delimiter,
as in a comma or a tab. In this article, we will use the following sample CSV file:
sample.csv
id, name, department, salary, projects
1, A, IT, 60754, 4
2, B, Tech, 59640, 2
3, C, Marketing, 69040, 8
4, D, Marketing, 65043, 5
5, E, Tech, 59943, 2
6, F, IT, 65000, 5
7, G, HR, 69000, 7
42
R Programming
print(csv_data)
print (ncol(csv_data))
print(nrow(csv_data))
Output:
id, name, department, salary, projects
1 1 A HR 60754 14
2 2 B Tech 59640 3
3 3 C Marketing 69040 8
4 4 D HR 65043 5
5 5 E Tech 59943 2
6 6 F IT 65000 5
7 7 G HR 69000 7
[1] 4
[1] 7
The header is by default set to a TRUE value in the function. The head is not included in the
count of rows, therefore this CSV has 7 rows and 4 columns.
43
R Programming
print (min_pro)
Output:
2
Aggregator functions (min, max, count etc.) can be applied on the CSV data. Here
the min() function is applied on projects column using $ symbol. The minimum number of
projects which is 2 is returned.
print (new_csv)
Output:
44
R Programming
write.csv(new_csv, "new_sample.csv")
print(new_data)
Output:
print(new_data)
Output:
45
R Programming
Excel files are of extension .xls, .xlsx and .csv(comma-separated values). To start working
with excel files in R, we need to first import excel files in RStudio or any other R supporting
IDE(Integrated development environment).
First install readxl package in R to load excel files. Various methods including there subparts
are demonstrated further.
The xlsx is a file extension of a spreadsheet file format which was created by
Microsoft to work with Microsoft Excel. In the present era, Microsoft Excel is
a widely used spreadsheet program that sores data in the .xls or .xlsx format. R
allows us to read data directly from these files by providing some excel specific
packages. There are lots of packages such as XLConnect, xlsx, gdata, etc. We
will use xlsx package, which not only allows us to read data from an excel file
butalso allow us to write data in it.
Our primary task is to install "xlsx" package with the help of install.package
command. When we install the xlsx package, it will ask us to install some
additional packages on which this package is dependent. For installing the
additional packages, the same command is used with the required package
name. There is the following syntax of install command:
1. install.packages("package name")
Example
1. install.packages("xlsx")
86
46
R Programming
Output
In R, grepl() and any() functions are used to verify the package. If the packages
are installed, these functions will return True else return False. For verifying the
package, both the functions are used together.
For loading purposes, we use the library() function with the appropriate package
name. This function loads all the additional packages also.
Example
47
R Programming
Sample_data1.xlsx
Sample_data2.xlsx
Reading Files
The two excel files Sample_data1.xlsx and Sample_data2.xlsx and read from working
directory.
48
R Programming
install.packages("readxl")
library(readxl)
head(Data1)
head(Data2)
The excel files are loaded into variable Data_1 and Data_2 as a dataframes and then variable
Data_1 and Data_2 is called that prints the dataset.
49
R Programming
Modifying Files
The Sample_data1.xlsx file and Sample_file2.xlsx are modified.
Data1$Pclass <- 0
head(Data1)
head(Data2)
The value of the P-class attribute or variable of Data1 data is modified to 0. The value of
Embarked attribute or variable of Data2 is modified to S.
Deleting Content from files
The variable or attribute is deleted from Data1 and Data2 datasets containing
Sample_data1.xlsx and Sample_data2.xlsx files.
50
R Programming
Data1
Data2
The - sign is used to delete column or attributes from dataset. Column 2 is deleted from Data1
dataset and Column 3 is deleted from Data2 dataset.
Merging Files
The two excel datasets Data1 and Data2 are merged using merge() function which is in base
package and comes pre installed in R.
# Merging Files
51
R Programming
head(Data3)
Data1 and Data2 are merged with each other and the resultant file is stored in the Data3
variable.
Creating new columns
New columns or features can be easily created in Data1 and Data2 datasets.
Data1$Num <- 0
head(Data1)
head(Data2)
52
R Programming
Num is a new feature that is created with 0 default value in Data1 dataset. Code is a new
feature that is created with mission as a default string in Data2 dataset.
Writing Files
After performing all operations, Data1 and Data2 are written into new files
using write.xlsx() function built in writexl package.
install.packages("writexl")
# Loading package
library(writexl)
# Writing Data1
write_xlsx(Data1, "New_Data1.xlsx")
# Writing Data2
write_xlsx(Data2, "New_Data2.xlsx")
53
R Programming
In the computer science world, text files contain data that can easily be understood by humans.
It includes letters, numbers, and other characters. On the other hand, binary files contain 1s
and 0s that only computers can interpret. The information stored in a binary file can’t be read
by humans as the bytes in it translate to characters and symbols that contain various other non-
printable characters.
It may happen some times when the data produced by other programs are essential to be
processed by the R language as a binary file. Furthermore, R is necessarily responsible for
creating binary files that can be shared with other programs. The four most important
operations that can be performed in a binary file are:
Example:
# R program to illustrate
# working with binary file
54
R Programming
Output:
Example:
# R program to illustrate
# working with binary file
55
R Programming
56
R Programming
Output:
ID Name Age Pin
[1, ] 0 0 0 0
[2, ] 1072693248 1074266112 1074790400 1073741824
[3, ] 0 0 0 0
[4, ] 1073741824 1074790400 1074266112 1072693248
57
R Programming
58
R Programming
The Data1 dataset is written New_Data1.xlsx file and Data2 dataset is written
in New_Data2.xlsx file. Both the files are saved in present working directory.
Visualization in R
In R, we can create visually appealing data visualizations by writing few lines of code.
For this purpose, we use the diverse functionalities of R. Data visualization is an
efficient technique for gaining insight about data through a visual medium. With the
help of visualization techniques, a human can easily obtain information about hidden
patterns in data that might be neglected.
By using the data visualization technique, we can work with large datasets to
efficiently obtain key insights about it.
R Visualization Packages
R provides a series of packages for data visualization. These packages are as follows:
1) plotly
The plotly package provides online interactive and quality graphs. This package extends
upon theJavaScript library ?plotly.js.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for this
purpose. Thispackage is famous for its elegant and quality graphs, which sets it apart
59
R Programming
3) tidyquant
The tidyquant is a financial package that is used for carrying out quantitative financial
analysis. This package adds under tidyverse universe as a financial package that is used
for importing, analyzing, and visualizing the data.
4) taucharts
Data plays an important role in taucharts. The library provides a declarative interface
for rapid mappingof data fields to visual properties.
5) ggiraph
It is a tool that allows us to create dynamic ggplot graphs. This package allows
us to add tooltips,JavaScript actions, and animations to the graphics.
6) geofacets
7) googleVis
googleVis provides an interface between R and Google's charts tools. With the help of
this package, wecan create web pages with interactive charts based on R data frames.
8) RColorBrewer
This package provides color schemes for maps and other graphics, which are
designed by CynthiaBrewer.
9) dygraphs
10) shiny
60
R Programming
a shiny package.This package provides various extensions with HTML widgets, CSS,
and JavaScript.
R Graphics
Graphics play an important role in carrying out the important features of the data.
Graphics are used to examine marginal distributions, relationships between variables,
and summary of very large data. It is a very important complement for many statistical
and computational techniques.
Standard Graphics
Scatterplots
Piecharts
Boxplots
Barplots etc.
61
R Programming
There are some of the following points which are essential to understand:
The functions of graphics devices produce output, which depends on the active
graphics device.A screen is the default and most frequently used device.
R graphical devices such as the PDF device, the JPEG device, etc. are used.
We just need to open the graphics output device which we want. Therefore,
R takes care ofproducing the type of output which is required by the device.
For producing a certain plot on the screen or as a GIF R graphics file, the R code
should exactly bethe same. We only need to open the target output device before.
Several devices can be open at the same time, but there will be only
There are some key elements of a statistical graphic. These elements are the basics
of the grammarof graphics. Let's discuss each of the elements one by one to gain
the basic knowledge of graphics.
1) Data
62
R Programming
Data is the most crucial thing which is processed and generates an output.
2) Aesthetic Mappings
Aesthetic mappings are one of the most important elements of a statistical graphic.
It controls the relation between graphics variables and data variables. In a scatter
plot, it also helps to map the temperature variable of a data set into the X variable.
In graphics, it helps to map the species of a plant into the color of dots.
3) Geometric Objects
Geometric objects are used to express each observation by a point using the
aesthetic mappings. It maps two variables in the data set into the x,y variables of the
plot.
4) Statistical Transformations
Statistical transformations allow us to calculate the statistical analysis of the data in the plot.The
statistical transformation uses the data and approximates it with the help of a regression line
having x,y coordinates, and counts occurrences of certain values.
5) Scales
It is used to map the data values into values present in the coordinate system of the graphics
device.
6) Coordinate system
7) Faceting
Faceting is used to split the data into subgroups and draw sub-
1. Understanding
63
R Programming
it can attract a wider range of audiences. Also, it promotes the widespread use of
business insights that come to make better decisions.
2. Efficiency
3. Location
Its app utilizing features such as Geographic Maps and GIS can be particularly
relevant to wider business when the location is a very relevant factor. We will use
maps to show business insights from various locations, also consider the seriousness
of the issues, the reasons behind them, and working groups to address them.
1. Cost
2. Distraction
However, at times, data visualization apps create highly complex and fancy
graphics-rich reports andcharts, which may entice users to focus more on the form
than the function. If we first add visual appeal, then the overall value of the graphic
representation will be minimal. In resource-setting, it is required to understand how
resources can be best used. And it is also not caught in the graphics trendwithout a
clear purpose.
64
R Programming
Module 3
Function Creation:
A function is a set of statements organized together to perform a
specific task. R has a large number of in-built functions and the
user can create their own functions.
In R, a function is an object so the R interpreter is able to pass
65
R Programming
FunctionDefinition
An R function is created by using the keyword function. The
basic syntax of an R function definition is as follows:
Function Components
The different parts of a function are:
• Function Name: This is the actual name of the function. It is
stored in R environment as an object with this name.
66
R Programming
Built-inFunction
Simple examples of in-built functions are seq(), mean(), max(), sum(x)and
paste(...) etc. They are directly called by user written programs. You can refer most
widely used R functions.
User-defined Function
We can create user-defined functions in R. They are specific to
what a user wants and once created they can be used like the
built-in functions. Below is an example of how a functionis
created and used.
for(i in 1:a) {
b <- i^2
print(b)
}
}
67
R Programming
Callinga Function
b <- i^2
print(b)
new.function(6)
{
for(i in 1:5)
print(i^2)
}}
68
R Programming
[1] 26
[1] 58
69
R Programming
[1] 18
[1] 45
print(a^2)
print(a)
print(b)
[1] 36
[1] 6
R scripts
While entering and running your code at the R command line is effective and
simple. This technique has its limitations. Each time you want to execute a set
of commands, you have to re- enter them at the command line. Complex
commands are potentially subject to typographical errors, necessitating that
they be re-entered correctly. Repeating a set of operations requires re- entering
70
R Programming
A script is simply a text file containing a set of commands and comments. The
script can be saved and used later to re-execute the saved commands. The script
can also be edited so you canexecute a modified version of the commands.
Creating an R script
It is easy to create a new script in RStudio. You can open a new empty
script by clicking the New File icon in the upper left of the main RStudio
toolbar. This icon looks like a whitesquare with a white plus sign in a green
circle. Clicking the icon opens the New File Menu.Click the R Script
menu option and the script editor will open with an empty script.
71
R Programming
Once the new script opens in the Script Editor panel, the script is ready for
text entry, and yourRStudio session will look like this.
Here is an easy example to familiarize you with the Script Editor interface.
Type the followingcode into your new script [later topics will explain the
specific code components do].
# this
72
R Programming
is my
first R
script
# do
some
things
x = 34
y = 16
z
=
x
+
y
#
ad
di
ti
o
n
w
=
y/
x # division# display the results x
y
z
w
# change x
x = "some text"
#
d
i
s
p
l
a
y
t
h
e
r
e
s
u
lt
s
x
y
z
w
73
R Programming
There, you now have your first R script. Notice how the editor places a number
in front of each line of code. The line numbers can be helpful as you work with
your code. Before proceeding onto executing this code, it would be a good idea
to learn how to save your script.
Saving an R script
You can save your script by clicking on the Save icon at the top of the
Script Editor panel.When you do that, a Save File dialog will open.
74
R Programming
The default script name is Untitled.R. The Untitled part is highlighted. You will
save this script as First script.R. Start typing First script. RStudio overwrites
the highlighted default name withyour new name, leaving the .R file extension.
The Save File dialog should now look like this.
Notice that RStudio will save your script to your current working folder. An
earlier topic in this learning infrastructure explained how to set your default
working folder, so that will not be addressed here. Press the Save button and
your script is saved to your working folder. Notice thatthe name in the file tab
at the top of the Script Editor panel now shows your saved script file name.
Be aware that, while it is not necessary to use an .R file extension for your
R scripts, it doesmake it easier for RStudio to work with them if your use
this file extension.
That is how you save your script files to your working folder.
Opening an R script
75
R Programming
Opening a saved R script is easy to do. Click on the Open an existing file
icon in the RStudiotoolbar. A Choose file dialog will open.
Select the R script you want to open [this is one place where the .R file extension
comes in handy] and click the Open button. Your script will open in the Script
Editor panel with the scriptname in an editor tab.
Working through an example may be helpful. We will use the script you
created above [First script.R] for this exercise. First, you will need to close
the script. You can close this script byclicking the X in the right side of the
editor tab where the script name appears. Since you onlyhad one script open,
when you close First script.R, the Script Editor panel disappears.
Now, click on the Open an existing file icon in the RStudio toolbar. The
Choose file dialog willopen. Select First script.R and then press the Open
button in the dialog. Your script is now open in the Script Editor panel and
ready to use.
You can run the code in your R script easily. The Run button in the Script
Editor panel toolbarwill run either the current line of code or any block of
76
R Programming
selected code. You can use your First script.R code to gain familiarity with
this functionality.
Place the cursor anywhere in line 3 of your script [x = 34]. Now press the
Run button in the Script Editor panel toolbar. Three things happen: 1) the
code is transferred to the command console, 2) the code is executed, and 3)
the cursor moves to the next line in your script. Pressthe Run button three
more times. RStudio executes lines 4, 5, and 6 of your script.
Now you will run a set of code commands all at once. Highlight lines 8,
9, 10, and 11 in thescript.
Before finishing this topic, there is one final concept you should understand. It
is always a good idea to place comments in your code. They will help you
understand what your code is meant to do. This will become helpful when you
77
R Programming
reopen code you wrote weeks ago and are trying to workwith again. The saying,
"Real programmers do not document their code. If it was hard to write, itshould
be hard to understand" is meant to be a dark joke, not a coding style guide.
A comment in R code begins with the # symbol. Your code in First script.R
contains several examples of comments. Lines 1, 2, 7, 12, and 14 in the image
above are all comment lines. Any line of text that starts with # will be treated
as a comment and will be ignored during code execution. Lines 5 and 6 in this
image contain comments at the end. All text after the # is treatedas a comment
and is ignored during execution.
Notice how the RStudio editor shows these comments colored green. The
green color helps youfocus on the code and not get confused by the comments.
Besides using comments to help make your R code more easily understood, you can use
the # symbol to ignore lines of code while you are developing your code
stream. Simply place a # in front of any line that you want to ignore. R will
treat those lines as comments and ignorethem. When you want to include
those lines again in the code execution, remove the # symbolsand the code is
executable again. This technique allows you to change what code you execute
without having to retype deleted code.
78
R Programming
Logical Operators
Following table shows the logical operators supported by R
language. It is applicable only to vectors of type logical, numeric
or complex. All numbers greater than 1 are considered as logical
value TRUE.
Each element of the first vector is compared with the
corresponding element of the second vector. The result of
comparison is a Boolean value.
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes print(!v)
each element of the vector and gives the
! opposite logical value. it produces the following result:
R Programming
79
The logical operator && and || considers only the first element
of the vectors andgive a vector of single element as outpu
v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
Called Logical AND operator. Takes first print(v&&t)
&& element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result:
[1] TRUE
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first print(v||t)
|| element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result:
[1] FALSE
Decision Making
Decision making structures require the programmer to specify one or more
conditions to be evaluated or tested by the program, along with a statement or
statements to be executed if the condition is determined to be true, and
optionally, other statements to be executed if the conditionis determined to be
false.
Following is the general form of a typical decision making structure found in
most of the programming languages:
80
R provides the following types of decision making statements.
Click thefollowing links to check their detail.
Statement Description
R-IfStatement
81
An if statement consists of a Boolean expression followed by one or more statements.
R Programming
Syntax
The basic syntax for creating an if statement in R is:
if(boolean_expression) {
If the Boolean expression evaluates to be true, then the block of code inside the
if statement will be executed. If Boolean expression evaluates to be false, then
the first set of code after the end ofthe if statement (after the closing curly brace)
will be executed.
Flow Diagram
Example
x <- 30L
if(is.integer(x)){
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result:
82
R Programming
R–If...ElseStatement
An if statement can be followed by an optional else statement which executes
when the booleanexpression is false.
Syntax
The basic syntax for creating an if...else statement in R is:
if(boolean_expression) {
} else {
Flow Diagram
Example
83
R Programming
x <- c("what","is","truth")
if("Truth" %in% x){
print("Truth is found")
} else {
When the above code is compiled and executed, it produces the following result:
Theif...elseif...elseStatement
An if statement can be followed by an optional else if...else statement, which
is very useful totest various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
• An if can have zero or one else and it must come after any else if's.
• An if can have zero to many else if's and they must come before the else.
• Once an else if succeeds, none of the remaining else if's or else's will be tested.
Syntax
The basic syntax for creating an if...else if...else statement in R is:
84
R Programming
if(boolean_expression 1) {
}else {
Example
x <- c("what","is","truth")
if("Truth" %in% x){
} else {
When the above code is compiled and executed, it produces the following result:
R –SwitchStatement
A switch statement allows a variable to be tested for equality against a list of
values. Each valueis called a case, and the variable being switched on is checked
for each case.
85
R Programming
Syntax
The basic syntax for creating a switch statement in R is :
• You can have any number of case statements within a switch. Each case is
followed by the valueto be compared to and a colon.
• If the value of the integer is between 1 and nargs()-1 (The max number of
arguments)then thecorresponding element of case condition is evaluated and
the result returned.
• If there is more than one match, the first matching element is returned.
• In the case of no match, if there is a unnamed element of ... its value is returned.
(If there is morethan one such argument an error is returned.)
Flow Diagram
86
R Programming
Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
print(x)
When the above code is compiled and executed, it produces the following result:
[1] "third"
There may be a situation when you need to execute a block of code several
number of times. In general, statements are executed sequentially. The first
statement in a function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for
more complicatedexecution paths.
Loops
A loop statement allows us to execute a statement or group of statements
multiple times and the following is the general form of a loop statement in most
of the programming languages:
87
R Programming
R - Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a repeat loop in R is:
88
R Programming
repeat {
commands
if(condition){
break
}
Flow Diagram
Example
v <- c("Hello","loop")
89
R Programming
cnt <- 2
repeat{
print(v)
cnt <- cnt+1
if(cnt > 5){
break
}
}
When the above code is compiled and executed, it produces the following result:
R - While Loop
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is :
while (test_expression) {
statement
90
R Programming
Flow Diagram
Here key point of the while loop is that the loop might not ever run. When the
condition is testedand the result is false, the loop body will be skipped and the
first statement after the while loop will be executed.
Example
cnt = cnt + 1
}
R–ForLoop
A for loop is a repetition control structure that allows you to
efficiently write a loopthat needs to execute a specific number of
times.
Syntax
The basic syntax for creating a for loop statement in R is:
91
R Programming
Flow Diagram
R’s for loops are particularly flexible in that they are not limited
to integers, or even numbers in the input. We can pass character
vectors, logical vectors, lists or expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
When the above code is compiled and executed, it produces the following result:
92
R Programming
[1] "A"
[1] "B"
[1] "C"
[1] "D"
R–BreakStatement
The break statement in R programming language has the following two usages:
• When the break statement is encountered inside a loop, the
loop is immediatelyterminated and program control resumes at
the next statement following the loop.
• It can be used to terminate a case in the switch statement (covered in the next chapter).
Syntax
The basic syntax for creating a break statement in R is:
93
R Programming
break
Flow Diagram
Example
v <-
c("Hell
o","loop
") cnt <-
2
repeat{
p
r
i
n
t
(
v
)
c
n
94
R Programming
t
<
-
c
n
t
+
1
i
f
(
c
n
t
>
5
)
{
break
}
}
When the above code is compiled and executed, it produces the following result:
95
R Programming
R – Next Statement
The next statement in R programming language is useful when
we want to skip the current iteration of a loop without
terminating it. On encountering next, the R parser skips further
evaluation and starts next iteration of the loop.
Syntax
The basic syntax for creating a next statement in R is:
next
Flow Diagram
Example
96
R Programming
v <- LETTERS[1:6]
for ( i in v){
if (i == "D"){
next
print(i)
}
When the above code is compiled and executed, it produces the following result:
[1] "A"
[1] "B"
List
Lists are the R objects which contain elements of different types like −
numbers, strings, vectors and another list inside it. A list can also contain a
matrix or a function as its elements. List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors
and a logical values.
Live Demo
97
R Programming
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
98
R Programming
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
99
R Programming
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
Data Frame
100
R Programming
101
R Programming
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
102
R Programming
Extract 3rd and 5th row with 2nd and 4th column
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
Add Column
Just add the column vector using a new column name.
103
R Programming
Add Row
To add more rows permanently to an existing data frame, we need to bring in
the new rows in the same structure as the existing data frame and use the
rbind() function.
In the example below we create a data frame with new rows and merge it with
the existing dataframe to create the final data frame.
104
R Programming
lapply() function
lapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x
l in lapply() stands for list. The difference between lapply() and apply() lies
105
R Programming
between the output return. The output of lapply() is a list. lapply() can be used
for other objects like data frames andlists.
A very easy example can be to change the string value of a matrix to lower
case with tolower function. We construct a matrix with the name of the famous
movies. The name is in upper caseformat.
Output:
## List of 4
## $:chr"spyderman"
## $:chr"batman"
## $:chr"vertigo"
## $:chr"chinatown"
movies_lower <-unlist(lapply(movies,tolower))
str(movies_lower)
Output:
sapply() function
sapply() function takes list, vector or data frame as input and gives output in
vector or matrix. Itis useful for operations on list objects and returns a list
object of same length of original set. sapply() function does the same job as
lapply() function but returns a vector.
106
R Programming
sapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x
We can measure the minimum speed and stopping distances of cars from the cars dataset.
dt <- cars
lmn_cars <- lapply(dt, min)
smn_cars <- sapply(dt, min)
lmn_cars
Output
## speed dist
## 14.5 61.0
We can summarize the difference between sapply() and `lapply() in the following table:
107
Function Arguments Objective Input Output
R Programming
lapply lapply(X, FUN) Apply a function to all the List, vector or list
elements of the input data frame
sapply sappy(X FUN) Apply a function to all the List, vector or vector or
elements of the input data frame matrix
S3
In oops, the S3 is used to overload any function. So that we can call the functions
with different names and it depends on the type of input parameter or the number
of parameters.
S4
R Programming
In R, classes are the outline or design for the object. Classes encapsulate the
data members, along with the functions. In R, there are two most important
classes, i.e., S3 and S4, which play an important role in performing OOPs
concepts.
Let's discuss both the classes one by one with their examples for better understanding.
1) S3 Class
With the help of the S3 class, we can take advantage of the ability to implement
the generic function OO. Furthermore, using only the first argument, S3 is
capable of dispatching. S3 differs from traditional programming languages such
as Java, C ++, and C #, which implement OO passing messages. This makes S3
easy to implement. In the S3 class, the generic function calls the method. S3 is
very casual and has no formal definition of classes.
Creating an S3 class
In R, we define a function which will create a class and return the object of the
R Programming
created class. A list is made with relevant members, class of the list is
determined, and a copy of the list is returned. There is the following syntax to
create a class
Example
Output
There is the following way in which we define our generic function print.
1. print
2. function(x, ................. )
3. UseMethod("Print")
When we execute or run the above code, it will give us the following output:
R Programming
Like print function, we will make a generic function GPA to assign a new
value to our GPAmember. In the following way we will make the generic
function GPA
Once our generic function GPA is created, we will implement a default function for it
After that we will make a new method for our GPA function in the following way
2. cat("Fina
l GPA is
",obj1$GPA,"\n")
3. }
1. GPA(s)
Output
Inheritance in S3
Inheritance means extracting the features of one class into another class. In
the S3 class of R,inheritance is achieved by applying the class attribute in a
vector.
1. faculty<- function(n,a,g) {
2. value <- list(nname=n, aage=a, GPA=g)
3. attr(value, "class") <- "faculty"
4. value
5. }
, "\n")5. }
1. class(Objet) <-
c(child,
parent)so,
1. # create a list
2. fac <- list(name="Shubham", age=22, GPA=3.5, country="India")
3. # make it of the class InternationalFaculty which is derived from the class Faculty
4. class(fac) <- c("InternationalFaculty","Faculty")
5. # print it out
6. fac
When we run the above code which we have discussed, it will generate the following output:
We can see above that, we have not defined any method of form print.
R Programming
1. print.InternationalFaculty<- function(obj1) {
2. cat(obj1$name, "is
from", obj1$country, "\n")
3. }
The above function will overwrite the method defined for class faculty as
1. Fac
There are the two most common and popular S3 method functions which are
used in R. The first method is getS3method() and the second one is
getAnywhere().
S3 finds the appropriate method associated with a class, and it is useful to see
how a method is implemented. Sometimes, the methods are non-visible,
because they are hidden in a namespace. We use getS3method or getAnywhere
to solve this problem.
getS3method
R Programming
getAnywhere function
1. getAnywhere("simpleloess")
2) S4 Class
The S4 class is similar to the S3 but is more formal than the latter one. It differs
from S3 in two different ways. First, in S4, there are formal class definitions
which provide a description and representation of classes. In addition, it has
special auxiliary functions for defining methods and generics. The S4 also offers
multiple dispatches. This means that common functions are capable of taking
methods based on multiple arguments which are based on class.
Creating an S4 class
To create an S3 class, we have to define the class and its slots. There are the
following steps to create an S4 class
Step 1:
In the first step, we will create a new class called faculty with three slots name, age, and GPA.
There are many other optional arguments of setClass() function which we can
explore byusing ?setClass command.
Step 2:
In the next step, we will create the object of S4 class. R provides new() function
to create an object of S4 class. In this new function we pass the class name and
the values for the slots in the following way:
Now we can use the above constructor function to create new objects. The
constructor in turn uses the new() function to create objects. It is just a wrap
around. Let's see an example to understand how S4 object is created with the
help of generator function.
117
R Programming
Example
Output
Inheritance in S4 class
Like S3 class, we can perform inheritance in S4 class also. The derived class
will inherit both attributes and methods of the parent class. Let's start
understanding that how we can perform inheritance in S4 class. There are the
following ways to perform inheritance in S4 class:
Step 1:
In the first step, we will create or define class with appropriate slots in the following way:
1. setClass("faculty",
2. slots=list(name="character",
age="numeric", GPA="numeric")3. )
Step 2:
118
R Programming
After defining class, our next step is to define class method for the display()
generic function.This will be done in the following manner:
1. setMethod("show",
2. "faculty",
3. function(obj) {
4. cat(obj@name, "\n")
5. cat(obj@age, "years old\n")
6. ca
t("GPA:",
obj@GPA
, "\n")7. }
8. )
Step 3:
In the next step, we will define the derived class with the argument contains.
The derived class isdefined in the following way
1. setClass("Internationalfaculty",
2. slots=list(country="character"),
3. c
o
n
t
a
i
n
s
=
"
f
a
c
u
l
t
y
"
4
.
)
119
R Programming
In our derived class we have defined only one attribute i.e. country. Other
attributes will beinherited from its parent class.
When we did show(s), the method defines for class faculty gets called. We
can also definemethods for the derived class of the base class as in the case of
the S3 system.
DATA MANIPULATION
Module 4
We can easily perform data manipulation using R software. We’ll cover the following data
manipulation techniques:
▪ filtering and ordering rows,
▪ renaming and adding columns,
▪ computing summary statistics
We’ll use mainly the popular dplyr R package, which contains important R functions to carry
out easily your data manipulation. In the final section, we’ll show you how to group your data
120
R Programming
by a grouping variable, and then compute some summary statitistics on each subset. You will
also learn how to chain your data manipulation operations.
At the end of this course, you will be familiar with data manipulation tools and approaches that
will allow you to efficiently manipulate data.
Required R packages
We recommend to install the tidyverse packages, which include the dplyr package (for data
manipulation) and additional R packages for easily reading (readr), transforming (tidyr) and
visualizing (ggplot2) datasets.
Install:
install.packages("tidyverse")
Load the tidyverse packages, which also include the dplyr package:
library("tidyverse")
Demo datasets
We’ll use mainly the R built-in iris data set, which we start by converting into a tibble data
frame (tbl_df) for easier data analysis. tbl_df data object is a data frame providing a nicer
printing method, useful when working with large data sets.
library("tidyverse")
my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## # ... with 144 more rows
Note that, the type of data in each column is specified. Common types include:
int: integers
dbl: double (real numbers),
chr: character vectors, strings, texts
121
R Programming
fctr: factor,
dttm: date-times (date + time)
lgl: logical (TRUE or FALSE)
date: dates
Main data manipulation functions
There are 8 fundamental data manipulation verbs that you will use to do most of your data
manipulations. These functions are included in the dplyr package:
filter(): Pick rows (observations/samples) based on their values.
distinct(): Remove duplicate rows.
arrange(): Reorder the rows.
select(): Select columns (variables) by their names.
rename(): Rename columns.
mutate() and transmutate(): Add/create new variables.
summarise(): Compute statistical summaries (e.g., computing the mean or the sum)
It’s also possible to combine each of these verbs with the function group_by() to operate on
subsets of the data set (group-by-group).
All these functions work similarly as follow:
The first argument is a data frame
The subsequent arguments are comma separated list of unquoted variable names and the
specification of what you want to do
The result is a new data frame
122
R Programming
Probability Distributions
A probability distribution describes how the values of a random variable is distributed. For
example, the collection of all possible outcomes of a sequence of coin tossing is known to
follow the binomial distribution. Whereas the means of sufficiently large samples of a data
population are known to resemble the normal distribution. Since the characteristics of these
theoretical distributions are well understood, they can be used to make statistical inferences
on the entire data population as a whole.
In the following tutorials, we demonstrate how to compute a few well-known probability
distributions that occurs frequently in statistical study. We reference them quite often in other
sections.
Binomial Distribution
Problem
Suppose there are twelve multiple choice questions in an English class quiz. Each question has
five possible answers, and only one of them is correct. Find the probability of having four or
less correct answers if a student attempts to answer every question at random.
Solution
Since only one out of five possible answers is correct, the probability of answering a question
correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers
by random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
To find the probability of having four or less correct answers by random attempts, we apply
the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) +
+ dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) +
+ dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.
123
R Programming
Poisson Distribution
Problem
If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution
The probability of having sixteen or less cars crossing the bridge in a particular minute is given
by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the bridge in a minute is in
the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer
If there are twelve cars crossing a bridge per minute on average, the probability of having
seventeen or more cars crossing the bridge in a particular minute is 10.1%.
Continuous Uniform Distribution
The continuous uniform distribution is the probability distribution of random number selection
from the continuous interval between a and b. Its density function is defined by the following.
124
R Programming
Problem
Select ten random numbers between one and three.
Solution
We apply the generation function runif of the uniform distribution to generate ten random
numbers between one and three.
> runif(10, min=1, max=3)
[1] 1.6121 1.2028 1.9306 2.4233 1.6874 1.1502 2.7068
[8] 1.4455 2.4122 2.2171
Exponential Distribution
The exponential distribution describes the arrival time of a randomly recurring independent
event sequence. If μ is the mean waiting time for the next event recurrence, its probability
density function is:
125
R Programming
Problem
Suppose the mean checkout time of a supermarket cashier is three minutes. Find the
probability of a customer checkout being completed by the cashier in less than two minutes.
Solution
The checkout processing rate is equals to one divided by the mean checkout completion time.
Hence the processing rate is 1/3 checkouts per minute. We then apply the function pexp of
the exponential distribution with rate=1/3.
> pexp(2, rate=1/3)
[1] 0.48658
Answer
The probability of finishing a checkout in under two minutes by the cashier is 48.7%
Normal Distribution
The normal distribution is defined by the following probability density function, where μ is
the population mean and σ2 is the variance.
In particular, the normal distribution with μ = 0 and σ = 1 is called the standard normal
distribution, and is denoted as N(0,1). It can be graphed as follows.
126
R Programming
The normal distribution is important because of the Central Limit Theorem, which states
that the population of all possible samples of size n from a population with mean μ and
variance σ2 approaches a normal distribution with mean μ and σ2∕n when n approaches
infinity.
Problem
Assume that the test scores of a college entrance exam fits a normal distribution.
Furthermore, the mean test score is 72, and the standard deviation is 15.2. What is the
percentage of students scoring 84 or more in the exam?
Solution
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are
interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Answer
The percentage of students scoring 84 or more in the college entrance exam is 21.5%.
Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.
127
R Programming
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
128
R Programming
print(relation)
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Get the Summary of the Relationship
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
129
R Programming
130
R Programming
Covariance and Correlation are terms used in statistics to measure relationships between two
random variables. Both of these terms measure linear dependency between a pair of random
variables or bivariate data.
In this article, we are going to discuss cov(), cor() and cov2cor() functions in R which use
covariance and correlation methods of statistics and probability theory.
Covariance
In R programming, covariance can be measured using cov() function. Covariance is a
statistical term used to measures the direction of the linear relationship between the data
vectors. Mathematically,
where,
131
R Programming
Syntax:
cov(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is "pearson".
Example:
# Data vectors
print(cov(x, y))
132
R Programming
Output:
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667
Correlation
cor() function in R programming measures the correlation coefficient value. Correlation is a
relationship term in statistics that uses the covariance method to measure how strong the
vectors are related. Mathematically,
where,
x represents the x data vector
y represents the y data vector
represents mean of x data vector
represents mean of y data vector
Syntax:
cor(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is "pearson".
Example:
# Data vectors
print(cor(x, y))
133
R Programming
Output:
[1] 0.9724702
[1] 0.9724702
[1] 1
[1] 1
Conversion of Covariance to Correlation
cov2cor() function in R programming converts a covariance matrix into corresponding
correlation matrix.
Syntax:
cov2cor(X)
where,
X and y represents the covariance square matrix
Example:
# Data vectors
x <- rnorm(2)
y <- rnorm(2)
134
R Programming
X <- cov(mat)
print(X)
print(cor(mat))
print(cov2cor(X))
Output:
x y
x 0.0742700 -0.1268199
y -0.1268199 0.2165516
x y
x 1 -1
y -1 1
x y
x 1 -1
y -1 1
135
R Programming
t-tests
One of the most common tests in statistics is the t-test, used to determine
whether the means of two groups are equal to each other. The assumption for
the test is that both groups are sampled from normal distributions with equal
variances. The null hypothesis is that the two means are equal, and the
alternative is that they are not. It is known that under the null hypothesis, we
can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees
of freedom. There is also a widely used modification of the t-test, known as
Welch's t-test that adjusts the number of degrees of freedom when the variances
are thought not to be equal to each other. Before we can explore the test much
further, we need to find an easy way to calculate the t-statistic.
The function t.test is available in R for performing t-tests. Let's test it out on a
simple example, using data simulated from a normal distribution.
> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)
data: x and y
t = 1.4896, df = 15.481, p-value = 0.1564
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3221869 1.8310421
sample estimates:
mean of x mean of y
0.1944866 -0.5599410
Before we can use this function in a simulation, we need to find out how to
extract the t-statistic (or some other quantity of interest) from the output of the
t.test function. For this function, the R help page has a detailed list of what the
object returned by the function contains. A general method for a situation like
this is to use the class and names functions to find where the quantity of interest
is. In addition, for some hypothesis tests, you may need to pass the object from
the hypothesis test to the summary function and examine its contents. For t.test
it's easy to figure out what we want:
136
R Programming
> ttest$statistic
t
1.489560
> ttest[['statistic']]
t
1.489560
Of course, just one value doesn't let us do very much - we need to generate many
such statistics before we can look at their properties. In R, the replicate function
makes this very simple. The first argument to replicate is the number of samples
you want, and the second argument is an expression (not a function name or
definition!) that will generate one of the samples you want. To generate 1000 t-
statistics from testing two groups of 10 standard random normal numbers, we
can use:
> ts = replicate(1000,t.test(rnorm(10),rnorm(10))$statistic)
Under the assumptions of normality and equal variance, we're assuming that the
statistic will have a t-distribution with 10 + 10 - 2 = 18 degrees of freedom.
(Each observation contributes a degree of freedom, but we lose two because we
have to estimate the mean of each group.) How can we test if that is true?
One way is to plot the theoretical density of the t-statistic we should be seeing,
and superimposing the density of our sample on top of it. To get an idea of what
range of x values we should use for the theoretical density, we can view the
range of our simulated data:
> range(ts)
> range(ts)
[1] -4.564359 4.111245
Since the distribution is supposed to be symmetric, we'll use a range from -4.5
to 4.5. We can generate equally spaced x-values in this range with seq:
Now we can add a line to the plot showing the density for our simulated sample:
137
R Programming
> lines(density(ts))
The plot appears below.
> qqplot(ts,rt(1000,df=18))
> abline(0,1)
We can see that the central points of the graph seems to agree fairly well, but
there are some discrepancies near the tails (the extreme values on either end of
the distribution). The tails of a distribution are the most difficult part to
accurately measure, which is unfortunate, since those are often the values that
interest us most, that is, the ones which will provide us with enough evidence
to reject a null hypothesis. Because the tails of a distribution are so important,
another way to test to see if a distribution of a sample follows some
hypothesized distribution is to calculate the quantiles of some tail probabilities
(using the quantile function) and compare them to the theoretical probabilities
from the distribution (obtained from the function for that distribution whose first
letter is "q"). Here's such a comparison for our simulated data:
138
R Programming
Performing more simulations, or using a large sample size for the two groups
would probably result in values even closer to what we have theoretically
predicted.
> t.test(x,y)
data: x and y
t = -0.8103, df = 17.277, p-value = 0.4288
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0012220 0.4450895
sample estimates:
mean of x mean of y
0.2216045 0.4996707
> t.test(x,y,var.equal=TRUE)
data: x and y
t = -0.8103, df = 18, p-value = 0.4284
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.9990520 0.4429196
sample estimates:
mean of x mean of y
0.2216045 0.4996707
Since the statistic is the same in both cases, it doesn't matter whether we use the
correction or not; either way we'll see identical results when we compare the
two methods using the techniques we've already described. Since the degree of
freedom correction changes depending on the data, we can't simply perform the
139
R Programming
> qqplot(tps,runif(1000))
> abline(0,1)
The graph appears below.
The idea that the probabilities follow a uniform distribution seems reasonable.
Now, let's look at some of the quantiles of the p-values when we force the t.test
function to use var.equal=TRUE:
140
R Programming
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.4932319 0.7084562 0.9036533 0.9518775 0.9889234
There's not that much of a difference, but, of course, the variances in this
example were equal. How does the correction work when the variances are not
equal?
> tps =
replicate(1000,t.test(rnorm(10),rnorm(10,sd=5),var.equal=TRUE)$p.value)
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.5221698 0.6926466 0.8859266 0.9490947 0.9935562
> tps = replicate(1000,t.test(rnorm(10),rnorm(10,sd=5))$p.value)
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.4880855 0.7049834 0.8973062 0.9494358 0.9907219
There is an improvement, but not so dramatic.
t.power = function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1)){
lower = qt(.025,df=sum(nsamp) - 2)
upper = qt(.975,df=sum(nsamp) - 2)
ts = replicate(nsim,
t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]),
rnorm(nsamp[2],mean=means[2],sd=sds[2]))$statistic)
141
R Programming
> t.power(means=c(0,1))
[1] 0.555
Not bad for a sample size of 10!
Of course, if the differences in means are smaller, it's going to be harder to reject
the null hypothesis:
> t.power(means=c(0,.3))
[1] 0.104
How large a sample size would we need to detect that difference of .3 with 95%
power?
Now we can return to the issue of unequal variances. We saw that Welch's
adjustment to the degrees of freedom helped a little bit under the null
hypothesis. Now let's see if the power of the test is improved using Welch's test
when the variances are unequal. To do this, we'll need to modify our t.power
function a little:
t.power1 =
function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1),var.equal=TR
UE){
tps = replicate(nsim,
t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]),
rnorm(nsamp[2],mean=means[2],sd=sds[2]))$p.value)
142
R Programming
> t.power1(nsim=10000,sds=c(1,2),mean=c(1,2))
[1] 0.1767
> t.power1(nsim=10000,sds=c(1,2),mean=c(1,2),var.equal=FALSE)
[1] 0.1833
There does seem to be an improvement, but not so dramatic.
where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x,
which indicates the changes in y as a function of changes in x. For example if the slope is +0.5,
we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can
also be negative.
This equation can be expanded to accommodate more that one explanatory variable x:
143
R Programming
In this case the interpretation is a bit more complex because for example the coefficient β_2
provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2
the target variable y changes by the value of β_2, if the other explanatory variables are kept
constant.
In case our model includes interactions, the linear equation would be changed as follows:
notice the interaction term between x_1 and x_2. In this case the interpretation becomes
extremely difficult just by looking at the model.
we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason
the only way we can actually determine how x_1 changes Y, when the other terms are kept
constant, is to use the equation with new values of x_1.
This linear model can be applied to continuous target variables, in this case we would talk about
an ANCOVA for exploratory analysis, or a linear regression if the objective was to create a
predictive model.
ANOVA
The Analysis of variance is based on the linear model presented above, the only difference is
that its reference point is the mean of the dataset. When we described the equations above we
said that to interpret the results of the linear model we would look at the slope term; this
indicates the rate of changes in Y if we change one variable and keep the rest constant. The
ANOVA calculates the effects of each treatment based on the grand mean, which is the mean
of the variable of interest.
144
R Programming
where y is the effect on group j of treatment τ_1, while μ is the grand mean (i.e. the mean of
the whole dataset). From this equation is clear that the effects calculated by the ANOVA are
not referred to unit changes in the explanatory variables, but are all related to changes on the
grand mean.
install.packages("agridat")
We also need to include other packages for the examples below. If some of these are not
installed in your system please use again the function install.packages (replacing the name
within quotation marks according to your needs) to install them.
library(agridat)
library(ggplot2)
library(plotrix)
library(moments)
library(car)
library(fitdistrplus)
library(nlme)
library(multcomp)
library(epade)
library(lme4)
Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn
yield in a field in Argentina, plus several explanatory variables both factorial (or categorical)
and continuous.
> str(dat)
145
R Programming
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
Important for the purpose of this tutorial is the target variable yield, which is what we are trying
to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which
is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have
rep, which is the blocking factor.
Checking Assumptions
Since we are planning to use an ANOVA we first need to check that our data fits with its
assumptions. ANOVA is based on three assumptions:
▪ Data independence
▪ Normality
▪ Equality of variances between groups
▪ Balance design (i.e. all groups have the same number of samples)
Let’s see how we can test for them in R. Clearly we are talking about environmental data so
the assumption of independence is not met, because data are autocorrelated with distance.
Theoretically speaking, for spatial data ANOVA cannot be employed and more robust methods
should be employed (e.g. REML); however, over the years it has been widely used for analysis
of environmental data and it is accepted by the community. That does not mean that it is the
correct method though, and later on in this tutorial we will see the function to perform linear
modelling with REML.
The third assumption is the one that is most easy to assess using the function tapply:
N0 N1 N2 N3 N4 N5
In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of
146
R Programming
nitrogen). There is some variation between groups but in my opinion it is not substantial. Now
we can shift our focus on normality. There are tests to check for normality, but again the
ANOVA is flexible (particularly where our dataset is big) and can still produce correct results
even when its assumptions are violated up to a certain degree. For this reason, it is good practice
to check normality with descriptive analysis alone, without any statistical test. For example,
we could start by plotting the histogram of yield:
By looking at this image it seems that our data are more or less normally distributed. Another
plot we could create is the QQplot
(http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm):
qqline(dat$yield)
147
R Programming
For normally distributed data the points should all be on the line. This is clearly not the case
but again the deviation is not substantial. The final element we can calculate is the skewness
of the distribution, with the function skewnessin the package moments:
> skewness(dat$yield)
[1] 0.3875977
According to Webster and Oliver (2007) is the skewness is below 0.5, we can consider the
deviation from normality not big enough to transform the data. Moreover, according to Witte
and Witte (2009) if we have more than 10 samples per group we should not worry too much
about violating the assumption of normality or equality of variances.
To see how many samples we have for each level of nitrogen we can use once again the
function tapply:
N0 N1 N2 N3 N4 N5
148
R Programming
As you can see we have definitely more than 10 samples per group, but our design is not
balanced (i.e. some groups have more samples). This implies that the normal ANOVA cannot
be used, this is because the standard way of calculating the sum of squares is not appropriate
for unbalanced designs (look here for more
info: http://goanna.cs.rmit.edu.au/~fscholer/anova.php).
In summary, even though from the descriptive analysis it appears that our data are close to
being normal and have equal variance, our design is unbalanced, therefore the normal way of
doing ANOVA cannot be used. In other words we cannot function aovfor this dataset.
However, since this is a tutorial we are still going to start by applying the normal ANOVA
with aov.
BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))
This code first uses the function tapply to compute mean and standard error of the mean for
yield in each nitrogen group. Then it plots the means as bars and creates error bars using the
standard error (please remember that with a normal distribution ± twice the standard error
provides a 95% confidence interval around the mean value). The result is the following image:
149
R Programming
By plotting our data we can start figuring out what is the interaction between nitrogen levels
and yield. In particular, there is an increase in yield with higher level of nitrogen. However,
some of the error bars are overlapping, and this may suggest that their values are not
significantly different. For example, by looking at this plot N0 and N1 have error bars very
close to overlap, but probably not overlapping, so it may be that N1 provides a significant
different from N0. The rest are all probably significantly different from N0. For the rest their
interval overlap most of the times, so their differences would probably not be significant.
We could formulate the hypothesis that nitrogen significantly affects yield and that the mean
of each subgroup are significantly different. Now we just need to test this hypothesis with a
one-way ANOVA:
The code above uses the function aov to perform an ANOVA; we can specify to perform a one-
way ANOVA simply by including only one factorial term after the tilde (~) sign. We can plot
the ANOVA table with the function summary:
> summary(mod1)
150
R Programming
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It is clear from this output that nitrogen significantly affects yield, so we tested our first
hypothesis. To test the significance for individual levels of nitrogen we can use the Tukey’s
test:
$nf
151
R Programming
There are significant differences between the control and the rest of the levels of nitrogen, plus
other differences between N4 and N5 compared to N1, but nothing else. If you look back at the
bar chart we produced before, and look carefully at the overlaps between error bars, you will
see that for example N1, N2, and N3 have overlapping error bars, thus they are not significantly
different. On the contrary, N1 has no overlaps with either N4 and N5 , which is what we
demonstrated in the ANOVA.
The function model.tables provides a quick way to print the table of effects and the table of
means:
Tables of effects
nf
N0 N1 N2 N3 N4 N5
These values are all referred to the gran mean, which we can simply calculate with the
function mean(dat$yield) and it is equal to 69.83. This means that the mean for N0 would be
69.83-4.855 = 64.97. we can verify that with another call to the function model.tables, this time
with the option type=”means”:
Tables of means
Grand mean
69.82831
nf
N0 N1 N2 N3 N4 N5
152
R Programming
This line fits the same model but with the standard linear equation. This become clearer by
looking at the summary table:
> summary(mod2)
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
153
R Programming
There are several information in this table that we should clarify. First of all it already provides
with some descriptive measures for the residuals, from which we can see that their distribution
is relatively normal (first and last quartiles have similar but opposite values and the same is
true for minimum and maximum). Then we have the table of the coefficients, with the intercept
and all the slopes. As you can see the level N0 is not shown in the list; this is called the reference
level, which means that all the other are referenced back to it. In other words, the value of the
intercept is the mean of nitrogen level 0 (in fact is the same we calculated above 64.97). To
calculate the means for the other groups we need to sum the value of the reference level with
the slopes. For example N1 is 64.97 + 3.64 = 68.61 (the same calculated from the ANOVA).
The p-value and the significance are again in relation to the reference level, meaning for
example that N1 is significantly different from N0 (reference level) and the p-value is 0.0017.
This is similar to the Tukey’s test we performed above, but it is only valid in relation to N0.
We need to change the reference level, and fit another model, to get the same information for
other nitrogen levels:
summary(mod3)
Now the reference level is N1, so all the results will tell us the effects of nitrogen in relation to
N1.
> summary(mod3)
Call:
Residuals:
154
R Programming
Linear Regression :
It is a commonly used type of predictive analysis. It is a statistical approach for modelling
relationship between a dependent variable and a given set of independent variables.
There are two types of linear regression.
• Simple Linear Regression
• Multiple Linear Regression
Let’s discuss Simple Linear regression using R.
Simple Linear Regression:
It is a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables. One variable denoted x is regarded as an independent
variable and other one denoted y is regarded as a dependent variable. It is assumed that the
two variables are linearly related. Hence, we try to find a linear function that predicts the
response value(y) as accurately as possible as a function of the feature or independent
variable(x).
For understanding the concept let’s consider a salary dataset where it is given the value of
the dependent variable(salary) for every independent variable(years experienced).
Salary dataset-
Years experienced Salary
1.1 39343.00
1.3 46205.00
1.5 37731.00
2.0 43525.00
2.2 39891.00
2.9 56642.00
3.0 60150.00
3.2 54445.00
3.2 64445.00
3.7 57189.00
For general purpose, we define:
x as a feature vector, i.e x = [x_1, x_2, …., x_n],
y as a response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
155
R Programming
Now, we have to find a line which fits the above scatter plot through which we can predict
any value of y or response for any value of x
The lines which best fits is called Regression line.
The equation of regression line is given by:
y = a + bx
Where y is predicted response value, a is y intercept, x is feature value and b is slope.
To create the model, let’s evaluate the values of regression coefficient a and b. And as soon
as the estimation of these coefficients is done, the response model can be predicted. Here we
are going to use Least Square Technique.
The principle of least squares is one of the popular methods for finding a curve fitting a
given data. Say (x1, y1), (x2, y2)….(xn, yn) be n observations from an experiment. We are
interested in finding a curve
156
R Programming
Closely fitting the given data of size ‘n’. Now at x=x1 while the observed value of y is y1 the
expected value of y from curve (1) is f(x1).Then the residual can be defined by…
While evaluating the residual we will find that some residuals are positives and some are
negatives. We are looking forward to finding the curve fitting the given data such that residual
at any xi is minimum. Since some of the residuals are positive and others are negative and as
we would like to give equal importance to all the residuals it is desirable to consider the sum
of the squares of these residuals. Thus we consider:
Note: E is a function of parameters a and b and we need to find a and b such that E is
minimum and the necessary condition for E to be minimum is as follows:
The above two equations are called normal equations which are solved to get the value of a
and b.
The Expression for E can be rewritten as:
157
R Programming
lm(Y ~ model)
where Y is the object containing the dependent variable to be predicted and model is the
formula for the chosen mathematical model.
The command lm( ) provides the model’s coefficients but no further statistical information.
Following R code is used to implement SIMPLE LINEAR REGRESSION:
dataset = read.csv('salary.csv')
install.packages('caTools')
library(caTools)
data = trainingset)
coef(lm.r)
158
R Programming
install.packages("ggplot2")
library(ggplot2)
geom_line(aes(x = trainingset$YearsExperience,
xlab('Years of experience') +
ylab('Salary')
ggplot() +
colour = 'red') +
geom_line(aes(x = trainingset$YearsExperience,
159
R Programming
colour = 'blue') +
xlab('Years of experience') +
ylab('Salary')
Output of coef(lm.r):
160
R Programming
161
R Programming
162
R Programming
Modelling strategies
Frank Harrell’s Regression Modelling Strategies, a must read for anyone who ever fits a
regression model, although be prepared - depending on your background, you might get 30
pages in and suddenly become convinced you’ve been doing nearly everything wrong before,
which can be disturbing.
I wanted to evaluate three simple modelling strategies in dealing with data with many variables.
Using data with 54 variables on 1,785 area units from New Zealand’s 2013 census, I’m looking
to predict median income on the basis of the other 53 variables. The features are all continuous
and are variables like “mean number of bedrooms”, “proportion of individuals with no
religion” and “proportion of individuals who are smokers”. Restricting myself to traditional
linear regression with a normally distributed response, my three alternative strategies were:
Validating models
The main purpose of the exercise was actually to ensure I had my head around different ways
of estimating the validity of a model, loosely definable as how well it would perform at
predicting new data. As there is no possibility of new areas in New Zealand from 2013 that
need to have their income predicted, the “prediction” is a thought-exercise which we need to
find a plausible way of simulating. Confidence in hypothetical predictions gives us confidence
in the insights the model gives into relationships between variables.
There are many methods of validating models, although I think k-fold cross-validation has
market dominance (not with Harrell though, who prefers varieties of the bootstrap). The three
validation methods I’ve used for this post are:
1. ‘simple’ bootstrap. This involves creating resamples with replacement from the original
data, of the same size; applying the modelling strategy to the resample; using the model
to predict the values of the full set of original data and calculating a goodness of fit
statistic (eg either R-squared or root mean squared error) comparing the predicted value
to the actual value. Note - Following Efron, Harrell calls this the “simple bootstrap”,
but other authors and the useful caret package use “simple bootstrap” to mean the
163
R Programming
resample model is used to predict the out-of-bag values at each resample point, rather
than the full original sample.
2. ‘enhanced’ bootstrap. This is a little more involved and is basically a method of
estimating the ‘optimism’ of the goodness of fit statistic. There’s a nice step by step
explanation by thestatsgeek which I won’t try to improve on.
3. repeated 10-fold cross-validation. 10-fold cross-validation involves dividing your data
into ten parts, then taking turns to fit the model on 90% of the data and using that model
to predict the remaining 10%. The average of the 10 goodness of fit statistics becomes
your estimate of the actual goodness of fit. One of the problems with k-fold cross-
validation is that it has a high variance ie doing it different times you get different results
based on the luck of you k-way split; so repeated k-fold cross-validation addresses this
by performing the whole process a number of times and taking the average.
As the sample sizes get bigger relative to the number of variables in the model the methods
should converge. The bootstrap methods can give over-optimistic estimates of model validity
compared to cross-validation; there are various other methods available to address this issue
although none seem to me to provide all-purpose solution.
It’s critical that the re-sampling in the process envelopes the entire model-building strategy,
not just the final fit. In particular, if the strategy involves variable selection (as two of my
candidate strategies do), you have to automate that selection process and run it on each different
resample. That’s because one of the highest risk parts of the modelling process is that variable
selection. Running cross-validation or the bootstrap on a final model after you’ve eliminated a
bunch of variables is missing the point, and will give materially misleading statistics (biased
towards things being more “significant” than there really is evidence for). Of course, that
doesn’t stop this being common misguided practice.
Results
One nice feature of statistics since the revolution of the 1980s is that the bootstrap helps you
conceptualise what might have happened but didn’t. Here’s the root mean squared error from
the 100 different bootstrap resamples when the three different modelling strategies (including
variable selection) were applied:
Notice anything? Not only does it seem to be generally a bad idea to drop variables just because
they are collinear with others, but occasionally it turns out to be a really bad idea - like in
resamples #4, #6 and around thirty others. Those thirty or so spikes are in resamples where
random chance led to one of the more important variables being dumped before it had a chance
to contribute to the model.
The thing that surprised me here was that the generally maligned step-wise selection strategy
performed nearly as well as the full model, judged by the simple bootstrap. That result comes
through for the other two validation methods as well:
In all three validation methods there’s really nothing substantive to choose between the “full
model” and “stepwise” strategies, based purely on results.
164
R Programming
Reflections
The full model is much easier to fit, interpret, estimate confidence intervals and perform tests
on than stepwise. All the standard statistics for a final model chosen by stepwise methods are
misleading and careful recalculations are needed based on elaborate bootstrapping. So the full
model wins hands-down as a general strategy in this case.
With this data, we have a bit of freedom from the generous sample size. If approaching this for
real I wouldn’t eliminate any variables unless there were theoretical / subject matter reasons to
do so. I have made the mistake of eliminating the co-linear variables before from this dataset
but will try not to do it again. The rule of thumb is to have 20 observations for each parameter
(this is one of the most asked and most dodged questions in statistics education; see Table 4.1
of Regression Modelling Strategies for this particular answer), which suggests we can have up
to 80 parameters with a bit to spare. This gives us 30 parameters to use for non-linear
relationships and/or interactions, which is the direction I might go in a subsequent post. Bearing
that in mind, I’m not bothering to report here the actual substantive results (eg which factors
are related to income and how); that can wait for a better model down the track.
#===================setup=======================
library(ggplot2)
library(scales)
library(MASS)
library(boot)
library(caret)
library(dplyr)
library(tidyr)
library(directlabels)
165
R Programming
set.seed(123)
# restrict to areas with no missing data. If this was any more complicated (eg
# imputation),it would need to be part of the validation resampling too; but
# just dropping them all at the beginning doesn't need to be resampled; the only
# implication would be sample size which would be small impact and complicating.
au <- au[complete.cases(au), ]
166
R Programming
# The third strategy, full model, is only a one-liner with standard functions
# so I don't need to define a function separately for it.
167
R Programming
# perform bootstrap
Repeats <- 100
res <- boot(au, statistic = compare, R = Repeats)
# restructure results for a graphic showing root mean square error, and for
# later combination with the other results. I chose just to focus on RMSE;
# the messages are similar if R squared is used.
RMSE_res <- as.data.frame(res$t[ , 4:6])
names(RMSE_res) <- c("AIC stepwise selection", "Remove collinear variables", "Use all var
iables")
RMSE_res %>%
mutate(trial = 1:Repeats) %>%
168
R Programming
# create a function suitable for boot that will return the optimism estimates for
# statistics testing models against the full original sample.
compare_opt <- function(orig_data, i){
# create the resampled data
train_data <- orig_data[i, ]
169
R Programming
# perform bootstrap
res_opt <- boot(au, statistic = compare_opt, R = Repeats)
170
R Programming
#------------------repeated cross-validation------------------
# The number of cross validation repeats is the number of bootstrap repeats / 10:
cv_repeat_num <- Repeats / 10
171
R Programming
#===============reporting results===============
# combine the three cross-validation results together and combined with
# the bootstrap results from earlier
summary_results <- data.frame(rbind(
simple,
enhanced,
c(mean(cv_step$resample$RMSE),
cv_vif,
mean(cv_full$resample$RMSE)
)
), check.names = FALSE) %>%
mutate(method = c("Simple bootstrap", "Enhanced bootstrap",
paste(cv_repeat_num, "repeats 10-fold\ncross-validation"))) %>%
gather(variable, value, -method)
# Draw a plot summarising the results
direct.label(
summary_results %>%
mutate(variable = factor(variable, levels = c(
"Use all variables", "AIC stepwise selection", "Remove collinear variables"
))) %>%
ggplot(aes(y = method, x = value, colour = variable)) +
geom_point(size = 3) +
labs(x = "Estimated Root Mean Square Error (higher is worse)\n",
colour = "Modelling\nstrategy",
y = "Method of estimating model fit\n",
caption = "Data from New Zealand Census 2013") +
ggtitle("Three different validation methods of three different regression strategies",
172
R Programming
NON-LINEAR MODELING
Module 6
Consider a nonlinear least squares model in R, for example of the following form):
I'd like to replace the term "alpha + beta * x" with (say) a natural cubic spline.
here's some code to create some example data with a nonlinear function inside the logistic:
set.seed(438572L)
x <- seq(1,10,by=.25)
y <- 8.6/(1+exp( -(-3+x/4.4+sqrt(x*1.1)*(1.-sin(1.+x/2.9))) )) + rnorm(x, s=0.2 )
Without the need for a logistic around it, if I was in lm, I could replace a linear term with a
spline term easily; so a linear model something like this:
lm( y ~ x )
then becomes
library("splines")
lm( y ~ ns( x, df = 5 ) )
generating fitted values is simple and getting predicted values with the aid of (for example) the
rms package seems simple enough.
Indeed, fitting the original data with that lm-based spline fit isn't too bad, but there's a reason I
need it inside the logistic function (or rather, the equivalent in my problem).
The problem with nls is I need to provide names for all the parameters (I'm quite happy with
calling them say (b1, ..., b5) for one spline fit (and say c1, ... , c6 for another variable - I'll need
to be able to make several of them).
Is there a reasonably neat way to generate the corresponding formula for nls so that I can
replace the linear term inside the nonlinear function with a spline?
173
R Programming
The only ways I can figure that there could be to do it are a bit awkward and clunky and don't
nicely generalize without writing a whole bunch of code.
Many data in the environmental sciences do not fit simple linear models and are best described
by “wiggly models”, also known as Generalised Additive Models (GAMs).
Let’s start with a famous tweet by one Gavin Simpson, which amounts to:
1. GAMs are just GLMs
2. GAMs fit wiggly terms
3. use + s(x) not x in your syntax
4. use method = "REML"
5. always look at gam.check()
174
R Programming
This is basically all there is too it – an extension of generalised linear models (GLMs) with a
smoothing function. Of course, there may be many sophisticated things going on when you fit
a model with smooth terms, but you only need to understand the rationale and some basic
theory. There are also lots of what would be apparently magic things happening when we try
to understand what is under the hood of say lmer or glmer, but we use them all the time without
reservation!
GAMs in a nutshell
Let’s start with an equation for a Gaussian linear model:
y=β0+x1β1+ε,ε∼N(0,σ2)y=β0+x1β1+ε,ε∼N(0,σ2)
What changes in a GAM is the presence of a smoothing term:
y=β0+f(x1)+ε,ε∼N(0,σ2)y=β0+f(x1)+ε,ε∼N(0,σ2)
This simply means that the contribution to the linear predictor is now some function ff. This is
not that dissimilar conceptually to using a quadratic (x21x12) or cubic term (x31x13) as your
predictor.
The function ff can be something more funky or kinky – here, we’re going to focus on splines.
In the old days, it might have been something like piecewise linear functions.
You can have combinations of linear and smooth terms in your model, for example
y=β0+x1β1+f(x2)+ε,ε∼N(0,σ2)y=β0+x1β1+f(x2)+ε,ε∼N(0,σ2)
or we can fit generalised distributions and random effects, for example
ln(y)=β0+f(x1)+ε,ε∼Poisson(λ)ln(y)=β0+f(x1)+ε,ε∼Poisson(λ)
ln(y)=β0+f(x1)+z1γ+ε,ε∼Poisson(λ),γ∼N(0,Σ)ln(y)=β0+f(x1)+z1γ+ε,ε∼Poisson(λ),γ∼N(0,Σ)
A simple example
Lets try a simple example. First, let’s create a data frame and fill it with some simulated data
with an obvious non-linear trend and compare how well some models fit to that data.
175
R Programming
176
R Programming
Looking at the plot or summary(lm_y), you might think the model fits nicely, but look at the
residual plot – eek!
plot(lm_y, which = 1)
177
R Programming
Clearly, the residuals are not evenly spread across values of xx, and we need to consider a
better model.
Running the analysis
Before we consider a GAM, we need to load the package mgcv – the choice for running GAMs
in R.
library(mgcv)
To run a GAM, we use:
178
R Programming
You can see the model is better fit to the data, but always check the diagnostics.
179
R Programming
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.37327e-09,1.17425e-09]
## (score 44.14634 & scale 0.174973).
## Hessian positive definite, eigenvalue range [1.75327,30.69703].
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(x) 9.00 5.76 1.19 0.9
Using summary with the model object will give you the significance of the smooth term (along
with any parametric terms, if you’ve included them), along with the variance explained. In this
example, a pretty decent fit. The ‘edf’ is the estimated degrees of freedom – essentially, the
larger the number, the more wiggly the fitted model. Values of around 1 tend to be close to a
linear term. You can read about penalisation and shrinkage for more on what the edf reflects.
summary(gam_y)
##
180
R Programming
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01608 0.05270 -0.305 0.761
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x) 5.76 6.915 23.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.722 Deviance explained = 74.8%
## -REML = 44.146 Scale est. = 0.17497 n = 63
Smooth terms
As mentioned above, we’ll focus on splines, as they are the smooth functions that are most
commonly implemented (and are pretty quick and stable). So what was actually going on when
we specified s(x)?
Well, this is where we say we want to fit yy as a linear function of some set of functions of xx.
The default in mgcv is a thin plate regression spline – the two common ones you’ll probably
see are these, and cubic regression splines. Cubic regression splines have the
traditional knots that we think of when we talk about splines – they’re evenly spread across the
covariate range in this case. We’ll just stick to thin plate regression splines, since I figure Simon
made them the default for a reason,
Basis functions OK, so here’s where we see what the wiggle bit is really made of. We’ll start
with the fitted model, then we’ll look at it from first principles (not really). Remembering that
the smooth term is the sum of some number of functions (I’m not sure how well this equation
really represents the smooth term, but you get the point),
f(x1)=∑j=1kbj(x1)βjf(x1)=∑j=1kbj(x1)βj
First we extract the set of basis functions (that is, the bj(xj)bj(xj) part of the smooth term). Then
we can plot say the first and second basis functions.
181
R Programming
Let’s plot all of the basis functions now, and then add that to the predictions from the GAM
(y_pred) on top again.
plot(y ~ x)
abline(h = 0)
182
R Programming
Now, it’s difficult at first to see what has happened, but it’s easiest to think about it like this –
each of those dotted lines represents a function (bjbj) for which gam estimates a coefficient
(βjβj), and when you sum them you get the contribution for the corresponding f(x)f(x) (i.e. the
previous equation). It’s nice and simple for this example, because we model yy only as a
function of the smooth term, so it’s fairly relatable. As an aside, you can also just
use plot.gam to plot the smooth terms.
plot(gam_y)
183
R Programming
OK, now let’s look a little closer at how the basis functions are constructed. You’ll see that the
construction of the functions is separate to the response data. Just to prove it, we’ll
use smoothCon.
x_sin_smooth <- smoothCon(s(x), data = data.frame(x), absorb.cons = TRUE)
X <- x_sin_smooth[[1]]$X
par(mfrow = c(1,2))
matplot(x, X, type = "l", main = "smoothCon()")
matplot(x, model_matrix[,-1], type = "l", main = "predict(gam_y)")
184
R Programming
And now to prove that you can go from the basis functions and the estimated coefficients to
the fitted smooth term. Again note that this is simplified here because the model is just one
smooth term. If you had more terms, we would need to add up all of the terms in the linear
predictor.
par(mfrow = c(1,2))
plot(y ~ x, main = "manual from basis/coefs")
lines(linear_pred ~ x, col = "red", lwd = 2)
plot(y ~ x, main = "predict(gam_y)")
lines(y_pred ~ x_new, col = "red", lwd = 2)
185
R Programming
Out of interest, take a look at the following plot, remembering that X is the matrix of basis
functions.
par(mfrow = c(1,2))
plot(y ~ x)
plot(y ~ rowSums(X))
186
R Programming
187
R Programming
Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples of use of decision tress is − predicting an email as spam or not spam, predicting of
a tumor is cancerous or predicting a loan as a good or bad credit risk based on the factors in
each of these. Generally, a model is created with observed data also called training data. Then
a set of validation data is used to verify and improve the model. R has packages which are
used to create and visualize decision trees. For new set of predictor variable, we use this model
to arrive at a decision on the category (yes/No, spam/not spam) of the data.
The R package "party" is used to create decision trees.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Following is the description of the parameters used −
• formula is a formula describing the predictor and response variables.
• data is the name of the data set used.
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
Here is the sample data.
# Load the party package. It will automatically load other
# dependent packages.
library(party)
188
R Programming
print(head(readingSkills))
When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other
# dependent packages.
library(party)
189
R Programming
as.Date, as.Date.numeric
Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills score
is less than 38.3 and age is more than 6 is not a native Speaker.
K-means Clustering in R
190
R Programming
Clustering analysis is not too difficult to implement and is meaningful as well as actionable for
business.
The most striking difference between supervised and unsupervised learning lies in the results.
Unsupervised learning creates a new variable, the label, while supervised learning predicts an
outcome. The machine helps the practitioner in the quest to label the data based on close
relatedness. It is up to the analyst to make use of the groups and give a name to them.
Let’s make an example to understand the concept of clustering. For simplicity, we work in two
dimensions. You have data on the total spend of customers and their ages. To improve
advertising, the marketing team wants to send more targeted emails to their customers.
In the following graph, you plot the total spend and the age of the customers.
library(ggplot2)
df <- data.frame(age = c(18, 21, 22, 24, 26, 26, 27, 30, 31, 35, 39, 40, 41, 42, 44, 46, 47, 48,
49, 54),
spend = c(10, 11, 22, 15, 12, 13, 14, 33, 39, 37, 44, 27, 29, 20, 28, 21, 30, 31, 23, 24)
)
ggplot(df, aes(x = age, y = spend)) +
geom_point()
22.1M
214
A pattern is visible at this point
191
R Programming
1. At the bottom-left, you can see young people with a lower purchasing power
2. Upper-middle reflects people with a job that they can afford spend more
3. Finally, older people with a lower budget.
In the figure above, you cluster the observations by hand and define each of the three groups.
This example is somewhat straightforward and highly visual. If new observations are appended
to the data set, you can label them within the circles. You define the circle based on our
judgment. Instead, you can use Machine Learning to group the data objectively.
In this tutorial, you will learn how to use the k-means algorithm.
K-means algorithm
K-mean is, without doubt, the most popular clustering method. Researchers released the
algorithm decades ago, and lots of improvements have been done to k-means.
The algorithm tries to find groups by minimizing the distance between the observations,
called local optimal solutions. The distances are measured based on the coordinates of the
observations. For instance, in a two-dimensional space, the coordinates are simple and .
192
R Programming
K-means usually takes the Euclidean distance between the feature and feature :
Different measures are available such as the Manhattan distance or Minlowski distance. Note
that, K-mean returns different groups each time you run the algorithm. Recall that the first
initial guesses are random and compute the distances until the algorithm reaches a homogeneity
within groups. That is, k-mean is very sensitive to the first choice, and unless the number of
observations and groups are small, it is almost impossible to get the same clustering.
193
R Programming
The number of clusters depends on the nature of the data set, the industry, business and so on.
However, there is a rule of thumb to select the appropriate number of clusters:
Generally speaking, it is interesting to spend times to search for the best value of to fit with the
business need.
We will use the Prices of Personal Computers dataset to perform our clustering analysis. This
dataset contains 6259 observations and 10 features. The dataset observes the price from 1993
to 1995 of 486 personal computers in the US. The variables are price, speed, ram, screen, cd
among other.
• Import data
• Train the model
• Evaluate the model
Import data
K means is not suitable for factor variables because it is based on the distance and discrete
values do not return meaningful values. You can delete the three categorical variables in our
dataset. Besides, there are no missing values in this dataset.
library(dplyr)
PATH <-"https://raw.githubusercontent.com/guru99-edu/R-
Programming/master/computers.csv"
df <- read.csv(PATH) %>%
select(-c(X, cd, multi, premium))
glimpse(df)
Output
## Observations: 6, 259
## Variables: 7
## $ price < int > 1499, 1795, 1595, 1849, 3295, 3695, 1720, 1995, 2225, 2...
##$ speed < int > 25, 33, 25, 25, 33, 66, 25, 50, 50, 50, 33, 66, 50, 25, ...
##$ hd < int > 80, 85, 170, 170, 340, 340, 170, 85, 210, 210, 170, 210...
##$ ram < int > 4, 2, 4, 8, 16, 16, 4, 2, 8, 4, 8, 8, 4, 8, 8, 4, 2, 4, ...
194
R Programming
##$ screen < int > 14, 14, 15, 14, 14, 14, 14, 14, 14, 15, 15, 14, 14, 14, ...
##$ ads < int > 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, ...
## $ trend <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
From the summary statistics, you can see the data has large values. A good practice with k
mean and distance calculation is to rescale the data so that the mean is equal to one and the
standard deviation is equal to zero.
summary(df)
Output:
kmeans(df, k)
arguments:
-df: dataset used to run the algorithm
-k: Number of clusters
Train the model
In figure three, you detailed how the algorithm works. You can see each step graphically with
the great package build by Yi Hui (also creator of Knit for Rmarkdown). The package
animation is not available in the conda library. You can use the other way to install the package
195
R Programming
with install.packages(“animation”). You can check if the package is installed in our Anaconda
folder.
install.packages("animation")
After you load the library, you add .ani after kmeans and R will plot all the steps. For illustration
purpose, you only run the algorithm with the rescaled variables hd and ram with three clusters.
set.seed(2345)
library(animation)
kmeans.ani(rescale_df[2:3], 3)
Code Explanation
• kmeans.ani(rescale_df[2:3], 3): Select the columns 2 and 3 of rescale_df data set and
run the algorithm with k sets to 3. Plot the animation.
196
R Programming
The algorithm converged after seven iterations. You can run the k-mean algorithm in our
dataset with five clusters and call it pc_cluster.
pc_cluster <-kmeans(rescale_df, 5)
You will use the sum of the within sum of square (i.e. tot.withinss) to compute the optimal
number of clusters k. Finding k is indeed a substantial task.
197
R Programming
Optimal k
One technique to choose the best k is called the elbow method. This method uses within-group
homogeneity or within-group heterogeneity to evaluate the variability. In other words, you are
interested in the percentage of the variance explained by each cluster. You can expect the
variability to increase with the number of clusters, alternatively, heterogeneity decreases. Our
challenge is to find the k that is beyond the diminishing returns. Adding a new cluster does not
improve the variability in the data because very few information is left to explain.
In this tutorial, we find this point using the heterogeneity measure. The Total within clusters
sum of squares is the tot.withinss in the list return by kmean().
You can construct the elbow graph and find the optimal k as follow:
• Step 1: Construct a function to compute the total within clusters sum of squares
• Step 2: Run the algorithm times
• Step 3: Create a data frame with the results of the algorithm
• Step 4: Plot the results
Step 1) Construct a function to compute the total within clusters sum of squares
You create the function that runs the k-mean algorithm and store the total within clusters sum
of squares
DATA
198
R Programming
Descriptive Statistics
In this section includes descriptive statistics information about unscaled for of the data.
summary(pizza)
## brand id mois prot
## Length:300 Min. :14003 Min. :25.00 Min. : 6.98
## Class :character 1st Qu.:14094 1st Qu.:30.90 1st Qu.: 8.06
## Mode :character Median :24021 Median :43.30 Median :10.44
## Mean :20841 Mean :40.90 Mean :13.37
## 3rd Qu.:24110 3rd Qu.:49.12 3rd Qu.:20.02
## Max. :34045 Max. :57.22 Max. :28.48
## fat ash sodium carb
## Min. : 4.38 Min. :1.170 Min. :0.2500 Min. : 0.510
## 1st Qu.:14.77 1st Qu.:1.450 1st Qu.:0.4500 1st Qu.: 3.467
## Median :17.14 Median :2.225 Median :0.4900 Median :23.245
## Mean :20.23 Mean :2.633 Mean :0.6694 Mean :22.865
## 3rd Qu.:21.43 3rd Qu.:3.592 3rd Qu.:0.7025 3rd Qu.:41.337
199
R Programming
Data Preparation
The data preparation period is shown below as step by step:
1. At the beginning of the analysis, the data imported as follows;
hopkins(pizza[,2:9], n=nrow(pizza[,2:9])-1)
## $H
## [1] 0.002383577
1-0.002373702
## [1] 0.9976263
200
R Programming
head(pizs)
## id mois prot fat ash sodium carb
## [1,] -0.9725866 -1.369526 1.252089 2.745255 1.950635 2.971721 -1.225463
## [2,] -0.9748845 -1.299391 1.225669 2.636070 2.131776 3.025723 -1.211598
## [3,] -0.9789058 -1.314046 1.028292 2.846640 1.927007 2.593708 -1.223800
## [4,] -0.9801984 -1.083752 1.053158 2.551397 1.698611 2.539707 -1.191630
## [5,] -0.9817782 -1.090033 1.228777 2.386506 1.722238 2.620709 -1.170554
## [6,] -0.9717249 -1.021991 1.065591 2.460039 1.800996 2.647710 -1.190521
## cal
## [1,] 2.675659
## [2,] 2.530505
## [3,] 2.707915
## [4,] 2.369224
## [5,] 2.256327
## [6,] 2.256327
CLUSTERING ANALYSIS
K-means
In the project, k-means clustering analysis is done by using euclidian distance metric. In first,
optimal number of clusters are detected by using Elbow method.
201
R Programming
As shown plots the best option is 3 number of clusters; furthermore, kmeans clustering in
euclidean distance continue with 3 number of clusters on below
202
R Programming
203
R Programming
The summary of k-means clustering with euclidian distance metric is shown below.
summary(wcke)
## Length Class Mode
## cluster 300 -none- numeric
## centers 24 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
## clust_plot 9 gg list
## silinfo 3 -none- list
## nbclust 1 -none- numeric
204
R Programming
Silhouette
‘silhouette value’ is used to check the quality of clusters. It is a measure of how similar an
object is to its own cluster and how far it is to other clusters. It takes the values between -1 and
1. If it is close to 1, it means that the observations in a cluster is well fitted. In other words, the
value of average silhouette width must be between -1 and 1; therefore, the result closer to 1
implies high clustering quality and the value for silhouette width for this experiment is 0.48
which means the dataset is proper for cluster analysis.
sile<-silhouette(wcke$cluster, dist(pizs))
fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 151 0.32
## 2 2 29 0.76
## 3 3 120 0.61
205
R Programming
Here the best clustering result is obtained with the 2nd cluster(green).
10 clusters are suggested by the method however the number of clustering is designated 4 in
order to create more understandable clustering:
206
R Programming
207
R Programming
pizs.pam = pam(pizs,3)
pizs.pam
## Medoids:
## ID id mois prot fat ash sodium
## [1,] 23 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 107 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 139 0.4679016 -0.4410209 -0.8662149 -0.60825993 -0.9240069 -0.5653993
## carb cal
## [1,] -1.1949583 2.27245510
## [2,] -0.9564632 -0.48545705
## [3,] 0.9104540 -0.09838166
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
208
R Programming
## [112] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3
## Objective function:
## build swap
## 1.754077 1.620609
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pam.res$medoids
## id mois prot fat ash sodium
## [1,] 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 0.4559813 -0.7330761 -0.8724315 -0.48904862 -1.0500186 -0.6734030
## [4,] 0.4625877 0.8036160 -0.5616019 -0.47010851 -0.2624456 -0.1873864
## carb cal
## [1,] -1.19495831 2.2724551
## [2,] -0.95646323 -0.4854571
## [3,] 1.01694485 0.1757967
## [4,] 0.02691297 -0.8080199
pam.res$clusinfo
## size max_diss av_diss diameter separation
## [1,] 29 1.710058 1.124369 3.076119 3.362456
## [2,] 90 2.457287 1.619764 4.343469 1.725482
## [3,] 120 2.417467 1.235228 4.066620 1.215986
## [4,] 61 2.069652 1.227803 3.479344 1.215986
When we look at the structure of 4 clusters, cluster 3 has the maximum observations and cluster
1 has the minimum observations.
209
R Programming
Silhouette
The value of average silhouette width is 0.47 in PAM clustering analysis with euclidean
distance metric. Moreover, there is just slightly different between k-means and PAM about
silhouette result.
sile<-silhouette(pam.res$cluster, dist(pizs))
fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 29 0.73
## 2 2 90 0.36
## 3 3 120 0.50
## 4 4 61 0.45
Here the best clustering result is obtained with the 1st cluster, which is 0.73, while the 2nd
cluster has the worst quality among the clusters and its silhouette width value is 0.36.
210
R Programming
map_dbl(m, ac)
## average single complete ward
## 0.9609417 0.9365627 0.9704849 0.9938937
Ward is the biggest value in here with 0.9938937 so ward.D2 is determined as method for
hierarchical clustering.
211
R Programming
212
R Programming
After analyzed the dendrogram, the optimal cutree point is determined as standard deviation
value of 20 which is corresponded to 3 clusters in this experiment and 2 of these clusters include
mainly outliers.
CONCLUSION
In conclusion, 3 clustering methods (k-means, PAM and hierarchical clustering) with euclidean
distance distance method are analyzed with in the report. The analyze start with descriptive
statistics and observation of data scaling. Moreover, cluster tendency is measured by using
Hopkins’ statistic and optimal number of clusters are found by using Elbow method for k-
means algorithm and PAM algorithm. Moreover, for the dataset k-means and PAM algorithm
has no dramatic change between them with euclidean distance metrics are gave the slightly
different result for silhouette width and clustering, although K-means algorithm has better
result in average silhouette width which are respectively 0,48 and 0,37. Nevertheless, 3
clustering algorithms are giving different clustering results although for future study k-means
algorithm is gave more accurate clustering solution than PAM algorithm. In addition, 3 cluster
are reached in hierarchical clustering with using euclidean distance metric. The proportion is
good and the dendrogram shape is also efficient for cut tree for 3 clusters.
213
R Programming
214