0% found this document useful (0 votes)
0 views

R Programming Language E Notes_B.tech

The document provides lecture notes on R Programming for B.Tech (IT) Semester-VII, covering various modules including an introduction to R, data handling, functions, data manipulation, statistics, and modeling techniques. It details the installation of R and RStudio, their features, advantages, and applications in data science. Additionally, it compares R with Python and outlines the structure of the syllabus and index for the course.

Uploaded by

rakeshrawat1463
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

R Programming Language E Notes_B.tech

The document provides lecture notes on R Programming for B.Tech (IT) Semester-VII, covering various modules including an introduction to R, data handling, functions, data manipulation, statistics, and modeling techniques. It details the installation of R and RStudio, their features, advantages, and applications in data science. Additionally, it compares R with Python and outlines the structure of the syllabus and index for the course.

Uploaded by

rakeshrawat1463
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 215

Lecture Notes

ON

R PROGRAMMING

B.Tech (IT)
Semester-VII
Syllabus
CODE: OEC-CS-701(III)
SUBJECT NAME: R PROGRAMMING

MODULE-1: INTRODUCTION

Getting R, R Version, 32-bit versus 64-bit, The R Environment, Command Line


Interface, RStudio, Revolution Analytics RPE Packages: Installing Packages, Loading
Packages, Building a Package R Basics: Basic Math, Variables, Data Types, Vectors,
Calling Functions,Function Documentation, Missing Data Advanced Data Structures:
data frames, Lists, Matrices, Arrays

MODULE-2: R DATA

Reading Data into R: Reading CSVs, Excel Data, Reading from Databases, Data from
Other Statistical Tools, R Binary Files, Data Included with R, Extract Data from Web
Sites Statistical Graphics: Base Graphics , ggplot2

MODULE-3: R FUNCTIONS & STATEMENTS

Writing R Functions: Hello, World!, Function Arguments, Return Values, do.call


Control Statements: if and else, switch, ifelse, Compound Tests Loops: for Loops, while
Loops, Controlling Loops

MODULE-4: DATA MANIPULATION

Group Manipulation: Apply Family, aggregate, plyr, data.table Data Reshaping: cbind
andrbind, Joins, reshape2 Manipulating Strings: paste, sprint, Extracting Text, Regular
MODULE-5: R STATISTICS & LINEAR MODELING

Probability Distributions: Normal Distribution, Binomial Distribution, Poisson Basic


Statistics: Summary Statistics, Correlation and Covariance, T-Tests 200, ANOVA
Linear Models: Simple Linear Regression, Multiple Regression Generalized Linear
Models: Logistic Regression, Poisson Model Diagnostics: Residuals, Comparing
Models, Cross-Validation, Bootstrap, Stepwise Variable Selection

MODULE-6: NON-LINEAR MODELING

Nonlinear Models: Nonlinear Least Squares, Splines, Generalized Additive Models,


DecisionTrees, Random Forests Clustering: K-means, PAM, Hierarchical Clustering

1
Index

S. no. Module Page no.


1. R Programming Language – An Introduction 4-43
2. Working with CSV files in R Programming 44-60
3. R Functions & Statements 61-107
4. Data Manipulation 108-109
5. R Statistics & Linear Modeling 110-154
6. Non-Linear Modeling 155-192

2
R Programming Language – An Introduction

Module 1

R is an open-source programming language that is widely used as a statistical software and


data analysis tool. R generally comes with the Command-line interface. R is available across
widely used platforms like Windows, Linux, and macOS. Also, the R programming language
is the latest cutting-edge tool.
It was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R Development Core Team. R programming
language is an implementation of the S programming language. It also combines with lexical
scoping semantics inspired by Scheme. Moreover, the project conceives in 1992, with an
initial version released in 1995 and a stable beta version in 2000.

Why R Programming Language?

• R programming is used as a leading tool for machine learning, statistics, and data analysis.
Objects, functions, and packages can easily be created by R.
• It’s a platform-independent language. This means it can be applied to all operating system.
• It’s an open-source free language. That means anyone can install it in any organization
without purchasing a license.
3
• R programming language is not only a statistic package but also allows us to integrate with
other languages (C, C++). Thus, you can easily interact with many data sources and statistical
packages.
• The R programming language has a vast community of users and it’s growing day by day.
• R is currently one of the most requested programming languages in the Data Science job
market that makes it the hottest trend nowadays.

Features of R Programming Language


Statistical Features of R:
• Basic Statistics: The most common basic statistics terms are the mean, mode, and median.
These are all known as “Measures of Central Tendency.” So using the R language we can
measure central tendency very easily.
• Static graphics: R is rich with facilities for creating and developing interesting static
graphics. R contains functionality for many plot types including graphic maps, mosaic plots,
biplots, and the list goes on.
• Probability distributions: Probability distributions play a vital role in statistics and by using
R we can easily handle various types of probability distribution such as Binomial
Distribution, Normal Distribution, Chi-squared Distribution and many more.
Programming Features of R:
• R Packages: One of the major features of R is it has a wide availability of libraries. R has
CRAN(Comprehensive R Archive Network), which is a repository holding more than 10,
0000 packages.
• Distributed Computing: Distributed computing is a model in which components of a
software system are shared among multiple computers to improve efficiency and
performance. Two new packages ddR and multidplyr used for distributed programming in
R were released in November 2015.
Programming in R:
Since R is much similar to other widely used languages syntactically, it is easier to code and
learn in R. Programs can be written in R in any of the widely used IDE like R Studio, Rattle,
Tinn-R, etc. After writing the program save the file with the extension .r. To run the program
use the following command on the command line:
R file_name.r
# R program to print Welcome to GFG!

# Below line will print "Welcome to GFG!"


cat("Welcome to GFG!")

4
Advantages of R:
• R is the most comprehensive statistical analysis package. As new technology and concepts
often appear first in R.
• As R programming language is an open source. Thus, you can run R anywhere and at any
time.
• R programming language is suitable for GNU/Linux and Windows operating system.
• R programming is cross-platform which runs on any operating system.
• In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R:
• In the R programming language, the standard of some packages is less than perfect.
• Although, R commands give little pressure to memory management. So R programming
language may consume all available memory.
• In R basically, nobody to complain if something doesn’t work.
Applications of R:
• We use R for Data Science. It gives us a broad variety of libraries related to statistics. It also
provides the environment for statistical computing and design.
• R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing and cleaning.
• R is the most prevalent language. So many data analysts and research programmers use it.
Hence, it is used as a fundamental tool for finance.
• Tech giants like Google, Facebook, bing, Accenture, Wipro and many more using R
nowadays.

History of R Programming

R was first implemented in the early 1990's by Robert Gentleman and Ross Ihaka, both
faculty members at the University of Auckland. Robert and Ross established R as an open

5
source project in 1995. Since 1997, the R project has been managed by the R Core Group. And
in February 2000, R 1.0.

R vs Python

R programming and Python are both used extensively for Data Sciences. Both are very
useful and open source languages as well.

R Language is used for machine learning algorithms, linear regression, time series, statistical
inference, etc. It was designed by Ross Ihaka and Robert Gentleman in 1993.

R is an open-source programming language that is widely used as a statistical software and


data analysis tool. R generally comes with the Command-line interface. R is available across
widely used platforms like Windows, Linux, and macOS. Also, the R programming language
is the latest cutting-edge tool.

Python is a widely-used general-purpose, high level programming language. It was created


by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It
was designed with an emphasis on code readability, and its syntax allows programmers to
express their concepts in fewer lines of code.

Below are some major differences between R and Python:

Feature R Python

R is a language and environment for


statistical programming which Python is a general purpose
includes statistical computing and programming language for data
Introduction graphics. analysis and scientific computing

It has many features which are useful It can be used to develop GUI
for statistical analysis and application and web application as
Objective representation. well as with embedded systems

It can easily perform matrix


It has many easy to use packages for computation as well as
Work ability performing tasks optimization

Integrated
development Various popular R IDEs are Rstudio, Various popular Python IDEs are
environment RKward, R commander, etc. Spyder, Eclipse+Pydev, Atom, etc.

Some essential packages and


Libraries and There are many packages and libraries libraries are Pandas, Numpy, Scipy,
packages like ggplot2, caret, etc. etc.

6
Feature R Python

It is mainly used for complex data It takes more streamline approach


Scope analysis in data science. for data science projects.

Introduction to R Studio

R Studio is an integrated development environment(IDE) for R. IDE is a GUI, where you can
write your quotes, see the results and also see the variables that are generated during the
course of programming.
• R Studio is available as both Open source and Commercial software.
• R Studio is also available as both Desktop and Server version.
• R Studio is also available for various platforms such as Windows, Linux, and macOS.
R Studio can be downloaded from its official Website rstudio.com and instructions for
installation are available on How to Install RStudio for R programming in Windows?
After the Installation process is over, the R Studio Interface looks like:

7
• The console panel(left panel) is the place where R is waiting for you to tell it what to do, and
see the results that are generated when you type in the commands.
• To the top right, you have Environmental/History panel. It contains 2 tabs:
• Environment tab: It shows the variables that are generated during the course of
programming in a workspace which is temporary.
• History tab: In this tab, you’ll see all the commands that are used till now from the start of
usage of R Studio.
• To the right bottom, you have another panel, which contains multiple tabs, such as files,
plots, packages, help, and viewer.
• The Files tab shows the files and directories that are available within the default workspace
of R.
• The Plots tab shows the plots that are generated during the course of programming.
• The Packages tab helps you to look at what are the packages that are already installed in the
R Studio and it also gives a user interface to install new packages.
• The Help tab is the most important one where you can get help from the R Documentation
on the functions that are in built-in R.
8
• The final and last tab is that the Viewer tab which can be used to see the local web content
that’s generated using R.

Installation of R

R programming is a very popular language and to work on that we have to


install two things, i.e., R and RStudio. R and RStudio works together to create
a project on R.

Installing R to the local computer is very easy. First, we must know which
operating system we are using so that we can download it accordingly.

The official site https://cloud.r-project.org provides binary files for major


operating systems including Windows, Linux, and Mac OS. In some Linux
distributions, R is installed by default, which we can verify from the console by
entering R.

To install R, either we can get it from the site https://cloud.r-project.org or can


use commands from the terminal.

Install R in Windows

There are following steps used to install the R in Windows:

Step 1:

First, we have to download the R setup from https://cloud.r-project.org/bin/windows/base/.

Step 2:

When we click on Download R 3.6.1 for windows, our downloading will be

9
started of R setup.Once the downloading is finished, we have to run the setup
of R in the following way:

1) Select the path where we want to download the R and proceed to Next.

2) Select all components which we want to install, and then we will proceed to Next.

3) In the next step, we have to select either customized startup or


accept the default, and then weproceed to Next.

10
4) When we proceed to next, our installation of R in our system will get started:

5) In the last, we will click on finish to successfully install R in our system.

11
RStudio IDE

RStudio is an integrated development environment which allows us to interact


with R more readily. RStudio is similar to the standard RGui, but it is
considered more user-friendly. This IDE has various drop-down menus,
Windows with multiple tabs, and so many customization processes. The first
time when we open RStudio, we will see three Windows. The fourth

Window will be hidden by default. We can open this hidden Window by


clicking the File drop- down menu, then New File and then R Script.

RStudio Windows/Tabs Location Description


Console Window Lower-left The location where commands are entered and output is
printed.
Source Tabs Upper-left Built-in test editor
Environment Tab Upper-left An interactive list of loaded R objects.
History Tab Upper-left List of keystrokes entered into the console.
Files Tab Lower-right File explorer to navigate C drive folders.
Plots Tab Lower-right Output location for plots.
Packages Tab Lower-right List of installed packages.
Help Tab Lower-right Output location for help commands and help search Window.
Viewer Tab Lower-right Advanced tab for local web content.

Installation of RStudio

RStudio Desktop is available for both Windows and Linux. The open-source

12
RStudio Desktop installation is very simple to install on both operating systems.
The licensed version of RStudio has some more features than open-source.
Before installing RStudio, let's see what are the additional features in the license
version of RStudio.

Factor Open-Source Commercial License


Overview 1) Access RStudio locally All of the features of open-source are include
2) Code completion, syntax with
highlighting, and smart indentation 1) There is a commercial license for
3) Can execute R code directly fromthe organizations which are not able to use AGPL
source editor software.
4) Quickly jump to function 2) It provides access to priority support.
definitions.
5) Easily manage multiple working
directories using projects.
6) Integrated R help and
documentation.
7) Provide interactive debugger to
diagnose and fix errors quickly.
8) Extensive package deployment
tools.
Support It supports for community forums
1) only. It supports priority email.
2) It supports for an 8-hour response during
business hour.
License AGPL v3 RStudio License Agreement
Pricing Free $995/year

Installation on Windows/Linux

On Windows and Linux, it is quite simple to install RStudio. The process of


installing RStudio inboth the OS is the same. There are the following steps to
install RStudio in our Windows/Linux:

Step 1:

In the first step, we visit the RStudio official site and click on Download RStudio.

13
Step 2:

In the next step, we will select the RStudio desktop for open-source license and
click ondownload.

Step 3:

In the next step, we will select the appropriate installer. When we select
the installer, ourdownloading of RStudion setup will start.

14
Step 4:

In the next step, we will run our setup in the following way:

1) Click on Next.

2) Click on Install.

15
3) Click on finish.

16
4) RStudio is ready to work.

R Packages

R packages are the collection of R functions, sample data, and compile codes.
In the R environment, these packages are stored under a directory called
"library." During installation, R installs a set of packages. We can add packages
later when they are needed for some specific purpose. Only the default packages

17
will be available when we start the R console. Other packages which are already
installed will be loaded explicitly to be used by the R program.

There is the following list of commands to be used to check, verify, and use the R packages.

Check Available R Packages

To check the available R Packages, we have to find the library location in which
R packages are contained. R provides libPaths() function to find the library
locations.

1. libPaths()

When the above code executes, it produces the following project, which may
vary depending on the local settings of our PCs & Laptops.

[1] "C:/Users/ajeet/OneDrive/Documents/R/win-library/3.6"
[2] "C:/Program Files/R/R-3.6.1/library"

Getting the list of all the packages installed

R provides library() function, which allows us to get the list of all the installed packages.

1. library()

When we execute the above function, it produces the following result, which
may varydepending on the local settings of our PCs or laptops.

Packages in library 'C:/Program Files/R/R-3.6.1/library':

18
Like library() function, R provides search() function to get all packages
currently loaded in the Renvironment.

1. search()

When we execute the above code, it will produce the following result, which
may varydepending on the local settings of our PCs and laptops:

[1] ".GlobalEnv" "package:stats" "package:graphics"


[4] "package:grDevices" "package:utils"
"package:datasets
"Install a New Package

In R, there are two techniques to add new R packages. The first technique is
installing package directly from the CRAN directory, and the second one is to
install it manually after downloading the package to our local system.

Install directly from CRAN

The following command is used to get the packages directly from CRAN
webpage and install thepackage in the R environment. We may be prompted
to choose the nearest mirror. Choose theone appropriate to our location.

1. install.packages("Package Name")

The syntax of installing XML package is as follows:

1. install.packages("XML")

19
Output

Install package manually

To install a package manually, we first have to download it from https://cran.r-


project.org/web/packages/available_packages_by_name.html. The required
package will be saved as a .zip file in a suitable location in the local system.

Once the downloading has finished, we will use the following command:

1. install.packages(file_name_with_path, repos = NULL, type

= "source")Install the package named "XML"

1. install.packages("C:\Users\ajeet\OneDrive\Desktop\graphics\xml2_1.2.2.zip",
repos = NULL, type = "source")
Load Package to Library

We cannot use the package in our code until it will not be loaded into the current
R environment. We also need to load a package which is already installed

20
previously but not available in the current environment.

There is the following command to load a package:

1. library("package Name", lib.loc = "path to

library") Command to load the XML

package

1. install.packages("C:\Users\ajeet\OneDrive\Desktop\graphics\xml2_1.2.2.zip",
repos = NULL, type = "source")

Syntax of R Programming

R Programming is a very popular programming language which is broadly used


in data analysis. The way in which we define its code is quite simple. The "Hello
World!" is the basic program forall the languages, and now we will understand
the syntax of R programming with "Hello world" program. We can write our
code either in command prompt, or we can use an R script file.

R Command Prompt

It is required that we have already installed the R environment set up in our


system to work on the R command prompt. After the installation of R
environment setup, we can easily start R command prompt by typing R in our
Windows command prompt. When we press enter after typing R, it will launch
interpreter, and we will get a prompt on which we can code our program.

"Hello, World!" Program

21
The code of "Hello World!" in R programming can be written as:

In the above code, the first statement defines a string variable string, where we
assign a string "Hello World!". The next statement print() is used to print the
value which is stored in the variable string.

R Script File

The R script file is another way on which we can write our programs, and then
we execute those scripts at our command prompt with the help of R interpreter
known as Rscript. We make a text file and write the following code. We will
save this file with .R extension as:

Demo.R

1. string <-"Hello World!"


2. print(string)

To execute this file in Windows and other operating systems, the process will
remain the same asmentioned below.

When we press enter it will give us the following output:

Comments

In R programming, comments are the programmer readable explanation in the


source code of an R program. The purpose of adding these comments is to make
the source code easier to understand. These comments are generally ignored by
compilers and interpreters.

In R programming there is only single-line comment. R doesn't support multi-


line comment. Butif we want to perform multi-line comments, then we can add

22
our code in a false block.

Single-line comment

1. #My First program in R programming


2. string <-"Hello World!"
3. print(string)

The trick for multi-line comment

1. #Trick for multi-line comment


2. if(FALSE) {
3. "R is an interpreted computer programming language which was created by
4. Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand "5. }
6. #My First program in R programming
7. string <-"Hello World!"
8. print(string)

R-Objects
Generally, while doing programming in any programming
language, you need to use various variables to store various
information. Variables are nothing but reserved memory
locations to store values. This means that, when you create a
variable you reserve some space in memory.
You may like to store information of various data types like
character, wide character, integer, floating point, double floating
point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be
stored in the reserved memory.

23
In contrast to other programming languages like C and java in R,
the variables arenot declared as some data type. The variables
are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of
R-objects. The frequently used ones are:
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
The simplest of these objects is the vector object and there are
six data types of these atomic vectors, also termed as six classes
of vectors. The other R-Objects are built upon the atomic vectors.
Data Type Example Verify

v <- TRUE
print(class(v))
Logical TRUE , FALSE
it produces the following result:

[1] "logical"

v <- 23.5
Numeric 12.3, 5, 999 print(class(v))

it produces the following result:


[1] "numeric"
v <- 2L
print(class(v))

Integer 2L, 34L, 0L it produces the following result:

[1] "integer"
v <- 2+5i
print(class(v))
Complex 3 + 2i it produces the following result:

[1] "complex"
v <- "TRUE"
print(class(v))
Character 'a' , '"good", "TRUE", '23.4' it produces the following result:

[1] "character"

24
R Programming

v <- charToRaw("Hello")
print(class(v)) it produces the
following result:
Raw "Hello" is stored as 48 65 6c 6c 6f
[1] "raw"

In R programming, the very basic data types are the R-objects


called vectors which hold elements of different classes as shown
above. Please note in R the number of classes is not confined to
only the above six types. For example, we can use many atomic
vectors and create an array whose class will become array.

Vectors

When you want to create vector with more than one element, you should use c()
function which means to combine the elements into a vector.

# Create a vector.

apple <- c('red','green',"yellow")


print(apple)

# Get the class of the vector.


print(class(apple))

When we execute the above code, it produces the following result:

[1] "red" "green" "yellow"


[1] "character"

Lists
A list is an R-object which can contain many different types of
elements inside it like vectors, functions and even another list
inside it.

# Create a list.

list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)

25
R Programming

When we execute the above code, it produces the following result:

[[1]]

[1] 2 5 3

[[2]]

[1] 21.3

[[3]]

function (x) .Primitive("sin")

Matrices
A matrix is a two-dimensional rectangular data set. It can be
created using a vectorinput to the matrix function.

# Create a matrix.

M = matrix( c('a','a','b','c','b','a'), nrow=2,ncol=3,byrow = TRUE)


print(M)

When we execute the above code, it produces the following result:

[,1] [,2] [,3]

[1,] "a" "a" "b"

[2,] "c" "b" "a"

Arrays
While matrices are confined to two dimensions, arrays can be of
any number of dimensions. The array function takes a dim
attribute which creates the required number of dimension. In the
below example we create an array with two elements which are
3x3 matrices each.

26
R Programming

# Create an array.

a <- array(c('green','yellow'),dim=c(3,3,2))
print(a)

When we execute the above code, it produces the following result:

[3,] "yellow" "green" "yellow"

Factors
Factors are the r-objects which are created using a vector. It
stores the vector alongwith the distinct values of the elements in
the vector as labels. The labels are always character irrespective
of whether it is numeric or character or Boolean etc. in the input
vector. They are useful in statistical modeling.
Factors are created using the factor() function.The nlevels
functions gives the count of levels.

# Create a vector.

apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))

When we execute the above code, it produces the following result:

27
R Programming

[1] green green yellow red red red yellow green


Levels: green red yellow
# applying the nlevels function we can know the number of distinct values
[1] 3

Data Frames
Data frames are tabular data objects. Unlike a matrix in data
frame each column can contain different modes of data. The first
column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors
of equal length.
Data Frames are created using the data.frame() function.

# Create the data frame.


BMI <- data.frame(
gender = c("Male", "Male","Female"),

height = c(152, 171.5, 165),

weight = c(81,93, 78),


Age =c(42,38,26)

print(BMI)

When we execute the above code, it produces the following result:

1 Male 152.0 81 42
2 Male 171.5 93 38
gender height weight Age 26
3 Female 165.0 78

Understanding Basic Data Types and Data Structures in R

To make the best of the R language, we’ll need a strong understanding of the basic data types
and data structures and how to operate on them.

28
R Programming

Data structures are very important to understand because these are the objects you will
manipulate on a day-to-day basis in R.
Everything in R is an object.
R has 6 basic data types.
• character
• numeric (real or decimal)
• integer
• logical
• complex

Elements of these data types may be combined to form data structures, such as atomic vectors.
When we call a vector atomic, we mean that the vector only holds data of a single data type.
Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.

• character: "a", "swc"


• numeric: 2, 15.5
• integer: 2L (the L tells R to store this as an integer)
• logical: TRUE, FALSE
• complex: 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for example

• class() - what kind of object is it (high-level)?


• typeof() - what is the object’s data type (low-level)?
• length() - how long is it? What about two dimensional objects?
• attributes() - does it have any metadata?

# Example
x <- "dataset"
typeof(x)
[1] "character"
attributes(x)
NULL
y <- 1:10
y
[1] 1 2 3 4 5 6 7 8 9 10
typeof(y)
[1] "integer"
length(y)
[1] 10
z <- as.numeric(y)
z
[1] 1 2 3 4 5 6 7 8 9 10
typeof(z)
[1] "double"
R has many data structures. These include

• atomic vector

29
R Programming

• list
• matrix
• data frame
• factors

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse
of R. Technically, vectors can be one of two types:

• atomic vectors
• lists

although the term “vector” most commonly refers to the atomic types not to lists.

The Different Vector Modes

A vector is a collection of elements that are most commonly of


mode character, logical, integer or numeric.
You can create an empty vector with vector(). (By default the mode is logical. You can be more
explicit as shown in the examples below.) It is more common to use direct constructors such
as character(), numeric(), etc.
vector() # an empty 'logical' (the default) vector
logical(0)
vector("character", length = 5) # a vector of mode 'character' with 5 elements
[1] "" "" "" "" ""
character(5) # the same thing, but using the constructor directly
[1] "" "" "" "" ""
numeric(5) # a numeric vector with 5 elements
[1] 0 0 0 0 0
logical(5) # a logical vector with 5 elements
[1] FALSE FALSE FALSE FALSE FALSE
You can also create vectors by directly specifying their content. R will then guess the
appropriate mode of storage for the vector. For instance:
x <- c(1, 2, 3)
will create a vector x of mode numeric. These are the most common kind, and are treated as
double precision real numbers. If you wanted to explicitly create integers, you need to add
an L to each element (or coerce to the integer type using as.integer()).
x1 <- c(1L, 2L, 3L)
Using TRUE and FALSE will create a vector of mode logical:
y <- c(TRUE, TRUE, FALSE, FALSE)
While using quoted text will create a vector of mode character:
z <- c("Sarah", "Tracy", "Jon")

30
R Programming

Examining Vectors

The functions typeof(), length(), class() and str() provide useful information about your
vectors and R objects in general.
typeof(z)
[1] "character"
length(z)
[1] 3
class(z)
[1] "character"
str(z)
chr [1:3] "Sarah" "Tracy" "Jon"

Adding Elements

The function c() (for combine) can also be used to add elements to a vector.
z <- c(z, "Annette")
z
[1] "Sarah" "Tracy" "Jon" "Annette"
z <- c("Greg", z)
z
[1] "Greg" "Sarah" "Tracy" "Jon" "Annette"

Vectors from a Sequence of Numbers

You can create vectors as a sequence of numbers.


series <- 1:10
seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(from = 1, to = 10, by = 0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
[16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
[31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
[46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
[61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
[76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
[91] 10.0

Missing Data

R supports missing data in vectors. They are represented as NA (Not Available) and can be
used for all the vector types covered in this lesson:
x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")

31
R Programming

x <- c(1+5i, 2-3i, NA)


The function is.na() indicates the elements of the vectors that represent missing data, and the
function anyNA() returns TRUE if the vector contains any missing values:
x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
[1] FALSE TRUE FALSE FALSE TRUE
is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
[1] TRUE
anyNA(y)
[1] FALSE

Other Special Values

Inf is infinity. You can have either positive or negative infinity.


1/0
[1] Inf
NaN means Not a Number. It’s an undefined value.
0/0
[1] NaN

What Happens When You Mix Types Inside a Vector?

R will create a resulting vector with a mode that can most easily accommodate all the elements
it contains. This conversion between modes of storage is called “coercion”. When R converts
the mode of storage based on its content, it is referred to as “implicit coercion”. For instance,
can you guess what the following do (without running them first)?
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
You can also control how vectors are coerced explicitly using the as.<class_name>() functions:
as.numeric("1")
[1] 1
as.character(1:2)
[1] "1" "2"

Objects Attributes

Objects can have attributes. Attributes are part of the object. These include:

32
R Programming

• names
• dimnames
• dim
• class
• attributes (contain metadata)

You can also glean other attribute-like information such as length (works on vectors and lists)
or number of characters (for character strings).
length(1:10)
[1] 10
nchar("Software Carpentry")
[1] 18

Matrix

In R matrices are an extension of the numeric or character vectors. They are not a separate type
of object but simply an atomic vector with dimensions; the number of rows and columns. As
with atomic vectors, the elements of a matrix must be of the same data type.
m <- matrix(nrow = 2, ncol = 2)
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
dim(m)
[1] 2 2
You can check that matrices are vectors with a class attribute of matrix by
using class() and typeof().
m <- matrix(c(1:3))
class(m)
[1] "matrix" "array"
typeof(m)
[1] "integer"
While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an
integer vector.
Data types of matrix elements

Consider the following matrix:


FOURS <- matrix(
c(4, 4, 4, 4),
nrow = 2,
ncol = 2)
Given that typeof(FOURS[1]) returns "double", what would you expect typeof(FOURS) to
return? How do you know this is the case even without running this code?
Hint Can matrices be composed of elements of different data types?

33
R Programming

Solution

We know that typeof(FOURS) will also return "double" since matrices are made of elements
of the same data type. Note that you could do something like as.character(FOURS) if you
needed the elements of FOURS as characters.
Matrices in R are filled column-wise.
m <- matrix(1:6, nrow = 2, ncol = 3)
Other ways to construct a matrix
m <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using rbind() and cbind() (“row bind” and “column
bind”, respectively).
x <- 1:3
y <- 10:12
cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
You can also use the byrow argument to specify how the matrix is filled. From R’s own
documentation:
mdat <- matrix(c(1, 2, 3, 11, 12, 13),
nrow = 2,
ncol = 3,
byrow = TRUE)
mdat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13
Elements of a matrix can be referenced by specifying the index along each dimension (e.g.
“row” and “column”) in single square brackets.
mdat[2, 3]
[1] 13

List

In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a
single mode and can encompass any mixture of data types. Lists are sometimes called generic

34
R Programming

vectors, because the elements of a list can by of any type of R object, even lists containing
further lists. This property makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using list() or coerce other objects using as.list(). An empty list of the required
length can be created using vector()
x <- list(1, "a", TRUE, 1+4i)
x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i
x <- vector("list", length = 5) # empty list
length(x)
[1] 5
The content of elements of a list can be retrieved by using double square brackets.
x[[1]]
NULL
Vectors can be coerced to lists as follows:
x <- 1:10
x <- as.list(x)
length(x)
[1] 10
Examining Lists

1. What is the class of x[1]?


2. What is the class of x[[1]]?

Solution

Elements of a list can be named (i.e. lists can have the names attribute)
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(mtcars))
xlist
$a
[1] "Karthik Ram"

$b
[1] 1 2 3 4 5 6 7 8 9 10

$data

35
R Programming

mpg cyl disp hp drat wt qsec vs am gear carb


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
names(xlist)
[1] "a" "b" "data"
Examining Named Lists

1. What is the length of this object?


2. What is its structure?

Solution

Lists can be extremely useful inside functions. Because the functions in R are able to return
only a single object, you can “staple” together lots of different kinds of results into a single
object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a
new line.
Elements are indexed by double brackets. Single brackets will still return a(nother) list. If the
elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).

Data Frame

A data frame is a very important data type in R. It’s pretty much the de facto data structure for
most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has same length (i.e. data
frame is a “rectangular” list).
Data frames can have additional attributes such as rownames(), which can be useful for
annotating data, like subject_id or sample_id. But most of the time they are not used.
Some additional information on data frames:

• Usually created by read.csv() and read.table(), i.e. when importing the data into R.
• Assuming all columns in a data frame are of same type, data frame can be converted to a matrix
with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the
results may not always be what you expect.
• Can also create a new data frame with data.frame() function.
• Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
• Rownames are often automatically generated and look like 1, 2, …, n. Consistency in
numbering of rownames may not be honored when rows are reshuffled or subset.

36
R Programming

Creating Data Frames by Hand

To create data frames by hand:


dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
id x y
1 a 1 11
2 b 2 12
3 c 3 13
4 d 4 14
5 e 5 15
6 f 6 16
7 g 7 17
8 h 8 18
9 i 9 19
10 j 10 20
Useful Data Frame Functions

• head() - shows first 6 rows


• tail() - shows last 6 rows
• dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
• nrow() - number of rows
• ncol() - number of columns
• str() - structure of data frame - name, type and preview of data in each column
• names() or colnames() - both show the names attribute for a data frame
• sapply(dataframe, class) - shows the class of each column in the data frame

See that it is actually a special list:


is.list(dat)
[1] TRUE
class(dat)
[1] "data.frame"
Because data frames are rectangular, elements of data frame can be referenced by specifying
the row and the column index in single square brackets (similar to matrix).
dat[1, 3]
[1] 11
As data frames are also lists, it is possible to refer to columns (which are elements of such list)
using the list notation, i.e. either double square brackets or a $.
dat[["y"]]
[1] 11 12 13 14 15 16 17 18 19 20
dat$y
[1] 11 12 13 14 15 16 17 18 19 20
The following table summarizes the one-dimensional and two-dimensional data structures in
R in relation to diversity of data types they can contain.

37
R Programming

Dimensions Homogenous Heterogeneous

1-D atomic vector list

2-D matrix data frame

Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data
frames or another type of objects). Lists can also contain elements of any length, therefore list
do not necessarily have to be “rectangular”. However in order for the list to qualify as a data
frame, the length of each element has to be the same.
Column Types in Data Frames

Knowing that data frames are lists, can columns be of different type?
What type of structure do you expect to see when you explore the structure of
the PlantGrowth data frame? Hint: Use str().
Solution

Key Points

• R’s basic data types are character, numeric, integer, complex, and logical.
• R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these
structures require that all members be of the same data type (e.g. vectors, matrices) while
others permit multiple data types (e.g. lists, data frames).
• Objects may have attributes, such as name, dimension, and class.

Vectors
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
Live Demo

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))
When we execute the above code, it produces the following result −
[1] "red" "green" "yellow"
[1] "character"

38
R Programming

Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
Live Demo

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
Live Demo

# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
Live Demo

# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1

39
R Programming

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
Live Demo

# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result −
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
Live Demo

# Create the data frame.


BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)

40
R Programming

)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

Working with CSV files in R Programming

Module 2

R CSV Files

A Comma-Separated Values (CSV) file is a plain text file which contains a


list of data. These files are often used for the exchange of data between different
applications. For example,databases and contact managers mostly support CSV
files.

These files can sometimes be called character-separated values or comma-


delimited files. They often use the comma character to separate data, but
sometimes use other characters such as semicolons. The idea is that we can
export the complex data from one application to a CSV file, and then importing
the data in that CSV file to another application.

Storing data in excel spreadsheets is the most common way for data storing,
which is used by thedata scientists. There are lots of packages in R designed for
accessing data from the excel spreadsheet. Users often find it easier to save their

41
R Programming

spreadsheets in comma-separated value files and then use R's built-in


functionality to read and manipulate the data.

R allows us to read data from files which are stored outside the R environment.
Let's start understanding how we can read and write data into CSV files. The
file should be present in the current working directory so that R can read it. We
can also set our directory and read file from there.

Getting and setting the working directory

In R, getwd() and setwd() are the two useful functions. The getwd() function is
used to check on which directory the R workspace is pointing. And the setwd()
function is used to set a new working directory to read and write files from that
directory.

CSV files are basically the text files wherein the values of each row are separated by a delimiter,
as in a comma or a tab. In this article, we will use the following sample CSV file:
sample.csv
id, name, department, salary, projects
1, A, IT, 60754, 4
2, B, Tech, 59640, 2
3, C, Marketing, 69040, 8
4, D, Marketing, 65043, 5
5, E, Tech, 59943, 2
6, F, IT, 65000, 5
7, G, HR, 69000, 7

42
R Programming

Reading a CSV file


The contents of a CSV file can be read as a data frame in R using the read.csv(…) function.
The CSV file to be read should be either present in the current working directory or the
directory should be set accordingly using the setwd(…) command in R. The CSV file can also
be read from a URL using read.csv() function.
Examples:

csv_data <- read.csv(file = 'sample.csv')

print(csv_data)

# print number of columns

print (ncol(csv_data))

# print number of rows

print(nrow(csv_data))

Output:
id, name, department, salary, projects
1 1 A HR 60754 14
2 2 B Tech 59640 3
3 3 C Marketing 69040 8
4 4 D HR 65043 5
5 5 E Tech 59943 2
6 6 F IT 65000 5
7 7 G HR 69000 7
[1] 4
[1] 7
The header is by default set to a TRUE value in the function. The head is not included in the
count of rows, therefore this CSV has 7 rows and 4 columns.

43
R Programming

Querying with CSV files


SQL queries can be performed on the CSV content, and the corresponding result can be
retrieved using the subset(csv_data,) function in R. Multiple queries can be applied in the
function at a time where each query is separated using a logical operator. The result is stored
as a data frame in R.
Examples:

csv_data <- read.csv(file ='sample.csv')

min_pro <- min(csv_data$projects)

print (min_pro)

Output:
2
Aggregator functions (min, max, count etc.) can be applied on the CSV data. Here
the min() function is applied on projects column using $ symbol. The minimum number of
projects which is 2 is returned.

csv_data <- read.csv(file ='sample.csv')

new_csv <- subset(csv_data, department == "HR" & projects <10)

print (new_csv)

Output:

id, name, department, salary, projects


4 4 D HR 65043 5
7 7 G HR 69000 7
The subset of the data that is created is stored as a data frame satisfying the conditions
specified as the arguments of the function. The employees D and G are HR and have the
number of projects<10. The row numbers are retained in the resultant data frame.
Writing into a CSV file
The contents of the data frame can be written into a CSV file. The CSV file is stored in the
current working directory with the name specified in the function write.csv(data frame,
output CSV name) in R.
Examples:

44
R Programming

csv_data <- read.csv(file ='sample.csv')

new_csv <- subset(csv_data, department == "HR" & projects <10)

write.csv(new_csv, "new_sample.csv")

new_data <-read.csv(file ='new_sample.csv')

print(new_data)

Output:

X id, name, department, salary, projects


1 4 4 D HR 65043 5
2 7 7 G HR 69000 7
The column X contains the row numbers of the original CSV file. In order to remove it, we
can specify an additional argument in the write.csv() function that set row names to FALSE.

csv_data <- read.csv(file ='sample.csv')

new_csv <- subset(csv_data, department == "HR" & projects <10)

write.csv(new_csv, "new_sample.csv", row.names = FALSE)

new_data <-read.csv(file ='new_sample.csv')

print(new_data)

Output:

id, name, department, salary, projects


1 4 D HR 65043 5
2 7 G HR 69000 7
The original row numbers are removed from the new CSV.
Working with Excel Files in R Programming

45
R Programming

Excel files are of extension .xls, .xlsx and .csv(comma-separated values). To start working
with excel files in R, we need to first import excel files in RStudio or any other R supporting
IDE(Integrated development environment).
First install readxl package in R to load excel files. Various methods including there subparts
are demonstrated further.

The xlsx is a file extension of a spreadsheet file format which was created by
Microsoft to work with Microsoft Excel. In the present era, Microsoft Excel is
a widely used spreadsheet program that sores data in the .xls or .xlsx format. R
allows us to read data directly from these files by providing some excel specific
packages. There are lots of packages such as XLConnect, xlsx, gdata, etc. We
will use xlsx package, which not only allows us to read data from an excel file
butalso allow us to write data in it.

Install xlsx Package

Our primary task is to install "xlsx" package with the help of install.package
command. When we install the xlsx package, it will ask us to install some
additional packages on which this package is dependent. For installing the
additional packages, the same command is used with the required package
name. There is the following syntax of install command:

1. install.packages("package name")

Example

1. install.packages("xlsx")

86

46
R Programming

Output

Verifying and Loading of "xlsx" Package

In R, grepl() and any() functions are used to verify the package. If the packages
are installed, these functions will return True else return False. For verifying the
package, both the functions are used together.

For loading purposes, we use the library() function with the appropriate package
name. This function loads all the additional packages also.

Example

1. #Installing xlsx package


2. install.packages(xlsx)
3.
4. # Verifying the package is installed.
5. any(grepl("
xlsx",installed.pack
ages()))6.
7. # Loading the library into R workspace.

47
R Programming

Sample_data1.xlsx

Sample_data2.xlsx

Reading Files
The two excel files Sample_data1.xlsx and Sample_data2.xlsx and read from working
directory.

# Working with Excel Files

48
R Programming

# Installing required package

install.packages("readxl")

# Loading the package

library(readxl)

# Importing excel file

Data1 <- read_excel("Sample_data1.xlsx")

Data2 <- read_excel("Sample_data2.xlsx")

# Printing the data

head(Data1)

head(Data2)

The excel files are loaded into variable Data_1 and Data_2 as a dataframes and then variable
Data_1 and Data_2 is called that prints the dataset.

49
R Programming

Modifying Files
The Sample_data1.xlsx file and Sample_file2.xlsx are modified.

# Modifying the files

Data1$Pclass <- 0

Data2$Embarked <- "S"

# Printing the data

head(Data1)

head(Data2)

The value of the P-class attribute or variable of Data1 data is modified to 0. The value of
Embarked attribute or variable of Data2 is modified to S.
Deleting Content from files
The variable or attribute is deleted from Data1 and Data2 datasets containing
Sample_data1.xlsx and Sample_data2.xlsx files.

# Deleting from files

50
R Programming

Data1 <- Data1[-2]

Data2 <- Data2[-3]

# Printing the data

Data1

Data2

The - sign is used to delete column or attributes from dataset. Column 2 is deleted from Data1
dataset and Column 3 is deleted from Data2 dataset.

Merging Files
The two excel datasets Data1 and Data2 are merged using merge() function which is in base
package and comes pre installed in R.

# Merging Files

Data3 <- merge(Data1, Data2, all.x = TRUE, all.y = TRUE)

51
R Programming

# Dsiplaying the data

head(Data3)

Data1 and Data2 are merged with each other and the resultant file is stored in the Data3
variable.
Creating new columns
New columns or features can be easily created in Data1 and Data2 datasets.

# Creating feature in Data1 dataset

Data1$Num <- 0

# Creating feature in Data2 dataset

Data2$Code <- "Mission"

# Printing the data

head(Data1)

head(Data2)

52
R Programming

Num is a new feature that is created with 0 default value in Data1 dataset. Code is a new
feature that is created with mission as a default string in Data2 dataset.
Writing Files
After performing all operations, Data1 and Data2 are written into new files
using write.xlsx() function built in writexl package.

# Installing the package

install.packages("writexl")

# Loading package

library(writexl)

# Writing Data1

write_xlsx(Data1, "New_Data1.xlsx")

# Writing Data2

write_xlsx(Data2, "New_Data2.xlsx")

53
R Programming

Working with Binary Files in R Programming

In the computer science world, text files contain data that can easily be understood by humans.
It includes letters, numbers, and other characters. On the other hand, binary files contain 1s
and 0s that only computers can interpret. The information stored in a binary file can’t be read
by humans as the bytes in it translate to characters and symbols that contain various other non-
printable characters.
It may happen some times when the data produced by other programs are essential to be
processed by the R language as a binary file. Furthermore, R is necessarily responsible for
creating binary files that can be shared with other programs. The four most important
operations that can be performed in a binary file are:

• Creating and writing to a binary file


• Reading from the binary file
• Append to the binary file
• Deleting the binary file

Creating and writing to a binary file


The both creating and writing to a binary file can be performed by a single
function writeBin() by opening the file in “wb” mode where w indicates write and b indicates
binary mode.

Syntax: writeBin(object, con)


Parameters:
object: an R object to be written to the connection
con: a connection object or a character string naming a file or a raw vector.

Example:

# R program to illustrate
# working with binary file

# Creating a data frame


df = data.frame(
"ID" = c(1, 2, 3, 4),
"Name" = c("Tony", "Thor", "Loki", "Hulk"),
"Age" = c(20, 34, 24, 40),
"Pin" = c(756083, 756001, 751003, 110011)
)

# Creating a connection object


# to write the binary file using mode "wb"
con = file("myfile.dat", "wb")

# Write the column names of the data frame


# to the connection object
writeBin(colnames(df), con)

54
R Programming

# Write the records in each of the columns to the file


writeBin(c(df$ID, df$Name, df$Age, df$Pin), con)

# Close the connection object


close(con)

Output:

Reading from the binary file


Reading from the binary file can be performed by a the function readBin() by opening the file
in “rb” mode where r indicates read and b indicates binary mode.

Syntax: readBin(con, what, n )


Parameters:
con: a connection object or a character string naming a file or a raw vector
what: either an object whose mode will give the mode of the vector to be read or a character
vector of length one describing the mode: one of “numeric”, “double”, “integer”, “int”, “lo
gical”, “complex”, “character”, “raw”
n: the (maximal) number of records to be read

Example:
# R program to illustrate
# working with binary file

# Creating a connection object


# to read the file in binary mode using "rb".
con = file("myfile.dat", "rb")

55
R Programming

# Read the column names


# n = 4 as here 4 column
colname = readBin(con, character(), n = 4)

# Read column values


# n = 20 as here 16 values and 4 column names
con = file("myfile.dat", "rb")
bindata = readBin(con, integer(), n = 20)

# Read the ID values


# as first 1:4 byte for col name
# then values of ID col is within 5 to 8
ID = bindata[5:8]

# Similarly 9 to 12 byte for values of name column


Name = bindata[9:12]

# 13 to 16 byte for values of the age column


Age = bindata[13:16]

# 17 to 20 byte for values of Pincode column


PinCode = bindata[17:20]

# Combining all the values and make it a data frame


finaldata = cbind(ID, Name, Age, PinCode)
colnames(finaldata)= colname
print(finaldata)

56
R Programming

Output:
ID Name Age Pin
[1, ] 0 0 0 0
[2, ] 1072693248 1074266112 1074790400 1073741824
[3, ] 0 0 0 0
[4, ] 1073741824 1074790400 1074266112 1072693248

57
R Programming

58
R Programming

The Data1 dataset is written New_Data1.xlsx file and Data2 dataset is written
in New_Data2.xlsx file. Both the files are saved in present working directory.

Visualization in R

In R, we can create visually appealing data visualizations by writing few lines of code.
For this purpose, we use the diverse functionalities of R. Data visualization is an
efficient technique for gaining insight about data through a visual medium. With the
help of visualization techniques, a human can easily obtain information about hidden
patterns in data that might be neglected.

By using the data visualization technique, we can work with large datasets to
efficiently obtain key insights about it.

R Visualization Packages

R provides a series of packages for data visualization. These packages are as follows:

1) plotly

The plotly package provides online interactive and quality graphs. This package extends
upon theJavaScript library ?plotly.js.

2) ggplot2

R allows us to create graphics declaratively. R provides the ggplot package for this
purpose. Thispackage is famous for its elegant and quality graphs, which sets it apart

59
R Programming

from other visualization packages.

3) tidyquant

The tidyquant is a financial package that is used for carrying out quantitative financial
analysis. This package adds under tidyverse universe as a financial package that is used
for importing, analyzing, and visualizing the data.

4) taucharts

Data plays an important role in taucharts. The library provides a declarative interface
for rapid mappingof data fields to visual properties.

5) ggiraph

It is a tool that allows us to create dynamic ggplot graphs. This package allows
us to add tooltips,JavaScript actions, and animations to the graphics.

6) geofacets

This package provides geofaceting functionality for 'ggplot2'. Geofaceting arranges a


sequence of plotsfor different geographical entities into a grid that preserves some of
the geographical orientation.

7) googleVis

googleVis provides an interface between R and Google's charts tools. With the help of
this package, wecan create web pages with interactive charts based on R data frames.

8) RColorBrewer

This package provides color schemes for maps and other graphics, which are
designed by CynthiaBrewer.

9) dygraphs

The dygraphs package is an R interface to the dygraphs JavaScript charting library.


It provides richfeatures for charting time-series data in R.

10) shiny

R allows us to develop interactive and aesthetically pleasing web apps by providing

60
R Programming

a shiny package.This package provides various extensions with HTML widgets, CSS,
and JavaScript.

R Graphics

Graphics play an important role in carrying out the important features of the data.
Graphics are used to examine marginal distributions, relationships between variables,
and summary of very large data. It is a very important complement for many statistical
and computational techniques.

Standard Graphics

R standard graphics are available through package graphics, include several


functions whichprovide statistical plots, like:

Scatterplots
Piecharts
Boxplots
Barplots etc.

We use the above graphs that are typically a

single function call.Graphics Devices

61
R Programming

It is something where we can make a plot to appear. A graphics device is a window


on your computer (screen device), a PDF file (file device), a Scalable Vector
Graphics (SVG) file (file device), or a PNG or JPEG file (file device).

There are some of the following points which are essential to understand:

The functions of graphics devices produce output, which depends on the active
graphics device.A screen is the default and most frequently used device.
R graphical devices such as the PDF device, the JPEG device, etc. are used.
We just need to open the graphics output device which we want. Therefore,
R takes care ofproducing the type of output which is required by the device.
For producing a certain plot on the screen or as a GIF R graphics file, the R code
should exactly bethe same. We only need to open the target output device before.
Several devices can be open at the same time, but there will be only

one active device.The basics of the grammar of graphics

There are some key elements of a statistical graphic. These elements are the basics
of the grammarof graphics. Let's discuss each of the elements one by one to gain
the basic knowledge of graphics.

1) Data

62
R Programming

Data is the most crucial thing which is processed and generates an output.

2) Aesthetic Mappings

Aesthetic mappings are one of the most important elements of a statistical graphic.
It controls the relation between graphics variables and data variables. In a scatter
plot, it also helps to map the temperature variable of a data set into the X variable.

In graphics, it helps to map the species of a plant into the color of dots.

3) Geometric Objects

Geometric objects are used to express each observation by a point using the
aesthetic mappings. It maps two variables in the data set into the x,y variables of the
plot.

4) Statistical Transformations

Statistical transformations allow us to calculate the statistical analysis of the data in the plot.The
statistical transformation uses the data and approximates it with the help of a regression line
having x,y coordinates, and counts occurrences of certain values.

5) Scales

It is used to map the data values into values present in the coordinate system of the graphics
device.

6) Coordinate system

The coordinate system plays an important role in the

plotting of the data.Cartesian


Plot

7) Faceting

Faceting is used to split the data into subgroups and draw sub-

graphs for each group.Advantages of Data Visualization in R

1. Understanding

It can be more attractive to look at the business. And, it is easier to understand


through graphics and charts than a written document with text and numbers. Thus,

63
R Programming

it can attract a wider range of audiences. Also, it promotes the widespread use of
business insights that come to make better decisions.

2. Efficiency

Its applications allow us to display a lot of information in a small space. Although,


the decision- making process in business is inherently complex and multifunctional,
displaying evaluation findings in a graph can allow companies to organize a lot of
interrelated information in useful ways.

3. Location

Its app utilizing features such as Geographic Maps and GIS can be particularly
relevant to wider business when the location is a very relevant factor. We will use
maps to show business insights from various locations, also consider the seriousness
of the issues, the reasons behind them, and working groups to address them.

Disadvantages of Data Visualization in R

1. Cost

R application development range a good amount of money. It may not be possible,


especially for small companies, that many resources can be spent on purchasing
them. To generate reports, many companies may employ professionals to create
charts that can increase costs. Small enterprises are often operating in resource-
limited settings, and are also receiving timely evaluation results that can often be of
high importance.

2. Distraction

However, at times, data visualization apps create highly complex and fancy
graphics-rich reports andcharts, which may entice users to focus more on the form
than the function. If we first add visual appeal, then the overall value of the graphic
representation will be minimal. In resource-setting, it is required to understand how
resources can be best used. And it is also not caught in the graphics trendwithout a
clear purpose.

64
R Programming

R FUNCTIONS & STATEMENTS

Module 3
Function Creation:
A function is a set of statements organized together to perform a
specific task. R has a large number of in-built functions and the
user can create their own functions.
In R, a function is an object so the R interpreter is able to pass

65
R Programming

control to the function, along with arguments that may be


necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the
interpreter as well as any result which may be stored in other
objects.

FunctionDefinition
An R function is created by using the keyword function. The
basic syntax of an R function definition is as follows:

function_name <- function(arg_1, arg_2, ...) {


Function body

Function Components
The different parts of a function are:
• Function Name: This is the actual name of the function. It is
stored in R environment as an object with this name.

• Arguments: An argument is a placeholder. When a function is


invoked, you pass a value to the argument. Arguments are
optional; that is, a function may contain no arguments. Also
arguments can have default values.

• Function Body: The function body contains a collection of


statements thatdefines what the function does.

• Return Value: The return value of a function is the last


expression in the functionbody to be evaluated.
R has many in-built functions which can be directly called in the
program withoutdefining them first. We can also create and use
our own functions referred as user defined functions.

66
R Programming

Built-inFunction
Simple examples of in-built functions are seq(), mean(), max(), sum(x)and
paste(...) etc. They are directly called by user written programs. You can refer most
widely used R functions.

# Create a sequence of numbers from 32 to 44.


print(seq(32,44))

# Find mean of numbers from 25 to 82.


print(mean(25:82))

# Find sum of numbers frm 41 to 68.

User-defined Function
We can create user-defined functions in R. They are specific to
what a user wants and once created they can be used like the
built-in functions. Below is an example of how a functionis
created and used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {

for(i in 1:a) {
b <- i^2
print(b)

}
}

67
R Programming

Callinga Function

# Create a function to print squares of numbers in sequence.

new.function <- function(a) {


for(i in 1:a) {

b <- i^2
print(b)

# Call the function new.function supplying 6 as an argument.

new.function(6)

Calling a Function without an Argument

# Create a function without an


argument.new.function <- function()

{
for(i in 1:5)

print(i^2)

}}

# Call the function without supplying an argument.


new.function()

Calling a Function with Argument Values (by position and by name)


The arguments to a function call can be supplied in the same
sequence as defined in the function or they can be supplied in a
different sequence but assigned to the names of the arguments.

68
R Programming

# Create a function with arguments.


new.function <- function(a,b,c) {

result <- a*b+c


print(result)

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


new.function(a=11,b=5,c=3)
When we execute the above code, it produces the following result:

[1] 26

[1] 58

Calling a Function with Default Argument


We can define the value of the arguments in the function
definition and call the function without supplying any argument
to get the default result. But we can alsocall such functions by
supplying new values of the argument and get non default result.

# Create a function with arguments.


new.function <- function(a = 3,b =6) {

result <- a*b


print(result)

69
R Programming

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)
When we execute the above code, it produces the following result:

[1] 18

[1] 45

Lazy Evaluation of Function


Arguments to functions are evaluated lazily, which means so
they are evaluatedonly when needed by the function body.

# Create a function with arguments.


new.function <- function(a, b) {

print(a^2)
print(a)
print(b)

# Evaluate the function without supplying one of the arguments.


new.function(6)

[1] 36

[1] 6

Error in print(b) : argument "b" is missing, with no default

R scripts
While entering and running your code at the R command line is effective and
simple. This technique has its limitations. Each time you want to execute a set
of commands, you have to re- enter them at the command line. Complex
commands are potentially subject to typographical errors, necessitating that
they be re-entered correctly. Repeating a set of operations requires re- entering

70
R Programming

the code stream. Fortunately, R and RStudio provide a method to mitigate


these issues.R scripts are that solution.

A script is simply a text file containing a set of commands and comments. The
script can be saved and used later to re-execute the saved commands. The script
can also be edited so you canexecute a modified version of the commands.

Creating an R script

It is easy to create a new script in RStudio. You can open a new empty
script by clicking the New File icon in the upper left of the main RStudio
toolbar. This icon looks like a whitesquare with a white plus sign in a green
circle. Clicking the icon opens the New File Menu.Click the R Script

menu option and the script editor will open with an empty script.

71
R Programming

Figure 1 - RStudio New Script Menu

Once the new script opens in the Script Editor panel, the script is ready for
text entry, and yourRStudio session will look like this.

Figure 2 - RStudio with Script Editor Panel

Here is an easy example to familiarize you with the Script Editor interface.
Type the followingcode into your new script [later topics will explain the
specific code components do].

# this

72
R Programming

is my
first R
script
# do
some
things
x = 34
y = 16
z
=
x
+
y
#
ad
di
ti
o
n
w
=
y/
x # division# display the results x
y
z
w
# change x
x = "some text"
#
d
i
s
p
l
a
y
t
h
e
r
e
s
u
lt
s
x
y
z
w

73
R Programming

Figure 3 - R Script Example

There, you now have your first R script. Notice how the editor places a number
in front of each line of code. The line numbers can be helpful as you work with
your code. Before proceeding onto executing this code, it would be a good idea
to learn how to save your script.

Saving an R script

You can save your script by clicking on the Save icon at the top of the
Script Editor panel.When you do that, a Save File dialog will open.

74
R Programming

Figure 4 - Save File Dialog

The default script name is Untitled.R. The Untitled part is highlighted. You will
save this script as First script.R. Start typing First script. RStudio overwrites
the highlighted default name withyour new name, leaving the .R file extension.
The Save File dialog should now look like this.

Figure 5 - Save First script.R

Notice that RStudio will save your script to your current working folder. An
earlier topic in this learning infrastructure explained how to set your default
working folder, so that will not be addressed here. Press the Save button and
your script is saved to your working folder. Notice thatthe name in the file tab
at the top of the Script Editor panel now shows your saved script file name.

Be aware that, while it is not necessary to use an .R file extension for your
R scripts, it doesmake it easier for RStudio to work with them if your use
this file extension.

That is how you save your script files to your working folder.

Opening an R script

75
R Programming

Opening a saved R script is easy to do. Click on the Open an existing file
icon in the RStudiotoolbar. A Choose file dialog will open.

Figure 6 - RStudio Open Script Dialog

Select the R script you want to open [this is one place where the .R file extension
comes in handy] and click the Open button. Your script will open in the Script
Editor panel with the scriptname in an editor tab.

Working through an example may be helpful. We will use the script you
created above [First script.R] for this exercise. First, you will need to close
the script. You can close this script byclicking the X in the right side of the
editor tab where the script name appears. Since you onlyhad one script open,
when you close First script.R, the Script Editor panel disappears.

Now, click on the Open an existing file icon in the RStudio toolbar. The
Choose file dialog willopen. Select First script.R and then press the Open
button in the dialog. Your script is now open in the Script Editor panel and
ready to use.

Executing code in an R script

You can run the code in your R script easily. The Run button in the Script
Editor panel toolbarwill run either the current line of code or any block of

76
R Programming

selected code. You can use your First script.R code to gain familiarity with
this functionality.

Place the cursor anywhere in line 3 of your script [x = 34]. Now press the
Run button in the Script Editor panel toolbar. Three things happen: 1) the
code is transferred to the command console, 2) the code is executed, and 3)
the cursor moves to the next line in your script. Pressthe Run button three
more times. RStudio executes lines 4, 5, and 6 of your script.

Now you will run a set of code commands all at once. Highlight lines 8,
9, 10, and 11 in thescript.

Figure 7 - Highlighted Script Code

Highlighting is accomplished similar to what you may be familiar with in


word processor applications. You click your left mouse button and the
beginning of the text you want to highlight, you hold the mouse button and
drag the cursor to the end of the text and release the button. With those four
lines of code highlighted, click the editor Run button. All four lines ofcode
are executed in the command console. That is all it takes to run script code in
RStudio.

Comments in an R script [documenting your code]

Before finishing this topic, there is one final concept you should understand. It
is always a good idea to place comments in your code. They will help you
understand what your code is meant to do. This will become helpful when you

77
R Programming

reopen code you wrote weeks ago and are trying to workwith again. The saying,
"Real programmers do not document their code. If it was hard to write, itshould
be hard to understand" is meant to be a dark joke, not a coding style guide.

Figure 8 - R Script Example [with comments]

A comment in R code begins with the # symbol. Your code in First script.R
contains several examples of comments. Lines 1, 2, 7, 12, and 14 in the image
above are all comment lines. Any line of text that starts with # will be treated
as a comment and will be ignored during code execution. Lines 5 and 6 in this
image contain comments at the end. All text after the # is treatedas a comment
and is ignored during execution.

Notice how the RStudio editor shows these comments colored green. The
green color helps youfocus on the code and not get confused by the comments.

Besides using comments to help make your R code more easily understood, you can use
the # symbol to ignore lines of code while you are developing your code
stream. Simply place a # in front of any line that you want to ignore. R will
treat those lines as comments and ignorethem. When you want to include
those lines again in the code execution, remove the # symbolsand the code is
executable again. This technique allows you to change what code you execute
without having to retype deleted code.

78
R Programming

Logical Operators
Following table shows the logical operators supported by R
language. It is applicable only to vectors of type logical, numeric
or complex. All numbers greater than 1 are considered as logical
value TRUE.
Each element of the first vector is compared with the
corresponding element of the second vector. The result of
comparison is a Boolean value.

Operator Description Example

v <- c(3,1,TRUE,2+3i) t <-


It is called Element-wise Logical AND c(4,1,FALSE,2+3i)
operator. It combines each element of the print(v&t)
& first vector with the corresponding element
of the second vector and gives a output it produces the following result:
TRUE if both the elements are TRUE.
[1] TRUE TRUE FALSETRUE

v <- c(3,0,TRUE,2+2i) t <-


It is called Element-wise Logical OR c(4,0,FALSE,2+3i)
| operator. It combines each element of the print(v|t)
first vector with the corresponding element
of the second vector and gives a output it produces the following result:
TRUE if one the elements is TRUE.
[1] TRUE FALSETRUETRUE

v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes print(!v)
each element of the vector and gives the
! opposite logical value. it produces the following result:

[1] FALSE TRUE


FALSE FALSE

R Programming

79
The logical operator && and || considers only the first element
of the vectors andgive a vector of single element as outpu

Operator Description Example

v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
Called Logical AND operator. Takes first print(v&&t)
&& element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result:

[1] TRUE

v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first print(v||t)
|| element of both the vectors and gives the
TRUE only if both are TRUE. it produces the following result:

[1] FALSE

Decision Making
Decision making structures require the programmer to specify one or more
conditions to be evaluated or tested by the program, along with a statement or
statements to be executed if the condition is determined to be true, and
optionally, other statements to be executed if the conditionis determined to be
false.
Following is the general form of a typical decision making structure found in
most of the programming languages:

80
R provides the following types of decision making statements.
Click thefollowing links to check their detail.

Statement Description

An if statement consists of a Boolean expression followed by one or


if statement
more statements.

An if statement can be followed by an optional else statement,


if...else statement
which executes when the Boolean expression is false.

A switch statement allows a variable to be tested for equalityagainst


switch statement
a list of values.

R-IfStatement
81
An if statement consists of a Boolean expression followed by one or more statements.
R Programming

Syntax
The basic syntax for creating an if statement in R is:

if(boolean_expression) {

// statement(s) will execute if the boolean expression is true.

If the Boolean expression evaluates to be true, then the block of code inside the
if statement will be executed. If Boolean expression evaluates to be false, then
the first set of code after the end ofthe if statement (after the closing curly brace)
will be executed.

Flow Diagram

Example

x <- 30L

if(is.integer(x)){
print("X is an Integer")

}
When the above code is compiled and executed, it produces the following result:

82
R Programming

[1] "X is an Integer"

R–If...ElseStatement
An if statement can be followed by an optional else statement which executes
when the booleanexpression is false.

Syntax
The basic syntax for creating an if...else statement in R is:

if(boolean_expression) {

// statement(s) will execute if the boolean expression is true.

} else {

// statement(s) will execute if the boolean expression is false.

If the Boolean expression evaluates to be true, then the if block of code


will be executed,otherwise else block of code will be executed.

Flow Diagram

Example

83
R Programming

x <- c("what","is","truth")
if("Truth" %in% x){

print("Truth is found")

} else {

print("Truth is not found")

When the above code is compiled and executed, it produces the following result:

[1] "Truth is not found"

Here "Truth" and "truth" are two different strings.

Theif...elseif...elseStatement
An if statement can be followed by an optional else if...else statement, which
is very useful totest various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
• An if can have zero or one else and it must come after any else if's.
• An if can have zero to many else if's and they must come before the else.

• Once an else if succeeds, none of the remaining else if's or else's will be tested.

Syntax
The basic syntax for creating an if...else if...else statement in R is:

84
R Programming

if(boolean_expression 1) {

// Executes when the boolean expression 1 is true.

}else if( boolean_expression 2) {

// Executes when the boolean expression 2 is true.

}else if( boolean_expression 3) {

// Executes when the boolean expression 3 is true.

}else {

// executes when none of the above condition is true.

Example

x <- c("what","is","truth")
if("Truth" %in% x){

print("Truth is found the first time")


} else if ("truth" %in% x) {

print("truth is found the second time")

} else {

print("No truth found")

When the above code is compiled and executed, it produces the following result:

[1] "truth is found the second time"

R –SwitchStatement
A switch statement allows a variable to be tested for equality against a list of
values. Each valueis called a case, and the variable being switched on is checked
for each case.

85
R Programming

Syntax
The basic syntax for creating a switch statement in R is :

switch(expression, case1, case2, case3 ... )

The following rules apply to a switch statement:


• If the value of expression is not a character string it is coerced to integer.

• You can have any number of case statements within a switch. Each case is
followed by the valueto be compared to and a colon.

• If the value of the integer is between 1 and nargs()-1 (The max number of
arguments)then thecorresponding element of case condition is evaluated and
the result returned.

• If expression evaluates to a character string then that string is matched (exactly)


to the names ofthe elements.

• If there is more than one match, the first matching element is returned.

• No Default argument is available.

• In the case of no match, if there is a unnamed element of ... its value is returned.
(If there is morethan one such argument an error is returned.)

Flow Diagram

86
R Programming

Example

x <- switch(
3,

"first",
"second",

"third",
"fourth"

print(x)
When the above code is compiled and executed, it produces the following result:

[1] "third"

There may be a situation when you need to execute a block of code several
number of times. In general, statements are executed sequentially. The first
statement in a function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for
more complicatedexecution paths.

Loops
A loop statement allows us to execute a statement or group of statements
multiple times and the following is the general form of a loop statement in most
of the programming languages:

87
R Programming

R programming language provides the following kinds of loop to handle looping


requirements. Click the following links to check their detail.

Loop Type Description


Executes a sequence of statements multiple times and abbreviates the code
repeat loop that manages the loop variable.
Repeats a statement or group of statements while a given condition is true.
while loop It tests the condition before executing the loop body.
Like a while statement, except that it tests the condition at the end of the loop
for loop body.

R - Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is met.

Syntax
The basic syntax for creating a repeat loop in R is:

88
R Programming

repeat {

commands
if(condition){

break
}

Flow Diagram

Example

v <- c("Hello","loop")

89
R Programming

cnt <- 2
repeat{

print(v)
cnt <- cnt+1
if(cnt > 5){

break
}

}
When the above code is compiled and executed, it produces the following result:

[1] "Hello" "loop"


[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"

R - While Loop
The While loop executes the same code again and again until a stop condition is met.

Syntax
The basic syntax for creating a while loop in R is :

while (test_expression) {
statement

90
R Programming

Flow Diagram

Here key point of the while loop is that the loop might not ever run. When the
condition is testedand the result is false, the loop body will be skipped and the
first statement after the while loop will be executed.

Example

v <- c("Hello","while loop")


cnt <- 2

while (cnt < 7){


print(v)

cnt = cnt + 1
}

R–ForLoop
A for loop is a repetition control structure that allows you to
efficiently write a loopthat needs to execute a specific number of
times.

Syntax
The basic syntax for creating a for loop statement in R is:

91
R Programming

for (value in vector) {


statements

Flow Diagram

R’s for loops are particularly flexible in that they are not limited
to integers, or even numbers in the input. We can pass character
vectors, logical vectors, lists or expressions.

Example

v <- LETTERS[1:4]

for ( i in v) {
print(i)

When the above code is compiled and executed, it produces the following result:

92
R Programming

[1] "A"

[1] "B"

[1] "C"

[1] "D"

Loop Control Statements


Loop control statements change execution from its normal
sequence. When execution leaves a scope, all automatic objects
that were created in that scope are destroyed.
R supports the following control statements. Click the following links to check their detail.

Control Statement Description

Terminates the loop statement and transfers execution to the


break statement
statement immediately following the loop.

Next statement The next statement simulates the behavior of R switch.

R–BreakStatement

The break statement in R programming language has the following two usages:
• When the break statement is encountered inside a loop, the
loop is immediatelyterminated and program control resumes at
the next statement following the loop.

• It can be used to terminate a case in the switch statement (covered in the next chapter).

Syntax
The basic syntax for creating a break statement in R is:

93
R Programming

break

Flow Diagram

Example
v <-
c("Hell
o","loop
") cnt <-
2

repeat{
p
r
i
n
t
(
v
)
c
n

94
R Programming

t
<
-
c
n
t
+
1
i
f
(
c
n
t
>
5
)
{

break
}
}

When the above code is compiled and executed, it produces the following result:

[1] "Hello" "loop"


[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"

95
R Programming

R – Next Statement
The next statement in R programming language is useful when
we want to skip the current iteration of a loop without
terminating it. On encountering next, the R parser skips further
evaluation and starts next iteration of the loop.

Syntax
The basic syntax for creating a next statement in R is:

next

Flow Diagram

Example

96
R Programming

v <- LETTERS[1:6]

for ( i in v){

if (i == "D"){

next

print(i)

}
When the above code is compiled and executed, it produces the following result:

[1] "A"

[1] "B"

List
Lists are the R objects which contain elements of different types like −
numbers, strings, vectors and another list inside it. A list can also contain a
matrix or a function as its elements. List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors
and a logical values.
Live Demo

# Create a list containing strings, numbers, vectors and a


logical# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)

Naming List Elements


The list elements can be given names and they can be accessed using these names.

97
R Programming

# Create a list containing a vector, a matrix and a list.


list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Show the list.


print(list_data)

Accessing List Elements


Elements of the list can be accessed by the index of the element in the list.
In case of namedlists it can also be accessed using the names.
We continue to use the list in the above example −
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Access the first element of the list.


print(list_data[1])

# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])

# Access the list element using the name of the element.


print(list_data$A_Matrix)

Manipulating List Elements


We can add, delete and update list elements as shown below. We can add and
delete elementsonly at the end of a list. But we can update any element.

98
R Programming

# Create a list containing a vector, a matrix and a list.


list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Add element at the end of the list.


list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.


list_data[4] <- NULL

# Print the 4th Element.


print(list_data[4])

# Update the 3rd Element.


list_data[3] <- "updated element"
print(list_data[3])

Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.


merged.list <- c(list1,list2)

# Print the merged list.


print(merged.list)

Converting List to Vector


A list can be converted to a vector so that the elements of the vector can be used
for further manipulation. All the arithmetic operations on vectors can be applied
after the list is converted into vectors. To do this conversion, we use the unlist()

99
R Programming

function. It takes the list as input and produces a vector.


# Create lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Convert the lists to vectors.


v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

# Now add the vectors


result <- v1+v2
print(result)

Data Frame

A data frame is a table or a two-dimensional array-like structure in which each


column containsvalues of one variable and each row contains one set of values
from each column.
Following are the characteristics of a data frame.

• The column names should be non-empty.


• The row names should be unique.
• The data stored in a data frame can be of numeric, factor or character type.
• Each column should contain same number of data items.

100
R Programming

Create Data Frame


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)

Get the Structure of the Data Frame


The structure of the data frame can be seen by using str() function.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

Summary of Data in Data Frame


The statistical summary and nature of the data
can be obtained by applying
summary() function.

101
R Programming

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))

Extract Data from Data Frame


Extract specific column from a data frame using
column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

Extract the first two rows and then all columns

102
R Programming

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)

Extract 3rd and 5th row with 2nd and 4th column
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)

Expand Data Frame


A data frame can be expanded by adding columns and rows.

Add Column
Just add the column vector using a new column name.

103
R Programming

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

Add Row
To add more rows permanently to an existing data frame, we need to bring in
the new rows in the same structure as the existing data frame and use the
rbind() function.
In the example below we create a data frame with new rows and merge it with
the existing dataframe to create the final data frame.

104
R Programming

# Create the first data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

lapply() function

lapply() function is useful for performing operations on list objects and


returns a list object ofsame length of original set. lappy() returns a list of the
similar length as input list object, each element of which is the result of
applying FUN to the corresponding element of list. lapply() takes list, vector
or data frame as input and gives output in list.

lapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x

l in lapply() stands for list. The difference between lapply() and apply() lies

105
R Programming

between the output return. The output of lapply() is a list. lapply() can be used
for other objects like data frames andlists.

lapply() function does not need MARGIN.

A very easy example can be to change the string value of a matrix to lower
case with tolower function. We construct a matrix with the name of the famous
movies. The name is in upper caseformat.

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")


movies_lower <-lapply(movies, tolower)
str(movies_lower)

Output:

## List of 4
## $:chr"spyderman"
## $:chr"batman"
## $:chr"vertigo"
## $:chr"chinatown"

We can use unlist() to convert the list into a vector.

movies_lower <-unlist(lapply(movies,tolower))
str(movies_lower)

Output:

## chr [1:4] "spyderman" "batman" "vertigo" "chinatown"

sapply() function

sapply() function takes list, vector or data frame as input and gives output in
vector or matrix. Itis useful for operations on list objects and returns a list
object of same length of original set. sapply() function does the same job as
lapply() function but returns a vector.

106
R Programming

sapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x

We can measure the minimum speed and stopping distances of cars from the cars dataset.

dt <- cars
lmn_cars <- lapply(dt, min)
smn_cars <- sapply(dt, min)
lmn_cars

We can use a user built-in function into lapply() or sapply(). We create a


function named avg tocompute the average of the minimum and maximum of
the vector.

avg <- function(x) {


( min(x) + max(x) ) / 2}
fcars <- sapply(dt, avg)
fcars

Output

## speed dist
## 14.5 61.0

sapply() function is more efficient than lapply() in the output returned


because sapply() storevalues direclty into a vector. In the next example, we
will see this is not always the case.

We can summarize the difference between sapply() and `lapply() in the following table:
107
Function Arguments Objective Input Output
R Programming

lapply lapply(X, FUN) Apply a function to all the List, vector or list
elements of the input data frame
sapply sappy(X FUN) Apply a function to all the List, vector or vector or
elements of the input data frame matrix

What is Object-Oriented Programming in R?

Object-Oriented Programming (OOP) is the most popular programming


language. With the help of oops concepts, we can construct the modular pieces
of code which are used to build blocks for large systems. R is a functional
language, and we can do programming in oops style. In R, oopsis a great tool
to manage the complexity of larger programs.

In Object-Oriented Programming, S3 and S4 are the two important systems.

S3

In oops, the S3 is used to overload any function. So that we can call the functions
with different names and it depends on the type of input parameter or the number
of parameters.

S4
R Programming

S4 is the most important characteristic of oops. However, this is a limitation, as


it is quite difficult to debug. There is an optional reference class for S4.

Objects and Classes in R

In R, everything is an object. Therefore, programmers perform OOPS concept


when they write code in R. An object is a data structure which has some methods
that can act upon its attributes.

In R, classes are the outline or design for the object. Classes encapsulate the
data members, along with the functions. In R, there are two most important
classes, i.e., S3 and S4, which play an important role in performing OOPs
concepts.

Let's discuss both the classes one by one with their examples for better understanding.

1) S3 Class

With the help of the S3 class, we can take advantage of the ability to implement
the generic function OO. Furthermore, using only the first argument, S3 is
capable of dispatching. S3 differs from traditional programming languages such
as Java, C ++, and C #, which implement OO passing messages. This makes S3
easy to implement. In the S3 class, the generic function calls the method. S3 is
very casual and has no formal definition of classes.

S3 requires very little knowledge from the programmer.

Creating an S3 class

In R, we define a function which will create a class and return the object of the
R Programming

created class. A list is made with relevant members, class of the list is
determined, and a copy of the list is returned. There is the following syntax to
create a class

1. variable_name <- list(member1, member2, member3. ............... memberN)

Example

1. s <- list(name = "Ram", age = 29, GPA = 4.0)


2. class(s) <- "Faculty"
3. s

Output

There is the following way in which we define our generic function print.

1. print
2. function(x, ................. )
3. UseMethod("Print")

When we execute or run the above code, it will give us the following output:
R Programming

Like print function, we will make a generic function GPA to assign a new
value to our GPAmember. In the following way we will make the generic
function GPA

1. GPA <- function(obj1){


2. U
s
e
M
et
h
o
d
("
G
P
A
")
3.
}

Once our generic function GPA is created, we will implement a default function for it

1. GPA.default <- function(obj){


2. cat("We are
entering in generic
function\n")3. }

After that we will make a new method for our GPA function in the following way

1. GPA.faculty <- function(obj1){


R Programming

2. cat("Fina
l GPA is
",obj1$GPA,"\n")
3. }

And at last we will run the method GPA as

1. GPA(s)

Output

Inheritance in S3

Inheritance means extracting the features of one class into another class. In
the S3 class of R,inheritance is achieved by applying the class attribute in a
vector.

For inheritance, we first create a function which creates new object of


class faculty in thefollowing way

1. faculty<- function(n,a,g) {
2. value <- list(nname=n, aage=a, GPA=g)
3. attr(value, "class") <- "faculty"
4. value
5. }

After that we will define a method for generic function print() as

1. print.student <- function(obj1) {


2. cat(1obj$name, "\n")
3. cat(1obj$age, "years old\n")
4. ca
t("GPA:",
obj1$GPA
R Programming

, "\n")5. }

Now, we will create an object of class InternationalFaculty which will inherit


from faculty class.This process will be done by assigning a character vector of
class name as:

1. class(Objet) <-

c(child,

parent)so,

1. # create a list
2. fac <- list(name="Shubham", age=22, GPA=3.5, country="India")
3. # make it of the class InternationalFaculty which is derived from the class Faculty
4. class(fac) <- c("InternationalFaculty","Faculty")
5. # print it out
6. fac

When we run the above code which we have discussed, it will generate the following output:

We can see above that, we have not defined any method of form print.
R Programming

InternationalFaculty (),the method called print.Faculty(). This method of class


Faculty was inherited.

So our next step is to defined print.InternationalFaculty() in the following way:

1. print.InternationalFaculty<- function(obj1) {
2. cat(obj1$name, "is
from", obj1$country, "\n")
3. }

The above function will overwrite the method defined for class faculty as

1. Fac

getS3method and getAnywhere function

There are the two most common and popular S3 method functions which are
used in R. The first method is getS3method() and the second one is
getAnywhere().

S3 finds the appropriate method associated with a class, and it is useful to see
how a method is implemented. Sometimes, the methods are non-visible,
because they are hidden in a namespace. We use getS3method or getAnywhere
to solve this problem.

getS3method
R Programming

getAnywhere function

1. getAnywhere("simpleloess")

2) S4 Class

The S4 class is similar to the S3 but is more formal than the latter one. It differs
from S3 in two different ways. First, in S4, there are formal class definitions
which provide a description and representation of classes. In addition, it has
special auxiliary functions for defining methods and generics. The S4 also offers
multiple dispatches. This means that common functions are capable of taking
methods based on multiple arguments which are based on class.

Creating an S4 class

In R, we use setClass() command for creating S4 class. In S4 class, we will


specify a function forverifying the data consistency and also specify the default
value. In R, member variables are called slots.
R Programming

To create an S3 class, we have to define the class and its slots. There are the
following steps to create an S4 class

Step 1:

In the first step, we will create a new class called faculty with three slots name, age, and GPA.

1. setClass("faculty", slots=list(name="character", age="numeric", GPA="numeric"))

There are many other optional arguments of setClass() function which we can
explore byusing ?setClass command.

Step 2:

In the next step, we will create the object of S4 class. R provides new() function
to create an object of S4 class. In this new function we pass the class name and
the values for the slots in the following way:

1. setClass("faculty", slots=list(name="character", age="numeric", GPA="numeric"))


2. # creating an object using new()
3. # providing the class name and value for slots
4. s <- new("faculty",name="Shubham", age=22, GPA=3.5)
5. s
R Programming

It will generate the following output

Creating S4 objects using a generator function

The setClass() function returns a generator function. This generator function


helps in creatingnew objects. And it acts as a constructor.

1. A <- setClass("faculty", slots=list(name="character", age="numeric", GPA="numeric"))


2. A
It will generate the following output:

Now we can use the above constructor function to create new objects. The
constructor in turn uses the new() function to create objects. It is just a wrap
around. Let's see an example to understand how S4 object is created with the
help of generator function.

117
R Programming

Example

1. faculty<-setClass("faculty", slots=list(name="character", age="numeric", GPA="numeric"))


2. # creating an object using generator() function
3. # providing the class name and value for slots
4. faculty(name="Shubham", age=22, GPA=3.5)

Output

Inheritance in S4 class

Like S3 class, we can perform inheritance in S4 class also. The derived class
will inherit both attributes and methods of the parent class. Let's start
understanding that how we can perform inheritance in S4 class. There are the
following ways to perform inheritance in S4 class:

Step 1:

In the first step, we will create or define class with appropriate slots in the following way:

1. setClass("faculty",
2. slots=list(name="character",
age="numeric", GPA="numeric")3. )

Step 2:

118
R Programming

After defining class, our next step is to define class method for the display()
generic function.This will be done in the following manner:

1. setMethod("show",
2. "faculty",
3. function(obj) {
4. cat(obj@name, "\n")
5. cat(obj@age, "years old\n")
6. ca
t("GPA:",
obj@GPA
, "\n")7. }
8. )

Step 3:

In the next step, we will define the derived class with the argument contains.
The derived class isdefined in the following way

1. setClass("Internationalfaculty",
2. slots=list(country="character"),
3. c
o
n
t
a
i
n
s
=
"
f
a
c
u
l
t
y
"
4
.
)

119
R Programming

In our derived class we have defined only one attribute i.e. country. Other
attributes will beinherited from its parent class.

1. s <- new("Internationalfaculty",name="John", age=21, GPA=3.5, country="India")


2. show(s)

When we did show(s), the method defines for class faculty gets called. We
can also definemethods for the derived class of the base class as in the case of
the S3 system.

DATA MANIPULATION
Module 4
We can easily perform data manipulation using R software. We’ll cover the following data
manipulation techniques:
▪ filtering and ordering rows,
▪ renaming and adding columns,
▪ computing summary statistics
We’ll use mainly the popular dplyr R package, which contains important R functions to carry
out easily your data manipulation. In the final section, we’ll show you how to group your data
120
R Programming

by a grouping variable, and then compute some summary statitistics on each subset. You will
also learn how to chain your data manipulation operations.
At the end of this course, you will be familiar with data manipulation tools and approaches that
will allow you to efficiently manipulate data.
Required R packages

We recommend to install the tidyverse packages, which include the dplyr package (for data
manipulation) and additional R packages for easily reading (readr), transforming (tidyr) and
visualizing (ggplot2) datasets.
Install:
install.packages("tidyverse")
Load the tidyverse packages, which also include the dplyr package:
library("tidyverse")
Demo datasets

We’ll use mainly the R built-in iris data set, which we start by converting into a tibble data
frame (tbl_df) for easier data analysis. tbl_df data object is a data frame providing a nicer
printing method, useful when working with large data sets.
library("tidyverse")
my_data <- as_tibble(iris)
my_data
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## # ... with 144 more rows
Note that, the type of data in each column is specified. Common types include:
int: integers
dbl: double (real numbers),
chr: character vectors, strings, texts

121
R Programming

fctr: factor,
dttm: date-times (date + time)
lgl: logical (TRUE or FALSE)
date: dates
Main data manipulation functions

There are 8 fundamental data manipulation verbs that you will use to do most of your data
manipulations. These functions are included in the dplyr package:
filter(): Pick rows (observations/samples) based on their values.
distinct(): Remove duplicate rows.
arrange(): Reorder the rows.
select(): Select columns (variables) by their names.
rename(): Rename columns.
mutate() and transmutate(): Add/create new variables.
summarise(): Compute statistical summaries (e.g., computing the mean or the sum)
It’s also possible to combine each of these verbs with the function group_by() to operate on
subsets of the data set (group-by-group).
All these functions work similarly as follow:
The first argument is a data frame
The subsequent arguments are comma separated list of unquoted variable names and the
specification of what you want to do
The result is a new data frame

122
R Programming

R STATISTICS & LINEAR MODELING


Module 5

Probability Distributions

A probability distribution describes how the values of a random variable is distributed. For
example, the collection of all possible outcomes of a sequence of coin tossing is known to
follow the binomial distribution. Whereas the means of sufficiently large samples of a data
population are known to resemble the normal distribution. Since the characteristics of these
theoretical distributions are well understood, they can be used to make statistical inferences
on the entire data population as a whole.
In the following tutorials, we demonstrate how to compute a few well-known probability
distributions that occurs frequently in statistical study. We reference them quite often in other
sections.
Binomial Distribution

The binomial distribution is a discrete probability distribution. It describes the outcome


of n independent trials in an experiment. Each trial is assumed to have only two outcomes,
either success or failure. If the probability of a successful trial is p, then the probability of
having x successful outcomes in an experiment of n independent trials is as follows.

Problem
Suppose there are twelve multiple choice questions in an English class quiz. Each question has
five possible answers, and only one of them is correct. Find the probability of having four or
less correct answers if a student attempts to answer every question at random.
Solution
Since only one out of five possible answers is correct, the probability of answering a question
correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers
by random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
To find the probability of having four or less correct answers by random attempts, we apply
the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) +
+ dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) +
+ dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.

123
R Programming

> pbinom(4, size=12, prob=0.2)


[1] 0.92744
Answer
The probability of four or less questions answered correctly by random in a twelve question
multiple choice quiz is 92.7%.

Poisson Distribution

The Poisson distribution is the probability distribution of independent event occurrences in an


interval. If λ is the mean occurrence per interval, then the probability of having x occurrences
within a given interval is:

Problem
If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution
The probability of having sixteen or less cars crossing the bridge in a particular minute is given
by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the bridge in a minute is in
the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer
If there are twelve cars crossing a bridge per minute on average, the probability of having
seventeen or more cars crossing the bridge in a particular minute is 10.1%.
Continuous Uniform Distribution

The continuous uniform distribution is the probability distribution of random number selection
from the continuous interval between a and b. Its density function is defined by the following.

Here is a graph of the continuous uniform distribution with a = 1, b = 3.

124
R Programming

Problem
Select ten random numbers between one and three.
Solution
We apply the generation function runif of the uniform distribution to generate ten random
numbers between one and three.
> runif(10, min=1, max=3)
[1] 1.6121 1.2028 1.9306 2.4233 1.6874 1.1502 2.7068
[8] 1.4455 2.4122 2.2171
Exponential Distribution

The exponential distribution describes the arrival time of a randomly recurring independent
event sequence. If μ is the mean waiting time for the next event recurrence, its probability
density function is:

Here is a graph of the exponential distribution with μ = 1.

125
R Programming

Problem
Suppose the mean checkout time of a supermarket cashier is three minutes. Find the
probability of a customer checkout being completed by the cashier in less than two minutes.
Solution
The checkout processing rate is equals to one divided by the mean checkout completion time.
Hence the processing rate is 1/3 checkouts per minute. We then apply the function pexp of
the exponential distribution with rate=1/3.
> pexp(2, rate=1/3)
[1] 0.48658
Answer
The probability of finishing a checkout in under two minutes by the cashier is 48.7%
Normal Distribution

The normal distribution is defined by the following probability density function, where μ is
the population mean and σ2 is the variance.

If a random variable X follows the normal distribution, then we write:

In particular, the normal distribution with μ = 0 and σ = 1 is called the standard normal
distribution, and is denoted as N(0,1). It can be graphed as follows.

126
R Programming

The normal distribution is important because of the Central Limit Theorem, which states
that the population of all possible samples of size n from a population with mean μ and
variance σ2 approaches a normal distribution with mean μ and σ2∕n when n approaches
infinity.
Problem
Assume that the test scores of a college entrance exam fits a normal distribution.
Furthermore, the mean test score is 72, and the standard deviation is 15.2. What is the
percentage of students scoring 84 or more in the exam?
Solution
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are
interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Answer
The percentage of students scoring 84 or more in the college entrance exam is 21.5%.

Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.

127
R Programming

The general mathematical equation for a linear regression is −


y = ax + b
Following is the description of the parameters used −
• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his height is known. To
do this we need to have the relationship between height and weight of a person.
The steps to create the relationship is −
• Carry out the experiment of gathering a sample of observed values of height and corresponding
weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create the mathematical equation using these
• Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
• To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

128
R Programming

# Apply the lm() function.


relation <- lm(y~x)

print(relation)
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746
Get the Summary of the Relationship
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.

129
R Programming

Predict the weight of new persons


# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
When we execute the above code, it produces the following result −
1
76.22869
Visualize the Regression Graphically
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.


dev.off()
When we execute the above code, it produces the following result −

130
R Programming

Covariance and Correlation in R Programming

Covariance and Correlation are terms used in statistics to measure relationships between two
random variables. Both of these terms measure linear dependency between a pair of random
variables or bivariate data.
In this article, we are going to discuss cov(), cor() and cov2cor() functions in R which use
covariance and correlation methods of statistics and probability theory.
Covariance
In R programming, covariance can be measured using cov() function. Covariance is a
statistical term used to measures the direction of the linear relationship between the data
vectors. Mathematically,

where,

131
R Programming

x represents the x data vector


y represents the y data vector
represents mean of x data vector
represents mean of y data vector
N represents total obeservations

Syntax:
cov(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is "pearson".
Example:

# Data vectors

x <- c(1, 3, 5, 10)

y <- c(2, 4, 6, 20)

# Print covariance using different methods

print(cov(x, y))

print(cov(x, y, method = "pearson"))

print(cov(x, y, method = "kendall"))

print(cov(x, y, method = "spearman"))

132
R Programming

Output:
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667
Correlation
cor() function in R programming measures the correlation coefficient value. Correlation is a
relationship term in statistics that uses the covariance method to measure how strong the
vectors are related. Mathematically,

where,
x represents the x data vector
y represents the y data vector
represents mean of x data vector
represents mean of y data vector

Syntax:
cor(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is "pearson".
Example:

# Data vectors

x <- c(1, 3, 5, 10)

y <- c(2, 4, 6, 20)

# Print correlation using different methods

print(cor(x, y))

133
R Programming

print(cor(x, y, method = "pearson"))

print(cor(x, y, method = "kendall"))

print(cor(x, y, method = "spearman"))

Output:
[1] 0.9724702
[1] 0.9724702
[1] 1
[1] 1
Conversion of Covariance to Correlation
cov2cor() function in R programming converts a covariance matrix into corresponding
correlation matrix.
Syntax:
cov2cor(X)
where,
X and y represents the covariance square matrix
Example:

# Data vectors

x <- rnorm(2)

y <- rnorm(2)

# Binding into square matrix

mat <- cbind(x, y)

134
R Programming

# Defining X as the covariance matrix

X <- cov(mat)

# Print covariance matrix

print(X)

# Print correlation matrix of data vector

print(cor(mat))

# Using function cov2cor()

# To convert covariance matrix to correlation matrix

print(cov2cor(X))

Output:
x y
x 0.0742700 -0.1268199
y -0.1268199 0.2165516

x y
x 1 -1
y -1 1

x y
x 1 -1
y -1 1

135
R Programming

t-tests
One of the most common tests in statistics is the t-test, used to determine
whether the means of two groups are equal to each other. The assumption for
the test is that both groups are sampled from normal distributions with equal
variances. The null hypothesis is that the two means are equal, and the
alternative is that they are not. It is known that under the null hypothesis, we
can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees
of freedom. There is also a widely used modification of the t-test, known as
Welch's t-test that adjusts the number of degrees of freedom when the variances
are thought not to be equal to each other. Before we can explore the test much
further, we need to find an easy way to calculate the t-statistic.

The function t.test is available in R for performing t-tests. Let's test it out on a
simple example, using data simulated from a normal distribution.

> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)

Welch Two Sample t-test

data: x and y
t = 1.4896, df = 15.481, p-value = 0.1564
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3221869 1.8310421
sample estimates:
mean of x mean of y
0.1944866 -0.5599410
Before we can use this function in a simulation, we need to find out how to
extract the t-statistic (or some other quantity of interest) from the output of the
t.test function. For this function, the R help page has a detailed list of what the
object returned by the function contains. A general method for a situation like
this is to use the class and names functions to find where the quantity of interest
is. In addition, for some hypothesis tests, you may need to pass the object from
the hypothesis test to the summary function and examine its contents. For t.test
it's easy to figure out what we want:

> ttest = t.test(x,y)


> names(ttest)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"

136
R Programming

[6] "null.value" "alternative" "method" "data.name"


The value we want is named "statistic". To extract it, we can use the dollar sign
notation, or double square brackets:

> ttest$statistic
t
1.489560
> ttest[['statistic']]
t
1.489560
Of course, just one value doesn't let us do very much - we need to generate many
such statistics before we can look at their properties. In R, the replicate function
makes this very simple. The first argument to replicate is the number of samples
you want, and the second argument is an expression (not a function name or
definition!) that will generate one of the samples you want. To generate 1000 t-
statistics from testing two groups of 10 standard random normal numbers, we
can use:

> ts = replicate(1000,t.test(rnorm(10),rnorm(10))$statistic)
Under the assumptions of normality and equal variance, we're assuming that the
statistic will have a t-distribution with 10 + 10 - 2 = 18 degrees of freedom.
(Each observation contributes a degree of freedom, but we lose two because we
have to estimate the mean of each group.) How can we test if that is true?

One way is to plot the theoretical density of the t-statistic we should be seeing,
and superimposing the density of our sample on top of it. To get an idea of what
range of x values we should use for the theoretical density, we can view the
range of our simulated data:

> range(ts)
> range(ts)
[1] -4.564359 4.111245
Since the distribution is supposed to be symmetric, we'll use a range from -4.5
to 4.5. We can generate equally spaced x-values in this range with seq:

> pts = seq(-4.5,4.5,length=100)


> plot(pts,dt(pts,df=18),col='red',type='l')

Now we can add a line to the plot showing the density for our simulated sample:

137
R Programming

> lines(density(ts))
The plot appears below.

Another way to compare two densities is with a quantile-quantile plot. In this


type of plot, the quantiles of two samples are calculated at a variety of points in
the range of 0 to 1, and then are plotted against each other. If the two samples
came from the same distribution with the same parameters, we'd see a straight
line through the origin with a slope of 1; in other words, we're testing to see if
various quantiles of the data are identical in the two samples. If the two samples
came from similar distributions, but their parameters were different, we'd still
see a straight line, but not through the origin. For this reason, it's very common
to draw a straight line through the origin with a slope of 1 on plots like this. We
can produce a quantile-quantile plot (or QQ plot as they are commonly known),
using the qqplot function. To use qqplot, pass it two vectors that contain the
samples that you want to compare. When comparing to a theoretical
distribution, you can pass a random sample from that distribution. Here's a QQ
plot for the simulated t-test data:

> qqplot(ts,rt(1000,df=18))
> abline(0,1)

We can see that the central points of the graph seems to agree fairly well, but
there are some discrepancies near the tails (the extreme values on either end of
the distribution). The tails of a distribution are the most difficult part to
accurately measure, which is unfortunate, since those are often the values that
interest us most, that is, the ones which will provide us with enough evidence
to reject a null hypothesis. Because the tails of a distribution are so important,
another way to test to see if a distribution of a sample follows some
hypothesized distribution is to calculate the quantiles of some tail probabilities
(using the quantile function) and compare them to the theoretical probabilities
from the distribution (obtained from the function for that distribution whose first
letter is "q"). Here's such a comparison for our simulated data:

> probs = c(.9,.95,.99)


> quantile(ts,probs)
90% 95% 99%
1.427233 1.704769 2.513755
> qt(probs,df=18)
[1] 1.330391 1.734064 2.552380
The quantiles agree fairly well, especially at the .95 and .99 quantiles.

138
R Programming

Performing more simulations, or using a large sample size for the two groups
would probably result in values even closer to what we have theoretically
predicted.

One final method for comparing distributions is worth mentioning. We noted


previously that one of the assumptions for the t-test is that the variances of the
two samples are equal. However, a modification of the t-test known as Welch's
test is said to correct for this problem by estimating the variances, and adjusting
the degrees of freedom to use in the test. This correction is performed by default,
but can be shut off by using the var.equal=TRUE argument. Let's see how it
works:

> t.test(x,y)

Welch Two Sample t-test

data: x and y
t = -0.8103, df = 17.277, p-value = 0.4288
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0012220 0.4450895
sample estimates:
mean of x mean of y
0.2216045 0.4996707

> t.test(x,y,var.equal=TRUE)

Two Sample t-test

data: x and y
t = -0.8103, df = 18, p-value = 0.4284
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.9990520 0.4429196
sample estimates:
mean of x mean of y
0.2216045 0.4996707
Since the statistic is the same in both cases, it doesn't matter whether we use the
correction or not; either way we'll see identical results when we compare the
two methods using the techniques we've already described. Since the degree of
freedom correction changes depending on the data, we can't simply perform the

139
R Programming

simulation and compare it to a different number of degrees of freedom. The


other thing that changes when we apply the correction is the p-value that we
would use to decide if there's enough evidence to reject the null hypothesis.
What is the behaviour of the p-values? While not necessarily immediately
obvious, under the null hypothesis, the p-values for any statistical test should
form a uniform distribution between 0 and 1; that is, any value in the interval 0
to 1 is just as likely to occur as any other value. For a uniform distribution, the
quantile function is just the identity function. A value of .5 is greater than 50%
of the data; a value of .95 is greater than 95% of the data. As a quick check of
this notion, let's look at the density of probability values when the null
hypothesis is true:

> tps = replicate(1000,t.test(rnorm(10),rnorm(10))$p.value)


> plot(density(tps))
The graph appears below.

Another way to check to see if the probabilities follow a uniform distribution is


with a QQ plot:

> qqplot(tps,runif(1000))
> abline(0,1)
The graph appears below.

The idea that the probabilities follow a uniform distribution seems reasonable.

Now, let's look at some of the quantiles of the p-values when we force the t.test
function to use var.equal=TRUE:

> tps = replicate(1000,t.test(rnorm(10),rnorm(10),var.equal=TRUE)$p.value)


> probs = c(.5,.7,.9,.95,.99)
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.4873799 0.7094591 0.9043601 0.9501658 0.9927435
The agreement actually looks very good. What about when we let t.test decide
whether to make the correction or not?

> tps = replicate(1000,t.test(rnorm(10),rnorm(10))$p.value)

140
R Programming

> quantile(tps,probs)
50% 70% 90% 95% 99%
0.4932319 0.7084562 0.9036533 0.9518775 0.9889234
There's not that much of a difference, but, of course, the variances in this
example were equal. How does the correction work when the variances are not
equal?

> tps =
replicate(1000,t.test(rnorm(10),rnorm(10,sd=5),var.equal=TRUE)$p.value)
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.5221698 0.6926466 0.8859266 0.9490947 0.9935562
> tps = replicate(1000,t.test(rnorm(10),rnorm(10,sd=5))$p.value)
> quantile(tps,probs)
50% 70% 90% 95% 99%
0.4880855 0.7049834 0.8973062 0.9494358 0.9907219
There is an improvement, but not so dramatic.

Power of the t-test


Of course, all of this is concerned with the null hypothesis. Now let's start to
investigate the power of the t-test. With a sample size of 10, we obviously aren't
going to expect truly great performance, so let's consider a case that's not too
subtle. When we don't specify a standard deviation for rnorm it uses a standard
deviation of 1. That means about 68% of the data will fall in the range of -1 to
1. Suppose we have a difference in means equal to just one standard deviation,
and we want to calculate the power for detecting that difference. We can follow
the same procedure as the coin tossing experiment: specify an alpha level,
calculate the rejection region, simulate data under the alternative hypothesis,
and see how many times we'd reject the null hypothesis. As in the coin toss
example, a function will make things much easier:

t.power = function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1)){
lower = qt(.025,df=sum(nsamp) - 2)
upper = qt(.975,df=sum(nsamp) - 2)
ts = replicate(nsim,
t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]),
rnorm(nsamp[2],mean=means[2],sd=sds[2]))$statistic)

sum(ts < lower | ts > upper) / nsim


}
Let's try it with our simple example:

141
R Programming

> t.power(means=c(0,1))
[1] 0.555
Not bad for a sample size of 10!

Of course, if the differences in means are smaller, it's going to be harder to reject
the null hypothesis:

> t.power(means=c(0,.3))
[1] 0.104
How large a sample size would we need to detect that difference of .3 with 95%
power?

> samps = c(100,200,300,400,500)


> res = sapply(samps,function(n)t.power(means=c(0,.3),nsamp=c(n,n)))
> names(res) = samps
> res
100 200 300 400 500
0.567 0.841 0.947 0.992 0.999
It would take over 300 samples in each group to be able to detect such a
difference.

Now we can return to the issue of unequal variances. We saw that Welch's
adjustment to the degrees of freedom helped a little bit under the null
hypothesis. Now let's see if the power of the test is improved using Welch's test
when the variances are unequal. To do this, we'll need to modify our t.power
function a little:

t.power1 =
function(nsamp=c(10,10),nsim=1000,means=c(0,0),sds=c(1,1),var.equal=TR
UE){
tps = replicate(nsim,
t.test(rnorm(nsamp[1],mean=means[1],sd=sds[1]),
rnorm(nsamp[2],mean=means[2],sd=sds[2]))$p.value)

sum(tps < .025 | tps > .975) / nsim


}
Since I set var.equal=TRUE by default, Welch's adjustment will not be used
unless we specify var.equal=FALSE. Let's see what the power is for a sample
of size 10, assuming the mean of one of the groups is 1, and its standard
deviation is 2, while the other group is left at the default of mean=0 and sd=1:

142
R Programming

> t.power1(nsim=10000,sds=c(1,2),mean=c(1,2))
[1] 0.1767
> t.power1(nsim=10000,sds=c(1,2),mean=c(1,2),var.equal=FALSE)
[1] 0.1833
There does seem to be an improvement, but not so dramatic.

We can look at the same thing for a variety of sample sizes:

> sizes <- c(10, 20, 50, 100)


> res1 = sapply(sizes,function(n)t.power1(nsim=10000,sds=c(1,2),
+ mean=c(1,2),nsamp=c(n,n)))
> names(res1) = sizes
> res1
10 20 50 100
0.1792 0.3723 0.8044 0.9830
> res2 = sapply(sizes,function(n)t.power1(nsim=10000,sds=c(1,2),
+ mean=c(1,2),nsamp=c(n,n),var.equal=FALSE))
> names(res2) = sizes
> res2
10 20 50 100
0.1853 0.3741 0.8188 0.9868

Linear Model and ANOVA


Linear Model
The classic linear model forms the basis for ANOVA (with categorical treatments) and
ANCOVA (which deals with continuous explanatory variables). Its basic equation is the
following:

where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x,
which indicates the changes in y as a function of changes in x. For example if the slope is +0.5,
we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can
also be negative.

This equation can be expanded to accommodate more that one explanatory variable x:

143
R Programming

In this case the interpretation is a bit more complex because for example the coefficient β_2
provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2
the target variable y changes by the value of β_2, if the other explanatory variables are kept
constant.

In case our model includes interactions, the linear equation would be changed as follows:

notice the interaction term between x_1 and x_2. In this case the interpretation becomes
extremely difficult just by looking at the model.

In fact, if we rewrite the equation focusing for example on x_1:

we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason
the only way we can actually determine how x_1 changes Y, when the other terms are kept
constant, is to use the equation with new values of x_1.

This linear model can be applied to continuous target variables, in this case we would talk about
an ANCOVA for exploratory analysis, or a linear regression if the objective was to create a
predictive model.

ANOVA
The Analysis of variance is based on the linear model presented above, the only difference is
that its reference point is the mean of the dataset. When we described the equations above we
said that to interpret the results of the linear model we would look at the slope term; this
indicates the rate of changes in Y if we change one variable and keep the rest constant. The
ANOVA calculates the effects of each treatment based on the grand mean, which is the mean
of the variable of interest.

In mathematical terms ANOVA solves the following equation (Williams, 2004):

144
R Programming

where y is the effect on group j of treatment τ_1, while μ is the grand mean (i.e. the mean of
the whole dataset). From this equation is clear that the effects calculated by the ANOVA are
not referred to unit changes in the explanatory variables, but are all related to changes on the
grand mean.

Examples of ANOVA and ANCOVA in R


For this example we are going to use one of the datasets available in the
package agridatavailable in CRAN:

install.packages("agridat")

We also need to include other packages for the examples below. If some of these are not
installed in your system please use again the function install.packages (replacing the name
within quotation marks according to your needs) to install them.

library(agridat)

library(ggplot2)

library(plotrix)

library(moments)

library(car)

library(fitdistrplus)

library(nlme)

library(multcomp)

library(epade)

library(lme4)

Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn
yield in a field in Argentina, plus several explanatory variables both factorial (or categorical)
and continuous.

> dat = lasrosas.corn

> str(dat)

'data.frame': 3443 obs. of 9 variables:

145
R Programming

$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...

$ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ...

$ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ...

$ yield: num 72.1 73.8 77.2 76.3 75.5 ...

$ nitro: num 132 132 132 132 132 ...

$ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ...

$ bv : num 163 170 168 177 171 ...

$ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...

$ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...

Important for the purpose of this tutorial is the target variable yield, which is what we are trying
to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which
is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have
rep, which is the blocking factor.

Checking Assumptions
Since we are planning to use an ANOVA we first need to check that our data fits with its
assumptions. ANOVA is based on three assumptions:
▪ Data independence
▪ Normality
▪ Equality of variances between groups
▪ Balance design (i.e. all groups have the same number of samples)
Let’s see how we can test for them in R. Clearly we are talking about environmental data so
the assumption of independence is not met, because data are autocorrelated with distance.
Theoretically speaking, for spatial data ANOVA cannot be employed and more robust methods
should be employed (e.g. REML); however, over the years it has been widely used for analysis
of environmental data and it is accepted by the community. That does not mean that it is the
correct method though, and later on in this tutorial we will see the function to perform linear
modelling with REML.
The third assumption is the one that is most easy to assess using the function tapply:

> tapply(dat$yield, INDEX=dat$nf, FUN=var)

N0 N1 N2 N3 N4 N5

438.5448 368.8136 372.8698 369.6582 366.5705 405.5653

In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of

146
R Programming

nitrogen). There is some variation between groups but in my opinion it is not substantial. Now
we can shift our focus on normality. There are tests to check for normality, but again the
ANOVA is flexible (particularly where our dataset is big) and can still produce correct results
even when its assumptions are violated up to a certain degree. For this reason, it is good practice
to check normality with descriptive analysis alone, without any statistical test. For example,
we could start by plotting the histogram of yield:

hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)")

By looking at this image it seems that our data are more or less normally distributed. Another
plot we could create is the QQplot
(http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm):

qqnorm(dat$yield, main="QQplot of Yield")

qqline(dat$yield)

147
R Programming

For normally distributed data the points should all be on the line. This is clearly not the case
but again the deviation is not substantial. The final element we can calculate is the skewness
of the distribution, with the function skewnessin the package moments:

> skewness(dat$yield)

[1] 0.3875977

According to Webster and Oliver (2007) is the skewness is below 0.5, we can consider the
deviation from normality not big enough to transform the data. Moreover, according to Witte
and Witte (2009) if we have more than 10 samples per group we should not worry too much
about violating the assumption of normality or equality of variances.
To see how many samples we have for each level of nitrogen we can use once again the
function tapply:

> tapply(dat$yield, INDEX=dat$nf, FUN=length)

N0 N1 N2 N3 N4 N5

148
R Programming

573 577 571 575 572 575

As you can see we have definitely more than 10 samples per group, but our design is not
balanced (i.e. some groups have more samples). This implies that the normal ANOVA cannot
be used, this is because the standard way of calculating the sum of squares is not appropriate
for unbalanced designs (look here for more
info: http://goanna.cs.rmit.edu.au/~fscholer/anova.php).

In summary, even though from the descriptive analysis it appears that our data are close to
being normal and have equal variance, our design is unbalanced, therefore the normal way of
doing ANOVA cannot be used. In other words we cannot function aovfor this dataset.
However, since this is a tutorial we are still going to start by applying the normal ANOVA
with aov.

ANOVA with aov


The first thing we need to do is think about the hypothesis we would like to test. For example,
we could be interested in looking at nitrogen levels and their impact on yield. Let’s start with
some plotting to better understand our data:

means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean)

StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)

BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))

segments(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5)

arrows(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90,

code = 3, length = 0.05)

This code first uses the function tapply to compute mean and standard error of the mean for
yield in each nitrogen group. Then it plots the means as bars and creates error bars using the
standard error (please remember that with a normal distribution ± twice the standard error
provides a 95% confidence interval around the mean value). The result is the following image:

149
R Programming

By plotting our data we can start figuring out what is the interaction between nitrogen levels
and yield. In particular, there is an increase in yield with higher level of nitrogen. However,
some of the error bars are overlapping, and this may suggest that their values are not
significantly different. For example, by looking at this plot N0 and N1 have error bars very
close to overlap, but probably not overlapping, so it may be that N1 provides a significant
different from N0. The rest are all probably significantly different from N0. For the rest their
interval overlap most of the times, so their differences would probably not be significant.

We could formulate the hypothesis that nitrogen significantly affects yield and that the mean
of each subgroup are significantly different. Now we just need to test this hypothesis with a
one-way ANOVA:

mod1 = aov(yield ~ nf, data=dat)

The code above uses the function aov to perform an ANOVA; we can specify to perform a one-
way ANOVA simply by including only one factorial term after the tilde (~) sign. We can plot
the ANOVA table with the function summary:

> summary(mod1)

150
R Programming

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 12.4 6.08e-12 ***

Residuals 3437 1330110 387

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It is clear from this output that nitrogen significantly affects yield, so we tested our first
hypothesis. To test the significance for individual levels of nitrogen we can use the Tukey’s
test:

> TukeyHSD(mod1, conf.level=0.95)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = yield ~ nf, data = dat)

$nf

diff lwr upr p adj

N1-N0 3.6434635 0.3353282 6.951599 0.0210713

N2-N0 4.6774357 1.3606516 7.994220 0.0008383

N3-N0 5.3629638 2.0519632 8.673964 0.0000588

N4-N0 7.5901274 4.2747959 10.905459 0.0000000

N5-N0 7.8588595 4.5478589 11.169860 0.0000000

N2-N1 1.0339723 -2.2770686 4.345013 0.9489077

N3-N1 1.7195004 -1.5857469 5.024748 0.6750283

N4-N1 3.9466640 0.6370782 7.256250 0.0089057

N5-N1 4.2153960 0.9101487 7.520643 0.0038074

N3-N2 0.6855281 -2.6283756 3.999432 0.9917341

N4-N2 2.9126917 -0.4055391 6.230923 0.1234409

N5-N2 3.1814238 -0.1324799 6.495327 0.0683500

N4-N3 2.2271636 -1.0852863 5.539614 0.3916824

151
R Programming

N5-N3 2.4958957 -0.8122196 5.804011 0.2613027

N5-N4 0.2687320 -3.0437179 3.581182 0.9999099

There are significant differences between the control and the rest of the levels of nitrogen, plus
other differences between N4 and N5 compared to N1, but nothing else. If you look back at the
bar chart we produced before, and look carefully at the overlaps between error bars, you will
see that for example N1, N2, and N3 have overlapping error bars, thus they are not significantly
different. On the contrary, N1 has no overlaps with either N4 and N5 , which is what we
demonstrated in the ANOVA.

The function model.tables provides a quick way to print the table of effects and the table of
means:

> model.tables(mod1, type="effects")

Tables of effects

nf

N0 N1 N2 N3 N4 N5

-4.855 -1.212 -0.178 0.5075 2.735 3.003

rep 573.000 577.000 571.000 575.0000 572.000 575.000

These values are all referred to the gran mean, which we can simply calculate with the
function mean(dat$yield) and it is equal to 69.83. This means that the mean for N0 would be
69.83-4.855 = 64.97. we can verify that with another call to the function model.tables, this time
with the option type=”means”:

> model.tables(mod1, type="means")

Tables of means

Grand mean

69.82831

nf

N0 N1 N2 N3 N4 N5

64.97 68.62 69.65 70.34 72.56 72.83

rep 573.00 577.00 571.00 575.00 572.00 575.00

152
R Programming

Linear Model with 1 factor


The same results can be obtain by fitting a linear model with the function lm, only their
interpretation would be different. The assumption for fitting a linear models are again
independence (which is always violated with environmental data), and normality.

Let’s look at the code:

mod2 = lm(yield ~ nf, data=dat)

This line fits the same model but with the standard linear equation. This become clearer by
looking at the summary table:

> summary(mod2)

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 64.9729 0.8218 79.060 < 2e-16 ***

nfN1 3.6435 1.1602 3.140 0.0017 **

nfN2 4.6774 1.1632 4.021 5.92e-05 ***

nfN3 5.3630 1.1612 4.618 4.01e-06 ***

nfN4 7.5901 1.1627 6.528 7.65e-11 ***

nfN5 7.8589 1.1612 6.768 1.53e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

153
R Programming

Residual standard error: 19.67 on 3437 degrees of freedom

Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629

F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

There are several information in this table that we should clarify. First of all it already provides
with some descriptive measures for the residuals, from which we can see that their distribution
is relatively normal (first and last quartiles have similar but opposite values and the same is
true for minimum and maximum). Then we have the table of the coefficients, with the intercept
and all the slopes. As you can see the level N0 is not shown in the list; this is called the reference
level, which means that all the other are referenced back to it. In other words, the value of the
intercept is the mean of nitrogen level 0 (in fact is the same we calculated above 64.97). To
calculate the means for the other groups we need to sum the value of the reference level with
the slopes. For example N1 is 64.97 + 3.64 = 68.61 (the same calculated from the ANOVA).
The p-value and the significance are again in relation to the reference level, meaning for
example that N1 is significantly different from N0 (reference level) and the p-value is 0.0017.
This is similar to the Tukey’s test we performed above, but it is only valid in relation to N0.
We need to change the reference level, and fit another model, to get the same information for
other nitrogen levels:

dat$nf = relevel(dat$nf, ref="N1")

mod3 = lm(yield ~ nf, data=dat)

summary(mod3)

Now the reference level is N1, so all the results will tell us the effects of nitrogen in relation to
N1.

> summary(mod3)

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Simple Linear Regression

154
R Programming

Linear Regression :
It is a commonly used type of predictive analysis. It is a statistical approach for modelling
relationship between a dependent variable and a given set of independent variables.
There are two types of linear regression.
• Simple Linear Regression
• Multiple Linear Regression
Let’s discuss Simple Linear regression using R.
Simple Linear Regression:
It is a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables. One variable denoted x is regarded as an independent
variable and other one denoted y is regarded as a dependent variable. It is assumed that the
two variables are linearly related. Hence, we try to find a linear function that predicts the
response value(y) as accurately as possible as a function of the feature or independent
variable(x).
For understanding the concept let’s consider a salary dataset where it is given the value of
the dependent variable(salary) for every independent variable(years experienced).

Salary dataset-
Years experienced Salary

1.1 39343.00
1.3 46205.00
1.5 37731.00
2.0 43525.00
2.2 39891.00
2.9 56642.00
3.0 60150.00
3.2 54445.00
3.2 64445.00
3.7 57189.00
For general purpose, we define:
x as a feature vector, i.e x = [x_1, x_2, …., x_n],
y as a response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).

155
R Programming

Scatter plot of given dataset:

Now, we have to find a line which fits the above scatter plot through which we can predict
any value of y or response for any value of x
The lines which best fits is called Regression line.
The equation of regression line is given by:
y = a + bx
Where y is predicted response value, a is y intercept, x is feature value and b is slope.
To create the model, let’s evaluate the values of regression coefficient a and b. And as soon
as the estimation of these coefficients is done, the response model can be predicted. Here we
are going to use Least Square Technique.
The principle of least squares is one of the popular methods for finding a curve fitting a
given data. Say (x1, y1), (x2, y2)….(xn, yn) be n observations from an experiment. We are
interested in finding a curve

156
R Programming

Closely fitting the given data of size ‘n’. Now at x=x1 while the observed value of y is y1 the
expected value of y from curve (1) is f(x1).Then the residual can be defined by…

Similarly residual for x2, x3…xn are given by …

While evaluating the residual we will find that some residuals are positives and some are
negatives. We are looking forward to finding the curve fitting the given data such that residual
at any xi is minimum. Since some of the residuals are positive and others are negative and as
we would like to give equal importance to all the residuals it is desirable to consider the sum
of the squares of these residuals. Thus we consider:

and find the best representative curve.


Least Square Fit of a Straight Line
Suppose, given a dataset (x1, y1), (x2, y2), (x3, y3)…..(xn, yn) of n observation from an
experiment. And we are interested in fitting a straight line.
to the given data.
Now consider:

Now consider the sum of the squares of ei

Note: E is a function of parameters a and b and we need to find a and b such that E is
minimum and the necessary condition for E to be minimum is as follows:

This condition yields:

The above two equations are called normal equations which are solved to get the value of a
and b.
The Expression for E can be rewritten as:

The basic syntax for a regression analysis in R is

157
R Programming

lm(Y ~ model)
where Y is the object containing the dependent variable to be predicted and model is the
formula for the chosen mathematical model.
The command lm( ) provides the model’s coefficients but no further statistical information.
Following R code is used to implement SIMPLE LINEAR REGRESSION:

# Simple Linear Regression

# Importing the dataset

dataset = read.csv('salary.csv')

# Splitting the dataset into the

# Training set and Test set

install.packages('caTools')

library(caTools)

split = sample.split(dataset$Salary, SplitRatio = 0.7)

trainingset = subset(dataset, split == TRUE)

testset = subset(dataset, split == FALSE)

# Fitting Simple Linear Regression to the Training set

lm.r= lm(formula = Salary ~ YearsExperience,

data = trainingset)

coef(lm.r)

158
R Programming

# Predicting the Test set results

ypred = predict(lm.r, newdata = testset)

install.packages("ggplot2")

library(ggplot2)

# Visualising the Training set results

ggplot() + geom_point(aes(x = trainingset$YearsExperience,

y = trainingset$Salary), colour = 'red') +

geom_line(aes(x = trainingset$YearsExperience,

y = predict(lm.r, newdata = trainingset)), colour = 'blue') +

ggtitle('Salary vs Experience (Training set)') +

xlab('Years of experience') +

ylab('Salary')

# Visualising the Test set results

ggplot() +

geom_point(aes(x = testset$YearsExperience, y = testset$Salary),

colour = 'red') +

geom_line(aes(x = trainingset$YearsExperience,

y = predict(lm.r, newdata = trainingset)),

159
R Programming

colour = 'blue') +

ggtitle('Salary vs Experience (Test set)') +

xlab('Years of experience') +

ylab('Salary')

Output of coef(lm.r):

160
R Programming

Visualising the Training set results:

161
R Programming

Visualising the Testing set results:

162
R Programming

Modelling strategies
Frank Harrell’s Regression Modelling Strategies, a must read for anyone who ever fits a
regression model, although be prepared - depending on your background, you might get 30
pages in and suddenly become convinced you’ve been doing nearly everything wrong before,
which can be disturbing.
I wanted to evaluate three simple modelling strategies in dealing with data with many variables.
Using data with 54 variables on 1,785 area units from New Zealand’s 2013 census, I’m looking
to predict median income on the basis of the other 53 variables. The features are all continuous
and are variables like “mean number of bedrooms”, “proportion of individuals with no
religion” and “proportion of individuals who are smokers”. Restricting myself to traditional
linear regression with a normally distributed response, my three alternative strategies were:

• use all 53 variables;


• eliminate the variables that can be predicted easily from the other variables (defined by
having a variance inflation factor greater than ten), one by one until the main
collinearity problems are gone; or
• eliminate variables one at a time from the full model on the basis of comparing Akaike’s
Information Criterion of models with and without each variable.
None of these is exactly what I would use for real, but they serve the purpose of setting up a
competition of strategies that I can test with a variety of model validation techniques.

Validating models
The main purpose of the exercise was actually to ensure I had my head around different ways
of estimating the validity of a model, loosely definable as how well it would perform at
predicting new data. As there is no possibility of new areas in New Zealand from 2013 that
need to have their income predicted, the “prediction” is a thought-exercise which we need to
find a plausible way of simulating. Confidence in hypothetical predictions gives us confidence
in the insights the model gives into relationships between variables.
There are many methods of validating models, although I think k-fold cross-validation has
market dominance (not with Harrell though, who prefers varieties of the bootstrap). The three
validation methods I’ve used for this post are:

1. ‘simple’ bootstrap. This involves creating resamples with replacement from the original
data, of the same size; applying the modelling strategy to the resample; using the model
to predict the values of the full set of original data and calculating a goodness of fit
statistic (eg either R-squared or root mean squared error) comparing the predicted value
to the actual value. Note - Following Efron, Harrell calls this the “simple bootstrap”,
but other authors and the useful caret package use “simple bootstrap” to mean the

163
R Programming

resample model is used to predict the out-of-bag values at each resample point, rather
than the full original sample.
2. ‘enhanced’ bootstrap. This is a little more involved and is basically a method of
estimating the ‘optimism’ of the goodness of fit statistic. There’s a nice step by step
explanation by thestatsgeek which I won’t try to improve on.
3. repeated 10-fold cross-validation. 10-fold cross-validation involves dividing your data
into ten parts, then taking turns to fit the model on 90% of the data and using that model
to predict the remaining 10%. The average of the 10 goodness of fit statistics becomes
your estimate of the actual goodness of fit. One of the problems with k-fold cross-
validation is that it has a high variance ie doing it different times you get different results
based on the luck of you k-way split; so repeated k-fold cross-validation addresses this
by performing the whole process a number of times and taking the average.
As the sample sizes get bigger relative to the number of variables in the model the methods
should converge. The bootstrap methods can give over-optimistic estimates of model validity
compared to cross-validation; there are various other methods available to address this issue
although none seem to me to provide all-purpose solution.
It’s critical that the re-sampling in the process envelopes the entire model-building strategy,
not just the final fit. In particular, if the strategy involves variable selection (as two of my
candidate strategies do), you have to automate that selection process and run it on each different
resample. That’s because one of the highest risk parts of the modelling process is that variable
selection. Running cross-validation or the bootstrap on a final model after you’ve eliminated a
bunch of variables is missing the point, and will give materially misleading statistics (biased
towards things being more “significant” than there really is evidence for). Of course, that
doesn’t stop this being common misguided practice.

Results
One nice feature of statistics since the revolution of the 1980s is that the bootstrap helps you
conceptualise what might have happened but didn’t. Here’s the root mean squared error from
the 100 different bootstrap resamples when the three different modelling strategies (including
variable selection) were applied:

Notice anything? Not only does it seem to be generally a bad idea to drop variables just because
they are collinear with others, but occasionally it turns out to be a really bad idea - like in
resamples #4, #6 and around thirty others. Those thirty or so spikes are in resamples where
random chance led to one of the more important variables being dumped before it had a chance
to contribute to the model.
The thing that surprised me here was that the generally maligned step-wise selection strategy
performed nearly as well as the full model, judged by the simple bootstrap. That result comes
through for the other two validation methods as well:

In all three validation methods there’s really nothing substantive to choose between the “full
model” and “stepwise” strategies, based purely on results.

164
R Programming

Reflections
The full model is much easier to fit, interpret, estimate confidence intervals and perform tests
on than stepwise. All the standard statistics for a final model chosen by stepwise methods are
misleading and careful recalculations are needed based on elaborate bootstrapping. So the full
model wins hands-down as a general strategy in this case.
With this data, we have a bit of freedom from the generous sample size. If approaching this for
real I wouldn’t eliminate any variables unless there were theoretical / subject matter reasons to
do so. I have made the mistake of eliminating the co-linear variables before from this dataset
but will try not to do it again. The rule of thumb is to have 20 observations for each parameter
(this is one of the most asked and most dodged questions in statistics education; see Table 4.1
of Regression Modelling Strategies for this particular answer), which suggests we can have up
to 80 parameters with a bit to spare. This gives us 30 parameters to use for non-linear
relationships and/or interactions, which is the direction I might go in a subsequent post. Bearing
that in mind, I’m not bothering to report here the actual substantive results (eg which factors
are related to income and how); that can wait for a better model down the track.

Data and computing


The census data are ultimately from Statistics New Zealand of course, but are tidied up and
available in my nzelect R package, which is still very much under development and may
change without notice. It’s only available from GitHub at the moment (installation code below).
I do the bootstrapping with the aid of the boot package, which is generally the recommended
approach in R. For repeated cross-validation of the two straightforward strategies (full model
and stepwise variable selection) I use the caret package, in combination with stepAIC which is
in the Venables and Ripley MASS package. For the more complex strategy that involved
dropping variables with high variance inflation factors I found it easiest to do the repeated
cross-validation old-school with my own for loops.
This exercise was a bit complex and I won’t be astonished if someone points out an error. If
you see a problem, or have any suggestions or questions, please leave a comment.
Here’s the code:

#===================setup=======================
library(ggplot2)
library(scales)
library(MASS)
library(boot)
library(caret)
library(dplyr)
library(tidyr)
library(directlabels)

165
R Programming

set.seed(123)

# install nzelect package that has census data


devtools::install_github("ellisp/nzelect/pkg")
library(nzelect)

# drop the columns with areas' code and name


au <- AreaUnits2013 %>%
select(-AU2014, -Area_Code_and_Description)

# give meaningful rownames, helpful for some diagnostic plots later


row.names(au) <- AreaUnits2013$Area_Code_and_Description

# remove some repetition from the variable names


names(au) <- gsub("2013", "", names(au))

# restrict to areas with no missing data. If this was any more complicated (eg
# imputation),it would need to be part of the validation resampling too; but
# just dropping them all at the beginning doesn't need to be resampled; the only
# implication would be sample size which would be small impact and complicating.
au <- au[complete.cases(au), ]

#==================functions for two of the modelling strategies===========


==========
# The stepwise variable selection:
model_process_step <- function(the_data){
model_full <- lm(MedianIncome ~ ., data = the_data)
model_final <- stepAIC(model_full, direction = "both", trace = 0)
return(model_final)
}

166
R Programming

# The dropping of highly collinear variables, based on Variance Inflation Factor:


model_process_vif <- function(the_data){
# remove the collinear variables based on vif
x <- 20

while(max(x) > 10){


mod1 <- lm(MedianIncome ~ . , data = the_data)
x <- sort(car::vif(mod1) , decreasing = TRUE)
the_data <- the_data[ , names(the_data) != names(x)[1]]
# message(paste("dropping", names(x)[1]))
}

model_vif <- lm(MedianIncome ~ ., data = the_data)


return(model_vif)
}

# The third strategy, full model, is only a one-liner with standard functions
# so I don't need to define a function separately for it.

#==================Different validation methods=================

#------------------simple bootstrap comparison-------------------------


# create a function suitable for boot that will return the goodness of fit
# statistics testing models against the full original sample.
compare <- function(orig_data, i){
# create the resampled data
train_data <- orig_data[i, ]
test_data <- orig_data # ie the full original sample

# fit the three modelling processes


model_step <- model_process_step(train_data)
model_vif <- model_process_vif(train_data)

167
R Programming

model_full <- lm(MedianIncome ~ ., data = train_data)

# predict the values on the original, unresampled data


predict_step <- predict(model_step, newdata = test_data)
predict_vif <- predict(model_vif, newdata = test_data)
predict_full <- predict(model_full, newdata = test_data)

# return a vector of 6 summary results


results <- c(
step_R2 = R2(predict_step, test_data$MedianIncome),
vif_R2 = R2(predict_vif, test_data$MedianIncome),
full_R2 = R2(predict_full, test_data$MedianIncome),
step_RMSE = RMSE(predict_step, test_data$MedianIncome),
vif_RMSE = RMSE(predict_vif, test_data$MedianIncome),
full_RMSE = RMSE(predict_full, test_data$MedianIncome)
)
return(results)
}

# perform bootstrap
Repeats <- 100
res <- boot(au, statistic = compare, R = Repeats)

# restructure results for a graphic showing root mean square error, and for
# later combination with the other results. I chose just to focus on RMSE;
# the messages are similar if R squared is used.
RMSE_res <- as.data.frame(res$t[ , 4:6])
names(RMSE_res) <- c("AIC stepwise selection", "Remove collinear variables", "Use all var
iables")

RMSE_res %>%
mutate(trial = 1:Repeats) %>%

168
R Programming

gather(variable, value, -trial) %>%


# re-order levels:
mutate(variable = factor(variable, levels = c(
"Remove collinear variables", "AIC stepwise selection", "Use all variables"
))) %>%
ggplot(aes(x = trial, y = value, colour = variable)) +
geom_line() +
geom_point() +
ggtitle("'Simple' bootstrap of model fit of three different regression strategies",
subtitle = "Predicting areas' median income based on census variables") +
labs(x = "Resample id (there no meaning in the order of resamples)\n",
y = "Root Mean Square Error (higher is worse)\n",
colour = "Strategy",
caption = "Data from New Zealand Census 2013")

# store the three "simple bootstrap" RMSE results for later


simple <- apply(RMSE_res, 2, mean)

#-----------------------enhanced (optimism) bootstrap comparison-------------------


# for convenience, estimate the models on the original sample of data
orig_step <- model_process_step(au)
orig_vif <- model_process_vif(au)
orig_full <- lm(MedianIncome ~ ., data = au)

# create a function suitable for boot that will return the optimism estimates for
# statistics testing models against the full original sample.
compare_opt <- function(orig_data, i){
# create the resampled data
train_data <- orig_data[i, ]

# fit the three modelling processes


model_step <- model_process_step(train_data)

169
R Programming

model_vif <- model_process_vif(train_data)


model_full <- lm(MedianIncome ~ ., data = train_data)

# predict the values on the original, unresampled data


predict_step <- predict(model_step, newdata = orig_data)
predict_vif <- predict(model_vif, newdata = orig_data)
predict_full <- predict(model_full, newdata = orig_data)

# return a vector of 6 summary optimism results


results <- c(
step_R2 = R2(fitted(model_step), train_data$MedianIncome) - R2(predict_step, orig_data
$MedianIncome),
vif_R2 = R2(fitted(model_vif), train_data$MedianIncome) - R2(predict_vif, orig_data$M
edianIncome),
full_R2 = R2(fitted(model_full), train_data$MedianIncome) - R2(predict_full, orig_data$
MedianIncome),
step_RMSE = RMSE(fitted(model_step), train_data$MedianIncome) - RMSE(predict_ste
p, orig_data$MedianIncome),
vif_RMSE = RMSE(fitted(model_vif), train_data$MedianIncome) - RMSE(predict_vif, o
rig_data$MedianIncome),
full_RMSE = RMSE(fitted(model_full), train_data$MedianIncome) - RMSE(predict_full
, orig_data$MedianIncome)
)
return(results)
}

# perform bootstrap
res_opt <- boot(au, statistic = compare_opt, R = Repeats)

# calculate and store the results for later


original <- c(
RMSE(fitted(orig_step), au$MedianIncome),
RMSE(fitted(orig_vif), au$MedianIncome),
RMSE(fitted(orig_full), au$MedianIncome)

170
R Programming

optimism <- apply(res_opt$t[ , 4:6], 2, mean)


enhanced <- original - optimism

#------------------repeated cross-validation------------------
# The number of cross validation repeats is the number of bootstrap repeats / 10:
cv_repeat_num <- Repeats / 10

# use caret::train for the two standard models:


the_control <- trainControl(method = "repeatedcv", number = 10, repeats = cv_repeat_num)
cv_full <- train(MedianIncome ~ ., data = au, method = "lm", trControl = the_control)
cv_step <- train(MedianIncome ~ ., data = au, method = "lmStepAIC", trControl = the_contro
l, trace = 0)

# do it by hand for the VIF model:


results <- numeric(10 * cv_repeat_num)
for(j in 0:(cv_repeat_num - 1)){
cv_group <- sample(1:10, nrow(au), replace = TRUE)
for(i in 1:10){
train_data <- au[cv_group != i, ]
test_data <- au[cv_group == i, ]
results[j * 10 + i] <- RMSE(
predict(model_process_vif(train_data), newdata = test_data),
test_data$MedianIncome)
}
}
cv_vif <- mean(results)

cv_vif_results <- data.frame(


results = results,

171
R Programming

trial = rep(1:10, cv_repeat_num),


cv_repeat = rep(1:cv_repeat_num, each = 10)
)

#===============reporting results===============
# combine the three cross-validation results together and combined with
# the bootstrap results from earlier
summary_results <- data.frame(rbind(
simple,
enhanced,
c(mean(cv_step$resample$RMSE),
cv_vif,
mean(cv_full$resample$RMSE)
)
), check.names = FALSE) %>%
mutate(method = c("Simple bootstrap", "Enhanced bootstrap",
paste(cv_repeat_num, "repeats 10-fold\ncross-validation"))) %>%
gather(variable, value, -method)
# Draw a plot summarising the results
direct.label(
summary_results %>%
mutate(variable = factor(variable, levels = c(
"Use all variables", "AIC stepwise selection", "Remove collinear variables"
))) %>%
ggplot(aes(y = method, x = value, colour = variable)) +
geom_point(size = 3) +
labs(x = "Estimated Root Mean Square Error (higher is worse)\n",
colour = "Modelling\nstrategy",
y = "Method of estimating model fit\n",
caption = "Data from New Zealand Census 2013") +
ggtitle("Three different validation methods of three different regression strategies",

172
R Programming

subtitle = "Predicting areas' median income based on census variables")

NON-LINEAR MODELING

Module 6

Nonlinear Models: Nonlinear Least Squares, Splines

Consider a nonlinear least squares model in R, for example of the following form):

y ~ theta / ( 1 + exp( -( alpha + beta * x) ) )


(my real problem has several variables and the outer function is not logistic but a bit more
involved; this one is simpler but I think if I can do this my case should follow almost
immediately)

I'd like to replace the term "alpha + beta * x" with (say) a natural cubic spline.

here's some code to create some example data with a nonlinear function inside the logistic:

set.seed(438572L)
x <- seq(1,10,by=.25)
y <- 8.6/(1+exp( -(-3+x/4.4+sqrt(x*1.1)*(1.-sin(1.+x/2.9))) )) + rnorm(x, s=0.2 )
Without the need for a logistic around it, if I was in lm, I could replace a linear term with a
spline term easily; so a linear model something like this:

lm( y ~ x )
then becomes

library("splines")
lm( y ~ ns( x, df = 5 ) )
generating fitted values is simple and getting predicted values with the aid of (for example) the
rms package seems simple enough.

Indeed, fitting the original data with that lm-based spline fit isn't too bad, but there's a reason I
need it inside the logistic function (or rather, the equivalent in my problem).

The problem with nls is I need to provide names for all the parameters (I'm quite happy with
calling them say (b1, ..., b5) for one spline fit (and say c1, ... , c6 for another variable - I'll need
to be able to make several of them).

Is there a reasonably neat way to generate the corresponding formula for nls so that I can
replace the linear term inside the nonlinear function with a spline?

173
R Programming

The only ways I can figure that there could be to do it are a bit awkward and clunky and don't
nicely generalize without writing a whole bunch of code.

Generalised additive models (GAMs): an introduction

Many data in the environmental sciences do not fit simple linear models and are best described
by “wiggly models”, also known as Generalised Additive Models (GAMs).

Let’s start with a famous tweet by one Gavin Simpson, which amounts to:
1. GAMs are just GLMs
2. GAMs fit wiggly terms
3. use + s(x) not x in your syntax
4. use method = "REML"
5. always look at gam.check()

174
R Programming

This is basically all there is too it – an extension of generalised linear models (GLMs) with a
smoothing function. Of course, there may be many sophisticated things going on when you fit
a model with smooth terms, but you only need to understand the rationale and some basic
theory. There are also lots of what would be apparently magic things happening when we try
to understand what is under the hood of say lmer or glmer, but we use them all the time without
reservation!
GAMs in a nutshell
Let’s start with an equation for a Gaussian linear model:

y=β0+x1β1+ε,ε∼N(0,σ2)y=β0+x1β1+ε,ε∼N(0,σ2)
What changes in a GAM is the presence of a smoothing term:
y=β0+f(x1)+ε,ε∼N(0,σ2)y=β0+f(x1)+ε,ε∼N(0,σ2)
This simply means that the contribution to the linear predictor is now some function ff. This is
not that dissimilar conceptually to using a quadratic (x21x12) or cubic term (x31x13) as your
predictor.
The function ff can be something more funky or kinky – here, we’re going to focus on splines.
In the old days, it might have been something like piecewise linear functions.
You can have combinations of linear and smooth terms in your model, for example

y=β0+x1β1+f(x2)+ε,ε∼N(0,σ2)y=β0+x1β1+f(x2)+ε,ε∼N(0,σ2)
or we can fit generalised distributions and random effects, for example
ln(y)=β0+f(x1)+ε,ε∼Poisson(λ)ln(y)=β0+f(x1)+ε,ε∼Poisson(λ)
ln(y)=β0+f(x1)+z1γ+ε,ε∼Poisson(λ),γ∼N(0,Σ)ln(y)=β0+f(x1)+z1γ+ε,ε∼Poisson(λ),γ∼N(0,Σ)
A simple example
Lets try a simple example. First, let’s create a data frame and fill it with some simulated data
with an obvious non-linear trend and compare how well some models fit to that data.

x <- seq(0, pi * 2, 0.1)


sin_x <- sin(x)
y <- sin_x + rnorm(n = length(x), mean = 0, sd = sd(sin_x / 2))
Sample_data <- data.frame(y,x)
library(ggplot2)
ggplot(Sample_data, aes(x, y)) + geom_point()

175
R Programming

Try fitting a normal linear model:

lm_y <- lm(y ~ x, data = Sample_data)


and plotting the fitted line with data using geom_smooth in ggplot
ggplot(Sample_data, aes(x, y)) + geom_point() + geom_smooth(method = lm)

176
R Programming

Looking at the plot or summary(lm_y), you might think the model fits nicely, but look at the
residual plot – eek!
plot(lm_y, which = 1)

177
R Programming

Clearly, the residuals are not evenly spread across values of xx, and we need to consider a
better model.
Running the analysis
Before we consider a GAM, we need to load the package mgcv – the choice for running GAMs
in R.

library(mgcv)
To run a GAM, we use:

gam_y <- gam(y ~ s(x), method = "REML")


To extract the fitted values, we can use predict just like normal:
x_new <- seq(0, max(x), length.out = 100)
y_pred <- predict(gam_y, data.frame(x = x_new))
But for simple models, we can also utilise the method = argument in geom_smooth, specifying
the model formula.
ggplot(Sample_data, aes(x, y)) + geom_point() + geom_smooth(method = "gam", formula = y
~s(x))

178
R Programming

You can see the model is better fit to the data, but always check the diagnostics.

check.gam is quick and easy to view the residual plots.


par(mfrow = c(2,2))
gam.check(gam_y)

179
R Programming

##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.37327e-09,1.17425e-09]
## (score 44.14634 & scale 0.174973).
## Hessian positive definite, eigenvalue range [1.75327,30.69703].
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(x) 9.00 5.76 1.19 0.9
Using summary with the model object will give you the significance of the smooth term (along
with any parametric terms, if you’ve included them), along with the variance explained. In this
example, a pretty decent fit. The ‘edf’ is the estimated degrees of freedom – essentially, the
larger the number, the more wiggly the fitted model. Values of around 1 tend to be close to a
linear term. You can read about penalisation and shrinkage for more on what the edf reflects.
summary(gam_y)
##

180
R Programming

## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01608 0.05270 -0.305 0.761
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x) 5.76 6.915 23.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.722 Deviance explained = 74.8%
## -REML = 44.146 Scale est. = 0.17497 n = 63

Smooth terms
As mentioned above, we’ll focus on splines, as they are the smooth functions that are most
commonly implemented (and are pretty quick and stable). So what was actually going on when
we specified s(x)?
Well, this is where we say we want to fit yy as a linear function of some set of functions of xx.
The default in mgcv is a thin plate regression spline – the two common ones you’ll probably
see are these, and cubic regression splines. Cubic regression splines have the
traditional knots that we think of when we talk about splines – they’re evenly spread across the
covariate range in this case. We’ll just stick to thin plate regression splines, since I figure Simon
made them the default for a reason,
Basis functions OK, so here’s where we see what the wiggle bit is really made of. We’ll start
with the fitted model, then we’ll look at it from first principles (not really). Remembering that
the smooth term is the sum of some number of functions (I’m not sure how well this equation
really represents the smooth term, but you get the point),

f(x1)=∑j=1kbj(x1)βjf(x1)=∑j=1kbj(x1)βj
First we extract the set of basis functions (that is, the bj(xj)bj(xj) part of the smooth term). Then
we can plot say the first and second basis functions.

model_matrix <- predict(gam_y, type = "lpmatrix")


plot(y ~ x)
abline(h = 0)
lines(x, model_matrix[, "s(x).1"], type = "l", lty = 2)
lines(x, model_matrix[, "s(x).2"], type = "l", lty = 2)

181
R Programming

Let’s plot all of the basis functions now, and then add that to the predictions from the GAM
(y_pred) on top again.
plot(y ~ x)
abline(h = 0)

x_new <- seq(0, max(x), length.out = 100)


y_pred <- predict(gam_y, data.frame(x = x_new))

matplot(x, model_matrix[,-1], type = "l", lty = 2, add = T)


lines(y_pred ~ x_new, col = "red", lwd = 2)

182
R Programming

Now, it’s difficult at first to see what has happened, but it’s easiest to think about it like this –
each of those dotted lines represents a function (bjbj) for which gam estimates a coefficient
(βjβj), and when you sum them you get the contribution for the corresponding f(x)f(x) (i.e. the
previous equation). It’s nice and simple for this example, because we model yy only as a
function of the smooth term, so it’s fairly relatable. As an aside, you can also just
use plot.gam to plot the smooth terms.
plot(gam_y)

183
R Programming

OK, now let’s look a little closer at how the basis functions are constructed. You’ll see that the
construction of the functions is separate to the response data. Just to prove it, we’ll
use smoothCon.
x_sin_smooth <- smoothCon(s(x), data = data.frame(x), absorb.cons = TRUE)
X <- x_sin_smooth[[1]]$X

par(mfrow = c(1,2))
matplot(x, X, type = "l", main = "smoothCon()")
matplot(x, model_matrix[,-1], type = "l", main = "predict(gam_y)")

184
R Programming

And now to prove that you can go from the basis functions and the estimated coefficients to
the fitted smooth term. Again note that this is simplified here because the model is just one
smooth term. If you had more terms, we would need to add up all of the terms in the linear
predictor.

betas <- gam_y$coefficients


linear_pred <- model_matrix %*% betas

par(mfrow = c(1,2))
plot(y ~ x, main = "manual from basis/coefs")
lines(linear_pred ~ x, col = "red", lwd = 2)
plot(y ~ x, main = "predict(gam_y)")
lines(y_pred ~ x_new, col = "red", lwd = 2)

185
R Programming

Out of interest, take a look at the following plot, remembering that X is the matrix of basis
functions.
par(mfrow = c(1,2))
plot(y ~ x)
plot(y ~ rowSums(X))

186
R Programming

187
R Programming

Decision Tree

Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples of use of decision tress is − predicting an email as spam or not spam, predicting of
a tumor is cancerous or predicting a loan as a good or bad credit risk based on the factors in
each of these. Generally, a model is created with observed data also called training data. Then
a set of validation data is used to verify and improve the model. R has packages which are
used to create and visualize decision trees. For new set of predictor variable, we use this model
to arrive at a decision on the category (yes/No, spam/not spam) of the data.
The R package "party" is used to create decision trees.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Following is the description of the parameters used −
• formula is a formula describing the predictor and response variables.
• data is the name of the data set used.
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
Here is the sample data.
# Load the party package. It will automatically load other
# dependent packages.
library(party)

# Print some records from data set readingSkills.

188
R Programming

print(head(readingSkills))
When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other
# dependent packages.
library(party)

# Create the input data frame.


input.dat <- readingSkills[c(1:105),]

# Give the chart file a name.


png(file = "decision_tree.png")

# Create the tree.


output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)

# Plot the tree.


plot(output.tree)

# Save the file.


dev.off()
When we execute the above code, it produces the following result −
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

189
R Programming

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

as.Date, as.Date.numeric

Loading required package: sandwich

Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills score
is less than 38.3 and age is more than 6 is not a native Speaker.

K-means Clustering in R

What is Cluster analysis?


Cluster analysis is part of the unsupervised learning. A cluster is a group of data that share
similar features. We can say, clustering analysis is more about discovery than a prediction. The
machine searches for similarity in the data. For instance, you can use cluster analysis for the
following application:

• Customer segmentation: Looks for similarity between groups of customers


• Stock Market clustering: Group stock based on performances

190
R Programming

• Reduce dimensionality of a dataset by grouping observations with similar values

Clustering analysis is not too difficult to implement and is meaningful as well as actionable for
business.

The most striking difference between supervised and unsupervised learning lies in the results.
Unsupervised learning creates a new variable, the label, while supervised learning predicts an
outcome. The machine helps the practitioner in the quest to label the data based on close
relatedness. It is up to the analyst to make use of the groups and give a name to them.

Let’s make an example to understand the concept of clustering. For simplicity, we work in two
dimensions. You have data on the total spend of customers and their ages. To improve
advertising, the marketing team wants to send more targeted emails to their customers.

In the following graph, you plot the total spend and the age of the customers.

library(ggplot2)
df <- data.frame(age = c(18, 21, 22, 24, 26, 26, 27, 30, 31, 35, 39, 40, 41, 42, 44, 46, 47, 48,
49, 54),
spend = c(10, 11, 22, 15, 12, 13, 14, 33, 39, 37, 44, 27, 29, 20, 28, 21, 30, 31, 23, 24)
)
ggplot(df, aes(x = age, y = spend)) +
geom_point()

22.1M
214
A pattern is visible at this point

191
R Programming

1. At the bottom-left, you can see young people with a lower purchasing power
2. Upper-middle reflects people with a job that they can afford spend more
3. Finally, older people with a lower budget.

In the figure above, you cluster the observations by hand and define each of the three groups.
This example is somewhat straightforward and highly visual. If new observations are appended
to the data set, you can label them within the circles. You define the circle based on our
judgment. Instead, you can use Machine Learning to group the data objectively.

In this tutorial, you will learn how to use the k-means algorithm.

K-means algorithm
K-mean is, without doubt, the most popular clustering method. Researchers released the
algorithm decades ago, and lots of improvements have been done to k-means.

The algorithm tries to find groups by minimizing the distance between the observations,
called local optimal solutions. The distances are measured based on the coordinates of the
observations. For instance, in a two-dimensional space, the coordinates are simple and .

192
R Programming

The algorithm works as follow:

• Step 1: Choose groups in the feature plan randomly


• Step 2: Minimize the distance between the cluster center and the different observations
(centroid). It results in groups with observations
• Step 3: Shift the initial centroid to the mean of the coordinates within a group.
• Step 4: Minimize the distance according to the new centroids. New boundaries are
created. Thus, observations will move from one group to another
• Repeat until no observation changes groups

K-means usually takes the Euclidean distance between the feature and feature :

Different measures are available such as the Manhattan distance or Minlowski distance. Note
that, K-mean returns different groups each time you run the algorithm. Recall that the first
initial guesses are random and compute the distances until the algorithm reaches a homogeneity
within groups. That is, k-mean is very sensitive to the first choice, and unless the number of
observations and groups are small, it is almost impossible to get the same clustering.

193
R Programming

Select the number of clusters


Another difficulty found with k-mean is the choice of the number of clusters. You can set a
high value of , i.e. a large number of groups, to improve stability but you might end up
with overfit of data. Overfitting means the performance of the model decreases substantially
for new coming data. The machine learnt the little details of the data set and struggle to
generalize the overall pattern.

The number of clusters depends on the nature of the data set, the industry, business and so on.
However, there is a rule of thumb to select the appropriate number of clusters:

with equals to the number of observation in the dataset.

Generally speaking, it is interesting to spend times to search for the best value of to fit with the
business need.

We will use the Prices of Personal Computers dataset to perform our clustering analysis. This
dataset contains 6259 observations and 10 features. The dataset observes the price from 1993
to 1995 of 486 personal computers in the US. The variables are price, speed, ram, screen, cd
among other.

You will proceed as follow:

• Import data
• Train the model
• Evaluate the model

Import data
K means is not suitable for factor variables because it is based on the distance and discrete
values do not return meaningful values. You can delete the three categorical variables in our
dataset. Besides, there are no missing values in this dataset.

library(dplyr)
PATH <-"https://raw.githubusercontent.com/guru99-edu/R-
Programming/master/computers.csv"
df <- read.csv(PATH) %>%
select(-c(X, cd, multi, premium))
glimpse(df)
Output

## Observations: 6, 259
## Variables: 7
## $ price < int > 1499, 1795, 1595, 1849, 3295, 3695, 1720, 1995, 2225, 2...
##$ speed < int > 25, 33, 25, 25, 33, 66, 25, 50, 50, 50, 33, 66, 50, 25, ...
##$ hd < int > 80, 85, 170, 170, 340, 340, 170, 85, 210, 210, 170, 210...
##$ ram < int > 4, 2, 4, 8, 16, 16, 4, 2, 8, 4, 8, 8, 4, 8, 8, 4, 2, 4, ...

194
R Programming

##$ screen < int > 14, 14, 15, 14, 14, 14, 14, 14, 14, 15, 15, 14, 14, 14, ...
##$ ads < int > 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, ...
## $ trend <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
From the summary statistics, you can see the data has large values. A good practice with k
mean and distance calculation is to rescale the data so that the mean is equal to one and the
standard deviation is equal to zero.

summary(df)
Output:

## price speed hd ram


## Min. : 949 Min. : 25.00 Min. : 80.0 Min. : 2.000
## 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0 1st Qu.: 4.000 `
## Median :2144 Median : 50.00 Median : 340.0 Median : 8.000
## Mean :2220 Mean : 52.01 Mean : 416.6 Mean : 8.287
## 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0 3rd Qu.: 8.000
## Max. :5399 Max. :100.00 Max. :2100.0 Max. :32.000
## screen ads trend
## Min. :14.00 Min. : 39.0 Min. : 1.00
## 1st Qu.:14.00 1st Qu.:162.5 1st Qu.:10.00
## Median :14.00 Median :246.0 Median :16.00
## Mean :14.61 Mean :221.3 Mean :15.93
## 3rd Qu.:15.00 3rd Qu.:275.0 3rd Qu.:21.50
## Max. :17.00 Max. :339.0 Max. :35.00
You rescale the variables with the scale() function of the dplyr library. The transformation
reduces the impact of outliers and allows to compare a sole observation against the mean. If a
standardized value (or z-score) is high, you can be confident that this observation is indeed
above the mean (a large z-score implies that this point is far away from the mean in term of
standard deviation. A z-score of two indicates the value is 2 standard deviations away from the
mean. Note, the z-score follows a Gaussian distribution and is symmetrical around the mean.

rescale_df <- df % > %


mutate(price_scal = scale(price),
hd_scal = scale(hd),
ram_scal = scale(ram),
screen_scal = scale(screen),
ads_scal = scale(ads),
trend_scal = scale(trend)) % > %
select(-c(price, speed, hd, ram, screen, ads, trend))
R base has a function to run the k mean algorithm. The basic function of k mean is:

kmeans(df, k)
arguments:
-df: dataset used to run the algorithm
-k: Number of clusters
Train the model
In figure three, you detailed how the algorithm works. You can see each step graphically with
the great package build by Yi Hui (also creator of Knit for Rmarkdown). The package
animation is not available in the conda library. You can use the other way to install the package

195
R Programming

with install.packages(“animation”). You can check if the package is installed in our Anaconda
folder.

install.packages("animation")
After you load the library, you add .ani after kmeans and R will plot all the steps. For illustration
purpose, you only run the algorithm with the rescaled variables hd and ram with three clusters.

set.seed(2345)
library(animation)
kmeans.ani(rescale_df[2:3], 3)
Code Explanation

• kmeans.ani(rescale_df[2:3], 3): Select the columns 2 and 3 of rescale_df data set and
run the algorithm with k sets to 3. Plot the animation.

196
R Programming

You can interpret the animation as follow:

• Step 1: R randomly chooses three points


• Step 2: Compute the Euclidean distance and draw the clusters. You have one cluster in
green at the bottom left, one large cluster colored in black at the right and a red one
between them.
• Step 3: Compute the centroid, i.e. the mean of the clusters
• Repeat until no data changes cluster

The algorithm converged after seven iterations. You can run the k-mean algorithm in our
dataset with five clusters and call it pc_cluster.

pc_cluster <-kmeans(rescale_df, 5)

• The list pc_cluster contains seven interesting elements:


• pc_cluster$cluster: Indicates the cluster of each observation
• pc_cluster$centers: The cluster centres
• pc_cluster$totss: The total sum of squares
• pc_cluster$withinss: Within sum of square. The number of components return is equal
to `k`
• pc_cluster$tot.withinss: Sum of withinss
• pc_clusterbetweenss: Total sum of square minus Within sum of square
• pc_cluster$size: Number of observation within each cluster

You will use the sum of the within sum of square (i.e. tot.withinss) to compute the optimal
number of clusters k. Finding k is indeed a substantial task.

197
R Programming

Optimal k
One technique to choose the best k is called the elbow method. This method uses within-group
homogeneity or within-group heterogeneity to evaluate the variability. In other words, you are
interested in the percentage of the variance explained by each cluster. You can expect the
variability to increase with the number of clusters, alternatively, heterogeneity decreases. Our
challenge is to find the k that is beyond the diminishing returns. Adding a new cluster does not
improve the variability in the data because very few information is left to explain.

In this tutorial, we find this point using the heterogeneity measure. The Total within clusters
sum of squares is the tot.withinss in the list return by kmean().

You can construct the elbow graph and find the optimal k as follow:

• Step 1: Construct a function to compute the total within clusters sum of squares
• Step 2: Run the algorithm times
• Step 3: Create a data frame with the results of the algorithm
• Step 4: Plot the results

Step 1) Construct a function to compute the total within clusters sum of squares

You create the function that runs the k-mean algorithm and store the total within clusters sum
of squares

kmean_withinss <- function(k) {


cluster <- kmeans(rescale_df, k)
return (cluster$tot.withinss)
}
Code Explanation

• function(k): Set the number of arguments in the function


• kmeans(rescale_df, k): Run the algorithm k times
• return(cluster$tot.withinss): Store the total within clusters sum of squares

PAM, Hierarchical Clustering


use of R within clustering methods in unsupervised learning approach in a real dataset. In this
report, it is mainly included that preprocessing of data, comparison of three different clustering
algorithms and evaluation metrics (silhouette) for 2 clustering algorithms. The clustering
methods are respectively k-means, partitioning around medoids (PAM) in other words k-
medoids and hierarchical clustering.

DATA

Description of the Dataset


This dataset contains mainly pizza nutrients includes for 100 grams. Every id number
represents a different pizza and every letter (A, B, C, D, E, F, G, H, I, J) represents the pizza
producer brands.

198
R Programming

• Brand: Pizza brand


• Id: Sample analysed.
• Mois: Amount of water per 100 grams in the sample.
• Prot: Amount of protein per 100 grams in the sample.
• Fat: Amount of fat per 100 grams in the sample.
• Ash: Amount of ash per 100 grams in the sample.
• Sodium: Amount of sodium per 100 grams in the sample.
• Carb: Amount of carbohydrates per 100 grams in the sample.
• Cal: Amount of calories per 100 grams in the sample.
Number of Instances: 300
Attribute Characteristics: Integer
Number of Attributes: 9
Missing Values: No missing values
In this study, the main nutrient variables are analyzed which are id, mois, prot, fat, ash, sodium,
carb and cal. The dataset is imported in R as a pizza and scaled version of dataset is saved as
pizs. Moreover, the dataset documentation type is text (tab delimited) (*.txt).

Descriptive Statistics
In this section includes descriptive statistics information about unscaled for of the data.

summary(pizza)
## brand id mois prot
## Length:300 Min. :14003 Min. :25.00 Min. : 6.98
## Class :character 1st Qu.:14094 1st Qu.:30.90 1st Qu.: 8.06
## Mode :character Median :24021 Median :43.30 Median :10.44
## Mean :20841 Mean :40.90 Mean :13.37
## 3rd Qu.:24110 3rd Qu.:49.12 3rd Qu.:20.02
## Max. :34045 Max. :57.22 Max. :28.48
## fat ash sodium carb
## Min. : 4.38 Min. :1.170 Min. :0.2500 Min. : 0.510
## 1st Qu.:14.77 1st Qu.:1.450 1st Qu.:0.4500 1st Qu.: 3.467
## Median :17.14 Median :2.225 Median :0.4900 Median :23.245
## Mean :20.23 Mean :2.633 Mean :0.6694 Mean :22.865
## 3rd Qu.:21.43 3rd Qu.:3.592 3rd Qu.:0.7025 3rd Qu.:41.337

199
R Programming

## Max. :47.20 Max. :5.430 Max. :1.7900 Max. :48.640


## cal
## Min. :2.180
## 1st Qu.:2.910
## Median :3.215
## Mean :3.271
## 3rd Qu.:3.520
## Max. :5.080

Data Preparation
The data preparation period is shown below as step by step:
1. At the beginning of the analysis, the data imported as follows;

pizza <- read.delim("pizza.txt", stringsAsFactors = FALSE)


head(pizza)
## brand id mois prot fat ash sodium carb cal
## 1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
## 2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
## 3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
## 4 A 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
## 5 A 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
## 6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67

2. Cluster tendency has been calculated by using Hopkins’ statistic

hopkins(pizza[,2:9], n=nrow(pizza[,2:9])-1)
## $H
## [1] 0.002383577
1-0.002373702
## [1] 0.9976263

0.9976263 solution show us the dataset highly convenient to clustering analysis.


3. For acquire scale data, scale function is used, and the name of the scaled dataset is designated
as pizs.

200
R Programming

pizs <- scale(pizza[,2:9])

Furthermore, the dataset new appearance is became as shown below:

head(pizs)
## id mois prot fat ash sodium carb
## [1,] -0.9725866 -1.369526 1.252089 2.745255 1.950635 2.971721 -1.225463
## [2,] -0.9748845 -1.299391 1.225669 2.636070 2.131776 3.025723 -1.211598
## [3,] -0.9789058 -1.314046 1.028292 2.846640 1.927007 2.593708 -1.223800
## [4,] -0.9801984 -1.083752 1.053158 2.551397 1.698611 2.539707 -1.191630
## [5,] -0.9817782 -1.090033 1.228777 2.386506 1.722238 2.620709 -1.170554
## [6,] -0.9717249 -1.021991 1.065591 2.460039 1.800996 2.647710 -1.190521
## cal
## [1,] 2.675659
## [2,] 2.530505
## [3,] 2.707915
## [4,] 2.369224
## [5,] 2.256327
## [6,] 2.256327

CLUSTERING ANALYSIS

K-means
In the project, k-means clustering analysis is done by using euclidian distance metric. In first,
optimal number of clusters are detected by using Elbow method.

fviz_nbclust(pizs, kmeans, method = "silhouette") + theme_classic()

201
R Programming

As shown plots the best option is 3 number of clusters; furthermore, kmeans clustering in
euclidean distance continue with 3 number of clusters on below

wcke<-eclust(pizs, "kmeans", hc_metric="euclidean",k=3)

202
R Programming

fviz_cluster(wcke, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())

203
R Programming

The summary of k-means clustering with euclidian distance metric is shown below.

summary(wcke)
## Length Class Mode
## cluster 300 -none- numeric
## centers 24 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
## clust_plot 9 gg list
## silinfo 3 -none- list
## nbclust 1 -none- numeric

204
R Programming

## data 2400 -none- numeric

Silhouette
‘silhouette value’ is used to check the quality of clusters. It is a measure of how similar an
object is to its own cluster and how far it is to other clusters. It takes the values between -1 and
1. If it is close to 1, it means that the observations in a cluster is well fitted. In other words, the
value of average silhouette width must be between -1 and 1; therefore, the result closer to 1
implies high clustering quality and the value for silhouette width for this experiment is 0.48
which means the dataset is proper for cluster analysis.

sile<-silhouette(wcke$cluster, dist(pizs))

fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 151 0.32
## 2 2 29 0.76
## 3 3 120 0.61

205
R Programming

Here the best clustering result is obtained with the 2nd cluster(green).

Partitioning Around Medoids (PAM)


Another clustering method used in this study is PAM. It is an adoption of k-means algorithm.
However, PAM is more robust and less sensitive to outliers. In PAM, the selected observations
are the medoids. Also the distance metric is euclidean in this analysis.
In first, optimal number of clusters are detected by using Elbow method.

fviz_nbclust(pizs, pam, method = "silhouette") + theme_classic()

10 clusters are suggested by the method however the number of clustering is designated 4 in
order to create more understandable clustering:

pam.res <- eclust(pizs, "pam", k = 4, hc_metric="euclidean") #plotting of clusters

206
R Programming

fviz_cluster(pam.res, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())

207
R Programming

pizs.pam = pam(pizs,3)
pizs.pam
## Medoids:
## ID id mois prot fat ash sodium
## [1,] 23 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 107 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 139 0.4679016 -0.4410209 -0.8662149 -0.60825993 -0.9240069 -0.5653993
## carb cal
## [1,] -1.1949583 2.27245510
## [2,] -0.9564632 -0.48545705
## [3,] 0.9104540 -0.09838166
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

208
R Programming

## [112] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3
## Objective function:
## build swap
## 1.754077 1.620609
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pam.res$medoids
## id mois prot fat ash sodium
## [1,] 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 0.4559813 -0.7330761 -0.8724315 -0.48904862 -1.0500186 -0.6734030
## [4,] 0.4625877 0.8036160 -0.5616019 -0.47010851 -0.2624456 -0.1873864
## carb cal
## [1,] -1.19495831 2.2724551
## [2,] -0.95646323 -0.4854571
## [3,] 1.01694485 0.1757967
## [4,] 0.02691297 -0.8080199
pam.res$clusinfo
## size max_diss av_diss diameter separation
## [1,] 29 1.710058 1.124369 3.076119 3.362456
## [2,] 90 2.457287 1.619764 4.343469 1.725482
## [3,] 120 2.417467 1.235228 4.066620 1.215986
## [4,] 61 2.069652 1.227803 3.479344 1.215986

When we look at the structure of 4 clusters, cluster 3 has the maximum observations and cluster
1 has the minimum observations.

209
R Programming

Silhouette
The value of average silhouette width is 0.47 in PAM clustering analysis with euclidean
distance metric. Moreover, there is just slightly different between k-means and PAM about
silhouette result.

sile<-silhouette(pam.res$cluster, dist(pizs))
fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 29 0.73
## 2 2 90 0.36
## 3 3 120 0.50
## 4 4 61 0.45

Here the best clustering result is obtained with the 1st cluster, which is 0.73, while the 2nd
cluster has the worst quality among the clusters and its silhouette width value is 0.36.

210
R Programming

Hierarchical Clustering Analysis


In first step, the distance matrix is found in euclidean distance metrics and hierarchical
clustering analysis is initiated with 3 clusters as follows:

m <- c( "average", "single", "complete", "ward")


names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
+ agnes(pizs, method = x)$ac}

map_dbl(m, ac)
## average single complete ward
## 0.9609417 0.9365627 0.9704849 0.9938937

Ward is the biggest value in here with 0.9938937 so ward.D2 is determined as method for
hierarchical clustering.

d <- dist(pizs, method = "euclidean")

res.hc <- hclust(d, method = "ward.D2")

grp <- cutree(res.hc, k = 3)

plot(res.hc, cex = 0.6,labels = pizza$id)

211
R Programming

plot(res.hc,labels = pizza$id, main = 'Hclust Dendrogram')

rect.hclust(res.hc, k = 3, border = 2:5)

212
R Programming

After analyzed the dendrogram, the optimal cutree point is determined as standard deviation
value of 20 which is corresponded to 3 clusters in this experiment and 2 of these clusters include
mainly outliers.

CONCLUSION
In conclusion, 3 clustering methods (k-means, PAM and hierarchical clustering) with euclidean
distance distance method are analyzed with in the report. The analyze start with descriptive
statistics and observation of data scaling. Moreover, cluster tendency is measured by using
Hopkins’ statistic and optimal number of clusters are found by using Elbow method for k-
means algorithm and PAM algorithm. Moreover, for the dataset k-means and PAM algorithm
has no dramatic change between them with euclidean distance metrics are gave the slightly
different result for silhouette width and clustering, although K-means algorithm has better
result in average silhouette width which are respectively 0,48 and 0,37. Nevertheless, 3
clustering algorithms are giving different clustering results although for future study k-means
algorithm is gave more accurate clustering solution than PAM algorithm. In addition, 3 cluster
are reached in hierarchical clustering with using euclidean distance metric. The proportion is
good and the dendrogram shape is also efficient for cut tree for 3 clusters.

213
R Programming

214

You might also like