0% found this document useful (0 votes)
91 views130 pages

STAT630 RSlide

This document provides an introduction to a course on R taught by Dr. S. Iddi at the University of Ghana. The course objectives are to provide an introduction to the R environment, teach how to create and manipulate R objects, import and export data, and perform simulations, bootstrapping, and linear modeling. The textbook is "Data Analysis and Graphics Using R" and lectures will cover basics of R including objects, attributes, expressions, and getting help functions. Examples will be drawn from various fields like medicine, biology, economics and finance.

Uploaded by

Tennyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views130 pages

STAT630 RSlide

This document provides an introduction to a course on R taught by Dr. S. Iddi at the University of Ghana. The course objectives are to provide an introduction to the R environment, teach how to create and manipulate R objects, import and export data, and perform simulations, bootstrapping, and linear modeling. The textbook is "Data Analysis and Graphics Using R" and lectures will cover basics of R including objects, attributes, expressions, and getting help functions. Examples will be drawn from various fields like medicine, biology, economics and finance.

Uploaded by

Tennyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Introduction to R

For STAT 605/703

Instructor
Samuel Iddi (PhD)

Department of Statistics and Actuarial Science


University of Ghana
[email protected]

February 10, 2020

Dr. S. Iddi (UG) R Training February 10, 2020 1 / 130


Course Information

Learing Objectives:
 Provide an introduction to R environment
 Create R objects; list, factor and data frame
 Subset and index objects
 Import and export data
 Create new columns, rename, sort and subset dataset
 Merge two or more datasets and aggregate data
 Perform simulations and boostrapping
 Use R to perform fit linear models, interpret results and draw
conclusions.

Dr. S. Iddi (UG) R Training February 10, 2020 2 / 130


Course Information

Textbook:
1 Maindonald, J. and Braun, J. (2003). Data Analysis and Graphics Using
R. New York: Cambridge University Press.
Reference:
1 Chambers (2008). Software for Data Analysis. Springer.
2 Venables W.N. and Ripley B.D (1997). Modern Applied Statistics with
S-PLUS. 2nd Ed. New York: Springer.

Dr. S. Iddi (UG) R Training February 10, 2020 3 / 130


Introduction

Introduction

 In statistics, we study data for some purpose.


 Learn statistical methods to analyze data and draw conclusions.
 Preparing data, apply methods to data, interpreting and presenting data
receive the least amount of attention in teach and studying statistics.
 Examples will be taken from various fields; medicine, biology,
economics, finance etc.
 Application through R scripts and packages.
 R integrates the task of preparing data, analyzing and presenting in
powerful and flexible way.

Dr. S. Iddi (UG) R Training February 10, 2020 4 / 130


Basic R

Introduction to R

 Learning new language does not come easy and the learning curve of R
is steep.

 Knowledge of basic to more advanced techniques is an advantage.

 The objective of this section is to demystify some things about R

Dr. S. Iddi (UG) R Training February 10, 2020 5 / 130


Basic R Preliminaries

History of R

 R is a dialect of the S language developed by John Chambers and others


at Bell Labs.
 The S system provides a very flexible and powerful environment for
implementing new statistical ideas
 The software facility is also used for data analysis and graphical display.
 Ross Ihaka and Robert Gentleman created R in New Zealand in 1991.
 It was made free software in 1998.
 The R core group is formed (involving some people associated with
S-PLUS - an enhanced version of S).
 The first version of R (1.0.0) was released in 2000.

Dr. S. Iddi (UG) R Training February 10, 2020 6 / 130


Basic R Preliminaries

Overview of R

 R syntax is very similar to S and so easier for S-PLUS user to switch


over.
 The basic R installed is quite lean because functionality is divided into
modular packages.
 R has sophisticated graphical and better than most statistical packages.
 Contains powerful programming language for developing and
implementing new ideas.
 Very active and vibrant user community; R-help and R-devel mailing
lists and stackoverflow.
 It‘s free (who hates free lunch?).

Dr. S. Iddi (UG) R Training February 10, 2020 7 / 130


Basic R Preliminaries

Advantages and Drawbacks

Advantages
 Run program for any purpose.
 Study how program works and adapt to your own needs (free assess to
source code).
 Redistribute copies to anyone.
 Improve program and share with public.
Drawbacks
 Based on an old technology.
 Functionality is based on consumer demand and user contribution. If
your favorite method is not implemented, you have to do the job yourself.
 Not ideal for all possible situations (a drawback of all software
packages).

Dr. S. Iddi (UG) R Training February 10, 2020 8 / 130


Basic R Preliminaries

Design of the R System

There are two conceptual parts.


 The ‘base’ R system which contain base packages.
 Everything else.
R functionality is divided into a number of packages
 Base package: is required to run R and contain the most fundamental
functions.
 Example packages in ‘base’ system: utils, stats, datasets, graphics,
grid, tools, parallel, compilers, splines, stats4.
 Recommended packages: tidyverse, dplyr, Rcmdr,Hmisc,plotly,
ggplot2, boot, class, cluster, foreign, KernSmooth, lattice, mgcr,
nlme, nle4, rpart, survival, MASS, spatial, nnet, Matrix,
randomForest.
 About 4000 packages are in R CRAN.

Dr. S. Iddi (UG) R Training February 10, 2020 9 / 130


Basic R Preliminaries

How to install packages?

 How are R packages installed? We use the R function


install.packages()

 Example: textttinstall.packages(’ggplot2’)

 You can also install multiple packages by forming a vector.

 Example: install.packages(c(’ggplot2’, ’dplyr’))

 Note that: install.packages(’ggplot2’, ’dplyr’) will not


work.

Dr. S. Iddi (UG) R Training February 10, 2020 10 / 130


Basic R Preliminaries

R Resources and Getting Help

The following resources can be found from CRAN (visit


http://cran.r-project.org)
 An introduction to R.
 Writing R extensions.
 R data import/export.
 etc.
Find help by
 searching the web.
 reading the manual.
 reading FAQ.
 ask experienced and skilled friends.
 read source code.
 ask questions via mailing lists.
Dr. S. Iddi (UG) R Training February 10, 2020 11 / 130
Basic R Preliminaries

Getting Help with R Functions

Within R
 Access help file: ?rnorm.

 Search help files: help.search("rnorm").

 Get arguments: args(rnorm).

 Access codes: simply type rnorm.

Dr. S. Iddi (UG) R Training February 10, 2020 12 / 130


Basic R Getting Started with R

Expressions and Assignment

 Elementary commands consist of either expression or assignments.


 Assignment are indicated by the assignment operator <-.
 The # character is used to indicate a comment. Anything including and
to the right of # is ignored.
5+10
## [1] 15

sqrt(4^2+2/3)*pi^3

## [1] 126.5826

sum(4,2,5,1) # sum of all elements


## [1] 12

prod(4,2,5,1) # product of all elements

## [1] 40

Dr. S. Iddi (UG) R Training February 10, 2020 13 / 130


Basic R Getting Started with R

Expressions and Assignment

x<-mean(c(3,5,8,1)) #nothing is printed


x #auto-printing occurs

## [1] 4.25

print(x) #explicit printing

## [1] 4.25

y<-exp(1) # exponential function


y

## [1] 2.718282

(m<-max(4,2,5,1)) # maximum of all elements

## [1] 5

Dr. S. Iddi (UG) R Training February 10, 2020 14 / 130


Basic R Getting Started with R

Expressions and Assignment

(y<-sqrt(x)) #another way to print results

## [1] 2.061553

msg<-"hello"
print(msg)

## [1] "hello"

z<-1:6 #The operator : is used to create integer sequences.


z

## [1] 1 2 3 4 5 6

(seq(from=1, to=2, by=0.2))# create a sequence from 1 to 2

## [1] 1.0 1.2 1.4 1.6 1.8 2.0

# with 0.2 increment


(seq(from=1, to=2, length=6))# specify length of sequence
Dr. S. Iddi (UG) R Training February 10, 2020 15 / 130
Objects

Objects

 R works with objects.


 Understanding objects is key to using R effectively.
 Objects consist of vectors, list, matrices, data frame, arrays and
functions.
 The most basic object is a vector and consist of elements of the same
class or mode.
 Exception: a list which represent a vector can contain objects of different
classes.
 Five basic classes of objects: character, numeric (real number), integer,
complex, logical (TRUE/FALSE).

Dr. S. Iddi (UG) R Training February 10, 2020 16 / 130


Objects Object Attributes

Object Attributes

 View a list of currently defined objects with ls() or objects()


 Remove object from list with the function rm().
Object attributes can be examined and set using various functions.
◦ name: names(), dimnames().
◦ dimensions (eg. matrices, arrays): dim().
◦ class: class(), typeof(), mode().
◦ length: length().
◦ other user-defined.

Dr. S. Iddi (UG) R Training February 10, 2020 17 / 130


Objects Object Attributes

Object Attributes in R

x<- c(1:5) #vector


names(x)<-c("a","c","c","d","e")
names(x)

## [1] "a" "c" "c" "d" "e"

length(x)#length

## [1] 5

z<-c("Male","Female")#character vector
mode(x)

## [1] "numeric"

class(z)

## [1] "character"

Dr. S. Iddi (UG) R Training February 10, 2020 18 / 130


Objects Coercion

Coercion

 When different objects are mixed in a vector, coercion occurs so that


every element in the vector is of the same mode.
 An object can be coerced from one class to another using the as.*()
functions, if available. Example, as.numeric, as.logical,
as.character, as.matrix, as.factor etc.
 Nonsensical coercion results in NA eg. coercing a character object to
numeric or logical object.

Dr. S. Iddi (UG) R Training February 10, 2020 19 / 130


Objects Coercion

Examples: Coercion

z <- c(1.7, "a") ## character


class(z)

## [1] "character"

z <- c(TRUE, 2) ## numeric


class(z)

## [1] "numeric"

z <- c("a", TRUE) ## character


class(z)

## [1] "character"

x<-0:5
class(x)

## [1] "integer"

x<-as.numeric(x)
class(x)

## [1] "numeric"

Dr. S. Iddi (UG) R Training February 10, 2020 20 / 130


Objects Coercion

Examples: Coercion

as.logical(x)

## [1] FALSE TRUE TRUE TRUE TRUE TRUE

##Nonsensical Coercion
y<-c("NPP","NDC","CPP")
as.numeric(y)

## Warning: NAs introduced by coercion

## [1] NA NA NA

as.logical(y)

## [1] NA NA NA

as.complex(y)

## Warning: NAs introduced by coercion

## [1] NA NA NA
Dr. S. Iddi (UG) R Training February 10, 2020 21 / 130
Operators and Special Values

Arithmetic and Logic Operators

 Arithmetic operators consist of: + (addition), - (subtraction), *


(multiplication), / (division) and ∧ (power) operators.
 They operate on numbers, vectors, matrices etc.
 Logical operators: "and" and "or" denoted by & and |.
 Others: > (greater than), >= (greater than or equal to), < (less than), <=
(less than or equal to), == (equal to) and =! (not equal to).
 Upon evaluation, logical operators return the logical values TRUE or
FALSE.
 If operation cannot be accomplished, NA is returned.

Dr. S. Iddi (UG) R Training February 10, 2020 22 / 130


Operators and Special Values

Special Values

 Special values:
◦ Logical values: TRUE/FALSE or T/F
◦ Missing values: NA (not available), NaN (not a number)
◦ Inf is a special number which represent infinity. Eg. 1/0.
◦ NaN represent value of an undefined mathematical operations or
missing value. Eg. 0/0.
 is.na() is used to test objects if they are NA.
 is.nan() is used to test for NaN.
 NA values have a class also, so there are integer NA and character NA.
 NaN value is also NA but the converse is not true.
 If operation cannot be accomplished, NA is returned.

Dr. S. Iddi (UG) R Training February 10, 2020 23 / 130


Operators and Special Values

Examples in R

x <- c(TRUE, FALSE, TRUE, FALSE)


class(x)

## [1] "logical"

(y<-seq(from=-5, to=10, by=1))

## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

z<-y<=0
z

## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE

(p<-rnorm(10,0,1))

## [1] 1.01953555 -0.61850147 -0.70058543 -1.66749774 -1.48370835


## [6] -0.05437555 0.46999880 -2.01273952 0.66474585 0.26280083

(q<-p>= -1 & p<=1)

## [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
Dr. S. Iddi (UG) R Training February 10, 2020 24 / 130
Operators and Special Values

Examples in R

0/0

## [1] NaN

1/0

## [1] Inf

(r<-c(1,2,4, NA, NaN, 5, Inf))

## [1] 1 2 4 NA NaN 5 Inf

is.na(r)

## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE

is.nan(r)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE

Dr. S. Iddi (UG) R Training February 10, 2020 25 / 130


Numbers and Vectors

Numbers and Vectors

 Numbers are treated as numeric objects.


 For an integer, specify the L suffix eg. entering 1 gives you a numeric
object, entering 1L gives you an integer.
 We can create integers from floating point numbers by going to
◦ the next larger integer: ceiling().
◦ the next smallest: floor().
◦ the next integer closer to zero: trunc().
 Round a number to a number of decimal places: round().
 R considers a number as a vector of length one.

Dr. S. Iddi (UG) R Training February 10, 2020 26 / 130


Numbers and Vectors

R Examples on Numbers

(x<-20/sqrt(2))

## [1] 14.14214

floor(x)

## [1] 14

ceiling(x)

## [1] 15

trunc(x)

## [1] 14

round(x,digit=2)

## [1] 14.14

round(x,4)
Dr. S. Iddi (UG) R Training February 10, 2020 27 / 130
Numbers and Vectors

Vectors

 A vector consist of an ordered collection of elements.


 All elements must have the same class (or mode) eg. logical, numeric,
character etc.
 Exception: We can mix with any mode the special element, NA (not NaN)
 The length of a vector is the number of its elements.
 Construct a vector: c() - concatenate elements or vectors.
 Vector of dimension zero (empty vector): vector(). Default mode is
logical.
 To create a vector of specific mode, simply name the mode as a function.
eg. v=integer(), w=character().
 Generate sequence of vectors: seq(), rep() etc.

Dr. S. Iddi (UG) R Training February 10, 2020 28 / 130


Numbers and Vectors

R Examples on Vectors

(x<-c(1,3,4, sqrt(5), 10, -2))

## [1] 1.000000 3.000000 4.000000 2.236068 10.000000 -2.000000

class(x)

## [1] "numeric"

(y<-integer(5))

## [1] 0 0 0 0 0

class(y)

## [1] "integer"

(z<-character(4))

## [1] "" "" "" ""

Dr. S. Iddi (UG) R Training February 10, 2020 29 / 130


Numbers and Vectors

R Examples on Vectors

class(z)

## [1] "character"

(a=rep(3,9))

## [1] 3 3 3 3 3 3 3 3 3

(b=seq(3,8,0.5))

## [1] 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

(c=1:5)

## [1] 1 2 3 4 5

(v=c(x,c))

## [1] 1.000000 3.000000 4.000000 2.236068 10.000000 -2.000000 1.000


## [8] 2.000000 3.000000 4.000000 5.000000
Dr. S. Iddi (UG) R Training February 10, 2020 30 / 130
Numbers and Vectors

Useful Vector Functions

Working with data requires manipulation of vectors frequently. Some


functions allows to manipulate vectors in R.
 sort() - returns a vector which is a sorted version of the input.
 order() - returns an integer vector containing the permutation that will
sort the input into ascending order.
 rank() - ranks the input vector.
 unique() - return unique values of input vector.
 diff() - create a vector of differences, xi − xi−k for lag k.
 length() - returns length of input vector.
 mean() - return the mean of the input vector.

Dr. S. Iddi (UG) R Training February 10, 2020 31 / 130


Numbers and Vectors

Examples

x<-c(1,1,7,3,3,4,3,2,3,2,1,4,1,1,6,4)
sort(x)

## [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 6 7

order(x)

## [1] 1 2 11 13 14 8 10 4 5 7 9 6 12 16 15 3

rank(x)

## [1] 3.0 3.0 16.0 9.5 9.5 13.0 9.5 6.5 9.5 6.5 3.0 13.0 3.0 3.0
## [15] 15.0 13.0

Dr. S. Iddi (UG) R Training February 10, 2020 32 / 130


Numbers and Vectors

Examples

unique(x)

## [1] 1 7 3 4 2 6

length(x)

## [1] 16

diff(x,lag=2)

## [1] 6 2 -4 1 0 -2 0 0 -2 2 0 -3 5 3

Dr. S. Iddi (UG) R Training February 10, 2020 33 / 130


Numbers and Vectors

Vector arithmetics

 Arithmetic operations, addition, subtraction, multiplication and division


(+, -, *, and respectively) can be applied to vectors.
 To avoid printing many digits, especially when dividing, set:
options(digits= ).
 Square root function: sqrt().
(x=seq(2,10,by=2))
## [1] 2 4 6 8 10
(y=c(2.2:6.4))
## [1] 2.2 3.2 4.2 5.2 6.2
(z=c(2,4))
## [1] 2 4
x+y
## [1] 4.2 7.2 10.2 13.2 16.2

Dr. S. Iddi (UG) R Training February 10, 2020 34 / 130


Numbers and Vectors

Vector arithmetics

x*y

## [1] 4.4 12.8 25.2 41.6 62.0

x-y

## [1] -0.2 0.8 1.8 2.8 3.8

2*x+y-5*sqrt(x)

## [1] -0.8710678 1.2000000 3.9525513 7.0578644 10.3886117

y/x

## [1] 1.10 0.80 0.70 0.65 0.62

x/z

## Warning in x/z: longer object length is not a multiple of shorter


object length
Dr. S. Iddi (UG) R Training February 10, 2020 35 / 130
Numbers and Vectors

Character Vectors

 Data, reports and figures require frequent manipulation of characters.


 Character strings are delineated by double or single quotes.
 Create a single string: paste().
(x<-c('NPP',"NDC","CPP"))
## [1] "NPP" "NDC" "CPP"
(y <- c("Volta Region is NPP's",'nightmare'))
## [1] "Volta Region is NPP's" "nightmare"
paste(y[1],y[2])
## [1] "Volta Region is NPP's nightmare"
(z<-letters[1:5])
## [1] "a" "b" "c" "d" "e"
(Z<- LETTERS[20:25])
## [1] "T" "U" "V" "W" "X" "Y"

Dr. S. Iddi (UG) R Training February 10, 2020 36 / 130


Numbers and Vectors

Subsets and Index Vectors

 Extracting subsets of vectors is frequently required.


 To extract a subset, we specify the indices of elements we wish to extract
or exclude.
 The index vector specifies the elements to return.
x=34:45 #create a vector x
c(x[2],x[6]) #create a new vector with the second
## [1] 35 39
#and tenth element of x
x[c(2,6)] #an alternative way to extract the
## [1] 35 39
#second and sixth element

Dr. S. Iddi (UG) R Training February 10, 2020 37 / 130


Numbers and Vectors

Index vector of logical values and missing data

 Index vectors can be created by logical vectors.

 Some numerical data contain missing cases represented by NA.

 Arithmetic operations involving NA results in NA.

 So we need a way to extract values that are not NA.

 identify NA: is.na().

Dr. S. Iddi (UG) R Training February 10, 2020 38 / 130


Numbers and Vectors

R Examples: Index vector of logical values and missing data

x <- c(10, 20, NA, 4, NA, 2)


sum(x)/length(x)

## [1] NA

mean(x)

## [1] NA

(i=is.na(x))#identify NA's

## [1] FALSE FALSE TRUE FALSE TRUE FALSE

y=x[!i] #obtain values of x that are not NA


j=complete.cases(x) #identify values of x that are not NA
yy=x[j] #subset with no NA
sum(y)/length(y) #compute mean

## [1] 9

mean(x, na.rm = TRUE) #yield the same results


Dr. S. Iddi (UG) R Training February 10, 2020 39 / 130
Numbers and Vectors

R Examples: Index vector of logical values and missing data

airquality[1:6, ]#using an inbuild data

## Ozone Solar.R Wind Temp Month Day


## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

good <- complete.cases(airquality) #remove NA


airquality[good, ][1:6, ] #subset data without NAs

## Ozone Solar.R Wind Temp Month Day


## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8

Dr. S. Iddi (UG) R Training February 10, 2020 40 / 130


Matrices

Creating Matrices

 A two dimensional object.


 Matrices are vectors with a dimension attribute.
 The dimension attribute is itself an integer vector of length 2 (nrow,
ncol).
 The function matrix() creates a matrix from a vector.
 The general call for the function have the form: matrix(vector,
number of columns, number of rows, byrow=T(or
F))
 By default, they are constructed column-wise. For row-wise, use the
option byrow=T.
 Matrix can be created from a vector by adding a dimension attribute.
 Also with: cbind() and rbind().

Dr. S. Iddi (UG) R Training February 10, 2020 41 / 130


Matrices

Matrices in R

(x <- matrix(1:4, nrow = 2, ncol = 2))

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

dim(x)

## [1] 2 2

(y <- matrix(c(2, 4, 5, -1,0,-4), nrow = 3, ncol = 2))

## [,1] [,2]
## [1,] 2 -1
## [2,] 4 0
## [3,] 5 -4

Dr. S. Iddi (UG) R Training February 10, 2020 42 / 130


Matrices

Matrices in R

rbind(x,y)

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [3,] 2 -1
## [4,] 4 0
## [5,] 5 -4

(z<-c(2, 4, 5, -1,0,-4))

## [1] 2 4 5 -1 0 -4

dim(z)=c(3,2)#create matrix from vector using dim attribute


z

## [,1] [,2]
## [1,] 2 -1
## [2,] 4 0
## [3,] 5 -4

Dr. S. Iddi (UG) R Training February 10, 2020 43 / 130


Matrices

Matrices in R

(p <- 1:3)

## [1] 1 2 3

(q <- 10:12)

## [1] 10 11 12

cbind(p, q)

## p q
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12

(r <- cbind(letters[1 : 4], LETTERS[1 : 4]))

## [,1] [,2]
## [1,] "a" "A"
## [2,] "b" "B"
## [3,] "c" "C"
## [4,]
Dr. "d" "D"
S. Iddi (UG) R Training February 10, 2020 44 / 130
Matrices

Matrix operations

 R contains many functions and operators for matrices.


 functions:
◦ t() - transpose the input matrix
◦ nrow(), ncol() - returns the number of rows and columns of a
matrix respectively.
◦ solve() - returns inverse of a square matrix.
◦ solve(A,b) - gives the solution for the system of equation
Ax = b
◦ diag(n) - for a positive integer n, generates an n × n identity
matrix
◦ det() - calculate determinant of square matrix.
◦ sum(diag()) - returns the trace of an input matrix.
 Arithmetic operators also work with matrices.
◦ A ∗ B - multiply corresponding elements of two matrices A and B.
◦ A% ∗ %B - does proper matrix multiplication.
Dr. S. Iddi (UG) R Training February 10, 2020 45 / 130
Matrices

Functions and Operators for matrices in R

(x<-c(10,2,5,7,125,3,0,1,1))

## [1] 10 2 5 7 125 3 0 1 1

(y<-matrix(x,3,3))

## [,1] [,2] [,3]


## [1,] 10 7 0
## [2,] 2 125 1
## [3,] 5 3 1

det(y)#calculate determant of y

## [1] 1241

sum(diag(y))#return trace of y

## [1] 136

Dr. S. Iddi (UG) R Training February 10, 2020 46 / 130


Matrices

Functions and Operators for matrices in R

t(y)#transpose of y

## [,1] [,2] [,3]


## [1,] 10 2 5
## [2,] 7 125 3
## [3,] 0 1 1

solve(y)#inverse of y

## [,1] [,2] [,3]


## [1,] 0.098307816 -0.005640612 0.005640612
## [2,] 0.002417405 0.008058018 -0.008058018
## [3,] -0.498791297 0.004029009 0.995970991

Dr. S. Iddi (UG) R Training February 10, 2020 47 / 130


Matrices

Functions and Operators for matrices in R

eigen(y)#eigenvalues and eigenvectors of x

## eigen() decomposition
## $values
## [1] 125.148195 9.844519 1.007286
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.06065781 0.87316411 -0.006357337
## [2,] -0.99780533 -0.01939437 0.008167102
## [3,] -0.02655459 0.48704034 -0.999946440

eigen(y)$values

## [1] 125.148195 9.844519 1.007286

eigen(y)$vectors

## [,1] [,2] [,3]


## [1,] -0.06065781 0.87316411 -0.006357337
## [2,] -0.99780533 -0.01939437 0.008167102
## [3,] -0.02655459 0.48704034 -0.999946440
Dr. S. Iddi (UG) R Training February 10, 2020 48 / 130
Matrices

Functions and Operators for matrices in R

y*t(y) #elementwise multiplication

## [,1] [,2] [,3]


## [1,] 100 14 0
## [2,] 14 15625 3
## [3,] 0 3 1

solve(y)%*%y #proper matrix multiplication

## [,1] [,2] [,3]


## [1,] 1.000000e+00 4.163336e-17 0
## [2,] -1.387779e-17 1.000000e+00 0
## [3,] 8.881784e-16 4.440892e-16 1

Dr. S. Iddi (UG) R Training February 10, 2020 49 / 130


Matrices

Functions and Operators for matrices in R

diag(3) #returns 3x3 identity matrix

## [,1] [,2] [,3]


## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1

5+y

## [,1] [,2] [,3]


## [1,] 15 12 5
## [2,] 7 130 6
## [3,] 10 8 6

Dr. S. Iddi (UG) R Training February 10, 2020 50 / 130


Matrices

Subsetting a matrix

 A matrix has subscripts mat[i,j].


 By default, when a single element of a matrix is retrieved, it is returned
as a vector of length 1 rather than 1 × 1 matrix.
 This behaviour can be turned off by setting drop=FALSE
(m <- matrix(c(2, 4, 5, -1,0,-4), nrow = 2, ncol = 3))
## [,1] [,2] [,3]
## [1,] 2 5 0
## [2,] 4 -1 -4
m[1,2]#subseting a single element of a matrix
## [1] 5

Dr. S. Iddi (UG) R Training February 10, 2020 51 / 130


Matrices

Subsetting a matrix

m[1,2,drop=F] #return 1x1 matrix

## [,1]
## [1,] 5

m[,2] #subsetting column 2, returns a vector

## [1] 5 -1

m[,2 ,drop=FALSE] #subsetting column 2, returns a matrix

## [,1]
## [1,] 5
## [2,] -1

Dr. S. Iddi (UG) R Training February 10, 2020 52 / 130


Arrays

Arrays

 A matrix is a two dimension array but larger array can be defined as well.
 Arrays have k dimensions.
 Each element of an array is accessed with k indices, x[i1,...,ik].
 Eg. an array of 3 matrices 2 × 3 each is defined by dim=c(2,3,3).

Dr. S. Iddi (UG) R Training February 10, 2020 53 / 130


Arrays

Arrays

(x<-array(c(1:18),dim=c(2,3,3)))

## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18

Dr. S. Iddi (UG) R Training February 10, 2020 54 / 130


Arrays

Arrays

x[,,1] #subset first element in the array

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

x[,1,2] #subset first column in the second matrix

## [1] 7 8

dim(x)

## [1] 2 3 3

Dr. S. Iddi (UG) R Training February 10, 2020 55 / 130


List

Creating, subsetting a list and R "output" as a List

 A list is a special type of vector that can contain different objects.


 Each component can be an object of different type (i.e vectors and
matrices in the same list) and length.
 In contrast, matrices and data frame contain vector of the same length.
 To construct a list, use the function list().
 Components of a list can be accessed by name and index.
 R ’output’ are usually lists.
 For example, the eigen(x) is a list that contains the eigen values and
the eigen vectors.

Dr. S. Iddi (UG) R Training February 10, 2020 56 / 130


List

Examples on lists

(x <- list(1, "a", TRUE, 1 + 4i))

## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i

Dr. S. Iddi (UG) R Training February 10, 2020 57 / 130


List

Examples on lists

(y <- list(1, c("Male","Female"), matrix(1:4, 2,2)))

## [[1]]
## [1] 1
##
## [[2]]
## [1] "Male" "Female"
##
## [[3]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

z <- list(a=1, b=c("Male","Female"), d=matrix(1:4, 2,2)) #list with names


z[2] #subsetting a list

## $b
## [1] "Male" "Female"

z$d #subset with name

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

Dr. S. Iddi (UG) R Training February 10, 2020 58 / 130


List

Examples on lists

z[["d"]] #alternatively

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

z[c(2,3)] #subset last two list

## $b
## [1] "Male" "Female"
##
## $d
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

(ol<-eigen(z$d))

## eigen() decomposition
## $values
## [1] 5.3722813 -0.3722813
##
## $vectors
## [,1] [,2]
## [1,] -0.5657675 -0.9093767
## [2,] -0.8245648 0.4159736

summary(ol)
Dr. S. Iddi (UG) R Training February 10, 2020 59 / 130
Factors

Factors

 Factors are used to define groups in vectors.


 Each level of a factor defines a group.
 Many statistical model functions make use of factors. eg. ANOVA and
regression.
 Factors are stored as numerical values with values 1, 2, . . . , k where k is
the number of levels.
 Vector strings that hold character values can be changed to factor with
the function factor().
 The levels of factor can be assessed using levels().

Dr. S. Iddi (UG) R Training February 10, 2020 60 / 130


Factors

Factors

 When a vector of string is included as a column of a data frame, R by


default turns the vector into a factor in which the distinct strings are the
level names.
 There are some context in which factors become numeric vectors. To
obtain the vector of strings, specify as.character().
 To extract the codes 1, 2, . . . , specify as.numeric().
 For a factor whose levels are character strings representation of numeric
values eg "10", "2", "3.2", use as.numeric(as.character()) to
extract the numerical values.
 When the index variable in a for loop takes factor values, the values are
the integer codes.

Dr. S. Iddi (UG) R Training February 10, 2020 61 / 130


Factors

Example: Factors

(gender<-c(rep("F",3),rep("M",5)))

## [1] "F" "F" "F" "M" "M" "M" "M" "M"

levels(gender)

## NULL

(gender<-factor(gender))

## [1] F F F M M M M M
## Levels: F M

levels(gender)

## [1] "F" "M"

Dr. S. Iddi (UG) R Training February 10, 2020 62 / 130


Factors

Example: Factors

as.numeric(gender)

## [1] 1 1 1 2 2 2 2 2

x<-data.frame(sex=c(rep("M",3),rep("F",4)),
height=ceiling(rnorm(7,10,2)))
levels(x$sex)

## [1] "F" "M"

Dr. S. Iddi (UG) R Training February 10, 2020 63 / 130


Data Frame

Data frame

 Used to store tabular data.


 These objects fit somewhere in between matrices and lists.
 They are not rigid as matrices - contain columns of different classes.
 Data frame can contain numerical and character vectors.
 They are not as loose as lists - have rectangular structure.
 Many functions in R used for analysis require data frame.
 They are constructed with the function, data.frame().
 Appropriate objects can be coerced into data frame with
as.data.frame().

Dr. S. Iddi (UG) R Training February 10, 2020 64 / 130


Data Frame Creating a Data Frame

Create Data frame

 Can be constructed by reading from saved data from text file with
read.table() or read.csv()
 Columns of a data frame can be referenced by index or name.
 Use $ sign to call for a vector in the data frame.
 names() can be used to see the names of a data frame.
 Can simply attach (with attach()) data frame and indicate the column
name.
 Detach data frame with detach().
 Data frames can be converted into matrix with data.matrix() or
as.matrix().

Dr. S. Iddi (UG) R Training February 10, 2020 65 / 130


Data Frame Creating a Data Frame

R examples: data frame

#From Vectors#
scores<-c(50,45,90)
exams<-c("maths","english","science")
(dat1<-data.frame(exams,scores))

## exams scores
## 1 maths 50
## 2 english 45
## 3 science 90

#From Matrix#
(dat2<-data.frame(matrix(1 : 24, nrow = 4, ncol = 6)))

## X1 X2 X3 X4 X5 X6
## 1 1 5 9 13 17 21
## 2 2 6 10 14 18 22
## 3 3 7 11 15 19 23
## 4 4 8 12 16 20 24

Dr. S. Iddi (UG) R Training February 10, 2020 66 / 130


Data Frame Creating a Data Frame

R examples: data frame

(dat3<-as.data.frame(matrix(1 : 24, nrow = 4, ncol = 6)))

## V1 V2 V3 V4 V5 V6
## 1 1 5 9 13 17 21
## 2 2 6 10 14 18 22
## 3 3 7 11 15 19 23
## 4 4 8 12 16 20 24

mat<-data.matrix(dat1) #convert data frame to matrix


mat

## exams scores
## [1,] 2 50
## [2,] 1 45
## [3,] 3 90

Dr. S. Iddi (UG) R Training February 10, 2020 67 / 130


Data Frame Creating a Data Frame

R examples: data frame

##Subset Dataframe##
dat1$exams #subset by name

## [1] maths english science


## Levels: english maths science

dat1[,1] #subset by index

## [1] maths english science


## Levels: english maths science

dat1[,'scores'] #subset by name-index

## [1] 50 45 90

Dr. S. Iddi (UG) R Training February 10, 2020 68 / 130


Data Frame Creating a Data Frame

R examples: data frame

attach(dat1) #attach data frame

## The following objects are masked _by_ .GlobalEnv:


##
## exams, scores

scores #access 'scores' from dat1

## [1] 50 45 90

Dr. S. Iddi (UG) R Training February 10, 2020 69 / 130


Data Frame Reading and Exporting Data

Import data

 Data can be entered directly or imported into R.


 Data from database management system (DBMS) can be imported
directly into R without first exporting from the system.
 To import text or from foreign systems:
◦ Text data: read.table(), read.csv().
◦ Excel data: load library(xlsx) and read.xlsx().
◦ SAS data: load library(Hmics) and sasxport.get().
◦ SPSS data: load library(Hmics) and spss.get().
◦ STATA data: load library(foreign) and read.dta().
◦ SYSTAT data: load library(foreign) and
read.systat().

Dr. S. Iddi (UG) R Training February 10, 2020 70 / 130


Data Frame Reading and Exporting Data

Important arguments in read.table()

The read.table function is one of the most commonly used functions for
reading data. It has a few important arguments:
 file, the name of a file, or a connection.
 header, logical indicating if the file has a header line.
 sep, a string indicating how the columns are separated.
 colClasses, a character vector indicating the class of each column in
the dataset.
 nrows, the number of rows in the dataset.
 comment.char, a character string indicating the comment character.
 skip, the number of lines to skip from the beginning.
 stringsAsFactors, should character variables be coded as factors?

Dr. S. Iddi (UG) R Training February 10, 2020 71 / 130


Data Frame Reading and Exporting Data

Export data

 There are several ways to export R objects into other formats.


 For SPSS, SAS and STATA, first load the package foreign.
 For Excel, load the xlsReadWrite package.
 To export:
◦ To tab delimited text file: write.table(mydata,
"C:/dat.txt", sep="" )
◦ Excel spreadsheet: write.xlsx(mydata, "C:/dat.xlsx").
◦ SAS: write.foreign(mydata, "C:/dat.txt",
"C:/dat.sas", package="SAS").
◦ SPSS: write.foreign(mydata, "C:/dat.txt",
"C:/dat.sps", package="SPSS").
◦ STATA:write.dta(mydata, "C:/dat.dta").

Dr. S. Iddi (UG) R Training February 10, 2020 72 / 130


Data Frame Reading and Exporting Data

Activity

Enter the following data in a .txt file.

Ozone Solar.R Wind Temp Month Day


41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
14.3 56 5 5
28 14.9 66 5 6

Dr. S. Iddi (UG) R Training February 10, 2020 73 / 130


Data Frame Reading and Exporting Data

R examples: import data into R

#mydata<- read.table("C:/Datasets/mydata.txt",header=TRUE)
mydata<-head(airquality)
head(mydata,n=3) #print first 3 rows of mydata

## Ozone Solar.R Wind Temp Month Day


## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3

tail(mydata,n=3) #print last 3 rows of mydata

## Ozone Solar.R Wind Temp Month Day


## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

Dr. S. Iddi (UG) R Training February 10, 2020 74 / 130


Data Frame Reading and Exporting Data

R examples: export data from R

##Export complete cases


expdat<-mydata[complete.cases(mydata),]
write.table(expdat, "./Datasets/expdat.txt",sep="," )

## Warning in file(file, ifelse(append, "a", "w")): cannot


open file ’./Datasets/expdat.txt’: No such file or directory
## Error in file(file, ifelse(append, "a", "w")): cannot open
the connection

#install.packages('xlsx)
#install.packages('openxlsx)
library(openxlsx)
write.xlsx(expdat,"./Datasets/expdat.xlsx" )

## Note: zip::zip() is deprecated, please use zip::zipr()


instead
## Warning in file.create(to[okay]): cannot create file
’./Datasets/expdat.xlsx’, reason ’No such file or directory’

Dr. S. Iddi (UG) R Training February 10, 2020 75 / 130


Data Frame Reading and Exporting Data

Activity

 Enter the following data into Excel


 Save data as .csv file
 Import data into R

Names Ages Sex


Ben 14 M
Jullie 17 F
Fred 13 M
Ama 14 F
Vic 13 F
Joe 16 M
Sam 9 M
Ellen 11 F

Dr. S. Iddi (UG) R Training February 10, 2020 76 / 130


Data Manipulation Creating and Renaming Datasets

Creating new variable

 New variable in dataset is created by using the assignment operator <-


##Create new variable##
hospital <- c("Kolebu", "37 Military", "Police", "Legon")
patients <- c(150, 350, 200,500)
costs <- c(3.1, 2.5, 2.9,2.0)
(HosDat <- data.frame(hospital, patients, costs))
## hospital patients costs
## 1 Kolebu 150 3.1
## 2 37 Military 350 2.5
## 3 Police 200 2.9
## 4 Legon 500 2.0

Dr. S. Iddi (UG) R Training February 10, 2020 77 / 130


Data Manipulation Creating and Renaming Datasets

Creating new variable

##Create new variable##


HosDat$totcosts <- HosDat$patients *HosDat$costs
HosDat
## hospital patients costs totcosts
## 1 Kolebu 150 3.1 465
## 2 37 Military 350 2.5 875
## 3 Police 200 2.9 580
## 4 Legon 500 2.0 1000

##Alternatively, using the 'transform()' function


(mydata <- transform(HosDat, totalcost2 = patients*costs))
## hospital patients costs totcosts totalcost2
## 1 Kolebu 150 3.1 465 465
## 2 37 Military 350 2.5 875 875
## 3 Police 200 2.9 580 580
## 4 Legon 500 2.0 1000 1000

Dr. S. Iddi (UG) R Training February 10, 2020 78 / 130


Data Manipulation Creating and Renaming Datasets

Creating and renaming variables

 We can also recode variable


## For 2-categories
HosDat$costs.cat<-ifelse(HosDat$costs <= 2.5,
"Cheap","Expensive")

##For more than 2-categories


HosDat$costs.cat2[costs=2.5]<-"Normal"
HosDat$costs.cat2[costs<2.5]<-"Cheap"
HosDat$costs.cat2[costs>2.5]<-"Expensive"
HosDat
## hospital patients costs totcosts costs.cat costs.cat2
## 1 Kolebu 150 3.1 465 Expensive Expensive
## 2 37 Military 350 2.5 875 Cheap Normal
## 3 Police 200 2.9 580 Expensive Expensive
## 4 Legon 500 2.0 1000 Cheap Cheap

Dr. S. Iddi (UG) R Training February 10, 2020 79 / 130


Data Manipulation Creating and Renaming Datasets

Recoding variable

 Variable can be rename interactively with the function fix() or


programmatically by the rename() function from the reshape
package.
##Rename variable##
#fix(HosDat) #rename interactively (Chance hospital = Hospital,
#patients=Patient)

#install.packages('reshape')
library(reshape)
HosDat<-rename(HosDat, c(costs="Costs",totcosts="TotalCosts"))
HosDat
## hospital patients Costs TotalCosts costs.cat costs.cat2
## 1 Kolebu 150 3.1 465 Expensive Expensive
## 2 37 Military 350 2.5 875 Cheap Normal
## 3 Police 200 2.9 580 Expensive Expensive
## 4 Legon 500 2.0 1000 Cheap Cheap

Dr. S. Iddi (UG) R Training February 10, 2020 80 / 130


Data Manipulation Creating and Renaming Datasets

Activity

Assume that we have registered the height and weight for four people:
Heights in cm are 180, 165, 160, 193, 163, 145, 200; weights in kg are 87, 58,
65, 100,150,100, 75. Make two vectors, height and weight, with the data. The
bodymass index (BMI) is defined as
weight in kg
(height in m)2
Create a data frame. Make a column with the BMI values for the four people,
and a column with the natural logarithm of the BMI values.

Make a column to classify BMI values into the following classification.


BMI Classification
< 18.5 Underweight
18.5 − 24.9 Normal weight
25.0 − 29.9 Overweight
≥ 30 Obese

Export your data as a csv file.


Dr. S. Iddi (UG) R Training February 10, 2020 81 / 130
Data Manipulation Sorting and Subsetting Datasets

sorting and subsetting data sets

 To sort a data by a variable, use with() and order() functions.


Sorting by default is ASCENDING. To sort by DESCENDING, prepend
the sorting variable with the minus sign (not useful for factors).
#Using the 'iris' dataset
irisdata<- iris

##Sort ascending by Sepal.Length


irisdata<-irisdata[order(irisdata$Sepal.Length),]
head(irisdata)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 14 4.3 3.0 1.1 0.1 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 4 4.6 3.1 1.5 0.2 setosa

Dr. S. Iddi (UG) R Training February 10, 2020 82 / 130


Data Manipulation Sorting and Subsetting Datasets

sorting and subsetting data sets

#Sort ascending by Species


irisdata<-irisdata[order(irisdata$Species),]
head(irisdata)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 14 4.3 3.0 1.1 0.1 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 4 4.6 3.1 1.5 0.2 setosa

Dr. S. Iddi (UG) R Training February 10, 2020 83 / 130


Data Manipulation Sorting and Subsetting Datasets

sorting and subsetting data sets

##Sort descending
irisdata<-irisdata[order(-irisdata$Sepal.Length),]
head(irisdata)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 132 7.9 3.8 6.4 2.0 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 106 7.6 3.0 6.6 2.1 virginica

Dr. S. Iddi (UG) R Training February 10, 2020 84 / 130


Data Manipulation Sorting and Subsetting Datasets

sorting and subsetting data sets

##Sort with two variables


irisdata<-irisdata[order(irisdata$Species, irisdata$Petal.Length),]
head(irisdata)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 23 4.6 3.6 1.0 0.2 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 17 5.4 3.9 1.3 0.4 setosa

Dr. S. Iddi (UG) R Training February 10, 2020 85 / 130


Data Manipulation Sorting and Subsetting Datasets

Activity

Enter the data below in notepad and import in R.

id h1 h2 h3 w1 w2 w3 sex
1 1 11 101 5 25 35 male
2 2 12 102 6 26 36 male
3 3 13 103 7 27 37 male
4 4 14 104 8 28 38 female
5 5 15 105 9 29 39 female
The following codes can be used to import the data.
wide<-read.table("./Datasets/wide.txt", header=T)

## Warning in file(file, "rt"): cannot open file


’./Datasets/wide.txt’: No such file or directory
## Error in file(file, "rt"): cannot open the connection

##OR wide<-read.table(file.choose(), header=T)

Dr. S. Iddi (UG) R Training February 10, 2020 86 / 130


Data Manipulation Sorting and Subsetting Datasets

subsetting data sets

##Keep and drop variables


#select with variable name
(newWide<-wide[c('h1','h2', 'w1','w2', 'sex', 'id')])

## Error in eval(expr, envir, enclos): object ’wide’ not found

(newWide1<-wide[c(1:2,7:8)]) #select with variable position

## Error in eval(expr, envir, enclos): object ’wide’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 87 / 130


Data Manipulation Sorting and Subsetting Datasets

subsetting data sets

##Keep and drop variables


(newWide2<-wide[-c(1:2,7)]) #exclde with variable position

## Error in eval(expr, envir, enclos): object ’wide’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 88 / 130


Data Manipulation Sorting and Subsetting Datasets

subset() function

 We can delete or keep variables and observations.


 An easier way to select variables and observations is by using the
subset() function.
attach(wide)
## Error in attach(wide): object ’wide’ not found
#selecting observations
wide[(sex=="male" & w3 > 35),]
## Error in eval(expr, envir, enclos): object ’wide’ not
found
#selecting variables and observations
subset(wide, w3 > 35 & sex=='male', select=id:sex)
## Error in subset(wide, w3 > 35 & sex == "male", select =
id:sex): object ’wide’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 89 / 130


Data Manipulation Sorting and Subsetting Datasets

Activity

For the BMI data above make a vector with the weights for those people who
have a BMI larger than 25.

Dr. S. Iddi (UG) R Training February 10, 2020 90 / 130


Data Manipulation Merging and Aggregating

Merging and Aggregating

 Two datasets can be merged by one or more common key with the
command merge(data.frameA, data.frameB,
by=c("common key variable")).
 Return only the rows in which the left table have matching keys in the
right table and is called Inner join
 Other types of merge
◦ An outer join of data.frameA and data.frameB:
Returns all rows from both tables, join records from the left which have matching keys in the
right table.
◦ A left outer join (or simply left join) of data.frameA and
data.frameB
Return all rows from the left table, and any rows with matching keys from the right table.
◦ A right outer join of data.frameA and data.frameB
Return all rows from the right table, and any rows with matching keys from the left table.

Dr. S. Iddi (UG) R Training February 10, 2020 91 / 130


Data Manipulation Merging and Appending

Merging and Aggregating

 Append datasets with the rbind(data.frameA, data.frameB)


command.
 Examples
##Mergeing data###
df1<-data.frame(CustomerId = c(1:6),
Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2<-data.frame(CustomerId = c(2, 4, 6),
State = c(rep("Alabama",2), rep("Ohio", 1)))

#Inner join
merge(df1, df2,by="CustomerId")
## CustomerId Product State
## 1 2 Toaster Alabama
## 2 4 Radio Alabama
## 3 6 Radio Ohio

Dr. S. Iddi (UG) R Training February 10, 2020 92 / 130


Data Manipulation Merging and Appending

Merging and Aggregating

#Outer join
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
## CustomerId Product State
## 1 1 Toaster <NA>
## 2 2 Toaster Alabama
## 3 3 Toaster <NA>
## 4 4 Radio Alabama
## 5 5 Radio <NA>
## 6 6 Radio Ohio

Dr. S. Iddi (UG) R Training February 10, 2020 93 / 130


Data Manipulation Merging and Appending

Merging

#Left outer
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

## CustomerId Product State


## 1 1 Toaster <NA>
## 2 2 Toaster Alabama
## 3 3 Toaster <NA>
## 4 4 Radio Alabama
## 5 5 Radio <NA>
## 6 6 Radio Ohio

#Right outer
merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
## CustomerId Product State
## 1 2 Toaster Alabama
## 2 4 Radio Alabama
## 3 6 Radio Ohio

#Cross join
#merge(x = df1, y = df2, by = NULL)
Dr. S. Iddi (UG) R Training February 10, 2020 94 / 130
Data Manipulation Merging and Appending

Activity

Create and import the following datasets into R.


Table: Books
Name Title Other Author
Tukey Exploratory Data Analysis
Venables Modern Applied Statistics ... Ripley
Tierney LISP-STAT
Ripley Spatial Statistics
Ripley Stochastic Simulation
McNeil Interactive Data Analysis
R Core An Introduction to R Venables & Smith
Table: Authors
Surname Nationality Deceased
Tukey US yes
Venables Australia no
Tierney US no
Ripley UK no
McNeil Australia no

Merge the two datasets using the various merge types.


Dr. S. Iddi (UG) R Training February 10, 2020 95 / 130
Data Manipulation Aggregating

Aggregating

 It is relatively easy to collapse data in R using one or more BY variables


and a defined function.
 This can be demonstrated in the example using aggregate()
function.
(aggdata<-aggregate(irisdata[,c(-5)],by=list(irisdata$Species),
FUN=mean, na.rm=TRUE))
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026

Dr. S. Iddi (UG) R Training February 10, 2020 96 / 130


Data Manipulation Aggregating

Aggregating

attach(mtcars)
# ?mtcars
(agdata <-aggregate(mtcars, by=list(cyl,vs),FUN=mean, na.rm=TRUE))

## Group.1 Group.2 mpg cyl disp hp drat wt qsec


## 1 4 0 26.00000 4 120.30 91.0000 4.430000 2.140000 16.70000
## 2 6 0 20.56667 6 155.00 131.6667 3.806667 2.755000 16.32667
## 3 8 0 15.10000 8 353.10 209.2143 3.229286 3.999214 16.77214
## 4 4 1 26.73000 4 103.62 81.8000 4.035000 2.300300 19.38100
## 5 6 1 19.12500 6 204.55 115.2500 3.420000 3.388750 19.21500
## vs am gear carb
## 1 0 1.0000000 5.000000 2.000000
## 2 0 1.0000000 4.333333 4.666667
## 3 0 0.1428571 3.285714 3.500000
## 4 1 0.7000000 4.000000 1.500000
## 5 1 0.0000000 3.500000 2.500000

Dr. S. Iddi (UG) R Training February 10, 2020 97 / 130


Data Manipulation Reshaping data sets

R Examples on reshaping data sets from Wide to Long

 Reshape a dataset from wide to long and long to wide is easily done with
the reshape() or melt() and cast() command from the
library(reshape).
str(reshape)

## function (data, varying = NULL, v.names = NULL, timevar = "time",


## idvar = "id", ids = 1L:NROW(data), times = seq_along(varying[[1L]]),
## drop = NULL, direction, new.row.names = NULL, sep = ".", split = if (sep ==
## "") {
## list(regexp = "[A-Za-z][0-9]", include = TRUE)
## } else {
## list(regexp = sep, include = FALSE, fixed = TRUE)
## })

Dr. S. Iddi (UG) R Training February 10, 2020 98 / 130


Data Manipulation Reshaping data sets

Simple R Example on reshape from Wide to Long

#Simple Example: Wide to Long#


(wide<-read.table("./Datasets/wide.txt", header=T))

## Warning in file(file, "rt"): cannot open file


’./Datasets/wide.txt’: No such file or directory
## Error in file(file, "rt"): cannot open the connection

Dr. S. Iddi (UG) R Training February 10, 2020 99 / 130


Data Manipulation Reshaping data sets

Simple R Example on reshape from Wide to Long

#Simple Example: Wide to Long#


long<-reshape(wide,varying=list(c("h1","h2","h3"), c("w1","w2","w3")),
v.names=c("h","w"), times=1:3,direction="long")

## Error in idvar %in% names(data): object ’wide’ not found

head(long)

## Error in head(long): object ’long’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 100 / 130


Data Manipulation Reshaping data sets

Simple R Example on reshape from Long to Wide

#Long to Wide#
long<-long[order(long$id),]

## Error in eval(expr, envir, enclos): object ’long’ not found

head(long)

## Error in head(long): object ’long’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 101 / 130


Data Manipulation Reshaping data sets

Simple R Example on reshape from Long to Wide

#Long to Wide#
w<-reshape(long, timevar = "time", idvar = c("id", "sex"),
direction = "wide")

## Error in reshape(long, timevar = "time", idvar = c("id",


"sex"), direction = "wide"): object ’long’ not found

## Error in eval(expr, envir, enclos): object ’w’ not found

Dr. S. Iddi (UG) R Training February 10, 2020 102 / 130


Data Manipulation Reshaping data sets

Activity

Convert the following data into the long format.


Code Country 1950 1951 1952 1953 1954
1 GH Ghana 20,249 21,352 22,532 23,557 24,555
2 NGR Nigeria 8,097 8,986 10,058 11,123 12,246
3 SA South Africa 12,004 23,024 30,345 32,100 44,456

Convert back to the wide format.

Dr. S. Iddi (UG) R Training February 10, 2020 103 / 130


Data Manipulation Reshaping data sets

Exercise

Below, is a dataset in wide format. Students have been measured using five
metrics: read, write, math, science, and socst.
id female race ses schtyp prog read write math science socst
70 0 4 1 1 1 57 52 41 47 57
121 1 4 2 1 3 68 59 53 63 61
86 0 4 3 1 1 44 33 54 58 31
141 0 4 3 1 3 63 44 47 53 56
172 0 4 2 1 2 47 52 57 53 61
113 0 4 2 1 2 44 52 51 63 61
50 0 3 2 1 1 50 59 42 53 61
11 0 1 2 1 2 34 46 45 39 36
84 0 4 2 1 1 63 57 54 58 51
48 0 3 2 1 2 57 55 52 50 51

Reformat this dataset into long form, using the reshape function.

Dr. S. Iddi (UG) R Training February 10, 2020 104 / 130


Special Functions

The apply family of functions

 Loops are useful in programming but can be particularly difficult when


working interactively on the command line.
 The apply family of functions implement looping in an easy way.
 Example of these functions are lapply(), sapply, tapply() and
apply().
 They function as follows:
◦ lapply: loop over a list and evaluate a function on each element.
◦ sapply: same as lapply but try to simplify the results.
◦ apply: apply a function over the margins of an array (rows or
columns).
◦ tapply: apply a function over subsets of a vector.
◦ mapply: multivariate version of lapply.

Dr. S. Iddi (UG) R Training February 10, 2020 105 / 130


Special Functions

lapply() and sapply()

 Applies to a list or data frame (since data frame has the structure of a list
of columns).
 Used to apply a function to each columns of a data frame in turn.
 The results of lapply is a list.
 sapply() is used to simplify the result into a vector or matrix.
◦ if the result is a list where every element is of length 1, then a vector
is returned.
◦ if the result is a list where every element is a vector of the same
length (>1), a matrix is returned.
◦ If it can’t figure things out, a list is returned.

Dr. S. Iddi (UG) R Training February 10, 2020 106 / 130


Special Functions

R Examples on lapply() and sapply()

x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1),


d = rnorm(100, 5))
str(lapply)

## function (X, FUN, ...)

lapply(x, mean)

## $a
## [1] 2.5
##
## $b
## [1] -0.6856285
##
## $c
## [1] 0.9688946
##
## $d
## [1] 5.045283

Dr. S. Iddi (UG) R Training February 10, 2020 107 / 130


Special Functions

R Examples on lapply() and sapply()

str(sapply)

## function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

sapply(x,mean)

## a b c d
## 2.5000000 -0.6856285 0.9688946 5.0452832

Dr. S. Iddi (UG) R Training February 10, 2020 108 / 130


Special Functions

R Examples on lapply() and sapply()

lapply(iris[,-5],mean,na.rm=TRUE)

## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333

sapply(iris[,-5],mean,na.rm=TRUE)

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## 5.843333 3.057333 3.758000 1.199333

Dr. S. Iddi (UG) R Training February 10, 2020 109 / 130


Special Functions

R Examples on lapply() and sapply()

x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))


lapply(x, function(col) col[,1])

## $a
## [1] 1 2
##
## $b
## [1] 1 2 3

sapply(x, function(col) col[,1])

## $a
## [1] 1 2
##
## $b
## [1] 1 2 3

Dr. S. Iddi (UG) R Training February 10, 2020 110 / 130


Special Functions

apply() function

 Used to evaluate a function over the margins of an array.

 Can also be applied with data frame and give the same results as
sapply.

 Its first argument is an array or data frame, the second specifies the
margin.

 Specify MARGIN=1 to apply a function to each row in turn, 2 when the


function is to be applied to each column in turn and a number greater
than 2 if the argument is an array of more than two dimensions.

 There are shortcuts to find the sums and means of matrix dimensions.
◦ rowSums=apply(x, 1, sum)
◦ rowMeans=apply(x,1, mean)
◦ colSums=apply(x,2, sum)
◦ colMeans=apply(x,2, mean)
Dr. S. Iddi (UG) R Training February 10, 2020 111 / 130
Special Functions

R Examples on apply() function

attach(iris)
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))

## [,1] [,2] [,3] [,4] [,5] [,6]


## 25% -0.4084939 -0.1802337 -0.18559466 -0.3762959 0.03392191 -0.8586708
## 75% 0.9235477 0.5131000 0.09425262 0.5606906 0.51882212 0.5261400
## [,7] [,8] [,9] [,10] [,11] [,12]
## 25% -0.4185235 -0.95517398 -0.4549109 -0.6282301 -0.5888163 -1.3137031
## 75% 1.0384544 -0.09248019 0.9809233 0.1271469 0.7732049 0.4801687
## [,13] [,14] [,15] [,16] [,17] [,18]
## 25% 0.2723333 -0.5674557 -0.7895098 -1.056187 -0.4533377 -0.9828001
## 75% 0.8751587 0.1775895 1.0717022 1.563150 1.3608930 0.1409344
## [,19] [,20]
## 25% -0.4863923 -0.06651604
## 75% 0.4251499 0.82322584

apply(iris[,-5],2,mean)

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## 5.843333 3.057333 3.758000 1.199333

Dr. S. Iddi (UG) R Training February 10, 2020 112 / 130


Special Functions

tapply() function

 Arguments are a variable, a list factors and function that operates on a


vector to return a single value.
 The output is an array with as many dimensions as there are factors.

library(MASS)
attach(cabbages)
attach(iris);

## The following objects are masked from iris (pos = 5):


##
## Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

str(tapply)

## function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

x <- c(rnorm(10), runif(10), rnorm(10, 1))

Dr. S. Iddi (UG) R Training February 10, 2020 113 / 130


Special Functions

tapply() function

f <- gl(3, 10)


tapply(x, f, mean)

## 1 2 3
## -0.2104579 0.5430863 1.1316653

tapply(iris$Sepal.Length,list(iris$Species), mean)

## setosa versicolor virginica


## 5.006 5.936 6.588

tapply(HeadWt,list(Cult, Date),mean)

## d16 d20 d21


## c39 3.18 2.80 2.74
## c52 2.26 3.11 1.47

Dr. S. Iddi (UG) R Training February 10, 2020 114 / 130


Graphical Functions

Exploratory graphs

 Graphs are very useful in data analysis.


 They are used for
◦ understanding data properties.
◦ find underlying patterns of data
◦ suggest modeling strategies
◦ communicate results, etc.
 Univariate plots: stem-and-leaf plot, histograms, boxplots, barplots,
density plots, qqplots.
 Bivariate plots: scatterplots, line plots.
 Several figures can be plotted on the one page with the function
par(from=c(n,k)).

Dr. S. Iddi (UG) R Training February 10, 2020 115 / 130


Graphical Functions

Stem-and-leaf

 Textual graph that classifies data items according to their most significant
numeric digits.
 Used to study the distribution of a continuous random variable.
 Created using the function stem().
attach(faithful)
stem(faithful$waiting)

##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 3
## 4 | 55566666777788899999
## 5 | 00000111111222223333333444444444
## 5 | 555555666677788889999999
## 6 | 00000022223334444
## 6 | 555667899
## 7 | 00001111123333333444444
## 7 | 555555556666666667777777777778888888888888889999999999
## 8 | 000000001111111111111222222222222333333333333334444444444
## 8 | 55555566666677888888999
## 9 Dr.|S.00000012334
Iddi (UG) R Training February 10, 2020 116 / 130
Graphical Functions

Histogram or Ogives

 Consist of vertical bars that show graphically the frequency distribution


of a quantitative variable.

 Standard histogram is created with hist(x,...).

 Knowledge of few options can help your histogram look exactly how
you want it.

Dr. S. Iddi (UG) R Training February 10, 2020 117 / 130


Graphical Functions

Histogram or Ogives

BMI<-rnorm(n=1000, m=24.2, sd=2.2)


hist(BMI)

Histogram of BMI
Frequency

100
0

20 25 30

BMI
Dr. S. Iddi (UG) R Training February 10, 2020 118 / 130
Graphical Functions

Histogram or Ogives

hist(BMI, breaks=20, main="Breaks=20")

Breaks=20
Frequency

100
0

20 25 30

BMI
Dr. S. Iddi (UG) R Training February 10, 2020 119 / 130
Graphical Functions

Histogram or Ogives

#plottig with densities instead of frequencies


hist(BMI, freq=FALSE, main="Density plot")

Density plot
Density

0.10
0.00

20 25 30

BMI
Dr. S. Iddi (UG) R Training February 10, 2020 120 / 130
Graphical Functions

Histogram or Ogives

hist(BMI, freq=FALSE, xlab="Body Mass Index",


main="Distribution",
col="lightgreen", xlim=c(15,35), ylim=c(0, .20))
# Add a normal curve
curve(dnorm(x, mean=mean(BMI), sd=sd(BMI)),
add=TRUE, col="darkblue", lwd=2)

Distribution
0.20
Density

0.00

15 20 25 30 35

Dr. S. Iddi (UG) Body MassRIndex


Training February 10, 2020 121 / 130
Graphical Functions

QQ-Plot and Box-Plot

 QQ-Plot: To check if univariate data is close to being normal.


 The R command qqplot() is used to generate it and qqline() for
quartile line.
 It plots the sample quartile vs the theoretical quartile.
 Boxplot: graphical way to summarize data.
 Automatically compute median, first and third quartile and 95% CI of
the median.
 Can detect outliers and compare distributions.
 Generated with the function boxplot().

Dr. S. Iddi (UG) R Training February 10, 2020 122 / 130


Graphical Functions

Example: QQ-Plot and Box-Plot

##QQPlot
qqnorm(faithful$waiting)
qqline(faithful$waiting)

Normal Q−Q Plot


Sample Quantiles

50 80

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Dr. S. Iddi (UG) R Training February 10, 2020 123 / 130


Graphical Functions

Example: QQ-Plot and Box-Plot

##Boxplot
boxplot(faithful$waiting)
50 80

Dr. S. Iddi (UG) R Training February 10, 2020 124 / 130


Graphical Functions

Example: QQ-Plot and Box-Plot

##Boxplot
boxplot(faithful$waiting, main="Boxplot of Time Waited",
xlab="Time waited", horizontal=TRUE)

Boxplot of Time Waited

50 60 70 80 90

Time waited

Dr. S. Iddi (UG) R Training February 10, 2020 125 / 130


Graphical Functions

Example: QQ-Plot and Box-Plot

##Boxplot
hist(faithful$waiting,main="Boxplot of Time Waited",
xlab="Time waited")

Boxplot of Time Waited


Frequency

30
0

40 50 60 70 80 90

Time waited

Dr. S. Iddi (UG) R Training February 10, 2020 126 / 130


Graphical Functions

Scatter plots and line plots

 Scatter Plot: There are many ways to create a scatter plot of two
quantitative variables.
 Created with the basic plot function plot(x,y,...).
 It display the pair of values of the vectors x and y inside a cartesian
diagram.
 Used to reveal the relationship between the variables.
 There are several options to change the default plot. Example,
◦ pch - change symbol of points
◦ cex - change size of text
◦ adj - shift title to left or right, etc.
 Line plot: control the type of line connecting points with the option
type=.
 Example: "p" for points, "l" for lines, "b" for both, "s" for stair
steps,"n" for no plotting.
Dr. S. Iddi (UG) R Training February 10, 2020 127 / 130
Graphical Functions

Codes: Scatter plots and line plots

##Scatter plots
par(mfrow=c(1,2))
plot(eruptions, waiting, # plot the variables
xlab="Eruption duration", # x-axis label
ylab="Time waited",main="Scatterplot1")
abline(lm(waiting~eruptions))
#change symbol of points and size of points.
plot(eruptions, waiting,main="Scatterplot2",adj=0,
xlab="Eruption duration", cex=2, ylab="Time waited",pch=2,col=3)

Scatterplot1 Time waited Scatterplot2


Time waited

50 80

50 80

1.5 4.5 1.5 4.5


Dr. S. Iddi (UG) R Training February 10, 2020 128 / 130
Graphical Functions

Codes: Scatter plots and line plots

##Line plot
x<-rnorm(25,0,1)
x<-sort(x)
Fn<-order(x)/length(x)
#cbind(x,order(x),Fn)
par(mfrow=c(1,2))
plot(x,Fn)
plot(x,Fn,type="l")
0.2 0.8

0.2 0.8
Fn

Fn

−2.0 1.0 −2.0 1.0

Dr. S. Iddi (UG) x R Training x February 10, 2020 129 / 130


Graphical Functions

Codes: Scatter plots and line plots

##Line plot
par(mfrow=c(1,2))
plot(x,Fn,type="b")
plot(x,Fn,type="s")
1.0

1.0
0.6

0.6
Fn

Fn
0.2

0.2

−2.0 1.0 −2.0 1.0

Dr. S. Iddi (UG) x R Training x February 10, 2020 130 / 130

You might also like