0% found this document useful (0 votes)
8 views

EDAV

Uploaded by

Foba Ogunkeye
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

EDAV

Uploaded by

Foba Ogunkeye
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 218

Exploratory Data Analysis and Visualization Using R

Draft version 2.1

Materials in this course are protected by the United States copyright law
[Title 17, U.S. Code]. No parts of this draft handout may be reproduced,
shared or distributed in print or digitally without permission from the author.
Contents

1 R Environment, Language and Framework 1


1.1 Basics of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical Expressions in R . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Simple Objects in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Calling Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Composite Objects in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1.1 Arithmetic and Relational Operators on Vectors . . . . . . . 19
1.2.1.2 Subsetting Vectors . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.1.3 Modifying Vectors . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1.4 Essential Vector Functions and Operators . . . . . . . . . . . 25
1.2.1.5 Vectors in Linear Algebra . . . . . . . . . . . . . . . . . . . . 31
1.2.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.2.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2.3.1 Subsetting Matrices . . . . . . . . . . . . . . . . . . . . . . . 40
1.2.3.2 Arithmetic Operators on Matrices . . . . . . . . . . . . . . . 42
1.2.3.3 Matrices in Linear Algebra . . . . . . . . . . . . . . . . . . . 43
1.2.3.4 Fundamental Types of Matrices in Linear Algebra . . . . . . 57
1.2.4 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.2.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.2.5.1 Subsetting Data Frames . . . . . . . . . . . . . . . . . . . . . 63
1.2.5.2 Modifying Data Frames . . . . . . . . . . . . . . . . . . . . . 63
1.2.5.3 Essential Data Frame Functions . . . . . . . . . . . . . . . . 64
1.2.6 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.2.6.1 Subsetting Lists . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.2.6.2 Modifying Lists . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.2.7 Symbolic Expressions in R . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.3 Apply Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.4 Installing and Loading Packages in R . . . . . . . . . . . . . . . . . . . . . . . 74
1.4.1 Loading Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.4.2 Importing/Exporting Data in R . . . . . . . . . . . . . . . . . . . . . 76
1.5 R Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.6 R Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.6.1 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.6.2 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.7 Implementing Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

i
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

2 Exploratory Data analysis 89


2.1 Data and Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.1.1 Levels of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.2.1 Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.2.2 Theoretical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.3 Univariate Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.3.1 Descriptive Statistics to Measure the Central or Positional Tendency . 103
2.3.2 Descriptive Statistics to Measure Dispersion . . . . . . . . . . . . . . . 106
2.3.3 Descriptive Statistics to Measure Shapes . . . . . . . . . . . . . . . . . 110
2.3.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.3.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.4 Bivariate Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.4.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 117
2.4.2 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
2.4.3 Correspondance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.4.4 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.5 Multivariate Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.5.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.5.1.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . 124
2.5.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . 129
2.5.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . 129
2.5.4 Factor Analysis (FA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.5.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.6 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.6.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.6.1.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.6.1.2 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.6.1.3 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.6.1.4 Smoothing Noisy Data . . . . . . . . . . . . . . . . . . . . . 134
2.6.1.5 Managing Outliers . . . . . . . . . . . . . . . . . . . . . . . . 135
2.6.1.6 Resolving Inconsistencies . . . . . . . . . . . . . . . . . . . . 135
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

3 Data Visualization 137


3.1 Visual Patterns to Observe in Data . . . . . . . . . . . . . . . . . . . . . . . . 137
3.1.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.1.2 Asymmetry of the Distribution . . . . . . . . . . . . . . . . . . . . . . 137
3.1.3 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.1.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.1.5 Main Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.1.6 Seasonal Effect and Anomalies . . . . . . . . . . . . . . . . . . . . . . 138
3.2 ggplot2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.2.1 Fundamentals of ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.2.2 Case Study: Constructing Graphics Using ggplot2 . . . . . . . . . . . 139
3.2.3 Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

ii
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

3.2.3.1 Setting and Mapping Aesthetic Parameters . . . . . . . . . . 144


3.2.3.2 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.2.4 Geometric Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.2.4.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.4.2 Density Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.4.3 Box and Jitter Plots . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.4.4 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.2.4.5 Positioning Overlapping Geometric Objects . . . . . . . . . . 160
3.2.4.6 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.2.5 Statistics Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.2.5.1 Smoothing Functions . . . . . . . . . . . . . . . . . . . . . . 168
3.2.5.2 Functions without Data . . . . . . . . . . . . . . . . . . . . . 168
3.2.5.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.6 Bivariate Density Estimation . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.6.1 QQ Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
3.2.6.2 Accessing Computed Statistic Values . . . . . . . . . . . . . 174
3.2.7 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.2.7.1 Controlling x and y Axes . . . . . . . . . . . . . . . . . . . . 181
3.2.7.2 Controlling Color Legends and their Properties . . . . . . . . 184
3.2.7.3 Controlling Shape/Line Legends and their Properties . . . . 189
3.2.8 Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.2.9 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.2.10 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.2.11 Spatial Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.2.11.1 Visualizing Physical and Thematic Maps . . . . . . . . . . . 196
3.2.11.2 Visualizing Administrative Maps . . . . . . . . . . . . . . . . 197
3.2.11.3 Visualizing Data on top of Maps . . . . . . . . . . . . . . . . 203
3.2.11.4 ggmap Utility Functions . . . . . . . . . . . . . . . . . . . . . 209
3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Appendices 212

A R Code Examples 213


A.1 Vector Opearions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

iii
Chapter 1

R Environment, Language and


Framework

R is a software environment, language and a framework for statistical computing and graph-
ics. It is available as a free software under the terms of the GNU General Public License.
The R software environment is an integrated suite of tools for data manipulation, calcula-
tion, analysis and graphics. The R language is a true programming language for extending
the capabilities of the software environment through additional functions. Hence, all ele-
ments of a full featured procedural programming language such as variables, conditionals,
loops and functions are supported by the R language. Furthermore, R supports binding C,
C++ and Fortran code especially for computationally heavy tasks. The software is available
in different formats for multiple platforms at the official R project web site1 . Finally, R is
a software framework consisting of thousands of packages. These packages are libraries (or
directories) containing groups of functions and/or data sets. The R software environment
comes with a small set of core packages which includes the package base. The base package
contains functions for the basic operators and data types and support for R as a program-
ming language. Thousands of packages for various tasks are available at Comprehensive R
Archive Network (CRAN)2 .

1.1 Basics of R
Once R is installed on your computer you need to launch the program by either double
clicking the executable or typing “R” in the console. Launching the program loads the R
software environment along with a small set of packages (including the base package) and
brings the R interactive shell or R terminal. The terminal allows a user to interact with
the R environment by entering R instructions and displaying the results of the instructions.
The terminal prompts a “>” symbol which simply indicates that R is waiting for the user’s
next instruction.
Although the interactive R terminal and a simple text editor is enough to develop and ex-
ecute R code, programmers often depend on Integrated Development Environments (IDEs)
for code development, execution and deployment. IDEs are software applications which
1 http://www.r-project.org/
2 http://cran.r-project.org/

1
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

come with a suite of development tools to support editing, compiling, interpreting, debug-
ging, code completing, code refactoring, code executing and version controlling. Typically,
IDEs support one or more languages. RStudio is a professional IDE tailored for R and
widely used in the industry. It has free and paid versions available on the official web site3 .
Since RStudio depends on R, it is recommended to first install R and then RStudio.
This book contains more than a hundred R console figures presenting R code in action.
It is highly recommended to experiment with the code by actively typing it on your own
console to reinforce your learning. Another way to reinforce your learning is to learn from
your errors. Similar to other languages, R interpreter prints an error message for syntax
errors. Often googling the error message will help you to obtain more information about the
error as well as learn how to fix the error in your code. In addition to error messages, the R
generates warning messages which notify a condition. Unlike the error messages, warning
messages do not halt the code execution.

1.1.1 Mathematical Expressions in R


R is an expression language and a mathematical expression is an elementary command in
R. In this section we first introduce arithmetic expressions and then introduce logical or
boolean expressions.
An arithmetic expression is a mathematical expression built from numeric values and
arithmetic operators which itself evaluates into a numeric value. When an arithmetic ex-
pression is given as an instruction to R, R evaluates the expression and displays the result.

Figure 1.1: Computing an arithmetic expression in R

> 1004 + 23
[1] 1027

Figure 1.1 simply shows that R evaluates a given expression and displays the result on
the next line. Note that the term “[1]” preceding the result in Figure 1.1 simply tells that
“1027” is the first element displayed on its line in the shell. In case there are multiple
elements displayed on multiple lines, R starts each line with the index of the first element
displayed on that line. This scheme makes it easy to track the elements of large vectors or
matrices. We will revisit this scheme in Section AAA again.

Table 1.1: Arithmetic operators in R

Addition +
Subtraction -
Multiplication *
Division /
Power ^
Modulo %%
Integer Division %/%

3 https://www.rstudio.com/

2
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Table 1.1 shows the arithmetic operators supported by R. Figure 1.2 shows arithmetic
expressions evaluated by R. Note that R follows the conventional order of operators where
the exponent operator has the highest precedence; followed by the multiplication and di-
vision operators; and followed by the addition and subtraction operators. The operators
having the same level of precedence are evaluated from left to right. Finally, R allows the
use parenthesis in expressions to rearrange operator precedence and avoid ambiguity.

Figure 1.2: Arithmetic expressions in R

> 17/4
[1] 4.25
> 6*5+14
[1] 44
> 6*(5+14)
[1] 114
> 2^9
[1] 512
> (2^3+5*9)/(4+1)
[1] 10.6
> 2^3+5*9/4+1
[1] 20.25

A logical (boolean) value is a quality intended to represent the truth of a logic statement
using two levels namely, true and false. One can use keywords TRUE and FALSE or T and F to
represent logical values true and false, respectively. A logical expression is a mathematical
expression built from logical values and logical operators which itself evaluates into a logical
value. Table 1.2 shows the list of the logical operators supported by R4 .

Table 1.2: Logical operators in R

Negation !
Conjunction (And) &&
Elementwise Conjunction (And) &
Disjunction (Or) ||
Elementwise Disjunction (Or) |
Exclusive Disjunction (Xor) xor(...)

Figure 1.3 demonstrates examples of logical expressions in R.


Relational operators tests the existence of a particular relation between two values. Since
the test is about the existence or lack of a relation, relational operators return a logical value.
R supports the following relational operators.
Figure 1.4 demonstrates examples of relational operators in R.
4R facilitates the exclusive disjunction as a function rather than an operator.

3
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.3: Example logical expressions in R

> T & F
[1] FALSE
> TRUE & FALSE
[1] FALSE
> TRUE | FALSE
[1] TRUE
> !FALSE
[1] TRUE
> (TRUE | FALSE) & (TRUE | TRUE)
[1] TRUE
> xor(TRUE, TRUE)
[1] FALSE

Figure 1.4: Example relational operations in R

> 5.6 == 5.6


[1] TRUE
> 4 != 7
[1] TRUE
> 101 <= 101
[1] TRUE
> 101 < 101
[1] FALSE
> 101 > 102
[1] FALSE
> (101 > 102) | TRUE
[1] TRUE

4
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Table 1.3: Relational operators in R

Equal to (=) ==
Not equal to (6=) !=
Less than (<) <
Greater than (>) >
Less than or equal (≤) <=
Greater than or equal (≥) >=

1.1.2 Simple Objects in R


In the previous section we entered mathematical expressions in R’s interactive shell. R envi-
ronment evaluated each expression and displayed its result on the next line. The evaluated
results have not been saved in any way and they just vanish after they are displayed.
Sometimes, we need to store the results of expressions in memory in order to use them
later. To store a value in the memory we need to (i) allocate a piece of memory; (ii) store
a value in the allocated memory; and (iii) use a symbolic name (an identifier) to refer to
the allocated memory. More importantly, we need R to facilitate the implementation of all
these three steps. R facilitates all these three steps via a single assignment operator, <-,
which is an arrow consisting of a “less than” and a “dash” symbol5 .

Figure 1.5: Simple assignment operator

> myage <- 21+2*2-6/2


>

Figure 1.5 demonstrates the use of the assignment operator. After the execution of the
assignment instruction nothing is displayed on R’s console. Despite it seems nothing has
happened many things happened behind the scenes. First of all, the R environment parsed
the entire instruction and divided it into left-hand side and right-hand side terms according
to the assignment operator, <-6 . Secondly, it evaluated the expression on the right-hand
side, 21 + 2 ∗ 2 − 6/2, until the entire expression is reduced to a single value, 22. Thirdly,
it allocated a piece of memory and stored the evaluated value, 22, in the allocated memory.
Finally, it used the symbolic name on the left-hand side, myage, to label the allocated
memory for future reference.
In R, a location in memory having a value and referenced by an identifier is called an
object. As a matter of fact, any simple or composite information stored in the memory and
referenced by an identifier is called an object in R.
When we create an object we associate it with an identifier. We use the identifier to
access to the value of the object later. Figure 1.6 shows that R environment displays the
value of an object when the object’s identifier (or symbolic name) is entered as an instruction.
One needs to create an object by using the assignment operator before accessing the object’s
5 Alternatively, R supports the equal sign = as an assignment operator.
6R supports both forward and backward arrows as the assignment operator as long as the arrow points
the identifier. However, forward arrow is rarely used in practice and we prefer backward arrow in this text
as well.

5
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.6: Printing the value of an object

> myage
[1] 22

value via its identifier.


Usually we use meaningful words, combinations of words or letters as identifiers for
objects. Yet, the identifier of an object, in R, should start with a letter [a-z A-Z] followed
by zero or more letters [a-z A-Z], digits [0-9], dots [.] or underscores [_]. Note that R is a
case sensitive language. That is R recognizes lower case and upper case letters as different
symbols.
Once an object is created its value can be overridden by assigning a new value through
its identifier. Figure 1.7 shows how the value of a previously created object is overridden by
assigning a new value to it.

Figure 1.7: Overriding the value of an object

> myage <- 21+2*2-6/2


> myage
[1] 22
>
> myage <- 33
> myage
[1] 33
>
> myage <- myage + 2
> myage
[1] 35

Furthermore, one can use previously created objects in mathematical expressions solely
for computations or for creating new objects. In case R encounters identifiers representing
objects in the memory within an expression, it replaces the identifiers by retrieving the
values stored in the object. Once all objects’ values in an expression have been retrieved R
evaluates the expression as usual.
In Figure 1.8 demonstrates examples of creating objects and using them in various ex-
pressions to create new objects or to do calculations. In the first example block, we created
two numeric objects, n1 and n2, set to 6.2 and 145, respectively. Then, we used these
numeric objects to do some calculations and even to create a new numeric object n3.
In the second example block of Figure 1.8 we created an object (celc) representing a
daily temperature in Celsius and set its value to 11. Then, we used the object celc in an
expression (celc*9/5+32) to create another object (fahr) for the Fahrenheit equivalent of
the same temperature. Finally, we printed the same temperature in Celsius and Fahrenheit.
In the third example block of the same Figure we built a logical expression consisting of
logical values, logical operators and a relational operator between objects celc and fahr
and stored the result in an object labeled switch.
So far we have created objects having numeric values. R fully supports character strings
to be stored in the memory and retrieved later as well. Character strings in R has to be

6
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.8: Using existing objects in expressions

> n1 <- 6.2


> n2 <- 145
> n1 + n2
[1] 151.2
> n2 / n1
[1] 23.3871
> n3 <- n1 - n2
> n3
[1] -138.8
>
> celc <- 11
> fahr <- celc*9/5+32
> celc
[1] 11
> fahr
[1] 51.8
>
> switch <- (FALSE | TRUE | FALSE) & (FALSE) | (fahr>=celc)
> switch
[1] TRUE

enclosed with either double quotations "..." or single quotations ’...’.

Figure 1.9: Example character string object in R

> prompt <- "hello world"


> prompt
[1] "hello world"
>
> statement <- "The set has two elements, namely \"a\" and \"b\""
> statement
[1] "The set has two elements, namely \"a\" and \"b\""

Figure 1.9 shows an example character string object in R. Usually, using double quota-
tions is preferred over single quotations in enclosing character strings. However, if a double
quotation appears as a symbol in the character string one can use the single quotations to
enclose the character string. Alternatively, one can escape the double quotation symbol(s)
appearing in the character string by preceding it(them) using the backslash character, \",
as shown in Figure 1.9.
In addition to numeric data type, character strings and logical data type R allows special
values such as Inf for infinity, NA for not available (for missing values), Nan for not a number
and NULL for null object (for missing objects) to be used. Figure 1.10 shows a use case for
special characters allowed in R.
Finally, R fully supports complex numbers as well. However, we will not discuss complex
numbers in this text.
Please note that, not all operations are meaningful on all data types. For example, the

7
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.10: Special Values in R

> n1 <- 5/0


> n1
[1] Inf
> n2 <- NA
> n2
[1] NA
> n3 <- Inf -Inf
> n3
[1] NaN

“division” operator is applied to the numeric data type however, it is not meaningful on
logical data type or character strings. The “logical or” operator is applied to logical data
type however, it is not meaningful on numeric data type or character strings. The “less
than” operator is applied to character strings (strings have lexicographical or dictionary
order) and numeric data types however, it is not meaningful on logical data type.

1.1.3 Calling Functions in R


In this section our focus is on executing the functions that comes in various R packages.
Implementing custom functions is discussed in Section AAA.
A function in R is a sequence instructions that are encapsulated as a single unit to
perform a specific task. One can think of a function as a block of instructions with a name
(identifier). Once implemented, the entire block can be executed multiple times by using its
name. Calling or invoking a function is a command to the R environment to execute the
instructions encapsulated as a block. In order to call a function one needs to know the header
of a function which consists of a name (identifier) and zero or more parameters. The name
of the function is the identifier used to call the function in R environment. The parameters
of a function serves as placeholders to pass data to the function. Each parameter has its
own name (or identifier) within the definition of the function. The actual data passed to a
function through the parameters during the function call are called arguments. Arguments
can be literal values, objects or mathematical expressions.

Figure 1.11: R function call

functionName([param1=argument1, param2=argument2, ...])

Figure 1.11 shows the generic format for calling a function in R. To call or execute a
function, it is enough to know the name of the function and the list of the parameters that
a function expects. In general, functions, when called, take in some data; process the data;
and return the result of the computation. However, not all functions take in data nor do
they return a result. Some functions do not need any data to be provided to perform their
tasks when called. Some other functions perform their tasks and exit after without returning
any result.
Figure 1.12 demonstrates the call of sqrt function which calculates and returns the
square root of a non-negative number. The function sqrt has only one parameter with

8
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.12: Calling a simple function in R

> sqrt(x=16)
[1] 4
>
> side <- sqrt(x=16)
> side
[1] 4
>
> pow = 4
> exp(x=pow+2)
[1] 403.4288

name x. The parameter x acts as a place holder for an actual value to be provided during
the function call. In the example the parameter is set to literal argument 16. In the first code
example (Figure 1.12) the return value of sqrt is not stored in any object hence, R simply
prints the result on the terminal. In the second code example (Figure 1.12) the return value
of sqrt function is stored in the object side to be used later. In the third code example
(Figure 1.12) the exp (natural exponent) function is called where an expression, “pow+2”,
is used to set the parameter x. Note that in case a parameter of a function is set to an
expression, R evaluates the expression and sets the parameter to the evaluated result before
executing the function. The third example first evaluates pow + 2 which is 4 + 2 = 6 then,
it calls the function exp(x=6) which is equivalent to e6 . Notice the difference/similarity
between the “^” power operator and theexp function.
Not all functions are required to have parameters. A function may execute its task
without requiring the caller to pass any data. For example getwd function returns the
working directory, i.e., the current directory, of R environment as shown in Figure 1.13.
Notice that calling a function always requires the parenthesis even though the function
does not have any parameters in its definition. As a matter of fact, the parenthesis after a
function name simply implies function call. Entering a function name without parenthesis
either displays the implementation (source code) of the function or prints information about
the function. The same Figure also shows how to call the corresponding function setwd to
change the working directory by setting the parameter dir to the new working directory.
So far we presented functions that do not require a parameter or functions that require
a single parameter. However, majority of functions that come with R require more than one
parameter to be set. In case a function requires more than one parameter, the parameters
are separated by comma “,”.
In Figure 1.14 the call to log function requires two parameters to be set namely, x and
base. The function log(x=24, base=3) evaluates log3 (24) and returns the result.
In R, functions work with two types of parameters: mandatory parameters and optional
parameters. As the names imply, when a function is called all mandatory parameters have to
be explicitly set. On the other hand, optional parameters (default parameters) of a function
already are already set to default values in the definition of the function. The optional
parameters of a function are allowed to be set to explicit values other than their default
values. That is the default value of an optional parameter can be overwritten during the
function call.
Figure 1.14 provides three examples for the log function. In the first example only the

9
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.13: getwd function call

> getwd()
[1] "/home/mehmet"
>
> setwd(dir="/home/mehmet/Desktop/test")
> getwd()
[1] "/home/mehmet/Desktop/test"
>
> getwd
function ()
.Internal(getwd())
<bytecode: 0x3953858>
<environment: namespace:base>

Figure 1.14: log function call

> log(x=24, base=3)


[1] 2.892789

Figure 1.15: Mandatory and optional parameters of the log function

> log(x=16)
[1] 2.772589
>
> log(x=16, base=2)
[1] 4
>
> log(x=exp(5))
[1] 5

10
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

parameter x is explicitly set to a value while the parameter base is left untouched with
its default (implicit) value, e. Hence, first example evaluates e16 . In the second example
parameter x is explicitly set to 16 and parameter base is explicitly set to 2. That is the
default value of base, e, is overwritten by 2. The function simply evaluates log2 (16). In the
last example again only the parameter x is set to a value while keeping the parameter base
at its default value, e. However, the argument that the parameter x is set to is another
function call exp(5). In this case, R first evaluates exp(5) then, sets x to e5 and then calls
the log function.
So far we have always used the parameter names of a function when we set them to
arguments. In fact, R does not require one using parameter names to be used while passing
arguments. In case there is no parameter names in a function call, R sets the arguments to
parameters using the order of the parameters in the function definition.

Figure 1.16: Passing arguments by parameter order

> log(16,2)
[1] 4
>
> log(16)
[1] 2.772589

Figure 1.16 shows how the arguments 16 and 2 are mapped to the first parameter, x, and
the second parameter, base, respectively in the first example. In the second example, only
the first parameter, x, is set to an argument, i.e., 16, and the parameter base is kept with
its default value. We strongly suggest using parameter names along with the arguments
for mandatory and overwritten parameters to improve code clarity. In case there is no
ambiguity passing arguments according to parameter order is also fine.
As a side note, R has a print function which displays the passed arguments on the
terminal. So far we have just been typing the identifier (name) of an object to print its
content on the terminal. As a matter of fact, typing the name of an object to print its
content is just a short hand notation for calling the print function. R provides short hand
notations for some of the frequently used functions. Figure 1.17 shows that the print
function is equivalent to its shorthand notation. Additionally, the same figure shows that
we can write expressions involving literal values, objects and functions.
One important question at this point is how an R user would know the mandatory and
optional parameters of a function as well as the order of these parameters. R provides
a function named help which has a mandatory character string parameter named topic.
Setting the topic parameter to a function name simply opens the help documentation of
the function. The help documentation provides detailed explanation of the function, its
expected parameters in order, the default values for optional parameters, the data type of
each parameter as well as example code. A shortcut for the help function in R is to precede
the “?” before the name of a function without parenthesis.
Figure 1.18 demonstrates the use of the help function and its shorthand notation. The
figure requests documentations for absolute value function, abs, round function round,
trigonometric cosine function cos, the summation function sum and the sample function
sample. Figure 1.19 shows the function signatures excerpted from the respective docu-
mentation files of the functions. Note that the documentation files have more detailed
descriptions of the functions as well as long explanations for each parameter. The abs func-

11
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.17: print function

> age <- 18


> age
[1] 18
> print(x=age)
[1] 18
>
> a <- 12
> b <- 7.6
> c <- log(a + 36.8) + 2 * sqrt(b/3)
> print(c)
[1] 7.07102

Figure 1.18: Help function in R

> help(topic="abs")
> help("round")
> ?cos
> ?sum

tion has only one mandatory parameter x. The round function has a mandatory parameter
x and an optional parameter digits which is set to the default value 0. According to the
documentation of the function the digits parameter is used to specify the number of the
significant digits to be preserved while rounding a number. Similar to the abs function, the
trigonometric cos function has only one mandatory parameter x. The sum function adds up
all the argumented objects and returns the summation. Its signature is different from the
other functions because the first parameter is just an ellipses, “...”. An ellipses appearing
as a parameter in the function signature simply implies zero or more arguments. That is
the sum function is designed to add up an arbitrary number of objects. You can try the
function by yourself with zero, one, two or more arguments without using any parameter
names. The second argument of the sum function, na.rm is an optional parameter set to
the default value FALSE. This parameter specifies whether to remove the objects having the
value NA, not available, in the summation or not.

Figure 1.19: Function signatures excerpted from the documentation files

abs(x)
round(x, digits = 0)
cos(x)
sum (..., na.rm = FALSE)
sample(x, size , replace = FALSE, prob = NULL)

Below we present three widely used functions in R: ls, str and rm. As we interact with
the R environment we create many objects and from time to time we need to list the objects
in the memory. The ls function returns the list of the objects in the current session when
called without any parameters. The rm function is used to remove (delete) existing objects

12
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

from the memory. The str function returns the internal structure information of an object.
Please use the help function to open the documentation of these functions. Figure 1.20
shows the use of ls, str and rm functions.

Figure 1.20: Functions ls, str and rm

> name <- "buddy"


> age <- 3
> breed <- "retriever"
> weight <- 60
>
> ls()
[1] "age" "breed" "name" "weight"
>
> str(breed)
chr "retriever"
> str(age)
num 3
>
> rm(age)
> ls()
[1] "breed" "name" "weight"
>
> rm(list = ls()) # removes all objects

We conclude this section by presenting functions for saving data in files and loading
them later. The function save is used to save one or more objects into a file on the system.
The save function supports many parameters for controlling almost all aspects of the save
operation. Among these are an ellipses, “...” representing the names of the objects to be
saved separated by comma and file denoting the path to a destination file where the data
will be saved. In order to easily save the entire session, i.e., all objects in the memory, into
a file R provides the function save.image. The save.image also supports the parameter
file for specifying the destination file. However, it also sets the default value “.RData” to
the parameter. Hence if no destination file path is given to the save.image function it saves
the entire session into the file “.RData” located in the current working directory. Note that,
on Unix-like systems a period, “.”, preceding a file name implies that the file is a hidden file.
Hence, the default file might not be visible in the working directory.
The load function is used to load the data back into the memory from a file on the
system. The parameter file of the function load denotes the file to load the data from.
On some systems the content of the default file “.RData” in the default working directory
is automatically loaded when R is launched. One can simply rename the file to circumvent
this behavior.
Figure 1.21 demonstrates save, save.image and load functions for saving data on the
disk and loading them later. The file parameter of the save and load functions in the figure
is the relative or absolute file path to save and load data and it depends on your operating
system. The tilde (∼) in the beginning of file path is specific to Linux systems and it
is a shortcut to the user’s home directory. On Windows systems tilde is not supported,
hence it is recommended to use absolute file paths. Also, forward slash, “/” as a path
separator is specific to Linux systems and the Windows path separator, “\” is a string

13
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.21: Saving and loading data

> name <- "buddy"


> age <- 3
> breed <- "retriever"
> weight <- 60
>
> save(name, breed, file="~/Desktop/.RData")
> q() # quit R and restart it
> load(file="~/Desktop/.RData")
> save.image()
> q() # quit R and restart it

escape character. To facilitate working with file paths, R supports forward slash even on
Windows systems, hence it is recommended to always use forward slash as path separator,
e.g., C:/Users/YourUserName/Desktop . Lastly, the function q in Figure 1.21 is used to
quit R environment.

1.2 Composite Objects in R

Figure 1.22: Scalars and vectors in R

> n <- 5 # A scalar


> v <- c(5) # A vector
> str(n)
num 5
> str(v)
num 5
> class(n)
[1] "numeric"
> class(v)
[1] "numeric"
> length(n)
[1] 1
> length(v)
[1] 1
> is.vector(n)
[1] TRUE
> is.vector(v)
[1] TRUE

We have already seen that R fully supports simple data types (numeric values, character
strings, logical values and complex numbers) as objects. As a matter of fact, R does not
specify scalar or atomic objects at a higher level. The scalar objects of atomic numbers,
character strings and logical values are considered to be vector objects of length one. Fig-
ure 1.22 demonstrates that R scalars truly behave like vectors of length one. In addition to

14
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

vectors, R supports composite objects including factors, matrices, arrays, data frames, lists
and time series which are constructed on top of the simple data types.

1.2.1 Vectors
A vector is an object that is capable of holding a sequence values of the same data type. We
use the function c, standing for concatenate, to create vectors in R. Figure 1.23 shows how to
use the concatenation function, c, to create vectors of numbers, character strings and logical
values. We introduce two important functions namely mode and length functions in the
same figure as well. The mode function gives information about the format that an object is
stored in the memory. The length function returns the number of individual elements in an
object. In the first example (Figure 1.23) we create a vector of numbers, print the content
of the vector, get information about its mode and print its length. In the second example
(Figure 1.23) we create a vector of character strings, print the content of the vector, get
information about its mode and print its length. Note that the period “.” appearing in the
name of the object forrest.gump.cast is perfectly valid and it has no special meanings
other than being a symbol in an identifier. Also note that it took two lines for R to display
the content of the forrest.gump.cast object and the “[5]” on the second line simply tells
that the immediate character string is the fifth element of the printed vector object. In the
third example (Figure 1.23) we create a vector of logical values, print the content of the
vector, get information about its mode and print its length.
All examples in Figure 1.23 demonstrates how to create a vector by using the concate-
nate function c. The concatenate function takes a list of individual values (numeric values,
character strings or logical values). Considering that R does not specify scalars or atomic
values at a higher level, these individual values are in fact vectors of length one and the
concatenate function, c, actually concatenates vectors of length one. As a natural extension,
the concatenate function can also concatenate vectors of varying lengths as shown in Fig-
ure 1.24. Please notice how the concatenate function joined the two already defined vector
objects, v1 and v2, with another vector object created on the fly, c(21, 22, 23), to form
the vector object v3.
The concatenate function is used to form vectors and the elements of vectors supposed
to be the same type. In case the arguments to the concatenate function have different data
types, the function coerces logical values to numbers 1 and 0 and numbers to their string
forms in order to keep the vector of the same data type. This behavior is demonstrated in
Figure 1.25.
In addition to the concatenation function, c, R provides many functions for forming vec-
tors. In the following we will study three functions for generating vectors of sequences (seq,
rep and sample) and three functions for generating empty vectors (numeric, character
and logical).
The seq function is used to generate a sequence of values starting from the parameter
from up to the parameter to incremented by the parameter by. All these parameters
are optional parameters set to the default value 1. Figure 1.26 shows examples of the seq
function where numbers are generated from a value up to another value with certain amount
of increments. Instead of setting the to parameter to an upper limit, one may generate a
certain number of values starting from an initial value, incremented by a certain amount.
Parameter length.out of the seq function is used to specify the number of values that
need to be generated. Example two in Figure 1.26 demonstrates the use of the length.out
parameter to generate 20 values starting from 4, incremented by 0.1. R provides a shorthand

15
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.23: Vectors of different data types

> # Example 1
> grades <- c(75, 36, 96, 84, 65, 51, 90)
> grades
[1] 75 36 96 84 65 51 90
> mode(grades)
[1] "numeric"
> length(grades)
[1] 7
>
> # Example 2
> forrest.gump.cast <- c("Tom Hanks", "Rebecca Williams", "Sally Field", "George
Kelly", "Margo Moorer")
> forrest.gump.cast
[1] "Tom Hanks" "Rebecca Williams" "Sally Field" "George Kelly"
[5] "Margo Moorer"
> mode(forrest.gump.cast)
[1] "character"
> length(forrest.gump.cast)
[1] 5
>
> # Example 3
> topRanked <- c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)
> topRanked
[1] FALSE FALSE TRUE FALSE FALSE TRUE
> mode(topRanked)
[1] "logical"
> length(topRanked)
[1] 6

Figure 1.24: Concatenating vectors

> v1 <- c(1, 2, 3, 4, 5, 6)


> v1
[1] 1 2 3 4 5 6
> v2 <- c(11, 12, 13, 14)
> v2
[1] 11 12 13 14
> v3 <- c(v1, v2, c(21, 22, 23))
> v3
[1] 1 2 3 4 5 6 11 12 13 14 21 22 23

16
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.25: Concatenating vectors of different types

> v1 <- c(1, 10, 100, 1000, "Louisiana", "Texas")


> v1
[1] "1" "10" "100" "1000" "Louisiana" "Texas"
>
> v2 <- c(1, 10, 100, 1000, FALSE, TRUE, TRUE, FALSE)
> v2
[1] 1 10 100 1000 0 1 1 0

notation for the seq function where the increment amount is 1. To generate numbers starting
from k up to l with increment 1, it is enough to type k : l in the terminal. The last example
in Figure 1.26 demonstrates the shorthand notation.

Figure 1.26: seq function for generating regular sequences

> # Example set 1


> seq()
[1] 1
> seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=-5, to=5, by=0.5)
[1] -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5
[9] -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
[17] 3.0 3.5 4.0 4.5 5.0
> seq(from=10, to=0, by=-1)
[1] 10 9 8 7 6 5 4 3 2 1 0
>
> # Example 2
>> seq(from=4, length.out=20, by=0.1)
[1] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
[11] 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
>
> # Example 3
> 5:13
[1] 5 6 7 8 9 10 11 12 13

The rep function has the signature rep(x, times = 1, length.out = NA, each = 1). The
parameter x is the vector object to be repeated. The parameter times specifies how many
times to repeat the vector object. In case the times is a vector of multiple elements, each
element in times specify the number of times the corresponding element is repeatd in x.
The parameter length.out is used to set the length of the generated vector. The parameter
each is used to specify how many times to repeat each value in x. Figure 1.27 shows several
examples of the rep function.
The sample function is used to randomly select a number of elements from a vector and
it has the signature sample(x, size, replace = FALSE, prob = NULL). Parameter x specifies
the vector from which one wants to sample. Parameter size denotes the number of samples
one wants to draw. Parameter replace controls whether the sampling is with-replacement

17
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.27: rep function for generating repeated sequences

> rep(1:4, each = 2)


[1] 1 1 2 2 3 3 4 4
> rep(1:4, times=2)
[1] 1 2 3 4 1 2 3 4
> rep(1:4, each = 2, len = 10)
[1] 1 1 2 2 3 3 4 4 1 1
> rep(1:4, times=c(2,1,2,1))
[1] 1 1 2 3 3 4

or without-replacement. Parameter prob denote a vector of probability weights to control


the drawn probability of each element in the vector set to x. In case the prob is NULL each
element is sampled by equal probability.

Figure 1.28: sample function for generating random sequences

> sample(x=c(1, 10, 7, 2, 6), size=10, replace=TRUE)


[1] 1 10 10 2 10 6 10 6 10 6
> sample(x=1:10, size=10, replace=FALSE)
[1] 6 4 3 8 9 5 10 1 7 2
> sample(x=1:100, size=10, replace=FALSE)
[1] 36 72 71 44 95 100 92 66 75 87
> country.codes <- c("TR", "US", "DE", "BR", "JP", "CH", "UK", "HK", "FR", "AU", "SA
")
> sample(x=country.codes, size=4)
[1] "JP" "TR" "BR" "DE"
> sample(x=country.codes, size=4)
[1] "SA" "JP" "HK" "AU"

Figure 1.28 shows several examples for generating random sequences using the sample
function. Note that, the random samples in Figure 1.28 are not reproducible. That is, each
time trying the examples may result in different random samples. In the following of this
part we will introduce the set.seed function for generating reproducible random samples.
Please note that seq, rep and sample functions are used to generate vectors, i.e., they
return vectors.
So far we have created vectors using the assignment operator and the functions to gen-
erate a sequence of predetermined values. R provides three functions (numeric, character
and logical) for generating empty vectors. All three functions have a parameter length,
set to 0 by default, for setting the length of the vector to be generated. If the length pa-
rameter is zero they generate a vector of length zero otherwise they generate a vector filled
with zeros (0), empty strings ("") or logical false (FALSE).
The functions numeric, character and logical also have corresponding functions
is.numeric, is.character and is.logical to check a vector is of type numeric, char-
acter and logical or not as well as functions as.numeric, as.character and as.logical
to coerce a vector object into numeric, character and logical vector, respectively.

18
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

1.2.1.1 Arithmetic and Relational Operators on Vectors


Arithmetic operators on real values are addition, subtraction, multiplication and division.
The same operators are also defined on vectors where each operator works on two vectors
instead of two real values. There are two important concepts that not only apply to vector
arithmetics but also apply to almost any function defined on vectors.

Element-wise Evaluation Rule


Arithmetic operators on two vectors are evaluated sequentially and element-by-element.
Recycling Rule
In case an operator is applied to two vectors of different lengths, the shorter vec-
tor is recycled to match the longer vector. That is the elements of the shorter vec-
tor are sequentially repeatd and appended to itself one-by-one until it matches the
longer vector in terms of length. In fact, R calls the rep(x=shorterVector, each=1,
len=length(longerVecor)) to recycle the shorter vector.
Figure 1.29 shows three examples for arithmetic vector addition. In the first example,
two vectors of the same length are added together. The result is evaluated by adding up the
two vectors element by element. In the second example two vectors of different lengths are
added together. The shorter vector is recycled by R to match the length and the addition
is performed as usual. Please note that in addition to evaluating the expression a warning
message is printed to inform the user about the recycling operation that took place in the
back scenes. The third example shows that the recycling step can be done manually by
using the rep function.

Figure 1.29: Arithmetic vector addition

> v1 <- c(1, 10, 100, 1000)


> v2 <- c(1, 2, 3, 4)
> v3 <- c(1, 2, 3, 4, 5, 6, 7)
>
> # Example 1
> v1 + v2
[1] 2 12 103 1004
>
> # Example 2
> v1 + v3
[1] 2 12 103 1004 6 16 107
Warning message:
In v1 + v3 :
longer object length is not a multiple of shorter object length
>
> # Example 3
> rep(x=v1, each=1, len=length(v3)) + v3
[1] 2 12 103 1004 6 16 107

The arithmetic subtraction, multiplication and division works exactly like the arithmetic
addition operator. That is if the lengths of the vectors are different recycle the shorter one
to match the length of the longer one then, perform the operation element-by-element.

19
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.30 shows different examples for arithmetic operators on vectors. Notice that, the
operation v1*3 is a multiplication on a vector of length five and a vector of length one. The
vector with a single value is recycled five times before the multiplication is performed.

Figure 1.30: Arithmetic operators on vectors

> v1 <- 1:5


> v1
[1] 1 2 3 4 5
> v1 * 3
[1] 3 6 9 12 15
> v1 * rep(x=3, each=1, len=length(v1))
[1] 3 6 9 12 15
> (1:10) - 4
[1] -3 -2 -1 0 1 2 3 4 5 6
> (1:10) / 2
[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> (1:10)^2
[1] 1 4 9 16 25 36 49 64 81 100

In addition to the basic arithmetic operators dot product and cross product are widely
used in linear algebra. These operators are covered in Section AAAA, Matrices.
In the beginning of this text we defined relational operators (“<”, “>”, “==”, “!=”, “<=”
and “>=”) as the operators which test the existence of a particular relation between two
values. The same concept can be extended to vectors where the test for a relation between
two vectors is performed element-wise. As a result applying a relational operator to two
vectors returns a vector of logical values representing the existence of the relation between
two elements located at each position. Similar to arithmetic operators, if the lengths of the
two vectors are different the shorter one is recycled until the lengths match. Figure 1.31
shows several examples of using relational operators on vectors. Notice how vectors of length
one and length two are recycled to match the lengths.

1.2.1.2 Subsetting Vectors


Retrieving individual elements or parts of vectors is called subsetting. R supports multiple
ways to subset a vector. In this we cover three approaches to build new vectors by subsetting
a vector, i.e., retrieving parts of the vector. The first approach is called subsetting by integer
indices, the second approach is called subsetting by logical indices and the third approach
is called subsetting by names.
Elements of a vector in R are indexed using integers starting from 1 up to the length of
the vector. These indices are used to retrieve individual elements of a vector by declaring
the indices in a pair of square brackets following a vector object to be subsetted. Figure 1.32
demonstrates how to retrieve individual elements of a vector using integer indices. In the
last two examples, the index values are expressions. In case the index value is an expression
involving arithmetic operators and/or functions, the expressions inside the brackets are
evaluated first and then, the related value(s) is (are) retrieved. In case the index value is
out of bounds, i.e, greater than the length of the vector or less than 1, R returns NA except
for index 0. R returns the data type of the vector when the subsetting index is 0.

20
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.31: Relational operators on vectors

> v1 <- 1:5


> v2 <- c(1, 3, 5, 2, 2)
> v1
[1] 1 2 3 4 5
> v2
[1] 1 3 5 2 2
>
> v1 < v2
[1] FALSE TRUE TRUE FALSE FALSE
> v1 == v2
[1] TRUE FALSE FALSE FALSE FALSE
> v1 >= v2
[1] TRUE FALSE FALSE TRUE TRUE
> v1 < 2
[1] TRUE FALSE FALSE FALSE FALSE
> v1 < c(2,4)
[1] TRUE TRUE FALSE FALSE FALSE
> v1 + 10 > 10
[1] TRUE TRUE TRUE TRUE TRUE

Figure 1.32: Subsetting individual elements of a vector using integer indices

> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> states[3]
[1] "California"
> states[1]
[1] "Louisiana"
> states[2+5] #states[7]
[1] "Maine"
> states[length(states)] #states[10]
[1] "Arizona"
> states[length(states)-6] #states[4]
[1] "Montana"

21
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

All examples in Figure 1.32 involves a single integer index. Considering that the atomic
unit in R is vector, a single integer index is nothing but a vector of length one. As a natural
extension, one can use vectors having multiple integer indices to retrieve multiple values of
a vector.

Figure 1.33: Subsetting multiple elements of a vector using integer indices

> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> indexV <- c(1, 3, 9)
> states[indexV]
[1] "Louisiana" "California" "Utah"
> states[1:4]
[1] "Louisiana" "Texas" "California" "Montana"
> states[c(4, 7, 1)]
[1] "Montana" "Maine" "Louisiana"
> states[c(1:2, 9:10)]
[1] "Louisiana" "Texas" "Utah" "Arizona"
> states[rep(x=1:2, times=3)] #states[c(1, 2, 1, 2, 1, 2)]
[1] "Louisiana" "Texas" "Louisiana" "Texas" "Louisiana" "Texas"
> states[1:3 * 2] #states[c(2, 4, 6)]
[1] "Texas" "Montana" "South Dakota"

Figure 1.33 shows how to subset multiple elements of a vector using integer indices. Note
that we have used positive integers as indices of elements to be included in the retrieved
vector. An alternative way to retrieve elements of a vector is to use negative integers as
indices of elements to be excluded in the retrieved vector as shown in Figure 1.34.

Figure 1.34: subsetting multiple elements of a vector using negative integer indices

> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> states[-4]
[1] "Louisiana" "Texas" "California" "New Hampshire" "South Dakota"
[6] "Maine" "Ohio" "Utah" "Arizona"
> states[-5:-1]
[1] "South Dakota" "Maine" "Ohio" "Utah" "Arizona"
> states[c(-1, -2, -5, -7, -9, -10)]
[1] "California" "Montana" "South Dakota" "Ohio"

In addition to integer indices R supports logical indices (consisting of a sequence of


TRUE and FALSE values) to subset a vector. Similar to an integer index vector, a logical
index vector is provided within the square brackets following the name of the vector object.
Ideally, the length of the logical index vector should be the same as the length of the vector
from which one wants to retrieve elements. If the logical index vector is shorter R recyles
the logical index vector to match the lengths. Once lengths are matched, R retrieves each
element of the vector for which logical vector has value TRUE and skips elements for which
the logical vector has value FALSE. Figure 1.35 demonstrates several examples about the use

22
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

of logical indices to subset elements of a vector. In the last examples R implicitly recycles
the logical index vector of length three.

Figure 1.35: Subsetting elements of a vector using logical indices

> v1 <- 1:10


> v1
[1] 1 2 3 4 5 6 7 8 9 10
> indexV <- c(FALSE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,TRUE)
> v1[indexV]
[1] 2 6 7 10
> v1[c(TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE)]
[1] 1 3 4
> v1[rep(x=c(TRUE,FALSE), length.out=length(v1))]
[1] 1 3 5 7 9
> v1[c(TRUE, FALSE, FALSE)]
[1] 1 4 7 10

Remember that relational operators on vectors generate a logical vector. This logi-
cal vector can be used to subset the elements of a vector satisfying a particular relation.
Figure 1.36 shows several examples where relational operators are used within the square
brackets to generate a logical vector satisfying the relational condition.

Figure 1.36: Using relational operators to generate logical indices

> v1 <- sample(x=1:100, size=10, replace=TRUE)


> v1
[1] 73 62 9 25 40 38 74 25 20 64
> indexV <- v1 >= 50
> indexV
[1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
> v1[indexV]
[1] 73 62 74 64
> v1[v1>50]
[1] 73 62 74 64
> v1[(v1 %% 2) == 0] # Only even numbers
[1] 62 40 38 74 20 64
> v1[sqrt(v1) > 8.0]
[1] 73 74
> v1[v1>10 & v1<30]
[1] 25 25 20

R supports subsetting vectors using the names (character strings) of individual elements
appearing in a vector. So far, we did not named individual elements of vectors. However, R
provides the names function to set names to the individual elements of a vector or get the
names of the elements of a vector. The function sets names to the individual elements of a
vector if it is used on the left hand side of an assignment operator and retrieves the names
of the elements otherwise. In case no names have been assigned the function returns NULL.
If only a subset of the elements are assigned a name the function returns NA for elements

23
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

without a name. If the same character string is used to name multiple elements the function
returns only the first matching element. Figure 1.37 demonstrates the use of the names
function via several examples.

Figure 1.37: Subsetting using names

> v1 <- c(67, 74, 90)


> names(v1)
NULL
> names(v1) <- c("Assignment1", "Assignment2", "Assignment3")
> names(v1)
[1] "Assignment1" "Assignment2" "Assignment3"
> v1["Assignment1"]
Assignment1
67
> v1[c("Assignment1", "Assignment3")]
Assignment1 Assignment3
67 90

1.2.1.3 Modifying Vectors


R allows users modify vectors as well. The concatenation function, c, can be used to create
a new vector by joining two or more vectors. Inserting a new element in the beginning or
the end of a vector can be performed using the concatenation function. To insert a new
element at an arbitrary position in a vector one needs to split the original vector in to two
and concatenate them with the new element being in the middle. Figure 1.38 shows how
to use integer indices and concatenation to insert an element in the end, in the beginning
and at an arbitrary position of a vector. Note that these operations do not modify the
allocated memory blocks of the vectors instead, they create a new vector each time and
copy the elements of the operand vectors into the new vector. Hence, such operations might
be computationally expensive for large vectors.
The operations demonstrated in Figure 1.38 create new vectors each time because they
do not alter the memory blocks of fixed length vectors. Overwriting one or more elements of
a vector does not change the length of the vector hence, the operation is performed directly
in the memory block of the vector. R use subsetting along with the assignment operator to
modify one or more elements of a vector. The left hand side of the assignment operator is
the subset of the vector to be overwritten. The right hand side of the assignment operator
is the vector with values to overwrite. Note that recycling rule applies if the left hand side
vector is larger than the right hand side vector. That is, the vector with values to overwrite
is recycled until it matches the length of the subset vector. On the other hand, if the left
hand side vector is the smaller one then, the required number of values from the right hand
side vector are used to overwrite the subset and a warning message will be displayed by R.
Figure 1.39 demonstrates several examples of vector modification. In the last example
is.na function is used to obtain a logical vector marking the NA values using logical TRUE.
Then, the unit length vector 0 is recycled to overwrite the subsetted elements. Note that we
also introduce the set.seed function in this example. The sample function randomly picks
a different sequence of values each time it is called. Calling set.seed with a fixed value

24
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.38: Inserting an element in a vector

> v1 <- 91:95


> v1
[1] 91 92 93 94 95
> # inserting an element at the end of a vector
> v1 <- c(v1, 96)
> v1
[1 91 92 93 94 95 96
> # inserting an element in the beginning of a vector
> v1 <- c(90, v1)
> v1
[1] 90 91 92 93 94 95 96
> # inserting an element at an arbitrary position, e.g., 6
> v1 <- c(v1[1:(6-1)], 100, v1[6:length(v1)])
> v1
[1] 90 91 92 93 94 100 95 96

just before the sample function makes it pick the same sequence aligned by the argument
of the set.seed function. Hence, one can generate reproducible “random” values.

1.2.1.4 Essential Vector Functions and Operators


We have already covered exponential function exp, square root function sqrt, logarithm
function log as well as trigonometric functions sin, cos, tan and their inverses. These
functions expect an argument x and return a transformed value accordingly. When these
functions are applied to vectors of length greater than one, they return a vector of the same
length where each value in the original vector is transformed accordingly. This provides us
a very powerful tool to express various mathematical functions evaluated at different points.
In Figure 1.40 shows several examples for evaluating functions at different points.
The paste function in R is used to concatenate a number of character strings. Its
parameter sep denotes the separator character between the concatenated strings and its
default value is a space character “ ”. When applied to two vectors, the paste function
coerces the vectors to character strings and concatenates the vectors element by element.
Again, if the lengths of the vectors differ R recycles the shorter one. Figure 1.41 shows two
examples for the paste function.
The in operator, %in%, is a binary operator that applies to two vectors. The %in%
operator returns a logical vector denoting if each element in the left hand side vector also
appears in the right hand side vector. Figure 1.42 demonstrates the use of the %in% operator.
The functions sum and prod return the sum and product of the values of a numeric
vector, respectively. The functions cumsum and cumprod return a vector of cumulative
sum and cumulative products of a numeric vector, respectively. The functions max and
min return the maximum and minimum values of a vector, respectively. The functions
which.max and which.min return the indices of the maximum and minimum values of a
vector, respectively. The functions union, intersection, setdiff and setequal accepts
two vectors as arguments and return the set union, intersection, difference and equality
(logical) of the two vectors. Note that these functions are set functions and they discard
any duplicates in the results. The rev function returns the reversed version of the argument

25
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.39: Modifying elements of a vector

> # Example set 1


> states <- c("Louisiana", "Texas", "California", "Maine", "Ohio")
> states
[1] "Louisiana" "Texas" "California" "Maine" "Ohio"
> states[3] <- "Oregon"
> states
[1] "Louisiana" "Texas" "Oregon" "Maine" "Ohio"
>
> states[c(1,5)] <- c("Arkansas", "Kansas")
> states
[1] "Arkansas" "Texas" "Oregon" "Maine" "Kansas"
>
> states[1:5] <- states[c(4:5,1:3)] # rotate by two positions
> states
[1] "Maine" "Kansas" "Arkansas" "Texas" "Oregon"
>
> # Example set 2
> set.seed(1001)
> v1 <- sample(x=c(1:100, rep(x=NA, times=10)), size=12, replace=FALSE)
> v1
[1] NA 45 47 NA 46 94 1 9 30 78 NA 14
> is.na(v1)
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
> v1[is.na(v1)] <- 0
> v1
[1] 0 45 47 0 46 94 1 9 30 78 0 14

Figure 1.40: Essential functions in R

> x <- seq(from=0, to=2, by=0.25)


> x
[1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
> exp(x)
[1] 1.00 1.28 1.64 2.11 2.71 3.49 4.48 5.75 7.38
> log(x)
[1] -Inf -1.38 -0.69 -0.28 0.00 0.22 0.40 0.55 0.69
> exp(x) * 4
[1] 4.00 5.13 6.59 8.46 10.87 13.96 17.92 23.01 29.55
> 2^(1:10)
[1] 2 4 8 16 32 64 128 256 512 1024

Figure 1.41: The paste function

> paste("Hello", "World", sep="----")


[1] "Hello----World"
> paste("Assignment", 1:5, sep="")
[1] "Assignment1" "Assignment2" "Assignment3" "Assignment4" "Assignment5"

26
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.42: %in% operator

> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v2 <- c(49, 58, 41, 57)
> v1 %in% v2
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
> v2 %in% v1
[1] TRUE FALSE TRUE FALSE

vector. That is, the vector is circularly left shifted by its length. The unique function
returns a version of the argument vector where duplicate elements are discarded. The sort
function returns the sorted version of the argument vector. The parameter decreasing
which is set to FALSE by default controls whether the sort is in ascending or descending
order. In case the vector given to the sort function is a character string vector, the function
sorts lexicographically (dictionary order). Figure 1.43 shows examples of these essential
functions.
The order function returns an integer index vector representing the correct order of the
elements of a vector in ascending order. Subsetting a vector using its “order” is equivalent
to sorting the vector. The rank function returns a vector representing the rank of each
element in the argument vector in ascending order. In case there are elements appearing
multiple times their rank is their rank average by default. Since an average value is a real
number, the rank function returns a vector of real values to represent the ranks. In fact the
rank function has a parameter named ties.method which takes a value among average,
first, random, max and min to control how to rank the elements appearing multiple times.
You may try setting the parameter to different values to see the effect. Figure 1.44 shows
examples of the order and rank functions.
The which function works on logical vectors in its simplest form. It takes a logical vector
and returns a vector consisting of integer indices for which the value of the logical vector
is TRUE. The first example in Figure 1.45 shows this simple behavior. However, the true
strength of the which function comes when its argument is the result of a relational or logical
operator on vectors. Then, the function returns a vector of integer indices for which the
relational or logical operator evaluates TRUE. In other words, it returns the integer indices of
the elements which satisfy the relational or logical operator. The resulting index vector can
be used to subset the original vector to obtain a vector satisfying some logical conditions as
shown in the last example of Figure 1.43.
Another versatile and very useful function in R is the table function. Called on a vector,
it displays the frequency of each element in the vector. Figure 1.46 shows an example for
the table function.
The table function simply returns the frequency distribution of a vector. By default it
calculates the frequency of each unique element in the vector. Sometimes one may need to
calculate the frequencies based on custom intervals rather than individual values. That is
one may need to explicitly divide the range of the data into hypothetical bins and calculate
the frequencies falling into each bin. R provides the cut function which allows us to explicitly
define intervals and obtain a collection of intervals where each value in the vector is replaced
by its interval in the new collection.
To use the cut function for evaluating the interval frequency distribution of a vector, the
following four steps should be followed: (i) determine the range of the vector; (ii) create a

27
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.43: Essential vector functions

> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> sum(v1)
[1] 355
> cumsum(v1)
[1] 46 87 130 179 221 266 314 355
> prod(v1)
[1] 1.478064e+13
> max(v1)
[1] 49
> min(v1)
[1] 41
> which.max(v1)
[1] 4
> which.min(v1)
[1] 2
> intersect(v1, c(57, 41, 89, 90, 45))
[1] 41 45
> rev(v1)
[1] 41 48 45 42 49 43 41 46
> unique(v1)
[1] 46 41 43 49 42 45 48
> sort(v1)
[1] 41 41 42 43 45 46 48 49
> sort(v1, decreasing=TRUE)
[1] 49 48 46 45 43 42 41 41

Figure 1.44: The order and rank functions

> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> order(v1)
[1] 2 8 5 3 6 1 7 4
> v1[order(v1)] # Equivalent to sort
[1] 41 41 42 43 45 46 48 49
> rank(v1)
[1] 6.0 1.5 4.0 8.0 3.0 5.0 7.0 1.5
> rank(v1, ties.method="first")
[1] 6 1 4 8 3 5 7 2
> rank(v1, ties.method="min")
[1] 6 1 4 8 3 5 7 1

28
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.45: The order and rank functions

> which(c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE))


[1] 1 3 4 8
>
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> which(v1==41)
[1] 2 8
> which(v1==41 | v1>46)
[1] 2 4 7 8
> v1[which(v1==41 | v1>46)]
[1] 41 49 48 41

Figure 1.46: The table function

> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> table(v1)
v1
41 42 43 45 46 48 49
2 1 1 1 1 1 1

vector defining intervals or specify the number of breaks; (iii) use the cut function to obtain
a collection of intervals and (iv) use the table function for the frequency distribution of
the collection of intervals. Figure 1.47 shows how the four steps are followed to evaluate
the frequency distribution of large vector based on five intervals. The cut function can
be fine tuned through several parameters. It is strongly recommended to browse at the
documentation of the cut function.

Figure 1.47: Frequency distributions of custom intervals

> help("cut")
> v1 <- sample(x=seq(from=0, to=1, by=0.001), size=5000, replace=TRUE)
> breaks.v1 <- seq(from=min(v1), to=max(v1), by =0.2)
> interval.v1 <- cut(x=v1, breaks=breaks.v1, include.lowest=TRUE)
> table(interval.v1)
interval.v1
[0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
1054 984 959 1028 975

Two other functions we present here are any and all functions that only work on logical
vectors. As the name implies any checks whether any of the elements of the argumented
vector is a logical true. Similarly, all checks whether all of the elements of the argumented
vector are logical trues. Figure 1.48 gives examples of any and all functions.

29
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.48: The any and all functions

> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> any(v1>50)
[1] FALSE
> any(v1<45 | v1>50)
[1] TRUE
> all(v1>=40)
[1] TRUE

It is a common task to compare if two vectors are equal or not in R. We have already
covered the relational operator “==” which compares the matching elements of two vectors
and return a vector of logicals which has a true for the elements being equal. It is possible
to use the equal operator along with the function all to check if all elements of two vector
are equal or not.
R provides two more comparison functions that behave slightly different. The identical
function checks if two vectors are exactly identical in terms of both values and data types
used to hold the values in the memory. The all.equal function is used to check if the
matching elements of two vectors are nearly equal, i.e., equal within a tolerance boundary.
Both identical and all.equal functions return a single logical value representing whether
two vectors are identical and equal within a tolerance, respectively. Figure 1.49 shows the
difference among alternative methods for comparing two vectors. We suggest using the
equality operator “==” for comparing vectors unless you know what you are doing.

Figure 1.49: Equality in R

> v1 <- c(as.integer(3),as.integer(9))


> v2 <- c(as.numeric(3), as.numeric(9))
> v3 <- c(as.numeric(3.001), as.numeric(9.001))
> class(v1)
[1] "integer"
> class(v2)
[1] "numeric
> v1 == v2
[1] TRUE TRUE
> all(v1==v2)
[1] TRUE
> identical(v1,v2)
[1] FALSE
> v2==v3
[1] FALSE FALSE
> all.equal(v2, v3, tolerance=0.002)
[1] TRUE

30
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

1.2.1.5 Vectors in Linear Algebra


In algebra we often deal with scalars and the operations defined on scalars such as addition,
subtraction and multiplication. In linear algebra we often deal with collections of scalars
called vectors and the operations defined on vectors. More formally, vectors are multidimen-
sional quantities that are elements of a vector space along with a set of real numbers called
scalars (in general any Field) such that the vectors can be added together and multiplied by
scalars while holding closure, associativity, distributivity, commutativity, additive identity,
additive inverse and unitary properties.
Algebraically, a vector of length p has p dimensions and each value represents the amount
of the vector contained in a particular dimension. The common typographic notation for
vectors is bold, lower case letters. Vectors consisting of all zeros and all ones are denoted as 0
and 1, respectively. For example, v = [3.0, 3.5] is a vector of GPAs of a double major student
majoring in Mathematics and Informatics. Each major is considered to be a dimension and
each GPA is the amount of the vector that is contained in that dimension. More precisely,
each dimension is represented by a standard basis vector and each entry is the amount of
the vector along the direction of the standard basis vector. These amounts are also called
the coordinates of v with respect to the standard basis and they can be expressed as

1 0
   
v = 3.0 + 3.5 (1.1)
0 1
or in a more compact form

1 0 3.0
  
v= (1.2)
0 1 3.5

Figure 1.50: An example vector

Informatics
3.0
3.5

Amount along the 0


direction of 1 0
1
1
0

Mathematics

Amount along the 1


direction of 0

Figure 1.50 shows vector v = [3.0, 3.5] by a black arrow. In the figure the directions of
the dimensions, i.e., Mathematics and Informatics, are shown by dashed lines; the standard

31
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

basis vectors of the dimensions are shown by red and blue arrows; the amount of v along
the direction of each standard basis vector is shown by dotted lines.
Vectors in linear
 algebra can be defined as row vectors, e.g., v = [3.0, 3.5] or column
3.0
vectors v = . The transpose operator, T, defined by function t in R expresses a row
3.5
vector as a column vector and a column vector as a row vector.
3.0
 
[3.0, 3.5]T = (1.3)
3.5
 T
3.0
= [3.0, 3.5] (1.4)
3.5

In general, the column vector notation is preferred over the row vector notation. Hence, we
adopt the column vector notation in the rest of the manuscript, e.g., v = [3.0, 3.5]T .
A vector object in R is a data structure holding a collection of items of the same mode.
Unfortunately, the R vector object does not correspond to a vector object in linear algebra.
One needs to convert it into a linear algebra vector using the as.matrix function.

Figure 1.51: R vectors in the sense of linear algebra

> u <- c(3.0, 3.5) # A vector object in R


> u
[1] 3.0 3.5
> v <- as.matrix(c(3.0, 3.5)) # A vector object in linear algebra
> v
[,1]
[1,] 3.0
[2,] 3.5
> t(v) # Transpose of v
[,1] [,2]
[1,] 3 3.5

Several operators are defined on vectors in linear algebra, including transpose, addition,
subtraction, scalar-vector multiplication, dot product, length and project.

Addition and subtraction. Addition and subtraction operators work element-wise.

v 1 + u1
     
v1 u1
v2  u2  v2 + u2 
 ..  +  ..  =  ..  (1.5)
     
. .  . 
vp up vp + u p

Similarly,      
v1 u1 v1 − u1
v2  u2  v2 − u2 
 ..  −  ..  =  ..  (1.6)
     
. .  . 
vp up vp − up

32
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Scalar-vector multiplication. The scalar-vector multiplication operation, av = va is


the multiplication of a scalar and a vector and it also works element-wise
   
v1 av1
v2  av2 
a .  =  .  (1.7)
   
 ..   .. 
vp avp
The scalar-vector multiplication operation implicitly indicates scalar-vector division op-
eration, because dividing a vector by a scalar is equal to multiplying it by the reciprocal of
the scalar, i.e., v/a = a1 v. The scalar-vector multiplication operation holds the associative,
commutative and distributive properties.

Figure 1.52: Basic vector operations in linear algebra

> u <- as.matrix(c(1,3,5,7))


> v <- as.matrix(c(2,4,6,8))
> u + v # Vector addition
[,1]
[1,] 3
[2,] 7
[3,] 11
[4,] 15
> u - v # Vector subtraction
[,1]
[1,] -1
[2,] -1
[3,] -1
[4,] -1
> 3*u # Scalar-vector multiplication
[,1]
[1,] 3
[2,] 9
[3,] 15
[4,] 21

Dot product. By far, the most important operation on vectors is the dot product oper-
ation. The dot product operation on two vectors, u · v = uT v and generates a scalar value
as defined in Equation 1.8.
     
u1 v1 v1
u2  v2    v2  X p
= = (1.8)

 ..   .. 
· u u . . . u p  .  ui vi
     
 .. 
1 2
. . i=1
up vp vp
Since the dot product operation generates a scalar value, it is also called scalar product.
In some texts, the dot product operation is also called inner product. In fact, inner product is
the generalization of the dot product for vector spaces that are beyond the finite dimensional
Euclidean space over the real numbers Rp

33
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

• Dot product is commutative, i.e., uT v = v T u


• Dot product is distributive, i.e., uT (v + z) = uT v + uT z
• Dot product is not associative under vector-vector product

• Dot product is associative under scalar-vector product, i.e., uT (av) = (auT )v =


a(uT v)

One interpretation of the dot product operation is that it represents weighted linear
combinations in a compact form. A linear combination is a mathematical expression denoted
by the sum of variables weighted by constants. For example, a1 x1 + a2 x2 + . . . + ap xp is the
linear combination where each variable xi is weighted by a constant ai ∈ R. This expression,
also called weighted sum, is represented compactly as a dot product in Equation 1.9.
 
a1
a2 
 
a1 x1 + a2 x2 + . . . + ap xp = x1 x2 . . . xp  .  = xT a (1.9)

 .. 
ap
where a and x are the vectors consisting of the constants and the variables, respectively.
In R one can either use the matrix multiplication operator %*% or the dot function in
geometry package for dot product. The magnitude of a vector is either computed via dot
product of the vector by itself or by the norm function. Figure 1.53 presents the dot product
and norm operations.

Figure 1.53: Dot product and norm operations

> u <- as.matrix(c(1,3,5,7))


> v <- as.matrix(c(2,4,6,8))
> t(u) %*% v # Dot product of u and v
[,1]
[1,] 100
> library(geometry)
> dot(x=u, y=v) # Dot product of u and v
[1] 100
> sqrt(t(u) %*% u) # Magnitude or length of u
[,1]
[1,] 9.165151
> norm(u, type="2") # Magnitude or length of u
[1] 9.165151

Norm of a vector. The coordinates constituting a vector defines a direction. In addition


to a direction, a vector has a magnitude or length. The length, also called magnitute or
norm, of a vector, v is defined as ||v|| = vi2 = v T v. A vector is said to have unit length
pP

when it’s length is one. In some cases only the direction of the vector is important and the
length can be normalized by making its length unit length, i.e., v̂ = v/||v||. Lengths of
vectors leads to a very nice geometrical interpretation of the dot product operation. Dot

34
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

(a) 0 ≤ θ < 90◦ (b) 90◦ < θ ≤ 180◦ (c) θ = 90◦

Figure 1.54: Geometric interpretation of the dot product of two vectors.

product of two vectors is the multiplication of their lengths and the cosine of the angle
between them, i.e., u · v = uT v = ||u||||v|| cos(θ) where θ is the angle between u and v.
Figure 1.54 geometric interpretation of the dot product of two vectors. The length of a
vector is always positive, hence the sign of the dot product of two vectors is determined by
the angle between them. If the angle is acute, the dot product is positive. If the angle is
obtuse, the dot product is negative. Lastly and the most importantly, if the angle is right, the
dot product is zero. That is if two vectors are perpendicular (orthogonal) to each other their
dot product is zero. In addition, the geometric interpretation of the dot product operation
leads to the famous similarity measure of two vectors named the cosine similarity. The
uT v
cosine similarity score is defined as cosu,v (θ) = ||u||||v|| and it reflects whether two vectors
point the same or the opposite directions “in general” regardless of their magnitudes.

Scalar and vector projection. The scalar projection of one vector onto another vector
is also an important vector operation. The scalar projection uv of vector u onto vector v is
the magnitude of the component of u along the direction of vector v. Because the definition
only refers to only the magnitude or length of the vector along the direction of the other
vector, it is called scalar projection.

(a) Scalar Projection on v (b) Scalar Projection on ei (c) Vector Projection on v

Figure 1.55: Scalar and vector projections presents an alternative interpretation of the dot
product of two vectors.

Figure 1.55a shows the scalar projection of u onto v in red. Applying basic trigonometry,
the length of the double green line is uv = ||u|| cos(θ) where θ is the angle between u and
v. One can simplify the scalar projection uv further as

35
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

uv = ||u|| cos(θ)
uT v
= ||u||
||u||||v|| (1.10)
T
u v
=
||v||

Equation 1.10 allows us to have an alternative interpretation of the dot product operation
when the length of v is one, i.e., ||v|| = 1. Dot product uT v presents the amount of u along
the direction of v when v is a unit vector. If v is not unit vector, it can be normalized as
vk|v||. The most important conclusion of the projection operation is that the coordinates of
a vector simply denote the amount of the vector along the directions of the standard basis
vectors as shown in Figure 1.55b.
The vector projection uv of vector u onto vector v is the magnitude of the component
of u along the direction of vector v multiplied by normalized v as shown in Figure 1.55c
Equation 1.11

v
uv = uv
||v||
v
= ||u|| cos(θ)
||v||
uT v v (1.11)
=
||v|| ||v||
uT v
= v
||v||2

Scalar and vector projections play an important role in principal component analysis, a
technique to change the basis of a vector space along the directions of the largest variances,
i.e., eigenvector basis.
Although there are libraries providing functions for the cosine similarity, scalar pro-
jection and vector projection operations, their implementations are not very difficult. In
Appendix A.1, Figure A.1 demonstrates self-reliant, custom implementations of these func-
tions.

1.2.2 Factors
A factor object in R is a vector which can only take values from a finite number of distinct
values. These objects are also called categorical variables. Factors (or categorical variables)
take qualitative values that do not support all arithmetic operators. Examples of categorical
variables are gender (Male, Female), opinion (Good, Mediocre, Bad) are rank (1, 2, 3, 4,
5). Note that even though the last example seem to take numeric values these values are
not quantitative because the distance between consecutive values are not the same. The
values of categorical variables can either be integers or strings. In any case, it is important
to declare them as factors because categorical variables are treated differently in statistics.
The function factor creates a factor object from a vector of integers or character strings.
The signature of the function is factor(x, levels = sort(unique(x), na.last=TRUE), labels =

36
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

levels, exclude = NA, ordered = is.ordered(x)) The only mandatory parameter of the func-
tion is x which is an integer or character string vector to be used for generating a factor
object. Other important parameters are levels denoting all possible levels (or categories)
of the categorical variable; labels providing a string to represent each possible level (or cat-
egory) and ordered representing whether there is an order within the levels (or categories)
of the categorical variable. Although R displays factors using the labels corresponding to
levels, the levels are encoded as integers to save memory. Hence, one can assume a factor
object a composite object consisting of a set of integers as levels, a integer vector of levels
representing the data, a set of strings or numbers that label each level and a logical variable
denoting if the levels are ordered.

Figure 1.56: Factor objects and factor function

> factor(x=c(1, 2, 2, 3, 1, 3))


[1] 1 2 2 3 1 3
Levels: 1 2 3
> factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5
> factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5, labels=c("Business", "Health", "
DigitalArts", "SystemAdmin", "Personalized"))
[1] Business Health Health DigitalArts Business DigitalArts
Levels: Business Health DigitalArts SystemAdmin Personalized
>
> set.seed(1001)
> v1 <- factor(x=sample(x=1:3, size=24, replace=TRUE), labels=c("bad", "mediocre", "
good"), ordered=TRUE)
> v1
[1] good mediocre mediocre mediocre mediocre good bad bad
[9] bad good mediocre bad good mediocre bad good bad mediocre
[19] bad good mediocre mediocre good good
Levels: bad < mediocre < good
>
> v1[10]
[1] good
Levels: bad < mediocre < good

Figure 1.56 demonstrates the use of factor function for creating factor objects. R also
provides utility functions is.factor to check if an object is a factor and levels to obtain
the levels of a factor object.

1.2.3 Matrices
A matrix is a collection of data arranged into a two dimensional tabular layout. A matrix
has a fixed number of columns and rows and the elements of a matrix must have been of
the same mode (basic data type). Linear algebra provides a rich set of mathematical tools
that work on matrices, e.g., arithmetic operators, transformations and various functions
of matrix algebra. Hence, matrices are very convenient data structures for holding and
manipulating data.

37
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

R provides the function matrix for creating matrices. The matrix function has the sig-
nature matrix(data=NA, nrow=1, ncol=1, byrow=FALSE, dimnames=NULL). Parameter
data denotes a vector of data to be arranged in tabular form. Parameters nrow and ncol
specify the number of rows and columns, respectively. There are two ways to fill an empty
matrix of a fixed number of columns and rows with data: either by rows or by columns.
Parameter byrow controls whether the matrix is to be filled by columns or by rows. Fi-
nally, one can name the two dimensions (columns and rows) of a matrix using the dimnames
parameter.

Figure 1.57: Matrix objects and matrix function

> A <- matrix(data=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K","L"),
nrow=3, ncol=4, byrow=TRUE)
> A
[,1] [,2] [,3] [,4]
[1,] "A" "B" "C" "D"
[2,] "E" "F" "G" "H"
[3,] "I" "J" "K" "L"
> B <- matrix(data=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
nrow=3, ncol=4, byrow=FALSE)
> B
[,1] [,2] [,3] [,4]
[1,] "A" "D" "G" "J"
[2,] "B" "E" "H" "K"
[3,] "C" "F" "I" "L"
> C <- matrix(data=c("A", "B", "C", "D"), nrow=3, ncol=4)
> C
[,1] [,2] [,3] [,4]
[1,] "A" "D" "C" "B"
[2,] "B" "A" "D" "C"
[3,] "C" "B" "A" "D"

Figure 2.37 demonstrates the use of matrix function for creating matrices by row and by
column. Notice that in the last example the data object in less than the number of elements
required to fill a 3x4 matrix and the data is recycled until the matrix is filled.
An alternative way of creating matrices using other matrices or vectors is the bind
functions rbind and cbind. The function rbind takes multiple vectors of the same length or
matrices of the same column numbers and binds them row by row in the order of arguments.
Similarly, the function cbind takes multiple vectors of the same length or matrices of the
same row numbers and binds them column by column in the order of arguments. Figure 2.39
demonstrates the use of rbind and cbind via several examples. Notice how R recycles the
data argument of the matrix function to generate enough data to fill the matrix specified
by its dimensions.
Sometimes one needs to flatten the matrix back into an R vector. One can use the
concatenation function, c, with the matrix to be flattened being the argument to obtain an
R vector. The concatenation function, c, concatenates all columns of a matrix into an R
vector as shown in Figure 1.59. In the same figure we demonstrate the use of is.matrix
function to check if an object is a matrix or not as well as the dim function which returns
the dimensions of a matrix in the form of number of rows and number of columns.

38
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.58: Functions rbind and cbind

> rbind(c(1:3), c(5,7,9))


[,1] [,2] [,3]
[1,] 1 2 3
[2,] 5 7 9
> rbind(matrix(data=1:6, nrow=2, ncol=3), matrix(data=0, nrow=1, ncol=3))
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[3,] 0 0 0
> cbind(c(1:3), c(5,7,9))
[,1] [,2]
[1,] 1 5
[2,] 2 7
[3,] 3 9
> cbind(matrix(data=1, nrow=2, ncol=3), matrix(data=0, nrow=2, ncol=5))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 1 1 0 0 0 0 0
[2,] 1 1 1 0 0 0 0 0

Figure 1.59: Flattening matrices using the c function

> M <- rbind(matrix(data=1:4, nrow=2, ncol=2), c(0,0))


> M
[,1] [,2]
[1,] 1 3
[2,] 2 4
[3,] 0 0
>
> is.matrix(M)
[1] TRUE
> dim(M)
[1] 3 2
> m <- c(M)
> m
[1] 1 2 0 3 4 0
> is.matrix(m)
[1] FALSE
> is.vector(m)
[1] TRUE

39
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

1.2.3.1 Subsetting Matrices


By default, the rows and the columns of a matrix are indexed by integer indices starting
from 1. R allows subsetting an individual element of a matrix using both the row and
column indices separated by comma “,” within the subset operator square brackets“[]”. To
access an element located at the intersection of the k th row and lth column one needs to use
the subset expression “[k,l]” following the matrix object. To access an entire row the column
index after the comma is left blank. To access an entire column the row index before the
comma is left blank. To access arbitrary portions of a matrix vectors of row and column
indices are used.

Figure 1.60: Subsetting matrices using row and column indices

> set.seed(1001)
> M <- matrix(data=sample(x=0:99, size=16, replace=FALSE), nrow=4, ncol=4)
> M
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 40 84 69 38
[3,] 42 0 39 22
[4,] 99 7 12 74
> M[4,1] # subset the element located at [4,1]
[1] 99
> M[2,3] # subset the element located at [2,3]
[1] 69
> M[2, ] # subset the entire second row
[1] 40 84 69 38
> M[ ,4] # subset the entire fourth column
[1] 75 38 22 74
> M[c(1,4), ] # subset the first and fourth rows
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 99 7 12 74
> M[c(2,3),c(2,3)] # subset the elements intersecting at rows (2,3) and columns
(2,3)
[,1] [,2]
[1,] 84 69
[2,] 0 39

In Figure 1.60 we first generate a matrix of sixteen elements arranged in four rows and
four columns. Secondly, we subset the element located at the fourth row and first column
and then the element located at the second row and third column. Thirdly, we subsetted the
entire second row and then we subsetted the entire fourth column. Fourthly, we subsetted
first and fourth rows. Lastly we subsetted the partition formed by the intersection of second
and third rows and second and third columns.
In linear algebra a matrix consisting of a single row or a column is called a row or a
column vector, respectively. In R however, matrices and vectors are data structures rather
than mathematical concepts and they are entirely two different types of compound objects.
Hence, a matrix of a single column or a row is not equivalent to a vector in R. Although a
matrix of a single column or row and a vector are displayed differently on the R terminal,

40
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

one can use is.matrix and is.vector functions to check their types.
In Figure 1.61 we again first generate a 4x4 dimensional matrix object, M, consisting of
sixteen elements. We subset M to obtain another 2x2 dimensional matrix object named A.
The dim function on A returns its dimensions, 2x2, and the is.matrix function on A returns
true. Then, we subset the first row of M to obtain v. The dim functions on v returns NULL
and the is.vector on v verifies that it is a vector but not a matrix hence, dim returns NULL.
In Example set 2 (Figure 1.61) we first display the vector v. Then similar to v, we subset
the first row of the matrix M again to obtain w. However, we set a parameter of the subset
operation named drop to FALSE. Displaying w and v already shows that the objects are
different although they have the exact same content. The is.matrix function on w returns
true. The reason of this behavior is attributed to the object of the lowest dimension rule in
subsetting as explained below.

Figure 1.61: Subsetting matrices with drop parameter

> # Example set 1


> set.seed(1001)
> M <- matrix(data=sample(x=0:99, size=16, replace=FALSE), nrow=4, ncol=4)
> M
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 40 84 69 38
[3,] 42 0 39 22
[4,] 99 7 12 74
> A <- M[c(-2,-3), c(1,2)]
> is.matrix(A)
[1] TRUE
> dim(A)
[1] 2 2
> v <- M[1, ]
> dim(v)
NULL
> is.matrix(v)
[1] FALSE
> is.vector(v)
[1] TRUE
>
> # Example set 2
> w <- M[1, , drop=FALSE]
> w
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
> v
[1] 98 96 26 75
> is.vector(w)
[1] FALSE
> is.matrix(w)
[1] TRUE
> dim(w)
[1] 1 4

41
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

We have already covered the two important rules that R implicitly uses whenever nec-
essary. The rules were element-wise evaluation rule and recycling rule. Now it is time to
present a third very important rule that is implicitly used in subsetting by R namely, object
of the lowest dimension rule.

Object of the Lowest Dimension Rule


When subsetting a higher dimensional object, R coerces the result into the lowest
dimension possible. To illustrate, a matrix object in R is a two dimensional object
and a vector object is one dimensional. When we subset a row of a matrix the result
is coerced into a vector object rather than a matrix object consisting of a single row.
To keep the dimensions of the subset object in accordance with the subsetted object
one needs to set the drop parameter of the subset operator to FALSE as we did in the
Example set 2 in Figure 1.61.

An alternative way of subsetting a matrix is naming its dimensions (rows and columns)
and using the dimension names to subset the matrix A simple example is shown in Fig-
ure 1.62. In the figure we first create a numeric matrix, E to represent the number of students
enrolled in a class in years “2014” and “2015” and semesters “Fall” and “Spring”. We
use the dimnames function, similar to the names function, to assign character string names
to rows and columns, respectively. R would coerce the names to be character strings in case
the arguments were numeric or boolean. Note that we could have named the dimensions of
the matrix using the dimnames parameter of the matrix function as well. Lastly, R provides
two functions namely rownames and colnames to assign character string vectors as names to
the rows and columns of a matrix, respectively. In Figure 1.62 we used the subset operator
along with dimension names to access an element specified by “2014” and “Spring”.

Figure 1.62: Subsetting matrices using dimension names

> E <- matrix(data=c(21, 34, 18, 27), nrow=2, ncol=2, byrow=FALSE )


> dimnames(E) <- list(c("2014","2015"),c("Fall","Spring"))
> E
Fall Spring
2014 21 18
2015 34 27
>
> E["2014","Spring"]
[1] 18

Finally, R allows using logical indices to subset matrices. When we subset a vector using
logical indices, each logical value in the index vector tells whether to include or exclude the
corresponding element in the vector to be subsetted. When we subset a matrix using logical
indices however, each logical value in the row (column) index vector tells whether to include
or exclude the corresponding row (column) in the matrix to be subsetted. Although using
logical indices to subset matrices is no a prevalent approach Figure 1.63

1.2.3.2 Arithmetic Operators on Matrices


Arithmetic operators on matrices are addition, subtraction, multiplication and division.
These operators are performed element-wise on matrix objects as long as operand matrix

42
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.63: Subsetting matrices using logical

> set.seed(1001)
> M <- matrix(data=sample(x=0:99, size=16, replace=FALSE), nrow=4, ncol=4)
> M
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 40 84 69 38
[3,] 42 0 39 22
[4,] 99 7 12 74
> M[c(FALSE,TRUE,TRUE,FALSE), c(TRUE,FALSE,FALSE,TRUE)]
[,1] [,2]
[1,] 40 38
[2,] 42 22

objects have the exact same dimensions. That is both matrix objects should have the same
row and column numbers. On the other hand, if one operand is a vector object (of length
one or longer) and the other one is a matrix object then, the vector gets recycled by column
to obtain a matrix of equal dimensions and the operation gets performed element-wise on
both matrices. Figure 1.64 shows several examples of arithmetic operations on matrices.
The outer function takes two vectors and a function (or operator) as arguments. It
applies the function to every ordered pair of the vectors where the ordered pair is formed
by an element from the first vector and an element from the second vector. The parameters
X and Y are place holders the first and second vectors, respectively. The parameter FUN
is a character string representing the function (or operator) to be applied. The outer
function returns a matrix where the rows represent the elements of the vector X, the columns
represent the elements of the vector Y and the entries in the matrix are the values obtained by
applying the function (or operator) FUN to the corresponding row and column. Figure 1.65
demonstrates the use of the outer function.

1.2.3.3 Matrices in Linear Algebra


In linear algebra a matrix is a rectangular table of numbers arranged in rows and columns.
The common typographic notation for matrices is bold, upper case letters. Matrices consist-
ing of all zeros and all ones are denoted as 0 and 1, respectively. There are many operations
defined on matrices and the basic ones include addition, subtraction, scalar-matrix multi-
plication, transposition, matrix-vector multiplication and matrix-matrix multiplication.

Matrix addition and subtraction. Addition and subtraction operations are computed
element-wise, hence two matrices should have the same dimensions (row and column num-
bers) in order to be added or subtracted as shown in 1.12 and 1.13. Matrix addition supports
commutativity, A + B = B + A; associativity, A + (B + C) = (A + B) + C; additive iden-
tity, A + 0 = A; and additive inverse, A + (−A) = 0 properties. R naturally supports
element-wise matrix addition and subtraction via the + and - operators as shown in the
console output in Figure 1.66.

43
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.64: Arithmetic operations on matrices

> set.seed(1001)
> M1 <- matrix(data=sample(x=c(-6:-1, 1:6), size=12, replace=FALSE), nrow=3, ncol=4)
> M2 <- matrix(data=sample(x=c(-6:-1, 1:6), size=12, replace=FALSE), nrow=3, ncol=4)
> M1
[,1] [,2] [,3] [,4]
[1,] 6 -3 -6 -4
[2,] -2 3 -1 4
[3,] 5 1 -5 2
> M2
[,1] [,2] [,3] [,4]
[1,] 5 2 -6 3
[2,] -2 -5 -3 6
[3,] -4 4 1 -1
>
> M1 + M2 # Matrices having the same dimensions
[,1] [,2] [,3] [,4]
[1,] 11 -1 -12 -1
[2,] -4 -2 -4 10
[3,] 1 5 -4 1
>
> c(0, 1, 10, 100) * M1 # A vector of length four and a matrix
[,1] [,2] [,3] [,4]
[1,] 0 -300 -60 -4
[2,] -2 0 -100 40
[3,] 50 1 0 200
>
> M1 / 2 # A matrix and a vector of length one
[,1] [,2] [,3] [,4]
[1,] 3.0 -1.5 -3.0 -2
[2,] -1.0 1.5 -0.5 2
[3,] 2.5 0.5 -2.5 1

44
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.65: The outer function

> outer(X=100:105, Y=0:2, FUN="*")


[,1] [,2] [,3]
[1,] 0 100 200
[2,] 0 101 202
[3,] 0 102 204
[4,] 0 103 206
[5,] 0 104 208
[6,] 0 105 210
>
> outer(X=100:105, Y=0:2, FUN="paste")
[,1] [,2] [,3]
[1,] "100 0" "100 1" "100 2"
[2,] "101 0" "101 1" "101 2"
[3,] "102 0" "102 1" "102 2"
[4,] "103 0" "103 1" "103 2"
[5,] "104 0" "104 1" "104 2"
[6,] "105 0" "105 1" "105 2"
>
> f <- function(a,b){a*b/2}
> outer(X=1:5, Y=c(10,100), FUN="f")
[,1] [,2]
[1,] 5 50
[2,] 10 100
[3,] 15 150
[4,] 20 200
[5,] 25 250

45
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

a1,1 + b1,1 a1,2 + b1,2 a1,p + b1,p


     
a1,1 a1,2 ··· a1,p b1,1 b1,2 ··· b1,p ···
 a2,1 a2,2 a2,p   b2,1 b2,2 b2,p   a2,1 + b2,1 a2,2 + b2,2 a2,p + b2,p 
 .. .. .. + .. .. ..  =  .. .. ..
     
 . . .   . . .   . . .


an,1 an,2 ··· an,p bn,1 bn,2 ··· bn,p an,1 + bn,1 an,2 + bn,2 ··· an,p + bn,p
(1.12)

     
a1,1 a1,2 ··· a1,p b1,1 b1,2 ··· b1,p a1,1 − b1,1 a1,2 − b1,2 ··· a1,p − b1,p
 a2,1 a2,2 a2,p   b2,1 b2,2 b2,p   a2,1 − b2,1 a2,2 − b2,2 a2,p − b2,p 
 .. .. .. − .. .. ..  =  .. .. ..
     
 . . .   . . .   . . .


an,1 an,2 ··· an,p bn,1 bn,2 ··· bn,p an,1 − bn,1 an,2 − bn,2 ··· an,p − bn,p
(1.13)

Scalar-matrix multiplication. Scalar-matrix multiplication is also carried out element-


wise. That is, the scalar is multiplied by all entries of the matrix as shown in 1.14. Scalar-
matrix multiplication supports commutativity, cA = Ac; associativity, c(dA) = (cd)A;
distributivity, c(A + B) = cA + cB; and multiplicative identity, 1A = A properties. R
naturally supports scalar-matrix multiplication via the * as shown in the console output in
Figure 1.66
   
a1,1 a1,2 · · · a1,p ca1,1 ca1,2 · · · ca1,p
 a2,1 a2,2 a2,p 
  ca2,1 ca2,2 ca2,p 

c . .. . = . .. ..  (1.14)
 
 .. . ..   .. . . 
 

an,1 an,2 ··· an,p can,1 can,2 ··· can,p

Matrix-vector multiplication. Matrix-vector multiplication is defined as the left prod-


uct of a matrix A by a vector x, i.e., Ax. A requirement of matrix-vector multiplication is
that the number of columns of the matrix must be equal to the number of the rows of the
vector. Note that the vectors are defined as column vectors. Given an n × p matrix A and
a vector x ∈ Rp , Ax = b, generates a vector b ∈ Rn . In fact, matrix-vector multiplication
is defined as the dot product of x by every row of A as shown below.

a1,1 x1 + a1,2 x2 + · · · + a1,p xp


    
a1,1 a1,2 · · · a1,p x1
 x2   a2,1 x1 + a2,2 x2 + · · · + a2,p xp 
 a2,1 a2,2 a2,p     
 .. . . . = . (1.15)

 . .. ..   ..  
   .. 

an,1 an,2 · · · an,p xp an,1 x1 + an,2 x2 + · · · + an,p xp
Matrix-vector multiplication operator in R is “%*%” as demonstrated in Figure 1.67
There are multiple, but related ways to interpret matrix-vector multiplications (Equa-
tion 1.15). One can think of matrix-vector multiplication as the compact form of the sum of
scalar-vector multiplications. Let u, v and z be three vectors in Rp and a, b and c be three
scalars. Then, the sum of the scalar-vector multiplications, au + bv + cz, can be represented
as a matrix-vector multiplication, Ax = b such that A is the matrix consisting of the col-
umn vectors u, v and z, x is the column vector consisting of the scalars x1 , x2 and x3 and
b is the column vector consisting of the results b1 , b2 and b3 as shown in Equation 1.16.

46
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.66: Matrix addition, subtraction and scalar multiplication

> A <- cbind(c(6,-3), c(2,9))


> B <- cbind(c(-2,7), c(4,-1))
> A
[,1] [,2]
[1,] 6 2
[2,] -3 9
> B
[,1] [,2]
[1,] -2 4
[2,] 7 -1
> A + B
[,1] [,2]
[1,] 4 6
[2,] 4 8
> A - B
[,1] [,2]
[1,] 8 -2
[2,] -10 10
> 3*A
[,1] [,2]
[1,] 18 6
[2,] -9 27

Figure 1.67: Matrix-vector multiplication

> A <- cbind(c(6,-3), c(2,9), c(4, 0))


> A
[,1] [,2] [,3]
[1,] 6 2 4
[2,] -3 9 0
> x <- as.matrix(c(2, -1, 5))
> A %*% x
[,1]
[1,] 30
[2,] -15

47
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

          
u1 v1 z1 u1 v1 z1 x1 b1
u2  x1 + v2  x2 + z2  x3 = u2 v2 z2  x2  = b2  (1.16)
u3 v3 z3 u3 v3 z3 x3 b3
Matrix-vector multiplication, Ax, can be interpreted as a collection of weighted sums,
i.e., linear combinations of variables and constants, as well. In Equation 1.9, the dot product
operator compactly represents weighted sums of a row vector and a column vector. In
Equation 1.15 the weighted sum is computed for a collection of row vectors and a column
vector. That is, Equation 1.15 is a natural extension of Equation 1.9.
Matrix-vector multiplication, Ax, can also be interpreted as the scalar projections of
multiple row vectors in A onto a unit column vector x. Remember that dot product of two
vectors uT v represents the amount of u along the direction of v when v is a unit vector.
A matrix-vector multiplication is simply the dot product of each row vector in A by the
column vector x. Hence, Ax = b presents the scalar projections in a compact form when x
is a unit vector.
Another interpretation of Equation 1.15 involves systems of linear equations. A linear
equation is an equation in the form of a1 x1 + a2 x2 + . . . + an xn + b = 0 such that xi ’s are the
unknown variables, ai ’s are the coefficients and b is the constant term. A system of linear
equations is a collection of linear equations defined over the same variables but different
coefficients. A solution of a system of linear equation is requires finding the values for the
variables which simultaneously satisfy all the equations in the system. Problems of systems
of linear equations involving several variables frequently appear in science, engineering,
economics and daily life.
For example, Alice, Bob and Carol went to a restaurant for lunch. Alice ordered two
slices of pizza, two cookies and a soda and she paid $10 in total. Bob ordered four slices
of pizza, three cookies and one soda and he paid $16.5 in total. Carol ordered a slice of
pizza with two sodas and she paid $6.5 in total. While leaving the restaurant they had a
discussion on how much a cookie costs at the restaurant. This problem can be defined as
a system of linear equations. Let x1 , x2 and x3 be variables denoting the cost of a slice of
pizza, a cookie and a soda respectively. The bill for Alice can be mathematically expressed
as 2x1 + 2x2 + x3 = 10. Bob’s bill is expressed as 4x1 + 3x2 + x3 = 16.5. Lastly, Carol’s bill
is x1 + 2x3 = 6.5 which is equivalent to x1 + 0x2 + 2x3 = 6.5
Together, the equations are expressed as the following system of linear equations.

2x1 + 2x2 + x3 = 10
4x1 + 3x2 + x3 = 16.5 (1.17)
x1 + 0x2 + 2x3 = 6.5
Equations in (1.17) can be compactly expressed as follows.

2 2 1 10
       
4 x1 + 3 x2 + 1 x3 = 16.5
1 0 2 6.5
2 2 1 10 (1.18)
    
x1
4 3 1 x2  = 16.5
1 0 2 x3 6.5
Ax = b

48
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

where A is the matrix of coefficients, x is the vector of variables and b is the vector of
constant terms. R provides a function named solve to solve systems of linear equations as
shown in Figure 1.68. The solve function returns a vector such that the first entry of the
vector is the value of the first variable, x1 , second entry is the value of the second variable,
x2 and so on.

Figure 1.68: Solving systems of linear equations

> A <- matrix(data=c(2,4,1,2,3,0,1,1,2), nrow=3, ncol=3)


> A
[,1] [,2] [,3]
[1,] 2 2 1
[2,] 4 3 1
[3,] 1 0 2
> b <- c(10,16.5,6.5)
> solve(A,b)
[1] 2.5 1.5 2.0

Note that a system of linear equations may have a single solution, infinitely many solu-
tions or no solutions at all.
Alternatively, matrix-vector multiplication can be interpreted as a linear transformation
(also called linear map) of x from a finite dimensional vector space over the real numbers
Rp to another finite dimensional vector space over the real numbers Rn . A transformation
is similar to functions, f : X → Y, in algebra which map an input in X to an output in
Y . Typically, functions are defined as y = f (x). Similarly, a “matrix transformation”,
T : Rp → Rn , associated with an n × p matrix A, maps an input vector in Rp to an output
vector in Rn . Typically, a matrix transformation is defined as b = Ax (or Ax = b) such
that x ∈ Rp and b ∈ Rn .
A “linear transformation”, T , is a mapping of vectors from a vector space Rp to another
space Rn while preserving vector addition and scalar-vector multiplication operations. That
is, T : Rp → Rn such that T (u + v) = T (u) + T (v) and T (cv) = cT (v). It turns out
that every linear transformation T can be described as a matrix-vector product,Ax, for
x = [x1 , x2 , . . . , xp ]T as shown in Equation 1.19.

49
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

0 0
       
x1 x1
x2   0  x2   0 
T  .  = T  ..  + T  ..  + . . . + T  .. 
       
 ..   .   .   . 
xp 0 0 xp
1 0 0
        
 0  1  0
= T x1  .  + T x2  .  + . . . + T xp  . 
        
  ..    ..    .. 
(1.19)
0 0 1
1 0 0
     
0 1 0
= x1 T  .  + x2 T  .  + . . . + xp T  . 
     
 ..   ..   .. 
0 0 1
= x1 T (e1 ) + x2 T (e2 ) + . . . + xp T (ep )
= Ax

where matrix A consists of the column vectors of the ordered transformations of the standard
basis vectors, A = [T (e1 ) T (e2 ) . . . T (ep )]. Moreover, T (u + v) = A(u + v) = Au + Av =
T (u)+T (v) and T (cv) = A(cv) = cAv = cT (v). The matrix A is called the transformation
matrix for T . Equation 1.19 has an important conclusion: to obtain the n×p transformation
matrix, A, of a linear transformation T : Rp → Rn , one needs to apply T to the standard
basis of Rp and arrange them as columns of the matrix A. Alternatively, one can interpret
the columns of a transformation matrix as how a transformation T : Rp → Rn changes the
ordered standard basis vectors of the space Rp . Linear transformations maintain zero vector
and the negative of a vector. That is, T (0) = 0 and T (−v) = −T (v). Note that linear
transformation preserves only vector addition and scalar-vector multiplication operations,
it does not support the addition of a constant. A linear function in calculus, f (x) = ax + b,
supports the addition of a constant. In this sense, linearity in linear transformations is a
stricter than linearity in functions and a linear function is a linear map when the intercept
of the function is zero.
Linear transformations are heavily used in image processing, game development and
video processing. They change the shape of a space along with all vectors in the space and
of course all objects represented by those vectors. Because they are linear transformations,
they preserve the parallel lines as parallel; preserve equal distance between the parallel
lines; and preserve the origin at 0. Figure 1.69 presents various linear transformations
of an image consisting of four colors along with their transformation matrices. The first
five transformations are from R2 to R2 and the last transformation is from R2 to R. The
transformation matrices are constructed by applying the transformations to the standard
basis of R2 , i.e., e1 = [1, 0]T and e2 = [0, 1]T , and arranging them column-wise. The original
image is shown in Figure 1.69a with identity transformation which does not change e1 and
e2 . In Figure 1.69b the original image is dilated by 1.5 times by multiplying both e1 and e2
by 1.5 . In Figure 1.69c the original image is reflected along the vertical axis by reflecting
e1 along the vertical axis and preserving e2 . In Figure 1.69d the original image is rotated
counter-clockwise by 60 degrees (π/3 radians) √ by rotating both e1 and e2 by 60 degrees.
Note that cos(π/3) = 1/2 and sin(π/3) = 3/2 . In Figure 1.69e the original image is

50
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

sheared by fixing the horizontal axis and displacing each point by the amount of its signed
distance to the horizontal axis. Lastly, in Figure 1.69f the original two dimensional image
is projected onto the horizontal axis by preserving e1 but mapping e2 to 0.

     
1 0 1.5 0 −1 0
(a) Identity (b) Dilation (c) Reflection
0 1 0 1.5 0 1

 √     
1/2 − 3/2 1 1 1 0
(d) Rotation √ (e) Shear (f) Projection
3/2 1/2 0 1 0 0

Figure 1.69: Various linear transformations

Another interpretation of matrix-vector multiplications, Ax, is change of basis of x ∈ Rp ,


given that A is a square matrix, i.e., n = p, consisting of p column vectors that are linearly
independent and span Rp . In fact, vectors in Euclidean space are abstract objects having
a magnitude and direction. However, we write them as lists of numbers representing the
coordinates of the vector with respect to the ordered standard basis, E = {e1 , e2 , . . . , ep }
where ei is the vector consisting of all zeros, except a one at the ith position, i.e., ei =
[0, . . . , 0, 1, 0, . . . , 0]T . Then, [x1 , . . . , xp ]T is the coordinates of x such that x = x1 e1 +
x2 e2 + . . . + xp ep . This sum of scalar-vector multiplications can be written as

1 0 0 1 0 ··· 0
        
x1
0 1 0 0 1 0   x2 
x1  .  + x2  .  + · · · + xp  .  =  . .. ..   ..  = Ix (1.20)
        
 ..   ..   ..   .. . .  . 
0 0 1 0 0 · · · 1 xp
Although standard basis is commonly used to express vector coordinates in Rp , other
basis are also possible. Any set B = {b1 , b2 , . . . , bp } can serve as the basis for Rp as long as

51
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

the vectors in B are linearly independent, i.e., a1 b1 + a2 b2 + . . . + ap bp = 0 ⇐⇒ a1 = a2 =


· · · = ap = 0 and they span Rp , i.e., any v ∈ Rp can be expressed as a linear combinations
of the vectors in B. When a basis other than the standard basis E is involved, it is better to
explicitly specify the basis of the coordinates of a vector such as [x]E or [x]B . Any vector x
can be represented in basis B as

x = x1 b1 + x2 b2 + . . . + xp bp
(1.21)
= B[x]B
where bi ∈ B is the ith basis vector denoted in standard basis and matrix B consists of the
ordered column vectors of basis B. One can easily change the basis of a vector x ∈ Rp from
standard basis E = {e1 , e2 , . . . , ep } to another basis B = {b1 , b2 , . . . , bp } as shown in

B[x]B = x = I[x]E
∴ B[x]B = [x]E (1.22)
−1
[x]B = B [x]E
where ∴ implies “therefore” and matrix B −1 is the inverse of the matrix consisting of the
ordered column vectors of basis B. Note that matrix B is always invertible, because the
vectors constituting basis B are linearly independent. Equation 1.23 says that given a new
basis, inverting its basis matrix and left multiplying it by the coordinates of a vector in
standard basis, simply gives the coordinates of the same vector in terms of the new basis.

2 1
2 1

(a) x w.r.t. basis E and B (b) x w.r.t. basis E = {e1 , e2 } (c) x w.r.t. basis B = {b1 , b2 }

Figure 1.70: Change of Basis of x with respect to bases E and B

Figure 1.70a shows vector x along with bases E and B. Both bases consist of linearly
independent vectors and span R2 . Moreover, they present two different coordinate systems
shown in red and blue colors. Figure 1.70b shows the coordinates of x = [2, 2]T with respect
to basis E = {e1 , e2 }. That is, x = 2e1 + 2e2 . Similarly, figure 1.70c shows the coordinates
of x = [1, 1]T with respect to basis B = {b1 , b2 }. That is, x = 1b1 + 1b2 . Figure 1.71
presents the R code changing the basis of vector x in Figure 1.70 from basis E to basis B.
The vector space Rp can have many bases such as C = {c1 , c2 , . . . , cp } or D = {d1 , d2 , . . . , dp }.
Given a vector x ∈ Rp with respect to basis B it is possible to formulate the change of ba-
sis of x with respect to C as a matrix-vector multiplication as [x]C = M [x]B as shown in
Equation 1.23

52
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.71: Change of basis from E to B

> B <- cbind(c(3,1),c(-1,1)) #create change of basis matrix B


> B
[,1] [,2]
[1,] 3 -1
[2,] 1 1
> B.inv <- solve(B) # get the inverse of B
> B.inv
[,1] [,2]
[1,] 0.25 0.25
[2,] -0.25 0.75
> x <- as.matrix(c(2,2)) # create vector x
> B.inv %*% x # compute the coordinates of x in basis B
[,1]
[1,] 1
[2,] 1

C[x]C = x = B[x]B
∴ C[x]C = B[x]B
(1.23)
[x]C = C −1 B[x]B
[x]C = M [x]B , such that M = C −1 B

where ∴ implies “therefore” and M is the matrix consisting of the column vectors of basis
B expressed in the basis C, i.e., [[b1 ]C , [b2 ]C , . . . , [bp ]C ]. Note that B −1 [x]E in Equation 1.20
changes the coordinates of x into basis B. Similarly, C −1 B in Equation 1.23 changes the
coordinates of every column vector, i.e., basis vectors of B, in B from the standard basis to
basis C. Therefore, M in Equation 1.23 is nothing but the basis vectors of B expressed in
basis C.

Matrix-matrix multiplication. Matrix-matrix multiplication is defined as the product


of a matrix A by another matrix B, i.e., AB. A requirement of matrix-matrix multiplication
is that the number of columns of the left matrix must be equal to the number of the rows of
the right matrix. Given an n × p matrix A and another p × m matrix B, AB = C generates
an n × m matrix C. Therefore, BA is not always defined and even if it is defined for the
square matrices or for the matrices having the opposite dimensions, BA is not necessarily
equal to AB.
In fact, matrix-vector multiplication is defined as the matrix-vector multiplication of A
by every column vector of B as shown below.

53
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

  
a1,1 a1,2 ··· a1,p b1,1 b1,2 ··· b1,m
 a2,1 a2,2 a2,p  b2,1 b2,2 b2,m 
 .. .. ..   .. .. .. 
  
 . . .  . . . 
an,1 an,2 ··· an,p bp,1 bn,2 ··· bn,m (1.24)
a1,1 b1,1 + a1,2 b2,1 + · · · + a1,p bp,1 . . . a1,1 b1,m + a1,2 b2,m + · · · + a1,p bp,m
 
 a2,1 b1,1 + a2,2 b2,1 + · · · + a2,p bp,1 . . . a2,1 b1,m + a2,2 b2,m + · · · + a2,p bp,m 
= 
 ··· 
an,1 b1,1 + an,2 b2,1 + · · · + an,p bp,1 . . . an,1 b1,m + an,2 b2,m + · · · + an,p bp,m

Like dot product and matrix-vector multiplication, matrix-matrix multiplication opera-


tor in R is “%*%” as demonstrated in Figure 1.72

Figure 1.72: Matrix-matrix multiplication

> A <- matrix(data=c(4,7,2,6), nrow=2, ncol=2)


> A
[,1] [,2]
[1,] 4 2
[2,] 7 6
> B <- matrix(data=-2:3, nrow=2, ncol=3)
> B
[,1] [,2] [,3]
[1,] -2 0 2
[2,] -1 1 3
> A %*% B # the number of columns of A is equal to the number of rows of B
[,1] [,2] [,3]
[1,] -10 2 14
[2,] -20 6 32
> B %*% A # the number of columns of B is not equal to the number of rows of A
Error in B %*% A : non-conformable arguments
> A %*% A %*% A # matrix A is raised to the power 3
[,1] [,2]
[1,] 260 180
[2,] 630 440

One can interpret matrix-matrix multiplication AB as a series of matrix-vector mul-


tiplications Ab1 , Ab2 , . . ., Abm where bi is the ith column vector of matrix B. That
is, C = [Ab1 Ab2 . . . Abm ]. Therefore, all different interpretations of matrix-vector
multiplication is applicable to matrix-matrix multiplication, such that the matrix-vector
multiplication is applied to a collection of column vectors expressed in the right matrix and
the result is a collection of column vectors, instead of a single column vector.
A new interpretation of matrix-matrix multiplication comes into picture when consider-
ing the composition of linear transformations. Let T1 : Rp → Rn , associated with an n × p
matrix A, maps an input vector in Rp to an output vector in Rn . Let T2 : Rm → Rp ,
associated with an p × m matrix B, maps an input vector in Rm to an output vector in Rp .
Then, the composition of T1 ◦ T2 : Rm → Rn is a transformation associated with an m × n
matrix C = AB and it maps an input vector in Rm to an output vector in Rn . Note that

54
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

T1 ◦ T2 supports vector addition as T1 ◦ T2 (u + v) = T1 (T2 (u + v)) = T1 (T2 (u) + T2 (v)) =


T1 (T2 (u)) + T1 (T2 (v)) = T1 ◦ T2 (u) + T1 ◦ T2 (v) It also supports constant multiplication as
T1 ◦ T2 (cv) = T1 (T2 (cv)) = T1 (cT2 (v)) = cT1 (T2 (v)) = cT1 ◦ T2 (v).
A square matrix, I, with ones on the main diagonal and zeros elsewhere is called the
identity matrix for matrix multiplication. Similar to identity element in algebraic multipli-
cation, IA = AI = A.
Similar to algebraic multiplication, matrix-matrix multiplication allows us to define pow-
ers of square matrices, Ak = AA . . . A such that A is multiplied by itself k times. Note
that (Ak )s = Aks and Ak As = Ak+s .
Matrix-matrix multiplication has several interesting properties such as

• Matrix-matrix multiplication is not commutative, i.e., AB 6= BA


• If AB = 0, then A and/or B are not necessarily 0

• A0 = 0
• Matrix-matrix multiplication is commutative and associative for scalar multiplication,
i.e., A(cB) = (Ac)B = (cA)B = c(AB) = A(Bc)
• Matrix multiplication is distributive over scalar addition, i.e., (c + d)A = cA + dA

• Matrix-matrix multiplication is distributive over matrix addition, i.e., A(B + C) =


AB + AC
• Matrix-matrix multiplication is associative, i.e., ABC = (AB)C = A(BC), when
the number of columns of A is equal to the number of rows of B and the number of
columns of B is equal to the number of rows of C.

Transpose of matrices. The transpose operator flips a matrix over its diagonal by re-
placing its rows by its columns or equivalently, replacing its columns by its rows. The
transpose of a matrix A is denoted by AT . In R, the function t is used to compute the
transpose of a matrix as shown in Figure 1.73.

Figure 1.73: Transpose of a matrix

> A <- matrix(data=-2:3, nrow=2, ncol=3)


> A
[,1] [,2] [,3]
[1,] -2 0 2
[2,] -1 1 3
> t(A)
[,1] [,2]
[1,] -2 -1
[2,] 0 1
[3,] 2 3

Transpose of a matrix has several interesting properties including

• (AT )T = A

55
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

• (A + B)T = AT + B T
• (AB)T = B T AT (note the order)
• (cA)T = cAT

Inverse of square matrices. In algebraic multiplication the reciprocal of a number k is


denoted by k −1 = 1/k and it represents the multiplicative inverse of k such that kk −1 =
k −1 k = 1. The same concept is extended for square matrices in linear algebra. The multi-
plicative inverse of a square matrix A is denoted by A−1 such that AA−1 = A−1 A = I.
Not all square matrices are invertible. An invertible square matrix is also called nonsingular
or nondegenerate. Similarly, a square matrix that does not have an inverse is called singular
or degenerate. In R the solve is used to compute the inverse of a square matrix if one
exists.
Figure 1.74: Inverse of a matrix

> A <- matrix(data=c(1,2,3,4), nrow=2, ncol=2) # a nonsingular (invertible) matrix


> A
[,1] [,2]
[1,] 1 3
[2,] 2 4
> A.inv <- solve(A)
> A.inv
[,1] [,2]
[1,] -2 1.5
[2,] 1 -0.5
> A %*% A.inv
[,1] [,2]
[1,] 1 0
[2,] 0 1
> B <- matrix(data=c(2,2,3,3), nrow=2, ncol=2) # a singular matrix
> B
[,1] [,2]
[1,] 2 3
[2,] 2 3
> solve(B)
Error in solve.default(B) :
Lapack routine dgesv: system is exactly singular: U[2,2] = 0

Invertible square matrices support many useful features such as

• If AB = BA = I, then A and B are inverses of each other.


• If A is invertible, then(A−1 )−1 = A
• If A and B are both invertible, then their product AB is also invertible. Moreover,
(AB)−1 = B −1 A−1 (note the order)
• If A is invertible, then its transpose is also invertible. Moreover, (AT )−1 = (A−1 )T
• If A is invertible, then (cA)−1 = 1c A−1

56
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

• If A is invertible, then (An )−1 = A−n = (A−1 )n

Determinant of square matrices. TO BE WRITTEN LATER

Eigenvalues and eigenvectors of square matrices. TO BE WRITTEN LATER

1.2.3.4 Fundamental Types of Matrices in Linear Algebra


There are some important or fundamental matrix types that one encounters frequently in
linear algebra. These matrices often have nice properties that reduce complexity, make
calculations easier and/or make computations efficient. Therefore, there are many processes
in linear algebra to factorize common matrices into these fundamental matrices.

Square matrices. A square matrix is a matrix which has the equal number of rows and
columns.

Diagonal matrices. A diagonal matrix is a square matrix which has zeros in all of its
entries, except the principal diagonal. These matrices are also called scalar matrices, because
they scale the elements of a vector by the amounts specified on their main diagonal in matrix-
vector multiplication. The transpose of a diagonal matrix is equal to itself. The inverse of a
diagonal matrix is another diagonal matrix with the reciprocals of the entries on the main
diagonal. The determinant of a diagonal matrix is the multiplication of the elements on
the main diagonal. Due to these nice properties, it is often desirable to have a diagonal
matrix term in matrix factorization. R uses the diag function to create diagonal matrices.
Figure 1.75 presents operations on diagonal matrices.

Figure 1.75: Diagonal Matrices

> D <- diag(c(1,5,10)) # Creating a diagonal matrix


> D
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 5 0
[3,] 0 0 10
> t(D) # Transpose of a diagonal matrix
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 5 0
[3,] 0 0 10
> solve(D) # Inverse of a diagonal matrix
[,1] [,2] [,3]
[1,] 1 0.0 0.0
[2,] 0 0.2 0.0
[3,] 0 0.0 0.1
> det(D) # Determinant of a diagonal matrix
[1] 50

57
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Identity matrices. An identity matrix I is a diagonal matrix with ones on the main
diagonal and zeros elsewhere. Identity element in algebraic multiplication is 1 and when it is
multiplied by a real value, it leaves the real value unchanged, e.g., 3×1 = 1×3 = 3. Similarly,
identity matrix when multiplied by another matrix, it leaves the matrix unchanged, i.e.,
IA = AI = A. Note that the size of the identity matrix is implicitly inferred based on
whether it is left or right multiplied. If it is left multiplied by a matrix, i.e, IA, then its size
is the number of the rows of A. If it is right multiplied by a matrix, i.e., AI, then its size
is the number of the columns of A. In summary, for an n × p matrix A, I n A = AI p = A.
R uses the diag function with the size as a parameter to create identity matrices as shown
in Figure 1.76.

Figure 1.76: Identity Matrices

> diag(2)
[,1] [,2]
[1,] 1 0
[2,] 0 1
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1

Symmetric matrices. A symmetric matrix is a square matrix which is equal to its trans-
pose, i.e., S is symmteric ⇐⇒ S = S T . The non-principal-diagonal entries of a symmetric
matrix are symmetric with respect to the principal diagonal. By definition, diagonal ma-
trices, including the identity matrix, are symmetric. Symmetric matrices appear naturally
to represent relations such as “married to’ and “works with” or “distance” and “similarity”.
In addition, for any matrix X, XX T or X T X are always symmetric and XX T is not
necessarily equal to X T X.
Covariance, correlation and cosine similarity matrices are very famous symmetric matri-
ces in data science. Given a centered dataset matrix X, the matrix S = n−1 1
X T X is called
covariance matrix. In some cases, the relationships between variables are more important
than the exact variance/covariance values and the scalar term n−1 1
is skipped for a centered
dataset matrix X, i.e., S = X X. Although some texts call this matrix as covariance
T

matrix some others call it sum of squares cross products matrix where the entries on the
principal diagonal are the sum of squares and the non-principal-diagonal entries are the
cross products. Given a standardized dataset matrix X, the matrix S = n−1 1
X T X is called
correlation matrix. Given a column-wise unit-scaled dataset matrix X, the matrix X T X
is called cosine similarity matrix.
Real symmetric matrices have several important properties. First, all eigenvalues of a real
symmetric matrix are real. Second, the eigenvectors of different eigenvalues are orthogonal.
Third, for repeated eigenvalues there will be many independent eigenvectors and one can
use the Gram–Schmidt process to find an orthonormal basis for the eigenspace related to a
repeated eigenvalue. As a result, a real symmetric matrix, X p×p always has p orthonormal
eigenvectors that span Rp . The determinant of a square matrix (a symmetric matrix is

58
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

also square) is the product of its eigenvalues. A square matrix (a symmetric matrix is also
square) is invertible when its determinant is not zero, i.e., none of its eigenvalues is zero. The
inverse of an invertible (non-singular) symmetric matrix is also symmetric. If S and S 0 are
symmetric matrices so are S + S 0 and S − S 0 . More importantly, a symmetric matrix S can
be factorized into S = QΛQT where Q is the column matrix consisting of the orthonormal
eigenvectors of S and Λ is the diagonal matrix costing of the eigenvalues of S on the main
diagonal. To put in other words, any symmetric matrix is diagonalizable, i.e., non-defective.
In fact a square matrix M p×p is diagonalizable (non-defective) when M has p linearly
independent eigenvectors. Then, M = QΛQ−1 where Q is the column matrix consisting of
the eigenvectors of M and Λ is the diagonal matrix costing of the eigenvalues of M . Note
that when Q is an orthogonal matrix, then QT = Q−1 .

Figure 1.77: Symmetric Matrices

> S <- cbind(c(8,9,2), c(9,4,3), c(2,3,5))


> S
[,1] [,2] [,3]
[1,] 8 9 2
[2,] 9 4 3
[3,] 2 3 5
> t(S)
[,1] [,2] [,3]
[1,] 8 9 2
[2,] 9 4 3
[3,] 2 3 5
> Q <- eigen(S)$vectors
> Q
[,1] [,2] [,3]
[1,] -0.7354438 -0.31610738 0.5993317
[2,] -0.6109345 -0.07320017 -0.7882898
[3,] -0.2930554 0.94589527 0.1392863
> D <- diag(eigen(S)$values)
> Q %*% D %*% t(Q)
[,1] [,2] [,3]
[1,] 8 9 2
[2,] 9 4 3
[3,] 2 3 5

Orthogonal matrices. Two vectors are called orthogonal, if their dot product is zero,
i.e., they are perpendicular to each other in Euclidean space. These vectors are called
orthonormal when they are also unit vectors, i.e., their lengths are one. A real square matrix
is called orthogonal or orthonormal matrix, when its columns and rows are orthonormal
vectors. The columns of an orthogonal matrix, Qp×p , naturally forms the orthonormal basis
of the Euclidean space Rp . Let Qp×p be an orthogonal matrix with columns {q1 , q2 , . . . , qp }.
Then, qiT qi = 1, because qi is a unit vector and qiT qj = 0, because qi and qj are orthogonal.
As a result, QQT = QT Q = I which also defines an orthogonal matrix. Moreover, the
inverse of an orthogonal matrix is equal to its transpose, i.e., Q−1 = QT . The determinant
of an orthogonal matrix is either 1 or -1 which implies orthogonal matrices represent rotation

59
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

and reflection transformations. Orthogonal matrices preserve the dot products of vectors,
i.e., uT v = (Qu)T (Qv) = uT QT Qv = uT Q−1 Qv = uT Iv √ which also implies they preserve
vector lengths because the length of a vector ||u|| = uT u. The identity matrix I is
an orthogonal matrix. The inverse, hence the transpose, of an orthogonal matrix is also
orthogonal. Products of orthogonal matrices are also orthogonal. Eigenvalues of orthogonal
matrices are ±1 and their eigenvectors are orthogonal. Due to these nice properties, many
matrix factorizations, such as QR decomposition, symmetric matrix eigen decomposition and
singular value decomposition, involve orthogonal matrices. Figure 1.78 presents operations
on orthogonal matrices.

Figure 1.78: Orthogonal Matrices

> Q <- cbind(c(cos(60*pi/180),-1*sin(60*pi/180)), c(sin(60*pi/180),cos(60*pi/180)))


> Q # An orthogonal matrix
[,1] [,2]
[1,] 0.5000000 0.8660254
[2,] -0.8660254 0.5000000
> t(Q) # Transpose of Q
[,1] [,2]
[1,] 0.5000000 -0.8660254
[2,] 0.8660254 0.5000000
> solve(Q) # Inverse of Q
[,1] [,2]
[1,] 0.5000000 -0.8660254
[2,] 0.8660254 0.5000000
> t(Q) %*% Q # t(Q)Q = Qt(Q) = I
[,1] [,2]
[1,] 1 0
[2,] 0 1

Positive definite matrices. A real symmetric matrix S is called a positive definite ma-
trix if v T Sv is positive for any nonzero vector v. If one considers S as a linear transformation
matrix T : Rp → Rp where T (v) = Sv, then the direction of Sv will always point the same
“general” direction of v. That is, the angle between v and its transformation Sv will always
be less than 90◦ . The dot product of two column vectors is related to the cosine of the angle
between them in uT z = ||u||||z|| cos(θ) where θ is the angle between u and z and ||u|| and
||u|| are their corresponding lengths. One can easily conclude that the vector v and its
transformation Sv points the same general direction only if, v T (Sv) = ||v||||Sv|| cos(θ) , θ
is less than 90◦ .
When v is an eigenvector of S, then Sv = λv. Therefore, v T (Sv) = v T (λv) = λv T v =
λ||v||2 . Since ||v||2 is always positive λ has to be positive to make v T Sv. This leads us to
an equivalent definition for called a positive-definite matrices: a real symmetric matrix is
positive definite if and only if all of its eigenvalues are positive.

Positive semi-definite matrices. A real symmetric matrix S is called a positive semi-


definite matrix if v T Sv is positive or zero (non-negative) for any nonzero vector v. That
is, the angle between v and its transformation Sv will always be less than or equal to

60
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

90◦ for a positive semi-definite matrix S. Equivalently, a real symmetric matrix is positive
semi-definite if and only if all of its eigenvalues are non-negative. The inverse of a positive
definite matrix is also positive definite. Any positive definite matrix S has a unique Cholesky
factorization such that S = LLT where L is a real lower triangular matrix with real entries
on the principal diagonal. Any symmetric matrix generated by S = M M T or S = M T M
is not always positive definite but it is always positive semi-definite. Therefore, a covariance
matrix is always positive semi-definite.

1.2.4 Arrays
So far we have discussed vector and matrix objects used for storing values of the same mode
(data type) in one dimensional and two dimensional data structures, respectively. R allows
us to go beyond two dimensions by providing array objects. An array object in R is a data
structure for storing values of the same mode in any number of dimensions. The function
array has the signature array(data = NA, dim = length(data), dimnames = NULL). The
parameter data denotes the vector of values that we want to arrange in an array. The
parameter dim denotes the vector of dimensions. parameter dimnames allows us to give
names to the dimensions.
To illustrate let us create a synthetic data set representing the cumulative GPA’s of
students based on grade (freshman,sophomore,junior,senior), scholarship status (scholarship,
no-scholarship) and gender (female, male).

Figure 1.79: Arrays in R

> gpa.arr <- array(data=c(3.96, 2.53, 2.95, 1.65, 2.56, 3.71, 3.01, 3.65, 1.70,
3.41, 2.60, 1.84, 3.22, 2.59, 2.95, 3.70), dim=c(4,2,2))
> gpa.arr
, , 1

[,1] [,2]
[1,] 3.96 2.56
[2,] 2.53 3.71
[3,] 2.95 3.01
[4,] 1.65 3.65

, , 2

[,1] [,2]
[1,] 1.70 3.22
[2,] 3.41 2.59
[3,] 2.60 2.95
[4,] 1.84 3.70

Figure 1.79 shows how to create a three dimensional array where the first dimension
represents the grade, the second dimension represents scholarship status and the third di-
mension represents the gender. R displays three arrays by subsetting the array starting from
the last dimension using the integer indices. The last dimension in our example is gender
where one represents female and two represents male students. Additionally one can use
dimnames to provide names to the dimensions as shown in Figure 1.80.

61
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.80: Arrays in R

> gpa.arr <- array(data=c(3.96, 2.53, 2.95, 1.65, 2.56, 3.71, 3.01, 3.65, 1.70,
3.41, 2.60, 1.84, 3.22, 2.59, 2.95, 3.70), dim=c(4,2,2))
> dimnames(gpa.arr) <- list(c("freshman", "sophomore", "junior", "senior"), c("
scholarship", "no-scholarship"), c("female", "male"))
> gpa.arr
, , female

scholarship no-scholarship
freshman 3.96 2.56
sophomore 2.53 3.71
junior 2.95 3.01
senior 1.65 3.65

, , male

scholarship no-scholarship
freshman 1.70 3.22
sophomore 3.41 2.59
junior 2.60 2.95
senior 1.84 3.70

Subsetting an array is not different from subsetting matrices. One can use integer indices
or names to subset an individual element or a portion in an array.

1.2.5 Data Frames


Vectors, factors, matrices and arrays in R are objects holding values of the same mode
(data type). In real life however, we have collections of data consisting of measurements
or observations on several features of instances where each feature has a different mode.
These data sets are arranged in tabular fashion where each row corresponds to an instance
(record) and each column represents a feature (variable). A particular value in the table
denotes an observation or measurement on an instance (row) for a feature (column). For
example, a health data set may consist of names, weights, heights, blood types and various
test results for patients arranged in tabular fashion.
R supports tabular data sets consisting of multiple records and multiple variables through
data frame objects. Basically, a data frame is nothing but a collection of vectors joined
together vertically to create a table-like data structure. The vectors forming a data frame
must have the same length otherwise, the shorter vectors are recycled to match the length
of the longest vector in the data frame.
For example we can create a data set for analyzing various features of universities in
Louisiana including the school type, number of students, Carnegie classification and accep-
tance rate as shown in Figure 1.81. In Figure 1.81 we first create five vectors holding the
names, types, enrollments, Carnegie classifications and acceptance rates of five universities.
Then we create a data frame by joining all these vectors vertically. Each row in the data
frame is a higher education instance, each column is a higher education feature and each
cell denotes the feature value observed on the instance. The data.frame function uses the

62
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

identifiers (names) of vector objects to name the columns. The rows are named by inte-
ger numbers as shown in Figure 1.81. The data.frame function supports the parameter
row.names to explicitly name the rows using a vector.

Figure 1.81: Data frames in R

> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(name, type, enroll, carnegie, acceptance)
> univ
name type enroll carnegie acceptance
1 UL Lafayette Public 17195 RU/H 59.3
2 LSU Public 30451 RU/VH 75.0
3 Tulane Private 13531 RU/VH 26.4
4 LA Tech Public 11015 RU/H 70.8
5 Xavier Private 2926 <NA> 54.4

Remember that matrix objects also arrange data in tabular fashion. However, matrices
require the mode of all columns to be the same. On the other hand, there is no such
requirement for data frames. in Figure 1.81 the variable name is a character string vector;
the variable type is a factor with two levels; the variable enroll is a numeric vector; the
variable carnegie is again a factor with two levels; and the variable acceptance is a numeric
vector.
The data frame coerces the character string vectors into factors by default. In case, one
wants to keep character string vectors as they are he/she can set the stringsAsFactors
parameter of the data.frame function to logical false.
In addition to data.frame function, R provides the expand.grid function to quickly
create a data frame from all combinations of argumented vectors and/or factors. Figure 1.82
demonstrates the expand.grid function.

1.2.5.1 Subsetting Data Frames


Similar to matrices, data frames can be subsetted by positive/negative integer index vectors,
logical index vectors and name index vectors. Figure 1.83 provides several examples to
subsetting data frames using different types of index vectors. In Figure 1.83 the vector name
is used to name rows instead of being another column as demonstrated in Figure 1.81. Notice
that subsetting rows of a data frame returns another data frame whereas subsetting the
columns return a vector. Additionally R supports accessing or subsetting columns of a data
frame through the operator “’$’. Data frame object identifier followed by a dollar sign “$” and
a column name simply subsets the column. In Figure 1.83 the example “univ$acceptance”
simply subsets the acceptance column of the univ object.

1.2.5.2 Modifying Data Frames


Similar to other objects in R, using the subset of a data frame on the left hand side of the
assignment operator allows us to modify cells, columns and rows of the data frame in place.

63
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.82: The expand.grid function

> expand.grid(col1=c("A", "B", "C"), col2=c(TRUE, FALSE), col3=c(10,20))


col1 col2 col3
1 A TRUE 10
2 B TRUE 10
3 C TRUE 10
4 A FALSE 10
5 B FALSE 10
6 C FALSE 10
7 A TRUE 20
8 B TRUE 20
9 C TRUE 20
10 A FALSE 20
11 B FALSE 20
12 C FALSE 20

Furthermore, the function cbind allows us to add a new column to a data frame. Adding
a new column is a trivial operation because a column is a vector object consisting of values
of the same mode (data type). On the other hand, adding a new row into a data frame
is a more complex operation because a row instance consists of multiple values of different
modes. The simplest way to add a new row is to represent the new row as a data frame of
a single instance and use the rbind function to append it to the existing data frame object.
To delete a column one can subset the entire column and set it to NULL using the
assignment operator. To delete a row one can subset the data frame using a negative index
vector and assign it to itself. This method also works for removing columns.
Figure 1.84 shows several examples of data frame modification. Note that adding a new
row with a new factor level results in an update in the levels of the corresponding column
of the data frame.

1.2.5.3 Essential Data Frame Functions


Data frames are located at the heart of data analysis using R. Hence, R provides several built-
in functions to work with data frames. In the following we will cover the most frequently
used ones. Note that almost all of these functions are overloaded for other types of objects
as well.
The names and colnames functions return the row and column names of a data frame
object, respectively. The rownames function returns the row names of a data frame object.
The dimnames function returns the both column and row names of a data frame object.
The dim function returns the dimensions, i.e., the number of rows by the number of
columns, of a data frame object. The nrow and ncol functions return the number of rows
and columns in a data frame object, respectively.
The View function displays a data frame object in a spreadsheet. The actual behavior
of the View function depends on the system that R is installed on. Most of the time a data
frame consists of thousands of records and displaying the entire data set is not feasible. R
provides head and tail functions to only display the top and bottom six rows of a data
frame to peek. Both functions supports the parameter n to control the number of the
displayed rows.

64
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.83: Subsetting data frames

> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
>
> univ[3,2] # Subsetting individual elements by integer indices
[1] 13531
> univ[1, ] # Subsetting rows by integer indices
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
> univ[, 4] # Subsetting columns by integer indices
[1] 59.3 75.0 26.4 70.8 54.4
> univ[c(2,4), c(1,2,3)] # Subsetting parts by integer indices
type enroll carnegie
LSU Public 30451 RU/VH
LA Tech Public 11015 RU/H
> univ[c("LSU", "LA Tech"),"type"] # Subsetting elements by names
[1] Public Public
Levels: Private Public
> univ$acceptance # Subsetting columns by the $ operator
[1] 59.3 75.0 26.4 70.8 54.4
> univ[univ$acceptance<50, ] # Subsetting rows by logical indices
type enroll carnegie acceptance
Tulane Private 13531 RU/VH 26.4

65
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.84: Modifying data frames

> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
> univ[1,2] <- 18796
> mascot <- c("Cayenne", "Tiger", "Pelican", "Bulldog", "Gold")
> univ <- cbind(univ, mascot)
> ul.monroe <- data.frame(type="Public", enroll=8811, carnegie=NA, acceptance=92.0,
mascot="Warhawk", row.names="UL Monroe")
> univ <- rbind(univ,ul.monroe)
> univ$carnegie <- NULL
> univ
type enroll acceptance mascot
UL Lafayette Public 18796 59.3 Cayenne
LSU Public 30451 75.0 Tiger
Tulane Private 13531 26.4 Pelican
LA Tech Public 11015 70.8 Bulldog
Xavier Private 2926 54.4 Gold
UL Monroe Public 8811 92.0 Warhawk

66
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

The two very important functions especially used on data frames are the str and summary
functions. The str function gives a brief description about the structure of the data frame.
The summary function provides summary statistics for each column in a data frame. Fig-
ure 1.85 shows the use of str and summary functions.

Figure 1.85: The str and summary functions

> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
> str(univ)
’data.frame’: 5 obs. of 4 variables:
$ type : Factor w/ 2 levels "Private","Public": 2 2 1 2 1
$ enroll : num 18796 30451 13531 11015 2926
$ carnegie : Factor w/ 2 levels "RU/H","RU/VH": 1 2 2 1 NA
$ acceptance: num 59.3 75 26.4 70.8 54.4
> summary(univ)
type enroll carnegie acceptance
Private:2 Min. : 2926 RU/H :2 Min. :26.40
Public :3 1st Qu.:11015 RU/VH:2 1st Qu.:54.40
Median :13531 NA’s :1 Median :59.30
Mean :15344 Mean :57.18
3rd Qu.:18796 3rd Qu.:70.80
Max. :30451 Max. :75.00

TALK ABOUT THE BEST WAYS TO MERGE DATA FRAMES, DESPITE IT IS


PACKAGE DEPENDENT. SUGGEST USING EXCEL FOR MERGING DATASETS AND
SAVING THE FINAL DATASET AS .csv BEFORE LOADING INTO R FOR ANALYSES.

1.2.6 Lists
Vector, matrix and array objects in R require the mode of the data to be stored the same.
Data frame objects allow storing data in different modes together in tabular fashion however,
the columns should have the same length. A list in R is a data structure that can hold
multiple objects of different modes or lengths. That is one can put together vectors, factors,
matrices, data frames, functions and even lists into a list. This flexibility allows us to
combine the data that are loosely related to each other into a single object.
The function list is used to create lists of objects in R. It expects one or more objects
to be provided as arguments. In Figure 1.86 we first create a two by two matrix object
aMatrix, a vector of character strings consisting of four elements aVector and a factor of

67
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

five levels consisting of six elements aFactor. We then, create a list of objects consisting
of the objects aMatrix, aVector, aFactor along with another string vector of length one
created on-the-fly.

Figure 1.86: Creating lists in R

> aMatrix <- matrix(data=1:4, nrow=2, ncol=2)


> aVector <- c("Louisiana", "Texas", "California", "Maine")
> aFactor <- factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
> theList <- list(aMatrix, aVector, aFactor, "Hello world")
> theList
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4

[[2]]
[1] "Louisiana" "Texas" "California" "Maine"

[[3]]
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5

[[4]]
[1] "Hello world"

1.2.6.1 Subsetting Lists


The list function assigns an integer index to its objects based on the order that they
are argumented. These integer indices displayed within double square brackets “[[ ]]” in
Figure 1.86 allow us to access individual objects of the list as well as multiple objects. R
allows accessing the elements of a list using the single square bracket subsetting operator
“[ ]” as well as double square brackets “[[ ]]” and the difference in their behavior is subtle.
As shown in Figure 1.87 when single square bracket subsetting operator used to retrieve a
single object in the list, R puts the returned object into a list object and returns a list of
length one. That is, the returned object is still of type list rather than the element’s actual
object type. On the other hand, when double square bracket subsetting operator used to
retrieve a single object in the list, R returns the actual object without putting it into a list
object. Note that a list is nothing but a data structure for holding objects of different types.
Hence, many of the mathematical operators and functions do not work on lists as expected.
However, one can always subset the original object using the double square brackets “[[ ]]”
subsetting operator and perform calculations on the object. In Figure 1.87 we used the
det function to calculate the determinant of the matrix object in the list by retrieving the
matrix using double square brackets “[[ ]]” operator. If we had used the sing square brackets
“[ ]” subsetting operator we would have obtained a list holding a matrix object and the det
function is not defined on lists.
An alternative way to subset lists is naming the objects and using the names to accesss
objects in a list. Note that the data.frame function automatically uses the object identifiers

68
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.87: Subsetting lists using double and single square brackets operators

> aMatrix <- matrix(data=1:4, nrow=2, ncol=2)


> aVector <- c("Louisiana", "Texas", "California", "Maine")
> aFactor <- factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
> theList <- list(aMatrix, aVector, aFactor, "Hello world")
> theList[1] # returns a list
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4

> theList[2] # returns a list


[[1]]
[1] "Louisiana" "Texas" "California" "Maine"

> theList[[1]] # returns a matrix


[,1] [,2]
[1,] 1 3
[2,] 2 4
> theList[[2]] # returns a vector
[1] "Louisiana" "Texas" "California" "Maine"
> det(theList[[1]])
[1] -2

to name the columns in a data frame. However, the list function does not implicitly name
the objects. There are two ways to name the objects in a list. One approach is using the
names function on the left hand side of the assignment operator, “<-”, and providing a
character string vector having a name for each element in the list. The second and preferred
approach is to name the objects while calling the list function using identifier-object pairs
combined by the parameter assignment operator, ”=“. In Figure 1.88 we use the second
approach to name the objects in the list. Please notice how the identifier-object pairs
provided to the list function while creating a list in the figure. In Figure 1.88 we also
display the list on the terminal. On difference is that R uses “$” followed by the name of
the element instead of integer index of the element within “[[ ]]” while displaying the list
on the terminal. In fact the ‘$” operator is the same operator that we used to retrieve the
columns in a data frame. The ‘$” operator can also be used to retrieve individual elements
of a list and it is equivalent to the double square brackets “[[ ]]” in terms of behavior. Note
that integer indices still work for subsetting the elements of a list even if the elements have
their associated names.
R also allows subsetting the objects within a list by using multiple subsetting operators
in a row. For example, theList[[2]][3] accesses the the third element of the object
located at index position two in the list theList. In case the objects have associated names
an equivalent instruction would be theList$vec[3] as shown in Figure 1.89. Notice that
in both examples the first subsetting operator retrieves an object within the list and the
second subsetting operator retrieves an element of the object. Figure 1.89 demonstrates
several examples to subsetting objects in lists.

69
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.88: Subsetting lists using names and $ operator

> aMatrix <- matrix(data=1:4, nrow=2, ncol=2)


> aVector <- c("Louisiana", "Texas", "California", "Maine")
> aFactor <- factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
>
> theList <- list("mat"=aMatrix, "vec"=aVector, "fac"=aFactor, "greet"="Hello world"
)
> theList
$mat
[,1] [,2]
[1,] 1 3
[2,] 2 4

$vec
[1] "Louisiana" "Texas" "California" "Maine"

$fac
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5

$greet
[1] "Hello world"
>
> theList["vec"]
$vec
[1] "Louisiana" "Texas" "California" "Maine"

> theList[["vec"]]
[1] "Louisiana" "Texas" "California" "Maine"
> theList$vec
[1] "Louisiana" "Texas" "California" "Maine"

Figure 1.89: Subsetting the objects in a list

> aMatrix <- matrix(data=1:4, nrow=2, ncol=2)


> aVector <- c("Louisiana", "Texas", "California", "Maine")
> aFactor <- factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
>
> theList <- list("mat"=aMatrix, "vec"=aVector, "fac"=aFactor, "greet"="Hello world"
)
> theList[[2]][3]
[1] "California"
> theList$vec[3]
[1] "California"
> theList$mat[2,2]
[1] 4
> theList$vec[c(-2,-3)]
[1] "Louisiana" "Maine"

70
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

1.2.6.2 Modifying Lists


As we have already discussed, subsetting objects on the left hand side of the assignment
operator allows us overwriting the objects in a list. Adding a new object into a list is as
simple as assigning the object to the list with a new name or integer index. Removing an
object from the list is done by assigning the NULL value to the object in the list.
Figure 1.90 demonstrates several examples to modifying lists.

Figure 1.90: Modifying lists

> aMatrix <- matrix(data=1:4, nrow=2, ncol=2)


> aVector <- c("Louisiana", "Texas", "California", "Maine")
> aFactor <- factor(x=c(1, 2, 2, 3, 1, 3), levels=1:5)
> theList <- list("mat"=aMatrix, "vec"=aVector, "fac"=aFactor, "greet"="Hello world"
)
>
> theList$vec[1] <- "New York"
> theList$vec2 <- seq(from=0.0, to=0.5, by=0.1)
> theList$mat <- NULL
> theList
$vec
[1] "New York" "Texas" "California" "Maine"

$fac
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5

$greet
[1] "Hello world"

$vec2
[1] 0.0 0.1 0.2 0.3 0.4 0.5

1.2.7 Symbolic Expressions in R


R is a statistical language and framework for numeric computation. However, R limitedly
supports symbolic computation as well. As a matter of fact, any instruction that is typed
on R terminal is considered to be a symbolic expression. When one inputs the expression
to the R environment by pressing the carriage return key, R first validates the symbolic
expression and then evaluates it. It is possible to create expressions in the form of symbols
using the expression function and then evaluating them using the eval function.
In Figure 1.91 we create a symbolic expression, myexp, representing the function (x2 +
y)/log(x). Notice that the mode of myexp object is expression. Then, we create x and y
objects and evaluate the expression using the eval function.
Many graphics functions for plotting data or machine learning functions for inference and
estimation have parameters of mode expression. These expressions are used in the bodies
of these functions. The function D in R is defined for computing the partial derivatives of
simple expressions in symbolic notation. The function has two parameters namely expr for
the expression and name for the variable to perform the derivation. Figure 1.92 shows how

71
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.91: Symbolic expressions in R

> myexpr <- expression(2*x^3 + exp(y)/(5*z))


> myexpr
expression(2 * x^3 + exp(y)/(5 * z))
> mode(myexpr)
[1] "expression"
> x <- 4
> y <- 10
> z <- 3
> eval(myexpr)
[1] 1596.431

to calculate the derivative of an expression in symbolic notation. R has provides various


packages for symbolic calculation. However, we do not cover this topic in the text because
our use of symbolic notations is limited to providing simple symbolic expressions to some
functions.
Figure 1.92: Symbolic partial derivative

> myexpr <- expression(2*x^3 + exp(y)/(5*z))


> D(expr=myexpr, name="x")
2 * (3 * x^2)
> D(expr=myexp, name="y")
exp(y)/(5 * z)

1.3 Apply Functions


The base package in R defines a family of functions, including apply, lapply, sapply and
tapply. All these functions are used to apply a function to a collection of objects or subsets
of an object without using a loop structure. The simplest of these functions is the apply
function which applies a function to the subsets of a composite object and requires three
mandatory parameters. The FUN parameter denotes the function name that will be applied
to the subsets of a composite object. The X parameter indicates the composite object. The
MARGIN parameter specifies whether the function will be applied to the rows, “MARGIN=1”,
or to the columns “MARGIN=2” or to each element “MARGIN=c(1,2)”.
In Console 1.93 we first create a matrix M consisting of five rows and three columns. Next,
we use the apply function to compute the sums of rows and then we use the apply function
again to compute the means of the columns. Note that the parameter na.rm belongs to the
mean function and one can pass arguments to the function specified by the FUN parameter
by listing them in the apply function call. The apply function always returns a vector or
a matrix and it applies a function to multidimensional objects, including matrices, arrays
and data frames. If the composite object is a data frame, one must ensure that the rows or
columns of the data frame are of the same primitive type by subsetting.
The function lapply applies a function, specified via parameter FUN, to a single dimen-
sional object such as vectors or lists. Note that while a list is single dimensional, it may

72
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.93: The apply function

> M <- matrix(c(1:5, 11:15, 101:105), nrow = 5, ncol = 3)


> M
[,1] [,2] [,3]
[1,] 1 11 101
[2,] 2 12 102
[3,] 3 13 103
[4,] 4 14 104
[5,] 5 15 105
> apply(X=M, MARGIN=1, FUN=sum)
[1] 113 116 119 122 125
> apply(X=M, MARGIN=2, FUN=mean, na.rm=TRUE)
[1] 3 13 103

contain multidimensional objects. In that case if the function specified via the FUN param-
eter is applicable to the enclosed multidimentional object, the lapply function will work
without any problems. The apply function returns a list.

Figure 1.94: The lapply and sapply functions

> L <- list(M=matrix(c(1:5, 11:15, 101:105), nrow = 5, ncol = 3), V1=21:25, V2


=31:35, V3=41:45)
> lapply(X=L, FUN=sum) # lapply function
$M
[1] 595
$V1
[1] 115
$V2
[1] 165
$V3
[1] 215
> sapply(X=L, FUN=sum) # sapply function
M V1 V2 V3
595 115 165 215

In Console 1.94 we first create a list consisting of a matrix and three vectors and then use
the lapply function to compute the sums of each object in the list. Note that the lapply
function returns a list. The function sapply is a wrapper function around the lapply
function. Different from lapply, sapply simplifies the output if possible. In Console 1.94
the sapply function applies the same function, however it simplifies the output into a named
vector or table.
The tapply function applies a function to a ragged multidimensional object. Typically it
is used on dataframes to apply a function to groups split by a factor. The INDEX parameter
is used to specify the factor column for grouping.
In Console 1.95 we first create a dataframe consisting of a numeric column and a column
of factor. Then we use the tapply function to group the numeric column by the factor
column and apply the sum function.

73
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.95: The tapply function

> df <- data.frame("C1" = 1:5, "C2" = as.factor(c("A", "B", "B", "A", "B")))
> df
C1 C2
1 1 A
2 2 B
3 3 B
4 4 A
5 5 B
> tapply(X=df$C1, INDEX=df$C2, FUN=sum)
A B
5 10

There are other functions in the family. For example, the vapply is similar to sapply
but it allows to explicitly define the output. The mapply function applies is the multivariate
version of the sapply function.

1.4 Installing and Loading Packages in R


A package in R is a collection of compiled or interpreted functions and/or named data
grouped together. Packages are used to allow developers to extend the core R functionality
with new data sets and statistical methods. Package developers implement new pieces
of software with well defined function headers and share them with the R community.
Thousands of packages for various tasks are available at Comprehensive R Archive Network
(CRAN)7 .
When R is launched it loads only the core packages and makes them available for the
user. In order to call a function located in a package other than the core packages one has
to (i) install the package on his/her system, if not already installed and (ii) load the package
to make it available for use.
There are two methods to install packages in R. The first and most frequently used
method is installing the binary package directly from the CRAN repository. The install.packages
function which comes with the core R allows us to install packages. The install.packages
function supports numerous parameters and using them with their default values work most
of the time. The parameter pkgs is used to install one or more packages defined as a vector
of character strings representing the names of the packages to be installed. The parameter
libs is another character vector denoting the destination library directories to install the
packages. The parameter repos is a character vector denoting the URL’s of the package
repositories. The parameter type is a character string vector indicating the formats, e.g.
“source” or “win.binary”, of the packages to be downloaded and installed. Usually using
only pkgs parameter is enough to download and install a package from CRAN repository
as shown in Figure 1.96.
The second method is installing a package manually from the source. To install a pack-
age manually one has to download the source of the package first. Once the package is
downloaded the install.packages function can be used to install it on the system using
7 http://cran.r-project.org/

74
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

the instruction install.packages (path_to_file, repos = NULL, type="source").


After installing a package, the functions and named data in that package do not become
available to the user unless he/she loads the package. The function library is used to load
a package in R. The library function supports parameter package which denotes the name
of the package to be loaded. In case the library function is called without any arguments it
lists the installed packages. The function search lists packages loaded in current R session.
In Figure 1.96 we demonstrate the use of these functions without explicitly specifying the
parameter names.

Figure 1.96: Installing and loading packages in R

> install.packages(pkgs="ggplot2") # Install the package ggplot2


...
...
> library() # List the installed packages
...
...
> library(package="ggplot2") # Load the package ggplot2
> search() # List loaded packages in current session
...
...

The require function is an alternative to the library function for loading libraries in
R. The installed.packages function may also be used for listing the packages that are
already installed on the system. Different from the library function, installed.packages
provides more details about the installed packages.

1.4.1 Loading Data in R


Loading a package makes not only the functions defined in the package available to the user
but also any data that come in the package. The data in a package are nothing but named
R objects. Usually one or more data sets in the form of data frames are put into R packages
for experimenting the functionality of the package. However, there are some packages, i.e.
datasets and Ecdat, which consist of only data sets.
Once a package is loaded its named objects, if any, including the data sets become
available to the user. However, availability of a data set does not mean that it is accessible
to the user. In order to access a data set the user has to load the data set which resides in
an already loaded package. R provides the data function to load a specified data set. Data
sets are specified by their names and the data functions expects the name of the data set
to be loaded as an argument. Note that loading a particular data set does not actually load
the data set into the memory immediately. Instead most of the package developers use a
design pattern called lazy loading. In lazy loading the data is not actually loaded into the
memory until the user accesses the data.
Figure 1.97 demonstrates how to load a data set named diamonds. Since the diamonds
data set is part of the ggplot2 package, the package has to be loaded before the data set.
Another dataset package is the Ecdat package which contains several data sets for econo-
metrics. Ecdat is not a core package of R hence, it is necessary to install the package before
accessing the data sets coming in the package.

75
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.97: Listing and loading data in R

> library("ggplot2")
> data()
...
...
> data(diamonds)
> head(diamonds)
> str(diamonds)
> summary(diamonds)

1.4.2 Importing/Exporting Data in R


Usually one has to deal with external data sets generated by various applications in different
formats. R provides several functions to import external data into the R environment for
analysis and manipulation. These functions allow us to import the data stored in text files,
native R files, MS Excel files, Open document spreadsheets, SPSS data files, STATA data
files, SAS export files and relational databases.
Since majority of applications support exporting data in text format, R intrinsically
provides many functions for importing data from text files. The central function for im-
porting data stored in text files is read.table. The read.table function expects the data
in the file be arranged tabular fashion and it returns a data frame object holding the data.
Since read.table function supports many parameters to control almost every aspect of
importing table-like data from a text file, we strongly suggest our readers to browse at
the documentation of the function. However, we cover the most frequently used param-
eters of the read.table function in the following. The parameter file is a mandatory
parameter denoting the absolute or relative path of the data file. It is a good practice to
ensure the working directory by the function getwd in case relative paths are used. The
parameter header=FALSE is an optional parameter representing whether the first line in a
file is the column headers or not. The parameter sep="" is an optional parameter denoting
the field separator used to separate values on each line. The parameter dec="." is an op-
tional parameter specifying the character used for decimal point numbers. The parameter
comment.char="#" is an optional parameter specifying the comment-starts character. The
parameter na.strings="NA" is an optional parameter for the vector of strings representing
missing values in the data set. The parameter strip.white=FALSE is an optional param-
eter denoting whether to strip leading and trailing white space characters. The parameter
fill=FALSE is an optional parameter denoting whether to implicitly add blank fields if rows
have blank fields. Finally, the optional parameter quote="
"" denotes the set of quoting characters.
Figure 1.98 illustrates how to use the read.table function to import data stored in a file
and inspect the main features. Note that the data set worldbook.csv contains information
about 42 different features of 179 countries in the world. The file is compiled from many
different sources and it does not reflect the current figures of the countries. However, it
serves very well for demonstrating various visual analytics concepts. The data set is publicly
available on the web site of this textbook.
Note that R provides several wrapper functions for the read.table function. These
wrapper functions call the read.table function in their bodies by setting its parameters

76
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.98: Importing data by read.table function and inspecting the features

> worldbook <- read.table(file="~/Desktop/worldbook.csv", header=TRUE, sep=",", fill


=TRUE, quote="\"")
> head(worldbook)
...
...
> str(worldbook)
...
...
> summary(worldbook)
...
...
> View(worldbook)

to specific values. The read.csv and read.csv2 are two wrapper functions which are fre-
quently used for importing data sets formatted as comma separated values in tabular fashion.
The read.csv function assumes that the fields of the data are separated by commas. The
read.csv2 function assumes that the fields of the data are separated by semicolons. Fig-
ure 1.99 illustrates how to use the read.csv function to import the same data set imported
in Figure 1.98. In addition to read.table and its wrappers, R provides functions read.fwf
and scan for importing data in fixed format width and importing data as vectors or lists.

Figure 1.99: Importing data by read.csv function and inspecting the features

> worldbook <- read.csv(file="~/Desktop/worldbook.csv", comment.char = "#")


> head(worldbook)
...
...
> str(worldbook)
...
...
> summary(worldbook)
...
...
> View(worldbook)

Loading data in native R format (.R and .rda) is easier. The load function allows us to
load a data file in native R format by setting its mandatory file parameter to the path of
the file.
Formats of other statistical software are called foreign formats. R supports foreign
formats through its foreign package. The package foreign provides several functions
for importing SPSS data files, STATA data files and SAS export files. The package gdata
provides functions for importing data in excel spreadsheets. Although, R supports importing
files in foreign formats we strongly suggest using text files for migrating data sets. Almost
all statistical software and spreadsheet applications allows a user to export data in comma
separated values (csv) format which is a prevalently used text format.
The function write.table is used to export tabular data into a file. Similar to read.table,

77
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

the write.table function supports several parameters. The parameter file is a mandatory
parameter denoting the absolute or relative path of the destination file. It is a good practice
to ensure the working directory by the function getwd in case relative paths are used. The
parameter append=FALSE controls whether the new data is to be appended to an existing
file or not. In case the parameter is set to logical false, the existing file is overwritten.
The parameter eol="\n" denotes the end of line character to be used while writing the
data. The parameter na="NA" is used to control how to write NA (Not Available) values into
the file. Figure 1.99 shows how to export data by write.table function. In the example,
semicolon is used as the separator character instead of comma and it is specified by the sep
parameter of the write.table function.

Figure 1.100: Exporting data by write.table function

> worldbook <- read.csv(file="~/Desktop/worldbook.csv", comment.char = "#")


> write.table(x=worldbook, file="~/Desktop/newfile.csv2", sep=";")
>

Finally, the function write is used for writing out matrix-like data into files and the
function save is used to write any type of data into a file in native R format. Data exported
in native R format can be imported later by the load function.

1.5 R Scripts
So far we have been interacting with the R terminal by entering and running our commands.
One drawback of this approach is that you need to re-enter and execute the commands of
repetitive tasks. R, however can be used as a scripting language. That is, a sequence of
R instructions can be saved into a text file as a script and later, the entire script can be
executed. One can use his/her favorite text editor to populate a file with R instructions
and save it to be executed later. RStudio also has a built-in text editor for creating R
script files. Although it is not a requirement, .R extension is typically used with R script
file names. There are two ways to run your script files; (i) through the OS terminal and (ii)
through the R terminal. To run a script via the OS terminal, one needs to call the program
named Rscript provided with the path to the R script file. Note that the script file should
be marked as executable before running it via the OS terminal. To run a script via the
R terminal one needs to call the source function. The file parameter is required and it
denotes the path to a local script file or a URL to a remote script address.
Figure 1.101a shows the content of a script file located in the Desktop folder. Fig-
ures 1.101b and 1.101c shows the script run via the OS terminal and the R terminal, re-
spectively.

1.6 R Control Structures


R is a full featured programming language and similar to other languages it supports the
standard control structures, including if, if-else, else if, for, while, repeat, break,
next and return. Although the best way to use the control structures is in functions, one
can run the control structures in an R terminal as well. Note that implementing functions is

78
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.101: An R script file run via both R and OS terminals

# script file example.R


a <- 5
b <- 4
c <- a + b
print(c)

(a) R script in ∼/Desktop/example.R

% Rscript ~/Desktop/example.R > source(file="~/Desktop/example.R")


[1] 9 [1] 9
% >

(b) Running the script via the OS terminal (c) Running the script via the R terminal

introduced in the next section. Control structures typically starts with a statement followed
by the instructions placed within the body of the statement. The body of the statement is
enclosed by two curly braces, i.e., { and }. Some control statements, such as break does
not have bodies.

1.6.1 Conditionals

Figure 1.102: if-else and else if conditional structures

> n <- 0
> if(n < 0){
+ print("Negative Number")
+ } else if(n > 0){
+ print("Positive Number")
+ } else {
+ print("Neither Positive nor Negative")
+ }
[1] "Neither Positive nor Negative"
>

Figure 1.102 shows an if-else and else if conditional structure executed in an R


terminal. The plus sign in the console indicates that R is expecting additional input to
finalize the command. if and else-if structures require a condition. If the condition
evaluates true the body of the if or else-if structure is executed, otherwise their bodies
are simply skipped. The conditions of an if and else-if structure are evaluated in order
and when a condition evaluates true, the related body is executed, while the remaining
conditions and their bodies are skipped. When none of the conditions of an if and else-if
structure evaluates true, then the body of the last else structure is executed, if one exists.
In Figure 1.102, R first evaluates the condition of the if structure, since n is zero, the
condition evaluates false. Next, it executes the condition in else if structure, again n is
zero and the condition evaluates false. Since there is no more conditions, R executes the

79
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

body of the else structure, which prints “Neither Positive nor Negative”.

1.6.2 Loops

Figure 1.103: for loop

> v <- c(1, 2, 4, 8, 16, 32, 64)


> cum.sum <- 0
> for (i in 1:length(v)) {
+ cum.sum <- cum.sum + v[i]
+ cat(cum.sum, " ")
+ }
1 3 7 15 31 63 127

R supports various loop types, including for, while and repeat. Figure 1.103 initiates
a vector, v, and a variable named cum.sum which represents cumulative sum. The for loop
initiates the variable i to 1 and repeatedly executes its body while updating i to next value
in sequence 1:length(v). The loop automatically terminates when i assumes the last value
in the sequence, i.e., the length of v. Inside the body of the loop, the current value of v is
added to the cum.sum and its value is printed by the cat function followed by a space. Note
that one can use the print function instead of cat to print each value on a new line.

Figure 1.104: while loop

> v <- c(1, 2, 4, 8, 16, 32, 64)


> cum.sum <- 0
> i <- 1
> while (i <= length(v)) {
+ cum.sum <- cum.sum + v[i]
+ cat(cum.sum, " ")
+ i <- i + 1
+ }
1 3 7 15 31 63 127

The console in Figure 1.104 is equivalent to the one in Figure 1.103, however it uses
a while loop instead of a for loop. Different from a for loop, a while loop requires a
condition at the beginning and it repetitively evaluates the condition before executing its
body. When the condition evaluates false, the while loop automatically ends. In the figure
variable i is initialized to 1 and it is incremented by 1 in the body of the loop. When the
value of i exceeds the length of v, the condition of the while evaluates false and the loop
ends automatically.
A repeat loop is similar to a while loop, except it does not have a condition specified in
the beginning. The lack of an explicit condition in the beginning requires the programmer
to explicitly break the repeat loop to prevent it running infinitely. To break the repeat
loop R introduces the break statement, which can also be combined with other types of
loops to prematurely end their repetitive executions.

80
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.105: repeat loop

> cum.sum <- 0


> i <- 1
> repeat{
+ cum.sum <- cum.sum + v[i]
+ cat(cum.sum, " ")
+ i <- i + 1
+ if(i > length(v)){
+ break
+ }
+ }
1 3 7 15 31 63 127

The console in Figure 1.105 is equivalent to the one in Figure 1.104, however it uses a
repeat loop instead of a while loop. Different from a while loop, the repeat loop does
not require a condition at the beginning and it repetitively executes its body until a break
statement is executed. The body of the repeat loop is similar to the body of the while loop
presented in Figure 1.104. The only difference is that the repeat loop has a conditional
statement that explicitly breaks the loop when i is greater than the length of v.
The next statement is another control structure used with loops. When a next statement
is executed it simply skips all instructions following the next in the body of the loop and
takes the control back to the beginning of the loop.
Lastly, the return statement which is used to return objects from functions is covered
in the next section.

1.7 Implementing Functions in R


An R function is a script consisting of a sequence of R statements and control structures
that perform a specific task. Although R comes with numerous built-in functions, it also
allows implementing and running custom functions. In R a function is an object of mode
function. Different from other programming languages, one has to assign the definition of
the function to an R object in order to call it later using the name of the object.

Figure 1.106: A simple R function definition and call

> greeting <- function(){


print("Hello World")
}
> greeting()
[1] "Hello World"

An R function definition starts with keyword function followed by parenthesis for pa-
rameter declarations. Note that no parameters are declared in the greeting function defined
in Terminal 1.106. The body of the function is delimited by curly braces. Simply, the body
of a function encloses a sequence of R statements and control structures that perform a task

81
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

all together. In the previous example the body of the function consists of a single state-
ment calling the built-in print function. Finally, the definition of the function needs to be
assigned to an R object in order to be called later. In the previous example the function is
assigned to the variable greeting which also serves as the name of the function.

Figure 1.107: R function structure. Brackets are used to refer to optional elements

functionName <- function([param1, param2, ...]){


statement1
statement2
...
statementk
[return(argument)]
}

Terminal 1.107 shows the general structure of an R function. Note that the brackets
are used to refer to the optional elements of a function. Although the previous example
(Terminal 1.106) does not require a number of arguments provided by the callee to perform
its task, some functions may require one or more arguments to be passed. For example,
a function calculating mean and standard deviation of a vector needs the vector to be
passed as an argument. Parameters are lists of placeholders for arguments. Parameters
are defined between the parenthesis in the definition of a function using the form param1,
param2, ... . R supports C++ style parameters with default values. One needs to use the
format param = value in the function definition to set a default value for the parameter.
Setting default values for multiple parameters results in ambiguity for R interpreter. As
a rule of thumb, parameters with default values should be aligned on the right of the list
of parameters in the function definition. Note that programmers use meaningful names for
parameters rather than param1.
Arguments are passed to R by value. That is, a copy of each argument is created and
passed to the function to keep the objects in the parent environment intact. This imposes
a problem in terms of efficiency and memory space especially when it comes passing large
datasets to functions. R minimizes the severity of the problem by not creating a copy of
the passed object until it is modified.
Some functions perform calculations over arguments and need to return the results of
their calculations to the caller environment. This is done by using a special R function
called return. return takes a single argument as parameter and returns the argument to
the caller environment. Multiple arguments are returned to the caller environment by using
existing multi-variate data types such as list, array, or data frame.
In the following we implement a function called center_spread that computes and returns
the mean and standard deviation or the median and the inter quartile range of a vector.
Note that R by default returns the last evaluated value even if there is no explicit return
statement. However, explicitly using return for functions that are returning a value is
suggested for code clarity.
Sometimes one needs to define a function and pass it to another function as an argu-
ment. In case those functions are needed only once it is better to define them on the fly as
anonymous functions. Anonymous functions are not assigned to an object and mostly they
are defined on a single line where individual statements are separated by space. Another
unusual practical aspect of anonymous functions in R is that mostly the body of the function

82
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 1.108: center_spread function definition

center_spread <- function(vec, mean=TRUE){


if(mean==TRUE){
center <- mean(x=vec)
spread <- sd(x=vec)
}else{
center <- median(x=vec)
spread <- quantile(x=vec, probs=0.75) - quantile(x=vec, probs=0.25)
}

return(list(center=center, spread=spread))
}

is not enclosed with curly braces and the return function is not explicitly used. Yet, it is
recommended to use curly braces around the body of the function and to explicitly add the
return function (if needed) in order to improve code readability.
R also supports passing arbitrary number of arguments to a function through the ellipsis
(...) parameter. Ellipsis parameter is especially useful to pass arguments to the functions
that are called inside the body of a function. Ellipsis parameter is mostly combined with
regular parameters by placing it at the end of the parameter list. Ellipsis arguments can be
processed within the body of a function as a list data type.

1.8 Exercises
Vectors and Factors
1. Create and print the following vectors
(a) [100, 99, 98, . . . , 3, 2, 1] .
(b) [1, 3, 5, . . . , 95, 97, 99] .
(c) [1, 3, 5, . . . , 45, 47, 49, 2, 4, 6, . . . , 46, 48, 50] .
(d) [1, 3, 5, 1, 3, 5, . . . , 1, 3, 5, 1, 3, 5] such that there are 20 occurrences of 1, 20 occur-
rences of 3 and 20 occurrences of 5.
(e) [7, 7, . . . , 7, 9, 9, . . . , 9, 11, 11, . . . , 11] where there are 20 occurrences of 7, 40 oc-
currences 9 and 80 occurrences of 11.
2. First, create the vector x = [1.0, 1.1, 1.2, . . . , 1.8, 1.9, 2.0]

(a) Create and print a vector of the function ex | cos(x)| where | cos(x)| is the ab-
p

solute value of the cosine function.


2 3
(b) Create and print a vector of the function xe ln(x)+ex where ln(x) is the natural
logarithm function.
3. First, create the vectors x = [45, 3, 8, 27, 35, 13] and y = [29, 12, 11, 9, 21, 19]

(a) Combine vectors x and y to create a new vector z.

83
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

(b) Print the maximum and minimum numbers in vector z.


(c) Print the indices of the maximum and minimum numbers in vector z.
(d) Print the second maximum number in vector z.
(e) Print the second minimum number in vector z.
4. Create vectors v1 = [1, 4, 8, 2, 4, 6] and v2 = [3, 4, 7, 2, 4, 5]
(a) Print the indices of the elements satisfying the rule (v1 < v2)
(b) Print the total number of the elements satisfying the rule (textbf v1 == v2)
5. Create and print the vector [21 0.126 , 22 0.125 , 23 0.124 , . . . , 225 0.12 , 226 0.11 ] consisting of
the multiplications of the powers of 2 and 0.1 .
2 3 4 20
6. Create and print the vector [5, 52 , 53 , 54 , . . . , 520 ]
7. Calculate and print the final result of the following summation
x=99
X
x2 + ln(x)

x=12

8. Calculate and print the final result of the following product


x=9
Y √ 
x2 + x
x=3

9. Create a vector x = [1, 2, 3, 4, . . . , 46, 47, 48, 49, 49, 48, 47, 46, . . . , 4, 3, 2, 1]
(a) Print the indices of the elements of x that are divisible by 3.
(b) Print the elements of x that are divisible by 3.
(c) Print the indices of the elements of x that are divisible by 2 but not by 5.
(d) Print the ithe elements of x that are divisible by 2 but not by 5.
10. Use the paste function to create the character vector ["Player-1", "Player-2", "Player-3",
. . ., "Player-19"] and print it.
11. Generate the vector of characters rating = ["Good", "Mediocre", "Bad", "Bad", "Good",
"Good", "Mediocre", "Bad", "Good", "Good", "Good", "Bad", "Bad", "Mediocre", "Bad",
"Good"] representing the food ratings given by customers at a restaurant. Generate
another vector of integers gender = [1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2] representing
the genders of customers who rated.
(a) Convert and print the rating vector into an ordered factor with four levels,
including Bad < M ediocre < Good < Excellent, despite none of the customers
rated the food “Excellent”.
(b) Convert and print the gender vector into a labeled factor with two levels: M ale
and F emale.
(c) Use the table function to print a contingency table of the rating and gender
objects together. Explain the pattern you see in the contingency table.

84
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Matrices
12. Considering the following matrix
5 −3 2
 

M = 15 −9 6
10 −6 4

(a) Show that M is a nilpotent matrix. That is Mk = 0 for some k where 0 is a


matrix with all zero entries.
(b) Update M by replacing its second column with the sum of the first and the third
columns and print the updated M .
(c) Update M by replacing its element located at (2,3) with 128 and print the second
row of the updated matrix.
(d) Prove that the updated M does not have an inverse, i.e., singular, by showing
that its determinant is zero. Please use the det function to compute determinant
of matrices.
(e) Print MT , i.e., the transpose of of the updated M.
13. First, set your seed value to 1001 by using the set.seed function. Please do research
about the set.seed function before using it, as it is sensitive to the seed value and
the calls afterwards. Then, create two 3 × 3 square matrices P and R by randomly
populating them with the numbers between 0 and 10, including zero and ten, i.e.,
[0, 10], with replacement.
(a) Compute and print P+R
(b) Compute and print matrix multiplication PR
(c) Compute and print matrix multiplication RP
(d) Compute and print matrix multiplication PT R
(e) Compute and print matrix multiplication PRT
14. First, set your seed value to 1001 by using the set.seed function. Then, create a
6x10 matrix by randomly selecting 60 numbers from interval [1-10] with replacement.
(a) Find and print the number of the elements that are greater than six.
(b) Find and print the number of the elements that are greater than six but less than
ten.
(c) Find and print the number of occurrence (frequency) of each unique value in the
matrix.
15. Considering the following square matrices
5
 
−2
P=
1 4

0 7
 
Q=
−4 9

3 8
 
R=
8 −6

85
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

(a) Show that matrix multiplication satisfies the associativity rule, i.e., (PQ)R =
P(QR) .
(b) Show that matrix multiplication over addition satisfies the distributivity rule,
i.e., (P + Q)R = PR + QR .
(c) Show that matrix multiplication does not satisfy the commutativity rule in gen-
eral, ı.e., PQ 6= QP
(d) Generate a 2 × 2 identity matrix, I. Note that the 2 × 2 identity matrix is a
square matrix in which the elements on the main diagonal are 1 and all other
elements are 0. Show that for a square matrix, matrix multiplication satisfies the
rules PI = IP = P .

16. Solve the following system of linear equations using matrix algebra and print the
results for unknowns.

x+y+z =6
2y + 5z = −4
2x + 5y − z = 27

17. Use the outer function to generate and print the following matrix

0 1 2 3 4
 
1 2 3 4 5
M= 2 3 4 5 6
 
3 4 5 6 7

4 5 6 7 8

18. Compute and print the final result of the following series. Hint: You may use the
outer function to generate a matrix representation of the partial results.

X y=12
x=24 X x2


x=1 y=1
y+1

19. Compute and print the final result of the following series. Hint: You may use the
outer function to generate a matrix representation of the partial results.

X y=12
x=24 X x2


x=1 y=1
xy + 1

Data Frames
20. Use the following tabular data consisting of 6 records to create a data frame object
named cars and print it. The data consists of 7 different features of 6 cars. You may
use vectors along with the data.frame function to create the cars object.
21. Print the output of the str function for the cars data frame.

86
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

model mpg cyl disp hp drat wt


Mazda RX4 21.00 6.00 160.00 110.00 3.90 2.62
Mazda RX4 Wag 21.00 6.00 160.00 110.00 3.90 2.88
Datsun 710 22.80 4.00 108.00 93.00 3.85 2.32
Hornet 4 Drive 21.40 6.00 258.00 110.00 3.08 3.21
Hornet Sportabout 18.70 8.00 360.00 175.00 3.15 3.44
Valiant 18.10 6.00 225.00 105.00 2.76 3.46

22. Print the number of rows and the number of columns of cars.
23. Update the cars data frame by adding (“Toyota Corolla” 33.9 4 71.1 65 4.22 1.835).
Print the updated data frame. In the following questions always use the updated cars
object.

24. Sort and print cars by mpg variable.


25. Update the cars data frame by using the disp and hp variables to crate a new variable
named dhp representing dist + hp. Print the updated data frame. In the following
questions always use the updated cars object.
26. Print the car record with the minimum dhp.

27. Print all car models that have 20 or higher mpg.


28. Print the car record with the third largest disp.
29. Print all cars with less than 3.7 drat and exactly 6.00 cyl.

30. Use the table function to print the frequency distribution of the cars in your dataset
based on the carb values.
31. Use the table and cut functions to print the frequency distribution of the casr falling
into bins of 2 mpg starting from 16. That is, bin the mpg into intervals of two starting
from 16.

32. Create and print a new data frame object called fast.cars by subsetting only qsec
and gear columns and including the cars only with 110 hp or higher.

Packages
33. Install package AER on your computer, if you have not done before. Next, load the
package and print the output of the package loading process.
34. Load the dataset named NMES1988 in the AER package and print the output of the
loading process.
35. In your own words, explain what this dataset is about after doing some research about
the dataset.

36. Print the output of the str function on NMES1988. Also, use the help function to
explain the variables of NMES1988 dataset in your own words.

87
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

37. Use the table function to display the frequency distribution of the school variable in
NMES1988. Explain the pattern that you see in the distribution.
38. Create a new data frame object called NMES1988.SUB by subsetting all numeric vari-
ables. Print the first few records of NMES1988.SUB using the head function.

39. Use one of the apply family functions, to compute and print a vector of averages of
all columns of NMES1988.SUB. Note that you can use the mean function for computing
the averages.

88
Chapter 2

Exploratory Data analysis

2.1 Data and Measurement


This section provides an informal introduction of the concepts related to data and measure-
ment.
A unit in statistics is an entity to be studied. A unit can be a single entity however, a set
of entities can collectively form a single unit as well. For example a patient may be a unit
to study however, a collection of patients may also be another unit to study for a doctor.
Furthermore, a unit could be a tangible or intangible entity. A tree, a forest, a person, a
population, an animal, a mathematical function, a geometric shape etc. are examples of
units.
Data (plural of datum) are the facts observed or measured on units. Most of the time
data is collected on one or more attributes of a unit. Observation involves acquiring data
through senses whereas measurement involves mapping the observation into a number or
symbol. Measurement allows us to compare, contrast or transform data. Measured data
can broadly be categorized as qualitative and quantitative. Qualitative data refers to the
descriptive values measured on units, e.g., gender of a person, major of a student, type of
a cup of coffee. Quantitative data refers to the numerical values measured on units, e.g.
height of a person, age of a student, amount of coffee in a cup. Quantitative data is further
classified as discrete and continuous. Discrete data assume finite countable values whereas
continuous data assume real values. For example the credits earned by a student and the
outcome of a die assume discrete values. On the other hand, the GPA of a student or length
of a fish assume continuous values.
Generally, measurement takes place in three forms:

Counting
Counting is defined as the determination or estimation of a quantity. In its simplest
form determining the number of units is counting. An indirect example would be
measuring the height of a person. In fact while measuring the height of a person we
are counting the number of inches (centimeters). Similarly, weighting is nothing but
counting the number of pounds or ounces (kilograms or grams).
Ranking
Ranking is defined as the assignment of a position to a unit. The assigned positions
can be represented by symbols, e.g., good-mediocre-bad, or numbers, e.g., 1-2-3. For

89
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

example one may order fruits starting from his most favorite to the least favorite.
Similarly, one may rank universities according to their research expenditures.
Classifying
Classifying is defined as sorting units into predefined categories. For example a person
can be categorized according to his blood type: A, B, AB and 0. Units can also be
cross classified. For example universities can be categorized as national or regional
universities and public or private.

The term variable has different but related meanings in mathematics, computer science
and statistics. In mathematics a variable is a symbol representing qualitative or quantitative
data. In computer science a variable is a storage location associated with a symbolic name
(identifier) that contains qualitative or quantitative data. In data analysis or statistics
context, a variable is an attribute that is being measured on one or more units. For example
height or weight of a person, the rank of a university, blood type of a person are all variables.
The values of variables may change in time, under different conditions or from one unit to
another.
Variables are mathematical tools allowing us to modify, transform or control measured
attributes. In experimental research there are two types of variables: independent and depen-
dent variables. Independent variables are variables that can be controlled, i.e., manipulated
or modified, in a model or equation.

2.1.1 Levels of Measurement


Although variables are assigned values via counting, ranking, or classifying, the nature of
the values that they take on allows us to compare, contrast and transform them as well as
apply fundamental mathematical operations. Levels of measurement or scale of measure is
an effort to develop a taxonomy of variables based on the nature of the values that they
take on. There are five levels of measurement: nominal scale, ordinal scale, rating scale,
interval scale and ratio scale.

Nominal Scale
Nominal scale variables take descriptive (qualitative) values. Examples of nominal
scale variables are: gender (male, female); marital status (single, married, divorced);
major (Biology, Economics, Informatics etc); shirt numbers of sports players (1,2,3 ...).
Nominal scale variables having only two possible values such as gender (male, female)
or coin toss (tails, heads) are also called dichotomous variables. Notice that nominal
scales are purely descriptive. We cannot apply relational operators such as less than
(<) or greater than (>) nor can we apply arithmetic operators such as addition (+),
subtraction (−), multiplication (∗) or division (/). That is, a player with shirt number
8 being greater than another player with shirt number 4 does not make sense nor does
adding Biology to Economics. Sometimes, nominal scales are represented by numbers.
Despite we can use 0 and 1 to represent the tails and heads outcome of a coin toss,
respectively, the numbers just serve as symbols of ordinal scale. Finally, counting the
number of nominal scales is different from nominal scales themselves. For example a
variable denoting the number of female students is different from a variable denoting
the gender of a student.
Mode is a common statistics applied to nominal scale variables to measure central
tendency.

90
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Ordinal Scale
Ordinal scale variables take descriptive (qualitative) or numerical (quantitative) val-
ues. The values that an ordinal scale variable takes can be ordered from high to
low. Examples of ordinal scale variables are: ranks at Olympic games (1st , 2nd , 3rd );
consumer satisfaction level (very satisfied, satisfied, somewhat satisfied, somewhat
dissatisfied, dissatisfied, very dissatisfied); statement agreement level (strongly agree,
agree, neutral, disagree, strongly disagree). Although the values of an ordinal scale
variable can be ranked, consecutive values do not hold equal intervals. Hence, the
amount of distance between any two values is not meaningful or interpretable. That
is, the performance difference between the athletes came in the 1st and 2nd places is
not equal to the performance difference between the athletes came in the 2nd and 3rd
places. Similarly, the differences among the consecutive levels of consumer satisfaction
are not equal. Although relational operators such as less than (<) or greater than (>)
makes sense for ordinal scale variables, arithmetic operators do not make sense.
Mode and median are common statistics applied to ordinal scale variables to measure
central tendency.
Rating Scale
Rating scale variables are an extension to ordinal scale variables where the possible
values are represented by numbers. For example statement agreement level (strongly
agree, agree, disagree, strongly disagree) can be represented by numbers such that
values strongly agree, agree, neutral, disagree, and strongly disagree correspond to
5, 4, 3, 2, 1, respectively. Another example is letter scale grades A, B, C, D, and F
mapped to numbers 4.0, 3.0, 2.0, 1.0, and 0.0, respectively. Although the consecutive
numbers in both examples seem to be equal interval, the resemblance is just illusionary
when one thinks about the ordinal scales that these numbers represent in reality.
Mode and median are common statistics applied to rating scale variables to measure
central tendency. Average as a measurement of central tendency should be avoided at
best or applied cautiously.
Interval Scale
Interval scale variables take numerical (quantitative) values such that the intervals
between consecutive values are equal. Not only interval scale variables can be ordered
but also the distance between their values are interpretable. Put in other words,
interval scale variables not only tell which value is greater than the other but also tells
the magnitude of difference between them. Examples of interval scale measurements
are: temperature degree in Fahrenheit (Celsius); IQ (Intelligence Quotient) score;
latitude degree (from −90◦ to +90◦ ); date. The 5◦ temperature difference between 15◦
and 20◦ is the same as the 5◦ temperature difference between 40◦ and 45◦ . Relational
operators are applicable to interval scale variables. Although addition and subtraction
are applicable to interval scale variables, multiplication and division are not applicable.
That is, 30◦ is not twice as hot as 15◦ or a person with IQ score of 150 is 75 points
higher than a person with score of 75 however, he is not twice as smart.
Mode, median, arithmetic average are common statistics applied to interval scale vari-
ables to measure central tendency. Range, variance, standard deviation are statistics
applied to interval scale variables to measure dispersion.
Ratio Scale
Similar to interval scale variables ratio scale variables take numerical (quantitative)

91
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

values such that the intervals between consecutive values are equal. Furthermore,
ratio scale variables have a true zero point denoting the absence of the quantity being
measured. That is, 0 simply denotes the lack of quantity in ratio scale variables.
Examples of ratio scale measurements are: height, weight, age, number of students in
a classroom. Ration scale variables allow us to compare values with respect to each
other. For example a classroom with 40 students is twice crowded than a classroom
with 20 students. Similarly, a person who is 25 years old is half older than a person
who is 50 years old. Both relational and arithmetic operators are applicable to ratio
scale variables. The reason that temperature degree in Fahrenheit or Celsius is not
ratio scale is because it does not have a true zero value. That is, 0◦ does not imply
the absence of thermal motion. Instead, −459◦ and −273◦ correspond to the lack
of thermal motion in Fahrenheit and Celsius scales respectively. On the other hand
Kelvin degree for temperature measurement has absolute zero at which all thermal
motion stops. That is 0K is a true zero in Kelvin scale. Similarly, date in its common
usage is an interval scale because date zero does not refer to absence of time whereas
age is a ratio scale because age zero implies absence of life on earth.
Mode, median, arithmetic/geometric/harmonic mean are common statistics applied
to ratio scale variables to measure central tendency. Range, variance, standard devi-
ation, coefficient of variation are statistics applied to ratio scale variables to measure
dispersion.
Finally, two important characteristics of measurement are reliability and validity. Re-
liability implies that the the measurement of a property of a unit results in the same
value with an acceptable amount of error under the same conditions. For example
repeatedly measuring the height of a person should return the same result (with an
acceptable error) under the same conditions. Validity implies that a measurement
procedure measures what it is supposed to measure. For example one cannot weigh a
person to measure his blood type.

2.2 Distributions
2.2.1 Empirical Distributions
A collection of datum observed or measured on a set of units is simply called a distribution.
Figure 2.1a shows the distribution of the days it takes to service a car at a busy mechanic
shop for 256 cars. Figure 2.1b shows the distribution of the March electricity bills (in
dollars) of a neighborhood consisting of 400 houses. Note that due to the large size of
the collection we only display the lowest and highest twelve values. Basically, Figure 2.1a
and Figure 2.1b demonstrates how the service time (in days) are distributed over 256 cars
and how electricity bill amounts are distributed over 400 houses, respectively. We use
distributions to understand, analyze and infer various characteristics of the set of units on
which the measurements are collected. Although the concept of a distribution is simple, it
becomes a powerful tool when visualization techniques and statistical methods are used to
summarize, model and interpret distributions.
A common method for visualizing distributions is to (i) divide the range of measurements
into intervals; (ii) count the number of measurements falling into each interval and (iii)
show a graphics demonstrating the intervals, the number of measurements in each interval
(frequency) and the relationship between intervals. Figure 2.2 visualizes the car service time

92
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.1: Example distributions

1 0 3 12 3 3 115.64 82.51 82.41 49.91 77.20 82.99


4 12 2 3 5 0 100.28 76.28 72.30 62.69 89.24 107.88
... ...
... ...
... ...
9 4 7 1 0 7 86.43 54.62 82.65 82.33 87.01 65.38
1 1 16 5 6 1 58.98 101.62 104.31 75.55 89.49 110.94

(a) Service time distribution (days) (b) Electricity bill distribution ($)

distribution and the electricity bill distribution given in Figure 2.1a and Figure 2.1b using
different types of graphics. Figures 2.2a thru Figures 2.2f divides the domain of service
time into equal intervals of one day. Figures 2.2g thru Figures 2.2l divides the domain of
electricity bill amounts into equal intervals of eight dollars.
Figure 2.2a and Figure 2.2g show the distributions as frequency histograms where each
vertical rectangle represents an interval and the number of measurements falling into that
interval. The rectangles are positioned at the middle of each interval on the x axis and the
heights of the rectangles on the y axis represent the number of the measurements falling
into the interval. The sum of the frequencies (y axis) of all intervals is the total number
of measurements in the data set for each histogram. Figure 2.2b and Figure 2.2h show the
distributions as frequency polygons where the middle-top points of the frequency histograms
are joined together using lines.
Figure 2.2c and Figure 2.2i show the distributions as relative frequency histograms where
the y axis represent the relative frequencies (fractions) of the number of measurements falling
into each interval. The relative frequency of an interval is calculated by dividing the number
of measurements falling into the interval by the total number of measurements. As a result,
the sum of the relative frequencies (y axis) of all intervals is 1 for each relative frequency
histogram. Figure 2.2d and Figure 2.2j show the distributions as a relative frequency poly-
gons. Relative frequency polygon uses relative frequencies on the y axis instead of absolute
frequencies.
Figure 2.2e and Figure 2.2k show the distribution as density histograms where the y axis
represents the densities of intervals. The density of an interval is calculated by dividing the
relative frequency of the interval by the width of the interval. As a result, the total area of
all rectangles is 1 for each density histogram. Note that specific to thse cases, the heights of
rectangles (y axis) in Figure 2.2e and Figure 2.2a give the relative information about different
intervals because the intervals are equal width. On the other hand, the areas of rectangles
of density histograms always give relative information about different intervals even if the
intervals had different widths. We strongly suggest developing the habit of thinking in terms
of areas rather than heights while analyzing density histograms. Finally, Figure 2.2e and
Figure 2.2l show a smooth density curve estimation of the car service time distribution and
electricity bill distribution, respectively. The density curve is an approximation function of
the density histogram where the total area under the curve is always 1.
Density histograms suffer from the selection of interval widths in case the data do not
have natural bins or classes. That is the shape of the density histogram is quite sensitive to
the interval widths and if the data does not have natural classes the histogram shape might

93
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.2: Visualizing service time (days) and electricity bill distributions ($)

30

relative frequency
0.10
0.10
frequency

density
20

0.05
0.05
10

0.00
0 0.00
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35 service time (days) 0 5 10 15 20 25 30 35
service time (days) service time (days)
(b) Relative Frequency His-
(a) Frequency Histogram togram (c) Density Histogram

30 0.09
relative frequency

0.10

density
frequency

20 0.06

0.05
10 0.03

0.00
0 0.00
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 service time (days)
service time (days) service time (days)
(f) Smooth Density Curve Esti-
(d) Frequency Polygon (e) Relative Frequency Polygon mate

0.03
90
0.20
relative frequency

80
70
0.15 0.02
frequency

60
density

50
0.10
40
30 0.01
0.05
20
10
0.00
0 0.00
40 50 60 70 80 90 100 110 120 130 140
40 50 60 70 80 90 100 110 120 130 140 electricity bill amount 40 50 60 70 80 90 100 110 120 130 140
electricity bill amount electricity bill amount
(h) Relative Frequency His-
(g) Frequency Histogram togram (i) Density Histogram

0.03
90
80 0.20
relative frequency

70 0.02
density
frequency

60 0.15
50
40 0.10 0.01
30
20 0.05
10
0.00
0 0.00
40 50 60 70 80 90 100 110 120 130 140
40 50 60 70 80 90 100 110 120 130 140 40 50 60 70 80 90 100 110 120 130 140 electricity bill amount
electricity bill amount electricity bill amount
(l) Smooth Density Curve Esti-
(j) Frequency Polygon (k) Relative Frequency Polygon mate

94
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

be misleading. One day is a natural interval length for car service time distribution on the
other hand, one may pick a different interval length other than eight dollars to plot the
density histogram of the electricity bill distribution. Density curves estimates however, are
approximation functions and they may reflect the true empirical density of a collection of
data. However, density curve estimates are susceptible to noise in the data and require quite
amount of data in order to make a proper approximation. Hence, visually superimposing
density histograms and density curve estimations together might reveal more information
about a distribution as shown in Figure 2.3

Figure 2.3: Estimated density curves and density histograms superimposed

0.03

0.10
0.02
density

0.05
density 0.01

0.00 0.00
0 5 10 15 20 25 30 35 40 50 60 70 80 90 100 110 120 130 140
service time (days) electricity bill amount
(a) Service time distribution (b) Electricity bill distribution

Using density histograms and density curves seems to be counter intuitive in the begin-
ning because one has to think in terms of areas under rectangles and curves rather than
their heights. However, density histograms and curves have several advantages. First of all,
frequency and relative frequency distort the shape of the distribution when the intervals are
not equal width. Secondly, areas under densities allow us to interpret distributions in terms
of probabilities. Thirdly, theoretical distributions appearing in the fields of probability and
statistics are defined in terms of densities.
By calculating the areas of density histograms through summation or the areas under
density curves through integration we investigate
• the fraction of data falling into a particular interval or region of the density histogram
or curve
• the fraction of data that are smaller than or equal to a particular value

• the fraction of data that are greater than or equal to a particular value
In Figure 2.4 we show various fractions of the car service time distribution represented
by the areas under density histograms and curves. Figure 2.4a and Figure 2.4d demonstrate
the fraction of data that are less than or equal to three, 0.46. That is, 46% of the cars left
at the mechanic shop are serviced within three days. Figure 2.4b and Figure 2.4e present
the fraction of data that are inclusively between ten and twenty, 0.14. That is, it takes ten
to twenty days to fix 14% of the cars left at the mechanic shop. Lastly, Figure 2.4c and

95
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.4f show the fraction of data that are greater than or equal to fifteen, 0.07. That
is, 7% of the cars require more than 15 days to be fixed.

Figure 2.4: Various fractions of the car service time distribution represented by the areas
under density histograms and curves

0.10 0.10 0.10


density

density

density
0.05 0.05 0.05

0.00 0.00 0.00


0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
service time (days) service time (days) service time (days)

(a) 0.46(days ≤ 3) (b) 0.18(10 ≤ days ≤ 20) (c) 0.09(days ≥ 15)

0.09 0.09 0.09


density

density

density
0.06 0.06 0.06

0.03 0.03 0.03

0.00 0.00 0.00


0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
service time (days) service time (days) service time (days)

(d) 0.46(days ≤ 3) (e) 0.14(10 ≤ days ≤ 20) (f) 0.07(days ≥ 15)

To investigate data sometimes people use an alternative but equivalent way of describing
a distribution called empirical cumulative distribution function (ECDF ) or empirical distri-
bution function. An empirical cumulative distribution function shows the fraction of data
that are smaller than or equal to a particular value. Both density histograms and empiri-
cal ECDFs convey the same information where the former use the area and the latter use
the height to represent the fraction of data that are smaller or equal to a particular value.
People tend to perceive height more quickly and accurately compared to area to interpret a
fraction of data hence ECDF might be preferable in these cases. On the other hand, density
histograms reveal more information about the shape of distributions, e.g., central tendency,
dispersion, skewness, gaps. Finally, ECDFs approximate the true underlying CDFs well if
the data size is large and approach to 1 by Glivenko-Cantelli theorem.
Figure 2.5 shows the ECDFs of the distributions given in Figure 2.2a and Figure 2.2b.
ECDFs are step functions that jumps by 1/n for each data point in a data set of size n.
Notice that the steps in service time ECDF (Figure 2.5a) are more obvious because the data
size is smaller compared to the data size of electricity bill distribution.
To demonstrate that ECDFs and density histograms or curves convey the same informa-
tion represented as height and area, respectively, we plotted the fractions shown in Figure 2.4
as the heights of ECDFs in Figure 2.6. Figure 2.6a demonstrates the fraction of data that
are less than or equal to three, 0.46. That is, 46% of the cars left at the mechanic shop
are serviced within three days. Figure 2.6b presents the fraction of data that are inclusively
between ten and twenty, 0.14. That is, it takes ten to twenty days to fix 14% of the cars left

96
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.5: ECDFs representing the service time (days) and electricity bill ($) distributions

1.00 1.00
cumulative probability

cumulative probability
0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0 5 10 15 20 25 30 35 40 40 50 60 70 80 90 100 110 120 130 140


service time (days) electricity bill (dollars)

(a) Service time ECDF (b) Electricity bill ECDF

at the mechanic shop. Notice that 0.14 is the difference between the ECDFs of twenty and
ten. Lastly, Figure 2.4c shows the fraction of data that are greater than or equal to fifteen,
0.07. That is, 7% of the cars require more than 15 days to be fixed. Notice that 0.07 is the
difference between 1.0 and ECDF of 15.

Figure 2.6: Various fractions of the car service time distribution represented by the heights
of ECDFs

1.00 1.00 1.00


0.97
0.93

0.82
cumulative probability

cumulative probability

cumulative probability

0.75 0.75 0.75

0.50 0.50 0.50


0.46

0.25 0.25 0.25

0.00 0.00 0.00

0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
service time (days) service time (days) service time (days)

(a) 0.46(days ≤ 3) (b) 0.14(10 ≤ days ≤ 20) (c) 0.07(days ≥ 15)

2.2.2 Theoretical Distributions


So far our discussions have involved only the distributions that are measured or observed on
a collection of units. These distributions are called empirical distributions because the data
constituting the distributions are derived via observation or experimentation rather than
theory or pure logic. Theoretical distributions on the other hand, are derived from mathe-
matical reasoning and/or logical facts, assumptions and principles. Theoretical distributions
allow us to model or develop an understanding about the underlying processes that gener-
ate empirical data. Since theoretical distributions are directly related to probability they
are also called probability distributions. Since they are fundamental to probability and
statistics many of probability distributions have their own names e.g., normal (Gaussian)

97
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

distribution, binomial distribution, exponential distribution and Poisson distribution.


Probability distributions are tightly coupled with random variables. A random variable,
usually represented by capital letters, is a variable which takes numerical values that describe
the outcomes of a random phenomenon. The set of all possible outcomes of a random
phenomenon is called a sample space and the associated random variable takes only one of
these possible outcomes. To illustrate, the outcome of a coin flip is a random phenomenon
with sample space S = {T ails, Heads}. An associated random variable X takes a numerical
value 0 for T ails and 1 for Heads. At the end of a coin flip experiment X takes only one of
the values in the sample space.
The number of students passing a course with 11 enrollments is a random phenomenon
with sample space S = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}. The sample space S represents all
possible outcomes of the random phenomenon. An associated random variable Y denoting
the number of students who passed the course takes only one of the values in the sample
space at the end of the semester.
The number of free throws until getting a successful shot from 30 meters is a random
phenomenon with sample space S = {0, 1, 2, 3, . . . , +∞} or equivalently S = {s : s ∈
{0}∪Z+ }. The sample space S represents all possible outcomes of the random phenomenon.
Although a very large number of free shots or infinite number of free shots may not be
possible in practice it is possible hypothetically. An associated random variable Z denoting
the number of failed shots takes only one of the values in the sample space at the end of the
experiment.
The amount of time by which one is off (late or early) to his two o’clock meeting is a
random phenomenon with a hypothetical sample space S = {−∞, . . . , +∞} or S = {s :
s ∈ R}. The sample space S represents all possible outcomes of the random phenomenon
including time amounts for both being early and late. An associated random variable T
denoting the off time amount takes only one of the values in the sample space.
The amount of time that a light bulb lasts is a random phenomenon with sample space
S = {0.0, . . . , +∞} or equivalently S = {s : s ∈ {0}∪R+ }. The sample space S represents all
possible outcomes of the random phenomenon. An associated random variable W denoting
the time that a light bulb lasts takes only one of the values in the sample space after the
experiment.
The height of the next person walking into the classroom is a random phenomenon with a
hypothetical sample space S = {0.0, . . . , +∞} or S = {s : s ∈ {0} ∪ R+ }. The sample space
S represents all possible outcomes of the random phenomenon including the hypothetical
cases. An associated random variable V denoting the height of the next person walking in
the classroom takes only one of the values in the sample space after his height is measured.
There are two fundamental types of random variables namely, discrete random variables
and continuous random variables. A random variable that is associated with a random
phenomenon which has a countable (finite or countably infinite) sample space is called a
discrete random variable. In the previous examples the random variables X, Y, Z are discrete
random variables because the outcome of a flip coin, the number of students who passed a
course and the number of free throws until getting a successful shot have countable sample
spaces. A random variable that is associated with a random phenomenon that has a sample
space with uncountably infinite number of possible outcomes is called a continuous random
variable. Continuous random variables may take any value in a range. In the previous
examples the random variables T, W, V are continuous random variables because the off
time amount to a meeting, the time a light bulb lasts and the height of a person have
uncountably infinite sample spaces.

98
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

A random variable without a probability distribution is not interesting at all because


the only information it conveys is its sample space, i.e., the set of possible values that it
can take on. A probability distribution coupled with a random variable on the other hand,
is very interesting because it assigns probabilities to the values or intervals in the sample
space of the random variable. These probabilities reflect the occurrence probability of the
outcomes or the outcome intervals of sample spaces.
The probability distribution of a discrete random variable is called probability mass
function (p.m.f.). A probability mass function assigns a probability value to each outcome
in the sample space of a discrete random variable. That is, the whole probability amount, 1,
is distributed over all possible outcomes in the sample space of a discrete random variable.
Let Y be a random variable denoting the number of students who passed a course with 11
enrollments. Figure 2.7 shows a p.m.f. for the random variable Y which assigns a probability
to each possible value that Y may take on. Notice that the probability values are expressed
in scientific notation and the sum of all probability values is 1. As a matter of fact, a p.m.f.
is a function that defines P (Y = y) , p(y) where P (Y = y) is the probability of Y taking
the value y. The p.m.f. of the random variable Y is given in Equation 2.1.

11
 
p(y) = 0.7y 0.311−y (2.1)
y

Figure 2.7: A p.m.f. for random variable Y

0 1 2 0.2
1.771470e-06 4.546773e-05 5.304569e-04
density

3 4 5
3.713198e-03 1.732826e-02 5.660564e-02
6 7 8 0.1
1.320798e-01 2.201330e-01 2.568219e-01
9 10 11
1.997504e-01 9.321683e-02 1.977327e-02
0.0
(a) Sample space and probabilities 0 3 6 9 12
Y
(b) p.m.f. graphics

In order for a function p(x) to be a p.m.f. on a discrete sample space S it has to satisfy
the following two conditions:

p(x) ≥ 0 where x ∈ S (2.2)


X
p(x) = 1 where x ∈ S (2.3)
x∈S

The probability distribution of a continuous random variable is called a probability


density function (p.d.f.). A probability density function assigns a probability value to any
possible interval in the sample space of a continuous random variable. That is, the whole

99
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

probability amount, 1, is distributed over infinitely many, infinitely small, disjoint intervals
covering the entire domain of the sample space of a continuous random variable. Figure 2.8a
mimics infinitely many, infinitely small, disjoint rectangles for all intervals where the area
of the rectangle is the probability of the random variable taking on a value within the
interval. Note that to evaluate probabilities for larger intervals one can integrate (sum
up) the density function over the interval of interest as shown in Figure 2.8b. Finally, the
probability of a continuous variable taking on an exact value is zero because at an exact
value the width of the rectangle (shown in Figure 2.8a) becomes zero in turn, the area
reflecting the probability becomes zero. Let W be a random variable denoting the time
(in minutes) that a light bulb lasts. Figure 2.8a shows a p.d.f. for the random variable W
which assigns a probability to infinitely many, infinitely small, disjoint intervals covering
the entire domain of the sample space of W Notice that the probabilities are expressed as
the areas under the curve and theR total area is 1. As a matter of fact, a p.d.f. is a function
w
that defines P (w1 ≤ W ≤ w2 ) , w12 f (w)dw where P (w1 ≤ W ≤ w2 ) is the probability of
W taking on a value between w1 and w2 . The p.m.f. of the random variable W is given in
Equation 2.4.
1 1
f (w) = e− 2·106 w (2.4)
2 · 106

Figure 2.8: A p.d.f. for random variable W

5e−07 5e−07

4e−07 4e−07
density

density

3e−07 3e−07

2e−07 2e−07

1e−07 1e−07

0e+00 0e+00
0e+00 5e+06 1e+07 0e+00 5e+06 1e+07
W W
(a) p.d.f. plot of W (b) P (4 · 106 ≤ W ≤ 6 · 106 )

In order for a function f (x) to be a p.d.f. on a continuous sample space S it has to


satisfy the following two conditions:

f (x) ≥ 0 where x ∈ S
(2.5)
Z
f (x)dx = 1
S
Similar to empirical distributions, cumulative distributions functions (c.d.f.) are al-
ternative ways of defining and plotting theoretical probability distributions. Cumulative

100
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

distribution of a random variable X is defined as P (X ≤ x) , F (x). Given that p(x) is the


p.m.f. of a discrete random variable, its c.d.f. is computed by Equation 2.6
X
F (x) = p(t) (2.6)
t≤x

Given that f (x) is the p.d.f. of a continuous random variable, its c.d.f. is computed by
Equation 2.7
Z x
F (x) = f (t)dt (2.7)
−∞

Figures 2.9a and 2.9b show the cumulative distributions of random variables Y and W
which were shown in Figures 2.7 and 2.8, respectively.

Figure 2.9: Cumulative distribution plots for random variables Y and W

1.00 1.00
cumulative probability

cumulative probability

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00
0 1 2 3 4 5 6 7 8 9 101112 0e+00 1e+07 2e+07 3e+07
Y W
(a) c.d.f. of Y (b) c.d.f. of W

R has built-in support for more than twenty discrete and continuous theoretical probabil-
ity distributions. Moreover, it provides a uniform function naming scheme for dealing with
probability distributions. Each probability distribution has an abbreviated name consisting
of a few letters in R, e.g., norm for normal, binom for binomial, unif for uniform, geom
for geometric and t for student’s t distribution. Furthermore, R provides four functions for
each distribution represented by a single letter:
p Returns the cumulative distributions of a probability distribution.
q Returns the inverse cumulative distributions (quantiles) of a probability distribution.

d Returns the densities (or masses) of a continuous (or discrete) probability distribution.
r Returns randomly generated values belonging to a probability distribution.

101
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Table 2.1: Theoretical probability distributions and their p/q/d/r functions.

Distribution Probability Quantile Density Random


Beta pbeta qbeta dbeta rbeta
Binomial pbinom qbinom dbinom rbinom
Cauchy pcauchy qcauchy dcauchy rcauchy
Chi-Square pchisq qchisq dchisq rchisq
Exponential pexp qexp dexp rexp
F pf qf df rf
Gamma pgamma qgamma dgamma rgamma
Geometric pgeom qgeom dgeom rgeom
Hypergeometric phyper qhyper dhyper rhyper
Logistic plogis qlogis dlogis rlogis
Log Normal plnorm qlnorm dlnorm rlnorm
Negative Binomial pnbinom qnbinom dnbinom rnbinom
Normal pnorm qnorm dnorm rnorm
Poisson ppois qpois dpois rpois
Student t pt qt dt rt
Uniform punif qunif dunif runif
Weibull pweibull qweibull dweibull rweibull

Figure 2.10: Generating random values from binomial and normal distributions.

# Example 1
> pnorm(4, mean=5, sd=2) # P(X <= 4)
[1] 0.3085375
> 1 - pnorm(4, mean=5, sd=2) # P(X >= 4)
[1] 0.6914625
> rnorm(10, mean=5, sd=2)
[1] 6.806524 1.943950 4.909100 4.125945 3.296505 2.178318 4.334462 2.564031
3.076485 1.302218
# Example 2
> pbinom(8, size=20, prob=0.5) # P(Y <= 8)
[1] 0.2517223
> dbinom(8, size=20, prob=0.5) # P(Y == 8)
[1] 0.1201344
> rbinom(10, size=20, prob=0.5)
[1] 12 8 7 8 13 7 9 10 8 9

102
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

R prefixes the abbreviated distribution names by the single letter function representation
to provide a function signature denoting a functionality.
Table 2.1 shows some theoretical distributions and their p/q/d/r functions.
Figure 2.10 first shows the c.d.f. of a normal random variable, X ∼ N (µ = 5, σ = 2)
where F (X) = P (X ≤ 4) and F (X) = P (X ≥ 4) using the prnorm function. The rnorm
function in the figure synthetically generates ten values from a normally distributed random
variable with a normal distribution µ = 5 and σ = 2.
The second example in Figure 2.10 shows the c.d.f. of a binomial random variable,
Y ∼ Binom(n = 20, p = 0.5) where F (Y ) = P (Y ≤ 8). This is equivalent to tossing
a fair coin 20 times and counting the ratio of having eight or less heads, infinitely. In
the figure, dbinom function is also used to obtain the probability of having exactly eight
heads. Note that dnorm for normal distribution does not correspond to an exact value
probability, because the exact value probabilities are zero for continuous random variables.
Lastly, rbinom generates ten synthetic values from a binomial distribution with n = 20 and
p = 0.5.

2.3 Univariate Data Analysis


Univariate data analysis refers to analyzing only a single variable without its interaction
with other variables.
Descriptive statistics is the discipline of statistics that deals with quantitatively describ-
ing the main features of a collection of data using statistics and basic graphics. Summary
statistics, as part of the descriptive statistics, are powerful tools for summarizing the main
features of a collection of data including the central tendency, the dispersion, and the shape
of the distribution. Arithmetic mean, median, quartiles and mode are the most widely
used summary statistics for measuring the central tendency of a collection of data. Range,
interquartile range, variance, standard deviation, median absolute deviation (MAD) and co-
efficient of variation are the most common summary statistics used to measure the dispersion
of a collection of data. Skewness and kurtosis are the most common summary statistics used
to describe the shape of the distribution of a collection of data.

2.3.1 Descriptive Statistics to Measure the Central or Positional


Tendency
Arithmetic mean or average is defined as the sum of a collection of numeric values divided
by the total number of the values. Let {x1 , x2 , . . . , xn } be a collection of numbers.
n
1X
x̄ = xi (2.8)
n i=1
The quantity given in Equation 2.8 is referred as the sample mean and used as a summary
measure of central tendency. One way to interpret the arithmetic mean is to say that the
of numbers on the left side of the mean are balanced by the numbers on the right side in
terms of the distance to the mean. Also it is used as a representative value for a collection
of numbers. However, it is not a robust statistics, i.e., it is significantly influenced by the
outliers.
The mean function in R has the signature mean(x, trim=0, na.rm=FALSE) which returns
the mean of the object denoted by parameter x. The parameter trim takes a value between

103
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

[0, 0.5] and denotes the fraction of the observation to be trimmed from each end of x. The
trim function may be used to trim the outliers appearing in both sides.

Figure 2.11: The mean function

> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> mean(v1)
[1] 43.6875
> # Example 2
> v1 <- c(v1, 971)
> mean(v1)
[1] 98.23529

In Figure 2.26 we first generate a collection of ten integers and evaluate its mean using
the mean function. Obviously, the mean value, 43.6875, shows the central tendency of all
numbers. In the second example we add the number 971 to our vector and recalculate the
mean value. Notice that adding an outlier changed the mean value to 98.23529 which is
higher than the first sixteen values of the vector.
As we noted previously, if the numbers are evenly distributed around the mean then,
mean is a good measure of central tendency. On the other hand if the values are skewed
towards the left or right of the mean then, one needs to look at other central tendency
measures such as median and quartiles.
Median is a value that separates the lower half of an ordered collection of numbers from
the higher half. The median function in R has the signature median(x, na.rm = FALSE)
which returns the median of the object denoted by parameter x. Regardless of whether the
elements of the object x are sorted or not, R evaluates the median value on a sorted copy
of the object.

Figure 2.12: The median function

> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> sort(v1)
[1] 2 11 11 14 25 27 33 42 44 56 60 60 60 73 86 95
> median(v1)
[1] 43
> # Example 2
> v1 <- c(v1, 971)
> median(v1)
[1] 44

Figure 2.27 shows how to calculate the median value of a vector. Since the length of the
vector is an even number, 16, in Example 1 (Figure 2.27) the median value is the average of
the values located at index eight and nine in the sorted copy which leave seven numbers on
each side. Notice that adding an outlier did not affect the median value so much and the
new value 44 can still serve as a measure of central tendency for all numbers.
Median value divides a sorted collection of numbers into two approximately equal frac-

104
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

tions where the lower 0.5 fraction of the numbers is on the left hand side of the median value
and the higher 0.5 fraction of the numbers is on the right hand side of the median value.
One might be interested in other values of fractions or positional tendencies. For example
the 0.25 fraction (position), where approximately one quarter of the numbers are on the
left of a particular value and three quarters are on the right; 0.35 fraction (position) where
approximately 35% of the numbers are on the left of a particular value and 65% are on the
right; or 0.75 fraction (position) where approximately three quarters of the numbers are on
the left of a particular value and three quarters are on the right. Each of these fractions
is called a percentile, i.e., 0.25 percentile, 0.35 percentile or 0.75 percentile. Figure 2.28
demonstrates how different percentile separate a sorted collection of data into two halves.

Figure 2.13: A percentile separates the sorted data into two halves

Q - 0.50

Q - 0.25

Q - 0.75

Q - 0.25 Q - 0.50 Q - 0.75

There is an ambiguity between the terms quantile and percentile. Typically, a quantile
divides the sorted data into roughly equal intervals. Therefore, a qunatile dividing the sorted
data into 2, 3, 4, 5, 10 and 16 equal pieces are called median, tertiles, quartiles, quintiles,
deciles, and hexadeciles. On the other hand a quantile dividing the data into 100 equal
pieces is called percentiles.
There are multiple methods of calculating quantiles. One of the most common method
involves first sorting a collection of numbers then, associating each value with equally spaced
fractions (or quantiles) from 0 to 1. If the interested fraction is already associated with a
value then that particular value is the quantile. On the other hand, if the interested fraction
is between two already associated fractions then, linear interpolation is used to calculate
the quantile value.
Figure 2.30 shows a collection of numbers {2, 60, 33, 11, 27, 56} on which we want to
calculate 0.35-quantile. The numbers are first ordered and equal interval fractions are asso-
ciated with the numbers. Since 0.35 is between 0.2 and 0.4 one might think that calculating
the mean of the two numbers at the respective fractions might be a good quantile value.
However, 0.35 is not at half-between of 0.2 and 0.4 instead it is three times farther from
0.2 To calculate the linear interpolation let the fractions be the independent and the sorted
numbers be the dependent variables of multiple line segments. Then the line between points
(0.2, 11) and (0.4, 27) is y = 80x − 5 where y is the quantile and x is the fraction. Setting x
to 0.35 in the linear interpolation equation returns the 0.35-quantile which is 23.
The quantile function in R is used to calculate quantiles at one or more fractions (posi-
tions). The function expects the mandatory parameter x for the object denoting the collec-

105
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.14: Calculating a quantile using linear interpolation

# A collection of numbers 30
2 60 33 11 27 56
20

Quantile
# Ordered numbers with associated equally
spaced fractions
10

2 11 27 33 56 60
0.0 0.2 0.4 0.6 0.8 1.0 0

0.0 0.1 0.2 0.3 0.4 0.5


Fraction

tion of numbers. One can use parameter probs=seq(0,1,0.25) to specify one or more frac-
tions. The default value of parameter probs generates the sequence {0, 0.25, 0.50, 0.75, 1}.
Figure 2.31 demonstrates the use of the quantile function via multiple examples.

Figure 2.15: The quantile function

> v1 <- c(2, 60, 33, 11, 27, 56)


> quantile(x=v1, probs=0.35)
35%
23
> quantile(x=v1, probs=c(0.35, 0.80))
35% 80%
23 56
> quantile(x=v1)
0% 25% 50% 75% 100%
2.00 15.00 30.00 50.25 60.00

In descriptive statistics quartiles are three quantiles that divide the date into four more
or less equal groups. A quartile is a special case of quantiles calculated at (0.25, 0.50, 0.75)
such that the 0.25 quantile is called lower or first quartile, the 0.50 quantile is called median
or second quartile and the 0.75 quantile is called upper or third quartile. The quantile
function by default, returns the quartiles in addition to the minimum (0 quantile) and
maximum (100 quantile).
Mode is another central tendency statistics defined as the most frequently value appear-
ing in a data set. R does not have a direct function for calculating the mode of a vector.
However, with the help of the table function mode can be calculated easily as shown in
Figure 2.32.

2.3.2 Descriptive Statistics to Measure Dispersion


In descriptive statistics dispersion is a measure denoting the amount of spread out of a
collection of numeric values. Using central tendency without dispersion might be misleading
because a collection of numbers might be tightly clustered around the central tendency or

106
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.16: Calculating mode of a vector

> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> frequency.table <- table(v1)
> frequency.table
v1
2 11 14 25 27 33 42 44 56 60 73 86 95
1 2 1 1 1 1 1 1 1 3 1 1 1
> names(frequency.table)[which.max(frequency.table)]
[1] "60"

might spread out around the central tendency. There are many summary statistics used for
measuring dispersion and we are going to present the most prevalent ones here.
Range is the size of the smallest interval that contains all the values of a collection
of numbers. That is, it is the difference between the maximum and minimum values of
the collection. R has the function range which returns the minimum and maximum of
a vector object rather than the difference between them. One can evaluate the difference
between maximum and minimum values of a vector by using max and min functions as shown
in Figure 2.33. Range might be useful to measure the dispersion of a small collection of
numbers. However, it has its own disadvantages especially, on large collections. Firstly,
range is very sensitive to outliers. Secondly, it provides information about the maximum
and minimum values in a collection but how the in between values are scattered is not
represented by range.

Figure 2.17: The range function

> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> range(v1)
[1] 2 95
> max(v1) - min(v1)
[1] 93

Variance is another summary statistic for measuring the dispersion of a collection of


numbers around the mean value of the collection. It is defined as the average amount of
the squared distance (spread) between each value and the mean of the collection. Squaring
the distance between each value and the mean makes each distance positive so that the
distances on the right and left hand side of the mean value do not cancel each other out.
Taking the absolute distances would have worked as well however, absolute value function
is not continuous and power function (square) is mathematically more manageable. Taking
the average of the squared distances in the definition allows the variance to be independent
of the size of the collection. Hence, variance is the amount of dispersion per observation, on
the average.
Let {x1 , x2 , . . . , xn } be a collection of numbers.
n
1 X
s2 = (xi − x̄)2 (2.9)
n − 1 i=1

107
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

The quantity given in Equation 2.9 is referred as the sample variance, s2 and used as a
measure of dispersion around the sample mean value1 . Variance is always positive and the
higher the variance the greater the dispersion in the data set.
Since the distances are squared in the definition of variance the unit of variance (unit2 )
is not the same as the unit of the data in the collection. As a result, interpreting variance
with respect to data may not always make sense. An alternative statistics is called standard
deviation, σ, and defined as the square root of the variance.

Figure 2.18: The var and sd functions

> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> Sample Variance and sd
> var(v1)
[1] 782.2292
> sd(v1)
[1] 27.96836
>
> # Population Variance and sd
> pvar <- sum((v1-mean(v1))^2)/length(v1)
> pvar
[1] 733.3398
> psd <- sqrt(pvar)
> psd
[1] 27.08025

R provides var and sd functions for sample variance, s2 , and sample standard devia-
tion, s, respectively as shown in Figure 2.34. The same figure also shows how to calculate
population variance, σ 2 , and population standard deviation, σ.
Although variance and standard deviation are prevalently used in the analysis of disper-
sion, both are non-robust summary statistics. The calculation of both statistics depend on
the mean and mean is not a robust measure of central tendency, i.e., significantly affected by
outliers. Additionally, squaring the distance of a data point to the mean gives more weight
to the outliers in the data set because quadratic functions increase fast.
Two robust statistics for measuring dispersion in a collection of numeric datum are
Interquartile range (IQR) and median absolute deviation (MAD).
Interquartile range (IQR) is defined as the difference between the third quartile and first
quartile of a collection of numbers, IQR = Q3 − Q1 . IQR represents the spread of the
middle 50% portion of the data. IQR is a robust statistics for dispersion because it is not
significantly influenced by outliers which is due to trimming the lower and upper quarters.
Usually, IQR is used in cases where median is used as a measure of central tendency. TheIQR
function in R returns interquartile range of a numeric vector as shown in Figure 2.35. A
similar measure is called interdecile range which is defined as the difference between 0.9-
quantile and 0.1-quantile. Interdecile range simply trims the upper and lower 10% of the
data and provides the range of the remaining data.
1 Equation 2.9 is called sample variance. To estimate population variance based on a sample σ 2 =
n
1
P
n
(xi − µ)2 is used.
i=1

108
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.19: The IQR function

> IQR(v1)
[1] 37.75

Median absolute deviation (MAD) is another summary statistics for measuring the dis-
persion in the existence of outliers. As the name implies MAD is the median of the absolute
deviations of data points from their median. Let x = {x1 , x2 , . . . , xn } be a collection
of numbers and median(x) be a function that returns the median of the collection. Let
d be the absolute values of the distances of the elements in x to the median of x, i.e.,
d = {di : di = |xi − median(x)|}. The MAD of a dataset is simply defined as the median of
d, i.e., median(d). Figure 2.20 shows how to evaluate MAD manually as well as using the
MAD function.

Figure 2.20: Calculating MAD manually and using the function MAD

> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> med.v1 <- median(v1)
> med.v1
[1] 43
> distance.med.v1 <- abs(v1 - med.v1)
> distance.med.v1
[1] 41 17 10 32 16 13 18 43 17 52 1 29 32 30 1 17
> median(distance.med.v1)
[1] 17.5
> mad(x=v1, constant=1)
[1] 17.5

MAD is a robust statistics denoting the dispersion in data in case of outliers. However,
MAD assumes symmetry in the process that generates the data by dividing the data into
two halves around the median. If the data is not symmetric around the median alternative
measures can be used as suggested in the paper “"Alternatives to the Median Absolute
Deviation” by Peter J. Rousseeuw and Christophe Croux, 1993. ADD THE DISCUSSION
IN THE MENTIONED PAPER INTO THE CHAPTER.
Coefficient of variation (CV) is an alternative statistics to measure the dispersion around
mean. Note that standard deviation also measures the dispersion around the mean. Coef-
ficient of variation is defined as the ratio of standard deviation to the mean, CV = σ/µ,
and it is unit-less. CV measures the dispersion of a collection of data in a way that the
dispersion is independent from the unit of measurement of the data. It is useful for com-
paring the relative magnitudes of dispersions of two or more data sets especially if the unit
of measurement of these data sets are different. Let 10.2 be the average number of hours
spent for studying Chemistry final exam in a class along with standard deviation 4.6. Let
72 be the average score of the same final exam along with standard deviation 27. One
cannot say that the dispersion in the scores, 27, is more than the dispersion of the study
hours, 4.6, based on standard deviations because the measurement units are different. On
the other hand the comparison of the CV for study hours, 0.45 and scores 0.37 tells us that
the dispersion in study hours is more than the dispersion in scores.

109
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Note that, CV is meaningful only for non-negative data which is also ratio scale, i.e.,
have an absolute zero which represents the lack of quantity. The mixture of negative and
positive values in the data may potentially result in a mean zero for which CV is infinite.
Furthermore, the mean that is close to zero will let the CV approach infinity. Hence, CV is
very sensitive to small changes when the mean is close to zero.
R does not provide a function for calculating coefficient of variation however, one can
use the functions mean and sd to evaluate CV.

2.3.3 Descriptive Statistics to Measure Shapes


Skewness and kurtosis are two descriptive statistics that help us to measure the shape of a
distribution.

Figure 2.21: Example right, zero and left skewness


2.5 2.5 2.5

2.0 2.0 2.0

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x x

(a) Right skewed distribution (b) Symmetric distribution (c) Left skewed distribution

Skewness is a statistic that measures the asymmetry of a distribution around its mean.
Figure 2.21 shows example distributions with left, zero and right skewness. Theorethically,
skewness is the third “standardized” moment of a distribution. Sample skewness2 , m e 3 , is
defined as
Pn
i=1 (xi − x̄)
1 3
m3
me 3 = 3 = q n
3 (2.10)
s Pn
1
n (x
i=1 i − x̄)2

The numerator of Equation 2.10 cubes the distances around the sample mean value.
Therefore if a distribution is symmetric around the mean value, it will have values say 3
units on the right of the mean, 3 units ditance, as well as 3 units on the left of the mean,
-3 units distance. Hence, their cubes will cancel each other and make the numerator get
closer to zero. Ideally, a unimodal distribution is left skewed (has left tail) when m e 3 is
negative, is symmetric when m e 3 is close to zero and is right skewed (has right tail) when
e 3 is positive. The normal distribution is a symmetric distribution with zero theoretical
m
skewness and close to zero sample skewness. It is always better to visualize the distribution
while measuring skewness, because a fat and short tail may balance a thin and long tail
toward zero skewness.
Since m
e 3 does not account for sample size, another commonly used measure of skewness
2 Note that the leading terms of the sample moment and the sample standard deviation are 1/n rather

than 1/(n − 1)

110
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

is defined as
Pn
i=1 (xi − x̄)
1 3
n2 m3 n2
e3 =
m = n
(2.11)
(n − 1)(n − 2) s3 (n − 1)(n − 2) q Pn 3
1
n i=1 (x i − x̄)2

where m3 , x̄ and s are the third moment, sample mean and the standard deviation of the
distribution, respectively. Many software packages implement Equation 2.11. As a rule of
thumb, one may consider a distribution roughly symmetric when the skewness is [−0.5, 0.5],
moderately skewed when the skewness is [−1, −0.5) or (0.5, 1] and highly skewed when the
skewness is less than −1 or greater than 1.
Lastly, left skewed distributions’ medians are greater than their means and right skewed
distributions’ medians are less than their means, while symmetric distributions’ medians
and means are the same or closer to each other.

Figure 2.22: Calculating skewness

> library(moments)
> v <- rbeta(n=256, shape1=2, shape2=8)
> skewness(x=v)
[1] 0.7013203

The moments package in R implements the skewness function. The console in Figure 2.22,
first creates a vector of randomly generated data from a beta distribution with α = 2 and
β = 8. The skewness of the vector is 0.70 which indicates a moderate right skewness.
Kurtosis is a statistic that measures how heavy the tails of a distribution is around its
mean. Theorethically, kurtosis is the forth “standardized” moment of a distribution. Sample
kurtosis3 , m
e 4 , is defined as
Pn
i=1 (xi − x̄)
1 4
m4
e 4 = 4 = q
m n
4 (2.12)
s Pn
1
n i=1 (x i − x̄)2

where m4 , x̄ and s are the fourth moment, sample mean and the standard deviation of
the distribution, respectively. The numerator of of Equation 2.12 takes the fourth power
of distances around the mean value. Therefore, the samples closer to the mean do not
contribute much to the kurtosis, while the samples on te tails, i.e., farther from the mean,
contributes much more to the kurtosis. Note that kurtosis accounts for both tails together.
Equation 2.12 calculates the kurtosis of a normal distribution as 3. To standardize, many
software packages implement a version that subtracts 3 from the kurtosis to fix the kurtosis
of a normal distribution at zero. This version is called excess kurtosis and it also accounts
for the sample size. Excess kutosis is defined as
Pn
n(n + 1) i=1 (xi − x̄) 3(n − 1)2
1 4
e4 =
m n
− (2.13)
(n − 1)(n − 2)(n − 3) q 1 Pn 4 (n − 2)(n − 3)
n (x
i=1 i − x̄) 2

3 Note that the leading terms of the sample moment and the sample standard deviation are 1/n rather

than 1/(n − 1)

111
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

According to Equation 2.12, a positive value indicates heavier tail than the normal
distribution and a negative value indicates lighter tail than the normal distribution. A
distribution with positive kurtosis is called leptokurtic, which means it has more outliers
than a normal distribution. On the other hand, a distribution with negative kurtosis is
called playkurtic, which means it has less outliers than a normal distribution.

Figure 2.23: Calculating kurtosis

> library(moments)
> v <- rt(n=256, df=3)
> kurtosis(x=v)
[1] 7.496812

The moments package in R implements the kurtosis function without subtracting 3. The
console in Figure 2.23, first creates a vector of randomly generated data from a student’s t
distribution with degrees of freedom 3. The kurtosis of the vector is 7.49 which indicates
that the tails are heavier than the normal distribution’s, i.e., kurtosis 3.
Lastly, the moments package implements Jarque–Bera test which tests whether some
sample data has skewness and kurtosis matching to a normal distribution or not.

2.3.4 Normalization
Considering that a numeric vector is a collection of measurements or observations, directly
comparing the data points in two vectors might not always make sense. In order to do
a proper comparison one has to take into account the unit of measurement, the scale of
measurement and the process generating the data. If two vectors have the same measurement
but in different units one has to rescale one of the vectors into the units of the other vector.
Rescaling is done by adding, subtracting, multiplying or dividing the elements of a vector
by one or more constants. For example, in order to compare two vectors of distance values
measured in Miles and Kilometers, respectively, one has to rescale one of the vectors into
the units of the other vector.
Similarly, if the data in two vectors are measured using different scales it is necessary to
normalize the data into a common scale. A frequently used common scale is scaling the data
into the [0,1] interval. For example, different educational institutions use different scales to
grade the performance of their students in a course such as [0,10], [0,100] or [0,150]. In
order to fairly compare data points having different scales it is necessary to normalize them
into [0,1] interval using the Equation 2.14 assuming that x = {x1 , x2 , . . . , xn } denotes a
collection of numeric values.

xi − min(x)
x0i = (2.14)
max(x) − min(x)
Finally, if the data points in two vectors are generated via the same process (have the
same distribution) but with different parameters we use standardization. Standardization
is usually defined as subtracting a measure of location (central tendency), µ or x̄ from the
data point and dividing it by the scale (dispersion), σ or s.
xi − µ
x0i = (2.15)
σ

112
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

To illustrate, assume that Alice and Carol are two students who took a very crowded
course with two different professors. Furthermore, let Alice’s overall average grade be 78 and
Carol’s overall average grade be 65. Naturally, one may think that Alice performed better
than Carol in that particular course. Assuming that the students were randomly assigned
to courses, a real comparison should have been made by taking the group of students in
each class into consideration. Let the mean and median grades in Alice’s class be 84 and
88 and in Carol’s class be 60 and 59, respectively. Even if we assume that the standard
deviations in the grades were the same for both classes, Carol seems to be more successful
when the grades are standardized. Standardized data is called z-scores, standard scores or
scaled scores in different domains. Figure 2.24 shows how to calculate the distance of a
value from the mean of the collection in terms of the standard deviation.

Figure 2.24: Data Standardization

> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> # Population Variance and sd
> pvar <- sum((v1-mean(v1))^2)/length(v1)
> pvar
[1] 733.3398
> psd <- sqrt(pvar)
> psd
[1] 27.08025
>
> # The distance of v[4] to its mean in terms of the standard deviations
> (v1[4] - mean(v1))/psd
[1] -1.207061

In case of outliers a more robust method for standardization is

xi − median(x)
x0i = (2.16)
M AD
Equation 2.16 simply calculates the distance of a point to its median in terms of median
absolute deviation (MAD).

2.3.5 Outlier Detection


An outlier is a data point in a data set that appears to significantly deviate from the majority
of the data. An outlier might occur due to errors in measurement, natural deviations in data
sets or anomalous conditions. To illustrate, a device measuring the atmospheric pressure
may temporarily malfunction and cause erroneous readings. A small group of extremely rich
or poor people may appear as outliers in an income dataset as natural deviations. Abnormal
health symptoms or test results in a data set may indicate an anomalous condition.
Outliers should not be omitted from a data set unless they are due to measurement errors.
Instead they should be investigated further and more robust statistics should be used to
summarize the data. There is no common and accepted method for detecting outliers in any
data set. In fact, what an outlier is completely depends on the process that generates the
data set and most of the time this process is not fully known in advance. On the other hand,
interquartile range is used to detect outliers for simple cases. Any point falling outside of

113
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

the interval [Q1 − 1.5(IQR), Q3 + 1.5(IQR)] may be considered to be an outlier. One can
calculate the limits of the non-outlier interval and list the outliers appearing in a vector as
shown in Figure 2.25.

Figure 2.25: Simple outlier detection based on IQR

> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60, 971)
> lowerOutlierLimit <- quantile(v1, probs=0.25, names=FALSE)-1.5*IQR(v1)
> upperOutlierLimit <- quantile(v1, probs=0.75, names=FALSE)+1.5*IQR(v1)
> lowerOutlierLimit
[1] -27.5
> upperOutlierLimit
[1] 112.5
> v1[v1<lowerOutlierLimit | v1>upperOutlierLimit]
[1] 971

An alternative method to detect outliers is suggested by Boris Iglewicz and David Hoaglin
in their paper "How to Detect and Handle Outliers", 1993. The authors suggest calculating
a modified z-score for each element in the data set using the formula 2.17

0.6745(xi − median(x))
x0i = (2.17)
M AD
The authors recommend that modified z-scores that fall off the interval [-3.5, 3.5] can be
labeled as potential outliers.
Hampel filter is another outlier detection approach that is based on MAD and median.
Hampel filter assumes that any value out of range

I = [median − 3M AD, median + 3M AD] (2.18)


is an outlier.

2.4 Bivariate Data Analysis


So far our statistics have involved single variable distributions. Bivariate data analysis deals
with revealing the relationships among pairs of variables.
Figure2.26 and Figure 2.27 demonstrates the code and the bar chart of debpt per capita
for 12 countries. According to Figure 2.27, Vietnam, Ukraine and Thailand are the countries
which have the lowest dept per capita. Therefore, one may think that the economies of
these three countries are better compared to the other countries in the list. Especially, their
economies are much better compared to Switzerland, Finland and Denmak, which sounds
counter-intuitive.
As a matter of fact often looking at a single variable may be misleading as shown in
the previous example. Figure 2.28 shows the R code displaying additional variables in our
dataset. The figure shows that “gdp”, “debt”, “population (pop)” and “gdp per capita
(gdpPcapita)” are also provided in the dataset.
A better way to look at the economies of these countries is considering GDP per capita
along with debt per capita. Figure 2.29 shows that there is a very strong and positive
relationship between debt per capita and gdp per capita. Meaning that as the gross domestic

114
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.26: countryEcon2014.csv data set (2014)

> country <- read.csv(file="~/Desktop/countryEcon2014.csv", comment.char = "#")


> row.names(country) <- country$name #assign row names using the name column
> country$name <- NULL #delete the names column since they are row names now
> head(country)
debt gdp pop debtPcapita gdpPcapita
Mexico 639.6 1395.6 119.4 5357 11688
Argentina 228.6 497.2 42.0 5444 11838
Switzerland 313.0 671.9 8.1 38639 82951
Denmark 161.2 338.1 5.6 28778 60375
Ukraine 87.9 182.3 45.5 1931 4007
Thailand 203.9 422.3 68.5 2977 6165

Figure 2.27: Debt per Capita

Vietnam
Venezuela
Ukraine
Thailand
Switzerland
Slovakia
Mexico
Malaysia
Finland
Denmark
Czech Republic
Argentina

0 10000 20000 30000 40000


Debt per Capita (dollars)

115
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.28: countryEcon2014.csv covariance and correlation

> cov(x=country$debtPcapita, y=country$gdpPcapita)


[1] 323754524
> cor(x=country$debtPcapita, y=country$gdpPcapita)
[1] 0.9894238
> cor(country)
debt gdp pop debtPcapita gdpPcapita
debt 1.00000000 0.99537344 0.5664217 0.05924537 0.09675728
gdp 0.99537344 1.00000000 0.5839996 0.04898310 0.09534947
pop 0.56642171 0.58399965 1.0000000 -0.62041014 -0.58526726
debtPcapita 0.05924537 0.04898310 -0.6204101 1.00000000 0.98942378
gdpPcapita 0.09675728 0.09534947 -0.5852673 0.98942378 1.00000000

Figure 2.29: Scatter plot for debt per capita and gdp per capita

40000 Switzerland
Debt per Capita (dollars)

Debt per Capita (dollars)

40000
30000 Finland
Denmark
30000

20000
20000

10000 10000

0 0
0 20000 40000 60000 80000 0 20000 40000 60000 80000
GDP per Capita (dollars) GDP per Capita (dollars)

116
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

product of countries increase their debt also increase. When we look at the two variables,
the picture is not as dark for countries such as Switzerland, Finland and Denmark. In
fact a better measure here is to look at debt-to-GDP ratio to assess the default risk of a
country. Obviously, there are more variables involved in evaluating countries’ economies
and the example above is to illustrate how looking at a single variable might be misleading.

2.4.1 Covariance and Correlation


Covariance between two variables X and Y is expressed in Equation 2.19. Covariance
measures how two variables “co-vary” together or how two variables jointly vary. To put in
other words it measures if higher values of one variable is associated with the higher or the
lower values of the other variable. Similarly, it measures if the lower values of one variable
is associated with the higher or lower values of the other variable.

cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]


= E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]]
(2.19)
= E[XY ] − E[Y ]E[X] − E[X]E[Y ] + E[X]E[Y ]
= E[XY ] − E[Y ]E[X]
An equivalent and computationally more feasible equation for sample covariance, s2X,Y ,
is given in Equation 2.20.
n
1 X
s2X,Y = (xi − X̄)(yi − Ȳ ) (2.20)
n − 1 i=1

where X̄ and Ȳ denote the means of variables X and Y , respectively. Please note that
it requires to replace 1/(n − 1) by 1/n in Equation 2.20 to compute the covariance for a
population, rather than sample.
A positive covariance indicates that the two variables get larger and smaller together.
A typical example house size and house price usually have positive covariance. On the
other hand, a negative covariance indicates that when one variable gets larger the other one
gets smaller. For example, mortgage interest rate and house price usually have negative
covariance. A close to zero covariance indicates that the variables do not change together,
i.e., do not co-vary.
Covariance depend on the scales of the involved variables. Moreover, it changes between
negative infinity and positive infinity. Therefore interpreting and comparing the magnitudes
of covariance is difficult.
A related statistic which is easier to interpret is called correlation coefficient and given
in Equation 2.21.
s2X,Y
rX,Y = where − 1 ≤ r ≤ 1 (2.21)
sX sY
The quantity rX,Y , called the linear correlation coefficient or Pearson product moment
correlation coefficient. It measures the strength (how the points spread or cluster around a
line) and the direction of a linear relationship between two variables. Correlation coefficient
is dimensionless and changes between -1 and 1. A value close to one indicates very strong
positive correlation between two variables. Whereas, a value close to -1 indicates a very
strong negative correlation between the two variables. Lastly, a value close to 0 indicates
the lack of correlation between the two variables.

117
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.30 presents several correlation values indicating the direction and strength of
linear relationships between two variables. Note that correlation is not causation. That
is, the change in one variable does not necessarily cause a positive or negative change in
the other variable. For example, there is a strong correlation between shoe size of children
and their vocabulary size. The correlation does not indicate that larger feet causes larger
vocabulary. In fact, both shoe size and vocabulary size are related to a confounding factor,
namely age.

Figure 2.30: Example linear correlations

100
3000
600

75 2000
400
1000
Y

Y
50
200
0
25

0 −1000
0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
X X X

(a) r = 0.09 (b) r = 0.89 (c) r = 0.44

500 1000 2.0e+08

1.5e+08
0
0 1.0e+08
Y
Y

−500 5.0e+07

−1000 0.0e+00

−1000
−5.0e+07
0 25 50 75 100
0 25 50 75 100 0 25 50 75 100 X
X X
(f) r = 0.82, the relationship is
(d) r = −0.74 (e) r = −0.30 NOT linear

The coefficient of determination, rX,Y


2
, is another useful statistcis. It presents the propor-
tion of the variance (fluctuation) of one variable that is predictable from the other variable.
In other words the percentage of variability in X that could be explained by the variability
in Y .

2.4.2 Contingency Tables


In statistics contingency tables, also called cross tabulation, are used to explore the rela-
tionship between two categorical variables. It is important to note that the “relationship
between two categorical variables” actually means the relationship between the values of
the two categorical variables. In the following examples we are going to use the “Unem-
ployment” dataframe from the “Ecdat” package. To load the dataset the “Ecdat” package
must have been installed on your system. The “Unemployment” dataset has unemployment
information about 452 individuals in the US in 1993. Two important variables in the dataset
are sex and reason which list the biological gender of the individuals and the reason for
unemployment. The sex variable is a factor with two levels: male and female. The reason
variable is a factor with four levels: new (new entrant), lose (job loser), leave (job leaver)
and reentr (labor force reentrant). In its simplest form, the table function presents the

118
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

frequencies of the levels of a factor. For example in Console 2.31 the table function displays
the frequency distribution of the sex column indicating that there are 242 males and 210
females in the dataset. The function prop.table gets a table as input and returns the pro-
portions instead of the frequencies. In Console 2.31 the proportions of males and females
are very close to each other.

Figure 2.31: Unemployment Dataset and table function

> library(Ecdat)
> data("Unemployment")
>?Unemployment
>str(Unemployment)
...
...
> # single variable
> table(Unemployment$sex)
male female
242 210

> prop.table(table(Unemployment$sex))
male female
0.5353982 0.4646018

> # two variables


> table(Unemployment$sex, Unemployment$reason)
new lose leave reentr
male 14 111 60 57
female 27 60 32 91

> addmargins(table(Unemployment$sex, Unemployment$reason))


new lose leave reentr Sum
male 14 111 60 57 242
female 27 60 32 91 210
Sum 41 171 92 148 452

> addmargins(prop.table(table(Unemployment$sex, Unemployment$reason)))


new lose leave reentr Sum
male 0.03097345 0.24557522 0.13274336 0.12610619 0.53539823
female 0.05973451 0.13274336 0.07079646 0.20132743 0.46460177
Sum 0.09070796 0.37831858 0.20353982 0.32743363 1.00000000

The table and prop.table functions not only present the frequencies and proportions of
single variables but also the joint frequency distributions and proportions of two variables.
Console 2.31 also shows the cross tabulation of sex and reason variables. In the console we
also introduce the addmargins function which displays the column and row sums as well as
the total instances.
When we look at the sex and reason variables, a natural question is if there is a relation-
ship between these two variables, i.e., if they are statistically independent or not. To put in
other words, we say two variables are statistically independent when one variable assumes
a value, the probability distribution of the other variable roughly remains the same. If two

119
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

categorical variables are not independent, then the probability distribution of one variable
changes depending on the value of the other variable.
When two variables are (statistically) independent their respective conditional and marginal
distributions are expected to be (roughly) the same. For example in Console 2.31 the
marginal distribution of sex is 0.54 male and 0.46 female. However, the conditional distri-
bution when the unemployment reason is new entrant is 0.34 male and 0.66 female. The
question is that whether the difference between the marginal and conditional distributions
happened by chance and insignificant or the difference implies a relation between the two
variables and the evidence is statistically significant.
We use Pearson’s χ2 (chi-squared) test of independence to determine if two variables are
independent or not. χ2 statistic is computed as
n
X (oi − ei )2
χ2 = (2.22)
i=1
ei
where n is the number of cells in the contingency table, oi is the number of observations
in cell i and ei is the number of expected observations in cell i. Two events, A and B, are
considered to be independent when P (A) = P (A|B) ⇐⇒ P (A, B) = P (A)P (B). The last
table in Console 2.31 shows the joint and marginal probabilities of sex and reason. If these
two variables were independent then their expected joint probability in each cell would be
the product of their marginal probabilities. Hence, the expected frequency of each cell, ei ,
would be the product of the total number of instances and the expected joint probability.

H0 : The variables are independent


(2.23)
H1 : The variables are dependent, i.e., there is an association between them

The null hypothesis (H0 ) of the χ2 test indicates that the variables are independent
and the alternative hypothesis (H1 ) indicates that they are dependent. Under the null
hypothesis the χ2 statistic has χ2 distribution with (k − 1)(l − 1) degrees of freedom, where
k is the number of rows and l is the number of columns of the contingency table. Typically,
if the p-value of the χ2 statistic is lower than 0.05, we have strong evidence to reject the null
hypothesis, i.e., the evidence to reject is statistically significant. That is, we found strong
evidence that the marginal and the conditional distributions are different. Otherwise, there
is not enough evidence to reject the null hypothesis.

Figure 2.32: Pearson’s χ2 test of independence

> chisq.test(Unemployment$sex, Unemployment$reason)


Pearson’s Chi-squared test
data: Unemployment$sex and Unemployment$reason
X-squared = 33.568, df = 3, p-value = 2.444e-07

Console 2.32 introduces the chisq.test function for Pearson’s χ2 test of independence. In
the console the p-value is much smaller then 0.05, hence we have enough evidence to reject
the null hypothesis and assume that there is a relation between variables sex and reason.
For example, unemployment due to entering into the workforce is higher for females, while
unemployment due to loosing their jobs is much higher for males. Therefore, there is strong
evidence for a relation between variables sex and reason in our dataset.

120
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Please note that to apply χ2 test, at least 80% of the cells must have more than 5
observations. If it is not, one may consider merging cells, i.e., combining the levels of the
categorical variable(s). If the cell values are still very small, one may consider Fisher’s Exact
test. Also it is assumes that the instances in your dataset are independent from each other.
That is, the values of a row is not related to or does not affect the values in another row in
your dataset. If this is not the case, one may consider McNemar’s test or Cochran’s Q test.
Pearson’s χ2 test indicates if the two categorical variables, i.e., the levels of the categorical
variables, are related to each other. However, if there is a relation, it does not measure the
strength of the relation. Note that the magnitude of the χ2 statistics in Equation 2.22
depends on the total number of instances in the dataset as well as the the number of the
rows and columns of the contingency table. To measure the strength of the relation we use
corrected contingency coefficient, C 0 . The original Pearson’s Contingency coefficient, C, is
computed as
s
χ2
C= (2.24)
χ2 + N
where χ2 is the chi-squared statistic and N is the total number of the observations. Although
0 ≤ C ≤ 1, it can take values less than 1 for perfect relations. To fix the problem, the
corrected contingency coefficient, C 0 is computed as
s s
min(k, l) χ2
C =
0 (2.25)
min(k, l) − 1 χ + N
2

where k and l are row and column numbers of the contingency table, respectively and 0 ≤
C 0 ≤ 1 The function, ContCoef for the corrected and non-corrected contingency coefficient,
C 0 , is in the descriptive statistic tools package named “DescTools”.

Figure 2.33: Corrected and non-corrected contingency coefficient,

> library(DescTools)
> ContCoef(x=Unemployment$sex, y=Unemployment$reason, correct=FALSE)
[1] 0.2629277
> ContCoef(x=Unemployment$sex, y=Unemployment$reason, correct=TRUE)
[1] 0.371836

Figure 2.33 shows that the strength of the relation between variables sex and reason is
0.371836, which is not very high, yet important.
There are other statistics to measure the strength of the relations between two categorical
variables such as Cramer’s V and Goodman and Kruskal’s λ (lambda).

2.4.3 Correspondance Analysis


Figure 2.34 shows a balloon plot representing the frequencies of the levels of sex and reason
variables. As we cross analyze the sizes of the circles in the figure, it is clear that level reentr
is closer to level female while level lose is closer to male. Moreover, level leave is close to
both female and male levels.

121
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.34: Unemployment sex and reason balloon plot

female
sex

male

new lose leave reentr


reason

122
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Baloon plots help us to visually cross analyze the levels of the categorical variables when
the levels of the variables are not too many. On the other hand, some variables have several
levels which makes cross analysis through baloon plots difficult.
Correspondence analysis allows us to translate the levels of the categorical variables into
an Euclidean space with new coordinates. The coordinates of the levels of the categorical
variables help us to compute the distances between the levels as well as visualize the levels
in 2D with possible information loss. Before translating the levels into an Euclidian space,
we need to compute the standardized residual proportions based on the contingency table.

Figure 2.35: Correspondance analysis, step-by-step

> library(Ecdat)
> data("Unemployment")
> # original proportions matrix
>op <-as.matrix(prop.table(table(Unemployment$sex, Unemployment$reason)))
> # all proportions, including the marginal proportions
>m <-addmargins(prop.table(table(Unemployment$sex, Unemployment$reason)))
> # row and column marginal proportions
> rmp <- as.vector(m[-3,5])
> cmp <- as.vector(m[3,-5])
> # expected proportions matrix
> ep <- outer(rmp, cmp, FUN=’*’)
> # residual of proportions matrix
> rp <- op-ep
> # standardize residual of proportions, (o - e)/sqrt(e)
> # standardized residual of proportions
> st.rp <- rp/sqrt(ep)
> # use SVD (Singular Value Decomposition) to generate singular values d, ...
> # left singular matrix u and right matrix vector v
> w <- svd(st.rp)

Console 2.35 shows the steps taken to compute the standardized residual proportions as
well as the correspondence analysis. In the console, first we compute the original propor-
tions matrix, next we compute the marginal proportions matrix and extract the marginal
proportions of the rows and columns. Note that we extract the marginal proportions only
for the levels of our categorical variables. Then, we compute expected proportions matrix
using the “outer” function and compute the residual of proportions. Next, we standardize
the residual of proportions matrix and use Singular Value Decomposition (SVD) to obtain
decompose the standardized residual of proportions into a vector d, a left singular matrix
u and and a right singular matrix v. The left and right singular matrices correspond to
the rows and columns of the contingency table along with their new coordinates. SVD
decomposes a matrix Mk×l into Uk×k , Dk×l and Vl×l , where Dk×l is a diagonal matrix, i.e.,
M = U DV T . The diagonal entries of D is typically ordered and called singular values, U
is called left singular matrix and V is called right singular matrix. When M is real, U and
V are orthogonal matrices.
TO BE COMPLETED LATER

123
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

2.4.4 One-way ANOVA


TO BE WRITTEN LATER

2.5 Multivariate Data Analysis


Multivariate data analysis involves the datasets having more than one variable. Technically
bivariate data is also multivariate.

2.5.1 Cluster Analysis


Cluster analysis or clustering is the task of grouping a set of objects (data instances) together
such that the objects in the same cluster are similar to each other while the objects in dif-
ferent clusters are dissimilar. The terms “similar” and “dissimilar” depend on the variables
in the dataset as well as the application domain. The distance between the corresponding
features of two objects can be used to quantify how similar or dissimilar the two objects
are. Since the precise definition of the distance depends on the application domain, many
distance metrics are introduced in the literature including Chebyshev distance, Manhattan
distance, Euclidean distance, Minkowski distance (generalization of Euclidean and Manhat-
tan), Cosine distance and Shortest path. Euclidean distance is one of the most commonly
used distance metrics. The distance between two points in an Euclidean space is defined as
the length of the straight line connecting those two points. Let x and y be two points in an
p-dimentional Euclidean space. The distance between x and y is
v
u p
uX
||x − y|| = t (xi − yi )2 (2.26)
i=1

where double vertical bars denote the distance.

2.5.1.1 k-means Clustering


Theoretically, clustering can be defined as an optimization problem where either the sum
of the distances of the objects within a cluster is minimized or the sum of distances of the
objects between the clusters is maximized or both within-cluster minimization and between-
cluster maximization is achieved simultaneously.
k-means is a simple, yet an effective method of clustering a set of data instances into
multiple groups. k-means has many variations including k-medians, spherical k-means, k-
medoids and fuzzy c-means. Given a dataset consisting of m variables and n instances, the
dataset can be divided into k clusters by minimizing the average within-cluster distances.
Let W Cl be the average within-cluster distance of cluster Cl . Let |Cl | be the number of
data instances in cluster Cl .
1 X X
W Cl = ||xi − xj ||2
|Cl |
xi ∈Cl xj ∈Cl
X (2.27)
=2 ||xi − µl ||2
xi ∈Cl

124
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

where ||xi −xj ||2 is the square of the distance between xi and xj and µl is the centroid of the
data instances in cluster Cl . Centroid is a generalization of the mean value for multivariate
analysis or multidimensional spaces that applies to vectors rather than scalar values. Let
V = v 1 , v 2 , . . . , v n be a set of vectors in m dimensional space, i.e., v i ∈ Rm . Given that
each vector has a mass value 0 ≤ ti ≤ 1 where ti = 1, the centroid, µ of V is
P

n
X
µ= ti v i (2.28)
i=1
in case the mass values of vectors are the same, 1/n then, Equation 2.28 is reduced to sum
of vectors divided by the number of the vectors.
Equation 2.27 is also called within-cluster sum of squares or within-cluster variation
because the distances of data instances are squared or the distances of data instances are
calculated with respect to the mean of all data instances, respectively. Note that one way
to define the mean of multiple data instances consisting of many features is calculating a
data point that consists of the means of the individual features.
Given that there are k clusters the total within-cluster sum of squares or total within-
cluster variation, W is
k
X
W = W Cl
l=1
(2.29)
Xk X
= ||xi − µl ||2
l=1 xi ∈Cl

Since the total within-cluster sum of squares in Equation 2.29 denotes how compact the
clusters are, a smaller value is better in general.
The k-means algorithm expects a dataset and the number of clusters, k, as input and
returns k centroids denoting the center of each cluster. A data instance is assigned to the
cluster based on the smallest distance to the cluster’s centroid. Most implementations return
a list of labels representing the cluster of each data instance in the dataset as well.
Algorithm 1 presents the pseudo-code for the k-means algorithm. The algorithm expects
a dataset to be clustered, X , and the number of clusters, k. Lines 1-3 generates random
centroids for each cluster 1-to-k. Line 5 starts an infinite loop to be executed as long as there
is a change in computed cluster labels at every iteration . Lines 6-8 assigns a cluster label to
each instance in the dataset by computing the distances of the instance to all centroids and
picking the cluster with the minimum distance. Lines 10-12 checks if the previous for-loop
caused any changes in instance labels. If not, the while-loop is terminated at line 11. Lines
14-16 updates the centroids, when there is a change in instance labels, i.e., at least one
instance’s cluster label has changed. Finally, line 19 returns the k centroids representing the
cluster centers. Many algorithms also return cluster labels for the instances in the dataset.
Figure 2.36 presents a dataset that to be clustered into five clusters using Algorithm 1.
Before the algorithm starts, Iteration 0, the instances are not assigned to any clusters.
Random centroids are created and Iteration 1 shows the assignments of the instances to
these cluster centroids. The centroids are updated at Iteration 1 and the instances are
reassigned to their clusters at Iteration 2. At Iteration 3 the algorithm stabilizes, i.e., there
is not any instance for which the class label has changed.

125
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Algorithm 1 k-means Algorithm


Input: X . dataset of n instances and m features, xi ∈ Rm
Input: k . The number of clusters
Output: M . list of centroids of k clusters where µl ∈ Rm
1: for l from 1 to k do . Randomly generate k centroids representing the center of each cluster
2: µl ← random centroid
3: end for
4:
5: while True do
6: for i from 1 to n do . Label each xi ∈ x by the closest cluster w.r.t. its centroid
7: C(xi ) ← argminl ||xi − µl ||2
8: end for
9:
10: if No change in cluster labels in the previous step then
11: break
12: end if
13:
14: for l from 1 to k do . Recalculate the centroids by averaging each feature
15: µl ← compute by Equation 2.28
16: end for
17: end while
18:
19: return M

centroid centroid centroid


1 1 1
60 60 60 60
2 2 2
3 3 3
4 4 4
40 40 40 40
5 5 5
v2

v2

v2

v2
cluster cluster cluster
20 20 1 20 1 20 1
2 2 2
3 3 3
4 4 4
0 0 0 0
5 5 5

0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
v1 v1 v1 v1

(a) Iteration 0 (b) Iteration 1 (c) Iteration 2 (d) Iteration 3

Figure 2.36: Iterations of k-means clustering algorithm, k = 5

The console output shown in Figure 2.37 presents a synthetically generated dataset
consisting of two variables, v1 and v2 and 250 instances. The dataset is generated from
multivariate normal distributions with various centroids and the same covariance matrix.
Next, the dataset is converted into a dataframe and plotted via ggplot. Note that data
visualization via ggplot2 is covered in the next Section.
R supports k-means clustering via the kmeans function. The kmeans function expects
at least a matrix or a dataframe object consisting of only numeric variables and the num-
ber of clusters provided via parameter centers. It is a common practice to scale the data
via the scale function before providing it to kmeans. Scaling typically centers the data
by subtracting the column mean from the values in the column and dividing them by the
column’s standard deviation for each column. The scaling step is to make sure that the
variables contribute fairly without larger range variables affecting the computations signifi-
cantly compared to smaller range variables. Once kmeans computes the clusters it returns
a list holding multiple objects of different types. Among the most important objects that
the returned list holds are cluster, centers, size, withinss, betweenss and iter. cluster is a
numeric vector representing the cluster of each and every data instance in the clustered
dataset. centers is a matrix representing the cluster centroids. size is a numeric vector
denoting the number of data instances falling into each cluster. withinss is within-cluster
sum of squares of the clusters computed by Equation 2.27. betweenss is between-cluster

126
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.37: Generate some random data

> library(MASS)
> set.seed(1001)
> mydata <- rbind(
mvrnorm(50, mu = c(10,10), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(50,50), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(10,50), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(50,10), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(30,30), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE)
)
> colnames(mydata) <- c("v1", "v2")
> mydata <- as.data.frame(scale(mydata))
> head(mydata)
v1 v2
1 3.712771 -6.7042562
2 21.952162 -0.3210547
3 20.873552 0.9499241
4 18.608042 29.3532378
5 11.524498 15.3720296
6 7.236498 15.4625999
> library(ggplot2)
> ggplot(data=mydata, mapping=aes(x=v1, y=v2)) + geom_point()

60

40
v2

20

0 20 40 60
v1

127
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

sum of squares of the clusters computed by summing the squared distances of each point in
cluster to other points in other clusters. Finally, iter denotes how many iterations it took
the k-means algorithm to stabilize.

Figure 2.38: k-means clustering

> # Compute k-means clusters based on the cluster count found in scree plot
> set.seed(1001)
> fit <- kmeans(x=mydata, centers=5)
> fit
> ...
> ...
> # Bind the cluster labels to the dataset
> mydata$cluster <- factor(fit$cluster)
>
> # Visualize the clusters
> ggplot() + geom_point(data=mydata, mapping=aes(x=v1, y=v2, color=cluster)) + geom_
point(data=data.frame(fit$centers, centroid=as.factor(1:nrow(fit$centers))),
mapping=aes(x=v1, y=v2, fill=centroid), shape=23, color="black")

centroid
1
60
2
3
4
40
5
v2

cluster
20 1
2
3
4
0
5

0 20 40 60
v1

The console output in Figure 2.38 runs the kmeans function over the dataset that is
synthetically generated in Figure 2.37. The kmeans function returns a list which is referenced
by an object named fit. The following line creates a new column in mydata named cluster
and populates it with the cluster labels for each row. Finally, we use ggplot to visualize
the cluster centroids and the instance clusters by different clolors. As the figure shows, the
rows that are closer to each other are clustered in the same group.
One problem we have not discussed is how to determine the optimal number of clusters, k,
which was 5 in the previous example. Since the synthetically generated dataset in Figure 2.37
has two variables one can plot the dataset and visually decide number of optimal clusters.
However, visual inspection is more difficult for three variables and impossible for more than
three variables without an effective dimensionality reduction technique. There are many
methods to determine the optimal or suboptimal number of clusters in a dataset. One

128
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

prevalently used method is the elbow method which is based on the total within-sum-of-
squares. Elbow method simply runs the k-means algorithm for 1, 2, 3, . . . , t candidate cluster
numbers and extracts the total within-sum-of-squares for each candidate. Note that t ≤ n
where n is the total number of instances, because in the √ extreme case each instance can
exist in its own cluster. However, a good value for t is n, in general. The total within-
sum-of-squares is equal to the total squared distances when all data is assumed to exist in
a single cluster. Once the total within-sum-of-squares is computed, they are visualized via
a line graph. The total within-sum-of-squares typically decreases as the number of clusters
increases. Therefore, the line graph looks like an arm and roughly the elbow point of the
arm is considered to be a good value for the number of clusters, because the decrease in
total within-sum-of-squares is not significant after the elbow point.
Figure 2.39 presents the elbow method in practice. The dataset is scaled first to achieve
fair variable contributions to the clustering process. Next, the total sum-of-squares is com-
puted. It denotes the total within-sum-of-squares when we have only one cluster. Then,
total within-sum-of-squares are computed in a for-loop starting from 2 clusters up to 15
clusters. Lastly, the within-sum-of-squares is converted into a data frame and plotted using
ggplot. In the figure, the elbow point is roughly at 5 which is consistent with the synthetic
data generation process in Figure 2.37

2.5.2 Singular Value Decomposition (SVD)


TO BE WRITTEN LATER

2.5.3 Principal Component Analysis (PCA)


High dimensional data exhibits difficulties in terms of visualization and predictive analysis.
Exploratory data analysis often requires visualizing data and a dataset with more than three
continuous variables is impossible to visualize.
Principal Component Analysis (PCA) is a feature extraction technique employed in
dimensionality reduction for visualization or predictive analysis. Let An×p be a dataset
consisting of n instances and p continuous variables. The goal of PCA is to transform
the dataset into a new coordinate system represented by p orthonormal vectors where the
vectors represent the new dimensions. Two vectors are considered to be orthogonal when
they are perpendicular to each other, i.e., their dot product is zero. If two orthogonal vectors
are also unit vectors, i.e., their magnitudes are one, then they are called orthonormal. The
new dimensions that are extracted by PCA are sorted in terms of the amount of their
variance which represent the amount of information that they carry and one can reduce the
dimensionality by selecting the top k of them where k ≤ p. Usually, the top k dimensions
are expected to correspond at least 80% of the total variability, yet a lower value may be
used for visualization.
PCA first finds the direction (dimension) along which the data variance is the largest.
Once the first dimension is found, it finds the direction which is perpendicular to the first one
and has the second largest data variance. Then, it finds the direction which is perpendicular
to the previous ones and has the third largest data variance and so on.
Let v be an unknown unit vector representing the direction of the data with the largest
variance. The scalar projection, xl , of a data point, x, in our dataset onto v corresponds to

129
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 2.39: Estimate the number of clusters

> # standardize variables by subtracting each value from the mean and dividing by
the standard deviation
> # this step makes the variables contribute fairly into computations
> # otherwise a variable of range [1000-2000] will have more impact than a variable
of range [10-20]
> mydata <- as.data.frame(scale(mydata))
>
> # compute within-sum of squares wss, initially set to total of the variances of
the columns
> wss <- (nrow(mydata)-1)*sum(apply(X=mydata, MARGIN=2, FUN=var))
>
> # compute within-sum of squares wss for clusters of size 2 to 15
> # note that the first element is always the total
> # in case the data size is large wss can be computed using a smaller sample set
> set.seed(1001)
> for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
>
> # convert the wss into a dataframe
> wss.df <- data.frame(cluster = 1:15, wss = wss)
>
> # visualize the scree-plot
> ggplot(data=wss.df, mapping=aes(x=cluster, y=wss)) + geom_point() + geom_line() +
scale_x_continuous(breaks = seq(from=1, to=15, by=1)) + labs(x="Number of
Clusters", y="Within-Clusters Sum of Squares")

200000
Within−Clusters Sum of Squares

150000

100000

50000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Clusters

130
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

the magnitude or length of x along the direction of v 4 5 .


TO BE COMPLETED LATER

2.5.4 Factor Analysis (FA)


TO BE WRITTEN LATER

2.5.5 Outlier Detection


TO BE WRITTEN LATER

2.6 Data Preprocessing


Real world raw data are usually incomplete, noisy and inconsistent. Real world raw data
might be incomplete simply because it might lack some analysis-relevant features, some
values of particular features or contain only aggregated or summarized data. Real world
raw data might be noisy due to errors and outliers in the data Real world raw data might
be inconsistent because of discrepant labels, feature names or units of measurement. Most
of the time inconsistent data is the artifact of merging similar data obtained from multiple
sources. Hence, any decision or inference drawn from such data might be misleading.
Data preprocessing is the process of transforming raw data into a consistent, complete,
accurate and interpretable format. Usually preprocessing data to enhance the quality of the
data for analysis requires five steps

Data cleaning: The process of handling missing data, smoothing noisy data, managing
outliers, and resolving inconsistencies in the data.
Data integration: The process of integrating data from multiple sources with different
representations by resolving conflicts among representations.

Data transformation: The process of normalizing, aggregating and generalizing data


Data reduction: Reducing the sheer amount of data in a way that the analysis produces
the same or similar results.
Data discretization: Reducing continuous numerical variables into discrete variables to
improve the performance and accuracy of statistical data analysis techniques.

In the following we are going to discuss data cleaning which is a vital step in preprocess-
ing.

2.6.1 Data Cleaning


Data cleaning is the process of handling missing data, smoothing noisy data, managing
outliers, and resolving inconsistencies in the data.
a·b
4 The scalar projection of a onto b is al = ||b||
where a · b = aT b is the dot product of a and b and

||b|| = b · b is the length of b. ||b|| is equal to one when b is a unit vector by definition.
5 The vector projection of a onto b is a·b b
||b|| ||b||

131
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

2.6.1.1 Missing Data


Missing data is the lack of a value for a particular feature of an instance in the data. Missing
data occurs in data due to a non-responsive instance or an unobservable/unmeasurable
feature on an instance. Missing data needs to be carefully handled before data analysis
because it may significantly affect the conclusions drawn from the analysis.
The most important step in handling missing data is understanding the reasons or dis-
tribution patterns (if exists) of the missing data. There are three different types of missing
data:
Missing Completely at Random (MCAR) : The probability of a missing value is not
related to that value as well as it is not related to any other variable (feature) in the
data set. That is, the probability of a missing value of a variable is the same as the
probability of another value for the same variable or another value for another variable
in the data set. For example, some health screening test results might be missing in
the data set regardless of the type of the test. Although MCAR reduces the accuracy
of an analysis, it does not introduce a systematic bias into the analysis.

Missing at Random (MAR) : Also called Missing at Random Conditionally (MARC).


The probability of a missing value is not related to that value however, it is related
to other variable(s) (features) in the data set. That is, the missingness of a value is
correlated to other variable(s) in the data set and can be explained by accounting for
those other variables. For example, a health screening test result might be missing for
patients of a particular disease because the test is not applicable for those patients.
Although MAR reduces the accuracy of an analysis about the variable having missing
values, it does not introduce a bias into the analysis because missingness is random
within that particular variable.
Missing Not at Random (MNAR) : The probability of a missing value is related to
the value itself or it is related to some other variable(s) that does not exist in the
data set or cannot be measured/observed. For example, a health screening test result
might be missing if it exceeds some certain threshold value or a a health screening test
result might be missing because the patient finds it intrusive and intrusiveness is not
a variable in the data set. The bias in MNAR is systematic and severely affects the
accuracy of analysis unless the amount of introduced systematic bias is small.

MCAR and MAR missingness can potentially be ignored assuming that the data set
is large enough for the analysis. However, MNAR type of missingness should be handled
carefully. The literature on handling missing data is vast however there are two broad
categories namely deletion and imputation.

2.6.1.2 Deletion
Deleting a record with a missing value or the entry of the missing value is the simplest
approach to handle missing data.
In listwise deletion (complete-case analysis) a record is entirely excluded from analysis
if it has any missing data. Although this method does not introduce bias if the missingness
is MCAR or MAR, it affects the accuracy of the analysis unless the size of the data set is
large. On the other hand, it may introduce significant bias if the missingness is MNAR and
lead to misconclusions.

132
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Pairwise deletion (available-case analysis) is an alternative method to listwise deletion.


In pairwise deletion a record with missing value is used in the analysis as long as the
variable (feature) with the missing value is not part of the particular analysis. A common
example is computing the correlation matrix of a data set. Since the variables are handled
in pairs to compute a correlation matrix, a record is excluded only if it has a missing
value regarding the variables in the pair. Pairwise deletion aims to reduce the data loss
due to deletion. However, the analysis will be based on different subsets of the data set
with different sizes which may mislead the overall analysis. Additionally, it may distort the
analysis by generating mathematically inconsistent results.
Deletion as a method of handling missing data must be avoided if the missingness type
is MNAR. In cases of MCAR and MAR it should be avoided in general, unless it does not
significantly affect the analysis.

2.6.1.3 Imputation
Imputation is the process of replacing missing values in a data set by their substitutes. The
aim of imputation is keeping all records in a data set by replacing any missing value by its
meaningful substitute. There are multiple ways to determine the substitutes for a missing
value and we introduce the most common methods in this text.
Mean imputation is a technique that replaces the missing values appearing in a variable
(feature) by the mean of the available observations/measurements of the variable. Although
the technique is pretty straightforward, it may distort the distribution of a variable in the
data set by artificially pulling the data points towards the mean and resulting in underes-
timated standard deviations for the variable especially, if the number of missing values is
large in comparison to the size of the data set. Similarly, the data points artificially pulled
towards the mean may reduce the amount of correlation between two variables by estimating
the statistics towards zero.
Median imputation is a technique that replaces the missing values appearing in a variable
(feature) by the median of the available observations/measurements of the variable. This
method is used as an alternative to mean imputation in case the distribution of the variable
is skewed. Similar to mean imputation it may distort the distribution of a variable in the
data set by artificially pulling the data points towards the median.
Logical imputation is a technique that replaces the missing values appearing in a variable
(feature) by a logical value that makes sense due to domain knowledge or expertise. For
example in a survey the number of years served in prison might be missing for the subjects
who have never been sentenced. Replacing such missing values with zero makes sense for
the cases who have never been sentenced.
Replacing all missing values in a variable by a single value may distort the distribution
towards the replacement value in mean and median imputation. Simple random impu-
tation eliminates such distortion by replacing the missing data with a randomly selected
observed/measured value in the same variable. This kind of imputation has the implicit
assumption that the missingness in a variable has no systematic bias and it is not correlated
to any other variable in the data set.
Indicator variable imputation is especially common in regression analysis. In this method
an indicator variable is created for each variable which has missing value. The indicator
variables take on value zero if the corresponding value is missing and one otherwise. Then,
each missing value per variable is replaced by a single value (such as zero or mean of the
variable). Afterwards, the regression analysis is done by using the introduced indicator

133
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

variables as predictors as well. This method applies well if the missingness in a variable is
independent of the response variable and the predictor variables are uncorrelated.
Prediction model imputation is a technique that replaces the missing values appearing
in a variable (feature) by a value obtained through a prediction model which is based
on the other variables in the data set. KNN regression and linear regression for numeric
variables and logistic regression and linear discriminant analysis for categorical variables are
common models used in prediction model imputation. In prediction model imputation the
variable with the missing values is considered to be the response variable and one or more
other variables in the data set are considered to be explanatory (predictor) variables. The
estimated value by the model is used to replace the missing value for each record. Although
prediction model imputation is a stronger alternative to other imputation techniques it is
computationally expensive and requires carefully selecting the predictor variables . Moreover
in prediction model imputation, the estimated values are simply the most likely values for
the missing data however, they do not carry the uncertainty through residual variance. This
may cause an overestimated model fit in later analysis. Stochastic regression adds average
regression variance to the estimated values for missing data to reduce the error term in
prediction model imputation. Finally, predictor variables with missing values need to be
handled properly as well. Iterative regession models can be used to resolve missing values
in different columns iteratively. For details please see (Data Analysis Using Regression and
Multilevel/Hierarchical Models by Andrew Gelman)
Imputation by nature introduces noise into the data set. Multiple imputation is a tech-
nique to reduce the noise due to imputation. In multiple imputation the missing values are
replaced by an existing imputation method such as simple random imputation or prediction
model imputation. However, the process is repeated multiple times to obtain more than one
imputed data sets. The analysis is done on each imputed data set and the result is pooled
by averaging or by some more advanced method. For details please see (Statistical Analysis
with Missing Data by Roderick J. A. Little, Donald B. Rubin)

2.6.1.4 Smoothing Noisy Data


Noise in data is defined as the random error or variance appearing in observations or mea-
surements. The random error or variance in measurement may be due to faulty data collec-
tion instruments, unknown factors affecting the accuracy of the measurement or errors in
recording data. Binning, regression and outlier analysis are common methods for smoothing
noisy data.
Binning involves first sorting the data and then, binning them into equal-frequency (same
size or equi-depth) buckets. Then, each value in a particular bin or bucket is replaced by
a common value to that bin to perform local smoothing. The local smoothing techniques
based on binning gets different names depending on the common value picked to smooth
out the values in the bin. Smoothing by bin means uses the mean value of the bin to replace
each value in the bin. Smoothing by bin medians uses the median value of the bin to replace
each value in the bin. Smoothing by bin boundaries uses the minimum (maximum) value in
the bin to replace each value in the bin. Alternative to equal-frequency buckets, equal-width
intervals (or buckets) can be used for binning.
Using regression to replace each value of a variable in a data set by its predicted value
is another technique for smoothing noisy variables. Finally, outliers in a variable might be
detected by clustering and removed to reduce the amount of noise in the data set. However,
removing outliers is not a suggested method unless one is sure that the outliers are garbage

134
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

data. Please see the following part for a discussion on managing outliers in a data set.

2.6.1.5 Managing Outliers


MOVE THE OUTLIERS DISCUSSION TO HERE

2.6.1.6 Resolving Inconsistencies


Different people enter data in different ways. For example some people may prefer country
codes while others may prefer country codes while entering data. Similarly, different people
may prefer using different measurement units such as miles vs. kilometers or different date
formats such as MM/DD/YYYY vs. M/D/YY. Finally, data might have some erroneous
recordings such as a typo in a name or contradictory recording such as values out of pos-
sible range, e.g., a negative distance. One has to carefully screen the data set and fix the
inconsistencies before the analysis. Obviously, it is impossible to fix all the inconsistencies
especially those that are due to wrong recordings. However, it is always a good practice to
examine the summary statistics of variables in the data, to check the range of variables, to
check the levels of categorical variables, to check the measurement units and to ensure a
consistent date, phone number, zip code formats.

2.7 Exercises
Univariate and Bivariate Descriptive Statistics
1. Generate a vector of 10 numbers sampled from a normal distribution with mean zero,
µ = 0, and standard deviation eight, σ = 8. Compute and print the mean and
standard deviations of the elements in the vector. Discuss whether they are close to
the theoretical mean and standard deviation. Repeat the same experiment with 100,
1000 and 10000 samples, discuss whether the means and the standard deviations are
getting closer to the theoretical values.
2. Install package AER on your computer, if you haven not done before. Next, load the
dataset named NMES1988 in the AER package.
(a) In your own words, explain what this dataset is about after doing some research
about the dataset.
(b) Print the output of the str function on NMES1988. Also, use the help function
to explain the variables of NMES1988 dataset in your own words.
(c) Compute and print both mean and median of emergency variable. Is the median
much smaller than the mean. How does this affect the skewness of the distri-
bution? Please explain. you may use the table function to have a look at the
distribution of the emergency variable.
(d) Compute and print the mean and standard deviation of the age variable. How
do you interpret the standard deviation of age.
(e) Compute and print the quartiles of the age variable.
(f) Using the 1.5IQR rule, find and print how many outliers are there in age.
(g) Use coefficient of variation (CV) to compare and interpret the dispersion in
visits and income.

135
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

(h) Create a new dataset called NMES1988.SUB by excluding all variables of mode
factor. that is NMES1988.SUB has all rows, but only numeric and integer columns.
Note that the variables at column indices (7, 9, 10, 12, 13, 14, 17, 18, 19) are factors.
Then, print the correlation matrix of NMES1988.SUB.
(i) Regarding question 2h, what are the two distinct variables that have the least
amount of correlation. How do you interpret it.
(j) Regarding question 2h, what are the two distinct variables that have the most
amount of correlation. How do you interpret it.
(k) Regarding question 2h, explain why the correlations on the main diagonal of the
correlation matrix are one.

1.
2.

3.

136
Chapter 3

Data Visualization

3.1 Visual Patterns to Observe in Data


While visualizing your data the main question is what to look for in the graphics. You
can observe various patterns in the graphics depending on the plot without applying any
statistical test. This part covers some of the very first patterns to observe in the graphics.

3.1.1 Outliers
In statistics, an outlier is an observation which looks like isolated from other observations.
Outliers may appear genuinely in the dataset or may appear because of some error in mea-
surement or recording. In both cases you need to locate them and analyze them individually
or as a group to find the reasoning behind their existence. Box plots are good to locate out-
liers in one dimensional data. Scatter plots are good for locating outliers in two dimensional
data.

3.1.2 Asymmetry of the Distribution


Most of the time asymmetry in a distribution occurs in the form of positive or negative
skewness. Positive skewness mean there is a long tail in the distribution towards the right.
That is most of the observations accumulate around the smaller values and there are less
many observations towards the higher values. For example income distribution is positively
skewed in general. Such a distribution simply tells there are many people in the population
who have small income while there are a small number of people who have high income.
Density plots and histograms are good tools to visualize skewness in one dimensional data.
Heat map or 3D graphics are good to visualize two dimensional skewness.

3.1.3 Variability
In statistics, variability is a measure of dispersion denoting how a distribution is stretched
or squeezed. Most of the time there is a continuous variable in the data along with a
categorical variable. It is always good to analyze the variability of the continuous data per
the categorical data. Box plots per categorical data helps us analyze the variability in the
continuous variable. One can additionally use jitter plots to see the number of observations

137
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

contributing to the variability per category as well. Another graphics to visually observe
variability is density plots and histograms.
As for two dimensional data, covariance and correlation coefficient are good statistical
tools to see linear variability among two variables. Typically a scatter plot of two variables
might suggest that they do not covary together, they positively covary or they negatively
covary. Positive correlation (covariance) simply implies that as the values of one variable
increases the values of other variable increases with them. Negative correlation (covariance)
simply implies that as the values of one variable increases the values of other variable de-
creases with them. Scatter plots are the main tools to visualize co-variation of two variables.

3.1.4 Clustering
In statistics cluster analysis is the process of determining different groups of observations in
the dataset. Having isolated or semi-isolated groups of observations in the dataset suggest
the existence of clusters. One need to analyze the factor causing the clustered behavior in
the dataset. Scatter plots are goos to visually see the existence of clusters in two dimensional
data. Heat maps and 3D graphics are good to see clusters in three dimensional data.

3.1.5 Main Pattern


It is always good to try to come up with a model to represent the pattern of the data. A
model is a mathematical function that we already know to look like the pattern that the
data exhibits. You can use histograms and density plots to see if the pattern of a single
variable resembles to any distribution that you know of. You can use scatter plots along
with smoothing functions to see if the relation between two variables look like a function
e.g., linear, polynomial, exponential, that you already know of.

3.1.6 Seasonal Effect and Anomalies


A time series is a set of measurements collected over time. In general, time series possess
a seasonal effect that repeats itself throughout the interval that the data is collected. The
seasonal affect might not be perfect and might have some random noise yet would be ob-
servable, most of the time. Line plots are good to see the pattern of the seasonal effect,
the length of the effect, the frequency of the effect as well as the anomalies that occur at
individual seasons.

3.2 ggplot2 Visualization


Grammar of Graphics is an abstract scheme introduced by Leland Wilkinson for data visual-
ization. It simply states that a graphics is composed of several smaller semantic components
and by controlling, i.e., adding, removing, altering, those components one can build a diverse
set of graphics.
ggplot2 is a data visualization package for statistical computing language / framework
R implemented by Hadley Wickham. ggplot2 is based on Grammar of Graphics hence, is
a very flexible tool to visualize data. The flexibility of ggplot2 makes it a good candidate
to replace base R plotting. On the other hand, the flexibility and abstraction provided by
ggplot2 makes it computationally more expensive compared to other plotting tools in R.
Yet, since its introduction ggplot2 has widely been adopted in R community.

138
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.1: How to install and load ggplot2

> install.packages("ggplot2")
> require(ggplot2)

3.2.1 Fundamentals of ggplot2


The concept of “layers” is at the core of ggplot2 because a graphics is nothing but a number of
layers displayed on top of each other. Any component of a graphics including the underlying
data, the geometric objects used to represent the data in the graphics, the axis of the
graphics, the labels of the graphics, and the statistical summaries of the displayed data
is considered to be a layer. As a result, the final graphics consists of the visualization of
multiple layers. The user generates the final graphics by adding, removing, and altering
layers.
The layers compromising a ggplot2 graphics consists of the following elements:

Geometries
A Geometry (geom) is the graphical element used to represent data or the statistics
of the data. Common geometries are points, lines, bars, densities, and text.
Aesthetics
Aesthetics are the attributes of geometries which control the aesthetical properties
of the displayed geometries. Common aesthetics attributes are x-position, y-position,
size of the geometry, shape of the geometry, and color of the geometry.

Statistics
A statistic is a function that transforms or summarizes the data in a different form.
Most of the time one is interested in displaying a statistic of the raw data rather
than or along with the raw data itself. Some statistics are smoothers, regression lines,
mean, median, quantile, and bins.

Scales
A scale controls the display of the coordinate system as well as transforms the coor-
dinate system into different form. Common scales are logarithmic axis, label of axis,
limits of axis, and color of axis.

3.2.2 Case Study: Constructing Graphics Using ggplot2

Figure 3.2: Mapping an aesthetic property to the data

139
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

A graphics in ggplot2 is constructed iteratively by adding and altering semantic layers.


Each layer can be seen as a modular component of the graphics. Data analysis requires a
dataset to be available in advance. In the following I am going to use the “diamonds” dataset
that comes with ggplot2. Assuming that ggplot2 is loaded you can type the following to
load the “diamonds” dataset. The diamonds dataset consists of around 54 thousand records
with ten variables including price, color, cut, clarity, depth, table, x, y and z. In some cases
we sample only 128 instances for illustration purposes.

Figure 3.3: Loading and sampling the diamonds dataset

> library(ggplot2)
> data(diamonds)

A graphics representing a dataset starts with the data layer. The data layer is created
using “ggplot(data)”. Parameter data is the R dataframe that we want to visualize. The
other layers are going to be added onto the ggplot object incrementally. Each layer, including
the data layer may have one or more aesthetics parameters. Special to ggplot(...) function,
any aesthetics defined in this function is inherited by the subsequent layers. To exploit this
functionality we are going to use another form of the function defined as“ggplot(data, aes(x,
y, <other aesthetics >))” and define the x and y parameters in order not to define them at
each layer that we are going to add later. Note that, aes(...) is a function parameter.
In the following we are going to analyze the relation between carat and price variables
of the diamonds dataset.

Figure 3.4: Creating the data layer ggplot object

> p <- ggplot(data=diamonds, mapping=aes(x=carat, y=price))

p is the ggplot2 object having only the data layer. Printing p at this step results in an
error because we haven’t added any geometric object (geom) representing the data points to
be plotted. Scatter plots are commonly used to display the relationship between two vari-
ables. The ggplot2 geom used to generate scatter-plots is “geom_point(mapping=aes(...),
...)”. A new layer is added to the existing layers using the + operator. Each layer that is
added to the graphics is technically a function used to create a graphics element.
In the following we are going to add a geometry layer to our data layer to have a
displayable graphics.
You can set values to shape and color parameters of geom_point() to control the shape
used to display the data and its color.
A very important characteristic of ggplot2 is that we did not explicitly set the x and y
variables for the geom_point(). In fact the data, x variable, and y variable are inherited
to geom_point() from the data layer, ggplot(). Note that you can override the inherited
parameters if you need to.
We can further improve the plot by adding a smoother statistic to see the pattern in
the data. We use function stat_smooth() to add the statistic layer to our data. The band
around the smoothing function is the 95% confidence interval. Note that one can display
only the smoother without the actual data points by omitting the geom_point() but keeping
the stat_smooth() element.

140
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.5: Adding a geom layer to the data layer

> p <- p + geom_point()


> p

15000
price

10000

5000

0
0 1 2 3 4 5
carat

141
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.6: Adding a geom layer to the data layer

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) +


geom_point(color="green", shape=2)

15000
price

10000

5000

0
0 1 2 3 4 5
carat

142
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.7: Adding a new layer using a statistical element

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() +


stat_smooth()

20000

15000
price

10000

5000

0
0 1 2 3 4 5
carat

143
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

3.2.3 Aesthetics
Aesthetics are used to define the mapping between the variables of the data and the proper-
ties of visual elements. Although there is a relation between aesthetics and geometric objects
most of the aesthetics are applicable to the majority of the geometric objects. Aesthetics
are defined using the aes() function.

Position Related Aesthetics


x and y define the x and y variables of the dataset to be plotted or the beginning co-
ordinates of a function/object to be plotted. xend and yend define ending coordinates
of a function/object to be plotted. ymin and ymax (xmin and xmax) are used to
delimit the beginning and the end of the range related visual elements such as error
bars, box plots, or confidence intervals.
Color Related Aesthetics
color and f ill is used to set the colors of visual elements. color is used for single
dimensional visual elements or for coloring the borders of two dimensional elements.
f ill is used to color the area of two dimensional visual elements. alpha controls the
opacity of visual elements.
Differentiation Related Aesthetics
linetype is used to define the type of the lines for line and path visual elements and
for the borders of two dimensional visual elements. size is used to control the sizes
of visual elements. shape is applicable to point geometric objects and used to set the
shape to be used to display the data.
Other Aesthetics
By default color, shape, size parameters of aes() function groups the data based on
the given discrete variable. group parameter can be used to explicitly group the data.
order parameter is used to explicitly re-order the appearance of visual elements.

3.2.3.1 Setting and Mapping Aesthetic Parameters


Aesthetic parameters could be set or mapped. You can set the parameters by using
parameter = ”value” in any layer. On the other hand mapping an aesthetic parameter
to data involves mapping a variable in the dataset to an aesthetic parameter using the aes()
function, i.e., aes(parameter = variable). Depending on the parameter ggplot2 groups the
data according to the parameter and processes each group individually. In the following we
map the color parameter to the variable clarity in our dataset.
ggplot2 groups the variable according to the clarity and use a different color to represent
each group. Additionally, it shows the color legend on the right of the graphics. We mapped
the color parameter of ggplot aesthetics to the clarity variable of the diamonds dataset. Note
that diamonds dataset also has a variable named color representing the color of diamonds
in the dataset and color parameter of ggplot2 aesthetics has nothing to do with the color
variable of the dataset. To make the distinction clearer, in the following we mapped the
color parameter of aesthetics to the color variable of the diamonds dataset.
ggplot2 not only supports mapping categorical variables to the color parameter but also
continuous variables. In case the variable mapped to color parameter is continuous ggplot2
uses different shades of color to map the values. In the following we map the depth variable
of our sample diamonds dataset to color.

144
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.8: Mapping an aesthetic property to the data

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=clarity)) +


geom_point()

clarity
15000
I1
SI2
SI1
price

10000 VS2
VS1
VVS2
5000 VVS1
IF

0
0 1 2 3 4 5
carat

145
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.9: Mapping an aesthetic property to the data

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=color)) +


geom_point()

15000 color
D
E
F
price

10000
G
H
I
5000
J

0
0 1 2 3 4 5
carat

146
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.10: Mapping color aesthetic property to a continuous variable

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=depth)) +


geom_point()

15000
depth

70
price

10000
60

50
5000

0
0 1 2 3 4 5
carat

147
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

On the other hand not all aesthetic parameter mappings work for continuous variables.
For example, shape parameter groups the data and represent each group by a shape for a
categorical variable however, does not do the same for a continuous variable.

3.2.3.2 Inheritance
In the following we are going to add one more layer to the graphics that we created in
Terminal 3.8. However, instead of employing the entire dataset, we will work with a sample
consisting of only 128 instances for illustration purposes. geom_line() is a geometric object
used to create a layer that connects the data points through lines.

Figure 3.11: Inheriting an aesthetic mapping from the data layer

> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price, color=clarity)) +
geom_point() + geom_line()

15000
clarity
I1
SI2
SI1
10000
price

VS2
VS1
VVS2
VVS1
5000 IF

0
0.5 1.0 1.5 2.0 2.5
carat

Similar to geom_point(), geom_line inherited the color aesthetic mapping from the
data later created by ggplot() function. Remember that color mapping groups the data
according to the given variable and plots each group individually on the graphics. Hence,
each sub-group is displayed using a different color.
ggplot2 geometric objects inherits the aesthetics mappings only from the data layer de-
fined by the ggplot() function. Moving the data mapping from the data layer to a geometric
object invalidates the inheritance. In the following we move the color mapping from gg-
plot() to the geom_point(). Although the dataset is grouped based on clarity and different
clarities are represented by different point colors, grouping and its representation is not

148
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

inherited by the geom_line() layer. As a result geom_line() treats the dataset as a single
unified group.

Figure 3.12: Mapping an aesthetic in non-data layers

> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price)) +
geom_point(aes(color=clarity)) + geom_line()

15000
clarity
I1
SI2
SI1
10000
price

VS2
VS1
VVS2
VVS1
5000 IF

0
0.5 1.0 1.5 2.0 2.5
carat

A really useful feature of ggplot2 is overriding inheritance at different layers of the


graphics based on the final graphics that we want to obtain. In the following we add a
smoothing function to our graphics as a layer. However, we override the grouping by setting
the group parameter to 1. Hence we obtain a single smoother for the entire dataset rather
than a smoother for each level of clarity. Setting the se parameter to F ALSE disables the
confidence interval of the smoother in the graphics.

3.2.4 Geometric Objects


A geometric object such as point, bar, box plot, and line is the visualization form used
to display data. As listed in Table 3.1 ggplot2 comes with a very diverse set of geometric
objects that display data in various visual forms. Some of those geometric objects are
convenient wrappers of other geometric or statistic objects. For example, geom_jitter is
a special version of geom_point with position parameter set to jitter and geom_smooth
provides the same functionality as stat_smooth.
In the rest of this part we are going to show examples of the most commonly used
geometric objects.

149
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.13: Mapping an aesthetic in non-data layers

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=clarity)) +


geom_point(aes(color=clarity)) + stat_smooth(se=FALSE, aes(group=1))

clarity
15000
I1
SI2
SI1
price

10000 VS2
VS1
VVS2
5000 VVS1
IF

0
0 1 2 3 4 5
carat

150
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Table 3.1: ggplot2 geometric objects

geom_abline() geom_area() geom_bar()


geom_bin2d() geom_blank() geom_boxplot()
geom_contour() geom_crossbar() geom_density()
geom_density2d() geom_dotplot() geom_errorbar()
geom_errorbarh() geom_freqpoly() geom_hex()
geom_histogram() geom_hline() geom_jitter()
geom_line() geom_linerange() geom_map()
geom_path() geom_point() geom_pointrange()
geom_polygon() geom_quantile() geom_raster()
geom_rect() geom_ribbon() geom_rug()
geom_segment() geom_smooth() geom_step()
geom_text() geom_tile() geom_violin()
geom_vline()

3.2.4.1 Histogram
A histogram is a visual representation of the distribution of single variable data (y-axis
is count by default). A histogram is constructed by counting the number of observations
falling into a particular category (bin). For categorical variables each category serves as a
bin. On the other hand, the range of a continuous variable is divided into disjoint intervals
that serve as bins. Although equal-interval bins are common, one can use different interval
length bins to construct a histogram. The ideal number of bins on the √ other hand, depends
on the distribution of data. Given that there are n observations n can be used as the
number of bins, in general. Note that histograms are highly sensitive to the number of bins
or bin intervals.

3.2.4.2 Density Plots


Density functions are single variable geometrical objects describing the relative likelihood of
a (continuous) variable falling within an interval. geom_density visualizes a smooth density
of the observed data estimated by a kernel (gaussian by default). The area under the density
function is 1. The following shows the estimated density function of the price variable of
our dataset.
Furthermore, you can map a categorical variable in your dataset and plot multiple den-
sities on the same graphics.
Note that the densities per category in Terminal 3.17 overlaps in the final graphics.
Obviously this default behavior adversely affects the readability of the graphics. Later, e
are going to adjust the positioning of overlapping densities to obtain a better graphics.

3.2.4.3 Box and Jitter Plots


Box plots are single variable geometrical objects depicting the groups of numerical data
using quartiles. A box plot is depicted by a rectangle having a line inside and two whiskers.
The upper and lower boundaries of the rectangle represent the first (Q1) and third (Q3)
quartiles of the data. The line inside the rectangle shows the second (Q2) quartile of the

151
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.14: Bar graphics of a categorical variable

> ggplot(data=diamonds, mapping=aes(x=cut)) + geom_bar()

20000

15000
count

10000

5000

Fair Good Very Good Premium Ideal


cut

152
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.15: Histogram graphics of a continuous variable

> ggplot(data=diamonds, mapping=aes(x=price)) + geom_histogram()

10000
count

5000

0 5000 10000 15000 20000


price

153
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.16: Density graphics of a continuous variable

> ggplot(data=diamonds, mapping=aes(x=price)) + geom_density()

3e−04

2e−04
density

1e−04

0e+00

0 5000 10000 15000


price

154
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.17: Density graphics of a continuous variable grouped by a categorical variable

> ggplot(data=diamonds, mapping=aes(x=price, fill=cut)) + geom_density(alpha=0.3)

4e−04

3e−04 cut
Fair
density

Good
2e−04
Very Good
Premium
Ideal
1e−04

0e+00

0 5000 10000 15000


price

155
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

data which is the median. The top and bottom whiskers extend to the maximum and
minimum values that are not considered to be an outliers. The points (if any) shown before
and after the bottom and√top whiskers are the outliers. The outliers are calculated either
by 1.5 IQR or 1.58 IQR/ n. IQR is the inter-quartile-range calculated by Q3 − Q1 and n
is the number of observations in the dataset.

Figure 3.18: Box plots

> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot()

15000
price

10000

5000

0
Fair Good Very Good Premium Ideal
cut

Although box plots are useful to study the ranges of different quartiles, they do not
reflect the number of points falling into each x-axis category. To develop further intuition
it might be useful to plot the points along with box plots.
One problem with Terminal 3.19 is that the points overlap. To remedy the problem we
can use geom_jitter instead of geom_point. geom_jitter adds random noise to points along
with y-axis to improve the readability of possibly overlapping points.
Terminal 3.20 is better than Terminal 3.19, however the figure still exhibits the over-
lapping probkem due to the very high number of instances. There are two visualization
approaches to remedy the problem; (i) randomly sampling the dataset and visualizing the
smaller sample, (ii) using the alpha parameter to control the transparency of the points. In
Terminal 3.21, we take the second approach.

3.2.4.4 Text
geom_text is the ggplot2 geometric object used for adding text labels to graphics. Required
parameters x, y, and label defines the x and y positions of the text and the text label,

156
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.19: Box plots with points

> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot() +


geom_point()

15000
price

10000

5000

0
Fair Good Very Good Premium Ideal
cut

157
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.20: Box plots with jittered points

> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot() +


geom_jitter()

15000
price

10000

5000

0
Fair Good Very Good Premium Ideal
cut

158
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.21: Box plots with jittered points

> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot() +


geom_jitter(alpha=0.02)

15000
price

10000

5000

0
Fair Good Very Good Premium Ideal
cut

159
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

respectively. The x, y positions are defined according to the axes of the coordinate system
where they correspond to the center of the text label. In case one of the axes is categorical
(factor) it is mapped to the integer levels of the factors while supporting decimal coordinates.
Additional parameters family, fontface, angle, size, alpha, color allows setting the font family,
font face (plain, bold, italic), angle, size, transparency, and color of the text label.
Note that one may need to set the data parameter of geom_text to NULL to prevent all
row names of a dataset to be displayed. Lastly, geom_label is similar to geom_text with
an additional rectangle drawn behind the text.
geom_text is demonstrated in Terminal 3.22. However, instead of employing the entire
dataset, we work with a sample consisting of only 128 instances for illustration purposes.

Figure 3.22: A text highlighting an outlier

> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=cut, y=price)) + geom_boxplot()+
geom_text(x=2.2, y=12600, label="Outlier", family="Times New Roman",
fontface="plain", color="red")

15000

Outlier

10000
price

5000

0
Fair Good Very Good Premium Ideal
cut

3.2.4.5 Positioning Overlapping Geometric Objects


Depending on data, geometric objects, e.g., points, lines, bars, may overlap on top of each
other in the graphics. One can adjust the overlapping of geometric objects by setting the
position parameter. position parameter can also be used to visualize data in customized
forms. Below are the possible values to adjust the positioning of overlapping visual forms.

160
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

identity
Simply uses x and/or y values of data while plotting geometric objects.
jitter
Adds random noise to x and/or y values of data while plotting geometric objects.

stack
Stacks the geometric objects on top of each other.
dodge
Dodges the geometric objects side by side.
fill
Expands or shrinks the geometric objects to fill the space in the final graphics.

Many geometric object’s have identity as default position value. stack and dodge can
be used with geom_bar (default is stack). jitter can be used with geom_point. fill can be
used with geom_density and geom_bar to show proportions. Note the difference between
the fill value of parameter position and fill parameter of aesthetics mapping function aes().
Similar to geom_histogram, geom_bar is a single variable geometric object and y-axis
is always the number of observations, i.e., count. geom_bar is used for factors (categorical
variables) while geom_histogram is used for continuous numeric variables. The following
two examples shows the number of cut observations per clarity in the dataset using default
position stack, dodge, and fill respectively.
The following shows price densities stacked up per cut using position value stack and
fill, respectively.
Again, note the difference between the fill value of parameter position and fill parameter
of aesthetics mapping function aes().

3.2.4.6 Pie Charts


Pie charts are circular charts divided into sectors representing proportions. Pie charts are
widely used in mass media. However, it is hard to read pie charts because visually examining
sector areas is difficult compared to examining lengths. Hence, it is suggested to use bar
charts, instead.
ggplot2 does not have a geometric object for pie charts. Yet, geom_bar can be used
with polar coordinates to generate pie charts.
In Figure 3.28 we map x axis to a factor with value one and fill to the variable cut.
Setting position to ’fill’ expands the bar to cover the entire space. coord_polar changes
the coordinate axis to polar coordinate axis. scale_* objects sets the labels of x and y
coordinates to empty string.

3.2.5 Statistics Objects


ggplot2 naturally supports adding a variety of statistical summaries of data to graphics.
Table 3.2 shows different statistic objects defined in ggplot2 package. Similar to geometric
objects you need to add a statistic objects to the graphics in order to plot the statistical
summaries of the data. All statistic objects are associated with a default geometric ob-
ject used to visualize the required statistical summaries or transformations. Similarly all

161
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.23: Bar graph of variables cut per clarity with default position stack

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black")

20000
clarity
I1
15000 SI2
SI1
count

VS2
10000
VS1
VVS2

5000 VVS1
IF

Fair Good Very Good Premium Ideal


cut

162
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.24: Bar graph of variables cut per clarity with position dodge

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black", position="dodge")

5000

clarity
4000
I1
SI2
3000 SI1
count

VS2
VS1
2000
VVS2
VVS1
1000
IF

Fair Good Very Good Premium Ideal


cut

163
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.25: Bar graph of variables cut per clarity with position fill

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black", position="fill")

1.00

clarity
0.75 I1
SI2
SI1
count

0.50 VS2
VS1
VVS2
0.25 VVS1
IF

0.00

Fair Good Very Good Premium Ideal


cut

164
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.26: price density grouped by cut using position stack

> ggplot(data=diamonds, mapping=aes(x=price, fill=cut)) +


geom_density(position="stack")

1e−03
cut
Fair
density

Good
Very Good
5e−04 Premium
Ideal

0e+00

0 5000 10000 15000


price

165
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.27: price density grouped by cut using position fill

> ggplot(data=diamonds, mapping=aes(x=price, fill=cut)) +


geom_density(position="fill")

1.00

0.75
cut
Fair
density

Good
0.50
Very Good
Premium
Ideal
0.25

0.00

0 5000 10000 15000


price

166
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.28: cut pie chart

> ggplot(data=diamonds) + geom_bar(mapping=aes(x=factor(1), fill=cut), width=1,


position="fill", color="black") + coord_polar(theta="y") + scale_y_continuous(
name="") + scale_x_discrete(name="")

0.00/1.00

1 cut
Fair
Good
0.75 0.25
Very Good
Premium
Ideal

0.50

167
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

geometric objects are associated with a default statistical object used to transform or sum-
marize the data before visualizing them on the graphics. For example, the default statistical
object for geom_smooth is stat_smooth and the default geometric object for stat_smooth
is geom_smooth. That is why adding stat_smooth to a graphics not only computes a
smoothing function for the data but also visualizes it using geom_smooth. Similarly, adding
geom_smooth to a graphics uses stat_smooth to calculate the smoothing function before
plotting the function. This design of ggplot2 results in many geometric and statistic objects
sharing the same name. On the other hand, one may override the default geometric object
of a statistic object to change its default visualization or may override the default statistic
object of a geometric object to change its default statistical transformation or summary.
Note that the default statistic object for many geometric objects is stat_identity meaning
not to transform data or compute any statistical summary.

Table 3.2: ggplot2 statistic objects

stat_abline() stat_bin() stat_bin2d()


stat_bindot() stat_binhex() stat_boxplot()
stat_contour() stat_density() stat_density2d()
stat_ecdf() stat_function() stat_hline()
stat_identity() stat_qq() stat_quantile()
stat_smooth() stat_spoke() stat_sum()
stat_summary() stat_summary2d() stat_summary_hex()
stat_unique() stat_vline() stat_ydensity()

3.2.5.1 Smoothing Functions


Smoothing is a non parametric statistical technique used to approximate a real valued func-
tion to fit the data. stat_smooth is the ggplot2 statistical object for estimating smoothing
functions. There are multiple techniques in the literature to compute a smoothing func-
tion. By default stat_smooth uses local polynomial regression fitting (loess) for datasets
with at most 1000 observations and generalized additive model with penalized cubic regres-
sion splines (gam) for datasets having more than 1000 observations. The following terminal
shows the smoothing function estimated to fit the relation between carat and price variables
in our dataset.
stat_smooth supports many techniques to compute a smoothing function by allowing
the user to set the method and f ormula parameters. method defines the technique used
to compute the smoothing function and f ormula specifies the model used to define the
smoothing function.
In the following terminal we use ordinary linear regression to model the relation between
carat and price as a 7th order polynomial function.

3.2.5.2 Functions without Data


stat_function statistic object is used to plot various functions without data. Essential
parameters of stat_function are fun to set the function, n to set the points to interpolate,
args to pass additional arguments to the function, and geom to set the geometric object
(default is line) to visualize the function.

168
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.29: Smoothing function for variables carat and price

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() +


stat_smooth()

20000

15000
price

10000

5000

0
0 1 2 3 4 5
carat

169
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.30: Smoothing function for variables carat and price using polynomial regression

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() +


stat_smooth(method = lm, formula = y ~ poly(x, 7))

25000

20000

15000
price

10000

5000

0
0 1 2 3 4 5
carat

170
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

The following terminal output shows how to plot the function x3 + 2x2 − 5x + 7 .

Figure 3.31: Graphics of x3 + 2x2 − 5x + 7

> ggplot(data.frame(domain = c(-5,5)), aes(x=domain)) + stat_function(fun = function


(x) {x^3 + 2*x^2 - 5*x + 7}, n=1000)

150

100
y

50

−50
−5.0 −2.5 0.0 2.5 5.0
domain

3.2.5.3 Summary Statistics


stat_summary is the ggplot2 statistic object used to plot different statistical summaries
of the data including mean, median, variance and standard deviation. Parameter fun.y of
stat_summary is used to pass the function computing a summary statistic. The following
terminal plots the mean values of price per cut in our dataset along with the box plot.
Parameter fun.data of of stat_summary is used to pass a function computing ranges
of data summaries such as confidence intervals. For example mean_sdl(mult=k) in Hmisc
package returns sample mean and its lower and upper k standard deviation values. The
following terminal output shows the jitter plot of price per cut in our dataset along with
their mean values and 1 standard deviation ranges.
Furthermore, you can use mean_cl_boot(), mean_cl_normal(), and median_hilow()
functions to compute mean and 95% bootstrap confidence interval, mean and 95% Normal
confidence interval, and median and middle 95% quantile, respectively.

3.2.6 Bivariate Density Estimation


stat_density2d is used to estimate the joint density of two variables. It uses two dimensional
kernel density estimation with an axis-aligned bivariate normal kernel, evaluated on a square

171
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.32: Mean values along with box plot

> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot() +


stat_summary(fun=mean, geom="point", color="red", size=3)

15000
price

10000

5000

0
Fair Good Very Good Premium Ideal
cut

172
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.33: Mean values along with jitter plot and 1 standard deviation ranges

> library(Hmisc)
> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_jitter(alpha=0.02) +
+ stat_summary(fun.data=mean_sdl, geom="pointrange", color="blue") +
+ stat_summary(fun=mean, geom="point", color="red", size=2)

15000

10000
price

5000

−5000
Fair Good Very Good Premium Ideal
cut

173
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

grid. It bins the two dimensional scatter plot and maps the three dimensional density
function onto topographic contours. The following terminal output shows joint density
estimation of variables carat and price in our dataset.

Figure 3.34: Bivariate density estimation

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + stat_density2d()

5000

4000
price

3000

2000

1000

0.3 0.6 0.9


carat

3.2.6.1 QQ Plots
QQ (Quantile-Quantile) plots are probability plots used to visually compare two distribu-
tions by plotting their quantiles against each other. If the final graphics looks like y = x
line then the distributions on x and y axis might be the same. Any other lines suggest a
possible linear relation between the distributions.
stat_qq is the ggplot2 statistic object for visualizing QQ plots of two distributions. In
the following we are going show the QQ plot of 256 randomly generated synthetic values
(N (µ = 0, σ = 1)) against the theoretical normal distribution.

3.2.6.2 Accessing Computed Statistic Values


Statistic objects of ggplot2 generate values to represent statistical transformations or sum-
maries. Some of those intermediate or final variables are available to ggplot2 objects. Two
consecutive periods, .., at the beginning and end of a variable implies that the variable is
generated by a statistic object. For example, the default behavior of histogram is using the
frequency, i.e. number of observations, as y-axis. However, one can prefer to plot relative

174
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.35: QQ Plot

> ggplot(data=data.frame(values=rnorm(256)), mapping=aes(sample=values)) +


stat_qq(dist=qnorm)

2
sample

−2

−3 −2 −1 0 1 2 3
theoretical

175
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

frequency (..density..) histogram. That is the number of observations falling into a bin di-
vided by the total number of observations. The following terminal output shows the relative
frequency histogram of the price variable.

Figure 3.36: Relative frequency histogram

> ggplot(data=diamonds, mapping=aes(x=price)) +


geom_histogram(aes(y=..density..))

4e−04

3e−04
density

2e−04

1e−04

0e+00

0 5000 10000 15000 20000


price

The following adds a density plot to the relative frequency histogram shown in Termi-
nal 3.36.
Similarly, one can use ..count.. to refer to the exact frequency of observations. The
following terminal output shows the same plot shown in Terminal 3.27 using the exact
frequencies rather than relative frequencies.
The topographic contours of stat_density2d can be accessed through ..level.. variable.
The following terminal output shows joint density estimation of variables carat and price
by replacing the contours with filled polygons.
Alternatively, one can disable the contours by setting it to false and generate a heat map
based on the joint density.

3.2.7 Scales
Remember that aesthetic parameters control the way we map the data to geometric objects
in our graphics. Scales however, control the visual representations of aesthetic properties
of geometric objects. The essential aesthetic parameters of ggplot2 are x, y, size, shape,
linetype, color, fill, alpha, group, and order. These parameters are visually represented as

176
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.37: Relative frequency histogram along with density plot

> ggplot(data=diamonds, mapping=aes(x=price)) +


geom_histogram(aes(y=..density..)) + geom_density(color="red")

4e−04

3e−04
density

2e−04

1e−04

0e+00

0 5000 10000 15000 20000


price

177
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.38: price density grouped by cut using position fill with ..count..

> ggplot(data=diamonds, mapping=aes(x=price, fill=cut)) +


geom_density(aes(y=..count..), position="fill")

1.00

0.75
cut
Fair
count

Good
0.50
Very Good
Premium
Ideal
0.25

0.00

0 5000 10000 15000


price

178
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.39: Bivariate density with filled levels

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) +


stat_density2d(geom = "polygon", aes(fill = ..level..))

5000

4000 level
price

3000 1e−03

5e−04
2000

1000

0.3 0.6 0.9


carat

179
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.40: Bivariate density as a heat map

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) +


stat_density2d(geom = "tile", contour=FALSE, aes(fill = ..density..))

15000
density

1e−03
price

10000

5e−04

5000

0
0 1 2 3 4 5
carat

180
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Table 3.3: Aesthetics Parameters and Guides

Aesthetic Parameter Discrete Guide Continuous Guide


x scale_x_discrete() scale_x_continuous()
scale_x_date()
scale_x_datetime()
y scale_y_discrete() scale_y_continuous()
scale_y_date()
scale_y_datetime()
size scale_size_discrete() scale_size_continuous()
scale_size_identity() scale_area()
scale_size_manual()
shape scale_shape_discrete()
scale_shape_identity()
scale_shape_manual()
linetype scale_linetype_discrete()
scale_linetype_identity()
scale_linetype_manual()
color scale_color_hue() scale_color_gradient()
scale_color_brewer() scale_color_gradient2()
scale_color_grey() scale_color_gradientn()
scale_color_identity()
scale_color_manual()
fill scale_fill_hue() scale_fill_gradient()
scale_fill_brewer() scale_fill_gradient2()
scale_fill_grey() scale_fill_gradientn()
scale_fill_identity()
scale_fill_manual()
alpha scale_alpha_continuous()

“guides”. The guides for x and y aesthetic parameters are x and y axes of the graphics.
The guides for other aesthetic parameters are simply legends. These guides not only vi-
sually represent the aesthetic properties but also allow readers to interpret the meanings
of aesthetic mappings in the graphics. Note that most of the time ggplot2 automatically
picks the proper scales according to the aesthetic parameters that we use to map the data
to geometric objects.
Table 3.3 shows the different scales used to control the guides (visual representations)
of various aesthetic parameters.
Although ggplot2 provides scale_shape_continuous and scale_linetype_continuous, these
scales are not applicable to continuous variables.

3.2.7.1 Controlling x and y Axes


Scales in ggplot2 have their own parameters. scale_x_discrete, scale_x_continuous, scale_y_discrete,
and scale_y_continuous are used to control the x and y axes of a graphics. The most im-
portant parameters are name, breaks, labels, and limits. name is used to assign names to x
and y axes. breaks is used to determine tics of x and y axes. limits is used to set the upper
and lower boundaries of x and y axes tics.
The following terminal output shows the same graphics presented in Terminal 3.5 with

181
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

finer granular x axis and limited y axis.

Figure 3.41: Scatter plot with altered x and y axes names and tics

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() +


scale_x_continuous(name="Diamond Cut", breaks=seq(from=0, to=3, by=0.25), limits=c
(0,3)) +
scale_y_continuous(name="Diamond Price", limits=c(0, 15000))

15000

10000
Diamond Price

5000

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Diamond Cut

labels are used to re-label the x and y axes tics. The best practice is to use with
breaks and map each tic generated by break to a label. Another important parameter
of scale_x_continuous and scale_y_continuous is the trans parameter which is used to
transform the scales of x and y axes. Note that trans is applicable to continuous axes in
practice rather than categorical axes. Table 3.4 shows the possible axis transformations.

Table 3.4: ggplot2 axes transformations

asn atanh boxcox


coord date exp
identity log10 log1p
log2 log logit
probability probit reciprocal
reverse sqrt time

The following terminal output shows the same graphics presented in Terminal 3.5 with
log10 axes transformations.

182
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.42: Scatter plot with axes transformations

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() +


scale_x_continuous(trans="log10") +
scale_y_continuous(trans="log10")

10000

3000
price

1000

300
0.3 1.0 3.0
carat

183
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

3.2.7.2 Controlling Color Legends and their Properties


Aesthetic parameters shape and linetype are applicable to only discrete variables. Aesthetic
parameter alpha is applicable to only continuous variables. On the other hand aesthetic
parameters color, fill, and size are applicable to both discrete and continuous variables.
So far we have used these aesthetic parameters to map different variables in our data.
However, we have depended on the defaults of ggplot2 to pick various shapes and colors to
represent different groups of our dataset. The default color palette of ggplot2 consists of
equidistant colors with equal luminance so that, none of the colors is visually emphasized.
However, the default color palette is hard to read for colorblind readers and will print to
the same shade of gray. In this part we are going to use different scales to set properties of
legends which in turn control the visual properties of geometric objects.
scale_fill_brewer and scale_color_brewer allows us select different color palettes that
comes with the RColorBrewer package. The type and palette parameter are used to set
different color types and palettes to visualize the data. You can check the available palettes
at http://colorbrewer2.org/ .
The following terminal output shows the same graphics presented in Terminal 3.23 with
the palette named “Set2”.

Figure 3.43: Using palette “Set2” to fill bar plots

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black") + scale_fill_brewer(palette = "Set2")

20000
clarity
I1
15000 SI2
SI1
count

VS2
10000
VS1
VVS2

5000 VVS1
IF

Fair Good Very Good Premium Ideal


cut

Furthermore, one can manually assign colors using values parameter of scale_fill_manual.
You can use color names, RGB values, or HEX colors to set the values parameter.

184
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

The following terminal output shows the same graphics presented in Terminal 3.43 with
manual colors.

Figure 3.44: Using manual colors to fill bar plots

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black") +
scale_fill_manual(values=c("#b2182b", "#d6604d", "#f4a582", "#fddbc7", "#e0e0e0", "#
bababa", "#878787", "#4d4d4d"))

20000
clarity
I1
15000 SI2
SI1
count

VS2
10000
VS1
VVS2

5000 VVS1
IF

Fair Good Very Good Premium Ideal


cut

You can use scale_fill_grey to set the legend colors to gray scale as shown in the following
terminal.
Similar to colors, ggplot2 allows setting gradient colors for continuous data. scale_fill_gradient
uses parameters low and high to explicitly set the color gradient gradually changing from
low to high.
The following terminal output shows the same graphics presented in Terminal 3.40 with
manual colors.
scale_fill_gradient2 is similar to scale_fill_gradient with additional parameters mid to
set middle color between the low and high colors as well as midpoint to set middle point
value.
scale_fill_gradientn is a generalized version of scale_fill_gradient2 for n colors that
are given manually or using R’s built in color palettes, like rainbow(), terrain.colors() and
topo.colors() using the color parameter.
The following terminal output shows the same graphics presented in Terminal 3.46 with
rainbow colors.

185
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.45: Using gray scale to fill bar plots

> ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +


geom_bar(color="black") + scale_fill_grey()

20000
clarity
I1
15000 SI2
SI1
count

VS2
10000
VS1
VVS2

5000 VVS1
IF

Fair Good Very Good Premium Ideal


cut

186
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.46: Setting gradient values manually

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + stat_density2d(geom = "tile


", contour=FALSE, aes(fill = ..density..)) + scale_fill_gradient(low="yellow",
high="red")

15000
density

1e−03
price

10000

5e−04

5000

0
0 1 2 3 4 5
carat

187
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.47: Setting gradient values using rainbow palette

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + stat_density2d(geom = "tile


", contour=FALSE, aes(fill = ..density..)) + scale_fill_gradientn(colors =
rainbow(6))

15000
density

1e−03
price

10000

5e−04

5000

0
0 1 2 3 4 5
carat

188
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

3.2.7.3 Controlling Shape/Line Legends and their Properties


scale_linetype_manual and scale_shape_manual are main scales to control the types of
lines and shapes of points to represent different categories of data. Note that both shape
and line type are applicable to categorical variables. The following terminal output shows
how to manually set line types using scale_linetype_manual. You can use the same scheme
to manually assign shapes to points using scale_shape_manual.
scale_shape_manual is demonstrated in Terminal 3.48. However, instead of employing
the entire dataset, we work with a sample consisting of only 128 instances for illustration
purposes.

Figure 3.48: Manually setting line types

> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price, linetype=cut)) + geom_line() +
scale_linetype_manual(values=c("dashed", "dotted", "solid", "dotdash", "twodash
"))

15000

cut
Fair
10000 Good
price

Very Good
Premium
Ideal

5000

0
0.5 1.0 1.5 2.0 2.5
carat

3.2.8 Coordinate Systems


A ggplot2 coordinate system in ggplot2 defines the locations of geometric objects. co-
ord_cartesian is the default coordinate system in ggplot2 and represents a horizontal x axis
and a vertical y axis. coord_flip is the flipped cartesian coordinate where x axis is vertical
and y axis is horizontal. coord_trans is used to define x or y axis transformations through
the parameters xtrans and ytrans. The valid transformations are covered in the previous
section (Table 3.4). coord_polar is the polar coordinate system where a point is repre-

189
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

sented by its distance from the center (pole) and an angle. A point (x, y) is represented as
( x2 + y 2 , arctan(y/x)). Finally, coord_map is used for map projections.
p

3.2.9 Faceting
Aesthetic mappings such as color, shape, or fill subgroups the data and plots on the same
graphics. Although this approach allows us compare different subgroups, it might be difficult
to read from time to time. An alternative approach supported by ggplot2 is faceting which
creates a panel for each data subgroup mapped by an aesthetic and plots the subgroup on
a different panel. This way the one can visually compare and contrast the plots of different
subgroups better in some cases. facet_wrap and facet_grid are two ggplot2 faceting objects
that are good for univariate and bivariate faceting, respectively. Both objects support facets
parameter to set the variable to be used for subgrouping. The facets parameter of the
facet_wrap is set using the notation facets=∼variable. As for facet_grid it is set using the
notation facets=rowVariable∼columnVariable. Additionally, facet_wrap supports nrow and
ncol parameters to set the number of rows and columns of the panel table, respectively.
The following terminals show examples of faceting using facet_wrap and facet_grid.

Figure 3.49: Faceting data per cut using facet_wrap

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() + facet_wrap(


facets=~cut)

Fair Good Very Good

15000

10000

5000

0
price

0 1 2 3 4 5
Premium Ideal

15000

10000

5000

0
0 1 2 3 4 5 0 1 2 3 4 5
carat

One may map the color parameter to either the row or column variable of the facet_grid
function to use different colors for the levels of the categorical variable as shown in Fig-
ures 3.51 and 3.52. Note that this type of visualizations often require hiding the legend

190
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.50: Faceting data per clarity and cut using facet_grid

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() + facet_grid(


facets=clarity~cut)

Fair Good Very Good Premium Ideal


15000
10000

I1
5000
0
15000

SI2
10000
5000
0
15000

SI1
10000
5000
0
15000

VS2
10000
price

5000
0
15000

VS1 VVS2 VVS1


10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
IF

10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat

191
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

using the emphlegend.position parameter of the function theme.

Figure 3.51: Faceting data per clarity and cut using facet_grid

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=clarity)) + geom_point()


+ facet_grid(facets=clarity~cut) + theme(legend.position="none")

Fair Good Very Good Premium Ideal


15000
10000

I1
5000
0
15000

SI2
10000
5000
0
15000

SI1
10000
5000
0
15000

VS2
10000
price

5000
0
15000

VS1 VVS2 VVS1


10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000

IF
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat

Some times it is necessary to plot the variables in a dataset as pairs and analyze them
side by side in a matrix. ggplot2 does not provide a direct function to plot a matrix of
paired variables. However, GGally an extension package to ggplot2 provides a function
named ggpairs to plot matrices of paired variables. By default the diagonal of the paired
variables matrix holds the variable names. The upper and lower triangles of the matrix
shows graphics of paired variables where the row is the x axis and the column is the y axis.
Depending on the data type each panel shows the default ggplot2 graphics object. If both
row and column are numeric it shows a scatter plot in the lower triangle panel and the
correlation coefficient value in the upper triangle panel. If row is numeric but the column
is a factor it shows a bar plot. If row is a factor but column is numeric it shows a box plot.
If both row and column are factors it shows a bar plot.
The most important parameter of the function ggpairs are data, columns, upper, lower,
and diag. data denotes the data set to be used. columns specifies the columns to be cross
paired. It accepts both a vector of column numbers or column names in string. upper
and lower are used to set the plot types for upper and lower panels of the matrix. Possi-
ble variable pair types are continuous-continuous (continuous), discrete-discrete (discrete),
and continuous-discrete (combo). Please see the documentation for possible plot values.
Furthermore, it allows plotting on the diagonal by setting the diag parameter to a plot

192
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.52: Faceting data per clarity and cut using facet_grid

> ggplot(data=diamonds, mapping=aes(x=carat, y=price, color=clarity)) + geom_point()


+ facet_grid(facets=clarity~cut) + theme(legend.position="none") + scale_color
_manual(values=rep(x=c("red","blue"),times=4))

Fair Good Very Good Premium Ideal


15000
10000

I1
5000
0
15000

SI2
10000
5000
0
15000

SI1
10000
5000
0
15000

VS2
10000
price

5000
0
15000

VS1 VVS2 VVS1


10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
IF

10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat

193
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

value.
Figure 3.53: Generating matrix plots

> library(GGally)
> ggpairs(data=diamonds, columns=c("carat","cut","price"))

carat cut price

1.5
Corr:

carat
1.0
0.5
0.922***

0.0
8000
6000
4000
2000
0
8000
6000
4000
2000
0
8000
6000

cut
4000
2000
0
8000
6000
4000
2000
0
8000
6000
4000
2000
0

15000

price
10000
5000
0
0 1 2 3 4 5 2000
04000
6000
2000
04000
6000
2000
04000
6000
2000
04000
6000 60000
2000
04000 5000 1000015000

3.2.10 Themes
ggplot2 theme object allows modifying the appearance of almost all non-data elements
plotted on a graphics. Since it supports a huge number of parameters we cannot include all
of them in this section. Interested readers are directed to the documentation of the theme
function at ggplot2 website. ggplot2 provides two built-in themes namely theme_grey
used by default and theme_bw a theme by white background.
The following figure shows how to change the theme to black and white as well as change
the font size and family.

3.2.11 Spatial Data Visualization


ggmap is a package for integrating the visualization of spatial data, i.e. static maps, with the
grammar of graphics of ggplot. The basic mechanism is to download a map image; plot it as
a ggplot2 context layer; and use ggplot2 geometric objects and statistic transformations
to add new layers on top of the context layer. In addition to map visualization, ggmap
provides several functions for acquiring map data from different resources (Google Map,
OpenStreetMaps, Stamen Maps, or Cloudmade Maps) or geocoding an address, zip or point
of interest into their map coordinates.

194
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.54: Customizing the theme of a graphics

> ggplot(data=diamonds, mapping=aes(x=carat, y=price)) + geom_point() + theme_bw(


base_size = 16, base_family = "mono")

195
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

ggmap divides the spatial visualization process into three steps: (i) downloading and
formatting the map images using get_map function; (ii) plotting the map as a context layer
using the ggmap function; and (iii) adding additional layers using geom and/or stat objects
of ggplot2. Note that qgmap also provides qmap function for quick plotting similar to
ggplot2’s qplot.

3.2.11.1 Visualizing Physical and Thematic Maps


A physical map shows the physical landscape properties of a geographic place. A thematic
map on the other hand, focuses on a particular theme such as rivers, roads, or water re-
sources. ggmap provides several functions for downloading physical and thematic maps as
images from different sources. get_googlemap, get_openstreetmap, get_stamenmap,and
get_cloudmademap are used to download maps from sources Google Map, OpenStreetMaps,
Stamen Maps, or Cloudmade Maps, respectively. Moreover, ggmap provides a wrapper
function get_map around those specific functions to facilitate map downloading. The most
important parameters of the function get_map are source, location, zoom, and maptype. Pa-
rameter source is used to set the source to download the map. Possible values are Google
Maps (’google’), OpenStreetMap (’osm’), Stamen Maps (’stamen’), and CloudMade Maps
(’cloudmade’). Parameter location is the most important parameter of the function along
with the parameter zoom. Using location one can set the longitude and latitude of the center
of the map and using zoom one can implicitly specify the spatial extent around the center.
Parameter zoom takes a value between 0 and 20 where roughly 3 is continent scale, 10 is city
scale and 20 is building scale. Although providing a longitude and latitude pair as a value to
set the location parameter is ideal, it is impractical for most cases. Therefore get_map func-
tion allows using a character string address, zip code, or point of interest to set the location
parameter. In this case get_map uses geocode function to convert the character string into
its longitude and latitude values and uses Google Map services to determine its bounding
box. Parameter maptype is used to set the type of the map. However, maptype depends
on the source used. Possible values for Google Maps are ’terrain’, ’satellite’, ’roadmap’,
and ’hybrid’. OpenStreetMap supports only single default map type. Although Cloudmade
Maps requires registration, it provides several thematic maps focusing on different themes.
Since 2018, Google Maps as well as any ggmap function that uses a Google Maps service
requires a registered API key. Everyone needs to obtain his/her API key and enable Google
Maps services. Given that you have a google account, you can read the description of the
register_google function of package ggmap to learn about the details of the registration
process.

Figure 3.55: Registering your Google Maps API key

> library(ggmap)
> ?register_google
>
> #After reading the description and obtaining your API key through your
Google account
> register_google(key="PUT YOUR API KEY HERE")

The register_google function, registers your key for your current session. That is, each

196
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

time you start a new session, you need to register your key to be able to use the Google Maps
services. Alternatively, you can set the write parameter of the register_google function to
TRUE to make key registration permanent. Since your key is a private key associated with
your account, do not share your key when you share your code for any purposes, including
assignments and projects.
The get_map function simply returns a raster object representing the map as a matrix
of hexadecimal colors. The following terminal partially shows the raster object returned by
the get_map function.

Figure 3.56: Acquiring the map of the city Lafayette, LA

> laff <- get_map(location="Lafayette, LA", zoom=12, source="osm")


> head(laff)
[1] "#C1C1C0" "#C2C2C1" "#C2C2C1" "#C2C2C1" "#C2C2C1" "#C2C1C1"

The function used to visualize maps as a ggplot2 context layer is the ggmap function of
the ggmap package. The map image used by ggmap is possibly obtained using the get_map
function. The following terminal demonstrates the use of the ggmap function to visualize
the raster object obtained in Terminal 3.56. ggmap by default, fixes the x axis of a graphics
to longitude, the y axis of the graphics to latitude and the coordinate system to Mercator
projection.
Another parameter of the ggmap function is extent which takes values ’normal’, ’device’,
and ’panel’. ’normal’ shows the blank ggplot panel behind the map. ’panel’ eliminates the
blank background panel and only shows the longitude and latitude axes. Finally, ’device’
removes both x and y axes leaving only the map itself. The following terminal shows the
hybrid map of the University of Louisiana campus obtained from Google Maps with extent
set to ’device’.

3.2.11.2 Visualizing Administrative Maps


Administrative maps focuses on the administrative/political boundaries (regional or na-
tional) of geographic places rather than their topographic features. R allows working with
administrative region maps at various levels including country, state, county, city, and zip
code. ESRI shape file (shapefile) is a common format for storing geometric locations and
attributes which is also supported by GIS. As a matter of fact a shapefile is a collection of
files rather than a single file. Before visualizing a map involving administrative regions, one
needs to find and download the shapefile of interest. United States Census Bureau, TIGER
project1 provides shapefiles for US administrative regions at various levels.
It requires two steps to visualize a shapefile using ggmap: (i) converting the shapefile
into a Spatial*DataFrame object and (ii) converting the Spatial*DataFrame object into
a data frame.
R requires the shapefile to be converted into a Spatial*DataFrame object. Spatial*DataFrame
classes are SpatialPointsDataFrame, SpatialLinesDataFrame, and SpatialPolygonsDataFrame.
The readShapeSpatial function of the maptools package automatically converts the shapetype
into the proper Spatial*DataFrame object. However, readShapeSpatial function requires
two important parameters: fn denoting path to the shapefile and proj4string specifying
1 http://www.census.gov/geo/maps-data/data/tiger-line.html

197
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.57: Visualizing Lafayette map obtained in Terminal ??

> ggmap(ggmap=laff)

30.30

30.25
lat

30.20

30.15

−92.10 −92.05 −92.00 −91.95


lon

198
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.58: University of Louisiana hybrid campus map

> UL <- get_map(location="University of Louisiana", zoom=17, source="google",


maptype="hybrid")
> ggmap(ggmap=UL, extent="device")

199
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Coordinate Reference System (CRS) arguments. Furthermore the notation for proj4string
parameter requires special attention.

• proj4string arguments are provided using CRS, an interface class to the PROJ.4 pro-
jection system.
• proj4string arguments are provided in the form of +arg1 =val1 +arg2 =val2 ...

The resulting Spatial*DataFrame object is a complex data structure consisting of mul-


tiple parts called slots. Table 3.5 shows the slots of a Spatial*DataFrame objects. Slots
are accessed using @ operator in the form of ’object@slot’. One of the most useful slot is
the data slot. It is strongly suggested to use R’s str function to see the structure of the
data slot of a Spatial*DataFrame object before processing it further. Remember that, the
data slot is of type data frame.

Table 3.5: Spatial*DataFrame slots

Slot Data Type Description


data data fame Collection of attributes
coords matrix Collection of point coordinates (only for SpatialPoints-
DataFrame)
lines list Collection of line objects (only for SpatialLinesDataFrame)
polygons list Collection of polygon objects (only for SpatialPolygon-
DataFrame)
plotOrder integer Integer array giving the order in which objects should be
plotted
bbox matrix Point coordinates
proj4string CRS Coordinate reference system arguments

As a case study we downloaded US states shapefile from the website of the United
States Census Bureau, TIGER project. As demonstrated in Figure 3.59 we converted the
shape file using the readShapeSpatial function of the package maptools. For our pur-
poses we set proj4string as ’proj4string=CRS("+proj=longlat +datum=WGS84")’ Argu-
ment +proj=longlat specifies that the projection to be used is longitude/latitude projec-
tion. Argument +datum=WGS84 specifies that the datum to be used is World Geodetic
System 1984 standard which is a global coordinate system for cartography, geodesy, and
navigation used by GPS (Global Positioning System). Checking the class of the resulting
object (’US.shape’) shows that it is a SpatialPolygonsDataFrame object. We also used R’s
str function to compactly view the structure of the ’US.shape’ object. Please notice that
variables NAME and STUSPS hold the state names and abbreviations, respectively.
The second step to visualize a shapefile using ggmap is converting the Spatial*DataFrame
object into a data frame. Remember that ggmap requires the data being plotted to be a data
frame. The ggplot2 function fortify, a generic function used to convert various objects
into R data frames, is used to convertSpatial*DataFrames as well.
In Figure 3.60 we convert the Spatial*DataFrame object obtained in Figure 3.59 into a
data frame. The figure also shows the first 6 lines of the ’US.shape.df’ data frame using the
head function. Next, we plot it as a polygon geom on top of a blank ggplot2 layer obtained
by calling the ggplot function without parameters. The aes mapping sets the x axis to
longitude, y axis to latitude and group parameter to the group variable of the data frame.

200
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.59: Reading in US states shapefile

> US.shape <- readShapeSpatial(fn="./tl_2013_us_state/tl_2013_us_state.shp",


proj4string = CRS("+proj=longlat +datum=WGS84"))
> class(US.shape)
[1] "SpatialPolygonsDataFrame"
> str(US.shape@data)
’data.frame’: 56 obs. of 14 variables:
$ REGION : Factor w/ 5 levels "1","2","3","4",..: 3 3 2 2 3 1 4 1 3 1 ...
$ DIVISION: Factor w/ 10 levels "0","1","2","3",..: 6 6 4 5 6 2 9 2 6 2 ...
$ STATEFP : Factor w/ 56 levels "01","02","04",..: 49 10 14 24 21 40 13 30 34 46
...
$ STATENS : Factor w/ 56 levels "00068085","00294478",..: 47 2 28 6 19 13 27 37 9
44 ...
$ GEOID : Factor w/ 56 levels "01","02","04",..: 49 10 14 24 21 40 13 30 34 46
...
$ STUSPS : Factor w/ 56 levels "AK","AL","AR",..: 55 11 17 26 23 44 16 34 31 52
...
$ NAME : Factor w/ 56 levels "Alabama","Alaska",..: 54 12 17 27 24 44 16 33 37
51 ...
$ LSAD : Factor w/ 1 level "00": 1 1 1 1 1 1 1 1 1 1 ...
$ MTFCC : Factor w/ 1 level "G4000": 1 1 1 1 1 1 1 1 1 1 ...
$ FUNCSTAT: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ ALAND : num 6.23e+10 1.39e+11 1.44e+11 2.06e+11 2.51e+10 ...
$ AWATER : num 4.89e+08 3.14e+10 6.20e+09 1.89e+10 6.99e+09 ...
$ INTPTLAT: Factor w/ 56 levels "+13.4382961",..: 25 7 33 52 27 39 47 44 17 46 ...
$ INTPTLON: Factor w/ 56 levels "-064.9712501",..: 17 19 27 34 12 5 47 6 16 7 ...
- attr(*, "data_types")= chr "C" "C" "C" "C" ...

201
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.60: US map of states

> US.shape.df <- fortify(US.shape)


> head(US.shape.df)
long lat order hole piece group id
1 -81.04429 39.53661 1 FALSE 1 0.1 0
2 -81.04381 39.53694 2 FALSE 1 0.1 0
3 -81.04323 39.53725 3 FALSE 1 0.1 0
4 -81.04249 39.53764 4 FALSE 1 0.1 0
5 -81.04102 39.53842 5 FALSE 1 0.1 0
6 -81.03939 39.53945 6 FALSE 1 0.1 0
>
> ggplot() + geom_polygon(data=US.shape.df, mapping=aes(x=long, y=lat, group=group),
color="black", fill="white", size=0.2) + scale_x_continuous(limits=c(-130,-65)
) + scale_y_continuous(limits=c(24,50))

50

40
lat

30

-120 -100 -80


long

202
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

we set the x and y axes limits to (-130,-65) and (24,50) to have a better looking mainland
map. However, limiting the axes caused some states and regions, e.g., Alaska and Puerto
Rico, to be invisible on the map

3.2.11.3 Visualizing Data on top of Maps


ggmap function returns a ggplot object. Hence, the returned object can be used as a ggplot
layer which can be used along with other ggplot objects. In this part we generated a lightly
cleaned, example dataset using the data provided by the United States Census Bureau2 .
The original dataset consists of 2012 estimated population, income per capita (in dollars)
and median house price (in dollars) per zip code that is partially or completely located in
the state of Louisiana. The dataset contains 516 zip code entries. Furthermore, we used
the ggmap geocode function to obtain longitude and latitude values of the zip codes in our
dataset. The final dataset (LAzipLL.csv) is augmented with the longitude and latitude
values. The dataset does not reflect the estimated values for the current year. Yet, it is
perfectly valid to use it for illustration purposes besides, one can always obtain the up-to-
date data from the United States Census Bureau web site.
Figure 3.61 shows the procedure to read-in the dataset, download the map of Louisiana
for visualization, and glimpse at the dataset.

Figure 3.61: Reading in the file LAzipLL.csv and obtain the Louisiana map

> LAdata <- read.csv(file="LAzipLL.csv", header=TRUE, comment.char="#")


> LAdata$zip <- as.factor(LAdata$zip)
> LAmap <- ggmap(ggmap=get_map("Louisiana, USA", source="google", zoom=7, color="bw"
))
> head(x=LAdata, n=5)
zip population income house lon lat
1 70001 38019 29413 208400 -90.16354 29.97973
2 70002 18657 36131 260400 -90.16354 30.01442
3 70003 39449 27885 178300 -90.21509 30.00856
4 70005 24552 41271 251100 -90.13546 30.00289
5 70006 16355 29195 236500 -90.19165 30.04905

Figure 3.62 shows the population scatter plot of Louisiana per zip code location. LAmap
obtained in Figure 3.61 is a ggplot layer representing the map of Louisiana as an image.
geom_point simply adds a new layer on top of the LAmap layer. The aesthetic mappings for
x and y axis should be fixed to longitude (lon) and latitude (lat), respectively. Furthermore,
size aesthetic is used to map the population of each zip to a point size in the scatter plot.
In Figure ref we show a heat map of the population density of Louisiana. Note that,
the data is divided into two dimensional bins rather than administrative boundaries of zip
codes. stat_summary2d allows using a value as the third dimension (z dimension) rather
than the count of the observations falling into a bin. Furthermore the fun parameter of
stat_summary2d allows using any function to process the values given as z-dimension. In
the example we use the built-in function sum to tell stat_summary2d to sum the values of
z-dimension falling into a bin.
2 http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t

203
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.62: Population map of Louisiana

> LAmap + geom_point(data=LAdata, mapping=aes(x=lon, y=lat, size=population), color=


"red")

32 population
0
10000
20000
lat

30000
40000
30 50000

28
−94 −92 −90
lon

204
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.63: Population heat map of Louisiana

> LAmap + stat_summary2d(data=LAdata, mapping=aes(x=lon, y=lat, z=population), alpha


=0.5, fun=sum) + scale_fill_gradient(low="yellow", high="red")

32
value

3e+05
lat

2e+05

1e+05

30

28
-94 -92 -90
lon

205
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Next, we want to use stat_density2d to estimate and plot the density graphics of
the population. However, kde2d kernel density estimator does not support a z axis for
population values nor does it accept a weight aesthetics as in geom_density. That is, it
simply counts the observations falling into particular two dimensional bin. Hence, we are
going to use a trick that works for our purposes. Specifically, we are going to create a
new dataset that repeats the rows in the original dataset as many as its population value.
Next, we plot the dataset using stat_density2d. Setting guide parameter of scale_alpha
removes the legend for transparency from the final graphics. The process is demonstrated
in Figure 3.64. Note that due to the number of points in the repeated dataset density
estimation may take several minutes on your system.

Figure 3.64: Population density map of Louisiana

> LAdataR <- LAdata[rep(1:nrow(LAdata), LAdata$population), ]


> LAmap + stat_density2d(data=LAdataR, mapping=aes(x=lon, y=lat, alpha=..level..,
fill=..level..), geom="polygon", bins=150) + scale_fill_gradient(low = "yellow"
, high = "red") + scale_alpha(guide=FALSE)

32
level

2
lat

30

28
-94 -92 -90
lon

Although the zip codes correspond to administrative regions, we used geocode function
to represent them as longitude and latitude points and used these points in the previous
graphics. Alternatively, we can obtain the boundaries as a zip code tabulation area shapefile
from the US Census Bureau and depict them as administrative regions. Figure 3.65 shows
how to read-in the shapefile after downloading it and the structured summary of the re-
sulting SpatialPolygonDataFrame object. The documentation of the shapefile states that
ZCTA5CE10 variable represents the zip codes.

206
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.65: Reading in the zip codes shapefile

> shape <- readShapeSpatial(fn="./tl_2013_us_zcta510/tl_2013_us_zcta510.shp",


proj4string = CRS("+proj=longlat +datum=WGS84"))
> str(shape@data)
’data.frame’: 33144 obs. of 9 variables:
$ ZCTA5CE10 : Factor w/ 33144 levels "00601","00602",..: 13960 13961 13962 13963
13964 13965 13966 13967 13968 13969 ...
$ GEOID10 : Factor w/ 33144 levels "00601","00602",..: 13960 13961 13962 13963
13964 13965 13966 13967 13968 13969 ...
$ CLASSFP10 : Factor w/ 1 level "B5": 1 1 1 1 1 1 1 1 1 1 ...
$ MTFCC10 : Factor w/ 1 level "G6350": 1 1 1 1 1 1 1 1 1 1 ...
$ FUNCSTAT10: Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
$ ALAND10 : num 6.34e+07 1.22e+08 9.39e+06 4.80e+07 2.57e+06 ...
$ AWATER10 : num 157689 13437379 999166 0 39915 ...
$ INTPTLAT10: Factor w/ 33141 levels "+13.2603724",..: 22511 23066 23453 22366
23113 23330 22401 23041 22750 23215 ...
$ INTPTLON10: Factor w/ 33144 levels "-064.6829328",..: 12183 11482 11324 11978
11755 12112 12293 12059 11417 12055 ...
- attr(*, "data_types")= chr "C" "C" "C" "C" ...

In Figure 3.66 we first read the ’LAzipLL.csv’ dataset containing the zip codes, popula-
tion, income per capita, and median housing prices along with the longitude and latitude
values computed by the geocode function. Since we need to work with the zip codes be-
longing to the state of Louisina, we subsetted the ’shape’ object to include only the zip
codes belonging to Louisiana (’LA.shape’) using the %in% operator. Note that the class
of ’LA.shape’ is SpatialPolygonDataFrame. Next, we used the fortify function to con-
vert the ’shape’ into a data frame. At this point the resulting data frame (’LA.shape.df’)
does not include all the fields in the data slot of ’LA.shape’. We merged the data frame
(’LA.shape.df’) with the data slot of ’LA.shape’ to include the variables in the data slot.
Finally we merged the first four columns of our data set (’LAdata’) with the ’LA.shape.df’
data frame. Here we ommitted the geocode computed ongitude and latitude values of
’LAdata’ because ’LA.shape.df’ already has those values.

Figure 3.66: Preprocessing US zip codes shapefile

> LAdata <- read.csv(file="LAzipLL.csv", header=TRUE, comment.char="#")


> LAdata$zip <- as.factor(LAdata$zip)
>
> LA.shape <- shape[shape@data$ZCTA5CE10 %in% LAdata$zip, ]
> LA.shape.df <-fortify(LA.shape)
> LA.shape.df <- merge(x=LA.shape.df, y=LA.shape@data, by.x="id",by.y="row.names")
> LA.shape.df <- merge(x=LA.shape.df, y=LAdata[,1:4], by.x="ZCTA5CE10", by.y="zip")

After preparing the data we used geom_polygon geometric object to visualize the me-
dian house prices on top of the map image of Louisiana obtained by get_map function in
Figure 3.67.

207
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure 3.67: Median house price per zip code in Louisiana

> LAmap <- ggmap(ggmap=get_map("Louisiana, USA", source="google", zoom=7, color="bw"


))
> LAmap + geom_polygon(data=LA.shape.df, mapping=aes(x=long, y=lat, group=group,
fill=house), color="black", size=0.2) + scale_fill_gradient(low = "yellow",
high = "red")

32
house
3e+05

2e+05
lat

1e+05

30

28
-94 -92 -90
lon

208
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

3.2.11.4 ggmap Utility Functions


ggmap comes with a set of utility functions for analyzing and transforming spatial data.

geocode
Essentially it pulls the approximate longitude and latitude values of an address, place,
or zip code using Google Geocoding API. The location parameter accepts a location
as a character string and returns a data frame consisting of longitude and latitude
values of the location. Note that more information can be pulled by setting the output
parameter to more.
revgeocode
The function reverse geocode pulls the approximate address of a given longitude and
latitude pairs in numeric format. It uses Google Geocoding API to conduct the trans-
formation.

mapdist
The function uses Google Distance Matrix API to calculate the distances and durations
between any two locations based on Google calculated routes for driving, bicycling, or
walking.

route
The function uses Google Maps API to determine the routes between any two locations.
A route consists of a sequence of legs where each leg has a beginning/ending longitude
and latitude pairs as well as distance and duration values. One can use geom_leg to
show the route between two locations on a map.

3.3 Exercises
1. In the following you are going to analyze the “Caschool (The California Test Score)”
dataset that comes with the R package Ecdad. In order to access the data set you
first need to install the package Ecdad. Once you install the package and load it, the
“Caschool” dataset will be available to you and you can load the data set using the
data function.

(a) Briefly describe what the dataset is about.


(b) How many variables are there in the data set.
(c) How many observations/instances are there in the data set.
(d) For each variable in the data set, provide brief information about it and indicate
the type of the variable, e.g., numeric, factor, character string etc. Hint: You
may use the str function.
(e) The variable district denotes the school names. Show that the number of
unique school names is not equal to the number of rows.
(f) Pertaining to question 1e, determine and print the school names that appear
more than once. Do you think there is a duplication of records in the dataset
or there is some other explanation for the school names that appear more than
once.

209
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

(g) The type of the variable district, denoting the school names, should be a char-
acter string. If the type is not character string, convert the variable type in the
dataframe and verify and print that your conversion took place using the class
function.
(h) Determine and print how many different observations are there in the data set
per county.
(i) Determine and print how many different observations are there in the data set
per grade span (grspan).
(j) Use ggplot2 to show a barplot of the variable county. Briefly, explain what you
see in the plot. Use the theme function to rotate and space the x-axis labels
properly.
(k) What are the mean, median, maximum, minimum, and quartile values for the
variable expnstu.
(l) Use ggplot2 to show a histogram of the variable expnstu. Does the histogram
look like a normal distribution? Use the Shapiro–Wilk test to test for normality
and interpret the p-value.
(m) Use ggplot2 to show a box-whisker plot of the variable expnstu with respect to
the variable grspan. In addition to the median value show the mean value on
the same plot. Compare the box-whisker plots and interpret the differences.
(n) Pertaining to question 1m, use ggplot2 to print the name of the outlier school
in KK-6 category next to its point representation in the graphics.
(o) Use ggplot2 to plot the scatter plot of the variables mathscr and readscr and
interpret the results.
(p) Use ggplot2 to plot the scatter plot of the variables avginc and testscr along
with their contour plot and interpret the results.
(q) Use ggplot2 to plot a heatmap of the variables avginc and expnstu employing
the rainbow colors and interpret the results.
2. In the following you are going to continue to analyze the “Caschool (The California
Test Score)” dataset that comes with the R package Ecdad. In order to access the
data set you first need to install the package Ecdad. Once you install the package and
load it, the “Caschool” dataset will be available to you and you can load the data set
using the data function.
(a) Carefully read the documentation of the quantile function and use it to print
the first (0.25), second (0.50) and third (0.75) quartiles of the variable mathscr.
Please set the names parameter to FALSE to strip the names of the quartiles.
(b) Carefully read the documentation of the quantile function and use it to print the
maximum and minimum values along with the first, second and third quartiles
of the variable mathscr. Please set the names parameter to FALSE to strip the
names.
(c) First, carefully read the documentation of the cut function. Then, use the cut
function to create a categorical vector named “mathscrL” from the continuous
variable mathscr of the dataset such that all values between the minimum and the
first quartile are labeled as M4 ; between the first and second quartile are labeled

210
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

as M3 ; between the second and third quartile are labeled as M2 ; and between the
third quartile and maximum are labeled as M1. It is important to note that you
need to use the quantile function while setting the breaks for the cut function.
Use the labels parameter for setting the category labels (M4, M3, M2 and M1).
Moreover, set the right parameter to FALSE and ordered_result parameter to
TRUE to control the range boundaries and factor ordering, respectively. Also
note that you may need to assign the category label of the maximum mathscr
value manually, if it violates the range and assumes an NA value. You may use
is.na, any and which functions to locate the NA values, if exist. Lastly, append
the “mathscrL” vector to the Caschool dataset as a new column. Use the str
function on your data frame to show that the new column is appended.
(d) First, carefully read the documentation of the cut function. Then, use the cut
function to create a categorical vector named “readscrL” from the continuous
variable readscr of the dataset such that all values between the minimum and the
first quartile are labeled as R4 ; between the first and second quartile are labeled
as R3 ; between the second and third quartile are labeled as R2 ; and between the
third quartile and maximum are labeled as R1. It is important to note that you
need to use the quantile function while setting the breaks for the cut function.
Use the labels parameter for setting the category labels (R4, R3, R2 and R1).
Moreover, set the right parameter to FALSE and ordered_result parameter to
TRUE to control the range boundaries and factor ordering, respectively. Also
note that you may need to assign the category label of the maximum readscr
value manually, if it violates the range and assumes an NA value. You may use
is.na, any and which functions to locate the NA values, if exist. Lastly, append
the “readscrL” vector to the Caschool dataset as a new column. Use the str
function on your data frame to show that the new column is appended.
(e) Use facet_wrap to visualize and interpret how variables expnstu and elpct
change for each category of the variable mathscrL created in question 2c.
(f) Use facet_grid to visualize and interpret how variables expnstu and elpct
change for the pairs of categories of the variables mathscrL and readscrL, to-
gether. Note that the variables mathscrL and readscrL are created in ques-
tions 2c and 2d.
(g) Use ggmap to visualize the density map of the enrollment totals of the counties in
the dataset on a physical map of California. Please use Figure 3.64 as your refer-
ence guide. Note that you may need to create a new dataset that aggregates the
total enrollments of each county in California, because the dataset has multiple
schools per county. There are multiple ways to aggregate data in R, one approach
is to use the aggregate function along with the sum function as argument.
(h) Use ggmap to visualize the map of the number of the total students qualifying for
the reduced-price lunch per county on an administrative map of California. Please
use Figure 3.67 as your reference guide. Note that you may need to create a new
dataset that aggregates the number of students qualifying for the reduced-price
lunch using the percentage in mealpct and the number of students in enrltot
variables per district for each county.There are multiple ways to aggregate data
in R, one approach is to use the aggregate function along with the sum function
as argument.

211
Appendices

212
Appendix A

R Code Examples

A.1 Vector Opearions

213
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document

Figure A.1: Basic vector operations in linear algebra

> cosine.sim <- function(x,y){


if(is.matrix(x)==FALSE){
x <- as.matrix(x)
}
if(is.matrix(y)==FALSE){
y <- as.matrix(y)
}
x.dot.y <- as.numeric(t(x) %*% y)
x.norm <- as.numeric(sqrt(t(x) %*% x))
y.norm <- as.numeric(sqrt(t(y) %*% y))

cs <- x.dot.y / (x.norm * y.norm)


return (cs)
}
>
> x.scalar.proj.y <- function(x,y){
if(is.matrix(x)==FALSE){
x <- as.matrix(x)
}
if(is.matrix(y)==FALSE){
y <- as.matrix(y)
}

x.dot.y <- as.numeric(t(x) %*% y)


y.norm <- as.numeric(sqrt(t(y) %*% y))
sp <- as.numeric(x.dot.y / y.norm)
return (sp)
}
>
> x.vector.proj.y <- function(x,y){
if(is.matrix(x)==FALSE){
x <- as.matrix(x)
}
if(is.matrix(y)==FALSE){
y <- as.matrix(y)
}
x.dot.y <- as.numeric(t(x) %*% y)
y.norm <- as.numeric(sqrt(t(y) %*% y))
sp <- x.dot.y / y.norm
y.unit <- y/y.norm
vp <- sp*y.unit
return(vp)
}
>
> u <- as.matrix(c(1,3,5,7))
> v <- as.matrix(c(2,4,6,8))
> cosine.sim(u,v)
[1] 0.9960238
> x.scalar.proj.y(u,v)
[1] 9.128709
> x.vector.proj.y(u,v)
[,1]
[1,] 1.666667 214
[2,] 3.333333
[3,] 5.000000
[4,] 6.666667

You might also like