EDAV
EDAV
Materials in this course are protected by the United States copyright law
[Title 17, U.S. Code]. No parts of this draft handout may be reproduced,
shared or distributed in print or digitally without permission from the author.
Contents
i
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ii
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Appendices 212
iii
Chapter 1
R is a software environment, language and a framework for statistical computing and graph-
ics. It is available as a free software under the terms of the GNU General Public License.
The R software environment is an integrated suite of tools for data manipulation, calcula-
tion, analysis and graphics. The R language is a true programming language for extending
the capabilities of the software environment through additional functions. Hence, all ele-
ments of a full featured procedural programming language such as variables, conditionals,
loops and functions are supported by the R language. Furthermore, R supports binding C,
C++ and Fortran code especially for computationally heavy tasks. The software is available
in different formats for multiple platforms at the official R project web site1 . Finally, R is
a software framework consisting of thousands of packages. These packages are libraries (or
directories) containing groups of functions and/or data sets. The R software environment
comes with a small set of core packages which includes the package base. The base package
contains functions for the basic operators and data types and support for R as a program-
ming language. Thousands of packages for various tasks are available at Comprehensive R
Archive Network (CRAN)2 .
1.1 Basics of R
Once R is installed on your computer you need to launch the program by either double
clicking the executable or typing “R” in the console. Launching the program loads the R
software environment along with a small set of packages (including the base package) and
brings the R interactive shell or R terminal. The terminal allows a user to interact with
the R environment by entering R instructions and displaying the results of the instructions.
The terminal prompts a “>” symbol which simply indicates that R is waiting for the user’s
next instruction.
Although the interactive R terminal and a simple text editor is enough to develop and ex-
ecute R code, programmers often depend on Integrated Development Environments (IDEs)
for code development, execution and deployment. IDEs are software applications which
1 http://www.r-project.org/
2 http://cran.r-project.org/
1
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
come with a suite of development tools to support editing, compiling, interpreting, debug-
ging, code completing, code refactoring, code executing and version controlling. Typically,
IDEs support one or more languages. RStudio is a professional IDE tailored for R and
widely used in the industry. It has free and paid versions available on the official web site3 .
Since RStudio depends on R, it is recommended to first install R and then RStudio.
This book contains more than a hundred R console figures presenting R code in action.
It is highly recommended to experiment with the code by actively typing it on your own
console to reinforce your learning. Another way to reinforce your learning is to learn from
your errors. Similar to other languages, R interpreter prints an error message for syntax
errors. Often googling the error message will help you to obtain more information about the
error as well as learn how to fix the error in your code. In addition to error messages, the R
generates warning messages which notify a condition. Unlike the error messages, warning
messages do not halt the code execution.
> 1004 + 23
[1] 1027
Figure 1.1 simply shows that R evaluates a given expression and displays the result on
the next line. Note that the term “[1]” preceding the result in Figure 1.1 simply tells that
“1027” is the first element displayed on its line in the shell. In case there are multiple
elements displayed on multiple lines, R starts each line with the index of the first element
displayed on that line. This scheme makes it easy to track the elements of large vectors or
matrices. We will revisit this scheme in Section AAA again.
Addition +
Subtraction -
Multiplication *
Division /
Power ^
Modulo %%
Integer Division %/%
3 https://www.rstudio.com/
2
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Table 1.1 shows the arithmetic operators supported by R. Figure 1.2 shows arithmetic
expressions evaluated by R. Note that R follows the conventional order of operators where
the exponent operator has the highest precedence; followed by the multiplication and di-
vision operators; and followed by the addition and subtraction operators. The operators
having the same level of precedence are evaluated from left to right. Finally, R allows the
use parenthesis in expressions to rearrange operator precedence and avoid ambiguity.
> 17/4
[1] 4.25
> 6*5+14
[1] 44
> 6*(5+14)
[1] 114
> 2^9
[1] 512
> (2^3+5*9)/(4+1)
[1] 10.6
> 2^3+5*9/4+1
[1] 20.25
A logical (boolean) value is a quality intended to represent the truth of a logic statement
using two levels namely, true and false. One can use keywords TRUE and FALSE or T and F to
represent logical values true and false, respectively. A logical expression is a mathematical
expression built from logical values and logical operators which itself evaluates into a logical
value. Table 1.2 shows the list of the logical operators supported by R4 .
Negation !
Conjunction (And) &&
Elementwise Conjunction (And) &
Disjunction (Or) ||
Elementwise Disjunction (Or) |
Exclusive Disjunction (Xor) xor(...)
3
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> T & F
[1] FALSE
> TRUE & FALSE
[1] FALSE
> TRUE | FALSE
[1] TRUE
> !FALSE
[1] TRUE
> (TRUE | FALSE) & (TRUE | TRUE)
[1] TRUE
> xor(TRUE, TRUE)
[1] FALSE
4
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Equal to (=) ==
Not equal to (6=) !=
Less than (<) <
Greater than (>) >
Less than or equal (≤) <=
Greater than or equal (≥) >=
Figure 1.5 demonstrates the use of the assignment operator. After the execution of the
assignment instruction nothing is displayed on R’s console. Despite it seems nothing has
happened many things happened behind the scenes. First of all, the R environment parsed
the entire instruction and divided it into left-hand side and right-hand side terms according
to the assignment operator, <-6 . Secondly, it evaluated the expression on the right-hand
side, 21 + 2 ∗ 2 − 6/2, until the entire expression is reduced to a single value, 22. Thirdly,
it allocated a piece of memory and stored the evaluated value, 22, in the allocated memory.
Finally, it used the symbolic name on the left-hand side, myage, to label the allocated
memory for future reference.
In R, a location in memory having a value and referenced by an identifier is called an
object. As a matter of fact, any simple or composite information stored in the memory and
referenced by an identifier is called an object in R.
When we create an object we associate it with an identifier. We use the identifier to
access to the value of the object later. Figure 1.6 shows that R environment displays the
value of an object when the object’s identifier (or symbolic name) is entered as an instruction.
One needs to create an object by using the assignment operator before accessing the object’s
5 Alternatively, R supports the equal sign = as an assignment operator.
6R supports both forward and backward arrows as the assignment operator as long as the arrow points
the identifier. However, forward arrow is rarely used in practice and we prefer backward arrow in this text
as well.
5
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> myage
[1] 22
Furthermore, one can use previously created objects in mathematical expressions solely
for computations or for creating new objects. In case R encounters identifiers representing
objects in the memory within an expression, it replaces the identifiers by retrieving the
values stored in the object. Once all objects’ values in an expression have been retrieved R
evaluates the expression as usual.
In Figure 1.8 demonstrates examples of creating objects and using them in various ex-
pressions to create new objects or to do calculations. In the first example block, we created
two numeric objects, n1 and n2, set to 6.2 and 145, respectively. Then, we used these
numeric objects to do some calculations and even to create a new numeric object n3.
In the second example block of Figure 1.8 we created an object (celc) representing a
daily temperature in Celsius and set its value to 11. Then, we used the object celc in an
expression (celc*9/5+32) to create another object (fahr) for the Fahrenheit equivalent of
the same temperature. Finally, we printed the same temperature in Celsius and Fahrenheit.
In the third example block of the same Figure we built a logical expression consisting of
logical values, logical operators and a relational operator between objects celc and fahr
and stored the result in an object labeled switch.
So far we have created objects having numeric values. R fully supports character strings
to be stored in the memory and retrieved later as well. Character strings in R has to be
6
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 1.9 shows an example character string object in R. Usually, using double quota-
tions is preferred over single quotations in enclosing character strings. However, if a double
quotation appears as a symbol in the character string one can use the single quotations to
enclose the character string. Alternatively, one can escape the double quotation symbol(s)
appearing in the character string by preceding it(them) using the backslash character, \",
as shown in Figure 1.9.
In addition to numeric data type, character strings and logical data type R allows special
values such as Inf for infinity, NA for not available (for missing values), Nan for not a number
and NULL for null object (for missing objects) to be used. Figure 1.10 shows a use case for
special characters allowed in R.
Finally, R fully supports complex numbers as well. However, we will not discuss complex
numbers in this text.
Please note that, not all operations are meaningful on all data types. For example, the
7
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
“division” operator is applied to the numeric data type however, it is not meaningful on
logical data type or character strings. The “logical or” operator is applied to logical data
type however, it is not meaningful on numeric data type or character strings. The “less
than” operator is applied to character strings (strings have lexicographical or dictionary
order) and numeric data types however, it is not meaningful on logical data type.
Figure 1.11 shows the generic format for calling a function in R. To call or execute a
function, it is enough to know the name of the function and the list of the parameters that
a function expects. In general, functions, when called, take in some data; process the data;
and return the result of the computation. However, not all functions take in data nor do
they return a result. Some functions do not need any data to be provided to perform their
tasks when called. Some other functions perform their tasks and exit after without returning
any result.
Figure 1.12 demonstrates the call of sqrt function which calculates and returns the
square root of a non-negative number. The function sqrt has only one parameter with
8
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> sqrt(x=16)
[1] 4
>
> side <- sqrt(x=16)
> side
[1] 4
>
> pow = 4
> exp(x=pow+2)
[1] 403.4288
name x. The parameter x acts as a place holder for an actual value to be provided during
the function call. In the example the parameter is set to literal argument 16. In the first code
example (Figure 1.12) the return value of sqrt is not stored in any object hence, R simply
prints the result on the terminal. In the second code example (Figure 1.12) the return value
of sqrt function is stored in the object side to be used later. In the third code example
(Figure 1.12) the exp (natural exponent) function is called where an expression, “pow+2”,
is used to set the parameter x. Note that in case a parameter of a function is set to an
expression, R evaluates the expression and sets the parameter to the evaluated result before
executing the function. The third example first evaluates pow + 2 which is 4 + 2 = 6 then,
it calls the function exp(x=6) which is equivalent to e6 . Notice the difference/similarity
between the “^” power operator and theexp function.
Not all functions are required to have parameters. A function may execute its task
without requiring the caller to pass any data. For example getwd function returns the
working directory, i.e., the current directory, of R environment as shown in Figure 1.13.
Notice that calling a function always requires the parenthesis even though the function
does not have any parameters in its definition. As a matter of fact, the parenthesis after a
function name simply implies function call. Entering a function name without parenthesis
either displays the implementation (source code) of the function or prints information about
the function. The same Figure also shows how to call the corresponding function setwd to
change the working directory by setting the parameter dir to the new working directory.
So far we presented functions that do not require a parameter or functions that require
a single parameter. However, majority of functions that come with R require more than one
parameter to be set. In case a function requires more than one parameter, the parameters
are separated by comma “,”.
In Figure 1.14 the call to log function requires two parameters to be set namely, x and
base. The function log(x=24, base=3) evaluates log3 (24) and returns the result.
In R, functions work with two types of parameters: mandatory parameters and optional
parameters. As the names imply, when a function is called all mandatory parameters have to
be explicitly set. On the other hand, optional parameters (default parameters) of a function
already are already set to default values in the definition of the function. The optional
parameters of a function are allowed to be set to explicit values other than their default
values. That is the default value of an optional parameter can be overwritten during the
function call.
Figure 1.14 provides three examples for the log function. In the first example only the
9
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> getwd()
[1] "/home/mehmet"
>
> setwd(dir="/home/mehmet/Desktop/test")
> getwd()
[1] "/home/mehmet/Desktop/test"
>
> getwd
function ()
.Internal(getwd())
<bytecode: 0x3953858>
<environment: namespace:base>
> log(x=16)
[1] 2.772589
>
> log(x=16, base=2)
[1] 4
>
> log(x=exp(5))
[1] 5
10
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
parameter x is explicitly set to a value while the parameter base is left untouched with
its default (implicit) value, e. Hence, first example evaluates e16 . In the second example
parameter x is explicitly set to 16 and parameter base is explicitly set to 2. That is the
default value of base, e, is overwritten by 2. The function simply evaluates log2 (16). In the
last example again only the parameter x is set to a value while keeping the parameter base
at its default value, e. However, the argument that the parameter x is set to is another
function call exp(5). In this case, R first evaluates exp(5) then, sets x to e5 and then calls
the log function.
So far we have always used the parameter names of a function when we set them to
arguments. In fact, R does not require one using parameter names to be used while passing
arguments. In case there is no parameter names in a function call, R sets the arguments to
parameters using the order of the parameters in the function definition.
> log(16,2)
[1] 4
>
> log(16)
[1] 2.772589
Figure 1.16 shows how the arguments 16 and 2 are mapped to the first parameter, x, and
the second parameter, base, respectively in the first example. In the second example, only
the first parameter, x, is set to an argument, i.e., 16, and the parameter base is kept with
its default value. We strongly suggest using parameter names along with the arguments
for mandatory and overwritten parameters to improve code clarity. In case there is no
ambiguity passing arguments according to parameter order is also fine.
As a side note, R has a print function which displays the passed arguments on the
terminal. So far we have just been typing the identifier (name) of an object to print its
content on the terminal. As a matter of fact, typing the name of an object to print its
content is just a short hand notation for calling the print function. R provides short hand
notations for some of the frequently used functions. Figure 1.17 shows that the print
function is equivalent to its shorthand notation. Additionally, the same figure shows that
we can write expressions involving literal values, objects and functions.
One important question at this point is how an R user would know the mandatory and
optional parameters of a function as well as the order of these parameters. R provides
a function named help which has a mandatory character string parameter named topic.
Setting the topic parameter to a function name simply opens the help documentation of
the function. The help documentation provides detailed explanation of the function, its
expected parameters in order, the default values for optional parameters, the data type of
each parameter as well as example code. A shortcut for the help function in R is to precede
the “?” before the name of a function without parenthesis.
Figure 1.18 demonstrates the use of the help function and its shorthand notation. The
figure requests documentations for absolute value function, abs, round function round,
trigonometric cosine function cos, the summation function sum and the sample function
sample. Figure 1.19 shows the function signatures excerpted from the respective docu-
mentation files of the functions. Note that the documentation files have more detailed
descriptions of the functions as well as long explanations for each parameter. The abs func-
11
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> help(topic="abs")
> help("round")
> ?cos
> ?sum
tion has only one mandatory parameter x. The round function has a mandatory parameter
x and an optional parameter digits which is set to the default value 0. According to the
documentation of the function the digits parameter is used to specify the number of the
significant digits to be preserved while rounding a number. Similar to the abs function, the
trigonometric cos function has only one mandatory parameter x. The sum function adds up
all the argumented objects and returns the summation. Its signature is different from the
other functions because the first parameter is just an ellipses, “...”. An ellipses appearing
as a parameter in the function signature simply implies zero or more arguments. That is
the sum function is designed to add up an arbitrary number of objects. You can try the
function by yourself with zero, one, two or more arguments without using any parameter
names. The second argument of the sum function, na.rm is an optional parameter set to
the default value FALSE. This parameter specifies whether to remove the objects having the
value NA, not available, in the summation or not.
abs(x)
round(x, digits = 0)
cos(x)
sum (..., na.rm = FALSE)
sample(x, size , replace = FALSE, prob = NULL)
Below we present three widely used functions in R: ls, str and rm. As we interact with
the R environment we create many objects and from time to time we need to list the objects
in the memory. The ls function returns the list of the objects in the current session when
called without any parameters. The rm function is used to remove (delete) existing objects
12
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
from the memory. The str function returns the internal structure information of an object.
Please use the help function to open the documentation of these functions. Figure 1.20
shows the use of ls, str and rm functions.
We conclude this section by presenting functions for saving data in files and loading
them later. The function save is used to save one or more objects into a file on the system.
The save function supports many parameters for controlling almost all aspects of the save
operation. Among these are an ellipses, “...” representing the names of the objects to be
saved separated by comma and file denoting the path to a destination file where the data
will be saved. In order to easily save the entire session, i.e., all objects in the memory, into
a file R provides the function save.image. The save.image also supports the parameter
file for specifying the destination file. However, it also sets the default value “.RData” to
the parameter. Hence if no destination file path is given to the save.image function it saves
the entire session into the file “.RData” located in the current working directory. Note that,
on Unix-like systems a period, “.”, preceding a file name implies that the file is a hidden file.
Hence, the default file might not be visible in the working directory.
The load function is used to load the data back into the memory from a file on the
system. The parameter file of the function load denotes the file to load the data from.
On some systems the content of the default file “.RData” in the default working directory
is automatically loaded when R is launched. One can simply rename the file to circumvent
this behavior.
Figure 1.21 demonstrates save, save.image and load functions for saving data on the
disk and loading them later. The file parameter of the save and load functions in the figure
is the relative or absolute file path to save and load data and it depends on your operating
system. The tilde (∼) in the beginning of file path is specific to Linux systems and it
is a shortcut to the user’s home directory. On Windows systems tilde is not supported,
hence it is recommended to use absolute file paths. Also, forward slash, “/” as a path
separator is specific to Linux systems and the Windows path separator, “\” is a string
13
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
escape character. To facilitate working with file paths, R supports forward slash even on
Windows systems, hence it is recommended to always use forward slash as path separator,
e.g., C:/Users/YourUserName/Desktop . Lastly, the function q in Figure 1.21 is used to
quit R environment.
We have already seen that R fully supports simple data types (numeric values, character
strings, logical values and complex numbers) as objects. As a matter of fact, R does not
specify scalar or atomic objects at a higher level. The scalar objects of atomic numbers,
character strings and logical values are considered to be vector objects of length one. Fig-
ure 1.22 demonstrates that R scalars truly behave like vectors of length one. In addition to
14
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
vectors, R supports composite objects including factors, matrices, arrays, data frames, lists
and time series which are constructed on top of the simple data types.
1.2.1 Vectors
A vector is an object that is capable of holding a sequence values of the same data type. We
use the function c, standing for concatenate, to create vectors in R. Figure 1.23 shows how to
use the concatenation function, c, to create vectors of numbers, character strings and logical
values. We introduce two important functions namely mode and length functions in the
same figure as well. The mode function gives information about the format that an object is
stored in the memory. The length function returns the number of individual elements in an
object. In the first example (Figure 1.23) we create a vector of numbers, print the content
of the vector, get information about its mode and print its length. In the second example
(Figure 1.23) we create a vector of character strings, print the content of the vector, get
information about its mode and print its length. Note that the period “.” appearing in the
name of the object forrest.gump.cast is perfectly valid and it has no special meanings
other than being a symbol in an identifier. Also note that it took two lines for R to display
the content of the forrest.gump.cast object and the “[5]” on the second line simply tells
that the immediate character string is the fifth element of the printed vector object. In the
third example (Figure 1.23) we create a vector of logical values, print the content of the
vector, get information about its mode and print its length.
All examples in Figure 1.23 demonstrates how to create a vector by using the concate-
nate function c. The concatenate function takes a list of individual values (numeric values,
character strings or logical values). Considering that R does not specify scalars or atomic
values at a higher level, these individual values are in fact vectors of length one and the
concatenate function, c, actually concatenates vectors of length one. As a natural extension,
the concatenate function can also concatenate vectors of varying lengths as shown in Fig-
ure 1.24. Please notice how the concatenate function joined the two already defined vector
objects, v1 and v2, with another vector object created on the fly, c(21, 22, 23), to form
the vector object v3.
The concatenate function is used to form vectors and the elements of vectors supposed
to be the same type. In case the arguments to the concatenate function have different data
types, the function coerces logical values to numbers 1 and 0 and numbers to their string
forms in order to keep the vector of the same data type. This behavior is demonstrated in
Figure 1.25.
In addition to the concatenation function, c, R provides many functions for forming vec-
tors. In the following we will study three functions for generating vectors of sequences (seq,
rep and sample) and three functions for generating empty vectors (numeric, character
and logical).
The seq function is used to generate a sequence of values starting from the parameter
from up to the parameter to incremented by the parameter by. All these parameters
are optional parameters set to the default value 1. Figure 1.26 shows examples of the seq
function where numbers are generated from a value up to another value with certain amount
of increments. Instead of setting the to parameter to an upper limit, one may generate a
certain number of values starting from an initial value, incremented by a certain amount.
Parameter length.out of the seq function is used to specify the number of values that
need to be generated. Example two in Figure 1.26 demonstrates the use of the length.out
parameter to generate 20 values starting from 4, incremented by 0.1. R provides a shorthand
15
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> # Example 1
> grades <- c(75, 36, 96, 84, 65, 51, 90)
> grades
[1] 75 36 96 84 65 51 90
> mode(grades)
[1] "numeric"
> length(grades)
[1] 7
>
> # Example 2
> forrest.gump.cast <- c("Tom Hanks", "Rebecca Williams", "Sally Field", "George
Kelly", "Margo Moorer")
> forrest.gump.cast
[1] "Tom Hanks" "Rebecca Williams" "Sally Field" "George Kelly"
[5] "Margo Moorer"
> mode(forrest.gump.cast)
[1] "character"
> length(forrest.gump.cast)
[1] 5
>
> # Example 3
> topRanked <- c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)
> topRanked
[1] FALSE FALSE TRUE FALSE FALSE TRUE
> mode(topRanked)
[1] "logical"
> length(topRanked)
[1] 6
16
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
notation for the seq function where the increment amount is 1. To generate numbers starting
from k up to l with increment 1, it is enough to type k : l in the terminal. The last example
in Figure 1.26 demonstrates the shorthand notation.
The rep function has the signature rep(x, times = 1, length.out = NA, each = 1). The
parameter x is the vector object to be repeated. The parameter times specifies how many
times to repeat the vector object. In case the times is a vector of multiple elements, each
element in times specify the number of times the corresponding element is repeatd in x.
The parameter length.out is used to set the length of the generated vector. The parameter
each is used to specify how many times to repeat each value in x. Figure 1.27 shows several
examples of the rep function.
The sample function is used to randomly select a number of elements from a vector and
it has the signature sample(x, size, replace = FALSE, prob = NULL). Parameter x specifies
the vector from which one wants to sample. Parameter size denotes the number of samples
one wants to draw. Parameter replace controls whether the sampling is with-replacement
17
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 1.28 shows several examples for generating random sequences using the sample
function. Note that, the random samples in Figure 1.28 are not reproducible. That is, each
time trying the examples may result in different random samples. In the following of this
part we will introduce the set.seed function for generating reproducible random samples.
Please note that seq, rep and sample functions are used to generate vectors, i.e., they
return vectors.
So far we have created vectors using the assignment operator and the functions to gen-
erate a sequence of predetermined values. R provides three functions (numeric, character
and logical) for generating empty vectors. All three functions have a parameter length,
set to 0 by default, for setting the length of the vector to be generated. If the length pa-
rameter is zero they generate a vector of length zero otherwise they generate a vector filled
with zeros (0), empty strings ("") or logical false (FALSE).
The functions numeric, character and logical also have corresponding functions
is.numeric, is.character and is.logical to check a vector is of type numeric, char-
acter and logical or not as well as functions as.numeric, as.character and as.logical
to coerce a vector object into numeric, character and logical vector, respectively.
18
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The arithmetic subtraction, multiplication and division works exactly like the arithmetic
addition operator. That is if the lengths of the vectors are different recycle the shorter one
to match the length of the longer one then, perform the operation element-by-element.
19
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 1.30 shows different examples for arithmetic operators on vectors. Notice that, the
operation v1*3 is a multiplication on a vector of length five and a vector of length one. The
vector with a single value is recycled five times before the multiplication is performed.
In addition to the basic arithmetic operators dot product and cross product are widely
used in linear algebra. These operators are covered in Section AAAA, Matrices.
In the beginning of this text we defined relational operators (“<”, “>”, “==”, “!=”, “<=”
and “>=”) as the operators which test the existence of a particular relation between two
values. The same concept can be extended to vectors where the test for a relation between
two vectors is performed element-wise. As a result applying a relational operator to two
vectors returns a vector of logical values representing the existence of the relation between
two elements located at each position. Similar to arithmetic operators, if the lengths of the
two vectors are different the shorter one is recycled until the lengths match. Figure 1.31
shows several examples of using relational operators on vectors. Notice how vectors of length
one and length two are recycled to match the lengths.
20
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> states[3]
[1] "California"
> states[1]
[1] "Louisiana"
> states[2+5] #states[7]
[1] "Maine"
> states[length(states)] #states[10]
[1] "Arizona"
> states[length(states)-6] #states[4]
[1] "Montana"
21
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
All examples in Figure 1.32 involves a single integer index. Considering that the atomic
unit in R is vector, a single integer index is nothing but a vector of length one. As a natural
extension, one can use vectors having multiple integer indices to retrieve multiple values of
a vector.
> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> indexV <- c(1, 3, 9)
> states[indexV]
[1] "Louisiana" "California" "Utah"
> states[1:4]
[1] "Louisiana" "Texas" "California" "Montana"
> states[c(4, 7, 1)]
[1] "Montana" "Maine" "Louisiana"
> states[c(1:2, 9:10)]
[1] "Louisiana" "Texas" "Utah" "Arizona"
> states[rep(x=1:2, times=3)] #states[c(1, 2, 1, 2, 1, 2)]
[1] "Louisiana" "Texas" "Louisiana" "Texas" "Louisiana" "Texas"
> states[1:3 * 2] #states[c(2, 4, 6)]
[1] "Texas" "Montana" "South Dakota"
Figure 1.33 shows how to subset multiple elements of a vector using integer indices. Note
that we have used positive integers as indices of elements to be included in the retrieved
vector. An alternative way to retrieve elements of a vector is to use negative integers as
indices of elements to be excluded in the retrieved vector as shown in Figure 1.34.
Figure 1.34: subsetting multiple elements of a vector using negative integer indices
> states <- c("Louisiana", "Texas", "California", "Montana", "New Hampshire", "South
Dakota", "Maine", "Ohio", "Utah", "Arizona")
> states[-4]
[1] "Louisiana" "Texas" "California" "New Hampshire" "South Dakota"
[6] "Maine" "Ohio" "Utah" "Arizona"
> states[-5:-1]
[1] "South Dakota" "Maine" "Ohio" "Utah" "Arizona"
> states[c(-1, -2, -5, -7, -9, -10)]
[1] "California" "Montana" "South Dakota" "Ohio"
22
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
of logical indices to subset elements of a vector. In the last examples R implicitly recycles
the logical index vector of length three.
Remember that relational operators on vectors generate a logical vector. This logi-
cal vector can be used to subset the elements of a vector satisfying a particular relation.
Figure 1.36 shows several examples where relational operators are used within the square
brackets to generate a logical vector satisfying the relational condition.
R supports subsetting vectors using the names (character strings) of individual elements
appearing in a vector. So far, we did not named individual elements of vectors. However, R
provides the names function to set names to the individual elements of a vector or get the
names of the elements of a vector. The function sets names to the individual elements of a
vector if it is used on the left hand side of an assignment operator and retrieves the names
of the elements otherwise. In case no names have been assigned the function returns NULL.
If only a subset of the elements are assigned a name the function returns NA for elements
23
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
without a name. If the same character string is used to name multiple elements the function
returns only the first matching element. Figure 1.37 demonstrates the use of the names
function via several examples.
24
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
just before the sample function makes it pick the same sequence aligned by the argument
of the set.seed function. Hence, one can generate reproducible “random” values.
25
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
26
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v2 <- c(49, 58, 41, 57)
> v1 %in% v2
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
> v2 %in% v1
[1] TRUE FALSE TRUE FALSE
vector. That is, the vector is circularly left shifted by its length. The unique function
returns a version of the argument vector where duplicate elements are discarded. The sort
function returns the sorted version of the argument vector. The parameter decreasing
which is set to FALSE by default controls whether the sort is in ascending or descending
order. In case the vector given to the sort function is a character string vector, the function
sorts lexicographically (dictionary order). Figure 1.43 shows examples of these essential
functions.
The order function returns an integer index vector representing the correct order of the
elements of a vector in ascending order. Subsetting a vector using its “order” is equivalent
to sorting the vector. The rank function returns a vector representing the rank of each
element in the argument vector in ascending order. In case there are elements appearing
multiple times their rank is their rank average by default. Since an average value is a real
number, the rank function returns a vector of real values to represent the ranks. In fact the
rank function has a parameter named ties.method which takes a value among average,
first, random, max and min to control how to rank the elements appearing multiple times.
You may try setting the parameter to different values to see the effect. Figure 1.44 shows
examples of the order and rank functions.
The which function works on logical vectors in its simplest form. It takes a logical vector
and returns a vector consisting of integer indices for which the value of the logical vector
is TRUE. The first example in Figure 1.45 shows this simple behavior. However, the true
strength of the which function comes when its argument is the result of a relational or logical
operator on vectors. Then, the function returns a vector of integer indices for which the
relational or logical operator evaluates TRUE. In other words, it returns the integer indices of
the elements which satisfy the relational or logical operator. The resulting index vector can
be used to subset the original vector to obtain a vector satisfying some logical conditions as
shown in the last example of Figure 1.43.
Another versatile and very useful function in R is the table function. Called on a vector,
it displays the frequency of each element in the vector. Figure 1.46 shows an example for
the table function.
The table function simply returns the frequency distribution of a vector. By default it
calculates the frequency of each unique element in the vector. Sometimes one may need to
calculate the frequencies based on custom intervals rather than individual values. That is
one may need to explicitly divide the range of the data into hypothetical bins and calculate
the frequencies falling into each bin. R provides the cut function which allows us to explicitly
define intervals and obtain a collection of intervals where each value in the vector is replaced
by its interval in the new collection.
To use the cut function for evaluating the interval frequency distribution of a vector, the
following four steps should be followed: (i) determine the range of the vector; (ii) create a
27
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> sum(v1)
[1] 355
> cumsum(v1)
[1] 46 87 130 179 221 266 314 355
> prod(v1)
[1] 1.478064e+13
> max(v1)
[1] 49
> min(v1)
[1] 41
> which.max(v1)
[1] 4
> which.min(v1)
[1] 2
> intersect(v1, c(57, 41, 89, 90, 45))
[1] 41 45
> rev(v1)
[1] 41 48 45 42 49 43 41 46
> unique(v1)
[1] 46 41 43 49 42 45 48
> sort(v1)
[1] 41 41 42 43 45 46 48 49
> sort(v1, decreasing=TRUE)
[1] 49 48 46 45 43 42 41 41
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> order(v1)
[1] 2 8 5 3 6 1 7 4
> v1[order(v1)] # Equivalent to sort
[1] 41 41 42 43 45 46 48 49
> rank(v1)
[1] 6.0 1.5 4.0 8.0 3.0 5.0 7.0 1.5
> rank(v1, ties.method="first")
[1] 6 1 4 8 3 5 7 2
> rank(v1, ties.method="min")
[1] 6 1 4 8 3 5 7 1
28
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> v1
[1] 46 41 43 49 42 45 48 41
> table(v1)
v1
41 42 43 45 46 48 49
2 1 1 1 1 1 1
vector defining intervals or specify the number of breaks; (iii) use the cut function to obtain
a collection of intervals and (iv) use the table function for the frequency distribution of
the collection of intervals. Figure 1.47 shows how the four steps are followed to evaluate
the frequency distribution of large vector based on five intervals. The cut function can
be fine tuned through several parameters. It is strongly recommended to browse at the
documentation of the cut function.
> help("cut")
> v1 <- sample(x=seq(from=0, to=1, by=0.001), size=5000, replace=TRUE)
> breaks.v1 <- seq(from=min(v1), to=max(v1), by =0.2)
> interval.v1 <- cut(x=v1, breaks=breaks.v1, include.lowest=TRUE)
> table(interval.v1)
interval.v1
[0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
1054 984 959 1028 975
Two other functions we present here are any and all functions that only work on logical
vectors. As the name implies any checks whether any of the elements of the argumented
vector is a logical true. Similarly, all checks whether all of the elements of the argumented
vector are logical trues. Figure 1.48 gives examples of any and all functions.
29
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> v1 <- c(46, 41, 43, 49, 42, 45, 48, 41)
> any(v1>50)
[1] FALSE
> any(v1<45 | v1>50)
[1] TRUE
> all(v1>=40)
[1] TRUE
It is a common task to compare if two vectors are equal or not in R. We have already
covered the relational operator “==” which compares the matching elements of two vectors
and return a vector of logicals which has a true for the elements being equal. It is possible
to use the equal operator along with the function all to check if all elements of two vector
are equal or not.
R provides two more comparison functions that behave slightly different. The identical
function checks if two vectors are exactly identical in terms of both values and data types
used to hold the values in the memory. The all.equal function is used to check if the
matching elements of two vectors are nearly equal, i.e., equal within a tolerance boundary.
Both identical and all.equal functions return a single logical value representing whether
two vectors are identical and equal within a tolerance, respectively. Figure 1.49 shows the
difference among alternative methods for comparing two vectors. We suggest using the
equality operator “==” for comparing vectors unless you know what you are doing.
30
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
1 0
v = 3.0 + 3.5 (1.1)
0 1
or in a more compact form
1 0 3.0
v= (1.2)
0 1 3.5
Informatics
3.0
3.5
Mathematics
Figure 1.50 shows vector v = [3.0, 3.5] by a black arrow. In the figure the directions of
the dimensions, i.e., Mathematics and Informatics, are shown by dashed lines; the standard
31
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
basis vectors of the dimensions are shown by red and blue arrows; the amount of v along
the direction of each standard basis vector is shown by dotted lines.
Vectors in linear
algebra can be defined as row vectors, e.g., v = [3.0, 3.5] or column
3.0
vectors v = . The transpose operator, T, defined by function t in R expresses a row
3.5
vector as a column vector and a column vector as a row vector.
3.0
[3.0, 3.5]T = (1.3)
3.5
T
3.0
= [3.0, 3.5] (1.4)
3.5
In general, the column vector notation is preferred over the row vector notation. Hence, we
adopt the column vector notation in the rest of the manuscript, e.g., v = [3.0, 3.5]T .
A vector object in R is a data structure holding a collection of items of the same mode.
Unfortunately, the R vector object does not correspond to a vector object in linear algebra.
One needs to convert it into a linear algebra vector using the as.matrix function.
Several operators are defined on vectors in linear algebra, including transpose, addition,
subtraction, scalar-vector multiplication, dot product, length and project.
v 1 + u1
v1 u1
v2 u2 v2 + u2
.. + .. = .. (1.5)
. . .
vp up vp + u p
Similarly,
v1 u1 v1 − u1
v2 u2 v2 − u2
.. − .. = .. (1.6)
. . .
vp up vp − up
32
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Dot product. By far, the most important operation on vectors is the dot product oper-
ation. The dot product operation on two vectors, u · v = uT v and generates a scalar value
as defined in Equation 1.8.
u1 v1 v1
u2 v2 v2 X p
= = (1.8)
.. ..
· u u . . . u p . ui vi
..
1 2
. . i=1
up vp vp
Since the dot product operation generates a scalar value, it is also called scalar product.
In some texts, the dot product operation is also called inner product. In fact, inner product is
the generalization of the dot product for vector spaces that are beyond the finite dimensional
Euclidean space over the real numbers Rp
33
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
One interpretation of the dot product operation is that it represents weighted linear
combinations in a compact form. A linear combination is a mathematical expression denoted
by the sum of variables weighted by constants. For example, a1 x1 + a2 x2 + . . . + ap xp is the
linear combination where each variable xi is weighted by a constant ai ∈ R. This expression,
also called weighted sum, is represented compactly as a dot product in Equation 1.9.
a1
a2
a1 x1 + a2 x2 + . . . + ap xp = x1 x2 . . . xp . = xT a (1.9)
..
ap
where a and x are the vectors consisting of the constants and the variables, respectively.
In R one can either use the matrix multiplication operator %*% or the dot function in
geometry package for dot product. The magnitude of a vector is either computed via dot
product of the vector by itself or by the norm function. Figure 1.53 presents the dot product
and norm operations.
when it’s length is one. In some cases only the direction of the vector is important and the
length can be normalized by making its length unit length, i.e., v̂ = v/||v||. Lengths of
vectors leads to a very nice geometrical interpretation of the dot product operation. Dot
34
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
product of two vectors is the multiplication of their lengths and the cosine of the angle
between them, i.e., u · v = uT v = ||u||||v|| cos(θ) where θ is the angle between u and v.
Figure 1.54 geometric interpretation of the dot product of two vectors. The length of a
vector is always positive, hence the sign of the dot product of two vectors is determined by
the angle between them. If the angle is acute, the dot product is positive. If the angle is
obtuse, the dot product is negative. Lastly and the most importantly, if the angle is right, the
dot product is zero. That is if two vectors are perpendicular (orthogonal) to each other their
dot product is zero. In addition, the geometric interpretation of the dot product operation
leads to the famous similarity measure of two vectors named the cosine similarity. The
uT v
cosine similarity score is defined as cosu,v (θ) = ||u||||v|| and it reflects whether two vectors
point the same or the opposite directions “in general” regardless of their magnitudes.
Scalar and vector projection. The scalar projection of one vector onto another vector
is also an important vector operation. The scalar projection uv of vector u onto vector v is
the magnitude of the component of u along the direction of vector v. Because the definition
only refers to only the magnitude or length of the vector along the direction of the other
vector, it is called scalar projection.
Figure 1.55: Scalar and vector projections presents an alternative interpretation of the dot
product of two vectors.
Figure 1.55a shows the scalar projection of u onto v in red. Applying basic trigonometry,
the length of the double green line is uv = ||u|| cos(θ) where θ is the angle between u and
v. One can simplify the scalar projection uv further as
35
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
uv = ||u|| cos(θ)
uT v
= ||u||
||u||||v|| (1.10)
T
u v
=
||v||
Equation 1.10 allows us to have an alternative interpretation of the dot product operation
when the length of v is one, i.e., ||v|| = 1. Dot product uT v presents the amount of u along
the direction of v when v is a unit vector. If v is not unit vector, it can be normalized as
vk|v||. The most important conclusion of the projection operation is that the coordinates of
a vector simply denote the amount of the vector along the directions of the standard basis
vectors as shown in Figure 1.55b.
The vector projection uv of vector u onto vector v is the magnitude of the component
of u along the direction of vector v multiplied by normalized v as shown in Figure 1.55c
Equation 1.11
v
uv = uv
||v||
v
= ||u|| cos(θ)
||v||
uT v v (1.11)
=
||v|| ||v||
uT v
= v
||v||2
Scalar and vector projections play an important role in principal component analysis, a
technique to change the basis of a vector space along the directions of the largest variances,
i.e., eigenvector basis.
Although there are libraries providing functions for the cosine similarity, scalar pro-
jection and vector projection operations, their implementations are not very difficult. In
Appendix A.1, Figure A.1 demonstrates self-reliant, custom implementations of these func-
tions.
1.2.2 Factors
A factor object in R is a vector which can only take values from a finite number of distinct
values. These objects are also called categorical variables. Factors (or categorical variables)
take qualitative values that do not support all arithmetic operators. Examples of categorical
variables are gender (Male, Female), opinion (Good, Mediocre, Bad) are rank (1, 2, 3, 4,
5). Note that even though the last example seem to take numeric values these values are
not quantitative because the distance between consecutive values are not the same. The
values of categorical variables can either be integers or strings. In any case, it is important
to declare them as factors because categorical variables are treated differently in statistics.
The function factor creates a factor object from a vector of integers or character strings.
The signature of the function is factor(x, levels = sort(unique(x), na.last=TRUE), labels =
36
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
levels, exclude = NA, ordered = is.ordered(x)) The only mandatory parameter of the func-
tion is x which is an integer or character string vector to be used for generating a factor
object. Other important parameters are levels denoting all possible levels (or categories)
of the categorical variable; labels providing a string to represent each possible level (or cat-
egory) and ordered representing whether there is an order within the levels (or categories)
of the categorical variable. Although R displays factors using the labels corresponding to
levels, the levels are encoded as integers to save memory. Hence, one can assume a factor
object a composite object consisting of a set of integers as levels, a integer vector of levels
representing the data, a set of strings or numbers that label each level and a logical variable
denoting if the levels are ordered.
Figure 1.56 demonstrates the use of factor function for creating factor objects. R also
provides utility functions is.factor to check if an object is a factor and levels to obtain
the levels of a factor object.
1.2.3 Matrices
A matrix is a collection of data arranged into a two dimensional tabular layout. A matrix
has a fixed number of columns and rows and the elements of a matrix must have been of
the same mode (basic data type). Linear algebra provides a rich set of mathematical tools
that work on matrices, e.g., arithmetic operators, transformations and various functions
of matrix algebra. Hence, matrices are very convenient data structures for holding and
manipulating data.
37
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
R provides the function matrix for creating matrices. The matrix function has the sig-
nature matrix(data=NA, nrow=1, ncol=1, byrow=FALSE, dimnames=NULL). Parameter
data denotes a vector of data to be arranged in tabular form. Parameters nrow and ncol
specify the number of rows and columns, respectively. There are two ways to fill an empty
matrix of a fixed number of columns and rows with data: either by rows or by columns.
Parameter byrow controls whether the matrix is to be filled by columns or by rows. Fi-
nally, one can name the two dimensions (columns and rows) of a matrix using the dimnames
parameter.
> A <- matrix(data=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K","L"),
nrow=3, ncol=4, byrow=TRUE)
> A
[,1] [,2] [,3] [,4]
[1,] "A" "B" "C" "D"
[2,] "E" "F" "G" "H"
[3,] "I" "J" "K" "L"
> B <- matrix(data=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
nrow=3, ncol=4, byrow=FALSE)
> B
[,1] [,2] [,3] [,4]
[1,] "A" "D" "G" "J"
[2,] "B" "E" "H" "K"
[3,] "C" "F" "I" "L"
> C <- matrix(data=c("A", "B", "C", "D"), nrow=3, ncol=4)
> C
[,1] [,2] [,3] [,4]
[1,] "A" "D" "C" "B"
[2,] "B" "A" "D" "C"
[3,] "C" "B" "A" "D"
Figure 2.37 demonstrates the use of matrix function for creating matrices by row and by
column. Notice that in the last example the data object in less than the number of elements
required to fill a 3x4 matrix and the data is recycled until the matrix is filled.
An alternative way of creating matrices using other matrices or vectors is the bind
functions rbind and cbind. The function rbind takes multiple vectors of the same length or
matrices of the same column numbers and binds them row by row in the order of arguments.
Similarly, the function cbind takes multiple vectors of the same length or matrices of the
same row numbers and binds them column by column in the order of arguments. Figure 2.39
demonstrates the use of rbind and cbind via several examples. Notice how R recycles the
data argument of the matrix function to generate enough data to fill the matrix specified
by its dimensions.
Sometimes one needs to flatten the matrix back into an R vector. One can use the
concatenation function, c, with the matrix to be flattened being the argument to obtain an
R vector. The concatenation function, c, concatenates all columns of a matrix into an R
vector as shown in Figure 1.59. In the same figure we demonstrate the use of is.matrix
function to check if an object is a matrix or not as well as the dim function which returns
the dimensions of a matrix in the form of number of rows and number of columns.
38
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
39
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> set.seed(1001)
> M <- matrix(data=sample(x=0:99, size=16, replace=FALSE), nrow=4, ncol=4)
> M
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 40 84 69 38
[3,] 42 0 39 22
[4,] 99 7 12 74
> M[4,1] # subset the element located at [4,1]
[1] 99
> M[2,3] # subset the element located at [2,3]
[1] 69
> M[2, ] # subset the entire second row
[1] 40 84 69 38
> M[ ,4] # subset the entire fourth column
[1] 75 38 22 74
> M[c(1,4), ] # subset the first and fourth rows
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 99 7 12 74
> M[c(2,3),c(2,3)] # subset the elements intersecting at rows (2,3) and columns
(2,3)
[,1] [,2]
[1,] 84 69
[2,] 0 39
In Figure 1.60 we first generate a matrix of sixteen elements arranged in four rows and
four columns. Secondly, we subset the element located at the fourth row and first column
and then the element located at the second row and third column. Thirdly, we subsetted the
entire second row and then we subsetted the entire fourth column. Fourthly, we subsetted
first and fourth rows. Lastly we subsetted the partition formed by the intersection of second
and third rows and second and third columns.
In linear algebra a matrix consisting of a single row or a column is called a row or a
column vector, respectively. In R however, matrices and vectors are data structures rather
than mathematical concepts and they are entirely two different types of compound objects.
Hence, a matrix of a single column or a row is not equivalent to a vector in R. Although a
matrix of a single column or row and a vector are displayed differently on the R terminal,
40
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
one can use is.matrix and is.vector functions to check their types.
In Figure 1.61 we again first generate a 4x4 dimensional matrix object, M, consisting of
sixteen elements. We subset M to obtain another 2x2 dimensional matrix object named A.
The dim function on A returns its dimensions, 2x2, and the is.matrix function on A returns
true. Then, we subset the first row of M to obtain v. The dim functions on v returns NULL
and the is.vector on v verifies that it is a vector but not a matrix hence, dim returns NULL.
In Example set 2 (Figure 1.61) we first display the vector v. Then similar to v, we subset
the first row of the matrix M again to obtain w. However, we set a parameter of the subset
operation named drop to FALSE. Displaying w and v already shows that the objects are
different although they have the exact same content. The is.matrix function on w returns
true. The reason of this behavior is attributed to the object of the lowest dimension rule in
subsetting as explained below.
41
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
We have already covered the two important rules that R implicitly uses whenever nec-
essary. The rules were element-wise evaluation rule and recycling rule. Now it is time to
present a third very important rule that is implicitly used in subsetting by R namely, object
of the lowest dimension rule.
An alternative way of subsetting a matrix is naming its dimensions (rows and columns)
and using the dimension names to subset the matrix A simple example is shown in Fig-
ure 1.62. In the figure we first create a numeric matrix, E to represent the number of students
enrolled in a class in years “2014” and “2015” and semesters “Fall” and “Spring”. We
use the dimnames function, similar to the names function, to assign character string names
to rows and columns, respectively. R would coerce the names to be character strings in case
the arguments were numeric or boolean. Note that we could have named the dimensions of
the matrix using the dimnames parameter of the matrix function as well. Lastly, R provides
two functions namely rownames and colnames to assign character string vectors as names to
the rows and columns of a matrix, respectively. In Figure 1.62 we used the subset operator
along with dimension names to access an element specified by “2014” and “Spring”.
Finally, R allows using logical indices to subset matrices. When we subset a vector using
logical indices, each logical value in the index vector tells whether to include or exclude the
corresponding element in the vector to be subsetted. When we subset a matrix using logical
indices however, each logical value in the row (column) index vector tells whether to include
or exclude the corresponding row (column) in the matrix to be subsetted. Although using
logical indices to subset matrices is no a prevalent approach Figure 1.63
42
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> set.seed(1001)
> M <- matrix(data=sample(x=0:99, size=16, replace=FALSE), nrow=4, ncol=4)
> M
[,1] [,2] [,3] [,4]
[1,] 98 96 26 75
[2,] 40 84 69 38
[3,] 42 0 39 22
[4,] 99 7 12 74
> M[c(FALSE,TRUE,TRUE,FALSE), c(TRUE,FALSE,FALSE,TRUE)]
[,1] [,2]
[1,] 40 38
[2,] 42 22
objects have the exact same dimensions. That is both matrix objects should have the same
row and column numbers. On the other hand, if one operand is a vector object (of length
one or longer) and the other one is a matrix object then, the vector gets recycled by column
to obtain a matrix of equal dimensions and the operation gets performed element-wise on
both matrices. Figure 1.64 shows several examples of arithmetic operations on matrices.
The outer function takes two vectors and a function (or operator) as arguments. It
applies the function to every ordered pair of the vectors where the ordered pair is formed
by an element from the first vector and an element from the second vector. The parameters
X and Y are place holders the first and second vectors, respectively. The parameter FUN
is a character string representing the function (or operator) to be applied. The outer
function returns a matrix where the rows represent the elements of the vector X, the columns
represent the elements of the vector Y and the entries in the matrix are the values obtained by
applying the function (or operator) FUN to the corresponding row and column. Figure 1.65
demonstrates the use of the outer function.
Matrix addition and subtraction. Addition and subtraction operations are computed
element-wise, hence two matrices should have the same dimensions (row and column num-
bers) in order to be added or subtracted as shown in 1.12 and 1.13. Matrix addition supports
commutativity, A + B = B + A; associativity, A + (B + C) = (A + B) + C; additive iden-
tity, A + 0 = A; and additive inverse, A + (−A) = 0 properties. R naturally supports
element-wise matrix addition and subtraction via the + and - operators as shown in the
console output in Figure 1.66.
43
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> set.seed(1001)
> M1 <- matrix(data=sample(x=c(-6:-1, 1:6), size=12, replace=FALSE), nrow=3, ncol=4)
> M2 <- matrix(data=sample(x=c(-6:-1, 1:6), size=12, replace=FALSE), nrow=3, ncol=4)
> M1
[,1] [,2] [,3] [,4]
[1,] 6 -3 -6 -4
[2,] -2 3 -1 4
[3,] 5 1 -5 2
> M2
[,1] [,2] [,3] [,4]
[1,] 5 2 -6 3
[2,] -2 -5 -3 6
[3,] -4 4 1 -1
>
> M1 + M2 # Matrices having the same dimensions
[,1] [,2] [,3] [,4]
[1,] 11 -1 -12 -1
[2,] -4 -2 -4 10
[3,] 1 5 -4 1
>
> c(0, 1, 10, 100) * M1 # A vector of length four and a matrix
[,1] [,2] [,3] [,4]
[1,] 0 -300 -60 -4
[2,] -2 0 -100 40
[3,] 50 1 0 200
>
> M1 / 2 # A matrix and a vector of length one
[,1] [,2] [,3] [,4]
[1,] 3.0 -1.5 -3.0 -2
[2,] -1.0 1.5 -0.5 2
[3,] 2.5 0.5 -2.5 1
44
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
45
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
a1,1 a1,2 ··· a1,p b1,1 b1,2 ··· b1,p a1,1 − b1,1 a1,2 − b1,2 ··· a1,p − b1,p
a2,1 a2,2 a2,p b2,1 b2,2 b2,p a2,1 − b2,1 a2,2 − b2,2 a2,p − b2,p
.. .. .. − .. .. .. = .. .. ..
. . . . . . . . .
an,1 an,2 ··· an,p bn,1 bn,2 ··· bn,p an,1 − bn,1 an,2 − bn,2 ··· an,p − bn,p
(1.13)
46
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
47
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
u1 v1 z1 u1 v1 z1 x1 b1
u2 x1 + v2 x2 + z2 x3 = u2 v2 z2 x2 = b2 (1.16)
u3 v3 z3 u3 v3 z3 x3 b3
Matrix-vector multiplication, Ax, can be interpreted as a collection of weighted sums,
i.e., linear combinations of variables and constants, as well. In Equation 1.9, the dot product
operator compactly represents weighted sums of a row vector and a column vector. In
Equation 1.15 the weighted sum is computed for a collection of row vectors and a column
vector. That is, Equation 1.15 is a natural extension of Equation 1.9.
Matrix-vector multiplication, Ax, can also be interpreted as the scalar projections of
multiple row vectors in A onto a unit column vector x. Remember that dot product of two
vectors uT v represents the amount of u along the direction of v when v is a unit vector.
A matrix-vector multiplication is simply the dot product of each row vector in A by the
column vector x. Hence, Ax = b presents the scalar projections in a compact form when x
is a unit vector.
Another interpretation of Equation 1.15 involves systems of linear equations. A linear
equation is an equation in the form of a1 x1 + a2 x2 + . . . + an xn + b = 0 such that xi ’s are the
unknown variables, ai ’s are the coefficients and b is the constant term. A system of linear
equations is a collection of linear equations defined over the same variables but different
coefficients. A solution of a system of linear equation is requires finding the values for the
variables which simultaneously satisfy all the equations in the system. Problems of systems
of linear equations involving several variables frequently appear in science, engineering,
economics and daily life.
For example, Alice, Bob and Carol went to a restaurant for lunch. Alice ordered two
slices of pizza, two cookies and a soda and she paid $10 in total. Bob ordered four slices
of pizza, three cookies and one soda and he paid $16.5 in total. Carol ordered a slice of
pizza with two sodas and she paid $6.5 in total. While leaving the restaurant they had a
discussion on how much a cookie costs at the restaurant. This problem can be defined as
a system of linear equations. Let x1 , x2 and x3 be variables denoting the cost of a slice of
pizza, a cookie and a soda respectively. The bill for Alice can be mathematically expressed
as 2x1 + 2x2 + x3 = 10. Bob’s bill is expressed as 4x1 + 3x2 + x3 = 16.5. Lastly, Carol’s bill
is x1 + 2x3 = 6.5 which is equivalent to x1 + 0x2 + 2x3 = 6.5
Together, the equations are expressed as the following system of linear equations.
2x1 + 2x2 + x3 = 10
4x1 + 3x2 + x3 = 16.5 (1.17)
x1 + 0x2 + 2x3 = 6.5
Equations in (1.17) can be compactly expressed as follows.
2 2 1 10
4 x1 + 3 x2 + 1 x3 = 16.5
1 0 2 6.5
2 2 1 10 (1.18)
x1
4 3 1 x2 = 16.5
1 0 2 x3 6.5
Ax = b
48
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
where A is the matrix of coefficients, x is the vector of variables and b is the vector of
constant terms. R provides a function named solve to solve systems of linear equations as
shown in Figure 1.68. The solve function returns a vector such that the first entry of the
vector is the value of the first variable, x1 , second entry is the value of the second variable,
x2 and so on.
Note that a system of linear equations may have a single solution, infinitely many solu-
tions or no solutions at all.
Alternatively, matrix-vector multiplication can be interpreted as a linear transformation
(also called linear map) of x from a finite dimensional vector space over the real numbers
Rp to another finite dimensional vector space over the real numbers Rn . A transformation
is similar to functions, f : X → Y, in algebra which map an input in X to an output in
Y . Typically, functions are defined as y = f (x). Similarly, a “matrix transformation”,
T : Rp → Rn , associated with an n × p matrix A, maps an input vector in Rp to an output
vector in Rn . Typically, a matrix transformation is defined as b = Ax (or Ax = b) such
that x ∈ Rp and b ∈ Rn .
A “linear transformation”, T , is a mapping of vectors from a vector space Rp to another
space Rn while preserving vector addition and scalar-vector multiplication operations. That
is, T : Rp → Rn such that T (u + v) = T (u) + T (v) and T (cv) = cT (v). It turns out
that every linear transformation T can be described as a matrix-vector product,Ax, for
x = [x1 , x2 , . . . , xp ]T as shown in Equation 1.19.
49
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
0 0
x1 x1
x2 0 x2 0
T . = T .. + T .. + . . . + T ..
.. . . .
xp 0 0 xp
1 0 0
0 1 0
= T x1 . + T x2 . + . . . + T xp .
.. .. ..
(1.19)
0 0 1
1 0 0
0 1 0
= x1 T . + x2 T . + . . . + xp T .
.. .. ..
0 0 1
= x1 T (e1 ) + x2 T (e2 ) + . . . + xp T (ep )
= Ax
where matrix A consists of the column vectors of the ordered transformations of the standard
basis vectors, A = [T (e1 ) T (e2 ) . . . T (ep )]. Moreover, T (u + v) = A(u + v) = Au + Av =
T (u)+T (v) and T (cv) = A(cv) = cAv = cT (v). The matrix A is called the transformation
matrix for T . Equation 1.19 has an important conclusion: to obtain the n×p transformation
matrix, A, of a linear transformation T : Rp → Rn , one needs to apply T to the standard
basis of Rp and arrange them as columns of the matrix A. Alternatively, one can interpret
the columns of a transformation matrix as how a transformation T : Rp → Rn changes the
ordered standard basis vectors of the space Rp . Linear transformations maintain zero vector
and the negative of a vector. That is, T (0) = 0 and T (−v) = −T (v). Note that linear
transformation preserves only vector addition and scalar-vector multiplication operations,
it does not support the addition of a constant. A linear function in calculus, f (x) = ax + b,
supports the addition of a constant. In this sense, linearity in linear transformations is a
stricter than linearity in functions and a linear function is a linear map when the intercept
of the function is zero.
Linear transformations are heavily used in image processing, game development and
video processing. They change the shape of a space along with all vectors in the space and
of course all objects represented by those vectors. Because they are linear transformations,
they preserve the parallel lines as parallel; preserve equal distance between the parallel
lines; and preserve the origin at 0. Figure 1.69 presents various linear transformations
of an image consisting of four colors along with their transformation matrices. The first
five transformations are from R2 to R2 and the last transformation is from R2 to R. The
transformation matrices are constructed by applying the transformations to the standard
basis of R2 , i.e., e1 = [1, 0]T and e2 = [0, 1]T , and arranging them column-wise. The original
image is shown in Figure 1.69a with identity transformation which does not change e1 and
e2 . In Figure 1.69b the original image is dilated by 1.5 times by multiplying both e1 and e2
by 1.5 . In Figure 1.69c the original image is reflected along the vertical axis by reflecting
e1 along the vertical axis and preserving e2 . In Figure 1.69d the original image is rotated
counter-clockwise by 60 degrees (π/3 radians) √ by rotating both e1 and e2 by 60 degrees.
Note that cos(π/3) = 1/2 and sin(π/3) = 3/2 . In Figure 1.69e the original image is
50
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
sheared by fixing the horizontal axis and displacing each point by the amount of its signed
distance to the horizontal axis. Lastly, in Figure 1.69f the original two dimensional image
is projected onto the horizontal axis by preserving e1 but mapping e2 to 0.
1 0 1.5 0 −1 0
(a) Identity (b) Dilation (c) Reflection
0 1 0 1.5 0 1
√
1/2 − 3/2 1 1 1 0
(d) Rotation √ (e) Shear (f) Projection
3/2 1/2 0 1 0 0
1 0 0 1 0 ··· 0
x1
0 1 0 0 1 0 x2
x1 . + x2 . + · · · + xp . = . .. .. .. = Ix (1.20)
.. .. .. .. . . .
0 0 1 0 0 · · · 1 xp
Although standard basis is commonly used to express vector coordinates in Rp , other
basis are also possible. Any set B = {b1 , b2 , . . . , bp } can serve as the basis for Rp as long as
51
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
x = x1 b1 + x2 b2 + . . . + xp bp
(1.21)
= B[x]B
where bi ∈ B is the ith basis vector denoted in standard basis and matrix B consists of the
ordered column vectors of basis B. One can easily change the basis of a vector x ∈ Rp from
standard basis E = {e1 , e2 , . . . , ep } to another basis B = {b1 , b2 , . . . , bp } as shown in
B[x]B = x = I[x]E
∴ B[x]B = [x]E (1.22)
−1
[x]B = B [x]E
where ∴ implies “therefore” and matrix B −1 is the inverse of the matrix consisting of the
ordered column vectors of basis B. Note that matrix B is always invertible, because the
vectors constituting basis B are linearly independent. Equation 1.23 says that given a new
basis, inverting its basis matrix and left multiplying it by the coordinates of a vector in
standard basis, simply gives the coordinates of the same vector in terms of the new basis.
2 1
2 1
(a) x w.r.t. basis E and B (b) x w.r.t. basis E = {e1 , e2 } (c) x w.r.t. basis B = {b1 , b2 }
Figure 1.70a shows vector x along with bases E and B. Both bases consist of linearly
independent vectors and span R2 . Moreover, they present two different coordinate systems
shown in red and blue colors. Figure 1.70b shows the coordinates of x = [2, 2]T with respect
to basis E = {e1 , e2 }. That is, x = 2e1 + 2e2 . Similarly, figure 1.70c shows the coordinates
of x = [1, 1]T with respect to basis B = {b1 , b2 }. That is, x = 1b1 + 1b2 . Figure 1.71
presents the R code changing the basis of vector x in Figure 1.70 from basis E to basis B.
The vector space Rp can have many bases such as C = {c1 , c2 , . . . , cp } or D = {d1 , d2 , . . . , dp }.
Given a vector x ∈ Rp with respect to basis B it is possible to formulate the change of ba-
sis of x with respect to C as a matrix-vector multiplication as [x]C = M [x]B as shown in
Equation 1.23
52
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
C[x]C = x = B[x]B
∴ C[x]C = B[x]B
(1.23)
[x]C = C −1 B[x]B
[x]C = M [x]B , such that M = C −1 B
where ∴ implies “therefore” and M is the matrix consisting of the column vectors of basis
B expressed in the basis C, i.e., [[b1 ]C , [b2 ]C , . . . , [bp ]C ]. Note that B −1 [x]E in Equation 1.20
changes the coordinates of x into basis B. Similarly, C −1 B in Equation 1.23 changes the
coordinates of every column vector, i.e., basis vectors of B, in B from the standard basis to
basis C. Therefore, M in Equation 1.23 is nothing but the basis vectors of B expressed in
basis C.
53
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
a1,1 a1,2 ··· a1,p b1,1 b1,2 ··· b1,m
a2,1 a2,2 a2,p b2,1 b2,2 b2,m
.. .. .. .. .. ..
. . . . . .
an,1 an,2 ··· an,p bp,1 bn,2 ··· bn,m (1.24)
a1,1 b1,1 + a1,2 b2,1 + · · · + a1,p bp,1 . . . a1,1 b1,m + a1,2 b2,m + · · · + a1,p bp,m
a2,1 b1,1 + a2,2 b2,1 + · · · + a2,p bp,1 . . . a2,1 b1,m + a2,2 b2,m + · · · + a2,p bp,m
=
···
an,1 b1,1 + an,2 b2,1 + · · · + an,p bp,1 . . . an,1 b1,m + an,2 b2,m + · · · + an,p bp,m
54
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
• A0 = 0
• Matrix-matrix multiplication is commutative and associative for scalar multiplication,
i.e., A(cB) = (Ac)B = (cA)B = c(AB) = A(Bc)
• Matrix multiplication is distributive over scalar addition, i.e., (c + d)A = cA + dA
Transpose of matrices. The transpose operator flips a matrix over its diagonal by re-
placing its rows by its columns or equivalently, replacing its columns by its rows. The
transpose of a matrix A is denoted by AT . In R, the function t is used to compute the
transpose of a matrix as shown in Figure 1.73.
• (AT )T = A
55
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
• (A + B)T = AT + B T
• (AB)T = B T AT (note the order)
• (cA)T = cAT
56
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Square matrices. A square matrix is a matrix which has the equal number of rows and
columns.
Diagonal matrices. A diagonal matrix is a square matrix which has zeros in all of its
entries, except the principal diagonal. These matrices are also called scalar matrices, because
they scale the elements of a vector by the amounts specified on their main diagonal in matrix-
vector multiplication. The transpose of a diagonal matrix is equal to itself. The inverse of a
diagonal matrix is another diagonal matrix with the reciprocals of the entries on the main
diagonal. The determinant of a diagonal matrix is the multiplication of the elements on
the main diagonal. Due to these nice properties, it is often desirable to have a diagonal
matrix term in matrix factorization. R uses the diag function to create diagonal matrices.
Figure 1.75 presents operations on diagonal matrices.
57
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Identity matrices. An identity matrix I is a diagonal matrix with ones on the main
diagonal and zeros elsewhere. Identity element in algebraic multiplication is 1 and when it is
multiplied by a real value, it leaves the real value unchanged, e.g., 3×1 = 1×3 = 3. Similarly,
identity matrix when multiplied by another matrix, it leaves the matrix unchanged, i.e.,
IA = AI = A. Note that the size of the identity matrix is implicitly inferred based on
whether it is left or right multiplied. If it is left multiplied by a matrix, i.e, IA, then its size
is the number of the rows of A. If it is right multiplied by a matrix, i.e., AI, then its size
is the number of the columns of A. In summary, for an n × p matrix A, I n A = AI p = A.
R uses the diag function with the size as a parameter to create identity matrices as shown
in Figure 1.76.
> diag(2)
[,1] [,2]
[1,] 1 0
[2,] 0 1
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
Symmetric matrices. A symmetric matrix is a square matrix which is equal to its trans-
pose, i.e., S is symmteric ⇐⇒ S = S T . The non-principal-diagonal entries of a symmetric
matrix are symmetric with respect to the principal diagonal. By definition, diagonal ma-
trices, including the identity matrix, are symmetric. Symmetric matrices appear naturally
to represent relations such as “married to’ and “works with” or “distance” and “similarity”.
In addition, for any matrix X, XX T or X T X are always symmetric and XX T is not
necessarily equal to X T X.
Covariance, correlation and cosine similarity matrices are very famous symmetric matri-
ces in data science. Given a centered dataset matrix X, the matrix S = n−1 1
X T X is called
covariance matrix. In some cases, the relationships between variables are more important
than the exact variance/covariance values and the scalar term n−1 1
is skipped for a centered
dataset matrix X, i.e., S = X X. Although some texts call this matrix as covariance
T
matrix some others call it sum of squares cross products matrix where the entries on the
principal diagonal are the sum of squares and the non-principal-diagonal entries are the
cross products. Given a standardized dataset matrix X, the matrix S = n−1 1
X T X is called
correlation matrix. Given a column-wise unit-scaled dataset matrix X, the matrix X T X
is called cosine similarity matrix.
Real symmetric matrices have several important properties. First, all eigenvalues of a real
symmetric matrix are real. Second, the eigenvectors of different eigenvalues are orthogonal.
Third, for repeated eigenvalues there will be many independent eigenvectors and one can
use the Gram–Schmidt process to find an orthonormal basis for the eigenspace related to a
repeated eigenvalue. As a result, a real symmetric matrix, X p×p always has p orthonormal
eigenvectors that span Rp . The determinant of a square matrix (a symmetric matrix is
58
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
also square) is the product of its eigenvalues. A square matrix (a symmetric matrix is also
square) is invertible when its determinant is not zero, i.e., none of its eigenvalues is zero. The
inverse of an invertible (non-singular) symmetric matrix is also symmetric. If S and S 0 are
symmetric matrices so are S + S 0 and S − S 0 . More importantly, a symmetric matrix S can
be factorized into S = QΛQT where Q is the column matrix consisting of the orthonormal
eigenvectors of S and Λ is the diagonal matrix costing of the eigenvalues of S on the main
diagonal. To put in other words, any symmetric matrix is diagonalizable, i.e., non-defective.
In fact a square matrix M p×p is diagonalizable (non-defective) when M has p linearly
independent eigenvectors. Then, M = QΛQ−1 where Q is the column matrix consisting of
the eigenvectors of M and Λ is the diagonal matrix costing of the eigenvalues of M . Note
that when Q is an orthogonal matrix, then QT = Q−1 .
Orthogonal matrices. Two vectors are called orthogonal, if their dot product is zero,
i.e., they are perpendicular to each other in Euclidean space. These vectors are called
orthonormal when they are also unit vectors, i.e., their lengths are one. A real square matrix
is called orthogonal or orthonormal matrix, when its columns and rows are orthonormal
vectors. The columns of an orthogonal matrix, Qp×p , naturally forms the orthonormal basis
of the Euclidean space Rp . Let Qp×p be an orthogonal matrix with columns {q1 , q2 , . . . , qp }.
Then, qiT qi = 1, because qi is a unit vector and qiT qj = 0, because qi and qj are orthogonal.
As a result, QQT = QT Q = I which also defines an orthogonal matrix. Moreover, the
inverse of an orthogonal matrix is equal to its transpose, i.e., Q−1 = QT . The determinant
of an orthogonal matrix is either 1 or -1 which implies orthogonal matrices represent rotation
59
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
and reflection transformations. Orthogonal matrices preserve the dot products of vectors,
i.e., uT v = (Qu)T (Qv) = uT QT Qv = uT Q−1 Qv = uT Iv √ which also implies they preserve
vector lengths because the length of a vector ||u|| = uT u. The identity matrix I is
an orthogonal matrix. The inverse, hence the transpose, of an orthogonal matrix is also
orthogonal. Products of orthogonal matrices are also orthogonal. Eigenvalues of orthogonal
matrices are ±1 and their eigenvectors are orthogonal. Due to these nice properties, many
matrix factorizations, such as QR decomposition, symmetric matrix eigen decomposition and
singular value decomposition, involve orthogonal matrices. Figure 1.78 presents operations
on orthogonal matrices.
Positive definite matrices. A real symmetric matrix S is called a positive definite ma-
trix if v T Sv is positive for any nonzero vector v. If one considers S as a linear transformation
matrix T : Rp → Rp where T (v) = Sv, then the direction of Sv will always point the same
“general” direction of v. That is, the angle between v and its transformation Sv will always
be less than 90◦ . The dot product of two column vectors is related to the cosine of the angle
between them in uT z = ||u||||z|| cos(θ) where θ is the angle between u and z and ||u|| and
||u|| are their corresponding lengths. One can easily conclude that the vector v and its
transformation Sv points the same general direction only if, v T (Sv) = ||v||||Sv|| cos(θ) , θ
is less than 90◦ .
When v is an eigenvector of S, then Sv = λv. Therefore, v T (Sv) = v T (λv) = λv T v =
λ||v||2 . Since ||v||2 is always positive λ has to be positive to make v T Sv. This leads us to
an equivalent definition for called a positive-definite matrices: a real symmetric matrix is
positive definite if and only if all of its eigenvalues are positive.
60
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
90◦ for a positive semi-definite matrix S. Equivalently, a real symmetric matrix is positive
semi-definite if and only if all of its eigenvalues are non-negative. The inverse of a positive
definite matrix is also positive definite. Any positive definite matrix S has a unique Cholesky
factorization such that S = LLT where L is a real lower triangular matrix with real entries
on the principal diagonal. Any symmetric matrix generated by S = M M T or S = M T M
is not always positive definite but it is always positive semi-definite. Therefore, a covariance
matrix is always positive semi-definite.
1.2.4 Arrays
So far we have discussed vector and matrix objects used for storing values of the same mode
(data type) in one dimensional and two dimensional data structures, respectively. R allows
us to go beyond two dimensions by providing array objects. An array object in R is a data
structure for storing values of the same mode in any number of dimensions. The function
array has the signature array(data = NA, dim = length(data), dimnames = NULL). The
parameter data denotes the vector of values that we want to arrange in an array. The
parameter dim denotes the vector of dimensions. parameter dimnames allows us to give
names to the dimensions.
To illustrate let us create a synthetic data set representing the cumulative GPA’s of
students based on grade (freshman,sophomore,junior,senior), scholarship status (scholarship,
no-scholarship) and gender (female, male).
> gpa.arr <- array(data=c(3.96, 2.53, 2.95, 1.65, 2.56, 3.71, 3.01, 3.65, 1.70,
3.41, 2.60, 1.84, 3.22, 2.59, 2.95, 3.70), dim=c(4,2,2))
> gpa.arr
, , 1
[,1] [,2]
[1,] 3.96 2.56
[2,] 2.53 3.71
[3,] 2.95 3.01
[4,] 1.65 3.65
, , 2
[,1] [,2]
[1,] 1.70 3.22
[2,] 3.41 2.59
[3,] 2.60 2.95
[4,] 1.84 3.70
Figure 1.79 shows how to create a three dimensional array where the first dimension
represents the grade, the second dimension represents scholarship status and the third di-
mension represents the gender. R displays three arrays by subsetting the array starting from
the last dimension using the integer indices. The last dimension in our example is gender
where one represents female and two represents male students. Additionally one can use
dimnames to provide names to the dimensions as shown in Figure 1.80.
61
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> gpa.arr <- array(data=c(3.96, 2.53, 2.95, 1.65, 2.56, 3.71, 3.01, 3.65, 1.70,
3.41, 2.60, 1.84, 3.22, 2.59, 2.95, 3.70), dim=c(4,2,2))
> dimnames(gpa.arr) <- list(c("freshman", "sophomore", "junior", "senior"), c("
scholarship", "no-scholarship"), c("female", "male"))
> gpa.arr
, , female
scholarship no-scholarship
freshman 3.96 2.56
sophomore 2.53 3.71
junior 2.95 3.01
senior 1.65 3.65
, , male
scholarship no-scholarship
freshman 1.70 3.22
sophomore 3.41 2.59
junior 2.60 2.95
senior 1.84 3.70
Subsetting an array is not different from subsetting matrices. One can use integer indices
or names to subset an individual element or a portion in an array.
62
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
identifiers (names) of vector objects to name the columns. The rows are named by inte-
ger numbers as shown in Figure 1.81. The data.frame function supports the parameter
row.names to explicitly name the rows using a vector.
> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(name, type, enroll, carnegie, acceptance)
> univ
name type enroll carnegie acceptance
1 UL Lafayette Public 17195 RU/H 59.3
2 LSU Public 30451 RU/VH 75.0
3 Tulane Private 13531 RU/VH 26.4
4 LA Tech Public 11015 RU/H 70.8
5 Xavier Private 2926 <NA> 54.4
Remember that matrix objects also arrange data in tabular fashion. However, matrices
require the mode of all columns to be the same. On the other hand, there is no such
requirement for data frames. in Figure 1.81 the variable name is a character string vector;
the variable type is a factor with two levels; the variable enroll is a numeric vector; the
variable carnegie is again a factor with two levels; and the variable acceptance is a numeric
vector.
The data frame coerces the character string vectors into factors by default. In case, one
wants to keep character string vectors as they are he/she can set the stringsAsFactors
parameter of the data.frame function to logical false.
In addition to data.frame function, R provides the expand.grid function to quickly
create a data frame from all combinations of argumented vectors and/or factors. Figure 1.82
demonstrates the expand.grid function.
63
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Furthermore, the function cbind allows us to add a new column to a data frame. Adding
a new column is a trivial operation because a column is a vector object consisting of values
of the same mode (data type). On the other hand, adding a new row into a data frame
is a more complex operation because a row instance consists of multiple values of different
modes. The simplest way to add a new row is to represent the new row as a data frame of
a single instance and use the rbind function to append it to the existing data frame object.
To delete a column one can subset the entire column and set it to NULL using the
assignment operator. To delete a row one can subset the data frame using a negative index
vector and assign it to itself. This method also works for removing columns.
Figure 1.84 shows several examples of data frame modification. Note that adding a new
row with a new factor level results in an update in the levels of the corresponding column
of the data frame.
64
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
>
> univ[3,2] # Subsetting individual elements by integer indices
[1] 13531
> univ[1, ] # Subsetting rows by integer indices
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
> univ[, 4] # Subsetting columns by integer indices
[1] 59.3 75.0 26.4 70.8 54.4
> univ[c(2,4), c(1,2,3)] # Subsetting parts by integer indices
type enroll carnegie
LSU Public 30451 RU/VH
LA Tech Public 11015 RU/H
> univ[c("LSU", "LA Tech"),"type"] # Subsetting elements by names
[1] Public Public
Levels: Private Public
> univ$acceptance # Subsetting columns by the $ operator
[1] 59.3 75.0 26.4 70.8 54.4
> univ[univ$acceptance<50, ] # Subsetting rows by logical indices
type enroll carnegie acceptance
Tulane Private 13531 RU/VH 26.4
65
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
> univ[1,2] <- 18796
> mascot <- c("Cayenne", "Tiger", "Pelican", "Bulldog", "Gold")
> univ <- cbind(univ, mascot)
> ul.monroe <- data.frame(type="Public", enroll=8811, carnegie=NA, acceptance=92.0,
mascot="Warhawk", row.names="UL Monroe")
> univ <- rbind(univ,ul.monroe)
> univ$carnegie <- NULL
> univ
type enroll acceptance mascot
UL Lafayette Public 18796 59.3 Cayenne
LSU Public 30451 75.0 Tiger
Tulane Private 13531 26.4 Pelican
LA Tech Public 11015 70.8 Bulldog
Xavier Private 2926 54.4 Gold
UL Monroe Public 8811 92.0 Warhawk
66
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The two very important functions especially used on data frames are the str and summary
functions. The str function gives a brief description about the structure of the data frame.
The summary function provides summary statistics for each column in a data frame. Fig-
ure 1.85 shows the use of str and summary functions.
> name <- c("UL Lafayette", "LSU", "Tulane", "LA Tech", "Xavier")
> type <- as.factor(c("Public", "Public", "Private", "Public","Private"))
> enroll <- c(17195, 30451, 13531, 11015, 2926)
> carnegie <- as.factor(c("RU/H", "RU/VH", "RU/VH", "RU/H", NA))
> acceptance <- c(59.3, 75.0, 26.4, 70.8, 54.4) # Acceptance percentages
> univ <- data.frame(type, enroll, carnegie, acceptance, row.names=name)
> univ
type enroll carnegie acceptance
UL Lafayette Public 17195 RU/H 59.3
LSU Public 30451 RU/VH 75.0
Tulane Private 13531 RU/VH 26.4
LA Tech Public 11015 RU/H 70.8
Xavier Private 2926 <NA> 54.4
> str(univ)
’data.frame’: 5 obs. of 4 variables:
$ type : Factor w/ 2 levels "Private","Public": 2 2 1 2 1
$ enroll : num 18796 30451 13531 11015 2926
$ carnegie : Factor w/ 2 levels "RU/H","RU/VH": 1 2 2 1 NA
$ acceptance: num 59.3 75 26.4 70.8 54.4
> summary(univ)
type enroll carnegie acceptance
Private:2 Min. : 2926 RU/H :2 Min. :26.40
Public :3 1st Qu.:11015 RU/VH:2 1st Qu.:54.40
Median :13531 NA’s :1 Median :59.30
Mean :15344 Mean :57.18
3rd Qu.:18796 3rd Qu.:70.80
Max. :30451 Max. :75.00
1.2.6 Lists
Vector, matrix and array objects in R require the mode of the data to be stored the same.
Data frame objects allow storing data in different modes together in tabular fashion however,
the columns should have the same length. A list in R is a data structure that can hold
multiple objects of different modes or lengths. That is one can put together vectors, factors,
matrices, data frames, functions and even lists into a list. This flexibility allows us to
combine the data that are loosely related to each other into a single object.
The function list is used to create lists of objects in R. It expects one or more objects
to be provided as arguments. In Figure 1.86 we first create a two by two matrix object
aMatrix, a vector of character strings consisting of four elements aVector and a factor of
67
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
five levels consisting of six elements aFactor. We then, create a list of objects consisting
of the objects aMatrix, aVector, aFactor along with another string vector of length one
created on-the-fly.
[[2]]
[1] "Louisiana" "Texas" "California" "Maine"
[[3]]
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5
[[4]]
[1] "Hello world"
68
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 1.87: Subsetting lists using double and single square brackets operators
to name the columns in a data frame. However, the list function does not implicitly name
the objects. There are two ways to name the objects in a list. One approach is using the
names function on the left hand side of the assignment operator, “<-”, and providing a
character string vector having a name for each element in the list. The second and preferred
approach is to name the objects while calling the list function using identifier-object pairs
combined by the parameter assignment operator, ”=“. In Figure 1.88 we use the second
approach to name the objects in the list. Please notice how the identifier-object pairs
provided to the list function while creating a list in the figure. In Figure 1.88 we also
display the list on the terminal. On difference is that R uses “$” followed by the name of
the element instead of integer index of the element within “[[ ]]” while displaying the list
on the terminal. In fact the ‘$” operator is the same operator that we used to retrieve the
columns in a data frame. The ‘$” operator can also be used to retrieve individual elements
of a list and it is equivalent to the double square brackets “[[ ]]” in terms of behavior. Note
that integer indices still work for subsetting the elements of a list even if the elements have
their associated names.
R also allows subsetting the objects within a list by using multiple subsetting operators
in a row. For example, theList[[2]][3] accesses the the third element of the object
located at index position two in the list theList. In case the objects have associated names
an equivalent instruction would be theList$vec[3] as shown in Figure 1.89. Notice that
in both examples the first subsetting operator retrieves an object within the list and the
second subsetting operator retrieves an element of the object. Figure 1.89 demonstrates
several examples to subsetting objects in lists.
69
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
$vec
[1] "Louisiana" "Texas" "California" "Maine"
$fac
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5
$greet
[1] "Hello world"
>
> theList["vec"]
$vec
[1] "Louisiana" "Texas" "California" "Maine"
> theList[["vec"]]
[1] "Louisiana" "Texas" "California" "Maine"
> theList$vec
[1] "Louisiana" "Texas" "California" "Maine"
70
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
$fac
[1] 1 2 2 3 1 3
Levels: 1 2 3 4 5
$greet
[1] "Hello world"
$vec2
[1] 0.0 0.1 0.2 0.3 0.4 0.5
71
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
72
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
contain multidimensional objects. In that case if the function specified via the FUN param-
eter is applicable to the enclosed multidimentional object, the lapply function will work
without any problems. The apply function returns a list.
In Console 1.94 we first create a list consisting of a matrix and three vectors and then use
the lapply function to compute the sums of each object in the list. Note that the lapply
function returns a list. The function sapply is a wrapper function around the lapply
function. Different from lapply, sapply simplifies the output if possible. In Console 1.94
the sapply function applies the same function, however it simplifies the output into a named
vector or table.
The tapply function applies a function to a ragged multidimensional object. Typically it
is used on dataframes to apply a function to groups split by a factor. The INDEX parameter
is used to specify the factor column for grouping.
In Console 1.95 we first create a dataframe consisting of a numeric column and a column
of factor. Then we use the tapply function to group the numeric column by the factor
column and apply the sum function.
73
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> df <- data.frame("C1" = 1:5, "C2" = as.factor(c("A", "B", "B", "A", "B")))
> df
C1 C2
1 1 A
2 2 B
3 3 B
4 4 A
5 5 B
> tapply(X=df$C1, INDEX=df$C2, FUN=sum)
A B
5 10
There are other functions in the family. For example, the vapply is similar to sapply
but it allows to explicitly define the output. The mapply function applies is the multivariate
version of the sapply function.
74
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The require function is an alternative to the library function for loading libraries in
R. The installed.packages function may also be used for listing the packages that are
already installed on the system. Different from the library function, installed.packages
provides more details about the installed packages.
75
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> library("ggplot2")
> data()
...
...
> data(diamonds)
> head(diamonds)
> str(diamonds)
> summary(diamonds)
76
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 1.98: Importing data by read.table function and inspecting the features
to specific values. The read.csv and read.csv2 are two wrapper functions which are fre-
quently used for importing data sets formatted as comma separated values in tabular fashion.
The read.csv function assumes that the fields of the data are separated by commas. The
read.csv2 function assumes that the fields of the data are separated by semicolons. Fig-
ure 1.99 illustrates how to use the read.csv function to import the same data set imported
in Figure 1.98. In addition to read.table and its wrappers, R provides functions read.fwf
and scan for importing data in fixed format width and importing data as vectors or lists.
Figure 1.99: Importing data by read.csv function and inspecting the features
Loading data in native R format (.R and .rda) is easier. The load function allows us to
load a data file in native R format by setting its mandatory file parameter to the path of
the file.
Formats of other statistical software are called foreign formats. R supports foreign
formats through its foreign package. The package foreign provides several functions
for importing SPSS data files, STATA data files and SAS export files. The package gdata
provides functions for importing data in excel spreadsheets. Although, R supports importing
files in foreign formats we strongly suggest using text files for migrating data sets. Almost
all statistical software and spreadsheet applications allows a user to export data in comma
separated values (csv) format which is a prevalently used text format.
The function write.table is used to export tabular data into a file. Similar to read.table,
77
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
the write.table function supports several parameters. The parameter file is a mandatory
parameter denoting the absolute or relative path of the destination file. It is a good practice
to ensure the working directory by the function getwd in case relative paths are used. The
parameter append=FALSE controls whether the new data is to be appended to an existing
file or not. In case the parameter is set to logical false, the existing file is overwritten.
The parameter eol="\n" denotes the end of line character to be used while writing the
data. The parameter na="NA" is used to control how to write NA (Not Available) values into
the file. Figure 1.99 shows how to export data by write.table function. In the example,
semicolon is used as the separator character instead of comma and it is specified by the sep
parameter of the write.table function.
Finally, the function write is used for writing out matrix-like data into files and the
function save is used to write any type of data into a file in native R format. Data exported
in native R format can be imported later by the load function.
1.5 R Scripts
So far we have been interacting with the R terminal by entering and running our commands.
One drawback of this approach is that you need to re-enter and execute the commands of
repetitive tasks. R, however can be used as a scripting language. That is, a sequence of
R instructions can be saved into a text file as a script and later, the entire script can be
executed. One can use his/her favorite text editor to populate a file with R instructions
and save it to be executed later. RStudio also has a built-in text editor for creating R
script files. Although it is not a requirement, .R extension is typically used with R script
file names. There are two ways to run your script files; (i) through the OS terminal and (ii)
through the R terminal. To run a script via the OS terminal, one needs to call the program
named Rscript provided with the path to the R script file. Note that the script file should
be marked as executable before running it via the OS terminal. To run a script via the
R terminal one needs to call the source function. The file parameter is required and it
denotes the path to a local script file or a URL to a remote script address.
Figure 1.101a shows the content of a script file located in the Desktop folder. Fig-
ures 1.101b and 1.101c shows the script run via the OS terminal and the R terminal, re-
spectively.
78
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
(b) Running the script via the OS terminal (c) Running the script via the R terminal
introduced in the next section. Control structures typically starts with a statement followed
by the instructions placed within the body of the statement. The body of the statement is
enclosed by two curly braces, i.e., { and }. Some control statements, such as break does
not have bodies.
1.6.1 Conditionals
> n <- 0
> if(n < 0){
+ print("Negative Number")
+ } else if(n > 0){
+ print("Positive Number")
+ } else {
+ print("Neither Positive nor Negative")
+ }
[1] "Neither Positive nor Negative"
>
79
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
body of the else structure, which prints “Neither Positive nor Negative”.
1.6.2 Loops
R supports various loop types, including for, while and repeat. Figure 1.103 initiates
a vector, v, and a variable named cum.sum which represents cumulative sum. The for loop
initiates the variable i to 1 and repeatedly executes its body while updating i to next value
in sequence 1:length(v). The loop automatically terminates when i assumes the last value
in the sequence, i.e., the length of v. Inside the body of the loop, the current value of v is
added to the cum.sum and its value is printed by the cat function followed by a space. Note
that one can use the print function instead of cat to print each value on a new line.
The console in Figure 1.104 is equivalent to the one in Figure 1.103, however it uses
a while loop instead of a for loop. Different from a for loop, a while loop requires a
condition at the beginning and it repetitively evaluates the condition before executing its
body. When the condition evaluates false, the while loop automatically ends. In the figure
variable i is initialized to 1 and it is incremented by 1 in the body of the loop. When the
value of i exceeds the length of v, the condition of the while evaluates false and the loop
ends automatically.
A repeat loop is similar to a while loop, except it does not have a condition specified in
the beginning. The lack of an explicit condition in the beginning requires the programmer
to explicitly break the repeat loop to prevent it running infinitely. To break the repeat
loop R introduces the break statement, which can also be combined with other types of
loops to prematurely end their repetitive executions.
80
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The console in Figure 1.105 is equivalent to the one in Figure 1.104, however it uses a
repeat loop instead of a while loop. Different from a while loop, the repeat loop does
not require a condition at the beginning and it repetitively executes its body until a break
statement is executed. The body of the repeat loop is similar to the body of the while loop
presented in Figure 1.104. The only difference is that the repeat loop has a conditional
statement that explicitly breaks the loop when i is greater than the length of v.
The next statement is another control structure used with loops. When a next statement
is executed it simply skips all instructions following the next in the body of the loop and
takes the control back to the beginning of the loop.
Lastly, the return statement which is used to return objects from functions is covered
in the next section.
An R function definition starts with keyword function followed by parenthesis for pa-
rameter declarations. Note that no parameters are declared in the greeting function defined
in Terminal 1.106. The body of the function is delimited by curly braces. Simply, the body
of a function encloses a sequence of R statements and control structures that perform a task
81
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
all together. In the previous example the body of the function consists of a single state-
ment calling the built-in print function. Finally, the definition of the function needs to be
assigned to an R object in order to be called later. In the previous example the function is
assigned to the variable greeting which also serves as the name of the function.
Figure 1.107: R function structure. Brackets are used to refer to optional elements
Terminal 1.107 shows the general structure of an R function. Note that the brackets
are used to refer to the optional elements of a function. Although the previous example
(Terminal 1.106) does not require a number of arguments provided by the callee to perform
its task, some functions may require one or more arguments to be passed. For example,
a function calculating mean and standard deviation of a vector needs the vector to be
passed as an argument. Parameters are lists of placeholders for arguments. Parameters
are defined between the parenthesis in the definition of a function using the form param1,
param2, ... . R supports C++ style parameters with default values. One needs to use the
format param = value in the function definition to set a default value for the parameter.
Setting default values for multiple parameters results in ambiguity for R interpreter. As
a rule of thumb, parameters with default values should be aligned on the right of the list
of parameters in the function definition. Note that programmers use meaningful names for
parameters rather than param1.
Arguments are passed to R by value. That is, a copy of each argument is created and
passed to the function to keep the objects in the parent environment intact. This imposes
a problem in terms of efficiency and memory space especially when it comes passing large
datasets to functions. R minimizes the severity of the problem by not creating a copy of
the passed object until it is modified.
Some functions perform calculations over arguments and need to return the results of
their calculations to the caller environment. This is done by using a special R function
called return. return takes a single argument as parameter and returns the argument to
the caller environment. Multiple arguments are returned to the caller environment by using
existing multi-variate data types such as list, array, or data frame.
In the following we implement a function called center_spread that computes and returns
the mean and standard deviation or the median and the inter quartile range of a vector.
Note that R by default returns the last evaluated value even if there is no explicit return
statement. However, explicitly using return for functions that are returning a value is
suggested for code clarity.
Sometimes one needs to define a function and pass it to another function as an argu-
ment. In case those functions are needed only once it is better to define them on the fly as
anonymous functions. Anonymous functions are not assigned to an object and mostly they
are defined on a single line where individual statements are separated by space. Another
unusual practical aspect of anonymous functions in R is that mostly the body of the function
82
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
return(list(center=center, spread=spread))
}
is not enclosed with curly braces and the return function is not explicitly used. Yet, it is
recommended to use curly braces around the body of the function and to explicitly add the
return function (if needed) in order to improve code readability.
R also supports passing arbitrary number of arguments to a function through the ellipsis
(...) parameter. Ellipsis parameter is especially useful to pass arguments to the functions
that are called inside the body of a function. Ellipsis parameter is mostly combined with
regular parameters by placing it at the end of the parameter list. Ellipsis arguments can be
processed within the body of a function as a list data type.
1.8 Exercises
Vectors and Factors
1. Create and print the following vectors
(a) [100, 99, 98, . . . , 3, 2, 1] .
(b) [1, 3, 5, . . . , 95, 97, 99] .
(c) [1, 3, 5, . . . , 45, 47, 49, 2, 4, 6, . . . , 46, 48, 50] .
(d) [1, 3, 5, 1, 3, 5, . . . , 1, 3, 5, 1, 3, 5] such that there are 20 occurrences of 1, 20 occur-
rences of 3 and 20 occurrences of 5.
(e) [7, 7, . . . , 7, 9, 9, . . . , 9, 11, 11, . . . , 11] where there are 20 occurrences of 7, 40 oc-
currences 9 and 80 occurrences of 11.
2. First, create the vector x = [1.0, 1.1, 1.2, . . . , 1.8, 1.9, 2.0]
(a) Create and print a vector of the function ex | cos(x)| where | cos(x)| is the ab-
p
83
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
9. Create a vector x = [1, 2, 3, 4, . . . , 46, 47, 48, 49, 49, 48, 47, 46, . . . , 4, 3, 2, 1]
(a) Print the indices of the elements of x that are divisible by 3.
(b) Print the elements of x that are divisible by 3.
(c) Print the indices of the elements of x that are divisible by 2 but not by 5.
(d) Print the ithe elements of x that are divisible by 2 but not by 5.
10. Use the paste function to create the character vector ["Player-1", "Player-2", "Player-3",
. . ., "Player-19"] and print it.
11. Generate the vector of characters rating = ["Good", "Mediocre", "Bad", "Bad", "Good",
"Good", "Mediocre", "Bad", "Good", "Good", "Good", "Bad", "Bad", "Mediocre", "Bad",
"Good"] representing the food ratings given by customers at a restaurant. Generate
another vector of integers gender = [1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2] representing
the genders of customers who rated.
(a) Convert and print the rating vector into an ordered factor with four levels,
including Bad < M ediocre < Good < Excellent, despite none of the customers
rated the food “Excellent”.
(b) Convert and print the gender vector into a labeled factor with two levels: M ale
and F emale.
(c) Use the table function to print a contingency table of the rating and gender
objects together. Explain the pattern you see in the contingency table.
84
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Matrices
12. Considering the following matrix
5 −3 2
M = 15 −9 6
10 −6 4
0 7
Q=
−4 9
3 8
R=
8 −6
85
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
(a) Show that matrix multiplication satisfies the associativity rule, i.e., (PQ)R =
P(QR) .
(b) Show that matrix multiplication over addition satisfies the distributivity rule,
i.e., (P + Q)R = PR + QR .
(c) Show that matrix multiplication does not satisfy the commutativity rule in gen-
eral, ı.e., PQ 6= QP
(d) Generate a 2 × 2 identity matrix, I. Note that the 2 × 2 identity matrix is a
square matrix in which the elements on the main diagonal are 1 and all other
elements are 0. Show that for a square matrix, matrix multiplication satisfies the
rules PI = IP = P .
16. Solve the following system of linear equations using matrix algebra and print the
results for unknowns.
x+y+z =6
2y + 5z = −4
2x + 5y − z = 27
17. Use the outer function to generate and print the following matrix
0 1 2 3 4
1 2 3 4 5
M= 2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
18. Compute and print the final result of the following series. Hint: You may use the
outer function to generate a matrix representation of the partial results.
X y=12
x=24 X x2
x=1 y=1
y+1
19. Compute and print the final result of the following series. Hint: You may use the
outer function to generate a matrix representation of the partial results.
X y=12
x=24 X x2
x=1 y=1
xy + 1
Data Frames
20. Use the following tabular data consisting of 6 records to create a data frame object
named cars and print it. The data consists of 7 different features of 6 cars. You may
use vectors along with the data.frame function to create the cars object.
21. Print the output of the str function for the cars data frame.
86
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
22. Print the number of rows and the number of columns of cars.
23. Update the cars data frame by adding (“Toyota Corolla” 33.9 4 71.1 65 4.22 1.835).
Print the updated data frame. In the following questions always use the updated cars
object.
30. Use the table function to print the frequency distribution of the cars in your dataset
based on the carb values.
31. Use the table and cut functions to print the frequency distribution of the casr falling
into bins of 2 mpg starting from 16. That is, bin the mpg into intervals of two starting
from 16.
32. Create and print a new data frame object called fast.cars by subsetting only qsec
and gear columns and including the cars only with 110 hp or higher.
Packages
33. Install package AER on your computer, if you have not done before. Next, load the
package and print the output of the package loading process.
34. Load the dataset named NMES1988 in the AER package and print the output of the
loading process.
35. In your own words, explain what this dataset is about after doing some research about
the dataset.
36. Print the output of the str function on NMES1988. Also, use the help function to
explain the variables of NMES1988 dataset in your own words.
87
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
37. Use the table function to display the frequency distribution of the school variable in
NMES1988. Explain the pattern that you see in the distribution.
38. Create a new data frame object called NMES1988.SUB by subsetting all numeric vari-
ables. Print the first few records of NMES1988.SUB using the head function.
39. Use one of the apply family functions, to compute and print a vector of averages of
all columns of NMES1988.SUB. Note that you can use the mean function for computing
the averages.
88
Chapter 2
Counting
Counting is defined as the determination or estimation of a quantity. In its simplest
form determining the number of units is counting. An indirect example would be
measuring the height of a person. In fact while measuring the height of a person we
are counting the number of inches (centimeters). Similarly, weighting is nothing but
counting the number of pounds or ounces (kilograms or grams).
Ranking
Ranking is defined as the assignment of a position to a unit. The assigned positions
can be represented by symbols, e.g., good-mediocre-bad, or numbers, e.g., 1-2-3. For
89
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
example one may order fruits starting from his most favorite to the least favorite.
Similarly, one may rank universities according to their research expenditures.
Classifying
Classifying is defined as sorting units into predefined categories. For example a person
can be categorized according to his blood type: A, B, AB and 0. Units can also be
cross classified. For example universities can be categorized as national or regional
universities and public or private.
The term variable has different but related meanings in mathematics, computer science
and statistics. In mathematics a variable is a symbol representing qualitative or quantitative
data. In computer science a variable is a storage location associated with a symbolic name
(identifier) that contains qualitative or quantitative data. In data analysis or statistics
context, a variable is an attribute that is being measured on one or more units. For example
height or weight of a person, the rank of a university, blood type of a person are all variables.
The values of variables may change in time, under different conditions or from one unit to
another.
Variables are mathematical tools allowing us to modify, transform or control measured
attributes. In experimental research there are two types of variables: independent and depen-
dent variables. Independent variables are variables that can be controlled, i.e., manipulated
or modified, in a model or equation.
Nominal Scale
Nominal scale variables take descriptive (qualitative) values. Examples of nominal
scale variables are: gender (male, female); marital status (single, married, divorced);
major (Biology, Economics, Informatics etc); shirt numbers of sports players (1,2,3 ...).
Nominal scale variables having only two possible values such as gender (male, female)
or coin toss (tails, heads) are also called dichotomous variables. Notice that nominal
scales are purely descriptive. We cannot apply relational operators such as less than
(<) or greater than (>) nor can we apply arithmetic operators such as addition (+),
subtraction (−), multiplication (∗) or division (/). That is, a player with shirt number
8 being greater than another player with shirt number 4 does not make sense nor does
adding Biology to Economics. Sometimes, nominal scales are represented by numbers.
Despite we can use 0 and 1 to represent the tails and heads outcome of a coin toss,
respectively, the numbers just serve as symbols of ordinal scale. Finally, counting the
number of nominal scales is different from nominal scales themselves. For example a
variable denoting the number of female students is different from a variable denoting
the gender of a student.
Mode is a common statistics applied to nominal scale variables to measure central
tendency.
90
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Ordinal Scale
Ordinal scale variables take descriptive (qualitative) or numerical (quantitative) val-
ues. The values that an ordinal scale variable takes can be ordered from high to
low. Examples of ordinal scale variables are: ranks at Olympic games (1st , 2nd , 3rd );
consumer satisfaction level (very satisfied, satisfied, somewhat satisfied, somewhat
dissatisfied, dissatisfied, very dissatisfied); statement agreement level (strongly agree,
agree, neutral, disagree, strongly disagree). Although the values of an ordinal scale
variable can be ranked, consecutive values do not hold equal intervals. Hence, the
amount of distance between any two values is not meaningful or interpretable. That
is, the performance difference between the athletes came in the 1st and 2nd places is
not equal to the performance difference between the athletes came in the 2nd and 3rd
places. Similarly, the differences among the consecutive levels of consumer satisfaction
are not equal. Although relational operators such as less than (<) or greater than (>)
makes sense for ordinal scale variables, arithmetic operators do not make sense.
Mode and median are common statistics applied to ordinal scale variables to measure
central tendency.
Rating Scale
Rating scale variables are an extension to ordinal scale variables where the possible
values are represented by numbers. For example statement agreement level (strongly
agree, agree, disagree, strongly disagree) can be represented by numbers such that
values strongly agree, agree, neutral, disagree, and strongly disagree correspond to
5, 4, 3, 2, 1, respectively. Another example is letter scale grades A, B, C, D, and F
mapped to numbers 4.0, 3.0, 2.0, 1.0, and 0.0, respectively. Although the consecutive
numbers in both examples seem to be equal interval, the resemblance is just illusionary
when one thinks about the ordinal scales that these numbers represent in reality.
Mode and median are common statistics applied to rating scale variables to measure
central tendency. Average as a measurement of central tendency should be avoided at
best or applied cautiously.
Interval Scale
Interval scale variables take numerical (quantitative) values such that the intervals
between consecutive values are equal. Not only interval scale variables can be ordered
but also the distance between their values are interpretable. Put in other words,
interval scale variables not only tell which value is greater than the other but also tells
the magnitude of difference between them. Examples of interval scale measurements
are: temperature degree in Fahrenheit (Celsius); IQ (Intelligence Quotient) score;
latitude degree (from −90◦ to +90◦ ); date. The 5◦ temperature difference between 15◦
and 20◦ is the same as the 5◦ temperature difference between 40◦ and 45◦ . Relational
operators are applicable to interval scale variables. Although addition and subtraction
are applicable to interval scale variables, multiplication and division are not applicable.
That is, 30◦ is not twice as hot as 15◦ or a person with IQ score of 150 is 75 points
higher than a person with score of 75 however, he is not twice as smart.
Mode, median, arithmetic average are common statistics applied to interval scale vari-
ables to measure central tendency. Range, variance, standard deviation are statistics
applied to interval scale variables to measure dispersion.
Ratio Scale
Similar to interval scale variables ratio scale variables take numerical (quantitative)
91
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
values such that the intervals between consecutive values are equal. Furthermore,
ratio scale variables have a true zero point denoting the absence of the quantity being
measured. That is, 0 simply denotes the lack of quantity in ratio scale variables.
Examples of ratio scale measurements are: height, weight, age, number of students in
a classroom. Ration scale variables allow us to compare values with respect to each
other. For example a classroom with 40 students is twice crowded than a classroom
with 20 students. Similarly, a person who is 25 years old is half older than a person
who is 50 years old. Both relational and arithmetic operators are applicable to ratio
scale variables. The reason that temperature degree in Fahrenheit or Celsius is not
ratio scale is because it does not have a true zero value. That is, 0◦ does not imply
the absence of thermal motion. Instead, −459◦ and −273◦ correspond to the lack
of thermal motion in Fahrenheit and Celsius scales respectively. On the other hand
Kelvin degree for temperature measurement has absolute zero at which all thermal
motion stops. That is 0K is a true zero in Kelvin scale. Similarly, date in its common
usage is an interval scale because date zero does not refer to absence of time whereas
age is a ratio scale because age zero implies absence of life on earth.
Mode, median, arithmetic/geometric/harmonic mean are common statistics applied
to ratio scale variables to measure central tendency. Range, variance, standard devi-
ation, coefficient of variation are statistics applied to ratio scale variables to measure
dispersion.
Finally, two important characteristics of measurement are reliability and validity. Re-
liability implies that the the measurement of a property of a unit results in the same
value with an acceptable amount of error under the same conditions. For example
repeatedly measuring the height of a person should return the same result (with an
acceptable error) under the same conditions. Validity implies that a measurement
procedure measures what it is supposed to measure. For example one cannot weigh a
person to measure his blood type.
2.2 Distributions
2.2.1 Empirical Distributions
A collection of datum observed or measured on a set of units is simply called a distribution.
Figure 2.1a shows the distribution of the days it takes to service a car at a busy mechanic
shop for 256 cars. Figure 2.1b shows the distribution of the March electricity bills (in
dollars) of a neighborhood consisting of 400 houses. Note that due to the large size of
the collection we only display the lowest and highest twelve values. Basically, Figure 2.1a
and Figure 2.1b demonstrates how the service time (in days) are distributed over 256 cars
and how electricity bill amounts are distributed over 400 houses, respectively. We use
distributions to understand, analyze and infer various characteristics of the set of units on
which the measurements are collected. Although the concept of a distribution is simple, it
becomes a powerful tool when visualization techniques and statistical methods are used to
summarize, model and interpret distributions.
A common method for visualizing distributions is to (i) divide the range of measurements
into intervals; (ii) count the number of measurements falling into each interval and (iii)
show a graphics demonstrating the intervals, the number of measurements in each interval
(frequency) and the relationship between intervals. Figure 2.2 visualizes the car service time
92
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
(a) Service time distribution (days) (b) Electricity bill distribution ($)
distribution and the electricity bill distribution given in Figure 2.1a and Figure 2.1b using
different types of graphics. Figures 2.2a thru Figures 2.2f divides the domain of service
time into equal intervals of one day. Figures 2.2g thru Figures 2.2l divides the domain of
electricity bill amounts into equal intervals of eight dollars.
Figure 2.2a and Figure 2.2g show the distributions as frequency histograms where each
vertical rectangle represents an interval and the number of measurements falling into that
interval. The rectangles are positioned at the middle of each interval on the x axis and the
heights of the rectangles on the y axis represent the number of the measurements falling
into the interval. The sum of the frequencies (y axis) of all intervals is the total number
of measurements in the data set for each histogram. Figure 2.2b and Figure 2.2h show the
distributions as frequency polygons where the middle-top points of the frequency histograms
are joined together using lines.
Figure 2.2c and Figure 2.2i show the distributions as relative frequency histograms where
the y axis represent the relative frequencies (fractions) of the number of measurements falling
into each interval. The relative frequency of an interval is calculated by dividing the number
of measurements falling into the interval by the total number of measurements. As a result,
the sum of the relative frequencies (y axis) of all intervals is 1 for each relative frequency
histogram. Figure 2.2d and Figure 2.2j show the distributions as a relative frequency poly-
gons. Relative frequency polygon uses relative frequencies on the y axis instead of absolute
frequencies.
Figure 2.2e and Figure 2.2k show the distribution as density histograms where the y axis
represents the densities of intervals. The density of an interval is calculated by dividing the
relative frequency of the interval by the width of the interval. As a result, the total area of
all rectangles is 1 for each density histogram. Note that specific to thse cases, the heights of
rectangles (y axis) in Figure 2.2e and Figure 2.2a give the relative information about different
intervals because the intervals are equal width. On the other hand, the areas of rectangles
of density histograms always give relative information about different intervals even if the
intervals had different widths. We strongly suggest developing the habit of thinking in terms
of areas rather than heights while analyzing density histograms. Finally, Figure 2.2e and
Figure 2.2l show a smooth density curve estimation of the car service time distribution and
electricity bill distribution, respectively. The density curve is an approximation function of
the density histogram where the total area under the curve is always 1.
Density histograms suffer from the selection of interval widths in case the data do not
have natural bins or classes. That is the shape of the density histogram is quite sensitive to
the interval widths and if the data does not have natural classes the histogram shape might
93
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.2: Visualizing service time (days) and electricity bill distributions ($)
30
relative frequency
0.10
0.10
frequency
density
20
0.05
0.05
10
0.00
0 0.00
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35 service time (days) 0 5 10 15 20 25 30 35
service time (days) service time (days)
(b) Relative Frequency His-
(a) Frequency Histogram togram (c) Density Histogram
30 0.09
relative frequency
0.10
density
frequency
20 0.06
0.05
10 0.03
0.00
0 0.00
0 5 10 15 20 25 30 35
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 service time (days)
service time (days) service time (days)
(f) Smooth Density Curve Esti-
(d) Frequency Polygon (e) Relative Frequency Polygon mate
0.03
90
0.20
relative frequency
80
70
0.15 0.02
frequency
60
density
50
0.10
40
30 0.01
0.05
20
10
0.00
0 0.00
40 50 60 70 80 90 100 110 120 130 140
40 50 60 70 80 90 100 110 120 130 140 electricity bill amount 40 50 60 70 80 90 100 110 120 130 140
electricity bill amount electricity bill amount
(h) Relative Frequency His-
(g) Frequency Histogram togram (i) Density Histogram
0.03
90
80 0.20
relative frequency
70 0.02
density
frequency
60 0.15
50
40 0.10 0.01
30
20 0.05
10
0.00
0 0.00
40 50 60 70 80 90 100 110 120 130 140
40 50 60 70 80 90 100 110 120 130 140 40 50 60 70 80 90 100 110 120 130 140 electricity bill amount
electricity bill amount electricity bill amount
(l) Smooth Density Curve Esti-
(j) Frequency Polygon (k) Relative Frequency Polygon mate
94
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
be misleading. One day is a natural interval length for car service time distribution on the
other hand, one may pick a different interval length other than eight dollars to plot the
density histogram of the electricity bill distribution. Density curves estimates however, are
approximation functions and they may reflect the true empirical density of a collection of
data. However, density curve estimates are susceptible to noise in the data and require quite
amount of data in order to make a proper approximation. Hence, visually superimposing
density histograms and density curve estimations together might reveal more information
about a distribution as shown in Figure 2.3
0.03
0.10
0.02
density
0.05
density 0.01
0.00 0.00
0 5 10 15 20 25 30 35 40 50 60 70 80 90 100 110 120 130 140
service time (days) electricity bill amount
(a) Service time distribution (b) Electricity bill distribution
Using density histograms and density curves seems to be counter intuitive in the begin-
ning because one has to think in terms of areas under rectangles and curves rather than
their heights. However, density histograms and curves have several advantages. First of all,
frequency and relative frequency distort the shape of the distribution when the intervals are
not equal width. Secondly, areas under densities allow us to interpret distributions in terms
of probabilities. Thirdly, theoretical distributions appearing in the fields of probability and
statistics are defined in terms of densities.
By calculating the areas of density histograms through summation or the areas under
density curves through integration we investigate
• the fraction of data falling into a particular interval or region of the density histogram
or curve
• the fraction of data that are smaller than or equal to a particular value
• the fraction of data that are greater than or equal to a particular value
In Figure 2.4 we show various fractions of the car service time distribution represented
by the areas under density histograms and curves. Figure 2.4a and Figure 2.4d demonstrate
the fraction of data that are less than or equal to three, 0.46. That is, 46% of the cars left
at the mechanic shop are serviced within three days. Figure 2.4b and Figure 2.4e present
the fraction of data that are inclusively between ten and twenty, 0.14. That is, it takes ten
to twenty days to fix 14% of the cars left at the mechanic shop. Lastly, Figure 2.4c and
95
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.4f show the fraction of data that are greater than or equal to fifteen, 0.07. That
is, 7% of the cars require more than 15 days to be fixed.
Figure 2.4: Various fractions of the car service time distribution represented by the areas
under density histograms and curves
density
density
0.05 0.05 0.05
density
density
0.06 0.06 0.06
To investigate data sometimes people use an alternative but equivalent way of describing
a distribution called empirical cumulative distribution function (ECDF ) or empirical distri-
bution function. An empirical cumulative distribution function shows the fraction of data
that are smaller than or equal to a particular value. Both density histograms and empiri-
cal ECDFs convey the same information where the former use the area and the latter use
the height to represent the fraction of data that are smaller or equal to a particular value.
People tend to perceive height more quickly and accurately compared to area to interpret a
fraction of data hence ECDF might be preferable in these cases. On the other hand, density
histograms reveal more information about the shape of distributions, e.g., central tendency,
dispersion, skewness, gaps. Finally, ECDFs approximate the true underlying CDFs well if
the data size is large and approach to 1 by Glivenko-Cantelli theorem.
Figure 2.5 shows the ECDFs of the distributions given in Figure 2.2a and Figure 2.2b.
ECDFs are step functions that jumps by 1/n for each data point in a data set of size n.
Notice that the steps in service time ECDF (Figure 2.5a) are more obvious because the data
size is smaller compared to the data size of electricity bill distribution.
To demonstrate that ECDFs and density histograms or curves convey the same informa-
tion represented as height and area, respectively, we plotted the fractions shown in Figure 2.4
as the heights of ECDFs in Figure 2.6. Figure 2.6a demonstrates the fraction of data that
are less than or equal to three, 0.46. That is, 46% of the cars left at the mechanic shop
are serviced within three days. Figure 2.6b presents the fraction of data that are inclusively
between ten and twenty, 0.14. That is, it takes ten to twenty days to fix 14% of the cars left
96
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.5: ECDFs representing the service time (days) and electricity bill ($) distributions
1.00 1.00
cumulative probability
cumulative probability
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
at the mechanic shop. Notice that 0.14 is the difference between the ECDFs of twenty and
ten. Lastly, Figure 2.4c shows the fraction of data that are greater than or equal to fifteen,
0.07. That is, 7% of the cars require more than 15 days to be fixed. Notice that 0.07 is the
difference between 1.0 and ECDF of 15.
Figure 2.6: Various fractions of the car service time distribution represented by the heights
of ECDFs
0.82
cumulative probability
cumulative probability
cumulative probability
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
service time (days) service time (days) service time (days)
97
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
98
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
11
p(y) = 0.7y 0.311−y (2.1)
y
0 1 2 0.2
1.771470e-06 4.546773e-05 5.304569e-04
density
3 4 5
3.713198e-03 1.732826e-02 5.660564e-02
6 7 8 0.1
1.320798e-01 2.201330e-01 2.568219e-01
9 10 11
1.997504e-01 9.321683e-02 1.977327e-02
0.0
(a) Sample space and probabilities 0 3 6 9 12
Y
(b) p.m.f. graphics
In order for a function p(x) to be a p.m.f. on a discrete sample space S it has to satisfy
the following two conditions:
99
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
probability amount, 1, is distributed over infinitely many, infinitely small, disjoint intervals
covering the entire domain of the sample space of a continuous random variable. Figure 2.8a
mimics infinitely many, infinitely small, disjoint rectangles for all intervals where the area
of the rectangle is the probability of the random variable taking on a value within the
interval. Note that to evaluate probabilities for larger intervals one can integrate (sum
up) the density function over the interval of interest as shown in Figure 2.8b. Finally, the
probability of a continuous variable taking on an exact value is zero because at an exact
value the width of the rectangle (shown in Figure 2.8a) becomes zero in turn, the area
reflecting the probability becomes zero. Let W be a random variable denoting the time
(in minutes) that a light bulb lasts. Figure 2.8a shows a p.d.f. for the random variable W
which assigns a probability to infinitely many, infinitely small, disjoint intervals covering
the entire domain of the sample space of W Notice that the probabilities are expressed as
the areas under the curve and theR total area is 1. As a matter of fact, a p.d.f. is a function
w
that defines P (w1 ≤ W ≤ w2 ) , w12 f (w)dw where P (w1 ≤ W ≤ w2 ) is the probability of
W taking on a value between w1 and w2 . The p.m.f. of the random variable W is given in
Equation 2.4.
1 1
f (w) = e− 2·106 w (2.4)
2 · 106
5e−07 5e−07
4e−07 4e−07
density
density
3e−07 3e−07
2e−07 2e−07
1e−07 1e−07
0e+00 0e+00
0e+00 5e+06 1e+07 0e+00 5e+06 1e+07
W W
(a) p.d.f. plot of W (b) P (4 · 106 ≤ W ≤ 6 · 106 )
f (x) ≥ 0 where x ∈ S
(2.5)
Z
f (x)dx = 1
S
Similar to empirical distributions, cumulative distributions functions (c.d.f.) are al-
ternative ways of defining and plotting theoretical probability distributions. Cumulative
100
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Given that f (x) is the p.d.f. of a continuous random variable, its c.d.f. is computed by
Equation 2.7
Z x
F (x) = f (t)dt (2.7)
−∞
Figures 2.9a and 2.9b show the cumulative distributions of random variables Y and W
which were shown in Figures 2.7 and 2.8, respectively.
1.00 1.00
cumulative probability
cumulative probability
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 1 2 3 4 5 6 7 8 9 101112 0e+00 1e+07 2e+07 3e+07
Y W
(a) c.d.f. of Y (b) c.d.f. of W
R has built-in support for more than twenty discrete and continuous theoretical probabil-
ity distributions. Moreover, it provides a uniform function naming scheme for dealing with
probability distributions. Each probability distribution has an abbreviated name consisting
of a few letters in R, e.g., norm for normal, binom for binomial, unif for uniform, geom
for geometric and t for student’s t distribution. Furthermore, R provides four functions for
each distribution represented by a single letter:
p Returns the cumulative distributions of a probability distribution.
q Returns the inverse cumulative distributions (quantiles) of a probability distribution.
d Returns the densities (or masses) of a continuous (or discrete) probability distribution.
r Returns randomly generated values belonging to a probability distribution.
101
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.10: Generating random values from binomial and normal distributions.
# Example 1
> pnorm(4, mean=5, sd=2) # P(X <= 4)
[1] 0.3085375
> 1 - pnorm(4, mean=5, sd=2) # P(X >= 4)
[1] 0.6914625
> rnorm(10, mean=5, sd=2)
[1] 6.806524 1.943950 4.909100 4.125945 3.296505 2.178318 4.334462 2.564031
3.076485 1.302218
# Example 2
> pbinom(8, size=20, prob=0.5) # P(Y <= 8)
[1] 0.2517223
> dbinom(8, size=20, prob=0.5) # P(Y == 8)
[1] 0.1201344
> rbinom(10, size=20, prob=0.5)
[1] 12 8 7 8 13 7 9 10 8 9
102
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
R prefixes the abbreviated distribution names by the single letter function representation
to provide a function signature denoting a functionality.
Table 2.1 shows some theoretical distributions and their p/q/d/r functions.
Figure 2.10 first shows the c.d.f. of a normal random variable, X ∼ N (µ = 5, σ = 2)
where F (X) = P (X ≤ 4) and F (X) = P (X ≥ 4) using the prnorm function. The rnorm
function in the figure synthetically generates ten values from a normally distributed random
variable with a normal distribution µ = 5 and σ = 2.
The second example in Figure 2.10 shows the c.d.f. of a binomial random variable,
Y ∼ Binom(n = 20, p = 0.5) where F (Y ) = P (Y ≤ 8). This is equivalent to tossing
a fair coin 20 times and counting the ratio of having eight or less heads, infinitely. In
the figure, dbinom function is also used to obtain the probability of having exactly eight
heads. Note that dnorm for normal distribution does not correspond to an exact value
probability, because the exact value probabilities are zero for continuous random variables.
Lastly, rbinom generates ten synthetic values from a binomial distribution with n = 20 and
p = 0.5.
103
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
[0, 0.5] and denotes the fraction of the observation to be trimmed from each end of x. The
trim function may be used to trim the outliers appearing in both sides.
> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> mean(v1)
[1] 43.6875
> # Example 2
> v1 <- c(v1, 971)
> mean(v1)
[1] 98.23529
In Figure 2.26 we first generate a collection of ten integers and evaluate its mean using
the mean function. Obviously, the mean value, 43.6875, shows the central tendency of all
numbers. In the second example we add the number 971 to our vector and recalculate the
mean value. Notice that adding an outlier changed the mean value to 98.23529 which is
higher than the first sixteen values of the vector.
As we noted previously, if the numbers are evenly distributed around the mean then,
mean is a good measure of central tendency. On the other hand if the values are skewed
towards the left or right of the mean then, one needs to look at other central tendency
measures such as median and quartiles.
Median is a value that separates the lower half of an ordered collection of numbers from
the higher half. The median function in R has the signature median(x, na.rm = FALSE)
which returns the median of the object denoted by parameter x. Regardless of whether the
elements of the object x are sorted or not, R evaluates the median value on a sorted copy
of the object.
> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> sort(v1)
[1] 2 11 11 14 25 27 33 42 44 56 60 60 60 73 86 95
> median(v1)
[1] 43
> # Example 2
> v1 <- c(v1, 971)
> median(v1)
[1] 44
Figure 2.27 shows how to calculate the median value of a vector. Since the length of the
vector is an even number, 16, in Example 1 (Figure 2.27) the median value is the average of
the values located at index eight and nine in the sorted copy which leave seven numbers on
each side. Notice that adding an outlier did not affect the median value so much and the
new value 44 can still serve as a measure of central tendency for all numbers.
Median value divides a sorted collection of numbers into two approximately equal frac-
104
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
tions where the lower 0.5 fraction of the numbers is on the left hand side of the median value
and the higher 0.5 fraction of the numbers is on the right hand side of the median value.
One might be interested in other values of fractions or positional tendencies. For example
the 0.25 fraction (position), where approximately one quarter of the numbers are on the
left of a particular value and three quarters are on the right; 0.35 fraction (position) where
approximately 35% of the numbers are on the left of a particular value and 65% are on the
right; or 0.75 fraction (position) where approximately three quarters of the numbers are on
the left of a particular value and three quarters are on the right. Each of these fractions
is called a percentile, i.e., 0.25 percentile, 0.35 percentile or 0.75 percentile. Figure 2.28
demonstrates how different percentile separate a sorted collection of data into two halves.
Figure 2.13: A percentile separates the sorted data into two halves
Q - 0.50
Q - 0.25
Q - 0.75
There is an ambiguity between the terms quantile and percentile. Typically, a quantile
divides the sorted data into roughly equal intervals. Therefore, a qunatile dividing the sorted
data into 2, 3, 4, 5, 10 and 16 equal pieces are called median, tertiles, quartiles, quintiles,
deciles, and hexadeciles. On the other hand a quantile dividing the data into 100 equal
pieces is called percentiles.
There are multiple methods of calculating quantiles. One of the most common method
involves first sorting a collection of numbers then, associating each value with equally spaced
fractions (or quantiles) from 0 to 1. If the interested fraction is already associated with a
value then that particular value is the quantile. On the other hand, if the interested fraction
is between two already associated fractions then, linear interpolation is used to calculate
the quantile value.
Figure 2.30 shows a collection of numbers {2, 60, 33, 11, 27, 56} on which we want to
calculate 0.35-quantile. The numbers are first ordered and equal interval fractions are asso-
ciated with the numbers. Since 0.35 is between 0.2 and 0.4 one might think that calculating
the mean of the two numbers at the respective fractions might be a good quantile value.
However, 0.35 is not at half-between of 0.2 and 0.4 instead it is three times farther from
0.2 To calculate the linear interpolation let the fractions be the independent and the sorted
numbers be the dependent variables of multiple line segments. Then the line between points
(0.2, 11) and (0.4, 27) is y = 80x − 5 where y is the quantile and x is the fraction. Setting x
to 0.35 in the linear interpolation equation returns the 0.35-quantile which is 23.
The quantile function in R is used to calculate quantiles at one or more fractions (posi-
tions). The function expects the mandatory parameter x for the object denoting the collec-
105
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
# A collection of numbers 30
2 60 33 11 27 56
20
Quantile
# Ordered numbers with associated equally
spaced fractions
10
2 11 27 33 56 60
0.0 0.2 0.4 0.6 0.8 1.0 0
tion of numbers. One can use parameter probs=seq(0,1,0.25) to specify one or more frac-
tions. The default value of parameter probs generates the sequence {0, 0.25, 0.50, 0.75, 1}.
Figure 2.31 demonstrates the use of the quantile function via multiple examples.
In descriptive statistics quartiles are three quantiles that divide the date into four more
or less equal groups. A quartile is a special case of quantiles calculated at (0.25, 0.50, 0.75)
such that the 0.25 quantile is called lower or first quartile, the 0.50 quantile is called median
or second quartile and the 0.75 quantile is called upper or third quartile. The quantile
function by default, returns the quartiles in addition to the minimum (0 quantile) and
maximum (100 quantile).
Mode is another central tendency statistics defined as the most frequently value appear-
ing in a data set. R does not have a direct function for calculating the mode of a vector.
However, with the help of the table function mode can be calculated easily as shown in
Figure 2.32.
106
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> frequency.table <- table(v1)
> frequency.table
v1
2 11 14 25 27 33 42 44 56 60 73 86 95
1 2 1 1 1 1 1 1 1 3 1 1 1
> names(frequency.table)[which.max(frequency.table)]
[1] "60"
might spread out around the central tendency. There are many summary statistics used for
measuring dispersion and we are going to present the most prevalent ones here.
Range is the size of the smallest interval that contains all the values of a collection
of numbers. That is, it is the difference between the maximum and minimum values of
the collection. R has the function range which returns the minimum and maximum of
a vector object rather than the difference between them. One can evaluate the difference
between maximum and minimum values of a vector by using max and min functions as shown
in Figure 2.33. Range might be useful to measure the dispersion of a small collection of
numbers. However, it has its own disadvantages especially, on large collections. Firstly,
range is very sensitive to outliers. Secondly, it provides information about the maximum
and minimum values in a collection but how the in between values are scattered is not
represented by range.
> # Example 1
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> range(v1)
[1] 2 95
> max(v1) - min(v1)
[1] 93
107
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The quantity given in Equation 2.9 is referred as the sample variance, s2 and used as a
measure of dispersion around the sample mean value1 . Variance is always positive and the
higher the variance the greater the dispersion in the data set.
Since the distances are squared in the definition of variance the unit of variance (unit2 )
is not the same as the unit of the data in the collection. As a result, interpreting variance
with respect to data may not always make sense. An alternative statistics is called standard
deviation, σ, and defined as the square root of the variance.
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> Sample Variance and sd
> var(v1)
[1] 782.2292
> sd(v1)
[1] 27.96836
>
> # Population Variance and sd
> pvar <- sum((v1-mean(v1))^2)/length(v1)
> pvar
[1] 733.3398
> psd <- sqrt(pvar)
> psd
[1] 27.08025
R provides var and sd functions for sample variance, s2 , and sample standard devia-
tion, s, respectively as shown in Figure 2.34. The same figure also shows how to calculate
population variance, σ 2 , and population standard deviation, σ.
Although variance and standard deviation are prevalently used in the analysis of disper-
sion, both are non-robust summary statistics. The calculation of both statistics depend on
the mean and mean is not a robust measure of central tendency, i.e., significantly affected by
outliers. Additionally, squaring the distance of a data point to the mean gives more weight
to the outliers in the data set because quadratic functions increase fast.
Two robust statistics for measuring dispersion in a collection of numeric datum are
Interquartile range (IQR) and median absolute deviation (MAD).
Interquartile range (IQR) is defined as the difference between the third quartile and first
quartile of a collection of numbers, IQR = Q3 − Q1 . IQR represents the spread of the
middle 50% portion of the data. IQR is a robust statistics for dispersion because it is not
significantly influenced by outliers which is due to trimming the lower and upper quarters.
Usually, IQR is used in cases where median is used as a measure of central tendency. TheIQR
function in R returns interquartile range of a numeric vector as shown in Figure 2.35. A
similar measure is called interdecile range which is defined as the difference between 0.9-
quantile and 0.1-quantile. Interdecile range simply trims the upper and lower 10% of the
data and provides the range of the remaining data.
1 Equation 2.9 is called sample variance. To estimate population variance based on a sample σ 2 =
n
1
P
n
(xi − µ)2 is used.
i=1
108
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> IQR(v1)
[1] 37.75
Median absolute deviation (MAD) is another summary statistics for measuring the dis-
persion in the existence of outliers. As the name implies MAD is the median of the absolute
deviations of data points from their median. Let x = {x1 , x2 , . . . , xn } be a collection
of numbers and median(x) be a function that returns the median of the collection. Let
d be the absolute values of the distances of the elements in x to the median of x, i.e.,
d = {di : di = |xi − median(x)|}. The MAD of a dataset is simply defined as the median of
d, i.e., median(d). Figure 2.20 shows how to evaluate MAD manually as well as using the
MAD function.
Figure 2.20: Calculating MAD manually and using the function MAD
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> med.v1 <- median(v1)
> med.v1
[1] 43
> distance.med.v1 <- abs(v1 - med.v1)
> distance.med.v1
[1] 41 17 10 32 16 13 18 43 17 52 1 29 32 30 1 17
> median(distance.med.v1)
[1] 17.5
> mad(x=v1, constant=1)
[1] 17.5
MAD is a robust statistics denoting the dispersion in data in case of outliers. However,
MAD assumes symmetry in the process that generates the data by dividing the data into
two halves around the median. If the data is not symmetric around the median alternative
measures can be used as suggested in the paper “"Alternatives to the Median Absolute
Deviation” by Peter J. Rousseeuw and Christophe Croux, 1993. ADD THE DISCUSSION
IN THE MENTIONED PAPER INTO THE CHAPTER.
Coefficient of variation (CV) is an alternative statistics to measure the dispersion around
mean. Note that standard deviation also measures the dispersion around the mean. Coef-
ficient of variation is defined as the ratio of standard deviation to the mean, CV = σ/µ,
and it is unit-less. CV measures the dispersion of a collection of data in a way that the
dispersion is independent from the unit of measurement of the data. It is useful for com-
paring the relative magnitudes of dispersions of two or more data sets especially if the unit
of measurement of these data sets are different. Let 10.2 be the average number of hours
spent for studying Chemistry final exam in a class along with standard deviation 4.6. Let
72 be the average score of the same final exam along with standard deviation 27. One
cannot say that the dispersion in the scores, 27, is more than the dispersion of the study
hours, 4.6, based on standard deviations because the measurement units are different. On
the other hand the comparison of the CV for study hours, 0.45 and scores 0.37 tells us that
the dispersion in study hours is more than the dispersion in scores.
109
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Note that, CV is meaningful only for non-negative data which is also ratio scale, i.e.,
have an absolute zero which represents the lack of quantity. The mixture of negative and
positive values in the data may potentially result in a mean zero for which CV is infinite.
Furthermore, the mean that is close to zero will let the CV approach infinity. Hence, CV is
very sensitive to small changes when the mean is close to zero.
R does not provide a function for calculating coefficient of variation however, one can
use the functions mean and sd to evaluate CV.
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x x
(a) Right skewed distribution (b) Symmetric distribution (c) Left skewed distribution
Skewness is a statistic that measures the asymmetry of a distribution around its mean.
Figure 2.21 shows example distributions with left, zero and right skewness. Theorethically,
skewness is the third “standardized” moment of a distribution. Sample skewness2 , m e 3 , is
defined as
Pn
i=1 (xi − x̄)
1 3
m3
me 3 = 3 = q n
3 (2.10)
s Pn
1
n (x
i=1 i − x̄)2
The numerator of Equation 2.10 cubes the distances around the sample mean value.
Therefore if a distribution is symmetric around the mean value, it will have values say 3
units on the right of the mean, 3 units ditance, as well as 3 units on the left of the mean,
-3 units distance. Hence, their cubes will cancel each other and make the numerator get
closer to zero. Ideally, a unimodal distribution is left skewed (has left tail) when m e 3 is
negative, is symmetric when m e 3 is close to zero and is right skewed (has right tail) when
e 3 is positive. The normal distribution is a symmetric distribution with zero theoretical
m
skewness and close to zero sample skewness. It is always better to visualize the distribution
while measuring skewness, because a fat and short tail may balance a thin and long tail
toward zero skewness.
Since m
e 3 does not account for sample size, another commonly used measure of skewness
2 Note that the leading terms of the sample moment and the sample standard deviation are 1/n rather
than 1/(n − 1)
110
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
is defined as
Pn
i=1 (xi − x̄)
1 3
n2 m3 n2
e3 =
m = n
(2.11)
(n − 1)(n − 2) s3 (n − 1)(n − 2) q Pn 3
1
n i=1 (x i − x̄)2
where m3 , x̄ and s are the third moment, sample mean and the standard deviation of the
distribution, respectively. Many software packages implement Equation 2.11. As a rule of
thumb, one may consider a distribution roughly symmetric when the skewness is [−0.5, 0.5],
moderately skewed when the skewness is [−1, −0.5) or (0.5, 1] and highly skewed when the
skewness is less than −1 or greater than 1.
Lastly, left skewed distributions’ medians are greater than their means and right skewed
distributions’ medians are less than their means, while symmetric distributions’ medians
and means are the same or closer to each other.
> library(moments)
> v <- rbeta(n=256, shape1=2, shape2=8)
> skewness(x=v)
[1] 0.7013203
The moments package in R implements the skewness function. The console in Figure 2.22,
first creates a vector of randomly generated data from a beta distribution with α = 2 and
β = 8. The skewness of the vector is 0.70 which indicates a moderate right skewness.
Kurtosis is a statistic that measures how heavy the tails of a distribution is around its
mean. Theorethically, kurtosis is the forth “standardized” moment of a distribution. Sample
kurtosis3 , m
e 4 , is defined as
Pn
i=1 (xi − x̄)
1 4
m4
e 4 = 4 = q
m n
4 (2.12)
s Pn
1
n i=1 (x i − x̄)2
where m4 , x̄ and s are the fourth moment, sample mean and the standard deviation of
the distribution, respectively. The numerator of of Equation 2.12 takes the fourth power
of distances around the mean value. Therefore, the samples closer to the mean do not
contribute much to the kurtosis, while the samples on te tails, i.e., farther from the mean,
contributes much more to the kurtosis. Note that kurtosis accounts for both tails together.
Equation 2.12 calculates the kurtosis of a normal distribution as 3. To standardize, many
software packages implement a version that subtracts 3 from the kurtosis to fix the kurtosis
of a normal distribution at zero. This version is called excess kurtosis and it also accounts
for the sample size. Excess kutosis is defined as
Pn
n(n + 1) i=1 (xi − x̄) 3(n − 1)2
1 4
e4 =
m n
− (2.13)
(n − 1)(n − 2)(n − 3) q 1 Pn 4 (n − 2)(n − 3)
n (x
i=1 i − x̄) 2
3 Note that the leading terms of the sample moment and the sample standard deviation are 1/n rather
than 1/(n − 1)
111
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
According to Equation 2.12, a positive value indicates heavier tail than the normal
distribution and a negative value indicates lighter tail than the normal distribution. A
distribution with positive kurtosis is called leptokurtic, which means it has more outliers
than a normal distribution. On the other hand, a distribution with negative kurtosis is
called playkurtic, which means it has less outliers than a normal distribution.
> library(moments)
> v <- rt(n=256, df=3)
> kurtosis(x=v)
[1] 7.496812
The moments package in R implements the kurtosis function without subtracting 3. The
console in Figure 2.23, first creates a vector of randomly generated data from a student’s t
distribution with degrees of freedom 3. The kurtosis of the vector is 7.49 which indicates
that the tails are heavier than the normal distribution’s, i.e., kurtosis 3.
Lastly, the moments package implements Jarque–Bera test which tests whether some
sample data has skewness and kurtosis matching to a normal distribution or not.
2.3.4 Normalization
Considering that a numeric vector is a collection of measurements or observations, directly
comparing the data points in two vectors might not always make sense. In order to do
a proper comparison one has to take into account the unit of measurement, the scale of
measurement and the process generating the data. If two vectors have the same measurement
but in different units one has to rescale one of the vectors into the units of the other vector.
Rescaling is done by adding, subtracting, multiplying or dividing the elements of a vector
by one or more constants. For example, in order to compare two vectors of distance values
measured in Miles and Kilometers, respectively, one has to rescale one of the vectors into
the units of the other vector.
Similarly, if the data in two vectors are measured using different scales it is necessary to
normalize the data into a common scale. A frequently used common scale is scaling the data
into the [0,1] interval. For example, different educational institutions use different scales to
grade the performance of their students in a course such as [0,10], [0,100] or [0,150]. In
order to fairly compare data points having different scales it is necessary to normalize them
into [0,1] interval using the Equation 2.14 assuming that x = {x1 , x2 , . . . , xn } denotes a
collection of numeric values.
xi − min(x)
x0i = (2.14)
max(x) − min(x)
Finally, if the data points in two vectors are generated via the same process (have the
same distribution) but with different parameters we use standardization. Standardization
is usually defined as subtracting a measure of location (central tendency), µ or x̄ from the
data point and dividing it by the scale (dispersion), σ or s.
xi − µ
x0i = (2.15)
σ
112
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
To illustrate, assume that Alice and Carol are two students who took a very crowded
course with two different professors. Furthermore, let Alice’s overall average grade be 78 and
Carol’s overall average grade be 65. Naturally, one may think that Alice performed better
than Carol in that particular course. Assuming that the students were randomly assigned
to courses, a real comparison should have been made by taking the group of students in
each class into consideration. Let the mean and median grades in Alice’s class be 84 and
88 and in Carol’s class be 60 and 59, respectively. Even if we assume that the standard
deviations in the grades were the same for both classes, Carol seems to be more successful
when the grades are standardized. Standardized data is called z-scores, standard scores or
scaled scores in different domains. Figure 2.24 shows how to calculate the distance of a
value from the mean of the collection in terms of the standard deviation.
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60)
> # Population Variance and sd
> pvar <- sum((v1-mean(v1))^2)/length(v1)
> pvar
[1] 733.3398
> psd <- sqrt(pvar)
> psd
[1] 27.08025
>
> # The distance of v[4] to its mean in terms of the standard deviations
> (v1[4] - mean(v1))/psd
[1] -1.207061
xi − median(x)
x0i = (2.16)
M AD
Equation 2.16 simply calculates the distance of a point to its median in terms of median
absolute deviation (MAD).
113
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
the interval [Q1 − 1.5(IQR), Q3 + 1.5(IQR)] may be considered to be an outlier. One can
calculate the limits of the non-outlier interval and list the outliers appearing in a vector as
shown in Figure 2.25.
> v1 <- c(2, 60, 33, 11, 27, 56, 25, 86, 60, 95, 44, 14, 11, 73, 42, 60, 971)
> lowerOutlierLimit <- quantile(v1, probs=0.25, names=FALSE)-1.5*IQR(v1)
> upperOutlierLimit <- quantile(v1, probs=0.75, names=FALSE)+1.5*IQR(v1)
> lowerOutlierLimit
[1] -27.5
> upperOutlierLimit
[1] 112.5
> v1[v1<lowerOutlierLimit | v1>upperOutlierLimit]
[1] 971
An alternative method to detect outliers is suggested by Boris Iglewicz and David Hoaglin
in their paper "How to Detect and Handle Outliers", 1993. The authors suggest calculating
a modified z-score for each element in the data set using the formula 2.17
0.6745(xi − median(x))
x0i = (2.17)
M AD
The authors recommend that modified z-scores that fall off the interval [-3.5, 3.5] can be
labeled as potential outliers.
Hampel filter is another outlier detection approach that is based on MAD and median.
Hampel filter assumes that any value out of range
114
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Vietnam
Venezuela
Ukraine
Thailand
Switzerland
Slovakia
Mexico
Malaysia
Finland
Denmark
Czech Republic
Argentina
115
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.29: Scatter plot for debt per capita and gdp per capita
40000 Switzerland
Debt per Capita (dollars)
40000
30000 Finland
Denmark
30000
20000
20000
10000 10000
0 0
0 20000 40000 60000 80000 0 20000 40000 60000 80000
GDP per Capita (dollars) GDP per Capita (dollars)
116
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
product of countries increase their debt also increase. When we look at the two variables,
the picture is not as dark for countries such as Switzerland, Finland and Denmark. In
fact a better measure here is to look at debt-to-GDP ratio to assess the default risk of a
country. Obviously, there are more variables involved in evaluating countries’ economies
and the example above is to illustrate how looking at a single variable might be misleading.
where X̄ and Ȳ denote the means of variables X and Y , respectively. Please note that
it requires to replace 1/(n − 1) by 1/n in Equation 2.20 to compute the covariance for a
population, rather than sample.
A positive covariance indicates that the two variables get larger and smaller together.
A typical example house size and house price usually have positive covariance. On the
other hand, a negative covariance indicates that when one variable gets larger the other one
gets smaller. For example, mortgage interest rate and house price usually have negative
covariance. A close to zero covariance indicates that the variables do not change together,
i.e., do not co-vary.
Covariance depend on the scales of the involved variables. Moreover, it changes between
negative infinity and positive infinity. Therefore interpreting and comparing the magnitudes
of covariance is difficult.
A related statistic which is easier to interpret is called correlation coefficient and given
in Equation 2.21.
s2X,Y
rX,Y = where − 1 ≤ r ≤ 1 (2.21)
sX sY
The quantity rX,Y , called the linear correlation coefficient or Pearson product moment
correlation coefficient. It measures the strength (how the points spread or cluster around a
line) and the direction of a linear relationship between two variables. Correlation coefficient
is dimensionless and changes between -1 and 1. A value close to one indicates very strong
positive correlation between two variables. Whereas, a value close to -1 indicates a very
strong negative correlation between the two variables. Lastly, a value close to 0 indicates
the lack of correlation between the two variables.
117
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 2.30 presents several correlation values indicating the direction and strength of
linear relationships between two variables. Note that correlation is not causation. That
is, the change in one variable does not necessarily cause a positive or negative change in
the other variable. For example, there is a strong correlation between shoe size of children
and their vocabulary size. The correlation does not indicate that larger feet causes larger
vocabulary. In fact, both shoe size and vocabulary size are related to a confounding factor,
namely age.
100
3000
600
75 2000
400
1000
Y
Y
50
200
0
25
0 −1000
0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
X X X
1.5e+08
0
0 1.0e+08
Y
Y
−500 5.0e+07
−1000 0.0e+00
−1000
−5.0e+07
0 25 50 75 100
0 25 50 75 100 0 25 50 75 100 X
X X
(f) r = 0.82, the relationship is
(d) r = −0.74 (e) r = −0.30 NOT linear
118
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
frequencies of the levels of a factor. For example in Console 2.31 the table function displays
the frequency distribution of the sex column indicating that there are 242 males and 210
females in the dataset. The function prop.table gets a table as input and returns the pro-
portions instead of the frequencies. In Console 2.31 the proportions of males and females
are very close to each other.
> library(Ecdat)
> data("Unemployment")
>?Unemployment
>str(Unemployment)
...
...
> # single variable
> table(Unemployment$sex)
male female
242 210
> prop.table(table(Unemployment$sex))
male female
0.5353982 0.4646018
The table and prop.table functions not only present the frequencies and proportions of
single variables but also the joint frequency distributions and proportions of two variables.
Console 2.31 also shows the cross tabulation of sex and reason variables. In the console we
also introduce the addmargins function which displays the column and row sums as well as
the total instances.
When we look at the sex and reason variables, a natural question is if there is a relation-
ship between these two variables, i.e., if they are statistically independent or not. To put in
other words, we say two variables are statistically independent when one variable assumes
a value, the probability distribution of the other variable roughly remains the same. If two
119
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
categorical variables are not independent, then the probability distribution of one variable
changes depending on the value of the other variable.
When two variables are (statistically) independent their respective conditional and marginal
distributions are expected to be (roughly) the same. For example in Console 2.31 the
marginal distribution of sex is 0.54 male and 0.46 female. However, the conditional distri-
bution when the unemployment reason is new entrant is 0.34 male and 0.66 female. The
question is that whether the difference between the marginal and conditional distributions
happened by chance and insignificant or the difference implies a relation between the two
variables and the evidence is statistically significant.
We use Pearson’s χ2 (chi-squared) test of independence to determine if two variables are
independent or not. χ2 statistic is computed as
n
X (oi − ei )2
χ2 = (2.22)
i=1
ei
where n is the number of cells in the contingency table, oi is the number of observations
in cell i and ei is the number of expected observations in cell i. Two events, A and B, are
considered to be independent when P (A) = P (A|B) ⇐⇒ P (A, B) = P (A)P (B). The last
table in Console 2.31 shows the joint and marginal probabilities of sex and reason. If these
two variables were independent then their expected joint probability in each cell would be
the product of their marginal probabilities. Hence, the expected frequency of each cell, ei ,
would be the product of the total number of instances and the expected joint probability.
The null hypothesis (H0 ) of the χ2 test indicates that the variables are independent
and the alternative hypothesis (H1 ) indicates that they are dependent. Under the null
hypothesis the χ2 statistic has χ2 distribution with (k − 1)(l − 1) degrees of freedom, where
k is the number of rows and l is the number of columns of the contingency table. Typically,
if the p-value of the χ2 statistic is lower than 0.05, we have strong evidence to reject the null
hypothesis, i.e., the evidence to reject is statistically significant. That is, we found strong
evidence that the marginal and the conditional distributions are different. Otherwise, there
is not enough evidence to reject the null hypothesis.
Console 2.32 introduces the chisq.test function for Pearson’s χ2 test of independence. In
the console the p-value is much smaller then 0.05, hence we have enough evidence to reject
the null hypothesis and assume that there is a relation between variables sex and reason.
For example, unemployment due to entering into the workforce is higher for females, while
unemployment due to loosing their jobs is much higher for males. Therefore, there is strong
evidence for a relation between variables sex and reason in our dataset.
120
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Please note that to apply χ2 test, at least 80% of the cells must have more than 5
observations. If it is not, one may consider merging cells, i.e., combining the levels of the
categorical variable(s). If the cell values are still very small, one may consider Fisher’s Exact
test. Also it is assumes that the instances in your dataset are independent from each other.
That is, the values of a row is not related to or does not affect the values in another row in
your dataset. If this is not the case, one may consider McNemar’s test or Cochran’s Q test.
Pearson’s χ2 test indicates if the two categorical variables, i.e., the levels of the categorical
variables, are related to each other. However, if there is a relation, it does not measure the
strength of the relation. Note that the magnitude of the χ2 statistics in Equation 2.22
depends on the total number of instances in the dataset as well as the the number of the
rows and columns of the contingency table. To measure the strength of the relation we use
corrected contingency coefficient, C 0 . The original Pearson’s Contingency coefficient, C, is
computed as
s
χ2
C= (2.24)
χ2 + N
where χ2 is the chi-squared statistic and N is the total number of the observations. Although
0 ≤ C ≤ 1, it can take values less than 1 for perfect relations. To fix the problem, the
corrected contingency coefficient, C 0 is computed as
s s
min(k, l) χ2
C =
0 (2.25)
min(k, l) − 1 χ + N
2
where k and l are row and column numbers of the contingency table, respectively and 0 ≤
C 0 ≤ 1 The function, ContCoef for the corrected and non-corrected contingency coefficient,
C 0 , is in the descriptive statistic tools package named “DescTools”.
> library(DescTools)
> ContCoef(x=Unemployment$sex, y=Unemployment$reason, correct=FALSE)
[1] 0.2629277
> ContCoef(x=Unemployment$sex, y=Unemployment$reason, correct=TRUE)
[1] 0.371836
Figure 2.33 shows that the strength of the relation between variables sex and reason is
0.371836, which is not very high, yet important.
There are other statistics to measure the strength of the relations between two categorical
variables such as Cramer’s V and Goodman and Kruskal’s λ (lambda).
121
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
female
sex
male
122
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Baloon plots help us to visually cross analyze the levels of the categorical variables when
the levels of the variables are not too many. On the other hand, some variables have several
levels which makes cross analysis through baloon plots difficult.
Correspondence analysis allows us to translate the levels of the categorical variables into
an Euclidean space with new coordinates. The coordinates of the levels of the categorical
variables help us to compute the distances between the levels as well as visualize the levels
in 2D with possible information loss. Before translating the levels into an Euclidian space,
we need to compute the standardized residual proportions based on the contingency table.
> library(Ecdat)
> data("Unemployment")
> # original proportions matrix
>op <-as.matrix(prop.table(table(Unemployment$sex, Unemployment$reason)))
> # all proportions, including the marginal proportions
>m <-addmargins(prop.table(table(Unemployment$sex, Unemployment$reason)))
> # row and column marginal proportions
> rmp <- as.vector(m[-3,5])
> cmp <- as.vector(m[3,-5])
> # expected proportions matrix
> ep <- outer(rmp, cmp, FUN=’*’)
> # residual of proportions matrix
> rp <- op-ep
> # standardize residual of proportions, (o - e)/sqrt(e)
> # standardized residual of proportions
> st.rp <- rp/sqrt(ep)
> # use SVD (Singular Value Decomposition) to generate singular values d, ...
> # left singular matrix u and right matrix vector v
> w <- svd(st.rp)
Console 2.35 shows the steps taken to compute the standardized residual proportions as
well as the correspondence analysis. In the console, first we compute the original propor-
tions matrix, next we compute the marginal proportions matrix and extract the marginal
proportions of the rows and columns. Note that we extract the marginal proportions only
for the levels of our categorical variables. Then, we compute expected proportions matrix
using the “outer” function and compute the residual of proportions. Next, we standardize
the residual of proportions matrix and use Singular Value Decomposition (SVD) to obtain
decompose the standardized residual of proportions into a vector d, a left singular matrix
u and and a right singular matrix v. The left and right singular matrices correspond to
the rows and columns of the contingency table along with their new coordinates. SVD
decomposes a matrix Mk×l into Uk×k , Dk×l and Vl×l , where Dk×l is a diagonal matrix, i.e.,
M = U DV T . The diagonal entries of D is typically ordered and called singular values, U
is called left singular matrix and V is called right singular matrix. When M is real, U and
V are orthogonal matrices.
TO BE COMPLETED LATER
123
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
124
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
where ||xi −xj ||2 is the square of the distance between xi and xj and µl is the centroid of the
data instances in cluster Cl . Centroid is a generalization of the mean value for multivariate
analysis or multidimensional spaces that applies to vectors rather than scalar values. Let
V = v 1 , v 2 , . . . , v n be a set of vectors in m dimensional space, i.e., v i ∈ Rm . Given that
each vector has a mass value 0 ≤ ti ≤ 1 where ti = 1, the centroid, µ of V is
P
n
X
µ= ti v i (2.28)
i=1
in case the mass values of vectors are the same, 1/n then, Equation 2.28 is reduced to sum
of vectors divided by the number of the vectors.
Equation 2.27 is also called within-cluster sum of squares or within-cluster variation
because the distances of data instances are squared or the distances of data instances are
calculated with respect to the mean of all data instances, respectively. Note that one way
to define the mean of multiple data instances consisting of many features is calculating a
data point that consists of the means of the individual features.
Given that there are k clusters the total within-cluster sum of squares or total within-
cluster variation, W is
k
X
W = W Cl
l=1
(2.29)
Xk X
= ||xi − µl ||2
l=1 xi ∈Cl
Since the total within-cluster sum of squares in Equation 2.29 denotes how compact the
clusters are, a smaller value is better in general.
The k-means algorithm expects a dataset and the number of clusters, k, as input and
returns k centroids denoting the center of each cluster. A data instance is assigned to the
cluster based on the smallest distance to the cluster’s centroid. Most implementations return
a list of labels representing the cluster of each data instance in the dataset as well.
Algorithm 1 presents the pseudo-code for the k-means algorithm. The algorithm expects
a dataset to be clustered, X , and the number of clusters, k. Lines 1-3 generates random
centroids for each cluster 1-to-k. Line 5 starts an infinite loop to be executed as long as there
is a change in computed cluster labels at every iteration . Lines 6-8 assigns a cluster label to
each instance in the dataset by computing the distances of the instance to all centroids and
picking the cluster with the minimum distance. Lines 10-12 checks if the previous for-loop
caused any changes in instance labels. If not, the while-loop is terminated at line 11. Lines
14-16 updates the centroids, when there is a change in instance labels, i.e., at least one
instance’s cluster label has changed. Finally, line 19 returns the k centroids representing the
cluster centers. Many algorithms also return cluster labels for the instances in the dataset.
Figure 2.36 presents a dataset that to be clustered into five clusters using Algorithm 1.
Before the algorithm starts, Iteration 0, the instances are not assigned to any clusters.
Random centroids are created and Iteration 1 shows the assignments of the instances to
these cluster centroids. The centroids are updated at Iteration 1 and the instances are
reassigned to their clusters at Iteration 2. At Iteration 3 the algorithm stabilizes, i.e., there
is not any instance for which the class label has changed.
125
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
v2
v2
v2
cluster cluster cluster
20 20 1 20 1 20 1
2 2 2
3 3 3
4 4 4
0 0 0 0
5 5 5
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
v1 v1 v1 v1
The console output shown in Figure 2.37 presents a synthetically generated dataset
consisting of two variables, v1 and v2 and 250 instances. The dataset is generated from
multivariate normal distributions with various centroids and the same covariance matrix.
Next, the dataset is converted into a dataframe and plotted via ggplot. Note that data
visualization via ggplot2 is covered in the next Section.
R supports k-means clustering via the kmeans function. The kmeans function expects
at least a matrix or a dataframe object consisting of only numeric variables and the num-
ber of clusters provided via parameter centers. It is a common practice to scale the data
via the scale function before providing it to kmeans. Scaling typically centers the data
by subtracting the column mean from the values in the column and dividing them by the
column’s standard deviation for each column. The scaling step is to make sure that the
variables contribute fairly without larger range variables affecting the computations signifi-
cantly compared to smaller range variables. Once kmeans computes the clusters it returns
a list holding multiple objects of different types. Among the most important objects that
the returned list holds are cluster, centers, size, withinss, betweenss and iter. cluster is a
numeric vector representing the cluster of each and every data instance in the clustered
dataset. centers is a matrix representing the cluster centroids. size is a numeric vector
denoting the number of data instances falling into each cluster. withinss is within-cluster
sum of squares of the clusters computed by Equation 2.27. betweenss is between-cluster
126
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> library(MASS)
> set.seed(1001)
> mydata <- rbind(
mvrnorm(50, mu = c(10,10), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(50,50), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(10,50), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(50,10), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE),
mvrnorm(50, mu = c(30,30), Sigma = matrix(c(80,0.16,0.16,80), ncol = 2),
empirical = TRUE)
)
> colnames(mydata) <- c("v1", "v2")
> mydata <- as.data.frame(scale(mydata))
> head(mydata)
v1 v2
1 3.712771 -6.7042562
2 21.952162 -0.3210547
3 20.873552 0.9499241
4 18.608042 29.3532378
5 11.524498 15.3720296
6 7.236498 15.4625999
> library(ggplot2)
> ggplot(data=mydata, mapping=aes(x=v1, y=v2)) + geom_point()
60
40
v2
20
0 20 40 60
v1
127
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
sum of squares of the clusters computed by summing the squared distances of each point in
cluster to other points in other clusters. Finally, iter denotes how many iterations it took
the k-means algorithm to stabilize.
> # Compute k-means clusters based on the cluster count found in scree plot
> set.seed(1001)
> fit <- kmeans(x=mydata, centers=5)
> fit
> ...
> ...
> # Bind the cluster labels to the dataset
> mydata$cluster <- factor(fit$cluster)
>
> # Visualize the clusters
> ggplot() + geom_point(data=mydata, mapping=aes(x=v1, y=v2, color=cluster)) + geom_
point(data=data.frame(fit$centers, centroid=as.factor(1:nrow(fit$centers))),
mapping=aes(x=v1, y=v2, fill=centroid), shape=23, color="black")
centroid
1
60
2
3
4
40
5
v2
cluster
20 1
2
3
4
0
5
0 20 40 60
v1
The console output in Figure 2.38 runs the kmeans function over the dataset that is
synthetically generated in Figure 2.37. The kmeans function returns a list which is referenced
by an object named fit. The following line creates a new column in mydata named cluster
and populates it with the cluster labels for each row. Finally, we use ggplot to visualize
the cluster centroids and the instance clusters by different clolors. As the figure shows, the
rows that are closer to each other are clustered in the same group.
One problem we have not discussed is how to determine the optimal number of clusters, k,
which was 5 in the previous example. Since the synthetically generated dataset in Figure 2.37
has two variables one can plot the dataset and visually decide number of optimal clusters.
However, visual inspection is more difficult for three variables and impossible for more than
three variables without an effective dimensionality reduction technique. There are many
methods to determine the optimal or suboptimal number of clusters in a dataset. One
128
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
prevalently used method is the elbow method which is based on the total within-sum-of-
squares. Elbow method simply runs the k-means algorithm for 1, 2, 3, . . . , t candidate cluster
numbers and extracts the total within-sum-of-squares for each candidate. Note that t ≤ n
where n is the total number of instances, because in the √ extreme case each instance can
exist in its own cluster. However, a good value for t is n, in general. The total within-
sum-of-squares is equal to the total squared distances when all data is assumed to exist in
a single cluster. Once the total within-sum-of-squares is computed, they are visualized via
a line graph. The total within-sum-of-squares typically decreases as the number of clusters
increases. Therefore, the line graph looks like an arm and roughly the elbow point of the
arm is considered to be a good value for the number of clusters, because the decrease in
total within-sum-of-squares is not significant after the elbow point.
Figure 2.39 presents the elbow method in practice. The dataset is scaled first to achieve
fair variable contributions to the clustering process. Next, the total sum-of-squares is com-
puted. It denotes the total within-sum-of-squares when we have only one cluster. Then,
total within-sum-of-squares are computed in a for-loop starting from 2 clusters up to 15
clusters. Lastly, the within-sum-of-squares is converted into a data frame and plotted using
ggplot. In the figure, the elbow point is roughly at 5 which is consistent with the synthetic
data generation process in Figure 2.37
129
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> # standardize variables by subtracting each value from the mean and dividing by
the standard deviation
> # this step makes the variables contribute fairly into computations
> # otherwise a variable of range [1000-2000] will have more impact than a variable
of range [10-20]
> mydata <- as.data.frame(scale(mydata))
>
> # compute within-sum of squares wss, initially set to total of the variances of
the columns
> wss <- (nrow(mydata)-1)*sum(apply(X=mydata, MARGIN=2, FUN=var))
>
> # compute within-sum of squares wss for clusters of size 2 to 15
> # note that the first element is always the total
> # in case the data size is large wss can be computed using a smaller sample set
> set.seed(1001)
> for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
>
> # convert the wss into a dataframe
> wss.df <- data.frame(cluster = 1:15, wss = wss)
>
> # visualize the scree-plot
> ggplot(data=wss.df, mapping=aes(x=cluster, y=wss)) + geom_point() + geom_line() +
scale_x_continuous(breaks = seq(from=1, to=15, by=1)) + labs(x="Number of
Clusters", y="Within-Clusters Sum of Squares")
200000
Within−Clusters Sum of Squares
150000
100000
50000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Clusters
130
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Data cleaning: The process of handling missing data, smoothing noisy data, managing
outliers, and resolving inconsistencies in the data.
Data integration: The process of integrating data from multiple sources with different
representations by resolving conflicts among representations.
In the following we are going to discuss data cleaning which is a vital step in preprocess-
ing.
131
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
MCAR and MAR missingness can potentially be ignored assuming that the data set
is large enough for the analysis. However, MNAR type of missingness should be handled
carefully. The literature on handling missing data is vast however there are two broad
categories namely deletion and imputation.
2.6.1.2 Deletion
Deleting a record with a missing value or the entry of the missing value is the simplest
approach to handle missing data.
In listwise deletion (complete-case analysis) a record is entirely excluded from analysis
if it has any missing data. Although this method does not introduce bias if the missingness
is MCAR or MAR, it affects the accuracy of the analysis unless the size of the data set is
large. On the other hand, it may introduce significant bias if the missingness is MNAR and
lead to misconclusions.
132
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
2.6.1.3 Imputation
Imputation is the process of replacing missing values in a data set by their substitutes. The
aim of imputation is keeping all records in a data set by replacing any missing value by its
meaningful substitute. There are multiple ways to determine the substitutes for a missing
value and we introduce the most common methods in this text.
Mean imputation is a technique that replaces the missing values appearing in a variable
(feature) by the mean of the available observations/measurements of the variable. Although
the technique is pretty straightforward, it may distort the distribution of a variable in the
data set by artificially pulling the data points towards the mean and resulting in underes-
timated standard deviations for the variable especially, if the number of missing values is
large in comparison to the size of the data set. Similarly, the data points artificially pulled
towards the mean may reduce the amount of correlation between two variables by estimating
the statistics towards zero.
Median imputation is a technique that replaces the missing values appearing in a variable
(feature) by the median of the available observations/measurements of the variable. This
method is used as an alternative to mean imputation in case the distribution of the variable
is skewed. Similar to mean imputation it may distort the distribution of a variable in the
data set by artificially pulling the data points towards the median.
Logical imputation is a technique that replaces the missing values appearing in a variable
(feature) by a logical value that makes sense due to domain knowledge or expertise. For
example in a survey the number of years served in prison might be missing for the subjects
who have never been sentenced. Replacing such missing values with zero makes sense for
the cases who have never been sentenced.
Replacing all missing values in a variable by a single value may distort the distribution
towards the replacement value in mean and median imputation. Simple random impu-
tation eliminates such distortion by replacing the missing data with a randomly selected
observed/measured value in the same variable. This kind of imputation has the implicit
assumption that the missingness in a variable has no systematic bias and it is not correlated
to any other variable in the data set.
Indicator variable imputation is especially common in regression analysis. In this method
an indicator variable is created for each variable which has missing value. The indicator
variables take on value zero if the corresponding value is missing and one otherwise. Then,
each missing value per variable is replaced by a single value (such as zero or mean of the
variable). Afterwards, the regression analysis is done by using the introduced indicator
133
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
variables as predictors as well. This method applies well if the missingness in a variable is
independent of the response variable and the predictor variables are uncorrelated.
Prediction model imputation is a technique that replaces the missing values appearing
in a variable (feature) by a value obtained through a prediction model which is based
on the other variables in the data set. KNN regression and linear regression for numeric
variables and logistic regression and linear discriminant analysis for categorical variables are
common models used in prediction model imputation. In prediction model imputation the
variable with the missing values is considered to be the response variable and one or more
other variables in the data set are considered to be explanatory (predictor) variables. The
estimated value by the model is used to replace the missing value for each record. Although
prediction model imputation is a stronger alternative to other imputation techniques it is
computationally expensive and requires carefully selecting the predictor variables . Moreover
in prediction model imputation, the estimated values are simply the most likely values for
the missing data however, they do not carry the uncertainty through residual variance. This
may cause an overestimated model fit in later analysis. Stochastic regression adds average
regression variance to the estimated values for missing data to reduce the error term in
prediction model imputation. Finally, predictor variables with missing values need to be
handled properly as well. Iterative regession models can be used to resolve missing values
in different columns iteratively. For details please see (Data Analysis Using Regression and
Multilevel/Hierarchical Models by Andrew Gelman)
Imputation by nature introduces noise into the data set. Multiple imputation is a tech-
nique to reduce the noise due to imputation. In multiple imputation the missing values are
replaced by an existing imputation method such as simple random imputation or prediction
model imputation. However, the process is repeated multiple times to obtain more than one
imputed data sets. The analysis is done on each imputed data set and the result is pooled
by averaging or by some more advanced method. For details please see (Statistical Analysis
with Missing Data by Roderick J. A. Little, Donald B. Rubin)
134
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
data. Please see the following part for a discussion on managing outliers in a data set.
2.7 Exercises
Univariate and Bivariate Descriptive Statistics
1. Generate a vector of 10 numbers sampled from a normal distribution with mean zero,
µ = 0, and standard deviation eight, σ = 8. Compute and print the mean and
standard deviations of the elements in the vector. Discuss whether they are close to
the theoretical mean and standard deviation. Repeat the same experiment with 100,
1000 and 10000 samples, discuss whether the means and the standard deviations are
getting closer to the theoretical values.
2. Install package AER on your computer, if you haven not done before. Next, load the
dataset named NMES1988 in the AER package.
(a) In your own words, explain what this dataset is about after doing some research
about the dataset.
(b) Print the output of the str function on NMES1988. Also, use the help function
to explain the variables of NMES1988 dataset in your own words.
(c) Compute and print both mean and median of emergency variable. Is the median
much smaller than the mean. How does this affect the skewness of the distri-
bution? Please explain. you may use the table function to have a look at the
distribution of the emergency variable.
(d) Compute and print the mean and standard deviation of the age variable. How
do you interpret the standard deviation of age.
(e) Compute and print the quartiles of the age variable.
(f) Using the 1.5IQR rule, find and print how many outliers are there in age.
(g) Use coefficient of variation (CV) to compare and interpret the dispersion in
visits and income.
135
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
(h) Create a new dataset called NMES1988.SUB by excluding all variables of mode
factor. that is NMES1988.SUB has all rows, but only numeric and integer columns.
Note that the variables at column indices (7, 9, 10, 12, 13, 14, 17, 18, 19) are factors.
Then, print the correlation matrix of NMES1988.SUB.
(i) Regarding question 2h, what are the two distinct variables that have the least
amount of correlation. How do you interpret it.
(j) Regarding question 2h, what are the two distinct variables that have the most
amount of correlation. How do you interpret it.
(k) Regarding question 2h, explain why the correlations on the main diagonal of the
correlation matrix are one.
1.
2.
3.
136
Chapter 3
Data Visualization
3.1.1 Outliers
In statistics, an outlier is an observation which looks like isolated from other observations.
Outliers may appear genuinely in the dataset or may appear because of some error in mea-
surement or recording. In both cases you need to locate them and analyze them individually
or as a group to find the reasoning behind their existence. Box plots are good to locate out-
liers in one dimensional data. Scatter plots are good for locating outliers in two dimensional
data.
3.1.3 Variability
In statistics, variability is a measure of dispersion denoting how a distribution is stretched
or squeezed. Most of the time there is a continuous variable in the data along with a
categorical variable. It is always good to analyze the variability of the continuous data per
the categorical data. Box plots per categorical data helps us analyze the variability in the
continuous variable. One can additionally use jitter plots to see the number of observations
137
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
contributing to the variability per category as well. Another graphics to visually observe
variability is density plots and histograms.
As for two dimensional data, covariance and correlation coefficient are good statistical
tools to see linear variability among two variables. Typically a scatter plot of two variables
might suggest that they do not covary together, they positively covary or they negatively
covary. Positive correlation (covariance) simply implies that as the values of one variable
increases the values of other variable increases with them. Negative correlation (covariance)
simply implies that as the values of one variable increases the values of other variable de-
creases with them. Scatter plots are the main tools to visualize co-variation of two variables.
3.1.4 Clustering
In statistics cluster analysis is the process of determining different groups of observations in
the dataset. Having isolated or semi-isolated groups of observations in the dataset suggest
the existence of clusters. One need to analyze the factor causing the clustered behavior in
the dataset. Scatter plots are goos to visually see the existence of clusters in two dimensional
data. Heat maps and 3D graphics are good to see clusters in three dimensional data.
138
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> install.packages("ggplot2")
> require(ggplot2)
Geometries
A Geometry (geom) is the graphical element used to represent data or the statistics
of the data. Common geometries are points, lines, bars, densities, and text.
Aesthetics
Aesthetics are the attributes of geometries which control the aesthetical properties
of the displayed geometries. Common aesthetics attributes are x-position, y-position,
size of the geometry, shape of the geometry, and color of the geometry.
Statistics
A statistic is a function that transforms or summarizes the data in a different form.
Most of the time one is interested in displaying a statistic of the raw data rather
than or along with the raw data itself. Some statistics are smoothers, regression lines,
mean, median, quantile, and bins.
Scales
A scale controls the display of the coordinate system as well as transforms the coor-
dinate system into different form. Common scales are logarithmic axis, label of axis,
limits of axis, and color of axis.
139
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> library(ggplot2)
> data(diamonds)
A graphics representing a dataset starts with the data layer. The data layer is created
using “ggplot(data)”. Parameter data is the R dataframe that we want to visualize. The
other layers are going to be added onto the ggplot object incrementally. Each layer, including
the data layer may have one or more aesthetics parameters. Special to ggplot(...) function,
any aesthetics defined in this function is inherited by the subsequent layers. To exploit this
functionality we are going to use another form of the function defined as“ggplot(data, aes(x,
y, <other aesthetics >))” and define the x and y parameters in order not to define them at
each layer that we are going to add later. Note that, aes(...) is a function parameter.
In the following we are going to analyze the relation between carat and price variables
of the diamonds dataset.
p is the ggplot2 object having only the data layer. Printing p at this step results in an
error because we haven’t added any geometric object (geom) representing the data points to
be plotted. Scatter plots are commonly used to display the relationship between two vari-
ables. The ggplot2 geom used to generate scatter-plots is “geom_point(mapping=aes(...),
...)”. A new layer is added to the existing layers using the + operator. Each layer that is
added to the graphics is technically a function used to create a graphics element.
In the following we are going to add a geometry layer to our data layer to have a
displayable graphics.
You can set values to shape and color parameters of geom_point() to control the shape
used to display the data and its color.
A very important characteristic of ggplot2 is that we did not explicitly set the x and y
variables for the geom_point(). In fact the data, x variable, and y variable are inherited
to geom_point() from the data layer, ggplot(). Note that you can override the inherited
parameters if you need to.
We can further improve the plot by adding a smoother statistic to see the pattern in
the data. We use function stat_smooth() to add the statistic layer to our data. The band
around the smoothing function is the 95% confidence interval. Note that one can display
only the smoother without the actual data points by omitting the geom_point() but keeping
the stat_smooth() element.
140
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
0 1 2 3 4 5
carat
141
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
0 1 2 3 4 5
carat
142
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
20000
15000
price
10000
5000
0
0 1 2 3 4 5
carat
143
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
3.2.3 Aesthetics
Aesthetics are used to define the mapping between the variables of the data and the proper-
ties of visual elements. Although there is a relation between aesthetics and geometric objects
most of the aesthetics are applicable to the majority of the geometric objects. Aesthetics
are defined using the aes() function.
144
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
clarity
15000
I1
SI2
SI1
price
10000 VS2
VS1
VVS2
5000 VVS1
IF
0
0 1 2 3 4 5
carat
145
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000 color
D
E
F
price
10000
G
H
I
5000
J
0
0 1 2 3 4 5
carat
146
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
depth
70
price
10000
60
50
5000
0
0 1 2 3 4 5
carat
147
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
On the other hand not all aesthetic parameter mappings work for continuous variables.
For example, shape parameter groups the data and represent each group by a shape for a
categorical variable however, does not do the same for a continuous variable.
3.2.3.2 Inheritance
In the following we are going to add one more layer to the graphics that we created in
Terminal 3.8. However, instead of employing the entire dataset, we will work with a sample
consisting of only 128 instances for illustration purposes. geom_line() is a geometric object
used to create a layer that connects the data points through lines.
> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price, color=clarity)) +
geom_point() + geom_line()
15000
clarity
I1
SI2
SI1
10000
price
VS2
VS1
VVS2
VVS1
5000 IF
0
0.5 1.0 1.5 2.0 2.5
carat
Similar to geom_point(), geom_line inherited the color aesthetic mapping from the
data later created by ggplot() function. Remember that color mapping groups the data
according to the given variable and plots each group individually on the graphics. Hence,
each sub-group is displayed using a different color.
ggplot2 geometric objects inherits the aesthetics mappings only from the data layer de-
fined by the ggplot() function. Moving the data mapping from the data layer to a geometric
object invalidates the inheritance. In the following we move the color mapping from gg-
plot() to the geom_point(). Although the dataset is grouped based on clarity and different
clarities are represented by different point colors, grouping and its representation is not
148
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
inherited by the geom_line() layer. As a result geom_line() treats the dataset as a single
unified group.
> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price)) +
geom_point(aes(color=clarity)) + geom_line()
15000
clarity
I1
SI2
SI1
10000
price
VS2
VS1
VVS2
VVS1
5000 IF
0
0.5 1.0 1.5 2.0 2.5
carat
149
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
clarity
15000
I1
SI2
SI1
price
10000 VS2
VS1
VVS2
5000 VVS1
IF
0
0 1 2 3 4 5
carat
150
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
3.2.4.1 Histogram
A histogram is a visual representation of the distribution of single variable data (y-axis
is count by default). A histogram is constructed by counting the number of observations
falling into a particular category (bin). For categorical variables each category serves as a
bin. On the other hand, the range of a continuous variable is divided into disjoint intervals
that serve as bins. Although equal-interval bins are common, one can use different interval
length bins to construct a histogram. The ideal number of bins on the √ other hand, depends
on the distribution of data. Given that there are n observations n can be used as the
number of bins, in general. Note that histograms are highly sensitive to the number of bins
or bin intervals.
151
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
20000
15000
count
10000
5000
152
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
10000
count
5000
153
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
3e−04
2e−04
density
1e−04
0e+00
154
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
4e−04
3e−04 cut
Fair
density
Good
2e−04
Very Good
Premium
Ideal
1e−04
0e+00
155
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
data which is the median. The top and bottom whiskers extend to the maximum and
minimum values that are not considered to be an outliers. The points (if any) shown before
and after the bottom and√top whiskers are the outliers. The outliers are calculated either
by 1.5 IQR or 1.58 IQR/ n. IQR is the inter-quartile-range calculated by Q3 − Q1 and n
is the number of observations in the dataset.
15000
price
10000
5000
0
Fair Good Very Good Premium Ideal
cut
Although box plots are useful to study the ranges of different quartiles, they do not
reflect the number of points falling into each x-axis category. To develop further intuition
it might be useful to plot the points along with box plots.
One problem with Terminal 3.19 is that the points overlap. To remedy the problem we
can use geom_jitter instead of geom_point. geom_jitter adds random noise to points along
with y-axis to improve the readability of possibly overlapping points.
Terminal 3.20 is better than Terminal 3.19, however the figure still exhibits the over-
lapping probkem due to the very high number of instances. There are two visualization
approaches to remedy the problem; (i) randomly sampling the dataset and visualizing the
smaller sample, (ii) using the alpha parameter to control the transparency of the points. In
Terminal 3.21, we take the second approach.
3.2.4.4 Text
geom_text is the ggplot2 geometric object used for adding text labels to graphics. Required
parameters x, y, and label defines the x and y positions of the text and the text label,
156
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
Fair Good Very Good Premium Ideal
cut
157
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
Fair Good Very Good Premium Ideal
cut
158
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
Fair Good Very Good Premium Ideal
cut
159
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
respectively. The x, y positions are defined according to the axes of the coordinate system
where they correspond to the center of the text label. In case one of the axes is categorical
(factor) it is mapped to the integer levels of the factors while supporting decimal coordinates.
Additional parameters family, fontface, angle, size, alpha, color allows setting the font family,
font face (plain, bold, italic), angle, size, transparency, and color of the text label.
Note that one may need to set the data parameter of geom_text to NULL to prevent all
row names of a dataset to be displayed. Lastly, geom_label is similar to geom_text with
an additional rectangle drawn behind the text.
geom_text is demonstrated in Terminal 3.22. However, instead of employing the entire
dataset, we work with a sample consisting of only 128 instances for illustration purposes.
> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=cut, y=price)) + geom_boxplot()+
geom_text(x=2.2, y=12600, label="Outlier", family="Times New Roman",
fontface="plain", color="red")
15000
Outlier
10000
price
5000
0
Fair Good Very Good Premium Ideal
cut
160
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
identity
Simply uses x and/or y values of data while plotting geometric objects.
jitter
Adds random noise to x and/or y values of data while plotting geometric objects.
stack
Stacks the geometric objects on top of each other.
dodge
Dodges the geometric objects side by side.
fill
Expands or shrinks the geometric objects to fill the space in the final graphics.
Many geometric object’s have identity as default position value. stack and dodge can
be used with geom_bar (default is stack). jitter can be used with geom_point. fill can be
used with geom_density and geom_bar to show proportions. Note the difference between
the fill value of parameter position and fill parameter of aesthetics mapping function aes().
Similar to geom_histogram, geom_bar is a single variable geometric object and y-axis
is always the number of observations, i.e., count. geom_bar is used for factors (categorical
variables) while geom_histogram is used for continuous numeric variables. The following
two examples shows the number of cut observations per clarity in the dataset using default
position stack, dodge, and fill respectively.
The following shows price densities stacked up per cut using position value stack and
fill, respectively.
Again, note the difference between the fill value of parameter position and fill parameter
of aesthetics mapping function aes().
161
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.23: Bar graph of variables cut per clarity with default position stack
20000
clarity
I1
15000 SI2
SI1
count
VS2
10000
VS1
VVS2
5000 VVS1
IF
162
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.24: Bar graph of variables cut per clarity with position dodge
5000
clarity
4000
I1
SI2
3000 SI1
count
VS2
VS1
2000
VVS2
VVS1
1000
IF
163
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.25: Bar graph of variables cut per clarity with position fill
1.00
clarity
0.75 I1
SI2
SI1
count
0.50 VS2
VS1
VVS2
0.25 VVS1
IF
0.00
164
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
1e−03
cut
Fair
density
Good
Very Good
5e−04 Premium
Ideal
0e+00
165
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
1.00
0.75
cut
Fair
density
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
166
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
0.00/1.00
1 cut
Fair
Good
0.75 0.25
Very Good
Premium
Ideal
0.50
167
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
geometric objects are associated with a default statistical object used to transform or sum-
marize the data before visualizing them on the graphics. For example, the default statistical
object for geom_smooth is stat_smooth and the default geometric object for stat_smooth
is geom_smooth. That is why adding stat_smooth to a graphics not only computes a
smoothing function for the data but also visualizes it using geom_smooth. Similarly, adding
geom_smooth to a graphics uses stat_smooth to calculate the smoothing function before
plotting the function. This design of ggplot2 results in many geometric and statistic objects
sharing the same name. On the other hand, one may override the default geometric object
of a statistic object to change its default visualization or may override the default statistic
object of a geometric object to change its default statistical transformation or summary.
Note that the default statistic object for many geometric objects is stat_identity meaning
not to transform data or compute any statistical summary.
168
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
20000
15000
price
10000
5000
0
0 1 2 3 4 5
carat
169
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.30: Smoothing function for variables carat and price using polynomial regression
25000
20000
15000
price
10000
5000
0
0 1 2 3 4 5
carat
170
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The following terminal output shows how to plot the function x3 + 2x2 − 5x + 7 .
150
100
y
50
−50
−5.0 −2.5 0.0 2.5 5.0
domain
171
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
price
10000
5000
0
Fair Good Very Good Premium Ideal
cut
172
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.33: Mean values along with jitter plot and 1 standard deviation ranges
> library(Hmisc)
> ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_jitter(alpha=0.02) +
+ stat_summary(fun.data=mean_sdl, geom="pointrange", color="blue") +
+ stat_summary(fun=mean, geom="point", color="red", size=2)
15000
10000
price
5000
−5000
Fair Good Very Good Premium Ideal
cut
173
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
grid. It bins the two dimensional scatter plot and maps the three dimensional density
function onto topographic contours. The following terminal output shows joint density
estimation of variables carat and price in our dataset.
5000
4000
price
3000
2000
1000
3.2.6.1 QQ Plots
QQ (Quantile-Quantile) plots are probability plots used to visually compare two distribu-
tions by plotting their quantiles against each other. If the final graphics looks like y = x
line then the distributions on x and y axis might be the same. Any other lines suggest a
possible linear relation between the distributions.
stat_qq is the ggplot2 statistic object for visualizing QQ plots of two distributions. In
the following we are going show the QQ plot of 256 randomly generated synthetic values
(N (µ = 0, σ = 1)) against the theoretical normal distribution.
174
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
2
sample
−2
−3 −2 −1 0 1 2 3
theoretical
175
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
frequency (..density..) histogram. That is the number of observations falling into a bin di-
vided by the total number of observations. The following terminal output shows the relative
frequency histogram of the price variable.
4e−04
3e−04
density
2e−04
1e−04
0e+00
The following adds a density plot to the relative frequency histogram shown in Termi-
nal 3.36.
Similarly, one can use ..count.. to refer to the exact frequency of observations. The
following terminal output shows the same plot shown in Terminal 3.27 using the exact
frequencies rather than relative frequencies.
The topographic contours of stat_density2d can be accessed through ..level.. variable.
The following terminal output shows joint density estimation of variables carat and price
by replacing the contours with filled polygons.
Alternatively, one can disable the contours by setting it to false and generate a heat map
based on the joint density.
3.2.7 Scales
Remember that aesthetic parameters control the way we map the data to geometric objects
in our graphics. Scales however, control the visual representations of aesthetic properties
of geometric objects. The essential aesthetic parameters of ggplot2 are x, y, size, shape,
linetype, color, fill, alpha, group, and order. These parameters are visually represented as
176
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
4e−04
3e−04
density
2e−04
1e−04
0e+00
177
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.38: price density grouped by cut using position fill with ..count..
1.00
0.75
cut
Fair
count
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
178
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
5000
4000 level
price
3000 1e−03
5e−04
2000
1000
179
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
density
1e−03
price
10000
5e−04
5000
0
0 1 2 3 4 5
carat
180
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
“guides”. The guides for x and y aesthetic parameters are x and y axes of the graphics.
The guides for other aesthetic parameters are simply legends. These guides not only vi-
sually represent the aesthetic properties but also allow readers to interpret the meanings
of aesthetic mappings in the graphics. Note that most of the time ggplot2 automatically
picks the proper scales according to the aesthetic parameters that we use to map the data
to geometric objects.
Table 3.3 shows the different scales used to control the guides (visual representations)
of various aesthetic parameters.
Although ggplot2 provides scale_shape_continuous and scale_linetype_continuous, these
scales are not applicable to continuous variables.
181
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.41: Scatter plot with altered x and y axes names and tics
15000
10000
Diamond Price
5000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Diamond Cut
labels are used to re-label the x and y axes tics. The best practice is to use with
breaks and map each tic generated by break to a label. Another important parameter
of scale_x_continuous and scale_y_continuous is the trans parameter which is used to
transform the scales of x and y axes. Note that trans is applicable to continuous axes in
practice rather than categorical axes. Table 3.4 shows the possible axis transformations.
The following terminal output shows the same graphics presented in Terminal 3.5 with
log10 axes transformations.
182
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
10000
3000
price
1000
300
0.3 1.0 3.0
carat
183
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
20000
clarity
I1
15000 SI2
SI1
count
VS2
10000
VS1
VVS2
5000 VVS1
IF
Furthermore, one can manually assign colors using values parameter of scale_fill_manual.
You can use color names, RGB values, or HEX colors to set the values parameter.
184
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
The following terminal output shows the same graphics presented in Terminal 3.43 with
manual colors.
20000
clarity
I1
15000 SI2
SI1
count
VS2
10000
VS1
VVS2
5000 VVS1
IF
You can use scale_fill_grey to set the legend colors to gray scale as shown in the following
terminal.
Similar to colors, ggplot2 allows setting gradient colors for continuous data. scale_fill_gradient
uses parameters low and high to explicitly set the color gradient gradually changing from
low to high.
The following terminal output shows the same graphics presented in Terminal 3.40 with
manual colors.
scale_fill_gradient2 is similar to scale_fill_gradient with additional parameters mid to
set middle color between the low and high colors as well as midpoint to set middle point
value.
scale_fill_gradientn is a generalized version of scale_fill_gradient2 for n colors that
are given manually or using R’s built in color palettes, like rainbow(), terrain.colors() and
topo.colors() using the color parameter.
The following terminal output shows the same graphics presented in Terminal 3.46 with
rainbow colors.
185
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
20000
clarity
I1
15000 SI2
SI1
count
VS2
10000
VS1
VVS2
5000 VVS1
IF
186
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
density
1e−03
price
10000
5e−04
5000
0
0 1 2 3 4 5
carat
187
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
15000
density
1e−03
price
10000
5e−04
5000
0
0 1 2 3 4 5
carat
188
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> set.seed(2020)
> dsample <- diamonds[sample(nrow(diamonds), 128), ]
> ggplot(data=dsample, mapping=aes(x=carat, y=price, linetype=cut)) + geom_line() +
scale_linetype_manual(values=c("dashed", "dotted", "solid", "dotdash", "twodash
"))
15000
cut
Fair
10000 Good
price
Very Good
Premium
Ideal
5000
0
0.5 1.0 1.5 2.0 2.5
carat
189
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
sented by its distance from the center (pole) and an angle. A point (x, y) is represented as
( x2 + y 2 , arctan(y/x)). Finally, coord_map is used for map projections.
p
3.2.9 Faceting
Aesthetic mappings such as color, shape, or fill subgroups the data and plots on the same
graphics. Although this approach allows us compare different subgroups, it might be difficult
to read from time to time. An alternative approach supported by ggplot2 is faceting which
creates a panel for each data subgroup mapped by an aesthetic and plots the subgroup on
a different panel. This way the one can visually compare and contrast the plots of different
subgroups better in some cases. facet_wrap and facet_grid are two ggplot2 faceting objects
that are good for univariate and bivariate faceting, respectively. Both objects support facets
parameter to set the variable to be used for subgrouping. The facets parameter of the
facet_wrap is set using the notation facets=∼variable. As for facet_grid it is set using the
notation facets=rowVariable∼columnVariable. Additionally, facet_wrap supports nrow and
ncol parameters to set the number of rows and columns of the panel table, respectively.
The following terminals show examples of faceting using facet_wrap and facet_grid.
15000
10000
5000
0
price
0 1 2 3 4 5
Premium Ideal
15000
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5
carat
One may map the color parameter to either the row or column variable of the facet_grid
function to use different colors for the levels of the categorical variable as shown in Fig-
ures 3.51 and 3.52. Note that this type of visualizations often require hiding the legend
190
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.50: Faceting data per clarity and cut using facet_grid
I1
5000
0
15000
SI2
10000
5000
0
15000
SI1
10000
5000
0
15000
VS2
10000
price
5000
0
15000
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat
191
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.51: Faceting data per clarity and cut using facet_grid
I1
5000
0
15000
SI2
10000
5000
0
15000
SI1
10000
5000
0
15000
VS2
10000
price
5000
0
15000
IF
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat
Some times it is necessary to plot the variables in a dataset as pairs and analyze them
side by side in a matrix. ggplot2 does not provide a direct function to plot a matrix of
paired variables. However, GGally an extension package to ggplot2 provides a function
named ggpairs to plot matrices of paired variables. By default the diagonal of the paired
variables matrix holds the variable names. The upper and lower triangles of the matrix
shows graphics of paired variables where the row is the x axis and the column is the y axis.
Depending on the data type each panel shows the default ggplot2 graphics object. If both
row and column are numeric it shows a scatter plot in the lower triangle panel and the
correlation coefficient value in the upper triangle panel. If row is numeric but the column
is a factor it shows a bar plot. If row is a factor but column is numeric it shows a box plot.
If both row and column are factors it shows a bar plot.
The most important parameter of the function ggpairs are data, columns, upper, lower,
and diag. data denotes the data set to be used. columns specifies the columns to be cross
paired. It accepts both a vector of column numbers or column names in string. upper
and lower are used to set the plot types for upper and lower panels of the matrix. Possi-
ble variable pair types are continuous-continuous (continuous), discrete-discrete (discrete),
and continuous-discrete (combo). Please see the documentation for possible plot values.
Furthermore, it allows plotting on the diagonal by setting the diag parameter to a plot
192
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Figure 3.52: Faceting data per clarity and cut using facet_grid
I1
5000
0
15000
SI2
10000
5000
0
15000
SI1
10000
5000
0
15000
VS2
10000
price
5000
0
15000
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat
193
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
value.
Figure 3.53: Generating matrix plots
> library(GGally)
> ggpairs(data=diamonds, columns=c("carat","cut","price"))
1.5
Corr:
carat
1.0
0.5
0.922***
0.0
8000
6000
4000
2000
0
8000
6000
4000
2000
0
8000
6000
cut
4000
2000
0
8000
6000
4000
2000
0
8000
6000
4000
2000
0
15000
price
10000
5000
0
0 1 2 3 4 5 2000
04000
6000
2000
04000
6000
2000
04000
6000
2000
04000
6000 60000
2000
04000 5000 1000015000
3.2.10 Themes
ggplot2 theme object allows modifying the appearance of almost all non-data elements
plotted on a graphics. Since it supports a huge number of parameters we cannot include all
of them in this section. Interested readers are directed to the documentation of the theme
function at ggplot2 website. ggplot2 provides two built-in themes namely theme_grey
used by default and theme_bw a theme by white background.
The following figure shows how to change the theme to black and white as well as change
the font size and family.
194
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
195
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
ggmap divides the spatial visualization process into three steps: (i) downloading and
formatting the map images using get_map function; (ii) plotting the map as a context layer
using the ggmap function; and (iii) adding additional layers using geom and/or stat objects
of ggplot2. Note that qgmap also provides qmap function for quick plotting similar to
ggplot2’s qplot.
> library(ggmap)
> ?register_google
>
> #After reading the description and obtaining your API key through your
Google account
> register_google(key="PUT YOUR API KEY HERE")
The register_google function, registers your key for your current session. That is, each
196
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
time you start a new session, you need to register your key to be able to use the Google Maps
services. Alternatively, you can set the write parameter of the register_google function to
TRUE to make key registration permanent. Since your key is a private key associated with
your account, do not share your key when you share your code for any purposes, including
assignments and projects.
The get_map function simply returns a raster object representing the map as a matrix
of hexadecimal colors. The following terminal partially shows the raster object returned by
the get_map function.
The function used to visualize maps as a ggplot2 context layer is the ggmap function of
the ggmap package. The map image used by ggmap is possibly obtained using the get_map
function. The following terminal demonstrates the use of the ggmap function to visualize
the raster object obtained in Terminal 3.56. ggmap by default, fixes the x axis of a graphics
to longitude, the y axis of the graphics to latitude and the coordinate system to Mercator
projection.
Another parameter of the ggmap function is extent which takes values ’normal’, ’device’,
and ’panel’. ’normal’ shows the blank ggplot panel behind the map. ’panel’ eliminates the
blank background panel and only shows the longitude and latitude axes. Finally, ’device’
removes both x and y axes leaving only the map itself. The following terminal shows the
hybrid map of the University of Louisiana campus obtained from Google Maps with extent
set to ’device’.
197
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
> ggmap(ggmap=laff)
30.30
30.25
lat
30.20
30.15
198
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
199
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Coordinate Reference System (CRS) arguments. Furthermore the notation for proj4string
parameter requires special attention.
• proj4string arguments are provided using CRS, an interface class to the PROJ.4 pro-
jection system.
• proj4string arguments are provided in the form of +arg1 =val1 +arg2 =val2 ...
As a case study we downloaded US states shapefile from the website of the United
States Census Bureau, TIGER project. As demonstrated in Figure 3.59 we converted the
shape file using the readShapeSpatial function of the package maptools. For our pur-
poses we set proj4string as ’proj4string=CRS("+proj=longlat +datum=WGS84")’ Argu-
ment +proj=longlat specifies that the projection to be used is longitude/latitude projec-
tion. Argument +datum=WGS84 specifies that the datum to be used is World Geodetic
System 1984 standard which is a global coordinate system for cartography, geodesy, and
navigation used by GPS (Global Positioning System). Checking the class of the resulting
object (’US.shape’) shows that it is a SpatialPolygonsDataFrame object. We also used R’s
str function to compactly view the structure of the ’US.shape’ object. Please notice that
variables NAME and STUSPS hold the state names and abbreviations, respectively.
The second step to visualize a shapefile using ggmap is converting the Spatial*DataFrame
object into a data frame. Remember that ggmap requires the data being plotted to be a data
frame. The ggplot2 function fortify, a generic function used to convert various objects
into R data frames, is used to convertSpatial*DataFrames as well.
In Figure 3.60 we convert the Spatial*DataFrame object obtained in Figure 3.59 into a
data frame. The figure also shows the first 6 lines of the ’US.shape.df’ data frame using the
head function. Next, we plot it as a polygon geom on top of a blank ggplot2 layer obtained
by calling the ggplot function without parameters. The aes mapping sets the x axis to
longitude, y axis to latitude and group parameter to the group variable of the data frame.
200
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
201
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
50
40
lat
30
202
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
we set the x and y axes limits to (-130,-65) and (24,50) to have a better looking mainland
map. However, limiting the axes caused some states and regions, e.g., Alaska and Puerto
Rico, to be invisible on the map
Figure 3.61: Reading in the file LAzipLL.csv and obtain the Louisiana map
Figure 3.62 shows the population scatter plot of Louisiana per zip code location. LAmap
obtained in Figure 3.61 is a ggplot layer representing the map of Louisiana as an image.
geom_point simply adds a new layer on top of the LAmap layer. The aesthetic mappings for
x and y axis should be fixed to longitude (lon) and latitude (lat), respectively. Furthermore,
size aesthetic is used to map the population of each zip to a point size in the scatter plot.
In Figure ref we show a heat map of the population density of Louisiana. Note that,
the data is divided into two dimensional bins rather than administrative boundaries of zip
codes. stat_summary2d allows using a value as the third dimension (z dimension) rather
than the count of the observations falling into a bin. Furthermore the fun parameter of
stat_summary2d allows using any function to process the values given as z-dimension. In
the example we use the built-in function sum to tell stat_summary2d to sum the values of
z-dimension falling into a bin.
2 http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t
203
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
32 population
0
10000
20000
lat
30000
40000
30 50000
28
−94 −92 −90
lon
204
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
32
value
3e+05
lat
2e+05
1e+05
30
28
-94 -92 -90
lon
205
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
Next, we want to use stat_density2d to estimate and plot the density graphics of
the population. However, kde2d kernel density estimator does not support a z axis for
population values nor does it accept a weight aesthetics as in geom_density. That is, it
simply counts the observations falling into particular two dimensional bin. Hence, we are
going to use a trick that works for our purposes. Specifically, we are going to create a
new dataset that repeats the rows in the original dataset as many as its population value.
Next, we plot the dataset using stat_density2d. Setting guide parameter of scale_alpha
removes the legend for transparency from the final graphics. The process is demonstrated
in Figure 3.64. Note that due to the number of points in the repeated dataset density
estimation may take several minutes on your system.
32
level
2
lat
30
28
-94 -92 -90
lon
Although the zip codes correspond to administrative regions, we used geocode function
to represent them as longitude and latitude points and used these points in the previous
graphics. Alternatively, we can obtain the boundaries as a zip code tabulation area shapefile
from the US Census Bureau and depict them as administrative regions. Figure 3.65 shows
how to read-in the shapefile after downloading it and the structured summary of the re-
sulting SpatialPolygonDataFrame object. The documentation of the shapefile states that
ZCTA5CE10 variable represents the zip codes.
206
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
In Figure 3.66 we first read the ’LAzipLL.csv’ dataset containing the zip codes, popula-
tion, income per capita, and median housing prices along with the longitude and latitude
values computed by the geocode function. Since we need to work with the zip codes be-
longing to the state of Louisina, we subsetted the ’shape’ object to include only the zip
codes belonging to Louisiana (’LA.shape’) using the %in% operator. Note that the class
of ’LA.shape’ is SpatialPolygonDataFrame. Next, we used the fortify function to con-
vert the ’shape’ into a data frame. At this point the resulting data frame (’LA.shape.df’)
does not include all the fields in the data slot of ’LA.shape’. We merged the data frame
(’LA.shape.df’) with the data slot of ’LA.shape’ to include the variables in the data slot.
Finally we merged the first four columns of our data set (’LAdata’) with the ’LA.shape.df’
data frame. Here we ommitted the geocode computed ongitude and latitude values of
’LAdata’ because ’LA.shape.df’ already has those values.
After preparing the data we used geom_polygon geometric object to visualize the me-
dian house prices on top of the map image of Louisiana obtained by get_map function in
Figure 3.67.
207
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
32
house
3e+05
2e+05
lat
1e+05
30
28
-94 -92 -90
lon
208
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
geocode
Essentially it pulls the approximate longitude and latitude values of an address, place,
or zip code using Google Geocoding API. The location parameter accepts a location
as a character string and returns a data frame consisting of longitude and latitude
values of the location. Note that more information can be pulled by setting the output
parameter to more.
revgeocode
The function reverse geocode pulls the approximate address of a given longitude and
latitude pairs in numeric format. It uses Google Geocoding API to conduct the trans-
formation.
mapdist
The function uses Google Distance Matrix API to calculate the distances and durations
between any two locations based on Google calculated routes for driving, bicycling, or
walking.
route
The function uses Google Maps API to determine the routes between any two locations.
A route consists of a sequence of legs where each leg has a beginning/ending longitude
and latitude pairs as well as distance and duration values. One can use geom_leg to
show the route between two locations on a map.
3.3 Exercises
1. In the following you are going to analyze the “Caschool (The California Test Score)”
dataset that comes with the R package Ecdad. In order to access the data set you
first need to install the package Ecdad. Once you install the package and load it, the
“Caschool” dataset will be available to you and you can load the data set using the
data function.
209
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
(g) The type of the variable district, denoting the school names, should be a char-
acter string. If the type is not character string, convert the variable type in the
dataframe and verify and print that your conversion took place using the class
function.
(h) Determine and print how many different observations are there in the data set
per county.
(i) Determine and print how many different observations are there in the data set
per grade span (grspan).
(j) Use ggplot2 to show a barplot of the variable county. Briefly, explain what you
see in the plot. Use the theme function to rotate and space the x-axis labels
properly.
(k) What are the mean, median, maximum, minimum, and quartile values for the
variable expnstu.
(l) Use ggplot2 to show a histogram of the variable expnstu. Does the histogram
look like a normal distribution? Use the Shapiro–Wilk test to test for normality
and interpret the p-value.
(m) Use ggplot2 to show a box-whisker plot of the variable expnstu with respect to
the variable grspan. In addition to the median value show the mean value on
the same plot. Compare the box-whisker plots and interpret the differences.
(n) Pertaining to question 1m, use ggplot2 to print the name of the outlier school
in KK-6 category next to its point representation in the graphics.
(o) Use ggplot2 to plot the scatter plot of the variables mathscr and readscr and
interpret the results.
(p) Use ggplot2 to plot the scatter plot of the variables avginc and testscr along
with their contour plot and interpret the results.
(q) Use ggplot2 to plot a heatmap of the variables avginc and expnstu employing
the rainbow colors and interpret the results.
2. In the following you are going to continue to analyze the “Caschool (The California
Test Score)” dataset that comes with the R package Ecdad. In order to access the
data set you first need to install the package Ecdad. Once you install the package and
load it, the “Caschool” dataset will be available to you and you can load the data set
using the data function.
(a) Carefully read the documentation of the quantile function and use it to print
the first (0.25), second (0.50) and third (0.75) quartiles of the variable mathscr.
Please set the names parameter to FALSE to strip the names of the quartiles.
(b) Carefully read the documentation of the quantile function and use it to print the
maximum and minimum values along with the first, second and third quartiles
of the variable mathscr. Please set the names parameter to FALSE to strip the
names.
(c) First, carefully read the documentation of the cut function. Then, use the cut
function to create a categorical vector named “mathscrL” from the continuous
variable mathscr of the dataset such that all values between the minimum and the
first quartile are labeled as M4 ; between the first and second quartile are labeled
210
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document
as M3 ; between the second and third quartile are labeled as M2 ; and between the
third quartile and maximum are labeled as M1. It is important to note that you
need to use the quantile function while setting the breaks for the cut function.
Use the labels parameter for setting the category labels (M4, M3, M2 and M1).
Moreover, set the right parameter to FALSE and ordered_result parameter to
TRUE to control the range boundaries and factor ordering, respectively. Also
note that you may need to assign the category label of the maximum mathscr
value manually, if it violates the range and assumes an NA value. You may use
is.na, any and which functions to locate the NA values, if exist. Lastly, append
the “mathscrL” vector to the Caschool dataset as a new column. Use the str
function on your data frame to show that the new column is appended.
(d) First, carefully read the documentation of the cut function. Then, use the cut
function to create a categorical vector named “readscrL” from the continuous
variable readscr of the dataset such that all values between the minimum and the
first quartile are labeled as R4 ; between the first and second quartile are labeled
as R3 ; between the second and third quartile are labeled as R2 ; and between the
third quartile and maximum are labeled as R1. It is important to note that you
need to use the quantile function while setting the breaks for the cut function.
Use the labels parameter for setting the category labels (R4, R3, R2 and R1).
Moreover, set the right parameter to FALSE and ordered_result parameter to
TRUE to control the range boundaries and factor ordering, respectively. Also
note that you may need to assign the category label of the maximum readscr
value manually, if it violates the range and assumes an NA value. You may use
is.na, any and which functions to locate the NA values, if exist. Lastly, append
the “readscrL” vector to the Caschool dataset as a new column. Use the str
function on your data frame to show that the new column is appended.
(e) Use facet_wrap to visualize and interpret how variables expnstu and elpct
change for each category of the variable mathscrL created in question 2c.
(f) Use facet_grid to visualize and interpret how variables expnstu and elpct
change for the pairs of categories of the variables mathscrL and readscrL, to-
gether. Note that the variables mathscrL and readscrL are created in ques-
tions 2c and 2d.
(g) Use ggmap to visualize the density map of the enrollment totals of the counties in
the dataset on a physical map of California. Please use Figure 3.64 as your refer-
ence guide. Note that you may need to create a new dataset that aggregates the
total enrollments of each county in California, because the dataset has multiple
schools per county. There are multiple ways to aggregate data in R, one approach
is to use the aggregate function along with the sum function as argument.
(h) Use ggmap to visualize the map of the number of the total students qualifying for
the reduced-price lunch per county on an administrative map of California. Please
use Figure 3.67 as your reference guide. Note that you may need to create a new
dataset that aggregates the number of students qualifying for the reduced-price
lunch using the percentage in mealpct and the number of students in enrltot
variables per district for each county.There are multiple ways to aggregate data
in R, one approach is to use the aggregate function along with the sum function
as argument.
211
Appendices
212
Appendix A
R Code Examples
213
The copyright law [Title 17, U.S. Code] prohibits sharing and distributing any parts of this document