Mathematical Foundations of Data Science Using R
Mathematical Foundations of Data Science Using R
Algorithms
Design and Analysis
Sushil C. Dimri, Preeti Malik, Mangey Ram, 2021
ISBN 978-3-11-069341-6, e-ISBN (PDF) 978-3-11-069360-7
Mathematical
Foundations of Data
Science Using R
|
2nd edition
Mathematics Subject Classification 2010
35-02, 65-02, 65C30, 65C05, 65N35, 65N75, 65N80
Authors
Prof. Dr. Frank Emmert-Streib Prof. Dr. Matthias Dehmer
Tampere University Schweizer Fernfachhochschule
Faculty of Information Technology and Department für Informatik
Communication Sciences Schinerstrasse 18
Tampere 33100 3900 Brig
Finland Schweiz
[email protected] [email protected]
UMIT Private Universität für
Dr. Salissou Moutari
Gesundheitswissenschaften
Queens University Belfast
Medizinische Informatik und Technik
School of Mathematics and Physics
Eduard Wallnöfer-Zentrum 1
University Road
6060 Hall i. Tirol
Belfast BT7 1NN
Austria
United Kingdom
[email protected]
[email protected]
ISBN 978-3-11-079588-2
e-ISBN (PDF) 978-3-11-079606-3
e-ISBN (EPUB) 978-3-11-079617-9
www.degruyter.com
Preface to the second edition
In recent years, data science has gained considerable popularity and established itself
as a multidisciplinary field. The goal of data science is to extract information from
data and use this information for decision making. One reason for the popularity
of the field is the availability of mass data in nearly all fields of science, industry,
and society. This allowed moving away from making theoretical assumptions, upon
which an analysis of a problem is based on toward data-driven approaches that are
centered around these big data. However, to master data science and to tackle real-
world data-based problems, a high level of a mathematical understanding is required.
Furthermore, for a practical application, proficiency in programming is indispens-
able. The purpose of this book is to provide an introduction to the mathematical
foundations of data science using R.
The motivation for writing this book arose out of our teaching and supervising
experience over many years. We realized that many students are struggling to under-
stand methods from machine learning, statistics, and data science due to their lack
of a thorough understanding of mathematics. Unfortunately, without such a mathe-
matical understanding, data analysis methods, which are based on mathematics, can
only be understood superficially. For this reason, we present in this book mathemat-
ical methods needed for understanding data science. That means we are not aiming
for a comprehensive coverage of, e. g., analysis or probability theory, but we provide
selected topics from such subjects that are needed in every data scientist’s mathe-
matical toolbox. Furthermore, we combine this with the algorithmic realization of
mathematical method by using the widely used programming language R.
The present book is intended for undergraduate and graduate students in the
interdisciplinary field of data science with a major in computer science, statistics,
applied mathematics, information technology or engineering. The book is organized
in three main parts. Part I: Introduction to R. Part II: Graphics in R. Part III:
Mathematical basics of data science. Each part consists of chapters containing many
practical examples and theoretical basics that can be practiced side-by-side. This
way, one can put the learned theory into a practical application seamlessly.
Many colleagues, both directly or indirectly, have provided us with input, help,
and support before and during the preparation of the present book. In particular, we
would like to thank Danail Bonchev, Jiansheng Cai, Zengqiang Chen, Galina Glazko,
Andreas Holzinger, Des Higgins, Bo Hu, Boris Furtula, Ivan Gutman, Markus Geuss,
Lihua Feng, Oliver Ittig, Juho Kanniainen, Urs-Martin Künzi, James McCann, Abbe
Mowshowitz, Aliyu Musa, Beatrice Paoli, Ricardo de Matos Simoes, Arno Schmid-
hauser, Yongtang Shi, John Storey, Simon Tavaré, Kurt Varmuza, Ari Visa, Olli
Yli-Harja, Shu-Dong Zhang, Yusen Zhang, Chengyi Xia, and apologize to all who
have not been named mistakenly. For proofreading and help with various chapters,
we would like to express our special thanks to Shailesh Tripathi, Kalifa Manjan, and
Nadeesha Perera. We are particularly grateful to Shailesh Tripathi for helping us
https://doi.org/10.1515/9783110796063-201
VI | Preface to the second edition
preparing the R code. We would like also to thank our editor Damiano Sacco from
DeGruyter Press who have been always available and helpful.
Finally, we hope this book helps to spread the enthusiasm and joy we have for
this field, and inspires students and scientists in their studies and research questions.
1 Introduction | 1
1.1 Relationships between mathematical subjects and data science | 2
1.2 Structure of the book | 4
1.2.1 Part one | 4
1.2.2 Part two | 4
1.2.3 Part three | 5
1.3 Our motivation for writing this book | 5
1.4 Examples and listings | 6
1.5 How to use this book | 7
Part I: Introduction to R
4 Installation of R packages | 26
4.1 Installing packages from CRAN | 26
4.2 Installing packages from Bioconductor | 26
4.3 Installing packages from GitHub | 27
4.4 Installing packages manually | 27
VIII | Contents
5 Introduction to programming in R | 30
5.1 Basic elements of R | 30
5.1.1 Navigating directories | 31
5.1.2 System functions | 31
5.1.3 Getting help | 32
5.2 Basic programming | 33
5.2.1 If-clause | 33
5.2.2 Switch | 34
5.2.3 Loops | 35
5.2.4 For-loop | 35
5.2.5 While-loop | 35
5.2.6 Logic behind a For-loop | 36
5.2.7 Break | 39
5.2.8 Repeat-loop | 39
5.3 Data structures | 39
5.3.1 Vector | 39
5.3.2 Matrix | 42
5.3.3 List | 45
5.3.4 Array | 46
5.3.5 Data frame | 47
5.3.6 Environment | 48
5.3.7 Removing variables from the workspace | 49
5.3.8 Factor | 49
5.3.9 Date and Time | 50
5.3.10 Information about R objects | 50
5.4 Handling character strings | 51
5.4.1 The function nchar() | 51
5.4.2 The function paste() | 52
5.4.3 The function substr() | 52
5.4.4 The function strsplit() | 53
5.4.5 Regular expressions | 53
5.5 Sorting vectors | 56
5.6 Writing functions | 57
5.6.1 One input argument and one output value | 57
5.6.2 Scope of variables | 59
5.6.3 One input argument, many output values | 60
5.6.4 Many input arguments, many output values | 61
Contents | IX
6 Creating R packages | 76
6.1 Requirements | 76
6.1.1 R base packages | 76
6.1.2 R repositories | 77
6.1.3 Rtools | 77
6.2 R code optimization | 77
6.2.1 Profiling an R script | 78
6.2.2 Byte code compilation | 78
6.2.3 GPU library, code, and others | 79
6.2.4 Exception handling | 79
6.3 S3, S4, and RC object-oriented systems | 80
6.3.1 The S3 class | 80
6.3.2 The S4 class | 82
6.3.3 Reference class (RC) system | 83
6.4 Creating an R package based on the S3 class system | 84
6.4.1 R program file | 84
6.4.2 Building an R package | 86
6.5 Checking the package | 87
6.6 Installation and usage of the package | 87
6.7 Loading and using a package | 88
6.7.1 Content of the files edited when generating the package | 88
6.8 Summary | 91
X | Contents
13 Analysis | 225
13.1 Introduction | 225
13.2 Limiting values | 225
13.3 Differentiation | 228
13.4 Extrema of a function | 233
13.5 Taylor series expansion | 235
13.6 Integrals | 239
13.6.1 Properties of definite integrals | 240
13.6.2 Numerical integration | 240
13.7 Polynomial interpolation | 241
13.8 Root finding methods | 243
13.9 Further reading | 247
13.10 Exercises | 247
18 Optimization | 369
18.1 Introduction | 369
18.2 Formulation of an optimization problem | 370
Contents | XV
Bibliography | 397
Index | 405
1 Introduction
We live in a world surrounded by data. Whether a patient is visiting a hospital
for treatment, a stockbroker is looking for an investment on the stock market, or
an individual is buying a house or apartment, data are involved in any of these
decision processes. The availability of such data results from technological progress
during the last three decades, which enabled the development of novel means of data
generation, data measurement, data storage, and data analysis. Despite the variety
of data types stemming from different application areas, for which specific data
generating devices have been developed, there is a common underlying framework
that unites the corresponding methodologies for analyzing them. In recent years, the
main toolbox or the process of analyzing such data has come to be referred to as
data science [73].
Despite its novelty, data science is not really a new field on its own as it draws
heavily on traditional disciplines [61]. For instance, machine learning, statistics, and
pattern recognition are playing key roles when dealing with data analysis problems
of any kind. For this reason, it is important for a data scientist to gain a basic under-
standing of these traditional fields and how they fuel the various analysis processes
used in data science. Here, it is important to realize that harnessing the aforemen-
tioned methods requires a thorough understanding of mathematics and probability
theory. Without such a hindsight, the application of any method is likely to be done
in a blindfolded way, lacking deeper insights. This deficiency hinders the adequate
usage of the methods and the ability to develop new ones. For this reason, this book
is dedicated to introducing the mathematical foundations of data science.
Furthermore, in order to exploit machine learning and statistics practically, a
computational realization needs to be found. This necessitates the writing of algo-
rithms that can be executed by computers. The advantage of such a computational
approach is that large amounts of data can be analyzed using many methods in an ef-
ficient way. However, this requires proficiency in a programming language. There are
many programming languages, but one of the most suited programming languages
for data science is R [154]. Hence, this book presents the mathematical foundations
of data science using the programming language R.
That means we will start with the basics that are needed to become a data
scientist with some understanding of the used methods, but also with the ability to
develop novel methods. Due to the difficulty of some of the mathematics involved,
this journey may take some time since there is a wide range of different subjects to
be learned. In the following sections, we will briefly summarize the main subjects
covered in this book as well as their relationship to and importance for advanced
topics in data science.
https://doi.org/10.1515/9783110796063-001
2 | 1 Introduction
Figure 1.1: Visualization of the relationship between several mathematical subjects and data
science subjects. The mathematical subjects shown in green are essentially needed for every topic
in data science, whereas graph theory is only used when the data have an additional structural
property. In contrast, differential equations and dynamical systems assume a special role used to
gain insights into the data generation process itself.
1.1 Relationships between mathematical subjects and data science | 3
types. To emphasize this, we drew the links with dashed lines. Dynamical systems
are used to gain insights into the data-generation process itself rather than to analyze
the data. In this way, it is possible to gain a deeper understanding of the system to
be analyzed. For instance, one can simulate the regulation between genes that leads
to the expression of proteins in biological cells, or the trading behavior of investors
to learn about the evolution of the stock market. Therefore, dynamical systems can
also be used to generate benchmark data to test analysis methods, which is very
important when either developing a new method or testing the influence of different
characteristics of data.
The diagram in Figure 1.1 shows the theoretical connection between some math-
ematical subjects and data science. However, it does not show how this connection
is realized practically. This is visualized in Figure 1.2. This figure shows that pro-
gramming is needed to utilize mathematical methods practically for data science.
That means programming, or computer science more generally, is a glue skill/field
that (1) enables the practical application of methods from statistics, machine learn-
ing, and mathematics, (2) allows the combination of different methods from different
fields, and (3) provides practical means for the development of novel computer-based
methods (using, e. g., Monte Carlo or resampling methods). All of these points are
of major importance in data science, and without programming skills one cannot
unlock the full potential that data science offers. For the sake of clarity, we want to
emphasize that we mean scientific and statistical programming rather than general
purpose programming skills when we speak about programming skills.
Due to these connections, we present in this book the mathematical foundations
for data science along with an introduction to programming. A pedagogical side-
effect of presenting programming and mathematics side-by-side is that one learns
Figure 1.2: The practical connection between mathematics and methods from data science is
obtained by means of algorithms, which require programming skills. Only in this way, mathemati-
cal methods can be utilized for specific data analysis problems.
4 | 1 Introduction
In Part two, we focus on the visualization of data by utilizing the graphical capa-
bilities of R. First, we discuss the basic plotting functions provided by R and then
present advanced plotting functionalities that are based on external packages, e. g.,
ggplot. These sections are intended for data of an arbitrary structure. In addition,
we also present visualization functions that can be used for network data.
The visualization of data is very important and it can be viewed as a form of
data analysis. In the statistics community, such an approach is termed exploratory
data analysis (EDA). In the 1950s, John Tukey advocated widely the idea of data
visualization as a means to generate novel hypothesis about the underlying data
structure, which would otherwise not come to the mind of the analyst [97, 189].
These hypotheses can then be further analyzed by means of quantitative analysis
methods. EDA uses data visualization techniques, e. g., box plots, scatter plots, and
1.3 Our motivation for writing this book | 5
also summary statistics, e. g., mean, variance, and quartiles to get either an overview
of the characteristics of the data or to generate new insights. Therefore, a first step
in formulating a question, which can be addressed by any quantitative data analysis
method, consists often in the visualization of the data. The Chapters 7, 8 and 9
present various visualization methods that can be utilized for the purpose of an
exploratory data analysis.
Figure 1.3: A typical approach in data science in order to analyze a ‘big question’ is to reformu-
late this question in a way that makes it practically analyzable. This requires an understanding of
the mathematical foundations and mathematical thinking (orange links).
has more far-reaching consequences. From analyzing this situation, we realized that
there is no shortcut in learning data science than to gain, first, a deep understanding
of its mathematical foundations.
This problem is visualized in Figure 1.3. In order to answer a ‘big question’ of
interest for an underlying problem, frequently, one needs to reformulate the question
in a way that the data available can be analyzed in a problem-oriented manner.
This requires also to either adapt existing methods to the data or to develop a new
method, which can then be used to analyze the data and obtain results. The adaption
of methods requires a technical understanding of the mathematical foundations of the
methods, whereas the reformulation of the ‘big question’ requires some mathematical
thinking skills. This should clarify our approach, which consists of starting to learn
data science from its mathematical foundations.
Lastly, our approach has the additional side-effect that the learner has an im-
mediate answer to the question, ‘what is this method good for?’. Everything we
discuss in this book aims to prepare the reader for learning the purpose and appli-
cation of data science.
Louden [122] pointed out that the above definition evokes some important con-
cepts, which merit brief explanation here. Computation is usually described using
the concept of Turing machines, where such a machine must be powerful enough to
perform computations any real computer can do. This has been proven true and,
moreover, Church’s thesis claims that it is impossible to construct machines which
are more powerful than a Turing machine.
In this chapter, we examine the most widely-used programming paradigms
namely, imperative programming, object-oriented programming, functional pro-
gramming, and logic programming. Note that so-called “declarative” programming
languages are also often considered to be a programming paradigm. The defining
characteristic of an imperative program is that it expresses how commands should
be executed in the source code. In contrast, a declarative program expresses what
the program should do. In the following, we describe the most important features
of these programming paradigms and provide examples, as an understanding of
https://doi.org/10.1515/9783110796063-002
12 | 2 Overview of programming paradigms
Figure 2.1: Programming paradigms and some examples of typical programming languages. R is
a multiparadigm language because it contains aspects of several pure paradigms.
these paradigms will assist program designers. Figure 2.1 shows the classification of
programming languages into the aforementioned paradigms.
We emphasize that the term “imperative” stems from the fact that a sequence of
commands can modify the actual state when executed. A typical elementary opera-
tion performed by an imperative program is the allocation of values. To explain the
above-mentioned variable concept in greater detail, let us consider the command
x:=x+1.
Now, one must distinguish two cases of x, see [136]. The l-value relates to the
memory location, and the r-value relates to its value in the memory. Those vari-
ables can be then used in other commands, such as control-flow structures. The
most important control-flow structures and other commands in the composition of
imperative programs are [136, 168]:
– Command sequences (C1 ; C1 ; ...; C𝑘 ).
– Conditional statements (if, else, switch, etc.).
2.3 Functional programming | 13
The following examples further illustrate the main features of functional pro-
gramming in comparison with equivalent imperative programs. The first example
shows two programs for adding two integer numbers.
Using (pseudo-)imperative programming, this may be expressed as follows:
In LISP or Scheme, the program is simply (+ a b), but a and b needs to be prede-
fined, e. g., as
The first program declares two variables a and b, and stores the result of the com-
putation in a new variable sum. The second program first defines two constants, a
and b, and binds them to certain values. Next, we call the function (+ ) (a func-
tion call is always indicated by an open and closed bracket) and provide two input
parameters for this function, namely a and b. Note also that define is already a
function, as we write (define ... ). In summary, the functional character of the
program is reflected by calling the function (+ ) instead of storing the sum of the
two integer numbers in a new variable using the elementary operation “+”. In this
case, the result of the purely functional program is (+ 4 5)=9.
Another example is the square function for real values expressed by
and
The imperative program square works similarly to sum. We declare a real variable a,
and store the result of the calculation in the variable square. In contrast to this, the
functional program written using Scheme defines the function (square ) with a
parameter x using the elementary function (* ). If we define x as (define (x 4)),
we yield (square 4)=16.
A more advanced example to illustrate the distinction between imperative and
functional programming is the calculation of the factorial, 𝑛! := 𝑛 · (𝑛 − 1) · · · 2 · 1.
The corresponding imperative program in pseudocode is given by
This program is typically imperative, as we use a loop structure (while ... do) and
the variables b and n change their values to finally compute 𝑛! (state change). As
loop structures do not exist in functional programming, the corresponding program
must be recursive. In purely mathematical terms, this can be expressed as follows:
𝑛! = 𝑓 (𝑛) = 𝑛 · 𝑓 (𝑛 − 1) if 𝑛 > 1, else 𝑓 (𝑛) = 0 if 𝑛 = 0. The implementation of 𝑛!
using Scheme writes as follows [1]:
Calling the function (factorial n) (see 𝑓 (𝑛)) can be interpreted as a process of ex-
pansion followed by contraction (see [1]). If the expansion is being executed, we then
observe the creation of a sequence of so-called deferred operations [1]. In this case,
the deferred operations are multiplications. This process is called a linear recursion
as it is characterized by a sequence of deferred operations. Here, the resulting se-
quence grows linearly with 𝑛. Therefore, this recursive version is relatively inefficient
when calling a function with large 𝑛 values.
and can react to events [122]. This programming style has been developed to model
real-world processes, as real-world objects must interact with one another. This is
exemplified in Figure 2.2. Important properties of object-oriented programs include
the reusability of software components and their independence during the design
process. Classical and purely object-oriented programming languages that realize
the above-mentioned ideas include, for example, Simula67, Smalltalk, and Eiffel
(see [122]). Other examples include the programming languages C++ or Modula 2.
We emphasize that they can be purely imperative or purely object-oriented and,
hence, they also support multiple paradigms. As mentioned above (in Section 2.3),
LISP and Scheme also support multiple paradigms.
– This programming paradigm has high reusability as the objects execute them-
selves.
naturalnumber(1)
for all n, naturalnumber(n) → naturalnumber(successor(n))
to be true (see also the Peano axioms [12]). Here, the symbol → stands for the logical
implication. Informally speaking, this means that if 1 is a natural number and if n
is a natural number (for all n), and that, therefore, the successor is also a natural
number, then 3 is a natural number. To prove the statement, we apply the last two
logical statements as so-called axioms [35, 175], and obtain
naturalnumber(1) → naturalnumber(successor(1))
→ naturalnumber(successor(successor(1)))
18 | 2 Overview of programming paradigms
Logic programming languages often use so-called Horn clauses to implement and
evaluate logical statements [35, 175]. Using Prolog, the evaluation of these state-
ments is given by the following:
a
5
prove the existence of variables (see Section 2.2). That means, this part is imperative.
Another way to demonstrate this is the procedure in Listing 2.9.
fun_sum_imperative (4)
10
The function fun_sum_imperative computes the sum of the first n natural numbers,
and is here written in a procedural way. By inspecting the code, one sees that there is
a state change of the declared variable, again underpinning its imperative character.
In contrast, the functional approach (see Section 2.3) to express this problem is
shown in Listing 2.10.
fun_sum_functional (4)
10
This version uses the concept of recursion for representing the following formula:
sum(n)=n+sum(n-1) (see also Section 2.3). As mentioned in Section 2.3, recursive
solutions may be less efficient especially when calling a function with large values
than iterative ones by using variables in the sense of imperative programming.
To conclude this section, we demonstrate the object-oriented programming
paradigm in R. For this, we employ the object-oriented programming system S4
[129] and implement the same problem as above. The result is shown in Listing 2.11.
setMethod("fun_sum_object_oriented",
signature=c(n="series_operation"),
function(n) {t=0;for(i in 1:n at n){t=t+i }; return(t)})
k <- new("series_operation", n=4)
fun_sum_object_oriented(k)
10
First, we use the predefined class series_operation with a predefined data- type.
Then, we define a prototype of the method fun_sum_object_oriented using the
standard class series_operation. Using the setMethod command, we define the
method fun_sum_object_oriented concretely, and also create a new object from
series_operation with a concrete value. Finally, calling the method gives the de-
sired result.
Figure 2.3: The basic principle of an interpreter (left) and compiler (right) [122].
efficient compared to that of a compiler, as the code is executed at the runtime only.
The frequent inefficiency of interpreter programs may be identified as a weakness be-
cause all fragments of the program, such as loops, must be translated when executing
the program again.
Next, we sketch the compiler approach to translate computer programs. A com-
piler translates an input program as a preprocessing step into another form, which
can then be executed more efficiently. This preprocessing step can be understood as
follows: A program written in a programming language (source language) is trans-
lated into machine language (target language) [202]. In mathematical terms, this
equals a mapping 𝐶 : 𝐿1 −→ 𝐿2 that maps programs of a programming language
𝐿1 to other programs of programming language 𝐿2 . After this process, a target pro-
gram can be then executed directly. Typical compiler languages include C, Pascal,
and Fortran (see [122, 168]). We emphasize that compiler languages are extremely
efficient compared to interpreter languages. However, when changing the source code,
the program must be compiled again, which can be time consuming and resource
intensive. Figure 2.3 shows the principle of a compiler schematically.
2.10 Summary
The study of programming paradigms has a long history and is relatively complex.
Nevertheless, we considered it important to introduce this fundamental aspect to
show that programming is much more than writing code. Indeed, although pro-
gramming is generally perceived as practical, it has a well-defined mathematical
foundation. As such, programming is less practical than it may initially appear,
and this knowledge can be utilized by programmers in their efforts to enhance their
coding skills.
3 Setting up and installing the R program
In this chapter, we show how to install R on three major operating systems that are
widely used: Linux, MAC OS X, and Windows. As a note, we would like to remark
that this order reflects our personal preference of the operating systems based on
the experience we gained over the years making maximum use of computers.
From our experience, Linux is the most stable and reliable operating system of
these three and is also freely available. An example of such a Linux-operating system
is Ubuntu, which can be obtained from the web page http://www.ubuntu.com/. We
are using Ubuntu since many years and can recommend it to anyone, no matter
whether it is for a professional or a private usage. Linux is in many ways similar
to the famous operating system Unix, developed by the AT&T Bell Laboratories
and released in 1969, however, without the need of acquiring a license. Typically,
a research environment of professional laboratories has a computer infrastructure
consisting of Linux computers, because of the above-mentioned advantages in ad-
dition to the free availability of all major programming languages (e. g., C/C++,
python, perl, and Java) and development tools. This makes Linux an optimal tool
for developers.
Interestingly, the MAC OS X system is Unix-based like Linux, and hence, shares
some of the same features with Linux. However, a crucial difference is that one
requires a license for many programs because it is a commercial operating system.
Fortunately, R is freely available for all operating systems.
Alternatively, one can install R by using the Ubuntu software center, which is
similar to an App store. For other Linux distributions the installation is similar,
but details change. For instance, for Fedora, the installation via terminal uses the
command:
https://doi.org/10.1515/9783110796063-003
24 | 3 Setting up and installing the R program
3.4 Using R
The above installation, regardless for which operating system, allows you to execute
R in a terminal. This is the most basic way to use the programming language.
That means one needs, in addition, an editor for writing the code. For Linux, we
recommend emacs and for MAX OS X Sublime (which is similar to emacs). Both
are freely available. However, there are many other editors that can be used. Just try
to find the best editor for your needs (e. g., nice command highlighting or additional
tools for writing or debugging the code) that allows you to comfortably write code.
Some people like this vi-feeling1 of programming, however, others prefer to have
a graphical-user interface that offers some utilities. In this case, RStudio (https:
//www.rstudio.com/) might be the right choice for you. In Figure 3.1, we show
an example how an RStudio session looks. Essentially, the window is split into four
parts. A terminal for executing commands (bottom-left), an editor (top-left) to write
scripts, a help window showing information about R (bottom-right) or for displaying
plots, and a part displaying variables available in the working space (top-right).
3.5 Summary
For using the base functionality of R, the installation shown in this chapter is suffi-
cient. That means essentially everything we will discuss in Chapter 5 regarding the
1 Vi is a very simple yet powerful and fast editor used on Unix or Linux computers.
3.5 Summary | 25
introduction to programming can be done with this installation. For this reason, we
suggest to skip the next chapter discussing the installation of external packages and
come back to it when it is needed to install such packages.
4 Installation of R packages
After installing the base version of R, the program is fully functional. However,
one of the advantages of using R is that we are not limited to the functionality
that comes with the base installation, but we can extend it easily by installing
additional packages. There are two major sources from which such packages are
available. One is the Comprehensive R Archive Network (CRAN) and the other
is Bioconductor. Recently, GitHub has been emerging as a third major repository.
In what follows, we explain how to install packages from these and other sources.
Here, package.name is the name of the package of interest. In order to find the
name of a package we want to install, one can go the CRAN web page (http://cran.r-
project.org/) and browse or search the list of available packages. If such a package
is found, then we just need to execute the above command within an R session and
the package will be automatically installed. It is clear that in order for this to work
properly, we need to have a web connection.
As an example, we install the bc3net package that enables inferring networks
from gene expression data [45].
At the time of writing this book CRAN provided 14435 available packages. This
is an astonishing number, and one of the reasons for the widespread use of R since
all of these packages are freely available.
https://doi.org/10.1515/9783110796063-004
4.3 Installing packages from GitHub | 27
provides functions to manipulate networks, one needs to execute the following com-
mands:
BiocManager::install("graph")
The first command sets the source from where to download the package, and the
second command downloads the package of interest.
That means, first, the package devtools from CRAN needs to be installed and
then a package from GitHub with the name ID/packagename can be installed. For
instance, in order to install ggplot2 one uses the command
Table 4.1: Essential unix commands that can be entered via a terminal.
computer processor for execution. The problem is that, for what we describe below,
there are no buttons available that could be clicked with your mouse. The good
thing is that this is not really a problem as long as we have a terminal that allows
us to enter the required commands directly.
Before we proceed, we would like to encourage the reader to get at least a basic
understanding of unix commands because this gives you a much better understand-
ing about the internal organization of a computer and its directory structure. In
Table 4.1, we provide a list of the most basic and essential unix commands that can
be entered via a terminal.
The most basic method to install packages is to download a package to your local
hard drive and then install it. Suppose that you downloaded such a package to the
directory “home/new.files”. Then you need to execute the following command within
a terminal (and not within an R session!) from the home directory:
Only after the execution of the above command the content of the package
package.name is available. For instance we activate the package bc3net as follows:
To see what functions are provided by a package we can use the function help:
4.6 Summary
In this chapter, we showed how to install external packages from different package
repositories. Such packages are optional and are not needed for utilizing the base
functionality of R. However, there are many useful packages available that make pro-
gramming more convenient and efficient. For instance, in an academic environment
it is common to provide an R package when publishing a scientific article that allows
reproducing the conducted analysis. This makes the replication of such an analysis
very easy because one does not need to rewrite such scripts.
5 Introduction to programming in R
This chapter will provide an introduction to programming in R. We will discuss key
commands, data structures, and basic functionalities for writing scripts. R has been
specifically developed for the statistical analysis of data. However, here we focus on
its general purpose functionalities that are common to many other programming
languages. Knowledge of these functionalities is necessary for utilizing its advanced
capabilities discussed in later chapters.
In principle also, the symbol “=” can be used for an assignment, but there are cases
where this leads to problems, and for this reason we suggest using always the “< −”
operator, because it can be used in all cases.
The basic elements of R, to which different values can be assigned, are called
objects. There are different types of objects and some of them are listed in Table 5.1.
The command typeof() provides information about the type of an object. An inter-
esting type is NULL, which is not an actual object-type, but serves more as a place
holder allowing an empty initialization. In Section 5.3.1, we will show how this can
be useful. Another interesting type is NA, indicating missing values.
https://doi.org/10.1515/9783110796063-005
5.1 Basic elements of R | 31
This will result in a character string showing the full path to the current working
directory of the R session. In case one would like to change the directory, one can
use the set working directory function setwd():
Here, new.dir is a character string containing a valid name of a directory you would
like to set as your current working directory.
Let us assume that we know the name of the function, but not of its arguments. In
R, there are two ways to find this out. First, one can use the function args() and, as
an argument for this function, the name of the function with unknown arguments:
The information resulting from args() is usually only informative if one is already
familiar with the function of interest, but just forgot details about its arguments. For
more information, we need to use the function help(), which is described in detail in
the next section.
In the following, we use the term “function” and “command” interchangeably,
although a command has a more general meaning than a function.
32 | 5 Introduction to programming in R
When there is a command that we want to use, but we are unfamiliar with its syntax,
e. g., sqrt(), R provides a help function, which is evoked by either help(sqrt) or ?sqrt:
At this early stage in the book, we would like to highlight the fact that R provides
helpful information about functions, but this does not necessarily mean that this
information will be to the extend you expect or would wish for. Instead, usually, the
provided help information is rather short and not sufficient (or intended) to fully
understand the very details of the complexity of the function of interest.
However, most help information comes with R examples at the end of the help
file. This allows you to reproduce, at least parts, of the capabilities of the described
functions by using the provided example code. It is not necessary to type these
examples manually but there is a useful function available, called example(), that
executes the provided example code automatically:
5.2 Basic programming | 33
That means you do not need to manually copy-and-paste (or type) the example
code, but just apply the example() command to the function you wish to learn more
about.
The general form of an if-clause is given by the following structure. Here, a gen-
eral logical statement is the argument of the if-clause. Depending on if this state-
ment is true or false, the commands in part 1 or 2 are executed. That means, the
outcome of the test of the provided logical statement selects the commands to be
executed.
The usage of an if-clause is very flexible, allowing the removal of the else clause,
but also to include further conditional statements by means of the else if com-
mand.
34 | 5 Introduction to programming in R
We would like to note that, e. g., the statement “a=4” is not a logical statement, but
an assignment, and will for this reason not work as an argument for an if-clause.
5.2.2 Switch
For all other values of “a”, there will be no true condition and, hence, none of the
above commands will be executed.
For clarity, we just want to mention that for reasons of a better readability, we
split the switch() command in the above example into three different lines. This way
one can see that it consists of 3 executable components. When you write your own
programs, you will see that such a formatting is in general very helpful to get a quick
overview of a program, because this increases the readability of the code.
5.2 Basic programming | 35
5.2.3 Loops
In R, there are two different ways to realize a looping behavior. The first is by using
a for-loop, and the second by using a while-loop. A looping behavior means the
consecutive execution of the same procedure for a number of steps. The number of
steps can be fixed, or variable.
5.2.4 For-loop
5.2.5 While-loop
One needs to make sure that the argument of the while() function becomes at some
time during the looping process logically false, because otherwise the function is
iterated infinitely. This is a frequent programming bug.
Before we continue, we want to present a look behind the curtains of the logic behind
a for-loop. The reader who has already some familiarity with programming can skip
this section, but in our experience the following information is helpful for beginners
to see in detail how a for-loop works.
The general form of a for-loop consists of a for() function that executes the body
of the for-loop comprised of individual R commands, depending on the argument of
the for-loop.
In the following, we will discuss each of the three components of the for-loop and
their connections.
This argument contains one variable and one parameter. Here i is the variable,
because its value changes systematically with every loop that is executed, and N is
a parameter, because its value is fixed throughout the whole loop. The values that
can be assumed to i are determined by 1:N, because the argument says i in 1:N.
If you define N=4 and execute 1:N in an R session, you get
That means 1:N is a vector of integers of length N. To see this, you can define a <-
1:N and access the components of vector a by a[1], e. g., for the first component.
5.2 Basic programming | 37
The values that i can assume are systematically assigned according to the order
of the vector 1:N, i. e., in loop 1, i is equal to 1; in loop 2, i is equal to 2; until
finally in loop N, i is equal to N. For this reason, the “argument” of a for-loop is—in
our example—dependent on the variable i, i. e.,
argument(i), (5.2)
Note that this is just a symbolic writing to emphasize that the argument of a loop
is connected to the step of the loop. Here, it is important to realize that the variable
of the “argument” changes its value in every loop step.
body(argument). (5.4)
Due to the fact that the argument itself is a function of the loop step, we have the
following dependency chain:
5.2.6.3 For-function
The third part is an actual R function. In R, you can always recognize a function
by its name, followed by round brackets “()” containing, optionally, an argument.
In the case of the for-function, it contains an argument, as discussed above. The
purpose of the for-function is to execute the body consecutively.
To make this clear, especially with respect to the argument of the body, which
depends on the number of the loop step, let us consider the following example:
The for-function converts this into the following consecutive execution of the body,
as a function of the argument:
First, the value of the variable i changes with every loop, according to the argument.
In our case “i” just assumes the values 1, 2, 3. Then the concrete value of i is
used in every loop, leading to different values of a. From a more general point
of view, this means that the for-function does not only execute the body of the
function consecutively, but it changes also the content of the workspace, which is
the memory of an R session, with every loop step. This is the exact meaning of
body(argument(loop step)).
In Figure 5.2, we visualize the general working mechanism of a for-loop by
unrolling it in time. Overall, if one wants to understand what a particular For-
loop does, one just needs to unroll its “body” in the way depicted in Figure 5.2,
considering the influence of the “argument” on it.
This discussion demonstrates that a for-loop, or any other R function, can be
quite complicated if we want to understand in more detail how it works. However,
once we understand its principle working mechanism, we can fade-out these details
focusing on key factors only. For the for-loop this is the systematic modification
of the variable in the argument of the loop, and the consecutive execution of its
body.
5.2.7 Break
Both loop functions can be interrupted at any time during the execution of the loop
using the break() command. Frequently, this is used in combination with an if-clause
within a loop to test for a specific decision that shall lead to the interruption of the
loop.
Combining loops with if-clauses and the break() function allows creating very flexible
constructs that can exhibit a rich behavior.
5.2.8 Repeat-loop
For completeness, we want to mention that there is actually a third type of loop
in R, the repeat-loop. However, in contrast with a for-loop and a while-loop, this
does not come with an interruption condition, but is in fact an infinite loop that
does never stop. For this reason, the repeat() command needs to be used always in
combination with the break() statement:
There are many functions to obtain properties of a vector, e. g., its length or
the sum of its elements. In order to make sure that the sum of its elements can
be computed, the function mode() or typeof() allows determining the data-type
of the elements. Examples of different types are character, double, logical, or
NULL.
a[3]
[1] 1
a[c(2,5)]
[1] 4 12
a[2:4]
[1] 4 1 7
length(a)
[1] 5
sum(a)
[1] 27
mode(a)
[1] "numeric"
Accessing elements of a vector can be done either individually (a[3] gives the
third element of vector a) or collectively by specifying the indices of the elements
(a[c(2,5)] gives the second and fifth element).
One can also assign names to elements of a vector using the command names().
When assessing an element with its name, one needs to make sure to use the same
index. In the below example, one needs to use "C" and not C, because the latter
indicates a variable rather than the capital letter itself.
a["C"]
C
10
5.3 Data structures | 41
There are also several functions available to generate vectors by using predefined
functions, e. g., the sequence (seq()) of numbers or letters (letters()). A general char-
acteristic of a vector is that whatever the type its elements, they need to be all of
the same type. This is in contrast with lists, discussed in Section 5.3.3.
b <- letters[1:4]
b
[1] "a" "b" "c" "d"
typeof(b)
[1] "character"
It is also possible to define a vector of a given length and mode initiated by zeros. For
example, vector(mode = "numeric", length = 10) results in a numeric vector of
length 10, whereas each element is initialized with a 0.
It is also possible to apply a function element-wise to a vector without the need
to access its elements, e. g., in a for-loop.
sqrt(a)
[1] 2 3 9 8
a * a
[1] 16 81 6561 4096
Other useful functions that can be either applied to a vector or used to generate
vectors are provided in Table 5.2.
If we want to add an element to a vector, we can use the command append():
Here the option after allows specifying a subscript, after which the values are
to be appended.
This command allows us also to demonstrate the usefulness of the NULL object
type, introduced in Section 5.1.
42 | 5 Introduction to programming in R
Table 5.2: Examples of functions that can be applied to vectors or can be used to generate vec-
tors.
Although the variable a does not contain an element with a value, it contains one
initialized element as a place holder of type NULL. Repeating the above example with
an uninitialized object would result in an error message.
A simplified form of the above can be written as follows:
5.3.2 Matrix
a[1,2]
[1] 2
5.3 Data structures | 43
a[1,]
[1] 1 2
a[c(1,3),]
[ ,1] [ ,2]
[1,] 1 2
[2,] 5 6
Here, the option byrow allows controlling how a matrix is filled. Specifically, by
setting it to “FALSE” (default), the matrix is filled by columns, otherwise the ma-
trix is filled by rows. Accessing the elements of a matrix is similar to a vector by
using the squared brackets. Again, this can be done either individually (a[1,2] giv-
ing the element in row 1 and column 2) or collectively by specifying the indices
of the elements (a[c(1,3),] gives all the elements of row 1 and 3). It is inter-
esting to note that by not specifying element, i. e., by using “,”, all elements are
selected.
There are several commands available to obtain the properties of a matrix. Some
of these commands are provided in Table 5.3.
Sometimes, it is useful to assign names to rows and columns. This can be achieved
by the commands rownames() and colnames().
Once these attributes are set, they can be retrieved by using the same command,
e. g., rownames(a) would give you a vector of length nrow, including the names of
the rows. The names of the rows or columns can also be used to access the rows and
columns:
44 | 5 Introduction to programming in R
There are alternative ways to create a matrix. For instance, by using the commands
cbind(), rbind(), or dim():
a <- c(2,5,6,2)
dim(a) <- c(2,2)
a
[ ,1] [ ,2]
[1,] 2 6
[2,] 5 2
Also, functions can be applied to a matrix element-wise. For example, sqrt() calcu-
lates the square root of each element.
All basic operations known from linear algebra can be performed with a matrix,
e. g., addition, multiplication, or matrix multiplication.
a * a # a^2
[ ,1] [ ,2]
[1,] 4 36
[2,] 25 4
a %*% a
[ ,1] [ ,2]
[1,] 34 24
[2,] 20 34
5.3.3 List
A list is a more complex data structure than the previous ones, because it can
contain elements of different types. This was not allowed for either of the previous
data structures. Formally, a list is defined using the function list(), see Listing 5.33.
[[2]]
[1] "hello"
[[3]]
[1] 3
[[4]]
[1] 4 2
In the above example, the list b consists of 4 elements, which are different data
structures. In order to access an element of a list, the double-squared brackets can
be used, e. g., b[[2]], to access the second element. This appears similar to a vector,
discussed in Section 5.3.1. In fact, there are many commands for vectors that can
also be applied to lists, e. g., length() or names(). If name attributes are assigned to
the elements of a list, then these can be accesses by the “$” operator.
In this case, the usage of double-squared brackets or the “$” operator provide the
same results. It is also possible to assign names to the elements when defining a list.
The following example shows that even a partial assignment is possible. In this case,
the first two elements could be accessed by their name, whereas the latter two can
only be accessed by using indices, e. g., b[[3]] for the third element.
$ind1
[ ,1] [ ,2]
[1,] 1 2
[2,] 3 4
$ex
[1] "hello"
[[3]]
[1] 3
[[4]]
[1] 4 2
5.3.4 Array
Arrays are a generalization of vectors, matrices, and lists in the sense that they can
be of arbitrary dimension and type.
[ ,1] [ ,2]
[1,] 1 3
[2,] 2 4
, , 2
[ ,1] [ ,2]
[1,] 5 7
[2,] 6 8
Elements of an array can be accessed using squared brackets, and the number of
indices corresponds to the number of dimensions.
a[ ,,1]
[ ,1] [ ,2]
[1,] 1 3
[2,] 2 4
[ ,1] [ ,2]
[1,] "one" Numeric ,4
[2,] Integer ,3 "one"
, , 2
[ ,1] [ ,2]
[1,] Integer ,3 "one"
[2,] Numeric ,4 Integer ,3
df$x
[1] 1 2 3
dim(df)
[1] 3 2
Again the command names() can be used to identify the names of the elements in a
data frame. Interestingly, the “$” operator can be used with "x" as well as x to access
elements. Table 5.4 provides an overview of further commands for data frames.
Table 5.4: Some examples of commands that can be used with data frames.
5.3.6 Environment
a$x
[1] 3 4 2
a$"x"
[1] 3 4 2
a[["x"]]
[1] 3 4 2
ls(a)
[1] "x"
Alternatively, one can use the function assign() to assign a new element to an envi-
ronment:
a$y
[1] "hello"
When we are unsure about the names of elements, we can use the function exists()
to perform a logical test, resulting in either a true or a false depending on whether
the element exists in the environment or not:
exists("y", envir=a)
[1] TRUE
5.3 Data structures | 49
If we want to delete many variables, we need to specify the “list” argument of the
command rm() providing a character vector naming the objects to be removed. We
can also delete all variables in the current working space in the following way:
5.3.8 Factor
For analyzing data containing categorial variables, a data structure called factor is
frequently encountered. A factor is like a label or a tag that is assigned to a certain
category to represent it. In principle, one could define a list containing the same
information, however, the R implementation of the data structure factor is more
efficient. An example for defining a factor is given below.
Here, we assign 4 different factors to height, but only three “values” are different.
In the case of a factor, different values are actually called levels. The different levels
of a factor can also be obtained with the command levels().
50 | 5 Introduction to programming in R
In the above example, the factors were categorial variables, meaning that the
levels have no particular ordering. An extension to this is to define such an ordering
between the levels. This can be done implicitly or explicitly.
The first example above, defines ordered factors by setting “ordered=T”. As a result,
there is an ordering between the three levels despite the fact that we did not specify
this order explicitly. However, this order is not due to any semantic meaning of these
words, but this is just an alphabetic ordering of the words.
If we would like defining a different order between the levels, we can include the
levels option. Then, the resulting order will follow the order of the levels specified
by this option.
For assessing the date and time of the computer system, we can use the following
functions:
Each of these functions results in an R object of a specific type. The first function
returns an object of class POSIXct and the second of class Date. The reason for this
is that objects of the same type can be manipulated in a convenient way, e. g., using
subtraction, we can get the time difference between two time points or dates.
In the above sections, we showed how to define basic R objects of different types.
In all these cases, we knew the type of these object, because we defined them ex-
5.4 Handling character strings | 51
plicitly. However, when using packages, we may not always have this information.
For such cases, R provides various commands to get information about the types of
objects.
If we use the function length() instead, it would not return the number of characters,
but count the whole string as 1.
52 | 5 Introduction to programming in R
For concatenating strings together one can use the function paste():
The sep option allows specifying what separator is used for concatenating the strings;
the default introduces a blank between two strings.
It is also possible to include a variable to form a new string.
This is useful if we want to read many files from a directory within a loop and their
names vary in a systematic way, e. g., by an enumeration. It can also be used to
create names for an environment (see Section 5.3.6), because an environment needs
strings as indices for elements.
Furthermore, the function paste() can be used to connect more than just two
strings:
If we want to overwrite parts of a string s with another string, we need to use the
function substring() with start, specifying where to start overwriting. In case we
just want to insert a new string without overwriting parts of the string s, we need
to use the function substr():
Splitting a string in one or more substrings can be done using the function strsplit():
The reason why a ''.'' does not work as a split symbol, but ''[.]'' does, is due
to the fact that the argument split is a regular expression (see Section 5.4.5).
The first argument of the above function characterizes the pattern we try to find
and text is the string to be searched.
54 | 5 Introduction to programming in R
gregexpr("is", txt)
[[1]]
[1] 3
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
[[4]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
[[5]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
Both functions result in similar outputs, but displayed in different ways. While
regexpr() returns an integer vector of the same length as text, whose components
provide information about the position of a match or no match, resulting in −1,
grepexpr() returns a list of this information. Furthermore, both functions have the
attribute match.length that indicates the number of elements that are actually
matched. One may wonder how is it possible that the length of a match could not
correspond to the length of pattern. This is where (nontrivial) regular expression
come into play.
5.4 Handling character strings | 55
Table 5.5: Some special symbols that can be used in regular expressions.
For example, the regular expression x+ matches any of the following within a string:
"x", "xx", "xxx", etc. This means that the length of the regular expression is not
equal to the length of the matched pattern. By using special symbols, it is possible to
generate quite flexible search patterns, and the resulting patterns are not necessarily
easy to recognize from the regular expression.
To demonstrate the complexity of regular expressions, let us consider the fol-
lowing example. Suppose that we want to identify a pattern in a string, of which
we do not know the exact composition. However, we know certain components. For
example, we know that it starts with a “G” and is followed either by none or several
letters or numbers, but we do not know by how many. After this, there is a sequence
of number, which is between 1 and 4 elements long:
The above code realizes such a search and it finds at position 4 of txt a match that
is 7 elements long.
56 | 5 Introduction to programming in R
In order to extract the matched substring of txt, the function regmatches() can
be used. It expects as arguments the original string used to match a pattern and the
result from the function regexpr():
This example demonstrates that with regular expressions it is not only possible to
match substrings that are exactly known, but also to match substrings that are only
partially known. This flexibility is very powerful.
The option decreasing enables specifying whether the sorting should be in decreas-
ing (TRUE) or nondecreasing (FALSE-default) order. It is important to note that
the result of sort(x) does not directly affect the input vector x. For this reason, we
need to assign the result of sort(x) to x, if we want to overwrite the input vector.
If we are interested in the positions of the sorted elements in the original vector x
we can get these indices by using the function order().
A somewhat related function to order() is rank(). However, rank(x) gives the rank
numbers (in increasing order) of the elements of the input vector x:
In the case of ties, there are several options available to handle the situation, and
one of them is ties.method.
To write a new function, one needs to define the name, the argument, and the content
of the function. The general syntax is shown in Listing 5.67.
Here, fct.name is the name of the new function you want to define, argument
is the argument you submit to this function, and body is a list of commands that
are executed, applied to argument.
A new definition for a function utilizes itself an R function called function.
If the body of the new function consists merely of one command, one can use the
simplified syntax:
However, for reasons of clarity and readability of the code, we recommend always
to define the body of the function, starting with a “{” and ending with a “}”.
58 | 5 Introduction to programming in R
Let us consider an example defining a new function that adds 1 to a real number
given by the argument x:
In this example, the name of the new function is add.one(). One should always
pay attention to the name to not accidentally overwrite some existing function. For
instance, if we would call the new function sqrt(), then the square root function,
part of the R base package, will be overwritten.
It is good practice to finish the body with the command return() that contains
as its argument the variable we would like to get as a result from the application of
the new function. However, the following will result in the exact same behavior as
the function add.one():
Here, it is important not to write y <- x + 1, but instead x+1, without assignment
to a variable. We do not recommend this syntax, especially not for beginners, because
it is less explicit in its meaning.
We would like to note that the above-defined function is just a simple example
that does not include checks in order to avoid errors. For instance, one would like to
ensure that the argument of the function, x, is actually a number, because otherwise
operations in the body of the function may result in errors. This can be done, for
instance, using the command is.numeric().
The usage of such a self-defined function is the same as for a systems function,
namely fct.name(x). The following is an example:
The last point is very important and shall be visualized with the following example.
Start a new R session (this is important!) and copy the following code into the R
workspace:
x <- 1
fct.test(x)
print(yzxv)
What will be the output of print(yzxv)? It will result in an error message, because
the variable yzxv is defined within the scope of the function fct.test(), and as such,
it is not directly accessible from outside the function. This is actually the reason
why we need to specify with the return() function the variables we want to return
from the function. If we could just access all variables defined within the body of a
function, there would be no need to do this.
The rationale behind our recommendation to start a new R session is to clear any
variable in the session, already defined yzxv; since, in this case, print(yzxv) would
output that existing variable rather than the value calculated inside the function
fct.test(). For the specific choice of our variable name, this may be unlikely (that is
why we used yzxv), but for more common variable names, such as a, i or m, there
is a real possibility that this could happen.
In general, functions allow us to separate our R workspace into different parts,
each containing their own variables. For this reason, it is also possible to reuse the
same variable name in different functions without the danger of collisions.
This point addresses the so-called scope of a variable, which is an important
issue, because it is the source of common bugs in programs.
In order to understand the full complexity of the scope of variables, let us consider
the following situation. Suppose that we have just one function, then we can have
60 | 5 Introduction to programming in R
three different scopes of variables depending on where and how they have been
defined.
First, all variables that are defined outside the function are global variables.
This means that the value of these variables is accessible inside the function and
outside the function. Second, all variables that are defined inside the function are
local variables, because they are only accessible inside the function, but not outside.
Finally, all variables that are defined inside a function by using the super-assignment
operator “<< −” are also global variables.
The following script provides an example:
x <- 1
y <- 5
fct.test(x)
print(yzxv)
print(z)
In order to return more than one output variable, we need to apply a little trick,
because an R function does not directly permit returning more than one variable with
the return command. Instead, we need to define a single variable, which contains
all the variables we want to return. The script below shows an example, utilizing a
list.
In this case, the list variable y serves as a container to transmit all desired variables.
That means, formally, one has just one output variable, but this variable contains
additional output variables that can be accessed via the components of the list. For
example, we can access its third component by y[[3]].
5.7 Writing and reading data | 61
R provides the useful command args(), which gives some information on the input
arguments of a function. Try, for example, args(matrix).
The easiest way to save one or more R objects from the workspace to a file is to use
the function save():
Here, the option file defines the name of the file in which we want to save the data.
In principle, any name is allowed, with or without extension. However, it is helpful
to name this file filename.RData, where the extension RData indicates that it is a
62 | 5 Introduction to programming in R
binary R data file. Here, binary file means that if we open this file within any text
editor, its content is not visible because of its coding format. Hence, in order to view
its content, we need to load this file again in an R workspace.
If we want to save more than one R object, two different syntax variations exist
that can be used. The first way to save more than one R object is to just name these
objects, separated by a comma:
The second way is to define a list that contains the variable names as character
elements:
If we want to save all the variables in the current workspace and not just the selected
ones, we can use the function save.image():
This function is a short cut for the following script, which accomplishes the same
task:
For the above examples, we did not need to care about the formatting of the file
to which we save the data, but R makes essentially a copy of the workspace, either
for selected variables or for all variables. This is a very convenient and fast way to
save variables to a file. One disadvantage of this way is that these files can only be
loaded with R itself, but not with other programs or programming languages. This
is a problem if we plan to exchange data with other people, friends, or collaborators
and we are unsure whether they either have access to R or do not want to use it,
for some reason. Therefore, R provides additional functions that are more generic in
this respect. In the following, we discuss three of them in detail.
There are 3 functions in the base package, namely, write.table(), write.csv(), and
write.csv2(), that allow saving tables as a text file. All of these functions have the
following syntax:
5.7 Writing and reading data | 63
Here, M is a matrix or a data frame, and the option sep specifies the symbol used to
separate elements in M from each other. As a result, the data saved in file can be
viewed by any text editor, because the information is saved as a text rather than a
binary file as the one generated, for example, by the function save(). The functions
write.csv() and write.csv2() provide a convenient interface to Microsoft EXCEL,
because the resulting file format is directly recognized by this program. This means
we can load these files directly in EXCEL.
A potential disadvantage of these 3 functions appears when the output of an R
program is not just one table, but several tables of different size and additional data
structures in the form of, e. g., lists, environments, or scalar variables. In such cases,
a function like write.table() would not suffice, because you can only save one table.
On the other hand, the functions save() or save.image() can be used without the
need to combine all data structures into just one table.
At the beginning of this section we said that, in general, it is more difficult to read
data than to save them. This is true with the exception of binary files saved with
the functions save() or save.image(). Because in this case, the counterpart to read
data from a file is the function load():
Since the function save() makes essentially a copy of the workspace, or parts of it,
and saves it to a file, then the function load() just pastes it back into the workspace.
Hence, there are no formatting problems that we need to take care of.
In contrast, if tabular data are provided in a text file, we need to read this file
differently. R provides 5 functions to read such data, namely, read.table(), read.csv(),
read.csv2(), read.delim(), and read.delim2(). For example, the function read.table()
has the following syntax:
The option header is a logical value that indicates whether the file contains the
names of the variables as its first line. The option skip is an integer value indicating
64 | 5 Introduction to programming in R
the number of lines that should be skipped when we start reading the file. This is
useful when the file contains at the beginning some explanations about its content
or general comments.
Let us consider an example:
The content of the file infile is shown in Figure 5.3. This file contains a comment
at its beginning spanning one row. For this reason, we skip this line with the op-
tion skip=1. Furthermore, this file contains a header giving information about the
columns it contains. By using “header=TRUE” this information is converted into the
column names of the table we are creating using the function read.table(). Using
colnames(dat) will give us this information. Most importantly, we need to specify
the symbol that is used to separate the numbers in the input file. This is accom-
plished by setting “sep=","”.
Figure 5.3: File content of infile and the effect the options in the command read.table() have
on its content.
As a result, the variable dat will be a data frame containing the tabular data in the
input file having the information about the corresponding columns as column names.
We can access the information in the individual columns by using either dat[[1]],
e. g., for the first column or dat$names. Try to access the information in the second
column. Is there a problem?
All the functions discussed so far, for reading data from a file, can be considered as
high-level functions, because they assume a certain structural organization of a file
that makes it relatively easy for a user to read these data into an R session. That
means, the structural organization can be captured by the supplied options of these
functions, e. g., by setting sep, skip etc. appropriately.
5.7 Writing and reading data | 65
In case there is a text file that has a more complex format that cannot be
read with one of the above functions, R provides a very powerful, low-level reading
function called readLines(). This function allows reading a specified number of lines,
n, from a given file:
If n is a negative value, the whole file will be read. Otherwise, the exact number of
lines will be read. The advantage of this way of reading a file is that the formatting
of the file can change, but does not need to be fixed.
For text files with a complex, irregular formatting it is necessary to read these
files line-by-line in order to adopt the formatting separately for each line. This can
be done in the following way:
The function file() opens a connection to the file specified by the option description
and the option open that we want to read the information from. Then calling read-
Lines() reads exactly one line from the file. That means, if called repeatedly, for
example within a for-loop, it gives one line after the other, and these can be pro-
cessed individually. In this way, arbitrarily formatted files can be read and stored in
variables so that the information provided by the file can be used in an R session. If we
want to restart reading from this file, we just need to apply the function file() again.
Figure 5.4: File content of infile and the effect the options in the command read.table() have
on its content.
To demonstrate the usage of the function readLines(), let us consider the following
example reading data from the file shown in Figure 5.4. In this case, our file contains
some irregular rows, and we would either like to entirely omit some of them, such
as row 5, or only use them partially, e. g., row 4. The following code reads the file
and accomplishes this task:
66 | 5 Introduction to programming in R
This corresponds to the information in the input file, skipping row 5 and omitting
the second element in row 4. From this example, we can see that “low-level func-
tions” offers some degree of flexibility, which translates into a considerable amount
of additional coding that we need to do to process an input file.
There is another function similar to readLines(), called scan(). The function
scan() does not result in a data frame, but a list or vector object. Another difference
with readLines() is that it allows specifying the data-types to be read by setting the
what option. Possible values of this options are, e. g., double, integer, numeric,
character, or raw. The following code shows an example for its usage:
As one can see, the object dat is a vector and the components of the input file,
separated according to sep, form the components of this vector. In our experience,
the function readLines() is the better choice for complex data files.
We just would like to mention without discussion that the function writeLines
allows a similar functionality and flexibility for writing data to a file in a line-by-line
manner.
In Table 5.6, we provide a brief overview of some R functions discussed in the previous
sections. Column three indicates the difficulty level in using these functions, which
is directly proportional to the flexibility of the corresponding functions.
In addition to the R functions discussed above, which are included in the base pack-
age, there are some additional packages available that allow importing data files
from other programs. In Table 5.7, we list some of the most common formats pro-
vided by other software and the corresponding package name, where one can find
68 | 5 Introduction to programming in R
For identifying the indices, in a vector or a matrix, whose components have certain
values, we can use the function which(). This function expects a logical vector or
a matrix and returns the indices of the TRUE elements. A logical vector from a
numerical vector v can be, e. g., obtained by an expression like v==3. This results
into a logical vector that has the same length as the vector v, but its components
are either TRUE or FASLE depending on whether the component equals 3 or not.
When we are interested in identifying the indices of a matrix that have a certain
value, one can use the option arr.ind=TRUE to get the matrix indices:
If we would set this option to FALSE (which is the default value), the result is just
the number of TRUE elements, but not their indices.
The function apply() enables applying a certain function to a matrix or array along
the provided dimension. Its syntax is:
Here, X corresponds to a matrix or array, FUN is the function that should be applied
to X, and MARGIN indicates the dimension of X to which FUN will be applied. The
following example, calculates the sum of the rows for a matrix:
A similar result could be obtained by using a for-loop over the rows of the matrix A.
In the case where the variable X is a vector, there exists a similar function called
sapply(). This function has the following syntax:
There are two differences compared to apply(). First, no MARGIN argument is needed,
because the function FUN will be applied to each component of the vector X. Second,
there is an option called simplify resulting in a simplified output of the function
sapply(). If set to TRUE, the result will have the form of a vector, whereas if set to
FALSE the result will be a list. It depends on the intended usage, i. e., which form
one might prefer, but a vector is usually most suitable for visual inspections. These
results can also be obtained with the command lapply().
The next example results in a vector, where each element is the third power of
the components of the vector X.
The function union() results in a set containing all elements, without duplication,
provided in the two sets of its argument.
Other commands for sets include intersect(), which returns only elements that
are in both sets, and setdiff() gives only elements, which are in the first, but not in
the second set, i. e., if X = setdiff(Y, Z), then all the elements in the set X are also
in the set Y, but not in the set Z. Table 5.8 provides an overview of set operations.
Table 5.8: Each of these commands will discard any duplicated values in its arguments.
When we have a vector x that may contain multiple duplications of serval elements,
we can use the function unique() to remove all such duplications:
This can be useful if we want to use the values in the vector x as indices, and we
want to use each index only once.
5.8 Useful commands | 71
When discussing the definition of functions in Section 5.6, we mentioned the im-
portance of making sure that the provided arguments are of the required type. In
general, R provides several useful comments for testing the nature of arguments. In
Table 5.9, we give an overview of the most useful ones.
Table 5.9: Each of these commands allows testing its argument and returns a logical value.
In order to sample elements from a given vector x, we can use the function sample().
To sample just means that the vector x contains a certain number of elements, i. e.,
its components, from which we can draw a certain number according to some rules.
Table 5.10: Each of these commands allows to convert its argument to a specific type.
Here, x is a vector, from which elements will be sampled. The option size indicates
the number of elements that will be sampled, and replace indicates if the sampling
is with (TRUE), or without (FALSE) replacement. In the case replace = FALSE,
the option size needs to be smaller than the number of elements (length of the
vector) in vector x.
In Figure 5.5, we visualize the two different sampling strategies. The column x
(before) indicates the possible values that can be sampled, and the column x (after)
contains the elements that are “left” after drawing a certain number of elements
from it. In the case of sampling with replacement, there is no difference since each
element that is “removed” from x is replaced with the same element. However, for
sampling without replacement, the number of elements in x decreases. It is important
to note that in the case of sampling with replacement, we can sample the same
element multiple times (see green ball in Figure 5.5). This is not possible without
replacement.
The option prob allows assigning a probability distribution to the elements of the
vector x. By default, a uniform distribution is assumed, i. e., selecting all elements
in x with the same probability.
In the case where we just want to sample all integer values from 1 to n, the following
version of the function sample() can be used:
In some cases, it may be possible that there is a command, whose execution might
cause an error leading to the interruption of a program. If such a command is used
within a larger program, this will of course result in the crash of the whole program.
To prevent this, there is the command try(), which is a wrapper function to run an
expression in a protected manner. That means, an expression will be evaluated and
in case it would result in an error, it will capture this error, but without leading to
a formal error causing the crash of a program. For example, executing sqrt("two")
results in an error, because the function sqrt() expects a numeric argument and
not a character string. However, using the following, by setting silent=T does not
generate a formal error, but captures it in the object ms:
In order to get the error message, one can execute either of the following commands:
geterrmessage ()
The difference between both commands is that the function geterrmessage() gives
only the last error message in the current R session. That means, if you execute
further commands that also result in an error, you cannot go back in the history of
crashed functions.
In order to use the functionality of the function try() within a program, one can
test if the output of try() is as expected or not. For our example above, this can be
done as follows:
In this way, a numeric output can be used in some way, whereas an error message,
resulting in a FALSE for this test, can be handled in a different manner.
74 | 5 Introduction to programming in R
One may wonder how could it be possible that a command within a “functional”
program can result in an error. The answer is that before a program is functional, it
needs to be tested. And during the testing stage, there may be some irregularities,
and using the function try() may help to find these. Aside from this, R may use
an external input, e. g., provided by an input file, containing information that is
outside the definition of the program. Hence, it may contain information that is not
as expected in a certain context.
In addition to the function try(), R provides the function tryCatch(), which is a
more advanced version for handling errors and warning events.
There is an easy way to invoke operating system (OS) specific commands by using
the command system(). This command allows the execution of OS commands like
pwd or ls as if they would be executed from a terminal. However, the real utility of
the function system() is that it can also be used to execute scripts.
The input of the function source() is a character string containing the name of the
file.
The advantage of writing an R program in a file and then executing it is that
the results are easily reproducible in the future. This is particularly important if we
are writing a scientific paper or a report and we would like to make sure that no
detail about the generation of the results is lost. In this respect, it can be considered
a good practice to store all of our programs in files.
Aside from this, it is also very helpful since we do not need to remember every
detail of a program, which is anyway hardly possible if a program is getting more
5.10 Summary | 75
and more complex and lengthy. In this way, we can create over time our own library
of programs, which we can use to look up how we solved certain problems, in case
we cannot remember.
5.10 Summary
In this chapter, we provided an introduction to programming with R that covered all
base elements of programming. This is sufficient for the remainder of the book and
should also allow you to write your own programs for a large number of different
problems. A very good free online resource for getting more details about func-
tions, options, and packages is STHDA http://www.sthda.com/english developed
by Alboukadel Kassambara. For unlocking advanced features of R, we recommend
the book by [46]. This is not a cookbook, but provides in-depth explanations and
discussions.
6 Creating R packages
R is an open-source interpreted language with the purpose to conduct statistical anal-
ysis. Nowadays it is widely used for statistical software development, data analysis
and machine learning applications in multiple scientific areas. R is easy to implement
and has a large number of packages available, which allows users to extend their code
easily and efficiently.
In this chapter, we show how you can create your own R package for functions
you implemented. This makes your code reusable and portable. An R package is not
only the most appropriate way to achieve this, but it also enables a convenient use of
these functions and ensures the reproducibility of results. Furthermore, R packages
enable us to easily integrate our code with other R packages.
6.1 Requirements
6.1.1 R base packages
Installation of the R environment: the first step for programming in R and developing
R packages is the installation of the R software environment itself. R is an open-source
programming environment, which can be downloaded free from the following address
https://cran.r-project.org/.
The basic R environment provides the following core packages: base, stats,
utils, and graphics.
– base: This package contains all the basic functions (including instructions and
syntax), which allow a user to write code in R. It contains functions, e. g., for
basic arithmetic operations, matrix operations, data structure, input/output,
and for programming instructions.
– utils: This package contains all utility functions for creating, installing, and
maintaining packages and many other useful functions.
– stats: This package contains the most basic functions for statistical analysis.
– graphics: This package provides functions to visualize different types of data
in R.
A user can utilize the functions available in the R-based environment to create their
packages rather than creating functions or objects from scratch. Below are examples
to get a list of all functions in these packages.
https://doi.org/10.1515/9783110796063-006
6.2 R code optimization | 77
6.1.2 R repositories
6.1.3 Rtools
Rtools is required for building R packages. It is installed with R-base for Linux and
MacOs, but for windows, it needs to be installed. The ".exe" file of Rtools for
installation can be obtained at the following address: http://cran.r-project.org/bin/
windows/Rtools/.
about memory size taken by the code, execution time, and performance of each
instruction in the code for making some performance improvement. The function
debug() in R-base allows a user to test the code execution, line by line. Furthermore,
the function traceback() helps a user to find the line where the code crashed.
From version 2.13.0, R includes the byte code compiler, which allows users to speedup
their codes. In order to use the byte code compiler, the user needs to install the
package compiler, which is available in CRAN. For the byte compilation of the
whole package during the installation a user must add ByteCompile: true to the
Description file of the package. This will avoid the use of ”cmp-fun” for each func-
tion.
The following listing provides an illustration of the byte code compilation in R:
<<>>=
library(rbenchmark)
library(compiler)
Using GPU libraries, such as gpuR, h2o4gpu, gmatrix, provides an R interface to use
GPU devices for computationally expensive analysis. Users can also parallelize their
R code using either of the following packages: parallel, foreach, and doParallel.
Furthermore, users can write scripts in C or C++ and run them in R using the package
Rcpp for a faster execution of their overall code.
This makes the debugging and profiling of complex code and packages easy. R pro-
vides two types of error handling mechanisms; the first one is try() or tryCatch(),
and the second is withCallingHandlers(). The command tryCatch() registers existing
handlers. When the condition is handled the control returns to the context where
tryCatch() was called. Thus, it causes code to exit when a condition is signaled. The
tryCatch() command is suitable to handle error conditions. The command withCall-
ingHandlers() defines local handlers which are called in the same context where
the condition is signaled and the control returns to the same context where the
condition was signaled. Hence, it resumes the execution of the code after handling
the condition. It maintains a full call stack to the code line or the segment that
signals the condition. The command withCallingHandlers() is specifically useful to
handle non-error conditions [201].
An example of exception handling is shown in the following Listing.
argument and then dispatches a relevant method of that class. For example, plot()
is a generic function to visualize data. Different packages inherit plot functions and
develop their own functions. For instance, plot.hclust() is a member function, which
provides visualization for dendrograms of the hclust class object. In the exam-
ple given below, we create a generic trigonometric value (trgval()) function, with a
default definition described by trgval.default() when the class of the object is un-
known. To create a new function trgval() for the class cosx, we can leverage this
generic function.
}
else{
tmp <- sin(x)
names(tmp) <- "sin"
}
tmp
}
}
else{
tmp <- cos(x)
names(tmp) <- "cos"
}
tmp
# Example 1
x <- 90
trgval(x)
82 | 6 Creating R packages
# Example 2
x <- 90
class(x) <- "cosx"
trgval(x)
}
# create a new generic function
setGeneric"("trgval)
# class definition and return a generator
# function to create objects from the "trg" class
setClass("trg", representation(ag = "numeric",
inv="logical", rd="logical"))
# class definition to create objects from the
#"sinx" class which inherits the class "trg".
setClass("sinx", contains = "trg")
# class definition to create objects from the
#"cosx" class which inherits the class "trg"
setClass("cosx", contains = "trg")
}
else{
tmp <- cos(x@ag)
}
tmp
})
The classes in the RC system provide reference semantics and support public and
private methods, active bindings, and inheritance. In this system, methods belong
to objects, not to generic functions. The objects are mutable. Creating RC objects
is similar to creating S4 objects. The methods library available in R implements
RC-based OOP. Also, the library R6 provides functionalities to implement RC-based
OOP. An example of object mutability in the RC system is shown below.
In the above example, S2 is not the copy of S1, but provides a reference of S1 so any
changes on S2 will reflect on S1, and vice versa.
84 | 6 Creating R packages
We create an R program file named trg.R with the S3 system, as shown in the
example below. This function creates a basic R skeleton with all the necessary folders
and files required for building a package. In our example, we create a method for
values of the sin() function. We start with our main function with a generic name
trgval(), which calls the function UseMethod(). In the next step, we create a default
function trgval.default() as well as the function plot.trg().
}
6.4 Creating an R package based on the S3 class system | 85
else{
tmp <- sin(xrd)
names(tmp) <- "sin"
}
res <- list(x =x, xrd=xrd , inv=inv , val=tmp)
class(res) <- "trg"
res
}
if(! wave){
rin <- .1
for(i in theta){
x1 <- rin*cos(c(0:i)*pi/180)
y1 <- rin*sin(c(0:i)*pi/180)
lb <- parse(text=(paste0("theta[",i,"]")))
points(x1 , y1 , pch=20, col=20, cex=.1)
text(mn1 , mn2 , lb)
if(rin!=1){
rin = rin+.1
}
}
}
else{
86 | 6 Creating R packages
plot(sin(c(minang:maxang)*pi/180) , type="l",
lwd=2, xaxt="n", ylab="sin (x)", xlab="")
ab line(h=0,lwd=2)
tmp <- c(minang:maxang)
k <- which(tmp==0)
points(theta+k, sin(theta*pi/180) , col="blue", pch=20, cex=2)
ab line(v=theta+k)
tt <- which(abs(tmp)%%90==0)
aa <- tmp[tt]/90
lbn <- sapply (1:length(tt),
function(x)parse(text=(paste0("pi"))))
axis(1, at=which(abs(tmp)%%90==0), labels=lbn , tick=TRUE)
axis(1, at=which(abs(tmp)%%90==0)-15, labels=aa/2, cex=.5,
tick=FALSE)
}
Now we need to check, compile, and build the package. First, we go to the command
prompt and change to the directory, where the package is kept and run the command
R CMD build [package name] to build the package tarball. We can also use the
same command in R calling it inside the system function of R. Below we provide an
example.
6.5 Checking the package | 87
The check command creates a folder with the package name and .Rcheck extension.
All the error logs and warning files are created inside this folder. The user can check
all these files to evaluate the package.
library("trgpkg")
data(angls)
zz <- trgpkg::trgval.default(c(30 ,60))
\author{
Shailesh Tripathi and Frank Emmert Streib}
\seealso{
\code{\link{plot}}
}
\examples{
zz <- trgval(90)
plot(zz)
plot(zz, wave=FALSE)
}
\value{
returns a "trg" class object
}
\author{
6.8 Summary | 91
\seealso{
plot.trg, trgval.default}
\examples{
zz <- trgval(c(30, 60, 90))
plot(zz)
plot(zz, wave=FALSE)
}
6.8 Summary
In this chapter, we provided a brief introduction how to create an R package. This
topic can be considered advanced and for the remainder of this book it is not re-
quired. However, in a professional context the creation of R packages is necessary
for simplifying the usage and exchange of a large number of individually created
functions.
Nowadays, many published scientific articles provide accompanying R packages
to ensure that all obtained results can be reproduced. Despite the intuitive clarity of
this, the reproducability of results has recently sparked heated discussions, especially
regarding provisioning the underlying data [70].
|
Part II: Graphics in R
7 Basic plotting functions
In this chapter, we introduce plotting capabilities of R that are part of the base
installation. We will see that there is a large number of different plotting functions
that allow a multitude of different visualizations.
7.1 Plot
The most basic plotting tool in R is provided by the plot() function, which allows
visualizing 𝑦 as a function of 𝑥. The following script gives two simple examples (see
Figure 7.1 (A) and (B)):
# B
x <- seq(from=0, to=2*pi , length.out = 50)
y <- sin(x)
plot(x, y, type="l")
https://doi.org/10.1515/9783110796063-007
96 | 7 Basic plotting functions
The two examples shown in Figure 7.1 (A) and (B) demonstrate just a quick visual-
ization of the functional relation between 𝑥 and 𝑦. However, in order to improve the
visual appearance of these plots, usually, it is advised to utilize additional options.
Below we show two further examples, see Figure 7.1 (C) and (D), that use different
options.
# D
x <- seq(from=0, to=2*pi , length.out = 50)
y <- sin(x)
par(mar=c(5,5,1,1))
plot(x, y, type="l", cex.axis=1.6, cex.lab=2.2, font.lab=2,
lwd=8.0, col="blue")
7.1 Plot | 97
The type option allows specifying if we want points (p), lines (l), or both (b) options,
to be used simultaneously. The option cex allows changing the font size of the axis
(cex.axis) and the labels (cex.lab), and font.lab leads to a bold face of the labels.
Finally, the line width can be adjusted by setting the lwd to a positive numerical
value, and col specifies the color of the lines or points.
There is one further command in the above examples (par) that appears unim-
pressive at first. However, it allows adjusting the margins of the figure by setting mar.
Specifically, we need to set a four-dimensional vector to set the margin values for
(bottom, left, top, right) (in this order). This command is important, because
when setting the font size labels larger than a certain value, it can happen that the
labels are cut-off. For preventing this, the mar option needs to be set appropri-
ately.
In the following, we will always modify a basic plot by setting additional options
to improve its visual appearance.
There are two functions available that enable adding multiple lines or points into
the same figure, namely lines and points. Two examples for the script below are
shown in Figure 7.2 (A) and (B). In order to distinguish different lines or points from
each other, we can specify the line-type (lty) or point-type (pch) option.
# B
x <- seq(from=0, to=2*pi , length.out = 50)
y1 <- sin(x)
y2 <- sin(x+0.4)
par(mar=c(5,5,1,1))
plot(x, y1 , type="p", cex.axis=1.6, cex.lab=2.2, font.lab=2,
lwd=4.0, pch=1)
points(x, y2 , lwd=4.0, pch=2)
This can be extended to multiple lines or points commands as shown for the
next examples; see Figure 7.2 (C) and (D). Here, we add in a legend to the figures
that allows a better identification of the different lines with the parameters that
have been used. The legend command allows specifying the position of the legend
within the figure. Here, we used bottomright so that the legend does not overlap
98 | 7 Basic plotting functions
with the lines or points. Further, we need to specify the text that shall appear in
the legend and the symbol, e. g., lty or pch. There are more options available, but
these provide the basic functionality of a legend.
For the above example, we used the xlab and ylab option to change the ap-
pearance of the labels of the 𝑥- and 𝑦-axis. In the previous examples, we did not
specify them explicitly. For this reason, R uses as default names for the labels the
names of the variables that have been used in the function plot().
# D
n <- 10
x <- seq(from=0, to=2*pi , length.out = 20)
y <- matrix (0, nrow = n, ncol = 20)
for(i in 1:n){
y[i,] <- x + (i-1)/2
}
par(mar=c(5,5,1,1))
plot(x, y[1,], type="p", cex.axis=1.6, cex.lab=2.2,
font.lab=2, lwd=2, pch=1, ylim=c(0,10.5), ylab="y", xlab="x")
for(i in 2:n){
points(x, y[i,], lwd=2.0, pch=i)
}
legend("bottomright", paste("x + (", 1:n, "- 1 )/2"), pch=1:n)
We can also add straight horizontal and vertical lines on the graph using the function
abline(). Depending on the option used, i. e., h or v, horizontal or vertical lines are
added to a figure at the provided values. Also this command allows changing the
line-type (lty) or color (col). In Figure 7.3, we show an example that includes one
horizontal and one vertical line.
Listing 7.5: Add horizontal and vertical lines to a plot, see Figure 7.3
x <- seq(from=0, to=2*pi , length.out = 50)
y <- sin(x)
par(mar=c(5,5,1,1))
plot(x, y, type="l", cex.axis=1.6, cex.lab=2.2, font.lab=2,
lwd=4.0)
ab line(v=pi/2)
ab line(h=0.5, lty=2)
100 | 7 Basic plotting functions
In order to plot a function in a new figure by keeping a figure that is already created,
one needs to open a new plotting window using one of the following commands:
– X11(), for Linux and Mac if using R within a terminal
– macintosh(), for a Mac operating system
– windows(), for a Windows operating system
If these commands are not executed, then every new plot() command executed will
overwrite the old figure created so far.
7.2 Histograms
An important graphical function to visualize the distribution of data is hist(). The
command hist() shows the histogram of a data set. For instance, we are drawing 𝑛 =
200 samples from a normal distribution with a mean of zero, a standard deviation of
one, and saving the resulting values in a vector called x, see the code below. The left
Figure 7.4 shows a histogram of the data with 25 bars of an equal width, set by the
option breaks. Here, it is important to realize that the data in x are raw data. That
means, the vector x does not provide directly the information displayed in Figure 7.4
(Left), but indirectly. For this reason, the number of occurrences of values in x, e. g.,
in the interval 0.5 ≤ 𝑥 ≤ 0.6, need to be calculated by the hist() function. However,
in order to do that one needs to specify what are the boundaries of the intervals to
conduct such calculations. The function hist() supports two different ways to do that.
The first one is to just set the total number of bars the histogram should contain.
The second one is by providing a vector containing the boundary values explicitly.
# Right
b <- c(-5,-2,-1.5,-1,-0.5,seq(-0.4, 0.4, 0.2) ,0.5,1,1.5,2,5)
hist(x, breaks=b, col="lavender", main="", cex.lab=2.0,
cex.axis=1.4, freq=F)
An example for the second way is shown in Figure 7.4 (Right). Here, the boundary
values are provided by the vector b. Also, in this case we set the option freq to
TRUE, which results in a histogram of densities. In contrast, Figure 7.4 (Left) shows
the frequencies for each bar corresponding to the number of 𝑥-values that fall in the
boundaries of one bar.
7.3 Bar plots | 101
Figure 7.4: Examples for histograms. Left: Providing the total number of bars. Right: Providing
the boundary values.
In Figure 7.5 (Right), we show a second example for a bar plot that splits
each bar into individual contributing components. Such plots are called stacked bar
charts. For instance, suppose we have 3 factors that contribute to the outcome of
a variable that is measured for 5 different conditions indexed by the letters A–E.
Then, for each condition, the outcome of a variable can be broken down into its
constituting 3 values.
In the example below, we use two new options. The first one is space allowing
to adjust the spacial distance between adjacent bars. The second one is names.arg,
which allows specifying the labels that appear below each bar. For specifying the
labels, we use the function LETTERS() to conveniently assign the first 5 capital
102 | 7 Basic plotting functions
Figure 7.5: Examples for a normal bar chart (Left) and a stacked bar chart (Right).
As usually, there is more than one way to visualize a data set in a meaningful
way, depending on the perspective. In the following, we present just one alternative
representation of the same data set by sorting the mortality values of the countries.
In Figure 7.8, we ordered the mortality values and grouped the countries into
three categories. Each of these categories is highlighted in a different color by spec-
ifying the option color. The category of a country is specified with the groups
option by providing a vector of factors. If visualized in this way, subgroups within
the data set can be highlighted additionally. Of course, there are further modifi-
cation one could conduct, e. g., the alphabetic organization of the countries within
the subgroups or subdividing the subgroups, e. g., highlighted by specifying different
symbols using the gpch option.
104 | 7 Basic plotting functions
Figure 7.7: Information about cancer mortality in Europe taken from the World Health Organi-
zation (WHO) data for the year 2013.
dotchart(dat.who.ave$deaths[ind[ind2]],
labels=dat.who.ave[ind[ind2] ,1], cex=0.5, cex.lab=2.0,
cex.main=2.0, font.lab=2,
main="Mortality by european country - ordered",
xlab="Cancer deaths per 100 ,000",
groups=country.g, color=country.col)
7.6 Strip and rug plots | 105
Figure 7.8: Ordered information about the cancer mortality in Europe. Same data as in Fig-
ure 7.7.
Finally, we would like to note that a dot plot is also called a Cleveland dot plot,
because William Cleveland pioneered this kind of visualization.
For these examples, we generated 100 integer values from the interval 1 to 100.
The example shown by the red data points corresponds to the base form of this
plot function that overwrites the data points, as specified by the option method.
Alternatively, the data points can be stacked leading to a kind of histogram (in
blue), although the height does not exactly reflect the count values, but is merely
proportional to the relative density of the x values within a certain region. Finally,
the data points can be jittered by applying a small offset between data points of the
same x-value.
Listing 7.12: Example of strip and rug plots, see Figure 7.9
n <- 100
x <- round(runif(n, 1, 100))
par(mar=c(5,5,1,1))
stripchart(x, method="stack", at=0.2, cex=2, offset=0.5,
xlab="x values", ylab="'at ' option", font.lab=2, cex.lab=2.0,
cex.axis=1.4, col = "blue", ylim=c(0,1.8))
stripchart(c, method="overplot", at=0.75, col="red", pch=1,
add=T)
stripchart(c, method="jitter", at=1.5, col = "green3", pch=2,
add=T)
rug(c, lwd=2.5)
(𝑥 − 𝜇)2
(︂ )︂
1
𝑓 (𝑥) = √ exp − , −∞ ≤ 𝑥 ≤ ∞, (7.1)
2𝜋𝜎 2𝜎 2
and discussed in detail in Chapter 17) is shown. The mean value 𝑚.𝑘 is highlighted by
the red vertical line to indicate the current position of the averaging. The averaging
itself involves all data points, however, the weight is proportional to the density of
the normal distribution for the given values of 𝑚.𝑘 and 𝑠𝑑.𝑘.
108 | 7 Basic plotting functions
Figure 7.10: Density plots for cancer mortality worldwide. Data are from the WHO. Top row:
Averaged over overlapping regions. Bottom row: Averaged and weighted over overlapping re-
gions.
𝐿
𝑓 num(𝑖); 𝑚.𝑘, 𝑠𝑑.𝑘 . (7.2)
∑︁ (︀ )︀
𝑣(𝑚.𝑘) =
𝑖
Here, 𝐿 is the total number of data points and 𝑛𝑢𝑚(𝑖) is the number of data points
in window 𝑖. In this way, for each position along the 𝑥-axis, the value 𝑣(𝑚.𝑘) is
evaluated by changing the values of 𝑚.𝑘.
The corresponding results are shown in the Figure 7.10 (bottom row). Again,
depending on the value of the band width (bw), the obtained density plots can
be smoothed. For the normal distribution, the parameter bw changes the standard
deviation, making the normal distribution broader for larger values.
We would like to finish this section by mentioning that the above discussion
focused on the graphical meaning of density plots and their underlying idea. However,
it is important to note that the quantitative estimation of the probability density of
a given data set is an important statistical problem in its own right.
110 | 7 Basic plotting functions
In this way, we can create very complex figures that carry a lot of information.
# Right
par(mar=c(5,5,1,1))
image(x, y, z2 , col=terrain.colors (10) , xlab = "X",
ylab = "Y", cex.lab=1.9, font.lab=2)
7.11 Summary | 113
Figure 7.14: Examples for a contour (Left) and an image plot (Right) of a normal distribution.
7.11 Summary
Despite the fact that all of the commands discussed in this chapter are part of the
base installation of R, they provide a vast variety of options for the visualization
of data, as we have seen in the last sections. All extension packages either address
specific problems, e. g., for the visualization of networks, or for providing different
visual aesthetics.
8 Advanced plotting functions: ggplot2
8.1 Introduction
The package ggplot2 was introduced by Hadley Wickham [200]. The difference of
this package to many others is that it does not only provide a set of commands for the
visualization of data, but it implements Leland Wilkinson’s idea of the Grammar of
Graphics [204]. This makes it more flexible, allowing to create many different kinds
of visualizations that can be tailored in a problem-specific manner. In addition, its
aesthetic realizations are superb.
The ggplot2 package is available from the CRAN repository and can be installed
and loaded into an R session by
There are two main plotting functions provided by the ggplot2 package:
– qplot(): for quick plots
– ggplot(): allows the control of everything (grammar of graphics)
8.2 qplot()
The function qplot() is similar to the basic plot function in R. The q in front of
plot stands for “quick”, in the way that it does not allow getting access to the
full potential provided by the package ggplot2. The full potential is accessible via
ggplot, discussed in Section 8.3.
To demonstrate the functionality of qplot(), we use the penguin data provided
in the package FlexParamCurve.
Listing 8.2: Installing the package FlexParamCurve and loading the data
install.packages("FlexParamCurve")
library("FlexParamCurve")
data(penguin.data)
The penguin.data data frame has 2244 rows and 11 columns of the measured
masses for little penguin chicks between 13 and 74 days of age (see [33]).
In Figure 8.1, we show the basic functionality of qplot(), generating a scatter
plot using the following script:
https://doi.org/10.1515/9783110796063-008
8.2 qplot() | 115
Figure 8.1: Example of a scatter plot for multiple data sets using qplot().
Similar to the base plot function in R, we first specify the values for the x and y
coordinates using the column names in the data file penguin.data. The geom option
defines the geometry of the object. Here, we chose “point” to produce a scatter plot
for all data points (𝑥𝑖 , 𝑦𝑖 ). Other options are given in Table 8.1. The last option we
point scatterplot
line connects ordered data points by a line
smooth smoothed line between data points
path connects data point by a line in the order provided by data
step step function
histogram histogram
boxplot boxplot
116 | 8 Advanced plotting functions: ggplot2
use is color to allow different colors for the observation points depending on the
year the observation has been made. We use the function “factor” to indicate that
the values of the variable “year” are only used as categorical variable. The aesthetics
command I() can be used to set the color of the data points manually.
In order to further distinguish data points from each other, one can use the
shape option using the factor ck for the hatching order. Because this can lead to a
crowded visualization, qplot() offers the additional option facets. The effect of this
option is shown in Figure 8.2.
Listing 8.4: An example for the usage of facets, see Figure 8.2
qplot(ckage , weight , data =penguin.data , geom=c("point"),
color=factor(year), facets = ~ck)
The two columns in Figure 8.2 are labeled A and B, corresponding to the factors of
the ck variable indicating first hatched (A), and second hatched (B).
Next, we visualize the effect of the option value smooth for geom.
For this, we use only the first 10 observation points. As we can see in Figure 8.3, in
addition to these 10 data points, there is a smooth curve added as a result from the
smoothing function. We would like to note that here, we used a vector to define the
option geom, because we wanted to show the data points in addition to the smooth
curve.
Similar to the base plot function in R, there are options available to enhance the
visual appearance of a plot. Table 8.2 shows some additional options to enhance plots.
8.3 ggplot()
The underlying idea of the function ggplot() is to construct a figure according to a
certain grammar that allows adding the desired components, features, and aspects
Table 8.2: Further options to improve the visual appearance of a plot for qplot().
Option Description
to a figure and then generate the final plot. Each of such components is added as a
layer to the plot.
The base function ggplot() requires two input arguments:
– data: a data frame of the data set to be visualized
– aes(): a function containing aesthetic settings of the plot
In the following, we study some simple examples by using the Orange data set
containing data about the growth of orange trees. To get an overview of these data,
we show the first lines.
> head(Orange)
Grouped Data: circumference ~ age | Tree
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
The data set contains only three variables (tree, age, and circumference), whereas
the variable “Tree” is an indicator variable for a particular tree.
Listing 8.6: Example of a data plot with ggplot, see Figure 8.4 (Left)
data(Orange)
ggplot(data=Orange , aes(age , circumference)) + geom_point ()
In order to plot any figure, we need to use the ggplot() command, and specify
how we want to plot these data by providing information about the geometry. In
the above case, we just want to plot the circumference of the trees as a function of
their age by means of points, see Figure 8.4 (Left). The same result can be obtained
by splitting the whole command into separate parts as follows:
Listing 8.7: A simple point plot with ggplot(), see Figure 8.4 (Left)
p <- ggplot(data=Orange , aes(age , circumference))
p + geom_point ()
Figure 8.4: Examples for point plots. Left: Base functionality without setting options. Right:
Modified point size.
Listing 8.8: Improved point plot with ggplot, see Figure 8.4 (Right)
p <- ggplot(Orange , aes(age , circumference))
p <- p + geom_point(size=3) + scale_x_continuous(name="age of
trees")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
The corresponding result is shown in Figure 8.4 (Right). For this plot we involved
three different layers, namely
– geoms: controls the geometrical objects
– scales: controls the mapping between data and aesthetics
– themes: controls nondata components of the plot
The meaning of the available options is rather intuitive, if we know all options
available. This information can be acquired from the manual of ggplot(), which is
quite extensive.
120 | 8 Advanced plotting functions: ggplot2
Beyond the simple usage of ggplot() demonstrated above, the combination of many
options within different layers becomes quickly involved. In Figure 8.5, we show two
additional examples that highlight the presence of multiple data sets.
Listing 8.9: A plot with multiple points, see Figure 8.5 (Left)
p <- ggplot(Orange , aes(age , circumference , color=Tree))
p <- p + geom_point(size=3, aes(shape=Tree)) +
scale_x_continuous(name="age of trees")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
By adding the option color to the function aes() in ggplot(), different colors will be
assigned to different factors, as given by Orange$Tree, and a figure legend will be
automatically generated. Specifying the types of the shape for the data points will,
in addition, assign different point shapes corresponding to the different trees.
Listing 8.10: A plot with multiple lines, see Figure 8.5 (Right)
ind <- order(as.numeric(levels(Orange$Tree)))
8.3 ggplot() | 121
Furthermore, we rearranged the 5 trees in the legend in their numerical order. This
is a bit tricky, because there is no option available that would allow to do this
directly. Instead, it is necessary to provide this information for the different factors,
by generating a new factor Tree2 that contains this information.
Before we continue, we would like to comment on the logic behind ggplot()
used to add either multiple points or lines to a plot. In contrast to the basic plotting
function plot() discussed in Chapter 7.1.1, which adds multiple data sets successively
by, e. g., using the lines() command, ggplot() can accomplish this by setting an option
(shape). However, this requires the data frame to contain information about this in
the form of an indicator variable (in our case “Tree”). Hence, the simplification in
the commands for multiple lines needs to be compensated by a more complex data
frame. This can be in fact nontrivial.
The good news is that it is possible to use ggplot() in the same logical way as
the basic plotting function. An example for this is shown in Listing 8.11.
gg <- ggplot ()
gg <- gg + geom_line(data=df1 , aes(x=X,y=Y), size=1.5,
color='purple ')
gg <- gg + geom_line(data=df2 , aes(x=X,y=Y), size=1.5,
color='green ')
print(gg)
For the shown example in Listing 8.11, there is certainly no advantage in using
ggplot() in this way, because a data frame with the required information exists
already. However, if one has two separate pairs of data in the form 𝐷𝑖 = {(𝑥𝑖 , 𝑦𝑖 )}
available, the advantage becomes apparent.
122 | 8 Advanced plotting functions: ggplot2
8.3.3 geoms()
Table 8.3: Functions associated with ggplot() and geom() and their corresponding counter parts
in the R base package.
geom_point() points()
geom_line() lines()
geom_curve() curve()
geom_hline() hline()
geom_vline() vline()
geom_rug() rug()
geom_text() text()
geom_smooth(method = ”lm”) abline(lm(y x))
geom_density() lines(density(x))
geom_smooth() lines(loess(x, y))
geom_boxplot() boxplot()
For adding straight lines into a plot, we can use the functions geom_abline() and
geom_vline(), see Figure 8.6 (Left). Because a straight line is fully specified by an
intercept and a slop, these two options need to be set for geom_abline(). If we
use a zero slop, we obtain a horizontal line. For adding vertical lines, the function
geom_vline() can be used, specifying the option xintercept. In addition, both func-
tions allow setting a variety of additional options, to change the visual appearance
of the lines. For example, valid linetype values include solid, dashed, dotted,
dashdot, longdash, and twodash.
8.3 ggplot() | 123
Figure 8.6: Examples using geom_abline() and geom_vline() (Left) and geom_step() (Right).
Listing 8.12: Additional modifications of a multiline plot, see Figure 8.6 (Left)
p <- ggplot(Orange , aes(age , circumference , color=Tree))
p <- p + geom_line(size=3) + scale_x_continuous(name="age of trees")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p <- p + geom_ab line(intercept=70, slope=0.1, size=1,
line type="dotted")
p <- p + geom_ab line(intercept=50, slope=0, size=1,
line type="dashed")
p <- p + geom_vline(xintercept=1200, size=1, line type="longdash")
p
In Figure 8.6 (Right), we show an example for the geom_step() function. This
function connects the data points by horizontal and vertical lines making it easier
to recognize horizontal and vertical jumps.
Listing 8.13: An example for step functions, see Figure 8.6 (Right)
p <- ggplot(Orange , aes(age , circumference , color=Tree))
p <- p + geom_step(size=3) + scale_x_continuous(name="age of trees")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
In Figure 8.7, we show examples for boxplots using the function geom_boxplot().
For these examples, we do not distinguish the different trees, but we are rather
interested in the distribution of the circumferences of the trees at the 7 different
time points of their measurement.
124 | 8 Advanced plotting functions: ggplot2
In Figure 8.7 (Top row, Right), we added the original data points, for which the
boxplots are assessed. In order to avoid a potential overlap between the data points,
the function geom_jitter() can be used to introduce a slight horizontal shift to the
data points. These shifts are randomly generated and, hence, different executions of
this function lead to different visual arrangements of the data points.
Listing 8.14: Examples for boxplots shown in Figure 8.7 (Top row)
# Top-Left
p <- ggplot(Orange , aes(factor(age), circumference))
p <- p + geom_boxplot () + scale_x_discrete(name="age of trees")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
8.3 ggplot() | 125
# Top- Right
p <- p + geom_jitter ()
p
Listing 8.15: Examples for boxplots shown in Figure 8.7 (Bottom row)
# Bottom -Left
p <- ggplot(Orange , aes(factor(age), circumference))
p <- p + geom_boxplot(aes(fill=factor(age))) +
scale_x_discrete(name="age of trees")
p <- p + geom_jitter ()
p <- p + geom_rug(sides="l")
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
# Bottom - Right
p <- ggplot(Orange , aes(factor(age), circumference ,
color=factor(age)))
p <- p + geom_boxplot () + scale_x_discrete(name="age of trees")
p <- p + geom_jitter(color="black")
p <- p + geom_rug(sides="bl", position='jitter ')
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
8.3.4 Smoothing
On the right figure, we add the information of the standard error in form of a gray
band that underlies the loess curve. Furthermore, we set the color of this band with
the fill option. We want to point out that by not specifying this option, the default
value is to use a transparent background. Unfortunately, our experience is that this
can cause problems, depending on the operating system. For this reason, setting this
option explicitly is a trick to circumvent potential problems.
Listing 8.16: Some examples for data smoothing, see Figure 8.8 (Top row)
# Top-Left
p <- ggplot(Orange , aes(age , circumference))
p <- p + geom_point(size=2) + scale_x_continuous(name="age of
trees")
p <- p + stat_smooth(method = "loess", se=F, size=2)
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
8.3 ggplot() | 127
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
# Top- Right
p <- ggplot(Orange , aes(age , circumference))
p <- p + geom_point(size=2) + scale_x_continuous(name="age of
trees")
p <- p + stat_smooth(method = "loess", se=T, fill="grey60", size=2)
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
Next, in Figure 8.8 (bottom-left), we add to the color band explicit error bars
by using the geom option for stat_smooth(). Since we have a continuous 𝑥-axis,
we need to specify the number n of error bars we want to add to the smoothed
curve.
Finally, in Figure 8.8 (bottom-right), we show an example for a different smooth-
ing function. In ggplot2, the available options are lm (linear model), glm (general-
ized linear model), gam (generalized additive model), loess and rlm (robust linear
model) and in this figure we use lm. A linear model means that the resulting curve
will be restricted to a straight line obtained from a least-squared fit.
Listing 8.17: Some examples for data smoothing, see Figure 8.8 (Bottom row)
# Bottom -Left
p <- ggplot(Orange , aes(age , circumference))
p <- p + geom_point(size=2) + scale_x_continuous(name="age of
trees")
p <- p + stat_smooth(method = "loess", fill="grey60", size=2)
p <- p + stat_smooth(method="loess", geom = "errorbar", size=0.75,
n = 10)
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
# Bottom - Right
p <- ggplot(Orange , aes(age , circumference))
p <- p + geom_point(size=2) + scale_x_continuous(name="age of
trees")
p <- p + stat_smooth(method = "glm", fill="grey60", size=2)
p <- p + stat_smooth(method="glm", geom = "errorbar", size=0.75, n =
10)
p <- p + theme(axis.text.x=element_text(size=12),
axis.title.x=element_text(size=15, face="bold"))
p <- p + theme(axis.text.y=element_text(size=12),
axis.title.y=element_text(size=15, face="bold"))
p
128 | 8 Advanced plotting functions: ggplot2
8.4 Summary
The purpose of this chapter was to introduce the base capabilities offered by ggplot2
and to highlight some aesthetic extensions it offers over the basic R plotting functions.
It is clear that the Grammar of Graphics offers a very rich framework with incredibly
many aspects that is continuously evolving. For this reason, the best way to learn
further capabilities is by following online resources, e. g., https://ggplot2.tidyverse.
org/ or http://moderngraphics11.pbworks.com/f/ggplot2-Book09hWickham.pdf.
9 Visualization of networks
9.1 Introduction
In this chapter, we discuss two R packages, igraph and NetBioV [42, 187]. Both have
been specifically designed to visualize networks. Nowadays, network visualization
plays an important role in many fields, as they can be used to visualize complex
relationships between a large number of entities. For instance, in the life sciences,
various types of biological, medical, and gene networks, e. g., ecological networks,
food networks, protein networks, or metabolic networks serve as a mathematical
representation of ecological, molecular, and disease processes [9, 71]. Furthermore, in
the social sciences and economics, networks are used to represent, e. g., acquaintance
networks, consumer networks, transportation networks, or financial networks [74].
Finally, in chemistry and physics, networks are used to encode molecules, rational
drugs, and complex systems [20, 55].
All these fields, and many more, benefit from a sensible visualization of networks,
which enables gaining an intuitive understanding of the meaning of structural re-
lationships between the entities within the network. Generally, such a visualization
precedes a quantitative analysis and informs further research hypotheses.
9.2 igraph
In Chapter 16, we will provide a detailed introduction to networks, their definition,
and their analysis. Here, we will only restate that a network consists of two basic
elements, nodes and edges, and the structure of a network can be defined in two
ways, by means of:
– an edge list or
– an adjacency matrix
V(g)
Vertex sequence:
[1] 1 2 3
https://doi.org/10.1515/9783110796063-009
130 | 9 Visualization of networks
E(g)
Edge sequence:
[1] 2 -- 1
[2] 3 -- 2
The second command defines an edge list (el). The function graph.edgelist()
converts the matrix el into an igraph object g representing a graph. Calling the
functions V and E, with g as argument, provides information about the vertices and
the edges in the graph g. This is an example of a simple graph consisting of merely
three nodes, labeled as 1, 2, and 3. This graph contains only two edges between the
nodes 1 and 2, and nodes 2 and 3. By using the function plot(), the igraph object
g can be visualized.
In Figure 9.1 (Top-left), the output of the above plot function is shown. In order
to understand the effect of the option directed in the function graph.edgelist(), we
show in Figure 9.1 (Top-right) an example for setting this option TRUE.
E(g)
Edge sequence:
[1] 1 -> 2
[2] 2 -> 3
The result is that the edges have now arrows, pointing from one node to another.
Specifically, for directed=T, the first column of the edge list contains information
about the nodes from which an edge points toward the nodes contained in the second
column.
An alternative definition of a graph can be obtained by an adjacency matrix.
The following script produces exactly the same result as in Figure 9.1 (Top-left):
By setting the option mode="directed", we obtain the graph in Figure 9.1 (Top-
right).
9.2 igraph | 131
Here, an adjacency matrix is a binary matrix containing only zeros and ones. If node
𝑖 is connected with node 𝑗, then the corresponding element (𝑖, 𝑗) is one, otherwise
it is zero. The adjacency matrix is a square matrix, meaning that the number of
rows is the same as the number of columns. The number of rows corresponds to the
number of nodes of the graph.
For the examples above, we defined the structure of a graph manually, either by
defining an edge list or its adjacency matrix. However, the igraph package pro-
vides also a large number of functions to generate networks with certain structural
properties. In Tables 9.1 and 9.2, we list some of these.
In Figure 9.1 (Bottom), we show two such examples.
132 | 9 Visualization of networks
Type Syntax
Type Syntax
This functionality for generating such networks is very convenient because imple-
menting network generation algorithms can be tedious.
Figure 9.2 illustrates the outputs from network visualization when modifying the
vertex and edge attributes.
Although, in principle, the attributes for each vertex and edge can be set inde-
pendently by specifying a numeric or character vector, this is not necessary in the
case where the attributes have to be identical for all vertices or edges. Then, it is
sufficient to provide a scalar numeric or character value to set the option throughout
the network.
# Top-left
vs <- round(seq(1, 2*L, length.out=L))
vc <- heat.colors(L)
plot(g, vertex.size=vs , vertex.label=as.character(vs),
vertex.label.dist=1.3, vertex.color=vc)
# Top- right
vsh <- c(rep("circle", L-2), rep("pie", 2))
v.pie <- as.pairlist(rep(0, L))
v.pie[[L-1]] <-c(3,2,6); v.pie[[L]] <- c(5,3,5,2,5)
v.pie.col <- list(c("black", "blue", "red", "green", "yellow"))
plot(g, vertex.size=vs , vertex.label=as.character(vs),
vertex.label.dist=1.3, vertex.color=vc , vertex.shape=vsh ,
vertex.pie=v.pie , vertex.pie.color=v.pie.col)
# Bottom -left
v.shapes <- vertex.shapes ()
plot(g, vertex.shape=v.shapes , vertex.label=v.shapes ,
vertex.label.dist=1.2,
134 | 9 Visualization of networks
vertex.size=20, vertex.color="green",
vertex.pie=lapply(shapes (), function(x) if (x=="pie") c(1,4,2)
else 0), vertex.pie.color=list(heat.colors (5)))
# Bottom - right
el <- as.character (1:L)
elc <- seq(1,5, length.out=L)
elcol <- terrain.colors(L)
ec <- sample(c(0,-0.5, 0.5), replace=T, L)
eam <- sample(c(0,1,2,3), replace=T, L)
plot(g, vertex.label="", edge.label=el , edge.label.cex=elc ,
edge.label.color=elcol , edge.curved=ec , edge.arrow.mode=eam)
Figure 9.3: Effect of different layout functions to generate the cartesian coordinates of the nodes
of a network. Importantly, for all cases the same network is used.
136 | 9 Visualization of networks
Specifically, we generate a scale-free network with 𝑛 = 500 nodes and use four
different layout functions to generate the cartesian coordinates for the nodes of this
network.
It is clear from Figure 9.3, that depending on the used algorithm to generate
the cartesian coordinates, the same network “looks” quite different. The coordinates,
contained in la1 to la4, are represented in the form of a matrix with 𝑛 rows (the
number of nodes) and 2 columns (corresponding to the x and y coordinates of a
node). That means, the 𝑥- and 𝑦-coordinates of the nodes are used to place the
nodes of the network onto a 2-dimensional plane, as shown in Figure 9.3.
The reason why all four layout styles result in different coordinates is that any
layout style is in fact an optimization algorithm. And each of these optimization al-
gorithms uses a different optimization function. For example, layout.fruchterman.
reingold and layout.kamada.kawai are two force-based algorithms proposed by
Fruchterman & Reingold and Kamada & Kawai [82, 106] that optimize the distance
between the nodes in a way similar to spring forces. In contrast, layout.random
chooses random positions for the 𝑥- and 𝑦-coordinates. Hence, it is the only layout
style among those illustrated that is not based on an optimization algorithm.
An important lesson from the above examples is that for a given network, the
graphical visualization is not trivial, but requires additional work to select a layout
style that corresponds best to the intended expectations of the user.
There are two possible ways to plot an igraph object g representing a graph. The
first option is to use the function plot(). This option has been used in the previous
examples. The second option is to use the function tkplot(). In contrast with the
function plot(), the function tkplot() allows the user to change the position of the
9.3 NetBioV | 137
vertices interactively by providing a graphical user interface (GUI). Hence, this can
be done by means of the computer mouse.
At first glance, the function tkplot() function may appear superior, because of
its interactive capability. However, for large networks, i. e., networks with more than
50 vertices, it is hardly possible to adjust the position for each vertex manually.
That means, practically, the utility of tkplot() is rather limited because only small
networks can be adjusted. A second argument against the usage of tkplot() is that
due to the involvement of a graphical user interface, there may be operating system
specific problems caused by the usage of TK libraries. Basically, such TK libraries
are freely available for all common operating systems, however, some systems may
require these libraries to be installed when they are not available.
9.3 NetBioV
NetBioV is another package for visualization networks. It provides three main lay-
out architectures, namely global, modular, and layered layouts [187]. These layouts
can be used either separately or in combination with each other. The rationale
behind this functionality is motivated by the fact that a network should be visual-
ized not only using one layout, but through many perspectives. Furthermore, since
many real-world networks are generally acknowledged to have a scale-free, modular,
and hierarchical structure, these three categories of layouts enable the highlighting
of, e. g., specific biological aspects of the network. Moreover, NetBioV includes an
additional layout category, which enables a spiral-view of the network. In the spi-
ral view, the nodes’ placement can be made using force-based algorithm, or using
network measures for nodes. Overall, this provides a more abstract view on net-
works.
138 | 9 Visualization of networks
Real-world networks, e. g., biological or social networks are usually not planar.
That means, they have edges crossing each other if a graph is displayed in a two-
dimensional plane. However, for a more effective visualization, the crossing of edges
should be minimized. The global network layouts of NetBioV aim to minimize such
crossings.
The most important features that can be highlighted via a global layout include
the backbone of the network, the spread of information within the network, and
the properties of the nodes, e. g., using various network measures. For instance,
for highlighting the backbone structure of a network NetBioV applies the following
strategy. In the first step, we define the backbone of a network. For this, we use
the minimum spanning tree (MST) algorithm to extract a subnetwork from a given
network. In the second step, we obtain the coordinates for the nodes by applying a
force-based algorithm to the subnetwork consisting of the MST. In the third step,
we assign a unique color to the MST edges whereas the remaining edges are colored
according to the distance between the nodes.
Most networks have a modular characteristics, that means there are groups of nodes
that are more strongly connected with each other than the rest of the nodes. De-
pending on the origin of the network, such modules serve a different purpose. For
instance, in biology, these modules can be thought of as performing a specific bio-
logical function for the organism.
The modular network layouts in NetBioV allow to highlight the individual mod-
ules by using standard graph-layout algorithms. The principle approach to this works
as follows. In the first step, we determine the relative coordinates of the nodes within
each module. In the second step, we optimize the coordinates for each module us-
ing standard-layout algorithms, and then we place the modules according to these
positions. In general, modules can be identified with module detection algorithms.
However, in specific application areas also other approaches are possible. For instance
in biology, modules can be defined by gene-sets defined via biological databases, such
as gene ontology [6] or KEGG [107].
9.3 NetBioV | 139
light colors in a module. The nodes in a graph can also be colored individually in two
ways. The first coloring option is based on the global rank in the network, whereas
the second coloring option is based on local ranks in the modules. The ranks are de-
termined by the different properties of nodes, such as the degree or expression value
observed from, e. g., experimental data. The global rank describes the rank of an indi-
vidual node with respect to all other nodes in the network, whereas the local rank of a
node in a module is obtained with respect to the nodes from the same module. Edges
for different modules can be colored differently so that the connectivity of individual
modules can be highlighted. Additionally, the node-size can be used to highlight the
rank of the nodes in the network. Moreover, for each module, an individual graph lay-
out can be defined as a parameter vector as argument for a modular layout function.
For the layered network layout, the color scheme is defined as follows. For a
directed network, the levels of the network are divided into three sections, namely
the lower, the initial, and the upper section. Importantly, only the initial section
and the upper section are used for undirected networks. A user can assign different
colors to different levels. For a directed network, if edges connect nodes with a level
difference greater than one, then edges are colored using two colors for two opposite
directions (up and down). If edges connect nodes on the same level, then the edges
are shown in a curved shape and in a unique color.
data("artificial2.graph")
# Right
mec <- "green"
vc <- rgb(r=1, g=0, b=0, alpha=.7)
ecls= rgb(r=.5, g=.5, b=1, alpha=.3)
id<-mst.plot.mod(g1 ,layout.function=layout.fruchterman.reingold ,
mst.edge.col=mec ,vertex.color=vc ,colors=ecls , v.size=1.5)
# Right
data("PPI_Athalina")
data("modules_PPI_Athalina")
clx <- rgb(red=0,green = .6, blue = .6, alpha = 0.4)
cl <- rep(clx , 28);
lb <- names(lm)
lb[c(1:24)[-c(1 ,5 ,17 ,21)]] <- ""
9.3 NetBioV | 143
Figure 9.4: Global network layouts using different options available in NetBiov. Left: Coloring
vertices of the B-cell lymphoma network based on external information, such as expression value
(red to blue—smaller to higher expression value). Right: Edges of the MST are shown in ”green”
and the remaining edges in ”blue”.
Figure 9.5: Modular layouts using different options available in NetBiov: Left: Abstract modular
view of A. thaliana; each module is labeled with the most significant enriched GO-pathway. Edge
width is proportional to the number of connections between modules. Right: Information flow in
A. thaliana network by highlighting shortest paths between nodes of modules 1, 5, 17 and 21.
names(lm) <- lb
id <- plot.modules(g1 ,mod.list=lm ,layout.function =
layout.fruchterman.reingold ,
modules.color =cl ,mod.edge.col=c(clx),ed.color= c(clx),sf=-20,
node set=c(1 ,5 ,17 ,21),col.s1="blue", col.s2="purple",
nodes.on.path="red", mod.lab=TRUE ,lab.color="white",
v.size.path=1.5, v.size=1.2)
144 | 9 Visualization of networks
Figure 9.6: Layered network layouts. Left: The B-cell lymphoma network is shown. Right: The
protein-protein interaction (PPI) network of Arabidopsis thaliana is shown.
# Right
data(PPI_Athalina)
clx <- rgb(red=.3,green = .3, blue = 1, alpha = 0.2)
id <- level.plot(g1 , layout.function=layout.reingold.tilford ,
vertex.colors=c(clx ,clx ,clx),edge.col=c(clx ,clx ,clx ,clx),
e.size=.3,e.curve=.4, initial_nodes=c(1 ,5 ,7 ,12 ,101 ,125),
node set=list(c(1 ,5 ,7 ,101), c(501, 701, 801 ,901 ,1001)),
order_degree=NULL , )
9.4 Summary
Networks from biology, chemistry, economy or the social sciences can be seen as
a data-type. For the visualization of such networks, we provided in this chapter
an introduction for igraph and NetBioV. Overall, igraph provides many helpful
base commands for the generation, manipulation, but also visualization of graphs,
whereas NetBioV focuses on high-level visualizations from a global, modular, and
layered perspective.
9.4 Summary | 145
Thinking in abstract mathematical terms makes you a better programmer and, hence, a
better data scientist.
This is also the reason why mathematics is sometimes called the language of science
[185, 188], as noted by Galileo.
Before we proceed, we would like to add a couple of notes for clarification. First,
by a programmer we mean actually a scientific programmer i. e., someone who is
concerned with the conversion of statistical and machine learning ideas into a com-
puter program rather than a general programmer who implements graphical user
interfaces (GUIs) or websites. The crucial difference is that the level of mathematics
needs for, e. g., the implementation of a GUI is minimal comparable to the imple-
mentation of a data analysis method. Also, such a way of programming is usually
purely deterministic and not probabilistic. On the other hand, the nature of a data
analysis is to deal with measurement errors and other imperfections of the data.
Hence, probabilistic and statistical methods cannot be avoided in data science since
they are integral pillars of the topic.
Second, although it is certainly not necessary to implement every method for
conducting a data analysis, a good data scientist should be capable to implement
https://doi.org/10.1515/9783110796063-010
150 | 10 Mathematics as a language for science
Figure 10.1: Generic visualization of any data analysis problem. Data analysis is conducted via
a computer program that has been written based on statistical- and machine-learning methods
informed with domain-specific knowledge, e. g., from biology, medicine, or the social sciences.
any method required for the analysis. Third, the natural language we are speaking,
e. g., English, does not translate equally well into a computer language like R, but
there are certain terms and structures that translate better. For instance, when we
speak about a “vector” and its components, we will not have a problem to capture
this in R, as shown in Chapter 5. Furthermore, in Chapter 12, we will learn much
more about vectors in the context of linear algebra. This is not a coincidence, but
the meaning of a vector is informed by its mathematical concept. Hence, whenever
we use this term in our natural language, we have an immediate correspondence to
its mathematical concept. This implies that the more we know about mathemat-
ics, the more we become familiar with terms that are well defined mathematically,
and such terms can be swiftly translated into a computer program for data analy-
sis.
We would like to finish this section by adding one more example that demon-
strates the importance of “language” and its influence on the way humans think.
Suppose, you have a twin sibling and you both are separated right after birth. You
grow up in the way you did, and your twin grows up on a deserted island without
civilization. Then, let say after 20 years, you both are independently asked a series
of questions and given tasks to solve. Given that you both share the same DNA,
one would expect that both of you have the same potential in answering these ques-
tions. However, practically it is unlikely that your twin will perform well, because
of basic communication problems in the first place. In our opinion the language of
“mathematics” plays a similar role with respect to “questions” and “tasks” from a
data analysis perspective.
In the remainder of this chapter, we provide a discussion of some basic abstract
mathematical symbols and operations we consider very important to (A) help for-
mulating concise mathematical statements, and (B) shape the way of thinking.
10.2 Numbers and number operations | 151
Each of the above symbols represents a set of all numbers that belong to the cor-
responding number system. For instance, N represents all natural numbers, i. e.,
1, 2, 3, . . . ; Z represents all integer number, i. e., . . . , −2, −1, 0, +1, +2, . . . ; Q repre-
sents all rational numbers 𝑎𝑏 with 𝑎 and 𝑏 being any integer number; and R is the
set of all real numbers, e. g., 1.4271.
There is a natural connection between these number systems in the way that
N ⊂ Z ⊂ Q ⊂ R ⊂ C. (10.1)
That means, e. g., every integer number is also a real number, but not every integer
number is a natural number. Furthermore, the special sets Z+ and R+ denote the
set of all positive integers and positive reals.
Intervals
When defining functions, it is common to limit the value of numbers to specific
intervals. One distinguishes finite from infinite intervals. Specifically, finite intervals
can be defined in four different ways:
The difference between a closed interval and an open interval is that for a closed
interval the end point(s) belong to the interval, whereas this is not the case for an
open interval.
Modulo operation
The modulo operation gives the remainder of a devision of two positive numbers 𝑎
and 𝑏. It is defined for 𝑎 ∈ R+ and 𝑏 ∈ R+ ∖ {0} by
modulo(𝑁 + 1, 𝑁 ). (10.12)
Rounding operations
The floor and ceiling operations round a real number to its nearest integer value up
or down. The corresponding functions are denoted by
Finally, the truncation function, trunc(𝑥), of a real number 𝑥 is just the integer
digits of the number 𝑥 without the fractional digits i. e., the numbers after the
decimal point. For instance, trunc(3.91) = 3.
Sign function
For any real number 𝑥 ∈ R, the sign function, sign(𝑥), is given by
⎧
⎨+1 if 𝑥 > 0;
⎪
⎪
sign(𝑥) = 0 if 𝑥 = 0; (10.15)
⎩−1 if 𝑥 < 0.
⎪
⎪
Absolute value
The absolute value or the modulus of a real number 𝑥 ∈ R is given by
if 𝑥 ≥ 0;
{︃
+𝑥
abs(𝑥) = |𝑥| = (10.16)
−𝑥 if 𝑥 < 0.
𝒞 = {K, Q, N, B, p} (10.17)
is the set containing chess pieces, and 𝒟 = {𝒜, 𝒞} is a set of sets. From these
examples, one can see that an object is something very generic, and a set is just a
container for objects. Usually, the objects of a set are enclosed by the curly brackets
“{” and “}”.
The symbol ∈ denotes the membership relation to indicate that an object is
contained in a set. For instance, 2 ∈ 𝒜, and ∘ ∈ ℬ. Here, the objects 2 and ∘ are also
154 | 10 Mathematics as a language for science
Figure 10.2: Visualization of set operations. Left: The union of two sets. Right: The intersection
of 𝐴1 and 𝐴2 .
An alphabet Σ is a finite set of atomic symbols, e. g., Σ = {𝑎, 𝑏, 𝑐}. That means, Σ
contains all elements for a given setting. No other elements can exist.
Σ⋆ is the set of all words over Σ. For example if Σ = {𝑏}, then Σ⋆ = {𝜖, 𝑏, 𝑏𝑏, 𝑏𝑏𝑏,
𝑏𝑏𝑏𝑏 . . .}. Here 𝜖 is the empty word.
10.4 Boolean logic | 155
There are three quantifiers from predicate logic that allow a concise description
of properties of elements of sets.
Definition 10.3.1. The expression ∀ means for all. For example if 𝐴 = {𝑎1 , 𝑎2 , 𝑎3 },
then by ∀ 𝑥 ∈ 𝐴, we mean all elements in set 𝐴, i. e., 𝑎1 , 𝑎2 , 𝑎3 .
Definition 10.3.3. The expression ∃! means there exits only one. For example: ∃! 𝑥 ∈
𝐵 : 𝑥 < 2 means that in the set 𝐵 there exists only one element, which is less than 2.
(10.18)
{︀ }︀
𝒪 := ¬, ∧, ∨, () ,
of logical operators, we can easily construct logical formulas. For instance, the for-
mulas
represent valid logical formulas as they are derived by using the operators in (10.18).
However, according to this definition, the formulas
(𝑣 ∨ 𝑞)𝑞𝑞, (𝑣 𝑞) (10.20)
𝑆1 ∧ 𝑆2 ⇐⇒ 𝑆2 ∧ 𝑆1 (10.21)
𝑆1 ∨ 𝑆2 ⇐⇒ 𝑆2 ∨ 𝑆1 (10.22)
Theorem 10.4.1 says that the logical arguments can be switched for the logical
operators and and or. Theorem 10.4.2 says that we may successively shift the brack-
ets to the right. Similarly, when expanding expressions over the reals, for instance
𝑥(𝑥 + 1) = 𝑥2 + 𝑥, Theorem 10.4.3 gives a rule for expanding logical expressions.
The rules of de Morgan given by Theorem 10.4.4 state that a negation applied
to the single expressions flips the logical operator. Note that these rules can be
formulated for sets accordingly.
The resulting statements (or forms) are called normal forms, and important
examples thereof are the disjunctive normal form and conjunctive normal form of
logical expressions, see [98].
𝑆 = 𝑆1 ∨ 𝑆1 ∨ · · · ∨ 𝑆𝑘 , (10.31)
where
The terms 𝑆𝑗𝑖 are literals, i. e., logical variables or the negation thereof.
Two examples for logical formulas given in disjunctive normal form are
or
𝑣 ∨ (𝑣 ∧ 𝑞). (10.34)
Here we denote the literals by using the notations 𝑣 and 𝑞 for logical variables.
𝑆 = 𝑆1 ∧ 𝑆1 ∧ · · · ∧ 𝑆𝑘 , (10.35)
where
or
𝑣 ∧ (𝑣 ∨ 𝑞). (10.38)
In practice, the application of Boolean functions [98] has been important to de-
velop electronic chips for computers, mobile phones, etc. A logic gate [98] represents
an electronic component that realizes (computes) a Boolean function 𝑓 (𝑣1 , . . . , 𝑣𝑛 ) ∈
{0, 1}; 𝑣𝑖 are logical variables. These logic gates use the logical operators ∧, ∨, ¬
and transform input signals into output signals. Figure 10.3 shows the elementary
logic gates and their corresponding truth tables.
We see in Figure 10.3 that the OR-gate is based on the functionality of the
operator ∨. That means, the output signal of the OR-gate equals 1 as soon as one
of its input signals is 1.
The output signal of the AND-gate equals 1 if and only if all input signals
equal 1. As soon as one input signal equals 0, the value of the Boolean function
computed by this gate is 0.
The NOT-gate computes the logical negation of the input signal. If the input
signal is 1, the NOT-gate gives 0, and vice versa.
Figure 10.3: Elementary logic gates of Boolean functions and their corresponding truth table.
The top symbol corresponds to the IEC, and the bottom to the US standard symbols.
Sum
The sum, , for the numbers 𝑎𝑖 involving the integer indices 𝑖𝑙 , 𝑖𝑙 + 1 . . . , 𝑖𝑢 ∈ N is
∑︀
defined by
𝑖𝑢
(10.39)
∑︁
𝑎𝑖 = 𝑎𝑖𝑙 + 𝑎𝑖𝑙 +1 + · · · + 𝑎𝑖𝑢 .
𝑖=𝑖𝑙
Here the index “l” indicates “lower” whereas “u” indicates “upper”, and they are
used to denote the start and end of the sequence of indices. For 𝑖𝑙 = 1, and 𝑖𝑢 = 𝑛,
we obtain the sum over all elements of 𝐴, 𝑖=1 𝑎𝑖 = 𝑎1 + · · · + 𝑎𝑛 . Alternatively, the
∑︀𝑛
sum can also be written using a different notation for the index of the sum symbol,
namely
(10.40)
∑︁
𝑎𝑖 = 𝑎𝑖𝑙 + 𝑎𝑖𝑙 +1 + · · · + 𝑎𝑖𝑢 .
𝑖∈{𝑖𝑙 ,𝑖𝑙 +1,...,𝑖𝑢 }
The latter form is more convenient when the summation concerned a subset of the
indices. For instance, suppose that 𝐼 = {2, 4, 5} is a subset containing the desired
indices for the summation then
(10.41)
∑︁ ∑︁
𝑎𝑖 = 𝑎𝑖 = 𝑎2 + 𝑎4 + 𝑎5 .
𝑖∈𝐼 𝑖∈{2,4,5}
10.5 Sum, product, and Binomial coefficients | 159
Product
Similar to the sum, the product, , for the numbers 𝑎𝑖 involving the integer indices
∏︀
𝑖𝑙 , 𝑖𝑙 + 1 . . . , 𝑖𝑢 ∈ N is defined as follows.
𝑖𝑢
(10.42)
∏︁
𝑎𝑖 = 𝑎𝑖𝑙 · 𝑎𝑖𝑙 +1 · · · · · 𝑎𝑖𝑢 ;
𝑖=𝑖𝑙
(10.43)
∏︁
𝑎𝑖 = 𝑎𝑖𝑙 · 𝑎𝑖𝑙 +1 · · · · · 𝑎𝑖𝑢 .
𝑖∈{𝑖𝑙 ,𝑖𝑙 +1...,𝑖𝑢 }
Remark 10.5.1. In the above discussions of the sum and product, we assumed inte-
ger indices for the identification of the numbers 𝑎𝑖 , i. e., 𝑖 ∈ N. However, we would
like to remark that, in principle, this can be generalized to arbitrary “labels”. For
instance, for the set 𝐴 = {𝑎△ , 𝑎∘ , 𝑎⊗ }, we can define the sum and product over its
elements as
(10.44)
∑︁
𝑎𝑖 = 𝑎△ + 𝑎∘ + 𝑎⊗ ;
𝑖∈{△,∘,⊗}
(10.45)
∏︁
𝑎𝑖 = 𝑎△ · 𝑎∘ · 𝑎⊗ .
𝑖∈{△,∘,⊗}
Hence, from a mathematical point of view, the nature of the indices is flexible.
However, whenever we implement a sum or a product with a programming language,
integer values for the indices are advantageous, because, e. g., the indexing of vectors
or matrices is accomplished via integer indices.
In R, the most flexible way to realize sums and products is via loops. However,
if one just wants a sum or a product over all elements in a vector 𝐴, from 𝑖𝑙 = 1 to
𝑖𝑢 = 𝑁 , one can use the following commands:
Binomial coefficients
For all natural numbers 𝑘, 𝑛 ∈ N with 0 ≤ 𝑘 ≤ 𝑛, the binomial coefficient, denoted
𝐶(𝑛, 𝑘), is defined by
(︂ )︂
𝑛 𝑛!
𝐶(𝑛, 𝑘) = = . (10.46)
𝑘 𝑘!(𝑛 − 𝑘)!
Figure 10.4: Visualization of the meaning of the Binomial coefficient 𝐶(4, 2).
For the definition of a binomial coefficient, the factorial of a natural number, denoted
by the symbol “!”, is used. The factorial of an natural number 𝑛 is just the product
of the numbers from 1 to 𝑛, i. e.,
𝑛
(10.47)
∏︁
𝑛! = 𝑖 = 1 · 2 · · · · · 𝑛.
𝑖=1
The binomial coefficient has the combinatorial meaning that from 𝑛 objects, there
are 𝐶(𝑛, 𝑘) ways to select 𝑘 objects without considering the order in which the
objects have been selected. In Figure 10.4, we show an urn with 𝑛 = 4 objects. From
this urn, we can draw 𝑘 = 2 objects in 6 different ways.
Also, the factorial 𝑛! has a combinatorial meaning. It gives the number of dif-
ferent arrangements of 𝑛 objects by considering the order. For instance, the objects
{1, 2, 3} can be arranged in 3! = 6 different ways:
∀𝑛 ∈ N, and 0 ≤ 𝑘 ≤ 𝑛.
10.6 Further symbols | 161
Figure 10.5: Pascal’s triangle for Binomial coefficients. Visualized is the recurrence relation for
Binomial coefficients in equation (10.52).
The following recurrence relation for binomial coefficients is called Pascal’s rule:
(︂ )︂ (︂ )︂ (︂ )︂
𝑛+1 𝑛 𝑛
= + . (10.52)
𝑘+1 𝑘 𝑘+1
In Figure 10.5, we visualize the result of Pascal’s rule for 𝑛 ∈ {0, . . . , 6}. The resulting
object is called Pascal’s triangle.
If there is more than one element that is minimum or maximum, then the corre-
sponding sets 𝑎*min and 𝑎*max contain more than one element.
162 | 10 Mathematics as a language for science
Logical statements
A logical statement may be defined verbally or mathematically, and take one of the
values true or false. For simplicity, we define the Boolean value 1 for true, and 0 for
false. One can show that the set {true, false} is isomorphic to the set {0, 1}.
The Boolean value of the statement “The next autumn comes for sure” equals
1 and, hence, the statement is true. From a probabilistic point of view, this event is
certain and its probability equals one. Therefore, we may conclude that this state-
ment does not contain any information, see also [169]. The following inequalities and
equations
𝑖 = −5, (10.57)
100 = 50 + 20 + 30, (10.58)
−1 ≥ 5, (10.59)
1 < 2, (10.60)
𝑛
𝑛(𝑛 + 1)
(10.61)
∑︁
𝑗= , 𝑛 ∈ N,
2
𝑗=1
are mathematical statements, which are true or false. The first equation is false, as
√
𝑖 = −1, where 𝑖 is the imaginary unit of a complex number 𝑧 = 𝑎 + 𝑖𝑏. The second
equation is obviously true, as 50 + 20 + 30 equals 100. For the third statement, a
negative number cannot be greater or equal than a positive number and its Boolean
value is therefore false. The fourth statement represents an inequality too, and is
true. Strictly speaking, the fifth equation is a statement form (Sf) over the natural
numbers, as it contains the variable 𝑛 ∈ N.
In general, statement forms contain variables and are true or false. In case of
equation (10.61), we can write ⟨Sf(𝑛)⟩ = ⟨ 𝑗=1 𝑗 = 𝑛(𝑛+1) ⟩. This statement form is
∑︀𝑛
2
true for all 𝑛 ∈ N and can be proven by induction. Another example of a statement
form is
(10.62)
⟨︀ ⟩︀
Sf(𝑥) = ⟨𝑥 + 5 = 15⟩ > .
Generally, we can see that the statement changes if the variable of the statement
form (Sf) changes. Once we define statements (or statement forms), they can be
combined by using logical operations. We demonstrate these operations by first
assuming that 𝑆1 and 𝑆2 are logical statements. The statement 𝑆1 ∧ 𝑆2 means that
𝑆1 and 𝑆2 hold. This statement may have the value true or false, see Figure 10.3.
For instance, 𝑆1 := 2 + 2 = 4 ∧ 𝑆2 := 3 + 3 = 6 is true, but 𝑆1 := 2 + 2 =
4 ∧ 𝑆3 := 3 + 3 = 9 is false. Similarly, 𝑆1 ∨ 𝑆2 means that 𝑆1 or 𝑆2 holds. Here,
𝑆1 := 2 + 2 = 4 ∨ 𝑆2 = 3 + 3 = 6 is true, but 𝑆1 := 2 + 2 = 4 ∨ 𝑆3 := 3 + 3 = 9 is
true as well. The logical negation of the statement 𝑆 is usually denoted by ¬𝑆. The
well-known triangle equation,
This means
is generally false.
Statement: ⇒
The logical implication 𝑆1 =⇒ 𝑆2 means that 𝑆1 implies 𝑆2 . Verbally, one can say
𝑆1 “logically implies” 𝑆2 , or if 𝑆1 holds, then follows 𝑆2 .
Statement ⇔
The statement 𝑆1 ⇐⇒ 𝑆2 is stronger, because 𝑆1 holds if and only if 𝑆2 holds.
For the above statements, it is important to note that to go from the left state-
ment to the right one, or vice versa, one needs to apply logical operators (¬, ∧, ∨)
or algebraic operations (+, −, /, etc.). For instance, by assuming the true statement
𝑛2 ≥ 2𝑛, 𝑛 > 1, we obtain the implications
𝑛2 ≥ 2𝑛 =⇒ 𝑛2 − 2𝑛 ≥ 0 =⇒ 𝑛2 − 2𝑛 + 1 = (𝑛 − 1)2 ≥ 0. (10.66)
Finally, we want to remark that a false statement may imply a true statement; 𝑖2 = 1
(false as 𝑖2 = −1) implies 0 · 𝑖2 = 0 · 1 (true).
164 | 10 Mathematics as a language for science
Definition 10.7.1. Let 𝑎, 𝑏 ∈ R. The sum of these two real numbers are defined by
sum(𝑎, 𝑏) := 𝑎 + 𝑏. (10.67)
Definition 10.7.1 defines the sum of two real numbers based on the trivial defi-
nition of the symbol “+”.
𝑓𝐿 (𝑥) := 𝑎𝑥 + 𝑏, (10.68)
𝑓𝐿 (𝑥) = 0 (10.69)
is given by 𝑥 = − 𝑎𝑏 .
The proof of Theorem 10.7.1 is very simple, as 𝑓𝐿 (𝑥) := 𝑎𝑥+𝑏 = 0 leads directly
to 𝑥 = − 𝑎𝑏 by performing elementary calculations. Specifically, the first elementary
calculation is subtracting 𝑏 from 𝑎𝑥+𝑏 = 0. Second, we divide the resulting equation
by 𝑎 and obtain the result.
Another example is the famous binomial theorem.
10.8 Summary
In general, the mathematical language is meant to help with the precise formulation
of problems. If one is new to the field, such formulations can be intimidating at first,
and verbal formulations may appear as sufficient. However, with a bit of practice one
realizes quickly that this is not the case, and one starts to appreciate and to benefit
from the power of mathematical symbols. Importantly, the mathematical language
has a profound implication on the general mathematical thinking capabilities, which
translate directly to analytical problem-solving strategies. The latter skills are key
for working successfully on data science projects, e. g., in business analytics, because
the process of analyzing data requires a full comprehension of all involved aspects,
and the often abstract relationships.
11 Computability and complexity
This chapter provides a theoretical underpinning for the programming in R that we
introduced in the first two parts of this book. Specifically, we introduced R practi-
cally by discussing various commands for computing solutions to certain problems.
However, computability can be defined mathematically in a generic way that is in-
dependent of a programming language. This paves the way for determining the
complexity of algorithms. Furthermore, we provide a mathematical definition of a
Turing machine, which is a mathematical model for an electronic computer. To place
this in its wider context, this chapter also provides a brief overview of several major
milestones in the history of computer science.
11.1 Introduction
Nowadays, the use of information technologies and the application of computers
are ubiquitous. Almost everyone uses computer applications to store, retrieve, and
process data from various sources. A simple example is a relational database system
for querying financial data from stock markets, or finding companies’ telephone
numbers. More advanced examples include programs that facilitate risk management
in life insurance companies or the identification of chemical molecules that share
similar structural properties in pharmaceutical databases [170, 54].
The foundation of computer science is based on theoretical computer science
[163, 164]. Theoretical computer science is a relatively young discipline that, put
simply, deals with the development and analysis of abstract models for information
processing. Core topics in theoretical computer science include formal language the-
ory and compilers [121, 160], computability [22], complexity [37], and semantics of
programming languages [126, 167, 122] (see also Section 2.8). More recent topics
include the analysis of algorithms [37], the theory of information and communica-
tion [40], and database theory [124]. In particular, the mathematical foundations of
theoretical computer science have influenced modern applications tremendously. For
example, results from formal language theory [160] have influenced the construction
of modern compilers [121]. Formal languages have been used for the analysis of au-
tomata. The automata model of a Turing machine has been used to formalize the
term algorithm, which plays a central role in computer science. When dealing with
algorithms, an important question is whether they are computable (see Section 2.2).
Another crucial issue relates to the analysis of algorithms’ complexity, which pro-
vides upper and lower bounds on their time complexity (see Section 11.5.1). Both
topics will be addressed in this chapter.
https://doi.org/10.1515/9783110796063-011
11.2 A brief history of computer science | 167
Given an alphabet Σ, one can print only one character 𝑐 ∈ Γ in each field. A special
character (e. g., $) is used to fill the empty fields (blank symbol).
The transition function 𝛿 is crucial for the control unit, and encodes the program of
the Turing machine (see Figure 11.1). The Turing table conveys information about
the current and subsequent stages of the machine after it reads a character 𝑐 ∈ Γ.
This initiates certain actions of the read/write head, namely
– 𝑙: moving the head exactly one field to the left.
– 𝑟: moving the head exactly one field to the right.
– 𝑥: overwriting the content of a field with 𝑥 ∈ Γ ∪ {$} without moving the head.
11.4 Computability
We now turn to a fundamental problem in theoretical computer science: the deter-
mination as to whether or not a function is computable [164]. This problem can be
discussed intuitively as well as mathematically. We begin with the intuitive discus-
sion, and then provide its mathematical formulation. It is generally accepted that
function 𝑓 : N −→ N is computable if an algorithm to compute 𝑓 exists. There-
fore, assuming an arbitrary 𝑛 ∈ N as input, the algorithm should stop after a finite
number of computation steps with output 𝑓 (𝑛). When discussing this simple model,
we did not take into account any considerations regarding a particular processor
or memory. Evidently, however, it is necessary to specify such steps to implement
an algorithm. In practical terms, this is complex, and can only be accomplished
by a general mathematical definition to decide whether a function 𝑓 : N −→ N is
computable.
A related problem is whether any arbitrary problem can be solved using an
algorithm, and, if not, whether the algorithm can identify the problem as noncom-
putable. This is known as the decision problem formulated by Hilbert, which turned
out to be invalid [36]. A counter-example is Godel’s well-known incompleteness the-
orem [36]. Put simply, it states that no algorithm exists that can verify whether an
arbitrary statement over N is true or false. To explore Gödel’s statement in depth,
170 | 11 Computability and complexity
We wish to note that a similar definition can be given for functions defined on words
(e. g., 𝑓 : Σ⋆ −→ Σ⋆ , see [36, 164]). Examples of computable functions include the
following:
– The functions 𝑓1 : N2 −→ N, 𝑓1 := 𝑎 · 𝑏 and 𝑓2 : N2 −→ N, 𝑓2 := 𝑎 + 𝑏.
– The (successor) function 𝑓 : N −→ N, 𝑓 (𝑛) := 𝑛 + 1.
– The recursive function sum : N −→ N defined by sum(𝑛) := 𝑛 + sum(𝑛 − 1),
sum(0) := 0.
iterative algorithms have been used to compute problems efficiently. The imperative
implementation (see Section 2.2) of the shortest path problem, proposed by Dijkstra
[58], is a standard case study in computer science, which is frequently used to illus-
trate iterative algorithms. Examples of typical algorithms in mathematics include
the GCD-algorithm proposed by Euclid [122] and the Gaussian elimination method
for solving linear equation systems [27].
It seems plausible that many algorithms exist to address a particular problem.
For example, the square of a real number can be computed using either a func-
tional or an imperative algorithm (see Sections 2.2 and 2.3). However, this raises the
question as to what type of algorithm is most suited to solving a given problem.
Listing important properties/questions in the context of algorithm design offers
insight into the complexity of the latter problem. Such properties and questions
include
– What level of effort is required to implement a particular algorithm?
– How can the algorithm be simplified as far as possible?
– How much memory is required?
– What is the time complexity (i. e., execution time) of an algorithm?
– What is the correctness of the algorithm?
– Does the algorithm terminate?
11.5.1 Bounds
Let 𝑛 be the input size of an algorithm (i. e., the number of data elements to be
processed). The time complexity of an algorithm is determined by the maximal
number of steps (e. g., value assignments, arithmetic operations, memory allocations,
etc.) in relation to input size required to obtain a specific result.
In the following, we describe how to measure the time complexity of an algorithm
asymptotically, and describe several forms thereof. First, we state an upper bound
for the time complexity that will be attained in the worst case (𝑂-notation). To
begin, we provide a definition of real polynomials, as they play a crucial role in the
asymptotic measurement of algorithms’ time complexity.
172 | 11 Computability and complexity
By definition, the input variable 𝑥 and the value of the function 𝑓 (𝑥) are real num-
bers. In terms of the coefficients 𝑐𝑘 , a polynomial is called real if its coefficients are
real. For instance, complex polynomials possess complex-valued coefficients.
To define an asymptotic upper bound for the time complexity of an algorithm,
the 𝑂-notation is required.
(11.9)
(︀ )︀ {︀ }︀
𝑂 𝑔(𝑛) = 𝑓 (𝑛) | ∃𝑐 > 0, ∃𝑛0 > 0, ∀𝑛 ≥ 𝑛0 : 𝑓 (𝑛) ≤ 𝑐 · 𝑔(𝑛) .
Definition 11.5.2 means that 𝑔(𝑛) is an asymptotic upper bound of 𝑓 (𝑛) if a constant
𝑐 > 0 exists and a natural number 𝑛0 such that 𝑓 (𝑛) is less or equal 𝑐·𝑔(𝑛) for 𝑛 ≥ 𝑛0 .
In contrast to the worst case, described by the 𝑂-notation, we now define an
asymptotic lower bound that describes the “least” complexity. This is provided by
the Ω-notation.
(11.10)
(︀ )︀ {︀ }︀
Ω 𝑔(𝑛) = 𝑓 (𝑛) | ∃𝑐 > 0, ∃𝑛0 > 0, ∀𝑛 ≥ 𝑛0 : 𝑐 · 𝑔(𝑛) ≤ 𝑓 (𝑛) .
11.5.2 Examples
In this section, some examples are given to illustrate the definitions of the asymptotic
bounds. In practice, the 𝑂-notation is the most important and widely used. Hence,
the following examples will focus on it.
11.5 Complexity of algorithms | 173
𝑛2 + 3𝑛 ≤ 𝑐 · 𝑛2 . (11.13)
𝑓 (𝑛) := 𝑐5 𝑛5 + 𝑐4 𝑛4 + 𝑐3 𝑛3 + 𝑐2 𝑛2 + 𝑐1 𝑛 + 𝑐0 . (11.14)
and obtain
Inequality (11.16) has been obtained using the triangle inequality [178]. By setting
𝑐 := |𝑎𝑘 | + |𝑎𝑘−1 | + |𝑎𝑘−2 | + · · · + |𝑎0 |, inequality (11.16) is satisfied for 𝑛 ≥ 1. That
means, 𝑓 (𝑛) ≤ 𝑐𝑛𝑗 for 𝑗 ≥ 𝑘 and 𝑛 ∈ N. Finally, we obtain 𝑓 (𝑛) ∈ 𝑂(𝑛𝑗 ) for 𝑗 ≥ 𝑘.
In the final example, we use a simple imperative program (see Section 2.2) to
calculate the sum of the first 𝑛 natural numbers (sum = 1 + 2 · · · + 𝑛). Basically, the
pseudocode of this program consists of the initialization step, sum = 0, and a for-
loop with variable 𝑖 and body sum = 𝑖 + 1 for 1 ≤ 𝑖 ≤ 𝑛. The first value assignment
requires constant costs, say, 𝑐1 . In each step of the for-loop to increment the value of
the variable sum, constant costs 𝑐2 are required. Then we obtain the upper bound
174 | 11 Computability and complexity
𝑓 (𝑛) = 𝑐1 + 𝑐2 · 𝑛, (11.17)
In view of the importance of the 𝑂-notation for practical use, several of its properties
are listed below:
𝑐 = 𝑂(1), (11.18)
(11.19)
(︀ )︀ (︀ )︀
𝑐 · 𝑂 𝑓 (𝑛) = 𝑂 𝑓 (𝑛) ,
(11.20)
(︀ )︀ (︀ )︀ (︀ )︀
𝑂 𝑓 (𝑛) + 𝑂 𝑓 (𝑛) = 𝑂 𝑓 (𝑛) ,
(11.21)
(︀ )︀ (︀ )︀
𝑂 log𝑏 (𝑛) = 𝑂 log(𝑛) ,
(11.22)
(︀ )︀ (︀ {︀ }︀)︀
𝑂 𝑓 (𝑛) + 𝑔(𝑛) = 𝑂 max 𝑓 (𝑛), 𝑔(𝑛) ,
(11.23)
(︀ )︀ (︀ )︀ (︀ )︀
𝑂 𝑓 (𝑛) · 𝑂 𝑔(𝑛) = 𝑂 𝑓 (𝑛) · 𝑔(𝑛) .
An algorithm with a constant number of steps has time complexity 𝑂(1) (see equa-
tion (11.18)). The second rule given by equation (11.19) means that constant factors
can be neglected. If we execute a program with time complexity 𝑂(𝑓 (𝑛)) sequentially,
the final program will have the same complexity (see equation (11.20)). According to
equation (11.21), the logarithmic complexity does not depend on the base 𝑏. More-
over, the sequential execution of two programs with different time complexities has
the complexity of the program with higher time complexity (see equation (11.22)).
Finally, the overall complexity of a nested program (for example, two nested loops)
is the product of the individual complexities (see equation (11.23)).
Algorithms with complexity 𝑂(1) are highly desirable in practice. Logarithmic and
linear time complexity are also favorable for practical applications as long as log(𝑛) <
𝑛, 𝑛 > 1. Quadratic and cubic time complexity remain sufficient when 𝑛 is relatively
small. Algorithms with complexity 𝑂(2𝑛 ) can only be used under certain constraints,
since 2𝑛 grows significantly compared to 𝑛𝑘 . Such algorithms could possibly be used
when searching for graph isomorphisms or cycles, whose graphs have bounded vertex
degrees, for example (see [130]).
11.6 Summary
At this juncture, it is worth reiterating that, despite the apparent novelty of the
term data science, the fields on which it is based have long histories, among them
theoretical computer science [61]. The purpose of this chapter has been to show
that computability, complexity, and the computer, in the form of a Turing machine,
are mathematically defined. This aspect can easily be overlooked in these terms’
practical usage.
The salient point is that data scientists should recognize that all these concepts
possess mathematical definitions which are neither heuristic nor ad-hoc. As such,
they may be revisited if necessary (e. g., to analyze an algorithm’s runtime). Our
second point is that not every detail about these entities must be known. Given
the intellectual complexity of these topics, this is encouraging, because acquiring
an in-depth understanding of these is a long-term endeavor. However, even a basic
understanding is preferable and helps in improving practical programming and data
analysis skills.
12 Linear algebra
One of the most important and widely used subjects of mathematics is linear algebra
[27]. For this reason, we begin this part of the book with this topic. Furthermore,
linear algebra plays a pivotal role for the mathematical basics of data science.
This chapter opens with a brief introduction to some basic elements of lin-
ear algebra, e. g., vectors and matrices, before discussing advanced operations,
transformations, and matrix decompositions, including Cholesky factorization, QR
factorization, and singular value decomposition [27].
12.1.1 Vectors
Vectors define quantities, which require both a magnitude, i. e., a length, and a
direction to be fully characterized. Examples of vectors in physics are velocity or
force. Hence, a vector extends a scalar, which defines a quantity fully described by
its magnitude alone. From an algebraic point of view, a vector, in an 𝑛-dimensional
real space, is defined by an ordered list of 𝑛 real scalars, 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 arranged in
an array.
(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ),
whereas a vector is called a column vector when its array is arranged vertically,
i. e.,
⎛ ⎞
𝑥1
⎜ 𝑥2 ⎟
⎜ . ⎟.
⎜ ⎟
⎝ .. ⎠
𝑥𝑛
https://doi.org/10.1515/9783110796063-012
12.1 Vectors and matrices | 177
−
→
Definition 12.1.2. Let 𝑉 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) be an 𝑛-dimensional real vector. Then,
−
→ −
→
the p-norm of 𝑉 , denoted ‖ 𝑉 ‖𝑝 , is defined by the following quantity
(︃ 𝑛 )︃ 𝑝1
⃦−
⃦→ (12.1)
∑︁
𝑝
⃦
𝑉 ⃦𝑝 = |𝑥𝑖 | ,
𝑖=1
𝑛
⃦−
⃦→
⃦ ∑︁
𝑉 ⃦1 = |𝑥𝑖 |,
𝑖=1
−
→
2. the 2-norm of the vector 𝑉 is defined by
(︃ 𝑛 )︃ 12
⃦−
⃦→
⃦ ∑︁
2
𝑉 ⃦2 = |𝑥𝑖 | ,
𝑖=1
𝑑 : 𝐸 𝑛 × 𝐸 𝑛 −→ R,
Such a function, 𝑑, is called a metric, and the pair (𝐸 𝑛 , 𝑑) is called a metric space.
When 𝐸 𝑛 = R𝑛 and 𝑑(𝑥, 𝑦) = ( 𝑖=1 (𝑥𝑖 − 𝑦𝑖 )2 )1/2 , then the pair (𝐸 𝑛 , 𝑑) =
∑︀𝑛
From Definition 12.1.3, it is clear that R, the set of real numbers (or the real line),
is a 1-dimensional Euclidean space, whereas R2 , the real plane, is a 2-dimensional
Euclidean space.
178 | 12 Linear algebra
Let 𝐴 = ( 𝑥𝑦𝐴
𝐴
) and 𝐵 = ( 𝑥𝑦𝐵
𝐵
) be two points in a 2-dimensional Euclidean space.
−−→ −
→
Then, the vector 𝐴𝐵 = 𝑉 is the displacement from the point 𝐴 to the point 𝐵,
which can be specified in a cartesian coordinates system by
(︂ )︂
−−→ −→ 𝑥𝐵 − 𝑥𝐴
𝐴𝐵 = 𝑉 = .
𝑦𝐵 − 𝑦𝐴
Figure 12.1: (a) Vector representation in a two-dimensional space. (b) Decomposition of a stan-
dard vector in a 2-dimensional space.
−−→
Definition 12.1.4. The magnitude of a vector 𝐴𝐵 is defined by the non-negative
−−→ −−→
scalar given by its Euclidean norm, denoted ‖𝐴𝐵‖2 or simply ‖𝐴𝐵‖.
−−→
Specifically, the magnitude of a 2-dimensional vector 𝐴𝐵 is given by
⃦−−→⃦ √︀
⃦𝐴𝐵 ⃦ = (𝑥𝐵 − 𝑥𝐴 )2 + (𝑦𝐵 − 𝑦𝐴 )2 .
V
[ ,1] [ ,2] [ ,3]
[1,] 3 2 -5
# Defining a 3- dimensional column vector from a given list of 3
numbers
listnb <- c(12, 1, -3)
W<- matrix(listnb , ncol=1)
W
[ ,1]
[1,] 12
[2,] 1
[3,] -3
# Norm of the vector V
NormV<-sqrt(sum(V^2))
NormV
[1] 6.164414
# Norm of the vector W
NormW<-sqrt(sum(W^2))
NormW
[1] 12.40967
−
→ −
→
Definition 12.1.5. Two 𝑛-dimensional vectors 𝑉 and 𝑊 are said to be parallel if
they have the same direction.
−
→ −
→
Definition 12.1.6. Two 𝑛-dimensional vectors 𝑉 and 𝑊 are said to be equal if they
have the same direction and the same magnitude.
Example 12.1.1. For supervised learning, 𝑘-NN (𝑘 nearest neighbors) [96] is a simple
yet efficient way to classify data. Suppose that we have a high-dimensional data set
with two classes, whose data points represent vectors. Let 𝑥 be a point that we wish
to assign to one of these two classes. To predict the class label of a point 𝑥, we
calculate the Euclidean distance, introduced above (see Remark 12.1.1), between 𝑥
and all other points 𝑥𝑖 , i. e., 𝑑𝑖 = ‖𝑥 − 𝑥𝑖 ‖. Then, we order these distances 𝑑𝑖 in
an increasing order. The 𝑘-NN classifier now uses the nearest 𝑘 distances to obtain
a majority vote for the prediction of the label for the point 𝑥. For instance, in
Figure 12.2, a two-dimensional example is shown for 𝑘 = 4. Among the four nearest
neighbors of 𝑥 are three red points and one blue point. This means the predicted
class label of 𝑥 would be “red”. In the extreme case 𝑘 = 1, the point 𝑥 would be
assigned to the class with the single nearest neighbor.
The 𝑘-NN method is an example of an instance-based learning algorithm. There
are many variations of the 𝑘-NN approach presented here, e. g., considering weighted
voting to overcome the limitations of majority voting in case of ties.
−−→ −−−→
In a two-dimensional space, the vector 𝐴𝐵 and its translation 𝐴′ 𝐵 ′ form opposite
sides of a parallelogram, as illustrated in Figure 12.3 (a). Thus,
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑥𝐴′ 𝑥𝐴 + 𝑘𝑥 𝑥𝐵 ′ 𝑥𝐵 + 𝑘𝑥
𝐴′ = = and 𝐵 ′ = = ,
𝑦𝐴′ 𝑦𝐴 + 𝑘𝑦 𝑦𝐵 ′ 𝑦𝐵 + 𝑘𝑦
noted:
Figure 12.3: Vector transformation in a 2-dimensional space: (a) Translation of a vector. (b) Ro-
tation of a vector.
Various operations can be carried out on vectors, including the product of a vector
by a scalar, the sum, the difference, the scalar or dot product, the cross prod-
uct, and the mixed product. In the following sections, we will discuss such opera-
tions.
Figure 12.5: Vector operations in a two-dimensional space: (a) Sum of two vectors. (b) Differ-
ence between two vectors.
−
→ −
→ −
→
Definition 12.1.9. For any scalar 𝑘, the vectors 𝑉 and 𝑈 = 𝑘 × 𝑉 are said to be
collinear.
−
→
In a two-dimensional space, the vector sum 𝑆 can be obtained geometrically, as
−
→
illustrated in Figure 12.5 (a); i. e., we translate the vector 𝑊 until its initial point
−
→
coincides with the terminal point of 𝑉 . Since translation does not change a vector,
−
→ −
→
the translated vector is identical to 𝑊 . Then, the vector 𝑆 is given by the dis-
−
→
placement from the initial point of 𝑉 to the terminal point of the translation of the
−
→
translation of 𝑊 . Note that the sum of vectors is commutative, i. e.,
−
→ −→ −→ − →
𝑉 +𝑊 =𝑊 + 𝑉 .
−
→ −
→
This means that if 𝑉 has been translated instead of 𝑊 , the result will be the
−
→
same sum vector 𝑆 . This is illustrated in Figure 12.5 (a).
−
→
In this case, the vector components of 𝑉 are given by
−
→ −→
−
→ 𝑉 ·𝑊 −
→
𝑉 || = −
→ −→𝑊,
𝑊 ·𝑊
−
→ −
→ −→
𝑉 ⊥ = 𝑉 − 𝑉 || .
−
→
If 𝛼 denotes the angle between the vector 𝑉 and the 𝑥-axis, as illustrated in Fig-
ure 12.1 (b), then we have the following relationships:
⃦−→⃦
𝑥𝐴 = cos(𝛼) × ⃦ 𝑉 ⃦,
⃦−
→⃦
𝑦𝐴 = sin(𝛼) × ⃦ 𝑉 ⃦,
𝑥𝐴
= tan(𝛼). (12.3)
𝑦𝐴
−
→ −
→
then the projection of 𝑉 onto the direction of 𝑊 is said to be orthogonal.
Geometrically, the dot product, can be defined through the orthogonal projection of
−
→ −
→
a vector onto another. Let 𝛼 be the angle between two vectors 𝑉 and 𝑊 . Then,
−
→ − →
−
→ − → ⃦−
→⃦ ⃦− →⃦ 𝑉 ·𝑊
𝑉 · 𝑊 = cos(𝛼) × ⃦ 𝑉 ⃦ × ⃦𝑊 ⃦, with cos(𝛼) = −
→ → .
−
‖ 𝑉 ‖ × ‖𝑊 ‖
−
→
In Figure 12.4 (a), the norm of the projected vector 𝑃 can be interpreted as the
−
→ −
→
dot product between the vectors 𝑉 and 𝑊 .
−
→ −
→
Definition 12.1.11. When the angle 𝛼 between two vectors, 𝑉 and 𝑊 , is 𝜋2 + 𝑘𝜋,
where 𝑘 is an integer, then the two vectors are said to be perpendicular or orthogonal
to each other, and their dot product is given by
(︂ )︂
𝜋 ⃦−
→⃦ ⃦− →⃦ ⃦−
→⃦ ⃦− →⃦
cos + 𝑘𝜋 × ⃦ 𝑉 ⃦ × ⃦𝑊 ⃦ = 0 × ⃦ 𝑉 ⃦ × ⃦𝑊 ⃦ = 0.
2
−
→ −
→ − → −
→ − →
4. For any two vectors 𝑉 and 𝑊 and a scalar 𝑘: 𝑘 × ( 𝑉 · 𝑊 ) = (𝑘 × 𝑉 ) · 𝑊 ;
−
→ − → −
→ − → − → − → −
→ − → −
→ − →
5. For any three vectors 𝑈 , 𝑉 , and 𝑊 : ( 𝑈 + 𝑉 ) · 𝑊 = ( 𝑈 · 𝑊 ) + ( 𝑉 · 𝑊 ).
−
→ −
→ −
→ − →
Then, the cross product of the vector 𝑉 by the vector 𝑊 , denoted 𝑉 × 𝑊 , is a
−
→ −
→ −→
vector 𝐶 perpendicular to both 𝑉 and 𝑊 , defined by
⎛⎞ ⎛ ⎞
𝑥𝐶 𝑦𝐴 × 𝑧𝐵 − 𝑦𝐵 × 𝑧𝐴
−
→ ⎝ ⎠ ⎝
𝐶 = 𝑦𝐶 = −𝑥𝐴 × 𝑧𝐵 + 𝑥𝐵 × 𝑧𝐴 ⎠ ,
𝑧𝐶 𝑥𝐴 × 𝑦𝐵 − 𝑥𝐵 × 𝑦𝐴
or
−
→ ⃦ −
→⃦ ⃦− →⃦
𝐶 = ⃦ 𝑉 ⃦ × ⃦𝑊 ⃦ × sin(𝜃) × −
→
𝑢,
−
→ −
→
𝑢 is the unit vector1 normal to both 𝑉 and 𝑊 , and 𝜃 is the angle between
where, −
→
−
→ −
→
𝑉 and 𝑊 .
Thus,
⃦−
⃦→
⃦ ⃦− →⃦ ⃦− →⃦
𝐶 ⃦ = ⃦ 𝑉 ⃦ × ⃦𝑊 ⃦ × sin(𝜃) = 𝒜,
−
→ −
→
where 𝒜 denotes the area of the parallelogram spanned by 𝑉 and 𝑊 , as illustrated
in Figure 12.7.
The cross product has the following properties:
−
→ −
→ − →
1. For any vector 𝑉 , we have: 𝑉 × 𝑉 = 0;
−
→ −→ −
→ − → −
→ − →
2. For any two vectors 𝑉 and 𝑊 , we have: 𝑉 × 𝑊 = −(𝑊 × 𝑉 );
−
→ −
→ −
→ − →
3. For any two vectors 𝑉 and 𝑊 and a scalar 𝑘, we have: (𝑘 × 𝑉 ) × 𝑊 = 𝑉 ×
−
→ −
→ − →
(𝑘 × 𝑊 ) = 𝑘 × ( 𝑉 × 𝑊 );
−
→ −
→ − → −
→ − → − → − →
4. 𝑉 × ( 𝑈 + 𝑊 ) = 𝑉 × 𝑈 + 𝑉 × 𝑊 ;
−
→ −
→ − → −→ − → −
→
5. 𝑉 × ( 𝑈 × 𝑊 ) ̸= ( 𝑉 × 𝑈 ) × 𝑊 .
−
→ − →
1 Note that the direction of the vector − →
𝑢 for the cross product 𝑉 × 𝑊 is determined by the
right-hand rule, i. e., it is given by the direction of the right-hand thumb when the other four
−
→ −
→
fingers are rotated from 𝑉 to 𝑊 .
12.1 Vectors and matrices | 187
[1] 5.385165
# Product of a vector by a scalar k
k <- 5
U<-k*V
U
[ ,1] [ ,2]
[1,] 10 -25
#Sum of the vectors V and W: S=V+W
S<-V+W
S
[ ,1] [ ,2]
[1,] 14 -4
# Difference of the vectors V and W: D=V-W
D<-V-W
D
[ ,1] [ ,2]
[1,] -10 -6
#Dot product of V and W
p<- sum(V*W)
p
[1] 19
# Finding the angle theta between V and W
NormV<-sqrt(sum(V^2))
NormW<-sqrt(sum(W^2))
theta<-acos(p/(NormV*NormW)) # p is the dot product of V and W
theta
[1] 1.273431 # This value of theta is in radian
theta<- theta*180/pi
theta
[1] 72.96223 # This is the value of theta in degree
# Orthogonal projection of V onto the direction of W
P<-(p/sum(W^2))*W # p is the dot product of V and W
P
[ ,1] [ ,2]
[1,] 1.572414 0.1310345
NormP<- sqrt(sum(P^2))
[1] 1.577864
p/NormW # Norm of the vector P using the dot product of V and W
[1] 1.577864
# Defining two 3- dimensional row vectors V and W
V <- matrix(c(3, 1, 0), nrow=1)
W <- matrix(c(2, 4, 0), nrow=1)
xC<-V[1, 2]*W[1,3] - W[1,2]*V[1,3]
yC<- -(V[1, 1]*W[1,3] - W[1,1]*V[1,3])
zC<-xC<-V[1, 1]*W[1,2] - W[1,1]*V[1,2]
C<-matrix(c(xC , yC , zC), nrow=1)
C
[ ,1] [ ,2] [ ,3]
[1,] 0 0 10
sum(C*V)
[1] 0 #C and V are orthogonal since their dot product is zero
sum(C*W)
[1] 0 #C and W are orthogonal since their dot product is zero
#Norm of the vector C
NormC<-sqrt(sum(C^2))
NormC
[1] 10
# Computing the area , A, of the parallelogram spanned by V and W
NormV<-sqrt(sum(V^2))
NormW<-sqrt(sum(W^2))
12.1 Vectors and matrices | 189
p<- sum(V*W)
theta<-acos(p/(NormV*NormW))
A<-NormV*NormW*sin(theta)
A
[1] 10 # A equal the norm of the vector C
The polar coordinates can be recovered from cartesian coordinates, and vice versa.
−
→ −→
Let 𝑉 = 𝑂𝐴 = ( 𝑥𝑦𝐴 𝐴
) be a standard vector in a two-dimensional cartesian space,
−
→
as depicted in Figure 12.8. Then, the polar coordinates of 𝑉 can be obtained as
follows:
√︁
𝑟 = 𝑥2𝐴 + 𝑦𝐴 2,
(︂ )︂
𝑦𝐴
𝜃 = tan−1 . (12.5)
𝑥𝐴
𝑥𝐴 = 𝑟 cos(𝜃),
𝑦𝐴 = 𝑟 sin(𝜃). (12.6)
In R, the above coordinate transformations can be carried out using the com-
mands in Listing 12.3.
thetaV<-acos(V[1,1]/r)*180/pi
rV
[1] 5.385165
thetaV
[1] 68.19859
# Recovering cartesian coordinates
xV<-rV*cos(thetaV*pi/180)
yV<-rV*sin(thetaV*pi/180)
xV
[1] 2
yV
[1] 5
−
→ −→
In a three-dimensional space, a standard vector 𝑉 = 𝑂𝐴, where 𝑂 denotes the
origin point, can be specified either by one of the following:
1. The triplet (𝑥𝐴 , 𝑦𝐴 , 𝑧𝐴 ), where 𝑥𝐴 , 𝑦𝐴 and 𝑧𝐴 denote the coordinates of the
−
→
point 𝐴, the terminal point of 𝑉 , in a three-dimensional Euclidean space. The
−
→
triplet (𝑥𝐴 , 𝑦𝐴 , 𝑧𝐴 ) defines the representation of the vector 𝑉 in cartesian co-
ordinates (see Figure 12.9 (a) for illustration).
−
→
2. The triplet (𝜌, 𝜃, 𝑧𝐴 ), where 𝜌 is the magnitude of the projection of 𝑉 on the
−
→
𝑥–𝑦 plane, 𝜃 is the angle between the projection of the vector 𝑉 on the 𝑥–𝑦
plane and the 𝑥-axis, and 𝑧𝐴 is the third coordinate of 𝐴 in a cartesian system.
−
→
The triplet (𝜌, 𝜃, 𝑧𝐴 ) defines the representation of the vector 𝑉 in cylindrical
coordinates (see Figure 12.9 (b) for illustration).
−
→ −
→
3. The triplet (𝑟, 𝜃, 𝜙), where 𝑟 = ‖ 𝑉 ‖ is the magnitude of 𝑉 , 𝜃 is the angle
−
→
between the projection of the vector 𝑉 on the 𝑥–𝑦 plane and the 𝑥-axis, and 𝜙
−
→
is angle between the vector 𝑉 and the 𝑥–𝑧 plane. The triplet (𝜌, 𝜃, 𝜙) defines
−
→
the representation of the vector 𝑉 in spherical coordinates (see Figure 12.9 (c)
for illustration).
12.1 Vectors and matrices | 191
−
→
1. The cylindrical coordinates of 𝑉 can be obtained as follows:
√︁
𝜌 = 𝑥2𝐴 + 𝑦𝐴 2,
(︂ )︂
𝑦𝐴
𝜃 = tan−1 ,
𝑥𝐴
𝑧𝐴 = 𝑧𝐴 . (12.7)
𝑥𝐴 = 𝜌 cos(𝜃),
𝑦𝐴 = 𝜌 sin(𝜃),
𝑧𝐴 = 𝑧𝐴 . (12.8)
−
→
2. The spherical coordinates of 𝑉 can be obtained as follows:
√︁
𝑟 = 𝑥2𝐴 + 𝑦𝐴 2 + 𝑧2 ,
𝐴
(︂ )︂
𝑦𝐴
𝜃 = tan−1 ,
𝑥𝐴
(︂ )︂
𝑧𝐴
𝜙 = cos−1 . (12.9)
𝑟
𝑥𝐴 = 𝑟 sin(𝜙) cos(𝜃),
𝑦𝐴 = 𝑟 sin(𝜙) sin(𝜃),
𝑧𝐴 = 𝑟 cos(𝜙). (12.10)
Relationships between cylindrical and spherical coordinates also exist. From cylin-
drical coordinates, spherical coordinates can be obtained as follows:
√︁
𝑟 = 𝜌2 + 𝑧𝐴 2,
𝜃 = 𝜃,
(︂ )︂
𝜌
𝜙 = tan −1
. (12.11)
𝑧𝐴
In R, the above coordinate system transformations can be carried out using the
scripts in Listing 12.4.
rhoW
[1] 3.605551
thetaW
[1] -33.69007
zW
[1] 7
Example 12.1.2. Classification methods are used extensively in data science [41, 64].
An important classification technique for high-dimensional data is referred to as
support vector machine (SVM) classification [41, 64] (see also Section 18.5.2).
For high-dimensional data, the problem of interest is to classify the (labeled)
data by determining a separating hyperplane. When using linear classifiers, it is
necessary to construct a hyperplane for optimal separation of the data points. To
this end, it is necessary to determine the distance between a point representing a
vector and the hyperplane.
Let 𝐻 : 𝛿1 · 𝑥 + 𝛿2 · 𝑦 + 𝛿3 · 𝑧 − 𝑎 = 0 be a three-dimensional hyperplane, and
let
⎛ ⎞
𝛿1
𝛿 = ⎝𝛿2 ⎠
𝛿2
(12.13)
√︀
‖𝛿‖ = (𝛿1 )2 + (𝛿2 )2 + (𝛿3 )2 .
𝛿1 · 𝑥 + 𝛿 2 · 𝑦 + 𝛿3 · 𝑧 − 𝑎
𝑑𝑎,𝐻 = . (12.14)
‖𝛿‖
subset of C, since any real number can be viewed as a complex number, for which
the imaginary part is zero, i. e. 𝑦 = 0. Any complex number 𝑧 = 𝑥𝑧 + 𝑖𝑦𝑧 can be
represented by the pair of reals (𝑥𝑧 , 𝑦𝑧 ); thus, a complex number can be viewed as
a particular two-dimensional standard real vector. Let 𝜃𝑧 denote the angle between
the vector 𝑧 = (𝑥𝑧 , 𝑦𝑧 ) and the 𝑥-axis. Then, using the vector decomposition in a
two-dimensional space, we have
where 𝑟𝑧 = (12.15)
√︀
𝑥𝑧 = 𝑟𝑧 cos(𝜃𝑧 ), 𝑦𝑧 = 𝑟𝑧 sin(𝜃𝑧 ), 𝑥2𝑧 + 𝑦𝑧2 .
The number 𝑟𝑧 is called the modulus or the absolute value of 𝑧, whereas 𝜃 is called
the argument of 𝑧.
From (12.15), we can deduce the following alternative description of a complex
number 𝑧 = 𝑥𝑧 + 𝑖𝑦𝑧 :
𝑧 = 𝑥𝑧 + 𝑖𝑦𝑧
= 𝑟𝑧 cos(𝜃𝑧 ) + 𝑖𝑟𝑧 sin(𝜃𝑧 )
= 𝑟𝑧 cos(𝜃𝑧 ) + 𝑖 sin(𝜃𝑧 ) = 𝑟𝑧 𝑒𝑖𝜃𝑧 . (12.16)
[︀ ]︀
– Complex exponentiation: 𝑧 𝑤 = (𝑥𝑧 + 𝑖𝑦𝑧 )𝑥𝑤 +𝑦𝑤 = (𝑥2𝑧 + 𝑦𝑧2 ) 𝑒𝑖𝑟𝑧 (𝑥𝑧 +𝑖𝑦𝑧 ) ,
𝑥𝑤 +𝑖𝑦𝑤
2
In R, the above basic operations on complex numbers can be performed using the
script in Listing 12.5.
[1] 8
# Modulus of the complex number z
rz<-Mod(z)
rz
[1] 8.544004
# Argument of the complex number z in degree
thetaz<-Arg(z)*180/pi
thetaz
[1] 110.556
# Power of a complex number
n<-3
zn<-z^n
zn
[1] 549-296i
rz^n*(cos(n*thetaz*pi/180) + 1i*sin(n*thetaz*pi/180))
[1] 549-296i # This is equivalent to zn
#Sum of the complex numbers z and w
s<-z+w
s
[1] -8+6i
# Difference of the complex numbers z and w
d<-z-w
d
[1] 2+10i
# Product of the complex numbers z and w:
p<-z*w
p
[1] 31-34i
# Division of the complex numbers z and w
q<-z/w
q
[1] -0.034483-1.586207i
# Exponentiation t=z^w
t<-z^w
t
[1] 0.000205779-0.001021052i
−
→
An 𝑛-dimensional complex vector is a vector of the form 𝑉 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ),
whose components 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 can be complex numbers. The concept of vector
operations and vector transformations, previously introduced in relation to real
vectors, can be generalized to complex vectors using the elementary operations on
complex numbers.
12.1.3 Matrices
In the two foregoing sections, we have presented some basic concepts of vector anal-
ysis. In this section, we will discuss a generalization of vectors also known as matri-
ces.
Let 𝑚 and 𝑛 be two positive integers. We call 𝐴 an 𝑚 × 𝑛 real matrix if it
consists of an ordered set of 𝑚 vectors in an 𝑛-dimensional space. In other words,
𝐴 is defined by a set of 𝑚 × 𝑛 scalars 𝑎𝑖𝑗 ∈ R, with 𝑖 = 1, . . . , 𝑚 and 𝑗 = 1, . . . , 𝑛,
198 | 12 Linear algebra
···
⎛ ⎞
𝑎11 𝑎12 𝑎1𝑛
⎜ 𝑎21 𝑎22 ··· 𝑎2𝑛 ⎟
𝐴=⎜ . .. .. .. ⎟ . (12.17)
⎜ ⎟
⎝ .. . . . ⎠
𝑎𝑚1 𝑎𝑚2 ··· 𝑎𝑚𝑛
The matrix 𝐴, defined by (12.17), has 𝑚 rows and 𝑛 columns. For any entry 𝑎𝑖𝑗 ,
with 𝑖 = 1, . . . , 𝑚 and 𝑗 = 1, . . . , 𝑛, of the matrix 𝐴, the index 𝑖 is called the row
index, whereas the index 𝑗 is the column index. The set of entries (𝑎𝑖1 , 𝑎𝑖2 , . . . , 𝑎𝑖𝑛 )
is called the 𝑖th row of 𝐴, and the set (𝑎1𝑗 , 𝑎2𝑗 , . . . , 𝑎𝑚𝑗 ) is the 𝑗 th column of 𝐴.
When 𝑛 = 𝑚, 𝐴 is called a squared matrix, and the set of entries (𝑎11 , 𝑎22 , . . . , 𝑎𝑛𝑛 )
is called its main diagonal.
In the case 𝑚 = 1 and 𝑛 > 1, the matrix 𝐴 is reduced to one row, and it is
called a row vector; likewise, when 𝑚 > 1 and 𝑛 = 1, the matrix 𝐴 is reduced to a
single column, and it is called a column vector. In the case 𝑚 = 𝑛 = 1, the matrix 𝐴
is reduced to a single value, i. e., a real scalar. An 𝑚 × 𝑛 matrix 𝐴 can be viewed as
a list of 𝑚 𝑛-dimensional row vectors or a list of 𝑛 𝑚-dimensional column vectors.
If the entries 𝑎𝑖𝑗 are complex numbers, then 𝐴 is called a complex matrix.
The R programming environment provides a wide range of functions for matrix
manipulation and matrix operations. Several of these are illustrated in the three
listings below.
SubA2<-A[ ,2:3]
SubA2
[ ,1] [ ,2]
[1,] 2 -5
[2,] -3 2
[3,] -1 4
SubA3<-A[ ,1]
SubA3
[1] 3 1 5
# Binding matrices , with the same number of rows , by columns
cbind(SubA2 , SubA3)
SubA3
[1,] 2 -5 3
[2,] -3 2 1
[3,] -1 4 5
# Binding matrices , with the same number of columns , by rows
> rbind(SubA2 , SubA1)
[ ,1] [ ,2]
[1,] 2 -5
[2,] -3 2
[3,] -1 4
[4,] 3 2
[5,] 1 -3
The power of the adjacency matrix (12.18) (here 2) gives the length of the walk.
The entry 𝑎𝑖𝑗 of 𝐴2 (𝐺) gives the number of walks of length 2 from 𝑣𝑖 to 𝑣𝑗 . For
instance, 𝑎11 = 2 means there exist two walks of length 2 from vertex 1 to vertex 1.
Moreover, 𝑎14 means there exists only one walk of length 2 from vertex 1 to ver-
12.2 Operations with matrices | 201
When 𝑚 = 1, then, the result is a product between a row vector and a matrix.
Note that, even if both products 𝐴 × 𝐵 and 𝐵 × 𝐴 are defined, i. e., if 𝑙 = 𝑚,
𝐴 × 𝐵 generally differs from 𝐵 × 𝐴.
𝑟
%, where 𝑟 is the number of nonzero entries in 𝐴.
𝑛𝑚
Definition 12.3.1. Let 𝐴 be an 𝑛×𝑛 squared matrix, and let 𝐼𝑛 be the 𝑛×𝑛 identity
matrix. If there exists an 𝑛 × 𝑛 matrix 𝐵 such that
𝐴𝐵 = 𝐼𝑛 = 𝐵𝐴, (12.21)
The inverse of the identity matrix is the identity matrix, whereas the inverse of
a lower (respectively, upper) triangular matrix is also a lower (respectively, upper)
triangular matrix.
Note that for a matrix to be invertible, it must be a squared matrix.
Using R, the inverse of a squared matrix, 𝐴, can be computed as follows:
if 𝑛 = 1,
{︃
𝑎11 ,
det(𝐴) = ∑︀𝑛
𝑖=1 (−1)𝑖+𝑗 𝑎 𝑖𝑗 det(𝑀𝑖𝑗 ), if 𝑛 > 1,
where, 𝑀𝑖𝑗 is the (𝑛 − 1) × (𝑛 − 1) matrix obtained by removing the 𝑖th row and the
𝑗 th column of 𝐴.
Let 𝐴 and 𝐵 be two 𝑛 × 𝑛 matrices and 𝑘 a real scalar. Some useful properties
of the determinant for 𝐴 and 𝐵 include the following:
1. det(𝐴𝐵) = det(𝐴) det(𝐵);
206 | 12 Linear algebra
2. det(𝐴𝑇 ) = det(𝐴);
3. det(𝑘𝐴) = 𝑘 𝑛 det(𝐴);
4. det(𝐴) ̸= 0 if and only if 𝐴 is nonsingular.
i. e., the determinant of a triangular matrix is the product of the diagonal entries.
Therefore, the most practical means of computing a determinant of a matrix is to
decompose it into a product of lower and upper triangular matrices.
Using R, the trace and the determinant of a matrix are computed as follows:
Definition 12.5.3. The set of all linear combinations of a set of 𝑚 vectors {𝐴𝑖 , 𝑖 =
1, 2, . . . , 𝑚} in R𝑛 is a subspace called the span of {𝐴𝑖 , 𝑖 = 1, 2, . . . , 𝑚}, and defined
as follows:
All the bases of a subspace 𝑆 have the same number of components, and this
number is called the dimension of 𝑆, denoted dim(𝑆).
Two key subspaces are associated with any 𝑚 × 𝑛 matrix 𝐴, i. e., 𝐴 ∈ R𝑚×𝑛 :
1. the subspace
ker(𝐴) = 𝑥 ∈ R𝑛 : 𝐴𝑥 = 0 ,
{︀ }︀
Definition 12.5.5. The rank of a matrix, 𝐴, denoted rank(𝐴), is the maximum num-
ber of linearly independent rows or columns of the matrix 𝐴, and it is defined as
follows:
𝐵 = 𝐴 + 𝑢𝑣 𝑇 (12.22)
Then,
)︀−1 )︀−1 𝑇 −1
𝐴 + 𝑈𝑉 𝑇 = 𝐴−1 − 𝐴−1 𝑈 𝐼𝑛 + 𝑉 𝑇 𝐴−1 𝑈 (12.23)
(︀ (︀
𝑉 𝐴 .
det(𝐴 − 𝜆𝐼) = 0.
A squared matrix, 𝐴, is called nonsingular if and only if all its eigenvalues are
nonzero.
Definition 12.6.1. The spectral radius of a squared matrix 𝐴, denoted 𝜌(𝐴), is given
by
⃒ ⃒
𝜌(𝐴) = max ⃒𝜆𝑖 (𝐴)⃒.
𝑖=1,2,...,𝑛
𝐴𝑥 = 𝜆𝑖 (𝐴)𝑥, (12.25)
For each eigenvalue 𝜆𝑖 (𝐴), its right eigenvector 𝑥 is found by solving the system
(𝐴 − 𝜆𝑖 (𝐴)𝐼)𝑥 = 0.
Let 𝐴 be an 𝑛 × 𝑛 real matrix. The following properties hold:
– If 𝐴 is diagonal, upper triangular or lower triangular, then its eigenvalues are
given by its diagonal entries, i. e.,
𝑄𝑇 𝐴𝑄 = 𝐷,
where 𝐷 is an 𝑛×𝑛 diagonal matrix, whose diagonal entries are 𝜆1 (𝐴), 𝜆2 (𝐴), . . . ,
𝜆𝑛 (𝐴).
12.6 Eigenvalues and eigenvectors of a matrix | 211
1
𝜆𝑖 𝐴−1 = for 𝑖 = 1, . . . , 𝑛.
(︀ )︀
,
𝜆𝑖 (𝐴)
Furthermore, if
Definition 12.7.2. Let 𝐴 ∈ R𝑚×𝑛 and 𝑥 ∈ R𝑛 . Then, the subordinate matrix 𝑝-norm
of 𝐴, denoted ‖𝐴‖𝑝 , is defined in terms of vector norms as follows:
‖𝐴𝑥‖𝑝
‖𝐴‖𝑝 = max .
𝑥̸=0𝑛 ‖𝑥‖𝑝
In particular,
1. the subordinate matrix 1-norm of 𝐴 is defined by
𝑚
‖𝐴𝑥‖1 ∑︁
‖𝐴‖1 = max = max |𝑎𝑖𝑗 |.
𝑥̸=0𝑛 ‖𝑥‖1 𝑗=1,2,...,𝑛
𝑖=1
‖𝐴𝑥‖2
‖𝐴‖2 = max ;
𝑥̸=0𝑛 ‖𝑥‖2
if 𝑚 = 𝑛, then
⃒ ⃒
‖𝐴‖2 = 𝜌(𝐴) = max ⃒𝜆𝑖 (𝐴)⃒,
𝑖=1,2,...,𝑛
Furthermore, if 𝑚 = 𝑛, we have
√︀ √
‖𝐴‖2 ≤ ‖𝐴‖1 × ‖𝐴‖∞ ≤ 𝑛 × ‖𝐴‖2 .
Remark 12.7.1. The subordinate matrix 𝑝-norm is consistent and, for any 𝐴 ∈
R𝑚×𝑛 and 𝑥 ∈ R𝑛 ,
12.8.1 LU factorization
𝐴 = 𝐿𝑈,
where
··· ···
⎡ ⎤ ⎡ ⎤
1 0 0 0 𝑢11 𝑢12 𝑢13 𝑢1𝑛
⎢ 𝑙21 1 0 ··· 0⎥ ⎢ 0 𝑢22 𝑢23 ··· 𝑢2𝑛 ⎥
𝐿=⎢ . .. .. .. .. ⎥ and 𝑈 = ⎢ . .. .. .. .. ⎥ .
⎢ ⎥ ⎢ ⎥
⎣ .. . . . .⎦ ⎣ .. . . . . ⎦
𝑙𝑛1 𝑙𝑛2 𝑙𝑛3 ··· 1 0 0 0 ··· 𝑢𝑛𝑛
214 | 12 Linear algebra
Therefore, the determinants of 𝐿 and 𝑈 are det(𝐿) = 1 and det(𝑈 ) = 𝑖=1 𝑢𝑖𝑖 ,
∏︀𝑛
respectively. Consequently,
𝑛
∏︁
det(𝐴) = det(𝐿𝑈 ) = det(𝐿) × det(𝑈 ) = 𝑢𝑖𝑖 .
𝑖=1
𝑃 𝐴 = 𝐿𝑈. (12.26)
^,
𝑃 𝐴 = 𝐿𝐷𝑈
where 𝐷 ∈ R𝑛×𝑛 is a diagonal matrix, whose diagonal entries are 𝑢𝑖𝑖 and 𝑈 ^ ∈ R𝑛×𝑛
is a unit upper triangular matrix; i. e., 𝑈 = 𝐷𝑈
^.
Computing the 𝐿𝑈 factorization of 𝐴 is formally equivalent to solving the follow-
ing nonlinear system of 𝑛2 equations where the unknowns are the 𝑛2 + 𝑛 coefficients
of the triangular matrices 𝐿 and 𝑈 :
min(𝑖,𝑗)
∑︁
𝑎𝑖𝑗 = 𝑙𝑖𝑘 𝑢𝑘𝑗 .
𝑘=1
𝐴 = 𝐿𝐷𝐿𝑇 . (12.27)
𝑃 𝐴𝑃 𝑇 = 𝐿𝐷𝐿𝑇 , (12.28)
If an 𝑛×𝑛 real matrix, 𝐴, is positive definite, then there exists a unit lower triangular
matrix 𝐿 ∈ R𝑛×𝑛 and a diagonal matrix 𝐷 ∈ R𝑛×𝑛 with 𝑑𝑖𝑖 > 0 for 𝑖 = 1, 2, . . . , 𝑛
216 | 12 Linear algebra
such that
˜𝐿
𝐴 = 𝐿𝐷𝐿𝑇 = 𝐿 ˜𝑇 , (12.29)
√
where 𝐿 ˜ = 𝐿𝐷 12 , with 𝐷 12 a diagonal matrix, whose diagonal entries are 𝑑𝑖𝑖 for
𝑖 = 1, 2, . . . , 𝑛. The factorization (12.29) is referred to as the Cholesky factorization.
An illustration of Cholesky factorization, using R, is provided in Listing 12.18.
12.8.3 QR factorization
𝐴 = 𝑄𝑅 ⇐⇒ 𝑄𝑇 𝐴 = 𝑅,
Let 𝑉 ∈ R𝑚×𝑛 and 𝑊 ∈ R𝑚×(𝑚−𝑛) denote the 𝑛 first columns and (𝑚 − 𝑛) last
columns of the orthogonal matrix 𝑄 ∈ R𝑚×𝑚 , respectively; that is, 𝑄 = [𝑉, 𝑊 ].
Then, the submatrices 𝑉 and 𝑊 are also orthogonal. Indeed,
𝑉 𝑇 [︀
[︂ ]︂
𝑄𝑇 𝑄 =
]︀
𝑇 𝑉 𝑊
𝑊
[︂ 𝑇
𝑉 𝑇𝑊
]︂
𝑉 𝑉
= .
𝑊𝑇𝑉 𝑊𝑇𝑊
𝑉 𝑇𝑉 𝑌 𝑇𝑌
[︂ ]︂ [︂ ]︂
𝐼𝑛 0
= ,
𝑊𝑇𝑉 𝑊𝑇𝑊 0 𝐼𝑚−𝑛
𝑉𝑇
[︂ ]︂ [︂ ]︂ [︂ ]︂
𝑅 𝑅
𝑄𝑇 𝐴 = ⇐⇒ 𝐴 = ,
0𝑚−𝑛,𝑛 𝑊𝑇 0𝑚−𝑛,𝑛
and therefore
𝑉 𝑇 𝐴 = 𝑅 ⇐⇒ 𝐴 = 𝑉 𝑅, (12.31)
𝑇
𝑊 𝐴 = 0𝑚−𝑛,𝑛 . (12.32)
Equations (12.31) and (12.32) yield several important results, which link the 𝑄𝑅
factorization of a matrix, 𝐴, to its subspaces im(𝐴) (i. e., the range of 𝐴) and ker(𝐴)
(i. e. the kernel or the null space of 𝐴). In particular,
1. since 𝑉 is an orthogonal matrix, then, thanks to (12.31), the columns of 𝑉
form an orthogonal basis for the subspace im(𝐴), that is, 𝐴 is uniquely deter-
mined by the linear combination of the column of 𝑉 through 𝐴 = 𝑉 𝑅. Conse-
218 | 12 Linear algebra
The singular value decomposition (SVD) is another type of matrix factorization that
generalizes the eigendecomposition of a square normal matrix to any matrix. It is a
popular methods because it has widespread applications for recommender systems
and the determination of a pseudoinverse. Let 𝐴 ∈ R𝑚×𝑛 . Then,
12.8 Matrix factorization | 219
𝑑1 ≥ 𝑑2 ≥ . . . ≥ 0.
Definition 12.8.1. The singular values of a matrix 𝐴 ∈ R𝑚×𝑛 , denoted 𝜎𝑖 (𝐴), are
given by the diagonal entries of the matrix 𝐷, that is,
𝜎𝑖 (𝐴) = 𝑑𝑖 , for 𝑖 = 1, 2, . . . , 𝑝,
that is, the rank of the matrix is the number of its nonzero singular values.
However, due to rounding errors, this approach to determine the rank is not
straightforward in practice, as it is unclear how small the singular value should
be to be considered as zero.
220 | 12 Linear algebra
Furthermore, if 𝐴 has a full rank, i. e., 𝑟 = min(𝑚, 𝑛), then the condition number
of 𝐴, denoted 𝜅(𝐴), is given by
𝜎1 (𝐴)
𝜅(𝐴) = .
𝜎𝑟 (𝐴)
𝑛
(12.33)
∑︁
𝑎𝑖𝑗 𝑥𝑗 = 𝑏𝑖 , 𝑖 = 1, . . . , 𝑚,
𝑗=1
where 𝑥𝑗 are the unknowns, whereas 𝑎𝑖𝑗 , the coefficients of the system, and 𝑏𝑖 , the
entries of the right-hand side, are assumed to be known constants.
The system (12.33) can be rewritten in the following matrix form:
𝐴𝑥 = 𝑏, (12.34)
det(𝑀𝑗 )
𝑥= , (12.35)
det(𝐴)
𝑏1
𝑥1 = , (12.36)
𝑙11
∑︀𝑖−1
(𝑏𝑖 − 𝑗=1 𝑙𝑖𝑗 𝑥𝑗 )
𝑥𝑖 = for 𝑖 = 2, 3, . . . , 𝑛. (12.37)
𝑙𝑖𝑖
222 | 12 Linear algebra
𝑥 = 𝐴−1 𝑏.
𝑃 −1 𝐿𝑈 𝑥 = 𝑏 ⇐⇒ 𝐿𝑈 𝑥 = 𝑃 𝑏, (12.40)
𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.
Thus,
⎛ ⎞ ⎛⎞
3 2 −5 12
𝐴 = ⎝1 −3 2⎠ and 𝑏 = ⎝−13⎠ .
5 −1 4 10
12.10 Exercises
−
→ (︀ −
→ (︀
1. Let 𝑈 = −1, −2, 4 , and let 𝑉 = 2, 3, 5 be two 3-dimensional vec-
)︀ )︀
tors:
−
→ −
→
(a) Compute the Euclidean norms of 𝑈 and 𝑉 .
−
→ −
→
(b) Compute the dot product of 𝑈 and 𝑉 .
−
→ −
→
(c) Compute the orthogonal projection of 𝑈 onto the direction of 𝑉 .
−
→ −
→
(d) Compute the reflection of 𝑉 with respect to the direction of 𝑈 .
−
→ −
→
(e) Compute the cross product of 𝑈 and 𝑉 .
−
→ −
→
(f) Compute the components of 𝑈 and 𝑉 in cylindrical and spherical coordi-
nates.
2. Let 𝑧 = 5 + 2𝑖 and 𝑤 = 7 + 4𝑖 be two complex numbers.
(a) Compute the conjugates of 𝑧 and 𝑤, the sum of 𝑧 and 𝑤, and difference 𝑧−𝑤.
(b) Compute the product 𝑧 × 𝑤 and the divisions 𝑤 𝑧
and 𝑤
𝑧.
3. Let
⎛ ⎞ ⎛ ⎞
1 3 0 7 2 1
𝐴=⎝ 2 −2 1⎠ and 𝐵=⎝ 0 3 −1⎠
−4 1 −1 −3 4 −2
13.1 Introduction
Differentiation and integration are fundamental mathematical concepts, having a
wide range of applications in many areas of science, particularly in physics, chemistry,
and engineering [158]. Both concepts are intimately connected, as integration is
the inverse process of differentiation, and vice versa. These concepts are especially
important for descriptive models, e. g., providing information about the position of
an object in space and time (physics) or the temporal evolution of the price of a stock
(finance). Such models require the precise definition of the functional, describing the
system of interest, and related mathematical objects defining the dynamics of the
system.
https://doi.org/10.1515/9783110796063-013
226 | 13 Analysis
From the above sequences, it can be observed that 𝑎𝑛 and 𝑏𝑛 have the closed forms:
𝑎𝑛 = 2𝑛1
and 𝑏𝑛 = 𝑛3 , respectively. Now, we are ready to define the limiting value
of a real sequence.
Definition 13.2.2. A number 𝑙 is called the limiting value or limes of a given real
sequence (𝑎𝑛 )𝑛∈N , if for all 𝜀 > 0 exists 𝑁0 (𝜀) such that |𝑎𝑛 −𝑙| < 𝜀, for all 𝑛 > 𝑁0 (𝜀).
In this case, the sequence, (𝑎𝑛 )𝑛∈N , is said to converge to 𝑙, and the following short-
hand notation is used to summarize the previous statement: lim𝑛−→∞ 𝑎𝑛 = 𝑙.
Thus, we find 𝑛 > 1𝜀 =: 𝑁0 (𝜀). In summary, for all 𝜀 > 0, there exists a number
𝑁0 (𝜀) := 1𝜀 such that |𝑎𝑛 − 1| < 𝜀 for all 𝑛 > 𝑁0 (𝜀) := 1𝜀 . For example, if we
set 𝜀 = 10
1
, then 𝑛 ≥ 11. This means that for all elements of the given sequence
𝑎𝑛 = 1 − 𝑛 , starting from 𝑛 = 11, |𝑎𝑛 − 1| < 10
1 1
holds.
Before we give some examples of basic limiting values of sequences, we provide the
following proposition, which is necessary for the calculations that follow [158].
Proposition 13.2.1. Let (𝑎𝑛 )𝑛∈N and (𝑏𝑛 )𝑛∈N be two convergent sequences with
lim𝑛−→∞ 𝑎𝑛 = 𝑎 and lim𝑛−→∞ 𝑏𝑛 = 𝑏. Then, the following relationships hold:
Example 13.2.2. Let us examine the convergence of the following two sequences:
3𝑛 + 1
𝑎𝑛 = , (13.7)
𝑛+5
𝑏𝑛 = (−1)𝑛 . (13.8)
For 𝑎𝑛 , we have
3𝑛 + 1 𝑛(3 + 𝑛1 ) lim𝑛→∞ (3 + 𝑛1 )
lim = lim =
𝑛→∞ 𝑛 + 5 𝑛→∞ 𝑛(1 + 5 ) lim𝑛→∞ (1 + 𝑛5 )
𝑛
3 + lim𝑛→∞ ( 𝑛1 ) 3+0
= = = 3. (13.9)
1 + lim𝑛→∞ ( 𝑛5 ) 1+0
Definition 13.2.3. Let 𝑓 (𝑥) be a real function and 𝑥𝑛 a sequence that belongs to
the domain of 𝑓 (𝑥). If all sequences of the values 𝑓 (𝑥𝑛 ) converge to 𝑙, then 𝑙 is called
the limiting value for 𝑥 → ±∞, and we write lim𝑥→±∞ = 𝑙.
For a general function, 𝑓 : 𝑋 → 𝑌 , we call the set 𝑋 domain and 𝑌 the co-
domain of function 𝑓 .
Example 13.2.3. Let us determine the limiting value of the function 𝑓 (𝑥) = 2𝑥−1
𝑥
for large and positive 𝑥. This means, we examine lim𝑥→∞ 2𝑥−1
𝑥 and find
(︂ )︂ (︂ )︂
2𝑥 − 1 1 1
lim = lim 2 − = 2 − lim = 2. (13.10)
𝑥→∞ 𝑥 𝑥→∞ 𝑥 𝑥→∞ 𝑥
The limiting value of 𝑓 (𝑥) = 2𝑥−1 𝑥 for large 𝑥 can be seen in Figure 13.2.
We conclude this section by stating the definition for the convergence of a func-
tion, 𝑓 (𝑥), if 𝑥 tends to a finite value 𝑥0 .
228 | 13 Analysis
The following sections will utilize the concepts introduced here to define differenti-
ation and integration.
13.3 Differentiation
Let 𝑓 : R −→ R be a given continuous function. Then, 𝑓 is called differentiable at
the point 𝑥0 if the following limit exists:
𝑓 (𝑥0 + ℎ) − 𝑓 (𝑥0 )
lim . (13.12)
ℎ−→0 ℎ
d𝑓 (𝑥)
If 𝑓 is differentiable at the point 𝑥0 , the derivative of 𝑓 denoted d𝑥 or 𝑓 ′ (𝑥) is
finite at 𝑥0 , and can be approximated by
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑑𝑓 = 𝑑𝑥1 + 𝑑𝑥2 + · · · + 𝑑𝑥𝑖 + · · · + 𝑑𝑥𝑛 .
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑖 𝜕𝑥𝑛
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = 𝑒1 + 𝑒2 + · · · + 𝑒𝑖 + · · · + 𝑒𝑛 ,
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑖 𝜕𝑥𝑛
230 | 13 Analysis
where the 𝑒𝑘 , 𝑘 = 1, . . . , 𝑛, are the orthogonal unit vectors pointing in the coordinate
directions. Thus, in (R𝑛 , ‖ · ‖2 ), where ‖𝑥‖2 = ⟨𝑥, 𝑥⟩ is the Euclidian norm, the
√︀
Thus, the Hessian matrix describes the local curvature of the function 𝑓 .
and
⎛ ⎞
6𝑥1 0 0
2 ⎜ 0 2 0 ⎟
∇ 𝑓 (𝑥) = ⎝ ⎠.
0 0 − 𝑥12
3
13.3 Differentiation | 231
At the point 𝑥
¯ = (1, 1/2, 1), we have ∇𝑓 (¯
𝑥) = (3, 1, 1) and
⎛ ⎞
6 0 0
∇2 𝑓 (¯
𝑥) = ⎝0 2 0 ⎠.
0 0 −1
Then,
𝑥22
[︂ ]︂
2𝑥1 𝑥2 0
𝐽𝑓 (𝑥) = .
2𝑥2 2𝑥1 2𝑥3
At the point 𝑥
¯ = (1, 1/2, 1), we have
[︂ ]︂
1/4 1 0
𝐽𝑓 (¯
𝑥) = .
1 2 2
Let us consider the following example, from economics, for determining extreme
values of economic functions [182] (see also Example 13.8.1). In this example, the
economic functions of interest are real polynomials [135].
1 3 3 2
𝐶(𝑥) = 𝑥 − 𝑥 +7 (13.16)
3 2
By solving
𝐶 ′ (𝑥) = 𝑥2 − 3𝑥 = 0, (13.18)
𝐶 ′′ (𝑥) = 2𝑥 − 3. (13.19)
This yields 𝐶 ′′ (3) = 3 > 0. Hence, we found a minimum of 𝐶(𝑥) at 𝑥 = 3. This can
also be observed, graphically, in Figure 13.4.
In the following section, we provide some formal definitions of extrema of a
function.
13.4 Extrema of a function | 233
Figure 13.4: An example of an economic cost function 𝐶(𝑥) with its minimum located at 𝑥 = 3.
Definition 13.4.3. Let 𝒟 denote the domain of a function 𝑓 , and let ℐ ⊂ 𝒟. A point
𝑥* ∈ ℐ is called a local maximum of 𝑓 if 𝑓 (𝑥* ) ≥ 𝑓 (𝑥) for all 𝑥 ∈ ℐ. Then 𝑓 (𝑥* ) is
referred to as the maximum value of 𝑓 in ℐ.
Definition 13.4.4. Let 𝒟 denote the domain of a function 𝑓 , and let ℐ ⊂ 𝒟. A point
𝑥* ∈ ℐ is called a local minimum of 𝑓 if 𝑓 (𝑥* ) ≤ 𝑓 (𝑥) for all 𝑥 ∈ ℐ. Then 𝑓 (𝑥* ) is
referred to as the minimum value of 𝑓 in ℐ.
Theorem 13.4.1 (Weierstrass extreme value theorem). Let 𝒟 denote the domain of a
function 𝑓 , and let ℐ = [𝑎, 𝑏] ⊂ 𝒟. If 𝑓 is continuous on ℐ, then 𝑓 achieves both its
maximum value, denoted 𝑀 , and its minimum value, denoted 𝑚. In other words,
there exist 𝑥*𝑀 and 𝑥*𝑚 in ℐ such that
– 𝑓 (𝑥*𝑀 ) = 𝑀 and 𝑓 (𝑥*𝑚 ) = 𝑚,
– 𝑚 ≤ 𝑓 (𝑥) ≤ 𝑀 .
For a numerical solution to this problem, the package ggpmisc in R can be used to
find extrema of a function, as illustrated in Listing 13.2. The plot from the output of
the script is shown in Figure 13.6, where the colored dots correspond to the different
extrema of the function
Below, we provide examples of Taylor series expansions for some common func-
tions, at a point 𝑥 = 𝑥0 :
[︂ ]︂
1 1
exp(𝑥) = exp(𝑥0 ) 1 + (𝑥 − 𝑥0 ) + (𝑥 − 𝑥0 )2 + (𝑥 − 𝑥0 )3 + · · ·
2 6
𝑥 − 𝑥0 (𝑥 − 𝑥0 )2 (𝑥 − 𝑥0 )3
ln(𝑥) = ln(𝑥0 ) + − 2
+ − ···
𝑥0 2𝑥0 3𝑥30
1 1
cos(𝑥) = cos(𝑥0 ) − sin(𝑥0 )(𝑥 − 𝑥0 ) − cos(𝑥0 )(𝑥 − 𝑥0 )2 + sin(𝑥0 )(𝑥 − 𝑥0 )3 + · · ·
2 6
1 1
sin(𝑥) = sin(𝑥0 ) + cos(𝑥0 )(𝑥 − 𝑥0 ) − sin(𝑥0 )(𝑥 − 𝑥0 ) − cos(𝑥0 )(𝑥 − 𝑥0 )3 + · · ·
2
2 6
The accuracy of a Taylor series expansion depends on both the function to be
approximated, the point at which the approximation is made, and the number of
terms used in the approximation, as illustrated in Figure 13.7 and Figure 13.8.
Several packages in R can be used to obtain the Taylor series expansion of a
function. For instance, the library Ryacas can be used to obtain the expression of
the Taylor series expansion of a function, which can then be evaluated. The library
pracma, on the other hand, provides an approximation of the function at a given
point using its corresponding Taylor series expansion. The scripts below illustrate
the usage of these two packages.
Listing 13.4: Taylor Series Approximation with pracma, see Figure 13.7
# Taylor Series Expansion Using pracma
library(pracma)
library(ggplot2)
fx <- function(x) {
e <- exp (1)
return(e^x)
}
fxts <- taylor(fx , 0, 5)
Figure 13.7, produced using Listing 13.4, shows the graph of the function 𝑓 (𝑥) =
exp(𝑥) alongside its corresponding Taylor approximation of order 𝑛 = 5, for 𝑥 ∈
[−1, 1]. It is clear that this Taylor series approximation of the function 𝑓 (𝑥) is quite
accurate for 𝑥 ∈ [−1, 1], since the graphs of the both functions match in this inter-
val.
Figure 13.7: Taylor series approximation of the function 𝑓 (𝑥) = exp(𝑥). The approximation
order is 𝑛 = 5.
238 | 13 Analysis
On the other hand, Figure 13.8, produced using Listing 13.5, shows the graph of
the function 𝑓 (𝑥) = 1−𝑥
1
alongside its corresponding Taylor approximation of order
𝑛 = 5, for 𝑥 ∈ [−1, 1]. The Taylor series approximation of the function 𝑓 (𝑥) is
accurate on most of the interval 𝑥 ∈ [−1, 1], except nearby 1, where the function
𝑓 (𝑥) and its Taylor approximation diverge. In fact, when 𝑥 tends to 1, 𝑓 (𝑥) tends
to ∞, and the corresponding Taylor series approximation cannot keep pace with the
growth of the function 𝑓 (𝑥).
13.6 Integrals
The integral of a function 𝑓 (𝑥) over the interval [𝑎, 𝑏], denoted 𝑎 𝑓 (𝑥)d𝑥, is given
∫︀ 𝑏
by the area between the graph of 𝑓 (𝑥) and the line 𝑓 (𝑥) = 0, where 𝑎 ≤ 𝑥 ≤ 𝑏.
Definition 13.6.1 (Definite integral). The definite integral of a function 𝑓 (𝑥) from 𝑎
to 𝑏 is denoted
∫︁𝑏
𝑓 (𝑥)d𝑥.
𝑎
𝐹 (𝑥) = 𝐺(𝑥) + 𝐶.
This result justifies the integration constant 𝐶 for the indefinite integral.
∫︁𝑥
𝐹 (𝑥) = 𝑓 (𝑧)d𝑧, 𝑎 ≤ 𝑥 ≤ 𝑏,
𝑎
∫︁𝑏
𝑓 (𝑥)d𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎).
𝑎
240 | 13 Analysis
The results from the above theorems demonstrate that the differentiation is
simply the inverse of integration.
The antiderivative, 𝐹 (𝑥), is not always easy to obtain analytically. Therefore, the
integral is often approximated numerically. The numerical estimation is generally
carried out as follows: The interval [𝑎, 𝑏] is subdivided into 𝑛 ∈ N subintervals. Let
Δ𝑥𝑖 = 𝑥𝑖+1 − 𝑥𝑖 denote the length of the 𝑖th subinterval, 𝑖 = 1, 2, 3 . . . , 𝑛, and let 𝑥
˜𝑖
be a value in the subinterval [𝑥𝑖 , 𝑥𝑖+1 ]. Then,
∫︁𝑏 𝑛
(13.20)
∑︁
𝑓 (𝑥)d𝑥 ≈ 𝑓 (˜
𝑥𝑖 )Δ𝑥𝑖 .
𝑎 𝑖=1
The last term in equation (13.20) is known as the Riemann sum. When 𝑛 tends to
∞, then Δ𝑥𝑖 tends to 0, for all 𝑖 = 1, 2, 3 . . . , 𝑛, and, consequently, the Riemann
sum tends toward the real value of the integral of 𝑓 (𝑥) over the interval [𝑎, 𝑏], as
illustrated in Figure 13.9.
Using R, a one-dimensional integral over a finite or infinite interval is computed
using the command integrate(f, lowerLimit, upperLimit), where f is the func-
tion to be integrated, lowerLimit and upperLimit are the lower and upper limits
of the integral, respectively.
(𝑥−5)2
The integral −∞ √1 𝑒− 2 𝑑𝑥 can be computed as follows:
∫︀ +∞
2𝜋
Using R, an 𝑛-fold integral over a finite or infinite interval is computed using the
command adaptIntegrate(f, lowerLimit, upperLimit).
The integral 0 1 −2 52 sin(𝑥) cos(𝑦𝑧) 𝑑𝑥 𝑑𝑦 𝑑𝑧 can be computed as follows:
∫︀ 3 ∫︀ 5 ∫︀ −1
Figure 13.9: Geometric interpretation of the integral. Left: Exact form of an integral. Right:
Numerical approximation.
242 | 13 Analysis
𝑃𝑛 (𝑥𝑖 ) = 𝑎𝑛 𝑥𝑛 𝑛−1
𝑖 + 𝑎𝑛−1 𝑥𝑖 + · · · + 𝑎1 𝑥𝑖 + 𝑎0 = 𝑦𝑖 , 𝑖 = 0, . . . , 𝑚. (13.21)
Note that this approach was developed by Lagrange, and the resulting interpola-
tion polynomial is referred to as the Lagrange polynomial [99, 131]. When 𝑛 = 1
and 𝑛 = 2, the process is called a linear interpolation and quadratic interpolation,
respectively.
Let us consider the following data points:
𝑥𝑖 1 2 3 4 5 6 7 8 9 10
𝑦𝑖 −1.05 0.25 1.08 −0.02 −0.27 0.79 −1.02 −0.17 0.97 2.06
Using R, the Lagrange polynomial interpolation for the above pairs of data points
(𝑥, 𝑦) can be carried out using Listing 13.8. In Figure 13.10 (left), which is an output
of Listing 13.8, the interpolation points are shown as dots, whereas the corresponding
Lagrange polynomial is represented by the solid line.
Figure 13.10: Left: Polynomial interpolation of the data points in blue. Right: Roots of the in-
terpolation polynomial.
be a real polynomial, i. e., its coefficients are real numbers, and 𝑛 is the degree of
this polynomial.
Then we write deg(𝑓 (𝑥)) = 𝑛.
If 𝑛 = 2,
𝑓 (𝑥) = 𝑎2 𝑥2 + 𝑎1 𝑥 + 𝑎0 = 0 (13.23)
For 𝑛 = 3,
𝑓 (𝑥) = 𝑎3 𝑥3 + 𝑎2 𝑥2 + 𝑎1 𝑥 + 𝑎0 = 0, (13.25)
leads to the formulas due to Cardano [135]. For some special cases where 𝑛 = 4,
analytical expressions are also known. In general, the well-known theorem due to
244 | 13 Analysis
Abel and Ruffini [184] states that general polynomials with deg(𝑓 (𝑥)) ≥ 5 are not
solvable by radicals. Radicals are 𝑛th root expressions that depend on the polynomial
coefficients. Another classical theorem proves the existence of a zero of a continuous
function.
𝑓 (𝑥) = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥2 + 𝑎3 𝑥3 + 𝑎4 𝑥4 + 𝑎5 𝑥5 + 𝑎6 𝑥6 + 𝑎7 𝑥7 + 𝑎8 * 𝑥8 + 𝑎9 𝑥9 , (13.26)
Example 13.8.1. In economics, for example, root-finding methods and basic deriva-
tives find frequent application [182]. For instance, these are used to explore profit and
revenue functions (see [182]). Generally, the revenue function 𝑅(𝑥) and the profit
function 𝑃 (𝑥) are defined by
and
respectively [182]. Here, 𝑥 is a unit of quantity, 𝑝 is the sales price per unit of
quantity, and 𝐶(𝑥) is a cost function. The unit of quantity 𝑥 is the variable and 𝑝
is a parameter, i. e., a fixed number. Suppose that we have a specific profit function
defined by
1 2
𝑃 (𝑥) = − 𝑥 + 50𝑥 − 480. (13.29)
10
This profit function 𝑃 (𝑥) is shown in Figure 13.11, and to find its maximum, we
need to find the zeros of its derivative:
2
𝑃 ′ (𝑥) = − 𝑥 + 50 = 0. (13.30)
10
From this, we find 𝑥 = 250. Using this value, we obtain the maximizing unit of
quantity for 𝑃 (𝑥), i. e., 𝑃 (250) = 5770. To find the so-called break-even points, it is
necessary to identify the zeros of 𝑃 (𝑥), i. e.,
1 2
𝑃 (𝑥) = − 𝑥 + 50𝑥 − 480 = 0. (13.31)
10
Between the two zeros of 𝑃 (𝑥), we make a profit. Outside this interval, we make
a loss. Therefore,
yields to
√︃(︂ )︂2
500 500
𝑥1,2 = ± − 4800. (13.33)
2 2
Specifically, 𝑥1 = 480.02, and 𝑥2 = 9.79. This means that the profit interval of 𝑃 (𝑥)
corresponds to [9.79, 480.02]. Graphically, this is also evident in Figure 13.11.
246 | 13 Analysis
Figure 13.11: An example of a profit function. The profit interval of the profit function is its
positive part between the zeros 9.79 and 480.02.
When the zeros of polynomials cannot be calculated explicitly, techniques for esti-
mating bounds are required. For example, various bounds have been proven for real
zeros (positive and negative), as well as for complex zeros [155, 50, 135]. For the
latter case, bounds for the moduli of a given polynomial are sought. Zero bounds
have proven useful in cases where the degree of a given polynomial is large and,
therefore, numerical techniques may fail.
The following well-known theorems are attributed to Cauchy [155]:
be a complex polynomial. All the zeros of 𝑓 (𝑧) lie in the closed disk |𝑧| ≤ 𝜌. Here, 𝜌
is the positive root of another equation, namely
Theorem 13.8.3. Let 𝑓 (𝑧) be a complex polynomial given by equation (13.34). All
the zeros of 𝑓 (𝑧) lie in the closed disk
⃒ ⃒
⃒ 𝑎𝑗 ⃒
|𝑧| ≤ 1 + max ⃒⃒ ⃒⃒. (13.36)
0≤𝑗≤𝑛−1 𝑎𝑛
Below, we provide some examples, which illustrate the results from these two
theorems.
13.9 Further reading | 247
𝑧1 = −0.099 , (13.37)
𝑧2 = −1.950 − 31.556𝑖 , (13.38)
𝑧3 = −1.950 + 31.556𝑖 , (13.39)
|𝑧1 | = 0.099, (13.40)
|𝑧2 | = |𝑧3 | = 31.616. (13.41)
Using Theorem 13.8.2 and Theorem 13.8.3 gives the bounds 𝜌 = 33.78 and 1001,
respectively. Considering that the largest modulus of 𝑓 (𝑧) is max(𝑧𝑖 ) = 31.616,
Theorem 13.8.2 gives a good result. The bound given by Theorem 13.8.3 is useless
for 𝑓 (𝑧). This example demonstrates the complexity of the problem of determining
zero bounds efficiently (see [155, 50, 135]).
13.10 Exercises
1. Evaluate the gradient, the Hessian, and the Jacobian matrix of the following
functions using R:
at the point
(︀ )︀ √︀
𝑓 (𝑥, 𝑦) = 𝑦 cos 𝑥2 + 𝑥𝑦 2 (𝑥 = 𝜋, 𝑦 = 5)
√
𝑔(𝑥, 𝑦, 𝑧) = 𝑒sin(𝑥*𝑦) + 𝑧 * 𝑦 cos(𝑥 * 𝑦) + 𝑥2 * 𝑦 3 + 𝑧 at the point
(𝑥 = 5, 𝑦 = 3, 𝑧 = 21)
2. Use R to find the extremum of the function 𝑓 (𝑥) = 3𝑥𝑥 , and determine whether
it is a minimum or a maximum.
3. Use R to find the global extrema of the function 𝑔(𝑦) = 𝑦 3 − 6𝑦 2 − 15𝑦 + 100 in
the interval [−3, 6].
4. Use R to find the points, where the function 𝑓 (𝑥) = 2𝑦 3 − 3𝑦 2 achieves its global
minimum and global maximum and the corresponding extreme values.
5. Use R to find the global maximum and minimum of the function 𝑓 (𝑥) = 2𝑥2 −
4𝑥 + 6 in the interval [−3, 6]. Calculate the difference between the maximal and
minimal values of 𝑓 (𝑥).
248 | 13 Analysis
6. Use R to find the extrema of the function 𝑓 (𝑦) = 𝑦 3 (𝑦 + 1)2 on the interval
2
[−1, 1].
Find the critical numbers of the function 𝑓 .
7. Use R to find the Taylor series expansion of the function 𝑓 (𝑥) = ln(1 + 𝑥). Plot
the graph of the function 𝑓 and the corresponding Taylor series approximation
for 𝑥 ∈ [−1; 1].
8. Evaluate the following integral using R:
∫︁𝜋
sin3 𝑥
𝐼1 = .
cos2 𝑥 + 1
−𝜋
9. Use R to find the polynomial interpolation of the following pairs of data points:
𝑥𝑖 1 2 3 4 5 6 7 8 9 10
𝑦𝑖 −2.05 0.75 1.8 −0.02 −0.75 1.71 −2.12 −0.25 1.70 3.55
𝑓 (𝑥) = 2𝑥 − 1
𝑔(𝑥) = 23𝑥2 − 3𝑥 − 1
ℎ(𝑥) = 23𝑥8 − 3𝑥7 + 𝑥4 − 𝑥2 − 20
14 Differential equations
Differential equations can be seen as applications of the methods from analysis,
discussed in the previous chapter. The general aim of differential equations is to
describe the dynamical behavior of functions [78]. This dynamical behavior is the
result of an equation that contains derivatives of a function. In this chapter, we
introduce ordinary differential equations and partial differential equations [3]. We
discuss the general properties of such equations and demonstrate specific solution
techniques for selected differential equations, including the heat equation and the
wave equation. Due to the descriptive nature of differential equations, physical laws
as well as biological and economical models are often formulated with such models.
Initial value ODE problems govern the evolution of a system from its initial state
𝑦(𝑡0 ) = 𝐶 at 𝑡0 onward, and we are seeking a function 𝑦(𝑡), which describes the
state of the system as a function of 𝑡. Thus, a general formulation of a first-order
initial value ODE problem can be written as follows:
𝑑𝑦(𝑡)
= 𝑦 ′ (𝑡) = 𝑓 𝑦(𝑡), 𝑡, 𝑘 for 𝑡 > 𝑡0 , (14.2)
(︀ )︀
𝑑𝑡
𝑦(𝑡0 ) = 𝐶, (14.3)
where 𝐶 is given.
https://doi.org/10.1515/9783110796063-014
250 | 14 Differential equations
Figure 14.1: Examples of initial value ODE problems. Left: Solutions of the ODE 𝑦 ′ (𝑡) =
2𝑡
1+𝑡2
(𝑦(𝑡) + 1). Right: Solutions of the ODE 𝑦 ′ (𝑡) = (𝑦(𝑡))2 + 𝑡2 − 1.
Some examples of initial value ODE problems, depicted in Figure 14.1, illustrate the
evolution of the ODE’s solution, depending on its initial condition.
Equation (14.2) may represent a system of ODEs, where
)︀𝑇
and 𝑓 𝑦(𝑡), 𝑡, 𝑘 = 𝑓1 𝑦(𝑡), 𝑡, 𝑘 , . . . , 𝑓𝑛 𝑦(𝑡), 𝑡, 𝑘 ,
(︀ (︀ )︀ (︀ (︀ )︀ (︀ )︀)︀
𝑦(𝑡) = 𝑦1 (𝑡), . . . , 𝑦𝑛 (𝑡)
and each entry of 𝑓 (𝑦(𝑡), 𝑡, 𝑘) can be a nonlinear function of all the entries of 𝑦.
The system (14.2) is called linear if the function 𝑓 (𝑦(𝑡), 𝑡, 𝑘) can be written as
follows:
(14.4)
(︀ )︀
𝑓 𝑦(𝑡), 𝑡, 𝑘 = 𝐺(𝑡, 𝑘)𝑦 + ℎ(𝑡, 𝑘),
𝑦 ′ = 𝑘𝑡𝑦 (14.7)
𝑑𝑦1
= 𝑘1 𝑦2 𝑦3 , (14.8)
𝑑𝑡
𝑑𝑦2
= 𝑘2 𝑦1 𝑦3 , (14.9)
𝑑𝑡
𝑑𝑦3
= 𝑘3 𝑦1 𝑦2 , (14.10)
𝑑𝑡
with the initial conditions 𝑦1 (0) = −1, 𝑦2 (0) = 0, 𝑦3 = 1, and where 𝑘1 , 𝑘2 and 𝑘3
are parameters, with values of 1, −1, and −1/2, respectively.
The system (14.8)–(14.10), known as the Euler equations, can be solved in R
using Listing 14.2. Figure 14.2 (center), which is an output of Listing 14.2, shows
the evolution of the solution (𝑦1 (𝑡), 𝑦2 (𝑡), 𝑦3 (𝑡)) to the problem (14.8)–(14.10), for
𝑡 ∈ [0, 15].
Figure 14.2: Left: Numerical solution of the ODE (14.7) with the initial condition 𝑦(0) = 10 and
the parameter 𝑘 = 1/5. Center: Numerical solution of the Euler equations (14.8)–(14.10) with
initial conditions 𝑦1 (0) = 1, 𝑦2 (0) = 𝑦3 (0) = 1. Right: Numerical solution of BV ODEs system
(14.13) with the boundary conditions 𝑦(−1) = 1/4, 𝑦(1) = 1/3.
𝑑𝑦(𝑡)
= 𝑦 ′ (𝑡) = 𝑓 𝑦(𝑡), 𝑡, 𝑘 for 𝑡 > 𝑡0 , (14.11)
(︀ )︀
𝑑𝑡
𝑦(𝑡0 ) = 𝐶1 , 𝑦(𝑡max ) = 𝐶2 , (14.12)
𝑦 ′′ (𝑡) − 2𝑦 2 (𝑡) − 4𝑡𝑦(𝑡)𝑦 ′ (𝑡) with 𝑦(−1) = 1/4, 𝑦(1) = 1/3. (14.13)
Listing 14.3: Solving a system of Boundary Value ODE, see Figure 14.2 (right)
#The packages "deSolve" and " bvpSolve " are required here
library(rootSolve)
library(bvpSolve)
# Specifying the BV ODEs to be solved
SystBVODEs <- function(t,y,k)
{ list(c(
y[2],
k*y[1]*y[1] +2*k*t*y[1]*y[2]
))
}
# Defining the boundary values
yb1<-c(1/4,NA)
yb2<-c(NA , 1/3)
# Defining the time limits and steps
t <- seq(-1, 8, by = 0.05)
# Solving the BV ODEs
SolBVODEs<-bvptwp(yini=yb1 , yend=yb2 , x=t, parms=2, func=SystBVODEs)
# Snapshot of the solution
x 1 2
[1,] -1.00 0.2500000 -0.08075594
[2,] -0.99 0.2492027 -0.07871783
[3,] -0.98 0.2484255 -0.07671780
# Changing the names of the columns of SolBVODEs
colnames(SolBVODEs)[1] <- "t"
colnames(SolBVODEs)[2] <- "y1"
colnames(SolBVODEs)[3] <- "y2"
dSolBVODEs <- data.frame(SolBVODEs)
ggplot(dSolBVODEs , aes(t)) + geom_line(aes(y=y1 , colour="y1")) +
geom_line(aes(y=y2 , colour="y2")) + scale_colour_manual(" ",
breaks=c("y1","y2"), labels=c(expression(y[1]),
expression(y[2])), values=c("darkmagenta","orange2")) +
ylab("y") + xlab("t") + theme_bw()
accurate modeling of physical processes, engineers and scientists are increasingly re-
quired to solve the actual PDEs that govern the physical problem being investigated.
A PDE is an equation stating a relationship between a function of two or more inde-
pendent variables, and the partial derivatives of this function with respect to these
independent variables. For most problems in engineering and science, the indepen-
dent variables are either space (𝑥, 𝑦, 𝑧) or space and time (𝑥, 𝑦, 𝑧, 𝑡). The dependent
variable, i. e., the function 𝑓 , depends on the physical problem being modeled.
(14.16)
(︀ )︀
𝐹 𝑥, 𝑢(𝑥), ∇𝑢(𝑥) = 0,
)︀ 𝜕 )︀ 𝜕
(14.17)
(︀ (︀ (︀ )︀
𝑓 𝑥, 𝑦, 𝑢(𝑥, 𝑦) 𝑢(𝑥, 𝑦) + 𝑔 𝑥, 𝑦, 𝑢(𝑥, 𝑦) 𝑢(𝑥, 𝑦) = ℎ 𝑥, 𝑦, 𝑢(𝑥, 𝑦) ,
𝜕𝑥 𝜕𝑦
There are three types of boundary conditions for PDEs. Let 𝑅 denote a domain and
𝜕𝑅 its boundary. Furthermore, let 𝑛 and 𝑠 denote the coordinates normal (outward)
and along the boundary 𝜕𝑅, respectively, and let 𝑓 , 𝑔 be some functions on the
boundary 𝜕𝑅. Then, the three boundary conditions, for PDEs, are:
– Dirichlet conditions, when 𝑢 = 𝑓 on the boundary 𝜕𝑅,
– Neumann conditions, when 𝜕𝑛 𝜕𝑢
= 𝑓 or 𝜕𝑢
𝜕𝑠 = 𝑔 on the boundary 𝜕𝑅,
– Mixed (Robin) conditions, when 𝜕𝑛 + 𝑘𝑢 = 𝑓 , 𝑘 > 0 on the boundary 𝜕𝑅.
𝜕𝑢
Dirichlet conditions can only be applied if the solution is known on the boundary
and if the function 𝑓 is analytic. These are frequently used for the flow (velocity)
into a domain. Neumann conditions occur more frequently [102].
Parabolic PDE
In this section, we will illustrate the solution to the heat equation, which is a pro-
totype parabolic PDE. The heat equation, in a one-dimensional space with zero
production and consumption, can be written as follows:
𝜕𝑢(𝑥, 𝑡) 𝜕 2 𝑢(𝑥, 𝑡)
−𝐷 , 𝑥 ∈ (𝑎, 𝑏). (14.19)
𝜕𝑡 𝜕𝑥2
14.2 Partial differential equations (PDE) | 257
Let us use R to solve the equation (14.19) with 𝑎 = 0, 𝑏 = 1, i. e., 𝑥 ∈ [0, 1], and
the following boundary and initial conditions:
(︂
)︂
𝜋
𝑢(𝑥, 0) = cos 𝑥 , 𝑢(0, 𝑡) = sin(𝑡), 𝑢(1, 𝑡) = 0. (14.20)
2
The heat equation (14.19)–(14.20) can be solved using Listing 14.4. The correspond-
ing solution, 𝑢(𝑥, 𝑡), is depicted in Figure 14.3 for color levels (left) and a contour
plot (right).
Hyperbolic PDE
A prototype of hyperbolic PDEs is the wave equation, defined as follows:
𝜕2𝑢
= ∇ · 𝑐2 ∇𝑢 . (14.21)
(︀ )︀
𝜕𝑡 2
Figure 14.3: Solution to the heat equation in equation (14.19) with the boundary and initial
conditions provided in equation (14.20).
The wave equation (14.22)–(14.23) can be solved using Listing 14.5. The correspond-
ing solution, 𝑢(𝑡, 𝑥, 𝑦), is depicted in Figure 14.4, for 𝑡 = 0, 𝑡 = 1, 𝑡 = 2 and 𝑡 = 3,
respectively.
# Initial condition
peak <- function (x, y, x0 , y0) exp(-((x-x0)^2 + (y-y0)^2))
uinitial <- outer(x, y, FUN = function(x, y) peak(x, y, 0,0))
vinitial <- rep(0, Nx*Ny)
# Solving the PDE
SolWEq <- ode.2D (y = c(uinitial , vinitial), times = t, parms =
NULL , func = WaveEq2D , names = c("u", "v"), dimens = c(Nx , Ny),
method = "ode45")
# Plotting the solution
mr <- par(mar = c(0, 0, 1, 0))
image(SolWEq , main = paste("t =", t), which = "u", grid = list(x =
x, y = y), method = "persp", border = NA , box = FALSE ,
legend=TRUE , shade = 0.5, theta = 30, phi = 60, mfrow = c(2,
2), ask = FALSE)
Figure 14.4: Solution to the wave equation in equation (14.22) with the boundary and initial
conditions provided in equation (14.23).
Elliptic PDE
A prototype of elliptic PDEs is the Poisson’s equation. Let us use R to solve the
following Poisson’s equation in a two-dimensional space:
𝜕 2 𝑢(𝑥, 𝑦) 𝜕 2 𝑢(𝑥, 𝑦)
𝛾1 2
+ 𝛾2 = 𝑥2 + 𝑦 2 , 𝑥 ∈ (𝑎, 𝑏), 𝑦 ∈ (𝑐, 𝑑), (14.24)
𝜕𝑥 𝜕𝑦 2
260 | 14 Differential equations
The Poisson’s equation (14.24)–(14.25) can be solved using Listing 14.6. The cor-
responding solution, 𝑢(𝑥, 𝑦), is depicted in Figure 14.5 for color levels (left) and a
contour plot (right).
14.3 Exercises
Use R to solve the following differential equations:
1. Solve the heat equation (14.19) with 𝑎 = 0, 𝑏 = 1, i. e., 𝑥 ∈ [0, 1], and the
following boundary and initial conditions:
(a) 𝑢(𝑥, 0) = 6 sin( 𝜋𝑥
2 ), 𝑢(0, 𝑡) = cos(𝑡), 𝑢(1, 𝑡) = 0.
(b) 𝑢(𝑥, 0) = 12 sin( 9𝜋𝑥 4𝜋𝑥
5 ) − 7 sin( 3 ), 𝑢(0, 𝑡) = cos(𝜋𝑡), 𝑢(1, 𝑡) = 0.
2. Solve the wave equation (14.22) with 𝑎 = 𝑐 = −4, 𝑏 = 𝑑 = 4, 𝛾1 = 𝛾2 = 1, and
the following boundary and initial conditions:
14.3 Exercises | 261
Figure 14.5: Solution to the Poisson’s equation in equation (14.24) with the boundary and initial
conditions provided in equation (14.25).
(a)
(b)
(b)
15.1 Introduction
The theory of dynamical systems can be viewed as the most natural way of describing
the behavior of an integrated system over time [109, 56]. In other words, a dynamical
system can be cast as the process by which a sequence of states is generated on the
basis of certain dynamical laws. Generally, this behavior is described through a
system of differential equations describing the rate of change of each variable as
a function of the current values of the other variables influencing the one under
consideration. Thus, the system states form a continuous sequence, which can be
formulated as follows. Let 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) be a point in C𝑛 that defines a curve
through time, i. e.,
(︀ )︀
𝑥 = 𝑥(𝑡) = 𝑥1 (𝑡), 𝑥2 (𝑡), . . . , 𝑥𝑛 (𝑡) , −∞ < 𝑡 < ∞.
Suppose that the laws, which describe the rate and direction of the change of 𝑥(𝑡),
are known and defined by the following equations:
𝑥(𝑡)
𝑡 ∈ R, 𝑥 ∈ C𝑛 , 𝑥(𝑡0 ) = 𝑥0 , (15.1)
(︀ )︀
= 𝑓 𝑥(𝑡) ,
𝑑𝑡
https://doi.org/10.1515/9783110796063-015
15.1 Introduction | 263
Definition 15.1.2. A curve 𝒞 = {𝑥(𝑡)}, which satisfies the equations (15.1) (respec-
tively (15.2)), is called the orbit of the dynamical system 𝑥(𝑡).
In the subsequent sections, we will illustrate the use of R to simulate and visualize
some basic dynamical systems, including population growth models, cellular au-
tomata, Boolean networks, and other “abstract” dynamical systems, such as strange
attractors and fractal geometries. These dynamical systems are well known for their
sensitivity to initial conditions, which is the defining feature of chaotic systems.
264 | 15 Dynamical systems
The exponential growth model describes the evolution of a population or the concen-
tration (number of organisms per area unit) of an organism living in an environment,
whose resources and conditions allow them to grow indefinitely. Supposing that the
growth rate of the organism is 𝑟, then the evolution of the population number of
organisms 𝑥(𝑡) over time is governed by the following equation:
𝑑𝑥
= 𝑟𝑥. (15.3)
𝑑𝑡
The solution (15.4) can be plotted in R using Listing 15.1, and the corresponding
output, which shows the evolution of the population for 𝑥(0) = 2, 𝑟 = 0.03 and
𝑡 ∈ [0, 100], is depicted in Figure 15.1 (left).
In contrast with the exponential growth model, the logistic population model as-
sumes that the availability of resources restricts the population growth. Let 𝐾 be
the “carrying capacity” of the living environment of the population, i. e., the popu-
lation number or the concentration (number of organisms per area unit) such that
15.2 Population growth models | 265
Figure 15.1: Left: Exponential population growth for 𝑟 = 0.03 and 𝑥0 = 2. Center: Logistic
population growth model for 𝑟 = 0.1, 𝑥0 = 0.1 < 𝐾 = 10. Right: Logistic population growth
model for 𝑟 = 0.1, 𝑥0 = 20 > 𝐾 = 10.
the growth rate of the organism population is zero. In this situation, a larger popu-
lation results in fewer resources, and this leads to a smaller growth rate. Hence, the
growth rate is no longer constant. When the growth rate is assumed to be a linearly
decreasing function of 𝑥 of the form
(︂ )︂
𝑥(𝑡)
𝑟 1− ,
𝐾
𝐾𝑥(0)𝑒𝑟𝑡
𝑥(𝑡) = . (15.6)
𝐾 + 𝑥(0)(𝑒𝑟𝑡 − 1)
The solution (15.6) can be plotted in R using Listing 15.2. The corresponding
output, which shows the evolution of the population over the time interval [0, 100], is
depicted in Figure 15.1 (center) for 𝑥(0) = 0.1, 𝑟 = 0.1, 𝐾 = 10, and in Figure 15.1
(right) for 𝑥(0) = 20, 𝑟 = 0.1, 𝐾 = 10.
The logistic map is a variant of the logistic population growth model (15.5) with
nonoverlapping generations. Let 𝑦𝑛 denote the population number of the current
generation and 𝑦𝑛+1 denote the population number of the next generation. When
the growth rate is assumed to be a linearly decreasing function of 𝑦𝑛 , then we get
the following logistic equation:
(︂ )︂
𝑦𝑛
𝑦𝑛+1 = 𝑟𝑦𝑛 1− . (15.7)
𝐾
Substituting 𝑦𝑛 for 𝐾𝑥𝑛 and 𝑦𝑛+1 for 𝐾𝑥𝑛+1 in Equation (15.7) gives the following
recurrence relationship, also known as the logistic map:
where, 𝑥𝑛+1 denotes the population size of the next generation, whereas 𝑥𝑛 is the
population size of the current generation; and 𝑟 is a positive constant denoting the
growth rate of the population between generations.
The graph 𝑥𝑛 versus 𝑥𝑛+1 is called the cobweb graph of the logistic map.
For any initial condition, over time, the population 𝑥𝑛 will settle into one of the
following types of behavior:
1. fixed, i. e., the population approaches a stable value
2. periodic, i. e., the population alternates between two or more fixed values
3. chaotic, i. e., the population will eventually visit any neighborhood in a subin-
terval of (0, 1).
system converges to the fixed point 𝑥 = 0. However, when 𝑟 > 1, the graph of the
function 𝑓 (𝑥) is a parabola achieving its maximum at 𝑥 = 1/2 and 𝑓 (0) = 𝑓 (1) = 0.
The intersection between the graph of 𝑓 and the straight line of equation 𝑦 = 𝑥
defines a point 𝑆, whose abscissa, 𝑥* , satisfies 𝑥* = 𝑟𝑥* (1 − 𝑥* ).
Hence, the point 𝑥* = 𝑟−1𝑟 is another fixed point of the system.
When 1 < 𝑟 < 3, the point 𝑥* is asymptotically stable, i. e., for any 𝑥 in
the neighborhood of 𝑥* , the sequence generated by the map (15.9)—the orbit of
𝑥—remains close to or converges to 𝑥* . In R, such a dynamics of the system can be
illustrated using the scripts provided in Listing 15.3 and Listing 15.4.
Figure 15.2 (left), produced using Listing 15.4, shows the cobweb graph of the
logistic map for 𝑟 = 2.5, which corresponds to a stable fixed point. When 𝑟 = 3 the
logistic map has an asymptotically stable fixed point, and the corresponding cobweb
graph and the graph of the population dynamics are depicted in Figure 15.2 (center)
(produced using Listing 15.4) and Figure 15.2 (right) (produced using Listing 15.3),
respectively.
Listing 15.4: Cobweb of the logistic map, see Figure 15.2 (left & center)
Cobweb<-function(r, x0 , N)
{
xn<-seq(0,1, length.out=N)
xn1<-r*xn*(1-xn)
plot(xn ,xn1 , xaxt="n", yaxt="n", type='l', xlab=expression(x[n]),
ylab=expression(x[n+1]), col="orangered", lwd=0.8)
lines(x=c(0,1), y=c(0,1), col="orangered",lwd=0.8)
xn<-x0
xn1<-r*x0*(1-x0)
for (i in 1:N)
{
268 | 15 Dynamical systems
s<-r*xn1*(1 - xn1)
lines(x=c(xn , xn), y=c(xn , xn1), col="purple4", lwd=0.08)
lines(x=c(xn , xn1), y=c(xn1 , xn1), col="purple4", lwd=0.08)
lines(x=c(xn1 , xn1), y=c(xn1 , s), col="purple4", lwd=0.08)
lines(x=c(xn1 , s), y=c(s, s), col="purple4", lwd=0.2)
xn<-xn1
xn1<-s
}
}
# Plotting the cobweb graphs
Cobweb(r=3, x0=0.2,N=100)
Figure 15.2: Left: Cobweb graph of a stable fixed point for 𝑟 = 2.5. Center: Cobweb graph of an
asymptotically stable fixed point for 𝑟 = 3; Right: Population number dynamics over time.
Figure 15.3: Left: Cobweb graph of periodic fixed points for 𝑟 = 3.2. Center: Cobweb graph of
periodic fixed points for 𝑟 = 3.4; Right: Dynamics of the population number over time.
15.2 Population growth models | 269
to periodic fixed points. Figure 15.3 (right), produced using Listing 15.3, illustrates
the dynamics of the populations over time for both cases.
Figure 15.4: Left: Cobweb graph of a chaotic motion for 𝑟 = 3.8. Center: Cobweb graph of a
chaotic motion for 𝑟 = 3.9; Right: Dynamics of the population number over time.
Figure 15.5 (left), (center), and (right), produced using Listing 15.5, illustrates the
bifurcation phenomenon, which can be visualized through the graph of the growth
rate, 𝑟, versus the population size, 𝑥. Such a graph is also known as the bifurcation
diagram of a logistic map model. Figure 15.5 (left) depicts the bifurcation diagram
for 0 ≤ 𝑟 ≤ 4, whereas Figure 15.5 (center) and Figure 15.5 (right) show the zoom
corresponding to the ranges 3 ≤ 𝑟 ≤ 4 and 3.52 ≤ 𝑟 ≤ 3.92, respectively.
r <- rvect[i]
for (j in 1:MaxIter)
{
if (j == 1)
{
xn = x0
for (k in 1:400)
{
xn1 = fxn(xn ,r)
xn = xn1
}
}
xn1 = fxn(xn ,r)
xmat[i,j] = xn1
xn = xn1
}
}
return(xmat)
}
x0<-0.2; rmin<-3; rmax<-4; N<-500; MaxIter<-1000;
Matx<-bifurcation(x0 ,N, rmin , rmax , MaxIter , fxn)
Lab.palette <- colorRampPalette(c("firebrick","red","orange"),
space = "Lab")
matplot(Matx ,pch = "17", col =Lab.palette (256) , cex=0.035, axes=F,
ann=F)
Figure 15.5: Bifurcation diagram for the logistic map model—growth rate 𝑟 versus population
size 𝑥: Left 0 ≤ 𝑟 ≤ 4. Center: zoom for 3 ≤ 𝑟 ≤ 4. Right: zoom for 3.52 ≤ 𝑟 ≤ 3.92.
2. The rate of predation upon the prey species is proportional to the rate at which
the predator species and the prey meet.
The model describes the evolution of the population numbers 𝑥1 and 𝑥2 over time
through the following relationships:
𝑑𝑥1
= 𝑥1 (𝛼 − 𝛽𝑥2 ),
𝑑𝑡
𝑑𝑥2
= −𝑥2 (𝛾 − 𝛿𝑥1 ), (15.10)
𝑑𝑡
where, 𝑑𝑥𝑑𝑡 and 𝑑𝑡 denote the growth rates of the two populations over time; 𝛼 is
1 𝑑𝑥2
the growth rate of the prey population in the absence of interaction with the predator
species; 𝛽 is the death rate of the prey species caused by the predator species; 𝛾 is
the death (or emigration) rate of the predator species in the absence of interaction
with the prey species; and 𝛿 is the growth rate of the predator population.
The predator–prey model (15.10) is a system of ODEs. Thus, it can be solved
using the function ode() in R. When the parameters 𝛼, 𝛽, 𝛾, and 𝛿 are set to 0.2,
0.002, 0.1, and 0.001, respectively, the system (15.10) can be solved in R, using the
scripts provided in Listing 15.6 and Listing 15.7.
The corresponding outputs are shown in Figure 15.6, where the solution in
the phase plane (𝑥1 , 𝑥2 ) for 𝑥2 (0) = 25, the evolution of the population of the
species over time for 𝑥2 (0) = 25, and the solution in the phase plane (𝑥1 , 𝑥2 ) for
10 ≤ 𝑥1 (0) ≤ 150 are depicted in Figure 15.6 (left), Figure 15.6 (center), and
Figure 15.6 (right), respectively.
Figure 15.6: Solutions of the system (15.10) with 𝛼 = 0.2, 𝛽 = 0.002, 𝛾 = 0.1, 𝛿 = 0.001,
and the initial conditions 𝑥1 (0) = 100. Left: Solution in the phase plane (𝑥1 , 𝑥2 ) for 𝑥2 (0) = 25.
Center: evolution of the population of the species over time for 𝑥2 (0) = 25. Right: solution in
the phase plane (𝑥1 , 𝑥2 ) for 10 ≤ 𝑥1 (0) ≤ 150.
15.4 Cellular automata | 273
The most elementary and yet interesting cellular automaton consists of a one-
dimensional grid of cells, where the set of states for the cells is 0 or 1, and the
neighborhood of a cell is the cell itself, as well as its immediate successor and pre-
decessor, as illustrated below:
A one-dimensional CA 0 1 1 0 1 0 0 0 0 1
At each time point, the state of each cell of the grid is updated according to a
specified rule, so that the new state of a given cell depends on the state of its neigh-
borhood, namely the current state of the cell under consideration and its adjacent
cells, as illustrated below:
A cell (in red) and its neighborhood 000 001 010 011 100 101 110 111
Rule for updating the cell in red 1 0 0 0 1 1 0 1
The cells at the boundaries do not have two neighbors, and thus require special
treatments. These cells are called the boundary conditions, and they can be handled
in different ways:
– The cells can be kept with their initial condition, i. e., they will not be updated
at all during the simulation process.
– The cells can be updated in a periodic way, i. e., the first cell on the left is a
neighbor of the last cell on the right, and vice versa.
– The cells can be updated using a desired rule.
Depending on the rule specified for updating the cell and the initial conditions, the
evolution of elementary cellular automata can lead to the following system states:
– Steady state: The system will remain in its initial configuration, i. e., the initial
spatiotemporal pattern can be a final configuration of the system elements.
– Periodic cycle: The system will alternate between coherent periodic stable pat-
terns.
– Self-organization: The system will always converge towards a coherent stable
pattern.
– Chaos: The system will exhibit some chaotic patterns.
274 | 15 Dynamical systems
For a finite number of cells 𝑁 , the number of possible configurations for the system is
also finite and is given by 2𝑁 . Hence, at a certain time point, all configurations will be
visited, and the CA will enter a periodic cycle by repeating itself indefinitely. Such a
cycle corresponds to an attractor of the system for the given initial conditions. When
a cellular automaton models an orderly system, then the corresponding attractor is
generally small, i. e., it has a cycle with a small period.
Using the R Listing 15.8, we illustrate some spatiotemporal evolutions of an ele-
mentary cellular automaton using both deterministic and random initial conditions,
whereby the cells at the boundaries are kept to their initial conditions during the
simulation process.
Figure 15.7 shows the spatiotemporal patterns of an elementary cellular au-
tomaton with a simple deterministic initial condition, i. e., all the cells are set to 0,
except the middle one, which is set to 1. Complex localized stable structures (us-
ing Rule 182), self-organization (using Rule 210) and chaotic patterns (using Rule
89) are depicted in Figure 15.7 (left), Figure 15.7 (center), and Figure 15.7 (right),
respectively.
Figure 15.8 shows spatiotemporal patterns of an elementary cellular automaton
with a random initial condition, i. e., the states of the cells are allocated randomly.
Complex localized stable structures (using Rule 182), self-organization (using Rule
210) and chaotic patterns (using Rule 89) are depicted in Figure 15.8 (left), Fig-
ure 15.8 (center), and Figure 15.8 (right), respectively.
# This function returns the cell update for a given rule number
RuleCA<-function(l, m, r, RuleNb)
{
RuleBin=dec2bin(RuleNb , npos=8)
InvRuleBin=rev(RuleBin)
n=paste(c(l,m,r), collapse="")
index=strtoi(n, 2)
newval=InvRuleBin[index+1]
return(newval)
}
for (i in 2:MaxIter)
{
for (j in 2:(Ncells-1))
{
l=cellMat[i-1,j-1]
m=cellMat[i-1,j]
r=cellMat[i-1,j+1]
15.4 Cellular automata | 275
cellMat[i,j]=RuleCA(l,m,r, RuleNb)
}
}
cellMat<-apply(cellMat , 1, rev)
return(cellMat)
}
Figure 15.7: Spatiotemporal patterns of an elementary cellular automaton with a simple deter-
ministic initial condition, i. e., all the cells are set to 0 except the middle, one which is set to 1.
Left: complex localized stable structures (Rule 182). Center: self-organization (Rule 210). Right:
chaotic patterns (Rule 89).
Figure 15.8: Spatiotemporal patterns of an elementary cellular automaton with a random ini-
tial condition, i. e., the states of the cells are allocated randomly. Left: complex localized stable
structures (Rule 182). Center: self-organization (Rule 210). Right: chaotic patterns (Rule 89).
276 | 15 Dynamical systems
where ∨, ∧, and ¬ are the logical disjunction (OR), conjunction (AND), and negation
(NOT), respectively.
At a given time point 𝑡, the state-vector is 𝑥(𝑡) = (𝑥1 (𝑡), 𝑥2 (𝑡), 𝑥3 (𝑡)) and the
state evolution at the time point 𝑡 + 1 is given by
⎧
⎨𝑥1 (𝑡 + 1) = 𝑓1 (𝑥1 (𝑡), 𝑥3 (𝑡)) = 𝑥1 (𝑡) ∨ 𝑥3 (𝑡),
⎪
⎪
𝑥2 (𝑡 + 1) = 𝑓2 (𝑥1 (𝑡), 𝑥3 (𝑡)) = 𝑥1 (𝑡) ∧ 𝑥3 (𝑡), (15.11)
⎪
⎪
⎩𝑥 (𝑡 + 1) = 𝑓 (𝑥 (𝑡), 𝑥 (𝑡)) = ¬𝑥 (𝑡) ∨ 𝑥 (𝑡).
3 3 1 2 1 2
278 | 15 Dynamical systems
The corresponding truth table, i. e., the nodes-state at time 𝑡 + 1 for any given
configuration of the state vector 𝑥 at time 𝑡, is as follows:
𝑥(𝑡) = (𝑥1 (𝑡), 𝑥2 (𝑡), 𝑥3 (𝑡)) 000 001 010 011 100 101 110 111
𝑥(𝑡 + 1) = (𝑥1 (𝑡 + 1), 𝑥2 (𝑡 + 1), 𝑥3 (𝑡 + 1)) 001 101 001 101 100 110 101 111
A B C
A 1 1 1
B 0 0 1
C 1 1 0
To draw the corresponding network using the package igraph in R, we can save the
adjacency matrix as a csv (comma separated values) or a text file and then load
the file in R. The corresponding text or csv file, which we will call here “Exam-
pleBN1.txt”, will be in the following format:
Nodes, A, B, C
A, 1, 1, 1
B, 0, 0, 1
C, 1, 1, 0
vertex.frame.color="seashell1", edge.arrow.size=0.7,
edge.curved=TRUE)
Using the R package Boolnet [140], we can also draw a given Boolean network,
generate an RBN and analyze it, e. g., find the associated attractors and plot them.
However, the dependency relations of the network must be written into a text file
using an appropriate format. For instance, the dependency relations (15.11) can be
written in a textual format as follows:
targets, factors
A, A | C
B, A & C
C, ! A | B
Here, the symbols |, & and ! respectively denote the logical disjunction (OR), con-
junction (AND) and negation (NOT). Let us call the corresponding text file “Ex-
ampleBN1p.txt”, and this must be in the current working R directory.
Figure 15.9, produced using Listing 15.10, shows the visualization and analysis
of the Boolean network represented in the text file “ExampleBN1p.txt”. The network
graph, the state transition graph as well as attractor basins, and the state transition
table when the initial state is (010) i. e., (𝐴 = 0, 𝐵 = 1, 𝐶 = 0), are depicted
in Figure 15.9 (top), Figure 15.9 (bottom left) and Figure 15.9 (bottom right),
respectively.
# Identification of attractors
AttractorsBNet1 <- getAttractors(BNet1)
Figure 15.9: Visualization and analysis of a Boolean network—Example 1. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table
when the initial state is (010), i. e., (𝐴 = 0, 𝐵 = 1, 𝐶 = 0).
plotSequence(network=BNet1 , startState=c(0,1,0),
includeAttractorStates ="all", mode="graph", draw Lab els=TRUE ,
vertex.size=7, edge.arrow.size=0.7, vertex.label.cex=1,
vertex.frame.color=c("moccasin"),
vertex.label.color="firebrick4",edge.color="slateblue",
vertex.color="oldlace")
Figure 15.10, produced using Listing 15.11, shows the visualization and analysis
of an RBN generated within the listing. The network graph, the state transition
graph, as well as attractor basins, and the state transition table when the initial
state is (11111111) are depicted in Figure 15.10 (top), Figure 15.10 (bottom left),
and Figure 15.10 (bottom right), respectively.
# Identification of attractors
AttractorsBNet2 <- getAttractors(BNet2)
Tmax=200
STransK2=array(dim=c(Tmax , N))
STransK7=array(dim=c(Tmax , N))
282 | 15 Dynamical systems
Figure 15.10: Visualization and analysis of a Boolean network—Example 2. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table
when the initial state is (11111111).
Figure 15.11: Spatiotemporal patterns of RBNs with 𝑁 = 1000. Left: critical dynamics (𝐾 = 2).
Right: chaotic patterns (𝐾 = 7).
15.6 Case studies of dynamical system models with complex attractors | 283
The Lorenz attractor is a seminal dynamical system model due to Lorenz Edward
[119], a meteorologist who was interested in modeling weather and the motion of
air as it heats up. The state variable in the system, 𝑥(𝑡), is in R3 , i. e., 𝑥(𝑡) =
(𝑥1 (𝑡), 𝑥2 (𝑡), 𝑥3 (𝑡)), and the system is written as:
𝑑𝑥1
= 𝑎(𝑥2 − 𝑥1 ),
𝑑𝑡
𝑑𝑥2
= 𝑟𝑥1 − 𝑥2 − 𝑥1 𝑥3 ,
𝑑𝑡
𝑑𝑥3
= 𝑥1 𝑥2 − 𝑏𝑥3 , (15.12)
𝑑𝑡
where, 𝑎, 𝑟, and 𝑏 are constants.
The chaotic behavior of the Lorenz system (15.12) is often termed the Lorenz
butterfly. In R, the Lorenz attractor can be simulated using Listing 15.13.
284 | 15 Dynamical systems
Figure 15.12, produced using Listing 15.13, shows some visualizations of the
Lorenz attractor for 𝑎 = 10, 𝑟 = 28, 𝑏 = 8/3, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.01, 0.01, 0.01), 𝑑𝑡 =
0.02 after 106 iterations. Representations of the attractor in the plane (𝑥, 𝑦), in the
space (𝑥, 𝑦, 𝑧) and in the plane (𝑥, 𝑧) are given in Figure 15.12 (left), Figure 15.12
(center), and Figure 15.12 (right), respectively.
Figure 15.12: Lorenz attractor for 𝑎 = 10, 𝑟 = 28, 𝑏 = 8/3, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.01, 0.01, 0.01),
𝑑𝑡 = 0.02 after 106 iterations: Left in the plane (𝑥, 𝑦). Center in the space (𝑥, 𝑦, 𝑧). Right in the
plane (𝑥, 𝑧).
15.6 Case studies of dynamical system models with complex attractors | 285
Figure 15.13: Clifford attractor. Left: 𝑎 = −1.4, 𝑏 = 1.6, 𝑐 = 1, 𝑑 = 0.3, (𝑥0 , 𝑦0 , ) = (𝜋/2, 𝜋/2)
after 1.5 × 106 iterations. Center: 𝑎 = −1.4, 𝑏 = 1.6, 𝑐 = 1, 𝑑 = 0.7, (𝑥0 , 𝑦0 , ) = (𝜋/2, 𝜋/2)
after 1.5 × 106 iterations. Right: 𝑎 = −1.4, 𝑏 = 1.6, 𝑐 = 1, 𝑑 = 0. − 1, (𝑥0 , 𝑦0 , ) = (𝜋/2, 𝜋/2)
after 2 × 106 iterations.
286 | 15 Dynamical systems
The Ikeda attractor is a dynamical system model that is used to describe a mapping
in the complex plane, corresponding to the plane-wave interactivity in an optical
ring laser. Its discrete-time version is defined by the following complex map:
𝑘−𝑝
(15.14)
𝑖 1+‖𝑧 ‖2
𝑧𝑛+1 = 𝑎 + 𝑏𝑧𝑛 𝑒 𝑛 ,
where, 𝑧𝑘 = 𝑥𝑘 + 𝑖𝑦𝑘 .
The resulting orbit of the map (15.14) is generally visualized by plotting 𝑧 in the
real-imaginary plane (𝑥, 𝑦), also called the phase-plot. In R, the orbit of the Ikeda
attractor can be obtained using Listing 15.15. Figure 15.14 (left), produced using
Listing 15.15, shows a representation of the Ikeda’s attractor in the plane (𝑥, 𝑦).
The Peter de Jong attractor is a well-known strange attractor, and its time-discrete
version is defined by the following system:
Figure 15.14: Left: Ikeda attractor for 𝑎 = 0.85, 𝑏 = 0.9, 𝑘 = 0.4, 𝑝 = 7.7, 𝑧0 = 0 after
1.5 × 106 iterations. Center: de Jong attractor (15.16) for 𝑎 = 1.4, 𝑏 = 1.56, 𝑐 = 1.4, 𝑑 = −6.56,
(𝑥0 , 𝑦0 , ) = (0, 0) after 1.5 × 106 iterations. Right: de Jong attractor (15.15) for 𝑎 = 2.01,
𝑏 = −2, 𝑐 = 2, 𝑑 = −2, (𝑥0 , 𝑦0 , ) = (0, 0) after 1.5 × 106 iterations.
In R, the orbit of Peter de Jong attractor can be obtained using Listing 15.16. Figure
15.14 (center), produced using Listing 15.16, shows a representation of the de Jong
attractor (15.16), in the plane (𝑥, 𝑦), for 𝑎 = 1.4, 𝑏 = 1.56, 𝑐 = 1.4, 𝑑 = −6.56,
(𝑥0 , 𝑦0 , ) = (0, 0) after 1.5 × 106 iterations. Figure 15.14 (right), produced also using
Listing 15.16, shows a representation of the de Jong attractor (15.15) for 𝑎 = 2.01,
𝑏 = −2, 𝑐 = 2, 𝑑 = −2, (𝑥0 , 𝑦0 , ) = (0, 0) after 1.5 × 106 iterations.
deJong2<-function(a,b,c,d,x0 , y0 ,MaxIter)
{
x<-array(dim=MaxIter); y<-array(dim=MaxIter)
x[1]<-x0; y[1]<-y0
for (k in 2:MaxIter)
{
x[k]<-d*sin(a*x[k-1])- sin(b*y[k-1])
y[k]<-c*cos(a*x[k-1])+ cos(b*y[k-1])
}
xy<-cbind(x,y)
return(xy)
}
288 | 15 Dynamical systems
a<- 1.4; b<- -2.3; c<-2.4; d<- - 2.1; x0<-0; y0<-0; MaxIter<-1.5e6
xy<-deJong1(a,b,c,d,x0 , y0 , MaxIter)
plot(xy[ ,1], xy[ ,2], pch=17, cex=.015, col="royalblue1", axes=F,
ann=F)
The Rössler attractor [157] is a dynamical system that has some applications in the
field of electrical engineering [113]. It is defined by the following equations:
⎧
𝑑𝑥
⎨ 𝑑𝑡 = −𝑦 − 𝑧,
⎪
⎪
𝑑𝑦
𝑑𝑡 = 𝑥 + 𝑎𝑦,
(15.17)
⎪
⎩ 𝑑𝑧 = 𝑏 + 𝑧(𝑥 − 𝑐),
⎪
𝑑𝑡
where, 𝑎, 𝑏, and 𝑐 are the parameters of the attractor. This attractor is known to
have some chaotic behavior for certain values of the parameters.
In R, the system (15.17) can be solved and its results visualized using List-
ing 15.17. Figure 15.15, produced using Listing 15.17, shows some visualizations of
the Rössler attractor for different values of its parameters and initial conditions.
Figure 15.15 (left) shows the output of Listing 15.17 when 𝑎 = 0.5, 𝑏 = 2, 𝑐 = 4,
(𝑥0 , 𝑦0 , 𝑧0 ) = (0.3, 0.4, 0.5), 𝑑𝑡 = 0.03 after 2 × 106 iterations. Figure 15.15 (cen-
ter) shows the output of Listing 15.17 when 𝑎 = 0.5, 𝑏 = 2, 𝑐 = 4, (𝑥0 , 𝑦0 , 𝑧0 ) =
(0.03, 0.04, 0.04), 𝑑𝑡 = 0.03. After 2 × 106 iterations. Figure 15.15 (right) shows the
output of Listing 15.17 when 𝑎 = 0.2, 𝑏 = 0.2, 𝑐 = 5.7, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.03, 0.04, 0.04),
𝑑𝑡 = 0.08. After 2 × 106 iterations.
Figure 15.15: Rössler’s attractor. Left: 𝑎 = 0.5, 𝑏 = 2, 𝑐 = 4, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.3, 0.4, 0.5), 𝑑𝑡 =
0.03 after 2 × 106 iterations. Center: 𝑎 = 0.5, 𝑏 = 2, 𝑐 = 4, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.03, 0.04, 0.04), 𝑑𝑡 =
0.03. After 2 × 106 iterations. Right: 𝑎 = 0.2, 𝑏 = 0.2, 𝑐 = 5.7, (𝑥0 , 𝑦0 , 𝑧0 ) = (0.03, 0.04, 0.04),
𝑑𝑡 = 0.08. After 2 × 106 iterations.
15.7 Fractals
There exist various definitions of the word “fractal”, and the simplest of these is the
one suggested by Benoit Mandelbrot [125], who refers to a “fractal” as an object,
which possesses self-similarity. In this section, we will provide examples of imple-
mentations for some classical fractal objects, using R.
The Sierpińsky carpet and triangle are geometry fractals named after Wacław Sier-
pińsky, who introduced them in the early nineteenth century [172]. The Sierpińsky
carpet can be constructed using the following iterative steps:
Step 1: Set 𝑥0 = 0, 𝑛 = 1, and choose the number of iterations 𝑁 .
Step 2: If 𝑛 ≤ 𝑁 , then the following applies:
– Set
⎛ ⎞
𝑥𝑛−1 𝑥𝑛−1 𝑥𝑛−1
𝑥𝑛 = ⎝𝑥𝑛−1 𝐼𝑛−1 𝑥𝑛−1 ⎠ ,
𝑥𝑛−1 𝑥𝑛−1 𝑥𝑛−1
Otherwise, go to Step 3.
Step 3: Plot the points in the final matrix 𝑥𝑁 .
The construction and visualization of the Sierpińsky carpet can be carried out, in R,
using Listing 15.18. Figure 15.16 (left), which is an output of Listing 15.18, shows
the visualization of the Sierpińsky carpet after six iterations.
The Sierpińsky triangle can be constructed using the following iterative steps:
Step 1: Select three points (vertices of the triangle) in a two-dimensional plane. Let
us call them 𝑥𝑎 , 𝑥𝑏 , 𝑥𝑐 ;
Plot the points 𝑥𝑎 , 𝑥𝑏 , 𝑥𝑐 ;
Choose the number of iterations 𝑁 ;
Step 2: Select an initial point 𝑥0 . Set 𝑛 = 1;
Step 3: If 𝑛 ≤ 𝑁 then do the following:
– Select one of the three vertices {𝑥𝑎 , 𝑥𝑏 , 𝑥𝑐 } at random, and let us call this
point 𝑝𝑛 ;
– Calculate the point 𝑥𝑛 = 𝑛−12 𝑛 , and plot 𝑥𝑛 ;
(𝑥 +𝑝 −)
In R, the construction and the visualization of the Sierpińsky triangle can be achieved
using Listing 15.19. Figure 15.16 (center), which is an output of Listing 15.19, shows
the visualization of the Sierpińsky triangle after 5e+5 iterations.
{
x<-array(0,dim=MaxIter); y<-x
for (i in 2:MaxIter)
{
c=sample (1:3,1)
if (c==1)
{
x[i]<-0.5*x[i-1]
y[i]<-0.5*y[i-1]
}
if (c==2)
{
x[i]<-0.5*x[i-1]+.25
y[i]<-0.5*y[i-1]+sqrt (3)/4
}
if (c==3)
{
x[i]<-0.5*x[i-1]+.5
y[i]<-0.5*y[i-1]
}
}
xy<-cbind(x,y)
return(xy)
}
MaxIter<-0.5e6
xy<-sierpinskytriangle (MaxIter)
plot(xy[ ,1], xy[ ,2], pch=17, cex=.04, col="firebrick", axes=F,
ann=F)
Named after the mathematician who introduced it, the Barnsley fern [11] is a fractal,
which can be constructed using the following iterative process:
Step 1: Set 𝑎 = (0, 0.85, 0.2, −0.15), 𝑏 = (0, 0.04, −0.26, 0.28), 𝑐 = (0, −0.04, 0.23,
0.26), 𝑑 = (0.16, 0.85, 0.22, 0.24), 𝑒 = (0, 0, 0, 0), 𝑓 = (0, 1.6, 1.6, 0.44);
Chose the number of iterations 𝑁 .
Step 2: Set 𝑥0 = 0, 𝑦0 = 0, and 𝑛 = 1;
Step 3: If 𝑛 ≤ 𝑁 , then do the following:
– Select at random a value 𝑟 ∈ (0, 1),
– If 𝑟 < 0.01 then set 𝑗 = 1 and go to Step 4,
– If 0.01 < 𝑟 < 0.86 the set 𝑗 = 2 and go to Step 4,
– If 0.86 < 𝑟 < 0.93 the set 𝑗 = 3 and go to Step 4,
– If 0.93 < 𝑟 the set 𝑗 = 4 and go to Step 4,
Step 4: Set 𝑥𝑛 = 𝑎𝑗 × 𝑥𝑛−1 + 𝑏𝑗 × 𝑦𝑛−1 + 𝑒𝑗 , 𝑦𝑛 = 𝑐𝑗 × 𝑥𝑛−1 + 𝑑𝑗 × 𝑦𝑛−1 + 𝑓𝑗 , where,
𝑎𝑗 , 𝑏𝑗 , 𝑐𝑗 , 𝑑𝑗 , 𝑒𝑗 , 𝑓𝑗 denote the 𝑖th component of the vectors 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 and 𝑓 ,
respectively.
Set 𝑛 = 𝑛 + 1 and go to Step 3;
Step 5: Plot the points of the sequence (𝑥0 , 𝑦0), (𝑥1 , 𝑦1 ), . . . , (𝑥𝑁 , 𝑦𝑁 ).
292 | 15 Dynamical systems
In R, the construction and the visualization of the Barnsley fern can be achieved
using Listing 15.20. Figure 15.16 (right), which is an output of Listing 15.20, shows
the visualization of the Barnsley fern after 1e+6 iterations.
x[i+1]=a[j]*x[i]+b[j]*y[i]+e[j];
y[i+1]=c[j]*x[i]+d[j]*y[i]+f[j];
}
xy<-cbind(x,y)
return(xy)
}
MaxIt<-1e6
xy<-barnsleyfern(MaxIt)
Lab.palette <- colorRampPalette(c("firebrick","red","orange"),
space = "Lab")
plot(xy[ ,1], xy[ ,2], pch=17, cex=.09, col=Lab.palette (256) ,
axes=F, ann=F)
Figure 15.16: Left: The Sierpińsky carpet. Center: The Sierpińsky triangle. Right: The Barnsley
fern.
15.7 Fractals | 293
JuliaSet<-function(n, c, Nx , Ny , MaxIter)
{
Matz<-array(0, dim=c(Nx ,Ny ,2)); MatzColor<-array(0,
dim=c(Nx ,Ny ,3))
scalexy<-5/4; xmin <- -scalexy*4/3; xmax <- scalexy*4/3
ymin <- -scalexy; ymax <- scalexy;
for (k in 1:MaxIter)
{
for (j in 1:Ny)
{
y <- ymin + j*(ymax - ymin)/(Ny - 1)
for (i in 1:Nx)
{
x <- xmin + i*(xmax - xmin)/(Nx - 1)
if (k==1)
z<- complex(real=x,imaginary=y)
else
z<- complex(real=Matz[i,j,1], imaginary=Matz[i,j,2])
z<-fzn(z, c, n)
Matz[i,j,1]=Re(z);
Matz[i,j,2]=Im(z);
# Examples of coloring patterns
scalecolor<-20;
MatzColor[i,j,1] <- abs(cos(scalecolor*abs(z)));
MatzColor[i,j,2] <- abs(cos(scalecolor*Arg(z)));
MatzColor[i,j,3] <- abs(cos(scalecolor*sqrt(abs(z))));
}
}
}
294 | 15 Dynamical systems
return(MatzColor)
}
Figure 15.17: Quadratic Julia sets. Left: 𝑐 = 0.7; Center: 𝑐 = −0.074543 + 0.11301𝑖; Right:
𝑐 = 0.770978 + 0.08545𝑖.
Figure 15.18: Quadratic Julia sets. Left: 𝑐 = 0.7. Center: 𝑐 = −0.74543 + 0.11301𝑖. Right:
𝑐 = 0.770978 + 0.08545𝑖.
The Mandelbrot set [125] is the set of all 𝑐 ∈ C, such that the sequence 𝑧𝑛 defined
by the recurrence relationship (15.19) is bounded.
{︃
𝑧0 = 0,
(15.19)
𝑧𝑛+1 = 𝑧𝑛𝑚 + 𝑐, 𝑚 ∈ N.
15.7 Fractals | 295
Figure 15.19: Quadratic Julia sets. Left: 𝑐 = −1.75. Center: 𝑐 = −𝑖. Right: 𝑐 = −0.835 −
0.2321𝑖.
for (k in 1:MaxIter)
{
for (j in 1:Ny)
{
y <- ymin + j*(ymax - ymin)/(Ny - 1)
for (i in 1:Nx)
{
x <- xmin + i*(xmax - xmin)/(Nx - 1)
if (k==1)
z<- complex(real=x,imaginary=y)
else
z<- complex(real=Matz[i,j,1], imaginary=Matz[i,j,2])
scalecolor<-20;
MatzColor[i,j,1] <- abs(cos(scalecolor*abs(z+1/sqrt (3))));
MatzColor[i,j,2] <- abs(cos(scalecolor*sqrt(abs(z))));
z0<- complex(real=x,imaginary=y)
z<-fz(z, z0)
Matz[i,j,1]=Re(z);
Matz[i,j,2]=Im(z);
}
}
}
return(MatzColor)
}
15.8 Exercises
1. Consider the following dynamical system:
𝑧𝑡+1 − 𝑧𝑡 = 𝑧𝑡 (1 − 𝑧𝑡 ) for 𝑡 = 0, 1, 2, 3, . . .
Use R to simulate the dynamics of 𝑥𝑛 using the initial conditions 𝑧0 = 0.2 and
𝑧0 = 5 for 𝑛 = 1, . . . , 500.
Plot the corresponding cobweb graph, as well as the graph of the evolution of
𝑥𝑛 , over time.
3. Let 𝑥𝑛 be the number of fish in generation 𝑛 in a lake. The evolution of the fish
population can be modeled using the following model:
Use R to simulate the dynamics of the fish population using the initial conditions
𝑥0 = 1 and 𝑥0 = log(8) for 𝑛 = 1, . . . , 500.
Plot the corresponding cobweb graph, as well as the graph of the dynamics of
the population number, over time.
4. Consider the following predator–prey model 𝑥 and 𝑦:
𝑑𝑥
= 𝐴𝑥 − 𝐵𝑥𝑦,
𝑑𝑡
𝑑𝑦
= −𝐶𝑦 + 𝐷𝑥𝑦. (15.22)
𝑑𝑡
Use R to solve the system (15.22) using the following initial conditions and values
of the parameters for 𝑡 ∈ [0, 200]:
(a) 𝑥(0) = 81, 𝑦(0) = 18, 𝐴 = 1.5, 𝐵 = 1.1, 𝐶 = 2.9, 𝐷 = 1.2;
(b) 𝑥(0) = 150, 𝑦(0) = 81, 𝐴 = 5, 𝐵 = 3.1, 𝐶 = 1.9, 𝐷 = 2.1.
Plot the corresponding solutions in the phase plane (𝑥, 𝑦), and the evolution of
the population of both species over time.
298 | 15 Dynamical systems
5. Use R to plot, in 3D, the following Lorenz system (15.12) using the parameters
𝑎 = 15, 𝑟 = 32, 𝑏 = 3, and the following initial conditions: 𝑥1 (0) = 0.03,
𝑥2 (0) = 0.03, 𝑥3 (0) = 0.03; 𝑥1 (0) = 0.5, 𝑥2 (0) = 0.21, 𝑥3 (0) = 0.55.
16 Graph theory and network analysis
This chapter provides a mathematical introduction to networks and graphs. To facili-
tate this introduction, we will focus on basic definitions and highlight basic properties
of defining components of networks. In addition to quantify network measures for
complex networks, e. g., distance- and degree-based measures, we survey also some
important graph algorithms, including breadth-first search and depth-first search.
Furthermore, we discuss different classes of networks and graphs that find widespread
applications in biology, economics, and the social sciences [23, 10, 53].
16.1 Introduction
A network 𝐺 = (𝑉, 𝐸) consists of nodes 𝑣 ∈ 𝑉 and edges 𝑒 ∈ 𝐸, see [94]. Often,
an undirected network is called a graph, but in this chapter we will not distinguish
between a network and a graph and use both terms interchangeably. In Figure 16.1,
we show some examples for undirected and directed networks. The networks shown
on the left-hand side are called undirected networks, whereas those on the right-
hand side are called directed networks since each edge has a direction pointing from
one node to another. Furthermore, all four networks, depicted in Figure 16.1, are
connected [94], i. e., none of them has isolated vertices. For example, removing the
edge between the nodes from an undirected network with only two vertices, leaves
merely two isolated nodes.
Weighted networks are obtained by assigning weights to each edge. Figure 16.2
depicts two weighted, undirected networks (left) and two weighted, directed networks
(right). A weight between two vertices, 𝑤𝐴𝐵 , is usually a real number. The range
of these weights depends on the application context. For example, 𝑤𝐴𝐵 could be a
positive real number indicating the distance between two cities, or two goods in a
warehouse [156].
From the examples above, it becomes clear that there exist a lot of different
graphs with a given number of vertices. We call two graphs isomorphic if they have
the same structure, but they might look differently [94].
In general, graphs or networks can be analyzed by using quantitative and qual-
itative methods [52]. For instance, a quantitative method to analyze graphs is a
graph measure to quantify structural information [52]. In this chapter, we focus on
quantitative techniques and in Section 16.3 we present important examples thereof.
https://doi.org/10.1515/9783110796063-016
300 | 16 Graph theory and network analysis
To define a network formally, we specify its set of vertices or nodes, 𝑉 , and its set
of edges, 𝐸. That means, any vertex 𝑖 ∈ 𝑉 is a node of the network. Similarly,
any element 𝐸𝑖𝑗 ∈ 𝐸 is an edge of the network, which means that the vertices 𝑖
and 𝑗 are connected with each other. Figure 16.3 shows an example of a network
with 𝑉 = {1, 2, 3, 4, 5} and 𝐸 = {𝐸12 , 𝐸23 , 𝐸34 , 𝐸14 , 𝐸35 }. For example, node 3 ∈ 𝑉
and edge 𝐸34 are part of the network shown by Figure 16.3. From Figure 16.3, we
further see that node 3 is connected with node 4, but also, node 4 is connected with
node 3. For this reason, we call such an edge undirected. In fact, the graph shown
by Figure 16.3 is an undirected network. It is evident that in an undirected network
the symbol 𝐸𝑖𝑗 has the same meaning as 𝐸𝑗𝑖 , because the order of the nodes in this
network is not important.
with 2 elements. The size of 𝐺 is the cardinality of the node set 𝑉 , and is often
denoted by |𝑉 |. The notation |𝐸| stands for the number of edges in the network.
From Figure 16.3, we see that this network has 5 vertices (|𝑉 | = 5) and 5 edges
(|𝐸| = 5).
In oder to encode a network by utilizing a mathematical representation, we use a
matrix representation. The adjacency matrix 𝐴 is a squared matrix with |𝑉 | number
of rows and |𝑉 | number of columns. The matrix elements 𝐴𝑖𝑗 , of the adjacency matrix
provide the connectivity of a network.
16.2 Basic types of networks | 301
1 if 𝑖 is connected with 𝑗 in 𝐺,
{︃
𝐴𝑖𝑗 = (16.1)
0 otherwise,
for 𝑖, 𝑗 ∈ 𝑉 .
Since this network is undirected, its adjacency matrix is symmetric, that means
𝐴𝑖𝑗 = 𝐴𝑗𝑖 holds for all 𝑖 and 𝑗.
From the previous discussions, we see that the graphical visualization of a network
is not determined by its definition. This is illustrated in Figure 16.4, where we show
the same network as in Figure 16.3, but with different positions of the vertices.
When comparing their adjacency matrix (16.2), one can see that these networks are
identical. In general, a network represents a topological object instead of a geometrical
one. This means that we can arbitrarily deform the network visually as long as 𝑉
and 𝐸 remain changed as shown in Figure 16.4. Therefore, the formal definition
of a given network does not include any geometric information about coordinates,
where the vertices are positioned in a plane as well as features, such as edge length
and bendiness. In order to highlight this issue, we included to the right figure of
Figure 16.4 a Cartesian coordinate system when drawing the graph. The good news
is as long as we do not require a visualization of a network the topological information
about it is sufficient to conduct any analysis possible.
302 | 16 Graph theory and network analysis
In contrast, from Figure 16.3 and Figure 16.4, we can see that the visualization of
a network is not unique and for a specific visualization often additional information
is utilized. This information could either be motivated by certain structural aspects
of the network we are trying to visualize, e. g., properties of vertices or edges (see
Section 16.3.1) or even from domain specific information (e. g., from biology or econ-
omy). An important consequence of the “arbitrariness” of a network visualization is
that there is no formal mapping from 𝐺 to its visualization.
We will start this section with some basic definitions for directed networks.
Definition 16.2.3. A directed network, 𝐺 = (𝑉, 𝐸), is defined by a vertex set 𝑉 and
an edge set 𝐸 ⊆ 𝑉 × 𝑉 .
𝐸 ⊆ 𝑉 × 𝑉 means that all directed edges of 𝐺 are subsets of all possible com-
binations of directed edges. The expression 𝑉 × 𝑉 is a cartesian product and the
corresponding result is a set of directed edges. If 𝑢, 𝑣 ∈ 𝑉 , then we write (𝑢, 𝑣) to
express that there exists a directed edge from 𝑢 to 𝑣.
The definition of the adjacency matrix of a directed graph is very similar to the
definition of an undirected graph.
for 𝑖, 𝑗 ∈ 𝑉 .
In contrast with equation (16.1), here, we choose the start vertex (𝑖) and the
end vertex (𝑗) of a directed edge. Figure 16.5 presents a directed network with the
16.2 Basic types of networks | 303
Here, we can see that 𝐴𝑡 ̸= 𝐴. Therefore, the transpose of the adjacency matrix, 𝐴,
of a directed graph is not always equal to 𝐴.
For example, the edge set of the directed network, depicted in Figure 16.5, is 𝐸 =
{(2, 1), (2, 3), (4, 1), (3, 4), (3, 5)}.
Now, we define a weighted, directed network.
for 𝑖, 𝑗 ∈ 𝑉 .
In equation (16.5), 𝑤𝑖𝑗 ∈ R denotes the weight associated with an edge from
vertex 𝑖 to vertex 𝑗.
Figure 16.6 depicted the weighted direct network with the following adjacency
matrix:
⎛ ⎞
0 0 0 0 0
⎜2 0 1 0 0 ⎟
⎜ ⎟
𝑊 = ⎜0 0 0 3 3 ⎟ . (16.6)
⎜ ⎟
⎜ ⎟
⎝1 0 0 0 0 ⎠
0 0 0 0 0
From the adjacency matrix 𝑊 , we can identify the following (real) weights: 𝑤21 = 2,
𝑤23 = 1, 𝑤34 = 3, 𝑤35 = 3, 𝑤41 = 1.
304 | 16 Graph theory and network analysis
Definition 16.2.7. A path 𝑃 is a special walk, where all the edges and all the vertices
are different.
In a directed graph, the close path is also called a cycle.
In the lower graph below on the right hand side, the path 23, 34, 41 has a length 3,
but does not represent a cycle, as its start and end vertices are not the same.
Now, we define the term distance between vertices in a network.
16.3 Quantitative network measures | 305
Definition 16.2.8. A shortest path is the minimum path connecting two vertices.
Definition 16.2.9. The number of edges in the shortest path connecting the vertices
𝑢 and 𝑣 is the topological distance 𝑑(𝑢, 𝑣).
Again, we consider the upper graph on the right hand side of Figure 16.7. For
instance, the path 12, 23, 34, for going from vertex 1 to vertex 4 has length 3 and is
obviously not the shortest one. Calculating the shortest path yields 𝑑(1, 4) = 1.
𝛿𝑘
𝑃 (𝑘) := , (16.7)
𝑁
The clustering coefficient, 𝐶𝑖 , is a local measure [198] defined, for a particular vertex
𝑣𝑖 , as follows:
2𝑒𝑖 𝑒𝑖
𝐶𝑖 = = . (16.9)
𝑛𝑖 (𝑛𝑖 − 1) 𝑡𝑖
Path- and distance-based measures have been proven useful, especially when charac-
terizing networks [64, 104]. For example, the average path length and the diameter
of a network have been used to characterize classes of biological and technical net-
works, see [64, 104, 196]. An important finding is that the average path lengths and
diameters of certain biological networks are rather small compared to the size of a
network, see [115, 128, 143].
In the following, we briefly survey important path and distance-based network
measures, see [29, 31, 93, 94, 174]. Starting from a network 𝐺 = (𝑉, 𝐸), we define
the distance matrix as follows:
16.3 Quantitative network measures | 307
(16.10)
(︀ )︀
𝑑(𝑣𝑖 , 𝑣𝑗 ) 𝑣𝑖 ,𝑣𝑗 ∈𝑉
,
Definition 16.3.4.
1
(16.11)
∑︁
¯
𝑑(𝐺) := (︀𝑁 )︀ 𝑑(𝑣𝑖 , 𝑣𝑗 ).
2 1≤𝑖<𝑗≤𝑁
We also define other well-known distance-based graph measures [94] that have
been used extensively in various disciplines [57, 197].
These graph measures have been investigated extensively by social scientists for
analyzing the communication within groups of people [80, 81, 197]. For instance, it
could be interesting to know how important or distinct vertices, e. g., representing
persons, in social networks are [197]. In the context of social networks, importance
can be seen as centrality. Following this idea, numerous centrality measures [92,
197] have been developed to determine whether vertices, e. g., representing persons,
may act distinctly with respect to the communication ability in these networks. In
this section, we briefly review the most important centrality measures, see [80, 81,
197].
308 | 16 Graph theory and network analysis
𝐶𝐷 (𝑣) = 𝑘𝑣 , (16.15)
When analyzing directed networks, the degree centrality can be defined straight-
forwardly by utilizing the definition of the in-degree and out-degree [94]. Now, let
us define the well-known betweenness centrality measure [80, 81, 159, 197].
𝜎𝑣𝑖 𝑣𝑗 (𝑣𝑘 )
(16.16)
∑︁
𝐶𝐵 (𝑣𝑘 ) = ,
𝜎𝑣𝑖 𝑣𝑗
𝑣𝑖 ,𝑣𝑗 ∈𝑉,𝑣𝑖 ̸=𝑣𝑗
where, 𝜎𝑣𝑖 𝑣𝑗 stands for the number of shortest paths from 𝑣𝑖 to 𝑣𝑗 , and 𝜎𝑣𝑖 𝑣𝑗 (𝑣𝑘 )
for the number of shortest paths from 𝑣𝑖 to 𝑣𝑗 that include 𝑣𝑘 .
𝜎𝑣𝑖 𝑣𝑗 (𝑣𝑘 )
(16.17)
𝜎𝑣𝑖 𝑣𝑗
can be seen as the probability that 𝑣𝑘 lies on a shortest path connecting 𝑣𝑖 with 𝑣𝑗 .
A further well-known measure of centrality is called closeness centrality.
1
𝐶𝐶 (𝑣𝑘 ) = ∑︀𝑁 , (16.18)
𝑖=1 𝑑(𝑣𝑘 , 𝑣𝑖 )
The measure 𝐶𝐶 (𝑣𝑘 ) has often been used to determine how close is a vertex to
other vertices in a given network [197].
require to find/visit certain distinct vertices. An example thereof is to find all vertices
of an input graph, which manifest a tree-like hierarchy in a graph by selecting an
arbitrary root vertex in the input graph. The two most prominent examples of graph
algorithms for performing graph-based searches are the breadth-first and depth-first
algorithms, see [38].
Breadth-first search (BFS) is a well-known and simple graph algorithm [38]. The
underlying principle of this algorithm relates to discovering all reachable vertices
and touching all edges systematically, starting from a given vertex 𝑠. After selecting
𝑠, all neighbors of 𝑠 are discovered, and so forth. Here, discovering vertices in a graph
involves determining the topological distance (see Definition 16.2.9) between 𝑠 and
all other reachable vertices.
Starting with a graph 𝐺 = (𝑉, 𝐸), the algorithm BFS uses colors in order to
symbolize the state of the vertices as follows:
– white: the unseen vertices are white; initially, all vertices are white;
– grey: the vertex is seen, but it needs to be determined whether it has white
neighbors;
– black: the vertex is processed, i. e., this vertex and all of its neighbors were seen.
Figure 16.9 shows an example by using a stack approach, where the colors are omit-
ted. The first graph in Figure 16.9 is the input graph. The start vertex is vertex 2.
The two stacks on the left hand side in each situation show the vertices, which have
already been visited along with their parents. For instance, we see that after four
steps of the algorithm (the fifth graph in the first row of Figure 16.9), we have
discovered 3 vertices, whose topological distance equals 1. Also, we see in the fifth
graph in the first row of Figure 16.9 that the vertices 1, 4, and 5 have been visited
together with their parent relations. Finally, all vertices have been visited in the last
graph in Figure 16.10 and, hence, BFS ends.
Depth-first search (DFS) is another graph algorithm for searching graphs [38]. Sup-
pose we start at a certain vertex. In case a vertex we visit has a still unexplored
neighbor, we visit this neighbor and pursue going in the depth to find another un-
explored neighbor, if it exists. We continue recursively with this procedure, until we
cannot go into the depth. Then, we perform backtracking to find an edge, which go
into the depth.
310 | 16 Graph theory and network analysis
Figure 16.9: The first graph is the input graph to run BFS. The start vertex is vertex 2. The
steps are shown together with a stack showing the visited and parent vertices.
We explain the basic steps as follows: To start, we highlight all vertices as not found
(white). The basic strategy of DFS is as follows:
– Highlight the actual vertex 𝑣 as found (grey)
– Whereas there exists an edge {𝑢, 𝑣} with a not found successor 𝑢:
– Perform the search recursively from 𝑢. That is
– Explore {𝑢, 𝑤} and visit 𝑤. Explore from 𝑤 in the depth until it ends
– Highlight 𝑢 as finished (black)
– Perform backtracking from 𝑢 to 𝑣
– Highlight 𝑣 as finished (black)
16.4 Graph algorithms | 311
Figure 16.10: The last two graphs when running BFS on the input graph shown in Figure 16.9.
Finally, we obtain all vertices, starting from the start vertex. Figure 16.11 shows an
input graph to run DFS. Then, Figure 16.11 shows the steps to explore the vertices
in the depth, starting from vertex 0. Figure 16.12 shows the last five graphs before
DFS ends.
Figure 16.11: The first graph is the input graph to run DFS. The start vertex is vertex 0. The
steps are shown together with a stack showing the visited and parent vertices.
– We create the set of shortest path trees (SPTS), containing the vertices that
are in a shortest path tree. These vertices have the property that they have
minimum distance from the starting vertex. Before starting, it holds SPTS = ∅.
– We assign initial distance values ∞ in the input graph. Also, we set the distance
value for the starting vertex equal to zero.
– Whereas the vertex set of SPTS does not contain all vertices of the input graph,
the following apply:
– Select a vertex 𝑣 ∈ 𝑉 that is not contained in the vertex set of SPTS with
minimum distance
16.4 Graph algorithms | 313
Figure 16.12: The last five graphs when running DFS on the input graph shown in Figure 16.11.
Now we demonstrate the application of this algorithm for an example. The input
graph in A is given in Figure 16.13. Because the vertex set of SPTS is initially
empty, and we choose vertex 1 as start node. The initial distance values can be seen
in Figure 16.13 (B). We perform some steps to see how the set of shortest paths is
emerging; see Figure 16.14. The vertices highlighted in red are the ones in the shortest
path tree. The graph shown in Figure 16.14 in situation D is the final shortest path
tree consisting all vertices of the input graph in Figure 16.13. That means, the set
of shortest path trees gives all shortest paths from vertex 1 to all other vertices.
As a remark, we would like to note that the graph shown in Figure 16.13 is a
weighted, undirected graph (see Section 16.2.3). So, using the algorithm of Dijkstra
[58] makes sense for edge-weighted graphs, as the shortest path between two vertices
of a graph depends on these weights. Interestingly, the shortest path problem be-
comes more simple if we consider unweighted networks. If all edges in a network are
unweighted, we may set all edge weights to 1. Then, Dijkstra’s algorithm reduces to
the search of the topological distances between vertices, see Definition 16.2.9.
Figure 16.13: (A) The input graph. (B) The graph with initial vertex weights.
314 | 16 Graph theory and network analysis
Let us consider the graph A in Figure 16.15. In case we determine all shortest paths
from vertex 1 to vertex 4, we see that there exist more than one shortest path between
these two vertices. We find the shortest paths 1-3-4 and 1-2-4. So, the shortest path
problem does not possess a unique solution. The same holds when considering the
shortest paths between vertex 1 and vertex 5. The calculations yield the two shortest
paths 1-3-4-5 and 1-2-4-5.
Another example when calculating shortest paths in unweighted networks gives the
graph in B shown by Figure 16.15. The shown graph 𝑃𝑛 is referred to as the path
graph [186] with 𝑛 vertices. We observe that there exist 𝑛 − 1 pairs of vertices with
𝑑(𝑢, 𝑣) = 1, 𝑛 − 2 pairs of vertices with 𝑑(𝑢, 𝑣) = 2, and so forth. Finally, we see that
there exists only 𝑛 − (𝑛 − 1) = 1 pair with 𝑑(𝑢, 𝑣) = 𝑛 − 1. Here, 𝑑(𝑢, 𝑣) = 𝑛 − 1 is
just the diameter of 𝑃𝑛 .
In Listing 16.1, we shown an example how shortest paths can be found by using R.
For this example, we use a small-world network with 𝑛 = 25 nodes. The command
distances() gives only the length of paths, whereas the command shortest_paths()
provides one shortest path. In contrast, all_shortest_paths() returns all shortest
paths.
16.4 Graph algorithms | 315
In Section 16.5, we will provide Definition 16.5.1, formally introducing what a tree
is. Informally, it is an acyclic and connected graph [94]. In this section, we discuss
spanning trees and the minimum spanning tree problem [15, 38].
Suppose, we start with an undirected input graph 𝐺 = (𝑉𝐺 , 𝐸𝐺 ). A spanning
tree 𝑇 = (𝑉𝑇 , 𝐸𝑇 ) of 𝐺 is a tree, where 𝑉𝑇 = 𝑉𝐺 . In this case, we say that the tree
𝑇 spans 𝐺, as the vertex set of the two graphs are the same and every edge of 𝑇
belongs to 𝐺.
Figure 16.16 shows an input graph 𝐺 with a possible spanning tree 𝑇 . It is
obvious, by definition, that there often exists more than one spanning tree of a
given graph 𝐺. The problem of determining spanning trees gets more complex if we
consider weighted networks. In case we start with an edge-labeled graph, one could
determine the so-called minimum spanning tree [38]. This can be achieved by adding
up the costs of all edge weights and, finally, searching for the tree with minimum cost
among all existing spanning trees. Again, the minimum spanning tree for a given
network is not unique. For instance, well-known algorithms to determine the mini-
mum spanning tree are due to Prim and Kruskal, see, e. g., [38]. We emphasize that
the application of those methods may result in different minimum spanning trees.
Here, we just demonstrate Kruskal’s algorithm [38] representing a greedy approach.
Let 𝐺 = (𝑉, 𝐸) be a connected graph with real edge weights. The main steps are
the following:
– We arrange the edges according to their weights in ascending order
– We add edges to the resulting minimum spanning tree as follows: we start with
the smallest weight and end with the largest weight by consecutively adding
edges according to their weight costs
– We only add the described edges if the process does not create a cycle
Figure 16.17 shows the sequence of steps when applying Kruskal’s algorithm to
the shown input graph 𝐺. We choose any subgraph with the smallest weight as
depicted in situation A. In B, we choose the next smallest edge, and so on. We
repeat this procedure according to the algorithmic steps above until it does not
create a cycle. Note, intermediate trees can be disconnected (see C). One possible
minimum spanning tree is shown in situation E. Differences between the algorithms
due to Kruskal and Prim are explained in, e. g., [38].
In Listing 16.2, we shown an example how the minimum spanning tree can be
found by using R. For this example, we use a small-world network with 𝑛 = 25 nodes.
The command mst() gives the underlying minimal spanning tree.
16.5.1 Trees
We start with the formal definition of a tree [94], already briefly introduced in Section
16.4.4.
In fact, there exist several characterizations for trees which are equivalent [100].
Special types of trees are rooted trees [94]. Rooted trees often appear in graph
algorithms, e. g., when performing a search or sorting [38].
Definition 16.5.2. A rooted tree is a tree containing one designated root vertex.
There is a unique path from the root vertex to all other vertices in the tree, and all
other vertices are directed away from the root.
Figure 16.18 presents a rooted tree, in which the root is at the very top of a tree,
whereas all other vertices are placed on some lower levels. The tree in Figure 16.18
is an unordered tree, that means, the order of the vertices is arbitrary. For instance,
the order of the green and orange vertex can be swapped.
Classes of rooted trees include ordered and binary-rooted trees [94].
Definition 16.5.3. An ordered tree is a rooted tree assigning a fixed order to the
children of each vertex.
Definition 16.5.4. A binary tree is an ordered tree, where each vertex has exactly
two children.
318 | 16 Graph theory and network analysis
Definition 16.5.7. A generalized tree as defined by Definition 16.5.5 has three edges
types [63]:
– Edges with |ℒ(𝑚) − ℒ(𝑛)| = 1 are called kernel edges (𝐸1 ).
– Edges with |ℒ(𝑚) − ℒ(𝑛)| = 0 are called cross edges (𝐸2 ).
– Edges with |ℒ(𝑚) − ℒ(𝑛)| > 1 are called up edges (𝐸3 ).
Note that for an ordinary rooted tree as defined by Definition 16.5.2, we always
obtain |ℒ(𝑚) − ℒ(𝑛)| = 1 for all pairs (𝑚, 𝑛). From the above given definitions and
the visualization in Figure 16.19, it is clear that a generalized tree is a tree-like graph
with a hierarchy, and may contain cycles.
Random networks have also been studied in many fields, including computer science
and network physics [183]. This class of networks are based on the seminal work of
Erdös and Rényi, see [76, 77].
By definition, a random graph with 𝑁 vertices can be obtained by connecting
every pair of vertices with probability 𝑝. Then, the expected number of edges for an
undirected random graph is given by
𝑁 (𝑁 − 1)
𝐸(𝑛) = 𝑝 . (16.19)
2
In what follows, we survey important properties of random networks [59]. For
instance, the degree distribution of a vertex 𝑣𝑖 follows a binomial distribution,
(︂ )︂
𝑁 −1 𝑘
𝑃 (𝑘𝑖 = 𝑘) = 𝑝 (1 − 𝑝)𝑁 −1−𝑘 , (16.20)
𝑘
since the maximum degree of the vertex 𝑣𝑖 is at most 𝑁 − 1; in fact, the probability
that the vertex has 𝑘 edges equals 𝑝𝑘 (1−𝑝)𝑁 −1−𝑘 and there exist 𝑁𝑘−1 possibilities
(︀ )︀
𝑧 𝑘 exp(−𝑧)
𝑃 (𝑘𝑖 = 𝑘) ∼ . (16.21)
𝑘!
We emphasize that 𝑧 = 𝑝(𝑁 − 1) is the expected number of edges for a vertex. This
implies that if 𝑁 goes to infinity, the degree distribution of a vertex in a random
network can be approximated by the Poisson distribution. For this reason, random
networks are often referred to as Poisson random networks [142].
320 | 16 Graph theory and network analysis
In addition, one can demonstrate that the degree distribution of the whole ran-
dom network also follows approximatively the following Poisson distribution:
𝑧 𝑟 exp(−𝑧)
𝑃 (𝑋𝑘 = 𝑟) ∼ . (16.22)
𝑟!
This means that there exist 𝑋𝑘 = 𝑟 vertices in the network that possess degree 𝑘 [4].
As an application, we recall the already introduced clustering coefficient 𝐶𝑖 , for
a vertex 𝑣𝑖 , represented by equation (16.9). In general, this quantity has been defined
as the ratio |𝐸𝑖 | of existing connections among its 𝑘𝑖 nearest neighbors divided by
the total number of possible connections. This consideration yields the following:
2|𝐸𝑖 |
𝐶𝑖 = . (16.23)
𝑘𝑖 (𝑘𝑖 − 1)
Therefore, 𝐶𝑖 is the probability that two neighbors of 𝑣𝑖 are connected with each
other in a random graph, and 𝐶𝑖 = 𝑝. This can be approximated by
𝑧
𝐶𝑖 ∼ , (16.24)
𝑁
# Right graph
n <- 50
pc <- 0.1
la <- layout.circle(g)
g <- erdos.renyi.game(n, pc , type="gnp", directed = FALSE ,
loops = FALSE)
plot(g, layout = la , vertex.color = "blue", vertex.size = 4,
vertex.label = "")
Small-world networks were introduced by Watts and Strogatz [198]. These networks
possess two interesting structural properties. Watts and Strogatz [198] found that
16.5 Network models and graph classes | 321
small-world networks have a high clustering coefficient and also a short (average)
distance among vertices. Small-world networks have been explored in several disci-
plines, such as network science, network biology, and web mining [190, 195, 203].
In the following, we present a procedure developed by Watts and Strogatz [198]
in order to generate small-world networks.
– To start, all vertices of the graph are arranged on a ring and connect each vertex
with its 𝑘/2 nearest neighbors. Figure 16.21 (left) shows an example using 𝑘 = 4.
For each vertex, the connection to its next neighbor (1st neighbor) is highlighted
in blue and the connection to its second next neighbor (2nd neighbor) in red.
– Second, start with an arbitrary vertex 𝑖 and rewire its connection to its nearest
neighbor on, e. g., the right side with probability 𝑝𝑟𝑤 to any other vertex 𝑗 in
the network. Then, choose the next vertex in the ring in a clockwise direction
and repeat this procedure.
– Third, after all first-neighbor connections have been checked, repeat this proce-
dure for the second and all higher-order neighbors, if present, successively.
This algorithm guarantees that each connection occurring in the network is chosen
exactly once and rewired with probability 𝑝𝑟𝑤 . Hence, the rewiring probability, 𝑝𝑟𝑤 ,
controls the disorder of the resulting network topology. For 𝑝𝑟𝑤 = 0, the regular
topology is conserved, whereas 𝑝𝑟𝑤 = 1 results in a random network. Intermediate
values 0 < 𝑝𝑟𝑤 < 1 give a topological structure that is between these two extremes.
Figure 16.21 (right) shows an example of a small-world network generated with
the following R code:
Figure 16.21: Small-world networks with 𝑝𝑟𝑤 = 0.0 (left) and 𝑝𝑟𝑤 = 0.10 (right). The two
rewired edges are shown in light blue and red.
– First, the adjacency matrix is initialized in a way that only the nearest 𝑘/2
neighbor vertices are connected. The order of the vertices is arbitrarily induced
by the labeling of the vertices from 1 to 𝑁 . This allows identifying, e. g., 𝑖 + 𝑓
as the 𝑓 th neighbor of vertex 𝑖 with 𝑓 ∈ N. For instance, 𝑓 = 1 corresponds to
the next neighbor of 𝑖. The module function is used to ensure that the neighbor
indices 𝑓 remain in the range of {1, . . . , 𝑁 }. Due to this fact the vertices can be
seen as organized on a ring. We would like to emphasize that for the algorithm
to work, the number of neighbors 𝑘 needs to be an even number.
– Second, each connection in the network is tested once if it should be rewired with
probability 𝑝𝑟𝑤 . To do this, a random number, 𝑐, between 0 and 1 is uniformly
sampled and tested in an if-clause. Then, if 𝑐 ≤ 𝑝𝑟𝑤 , a connection between vertex
𝑖 and 𝑖 + 𝑓 is rewired. In this case, we need first to remove the old connection
between these vertices and then draw a random integer, 𝑑, from {1, . . . , 𝑁 } ∖ {𝑖}
to select a new vertex to connect with 𝑖. We would like to note that in order to
avoid a self-connection of vertex 𝑖, we need to remove the index 𝑖 from the set
{1, . . . , 𝑁 }.
Neither random nor small-world network have a property frequently observed in real
world networks, namely a scale-free behavior of the degrees [4],
𝑃 (𝑘) ∼ 𝑘 −𝛾 . (16.25)
To explain this common feature Barabási and Albert introduced a model [8], now
known as Barabási–Albert (BA) or preferential attachment model [142]. This model
results in so called scale-free networks, which have a degree distribution following a
power law [8]. A major difference between the preferential attachment model and the
other algorithms, described above, for generating random or small-world networks is
16.6 Further reading | 323
that the BA model does not assume a fixed number of vertices, 𝑁 , and then rewires
them iteratively with a fixed probability, but in this model 𝑁 grows. Each newly
added vertex is connected with a certain probability (which is not constant) to other
vertices already present in the network. The attachment probability defined by
𝑘𝑖
𝑝𝑖 = ∑︀ (16.26)
𝑗 𝑘𝑗
is proportional to the degree 𝑘𝑗 of these vertices, explaining the name of the model.
This way, each new vertex is added to 𝑒 ∈ N existing vertices in the network.
Figure 16.22 presents two examples of random networks generated using the
following R code:
# Right
n <- 1000
g <- barabasi.game(n, m = 1, directed = FALSE)
la <- layout.fruchterman.reingold(g)
plot(g, layout = la , vertex.color = "blue", vertex.size = 4,
vertex.label = "")
16.7 Summary
Despite the fact that graph theory is a mathematical subject, similar to linear alge-
bra and analysis, it has a closer connection to practical applications. For this reason
many real-world networks have been studied in many disciplines, such as chemistry,
computer science, economy [64, 65, 143]. A possible explanation for this is provided
by the intuitive representation of many natural networks, e. g., transportation net-
works of trains and planes, acquaintance networks between friends or social networks
in twitter or facebook. Also many attributes of graphs, e. g., paths or the degrees of
nodes, have a rather intuitive meaning. This motivates the widespread application
of graphs and networks in nearly all application areas. However, we have also seen in
this chapter that the analysis of graphs can be quite intricate, requiring a thorough
understanding of the previous chapters.
16.8 Exercises
1. Let 𝐺 = 𝑉, 𝐸 be a graph with 𝑉 = {1, 2, 3, 4, 5} and 𝐸 = {{1, 2}, {2, 4}, {1, 3},
{3, 4}, {4, 5}}. Use R to obtain the following results:
– Calculate all vertex degrees of 𝐺.
– Calculate all shortest paths of 𝐺.
– Calculate diam(𝐺).
– Calculate the number of circles of 𝐺.
2. Generate 5 arbitrary trees with 10 vertices. Calculate their number of edges by
using R, and confirm 𝐸 = 10 − 1 = 9 for all 5 generated trees.
3. Let 𝐺 = 𝑉, 𝐸 be a graph with 𝑉 = {1, 2, 3, 4, 5, 6} and 𝐸 = {{1, 2}, {2, 4}, {1, 3},
{3, 4}, {4, 5}, {5, 6}}. Calculate the number of spanning trees for the given
graph, 𝐺.
4. Generate scale-free networks with the BA algorithm. Specifically, generate two
different networks, one for 𝑛 = 1000 and 𝑚 = 1 and one for 𝑛 = 1000 and 𝑚 = 3.
Determine for each network the degree distribution of the resulting network and
compare them with each other.
5. Generate small-word networks for 𝑛 = 2500. Determine the rewiring probability
𝑝𝑟𝑤 which separates small-word networks from random networks. Hint: Inves-
tigate the behavior of the clustering coefficient and the average shortest paths
graphically.
6. Identify practical examples of generalized trees by mapping real-world observa-
tions to this graph structure. Are the directories in a computer organized as a
tree or a generalized tree? Starting from your desktop and considering shortcuts,
does this change this answer?
17 Probability theory
Probability theory is a mathematical subject that is concerned with probabilistic
behavior of random variables. In contrast, all topics of the previous chapters in
Part III were concerned with deterministic behavior of variables. Specifically, the
meaning of a probability is a measure quantifying the likelihood that events will
occur. This significant difference between a deterministic and probabilistic behavior
of variables indicates the importance of this field for statistics, machine learning, and
data science in general, as they all deal with the practical measurement or estimation
of probabilities and related entities from data.
This chapter introduces some basic concepts and key characteristics of proba-
bility theory, discrete and continuous distributions, and concentration inequalities.
Furthermore, we discuss the convergence of random variables, e. g., the law of large
numbers or the central limit theorem.
Example 17.1.1. If we toss a coin once, there are two possible outcomes. Either we
obtain a “head” (𝐻) or a “tail” (𝑇 ). Each of these outcomes is called an elementary
event, 𝜔𝑖 (or a sample point). In this case, the sample space is Ω = {𝐻, 𝑇 } =
{(𝐻), (𝑇 )}, or abstractly {𝜔1 , 𝜔2 }. Points in the sample space 𝜔 ∈ Ω correspond to
an outcome of a random experiment, and subsets of the sample space, 𝐴 ⊂ Ω, e. g.,
𝐴 = {𝑇 }, are events.
Example 17.1.2. If we toss a coin three times, the sample space is Ω = {(𝐻, 𝐻, 𝐻),
(𝑇, 𝐻, 𝐻), (𝐻, 𝑇, 𝐻), . . . , (𝑇, 𝑇, 𝑇 )}, and the elementary outcomes are triplets com-
posed of elements in {𝐻, 𝑇 }. It is important to note that the number of triplets in
Ω is the total number of different combinations. In this case the number of different
elements in Ω is 23 = 8.
From the second example, it is clear that although there are only two elementary
outcomes, i. e. 𝐻 and 𝑇 , the size of the sample space can grow by repeating such
base experiments.
https://doi.org/10.1515/9783110796063-017
326 | 17 Probability theory
Definition 17.2.3. The complement of a set 𝐴 with respect to the entire space Ω,
denoted 𝐴𝑐 or 𝐴, is such that if 𝑎 ∈ 𝐴𝑐 , then 𝑎 ∈ Ω, but not in 𝐴.
There is a helpful graphical visualization of sets, called Venn diagram, that allows
an insightful representation of set operations. In Figure 17.1 (left), we visualize the
complement of a set 𝐴. In this figure, the entire space Ω is represented by the large
square, and the set 𝐴 is the inner circle (blue), whereas its complement 𝐴 is the area
around it (white). In contrast, in Figure 17.1 (right), the set 𝐴 is the outer shaded
area, and 𝐴 is the inner circle (white).
Definition 17.2.5. The intersection of two sets 𝐴 and 𝐵 consists only of the points
that are in 𝐴 and in 𝐵, and such a relationship is denoted by 𝐴 ∩ 𝐵, i. e., 𝐴 ∩ 𝐵 =
{𝑥 | 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵}.
Definition 17.2.6. The union of two sets 𝐴 and 𝐵 consists of all points that are
either in 𝐴 or in 𝐵, or in 𝐴 and 𝐵, and this relationship is denoted by 𝐴 ∪ 𝐵, i. e.,
𝐴 ∪ 𝐵 = {𝑥 | 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵}.
Figure 17.2 provides a visualization of the intersection (left) and the union (right)
of two sets 𝐴 and 𝐵.
17.2 Set theory | 327
Figure 17.2: Venn diagrams of two sets. Left: Intersection of 𝐴 and 𝐵, 𝐴 ∩ 𝐵. Right: Union of
𝐴 and 𝐵, 𝐴 ∪ 𝐵.
Definition 17.2.7. The set difference between two sets, 𝐴 and 𝐵, consists of the
points that are only in 𝐴, but not in 𝐵, and this relationship is denoted by 𝐴 ∖ 𝐵,
i. e., 𝐴 ∖ 𝐵 = {𝑥 | 𝑥 ∈ 𝐴 and 𝑥 ̸∈ 𝐵}.
Using R, the four aforementioned set operations can be carried out as follows:
Theorem 17.2.1. For three given sets 𝐴, 𝐵, and 𝐶, the following relations hold:
1. Commutativity: 𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴, and 𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴.
2. Associativity: 𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶, and 𝐴 ∩ (𝐵 ∩ 𝐶) = (𝐴 ∩ 𝐵) ∩ 𝐶.
3. Distributivity: 𝐴∪(𝐵∩𝐶) = (𝐴∪𝐵)∩(𝐴∪𝐶), and 𝐴∩(𝐵∪𝐶) = (𝐴∩𝐵)∪(𝐴∩𝐶).
4. (𝐴𝑐 )𝑐 = 𝐴.
For the complement of a set, a bar over the symbol is frequently used instead of
the superscript “𝑐”, i. e., 𝐴 = 𝐴𝑐 .
Definition 17.2.8. Two sets 𝐴1 and 𝐴2 are called mutually exclusive if the following
holds: 𝐴1 ∩ 𝐴2 = ∅.
If 𝑛 sets 𝐴𝑖 with 𝑖 ∈ {1, . . . , 𝑛} are mutually exclusive, then 𝐴𝑖 ∩ 𝐴𝑗 = ∅ holds
for all 𝑖 and 𝑗 with 𝑖 ̸= 𝑗.
328 | 17 Probability theory
Theorem 17.2.2 (De Morgan’s Laws). For two given sets, 𝐴 and 𝐵, the following
relations hold:
(𝐴 ∪ 𝐵) = 𝐴 ∩ 𝐵, (17.1)
(𝐴 ∩ 𝐵) = 𝐴 ∪ 𝐵. (17.2)
Pr(𝐴) ≥ 0. (17.3)
Pr(Ω) = 1. (17.4)
Definition 17.3.1. We call Pr(𝐴) a probability of event 𝐴 if it fulfills all the three
axioms above.
𝑘
(17.6)
∑︁
𝑃 (𝐴1 ∪ 𝐴2 ∪ . . . 𝐴𝑘 ) = 𝑃 (𝐴𝑖 ).
𝑖=1
Probabilities are called coherent if they obey the rules from the three axioms above.
Examples for the contrary will be given below.
We would like to note that the above definition of probability does not give a de-
scription about how to quantify it. Classically, Laplace provided such a quantification
for equiprobable elementary outcomes, i. e., for 𝑝(𝜔𝑖 ) = 1/𝑚 for Ω = {𝜔1 , . . . , 𝜔𝑚 }.
In this case, the probability of an event 𝐴 is given by the number of elements in
𝐴 divided by the total number of possible events, i. e., 𝑝(𝐴) = |𝐴|/𝑚. In practice,
not all problems can be captured by this approach, because usually the probabil-
ities, 𝑝(𝜔𝑖 ), are not equiprobable. For this reason a frequentist quantification or a
Bayesians quantification of probability, which hold for general probability values, is
used [91, 161].
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴|𝐵) = . (17.8)
𝑃 (𝐵)
Definition 17.4.2 (Partition of the sample space). Suppose that the events {𝐴1 , . . . ,
𝐴𝑘 } are disjoint, i. e., 𝐴𝑖 ∩ 𝐴𝑗 = ∅ for all 𝑖, and 𝑗 ∈ {1, . . . , 𝑘} and Ω = 𝐴1 ∪ · · · ∪ 𝐴𝑘 .
Then, the sets {𝐴1 , . . . , 𝐴𝑘 } form a partition of the sample space Ω.
Theorem 17.4.1 (Law of total probability). Suppose that the events {𝐵1 , . . . , 𝐵𝑘 } are
disjoint and form a partition of the sample space Ω and 𝑃 (𝐵𝑖 ) > 0. Then, for an
event 𝐴 ∈ Ω,
𝑘
(17.9)
∑︁
𝑃 (𝐴) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ).
𝑖=1
330 | 17 Probability theory
𝐴=𝐴∩Ω (17.10)
we have
𝐴 = 𝐴 ∩ (𝐵1 ∪ · · · ∪ 𝐵𝑘 ), (17.11)
𝐴 = (𝐴 ∩ 𝐵1 ) ∪ · · · ∪ (𝐴 ∩ 𝐵𝑘 ). (17.12)
(17.13)
(︀ )︀
𝑃 (𝐴) = 𝑃 (𝐴 ∩ 𝐵1 ) ∪ · · · ∪ (𝐴 ∩ 𝐵𝑘 )
= 𝑃 (𝐴 ∩ 𝐵1 ) + · · · + 𝑃 (𝐴 ∩ 𝐵𝑘 ) (17.14)
= 𝑃 (𝐴|𝐵1 )𝑃 (𝐵1 ) + · · · + 𝑃 (𝐴|𝐵𝑘 )𝑃 (𝐵𝑘 ) (17.15)
𝑘
(17.16)
∑︁
= 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ).
𝑖=1
Definition 17.5.1. Two events 𝐴 and 𝐵 are called independent, or statistically in-
dependent, if one of the following conditions hold:
1. 𝑃 (𝐴𝐵) = 𝑃 (𝐴)𝑃 (𝐵)
2. 𝑃 (𝐴|𝐵) = 𝑃 (𝐴) if 𝑃 (𝐵) > 0
3. 𝑃 (𝐵|𝐴) = 𝑃 (𝐵) if 𝑃 (𝐴) > 0
Theorem 17.5.1. If two events 𝐴 and 𝐵 are independent, then the following state-
ments hold:
1. 𝐴 and 𝐵 are independent
2. 𝐴 and 𝐵 are independent
3. 𝐴 and 𝐵 are independent
17.6 Random variables and their distribution function | 331
The extension to more than two events deserves attention, because it requires
independence among all subsets of the events.
(17.17)
∏︁
𝑃 (𝐴1 , . . . , 𝐴𝑛 ) = 𝑃 (𝐴𝑖 ).
𝑖∈𝐼
(17.18)
(︀{︀ }︀)︀
𝑃 (𝑋 ∈ 𝑆) = 𝑃 𝑎 ∈ Ω | 𝑋(𝑎) ∈ 𝑆 ,
since {𝑎 ∈ Ω | 𝑋(𝑎) ∈ 𝑆} ⊂ Ω.
Similarly, for a single element 𝑆 = 𝑥, we obtain
(17.19)
(︀{︀ }︀)︀
𝑃 (𝑋 = 𝑥) = 𝑃 𝑎 ∈ Ω | 𝑋(𝑎) = 𝑥 .
In this way, the probability values for events are clearly defined.
(17.21)
(︀{︀ }︀)︀
𝑃 (𝑋 ≤ 𝑥) = 𝑃 𝑎 ∈ Ω | 𝑋(𝑎) ≤ 𝑥 .
Example 17.6.1. Suppose that we have a fair coin and define a random variable by
𝑋(𝐻) = 1 and 𝑋(𝑇 ) = 0 for a probability space with Ω = {𝐻, 𝑇 }. We can find a
piecewise definition of the corresponding distribution function as follows:
⎧
⎪𝑃 (∅) = 0
⎪ for 𝑥 < 0;
}︀)︀ ⎨
𝐹𝑋 (𝑥) = 𝑃 𝑎 ∈ Ω | 𝑋(𝑎) ≤ 𝑥 = 𝑃 ({𝑇 }) = 1/2 for 0 ≤ 𝑥 < 1; (17.22)
(︀{︀
The circle at the end of the steps in Figure 17.3 means that the end points are not
included, but all points up to the end points themselves are. Mathematically, this
corresponds to an open interval indicated by “)”, e. g., [0, 1) for the second step in
Figure 17.3.
Theorem 17.6.1. The cumulative distribution function, 𝐹 (𝑥), has the following
properties:
1. 𝐹 (−∞) = lim𝑥→−∞ 𝐹 (𝑥) = 0 and 𝐹 (∞) = lim𝑥→∞ 𝐹 (𝑥) = 1;
2. 𝐹 (𝑥+) = 𝐹 (𝑥) is continuous from the right;
3. 𝐹 (𝑥) is monotone and nondecreasing; if 𝑥1 ≤ 𝑥2 ⇒ 𝐹 (𝑥1 ) ≤ 𝐹 (𝑥2 );
4. 𝑃 (𝑋 > 𝑥) = 1 − 𝐹 (𝑥);
5. 𝑃 (𝑥1 < 𝑥 ≤ 𝑥2 ) = 𝐹 (𝑥2 ) − 𝐹 (𝑥1 );
6. 𝑃 (𝑋 = 𝑥) = 𝐹 (𝑥) − 𝐹 (𝑥−);
7. 𝑃 (𝑥1 ≤ 𝑥 ≤ 𝑥2 ) = 𝐹 (𝑥2 ) − 𝐹 (𝑥1 −).
17.7 Discrete and continuous distributions | 333
Given these two definitions and the properties of probability values, it can be
shown that the following conditions hold:
1. 𝑓 (𝑥) = 0, if 𝑥 is not a possible value of the random variable 𝑋;
2. 𝑖=1 𝑓 (𝑥𝑖 ) = 1, if the 𝑥𝑖 are all the possible values for the random variable 𝑋.
∑︀𝑛
∫︁𝑏
𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 (𝑥)𝑑𝑥. (17.24)
𝑎
Here, the nonnegative function 𝑓 (𝑥) is called the probability density function of 𝑋.
if 𝑥 ∈ [𝑎, 𝑏];
{︃
1
𝑓 (𝑥) = 𝑏−𝑎
(17.27)
0 otherwise.
The notation Unif([𝑎, 𝑏]) is often used to denote a uniform distribution in the interval
[𝑎, 𝑏].
From the definition of the expectation values of a random variable follows several
important properties that hold for discrete and continuous random variables.
17.8 Expectation values and moments | 335
Theorem 17.8.1. Suppose that 𝑋 and 𝑋1 , . . . , 𝑋𝑛 are random variables. Then the
following results hold:
If 𝑋1 , . . . , 𝑋𝑛 are independent random variables and E[𝑋𝑖 ] is finite for every 𝑖, then
[︃ 𝑛 ]︃ 𝑛
(17.34)
∏︁ ∏︁
E 𝑋𝑖 = E[𝑋𝑖 ].
𝑖=1 𝑖=1
17.8.2 Variance
Due to the importance of this expression, it has its own name. It is called the variance
of 𝑋. If the mean of 𝑋, 𝜇, is not finite, or if it does not exist, then Var(𝑋) does not
exist.
There is a related measure, called the standard deviation, which is just the square
root of the variance of 𝑋, denoted 𝑠𝑑(𝑋) = Var(𝑋). Frequently, the Greek symbol
√︀
In this case, the standard deviation assumes the form, 𝑠𝑑(𝑋) = Var(𝑋) = 𝜎.
√︀
Property (3) has important practical implications, because it says that the variance
of the mean of a sample of size 𝑛 for random variables that have all the same
336 | 17 Probability theory
variance has a variance that is reduced by the factor 1/𝑛. If we take the square root
of Var(𝑋)
¯ = Var(𝑋) , we get the standard deviation of 𝑋
𝑛
¯ given by
¯ = 𝑠𝑑(𝑋)
𝑆𝐸 = 𝑠𝑑(𝑋) √ . (17.38)
𝑛
17.8.3 Moments
Along the same principle, as for the definition of the variance of a random variable
𝑋, one can define further expectation values.
𝑔(𝑥) = 𝑋, (17.41)
𝑚𝑘 = E 𝑔(𝑥)𝑘 = E 𝑋 𝑘 . (17.42)
[︀ ]︀ [︀ ]︀
(17.43)
[︀(︀ )︀(︀ )︀]︀
Cov(𝑋, 𝑌 ) = E 𝑋 − E[𝑋] 𝑌 − E[𝑌 ] .
17.9 Bivariate distributions | 337
(17.44)
[︀(︀ )︀(︀ )︀]︀ √︀
Cor(𝑋, 𝑌 ) = E 𝑋 − E[𝑋] 𝑌 − E[𝑌 ] / Var(𝑋) Var(𝑌 )
(17.45)
√︀
= Cov(𝑋, 𝑌 )/ Var(𝑋) Var(𝑌 ).
Theorem 17.9.1. Let 𝑋 and 𝑌 be two discrete random variables with joint prob-
ability function 𝑓 (𝑥, 𝑦). If (𝑥𝑎 , 𝑦𝑎 ) is not in the definition range of (𝑋, 𝑌 ), then
𝑓 (𝑥𝑎 , 𝑦𝑎 ) = 0. Furthermore
(17.48)
∑︁
𝑓 (𝑥𝑖 , 𝑦𝑖 ) = 1,
∀𝑖
338 | 17 Probability theory
and
(17.49)
(︀ )︀ ∑︁
𝑃 (𝑋, 𝑌 ) ∈ 𝑍 = 𝑓 (𝑥, 𝑦).
(𝑥,𝑦)∈𝑍
For evaluating such a discrete joint probability function, the corresponding prob-
abilities can be presented in a form of table. In Table 17.1, we present an example of
a discrete joint probability function 𝑓 (𝑥, 𝑦) with 𝑋 ∈ {𝑥1 , 𝑥2 } and 𝑌 ∈ {𝑦1 , 𝑦2 , 𝑦3 }.
Table 17.1: An example of a discrete joint probability function 𝑓 (𝑥, 𝑦) with 𝑋 ∈ {𝑥1 , 𝑥2 } and
𝑌 ∈ {𝑦1 , 𝑦2 , 𝑦3 }.
𝑌
𝑦1 𝑦2 𝑦3
𝑥1 𝑓 (𝑥1 , 𝑦1 ) 𝑓 (𝑥1 , 𝑦2 ) 𝑓 (𝑥1 , 𝑦3 )
𝑋
𝑥2 𝑓 (𝑥2 , 𝑦1 ) 𝑓 (𝑥2 , 𝑦2 ) 𝑓 (𝑥2 , 𝑦3 )
𝑛
(17.50)
∏︁ (︀ )︀
𝑃 (𝑋1 , . . . , 𝑋𝑛 ) = 𝑝 𝑋𝑖 | pa(𝑋𝑖 ) .
𝑖=1
Here, pa(𝑋𝑖 ) denotes the “parents” of variable 𝑋𝑖 . In Figure 17.4 (left), we show an
example for 𝑛 = 5. The joint probability distribution 𝑃 (𝑋1 , . . . , 𝑋5 ) factorizes in
𝑃 (𝑋1 , . . . , 𝑋5 ) = 𝑝(𝑋1 )𝑝(𝑋2 )𝑝(𝑋3 |𝑋1 )𝑝(𝑋4 |𝑋1 , 𝑋2 )𝑝(𝑋5 |𝑋1 , 𝑋2 ). (17.51)
Similarly, the joint probability distribution, for Figure 17.4 (right), can be written
as follows
The shown DAGs in Figure 17.4, together with the factorizations of their joint proba-
bility distributions, are examples of so called Bayesian networks [149, 114]. Bayesian
networks are special examples of probabilistic models called graphical models [116].
One of the most simple discrete distributions and yet very important is the Bernoulli
distribution. For this distribution, the sample space consists of only two outcomes
{0, 1}. The probabilities for these events are defined by
𝑃 (𝑋 = 1) = 𝑝, (17.53)
𝑃 (𝑋 = 0) = 1 − 𝑝. (17.54)
the help of the command rbern, we can draw 10 random variables from a distribution
with 𝑝 = 0.5.
the normal distribution. The advantage of such approximation is that the normal
distribution is computationally easier to handle than the Binomial distribution. As
17.11 Important discrete distributions | 341
Figure 17.5: Binomial distribution, Binom(𝑁 = 6, 𝑝 = 0.3) (left) and Binom(𝑁 = 6, 𝑝 = 0.1)
(right).
Listing 17.5: Plot for a Binomial distributions, see Figure 17.5 (right)
n <- 0:6
d <- dbinom(n, size=6, prob=0.3)
In the following, we do not provide the scripts for the visualizations of simi-
lar figures, but only for the values of the distributions. However, by following the
example in Listing 17.5, such visualizations can be generated easily.
So far we have seen that R provides for each available distribution a function
to sample random variables from this distribution, and a function to obtain the
corresponding probability density. For the Binomial distribution, these functions
342 | 17 Probability theory
Figure 17.6: Binomial distribution, pbinom(n, size=6, prob=0.6) (left) and qbinom(p, size=6,
prob=0.6) (right).
are called rbinom and dbinom. For other distributions, the following pattern for the
names apply:
– r'name-of-the-distribution': draw random samples from the distribu-
tion;
– d'name-of-the-distribution': density of the distribution.
There are two more standard functions available that provide useful information
about a distribution. The first one is the distribution function, also called cumulative
distribution function, because it provides 𝑃 (𝑋 ≤ 𝑛), i. e., the probability up to a
certain value of 𝑛, which is given by
𝑚=𝑛
(17.56)
∑︁
𝑃 (𝑋 ≤ 𝑛) = 𝑃 (𝑋 = 𝑚).
𝑚=0
The second function is the quantile function, which provides information about the
value of 𝑛, for which 𝑃 (𝑋 ≤ 𝑛) = 𝑝 holds. In R, the names of these functions follow
the pattern:
– p'name-of-the-distribution': distribution function;
– q'name-of-the-distribution': quantile function.
defined by
𝑃 (𝑋 = 𝑛) = (1 − 𝑝)𝑛 𝑝. (17.57)
For example, if we observe 0001 . . . then the first 𝑛 = 3 observations show consecu-
tively tail, and the probability for this to happen is given by 𝑃 (𝑋 = 3) = (1 − 𝑝)3 𝑝.
Using R, sampling from 𝑋 ∼ Geom(𝑝 = 0.4) is obtained as shown in Listing 17.6.
𝜆𝑛 exp(−𝜆)
𝑃 (𝑋 = 𝑛) = . (17.59)
𝑛!
For example, sampling from 𝑋 ∼ pois(𝜆 = 3) using R can be done as follows:
It is worth noting that the Poisson distribution can be obtained from a Binomial
distribution for 𝑁 → ∞ and 𝑝 → 0, assuming that 𝜆 = 𝑁 𝑝 remains constant.
This means that for large 𝑁 and small 𝑝 we can use the Poisson distribution with
𝜆 = 𝑁 𝑝 to approximate a Binomial distribution, because the former is easier to
handle computationally. Two rules of thumb say that this approximation is good if
𝑁 ≥ 20 and 𝑝 ≤ 0.05, or if 𝑁 ≥ 100 and 𝑁 𝑝 ≤ 10.
This approximation explains also why the Poisson distribution is used to describe
rare events that have a small probability to occur, e. g., radioactive decay of chemical
elements. Other examples of rare events include spelling errors on a book page,
the number of visitors of a certain website, or the number of infections due to a
virus.
𝜆 exp(−𝜆𝑥) if 𝑥 ≥ 0
{︃
𝑓 (𝑥) = (17.60)
0 otherwise.
The parameter 𝜆 of the exponential distribution must be strictly positive, i. e., 𝜆 > 0.
Figure 17.8: Exponential distribution. Left: dexp(rate = 1) (left) and pexp(rate = 1) (right).
In the denominator of the definition of the Beta distribution appears the Beta func-
tion, which is defined by
∫︁1
𝐵(𝛼, 𝛽) = 𝑥𝛼−1 (1 − 𝑥)𝛽−1 𝑑𝑥. (17.62)
0
Figure 17.9: Beta distribution. Left: dbeta(𝛼 = 2, 𝛽 = 2) (left) and pbeta(𝛼 = 2, 𝛽 = 2) (right).
if 𝑥 ≥ 0
{︃
1 −𝛼−1 exp(−𝑥/𝛽)
Γ(𝛼)𝛽 𝛼 𝑥
𝑓 (𝑥) = (17.63)
0 otherwise
The parameters 𝛼 and 𝛽 must be strictly positive. In the denominator of the density
appears the gamma function, Γ, which is defined as follows:
∫︁∞
Γ(𝛼) = 𝑡𝛼−1 exp(−𝑡)𝑑𝑡. (17.64)
0
(𝑥 − 𝜇)2
(︂ )︂
1
𝑓 (𝑥) = √ exp − , −∞ ≤ 𝑥 ≤ ∞. (17.65)
2𝜋𝜎 2𝜎 2
Figure 17.11: One-dimensional normal distribution. Left: Different values of 𝜎 ∈ {0.5, 1, 3} for
a constant mean of 𝜇 = 0. Right: Different values of 𝜇 ∈ {−1, 1, 3} for a constant standard
deviation of 𝜎 = 2.
(𝑥1 − 𝜇1 )2 (𝑥2 − 𝜇2 )2
(︂ [︂ ]︂)︂
1 (𝑥1 − 𝜇1 )(𝑥2 − 𝜇2 )
𝑓 (𝑥) = 𝑐 exp − + − 2𝜌 ,
2(1 − 𝜌2 ) 𝜎12 𝜎22 𝜎1 𝜎2
𝑥 = (𝑥1 , 𝑥2 ) ∈ R2 , (17.67)
1
𝑐= √︀ . (17.68)
2𝜋𝜎1 𝜎2 (1 − 𝜌2 )
projections. In contrast, Figure 17.13 shows a contour plot of this distribution. Such
a plot shows parallel slices of the 𝑥1 –𝑥2 plane.
(𝑥 − 𝜇)Σ−1 (𝑥 − 𝜇)𝑡
(︂ )︂
1
𝑓 (𝑥) = √︀ exp − , 𝑥 ∈ R𝑛 . (17.69)
(2𝜋)𝑛 |Σ| 2
Listing 17.14: Chi-square distribution – for one parameter pair, see Figure 17.14
x <- seq (0,30,0.1)
d1 <- dchisq(x, df=2) # Figure17.14 Left
350 | 17 Probability theory
Figure 17.14: Chi-square distribution. Left: Different values of the degree of freedom
𝑘 ∈ {2, 7, 20}. Right: Cumulative distribution function.
𝑍
𝑋 = √︁ , (17.72)
𝑌
𝜈
Listing 17.15: Student’s t-distribution – for one parameter pair, see Figure 17.15
x <- seq(-5,5,0.1)
d <- dt(x, df=2) # Figure17.15 Left
Figure 17.15: Student’s 𝑡-distribution. Left: Different values of the degree of freedom
𝑘 ∈ {2, 7, 20}. Right: QQnormal plot for t-distribution with 𝑘 = 100.
of one or two populations, i. e., groups of measurements, each with a certain number
of samples [171].
(ln 𝑥 − 𝜇)2
(︂ )︂
1
𝑓 (𝑥) = √ exp − , 0 < 𝑥 < ∞. (17.73)
2𝜋𝜎𝑥 2𝜎 2
Figure 17.17: Weibull distribution. Left: Constant value of 𝜆 = 1.0 and varying 𝛽 ∈
{1.0, 2.0, 3.5}. Right: Constant value of 𝛽 = 2.0 and varying 𝜆 ∈ {0.9, 2.0, 4.0}.
The log-normal distribution, shown in Figure 17.16, has the following location mea-
sures:
𝜎2
(︂ )︂
mean: exp 𝜇 + , (17.74)
2
variance: exp 2𝜇 + 𝜎 2 exp 𝜎 2 − 1 , (17.75)
(︀ )︀(︀ (︀ )︀ )︀
mode: exp 𝜇 − 𝜎 . 2
(17.76)
(︀ )︀
The Weibull distribution, shown in Figure 17.17, has the following location measures:
as a parametric model for the baseline hazard function of a Cox proportional hazard
model, which can be used to model time-to-event processes by considering covariates.
𝑃 (𝐷|𝐻)𝑃 (𝐻)
𝑃 (𝐻|𝐷) = . (17.81)
𝑃 (𝐷)
Its proof follows directly from the definition of conditional probabilities and the
commutativity of the intersection.
The terms in the above equation have the following names:
– 𝑃 (𝐻) is called the prior probability, or prior.
– 𝑃 (𝐷|𝐻) is called the likelihood.
– 𝑃 (𝐷) is just a normalizing constant, sometimes called marginal likelihood.
– 𝑃 (𝐻|𝐷) is called the posterior probability or posterior.
The letters denoting the above variables, i. e., 𝐷 and 𝐻, are arbitrary, but by using
𝐷 for “data” and 𝐻 for “hypothesis”, one can interpret equation 17.81 as the change
of the probability for a hypothesis (given by the prior) after considering new data
about this hypothesis (given by the posterior).
Bayes’ theorem can be generalized to more variables.
𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 |𝐴) = ∑︀𝑘 . (17.82)
𝑗=1 𝑃 (𝐴|𝐵𝑗 )𝑃 (𝐵𝑗 )
To understand the utility of the Bayes’ theorem, let us consider the following
example: Suppose that a medical test for a disease is performed on a patient, and
this test has a reliability of 90 %. That means, if a patient has this disease, the test
will be positive with a probability of 90 %. Furthermore, assume that if the patient
does not have the disease, the test will be positive with a probability of 10 %. Let
us assume that a patient tests positive for this disease. What is the probability that
354 | 17 Probability theory
this patient has this disease? The answer to this question can be obtained using
Bayes’ theorem.
In order to make the usage of Bayes’ theorem more intuitive, we adopt the
formulation in equation (17.82). Specifically, let us denote a positive test by 𝐴 = 𝑇 + ,
a sick patient that has the disease (D) by 𝐵1 = 𝐷+ , and a healthy patient that does
not have the disease by 𝐵2 = 𝐷− . Then, equation (17.82) becomes
𝑃 (𝑇 + |𝐷+ )𝑃 (𝐷+ )
𝑃 𝐷+ |𝑇 + = (17.83)
(︀ )︀
.
𝑃 (𝑇 + |𝐷− )𝑃 (𝐷− ) + 𝑃 (𝑇 + |𝐷+ )𝑃 (𝐷+ )
Note that 𝐷+ and 𝐷− provide a partition of the sample space, because 𝑃 (𝐷+ ) +
𝑃 (𝐷− ) = 1 (either the patient is sick or healthy). From the provided information
about the medical test, see above, we can identify the following entities:
At this point, the following observation can be made: the knowledge about the
medical test is not enough to calculate the probability 𝑃 (𝐷+ |𝑇 + ), because we also
need information about 𝑃 (𝐷+ ) and 𝑃 (𝐷− ).
These probabilities correspond to the prevalence of the disease in the population
and are independent from the characteristics of the performed medical test. Let us
consider two different diseases: one is a common disease and one is a rare disease. For
the common (𝑐) disease, we assume 𝑃𝑐 (𝐷+ ) = 1/1000, and for the rare (𝑟) disease
𝑃𝑟 (𝐷+ ) = 1/1000000. That means, for the common disease, one person from 1000
is, on average, sick, whereas, for the rare disease, only one person from 1000000 is
sick. This gives us
It is worth noting that although the used medical test has the exact same character-
istics, given by 𝑃 (𝑇 + |𝐷+ ) and 𝑃 (𝑇 + |𝐷− ) (see equation (17.84) and (17.85)), the
resulting probabilities are different from each other. More precisely,
𝑃𝑐 𝐷+ |𝑇 + = 991.1 · 𝑃𝑟 𝐷+ |𝑇 + , (17.90)
(︀ )︀ (︀ )︀
17.13 Bayes’ theorem | 355
which makes it almost 1000 times more likely to suffer from the common disease
than the rare disease, if tested positive.
The above example demonstrates that the context, as provided by 𝑃 (𝐷+ ) and
𝑃 (𝐷− ), is crucial in order to obtain a sensible result.
Finally, in Figure 17.18, we present some results for repeated analysis of the
above example, using different values for 𝑃 (𝐷+ ) from the full range of possible
prevalence probabilities, i. e., from [0, 1]. We can see that for any probability value
of 𝑃 (𝐷+ ) below 80 %, the probability to have a disease, if tested positive, is always
below 5 %. Furthermore, we can see that the functional relation between 𝑃 (𝐷+ )
and 𝑃 (𝐷+ |𝑇 + ) is strongly nonlinear. Such a functional behavior makes it difficult
to make good guesses for the values of 𝑃 (𝐷+ |𝑇 + ) without doing the underlying
mathematics properly.
After this example, demonstrating the use of the Bayes’ theorem, we will now
provide the proof of the theorem.
Proof. From the definition of a conditional probability for two events 𝐴 and 𝐵,
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴|𝐵) = , (17.91)
𝑃 (𝐵)
𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 |𝐴) = . (17.93)
𝑃 (𝐴)
356 | 17 Probability theory
Using the law of total probability and assuming that {𝐵1 , . . . , 𝐵𝑘 } is a partition of
the sample space, we can write
𝑘
(17.94)
∑︁
𝑃 (𝐴) = 𝑃 (𝐴|𝐵𝑗 )𝑃 (𝐵𝑗 ).
𝑗=1
𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 |𝐴) = ∑︀𝑘 , (17.95)
𝑗=1 𝑃 (𝐴|𝐵𝑗 )𝑃 (𝐵𝑗 )
It is because of the simplicity of this “proof” that the Bayes’ theorem is some-
times also referred to as the Bayes’ rule.
17.14.1 Entropy
Shannon defined the entropy for a discrete random variable 𝑋, assuming values in
{𝑋1 , . . . , 𝑋𝑛 } with probability density 𝑝𝑖 = 𝑃 (𝑋𝑖 ), as follows:
Usually, the logarithm is base 2, because the entropy is expressed in bits (that
means its unit is a bit). However, sometimes, other bases are used, hence, attention
to this is required.
The entropy is a measure of the uncertainty of a random variable. Specifically,
it quantifies the average amount of information needed to describe the random
variable.
17.14 Information theory | 357
𝐻(𝑋) = 𝐻 𝑋 ′ . (17.97)
(︀ )︀
– Maximum: The maximum of the entropy is assumed for 𝑃 (𝑋𝑖 ) = 1/𝑛 = const.
∀𝑖, for {𝑋1 , . . . , 𝑋𝑛 }.
0, with probability 1 − 𝑝
{︃
𝑋= (17.99)
1, with probability 𝑝
Clearly, the entropy is positive for all values of 𝑝, and assumes its maximum for
𝑝 = 0.5 with 𝐻(𝑝 = 0.5) = 1 bit. In order to plot the entropy, we used 𝑛 =
50 different values for 𝑝 obtained with the R command p <- seq(from=0, to=1,
length.out=n).
Similar to the joint probability and the conditional probability, there are also
extensions of the entropy along these lines.
Definition 17.14.2 (Joint entropy). Let 𝑋 and 𝑌 be two random variables assuming
values in 𝑋1 , . . . , 𝑋𝑛 and 𝑌1 , . . . , 𝑌𝑚 . Furthermore, let 𝑝𝑖𝑗 = 𝑃 (𝑋𝑖 , 𝑌𝑗 ) be their joint
probability distribution. Then, the joint entropy of 𝑋 and 𝑌 , denoted 𝐻(𝑋, 𝑌 ), is
given by
𝑛 ∑︁
𝑚
(17.100)
∑︁
𝐻(𝑋, 𝑌 ) = − 𝑝𝑖𝑗 log(𝑝𝑖𝑗 ).
𝑖=1 𝑗=1
Figure 17.19: Visualization of the entropy 𝐻(𝑝) for different values of 𝑝. The vertical dashed
line (red) indicates the maximum of 𝐻(𝑝).
Furthermore, let 𝑝𝑗𝑖 = 𝑃 (𝑌𝑗 |𝑋𝑖 ) be their conditional probability distribution and
𝐻(𝑌 |𝑋 = 𝑥𝑖 ) the entropy of 𝑌 , conditioned on 𝑋 = 𝑥𝑖 . Then, the conditional
entropy of 𝑌 given 𝑋, denoted 𝐻(𝑌 |𝑋), is given by
𝑛 𝑛 ∑︁
𝑚
(17.101)
∑︁ ∑︁
𝐻(𝑌 |𝑋) = 𝑝𝑖 𝐻(𝑌 |𝑋 = 𝑥𝑖 ) = 𝑝𝑖 𝑝𝑗𝑖 log(𝑝𝑗𝑖 ).
𝑖=1 𝑖=1 𝑗=1
Another measure, called mutual information, follows from the definition of the
Kullback–Leibler divergence by the transformation 𝑝(𝑥) → 𝑝(𝑥, 𝑦) and 𝑞(𝑥) →
𝑝(𝑥)𝑝(𝑦). It measures the amount of information of one random variable, 𝑋, from
another random variable 𝑌 .
Figure 17.20: An example for the Kullback–Leibler divergence. On the left-hand side, we show
the probability distribution 𝑝(𝑥) (a gamma distribution) and 𝑞(𝑥) (a normal distribution). On the
right-hand side, we show only the logarithm, log( 𝑝𝑞 ), of both distributions.
360 | 17 Probability theory
Furthermore, let 𝑝𝑖𝑗 = 𝑃 (𝑋𝑖 , 𝑌𝑗 ) be their joint probability distribution. Then, the
mutual information of 𝑋 and 𝑌 , denoted 𝐼(𝑋, 𝑌 ), is given by
𝑛 ∑︁
𝑚 (︂ )︂
𝑝𝑖𝑗
(17.103)
∑︁
𝐼(𝑋, 𝑌 ) = 𝑝𝑖𝑗 log .
𝑝 𝑖 𝑞𝑗
𝑖=1 𝑗=1
From the last relationship above follows a further property of the conditional entropy:
In Figure 17.21, we visualize the relationships between entropies and mutual in-
formation. This graphical representation of the abstract relationships helps in sum-
marizing these nontrivial dependencies and in gaining an intuitive understanding.
Theorem 17.15.1. For a given random variable 𝑋 with 𝑃 (𝑋 ≥ 0) and every real
𝑡 ∈ R with 𝑡 > 0, the following inequality holds:
E[𝑋]
𝑃 (𝑋 ≥ 𝑡) ≤ . (17.105)
𝑡
This inequality is called the Markov inequality.
Theorem 17.15.2. For a given random variable 𝑋 with finite Var(𝑋) and every real
𝑡 ∈ R with 𝑡 > 0, the following inequality holds:
)︀ Var(𝑋)
(17.106)
(︀⃒ ⃒
𝑃 ⃒𝑋 − E[𝑋]⃒ ≥ 𝑡 ≤ .
𝑡2
This inequality is called the Chebyshev inequality.
E[𝑌 ]
= 𝑃 (𝑌 ≥ 𝑠) ≤ , (17.108)
𝑠
Var(𝑋)
= . (17.109)
𝑠
It is important to emphasize that the two above inequalities hold for every
probability distribution with the required conditions. Despite this generality, it is
possible to make a specific statement about the distance of a random sample from
the mean of the distribution. For example, for 𝑡 = 4𝜎, we obtain
1
(17.110)
(︀⃒ ⃒ )︀
𝑃 ⃒𝑋 − E[𝑋]⃒ ≥ 4𝜎 ≤ = 0.063.
16
That means, for every distribution, the probability that the distance between a
random sample 𝑋 and E[𝑋] is larger than four standard derivations is less than
6.3 %.
362 | 17 Probability theory
At the beginning of this chapter, we stated briefly the result of the law of large
numbers. Before we formulate it formally, we have one last point that requires some
clarification. This point relates to the mean of a sample. Suppose that we have a
random sample of size 𝑛, given by 𝑋1 , . . . , 𝑋𝑛 , and each 𝑋𝑖 is drawn from the same
distribution with mean 𝜇 and variance 𝜎 2 . Furthermore, each 𝑋𝑖 is drawn indepen-
dently from the other samples. We call such samples independent and identically
distributed (iid) random variables.1 Then,
and
The question of interest here is the following: what is the expectation value of the
sample mean?
The sample mean of the sample 𝑋1 , . . . , 𝑋𝑛 is given by
𝑛
¯𝑛 = 1 (17.113)
∑︁
𝑋 𝑋𝑖 .
𝑛
𝑖=1
Here, we emphasize the dependence on 𝑛 by the subscript of the mean value. From
this, we can obtain the expectation value of 𝑋¯ 𝑛 by applying the rules for the expec-
tation values discussed in Section 17.8, giving
[︃ 𝑛 ]︃ 𝑛
1 1 ∑︁
(17.114)
∑︁
¯𝑛] = E
E[𝑋 𝑋𝑖 = E[𝑋𝑖 ] = 𝜇.
𝑛 𝑛
𝑖=1 𝑖=1
Similarly, we can obtain the variance of the sample mean, i. e., Var(𝑋
¯ 𝑛 ), by
(︃
𝑛
)︃
1
(17.115)
∑︁
¯ 𝑛 ) = Var
Var(𝑋 𝑋𝑖 ,
𝑛
𝑖=1
(︃ 𝑛 )︃
1
(17.116)
∑︁
= 2 Var 𝑋𝑖 ,
𝑛
𝑖=1
𝑛
1 ∑︁
= Var(𝑋𝑖 ), (17.117)
𝑛2
𝑖=1
1 𝜎2
= 2
𝑛𝜎 2 = . (17.118)
𝑛 𝑛
These results are interesting, because they demonstrate that the expectation value
of the sample mean is identical to the mean of the distribution, but the sample
variance is reduced by a factor of 1/𝑛 compared to the variance of the distribution.
Hence, the sampling distribution of 𝑋 ¯ 𝑛 becomes more and more peaked around 𝜇
with increasing values of 𝑛, and also having a smaller variance than the distribution
of 𝑋 for all 𝑛 > 1.
Furthermore, application of the Chebyshev inequality for 𝑋 = 𝑋 ¯ 𝑛 , gives
¯
¯ 𝑛 ]⃒ ≥ 𝑡 ≤ Var(𝑋𝑛 ) .
¯ 𝑛 − E[𝑋 (17.119)
(︀⃒ ⃒ )︀
𝑃 ⃒𝑋
𝑡2
Hence,
2
¯ 𝑛 − 𝜇| ≥ 𝑡 ≤ 𝜎 . (17.120)
(︀ )︀
Pr |𝑋
𝑛𝑡2
This is a precise probabilistic relationship between the distance of the sample mean
𝑋¯ 𝑛 from the mean 𝜇 as a function of the sample size 𝑛. Hence, this relationship
can be used to get an estimate for the number of samples required in order for the
sample mean to be “close” to the population mean 𝜇.
We are now in a position to finally present the result known as the law of
large numbers, which adds a further component to the above considerations for the
sample mean. Specifically, so far, we know that the expectation of the sample mean
is the mean of the distribution (see equation (17.114)) and that the probability of
the minimal distance between 𝑋 ¯ 𝑛 and 𝜇, given by 𝑡, decreases systematically for
increasing 𝑛 (see equation (17.120)). However, so far, we did not assess the opposite
behavior of equation (17.120), namely what is 𝑃 (|𝑋 ¯ 𝑛 − 𝜇| < 𝑡)?
Using the previous results, we obtain
2
¯ 𝑛 − 𝜇| < 𝑡 = 1 − 𝑃 |𝑋¯ 𝑛 − 𝜇| ≥ 𝑡 ≥ 1 − 𝜎 . (17.121)
(︀ )︀ (︀ )︀
𝑃 |𝑋
𝑛𝑡2
Taking the limit 𝑛 → ∞, the above equation yields
¯ 𝑛 − 𝜇| < 𝑡 = 1. (17.122)
(︀ )︀
lim 𝑃 |𝑋
𝑛→∞
This last expression is the result of the law of large numbers. That means, the law
of large numbers provides evidence that the distance between 𝑋 ¯ 𝑛 and 𝜇 stays with
certainty, i. e., with a probability of 1, below any arbitrary small value of 𝑡 > 0.
Formally, in statistics there is a special symbol that is reserved for the type of
convergence presented in equation (17.122), which is written as
𝑝
¯𝑛 →
𝑋 𝜇. (17.123)
The “𝑝” over the arrow means that the sample mean converges in probability to 𝜇.
364 | 17 Probability theory
Theorem 17.15.3 (Law of large numbers). Suppose that we have an iid sample of
size 𝑛, 𝑋1 , . . . , 𝑋𝑛 , where each 𝑋𝑖 is drawn from the same distribution with mean
𝜇 and variance 𝜎 2 . Then, the sample mean 𝑋 ¯ 𝑛 converges in probability to 𝜇,
𝑝
¯𝑛 →
𝑋 𝜇. (17.124)
𝑋¯𝑛 − 𝜇
(︂ )︂
lim Pr √︀ ≤ 𝑥 = 𝐹 (𝑥). (17.125)
𝑛→∞ 𝜎 2 /𝑛
2𝜖′ 2
(︂ )︂
𝑃 𝑆 − E[𝑆] ≥ 𝜖′ ≤ exp − ∑︀𝑛 (17.128)
(︀ )︀
2
,
𝑖=1 (𝑏𝑖 − 𝑎𝑖 )
2𝜖′ 2
(︂ )︂
𝑃 ⃒𝑆 − E[𝑆]⃒ ≥ 𝜖′ ≤ 2 exp − ∑︀𝑛 (17.129)
(︀⃒ ⃒ )︀
2
.
𝑖=1 (𝑏𝑖 − 𝑎𝑖 )
Let us define the scalar product for two random variables 𝑋 and 𝑌 by
𝑋 · 𝑌 = E[𝑋𝑌 ]. (17.131)
E[𝑋𝑌 ]2 ≤ E 𝑋 2 E 𝑌 2 . (17.132)
[︀ ]︀ [︀ ]︀
366 | 17 Probability theory
Using the Cauchy–Schwartz inequality, we can show that the correlation between
two linearly dependent random variables 𝑋 and 𝑌 is 1, i. e.,
Chernoff bounds are typically tighter than Markov’s inequality and Chebyshev
bounds, but they require stronger assumptions [137].
In a general form, Chernoff bounds are defined by
E[exp (𝑡𝑋)]
𝑃 (𝑋 ≥ 𝑎) ≤ for 𝑡 > 0, (17.134)
exp (𝑡𝑎)
E[exp (𝑡𝑋)]
𝑃 (𝑋 ≤ 𝑎) ≤ for 𝑡 < 0. (17.135)
exp (𝑡𝑎)
Here, E[exp (𝑡𝑋)] is the moment-generating function of 𝑋. There are many different
Chernoff bounds for different probability distributions and different values of the
parameter 𝑡. Here, we provide a bound for Poisson trails, which is a sum of iid
Bernoulli random variables, which are allowed to have different expectation values,
i. e., 𝑃 (𝑋𝑖 = 1) = 𝑝𝑖 .
17.19 Summary
Probability theory plays a pivotal role when dealing with data, because essentially
every measurement contains errors. Hence, there is an accompanied uncertainty that
needs to be quantified probabilistically when dealing with data. In this sense, prob-
ability theory is an important extension of deterministic mathematical fields, e. g.,
linear algebra, graph theory and analysis, which cannot account for such uncer-
tainties. Unfortunately, such methods are usually more difficult to understand and
require, for this reason, much more practice. However, once mastered, they add con-
siderably to the analysis and the understanding of real-world problems, which is
essential for any method in data science.
17.20 Exercises
1. In Section 17.11.2, we discussed that under certain conditions a Binomial distri-
bution can be approximated by a Poisson distribution. Show this result numer-
ically, using R. Use different approximation conditions and evaluate these. How
can this be quantified?
Hint: See Section 17.14.2 about the Kullback–Leibler divergence.
2. Calculate the mutual information for the discrete joint distribution 𝑃 (𝑋, 𝑌 )
given in Table 17.2.
Table 17.2: Numerical values of a discrete joint distribution 𝑃 (𝑋, 𝑌 ) with 𝑋 ∈ {𝑥1 , 𝑥2 } and
𝑌 ∈ {𝑦1 , 𝑦2 , 𝑦3 }.
𝑌
𝑦1 𝑦2 𝑦3
3. Use R to calculate the mutual information for the discrete joint distribution
𝑃 (𝑋, 𝑌 ) given in Table 17.3, for 𝑧 ∈ 𝑆 = [0, 0.5), and plot the mutual informa-
tion as a function of 𝑧. What happens for 𝑧 values outside the interval 𝑆?
368 | 17 Probability theory
Table 17.3: Numerical values of a discrete joint distribution 𝑃 (𝑋, 𝑌 ) with 𝑋 ∈ {𝑥1 , 𝑥2 } and
𝑌 ∈ {𝑦1 , 𝑦2 , 𝑦3 } in dependence on the parameter 𝑧.
𝑌
𝑦1 𝑦2 𝑦3
𝑥1 𝑧 0.5 − 𝑧 0.1
𝑋
𝑥2 0.1 0.1 0.2
4. Use the Bayes’ theorem for doping tests in sports. Specifically, suppose that
we have a doping test that identifies with 99 % someone correctly who is using
doping, i. e., 𝑃 (+|doping) = 0.99, and has a false positive probability of 1 %, i. e.,
𝑃 (+|no doping) = 0.01. Furthermore, assume that the percentage of people who
are doping is 1 %. What is the probability that someone who tests positive is
doping?
18 Optimization
Optimization problems consistently arise when we try to select the best element from
a set of available alternatives. Frequently, this consists of finding the best parameters
of a function with respect to an optimization criterion. This is especially difficult if
we have a high-dimensional problem, meaning that there are many such parameters
that must be optimized. Since most models, used in data science, essentially have
many parameters, then optimization (or optimization theory) is necessary to devise
these models.
In this chapter, we will introduce some techniques used to address unconstrained
and constrained, as well as deterministic and probabilistic optimization problems, in-
cluding Newton’s method, simulated annealing and the Lagrange multiplier method.
We will discuss examples and available packages in R that can be used to solve the
aforementioned optimization problems.
18.1 Introduction
Ingeneral, an optimization problem is characterized by the following:
– a set of alternative choices called decision variables;
– a set of parameters called uncontrollable variables;
– a set of requirements to be satisfied by both decision and uncontrollable vari-
ables, called constraints;
– some measure(s) of effectiveness expressed in term of both decision and uncon-
trollable variables, called objective-function(s).
Definition 18.1.1. A set of decision variables that satisfy the constraints is called a
solution to the problem.
The aim of an optimization problem is to find, among all solutions to the prob-
lem, a solution that corresponds to either
– the maximal value of the objective function, in which case the problem is referred
to as a maximization problem, e. g. maximizing the profit;
– the minimal value of the objective-function, in which case the problem is referred
to as a minimization problem, e. g. minimizing the cost; or
– a trade-off value of many and generally conflicting objective-functions, in which
case the problem is referred to as a multicriteria optimization problem.
https://doi.org/10.1515/9783110796063-018
370 | 18 Optimization
Optimize 𝑓 (𝑥),
𝑥∈R𝑛
i. e., the problem is to find the solution 𝑥* ∈ 𝑆, if it exists, such that for all 𝑥 ∈ 𝑆,
we have
– 𝑓 (𝑥* ) ≤ 𝑓 (𝑥), if “Optimize” stands for “minimize”;
– 𝑓 (𝑥* ) ≥ 𝑓 (𝑥), if “Optimize” stands for “maximize”.
The function 𝑓 denotes the objective function or the cost-function, whereas 𝑆 is the
feasible set, and any 𝑥 ∈ 𝑆 is called a feasible solution to the problem.
Remark 18.2.1. If all the decision variables, 𝑥𝑖 , in the problem (18.1) take only
discrete values (e. g. 0, 1, 2, . . .), then the problem is called a discrete optimization
problem, otherwise it is called a continuous optimization problem. When there is
a combination of discrete and continuous variables, the problem is called a mixed
optimization problem.
Minimize
𝑛
𝑓 (𝑥). (18.2)
𝑥∈R
The main difference between the various gradient-based methods lies in the com-
putation of the descent direction (Step 3) and the computation of the step-size
(Step 4).
In R, various gradients-based methods have been implemented either as stand-
alone packages or as part of a general-purpose optimization package.
The contour plot of the functions 𝑓 (𝑥1 , 𝑥2 ), depicted in Figure 18.1 (left), is obtained
using the following script:
Listing 18.1: Contour plot of 𝑓 (𝑥1 , 𝑥2 ) in (18.3) (see Figure 18.1 (left))
require(grDevices)
fxy<-function(x, y) x^2+y^2
x1 <- x2 <- seq(-1.5, 1.5, length=200)
f <- outer(x1 , x2 , fxy)
rgb.palette <-
colorRampPalette(c("firebrick2","lightsalmon1","oldlace"
),space = "rgb")
image(x1 , x2 , f, xlab=expression(x[1]), ylab=expression(x[2]),
col=rgb.palette (256))
contour(x1 , x2 , f, levels = seq(-2, 2, by = 0.1), add=TRUE)
Using the steepest descent method, the problem (18.3) can be solved in R as
follows:
Listing 18.2: Solving the problem (18.3) using the Steepest Descent method
#The package pracma is required here
library(pracma)
# Defining the function f(x1 , x2); its only minimum is reached at
x1=0 and x2=0
fx <- function(x) x[1]^2 + x[2]^2
# Defining an initial solution
x0<-c(-1.2,1)
# Calling the Steepest Descent method
sol<-steep_descent(x0 , fx)
sol$xmin
[1] 2.220446e-16 0.000000e+00 # These are the obtained optimal
values of x1 and x2 , respectively
sol$fmin
[1] 4.930381e-32 #This is the obtained optimal value of f
# Using a new initial solution
x0 <- c(10 ,10)
sol<-steep_descent(x0 , fx)
sol$xmin
[1] 0 0 # These are the obtained optimal values of x1 and x2 ,
respectively
sol$fmin
[1] -0.6880447 #This is the obtained optimal value of f
#Thus , for any initial solution the method converges towards the
only minimum of f
(18.4)
2
−𝑦 2 )
𝑔(𝑥1 , 𝑥2 ) = 𝑒(𝑥−2𝑥 sin 6 𝑥 + 𝑦 + 𝑥𝑦 2 .
(︀ (︀ )︀)︀
max
(𝑥1 ,𝑥2 )∈R2
The contour plot of the functions 𝑔(𝑥1 , 𝑥2 ), depicted in Figure 18.1 (right), is ob-
tained using the following script:
374 | 18 Optimization
Listing 18.3: Contour plot of 𝑔(𝑥1 , 𝑥2 ) in (18.4), (see Figure 18.1 (right))
require(grDevices)
gxy<-function(x, y) exp(x-2*x^2-y^2)*sin(6*(x+y+x*y^2))
x1 <- x2 <- seq(-1.5, 1.5, length=200)
g <- outer(x1 , x2 , gxy)
rgb.palette <-
colorRampPalette(c("firebrick2","lightsalmon1","oldlace"
),space = "rgb")
image(x1 , x2 , g, xlab=expression(x[1]), ylab=expression(x[2]),
col=rgb.palette (256))
contour(x1 , x2 , g, levels = seq(-2, 2, by = 0.25), add=TRUE)
Figure 18.1: Left: contour plot of the function 𝑓 (𝑥1 , 𝑥2 ) in (18.3) in the (𝑥1 , 𝑥2 ) plane; right:
contour plot of the function 𝑔(𝑥1 , 𝑥2 ) in (18.4) in the (𝑥1 , 𝑥2 ) plane.
(18.5)
2
−𝑦 2 )
min 𝑔(𝑥1 , 𝑥2 ) = −𝑒(𝑥−2𝑥 sin 6 𝑥 + 𝑦 + 𝑥𝑦 2
(︀ (︀ )︀)︀
𝑥1 ,𝑥2
Using the steepest descent method, implemented in R, the problem (18.5) can be
solved as follows:
Listing 18.4: Solving the problem (18.5) using the steepest descent method
#The package pracma is required here
library(pracma)
# Defining the function -g(x1 , x2); its global minimum is reached at
x1=0.2538 , x2=0.0076
18.3 Unconstrained optimization problems | 375
mgx<-function(x) -
exp(x[1]-2*x[1]^2-x[2]^2)*sin(6*(x[1]+x[2]+x[1]*x[2]^2))
# Defining an initial solution
x0<-c(1, 1)
# Calling the Steepest Descent method
sol<-steep_descent(x0 , mgx)
Warning message:
In steep_descent(x0 , mgx) :
Maximum number of iterations reached -- not converged.
# Using a new initial solution
x0 <- c(-1.2,1)
sol<-steep_descent(x0 , mgx)
sol$xmin
[1] 0.5046936 0.6012042 # These are the obtained optimal values of
x1 and x2 , respectively
sol$fmin
[1] -0.6880447 #This is the obtained optimal value of -g
# Using a new initial solution
x0 <- c(0, 0)
sol<-steep_descent(x0 , mgx)
sol$xmin
[1] 0.253778741 0.007586204 # These are the obtained optimal
values of x1 and x2 , respectively
sol$fmin
[1] -1.133047 #This is the optimal value of -g, hence g=1.133047
#Thus , depending on the initial solution , either the method does
not converge at all , or it converges towards a local or a
global minimum of -g
Note that the convergence and solution given by the steepest descent method depend
on both the form of the function to be minimized and the initial solution.
where several types of formulas for 𝛽𝑘 have been proposed. The most known
formulas are those proposed by Fletcher–Reeves (FR), Polak–Ribière–Polyak
(PRP) and Hestenes–Stiefel (HS), and they are defined as follows:
In R, the implementation of the conjugate gradient method can be found in the gen-
eral multipurpose package optimx. This implementation of the conjugate gradient
method can be used to solve the problem (18.3) as follows:
Listing 18.5: Solving the problem (18.3) using the conjugate gradient method
#The package optimx is required here
library(optimx)
# Defining the function f(x1 , x2), its only minimum is reached at
x1=0 and x2=0
fx<- function(x) x[1]^2 + x[2]^2
# Defining an initial solution
x0 <- c(-1.2,1)
# Calling the Conjugate Gradient methods (CG)
sol <- optimx(x0 , fx , method = "CG")
sol$par
$par
[1] 4.924372e-07, -4.103643e-07 # These are the obtained optimal
values of x1 and x2 , respectively
> sol$fvalues
$fvalues
[1] 4.108933e-13 #This is the obtained optimal value of f
# Using a new initial solution
x0<-c(10 ,10)
sol <- optimx(x0 , fx , method = "CG")
sol$par
$par
[1] 2.166724e-07 2.166724e-07 # These are the obtained optimal
values of x1 and x2 , respectively
sol$fvalues
$fvalues
[1] 9.389383e-14 #This is the obtained optimal value of f
#Thus , for any initial solution the method converges towards the
only minimum of f
Now, let us use the conjugate gradient method, implemented in the package optimx
to solve the problem (18.4).
Listing 18.6: Solving the problem (18.4) using the conjugate gradient method
#The package optimx is required here
18.3 Unconstrained optimization problems | 377
library(optimx)
# Defining the function -g(x1 , x2); its global minimum is reached at
x1=0.2538 , x2=0.0076
mgx<-function(x)
exp(x[1]-2*x[1]^2-x[2]^2)*sin(6*(x[1]+x[2]+x[1]*x[2]^2))
# Defining an initial solution
x0<-c(1, 1)
# Calling the Conjugate Gradient methods (CG)
sol<- optimx(x0 , mgx , method = "CG")
sol$par
$par
[1] 1.511231 2.016832 # These are the obtained optimal values of
x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -0.0008079518
# Using a new initial solution
x0<-c(-1.2,1)
sol<- optimx(x0 , mgx , method = "CG")
sol$par
$par
[1] -1.018596 1.491097 # These are the obtained optimal values of
x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -0.004766019 #This is the obtained optimal value of -g
# Using a new initial solution
x0<-c(0,0)
sol<- optimx(x0 , mgx , method = "CG")
sol$par
$par
[1] 0.253778535 0.007586373 # These are the obtained optimal
values of x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -1.133047 #This is the obtained optimal value of -g
#Thus , depending on the initial solution , the method converges
towards either a local or a global minimum of -g
The solution to the problem (18.4) with the initial solution 𝑥(0) = (𝑥1 , 𝑥2 ) = (1, 1)
(0) (0)
is 𝑥
¯ = (1.5112, 2.016), and 𝑓 (¯
𝑥) = −0.0008079518, which is a local minima. How-
ever, in contrast with the steepest descent method, the conjugate gradient method
converges with the initial solution 𝑥(0) = (1, 1).
Since the computation of the Hessian matrix is generally expensive, several modifi-
cations of Newton’s method have been suggested in order to improve its computa-
tional efficiency. One variant of Newton’s method is the Broyden–Fletcher–Goldfarb–
Shanno (BFGS) method, which uses the gradient to iteratively approximate the in-
verse of the Hessian matrix 𝐻𝑘−1 = [∇2 𝑓 (𝑥(𝑘) )]−1 , as follows:
𝑠𝑘 𝑦𝑘𝑇 𝑠𝑘 𝑦𝑘𝑇 𝑠𝑘 𝑠𝑇
(︂ )︂ (︂ )︂
𝐻𝑘−1 = 𝐼− 𝑇 −1
𝐻𝑘−1 𝐼 − 𝑇 + 𝑇 𝑘,
𝑠𝑘 𝑦𝑘 𝑠𝑘 𝑦𝑘 𝑠𝑘 𝑦𝑘
Now, let us use the BFGS method, implemented in the package optimx, to solve the
problem (18.4).
Gradient-based methods rely upon information about at least the gradient of the
objective-function to estimate the direction of search and the step size. Therefore,
if the derivative of the function cannot be computed, because, for example, the
objective-function is discontinuous, these methods often fail. Furthermore, although
these methods can perform well on functions with only one extrema (unimodal func-
tion), such as (18.3), their efficiency in solving problems with multimodal functions
depend upon how far the initial solution is from the global minimum, i. e., gradient-
based methods are more or less efficient in finding the global minimum only if they
start from an initial solution sufficiently close to it. Therefore, the solution obtained
using these methods may be one of several local minima, and we often cannot be
sure that the solution is the global minimum. In this section, we will present some
commonly used derivative-free methods, which aim to reduce the limitations of the
380 | 18 Optimization
Thus, 𝑥1 and 𝑥𝑛+1 correspond to the best and worst vertices, respectively.
At each iteration, the Nelder–Mead method consists of four possible operations:
reflection, expansion, contraction, and shrinking. Each of these operations has a
scalar parameter associated with it. Let us denote by 𝛼, 𝛽, 𝛾, and 𝛿 the parameters
associated with the aforementioned operations, respectively. These parameters are
chosen such that 𝛼 > 0, 𝛽 > 1, 0 < 𝛾 < 1, and 0 < 𝛿 < 1.
Then, the Nelder–Mead simplex algorithm, as described in Lagarias et al. [207],
can be summarized as follows:
– Step 0: Generate a simplex with 𝑛 + 1 vertices, and choose a convergence crite-
rion;
– Step 1: Sort the 𝑛 + 1 vertices according to their objective-function values, i. e.,
so that (18.12) holds. Then, evaluate the centroid of the points in the simplex,
excluding 𝑥𝑛+1 , given by: 𝑥¯ = 𝑖=1 𝑥𝑖 ;
∑︀𝑛
– Step 2:
– Calculate the reflection point 𝑥𝑟 = 𝑥 𝑥 − 𝑥𝑛+1 );
¯ + 𝛼(¯
– If 𝑓 (𝑥1 ) ≤ 𝑓 (𝑥𝑟 ) ≤ 𝑓 (𝑥𝑛 ), then perform a reflection by replacing 𝑥𝑛+1
with 𝑥𝑟 ;
– Step 3:
– If 𝑓 (𝑥𝑟 ) < 𝑓 (𝑥1 ), then calculate the expansion point 𝑥𝑒 = 𝑥 ¯ + 𝛽(𝑥𝑟 − 𝑥¯);
– If 𝑓 (𝑥𝑒 ) < 𝑓 (𝑥𝑟 ), then perform an expansion by replacing 𝑥𝑛+1 with 𝑥𝑒 ;
– otherwise (i. e. 𝑓 (𝑥𝑒 ) ≥ 𝑓 (𝑥𝑟 )), then perform a reflection by replacing 𝑥𝑛+1
with 𝑥𝑟 ;
18.3 Unconstrained optimization problems | 381
– Step 4:
– If 𝑓 (𝑥𝑛 ) ≤ 𝑓 (𝑥𝑟 ) < 𝑓 (𝑥𝑛 + 1), then calculate the outside contraction point
¯ + 𝛾(𝑥𝑟 − 𝑥
𝑥oc = 𝑥 ¯);
– If 𝑓 (𝑥oc ) ≤ 𝑓 (𝑥𝑟 ), then perform an outside contraction by replacing 𝑥𝑛+1
with 𝑥oc ;
– otherwise (i. e. if 𝑓 (𝑥oc ) > 𝑓 (𝑥𝑟 )), then go to Step 6;
– Step 5:
– If 𝑓 (𝑥𝑟 ) ≥ 𝑓 (𝑥𝑛+1 ), then calculate the inside contraction point
¯ − 𝛾(𝑥𝑟 − 𝑥
𝑥ic = 𝑥 ¯);
– If 𝑓 (𝑥ic ) < 𝑓 (𝑥𝑛+1 ), then perform an inside contraction by replacing 𝑥𝑛+1
with 𝑥ic ;
– otherwise (i. e. if 𝑓 (𝑥ic ) ≥ 𝑓 (𝑥𝑛+1 )), then go to Step 6;
– Step 6: Perform a shrink by updating 𝑥𝑖 , 2 ≤ 𝑖 ≤ 𝑛 + 1 as follows:
𝑥𝑖 = 𝑥1 + 𝛿(𝑥𝑖 − 𝑥1 );
Listing 18.9: Solving the problem (18.3) using the Nelder–Mead method
#The package optimx is required here
library(optimx)
# Defining the function f(x1 , x2); its only minimum is reached at
x1=0 and x2=0
fx<- function(x) x[1]^2 + x[2]^2
# Defining an initial solution
x0<-c(-1.2,1)
# Calling the Nelder -Mead method
sol<-optimx(x0 , fx , method = "Nelder-Mead")
sol$par
$par
[1] 1.992835e-04, 3.659277e-05 # These are the obtained optimal
values of x1 and x2 , respectively
sol$fvalues
$fvalues
[1] 4.105294e-08 #This is the obtained optimal value of f
Now, let us use the Nelder–Mead method, implemented in the package optimx, to
solve the problem (18.4).
Listing 18.10: Solving the problem (18.4) using the Nelder–Mead method
#The package optimx is required here
library(optimx)
# Defining the function -g(x1 , x2); its global minima is reached at
x1=0.2538 and x2*=0.0076
382 | 18 Optimization
mgx<-function(x) -
exp(x[1]-2*x[1]^2-x[2]^2)*sin(6*(x[1]+x[2]+x[1]*x[2]^2))
# Defining an initial condition
x0<-c(1, 1)
sol<- optimx(x0 , mgx , method = "Nelder-Mead")
sol$par
$par
[1] 0.8007997 1.2758681 # These are the obtained optimal values of
x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -0.1201183 #This is the obtained optimal value of -g
# Using a new initial solution
x0<-c(-1.2,1)
sol<- optimx(x0 , mgx , method = "Nelder-Mead")
sol$par
$par
[1] -0.9960316 1.5271877 # These are the obtained optimal values
of x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -0.004783335 #This is the obtained optimal value of -g
# Using a new initial solution
x0<-c(0,0)
sol<-optimx(x0 , mgx , method = "Nelder-Mead")
sol$par
$par
[1] 0.25377876 0.00758619 # These are the obtained optimal values
of x1 and x2 , respectively
sol$fvalues
$fvalues
[1] -1.133047 #This is the obtained optimal value of -g
#Thus , depending on the initial solution , the method converges
towards either a local or a global minimum of -g
– Step 0:
– Construct an initial solution 𝑥0 ; set 𝑥 = 𝑥0 ;
– Set the number of Monte Carlo steps 𝑁MC = 0;
– Set the temperature, 𝑇 , to some high value, 𝑇0 .
– Step 1: Choose a transition Δ𝑥 at random.
– Step 2: Evaluate Δ𝑓 = 𝑓 (𝑥) − 𝑓 (𝑥 − Δ𝑥).
– Step 3:
– If Δ𝑓 ≤ 0, then accept the state by updating 𝑥 as follows:
𝑥 ←− 𝑥 + Δ𝑥.
𝑥 ←− 𝑥 + Δ𝑥;
– Step 4:
– Update the temperature value as follows: 𝑇 ←− 𝑇 − 𝜀𝑇 , where 𝜀𝑇 ≪ 𝑇 is a
specified positive real value.
– Update the number of Monte Carlo steps: 𝑁MC ←− 𝑁MC + 1.
– Step 5:
– If 𝑇 ≤ 0, then stop, and return 𝑥;
– Otherwise (i. e. 𝑇 > 0) then go to Step 1.
Now, let us use the simulated annealing method, implemented in the package GenSA,
to solve the problem (18.4).
384 | 18 Optimization
𝑙𝑗 ≤ 𝑥𝑗 ≤ 𝑢𝑗 , 𝑗 = 1, . . . , 𝑛.
Minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥2 − 𝑥1
subject to 2𝑥1 − 𝑥2 ≥ −2
(𝑃2 ) 𝑥1 − 𝑥2 ≤ 2
𝑥1 + 𝑥2 ≤ 5
𝑥1 , 𝑥2 ≥ 0
Minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥2 − 𝑥1
subject to 2𝑥1 − 𝑥2 ≥ −2
(𝑃4 ) 𝑥1 − 2𝑥2 ≤ −8
𝑥1 + 𝑥2 ≤ 5
𝑥1 , 𝑥2 ≥ 0
The problem (𝑃1 ) can be solved using the lpSolveAPI package as follows:
The problem (𝑃2 ) can be solved using the lpSolveAPI package as follows:
The problem (𝑃3 ) can be solved using the lpSolveAPI package as follows:
The problem (𝑃4 ) can be solved using the lpSolveAPI package as follows:
Suppose that, in the problem (𝑃1 ), 𝑥1 is a binary variable (i. e., it takes only the
value 0 or 1), and 𝑥2 is an integer variable, then it is necessary to set them to the
appropriate type before solving the problem (𝑃 1). This can be done as follows:
Optimize 𝑓 (𝑥)
subject to 𝑔𝑖 (𝑥) = 𝑏𝑖 , 𝑖 ∈ 𝐼 ⊆ {1, . . . , 𝑚};
ℎ𝑗 (𝑥) ≤ 𝑑𝑗 , 𝑗 ∈ 𝐽 ⊆ {1, . . . , 𝑚};
(18.13)
ℎ𝑘 (𝑥) ≥ 𝑑𝑘 , 𝑘 ∈ 𝐾 ⊆ {1, . . . , 𝑚};
𝑥 ≤ 𝑢,
𝑥 ≥ 𝑙,
where
– Optimize = Minimize or Maximize;
– 𝑓 : R𝑛 −→ R, 𝑔𝑖 : R𝑛 −→ R, ∀ 𝑖 ∈ 𝐼, ℎ𝑟 : R𝑛 −→ R, ∀ 𝑟 ∈ 𝐽 ∪ 𝐾, with at least
one of these functions being nonlinear;
– 𝐼, 𝐽, and 𝐾 are disjunct and 𝐼 ∪ 𝐽 ∪ 𝐾 = {1, . . . , 𝑚};
– 𝑏𝑖 , 𝑑𝑗 , 𝑑𝑘 ∈ R ∪ {±∞}, ∀ 𝑖, 𝑗, 𝑘;
– 𝑙, 𝑢 ∈ (R ∪ {±∞})𝑛 .
Without loss of the generality, assume that the problem (18.13) is a minimization
problem, and let us multiply the inequality constraints of type “≥” by −1; then the
problem can be rewritten as:
Minimize 𝑧 = 𝑓 (𝑥)
subject to 𝑔𝑖 (𝑥) = 𝑏𝑖 , 𝑖 = 1, . . . , 𝑝, with 𝑝 ≤ 𝑚; (18.14)
ℎ𝑗 (𝑥) ≤ 𝑑𝑗 , 𝑗 = 1, . . . , 𝑚 − 𝑝.
𝑝
)︀ 𝑚−𝑝
(18.15)
∑︁ (︀ ∑︁ (︀ )︀
𝐿(𝑥, 𝜆, 𝜇) = 𝑓 (𝑥) + 𝜆𝑖 𝑔𝑖 (𝑥) − 𝑏𝑖 + 𝜇𝑗 ℎ𝑗 (𝑥) − 𝑑𝑗 ,
𝑖=1 𝑗=1
where 𝜆𝑖 , and 𝜇𝑗 are the Lagrangian multipliers associated with the constraints
𝑔𝑖 (𝑥) = 𝑏𝑖 , and ℎ𝑗 (𝑥) ≤ 𝑑𝑗 , respectively.
18.4 Constrained optimization problems | 389
The fundamental result behind the Lagrangian formulation (18.15) can be sum-
marized as follows: suppose that a solution 𝑥* = (𝑥*1 , 𝑥*2 , . . . , 𝑥*𝑛 ) minimizes the
function 𝑓 (𝑥) subject to the constraints 𝑔𝑖 (𝑥) = 𝑏𝑖 , for 𝑖 = 1, . . . , 𝑝 and ℎ𝑗 (𝑥) ≤ 𝑑𝑗 ,
for 𝑗 = 1, . . . , 𝑚 − 𝑝. Then we have one of the following:
1. Either there exist vectors 𝜆* = (𝜆*1 , . . . , 𝜆*𝑝 ) and 𝜇* = (𝜇*1 , . . . , 𝜇*𝑚−𝑝 ) such that
𝑝
(︀ )︀ 𝑚−𝑝
(18.16)
(︀ )︀ ∑︁ ∑︁
∇𝑓 𝑥* + 𝜆*𝑖 ∇𝑔𝑖 𝑥* + 𝜇*𝑗 ∇ℎ𝑗 𝑥* = 0;
(︀ )︀
𝑖=1 𝑗=1
𝜇*𝑗 ℎ𝑗 𝑥* − 𝑑𝑗 = 0, (18.17)
(︀ (︀ )︀ )︀
𝑗 = 1, . . . , 𝑚 − 𝑝;
𝜇*𝑗 ≥ 0, 𝑗 = 1, . . . , 𝑚 − 𝑝; (18.18)
2. Or the vectors ∇𝑔𝑖 (𝑥* ), for 𝑖 = 1, . . . , 𝑝, ∇ℎ𝑗 (𝑥* ) for 𝑗 = 1, . . . , 𝑚−𝑝 are linearly
dependent.
The result that is of greatest interest is the first one, i. e., case 1. From the equation
(18.17), either 𝜇𝑗 is zero or ℎ𝑗 (𝑥* ) − 𝑑𝑗 = 0. This provides various possible solutions
and the optimal solution is one of these. For an optimal solution, 𝑥* , some of the
inequalities constraints will be satisfied at equality, and others will not. The latter
can be ignored, whereas the former will form the second equation above. Thus, the
constraints 𝜇*𝑗 (ℎ𝑗 (𝑥* ) − 𝑑𝑗 ) = 0 mean that either an inequality constraint is satisfied
at equality, or the Lagrangian multiplier 𝜇𝑗 is zero.
The conditions (18.16)–(18.18) are referred to as the Karush–Kuhn–Tucker
(KKT) conditions, and they are necessary conditions for a solution to a non-
linear constrained optimization problem to be optimal. For a maximization-type
problem, the conditions (KKT) remain unchanged with the exception of the first
condition (18.16), which is written as
𝑝
(︀ )︀ ∑︁ (︀ )︀ 𝑚−𝑝
∑︁
∇𝑓 𝑥* − 𝜆*𝑖 ∇𝑔𝑖 𝑥* − 𝜇*𝑗 ∇ℎ𝑗 𝑥* = 0.
(︀ )︀
𝑖=1 𝑗=1
Note that the KKT conditions (18.16)–(18.18) represent the stationarity, the
complementary slackness and the dual feasibility, respectively. Other supplementary
KKT conditions are the primal feasibility conditions defined by constraints of the
problem (18.14).
In R, an implementation of the Lagrange multiplier method, for solving nonlinear
constrained optimization problems, can be found in the package Rsolnp.
390 | 18 Optimization
Let us use the function solnp from the R package Rsolnp to solve the following
constrained nonlinear minimization problem:
The likelihood function of a parameter 𝜔 for a given observed data set, 𝒟, denoted
by ℒ, is defined by
𝑛 𝑛 𝑛
(18.22)
∏︁ ∏︁ ∏︁
𝑃 (𝒟, 𝜔) ≈ Δ𝑥𝑖 𝑓 (𝑥𝑖 |𝜔) = Δ𝑥𝑖 𝑓 (𝑥𝑖 |𝜔).
𝑖=1 𝑖=1 𝑖=1
Definition 18.5.2. The value of the parameter 𝜔, which maximizes the likelihood
ℒ(𝜔), hence the probability of the observed dataset 𝑃 (𝒟, 𝜔), is known as the maxi-
mum likelihood estimator (MLE) of 𝜔 and is denoted 𝜔 ^.
Note that the MLE 𝜔 ^ is a function of the data sample 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 . The like-
lihood function (18.23) is often complex to manipulate and, in practice, it is more
convenient to work with the logarithm of ℒ(𝜔) (log ℒ(𝜔)), which also yields the same
optimal parameter 𝜔^.
The MLE problem can then be formulated as the following optimization prob-
lem, which can be solved using the numerical methods, implemented in R, presented
392 | 18 Optimization
Suppose that we are given the following data points: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ), where
𝑥𝑖 ∈ R𝑚 and 𝑦 ∈ {−1, +1}. The fundamental idea, behind the concept of support
vector machine (SVM) classification [191], is to find a pair (𝑤, 𝑏) ∈ R𝑚 ×R such that
the hyperplane defined by ⟨𝑤, 𝑥⟩ + 𝑏 = 0 separates the data points labeled 𝑦𝑖 = +1
from those labeled 𝑦𝑖 = −1, and maximizes the distance to the closest points from
either class. If the points (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, . . . , 𝑛 are linearly separable, then such a
pair exists.
Let (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ), with 𝑦1 = +1 and 𝑦2 = −1, be the closest points on
either sides of the optimal hyperplane defined by ⟨𝑤, 𝑥⟩ + 𝑏 = 0. Then, we have
{︃
⟨𝑤, 𝑥1 ⟩ + 𝑏 = +1,
(18.25)
⟨𝑤, 𝑥2 ⟩ + 𝑏 = −1.
1
minimize ‖𝑤‖2 . (18.26)
𝑚
𝑤∈R 2
⟨𝑤, 𝑥𝑖 ⟩ + 𝑏 ≥ +1, if 𝑦𝑖 = +1
{︃
(18.27)
(︀ )︀
=⇒ 𝑦𝑖 ⟨𝑤, 𝑥𝑖 ⟩ + 𝑏 ≥ 1.
⟨𝑤, 𝑥𝑖 ⟩ + 𝑏 ≤ −1, if 𝑦𝑖 = −1
1
minimize 𝑧(𝑤) = ‖𝑤‖2
𝑤∈R𝑚 , 𝑏∈R 2
subject to for 𝑖 = 1, . . . , 𝑚 (18.28)
(︀ )︀
𝑦𝑖 ⟨𝑤, 𝑥𝑖 ⟩ + 𝑏 ≥ 1,
The Lagrangian associated with the problem (18.28) can be defined as follows:
𝑚
1
(18.29)
∑︁
‖𝑤‖2 −
(︀ (︀ )︀ )︀
𝐿(𝑤, 𝑏, 𝜆) = 𝜆𝑖 𝑦𝑖 ⟨𝑤, 𝑥𝑖 ⟩ + 𝑏 − 1 ,
2
𝑖=1
yields
𝑚
(18.31)
∑︁
𝜆𝑖 𝑦𝑖 = 0,
𝑖=1
𝑚
(18.32)
∑︁
𝑤= 𝜆 𝑖 𝑦𝑖 .
𝑖=1
Substituting 𝑤 into the Lagrangian (18.29) leads to the following optimization prob-
lem, also known as the dual formulation of support vector classifier:
𝑚 𝑚 𝑚
1 ∑︁ ∑︁
maximize
∑︁
𝑍(𝜆) = 𝜆𝑖 − 𝜆𝑖 𝜆𝑗 𝑦𝑖 𝑦𝑗 ⟨𝑥𝑖 , 𝑥𝑗 ⟩
𝑚
𝜆∈R 2
𝑖 𝑖=1 𝑗=1
𝑚
subject to
∑︁
𝜆𝑖 𝑦𝑖 = 0,
𝑖=1
𝜆𝑖 ≥ 0, for 𝑖 = 1, . . . , 𝑚. (18.33)
Both problems (18.28) and (18.33) can be solved in R using the package Rsolnp, as
illustrated in Listing (18.18). However, since the constraints of (18.28) are relatively
complex, it is computationally easier to solve the problem (18.33) and then recover
the vector 𝑤 through (18.32).
18.7 Summary
Optimization is a broad and complex topic. One of the major challenges in opti-
mization is the determination of global optima for nonlinear and high-dimensional
problems. Generally, optimization methods find applications in attempts to optimize
a parametric decision-making process, such as classification, clustering, or regression
of data. The corresponding optimization problems either involve complex nonlinear
functions or are based on data points, i. e., the problems include discontinuities.
Knowledge about optimization methods can be helpful in designing analysis meth-
ods, since they usually involve difficult optimization solutions. Hence, a parsimonious
approach for designing such analysis methods will also help to keep optimization
problems tractable.
18.8 Exercises
1. Consider the following unconstrained problem:
Using R, provide the contour plot of the function 𝑓 (𝑥1 , 𝑥2) and solve the prob-
lem (18.34) using
– the steepest descent method;
– the conjugate gradient method;
– Newton’s method;
– the Nelder–Mead method;
– Simulated annealing;
2. Consider the following unconstrained problem:
1
min 𝑧(𝑥1 , 𝑥2 ) = −2𝑥1 − 3𝑥2 + 𝑥21 + 2𝑥22 − 3𝑥1 𝑥2 . (18.35)
(𝑥1 ,𝑥2 )∈R2 5
Using R, provide the contour plot of the function 𝑧(𝑥1 , 𝑥2) and solve the prob-
lem (18.35) using
– the steepest descent method;
– the conjugate gradient method;
– Newton’s method;
– the Nelder–Mead method;
– Simulated annealing;
18.8 Exercises | 395
3. Using the R package lpSolveAPI, solve the following linear programming prob-
lems:
4. Using the function solnp from the R package Rsolnp, solve the following non-
linear constrained optimization problems:
https://doi.org/10.1515/9783110796063-019
398 | Bibliography
[102] J. Jost. Partial Differential Equations. Springer, New York, NY, USA, 2007.
[103] G. Julia. Mémoire sur l’itération des fonctions rationnelles. J. Math. Pures Appl.,
8:47–245, 1918.
[104] B. Junker, D. Koschützki, and F. Schreiber. Exploration of biological network centralities
with centibin. BMC Bioinform., 7(1):219, 2006.
[105] Joseph B. Kadane. Principles of Uncertainty. Chapman and Hall/CRC, 2011.
[106] Tomihisa Kamada, Satoru Kawai, et al. An algorithm for drawing general undirected
graphs. Inf. Process. Lett., 31(1):7–15, 1989.
[107] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic
Acids Res., 28:27–30, 2000.
[108] Daniel Kaplan and Leon Glass. Understanding Nonlinear Dynamics. Springer Science &
Business Media, 2012.
[109] S. A. Kauffman. The Origin of Order: Self Organization and Selection in Evolution. Oxford
University Press, USA, 1993.
[110] S. V. Kedar. Programming Paradigms and Methodology. Technical Publications, 2008.
[111] U. Kirch-Prinz and P. Prinz. C++. Lernen und professionell anwenden. mitp Verlag, 2005.
[112] D. G. Kleinbaum and M. Klein. Survival Analysis: A Self-Learning Text. Statistics for
Biology and Health. Springer, 2005.
[113] V. Kontorovich, L. A. Beltrǹ, J. Aguilar, Z. Lovtchikova, and K. R. Tinsley. Cumulant
analysis of Rössler attractor and its applications. Open Cybern. Syst. J., 3:29–39, 2009.
[114] Kevin B. Korb and Ann E. Nicholson. Bayesian Artificial Intelligence. CRC Press, 2010.
[115] R. C. Laubenbacher. Modeling and Simulation of Biological Networks. Proceedings of
Symposia in Applied Mathematics. American Mathematical Society, 2007.
[116] S. L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Oxford University
Press, 1996.
[117] M. Z. Li, M. S. Ryerson, and H. Balakrishnan. Topological data analysis for aviation
applications. Transp. Res., Part E, Logist. Transp. Rev., 128:149–174, 2019.
[118] Dennis V. Lindley. Understanding Uncertainty. John Wiley & Sons, 2013.
[119] E. N. Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci., 20:130–141, 1963.
[120] A. J. Lotka. Elements of Physical Biology. Williams and Wilkins, 1925.
[121] K. C. Louden. Compiler Construction: Principles and Practice. Course Technology, 1997.
[122] K. C. Louden and K. A. Lambert. Programming Languages: Principles and Practice.
Advanced Topics Series. Cengage Learning, 2011.
[123] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge
University Press, 2003.
[124] D. Maier. Theory of Relational Databases, 1st edition. Computer Science Press, 1983.
[125] B. B. Mandelbrot. The Fractal Geometry of Nature. W. H. Freeman and Company, San
Francisco, 1983.
[126] E. G. Manes and A. A. Arbib. Algebraic Approaches to Program Semantics. Monographs in
Computer Science. Springer, 1986.
[127] M. Marden. Geometry of Polynomials. Mathematical Surveys of the American
Mathematical Society, Vol. 3. Rhode Island, USA, 1966.
[128] O. Mason and M. Verwoerd. Graph theory and networks in biology. IET Syst. Biol.,
1(2):89–119, 2007.
[129] N. Matloff. The Art of R Programming: A Tour of Statistical Software Design. No Starch
Press, 2011.
[130] B. D. McKay. Graph isomorphisms. Congr. Numer., 730:45–87, 1981.
[131] J. M. McNamee. Numerical Methods for Roots of Polynomials. Part I. Elsevier, 2007.
[132] A. Mehler, M. Dehmer, and R. Gleim. Towards logical hypertext structure. A
402 | Bibliography
[190] V. van Noort, B. Snel, and M. A. Huymen. The yeast coexpression network has a
small-world, scale-free architecture and can be explained by a simple model. EMBO Rep.,
5(3):280–284, 2004.
[191] V. Vapnik. Statistical Learning Theory. J. Willey, 1998.
[192] Vladimir Naumovich Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[193] V. Volterra. Variations and fluctuations of the number of individuals in animal species
living together. In R. N. Chapman, editor, Animal Ecology. McGraw-Hill, 1931.
[194] J. von Neumann. The Theory of Self-Reproducing Automata. University of Illinois Press,
Urbana, 1966.
[195] Andreas Wagner and David A. Fell. The small world inside large metabolic networks. Proc.
R. Soc. Lond. B, Biol. Sci., 268(1478):1803–1810, 2001.
[196] J. Wang and G. Provan. Characterizing the structural complexity of real-world complex
networks. In J. Zhou, editor, Complex Sciences, volume 4 of Lecture Notes of the Institute
for Computer Sciences, Social Informatics and Telecommunications Engineering, pages
1178–1189. Springer, Berlin/Heidelberg, Germany, 2009.
[197] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications.
Structural Analysis in the Social Sciences. Cambridge University Press, 1994.
[198] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature,
393:440–442, 1998.
[199] A. Weil. Basic Number Theory. Springer, 2005.
[200] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer, 2016.
[201] Hadley Wickham. Advanced R, 2nd edition. Chapman and Hall/CRC, 2019.
[202] R. Wilhelm and D. Maurer. Übersetzerbau: Theorie, Konstruktion, Generierung. Springer,
1997.
[203] Thomas Wilhelm, Heinz-Peter Nasheuer, and Sui Huang. Physical and functional
modularity of the protein network in yeast. Mol. Cell. Proteomics, 2(5):292–298, 2003.
[204] Leland Wilkinson. The grammar of graphics. In Handbook of Computational Statistics,
pages 375–414. Springer, 2012.
[205] S. Wolfram. Statistical mechanics of cellular automata. Phys. Rev. E, 55(3):601–644,
1983.
[206] S. Wolfram. A New Kind of Science. Wolfram Media, 2002.
[207] J. A. Wright, M. H. Wright, P. Lagarias, and J. C. Reeds. Convergence properties of the
Nelder-Mead simplex algorithm in low dimensions. SIAM J. Optim., 9:112–147, 1998.
Index
absolute value 153 closed interval 151
adjacency matrix 302 closeness centrality 308
algorithm 167 clustering coefficient 306
analysis 225 cobweb graph 266
antiderivative 239 codomain of a function 227
argmax 162 complex number 194
argmin 162 complex numbers 151
Asynchronous Random Boolean Networks 276 complex vectors 194
attractor 263 complexity 166
attractors 262 complexity classes 174
aviation network 311 complexity of algorithms 170
computability 166, 169
bar plot 101 computable 170
basic programming 33 concentration inequalities 364
basin of the attractor 263 conditional entropy 358
Bayes’ theorem 353 conditional probability 329, 330
Bayesian networks 339 conjugate gradient 375
Bernoulli distribution 339 conjunctive normal form 157
Beta distribution 345 constrained optimization 384
betweenness centrality 308 constraints 369
bifurcation 268 continuous distributions 344
bifurcation point 268 contour plot 112
binomial coefficients 159 coordinates systems 189
Binomial distribution 340 correlation 337
binomial theorem 164
covariance 336
bivariate distribution 337
Cramer’s method 221
boolean functions 158
critical point 263
boolean logic 155
cross product 186
boolean value 162
cumulative distribution function 331
Boundary Value ODE 249
curvature 230
Boundary Value ODE problem 253
cylindrical coordinates 192
breadth-first search 309
byte code compilation 78
data science 1
Cartesian space 177 data structures 39
Cauchy–Schwartz inequality 365 De Morgan’s laws 328
ceiling function 152 decision variables 369
cellular automata 262 definite integral 239
central limit theorem 325, 364 degree 305
centrality 307 degree centrality 308
chaotic behavior 269 degree distribution 305
character string 51 density plot 107
Chebyshev inequality 361 dependency structure 338
Chernoff bounds 366 depth-first search 309
Chi-square distribution 349 derivative 228
Cholesky factorization 215 derivative-free methods 379
Classical Random Boolean Networks 276 determinant 205
406 | Index