A Tour of Data Science Learn R and Python in Parallel by Nailong Zhang
A Tour of Data Science Learn R and Python in Parallel by Nailong Zhang
A Tour of Data Science Learn R and Python in Parallel by Nailong Zhang
Smart Data
State-of-the-Art Perspectives in Computing and Applications
Kuan-Ching Li, Qingchen Zhang, Laurence T. Yang, Beniamino Di Martino
Nailong Zhang
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www. copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
Title: A tour of data science : learn R and Python in parallel / Nailong Zhang.
(Computer programlanguage)
Contents
Preface ix
1.1 CALCULATOR 1
1.3 FUNCTIONS 4
1.8 MISCELLANEOUS 35
2.3 BENCHMARKING 47
2.4 VECTORIZATION 49
2.9 MISCELLANEOUS 69
3.1 SQL 71
vii
viii • Contents
3.4 ADD/REMOVE/UPDATE 86
3.5 GROUP BY 91
3.6 JOIN 93
Bibliography 203
Index 205
Preface
It is hard to give a clear definition of data science because it is not clear where the bor
der is. Actually, many data scientists are working in different industries with different
skill sets. Generally speaking, data science requires broad knowledge in statistics, ma
chine learning, optimization and programming. It is impossible to cover even one of
these four areas in depth in a short book, and it is also beyond the scope of this book.
However, I think it might be useful to have a book that talks about data science with
a broad range of topics and a moderate amount of technical detail. And that is one of
the motivations for writing this book. I chose to make the book short with minimum
mathematical theories introduced, and hope reading through the book could help the
readers get a sense of what data science is about.
I assume the readers have basic knowledge in statistics and linear algebra, and
calculus, such as normal distribution, sample size, gradient, matrix inversion, etc.
Previous programming knowledge is not required. However, as the title of the book
says, we will learn R and Python in parallel, if you’re already familiar with either R
or Python it would help learning the other based on side-by-side comparison.
Besides the comparison of the two popular languages used in data science, this
book also focuses on the translation from mathematical models to codes. In the
book, the audience could find applications/implementations of some important al
gorithms from scratch, such as maximum likelihood estimation, inversion sampling,
copula simulation, simulated annealing, bootstrapping, linear regression (lasso/ridge
regression), logistic regression, gradient boosting trees, etc.
ix
x • Preface
programming. Mastering these topics will greatly help with coding skills. Like the first
chapter, in this chapter, I try to emphasize the differences between R and Python
with coding examples.
Chapter 3: data.table and pandas - In the first two chapters, we focus on
general-purpose programming techniques. In this chapter, we introduce the very ba
sics of data science, i.e., data manipulation. For the audience with little experience
in data science, we start from a brief introduction to SQL. The major part of this
chapter focuses on the two widely used data.frame packages, i.e., data.table in R and
pandas in Python. Side-by-side examples using the two packages not only enables the
audience to learn basic usages of these tools but also can be used as a quick reference
manual.
Chapter 4: Random Variables, Distributions & Linear Regression - In
this chapter, we focus on statistics and linear regression, which is the foundation of
data science. To better follow this chapter, I recommend any introductory level statis
tics course as a prerequisite. The topics of this chapter include random variable sam
pling methods, distribution fitting, joint distribution/copula simulation, confidence
interval calculation, and hypothesis testing. In later sections, we also talk about lin
ear regression models from scratch. Many textbooks introduce the theories behind
linear regressions but still don’t help much on the implementation. We will see how
the linear regression is implemented as a toy example in both R and Python with the
help of linear algebra. I will also show how the basic linear regression model can be
used for L2 penalized linear regression, i.e., ridge regression.
Chapter 5: Optimization in Practice - Most machine learning models rely
on optimization algorithms. In this chapter, we give a brief introduction to optimiza
tion. Specifically, we will talk about convexity, gradient descent, general-purpose op
timization tools in R and Python, linear programming and metaheuristic algorithms,
etc. Based on these techniques, we will see coding examples about maximum likeli
hood estimation, linear regression, logistic regression, portfolio construction, traveling
salesman problem.
Chapter 6: Machine Learning – A gentle introduction - Machine learning
is a huge topic. In this chapter, I try to give a very short and gentle introduction to
machine learning. It starts with a brief introduction of supervised learning, unsuper
vised learning and reinforcement learning, respectively. For supervised learning, we
will see the gradient boosting regression with a pure Python implementation from
scratch, from which the audience could learn the translation from the mathematical
models to object-oriented programs. For unsupervised learning, the finite Gaussian
mixture model and PCA are discussed. And for reinforcement learning, we will also
use a simple game as an example to show the usage of deep Q-networks. The goal
of this chapter is two-fold. First, I would like to give the audience an impression of
what machine learning looks like. Second, reading the code snippets in this chapter
could help the audience review/recap the topics in previous chapters.
CHAPTER 1
Introduction to R/Python
Programming
1.1 CALCULATOR
R and Python are general-purpose programming languages that can be used for
writing softwares in a variety of domains. But for now, let us start with using them
as basic calculators. The first thing is to have them installed. R1 and Python2 can be
downloaded from their official website. In this book, I will be using R 3.5 and Python
3.7.
To use R/Python as basic calculators, let’s get familiar with the interactive mode.
After the installation, we can type R or Python (it is case insensitive so we can also
type r/python) to invoke the interactive mode. Since Python 2 is installed by default
on many machines, in order to avoid invoking Python 2 we type python3.7 instead.
2 ~ $R
3
4 R version 3.5.1 (2018 −07 −02) −− " Feather Sp\ footnoteray "
5 Copyright (C) 2018 The R Foundation for Statistical Computing
6 Platform : x86_64−apple−darwin15 .6.0 (64− bit)
7
1
https://www.r-project.org
2
https://www.python.org
1
2 • A Tour of Data Science: Learn R and Python in Parallel
22 >
Python
1 ~ $python3 .7
2 Python 3.7.1 (default , Nov 6 2018 , 18:45:35)
3 [ Clang 10.0.0 (clang − 1000.11.45.5)] on darwin
4 Type "help", " copyright ", " credits " or " license " for more
information .
5 >>>
The messages displayed by invoking the interactive mode depend on both the
version of R/Python installed and the machine. Thus, you may see different messages
on your local machine. As the messages said, to quit R we can type q(). There are 3
options prompted by asking the user if the workspace should be saved or not. Since
we just want to use R as a basic calculator, we quit without saving workspace.
To quit Python, we can simply type exit().
1 > q()
2 Save workspace image ? [y/n/c]: n
3 ~ $
Once we are inside the interactive mode, we can use R/Python as a calculator.
1 > 1+1
2 [1] 2
3 > 2 ∗ 3+5
4 [1] 11
5 > log (2)
Introduction to R/Python Programming • 3
6 [1] 0.6931472
7 > exp (0)
8 [1] 1
Python
1 >>> 1+1
2 2
3 >>> 2 ∗ 3+5
4 11
5 >>> log (2)
6 Traceback (most recent call last):
7 File "<stdin >", line 1, in <module >
8 NameError : name ’log ’ is not defined
9 >>> exp (0)
10 Traceback (most recent call last):
11 File "<stdin >", line 1, in <module >
12 NameError : name ’exp ’ is not defined
Python
In the previous section we have seen how to use R/Python as calculators. Now, let’s
see how to write real programs. First, let’s define some variables.
3
https://docs.python.org/3/tutorial/modules.html
4 • A Tour of Data Science: Learn R and Python in Parallel
R Python
R Python
1.3 FUNCTIONS
We have seen two functions log and exp when we use R/Python as calculators.
A function is a block of code which performs a specific task. A major purpose of
wrapping a block of code into a function is to reuse the code.
It is simple to define functions in R/Python.
R Python
Here, we defined a function fun1 in R/Python. This function takes x as input and
returns the square of x. When we call a function, we simply type the function name
Introduction to R/Python Programming • 5
followed by the input argument inside a pair of parentheses. It is worth noting that
input or output are not required to define a function. For example, we can define a
function fun2 to print Hello World! without input and output.
One major difference between R and Python codes is that Python codes are
structured with indentation. Each logical line of R/Python code belongs to a certain
group. In R, we use {} to determine the grouping of statements. However, in Python
we use leading whitespace (spaces and tabs) at the beginning of a logical line to
compute the indentation level of the line, which is used to determine the statements’
grouping. Let’s see what happens if we remove the leading whitespace in the Python
function above.
Python
R Python
1 > fun2= function (){print(’ 1 >>> def fun2 (): print (’Hello
Hello World!’)} World !’)
2 > fun2 () 2 ...
3 [1] " Hello World !" 3 >>> fun2 ()
4 Hello World !
Let’s go back to fun1 and have a closer look at the return. In Python, if we
want to return something we have to use the keyword return explicitly. return in
R is a function but it is not a function in Python and that is why no parenthesis
follows return in Python. In R, return is not required even though we need to return
something from the function. Instead, we can just put the variables to return in the
last line of the function defined in R. That being said, we can define fun1 as follows.
R Python
6 • A Tour of Data Science: Learn R and Python in Parallel
In Python we have to put the arguments with default values at the end, which
is not required in R. However, from a readability perspective, it is always better to
put them at the end. You may have noticed the error message above about positional
argument. In Python there are two types of arguments, i.e., positional arguments
and keyword arguments. Simply speaking, a keyword argument must be preceded
by an identifier, e.g., base in the example above. And positional arguments refer to
non-keyword arguments.
1.4.1 If/else
Let’s define a function to return the absolute value of input.
R Python
The code snippet above shows how to use if/else in R/Python. The subtle
Introduction to R/Python Programming • 7
difference between R and Python is that the condition after if must be embraced by
parenthesis in R but it is optional in Python.
We can also put if after else. But in Python, we use elif as a shortcut.
R Python
Similar to the usage of if in R, we also have to use parenthesis after the keyword for
in R. But in Python there should be no parenthesis after for.
R Python
There is something more interesting than the for loop itself in the snippets above.
In the R code, the expression 1:3 creates a vector with elements 1, 2 and 3. In the
Python code, we use the range() function for the first time. Let’s have a look at
them.
R Python
sequence of numbers. range() function can take three arguments, i.e., range(start,
stop, step). However, start and step are both optional. It’s critical to keep in mind
that the stop argument that defines the upper limit of the sequence is exclusive. And
that is why in order to loop through 1 to 3 we have to pass 4 as the stop argument
to range() function. The step argument specifies how much to increase from one
number to the next. The default values of start and step are 0 and 1, respectively.
R Python
You may have noticed that in Python we can do i+=1 to add 1 to i, which is not
feasible in R by default. Both for loop and while loop can be nested.
1.4.4 Break/continue
Break/continue helps if we want to break the for/while loop earlier, or to skip a
specific iteration. In R, the keyword for continue is called next, in contrast to continue
in Python. The difference between break and continue is that calling break would
exit the innermost loop (when there are nested loops, only the innermost loop is
affected); while calling continue would just skip the current iteration and continue
the loop if not finished.
R Python
In the previous sections, we haven’t seen much difference between R and Python.
However, regarding the built-in data structures, there are some significant differences
we will see in this section.
R Python
In the code snippet above, the first element in the variable z in R is coerced from
1 (numeric) to "1" (character) since the elements must have the same type.
To access a specific element from a vector or list, we could use []. In R, sequence
types are indexed beginning with the one subscript. In contrast, sequence types in
Python are indexed beginning with the zero subscript.
R Python
2 > x[1]
2 >>> x[1]
3 [1] 1
3 2
4 >>> x[0]
5 1
R Python
In Python, negative index number means indexing from the end of the list. Thus,
x[−1] points to the last element and x[−2] points to the second-last element of
the list. But R doesn’t support indexing with negative numbers in the same way as
Python. Specifically, in R x[−index] returns a new vector with x[index] excluded.
When we try to access with an index out of boundary, Python would throw an
IndexError. The behavior of R when indexing out of boundary is more interesting.
First, when we try to access x[0] in R we get a numeric(0) whose length is also
0. Since its length is 0, numeric(0) can be interpreted as an empty numeric vector.
When we try to access x[length(x)+1] we get an NA. In R, there are also NaN and
NULL.
NaN means "Not A Number" and it can be verified by checking its type - "double".
0/0 would result in an NaN in R. NA in R generally represents missing values. And
NULL represents a NULL (empty) object. To check if a value is NA, NaN or NULL, we
can use is.na(), is.nan() or is.null, respectively.
R Python
R Python
R Python
When we use the ∗ operator to make replicates of a list, there is one caveat - if
the element inside the list is mutable then the replicated elements point to the same
memory address. As a consequence, if one element is mutated other elements are also
affected.
Python
9 >>> y
10 [[−1], [−1], 2, [−1], [−1]]
11 >>> x
12 [−1]
How to get a list with replicated elements but pointing to different memory ad
dresses?
Python
1 >>> x=[0]
2 >>> y=[x[:] for _ in range (5)] # [:] makes a copy of the list x;
Python
The code snippet above uses hash character # for comments in both R and
Python. Everything after # on the same line would be treated as comment (not
executable). In the R code, we also used the function seq() to create a vector.
When I see a function that I haven’t seen before, I might either google it or
use the built-in helper mechanism. Specifically, in R use ? and in Python use
help().
R Python
• Condition-based
Condition-based slicing means to select a subset of the elements which satisfy
certain conditions. In R, it is quite straightforward by using a boolean vector
whose length is the same as the vector to be sliced.
Python
We can also use if statement with list comprehension to filter a list to achieve
list slicing.
Python
14 • A Tour of Data Science: Learn R and Python in Parallel
Python
The example above shows the power of list comprehension. To use if with list
comprehension, the if statement should be placed in the end after the for loop
statement; but to use if/else with list comprehension, the if/else statement
should be placed before the for loop statement.
R Python
R Python
5 [1] 1 2 3 4 5 6 7 8 5 [1, 2, 3, 4, 5, 6, 7, 8]
As the list structure in Python is mutable, there are many things we can do with
list.
Introduction to R/Python Programming • 15
Python
I like the list structure in Python much more than the vector structure in R. list
in Python has a lot more useful features which can be found from the python official
documentation4 .
1.5.2 Array
Array is one of the most important data structures in scientific programming. In R,
there is also an object type "matrix", but according to my own experience, we can
almost ignore its existence and use array instead. We can definitely use list as array
in Python, but lots of linear algebra operations are not supported for the list type.
Fortunately, there is a Python package numpy off the shelf.
1 > x=1:12
2 > array1 =array (x, c(4 ,3)) # convert vector x to a 4 rows ∗ 3
cols array
3 > array1
5 [1,] 1 5 9
6 [2,] 2 6 10
7 [3,] 3 7 11
8 [4,] 4 8 12
9 > y=1:6
13 [1,] 1 4
14 [2,] 2 5
15 [3,] 3 6
16 > array3 = array1 % ∗ % array2 # % ∗ % is the matrix multiplication
operator
17 > array3
18 [,1] [ ,2]
19 [1,] 38 83
20 [2,] 44 98
21 [3,] 50 113
22 [4,] 56 128
23 > dim( array3 ) # get the dimension of array3
24 [1] 4 2
Python
5 [ 4, 5, 6],
6 [ 7, 8, 9],
You may have noticed that the results of the R code snippet and Python code
snippet are different. The reason is that in R the conversion from a vector to an array
Introduction to R/Python Programming • 17
Python
1 >>> array1 =np. reshape (list(range (1 ,13)) ,(4 ,3) ,order =’F’) # use
order =’F’
2 >>> array1
4 [ 2, 6, 10] ,
5 [ 3, 7, 11] ,
6 [ 4, 8, 12]])
transpose an array
8 >>> array2
9 array ([[1 , 4],
10 [2, 5],
11 [3, 6]])
12 >>> np.dot(array1 , array2 ) # now we get the same result as using
R
13 array ([[ 38, 83],
14 [ 44, 98] ,
15 [ 50, 113] ,
16 [ 56, 128]])
To learn more about numpy, the official website5 has great documentation/tutori
als.
The term broadcasting describes how arrays with different shapes are handled
during arithmetic operations. A simple example of broadcasting is given below.
R Python
3 [1] 2 3 4 3 >>> x + 1
However, the broadcasting rules in R and Python are not exactly the same.
R Python
7 [1,] 1 4 multiplication
6 >>> x ∗ y
8 [2,] 4 10
7 Traceback (most recent call
9 [3,] 9 18
last):
10 > x∗z
8 File "<stdin >", line 1,
11 [,1] [ ,2] in <module >
12 [1,] 1 8 9 ValueError : operands could
13 [2,] 4 5 not be broadcast together
with shapes (3 ,2) (3 ,)
14 [3,] 3 12
10 >>> x ∗ z
11 array ([[ 1, 4],
12 [ 3, 8],
13 [ 5, 12]])
From the R code, we see the broadcasting in R is like recycling along with the
column. In Python, when the two arrays have different dimensions, the one with fewer
dimensions is padded with ones on its leading side. According to this rule, when we
do x ∗ y, the dimension of x is (3, 2) but the dimension of y is 3. Thus, the dimension
of y is padded to (1, 3), which explains what happens when x ∗ y.
2 > x
3 [[1]]
4 [1] 1
6 [[2]]
9 > x[[1]]
10 [1] 1
11 > x[[2]]
12 [1] " hello world !"
13 > length (x)
14 [1] 2
Introduction to R/Python Programming • 19
list in R could be named and support accessing by name via either [[]] or $
operator. But vector in R can also be named and support accessing by name.
1 > x=c(’a’=1,’b’=2)
4 > x[’b’]
5 b
6 2
7 > l=list(’a’=1,’b’=2)
8 > l[[’b’]]
9 [1] 2
10 > l$b
11 [1] 2
12 > names (l)
13 [1] "a" "b"
Python
2 >>> x
3 {’a’: 1, ’b’: 2}
4 >>> x[’a’]
5 1
6 >>> x[’b’]
7 2
9 2
10 >>> x.pop(’a’) # remove the key ’a’ and we get its value 1
11 1
12 >>> x
13 {’b’: 2}
Unlike dictionary in Python, list in R doesn’t support the pop() operation. Thus,
in order to modify a list in R, a new one must be created explicitly or implicitly.
20 • A Tour of Data Science: Learn R and Python in Parallel
R Python
There are quite a few ways to create data.frame. The most commonly used one
is to create data.frame object from array/matrix. We may also need to convert a
numeric data.frame to an array/matrix.
Python
8
https://dplyr.tidyverse.org/
not care whether we conceptualize a variable in our mind, or write it down on paper.
However, in programming a variable is not only a symbol. We have to understand
that a variable is a name given to a memory location in computer systems. When we
run x = 2 in R or Python, somewhere in memory has the value 2, and the variable
(name) points to this memory address. If we further run y = x, the variable y points
to the same memory location pointed to by (x). What if we run x=3? It doesn’t
modify the memory which stores the value 2. Instead, somewhere in the memory now
has the value 3 and this memory location has a name x. And the variable y is not
affected at all, as well as the memory location it points to.
1.6.1 Mutability
Almost everything in R or Python is an object, including these data structures we
introduced in previous sections. Mutability is a property of objects, not variables,
because a variable is just a name.
A list in Python is mutable, meaning that we could change the elements stored in
the list object without copying the list object from one memory location to another.
We can use the id function in Python to check the memory location for a variable.
In the code below, we modified the first element of the list object with name x. And
since Python list is mutable, the memory address of the list doesn’t change.
Python
3 ’0 x10592d908 ’
5 >>> hex(id(x))
6 ’0 x10592d908 ’
Is there any immutable data structure in Python? Yes, for example tuple is im
mutable, which contains a sequence of elements. The element accessing and subset
slicing of tuple follows the same rules as list in Python.
Python
4 >>> len(x)
5 3
6 >>> x[0]
7 1
8 >>> x[0]= −1
If we have two Python variables pointed to the same memory, when we modify
the memory via one variable the other is also affected as we expect (see the example
below).
Python
4 > a[1]=0
It is clear in this case the vector object is mutable since the memory address
doesn’t change after the modification. What if there is an additional name given to
the memory?
Before the modification, both variable a and b point to the same vector object in
the memory. But surprisingly, after the modification the memory address of variable
a also changed, which is called "copy on modify" in R. And because of this unique
behavior, the modification of a doesn’t affect the object stored in the old memory
and thus the vector object is immutable in this case. The mutability of R list is
similar to that of R vector.
Python Python
We see that the object is passed into function by its name. If the object is im
mutable, a new copy is created in memory when any modification is made to the
original object. When the object is immutable, no new copy is made and thus the
change persists out of the function.
In R, the passed object is always copied on a modification inside the function,
and thus no modification can be made to the original object in memory.
3 + x[1]=x[1]−1
5 + print (x)
6 + }
7 >
People may argue that R functions are not as flexible as Python functions. How
ever, it makes more sense to do functional programming in R since we usually can’t
modify objects passed into a function.
R Python
The results of the code above seem strange before knowing the concept of variable
scope. Inside a function, a variable may refer to a function argument/parameter or
it could be formally declared inside the function which is called a local variable. But
in the code above, x is neither a function argument nor a local variable. How does
the print() function know what the identifier x points to?
The scope of a variable determines where the variable is available/accessible (can
be referenced). Both R and Python apply lexical/static scoping for variables, which
set the scope of a variable based on the structure of the program. In static scoping,
when an ’unknown’ variable is referenced, the function will try to find it from the
most closely enclosing block. That explains how the print() function could find the
variable x.
In the R code above, x=x+1 the first x is a local variable created by the = operator;
the second x is referenced inside the function so the static scoping rule applies. As
a result, a local variable x which is equal to 2 is created, which is independent with
the x outside of the function var_func_2(). However, in Python when a variable is
assigned a value in a statement, the variable would be treated as a local variable and
that explains the UnboundLocalError.
Is it possible to change a variable inside a function which is declared outside the
function without passing it as an argument? Based on the static scoping rule only,
it’s impossible. But there are workarounds in both R/Python. In R, we need the help
of environment; and in Python we can use the keyword global.
So what is an environment in R? An environment is a place where objects are
stored. When we invoke the interactive R session, an environment named .GlobalEnv
is created automatically. We can also use the function environment() to get the
Introduction to R/Python Programming • 27
present environment. The ls() function can take an environment as the argument to
list all objects inside the environment.
1 $r
4 > environment ()
6 > x=1
8 [1] "x"
10 + y=x+1
11 + print ( environment ())
12 + ls( environment ())
13 + }
14 > env_func_1 (2)
15 <environment : 0 x7fc59d165a20 >
16 [1] "x" "y"
17 > env_func_2 = function (){print( environment ())}
18 > env_func_2 ()
19 <environment : 0 x7fc59d16f520 >
The above code shows that each function has its own environment containing
all function arguments and local variables declared inside the function. In order to
change a variable declared outside of a function, we need the access of the environment
enclosing the variable to change. There is a function parent_env(e) that returns the
parent environment of the given environment e in R. Using this function, we are
able to change the value of x declared in .GlobalEnv inside a function which is also
declared in .GlobalEnv. The global keyword in Python works in a totally different
way, which is simple but less flexible.
R Python
I seldom use the global keyword in Python, if ever. But the environment in R
could be very handy on some occasions. In R, environment could be used as a purely
mutable version of the list data structure. Let’s use the R function tracemem to trace
the copy of an object. It is worth nothing that tracemem can’t trace R functions.
R R
3 > x
4 [1] " <0 x7f829183f6f8 >"
4 <environment : 0 x7f8290aee7e8
5 > x$a=2
>
6 > tracemem (x)
5 > x$a=2
7 [1] " <0 x7f828f4d05c8 >"
6 > x
7 <environment : 0 x7f8290aee7e8
>
Python Python
Even though in the left snippet above there aren’t parentheses embracing 1, 2
after the = operator, a tuple is created first and then the tuple is unpacked and
assigned to x, y. Such a mechanism doesn’t exist in R, but we can define our own
multiple assignment operator with the help of environment.
Introduction to R/Python Programming • 29
R chapter1/multi_assignment.R
1 ‘%=%‘ = function (left , right ) {
variables on LHS
5 dest_env = parent .env( environment ())
9 if ( length (left) == 1) {
Before going more deeply into the script, first let’s see the usage of the multiple
assignment operator we defined.
13 [1] "2019−01−01"
In the %=% operator defined above, we used two functions substitute, deparse
which are very powerful but less known by R novices. To better understand these
functions as well as some other less known R functions, the Rchaeology9 tutorial is
worth reading.
It is also interesting to see that we defined the function recursive_assign inside
the %=% function. Both R and Python support the concept of first class functions.
More specifically, a function in R/Python is an object, which can be
1. stored as a variable;
5 + real = NULL ,
6 + imag = NULL ,
7 + # the initialize function would be called automatically when
we create an object of the class
8 + initialize = function (real , imag){
9 + # call functions to change real and imag values
10 + self$ set_real (real)
11 + self$ set_imag (imag)
12 + },
13 + # define a function to change the real value
14 + set_real = function (real){
15 + self$real=real
16 + },
17 + # define a function to change the imag value
18 + set_imag = function (imag){
19 + self$imag = imag
20 + },
21 + # override print function
22 + print = function (){
23 + cat( paste0 (as. character (self$real),’+’,as. character (self$
imag),’j’),’\n’)
24 + }
25 + )
26 + )
27 > # let ’s create a complex number object based on the Complex
class we defined above using the new function
28 > x = Complex $new (1 ,2)
29 > x
30 1+2j
31 > x$real # the public attributes of x could be accessed by $
operator
32 [1] 1
Python
9 ...
10 >>> x = Complex (1 ,2)
11 >>> x
12 1+2j
13 >>> x.real # different from the $ operator in R, here we use .
to access the attribute of an object
14 1
By overriding the print function in the R6 class, we can have the object printed
in the format of real+imag j. To achieve the same effect in Python, we override the
method __repr__. In Python, we refer to the functions defined in classes as methods.
And overriding a method means changing the implementation of a method provided
by one of its ancestors. To understand the concept of ancestors in OOP, one needs
to understand the concept of inheritance13 .
You may be curious about the double underscore surrounding the methods, such
as __init__ and __repr__. These methods are well-known as special methods14 . In
the definition of the special method __repr__ in the Python code, the format method
of str object15 is used.
Special methods can be very handy if we use them in suitable cases. For example,
we can use the special method __add__ to implement the + operator for the Complex
class we defined above.
Python
another .imag)
9 ...
10 >>> x = Complex (1 ,2)
11 >>> y = Complex (2 ,4)
12 >>> x+y # + operator works now
13 3+6j
We can also implement the + operator for Complex class in R as we have done for
Python.
13
https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)
14
https://docs.python.org/3/reference/datamodel.html#specialnames
15
https://docs.python.org/3.7/library/string.html
3 + }
6 > x+y
7 3+6j
The most interesting part of the code above is ‘+.Complex‘. First, why do we use
‘‘ to quote the function name? Before getting into this question, let’s have a look at
the Python 3’s variable naming rules16 .
According to the rule, we can’t declare a variable with name 2x. Compared
with Python, in R we can also use . in the variable names17 . However, there is a
workaround to use invalid variable names in R with the help of ‘‘.
R Python
1 > 2x = 5 1 >>> 2x = 5
2 Error: unexpected symbol in 2 File "<stdin >", line 1
"2x" 3 2x = 5
3 > .x = 3
4 ^
4 > .x
5 SyntaxError : invalid syntax
5 [1] 3
6 >>> .x = 3
6 > ‘+2x%‘ = 0
7 File "<stdin >", line 1
7 > ‘+2x%‘
8 .x = 3
8 [1] 0
9 ^
10 SyntaxError : invalid syntax
17
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Identifiers
Python
2 ... """
4 ... __y and __func2 are private and the names would be
mangled
5 ... """
In this example, an error is thrown when we try to access __y or __func2 outside
the class definition. But they are reachable within the class definition and these fields
are usually called private fields.
1.8 MISCELLANEOUS
There are some items that I haven’t discussed so far, which are also important in
order to master R/Python.
R R
3 > a 3 > a
4 [[1]] 4 $x
5 [1] 2 5 [1] 2
6 6
7 > x 7 > x
8 [1] 2 8 [1] 1
CHAPTER 2
More on R/Python
Programming
There are a few ways to run an R script. For example, we can run the script from
the console with the r −f filename command. We can also use the Rscript filename
command from the console. If we use r −f filename to run a script, the content of
the script is also printed. With Rscript filename, the script content would not be
1
https://www.rstudio.com/products/rstudio/
2
https://code.visualstudio.com
3
https://www.jetbrains.com/pycharm/
37
38 • A Tour of Data Science: Learn R and Python in Parallel
printed. Also, we can open the R interactive session and use the source() function.
As for the Python script, we can run it from the console.
R Python
3 for (i in 1:n){
4 print(" ∗ " ∗ i)
5 }
6 print_triangle (5)
6 }
7 # call the function for
printing
8 print_triangle (5)
R Python
7 ∗∗∗∗∗
The major reason for using scripts is that we don’t want to limit ourselves by using
toy_project/
|-- src
| |-- data_processing
| | |-- data_clean.R
| | |__ data_clean.py
| |-- model_building
| | |-- linear_model.R
| | |-- linear_model.py
| | |-- tree_model.R
| | |__ tree_model.py
| |__ model_validation
|__ test
We could put scripts into these directories accordingly. Please note that the toy
project is just for illustration and your projects don’t need to follow this toy project
structure.
An immediate problem of putting functions into different scripts is working out
how R/Python could find/use a function defined in another script. For example, in
the linear_model.R script we need to call a function impute defined in data_clean
.R. The solution is the source function in R and import statement in Python. In
R, sourcing a file is straightforward by using the file path. However, there are some
caveats using source in Python, which are out of this book’s scope.
When we use an IDE (RStudio or Visual Studio Code), not only can we work on
scripts, but also we can work on projects. But a project has nothing to do with the
programming language itself. A project is IDE-specific. For example, with RStudio we
can create an RStudio project to organize the scripts as well as some configurations4 .
To work with an IDE-specific project, the IDE should be installed. But running the
scripts doesn’t require any IDE since we can even run the scripts with a command
line interface.
4
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
40 • A Tour of Data Science: Learn R and Python in Parallel
2.2.1 Print
Most programming languages provide the functionality of printing, which is a natural
way of debugging. By trying to place print statements at different positions we may
finally catch the bugs. When I use print to debug, it’s feeling like playing the game
of Minesweeper. In Python, there is a module called logging5 which could be used
for debugging like the print function, but in a more elegant fashion.
6 } 6 v=[1 ,2 ,5 ,10]
7 } 7 print( find_pos (v,−1))
5
https://docs.python.org/3/library/logging.html
More on R/Python Programming • 41
R Python
When x=11, the function returns NULL in R and None in Python because there is no
such element in v larger than x. The implementation above is trivial, but not efficient.
If you have some background in data structures and algorithms, you probably know
this question can be solved by binary search. The essential idea of binary search
comes from the concept of divide-and-conquer6 . Since v is already sorted, we may
divide it into two partitions by cutting it from the middle, and then we get the left
partition and the right partition. v is sorted implying that both the left partition and
the right partition are also sorted. If the target value x is larger than the rightmost
element in the left partition, we can just discard the left partition and search x within
the right partition. Otherwise, we can discard the right partition and search x within
the left partition. Once we have determined which partition to search, we may apply
the idea recursively so that in each step we reduce the size of v by half. If the length
of v is denoted as n, in terms of big O notation7 , the run time complexity of binary
search is O(log n), compared with O(n) of the for-loop implementation.
The code below implements the binary search solution to our question. (It is more
intuitive to do it with recursion but here I write it with iteration since tail recursion
optimization8 in R/Python is not supported.)
R chapter2/find_binary_search_buggy.R
1 binary_search_buggy = function (v, x){
2 start = 1
6 if (v[mid ]>=x){
7 end = mid
8 }else{
9 start = mid +1
10 }
11 }
6
https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm
7
https://en.wikipedia.org/wiki/Big_O_notation
8
https://en.wikipedia.org/wiki/Tail_call
12 return ( start )
13 }
14 v=c(1 ,2 ,5 ,10)
15 print ( binary_search_buggy (v,−1))
16 print ( binary_search_buggy (v ,5))
17 print ( binary_search_buggy (v ,11))
Python chapter2/find_binary_search_buggy.py
11 v=[1 ,2 ,5 ,10]
12 print ( binary_search_buggy (v,−1))
13 print ( binary_search_buggy (v ,5))
14 print ( binary_search_buggy (v ,11))
R Python
The binary search solutions don’t work as expected when x=11. We write two new
scripts.
R chapter2/find_binary_search_buggy_debug.R
2 browser ()
3 start = 1
Python chapter2/find_binary_search_buggy_debug.py
3 set_trace ()
6 mid = (start+end)//2
7 if v[mid ]>=x:
8 end = mid
9 else:
10 start = mid +1
11 return start
12
13 v=[1 ,2 ,5 ,10]
14 print ( binary_search_buggy (v ,11))
Let’s try to debug the programs with the help of browser() and set_trace().
12 if (v[mid] >= x) {
13 end = mid
14 }
15 else {
16 start = mid + 1
17 }
18 }
19 Browse [2] > n
20 debug at binary_search_buggy_debug .R#6: mid = (start + end)%/%2
21 Browse [2] > n
22 debug at binary_search_buggy_debug .R#7: if (v[mid] >= x) {
23 end = mid
24 } else {
25 start = mid + 1
26 }
27 Browse [2] > n
28 debug at binary_search_buggy_debug .R#10: start = mid + 1
29 Browse [2] > n
30 debug at binary_search_buggy_debug .R#5: (while) start < end
31 Browse [2] > n
32 debug at binary_search_buggy_debug .R#6: mid = (start + end)%/%2
33 Browse [2] > n
34 debug at binary_search_buggy_debug .R#7: if (v[mid] >= x) {
35 end = mid
36 } else {
37 start = mid + 1
38 }
39 Browse [2] > n
40 debug at binary_search_buggy_debug .R#10: start = mid + 1
41 Browse [2] > n
42 debug at binary_search_buggy_debug .R#5: (while) start < end
43 Browse [2] > start
44 [1] 4
45 Browse [2] > n
46 debug at binary_search_buggy_debug .R#13: return (start)
47 Browse [2] > n
48 [1] 4
In the R code snippet above, we placed the browser() function on the top of
the function binary_search_buggy. Then when we call the function we enter into the
debugging environment. By calling ls() we see all variables in the current debugging
scope, i.e., v, x. Typing n will evaluate the next statement. After typing n a few
times, we finally exit from the while loop because start = 4 such that start < end
More on R/Python Programming • 45
is FALSE. As a result, the function just returns the value of start, i.e., 4. To exit from
the debugging environment, we can type Q; to continue the execution we can type c.
The root cause is that we didn’t deal with the corner case when the target value
x is larger than the last/largest element in v correctly.
Let’s debug the Python function using pdb module.
Python
Similar to R, command n would evaluate the next statement in pdb. Typing com
mand l would show the current line of current execution. Command b line_number
would set the corresponding line as a break point; and c would continue the execution
until the next breakpoint (if it exists).
In R, besides the browser() function there are a pair of functions debug()and
undebug() which are also very handy when we try to debug a function; especially
when the function is wrapped in a package. More specifically, the debug function will
invoke the debugging environment whenever we call the function to debug. See the
example below which explains how we invoke the debugging environment for the sd
function (standard deviation calculation).
3 > sd(x)
5 debug : sqrt(var(if (is. vector (x) || is. factor (x)) x else as.
double (x),
6 na.rm = na.rm))
R chapter2/find_binary_search.R
3 start = 1
7 if (v[mid ]>=x){
8 end = mid
9 }else{
10 start = mid +1
11 }
More on R/Python Programming • 47
12 }
13 return ( start )
14 }
Python chapter2/find_binary_search.py
2.3 BENCHMARKING
R chapter2/benchmark.R
1 library ( microbenchmark )
2 source (’binary_search .R’)
3 source (’find_pos .R’)
4
5 v =1:10000
6
10 # for−loop solution
11 set.seed (2019)
12 print( microbenchmark ( find_pos (v, sample (10000 ,1)),times =1000) )
13 # binary−search solution
48 • A Tour of Data Science: Learn R and Python in Parallel
14 set.seed (2019)
15 print( microbenchmark ( binary_search (v, sample (10000 ,1)),times
=1000) )
In the R code above, times=1000 means we want to call the function 1000 times
in the benchmarking process. The sample() function is used to draw samples from
a set of elements. Specifically, we pass the argument 1 to sample() to draw a single
element. It’s the first time we use set.seed() function in this book. In R/Python,
we draw random numbers based on the pseudorandom number generator (PRNG)
algorithm9 . The sequence of numbers generated by PRNG is completed determined
by an initial value, i.e., the seed. Whenever a program involves the usage of PRNG,
it is better to set the seed in order to get replicable results (see the example below).
R R
9
https://en.wikipedia.org/wiki/Pseudorandom_number_generator
More on R/Python Programming • 49
Python chapter2/benchmark.py
1 from binary_search import binary_search
2 from find_pos import find_pos
3 import timeit
4 import random
5
6 v=list(range (1 ,10001))
7
18 # for−loop solution
19 print( timeit . timeit (’test_for_loop (1000) ’,setup =’from __main__
import test_for_loop ’,number =1))
20 # binary_search solution
21 print( timeit . timeit (’test_bs (1000) ’,setup =’from __main__ import
test_bs ’,number =1))
The most interesting part of the Python code above is from __main__ import.
Let’s ignore it for now; we will revisit it later.
Below is the benchmarking result in Python (the unit is second).
Python
2.4 VECTORIZATION
3 int z[4];
5 z[i]=x[i]+y[i];
6 }
The C code above might be vectorized by the compiler so that the actual number
of iterations performed could be less than 4. If 4 pairs of operands are processed
at once, there would be only 1 iteration. Automatic vectorization may make the
program run much faster in some languages like C. However, when we talk about
vectorization in R/Python, it is different from automatic vectorization. Vectorization
in R/Python usually refers to the human effort paid to avoid for-loops. First, let’s
see some examples of how for-loops may slow your programs in R/Python.
R
chapter2/vectorization_1.R
1 library ( microbenchmark )
2
13 n=100
14 # for loop
15 print ( microbenchmark ( rnorm_loop (n),times =1000) )
16 # vectorize
17 print( microbenchmark ( rnorm_vec (n),times =1000) )
2 Unit: microseconds
3 expr min lq mean median uq max
neval
4 rnorm_loop (n) 131.622 142.699 248.7603 145.3995 270.212 16355.6
1000
5 Unit: microseconds
6 expr min lq mean median uq max neval
7 rnorm_vec (n) 6.696 7.128 10.87463 7.515 8.291 2422.338 1000
Python
1 import timeit
2 import numpy as np
14 print("for loop")
15 print(f’{ timeit . timeit (" rnorm_for_loop (100) ", setup =" from
__main__ import rnorm_for_loop ", number =1000) :.6f} seconds ’)
16 print(" vectorized ")
17 print(f’{ timeit . timeit (" rnorm_vec (100)",setup =" from __main__
import rnorm_vec ", number =1000) :.6f} seconds ’)
Please note that in this Python example we are using the random submodule of
numpy module instead of the built-in random module since random module doesn’t
provide the vectorized version of random number generation function. Running the
Python code results in the following on my local machine.
Python
Also, the timeit.timeit measures the total time to run the main statements
number times, but not the average time per run.
In either R or Python, the vectorized version of random normal random variable
(r.v.) is significantly faster than the scalar version. It is worth noting the usage of
the print(f’’) statement in the Python code, which is different from the way that
we print the object of Complex class in Chapter 1. In the code above, we use the
f−string11 which is a literal string prefaced with ’f’ containing expressions inside
{} which would be replaced with their values. f−string is a feature introduced since
Python 3.6.
It’s also worth noting that lots of built-in functions in R are already vectorized,
such as the basic arithmetic operators, comparison operators, ifelse(), element-wise
logical operators &,|. But the logical operators &&, || are not vectorized.
In addition to vectorization, there are also some built-in functions which may help
to avoid the usages of for-loops. For example, in R we might be able to use the apply
family of functions to replace for-loops; and in Python the map() function can also
be useful. In the Python pandas module, there are also many usages of map/apply
methods. But in general the usage of apply/map functions has little or nothing to
do with performance improvement. However, appropriate usages of such functions
may help with the readability of the program. Compared with the apply family of
functions in R, I think the do.call() function is more useful in practice. We will
spend some time in do.call() later.
familiar with vectorization thorough the Biham–Middleton–Levine (BML) traffic model12 . The
BML model is very important in modern studies of traffic flow since it exhibits a sharp phase
transition from free flowing status to a fully jammed status. A simplified BML model could be
characterized as follows:
• Initialized on a 2-D lattice, each site of which is either empty or occupied by a colored particle (blue
or red);
• Particles are distributed randomly through the initialization according to a uniform distribution; the
two colors of particles are equally distributed.
• On even time steps, all blue particles attempt to move one site up and an attempt fails if the site to
occupy is not empty;
• On Odd time steps, all red particles attempt to move one site right and an attempt fails if the site to
occupy is not empty;
• The lattice is assumed periodic which means when a particle moves out of the lattice, it will move
into the lattice from the opposite side.
11
https://www.python.org/dev/peps/pep-0498/
12
https://en.wikipedia.org/wiki/Biham-Middleton-Levine_traffic_model
More on R/Python Programming • 53
R chapter2/BML.R
1 library (R6)
2 BML = R6Class (
3 "BML",
4 public = list(
7 alpha = NULL ,
8 m = NULL ,
9 n = NULL ,
10 lattice = NULL ,
11 initialize = function (alpha , m, n) {
12 self$ alpha = alpha
13 self$m = m
14 self$n = n
15 self$ initialize_lattice ()
16 },
17 initialize_lattice = function () {
18 # 0 −> empty site
19 # 1 −> blue particle
20 # 2 −> red particle
21 u = runif (self$m ∗ self$n)
22 # the usage of L is to make sure the elements in particles
are of type integer ;
23 # otherwise they would be created as double
24 particles = rep (0L, self$m ∗ self$n)
25 # doing inverse transform sampling
26 particles [(u > self$alpha) &
27 (u <= (self$alpha + 1.0) / 2)] = 1L
28 particles [u > (self$alpha + 1.0) / 2] = 2L
29 self$ lattice = array (particles , c(self$m, self$n))
30 },
31 odd_step = function () {
32 blue. index = which (self$ lattice == 1L, arr.ind = TRUE)
33 # make a copy of the index
34 blue.up. index = blue. index
35 # blue particles move 1 site up
36 blue.up. index [, 1] = blue. index [, 1] − 1L
37 # periodic boundary condition
38 blue.up. index [blue.up. index [, 1] == 0L, 1] = self$m
54 • A Tour of Data Science: Learn R and Python in Parallel
Now we can create a simple BML system on a 5 × 5 lattice using the R code
above.
6 [1,] 2 0 2 1 1
7 [2,] 2 2 1 0 1
8 [3,] 0 0 0 2 2
9 [4,] 1 0 0 0 0
10 [5,] 0 1 1 1 0
11 > bml$ odd_step ()
12 > bml$ lattice
More on R/Python Programming • 55
Python
1 import numpy as np
3 class BML:
5 self.alpha = alpha
6 self.shape = (m, n)
7 self. initialize_lattice ()
21 [1, 2, 2, 2, 1],
22 [0, 0, 1, 0, 2],
23 [2, 1, 0, 0, 2],
24 [1, 0, 1, 2, 1]])
Please note that although we have imported numpy in BML.py, we import it again
in the code above in order to set the random seed. If we change the line to from
BML import ∗, we don’t need to import numpy again. But it is not recommended to
import ∗ from a module.
14
https://en.wikipedia.org/wiki/Thread_(computing)
15
https://en.wikipedia.org/wiki/Process_(computing)
16
https://docs.python.org/3.7/library/threading.html
17
https://en.wikipedia.org/wiki/Embarrassingly_parallel
18
https://en.wikipedia.org/wiki/Monte_Carlo_method
58 • A Tour of Data Science: Learn R and Python in Parallel
Figure 2.1: Generate points within a square and count how many times these points
fall into the inscribed circle
n points into the square and m of these points fall into the inscribed circle (see Figure 2.1),
then we can estimate the ratio as m/n. As a result, a natural estimate of π is 4m/n. This
problem is an embarrassingly parallel problem by its nature. Let’s see how we implement the
idea in R/Python.
R chapter2/pi.R
1 library ( parallel )
3 m = 0
4 for (i in 1:n){
8 m = m+1
9 }
10 }
11 m
12 }
13
In the above R code snippet, we use the function mclapply which is not currently
available on some operation systems19 . When it is not available, we may consider
using parLapply instead.
Python chapter2/pi.py
4 pool = mp.Pool ()
cpu_count ())
7
3 naive: pi − 3.144592
7 parallel : pi − 3.1415
Python
We see the winner in this example is vectorization, and the parallel solution is
better than the naive solution. However, when the problem cannot be vectorized we
may use parallelization to achieve better performance.
R Python
1 > divide = function (x, y){ 1 >>> def divide (x, y):
stopifnot (y!=0); return (x return x/y
/y)} 2 >>> def power(x, p):
2 > power = function (x, p){ return 1 if p==0
else x ∗∗ p
if (p==0) 1 else x^p}
3 >>> power( divide (1 ,0) ,0)
3 > power ( divide (1 ,0) ,2)
4 Traceback (most recent call
4 Error in divide (1, 0) : y !
= 0 is not TRUE last):
5 > power ( divide (1 ,0) ,0) 5 File "<stdin >", line 1,
6 [1] 1 in <module >
6 File "<stdin >", line 1,
in divide
7 ZeroDivisionError : division
by zero
Because of the outermost reduction order, the R code snippet evaluates the func
tion power first and since if the second argument is 0 the first argument is not required
to evaluate. Thus, the function call returns 1. But the Python code snippet first eval
uates the function call of divide and an exception is raised because of division by
zero.
Although Python is an eager language, it is possible to simulate the lazy eval
62 • A Tour of Data Science: Learn R and Python in Parallel
uation behavior. For example, the code inside a generator20 is evaluated when the
generator is consumed but not evaluated when it is created. We can create a sequence
of Fibonacci numbers with the help of generator.
Python
2 ... ’’’
4 ... ’’’
5 ... a, b = 0, 1
7 ... a, b = b, a+b
8 ... yield a
9 ...
10 >>> f = fib ()
11 >>> print (f)
12 <generator object fib at 0x116dfc570 >
13 >>> for e in f: print (e)
14 ...
15 1
16 1
17 2
18 3
19 5
20 8
21 13
22 >>>
In the code snippet above, we create a generator which generates the sequence of
Fibonacci numbers less than 10. When the generator is created, the sequence is not
generated immediately. When we consume the generator in a loop, each element is
then generated as needed. It is also common to consume the generator with the next
function.
Python
1 >>> f = fib ()
2 >>> next(f)
3 1
4 >>> next(f)
5 1
6 >>> next(f)
7 2
20
https://www.python.org/dev/peps/pep-0255
More on R/Python Programming • 63
8 >>> next(f)
9 3
10 >>> next(f)
11 5
The purpose of this section is not to teach C/C++. If you know how to write C/C++
then it may help to improve the performance of R/Python program.
It is recommended to use vectorization technique to speed up your program when
ever possible. But what if it’s infeasible to vectorize the algorithm? One potential
option is to use C/C++ in the R/Python code. According to my limited experience,
it is much easier to use C/C++ in R with the help of Rcpp package than in Python.
In Python, more options are available but not as straightforward as the solution of
Rcpp. In this section, let’s use Cython21 .
Actually, Cython itself is a programming language written in Python and C. In
Rcpp we can write C++ code directly and use it in native R code. With the help of
Cython, we can write python-like code which is able to be compiled to C or C++
and automatically wrapped into python-importable modules. Cython could also wrap
independent C or C++ code into python-importable modules.
Let’s see how to calculate Fibonacci numbers with these tools. The first step is
to install Rcpp package in R and the Cython module (use pip) in Python. Once they
are installed, we can write the code in C++ directly for R. As for Cython, we write
the python-like code which will be compiled to C/C++ later.
21
https://en.wikipedia.org/wiki/Cython
64 • A Tour of Data Science: Learn R and Python in Parallel
It’s worth noting the extension of the Cython file above is pyx, not py. And in
Fibonacci.pyx, the function we defined follows the native Python syntax. But in
Cython we can add static typing declarations which are often helpful to improve the
performance, although not it’s mandatory. The keyword cdef makes the function
Fibonacci_c invisible to Python and thus we have to define the Fibonacci_static
function as a wrapper which can be imported into Python. The function Fibonacci
is not static typed for benchmarking.
Let’s also implement the same functions in native R/Python for benchmarking
purposes.
7 Unit: milliseconds
uq max
9 Fibonacci_native (20) 3.917138 4.056468 4.456824 4.190078
4.462815 26.30846
10 neval
11 1000
12 > the C++ implementation
13 > microbenchmark ( Fibonacci (20) , times =1000)
14 Unit: microseconds
More on R/Python Programming • 65
Python
The results show the static typed implementation is the fastest, and the native
implementation in pure Python is the slowest. Again the time measured by timeit.
timeit is the total time to run the main statement 1000 times. The average time per
run of Fibonacci_static function is close to the average time per run of Fibonacci
in R.
could be ignored to return a value in an R function, now you get the explanation —
a function should return the output automatically from an FP perspective.
I don’t think it makes any sense to debate whether we should choose OOP or
FP using R/Python. And both languages support multiple programming paradigms
(FP, OOP, etc.). If you want to do purist FP, neither R nor Python would be the
best choice. For example, purist FP should avoid loop because loop always involves
an iteration counter whose value changes over iterations, which is obviously a side
effect. Thus, recursion is very important in FP. But tail recursion optimization is not
supported in R and Python although it can be implemented via trampoline23 .
Let’s introduce a few tools that could help to get a first impression of FP in R
and Python.
R Python
R Python
2.8.2 Map
Map could be used to avoid loops. The first argument to map is a function.
In R, we use Map which is shipped with base R; and in Python we use map which
is also a built-in function.
23
https://en.wikipedia.org/wiki/Tail_call#Through_trampolining
More on R/Python Programming • 67
R Python
y, [1,2,3], [4 ,5 ,6]))
7
6 [4, 10, 18]
8 [[3]]
9 [1] 9
10 > Map( function (x, y) x ∗ y,
c(1:3) , c (4:6))
11 [[1]]
12 [1] 4
13
14 [[2]]
15 [1] 10
16
17 [[3]]
18 [1] 18
There are a few things to note from the code snippets above. First, the returned
value from Map in R is a list rather than a vector. That is because Map is just a wrapper
of mapply function in R, and the SIMPLIFY argument is set to FALSE. If we want to
get a vector value returned, just use mapply instead.
The returned value from map in Python is a map object which is an iterator. The
generator function we have seen before is also an iterator. To use an iterator, we
could convert it to a list, or get the next value using the next function, or just use a
for loop.
Python
3 >>> next(o)
4 4
5 >>> next(o)
6 10
10 ...
11 4
12 10
13 18
14 >>> # convert an iterator to a list
15 >>> o=map( lambda x, y: x ∗ y, [1 ,2 ,3] , [4 ,5 ,6])
16 >>> list(o)
17 [4, 10, 18]
2.8.3 Filter
Similar to map, the first argument to Filter in R and filter in Python is also a
function. This function is used to get the elements which satisfy a certain condition
specified in the first argument. For example, we can use it to get the even numbers.
R Python
2.8.4 Reduce
Reduce behaves a bit differently from Map and Filter. The first argument is a function
with exactly two arguments. The second argument is a sequence, on each element of
which the first function argument will be applied from left to right. And there is one
optional argument called initializer. When the initializer is provided, it will be used
as the first argument of the function argument of Reduce. The examples below depict
how it works.
R Python
Please note that in order to use reduce in Python, we have to import it from the
functools module.
More on R/Python Programming • 69
Utilizing these functions flexibly may help to make the code more concise, at the
cost of readability.
2.9 MISCELLANEOUS
We have introduced the basics of R/Python programming so far. There is much more
to learn in becoming an advanced user of R/Python. For example, the appropriate
usages of iterator, generator, decorator could improve both the conciseness and
readability of your Python code. The generator24 is commonly seen in machine learn
ing programs to prepare training/testing samples. decorator is a kind of syntactic
sugar to allow the modification of a function’s behavior in a simple way. In R there
are no built-in iterator, generator, decorator, but you may find some third-party
libraries to mimic these features; or you may try to implement your own.
One advantage of Python over R is that there are some built-in modules con
taining high-performance data structures or commonly used algorithms implemented
efficiently. For example, I enjoy using the deque structure in the Python collections
module25 , but there is no built-in counterpart in R. We have written our own binary
search algorithm earlier in this chapter, which can also be replaced by the functions
in the built-in module bisect26 in Python.
Another important aspect of programming is testing. Unit testing is a typical
software testing method that is commonly adopted in practice. In R there are two
third-party packages testthat and RUnit. In Python, the built-in unittest is quite
powerful. Also, the third-party module pytest27 is very popular.
24
https://docs.python.org/3/howto/functional.html
25
https://docs.python.org/3/library/collections.html
26
https://docs.python.org/3.7/library/bisect.html
27
https://docs.pytest.org/en/latest/
CHAPTER 3
Upon receiving feedback and requests from readers on the first few chap
ters, I decided to devote a whole chapter to the introduction of data
.table and pandas. Of course, there are many other great data analysis tools in both
R and Python. For example, many R users like using dplyr to build up data analysis
pipelines. The performance of data.table is superior and that is the main reason I
feel there is no need to use other tools for the same tasks in R. But if you are a big fan
of the pipe operator %>% you may use data.table and dplyr together. Regarding the
big data ecosystem, Apache Spark has API in both R and Python. Recently, there
are also some emerging projects aiming at better usability and performance, such as
Apache Arrow1 , Modin2 .
3.1 SQL
Similar to the previous chapters, I will introduce the tools side by side. However,
I feel before diving into the world of data.table and pandas, it is better to talk a
little bit about SQL3 . SQL is a Query language designed for managing data in relational
database management system (RDBMS). Some of the most popular RDBMSs include
MS SQL Server, MySQL, PostgreSQL, etc. Different RDBMSs may use SQL languages
with major or subtle differences.
If you have never used RDBMS you may wonder why we need it.
In this chapter, we will use the public mtcars dataset as an example. This dataset
was available in R4 and originally it was reported in [11].
Let’s assume there is a table mtcars in a database (I’m using sqlite3 in this book)
and see some simple tasks we can do with SQL queries.
1
https://arrow.apache.org
2
https://github.com/modin-project/modin
3
https://en.wikipedia.org/wiki/SQL
4
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
71
72 • A Tour of Data Science: Learn R and Python in Parallel
SQL
In the example above, I select two rows from the table using the syntax select
from. The keyword limit in sqlite specifies the number of rows to return. In other
RDBMSs, we may need to use top instead. It is straightforward to select on conditions
with the where keyword.
SQL
1 sqlite > select mpg ,cyl from mtcars where name = ’Mazda RX4 Wag ’;
2 mpg ,cyl
3 21,6
SQL
1 sqlite > .mode column −− make the output aligned ; and yes we use
’−−’ to start comment in many SQL languages
2 sqlite > select name , mpg , cyl ,vs ,am from mtcars where vs =1 and
am =1;
3 name mpg cyl vs am
4 −−−−−−−−−− −−−−−−−−−− −−−−−−−−−− −−−−−−−−−− −−−−−−−−−−
5 Datsun 710 22.8 4 1 1
6 Fiat 128 32.4 4 1 1
7 Honda Civi 30.4 4 1 1
8 Toyota Cor 33.9 4 1 1
9 Fiat X1−9 27.3 4 1 1
data.table and pandas • 73
We are just accessing specific rows and columns from the table in database with
select from where. We can also do something a bit more fancy, for example, to get
the maximum, the minimum and the average of mpg for all vehicles grouped by the
number of cylinders.
SQL
In the above example, there are a few things worth noting. We use as to create an
alias for a variable; we group the original rows by the number of cylinders with the
keyword group by; and we sort the output rows with the keyword order by. max, min
and avg are all built-in functions that we can use directly.
It is also possible to have user-defined functions in SQL as what we usually do in
other programming languages.
SQL queries could be very complex and long. Usually a long query might con
tain subqueries or Common Table Expressions (CTE). First, let’s see an example of
subquery to get the average mpg for cars with 6 cylinders as below.
SQL
The subquery in the parentheses specifies the fact that we are only interested in
the rows with cyl==6. The same results could be obtained by CTE as below.
SQL
1 sqlite > with temp_table as ( select ∗ from mtcars where cyl ==6)
select am , avg(mpg) avg_mpg from temp_table group by am;
2 am avg_mpg
74 • A Tour of Data Science: Learn R and Python in Parallel
3 −−−−−−−−−− −−−−−−−−−−
4 0 19.125
5 1 20.5666666
Basically, a CTE can be thought of as a temporary result set which can be ref
erenced multiple times in one query. Multiple CTEs can be used in the same query.
Using CTE improves the readability of the query with similar performance to sub-
query in complex queries. In some cases, CTE could be a bottleneck in performance
and you may consider using temporary table which will not be discussed in this
chapter.
R Python
Python
11 [5 rows x 12 columns ]
The type of mtcars_dt is data.table, not data.frame. Here we use the fread
function from data.table to read a file and the output type is a data.table directly.
Regarding reading csv in R, a very good package is readr for very large files, but the
output has a data.frame type. In practice, it is very common to convert a data.frame
to data.table with the function as.data.table.
76 • A Tour of Data Science: Learn R and Python in Parallel
3 [1] "name"
gear carb
6 1: AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0
3 2
7 2: Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0
3 4
8 3: Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0
3 4
9 4: Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0
3 4
10 5: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4 1
Python
There are quite a few things worth noting from the above code snippets. When
we use the setkey function the quotes for the column name is optional. So setkey(
mtcars_dt, name) is equivalent to setkey(mtcars_dt, ’name’). But in pandas, quotes
are required. The effect of setkey is in place, which means no copies of the data are
made at all. But in pandas, by default set_index set the index on a copy of the data
and the modified copy is returned. Thus, in order to make the effect in place, we have
set the argument inplace=True explicitly. Another difference is that setkey would
sort the original data in place automatically but set_index does not. It’s also worth
noting every pandas data.frame has an index; and by default it is numpy.arange(n)
where n is the number of rows. But there is no default key in a data.table.
In the above example, we only use a single column as the key/index. It is possible
to use multiple columns as well.
R Python
To use multiple columns as the key in data.table, we use the function setkeyv. It
is also interesting that we have to use index.names rather than index.name to get the
multiple column names of the index (which is called MultiIndex) in pandas. There
are duplicated combinations of (cyl, gear) in the data, which implies key or index
could be duplicated.
Once the key/index set, we can access rows with given indices fast.
9 > mtcars_dt [.(6 ,4)] # work with key vector using .()
Here is a bit of explanation for the code above. We can simply use [] to access the
rows with the specified key values if the key has a character type. But if the key has a
numeric type, list() is required to enclose the key values. In data.table, .() is just
an alias of list(), which means we would get the same results with mtcars_dt[list
(6,4)]. Of course, we can use also do mtcars_dt[.(’Merc 230’)] which is equivalent
to mtcars_dt[.(’Merc 230’)].
Python
2 mpg 22.80
3 cyl 4.00
4 disp 140.80
5 hp 95.00
6 drat 3.92
7 wt 3.15
8 qsec 22.90
9 vs 1.00
10 am 0.00
11 gear 4.00
12 carb 2.00
13 Name: Merc 230, dtype: float64
14 >>> mtcars_df .loc [[ ’Merc 230 ’,’Camaro Z28 ’]] # multiple values
of a single index
15 mpg cyl disp hp drat wt qsec vs am
gear carb
16 name
17 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0
4 2
18 Camaro Z28 13.3 8 350.0 245 3.73 3.84 15.41 0 0
3 4
19
Compared to data.table, we need to use the loc method when accessing rows
based on index. The loc method also takes boolean conditions.
Python
1 >>> mtcars_df .loc[ mtcars_df .mpg >30] # select the vehicles with
mpg >30
2 mpg disp hp drat wt qsec vs am carb
3 cyl gear
4 4 4 32.4 78.7 66 4.08 2.200 19.47 1 1 1
5 4 30.4 75.7 52 4.93 1.615 18.52 1 1 2
6 4 33.9 71.1 65 4.22 1.835 19.90 1 1 1
7 5 30.4 95.1 113 3.77 1.513 16.90 1 1 2
Python
If the key/index is not needed, we can remove the key or reset the index. For
data.table we can set a new key to override the existing one which then becomes
a column. But in pandas, set_index method removes the existing index which also
disappears from the data.frame.
5 NULL
8 [1] "gear"
Python
index
4 >>> mtcars_df .index.names
5 FrozenList ([ None ])
8 ’gear ’
Python
So far we have seen how to access specific rows. What about columns? Accessing
columns in data.table and pandas is quite straightforward. For data.table, we can
use $ sign to access a single column or a vector to specify multiple columns inside [].
For data.frame in pandas, we can use . to access a single column or a list to specify
multiple columns inside [].
R Python
1 > head( mtcars_dt $mpg ,5) # 1 >>> mtcars_df .iloc [0:5]. mpg.
access a single column values
2 [1] 21.0 21.0 22.8 21.4
2 array ([21. , 21. , 22.8 ,
18.7}
21.4 , 18.7])
3 > mtcars_dt [1:5 ,c(’mpg ’,’
gear ’)] 3 Name: mpg , dtype : float64
4 mpg gear 4 >>> mtcars_df [[ ’mpg ’,’gear
’]]. head (5)
data.table and pandas • 83
1 > mtcars_dt [1:5 ,.( mpg ,cyl ,hp)] # without quotes for variables
2 mpg cyl hp
3 1: 21.5 4 97
4 2: 22.8 4 93
5 3: 24.4 4 62
6 4: 22.8 4 95
7 5: 32.4 4 66
R Python
To access specific rows and specific columns, there are two strategies:
4 mpg cyl hp
5 1: 21.4 6 110
6 2: 18.1 6 105
7 3: 21.0 6 110
8 4: 21.0 6 110
9 5: 19.2 6 123
10 6: 17.8 6 123
11 7: 19.7 6 175
12 > mtcars_dt [.(6) ][,c(’mpg ’,’cyl ’,’hp’)] # use strategy 1;
13 mpg cyl hp
14 1: 21.4 6 110
15 2: 18.1 6 105
16 3: 21.0 6 110
17 4: 21.0 6 110
18 5: 19.2 6 123
19 6: 17.8 6 123
20 7: 19.7 6 175
Python
22 6 19.7 175
As we have seen, using the setkey function for data.table sorts the data.table
automatically. However, sorting is not always desired. In data.table, there is another
function setindex/setindexv which has similar effects to setkey/setkeyv but doesn’t
sort the data.table. In addition, one data.table could have multiple indices, but it
cannot have multiple keys.
gear carb
4 4
4 4
4 1
3 1
24 > mtcars_dt [.(4) ,on=’cyl ’] # the index ’cyl ’ still works after
set c(’cyl ’,’gear ’) as indexv
25 name mpg cyl disp hp drat wt qsec vs am
gear carb
26 1: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1
4 1
27 2: Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0
4 2
28 3: Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0
4 2
29 4: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1
4 1
30 5: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1
4 2
31 6: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1
4 1
32 7: Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0
3 1
33 8: Fiat X1−9 27.3 4 79.0 66 4.08 1.935 18.90 1 1
4 1
34 9: Porsche 914−2 26.0 4 120.3 91 4.43 2.140 16.70 0 1
5 2
35 10: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1
5 2
36 11: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1
4 2
3.4 ADD/REMOVE/UPDATE
R Python
R Python
The interesting fact of the code above is that in R the %in% function is vectorized,
but the in function in Python is not.
Adding a single column to an existing data.table or DataFrame is as straightfor
ward as removing.
Python
Adding multiple columns is a bit tricky compared with adding a single column.
Python
7 [2 rows x 14 columns ]
In the R code, we use ‘:=‘ to create multiple columns. In the Python code, we
put the new columns into a dictionary and use the assign function with the dictionary
data.table and pandas • 89
unpacking operator ∗∗. To learn the dictionary unpacking operator, please refer to
official document6 . The assign method of a DataFrame doesn’t have inplace argument
so we need to assign the modified DataFrame to the original one explicitly.
Now let’s see how to update values. We can update the entire column or just the
column on specific rows.
4 10 2
Python
nc1 nc2
6 0 Mazda RX4 21.0 6 160.0 110 ... 1 4 4
10 2
7 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 1 4 4
10 2
8
9 [2 rows x 14 columns ]
10 >>> mtcars_df .loc[ mtcars_df .cyl ==6 ,’nc1 ’]=3
11 >>> mtcars_df .head (2)
6
https://www.python.org/dev/peps/pep-0448
90 • A Tour of Data Science: Learn R and Python in Parallel
16 [2 rows x 14 columns ]
We can also combine the technique of rows indexing with column update.
Python
In addition to :=, we can also use set function to modify values in a data.table.
When used properly, the performance gain could be significant. Let’s use a fictional
case.
data.table and pandas • 91
data. table
7 user system elapsed
8 1.499 0.017 0.383
We see that updating the values with the set function in this example is as fast
as updating the values in an array.
3.5 GROUP BY
At the beginning of this chapter, we saw an example with group by in SQL query.
group by is a very powerful operation, and it is also available in data.table and
pandas. Let’s try to get the average mpg grouped by cyl.
R Python
R Python
We can also create multiple columns with group by. My feeling is data.table is
more expressive.
Python
1 >>> mtcars_df . groupby ([’cyl ’,’gear ’]). apply( lambda e:pd. Series ({
’mean_mpg ’:e.mpg.mean (), ’max_hp ’:e.hp.max () }))
2 mean_mpg max_hp
3 cyl gear
4 4 3 21.500 97.0
5 4 26.925 109.0
6 5 28.200 113.0
7 6 3 19.750 110.0
8 4 19.750 123.0
data.table and pandas • 93
9 5 19.700 175.0
10 8 3 15.050 245.0
11 5 15.400 335.0
In data.table, there is also a keyword called keyby which enables group by and
sort operations together.
3.6 JOIN
Join7 combines columns from one or more tables for RDBMSs. We also have the Join
operation available in data.table and pandas. We only talk about three different types
of joins here, i.e., inner join, left join, and right join. Left join and right join are also
referred to as outer join.
Let’s make two tables to join.
7
https://en.wikipedia.org/wiki/Join_(SQL)
94 • A Tour of Data Science: Learn R and Python in Parallel
Figure 3.1: Inner join, left (outer) join and right (outer) join
Python
8 >>> employee_df
9 employee_id department_id
10 0 1 1
11 1 2 2
12 2 3 2
data.table and pandas • 95
13 3 4 3
14 4 5 1
15 5 6 4
We can join tables with or without the help of index/key. In general, joining on
index/key is more fast. Thus, I recommend always to set key/index for join operations.
department_dt
4 department_id department_name employee_id
5 1: 1 Engineering 1
6 2: 1 Engineering 5
7 3: 2 Operations 2
8 4: 2 Operations 3
9 5: 3 Sales 4
10 6: 4 <NA > 6
11 > department_dt [ employee_dt , nomatch =0] # employee_dt left join
department_dt
12 department_id department_name employee_id
13 1: 1 Engineering 1
14 2: 1 Engineering 5
15 3: 2 Operations 2
16 4: 2 Operations 3
17 5: 3 Sales 4
18 > employee_dt [ department_dt ] # employee_dt right join
department_dt
19 employee_id department_id department_name
20 1: 1 1 Engineering
21 2: 5 1 Engineering
22 3: 2 2 Operations
23 4: 3 2 Operations
24 5: 4 3 Sales
To join data.table A and B, the syntax A[B] and B[A] only work when the keys of
A and B are set. The function merge from data.table package works with or without
key.
Python
4 employee_id department_name
5 department_id
6 1 1 Engineering
7 1 5 Engineering
8 2 2 Operations
9 2 3 Operations
10 3 4 Sales
11 >>> pd. merge ( employee_df , department_df , how=’left ’, left_index =
True , right_index =True) # left join on index
12 employee_id department_name
13 department_id
14 1 1 Engineering
15 1 5 Engineering
16 2 2 Operations
17 2 3 Operations
18 3 4 Sales
19 4 6 NaN
20 >>> pd. merge ( employee_df , department_df , how=’right ’, left_index
=True , right_index =True) # right join on index
21 employee_id department_name
22 department_id
23 1 1 Engineering
24 1 5 Engineering
25 2 2 Operations
98 • A Tour of Data Science: Learn R and Python in Parallel
26 2 3 Operations
27 3 4 Sales
We have learned the very basics of data.table and pandas. In fact, there are
lots of other useful features in both tools which are not covered in this chapter. For
example, the .I/.N symbol in data.table, and the stack/unstack method in pandas.
CHAPTER 4
Random Variables,
Distributions & Linear
Regression
We will see a few topics related to random variables (r.v.) and distributions
in this chapter. Based on the concepts of random variables, we will further
introduce linear regression models.
99
100 • A Tour of Data Science: Learn R and Python in Parallel
random variable is equal to some value, but PDF does not represent probabilities
directly.
The quantile function is the inverse of CDF, i.e.,
PDF, CDF and quantile functions are all heavily used in quantitative analysis.
In R and Python, we can find a number of functions to evaluate these functions. The
best known distribution should be univariate normal/Gaussian distribution. Let’s use
Gaussian random variables for illustration.
5 > print(x)
8 > print(mean(x))
9 [1] 1.094594
10 > print(sd(x))
11 [1] 1.670898
12 > # evaluate PDF
13 > d = dnorm (x, mean =0, sd =2)
14 > print(d)
15 [1] 0.07793741 0.17007297 0.18674395 0.16327110 0.18381928
16 [6] 0.19835103 0.06364496 0.19857948 0.02601446 0.19907926
17 > # evalute CDF
18 > p = pnorm (x, mean =0, sd =2)
19 > print(p)
20 [1] 0.9148060 0.2861395 0.6417455 0.7365883 0.6569923 0.4577418
21 [7] 0.9346722 0.4622928 0.9782264 0.4749971
22 > # evaluate quantile
23 > q = qnorm (p, mean =0, sd =2)
24 > print(q)
25 [1] 2.7419169 − 1.1293963 0.7262568 1.2657252 0.8085366
26 [6] − 0.2122490 3.0230440 − 0.1893181 4.0368474 − 0.1254282
Python
Random Variables, Distributions & Linear Regression • 101
1 1
p(x; μ, Σ) = exp(− (x − μ)T Σ−1 (x − μ)), (4.3)
(2π )m/2 |Σ|1/2 2
where μ is the mean and Σ is the covariance matrix of the random variable x.
Sampling from distributions are involved in many algorithms, such as Monte
Carlo simulation. First, let’s see a simple example in which we draw samples from a
3-dimensional normal distribution.
R Python
Please note in the example above, we do not calculate the quantiles. For multi-
variable distributions, the quantiles are not necessarily fixed points.
2. calculate the quantiles q1 , ..., qn for u1 , ..., un based on the CDF, and return
q1 , ..., qn as the desired samples to draw.
Let’s see how to use the inversion sampling technique to sample from exponential
distribution with CDF fX (x; λ) = 1 − e−λx .
R Python
R Python
1 n =1000
1 import numpy as np
2 set.seed (42)
2 np. random .seed (42)
3 lambda = 1.0
3 lamb , b = 1.0 , 2.0
4 b = 2.0
4 n =1000
5 x = rep (0, n)
5 x = []
6 i = 0
6 i = 0
7 while (i<n){
7 while i<n:
8 y = rexp(n = 1, rate =
8 y = np. random . exponential (
13 }
After running the code snippets above, we have the samples stored in x from the
truncated exponential distribution.
Now let’s use the rejection sampling technique for this task. Since we want to
sample the random variable between 0 and b, one natural choice of the proposal
distribution fY is a uniform distribution between 0 and b and we choose M = bλ/(1−
e−λb ). As a result, the acceptance probability fX (x)/(M fY (x)) becomes e−λx .
R Python
1 n =1000
1 import numpy as np
2 set.seed (42)
2 np. random .seed (42)
3 lambda = 1.0
3 lamb , b = 1.0 , 2.0
4 b = 2.0
4 n = 1000
5 x = rep (0, n)
5 x = []
6 i = 0 6 i = 0
7 while (i<n){ 7 while i < n:
8 # sample from the proposal 8 # sample from the
distribution proposal distribution
9 y = runif (1, min = 0,
9 y = np. random . uniform (
max = b)
low =0, high=b)
We have seen the basic examples on how to draw samples with inversion samples
and truncated samples. Now let’s work on a more challenging problem.
P
θ
φ
y
The problem appears simple at first glance - we could utilize the spherical coordinates sys
tem and draw samples for φ and θ . Now the question is how to sample for φ and θ . A
straightforward idea is to draw independent and uniform samples φ from 0 to 2π and θ from
0 to π , respectively. However, this idea is incorrect which will be analyzed below.
Let’s use fP (φ, θ) to denote the PDF of the joint distribution of (φ, θ). We integrate this PDF,
then
2π π 2π π
1= fP (φ, θ)dφdθ = fΦ (φ)fΘ|Φ (θ|φ)dφdθ. (4.4)
0 0 0 0
If we enforce Φ has a uniform distribution between 0 and 2π , then fΦ (φ) = 1/2π , and
π
1= fΘ|Φ (θ|φ)dθ. (4.5)
0
Thus, we could generate the samples of Φ from the uniform distribution and the samples of
Θ from the distribution whose PDF is sin(φ)/2. Sampling for Φ is trivial, but how about Θ?
R Python
1 n=2000
1 import numpy as np
2 set.seed (42)
2 np. random .seed (42)
3 # sample phi
3 n = 2000
4 phi = runif (n, min = 0,
4 # sample phi
max = 2 ∗ pi) 5 phi = np. random . uniform (low
5 # sample theta by inversion
=0, high =2 ∗ np . pi , size=n)
cdf
6 u = runif (n, min = 0, 6 # sample theta by inversion
max = 1) cdf
7 theta = acos(1−2∗ u) 7 u = np. random . uniform (low =0,
high =1, size=n)
106 • A Tour of Data Science: Learn R and Python in Parallel
There are also other solutions to this problem, which won’t be discussed in this
book. A related problem is to draw samples inside a sphere. We could solve the inside
sphere sampling problem with a similar approach, or by using the rejection sampling
approach, i.e., sampling from a cube with acceptance ratio π/6.
in (4.3). A multivariate distribution is also called joint distribution, since the mul
tivariate random variable can be viewed as a joint of multiple univariate random
variables. Joint PDF gives the probability density of a set of random variables. Some
times we may only be interested in the probability distribution of a single random
variable from a set. And that distribution is called marginal distribution. The PDF
of a marginal distribution can be obtained by integrating the joint PDF over all
the other random variables. For example, the integral of (4.3) gives the PDF of a
univariate normal distribution.
The joint distribution is the distribution about the whole population. In the con
text of a bivariate Gaussian random variable (X1 , X2 ), the joint PDF fX1 ,X2 (x1 , x2 )
specifies the probability density for all pairs of (X1 , X2 ) in the 2-dimension plane.
The marginal distribution of X1 is still about the whole population because we are
not ruling out any points from the support of the distribution function. Sometimes
we are interested in a subpopulation only, for example, the subset of (X1 , X2 ) where
X2 = 2 or X2 > 5. We can use conditional distribution to describe the probability
distribution of a subpopulation. To denote conditional distribution, the symbol | is fre
quently used. We use fX1 |X2 =0 (x1 |x2 ) to represent the distribution of X1 conditional
on X2 = 0. By the rule of conditional probability P (A|B) = P (A, B)/P (B), the cal
culation fX1 |X2 (x1 |x2 ) is straightforward, i.e., fX1 |X2 (x1 |x2 ) = fX1 ,X2 (x1 , x2 )/fX2 (x2 ).
The most well-known joint distribution is the multivariate Gaussian distribution.
Multivariate Gaussian distribution has many important and useful properties. For
example, given the observation of (X1 , ..., Xk ) from (X1 , ..., Xm ), (Xk + 1, ..., Xm ) is
still following a multivariate Gaussian distribution, which is essential to Gaussian
process regression3 .
We have seen the extension from univariate Gaussian distribution to multivari
ate Gaussian distribution, but how about other distributions? For example, what
is the joint distribution for two univariate exponential distribution? We could use
copula4 for such a purpose. For the random variable (X1 , ..., Xm ), let (U1 , ..., Um ) =
(FX1 (X1 ), ..., FXm (Xm )) where FXk is the CDF of Xk . We know Uk is following a
uniform distribution. Let C(U1 , ..., Um ) denote the joint CDF of (U1 , ..., Um ) and the
CDF is called copula.
There are different copula functions, and one commonly used is the Gaussian
copula. The standard Gaussian copula is specified as below.
CΣGauss (u1 , ..., um ) = ΦΣ (Φ−1 (u1 ), ..., Φ−1 (um )), (4.6)
where Φ denotes the CDF of the standard Gaussian distribution, and ΦΣ denotes the
CDF of a multivariate Gaussian distribution with mean 0 and correlation matrix Σ.
Let’s see an example to draw samples from a bivariate exponential distribution
constructed via Gaussian copula. The basic idea of sampling multivariate random
variables via copula is to sample U1 , ..., Um first and then transform it to the desired
random variables.
3
http://www.gaussianprocess.org/gpml/chapters/RW2.pdf
4
https://en.wikipedia.org/wiki/Copula_(probability_theory)
2 > n = 10000
5 > mu = c (0, 0)
10 > # generate U
11 > u = pnorm (r)
12 > # calculate the quantile for X
13 > x1 = qexp(u[, 1], rate = rates [1])
14 > x2 = qexp(u[, 2], rate = rates [2])
15 > x = cbind (x1 , x2)
16 > cor(x)
17 x1 x2
18 x1 1.0000000 0.5476137
19 x2 0.5476137 1.0000000
20 > apply (x, mean , MARGIN = 2)
21 x1 x2
22 0.9934398 0.4990758
Python
Figure 4.2: Samples from a bivariate exponential distribution constructed via Gaus
sian copula
24 [0.57359023 , 1. ]])
We plot 2000 samples generated from the bivariate exponential distribution con
structed via copula in Figure 4.2.
With the help of copula, we can even construct joint distribution with marginals
from different distributions. For example, let’s make a joint distribution of a uniform
distributed random variable and an exponential distributed random variable.
2 > n = 10000
4 > mu = c(0, 0)
9 > # generate U
Figure 4.3: Samples from a joint distribution of a uniform marginal and an exponential
marginal
16 [1] 0.5220363
Statistics is used to solved real-world problems with data. In many cases we may have
a collection of observations for a random variable and want to know the distribution
which the observations follow. In fact, there are two questions involved in the process
of fitting a distribution. First, which distribution to fit? And second, given the distri
bution, how to estimate the parameters. These two questions are essentially the same
questions that we have to answer in supervised learning. In supervised learning, we
need to choose a model and estimate the parameters (if the model has parameters).
We can also refer to these two questions as model selection and model fitting. Usually,
model selection is done based on the model fitting.
Two widely-used methods in distribution fitting are method of moments and the
maximum likelihood method. In this section we will see the method of moments. The
maximum likelihood method will be introduced in Chapter 6. The k th moment of
a random variable is defined as μk = E(xk ). If there are m parameters, usually we
derive the first m theoretical moments in terms of the parameters, and by equating
these theoretical moments to the sample moments μ̂k = 1/n n1 xik we will get the
estimate.
Let’s take the univariate Gaussian distribution as an example. We want to esti
Random Variables, Distributions & Linear Regression • 111
mate the mean μ and variance σ 2 . The first and second theoretical moments are μ and
μ2 + σ 2 . Thus, the estimate μ̂ and σ̂ 2 are 1/n 1n xi and 1/n 1n x2i − (1/n n1 xi )2 =
1/n n1 (xi − μ̂)2 . The code snippets below show the implementation.
R Python
2 > n = 1000
2 >>> np. random .seed (42)
3 > mu = 2.5
3 >>> n = 1000
We could also fit another distribution to the data generated from a normal distri
bution. But which one is better? One answer is to compare the likelihood functions
evaluated at the fitted parameters and choose the one that gives the larger likelihood
value.
Please note that different methods to fit a distribution may lead to different
parameter estimates. For example, the estimate of population variance using the
maximum likelihood method is different from that using the moments method. Ac
tually, the estimator for population mean is biased using the methods method but
the estimator using the maximum likelihood method is unbiased.
2 > n = 1000
3 > mu = 2.5
sqrt(n))
12 > print(CI)
13 [1] 2.388738 2.545931
Random Variables, Distributions & Linear Regression • 113
Python
4 >>> n = 1000
The interpretation of CI is tricky. A 95% CI does not mean the probability that
the constructed CI contains the true population mean is 0.95. Actually, a constructed
CI again is a random variable because the CI is created based on each random
sample collected. Following the classic explanation from textbooks, when we repeat
the procedures to create CI multiple times, the probability that the true parameter
falls into the CI is equal to α. Let’s do an example to see that point.
2 > B = 1000
3 > n = 1000
4 > mu = 2.5
4.5.2 Bootstrap
So far we have seen how to create the CI for sample mean. What if we are inter
ested in quantifying the uncertainty of other parameters, for example, the variance
of a random variable? If we estimate these parameters with the maximum likelihood
method, we can still construct the CI in a similar approach with the large sample
theory. However, we will not discuss it in this book. Alternatively, we could use the
bootstrap technique.
Bootstrap is simple yet powerful. It is a simulation-based technique. If we want
to estimate a quantity θ, first we write the estimator for θ as a function of a random
sample i.e., θ̂ = g(X1 , ..., Xn ). Next, we just draw a random sample and calculate
θˆ and repeat this process B times to get a collection of θ̂ denoted as θˆ(1) , ..., θ̂(B) .
From these simulated θ̂, we could simply use the percentile θ̂(1−α)/2 and θ̂(1+α)/2 to
construct the α CI. There are also other variants of the bootstrapping method with
similar ideas.
Let’s try to use bootstrap to construct a 95% CI for the population variance of a
Gaussian distributed random variable.
2 > B = 1000
3 > n = 1000
4 > mu = 2.5
We have talked about confidence interval, which is used to quantify the uncertainty
in parameter estimation. The root cause of uncertainty in parameter estimation is
that we do the inference based on random samples. Hypothesis testing is another
technique related to confidence interval calculation.
Random Variables, Distributions & Linear Regression • 115
When we perform a hypothesis testing, there are two possible outcomes, i.e., a)
reject H0 if the evidence is likely to support the alternative hypothesis, and b) do
not reject H0 because of insufficient evidence.
The key point to understand in hypothesis testing is the significant level, which is
usually denoted as α. When the null hypothesis is true, the rejection of null hypothesis
is called type I error. And the significance level is the probability of committing a type
I error. When the alternative hypothesis is true, the acceptance of null hypothesis is
called type II error. And the probability of committing a type II error is denoted as
β. 1 − β is called the power of a test.
To conduct a hypothesis testing, there are a few steps to follow. First we have
to specify the null and alternative hypotheses, and the significance level. Next, we
calculate the test statistic based on the data collected. Finally, we calculate the p-
value. If the p-value is smaller than the significance level, we reject the null hypothesis;
otherwise we accept it. Some books may describe a procedure to compare the test
statistic with a critic region, which is essentially the same as the p-value approach.
The real challenge to conduct a hypothesis testing is to calculate the p-value, whose
calculation depends on which hypothesis test to use. Please note that the p-value
itself is a random variable since it is calculated from the random sample. And when
the null hypothesis is true, the distribution of p-value is uniform from 0 to 1.
p-value is also a conditional probability. A major misinterpretation about p-value
is that it is the conditional probability that given the observed data the null hy
pothesis is true. Actually, p-value is the probability of the observation given the null
hypothesis is true.
For many reasons, we will not go in-depth into the calculation of p-values in this
book. But the basic idea is to figure out the statistical distribution of the test statistic.
Let’s skip all the theories behind and go to the tools in R/Python.
A two-sided H1 does not specify if the population mean is greater or smaller than the
hypothesized population mean. In contrast, a one-sided H1 specifies the direction.
Now let’s see how to perform one-sample t test in R/Python.
2 > n = 50
3 > mu = 2.5
distribution
8 > x = rnorm (n, mu , sigma )
9 > mu_test = 2.5
10 >
11 > # first , let ’s do two−sided t test
12 > # H0: the population mean is equal to mu_test
13 > # H1: the population mean is not equal to mu_test
14 > t1 = t.test(x, alternative = "two.sided", mu = mu_test )
15 > print(t1)
16
19 data: x
20 t = −0.21906, df = 49, p−value = 0.8275
21 alternative hypothesis : true mean is not equal to 2.5
22 95 percent confidence interval :
23 2.137082 2.791574
24 sample estimates :
25 mean of x
26 2.464328
27
40 data: x
41 t = −0.21906, df = 49, p−value = 0.5862
42 alternative hypothesis : true mean is greater than 2.5
43 95 percent confidence interval :
44 2.191313 Inf
45 sample estimates :
46 mean of x
47 2.464328
48
60 data: x
61 t = 2.8514 , df = 49, p−value = 0.003177
62 alternative hypothesis : true mean is greater than 2
63 95 percent confidence interval :
64 2.191313 Inf
65 sample estimates :
66 mean of x
67 2.464328
68
Python
In the R code snippet we show both one-sided and two-sided one-sample t tests.
However, we only show a two-sided test in the Python program. It is feasible to
perform a one-sided test in an indirect manner with the same function, but I don’t
think it’s worth discussing here. For hypothesis testing, it seems R is a better choice
than Python.
3 > n = 50
distribution
18 Paired t−test
19
36 Paired t−test
37
Paired t test can also be done in Python, but we will not show the examples.
Unpaired t test is also about two population means’ difference, but the samples
are not paired. For example, we may want to study if the average blood pressure of
men is higher than that of women. In the unpaired t test, we also have to specify if
we assume the two populations have equal variance or not.
2 > n = 50
20 data: x 1 a nd x2
21 t = 1.2286 , df = 98, p−value = 0.2222
22 alternative hypothesis : true difference in means is not equal to
0.5
23 95 percent confidence interval :
24 0.3072577 1.3192945
25 sample estimates :
26 mean of x mean of y
27 1.964328 1.151052
28
33
36 data: x 1 a nd x2
37 t = 1.2286 , df = 94.78 , p−value = 0.2223
38 alternative hypothesis : true difference in means is not equal to
0.5
39 95 percent confidence interval :
40 0.3070428 1.3195094
41 sample estimates :
42 mean of x mean of y
43 1.964328 1.151052
44
45 > # there is no big change for p−value , we also accept the null
hypothesis since p−value >alpha
46 >
47 > # let ’s try a one sided test without equal variance
48 > # H0: the population means ’ difference is equal to mu_diff
49 > # H1: the population means ’ difference larger than mu_diff
50 > t.test(x1 , x2 , alternative = "less", mu = mu_diff , paired =
FALSE , var.equal = FALSE )
51
54 data: x 1 a nd x2
55 t = 1.2286 , df = 94.78 , p−value = 0.8889
56 alternative hypothesis : true difference in means is less than
0.5
57 95 percent confidence interval :
58 −Inf 1.236837
59 sample estimates :
60 mean of x mean of y
61 1.964328 1.151052
62
There are many other important hypothesis tests, such as the chi-squared test5 ,
likelihood-ratio test6 .
5
https://en.wikipedia.org/wiki/Chi-squared_test
6
https://en.wikipedia.org/wiki/Likelihood-ratio_test
y = Xβ + E, (4.7)
where the column vector y contains n observations on the dependent variable, X is a
n × (p + 1) matrix (n > p) of independent variables with constant vector 1 in the first
column, β is a column vector of unknown population parameters to estimate based
on the data, and E is the error term (or noise). For the sake of illustration, (4.7) can
be extended as in
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
y1 1 X11 X21 · · · Xp1 β0 E1
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ y2 ⎥ ⎢1 X12 X22 · · · Xp2 ⎥ ⎢β1 ⎥ ⎢ E2 ⎥
⎢ ⎥ ⎢ ⎥⎢ . ⎥ + ⎢ . ⎥
⎢ . ⎥ = ⎢. ... ... ... ⎥ ⎢ . ⎥ ⎢ .
⎥
... ⎦
(4.8)
⎣ .. ⎦ ⎣ .. ⎣ . ⎦ ⎣.⎦
yn 1 X1n X2n · · · Xpn βp En
We apply the ordinary least squares (OLS)7 approach to estimate the model pa
rameter β since it requires fewer assumptions than other estimation methods such as
maximum likelihood estimation8 . Suppose the estimated model parameter is denoted
as β̂; we define the residual vector of the system as e = y − Xβ.
ˆ The idea of OLS is
to find βˆ which can minimize the sum of squared residuals (SSR), i.e.,
min e1 e (4.9)
β̂
Now the question is how to solve the optimization problem (4.8). First, let’s
expand the SSR.
e1 e = (y − Xβ̂)1 (y − Xβ)
ˆ
(4.10)
= y 1 y − 2β̂ 1 X 1 y + β̂ 1 X 1 X β̂
7
https://en.wikipedia.org/wiki/Ordinary_least_squares
8
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
We see that the second-order derivative is positive semidefinite which implies the
SSR in OLS is a convex function (see Section 3.1.4 in [3]) and for an unconstrained
convex optimization problem, the necessary as well as sufficient condition for optimal
ity is that the first-order derivative equals 0 (see Section 4.2.3 in [3]). Optimization
of convex function is very important in machine learning. Actually, the parameter
estimations of many machine learning models are convex optimization problems.
Based on the analysis above, the solution of (4.9) is given in (4.12).
β̂ = (X 1 X)−1 X 1 y (4.12)
Now it seems we are ready to write our own linear regression model in R/Python.
The solution in (4.12) involves matrix transportation, multiplication and inversion,
all of which are supported in both R and Python. In Python, we can use the numpy
module for the matrix operations.
However, in practice we don’t solve linear regression with (4.12) directly. Why?
Let’s see an example with
1e + 6 −1
x= .
−1 1e − 6
R Python
The R code above throws an error because of the singularity of X 1 X. It’s in
teresting that the corresponding Python code doesn’t behave in the same way as R,
which has been reported as an issue on github9 .
When the matrix X 1 X is singular, how to solve the OLS problem? In this book,
we will focus on the QR decomposition based solution. Singular value decomposition
(SVD) can also be used to solve OLS, which will not be covered in this book.
In linear algebra, a QR decomposition10 of matrix X would factorize X into a
product, i.e., X = QR where Q are orthogonal matrices and R is an upper triangular
matrix. Since the matrix Q is orthogonal (Q1 = Q−1 ), we have
9
https://github.com/numpy/numpy/issues/10471
10
https://en.wikipedia.org/wiki/QR_decomposition
βˆ = (X 1 X)−1 X 1 y
= (R1 Q1 QR)−1 R1 Q1 y
(4.13)
= (R1 R)−1 R1 Q1 y
= R−1 Q1 y
Now we are ready to write our simple R/Python functions for linear regression
with the help of QR decomposition according to (4.13).
3 }
3 def qr_solver (x,y):
4 q,r=np. linalg .qr(x)
5 p = np.dot(q.T,y)
6 return np.dot(np. linalg .
inv(r),p)
R chapter4/linear_regression.R
2 library (R6)
3 LR = R6Class (
4 "LR",
5 public = list(
6 coef = NULL ,
7 initialize = function () {
9 },
14 self$coef = qr.coef(qr(x), y)
15 },
16 predict = function ( new_x ) {
17 cbind (1, new_x ) % ∗ % self$coef
18 }
19 )
20 )
Python chapter4/linear_regression.py
1 import numpy as np
3 class LR:
5 self.coef = None
9 p = np.dot(q.T, y)
11
self.coef)
Now let’s try to use our linear regression model to solve a real regression problem
with the Boston dataset11 , and check the results.
2 >
4 >
5 > lr = LR$new ()
6 > # −i means excluding the ith column from the data. frame
8 > print(lr$coef)
11
https://www.cs.toronto.edu/ delve/data/boston/bostonDetail.html
126 • A Tour of Data Science: Learn R and Python in Parallel
9 crim zn indus
chas
10 3.645949 e+01 − 1.080114e−01 4.642046e−02 2.055863e−02
2.686734 e+00
11 nox rm age dis
rad
12 −1.776661e+01 3.809865 e+00 6.922246e−04 −1.475567e+00
3.060495e−01
13 tax ptratio black lstat
14 −1.233459e−02 −9.527472e−01 9.311683e−03 −5.247584e−01
15 > # let ’s make prediction on the same data
16 > pred=lr$ predict (data. matrix ( Boston[,− ncol( Boston )]))
17 > print(pred [1:5])
18 [1] 30.00384 25.02556 30.56760 28.60704 27.94352
19 >
20 > # compare it with the R built−in linear regression model
21 > rlm = lm(medv ~ ., data= Boston )
22 > print(rlm$coef)
23 ( Intercept ) crim zn indus
chas
24 3.645949 e+01 − 1.080114e−01 4.642046e−02 2.055863e−02
2.686734 e+00
25 nox rm age dis
rad
26 −1.776661e+01 3.809865 e+00 6.922246e−04 −1.475567e+00
3.060495e−01
27 tax ptratio black lstat
28 −1.233459e−02 −9.527472e−01 9.311683e−03 −5.247584e−01
29 > print(rlm$ fitted [1:5])
30 1 2 3 4 5
31 30.00384 25.02556 30.56760 28.60704 27.94352
Python
15 >>>
19 >>> print(reg.coef_)
23 − 5.24758378e −01]
The results from our own linear regression models are almost identical to the
results from lm() function or the sklearn.linear_model module, which means we
have done a great job so far.
min e1 e + λβ 1 β. (4.14)
β̂
min e1 e
β̂ (4.15)
subject to β 1 β ≤ t.
12
https://en.wikipedia.org/wiki/All_models_are_wrong
128 • A Tour of Data Science: Learn R and Python in Parallel
The theory behind ridge regression can be found from The Elements of Statistical
Learning [9]. Let’s turn our attention to the implementation of ridge regression. The
solution to (4.15) can be obtained in the same way as the solution to (4.9), i.e.,
βˆ = (X 1 X + λI)−1 X 1 y. (4.16)
Again, in practice we don’t use (4.16) to implement ridge regression for the same
reasons that we don’t use (4.12) to solve linear regression without penalty.
Actually, we don’t need new techniques to solve (4.14). Let’s make some trans
formation on the objective function in (4.14):
n p √
e1 e + λβ 1 β = (yi − x1i β)2 + (0 − λβi )2 (4.17)
i=1 i=1
R chapter4/linear_regression_ridge.R
13
https://en.wikipedia.org/wiki/Normalization_(statistics)
Random Variables, Distributions & Linear Regression • 129
1 library (R6)
2 LR_Ridge = R6Class (
3 " LR_Ridge ",
4 public = list(
5 coef = NULL ,
6 mu = NULL ,
7 sd = NULL ,
8 lambda = NULL ,
9 initialize = function ( lambda ) {
10 self$ lambda = lambda
11 },
12 scale = function (x) {
13 self$mu = apply (x, 2, mean)
14 self$sd = apply (x, 2, function (e) {
15 sqrt (( length (e) − 1) / length (e)) ∗ sd(e)
16 })
17 },
18 transform = function (x) {
19 return (t((t(x) − self$mu) / self$sd))
20 },
21 fit = function (x, y) {
22 self$ scale(x)
23 x_transformed = self$ transform (x)
24 x_lambda = rbind ( x_transformed , diag(rep(sqrt(self$ lambda )
, ncol(x))))
25 y_lambda = c(y, rep (0, ncol(x)))
26 self$ qr_solver ( cbind (c(rep (1, nrow(
27 x
28 )), rep (0, ncol(
29 x
30 ))), x_lambda ), y_lambda )
31 },
32 qr_solver = function (x, y) {
33 self$coef = qr.coef(qr(x), y)
34 },
35 predict = function ( new_x ) {
36 new_x_transformed = self$ transform (new_x)
37 cbind (rep (1, nrow( new_x )), new_x_transformed ) % ∗ % self$
coef
38 }
39 )
40 )
130 • A Tour of Data Science: Learn R and Python in Parallel
Python chapter4/linear_regression_ridge.py
1 import numpy as np
5 class LR_Ridge :
7 self.l = l
8 self.coef = None
10
2 >
4 >
nox
10 22.53280632 − 0.92396151 1.07393055 0.12895159 0.68346136
− 2.04275750
11 rm age dis rad tax
ptratio
12 2.67854971 0.01627328 − 3.09063352 2.62636926 − 2.04312573
− 2.05646414
13 black lstat
14 0.84905910 − 3.73711409
15 > # let ’s make prediction on the same data
16 > pred= ridge$ predict (data. matrix ( Boston[,− ncol( Boston )]))
17 > print(pred [1:5])
18 [1] 30.01652 25.02429 30.56839 28.61521 27.95385
Python
3 >>>
7 ...
9 >>> ridge.fit(X, y)
10 >>> print(ridge.coef)
11 [ 2.25328063 e+01 − 9.23961511e−01 1.07393055 e+00 1.28951591e−01
12 6.83461360e−01 − 2.04275750e+00 2.67854971 e+00 1.62732755e−02
13 − 3.09063352e+00 2.62636926 e+00 − 2.04312573e+00 − 2.05646414e+00
132 • A Tour of Data Science: Learn R and Python in Parallel
14 8.49059103e−01 − 3.73711409e+00]
It’s exciting to see the outputs from R and Python are quite consistent.
Linear regression is not as simple as it seems. To learn more about it, I recommend
reading [10].
CHAPTER 5
Optimization in Practice
5.1 CONVEXITY
The importance of convexity cannot be overestimated. A convex optimization prob
lem [3] has the following form
max f0 (x)
x
(5.1)
subject to fi (x) ≤ bi ; i = 1, ..., m,
where the vector x ∈ Rn represents the decision variable; fi ; i = 1, ..., m are convex
functions (Rn → R). A function fi is convex if
133
134 • A Tour of Data Science: Learn R and Python in Parallel
Figure 5.1: A non-convex function f . Its local optimum f (x2 ) is a global optimum,
but the local optimum f (x1 ) is not
where f is the objective function and γ is called the step size or learning rate. Start
from an initial value x0 and follow (5.3); we will have a monotonic sequence f (x(0) ),
f (x(1) ), ..., f (x(n) ) in the sense that f (x(k) ) >= f (x(k+1) ). When the problem is
convex, f (x(i) ) converges to the global minimum (see Figure 5.2).
Many machine learning models can be solved by the gradient descent algorithm,
for example, the linear regression (with or without penalty) introduced in Chapter
Optimization in Practice • 135
4. Let’s see how to use the vanilla (standard) gradient descent method in the linear
regression that we introduced in Chapter 4.
According to the gradient derived in (4.11), the update for the parameters be
comes:
R chapter5/linear_regression_gradient_descent.R
1 library (R6)
2 LR_GD = R6Class (
4 public = list(
5 coef = NULL ,
6 learning_rate = NULL ,
7 x = NULL ,
8 y = NULL ,
9 seed = NULL ,
35 new_x % ∗ % self$coef
36 }
37 )
38 )
Python chapter5/linear_regression_gradient_descent.py
1 class LR_GD:
2 """
4 """
10 self.seed = seed
11 np. random .seed(self.seed)
12 self.coef = np. random . uniform (size =( self.x.shape [1], 1))
13 self. learning_rate = learning_rate
14
R Python
The results show that our implementation of gradient descent update works well
on the simulated dataset for linear regression.
However, when the loss function is non-differentiable, the vanilla gradient descent
algorithm (4.11) cannot be used. In Chapter 5, we added the L2 norm (also referred
to as Euclidean norm2 ) of β to the loss function to get the ridge regression. What if
we change the L2 norm to L1 norm in (4.14)?
p
min e1 e + λ |βi |, (5.5)
β̂ i=1
where λ > 0.
Solving the optimization problem specified in (5.5), we will get the Lasso solution
of linear regression. Lasso is not a specific type of machine learning model. Actually,
Lasso refers to least absolute shrinkage and selection operator. It is a method that
performs both variable selection and regularization. What does variable selection
mean? When we build up a machine learning model, we may collect as many data
points (also called features, or independent variables) as possible. However, if only a
subset of these data points are relevant for the task, we may need to select the subset
explicitly for some reasons. First, it’s possible to reduce the cost (time or money)
to collect/process these irrelevant variables. Second, adding irrelevant variables into
some machine learning models may impact the model performance. But why? Let’s
take linear regression as an example. Including an irrelevant variable into a linear
regression model is usually referred to as model misspecification. (Of course, omitting
a relevant variable also results in a misspecified model.) In theory, the estimator of
2
https://en.wikipedia.org/wiki/Norm_(mathematics)
138 • A Tour of Data Science: Learn R and Python in Parallel
Now we are ready to implement the Lasso solution of linear regression via (5.7).
Similar to ridge regression, we don’t want to put penalty on the intercept, and a
natural estimate of the intercept is the mean of response, i.e., ȳ. The learning rate γ in
each update could be fixed or adapted (change over iterations). In the implementation
below, we use the largest eigenvalue of X 1 X as the fixed learning rate.
R chapter5/lasso.R
1 library (R6)
3 # soft−thresholding operator
6 }
8 Lasso = R6Class (
10 public = list(
3
https://en.wikipedia.org/wiki/Efficient_estimator
Optimization in Practice • 139
11 intercept = 0,
12 beta = NULL ,
13 lambda = NULL ,
14 mu = NULL ,
15 sd = NULL ,
16 initialize = function ( lambda ) {
17 self$ lambda = lambda
18 },
19 scale = function (x) {
20 self$mu = apply (x, 2, mean)
21 self$sd = apply (x, 2, function (e) {
22 sqrt (( length (e)−1) / length (e)) ∗ sd(e)
23 })
24 },
25 transform = function (x) t((t(x) − self$mu) / self$sd),
26 fit = function (x, y, max_iter =100) {
27 if (!is. matrix (x)) x = data. matrix (x)
28 self$ scale(x)
29 x_transformed = self$ transform (x)
30 self$ intercept = mean(y)
31 y_centered = y − self$ intercept
32 gamma = 1/( eigen (t( x_transformed ) % ∗ % x_transformed , only.
values =TRUE)$ values [1])
33 beta = rep (0, ncol(x))
34 for (i in 1: max_iter ){
35 nabla = − t( x_transformed ) % ∗ % ( y_centered −
x_transformed % ∗ % beta)
36 z = beta − 2 ∗ gamma ∗ nabla
37 print(z)
38 beta = sto(z, self$ lambda ∗ gamma )
39 }
40 self$beta = beta
41 },
42 predict = function ( new_x ) {
43 if (!is. matrix (new_x)) new_x = data. matrix (new_x)
44 self$ transform (new_x) % ∗ % self$beta + self$ intercept
45 }
46 )
47 )
Python chapter5/lasso.py
2 import numpy as np
140 • A Tour of Data Science: Learn R and Python in Parallel
, 0.0)
7
8 class Lasso:
10 self.l = l # l is lambda
11 self. intercept = 0.0
12 self.beta = None
13 self. scaler = StandardScaler ()
14
Now, let’s see the application of our own Lasso solution of linear regression on
the Boston house-prices dataset.
4 > lr$fit(data. matrix ( Boston[,− ncol( Boston )]) , Boston $medv , 100)
5 > print(lr$beta)
6 [ ,1]
7 crim − 0.34604592
8 zn 0.38537301
9 indus − 0.03155016
10 chas 0.61871441
11 nox − 1.08927715
12 rm 2.96073304
13 age 0.00000000
14 dis − 1.74602943
15 rad 0.02111245
16 tax 0.00000000
17 ptratio − 1.77697602
18 black 0.67323937
19 lstat − 3.71677103
Python
In the example above, we set the λ = 200.0 arbitrarily. The results show that the
coefficients of age and tax are zero. In practice, the selection of λ is usually done by
cross-validation which will be introduced later in this book. If you have used the off-
the-shelf Lasso solvers in R/Python you may wonder why the λ used in this example
is so big. One major reason is that in our implementation the first item in the loss
function, i.e., e1 e is not scaled by the number of observations.
In practice, one challenge to apply Lasso regression is the selection of parameter
λ. We will talk about that in Chapter 6 in more detail. The basic idea is to select
the best λ to minimize the prediction error on the unseen data.
5.3 ROOT-FINDING
There is a close relation between optimization and root-finding. To find the roots of
f (x) = 0, we may try to minimize f 2 (x) alternatively. Under specific conditions, to
minimize f (x) we may try to solve the root-finding problem of f 1 (x) = 0 (e.g., see
142 • A Tour of Data Science: Learn R and Python in Parallel
linear regression in Chapter 5). Various root-finding methods are available in both R
and Python.
Let’s see an application of root-finding in finance.
where Ci denotes the cashflow and t denotes the duration from time 0. Thus, we can solve
the IRR by finding the root of NPV.
R chapter5/xirr.R
Python chapter5/xirr.py
R Python
5
https://en.wikipedia.org/wiki/Nelder-Mead_method
very brief introduction of what MLE is. To understand MLE, first we need to understand
the likelihood function. Simply speaking, the likelihood function L(θ|x) is a function of the
unknown parameters θ to describe the probability or odds of obtaining the observed data x
(x is a realization of random variable X ). For example, when the random variable X follows a
continuous probability distribution with probability density function (pdf) fθ , L(θ|x) = fθ (x).
Given the observed data and a model on which the data are generated, we may minimize the
corresponding likelihood function to estimate the model parameters. There are some nice
properties of MLE, such as its consistency and efficiency. To better understand MLE and its
In practice, it’s common to minimize the logarithm of the likelihood function, i.e.,
the log-likelihood function. When X ∼ N (μ, σ 2 ), the pdf of X is given as below
1 2 2
f (x|(μ, σ 2 )) = √ e−(x−μ) /2σ . (5.10)
σ 2π
Taking the logarithm, the log-likelihood function is equal to
n
n 1
L(θ|x) = − log(2π) − n log σ − 2 (xi − μ)2 . (5.11)
2 2σ i=1
Since the first item in (5.11) is a constant, we can simply set the log-likelihood
function to
n
1
L(θ|x) = −n log σ − (xi − μ)2 . (5.12)
2σ 2 i=1
It’s worth noting (5.12) is convex. Let’s implement the MLE for normal distribu
tion in R/Python.
R chapter5/normal_mle.R
Python chapter5/normal_mle.py
There is no bound set for the mean parameter μ, while we need to set the lower
bound for the standard deviation σ. We chose ’L-BFGS-B’ as the optimization algo
rithm in this example, which requires the gradient of the objective function. When the
gradient function is not provided, the gradient is estimated in the numeric approach.
In general, providing the gradient function may speed up the work. ’L-BFGS-B’ is
a quasi-Newton method. Newton method requires the Hessian matrix for optimiza
tion, but the calculation of Hessian matrix may be expensive or even unavailable
in some cases. The quasi-Newton methods approximate the Hessian matrix of the
objective function instead of calculation. Now let’s use these functions to estimate
the parameters of normal distribution.
R Python
Let’s see another application of the general purpose minimization - logistic re
gression.
1
P r(yi = 1|xi , β) = , (5.13)
1 + e−xi β
where xi = [1, X1i , X2i , ..., Xpi ] , and β denotes the parameter of the logistic model.
What does (5.13) mean? It implies the response variable yi given xi and β follows a
Bernoulli distribution. More specifically, yi ∼ Bern(1/(1 + exp(−xi β))). Based on the
assumption that the observations are independent, the log-likelihood function is
n ( )
L(β|X, y) = log P r(yi = 1|xi , β)yi P r(yi = 0|xi , β)1−yi
i=1
n ( )
= yi log P r(yi = 1|xi , β) + (1 − yi ) log P r(yi = 0|xi , β) (5.14)
i=1
n ( )
= yi xi β − log(1 + exi β ) .
i=1
Given the log-likelihood function, we can get the maximum likelihood estimate
of logistic regression by minimizing (5.14) which is also convex. The minimization
can be done similarly to linear regression via the iteratively re-weighted least square
method (IRLS) [9]. However, in this section we will not use the IRLS method and
Optimization in Practice • 147
instead let’s use the optim function and scipy.optimize.minimize function in R and
Python, respectively.
R chapter5/logistic_regression.R
1 library (R6)
3 LogisticRegression = R6Class (
5 public = list(
6 coef = NULL ,
7 initialize = function () {
8 },
10 1.0/(1.0 + exp(−x))
11 },
12 log_lik = function (beta , x , y) {
13 linear = x % ∗ % beta
14 ll = sum( linear ∗ y) − sum(log (1 + exp( linear )))
15 return(−ll) # return negative log−likelihood
16 },
17 fit = function (x, y) {
18 if (!is. matrix (x)) x = data. matrix (x)
19 self$coef = optim (par=rep (0, 1+ ncol(x)), fn=self$log_lik ,
x= cbind (1, x), y=y, method ="L−BFGS−B")$par
20 },
21 predict = function ( new_x ) {
22 if (!is. matrix (new_x)) new_x = data. matrix (new_x)
23 linear = cbind (1, new_x ) % ∗ % self$coef
24 self$ sigmoid ( linear )
25 }
26 )
27 )
Python chapter5/logistic_regression.py
1 import numpy as np
2 from scipy . optimize import minimize
3
4 class LogisticRegression :
5 def __init__ (self):
6 self.coef = None
7
Now let’s see how to use our own logistic regression. We use the banknote dataset7
in this example. In the banknote dataset, there are four different predictors extracted
from the image of a banknote, which are used to predict if a banknote is genuine or
forged.
Python
7
https://archive.ics.uci.edu/ml/datasets/banknote+authentication
Optimization in Practice • 149
The above implementation of logistic regression works well. But it is only a ped
agogical toy to illustrate what a logistic regression is and how it can be solved with
general purpose optimization tool. In practice, there are fast and stable off-the-shelf
software tools to choose. It’s also worth noting most of the iterative optimization
algorithms allow to specify the max number of iterations as well as the tolerance.
Sometimes it’s helpful to tune these parameters.
In practice, the penalized (L1 , L2 ) versions of logistic regression are more com
monly used than the vanilla logistic regression we introduced above.
max c1 x
x
(5.15)
subject to Ax ≤ b,
where the vector x represents the decision variable; A is a matrix, b and c are vectors.
We can do minimization instead of maximization in practice. Also, it’s completely
valid to have equality constraints for LP problems. All LP problems can be con
verted into the form of (5.15). For example, the equality constraint Cx = d can be
transformed to inequality constraints Cx ≤ d and −Cx ≤ −d.
Every LP problem falls into three categories:
2. unbounded - for every feasible solution, there exists another feasible solution
that improves the objective function;
There is no intermediate payment between the vintage year and the maturity year. Payments
received on maturity can be reinvested. We also assume the risk-free rate is 0. The goal is
to construct a portfolio using these investment opportunities with $10,000 at year 1 so that
Let x1 , x2 , x3 and x4 denote the amount to invest in each investment opportunity; and let
x0 denote the cash not invested in year 1. The problem can be formulated as
In Python, there are quite a few tools that can be used to solve LP, such as
R chapter5/portfolio_construction.R
1 library ( lpSolve )
3 total = 10000.0
Python chapter5/portfolio_construction.py
1 from ortools . linear_solver import pywraplp
2
3 total = 10000.0
4 rate_of_return = [0.0 , 0.08 , 0.12 , 0.16 , 0.14]
5 solver = pywraplp . Solver (’PortfolioConstruction ’,pywraplp . Solver .
GLOP_LINEAR_PROGRAMMING )
6
8 x=[ None ] ∗ 5
10
https://scipy.org/
11
https://www.cvxpy.org/
12
https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf
R Python
We see that the optimal portfolio allocation results in a net asset value of 15173.22
in year 5. You may notice that the Python solution is lengthy compared to the R
solution. That is because the ortools interface in Python is in an OOP fashion and
we add the constraint one by one. But for lpSolve, we utilize the compact matrix
form (5.15) to specify an LP problem. In ortools, we specify the bounds of a decision
variable during its declaration. In lpSolve, all decisions are non-negative by default.
LP also has applications in machine learning; for example, least absolute devia
Optimization in Practice • 153
tions (LAD) regression13 can be solved by LP. LAD regression is just another type
of linear regression. The major difference between LAD regression and the linear re
gression we introduced in Chapter 4 is that the loss function to minimize in LAD is
the sum of the absolute values of the residuals rather than the SSR loss.
5.6 MISCELLANEOUS
5.6.1 Stochasticity
The update schedule of gradient descent is straightforward. But what to do if the
dataset is too big to fit into memory? That is a valid question, especially in the era
of big data. There is a naive and brilliant solution - we can take a random sample
from all the observations and only evaluate the loss function on the random sample in
each iteration. Actually, this variant of gradient descent algorithm is called stochastic
gradient descent.
−1
x(k+1) = x(k) − γH(x(k) ) ∇f (x(k) ) (5.17)
We have seen the proximal gradient descent method for Lasso. Actually, the
second-order derivatives can also be incorporated into the proximal gradient descent
method, which leads to the proximal Newton method.
max f (x)
x
subject to gi (x) ≤ bi ; i = 1, ..., m, (5.18)
hi (x) = cj ; j = 1, ..., k,
13
https://en.wikipedia.org/wiki/Least_absolute_deviations
154 • A Tour of Data Science: Learn R and Python in Parallel
where gi is the inequality constraint and hi is the equality constraint, both of which
could be linear or nonlinear. A constrained optimization problem may or may not
be convex. Although there are some tools existing in R/Python for constrained op
timization, they may fail if you just throw the problem into the tools. It is better
to understand the very basic theories for constrained optimization problems, such as
the Lagrangian, and the Karush-Kuhn-Tucker (KKT) conditions [3].
min e1 e
β̂
p (5.19)
subject to λ |βi | ≤ t.
i=1
min e1 e
β̂
subject to
λβ1 + λβ2 + ...+λβn ≤ t (5.20)
−λβ1 + λβ2 + ...+λβn ≤ t
...
−λβ1 − λβ2 − ...−λβn ≤ t.
15
https://cran.r-project.org/web/views/Optimization.html, https://cvxopt.org
strategies to guide the search process for the solutions. There is a wide variety of meta
heuristic algorithms in the literature, and one of my favorites is simulated annealing
(SA) [16]. SA algorithm can be used for both continuous and discrete optimization
problems (or hybrid, such as mixed integer programming16 ). The optim function in R
has the SA algorithm implemented (set method = ’SANN’) for general purpose min
imization with continuous variables. For discrete optimization problems, it usually
requires customized implementation. The pseudocode of SA is given as follows.
• for k = 1, ..., m
• return sm .
Let’s see how we can apply SA for the famous traveling salesman problem.
R chapter5/TSP.R
1 library (R6)
2
3 TSP = R6Class (
16
https://en.wikipedia.org/wiki/Integer_programming
156 • A Tour of Data Science: Learn R and Python in Parallel
4 "TSP",
5 public = list(
6 cities = NULL ,
7 distance = NULL ,
8 initialize = function (cities , seed = 42) {
9 set.seed(seed)
10 self$ cities = cities
11 self$ distance = as. matrix (dist( cities ))
12 },
13 calculate_length = function (path) {
14 l = 0.0
15 for (i in 2: length (path)) l = l+self$ distance [path[i −1],
path[i]]
16 l + self$ distance [path [1] , path[ length (path)]]
17 },
18 accept = function (T_k , energy_old , energy_new ) {
19 delta = energy_new − energy_old
20 p = exp(−delta /T_k)
21 p>= runif (1)
22 },
23 solve = function (T_0 , alpha , max_iter ) {
24 T_k = T_0
25 # create the initial solution s0
26 s = sample (nrow(self$ distance ), nrow(self$ distance ),
replace = FALSE )
27 length_old = self$ calculate_length (s)
28 for (i in 1: max_iter ){
Optimization in Practice • 157
Python chapter5/TSP.py
1 import numpy as np
4 class TSP:
10
12 l = 0.0
16
R Python
7 5, 4, 9]) ,
8 $ length 47.372551972154646)
9 [1] 47.37255
In the implementation above, we set the initial temperature to 2000, the cooling
rate α to 0.99. After 2000 iterations, we get the same routes from both implementa
tions. In practice, it’s common to see metaheuristic algorithms get trapped in local
optimums.
CHAPTER 6
2
https://en.wikipedia.org/wiki/Transfer_learning
3
https://en.wikipedia.org/wiki/Semi-supervised_learning
4
https://en.wikipedia.org/wiki/Support-vector_machine
5
https://en.wikipedia.org/wiki/Linear_discriminant_analysis
161
162 • A Tour of Data Science: Learn R and Python in Parallel
deep neural network models cannot match the prediction accuracy of other models
for a specific problem.
Figure 6.1: A random sample from a population follows a linear model (the dashed
line) is overfit by the solid curve
• for i = 1, ..., k
But for pedagogical purposes, let’s do it step by step. We do the cross validation
using the Lasso regression we built in a previous chapter on the Boston dataset.
3 library (caret)
4 library (MASS)
5 library ( Metrics ) # we use the rmse function from this package
6 k = 5
7
8 set.seed (42)
9 # if we set returnTrain = TRUE , we get the indices for train
partition
10 test_indices = createFolds ( Boston$medv , k = k, list = TRUE ,
returnTrain = FALSE )
11 scores = rep(NA , k)
12
13 for (i in 1:k){
14 lr = Lasso$new (200)
15 # we exclude the indices for test partition and train the
model
16 lr$fit(data. matrix ( Boston[−test_indices [[i]], −ncol( Boston )]) ,
Boston $medv[−test_indices [[i]]] , 100)
17 y_hat = lr$ predict (data. matrix ( Boston [ test_indices [[i]], −ncol
( Boston )]))
18 scores [i] = rmse( Boston $medv[ test_indices [[i]]], y_hat)
19 }
20 print(mean( scores ))
Python
1 import sys
2 sys.path. append ("..")
3
10 boston = load_boston ()
11 X, y = boston .data , boston . target
12
Machine Learning - A gentle introduction • 165
Run the code snippets; we have the 5-fold cross-validation accuracy as follows.
R Python
metric formula
n
i=1
(ŷi −yi )2
RMSE n
n
|ŷ −yi |
i=1 i
MAE n
1 n ŷi −yi
MAPE n i=1 yi
evaluates the predictive power of the model since its calculation is based on the
training data. But what we are actually interested in is the model performance
on the unseen data. In statistics, goodness of fit is a term to describe how good
a model fits the observations, and R2 is one of these measures for goodness of
fit. In predictive modelling, we care more about the error of the model on the
unseen data, which is called generalization error. But of course, it is possible
to calculate the counterpart of R2 on testing data.
largest.
But actually we don’t always care about the labels of an instance. For example,
a classification model for mortgage default built in a bank may only be used to
system that predicts the probabilities which are used for ranking of items. In
that case, the model performance could be evaluated by logloss, AUC, etc., using
the output probabilities directly. We have seen in Chapter 5 the loss function
data, and thus it can also be used for classification models with more than 2
In practice, AUC (Area Under the ROC Curve) is a very popular evaluation
metric for binary-class classification problems. AUC is bounded between 0 and
1. A perfect model leads to an AUC equal to 1. If a model’s predictions are
9
https://en.wikipedia.org/wiki/Categorical_distribution
Machine Learning - A gentle introduction • 167
R Python
As of today, since the machine learning algorithms are not that intelligent, it is
worth trying feature engineering especially when domain knowledge is available.
If you are familiar with dimension reduction, embedding can be considered as
something similar. Dimension reduction aims at reducing the dimension of X. It
sounds interesting and promising if we can transform the high-dimensional dataset
into a low-dimensional dataset and feed the dataset in a low dimension space to the
machine learning model. However, this may not be a good idea in general. For exam
ple, in supervised learning it is not guaranteed that the low-dimensional predictors
will keep all the information related to the response variable. Actually, many machine
learning models are capable of handling the high-dimensional predictors directly.
Embedding transforms the features into a new space, which usually has a lower
11
https://en.wikipedia.org/wiki/Feature_engineering
12
https://en.wikipedia.org/wiki/Kolmogorov-Arnold_representation_theorem
dimension. But generally embedding is not done by the traditional dimension re
duction techniques (for example, principal component analysis). In natural language
processing, a word can be embedded into a vector space by word2vec [12] (or other
techniques). When an instance is associated with an image, we may consider using
the penultimate layer’s output from a (deep) neural network-based image classifica
tion model to encode/embed the image into a space with lower dimension. It is also
popular to use embedding layers for embedding.
6.1.6 Collinearity
Collinearity is one of the cliches in machine learning. For non-linear models, collinear
ity is usually not a problem. For linear models, I recommend reading this discussion13
to see when it is not a problem.
14
https://en.wikipedia.org/wiki/Automated_machine_learning
node 0
node 1 node 2
node 7 node 8
node is called the depth of the tree. For example, the depth of the tree above is 4.
Each leaf node has a label. In regression tasks, the label is a real number, and in
classification tasks the label could be a real number which is used to get the class
indirectly (for example, fed into a sigmoid function to get the probability), or an
integer representing the predicted class directly.
Each node except the leaves in a decision tree is associated with a splitting rule.
These splitting rules determine to which leaf an instance belongs. A rule is just a
function taking a feature as input and returns true or false as output. For example, a
rule on the root could be x1 < 0 and if it is true, we go to the left node; otherwise we
go to the right node. Once we arrive at a leaf, we can get the predicted value based
on the label of the leaf.
To get a closer look, let’s try to implement a binary tree structure for regression
tasks in R/Python from scratch.
Let’s implement the binary tree as a recursive data structure, which is composed
partially of similar instances of the same data structure. More specifically, a binary
tree can be decomposed into three components, i.e., its root node, the left subtree
under the root, and the right subtree of the root. To define a binary (decision) tree, we
only need to define these three components. And to define the left and right subtrees,
this decomposition is applied recursively until the leaves.
Now we have the big picture on how to define a binary tree. However, to make the
binary tree a decision tree, we also need to define the splitting rules. For simplicity,
we assume there is no missing value in our data and all variables are numeric. Then
a splitting rule of a node is composed of two components, i.e., the variable to split
on, and the corresponding breakpoint for splitting.
There is one more component we need to define in a decision tree; that is, the
prediction method which takes an instance as input and returns the prediction.
Now we are ready to define our binary decision tree.
R chapter6/tree.R
1 library (R6)
2 Tree = R6Class (
Machine Learning - A gentle introduction • 171
3 "Tree",
4 public = list(
5 left = NULL ,
6 right = NULL ,
7 variable_id = NULL ,
8 break_point = NULL ,
9 val = NULL ,
10 initialize = function (left , right , variable_id , break_point ,
val) {
11 self$left = left
12 self$ right = right
13 self$ variable_id = variable_id
14 self$ break_point = break_point
15 self$val = val
16 },
17 is_leaf = function () {
18 is.null(self$left) && is.null(self$right)
19 },
20 depth = function () {
21 if (self$ is_leaf ()) {
22 1
23 } else if (is.null(self$left)) {
24 1 + self$right $depth ()
25 } else if (is.null(self$right)) {
26 1 + self$left$ depth ()
27 } else{
28 1 + max(self$left$depth (), self$right $depth ())
29 }
30 },
31 predict_single = function (x) {
32 # if x is a vector
33 if (self$ is_leaf ()) {
34 self$val
35 } else{
36 if (x[self$ variable_id ] < self$ break_point ) {
37 self$left$ predict_single (x)
38 } else{
39 self$right $ predict_single (x)
40 }
41 }
42 },
43 predict = function (x) {
44 # if x is an array
45 preds = rep (0.0 , nrow(x))
46 for (i in 1: nrow(x)) {
172 • A Tour of Data Science: Learn R and Python in Parallel
Python chapter6/tree.py
1 class Tree:
val):
3 self.left = left
4 self.right = right
7 self.val = val
9 @property
24 @property
Machine Learning - A gentle introduction • 173
25 def depth(self):
26 if self. is_leaf :
27 return 1
28 elif self.left is None:
29 return 1 + self. right . depth
30 elif self.right is None:
31 return 1 + self.left. depth
32 return 1 + max(self.left.depth , self.right.depth)
33
You may have noticed the usage @property in our Python implementation. It
is one of the built-in decorators in Python. We won’t talk too much of decorators.
Basically, adding this decorator makes the method depth behave like a property, in
the sense that we can call self.depth instead of self.depth() to get the depth.
In the R implementation, the invisible(self) is returned in the print method
which seems strange. It is an issue of R6 class due to the S3 dispatch mechanism which
is not introduced in this book15 .
The above implementation doesn’t involve the training or fitting of the decision
tree. In this book, we won’t talk about how to fit a traditional decision tree model
due to its limited usage in the context of modern machine learning. Let’s see how to
use the decision tree structures we defined above by creating a pseudo decision tree
illustrated below.
node 0 splitting rule if x2 < 3.2 -> left otherwise -> right
node 1 splitting rule if x1 < 10.5 -> left otherwise -> right node 2
R Python
tage of fitting trees independently is that it can be done in parallel. But accuracy-wise,
GBM usually performs better according to my limited experience.
We have seen the structure of a single decision tree in GBM. Now it’s time to see
how to get these trees fitted in GBM. Let’s start from the first tree.
To grow a tree, we start from its root node. In GBM fitting, usually we pre
determine a maximum depth d for each tree to grow. And the final tree’s depth may
be equal to or less than the maximum depth d. At a high level, the tree is grown in
a recursive fashion. Specifically, first we attempt to split the current node and if the
splitting improves the performance we grow the left subtree and the right subtree
under the root node. When we grow the left subtree, its maximum depth is d − 1,
and the same applies to the right subtree. We can define a tree grow function for such
purpose which takes a root node N oderoot (it is also a leaf) as input. The pseudo code
of a tree grow function is illustrated below.
• if d > 1:
• return
where
K
ŷi = ft (xi ). (6.3)
t=1
ft denotes the prediction of tth tree in the forest. As we mentioned previously, the
fitting is done sequentially. When we fit the tth tree, all the previous t − 1 trees are
fixed. And the loss function for fitting the tth tree is given below.
176 • A Tour of Data Science: Learn R and Python in Parallel
n t−1
(t)
L = (yi − fl (xi ) − ft (xi ))2 (6.4)
i=1 l=1
Let’s follow the paper [6] and use the number of leaves as well as the L2 penalty
of the values (also called weights) of the leaves for regularization. The loss function
then becomes
n t−1
1 T 2
L(t) = (yi − fl (xi ) − ft (xi ))2 + γT + λ ω , (6.6)
i=1 l=1
2 j=1 j
where ωj is the value associated with the jth leaf of the current tree.
Again, we get an optimization problem, i.e., to minimize the loss function (6.6).
The minimization problem can also be viewed as a quadratic programming problem.
However, it seems different from the other optimization problems we have seen before,
in the sense that the decision tree ft is a non-parametric model. A model is non-
parametric if the model structure is learned from the data rather than pre-determined.
A common approach used in GBM is the second-order approximation. By second-
order approximation, the loss function becomes
n t−1 n
1 1 T 2
L(t) ≈ (yi − fl (xi ))2 + (gi ft (xi ) + hi ft (xi )2 ) + γT + λ ω , (6.7)
i=1 l=1 i=1
2 2 j=1 j
Python chapter6/util.py
1 import numpy as np
2
n
1 1 T 2
L(t) ≈ (gi ft (xi ) + hi ft (xi )2 ) + γT + λ ω . (6.8)
i=1
2 2 j=1 j
Let’s think of the prediction of an instance of the current tree. The training data,
i.e., instances fall under the leaves of a tree. Thus, the prediction of an instance is the
value ω associated with the corresponding leaf that the instance belongs to. Based
on this fact, the loss function can be further rewritten as follows.
T
(t) 1
L ≈ (ωj gi + ωj
2 ( hi + λ)) + γT. (6.9)
j=1 i∈Ij
2 i∈I j
When the structure of the tree is fixed the loss function (6.9) is a quadratic convex
function of ωj , and the optimal solution can be obtained by setting the derivative to
zero.
gi
i∈Ij
ωj = − (6.10)
i∈Ij hi + λ
Plugging (6.10) into the loss function results in the minimal loss of the current
tree structure
1 T ( i∈Ij gi )2
− + γT. (6.11)
2 j=1 i∈Ij hi + λ
Now let’s go back to the split function required in the tree grow function discussed
previously. How to determine if we need a splitting on a leaf? (6.11) gives the solution
— we can calculate the loss reduction by splitting which is given below
1 ( i∈Ilef t gi )2 ( i∈Iright gi )2 ( i∈I gi )2
+ − − γ. (6.12)
2 i∈Ilef t hi + λ i∈Iright hi + λ i∈I hi + λ
If the loss reduction is positive, the split function returns true; otherwise it returns
false.
So far, we have a few ingredients ready to implement our own GBM, which are
listed below:
• the structure of a single decision tree in the forest, i.e., the Tree class defined;
• the tree growing mechanism, i.e., the pseudo algorithm with the leaf value
calculation (6.10).
However, there are a few additional items we need to go through before the
implementation.
In Chapter 6, we have seen how stochasticity works in iterative optimization al
gorithms. The stochasticity technique is very important in the optimization of GBM.
178 • A Tour of Data Science: Learn R and Python in Parallel
Python chapter6/grow.py
2 import numpy as np
3 import pdb
67 best_break = best_break_k
68 best_location = best_location_k
69 best_w_left = w_left_k
70 best_w_right = w_right_k
73
split_node function
81 ’’’
82 nrow = len(y)
83 if max_depth == 0:
84 return
w_right , _ = split_node (
88
89 if do_split :
96
w_left_scaled )
99 # initialize the right subtree
100 current_tree . right = Tree(None , None , None , None ,
w_right_scaled )
101 # update if an instance is in left or right
102 x_in_left_node = [ False ] ∗ len( x_in_node )
Machine Learning - A gentle introduction • 181
Python chapter6/gbm.py
4 import numpy as np
7 class GBM:
56 i = 0
57 while i < max_tree :
58 # let ’s fit tree i
59 # instance−wise stochasticity
60 x_in_node = np. random . choice ([ True , False ], self.n,
p=[
61 self.sub_sample , 1−self
. sub_sample ])
62 # feature−wise stochasticity
63 f_in_tree_ = np. random . choice (
64 range (self.m), self.nf , replace = False )
65 f_in_tree = np.array ([ False ] ∗ self.m)
66 for e in f_in_tree_ :
67 f_in_tree [e] = True
68 del f_in_tree_
69 # initialize the root of this tree
70 root = Tree(None , None , None , None , None)
71 # grow the tree from root
72 grow_tree (root , f_in_tree , x_in_node , self.depth −1,
self. x_val_sorted ,
73 self. x_index_sorted , self.y_train , self.
g_tilde , self.h_tilde , self.eta , self.
lam , self.gamma , self. min_instances )
74 if root is not None:
75 i += 1
76 self. forest . append (root)
77 else:
78 next
79 for j in range (self.n):
80 self. y_tilde [j] += self. forest[−1].
_predict_single (
81 self. x_train [j])
82 self. g_tilde [j], self. h_tilde [j] = gh_lm(
83 self. y_train [j], self. y_tilde [j])
84 if self. x_test is not None:
85 # test on the testing instances
86 y_hat = self. predict (self. x_test )
87 print("iter: {0: >4} rmse: {1:1.6 f}". format (
88 i, rmse(self.y_test , y_hat )))
Now let’s see the performance of our GBM implemented from scratch.
Python chapter6/test_gbm.py
2 import numpy as np
3 from sklearn import datasets
4 from sklearn . utils import shuffle
5
Python
validation, but what is early stopping? It is a very useful technique in GBM. Usually,
the cross-validated loss decreases when we add new trees at the beginning and at
a certain point, the loss may increase when more trees are fitted (due to overfit
ting). Thus, we may select the best number of trees based on the cross-validated loss.
Specifically, stop the fitting process when the cross-validated loss doesn’t decrease.
In practice, we don’t want to stop the fitting immediately when the cross-validated
loss starts increasing. Instead, we specify a number of trees, e.g., 50, as a buffer, after
which the fitting process should stop if cross-validated loss doesn’t decrease.
Early stopping is also used in other machine learning models, for example, neural
network. Ideally, we would like to have early stopping based on the cross-validated
loss. But when the training process is time-consuming, it’s fine to use the loss on a
testing dateset16 .
The commonly used GBM packages include XGBoost17 , LightGBM18 and CatBoost
19
. Let’s see how to use XGBoost for the same regression task on the Boston dataset.
R chapter6/xgb.R
1 library ( xgboost )
2 library (MASS)
3 library ( Metrics )
4 set.seed (42)
replace = FALSE )
17
https://github.com/dmlc/xgboost
18
https://github.com/microsoft/LightGBM
19
https://github.com/catboost/catboost
38 cat(
39 "rmse on testing instances is",
40 rmse(y_test , pred),
41 "with",
42 hist$ best_iteration ,
43 " trees "
44 )
Python chapter6/xgb.py
5 import numpy as np
7 seed = 42
17
R Python
In XGBoost, we could also use linear regression models as the booster (or base
learner) instead of decision trees. However, when ’booster’:’gblinear’ is used, the
sum of the prediction from all boosters in the model is equivalent to the prediction
188 • A Tour of Data Science: Learn R and Python in Parallel
from a single (combined) linear model. In that sense, what we get is just a Lasso
solution of a linear regression model.
GBM can be used in different tasks, such as classification, ranking, survival anal
ysis, etc. When we use GBM for predictive modelling, missing value imputation is not
required, which is one big advantage over linear models. But in our own implementa
tion we don’t consider missing values for simplicity. In GBM, if a feature is categorical
we could do label-encoding20 , i.e., mapping the feature to integers directly without
creating dummy variables (such as one-hot encoding). Of course one-hot encoding21
can also be used. But when there are too many new columns created by one-hot
encoding, the probability that the original categorical feature is selected is higher
than these numerical variables. In other words, we are assigning a prior weight to the
categorical feature regarding the feature-wise stochasticity.
For quite a few real-world prediction problems, the monotonic constraints are
desired. Monotonic constraints are either increasing or decreasing. The increas
ing constraint for feature xk refers to the relationship that f (x1 , ..., xk , ..., xm ) ≤
f (x1 , ..., x1k , ..., xm ) if xk ≤ x1k . For example, an increasing constraint for the number
of bedrooms in a house price prediction model makes lots of sense. Using gradi
ent boosting tree regression models we can enforce such monotonic constraints in a
straightforward manner. Simply, after we get the best split for the current node, we
may check if the monotonic constraint is violated by the split. The split won’t be
adopted if the constraint is broken.
the transformation and we have to make a guidance for such transformation. The
key to our guidance is to make the data points projected onto the first transformed
20
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
21
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Machine Learning - A gentle introduction • 189
coordinate have the largest variance, and w11 xi is called the first principal component.
And we can find the remaining transformations and principal components iteratively.
So now we see how the PCA is formulated as an optimization problem. However,
under the above setting, there are infinite solutions for wk . We usually add a con
straint on wk in PCA, i.e., |wk | = 1. The solution to this optimization problem is
surprisingly elegant - the optimal wk is the eigenvectors of the covariance matrix of
x. Now let’s try to conduct a PCA with eigen decomposition (it can also be done
with other decompositions).
2 > n = 1000
4 > x1 = rexp(n)
8 > sum(diag(cov(x)))
9 [1] 15.54203
In fact, it does not matter if the raw data points are centered or not if the eigen
decomposition is on the covariance matrix. If you prefer to decompose x� x directly,
190 • A Tour of Data Science: Learn R and Python in Parallel
centering is a necessary step. Many times, we don’t want to perform PCA in this way
since there are a lot of functions/packages available in both R and Python.
where πk represents the probability that a randomly selected data point belongs to
distribution k. And thus, the log-likelihood function becomes:
n K
L= log( πk f (xi |μk , Σk )). (6.14)
i=1 k=1
Let’s try to implement the above idea in R with the optim function.
3 > n = 1000
4 > # 60% samples are from distribution 1 and the remainings are
from distribution 2
5 > p = 0.6
6 > n1 = n ∗ p
7 > n2 = n − n1
The estimates of parameters are not satisfactory, why? Remember we have empha
sized the importance of convexity in Chapter 6. Actually, the log-likelihood function
given in (6.14) is not convex. For non-convex problems, the optim function may not
converge. In practice, EM algorithm22 is frequently applied for mixture model.
The theory of EM algorithm is out of the scope of this book. Let’s have a look
at the implementation for this specific problem. Basically, there are two steps, i.e.,
E-step and M-step which run in an iterative fashion. In the tth E-step, we update
the membership probability wi,k that represents the probability that xi belongs to
distribution k, based on the current parameter estimates as follows.
(t) (t) (t)
(t) αk f (xi |μk , Σk )
wi,k = (t) (t) (t)
. (6.15)
K
k=1 αk f (xi |μk , Σk )
And in the tth M-step, for each k we update the parameters as follows.
n (t)
(t+1) i=1 wi,k
αk =
n
n (t)
(t+1) i=1 wi,k xi
μk = n (t) (6.16)
i=1 wi,k
(t) (t) (t)
(t+1)
n
i=1 wi,k (xi − μk )(xi − μk )1
Σk = n (t)
i=1 wi,k
The Python code below implements the above EM update schedule for a Gaussian
mixture model.
Python chapter6/gmm.py
1 import numpy as np
2
3 class GaussianMixture :
22
https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
Machine Learning - A gentle introduction • 193
4 """
5 X − n ∗ m array
6 K − the number of distributions / clusters
7 seed − random seed for reproducibility
8 """
9
17 [[2]]
18 [ ,1] [ ,2]
19 [1,] 0.9821523 − 0.3748227
20 [2,] − 0.3748227 1.0194300
21
23 z = as. integer (
24 fit$ lambda [1] ∗ dmvnorm (x_permuted , fit$mu [[1]] , fit$ sigma
[[1]]) > fit$ lambda [2] ∗
25 dmvnorm (x_permuted , fit$mu [[2]] , fit$sigma [[2]])
26 ) + 1
In the above code, zi represents the membership probability. We plot the data
points with their clusters in Figure 6.5.
6.3.3 Clustering
Clustering is the task of grouping similar objects together. Actually in the example
above we have used the Gaussian mixture model for clustering. There are many
clustering algorithms and one of the most famous clustering algorithms might be
K-means23 .
24
https://deepmind.com/research/case-studies/alphago-the-story-so-far
25
https://openai.com/projects/five
resources for learning RL (for example, see [15]), and it is impossible to give an in-
depth introduction on RL in this short section. But we could learn from a specific
example for a first impression.
The game seems very simple and the best strategy is to choose number 1 in
each step. This problem cannot be solved by supervised or unsupervised learning
techniques. And one obvious distinction in this problem is that we don’t have any
training data to learn from. This problem falls into the agent-environment interaction
paradigm.
agent
environment
In step t, the agent takes an action and the action a(t) in turn would have an effect
on the environment. As a result, the state of the environment goes from s(t) to s(t+1)
and returns a reward r(t+1) to the agent. Usually, the goal in RL is to pick up actions so
(t+1)
that the cumulative reward m t=0 r is maximized. When m → ∞, the cumulative
reward may not be bounded. In that case, the future reward can be discounted by
λ ∈ [0, 1] and we want to maximize ∞ t (t+1)
t=0 λ r . And in the following step, the agent
(t+1)
would pick up the next action a based on the information collected. In general,
the environment state and reward at step t are random variables. If the probability
distribution of the next state and reward only depends on the current environment
state and the action picked up, the Markov property holds which results in a Markov
decision process (MDP). A majority of RL studies focus on Markov decision process
MDP.
With the advance of deep learning, deep learning-based RL has made signifi
cant progress recently. Many DRL algorithms have been developed, such as deep
Q-networks (DQN) algorithm which we will discuss in more detail in the next sec
tion.
As we mentioned, usually there are no data to learn in RL. Both supervised and
unsupervised learning need data to perform optimization on some objective functions.
So does RL, in a different approach. RL can also be viewed as a simulation-based
optimization approach, because RL runs simulations for the agent-environment in
teraction.
Machine Learning - A gentle introduction • 197
There are many RL algorithms available online. In the next section we will have
a look at the DQN algorithm.
27
https://stable-baselines.readthedocs.io/en/master
current environment time is greater than the time horizon, the game is finished. We
could define the environment class in the following code snippet.
Python chapter6/game_env.py
1 import numpy as np
5 """
7 """
9 self.H = H
16 def reset(self):
17 self. current_time = 0.0
18 return np. array ([ self. current_time ])
19
And the code below shows how to use a DQN agent to play the game with the
help of the stable_baselines package.
Python chapter6/game_env_run.py
1 import numpy as np
2 from game_env_run import GameEnv
3 from stable_baselines import DQN
4 from stable_baselines . common . vec_env import DummyVecEnv
Machine Learning - A gentle introduction • 199
Python
In the game we discussed above, the environment does not have stochasticity
and the optimal policy does not depend on the environment’s state at all as it is
a simplified example to illustrate the usage of the tools. The real-world problems
are usually much more complicated than the example, but we could follow the same
procedure, i.e., define the environment for the problem and then build the agent for
learning.
200 • A Tour of Data Science: Learn R and Python in Parallel
R Python
It’s worth noting there are many other symbolic operations we could do in
R/Python, for example, symbolic integration.
Automatic differentiation is extremely important in modern machine learning.
Many deep learning models can be viewed as function compositions. Let’s take a
two-hidden layer neural network illustrated below as an example. Let x denote the
28
https://www.tensorflow.org/
29
https://pytorch.org/
30
https://www.cs.utexas.edu/users/novak/asg-symdif.html
input and fu , gv denote the first and second hidden layer, respectively. The output
y is written as y = gv (fu (x)). The learning task is to estimate the parameters u, v
in these hidden layers (see, Figure 6.6). If we use the gradient descent approach for
the parameter estimation, the evaluation of gradients for both u and v should be
calculated. However, there might be overlapped operations if we simply perform two
numerical differentiation operations independently. By utilizing automatic differenti
ation technique, it is possible to reduce the computational complexity.
Let’s see how we could utilize the automatic differentiation to simplify our linear
regression implementation.
Python chapter6/linear_regression_ad.py
4 class LR_AD:
5 """
6 linear regression using automatic differentiation from
autograd
7 """
8 def __init__ (self , x, y, learning_rate =0.005 , seed =42):
9 self.seed = seed
10 self. learning_rate = learning_rate
11 self.x = np. hstack ((np.ones ((x.shape [0], 1)), x))
12 self.y = y
13 self.coef = None
14
15 def loss(self):
16 def loss_with_coef (coef):
17 y_hat = self.x @ coef
18 err = self.y − y_hat
19 return err @ err.T / self.x.shape [0]
20 # return the loss function with respect to the coef
21 return loss_with_coef
202 • A Tour of Data Science: Learn R and Python in Parallel
22
6.7 MISCELLANEOUS
In this chapter, we have talked about machine learning in a superficial manner. We
have seen some implementation of machine learning algorithms from scratch through
out this book, which is mainly for teaching and illustration purposes. When we work
on real-world data science projects we usually prefer to use the robust off-the-shelf
machine learning libraries unless there is a need to customize the existing models or
even to invent new models. In both R and Python, there are many such tools ready
to use. Particularly, the CRAN task view32 provides an up-to-date review of machine
learning tools in R.
Many popular machine learning tools have API available in both languages. There
also exist some tools for interoperability between Python and R. These tools allow
us to use R in Python and vice versa, for example, the reticulate33 library in R.
It is worth noting that machine learning is not only about the learning theories.
It is also an engineering problem indeed.
31
https://en.wikipedia.org/wiki/Closure_(computer_programming)
32
https://cran.r-project.org/web/views/MachineLearning.html
33
https://rstudio.github.io/reticulate
Bibliography
203
204 • Bibliography
[14] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends
in Optimization, 1(3):127–239, 2014.
[16] Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. In Simulated
annealing: Theory and applications, pages 7–15. Springer, 1987.
Index
t test, 118
Gaussian mixture model, 190
anonymous function, 66
gradient descent, 134
Array, 15
group by, 91
AUC, 166
benchmark, 47
indexing, 76
binary search, 41
internal rate of return, 142
bootstrap, 114
broadcasting, 17
join, 93
browser(), 40
CDF, 99
linear programming, 149
closure, 202
logistic regression, 146
clustering, 195
logloss, 166
collinearity, 169
MAE, 165
map, 66
control flow, 6
MAPE, 165
convexity, 133
member accessibility, 34
copula, 106
cross-validation, 163
Cython, 64
mutability of object, 22
data.frame, 20
NA, 10
data.table, 74
NaN, 10
debug, 40
None, 11
NULL, 10
EM algorithm, 192
numpy, 15
evaluation strategy, 61
object-oriented programming, 30
functional programming, 65
pandas, 74
functions, 4
parallelism, 57
205
206 • Index
PCA, 188
pdb, 40
PDF, 99
QR decomposition, 123
quantile, 100
Rcpp, 63
reduce, 68
RMSE, 165
root-finding, 141
scope of variable, 25
script, 37
slicing, 12
SQL, 71
stochasticity, 153
timeit, 49
type, 3
variable, 3, 21
vector, 9
vectorization, 49
XGBoost, 185