0% found this document useful (0 votes)
67 views

Learn Python and R

Learn Python and R.

Uploaded by

John
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Learn Python and R

Learn Python and R.

Uploaded by

John
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Contents

1 Introduction to R/Python Programming 7

2 More on R/Python Programming 33

3 Linear Regression 57
1
Introduction to R/Python Programming

There has been considerable debate over choosing R vs. Python for Data Science. I started to learn Python
when I was an undergraduate in 2006. At that time I never heard of Data Science. Five years later I read an
R script for the first time. In my opinion, both R and Python are great languages and are worth learning; so
why not learn them together?
In this Chapter, I would give an introduction on general R and Python programming, in a parallel fashion.

1.1 Calculator

R and Python are general-purpose programming languages that can be used for writing softwares in a
variety of domains. But for now, let us start from using them as basic calculators. The first thing is to have
them installed. R 1 and Python 2 can be downloaded from their official website. In this book, I would keep
using R 3.5 and Python 3.7.
To use R/Python as basic calculators, let’s get familiar with the interactive mode. After the installation,
we can type R or Python (it is case insensitive so we can also type r/python) to invoke the interactive mode.
Since Python 2 is installed by default on many machines, in order to avoid invoking Python 2 we type
python3.7 instead.

2 ~ $R
3

4 R version 3.5.1 (2018 07 02) " Feather Spray"


5 Copyright (C) 2018 The R Foundation for Statistical Computing
6 Platform : x86_64 apple darwin15 .6.0 (64 bit)
7

8 R is free software and comes with ABSOLUTELY NO WARRANTY .


9 You are welcome to redistribute it under certain conditions .
10 Type ' license () ' or ' licence () ' for distribution details .
1
https://www.r-project.org
2
https://www.python.org
8 another book on data science

11

12 Natural language support but running in an English locale


13

14 R is a collaborative project with many contributors .


15 Type ' contributors () ' for more information and
16 ' citation () ' on how to cite R or R packages in publications .
17

18 Type ' demo () ' for some demos , ' help () ' for on line help , or
19 ' help.start () ' for an HTML browser interface to help.
20 Type ' q() ' to quit R.
21

22 >

Python

1 ~ $python3 .7
2 Python 3.7.1 (default , Nov 6 2018 , 18:45:35)
3 [Clang 10.0.0 (clang 1000.11.45.5)] on darwin
4 Type "help", " copyright ", " credits " or " license " for more information .
5 >>>

The messages displayed by invoking the interactive mode depend on both the version of R/Python in-
stalled and the machine. Thus, you may see different messages on your local machine. As the messages
said, to quit R we can type q(). There are 3 options prompted by asking the user if the workspace should
be saved or not. Since we just want to use R as a basic calculator, we quit without saving workspace.
To quit Python, we can simply type exit().

1 > q()
2 Save workspace image? [y/n/c]: n
3 ~ $

Once we are inside the interactive mode, we can use R/Python as a calculator.

1 > 1+1
2 [1] 2
3 > 2 * 3+5
4 [1] 11
5 > log (2)
6 [1] 0.6931472
7 > exp (0)
introduction to r/python programming 9

8 [1] 1

Python

1 >>> 1+1
2 2
3 >>> 2 * 3+5
4 11
5 >>> log (2)
6 Traceback (most recent call last):
7 File "<stdin >", line 1, in <module >
8 NameError : name ' log ' is not defined
9 >>> exp (0)
10 Traceback (most recent call last):
11 File "<stdin >", line 1, in <module >
12 NameError : name ' exp ' is not defined

From the code snippet above, R is working as a calculator perfectly. However, errors are raised when we
call log(2) and exp(2) in Python. The error messages are self-explanatory - log function and exp function
don’t exist in the current Python environment. In fact, log function and exp function are defined in the math
module in Python. A module 3 is a file consisting of Python code. When we invoke the interactive mode
of Python, a few built-in modules are loaded into the current environment by default. But the math module
is not included in these built-in modules. That explains why we got the NameError when we try to use the
functions defined in the math module. To resolve the issue, we should first load the functions to use by using
the import statement as follows.

Python

1 >>> from math import log ,exp


2 >>> log (2)
3 0.6931471805599453
4 >>> exp (0)
5 1.0

1.2 Variable and Type

In the previous section we have seen how to use R/Python as calculators. Now, let’s see how to write real
programs. First, let’s define some variables.

3
https://docs.python.org/3/tutorial/modules.html
10 another book on data science

R Python

1 > a=2 1 >>> a=2


2 > b=5.0 2 >>> b=5.0
3 > x= ' hello world ' 3 >>> x= ' hello world '
4 > a 4 >>> a
5 [1] 2 5 2
6 > b 6 >>> b
7 [1] 5 7 5.0
8 > x 8 >>> x
9 [1] "hello world " 9 ' hello world '
10 > e=a * 2+ b 10 >>> e=a * 2+ b
11 > e 11 >>> e
12 [1] 9 12 9.0

Here, we defined 4 different variables a, b, x, e. To get the type of each variable, we can utilize the
function typeof() in R and type() in Python, respectively.

R Python

1 > typeof (x) 1 >>> type(x)


2 [1] " character " 2 <class ' str ' >
3 > typeof (e) 3 >>> type(e)
4 [1] " double " 4 <class ' float ' >

The type of x in R is called character, and in Python is called str.

1.3 Functions

We have seen two functions log and exp when we use R/Python as calculators. A function is a block of code
which performs a specific task. A major purpose of wrapping a block of code into a function is to reuse the
code.
It is simple to define functions in R/Python.

R Python

1 > fun1= function (x){ return (x * x)} 1 >>> def fun1(x):


2 > fun1 2 ... return x * x # note the
3 function (x){ return (x * x)} indentation
4 > fun1 (2) 3 ...
5 [1] 4 4 >>> fun1 (2)
5 4
introduction to r/python programming 11

Here, we defined a function fun1 in R/Python. This function takes x as input and returns the square of
x. When we call a function, we simply type the function name followed by the input argument inside a pair
of parentheses. It is worth noting that input or output are not required to define a function. For example,
we can define a function fun2 to print Hello World! without input and output.
One major difference between R and Python codes is that Python codes are structured with indentation.
Each logical line of R/Python code belongs to a certain group. In R, we use {} to determine the grouping
of statements. However, in Python we use leading whitespace (spaces and tabs) at the beginning of a logical
line to compute the indentation level of the line, which is used to determine the statements’ grouping. Let’s
see what happens if we remove the leading whitespace in the Python function above.

Python

1 >>> def fun1(x):


2 ... return x * x # note the indentation
3 File "<stdin >", line 2
4 return x * x # note the indentation
5 ^
6 IndentationError : expected an indented block

We got an IndentationError because of missing indentation.

R Python

1 > fun2= function (){ print ( ' Hello 1 >>> def fun2 (): print ( ' Hello World
World! ' )} ! ')
2 > fun2 () 2 ...
3 [1] "Hello World !" 3 >>> fun2 ()
4 Hello World !

Let’s go back to fun1 and have a closer look at the return. In Python, if we want to return something we
have to use the keyword return explicitly. return in R is a function but it is not a function in Python and
that is why no parenthesis follows return in Python. In R, return is not required even though we need to
return something from the function. Instead, we can just put the variables to return in the last line of the
function defined in R. That being said, we can define fun1 as follows.

1 > fun1= function (x){x * x}

1.4 Control flows

To implement a complex logic in R/Python, we may need control flows.


12 another book on data science

If/else
Let’s define a function to return the absolute value of input.

R Python

1 > fun3= function (x){ 1 >>> def fun3(x):


2 + if (x >=0){ 2 ... if x >=0:
3 + return (x)} 3 ... return x
4 + else{ 4 ... else:
5 + return( x)} 5 ... return x
6 + } 6 ...
7 > fun3 (2.5) 7 >>> fun3 (2.5)
8 [1] 2.5 8 2.5
9 > fun3( 2.5) 9 >>> fun3( 2.5)
10 [1] 2.5 10 2.5

The code snippet above shows how to use if/else in R/Python. The subtle difference between R and
Python is that the condition after if must be embraced by parenthesis in R but it is optional in Python.
We can also put if after else. But in Python, we use elif as a shortcut.

R Python

1 > fun4= function (x){ 1 >>> def fun4(x):


2 + if (x==0){ 2 ... if x==0:
3 + print ( ' zero ' )} 3 ... print( ' zero ' )
4 + else if (x >0){ 4 ... elif x >0:
5 + print ( ' positive ' )} 5 ... print( ' positive ' )
6 + else{ 6 ... else:
7 + print ( ' negative ' )} 7 ... print( ' negative ' )
8 + } 8 ...
9 > fun4 (0) 9 >>> fun4 (0)
10 [1] "zero" 10 zero
11 > fun4 (1) 11 >>> fun4 (1)
12 [1] " positive " 12 positive
13 > fun4( 1) 13 >>> fun4( 1)
14 [1] " negative " 14 negative

For loop
Similar to the usage of if in R, we also have to use parenthesis after the keyword for in R. But in Python
there should be no parenthesis after for.
introduction to r/python programming 13

R Python

1 > for (i in 1:3){ print (i)} 1 >>> for i in range (1 ,4): print (i)
2 [1] 1 2 ...
3 [1] 2 3 1
4 [1] 3 4 2
5 3

There is something more interesting than the for loop itself in the snippets above. In the R code, the
expression 1:3 creates a vector with elements 1,2 and 3. In the Python code, we use the range() function
for the first time. Let’s have a look at the type of them.

R Python

1 > typeof (1:3) 1 >>> type(range (1 ,4))


2 [1] " integer " 2 <class ' range ' >

range() function returns a range type object, which represents an immutable sequence of numbers. range
() function can take three arguments, i.e., range(start, stop, step). However, start and step are both
optional. It’s critical to keep in mind that the stop argument that defines the upper limit of the sequence
is exclusive. And that is why in order to loop through 1 to 3 we have to pass 4 as the stop argument to
range() function. The step argument specifies how much to increase from one number to the next. The
default values of start and step are 0 and 1, respectively.

While loop

R Python

1 > i=1 1 >>> i=1


2 > while (i <=3){ 2 >>> while i <=3:
3 + print (i) 3 ... print(i)
4 + i=i+1 4 ... i+=1
5 + } 5 ...
6 [1] 1 6 1
7 [1] 2 7 2
8 [1] 3 8 3

You may have noticed that in Python we can do i+=1 to add 1 to i, which is not feasible in R by default.
Both for loop and while loop can be nested.

Break/continue
Break/continue helps if we want to break the for/while loop earlier, or to skip a specific iteration. In R,
14 another book on data science

the keyword for continue is called next, in contrast to continue in Python. The difference between break
and continue is that calling break would exit the innermost loop (when there are nested loops, only the
innermost loop is affected); while calling continue would just skip the current iteration and continue the
loop if not finished.

R Python

1 > for (i in 1:3){ 1 >>> for i in range (1 ,4):


2 + print (i) 2 ... print(i)
3 + if (i==1) break 3 ... if i==1: break
4 + } 4 ...
5 [1] 1 5 1
6 > for (i in 1:3){ 6 >>> for i in range (1 ,4):
7 + if (i==2){next} 7 ... if i==2: continue
8 + print (i) 8 ... print(i)
9 + } 9 ...
10 [1] 1 10 1
11 [1] 3 11 3

1.5 Some built-in data structures

In the previous sections, we haven’t seen much difference between R and Python. However, regarding the
built-in data structures, there are some significant differences we would see in this section.

vector in R and list in Python


In R, we can use function c() to create a vector; A vector is a sequence of elements with the same type. In
Python, we can use [] to create a list, which is also a sequence of elements. But the elements in a list don’t
need to have the same type. To get the number of elements in a vector in R, we use the function length();
and to get the number of elements in a list in Python, we use the function len().

R Python

1 > x=c(1 ,2 ,5 ,6) 1 >>> x=[1 ,2 ,5 ,6]


2 > y=c( ' hello ' , ' world ' , ' ! ' ) 2 >>> y=[ ' hello ' , ' world ' , ' ! ' ]
3 > x 3 >>> x
4 [1] 1 2 5 6 4 [1, 2, 5, 6]
5 > y 5 >>> y
6 [1] "hello" " world" "!" 6 [ ' hello ' , ' world ' , ' ! ' ]
7 > length (x) 7 >>> len(x)
8 [1] 4 8 4
9 > z=c(1, ' hello ' ) 9 >>> z=[1 , ' hello ' ]
10 > z 10 >>> z
11 [1] "1" " hello" 11 [1, ' hello ' ]
introduction to r/python programming 15

In the code snippet above, the first element in the variable z in R is coerced from 1 (numeric) to "1"
(character) since the elements must have the same type.
To access a specific element from a vector or list, we could use []. In R, sequence types are indexed
beginning with the one subscript; In contrast, sequence types in Python are indexed beginning with the zero
subscript.

R Python

1 > x=c(1 ,2 ,5 ,6) 1 >>> x=[1 ,2 ,5 ,6]


2 > x[1] 2 >>> x[1]
3 [1] 1 3 2
4 >>> x[0]
5 1

What if the index to access is out of boundary?

R Python

1 > x=c(1 ,2 ,5 ,6) 1 >>> x=[1 ,2 ,5 ,6]


2 > x[ 1] 2 >>> x[ 1]
3 [1] 2 5 6 3 6
4 > x[0] 4 >>> x[len(x)+1]
5 numeric (0) 5 Traceback (most recent call last):
6 > x[ length (x)+1] 6 File "<stdin >", line 1, in <
7 [1] NA module >
8 > length ( numeric (0)) 7 IndexError : list index out of
9 [1] 0 range
10 > length (NA)
11 [1] 1

In Python, negative index number means indexing from the end of the list. Thus, x[ 1] points to the
last element and x[ 2] points to the second-last element of the list. But R doesn’t support indexing with
negative number in the same way as Python. Specifically, in R x[ index] returns a new vector with x[
index] excluded.
When we try to access with an index out of boundary, Python would throw an IndexError. The behavior
of R when indexing out of boundary is more interesting. First, when we try to access x[0] in R we get a
numeric(0) whose length is also 0. Since its length is 0, numeric(0) can be interpreted as an empty numeric
vector. When we try to access x[length(x)+1] we get a NA. In R, there are also NaN and NULL.
NaN means "Not A Number" and it can be verified by checking its type - "double". 0/0 would result in a
NaN in R. NA in R generally represents missing values. And NULL represents a NULL (empty) object. To check
if a value is NA, NaN or NULL, we can use is.na(), is.nan() or is.null, respectively.
16 another book on data science

R Python

1 > typeof (NA) 1 >>> type(None)


2 [1] " logical " 2 <class ' NoneType ' >
3 > typeof (NaN) 3 >>> None is None
4 [1] " double " 4 True
5 > typeof (NULL) 5 >>> 1 == None
6 [1] "NULL" 6 False
7 > is.na(NA)
8 [1] TRUE
9 > is.null(NULL)
10 [1] TRUE
11 > is.nan(NaN)

In Python, there is no built-in NA or NaN. The counterpart of NULL in Python is None. In Python, we can
use the is keyword or == to check if a value is equal to None.
From the code snippet above, we also notice that in R the boolean type value is written as "TRUE/FALSE",
compared with "True/False" in Python. Although in R "TRUE/FALSE" can also be abbreviated as "T/F", I
don’t recommend to use the abbreviation.
There is one interesting fact that we can’t add a NULL to a vector in R, but it is feasible to add a None to a
list in Python.

R Python

1 > x=c(1,NA ,NaN ,NULL)


1 >>> x=[1 , None]
2 > x
2 >>> x
3 [1] 1 NA NaN
3 [1, None]
4 > length (x)
4 >>> len(x)
5 [1] 3
5 2

Beside accessing a specific element from a vector/list, we may also need to do slicing, i.e., to select a
subset of the vector/list. There are two basic approaches of slicing:
• Integer-based

1 > x=c(1 ,2 ,3 ,4 ,5 ,6)


2 > x[2:4]
3 [1] 2 3 4
4 > x[c(1 ,2 ,5)] # a vector of indices
5 [1] 1 2 5
introduction to r/python programming 17

6 > x[seq (1 ,5 ,2)] # seq creates a vector to be used as indices


7 [1] 1 3 5

Python

1 >>> x=[1 ,2 ,3 ,4 ,5 ,6]


2 >>> x[1:4] # x[ start:end] start is inclusive but end is exclusive
3 [2, 3, 4]
4 >>> x [0:5:2] # x[start:end:step]
5 [1, 3, 5]

The code snippet above uses hash character # for comments in both R and Python. Everything after # on
the same line would be treated as comment (not executable). In the R code, we also used the function
seq() to create a vector. When I see a function that I haven’t seen before, I might either google it or use
the builtin helper mechanism. Specifically, in R use ? and in Python use help().

R Python

1 > ?seq 1 >>> help( print )

• Condition-based
Condition-based slicing means to select a subset of the elements which satisfy certain conditions. In R, it
is quite straightforward by using a boolean vector whose length is the same as the vector to slice.

1 > x=c(1 ,2 ,5 ,5 ,6 ,6)


2 > x[x %% 2==1] # %% is the modulo operator in R; we select the odd elements
3 [1] 1 5 5
4 > x %% 2==1 # results in a boolean vector with the same length as x
5 [1] TRUE FALSE TRUE TRUE FALSE FALSE

The condition-based slicing in Python is quite different from that in R. The prerequisite is list comprehen-
sion which provides a concise way to create new lists in Python. For example, let’s create a list of squares
of another list.

Python

1 >>> x=[1 ,2 ,5 ,5 ,6 ,6]


2 >>> [e * * 2 for e in x] # * * is the exponent operator , i.e., x * * y means x to
the power of y
3 [1, 4, 25, 25, 36, 36]

We can also use if statement with list comprehension to filter a list to achieve list slicing.
18 another book on data science

Python

1 >>> x=[1 ,2 ,5 ,5 ,6 ,6]


2 >>> [e for e in x if e %2==1] # % is the modulo operator in Python
3 [1, 5, 5]

It is also common to use if/else with list comprehension to achieve more complex operations. For
example, given a list x, let’s create a new list y so that the non-negative elements in x are squared and the
negative elements are replaced by 0s.

Python

1 >>> x=[1, 1,0,2,5, 3]


2 >>> [e * * 2 if e >=0 else 0 for e in x]
3 [1, 0, 0, 4, 25, 0]

The example above shows the power of list comprehension. To use if with list comprehension, the
if statement should be placed in the end after the for loop statement; but to use if/else with list
comprehension, the if/else statement should be placed before the for loop statement.
We can also modify the value of an element in a vector/list variable.

R Python

1 > x=c(1 ,2 ,3) 1 >>> x=[1 ,2 ,3]


2 > x[1]= 1 2 >>> x[0]= 1
3 > x 3 >>> x
4 [1] 1 2 3 4 [ 1, 2, 3]

Although the vector structure in R and the list structure in Python looks similar regarding their usages
and purposes, the implementation of these two structures are essentially different. The list structure in
Python is mutable. A mutable object can be changed after it is created, but an immutable object can’t.
However, the mutability of vector in R is a bit of complicated. If we change the value of an element in a
vector without changing the type of the element, the vector is mutable. If we change an element to another
type, the behavior of the vector is immutable. A variable itself is a reference or pointer to an object (usually
stored in the machine’s memory). To check the mutability of a variable, we can trace the memory address.
introduction to r/python programming 19

R Python

1 > x=c(1:3) 1 >>> x=list(range (1 ,1001)) # list ()


2 > tracemem (x) # print the memory convert a range object to a
address of x whenever the list
address changes 2 >>> hex(id(x)) # print the memory
3 [1] " <0 x7ff360c95c08 >" address of x
4 > x[1]= x[1] # type not changed , i 3 ' 0 x10592d908 '
.e., from integer to integer 4 >>> x [0]=1.0 # from integer to
5 > tracemem (x) float
6 [1] " <0 x7ff360c95c08 >" 5 >>> hex(id(x))
7 > x[1]= 1.0 6 ' 0 x10592d908 '
8 tracemem [0 x7ff360c95c08 > 0
x7ff3604692d8 ]:

From the code snippet above, in Python the memory address doesn’t change after we change the value
of the first element because list in Python is mutable. When we try to modify the value of x[1] in R, the
memory address of x doesn’t change. (you probably would see different addresses on your machine). But
when we change the value of x[1] from the integer type to a double type, the memory address got changed.
It’s worth noting since R 3.5 arithmetic sequences created by 1:n, seq_along, and the like now use compact
internal representations via the ALTREP framework 4 . Let’s the example below.

R R

1 > x=1:3 1 > x=c(1:3)


2 > tracemem (x) 2 > tracemem (x)
3 [1] " <0 x7f828e84c110 >" 3 [1] " <0 x7f828fe498c8 >"
4 > . Internal ( inspect (x)) 4 > . Internal ( inspect (x))
5 @7f828e84c110 13 INTSXP g0c0 [NAM 5 @7f828fe498c8 13 INTSXP g0c2 [NAM
(3) ,TR] 1 : 3 ( compact ) (1) ,TR] (len =3, tl =0) 1,2,3
6 > x [1]=2 L 6 > x [1]=2L
7 tracemem [0 x7f828e84c110 > 0 7 > tracemem (x)
x7f828fe49848 ]: 8 [1] " <0 x7f828fe498c8 >"

4
https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
20 another book on data science

Two or multiple vectors/lists can be concatenated easily.

R Python

1 > x=c(1 ,2) 1 >>> x=[1 ,2]


2 > y=c(3 ,4) 2 >>> y=[3 ,4]
3 > z=c(5 ,6 ,7 ,8) 3 >>> z=[5 ,6 ,7 ,8]
4 > c(x,y,z) 4 >>> x+y+z
5 [1] 1 2 3 4 5 6 7 8 5 [1, 2, 3, 4, 5, 6, 7, 8]

As the list structure in Python is mutable, there are many things we can do with list.

Python

1 >>> x=[1 ,2 ,3]


2 >>> x. append (4) # append a single value to the list x
3 >>> x
4 [1, 2, 3, 4]
5 >>> y=[5 ,6]
6 >>> x. extend (y) # extend list y to x
7 >>> x
8 [1, 2, 3, 4, 5, 6]
9 >>> last=x.pop () # pop the last elememt from x
10 >>> last
11 6
12 >>> x
13 [1, 2, 3, 4, 5]

Is there any immutable data structure in Python? Yes, for example tuple is immutable, which contains a
sequence of elements. The element accessing and subset slicing of tuple is following the same rules of list in
Python.

Python

1 >>> x=(1 ,2 ,3 ,) # use () to create a tuple in Python , it is better to always


put a comma in the end
2 >>> type(x)
3 <class ' tuple ' >
4 >>> len(x)
5 3
6 >>> x[0]
7 1
8 >>> x[0]= 1
9 Traceback (most recent call last):
introduction to rspython programming 21

10 File "<stdin >", line 1, in <module >


11 TypeError : ' tuple ' object does not support item assignment

I like the list structure in Python much more than the vector structure in R. list in Python has a lot more
useful features which can be found from the python official documentation 5 .

Array
Array is one of the most important data structures in scientific programming. In R, there is also an object
type "matrix", but according to my own experience, we can almost ignore its existence and use array instead.
We can definitely use list as array in Python, but lots of linear algebra operations are not supported for the
list type. Fortunately, there is a Python package numpy off the shelf.

1 > x =1:12
2 > array1 = array(x,c(4 ,3)) # convert vector x to a 4 rows * 3 cols array
3 > array1
4 [,1] [ ,2] [ ,3]
5 [1,] 1 5 9
6 [2,] 2 6 10
7 [3,] 3 7 11
8 [4,] 4 8 12
9 > y=1:6
10 > array2 = array(y,c(3 ,2)) # convert vector y to a 3 rows * 2 cols array
11 > array2
12 [,1] [ ,2]
13 [1,] 1 4
14 [2,] 2 5
15 [3,] 3 6
16 > array3 = array1 % * % array2 # % * % is the matrix multiplication operator
17 > array3
18 [,1] [ ,2]
19 [1,] 38 83
20 [2,] 44 98
21 [3,] 50 113
22 [4,] 56 128
23 > dim( array3 ) # get the dimension of array3
24 [1] 4 2

Python

1 >>> import numpy as np # we import the numpy module and alias it as np


5
https://docs.python.org/3/tutorial/datastructures.html
22 another book on data science

2 >>> array1 =np. reshape (list(range (1 ,13)) ,(4,3)) # convert a list to a 2d np.
array
3 >>> array1
4 array ([[ 1, 2, 3],
5 [ 4, 5, 6],
6 [ 7, 8, 9],
7 [10 , 11, 12]])
8 >>> type( array1 )
9 <class ' numpy. ndarray ' >
10 >>> array2 =np. reshape (list(range (1 ,7)) ,(3 ,2))
11 >>> array2
12 array ([[1 , 2],
13 [3, 4],
14 [5, 6]])
15 >>> array3 =np.dot(array1 , array2 ) # matrix multiplication using np.dot ()
16 >>> array3
17 array ([[ 22, 28],
18 [ 49, 64] ,
19 [ 76, 100] ,
20 [103 , 136]])
21 >>> array3 . shape # get the shape( dimension ) of array3
22 (4, 2)

You may have noticed that the results of the R code snippet and Python code snippet are different. The
reason is that in R the conversion from a vector to an array is by-column; but in numpy the reshape from a
list to an 2D numpy.array is by-row. There are two ways to reshape a list to a 2D numpy.array by column.

Python

1 >>> array1 =np. reshape (list(range (1 ,13)) ,(4,3),order= ' F ' ) # use order = ' F '
2 >>> array1
3 array ([[ 1, 5, 9],
4 [ 2, 6, 10],
5 [ 3, 7, 11],
6 [ 4, 8, 12]])
7 >>> array2 =np. reshape (list(range (1 ,7)) ,(2 ,3)).T # use .T to transpose an array
8 >>> array2
9 array ([[1 , 4],
10 [2, 5],
11 [3, 6]])
12 >>> np.dot(array1 , array2 ) # now we get the same result as using R
13 array ([[ 38, 83],
14 [ 44, 98] ,
15 [ 50, 113] ,
introduction to r/python programming 23

16 [ 56, 128]])

To learn more about numpy, the official website 6 . has great documentation/tutorials.

list in R and dictionary in Python


Yes, in R there is also an object type called list. The major difference between a vector and a list in R is
that a list could contain different types of elements. list in R supports integer-based accessing using [[]]
(compared to [] for vector).

1 > x=list (1, ' hello world! ' )


2 > x
3 [[1]]
4 [1] 1
5

6 [[2]]
7 [1] "hello world !"
8

9 > x [[1]]
10 [1] 1
11 > x [[2]]
12 [1] "hello world !"
13 > length (x)
14 [1] 2

The mutability of the list structure in R is similar to the vector structure. The difference is that when we
change the type of an element in a list, the memory address doesn’t change in general.

1 > x=list(c(1:3) , ' Hello World! ' )


2 > tracemem (x)
3 [1] " <0 x7f828fe497c8 >"
4 > x [[1]]=1.0
5 > x [[2]]=2.0
6 > x
7 [[1]]
8 [1] 1
9

10 [[2]]
11 [1] 2
12

6
http://www.numpy.org
24 another book on data science

13 > tracemem (x)


14 [1] " <0 x7f828fe497c8 >"

list in R could be named and support accessing by name via either [[]] or $ operator. But vector in R can
also be named and support accessing by name.

1 > x=c( ' a ' =1, ' b ' =2)


2 > names (x)
3 [1] "a" "b"
4 > x[ ' b ' ]
5 b
6 2
7 > l=list( ' a ' =1, ' b ' =2)
8 > l[[ ' b ' ]]
9 [1] 2
10 > l$b
11 [1] 2
12 > names (l)
13 [1] "a" "b"

However, elements in list in Python can’t be named as R. If we need the feature of accessing by name in
Python, we can use the dictionary structure. If you used Java before, you may consider dictionary in Python
as the counterpart of HashMap in Java. Essentially, a dictionary in Python is a collection of key:value pairs.

Python

1 >>> x={ ' a ' :1, ' b ' :2} # {key:value } pairs
2 >>> x
3 { ' a ' : 1, ' b ' : 2}
4 >>> x[ ' a ' ]
5 1
6 >>> x[ ' b ' ]
7 2
8 >>> len(x) # number of key:value pairs
9 2
10 >>> x.pop( ' a ' ) # remove the key ' a ' and we get its value 1
11 1
12 >>> x
13 { ' b ' : 2}

Unlike dictionary in Python, list in R doesn’t support the pop() operation. Thus, in order to modify a list
in R, a new one would be created explicitly or implicitly.
introduction to rspython programming 25

data.frame, data.table and pandas


data.frame is a built-in type in R for data manipulation. In Python, there is no such built-in data structure
since Python is a more general-purpose programming language. The solution for data.frame in Python is
the pandas 7 module.
Before we dive into data.frame, you may be curious why we need it? In other words, why can’t we just
use vector, list, array/matrix and dictionary for all data manipulation tasks? I would say yes - data.frame is
not a must-have feature for most of ETL (extraction, transformation and Load) operations. But data.frame
provides a very intuitive way for us to understand the structured data set. A data.frame is usually flat with
2 dimensions, i.e., row and column. The row dimension is across multiple observations and the column
dimension is across multiple attributes/features. If you are familiar with relational database, a data.frame
can be viewed as a table.
Let’s see an example of using data.frame to represent employees’ information in a company.

R Python

1 >>> import pandas as pd


1 > employee _df = data. frame(name=c(
2 >>> employee_df =pd. DataFrame ({ '
"A", "B", "C"),department =c("
name ' :[ ' A ' , ' B ' , ' C ' ], ' department
Engineering "," Operations ","
' :[" Engineering "," Operations ","
Sales"))
Sales"]})
2 > employee _df
3 >>> employee_df
3 name department
4 name department
4 1 A Engineering
5 0 A Engineering
5 2 B Operations
6 1 B Operations
6 3 C Sales
7 2 C Sales

There are quite a few ways to create data.frame. The most commonly used one is to create data.frame
object from array/matrix. We may also need to convert a numeric data.frame to an array/matrix.

1 > x= array(rnorm (12) ,c(3 ,4))


2 > x
3 [ ,1] [ ,2] [ ,3] [ ,4]
4 [1,] 0.8101246 0.8594136 2.260810 0.5727590
5 [2,] 0.9175476 0.1345982 1.067628 0.7643533
6 [3,] 0.7865971 1.9046711 0.154928 0.6807527
7 > random _df=as.data.frame(x)
8 > random _df
9 V1 V2 V3 V4
10 1 0.8101246 0.8594136 2.260810 0.5727590

7
https://pandas.pydata.org/
26 another book on data science

11 2 0.9175476 0.1345982 1.067628 0.7643533


12 3 0.7865971 1.9046711 0.154928 0.6807527
13 > data. matrix ( random _df)
14 V1 V2 V3 V4
15 [1,] 0.8101246 0.8594136 2.260810 0.5727590
16 [2,] 0.9175476 0.1345982 1.067628 0.7643533
17 [3,] 0.7865971 1.9046711 0.154928 0.6807527

Python

1 >>> import numpy as np


2 >>> import pandas as pd
3 >>> x=np. random . normal (size =(3 ,4))
4 >>> x
5 array ([[ 0.54164878, 0.14285267, 0.39835535, 0.81522719],
6 [ 0.01540508 , 0.63556266 , 0.16800583 , 0.17594448] ,
7 [ 1.21598262, 0.52860817 , 0.61757696, 0.18445057]])
8 >>> random_df =pd. DataFrame (x)
9 >>> random_df
10 0 1 2 3
11 0 0.541649 0.142853 0.398355 0.815227
12 1 0.015405 0.635563 0.168006 0.175944
13 2 1.215983 0.528608 0.617577 0.184451
14 >>> np. asarray ( random_df )
15 array ([[ 0.54164878, 0.14285267, 0.39835535, 0.81522719],
16 [ 0.01540508 , 0.63556266 , 0.16800583 , 0.17594448] ,
17 [ 1.21598262, 0.52860817 , 0.61757696, 0.18445057]])

In general, operations on an array/matrix is much faster than that on a data frame. In R, we may use
the built-in function data.matrix to convert a data.frame to an array/matrix. In Python, we could use the
function asarray in numpy module.
Although data.frame is a built-in type, it is not quite efficient for many operations. I would suggest to use
data.table 8 whenever possible. dplyr 9 is also a very popular package in R for data manipulation. Many
good online resources are available online to learn data.table and pandas. Thus, I would not cover the usage
of these tools for now.

Object-oriented programming (OOP) in R/Python


All the codes we wrote above follow the procedural programming paradigm 10 . We can also do functional
programming (FP) and OOP in R/Python. In this section, let’s focus on OOP in R/Python.
Class is the key concept in OOP. In R there are two commonly used built-in systems to define classes,
8
https://cran.r-project.org/web/packages/data.table/index.html
9
https://dplyr.tidyverse.org/
10
https://en.wikipedia.org/wiki/Comparison_of_programming_paradigms
introduction to r/python programming 27

i.e., S3 and S4. In addition, there is an external package R6 11 which defines R6 classes. S3 is a light-weight
system but its style is quite different from OOP in many other programming languages. S4 system follows
the principles of modern object oriented programming much better than S3. However, the usage of S4 classes
is quite tedious. I would ignore S3/S4 and introduce R6, which is more close to the class in Python.
Let’s build a class in R/Python to represent complex numbers.

1 > library (R6) # load the R6 package


2 >
3 > Complex = R6Class (" Complex ",
4 + public = list( # only elements declared in this list are accessible by the
object of this class
5 + real = NULL ,
6 + imag = NULL ,
7 + # the initialize function would be called automatically when we create an
object of the class
8 + initialize = function (real ,imag){
9 + # call functions to change real and imag values
10 + self$set_real(real)
11 + self$set_imag(imag)
12 + },
13 + # define a function to change the real value
14 + set_real = function (real){
15 + self$real=real
16 + },
17 + # define a function to change the imag value
18 + set_imag = function (imag){
19 + self$imag=imag
20 + },
21 + # override print function
22 + print = function (){
23 + cat( paste0 (as. character (self$real), ' + ' ,as. character (self$imag), ' j ' ), ' \n ' )
24 + }
25 + )
26 + )
27 > # let ' s create a complex number object based on the Complex class we defined
above using the new function
28 > x = Complex $new (1 ,2)
29 > x
30 1+2j
31 > x$real # the public attributes of x could be accessed by $ operator
32 [1] 1
11
https://cran.r-project.org/web/packages/R6/index.html
28 another book on data science

Python

1 >>> class Complex :


2 ... # the __init__ function would be called automatically when we create an
object of the class
3 ... def __init__ (self ,real ,imag):
4 ... self.real = None
5 ... self.imag = None
6 ... self. set_real (real)
7 ... self. set_imag (imag)
8 ... # define a function to change the real value
9 ... def set_real (self ,real):
10 ... self.real=real
11 ... # define a function to change the imag value
12 ... def set_imag (self ,imag):
13 ... self.imag=imag
14 ... def __repr__ (self):
15 ... return " {0}+{1} j". format (self.real ,self.imag)
16 ...
17 >>> x = Complex (1 ,2)
18 >>> x
19 1+2j
20 >>> x.real # different from the $ operator in R, here we use . to access the
attribute of an object
21 1

By overriding the print function in the R6 class, we can have the object printed in the format of real+imag
j. To achieve the same effect in Python, we override the method __repr__. In Python, we call the functions
defined in classes as methods. And overriding a method means changing the implementation of a method
provided by one of its ancestors. To understand the concept of ancestors in OOP, one needs to understand
the concept of inheritance 12 .
You may be curious of the double underscore surrounding the methods, such as __init__ and __repr__.
These methods are well-known as magic methods 13 . Magic methods could be very handy if we use them
in the suitable cases. For example, we can use the magic method __add__ to implement the + operator for
the Complex class we defined above.
In the definition of the magic method __repr__ in the Python code, the format method of str object 14 is
used.

12
https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)
13
https://rszalski.github.io/magicmethods/
14
https://docs.python.org/3.7/library/string.html
introduction to r/python programming 29

Python

1 >>> class Complex :


2 ... def __init__ (self ,real ,imag):
3 ... self.real = None
4 ... self.imag = None
5 ... self. set_real (real)
6 ... self. set_imag (imag)
7 ... def set_real (self ,real):
8 ... self.real=real
9 ... def set_imag (self ,imag):
10 ... self.imag=imag
11 ... def __repr__ (self):
12 ... return " {0}+{1} j". format (self.real ,self.imag)
13 ... def __add__ (self ,another ):
14 ... return Complex (self.real+ another .real ,self.imag+ another .imag)
15 ...
16 >>> x= Complex (1 ,2)
17 >>> y= Complex (2 ,4)
18 >>> x+y # + operator works now
19 3+6j

We can also implement the + operator for Complex class in R like what we have done for Python.

1 > `+. Complex ` = function (x,y){


2 + Complex $new(x$real+y$real ,x$imag+y$imag)
3 + }
4 > x= Complex $new (1 ,2)
5 > y= Complex $new (2 ,4)
6 > x+y
7 3+6j

The most interesting part of the code above is `+.Complex`. First, why do we use `` to quote the function
name? Before getting into this question, let’s have a look at the Python 3’s variable naming rules 15 .

1 Within the ASCII range (U +0001.. U+007F), the valid characters for identifiers
(also referred to as names ) are the same as in Python 2.x: the uppercase
and lowercase letters A through Z, the underscore _ and , except for the
first character , the digits 0 through 9.

According to the rule, we can’t declare a variable with name 2x. Compared with Python, in R we can also
15
https://docs.python.org/3.3/reference/lexical_analysis.html
30 another book on data science

use . in the variable names 16 . However, there is a workaround to use invalid variable names in R with the
help of `` .

R Python

1 > 2x = 5 1 >>> 2x = 5
2 Error : unexpected symbol in "2x" 2 File "<stdin >", line 1
3 > .x = 3 3 2x = 5
4 > .x 4 ^
5 [1] 3 5 SyntaxError : invalid syntax
6 > `+2 x % ` = 0 6 >>> .x = 3
7 > `+2 x % ` 7 File "<stdin >", line 1
8 [1] 0 8 .x = 3
9 ^
10 SyntaxError : invalid syntax

Now it is clear the usage of `` in `+.Complex` is to define a function with invalid name. Placing .Complex
after + is related to S3 method dispatching which would not be discussed here.

1.6 Miscellaneous

There are some items that I haven’t discussed so far, which are also important in order to master R/Python.

Package/module installation
• Use install.packages() function in R
• Use R IDE to install packages
• Use pip 17 to install modules in Python

Virtual Environment
Virtual environment is a tool to mange dependencies in Python. There are different ways to create virtual
environments in Python. But I suggest to use the venv module shipped with Python 3. Unfortunately, there
is nothing like a real virtual environment in R as far as I know although there quite a few of packages
management tools/packages.

<- vs. =
If you have known R before, you probably heard of the advice 18 to use < to rather than = for value
assignment. I always use = for value assignment. Let’s see an example when < makes a difference when
we do value assignment.

16
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Identifiers
17
https://packaging.python.org/tutorials/installing-packages/
18
https://google.github.io/styleguide/Rguide.xml
introduction to r/python programming 31

R R

1 > x=1 1 > x=1


2 > a=list(x < 2) 2 > a=list(x = 2)
3 > a 3 > a
4 [[1]] 4 $x
5 [1] 2 5 [1] 2
6 6

7 > x 7 > x
8 [1] 2 8 [1] 1
2
More on R/Python Programming

We have learned quite a few of basic R/Python programming in the previous chapter. I hope this chapter
could be used as an intermediate level R/Python programming tutorial. There are a few topics to cover,
including debugging, vectorization and some other useful features of R/Python.

2.1 Write/run R/Python scripts

In chapter 1 we are coding within the interactive mode of R/Python. When we are working on the real
world projects, using an Integrated development environment (IDE) is a more pragmatic choice. There are
not many choices for R IDE, and among them RStudio 19 is the best one I have used so far. As for Python, I
would recommend either Visual Studio Code 20 or PyCharm 21 . But of course, you could use any text editor
to write R/Python scripts.
Let’s write a simple script to print Hello World! in R/Python. I have made a directory chapter2 on my
disk, the R script is saved as hello_world.R and the Python script is saved as hello_world.py, inside the
directory.

R Python

chapter2/hello_world.R chapter2/hello_world.py

1 print ("Hello World!") 1 print ("Hello World!")

There are a few ways to run the R script. For example, we can run the script from the console with the
r f filename command. Also, we can open the R interactive session and use the source() function. I
would recommend the second approach with source() function. As for the Python script, we can run it
from the console.

19
https://www.rstudio.com/products/rstudio/
20
https://www.jetbrains.com/pycharm/
21
https://www.jetbrains.com/pycharm/
34 another book on data science

R Python

1 chapter2 $ls 1 chapter2 $ls


2 hello _ world.R hello_world.py 2 hello_world .R hello_world .py
3 chapter2 $r f hello_world.R 3

4 > print(" Hello World!") 4 chapter2 $python3 .7 hello_world .py


5 [1] "Hello World !" 5 Hello World !
6

7 chapter2 $r
8 > source ( ' hello_ world.R ' )
9 [1] "Hello World !"

2.2 Debugging in R/Python

Debugging is one of the most important aspects of programming. What is debugging in programming? The
programs we write might include errors/bugs and debugging is a step-by-step process to find and remove
the errors/bugs in order to get the desired results.
If you are smart enough or the bugs are evident enough then you can debug the program on your mind
without using a computer at all. But in general we need some tools/techniques to help us with debugging.

Print
Most of programming languages provide the functionality of printing, which is a natural way of debug-
ging. By trying to place print statements at different positions we may finally catch the bugs. When I
use print to debug, it’s feeling like playing the game of minesweeper. In Python, there is a module called
logging 22 which could be used for debugging like the print function, but in a more elegant fashion.

Browser in R and pdb in Python


In R, there is a function browser() which interrupts the execution and allows the inspection of the current
environment. Similarly, there is a module pdb in Python that provides more debugging features. We would
only focus on the basic usages of browser() and the set_trace() function in pdb module. The essential
difference between debugging using print() and browser() and set_trace() is that the latter functions
allows us to debug in an interactive mode.
Let’s write a function which takes a sorted vector/list v and a target value x as input and returns the
leftmost index pos of the sorted vector/list so that v[pos]>=x. Since v is already sorted, we may simply loop
through it from left to right to find pos.

22
https://docs.python.org/3/library/logging.html
more on r/python programming 35

R Python

chapter2/find_pos.R chapter2/find_pos.py
1 find_pos= function (v,x){ 1 def find_pos (v,x):
2 for (i in 1: length (v)){ 2 for i in range (len(v)):
3 if (v[i]>=x){ 3 if v[i]>=x:
4 return (i) 4 return i
5 } 5

6 } 6 v=[1 ,2 ,5 ,10]
7 } 7 print ( find_pos (v, 1))
8 8 print ( find_pos (v ,4))
9 v=c(1 ,2 ,5 ,10) 9 print ( find_pos (v ,11))
10 print (find_pos(v, 1))
11 print (find_pos(v ,4))
12 print (find_pos(v ,11))

Now let’s run these two scripts.

R Python

1 chapter2 $r
2 > source ( ' find_pos.R ' ) 1 chapter2 $python3 .7 find_pos .py
3 [1] 1 2 0
4 [1] 3 3 2
5 NULL 4 None

When x=11, the function returns NULL in R and None in Python because there is no such element in v
larger than x. The implementation above is trivial, but not efficient. If you have some background in data
structures and algorithms, you probably know this question can be solved by binary search. The essential
idea of binary search comes from Divide-and-conquer 23 . Since v is already sorted, we may divide it into
two partitions by cutting it from the middle, and then we get the left partition and the right partition. v is
sorted implies that both the left partition and the right partition are also sorted. If the target value x is larger
than the rightmost element in the left partition, we can just discard the left partition and search x within the
right partition. Otherwise, we can discard the right partition and search x within the left partition. Once we
have determined which partition to search, we may apply the idea recursively so that in each step we reduce
the size of v by half. If the length of v is denoted as n, in terms of big O notation 24 , the run time complexity
of binary search is O(log n), compared with O(n) of the for-loop implementation.
The code below implements the binary search solution to our question (It is more intuitive to do it with
recursion but here I write it with iteration since tail recursion optimization 25 in R/Python is not supported).
23
https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm
24
https://en.wikipedia.org/wiki/Big_O_notation
25
https://en.wikipedia.org/wiki/Tail_call
36 another book on data science

chapter2/find_binary_search_buggy.R

1 binary _ search _ buggy= function (v,x){


2 start = 1
3 end = length (v)
4 while (start <end){
5 mid = ( start+end) %/% 2 # %/% is the floor division operator
6 if (v[mid ]>=x){
7 end = mid
8 }else{
9 start = mid +1
10 }
11 }
12 return (start)
13 }
14 v=c(1 ,2 ,5 ,10)
15 print ( binary _ search _buggy(v, 1))
16 print ( binary _ search _buggy(v ,5))
17 print ( binary _ search _buggy(v ,11))

Python

chapter2/find_binary_search_buggy.py

1 def binary_search_buggy (v,x):


2 start ,end = 0,len(v) 1
3 while start <end:
4 mid = ( start +end)//2 # // is the floor division operator
5 if v[mid ]>=x:
6 end = mid
7 else:
8 start = mid +1
9 return start
10

11 v=[1 ,2 ,5 ,10]
12 print ( binary_search_buggy (v, 1))
13 print ( binary_search_buggy (v ,5))
14 print ( binary_search_buggy (v ,11))
more on rspython programming 37

Now let’s run these two binary_search scripts.

R Python

1 chapter2 $r 1 chapter2 $python3 .7


2 > source ( ' binary _ search _buggy.R ' ) binary_search_buggy .py
3 [1] 1 2 0
4 [1] 3 3 2
5 [1] 4 4 3

The binary search solutions don’t work as expected when x=11. We write two new scripts.

chapter2/find_binary_search_buggy_debug.R
1 binary _ search _ buggy= function (v,x){
2 browser ()
3 start = 1
4 end = length (v)
5 while (start <end){
6 mid = ( start+end)
7 if (v[mid ]>=x){
8 end = mid
9 }else{
10 start = mid +1
11 }
12 }
13 return (start)
14 }
15 v=c(1 ,2 ,5 ,10)
16 print ( binary _ search _buggy(v ,11))

Python

chapter2/find_binary_search_buggy_debug.py
1 from pdb import set_trace
2 def binary_search_buggy (v,x):
3 set_trace ()
4 start ,end = 0,len(v) 1
5 while start <end:
6 mid = ( start +end)//2
7 if v[mid ]>=x:
8 end = mid
38 another book on data science

9 else:
10 start = mid +1
11 return start
12

13 v=[1 ,2 ,5 ,10]
14 print ( binary_search_buggy (v ,11))

Let’s try to debug the programs with the help of browser() and set_trace().

1 > source ( ' binary _ search _buggy_ debug.R ' )


2 Called from: binary _ search _ buggy (v, 11)
3 Browse [1] > ls()
4 [1] "v" "x"
5 Browse [1] > n
6 debug at binary _ search _ buggy _debug .R#3: start = 1
7 Browse [2] > n
8 debug at binary _ search _ buggy _debug .R#4: end = length (v)
9 Browse [2] > n
10 debug at binary _ search _ buggy _debug .R#5: while (start < end) {
11 mid = ( start + end)%/%2
12 if (v[mid] >= x) {
13 end = mid
14 }
15 else {
16 start = mid + 1
17 }
18 }
19 Browse [2] > n
20 debug at binary _ search _ buggy _debug .R#6: mid = (start + end)%/%2
21 Browse [2] > n
22 debug at binary _ search _ buggy _debug .R#7: if (v[mid] >= x) {
23 end = mid
24 } else {
25 start = mid + 1
26 }
27 Browse [2] > n
28 debug at binary _ search _ buggy _debug .R#10: start = mid + 1
29 Browse [2] > n
30 debug at binary _ search _ buggy _debug .R#5: ( while) start < end
31 Browse [2] > n
32 debug at binary _ search _ buggy _debug .R#6: mid = (start + end)%/%2
33 Browse [2] > n
more on rspython programming 39

34 debug at binary _ search _ buggy _debug .R#7: if (v[mid] >= x) {


35 end = mid
36 } else {
37 start = mid + 1
38 }
39 Browse [2] > n
40 debug at binary _ search _ buggy _debug .R#10: start = mid + 1
41 Browse [2] > n
42 debug at binary _ search _ buggy _debug .R#5: ( while) start < end
43 Browse [2] > start
44 [1] 4
45 Browse [2] > n
46 debug at binary _ search _ buggy _debug .R#13: return (start)
47 Browse [2] > n
48 [1] 4

In the R code snippet above, we placed the browser() function on the top of the function binary_search_
buggy. Then when we call the function we enter into the debugging environment. By calling ls() we see all
variables in the current debugging scope, i.e., v,x. Typing n will evaluate the next statement. After typing
n a few times, we finally exit from the while loop because start = 4 such that start < end is FALSE. As
a result, the function just returns the value of start, i.e., 4. To exit from the debugging environment, we can
type Q; to continue the execution we can type c.
The root cause is that we didn’t deal with the corner case when the target value x is larger than the
last/largest element in v correctly.
Let’s debug the Python function using pdb module.

Python

1 chapter2 $python3 .7 binary_search_buggy_debug .py


2 > chapter2 / binary_search_buggy_debug .py (4) binary_search_buggy ()
3 > start ,end = 0,len(v) 1
4 (Pdb) n
5 > chapter2 / binary_search_buggy_debug .py (5) binary_search_buggy ()
6 > while start <end:
7 (Pdb) l
8 1 from pdb import set_trace
9 2 def binary_search_buggy (v,x):
10 3 set_trace ()
11 4 start ,end = 0,len(v) 1
12 5 > while start <end:
13 6 mid = ( start +end)//2
14 7 if v[mid ]>=x:
15 8 end = mid
16 9 else:
40 another book on data science

17 10 start = mid +1
18 11 return start
19 (Pdb) b 7
20 Breakpoint 1 at chapter2 / binary_search_buggy_debug .py :7
21 (Pdb) c
22 > chapter2 / binary_search_buggy_debug .py (7) binary_search_buggy ()
23 > if v[mid ]>=x:
24 (Pdb) c
25 > chapter2 / binary_search_buggy_debug .py (7) binary_search_buggy ()
26 > if v[mid ]>=x:
27 (Pdb) mid
28 2
29 (Pdb) n
30 > chapter2 / binary_search_buggy_debug .py (10) binary_search_buggy ()
31 > start = mid +1
32 (Pdb) n
33 > chapter2 / binary_search_buggy_debug .py (5) binary_search_buggy ()
34 > while start <end:
35 (Pdb) start
36 3
37 (Pdb) n
38 > chapter2 / binary_search_buggy_debug .py (11) binary_search_buggy ()
39 > return start

Similar to R, command n would evaluate the next statement in pdb. Typing command l would show the
current line of current execution. Command b line_number would set the corresponding line as a break
point; and c would continue the execution until the next breakpoint (if exists).
In R, besides the browser() function there are a pair of functions debug()and undebug() which are also
very handy when we try to debug a function; especially when the function is wrapped in a package. More
specifically, the debug function would invoke the debugging environment whenever we call the function to
debug. See the example below how we invoke the debugging environment for the sd function (standard
deviation calculation).

1 > x=c(1 ,1 ,2)


2 > debug (sd)
3 > sd(x)
4 debugging in: sd(x)
5 debug: sqrt(var(if (is. vector (x) || is. factor (x)) x else as. double (x),
6 na.rm = na.rm))
7 Browse [2] > ls()
8 [1] "na.rm" "x"
9 Browse [2] > Q
more on r/python programming 41

10 > undebug (sd)


11 > sd(x)
12 [1] 0.5773503

The binary_search solutions are fixed below.

chapter2/find_binary_search.py

1 binary _ search = function (v,x){


2 if (x>v[ length (v)]){ return (NULL)}
3 start = 1
4 end = length (v)
5 while (start <end){
6 mid = ( start+end)
7 if (v[mid ]>=x){
8 end = mid
9 }else{
10 start = mid +1
11 }
12 }
13 return (start)
14 }

Python

chapter2/find_binary_search.py

1 def binary_search (v,x):


2 if x>v[ 1]: return
3 start ,end = 0,len(v) 1
4 while start <end:
5 mid = ( start +end)//2
6 if v[mid ]>=x:
7 end = mid
8 else:
9 start = mid +1
10 return start

2.3 Benchmarking

By benchmarking, I mean measuring the entire operation time of a piece of program. There is another term
called profiling which is related to benchmarking. But profiling is more complex since it commonly aims at
42 another book on data science

understanding the behavior of the program and optimizing the program in terms of time elapsed during the
operation.
In R, I like using the microbenchmark package. And in Python, timeit module is a good tool to use when
we want to benchmark a small bits of Python code.
As mentioned before, the run time complexity of binary search is better than that of a for-loop search. We
can do benchmarking to compare the two algorithms.

chapter2/benchmark.R

1 library ( microbenchmark )
2 source ( ' binary _ search .R ' )
3 source ( ' find_pos.R ' )
4

5 v =1:10000
6

7 # call each function 1000 times ;


8 # each time we randomly select an integer as the target value
9

10 # for loop solution


11 set.seed (2019)
12 print ( microbenchmark (find_pos(v, sample (10000 ,1)),times =1000) )
13 # binary search solution
14 set.seed (2019)
15 print ( microbenchmark ( binary _ search (v, sample (10000 ,1)),times =1000) )

In the R code above, times=1000 means we want to call the function 1000 times in the benchmarking
process. The sample() function is used to draw samples from a set of elements. Specifically, we pass the
argument 1 to sample() to draw a single element. It’s the first time we use set.seed() function in this book.
In R/Python, we draw random numbers based on the pseudorandom number generator (PRNG) algorithm
26 . The sequence of numbers generated by PRNG is completed determined by an initial value, i.e., the seed.

Whenever a program involves the usage of PRNG, it is better to set the seed in order to get replicable results
(see the example below).

R R

1 > set.seed (2019) 1 > set.seed (2019)


2 > rnorm (1) 2 > rnorm (1)
3 [1] 0.7385227 3 [1] 0.7385227
4 > rnorm (1) 4 > set.seed (2019)
5 [1] 0.5147605 5 > rnorm (1)
6 [1] 0.7385227

26
https://en.wikipedia.org/wiki/Pseudorandom_number_generator
more on r/python programming 43

Now let’s run the R script to see the benchmarking result. llr
1 > source ( ' benchmark .R ' )
2 Unit: microseconds
3 expr min lq mean median uq
max
4 find_pos(v, sample (10000 , 1)) 3.96 109.5385 207.6627 207.5565 307.8875
536.171
5 neval
6 1000
7 Unit: microseconds
8 expr min lq mean median uq max
9 binary _ search (v, sample (10000 , 1)) 5.898 6.3325 14.2159 6.6115 7.3635 6435.57
10 neval
11 1000

The binary_search solution is much more efficient based on the benchmarking result. Doing the same
benchmarking in Python is a bit of complicated.

Python

chapter2/benchmark.py
1 from binary_search import binary_search
2 from find_pos import find_pos
3 import timeit
4 import random
5

6 v=list(range (1 ,10001))
7

8 def test_for_loop (n):


9 random .seed (2019)
10 for _ in range (n):
11 find_pos (v, random . randint (1 ,10000))
12

13 def test_bs (n):


14 random .seed (2019)
15 for _ in range (n):
16 binary_search (v, random . randint (1 ,10000))
17

18 # for loop solution


19 print ( timeit . timeit ( ' test_for_loop (1000) ' ,setup = ' from __main__ import
test_for_loop ' ,number =1))
20 # binary_search solution
21 print ( timeit . timeit ( ' test_bs (1000) ' ,setup = ' from __main__ import test_bs ' ,
number =1))
44 another book on data science

The most interesting part of the Python code above is from __main__ import. Let’s ignore it for now,
and we would revisit it later.
Below is the benchmarking result in Python (the unit is second).

Python

1 chapter2 $python3 benchmark .py


2 0.284618441
3 0.00396658900000002

2.4 Vectorization

In parallel computing, automatic vectorization 27 means a program in a scalar implementation is converted


to a vector implementation which process multiple pairs of operands simultaneously by compilers that
feature auto-vectorization. For example, let’s calculate the element-wise sum of two arrays x and y of the
same length in C programming language.
1 int x[4] = {1 ,2 ,3 ,4};
2 int y[4] = {0 ,1 ,2 ,3};
3 int z[4];
4 for (int i=0;i <4;i++){
5 z[i]=x[i]+y[i];
6 }

The C code above might be vectorized by the compiler so that the actual number of iterations performed
could be less than 4. If 4 pairs of operands are processed at once, there would be only 1 iteration. Automatic
vectorization may make the program runs much faster in some languages like C. However, when we talk
about vectorization in R/Python, it is different from automatic vectorization. Vectorization in R/Python
usually refers to the human effort paid to avoid for-loops. First, let’s see some examples of how for-loops
may slow your programs in R/Python.

chapter2/vectorization_1.R
1 library ( microbenchmark )
2

3 # generate n standard normal r.v


4 rnorm_loop = function (n){
5 x=rep (0,n)
6 for (i in 1:n) {x[i]= rnorm (1)}
7 }
8

9 rnorm_vec = function (n){


27
https://en.wikipedia.org/wiki/Automatic_vectorization
more on r/python programming 45

10 x= rnorm(n)
11 }
12

13 n=100
14 # for loop
15 print ( microbenchmark ( rnorm_loop(n),times =1000) )
16 # vectorize
17 print ( microbenchmark ( rnorm_vec(n),times =1000) )

Running the R code results in the following result on my local machine.


llr
1 > source ( ' vectorization _1.R ' )
2 Unit: microseconds
3 expr min lq mean median uq max neval
4 rnorm_loop(n) 131.622 142.699 248.7603 145.3995 270.212 16355.6 1000
5 Unit: microseconds
6 expr min lq mean median uq max neval
7 rnorm_vec(n) 6.696 7.128 10.87463 7.515 8.291 2422.338 1000

Python

1 import timeit
2 import numpy as np
3

4 def rnorm_for_loop (n):


5 x =[0] * n # create a list with n 0s
6 np. random .seed (2019)
7 for _ in range (n):
8 np. random . normal (0 ,1 ,1)
9

10 def rnorm_vec (n):


11 np. random .seed (2019)
12 x = np. random . normal (0,1,n)
13

14 print ("for loop")


15 print (f ' { timeit . timeit (" rnorm_for_loop (100)", setup =" from __main__ import
rnorm_for_loop ", number =1000) :.6f} seconds ' )
16 print (" vectorized ")
17 print (f ' { timeit . timeit (" rnorm_vec (100)", setup =" from __main__ import rnorm_vec
", number =1000) :.6f} seconds ' )

Please note that in this Python example we are using the random submodule of numpy module instead of
the built-in random module since random module doesn’t provide the vectorized version of random number
generation function. Running the Python code results in the following result on my local machine.
46 another book on data science

Python

1 chapter2 $python3 .7 vectorization_1 .py


2 for loop
3 0.258466 seconds
4 vectorized
5 0.008213 seconds

In either R or Python, the vectorized version of random normal random variable (r.v.) is significantly
faster than the scalar version. It is worth noting the usage of the print(f'') statement in the Python code,
which is different from the way how we print the object of Complex class in chapter 1. In the code above, we
use the f string 28 which is a literal string prefixed with ' f' containing expressions inside {} which would
be replaced with their values. f string was a feature introduced since Python 3.6. If you are familiar with
Scala, you may find that this feature is quite similar with the string interpolation mechanism introduced
since Scala 2.10.
It’s also worth noting that lots of built-in functions in R are already vectorized, such as the basic arithmetic
operators, comparison operators, ifelse(), element-wise logical operators &,|. But the logical operators
&&, || are not vectorized.
In addition to vectorization, there are also some built-in functions which may help to avoid the usages of
for-loops. For example, in R we might be able use the apply family of functions to replace for-loops; and
in Python the map() function can also be useful. In the Python pandas module, there are also many usages
of map/apply methods. But in general the usage of apply/map functions has little or nothing to do with
performance improvement. However, appropriate usages of such functions may help with the readability
of the program. Compared with the apply family of functions in R, I think the do.call() function is more
useful in practice. We would spend some time in do.call() later.
Considering the importance of vectorization in scientific programming, let’s try to get more familiar
with vectorization thorough the Biham–Middleton–Levine (BML) traffic model 29 . The BML model is very
important in modern studies of traffic flow since it exhibits a sharp phase transition from free flowing status
to a fully jammed status. A simplified BML model could be characterized as follows:
• Initialized on a 2-D lattice, each site of which is either empty or occupied by a colored particle (blue or
red);
• Particles are distributed randomly through the initialization according to a uniform distribution; the two
colors of particles are equally distributed.
• On even time steps, all blue particles attempt to move one site up and an attempt fails if the site to occupy
is not empty;
• On Odd time steps, all red particles attempt to move one site right and an attempt fails if the site to
occupy is not empty;
• The lattice is assumed periodic which means when a particle moves out of the lattice, it would move into
the lattice from the opposite side.
The BML model specified above is implemented in both R/Python as follows to illustrate the usage of
vectorization.
28
https://www.python.org/dev/peps/pep-0498/
29
https://en.wikipedia.org/wiki/Biham-Middleton-Levine_traffic_model
more on rspython programming 47

chapter2/BML.R

1 library (R6)
2 BML = R6Class (
3 "BML",
4 public = list(
5 # alpha is the parameter of the uniform distribution to control particle
distribution ' s density
6 # m * n is the dimension of the lattice
7 alpha = NULL ,
8 m = NULL ,
9 n = NULL ,
10 lattice = NULL ,
11 initialize = function (alpha , m, n) {
12 self$ alpha = alpha
13 self$m = m
14 self$n = n
15 self$ initialize _ lattice ()
16 },
17 initialize _ lattice = function () {
18 # 0 > empty site
19 # 1 > blue particle
20 # 2 > red particle
21 u = runif(self$m * self$n)
22 # the usage of L is to make sure the elements in particles are of type
integer ;
23 # otherwise they would be created as double
24 particles = rep (0L, self$m * self$n)
25 # doing inverse transform sampling
26 particles [(u > self$ alpha ) &
27 (u <= (self$ alpha + 1.0) / 2)] = 1L
28 particles [u > (self$ alpha + 1.0) / 2] = 2L
29 self$ lattice = array(particles , c(self$m, self$n))
30 },
31 odd_step = function () {
32 blue. index = which(self$ lattice == 1L, arr.ind = TRUE)
33 # make a copy of the index
34 blue.up. index = blue. index
35 # blue particles move 1 site up
36 blue.up. index [, 1] = blue. index [, 1] 1L
37 # periodic boundary condition
38 blue.up. index [blue.up. index [, 1] == 0L, 1] = self$m
39 # find which moves are feasible
48 another book on data science

40 blue. movable = self$ lattice [blue.up. index ] == 0L


41 # move blue particles one site up
42 # drop=FALSE prevents the 2D array degenerates to 1D array
43 self$ lattice [blue.up.index [blue.movable , , drop = FALSE ]] = 1L
44 self$ lattice [blue.index[blue.movable , , drop = FALSE ]] = 0L
45 },
46 even_step = function () {
47 red. index = which(self$ lattice == 2L, arr.ind = TRUE)
48 # make a copy of the index
49 red. right . index = red. index
50 # red particles move 1 site right
51 red. right . index [, 2] = red.index[, 2] + 1L
52 # periodic boundary condition
53 red. right . index [red. right.index[, 2] == (self$n + 1L), 2] = 1
54 # find which moves are feasible
55 red. movable = self$ lattice [red. right . index ] == 0L
56 # move red particles one site right
57 self$ lattice [red.right.index [red.movable , , drop = FALSE ]] = 2L
58 self$ lattice [red.index[red.movable , , drop = FALSE ]] = 0L
59 }
60 )
61 )

Now we can create a simple BML system on a 5 ⇥ 5 lattice using the R code above.

1 > source ( ' BML.R ' )


2 > set.seed (2019)
3 > bml=BML$new (0.4 ,5 ,5)
4 > bml$ lattice
5 [,1] [ ,2] [ ,3] [ ,4] [ ,5]
6 [1,] 2 0 2 1 1
7 [2,] 2 2 1 0 1
8 [3,] 0 0 0 2 2
9 [4,] 1 0 0 0 0
10 [5,] 0 1 1 1 0
11 > bml$odd_step ()
12 > bml$ lattice
13 [,1] [ ,2] [ ,3] [ ,4] [ ,5]
14 [1,] 2 0 2 1 0
15 [2,] 2 2 1 0 1
16 [3,] 1 0 0 2 2
17 [4,] 0 1 1 1 0
more on rspython programming 49

18 [5,] 0 0 0 0 1
19 > bml$even_step ()
20 > bml$ lattice
21 [,1] [ ,2] [ ,3] [ ,4] [ ,5]
22 [1,] 0 2 2 1 0
23 [2,] 2 2 1 0 1
24 [3,] 1 0 0 2 2
25 [4,] 0 1 1 1 0
26 [5,] 0 0 0 0 1

In the initialization step, we used the inverse transform sampling approach 30 to generate the status of
each site. Inverse transform sampling method is basic but powerful approach to generate r.v. from any
probability distribution given its cumulative distribution function (CDF). Reading the wiki page is enough
to master this sampling method.

Python

1 import numpy as np
2

3 class BML:
4 def __init__ (self , alpha , m, n):
5 self. alpha = alpha
6 self. shape = (m, n)
7 self. initialize_lattice ()
8

9 def initialize_lattice (self):


10 u = np. random . uniform (0.0 , 1.0 , self. shape )
11 # instead of using default list , we use np.array to create the lattice
12 self. lattice = np. zeros_like (u, dtype =int)
13 # the parentheses below can ' t be ignored
14 self. lattice [(u > self. alpha ) & (u <= (1.0+ self. alpha ) /2.0) ] = 1
15 self. lattice [u > (self. alpha +1.0) /2.0] = 2
16

17 def odd_step (self):


18 # please note that np.where returns a tuple which is immutable
19 blue_index = np.where(self. lattice == 1)
20 blue_index_i = blue_index [0] 1
21 blue_index_i [ blue_index_i < 0] = self. shape[0] 1
22 blue_movable = self. lattice [( blue_index_i , blue_index [1])] == 0
23 self. lattice [( blue_index_i [ blue_movable ],
24 blue_index [1][ blue_movable ])] = 1
25 self. lattice [( blue_index [0][ blue_movable ],
26 blue_index [1][ blue_movable ])] = 0
30
https://en.wikipedia.org/wiki/Inverse_transform_sampling
50 another book on data science

27

28 def even_step (self):


29 red_index = np.where(self. lattice == 2)
30 red_index_j = red_index [1] + 1
31 red_index_j [ red_index_j == self. shape [1]] = 0
32 red_movable = self. lattice [( red_index [0] , red_index_j )] == 0
33 self. lattice [( red_index [0][ red_movable ],
34 red_index_j [ red_movable ])] = 2
35 self. lattice [( red_index [0][ red_movable ],
36 red_index [1][ red_movable ])] = 0

The Python implementation is also given.

1 >>> import numpy as np


2 >>> np. random .seed (2019)
3 >>> from BML import BML
4 >>> bml=BML (0.4 ,5 ,5)
5 >>> bml. lattice
6 array ([[2 , 0, 1, 1, 2],
7 [0, 2, 2, 2, 1],
8 [1, 0, 0, 2, 0],
9 [2, 0, 1, 0, 2],
10 [1, 1, 0, 2, 1]])
11 >>> bml.odd_step ()
12 >>> bml. lattice
13 array ([[2 , 0, 0, 1, 2],
14 [1, 2, 2, 2, 1],
15 [0, 0, 1, 2, 0],
16 [2, 1, 0, 0, 2],
17 [1, 0, 1, 2, 1]])
18 >>> bml.even_step ()
19 >>> bml. lattice
20 array ([[0 , 2, 0, 1, 2],
21 [1, 2, 2, 2, 1],
22 [0, 0, 1, 0, 2],
23 [2, 1, 0, 0, 2],
24 [1, 0, 1, 2, 1]])

Please note that although we have imported numpy in BML.py, we import it again in the code above in
order to set the random seed. If we change the line to from BML import *, we don’t need to import numpy
again. But it is not recommended to import * from a module.
more on r/python programming 51

2.5 Scope of Variables

We have seen how to define variables in R/Python in chapter 1, and we have known that a variable is an
identifier to a location in memory. What is the scope of a variable and why does it matter? Let’s first have a
look at the code snippets below.

R Python

1 > x=1 1 >>> x=1


2 > var_func _1 = function (){ print (x) 2 >>> def var_func_1 (): print (x)
} 3 ...
3 > var_func _1() 4 >>> var_func_1 ()
4 [1] 1 5 1
5 > var_func _2 = function (){x=x+1; 6 >>> def var_func_2 ():x+=1
print (x)} 7 ...
6 > var_func _2() 8 >>> var_func_2 ()
7 [1] 2 9 Traceback (most recent call last):
8 > x 10 File "<stdin >", line 1, in <
9 [1] 1 module >
11 File "<stdin >", line 1, in
var_func_2
12 UnboundLocalError : local variable
' x ' referenced before
assignment

The results of the code above seem strange before knowing the concept of variable scope. Inside a function,
a variable may refer to a function argument/parameter or it could be formally declared inside the function
which is called a local variable. But in the code above, x is neither a function argument nor a local variable.
How does the print() function know where the identifier x points to?
The scope of a variable determines where the variable is available/accessible (can be referenced). Both R
and Python apply lexical/static scoping for variables, which set the scope of a variable based on the structure
of the program. In static scoping, when an ’unknown’ variable referenced, the function will try to find it
from the most closely enclosing block. That explains how the print() function could find the variable x.
In the R code above, x=x+1 the first x is a local variable created by the = operator; the second x is referenced
inside the function so the static scoping rule applies. As a result, a local variable x which is equal to 2 is
created, which is independent with the x outside of the function var_func_2(). However, in Python when
a variable is assigned a value in a statement the variable would treated as a local variable and that explains
the UnboundLocalError.
Is it possible to change a variable inside a function which is declared outside the function? Based on the
static scoping rule only, it’s impossible. But there are workarounds in both R/Python. In R, we need the
help of environment; and in Python we can use the keyword global.
So what is an environment in R? An environment is a place where objects are stored. When we invoke
the interactive R session, an environment named as .GlobalEnv is created automatically. We can also use
52 another book on data science

the function environment() to get the present environment. The ls() function can take an environment as
the argument to list all objects inside the environment.

1 $r
2 > typeof (. GlobalEnv )
3 [1] " environment "
4 > environment ()
5 <environment : R_GlobalEnv >
6 > x=1
7 > ls( environment ())
8 [1] "x"
9 > env_func _1= function (x){
10 + y=x+1
11 + print ( environment ())
12 + ls( environment ())
13 + }
14 > env_func _1(2)
15 <environment : 0 x7fc59d165a20 >
16 [1] "x" "y"
17 > env_func _2= function (){ print ( environment ())}
18 > env_func _2()
19 <environment : 0 x7fc59d16f520 >

The above code shows that each function has its own environment containing all function arguments and
local variables declared inside the function. In order to change a variable declared outside of a function, we
need the access of the environment enclosing the variable to change. There is a function parent_env(e) that
returns the parent environment of the given environment e in R. Using this function, we are able to change
the value of x declared in .GlobalEnv inside a function which is also declared in .GlobalEnv. The global
keyword in Python works in a totally different way, which is simple but less flexible.

R Python

1 > x=1
1 >>> def env_func_3 ():
2 > env_func _3= function (){
2 ... global x
3 + cur_env= environment ()
3 ... x = 2
4 + par_env= parent .env(cur_env)
4 ...
5 + par_env$x=2
5 >>> x=1
6 + }
6 >>> env_func_3 ()
7 > env_func _3()
7 >>> x
8 > x
8 2
9 [1] 2
more on rspython programming 53

I seldomly use the global keyword in Python, if any. But the environment in R could be very handy in
some occasions. In R, environment could be used as a purely mutable version of the list data structure.

R R

1 # list is not purely mutable 1 # environment is purely mutable


2 > x=list (1) 2 > x=new.env ()
3 > tracemem (x) 3 > x
4 [1] " <0 x7f829183f6f8 >" 4 <environment : 0 x7f8290aee7e8 >
5 > x$a=2 5 > x$a=2
6 > tracemem (x) 6 > x
7 [1] " <0 x7f828f4d05c8 >" 7 <environment : 0 x7f8290aee7e8 >

Actually, the object of an R6 class type is also an environment.

1 > # load the Complex class that we defined in chapter 1


2 > x = Complex $new (1 ,2)
3 > typeof (x)
4 [1] " environment "

In Python, we can assign values to multiple variables in one line.

Python Python

1 >>> x,y = 1,2 1 >>> x,y=(1 ,2)


2 >>> x 2 >>> print (x,y)
3 1 3 1 2
4 >>> y 4 >>> (x,y)=(1 ,2)
5 2 5 >>> print (x,y)
6 1 2
7 >>> [x,y]=(1 ,2)
8 >>> print (x,y)
9 1 2

Even though in the left snippet above there aren’t parentheses embracing 1,2 after the = operator, a tuple
is created first and then the tuple is unpacked and assigned to x, y. Such mechanism doesn’t exist in R, but
we can define our own multiple assignment operator with the help of environment.

chapter2/multi_assignment.R
54 another book on data science

1 `%=% ` = function (left , right) {


2 # we require the RHS to be a list strictly
3 stopifnot (is.list( right ))
4 # dest_env is the desitination environment enclosing the variables on LHS
5 dest_env = parent .env( environment ())
6 left = substitute (left)
7

8 recursive _ assign = function (left , right , dest_env) {


9 if ( length (left) == 1) {
10 assign (x = deparse (left),
11 value = right ,
12 envir = dest_env)
13 return ()
14 }
15 if ( length (left) != length ( right ) + 1) {
16 stop("LHS and RHS must have the same shapes ")
17 }
18

19 for (i in 2: length (left)) {


20 recursive _ assign (left [[i]], right [[i 1]] , dest_env)
21 }
22 }
23

24 recursive _ assign (left , right , dest_env)


25 }

Before going into the script deeper, first let’s see the usage of the multiple assignment operator we defined.

1 > source ( ' multi_ assignment .R ' )


2 > c(x,y,z) %=% list (1,"Hello World!",c(2 ,3))
3 > x
4 [1] 1
5 > y
6 [1] "Hello World !"
7 > z
8 [1] 2 3
9 > list(a,b) %=% list (1,as.Date( ' 2019 01 01 ' ))
10 > a
11 [1] 1
12 > b
13 [1] "2019 01 01"
more on r/python programming 55

In the %=% operator defined above, we used two functions substitute, deparse which are very powerful
but less known by R novices. To better understand these functions as well as some other less known R
functions, the Rchaeology 31 tutorial is worth reading.
It is also interesting to see that we defined the function recursive_assign inside the %=% function. Both
R and Python support the concept of first class functions. More specifically, a function in R/Python is an
object, which can be
1. stored as a variable;
2. passed as a function argument;
3. returned from a function.
The essential idea behind the recursive_assign function is a depth-first search (DFS), which is a fun-
damental graph traversing algorithm 32 . In the context of the recursive_assign function, we use DFS to
traverse the parse tree of the left argument created by calling substitute(left).

2.6 Miscellaneous

We have introduced the basics of R/Python programming so far. There are much more to learn to become
an advanced user of R/Python. For example, the appropriate usages of iterator, generator, decorator
could improve both the conciseness and readability of your Python code. The generator 33 is commonly
seen in machine learning programs to prepare training/testing samples. decorator is a kind of syntactic
sugar to allow the modification of a function’s behavior in a simple way. In R there are no built-in iterator
, generator, decorator, but you may find some third-party libraries to mimic these features; or you may
try to implement your own.
One advantage of Python over R is that there are some built-in modules containing high-performance
data structures or commonly-used algorithms implemented efficiently. For example, I enjoy using the deque
structure in the Python collections module 34 , but there is no built-in counterpart in R. We have written
our own binary search algorithm earlier in this Chapter, which can also be replaced by the functions in the
built-in module bisect 35 in Python.
Another important aspect of programming is testing. Unit testing is a typical software testing method
that is commonly adopted in practice. In R there are two third-party packages testthat and RUnit. In
Python, the built-in unittest is quite powerful. Also, the third-party module pytest 36 is very popular.

31
https://cran.r-project.org/web/packages/rockchalk/vignettes/Rchaeology.pdf
32
https://en.wikipedia.org/wiki/Depth-first_search
33
https://docs.python.org/3/howto/functional.html
34
https://docs.python.org/3/library/collections.html
35
https://docs.python.org/3.7/library/bisect.html
36
https://docs.pytest.org/en/latest/
3
Linear Regression

After finishing the first two Chapters, I was thinking of the topic of Chapter 3. Finally I chose linear
regression to write, because of the importance of these models in machine learning and data science.
There are numerous books/online courses available on the theory of linear regression regression, among
which my favorite one is - The Elements of Statistical Learning [1]. So what is the purpose to write a
chapter about these models in another book of data science? Many audience would be interested in how to
implement their own regression models rather than using the off-the-shelf software packages. By the end of
this chapter, we will build up our own linear regression regression models in R/Python. We would also see
how we would reuse some functions in different regressions, such as linear regression regression, and these
regressions with L2 penalties.

3.1 Basics of linear regression

We start from the matrix form of linear regression.

y = X b + e, (3.1)

where the column vector y contains n observations on the dependent variable, X is a n ⇥ ( p + 1) matrix
(n > p) of independent variables with constant vector 1 in the first column, b is a column vector of unknown
population parameters to estimate based on the data, and e is the error term (or noise). For the sake of
illustration, (3.1) can be extended as in
2 3 2 32 3 2 3
y1 1 X11 X21 ··· X p1 b0 e1
6 7 6 76 7 6 7
6 y2 7 61 X12 X22 ··· X p2 7 6 b 1 7 6 e2 7
6 . 7 = 6. .. 7 6 7 6 7 (3.2)
6 . 7 6. .. .. .. 76 . 7+6 . 7
4 . 5 4. . . . . 5 4 .. 5 4 .. 5
yn 1 X1n X2n ··· X pn bp en

We apply ordinary least squares (OLS) 37 approach to estimate the model parameter b since it requires
fewer assumptions than other estimation methods such as maximum likelihood estimation 38 . Suppose the
estimated model parameter is denoted as b̂, we define the residual vector of the system as e = y X b̂. The

37
https://en.wikipedia.org/wiki/Ordinary_least_squares
38
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
58 another book on data science

idea of OLS is to find b̂ which can minimize the sum of squared residuals (SSR), i.e.,

min e0 e (3.3)

Now the question is how to solve the optimization problem (3.2). First, let’s expand the SSR.

e0 e = (y X b̂)0 (y X b̂)
0
(3.4)
=yy 2 b̂ X y + b̂0 X 0 X b̂
0 0

The first and second order derivatives are calculated as follows


8 0
< ∂e e = 2X 0 y + 2X 0 X b̂
∂ b̂
(3.5)
: ∂2 e 0 e = 2X 0 X.
∂ b̂∂ b̂0

We see that the second order derivative is positive semidefinite which implies the SSR in OLS is a convex
function (see section 3.1.4 in [2]) and for an unconstrained convex optimization problem, the necessary as
well as sufficient condition for optimality is that the first order derivative equals 0 (see section 4.2.3 in [2]).
Optimization of convex function is very important in machine learning. Actually, the parameter estimations
of many machine learning models are convex optimization problems.
Based on the analysis above, the solution of (3.3) is given in (3.6).

1
b̂ = ( X 0 X ) X0y (3.6)
Now it seems we are ready to write our own linear regression model in R/Python. The solution in
(3.6) involves matrix transportation, multiplication and inversion, all of which are supported in both R and
Python. In Python, we can use the numpy module for the matrix operations.
However,
" in practice#we don’t solve linear regression with (3.6) directly. Why? Let’s see an example with
1e + 6 1
x= .
1 1e 6
R Python

1 > x= array(c(10^6, 1, 1,10^ 6),c


1 >>> import numpy as np
(2 ,2))
2 >>> x=np. array ([[1e6, 1],[ 1,1e
2 > solve (t(x) % * % x) # solve ()
6]])
would return the inverse matrix
3 >>> np. linalg .inv(np.dot(x.
3 Error in solve . default (t(x) % * % x)
transpose () ,x))
:
4 array ([[4.50359963 e+03 , 4.50359963
4 system is computationally
e+09] ,
singular : reciprocal
5 [4.50359963 e+09 , 4.50359963
condition number = 2.22044e
e +15]])
28

The R code above throws an error because of the singularity of bmX 0 X. It’s interesting that the corre-
sponding Python code doesn’t behave in the same way as R, which has been reported as an issue on github
linear regression 59

39 .

When the matrix bmX 0 X is singular, how to solve the OLS problem? In this book, we would focus on
the QR decomposition based solution. Singular value decomposition (SVD) can also be used to solve OLS,
which would not be covered in this book.
In linear algebra, a QR decomposition 40 of matrix X would factorize X into a product, i.e., X = QR
where Q are orthogonal matrices and R is an upper triangular matrix. Since the matrix Q is orthogonal
(Q0 = Q 1 ), we have

1
b̂ = ( X 0 X ) X0y
1
= ( R0 Q0 QR) R0 Q0 y
1
(3.7)
= ( R0 R) R0 Q0 y
1
=R Q0 y

Now we are ready to write our simple R/Python functions for linear regression with the help of QR
decomposition according to (3.7).

R Python

chapter3/qr_solver.R chapter3/qr_solver.py

1 qr_ solver = function (x,y){ 1 import numpy as np


2 qr.coef(qr(x),y) 2

3 } 3 def qr_solver (x,y):


4 q,r=np. linalg .qr(x)
5 p = np.dot(q.T,y)
6 return np.dot(np. linalg .inv(r),p
)

Of course, we don’t need to implement our own OLS solvers in a production environment; but if you do,
still you may find some well-written and well-tested functions such as np.linalg.lstsq to save your time
and effort from doing it from scratch.
Ok, now we have finished the training part of a linear regression model in both R and Python. After we
train a model we want to use it, i.e., to make predictions based on the model. For most of machine learning
models, training is much more complex than prediction (Exceptions include Lazy-learning models such as
KNN). Let’s continue developing our own linear regression model by including the prediction function and
enclose everything in an object.

chapter3/linear_regression.R

39
https://github.com/numpy/numpy/issues/10471
40
https://en.wikipedia.org/wiki/QR_decomposition
60 another book on data science

2 library (R6)
3 LR = R6Class (
4 "LR",
5 public = list(
6 coef = NULL ,
7 initialize = function () {
8

9 },
10 fit = function (x, y) {
11 self$qr_ solver ( cbind (1, x), y)
12 },
13 qr_ solver = function (x, y) {
14 self$coef = qr.coef(qr(x), y)
15 },
16 predict = function (new_x) {
17 cbind (1, new_x) % * % self$coef
18 }
19 )
20 )

Python

chapter3/linear_regression.py

1 import numpy as np
2

3 class LR:
4 def __init__ (self):
5 self.coef = None
6

7 def qr_solver (self , x, y):


8 q, r = np. linalg .qr(x)
9 p = np.dot(q.T, y)
10 return np.dot(np. linalg .inv(r), p)
11

12 def fit(self , x, y):


13 self.coef = self. qr_solver (np. hstack (( np.ones ((x. shape [0] , 1)), x)), y
)
14

15 def predict (self , x):


16 return np.dot(np. hstack (( np.ones ((x. shape [0] , 1)), x)), self.coef)

41
https://www.cs.toronto.edu/ delve/data/boston/bostonDetail.html
linear regression 61

Now let’s try to use our linear regression model to solve a real regression problem with the Boston dataset
41 , and check the results.

1 > source ( ' linear _ regression .R ' )


2 >
3 > library (MASS) # load Boston data from this package
4 >
5 > lr = LR$new ()
6 > # i means excluding the ith column from the data.frame
7 > lr$fit(data. matrix ( Boston[, ncol( Boston )]) ,Boston $medv)
8 > print(lr$coef)
9 crim zn indus chas
10 3.645949 e+01 1.080114e 01 4.642046e 02 2.055863e 02 2.686734 e+00
11 nox rm age dis rad
12 1.776661e+01 3.809865 e+00 6.922246e 04 1.475567e+00 3.060495e 01
13 tax ptratio black lstat
14 1.233459e 02 9.527472e 01 9.311683e 03 5.247584e 01
15 > # let ' s make prediction on the same data
16 > pred=lr$ predict (data. matrix ( Boston[, ncol( Boston )]))
17 > print (pred [1:5])
18 [1] 30.00384 25.02556 30.56760 28.60704 27.94352
19 >
20 > # compare it with the R built in linear regression model
21 > rlm = lm(medv ~ ., data= Boston )
22 > print (rlm$coef)
23 ( Intercept ) crim zn indus chas
24 3.645949 e+01 1.080114e 01 4.642046e 02 2.055863e 02 2.686734 e+00
25 nox rm age dis rad
26 1.776661e+01 3.809865 e+00 6.922246e 04 1.475567e+00 3.060495e 01
27 tax ptratio black lstat
28 1.233459e 02 9.527472e 01 9.311683e 03 5.247584e 01
29 > print (rlm$ fitted [1:5])
30 1 2 3 4 5
31 30.00384 25.02556 30.56760 28.60704 27.94352

Python

1 >>> from sklearn . datasets import load_boston


2 >>> from linear_regression import LR
3 >>> boston = load_boston ()
4 >>> X, y = boston .data , boston . target
62 another book on data science

5 >>> # first , let ' s run our own linear regression


6 ... lr = LR ()
7 >>> lr.fit(X, y)
8 >>> print (lr.coef)
9 [ 3.64594884 e+01 1.08011358e 01 4.64204584e 02 2.05586264e 02
10 2.68673382 e+00 1.77666112e+01 3.80986521 e+00 6.92224640e 04
11 1.47556685e+00 3.06049479e 01 1.23345939e 02 9.52747232e 01
12 9.31168327e 03 5.24758378e 01]
13 >>> print (lr. predict (X)[:5])
14 [30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]
15 >>>
16 >>> # now , use sklearn ' s linear regression model
17 ... from sklearn . linear_model import LinearRegression
18 >>> reg = LinearRegression ().fit(X, y)
19 >>> print (reg.coef_)
20 [ 1.08011358e 01 4.64204584e 02 2.05586264e 02 2.68673382 e+00
21 1.77666112e+01 3.80986521 e+00 6.92224640e 04 1.47556685e+00
22 3.06049479e 01 1.23345939e 02 9.52747232e 01 9.31168327e 03
23 5.24758378e 01]
24 >>> print (reg. predict (X)[:5])
25 [30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]

The results from our own linear regression models are almost identical to the results from lm() function
or the sklearn.linear_model module, which means we have done a great job so far.

3.2 Linear hypothesis testing

I’m not a big fan of applying hypothesis testing in data science or machine learning. But sometimes it is
irreplaceable, or at least useful, for example in the Design of Experiment (DOE).
Most of applications of hypothesis testing in machine learning models are about feature/variable selec-
tions. There are lots of debates on whether hypothesis-testings-based feature selections are good or bad. The
major criticism of such approach is that the entire process is built on the data used for model training and
the model performance on testing data is not considered at all.
I think it is still worth giving a brief introduction of hypothesis testing in linear regression, as it is still
popular among data scientists with statistician’s mindset. I would assume the readers already have basic
ideas of hypothesis-testing, p-value, significance level.
If you have done linear regressions using a computer software (R, Stata, SPSS, Minitab etc.), you may
have noticed that the outputs of these softwares contain the p-values 42 and t-statistics of the coefficient of
each variable. If the p-value is less than a pre-determined significance level (usually 0.1 or 0.05 are used in
practice), the null hypothesis (always denoted as H0 ) should be rejected. The hypothesis against the null
hypothesis is called alternative hypothesis (denoted as H1 ). An example of H0 and H1 regarding model (3.2)

42
https://www.statsdirect.com/help/basics/p_values.htm
linear regression 63

could be stated as:

H0 : b 1 = 0
(3.8)
H1 : b 1 6= 0.

If the p-value of b 1 suggests not to reject H0 , we may exclude the corresponding feature and re-fit the
linear model.
The theory behind the hypothesis testing in (3.8) is not complex. But we have to make an additional
assumption for our linear regression model. More specifically, we assume the error term e follows a mul-
tivariate normal distribution, i.e., e⇠ N (0, s2 I ). Based on this assumption, we could conclude that b̂ also
follows a multivariate normal distribution. Furthermore, we can construct a test statistic 43 containing b̂ j
which follows a t-distribution 44 . The additional assumption we made for hypothesis testing is not required
for least square estimation. It is also not one of assumptions for the Gauss-Markov theorem 45 .
The example given in (3.8) focuses on a single coefficient. Actually, it is a special case of a more gen-
eral hypothesis testing, which is called linear hypothesis testing. In linear hypothesis testing, people are
interested in a linear combination of the coefficients, i.e.,

H0 : Ab + b = 0
(3.9)
H1 : Ab + b 6= 0.

The linear hypothesis testing is usually conducted by constructing a F-distributed test statistic. The
general hypothesis testings is very powerful. For example, we may start from a full model with a larger list
of variables to train a linear regression model; we may also exclude some variables from the full list to train
a reduced model. By setting proper values of A and b in (3.9) we can conduct a linear hypothesis testing
to accept or reject the reduced model. For example, if the full model involves three variables and in the
reduced2model we 3 only keep2 the
3 first variable, we will set
0 0 0 0
6 7 6 7
A = 40 1 05 and b = 405 .
0 0 1 0

3.3 Ridge regression

What is the problem of solving linear regression model specified by (3.3)? There is nothing wrong at all
with that approach. But I would like to cite my favorite quote - "Essentially, all models are wrong, but some
are useful" 46 . (3.3) provides one solution to model (3.1). There are some alternative solutions, such as lasso
regression and ridge regression. In this section, let’s focus on ridge regression.
What is ridge regression? Ridge regression doesn’t change the model itself. Instead, it changes the way
to estimate model parameters. By naive OLS, we minimize the SSR directly. In ridge regression, we change
the objective function (which is commonly called loss function in machine learning models) by adding an
43
https://en.wikipedia.org/wiki/Test_statistic
44
https://en.wikipedia.org/wiki/Student%27s_t-distribution
45
https://en.wikipedia.org/wiki/Gauss-Markov_theorem
46
https://en.wikipedia.org/wiki/All_models_are_wrong
64 another book on data science

additional penalty

min e0 e + lb0 b. (3.10)


The optimization problem (3.10) is still an unconstrained optimization problem and it is convex. It can
also be formulated as a constrained convex optimization problem as

min e0 e
b̂ (3.11)
subject to b0 b  t.

The theory behind ridge regression can be found from The Elements of Statistical Learning [1]. Let’s turn
our attention to the implementation of ridge regression. The solution to (3.11) can be obtained in the same
way as the solution to (3.3), i.e.,

1
b̂ = ( X 0 X + lI ) X 0 y. (3.12)

Again, in practice we don’t use (3.12) to implement ridge regression for the same reasons that we don’t
use (3.6) to solve linear regression without penalty.
Actually, we don’t need new techniques to solve (3.10). Let’s make some transformation on the objective
function in (3.10):

n p p
e0 e + lb0 b = Â (yi xi0 b)2 + Â (0 lb i )2 (3.13)
i =1 i =1

, where xi = [1, X1i , X2i , ..., X pi ]0 .


Let’s define an augmented data set:

2 3 2 3
1 X11 X21 ··· X p1 y1
6 1 X12 X22 ··· X p2 7 6y 7
6 7 6 27
6 . .. .. .. .. 7 6 . 7
6 .. . . . . 7 6 .. 7
6 7 6 7
6 7 6 7
6 1 X1n X2n ··· X pn 7 6yn 7
6p 7 6 7
Xl = 6
6 l 0 0 ··· 0 7 7, yl = 6 7
6 0 7. (3.14)
6 0 p
6 l 0 ··· 0 7 7
607
6 7
6 p 7 6 7
6 0 0 l ··· 0 7 607
6 7 6 7
6 .. .. .. .. .. 7 6 .. 7
4 . . . . . 5 4 . 5
p
0 0 0 0 l 0

If we regress yl on Xl , the OLS solution is just what we are looking after. However, usually the penalty
linear regression 65

is not applied on the intercept. Thus, we modify yl and Xl to


2 3 2 3
1 X11 X21 ··· X p1 y1
61 X12 X22 ··· X p2 7 6y 7
6 7 6 27
6. .. .. .. .. 7 6 . 7
6. 7 6 . 7
6. . . . . 7 6 . 7
6 7 6 7
61 X1n X2n ··· X pn 7 6yn 7
Xl = 6
60 p 7, yl = 6 7
6 0 7. (3.15)
6 l 0 ··· 0 7 7 6 7
6 p 7 6 7
60 0 l ··· 0 7 607
6 7 6 7
6 .. .. .. .. .. 7 6 .. 7
4. . . . . 5 4 . 5
p
0 0 0 0 l 0
Now we are ready to implement our own ridge regression model based on the description above. It is
also common to normalize 47 the independent variables before applying ridge regression to make different
variables in the same order of magnitude.

chapter3/linear_regression_ridge.R
1 library (R6)
2 LR_ Ridge = R6Class (
3 "LR_Ridge",
4 public = list(
5 coef = NULL ,
6 mu = NULL ,
7 sd = NULL ,
8 lambda = NULL ,
9 initialize = function ( lambda ) {
10 self$ lambda = lambda
11 },
12 scale = function (x) {
13 self$mu = apply(x, 2, mean)
14 self$sd = apply(x, 2, function (e) {
15 sqrt (( length (e) 1) / length (e)) * sd(e)
16 })
17 },
18 transform = function (x) {
19 return (t((t(x) self$mu) / self$sd))
20 },
21 fit = function (x, y) {
22 self$scale(x)
23 x_ transformed = self$ transform (x)
24 x_ lambda = rbind (x_ transformed , diag(rep(sqrt(self$ lambda ), ncol(x))))
25 y_ lambda = c(y, rep (0, ncol(x)))
47
https://en.wikipedia.org/wiki/Normalization_(statistics)
66 another book on data science

26 self$qr_ solver ( cbind(c(rep (1, nrow(


27 x
28 )), rep (0, ncol(
29 x
30 ))), x_ lambda ), y_ lambda )
31 },
32 qr_ solver = function (x, y) {
33 self$coef = qr.coef(qr(x), y)
34 },
35 predict = function (new_x) {
36 new_x_ transformed = self$ transform (new_x)
37 cbind(rep (1, nrow(new_x)), new_x_ transformed ) % * % self$coef
38 }
39 )
40 )

Python

chapter3/linear_regression_ridge.py

1 import numpy as np
2 from sklearn . preprocessing import StandardScaler
3 import pdb
4

6 class LR_Ridge :
7 def __init__ (self , l):
8 self.l = l
9 self.coef = None
10 self. scaler = StandardScaler ()
11

12 def qr_solver (self , x, y):


13 q, r = np. linalg .qr(x)
14 p = np.dot(q.T, y)
15 return np.dot(np. linalg .inv(r), p)
16

17 def fit(self , x, y):


18 x_transformed = self. scaler . fit_transform (x)
19 x_lambda = np. vstack (
20 ( x_transformed , np.diag ([ self.l * * 0 . 5 ] * x.shape [1])))
21 x_lambda = np. hstack (
22 (np. vstack ((np.ones ((x. shape [0] , 1)), np. zeros ((x. shape [1] , 1)))),
x_lambda ))
23 y_lambda = np. hstack ((y, np. zeros ((x. shape [1] ,))))
linear regression 67

24 self.coef = self. qr_solver (x_lambda , y_lambda )


25

26 def predict (self , x):


27 new_x_transformed = self. scaler . transform (x)
28 new_x_transformed = np. hstack (
29 (np.ones ((x.shape [0] ,1)), new_x_transformed )
30 )
31 return np.dot( new_x_transformed , self.coef)

in R, we implement our own scaler; but in the python implementation, we use StandardScaler from
sklearn.preprocessing module, which is very handy. Please pay attention to how we calculation the
standard deviation 48 in R - we are using the formula of population standard deviation rather than the
formula of sample standard deviation. Actually using which formula to calculate the standard deviation
doesn’t matter at all. I made the choice to use population standard deviation formula in order to generate
consistent result of the StandardScaler since in StandardScaler uses the formula of population standard
deviation.
The selection of best l requires solving the OLS problem repeatedly with different values of l, which
implies the QR decomposition procedure would be called multiple times. Using SVD decomposition could
be more efficient in terms of selection of best l. But it wouldn’t be covered in the current version of this
book.
We are ready to run our own ridge regression on the Boston dataset.

1 > source ( ' linear _ regression _ ridge.R ' )


2 >
3 > library (MASS) # load Boston data from this package
4 >
5 > # let ' s try lmabda = 0.5
6 > ridge = LR_ Ridge $new (0.5)
7 > ridge$fit(data. matrix ( Boston[, ncol( Boston )]) ,Boston $medv)
8 > print( ridge $coef)
9 crim zn indus chas nox
10 22.53280632 0.92396151 1.07393055 0.12895159 0.68346136 2.04275750
11 rm age dis rad tax ptratio
12 2.67854971 0.01627328 3.09063352 2.62636926 2.04312573 2.05646414
13 black lstat
14 0.84905910 3.73711409
15 > # let ' s make prediction on the same data
16 > pred=ridge$ predict (data. matrix ( Boston[, ncol( Boston )]))
17 > print (pred [1:5])
18 [1] 30.01652 25.02429 30.56839 28.61521 27.95385

48
https://en.wikipedia.org/wiki/Standard_deviation
68 another book on data science

Python

1 >>> from sklearn . datasets import load_boston


2 >>> from linear_regression_ridge import LR_Ridge
3 >>>
4 >>> boston = load_boston ()
5 >>> X, y = boston .data , boston . target
6 >>> # first , let ' s run our own linear regression
7 ...
8 >>> ridge = LR_Ridge (0.5)
9 >>> ridge .fit(X, y)
10 >>> print ( ridge.coef)
11 [ 2.25328063 e+01 9.23961511e 01 1.07393055 e+00 1.28951591e 01
12 6.83461360e 01 2.04275750e+00 2.67854971 e+00 1.62732755e 02
13 3.09063352e+00 2.62636926 e+00 2.04312573e+00 2.05646414e+00
14 8.49059103e 01 3.73711409e+00]
15 >>> print ( ridge. predict (X)[:5])
16 [30.01652397 25.02429359 30.56839459 28.61520864 27.95385422]

It’s excited to see the outputs from R and Python are quite consistent.
Bibliography

[1] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series
in statistics New York, NY, USA:, 2001.

[2] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

You might also like