Taylor R. Brown - An Introduction To R and Python For Data Analysis - A Side-By-Side Approach-CRC Press - Chapman & Hall (2023)
Taylor R. Brown - An Introduction To R and Python For Data Analysis - A Side-By-Side Approach-CRC Press - Chapman & Hall (2023)
Key Features:
- Teaches R and Python in a “side-by-side” way
- Examples are tailored to aspiring data scientists and statisticians, not software engineers
- Designed for introductory graduate students
- Does not assume any mathematical background
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
An Introduction to R and
Python For Data Analysis
A Side-By-Side Approach
Taylor R. Brown
Designed cover image: © Taylor R. Brown
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the
accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® software or related products
does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular
use of the MATLAB® software.
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not
available on CCC please contact [email protected]
Trademark notice:Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003263241
Publisher’s note:This book has been prepared from camera-ready copy provided by the authors.
Welcome xv
Preface xvii
2 Basic Types 11
2.1 Basic Types in Python . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Type Conversions in Python . . . . . . . . . . . . . . . 12
2.2 Basic Types in R . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Type Conversions in R . . . . . . . . . . . . . . . . . . 13
2.2.2 R’s Simplification . . . . . . . . . . . . . . . . . . . . . 14
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 R Questions . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Python Questions . . . . . . . . . . . . . . . . . . . . 15
vii
viii Contents
6 Functions 65
6.1 Defining R Functions . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Defining Python Functions . . . . . . . . . . . . . . . . . . . . 66
6.3 More Details on R’s User-Defined Functions . . . . . . . . . . 67
6.4 More Details on Python’s User-Defined Functions . . . . . . . 69
6.5 Function Scope in R . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Function Scope in Python . . . . . . . . . . . . . . . . . . . . 73
6.7 Modifying a Function’s Arguments . . . . . . . . . . . . . . . 75
6.7.1 Passing by Value in R . . . . . . . . . . . . . . . . . . 75
6.7.2 Passing by Assignment in Python . . . . . . . . . . . . 76
6.8 Accessing and Modifying Captured Variables . . . . . . . . . . 79
6.8.1 Accessing Captured Variables in R . . . . . . . . . . . 79
6.8.2 Accessing Captured Variables in Python . . . . . . . . 81
6.8.3 Modifying Captured Variables in R . . . . . . . . . . . 81
6.8.4 Modifying Captured Variables in Python . . . . . . . . 82
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.9.1 R Questions . . . . . . . . . . . . . . . . . . . . . . . . 83
6.9.2 Python Questions . . . . . . . . . . . . . . . . . . . . 84
Contents ix
7 Categorical Data 87
7.1 factors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Two Options for Categorical Data in Pandas . . . . . . . . . . 90
7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3.1 R Questions . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3.2 Python Questions . . . . . . . . . . . . . . . . . . . . 94
8 Data Frames 97
8.1 Data Frames in R . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Data Frames in Python . . . . . . . . . . . . . . . . . . . . . 100
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.1 R Questions . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.2 Python Questions . . . . . . . . . . . . . . . . . . . . 107
13 Visualization 167
13.1 Base R Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.2 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . 170
13.3 Plotting with Matplotlib . . . . . . . . . . . . . . . . . . . . . 175
13.4 Plotting with Pandas . . . . . . . . . . . . . . . . . . . . . . . 179
13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.5.1 R Questions . . . . . . . . . . . . . . . . . . . . . . . . 180
13.5.2 Python Questions . . . . . . . . . . . . . . . . . . . . 181
Bibliography 241
Index 245
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
List of Figures
1.1 RStudio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Anaconda navigator. . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Spyder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
xiii
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Welcome
License(s)
The textbook is licensed under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License. The code used to gen-
erate the text is licensed under a Creative Commons Zero v1.0 Universal
license.
xv
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Preface
xvii
xviii Preface
Conventions
Sometimes R and Python code look very similar, or even identical. This is
why I usually separate R and Python code into separate sections. However,
sometimes I do not, so whenever it is necessary to prevent confusion, I will
remind you what language is being used in comments (more about comments
in 1.2 ).
# in python
print('hello world')
## hello world
# in R
print('hello world')
## [1] "hello world"
1
https://cran.r-project.org/
2
https://www.rstudio.com/products/rstudio/download/#download
3
https://docs.anaconda.com/anaconda/install/#
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part I
Now that you have both R and Python installed, we can get started by taking
a tour of our two different integrated development environments (IDEs),
RStudio and Spyder.
In addition, I will also discuss a few topics superficially, so that we can get our
feet wet:
• printing,
• creating variables, and
• calling functions.
DOI: 10.1201/9781003263241-1 3
4 1 Introduction
print('hello R world')
## [1] "hello R world"
During the semester, we will write more complicated code. Complicated code
is usually written incrementally and stored in a text file called a script. Click
File -> New File -> R Script to create a new script. It should appear at the
top left of the RStudio window (see Figure 1.1). After that, copy and paste
the following code into your script window:
print('hello world')
print("this program")
1.1 Hello World in R 5
This script will run five print statements and then create a variable called
myName. The print statements are of no use to the computer and will not affect
how the program runs. They just display messages to the human running the
code.
The variable created on the last line is more important because it is used by
the computer, and so it can affect how the program runs. The operator <- is
the assignment operator1 . It takes the character constant "Taylor", which
is on the right, and stores it under the name myName. If we added lines to this
program, we could refer to the variable myName in subsequent calculations.
Save this file wherever you want on your hard drive. Call it awesomeScript.R.
Personally, I saved it to my desktop.
After we have a saved script, we can run it by sending all the lines of code over
to the console. One way to do that is by clicking the Source button at the top
right of the script window (see Figure 1.1). Another way is that we can use
R’s source() function2 . We can run the following code in the console:
The first line changes the working directory3 to Desktop/. The working
directory is the first place your program looks for files. You, dear reader,
should change this line by replacing Desktop/ to whichever folder you chose
to save awesomeScript.R in. If you would like to find out what your working
directory is currently set to, you can use getwd().
1
https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html
2
A third way is to tell R to run awesomeScript.R from the command line, but unfortunately,
this will not be discussed in this text.
3
https://stat.ethz.ch/R-manual/R-devel/library/base/html/getwd.html
6 1 Introduction
The second line calls source(). This function finds the script file and executes
all the commands found in that file sequentially.
Deleting all saved variables, and then source()ing your script can be
a very good debugging strategy. You can remove all saved variables by
� running rm(list=ls()). Don’t worry—the variables will come back
as soon as you source() your entire script again!
Recall that we will exclusively assume the use of Spyder (see Figure 1.3) in
this textbook. Open that up now. It should look something like this:
It looks a lot like RStudio, right? The script window is still on the left hand
side, but it takes up the whole height of the window this time. However, you
will notice that the console window has moved. It’s over on the bottom right
now.
Again, you might notice a lot more white when you open this for the first time.
Just like last time, I changed my color scheme. You can change yours by going
to Tools -> Preferences and then exploring the options available under the
Appearances tab.
Already we have many similarities between our two languages. Both R and
Python have a print() function, and they both use the same symbol to start a
comment: #. Finally, they both define character/string constants with quotation
marks. In both languages, you can use either single or double quotes.
8 1 Introduction
We will also show below that both languages share the same three ways to run
scripts. Nice!
Let’s try writing our first Python script. R scripts end in .r or .R, while Python
scripts end in .py. Call this file awesomeScript.py.
Just like RStudio, Spyder has a button that runs the entire script from start
to finish. It’s the green triangle button (see Figure 1.3).
You can also write code to run awesomeScript.py. There are a few ways to do
this, but here’s the easiest.
import os
os.chdir('/home/taylor/Desktop')
runfile("awesomeScript.py")
This is also pretty similar to the R code from before. os.chdir() sets our
working directory to the Desktop. Then runfile() runs all of the lines in our
program, sequentially, from start to finish5 .
The first line is new, though. We did not mention anything like this in R, yet.
We will talk more about importing modules in section 10.4. Suffice it to say
that we imported the os module to make the chdir() function available to us.
4
You can use this symbol in R, too, but it is less common.
5
Python, like R, allows you to run scripts from the command line, but this will not be
discussed in this text.
1.3 Getting Help 9
MacOS, these file paths will look very similar. The folder home/ will most likely
be replaced with Users/.
On Windows, things are a bit different. For one, a full path starts with a
drive (e.g. C:). Second, there are backslashes (not forward slashes) to separate
directory names (e.g. C:\Users\taylor\Desktop).
Unfortunately, backslashes are a special character in both R and Python (read
Section 3.9 to find out more about this). Whenever you type a \, it will change
the meaning of whatever comes after it. In other words, \ is known as an
escape character.
The recommended way of handling this is to just use forward slashes in-
stead. For example, if you are running Windows, C:/Users/taylor/Desk-
top/myScript.R will work in R, and C:/Users/taylor/Desktop/myScript.py
will work in Python.
You may also use “raw string constants” (e.g. r'C:\Users\taylor\my_file
.txt' ). “Raw” means that \ will be treated as a literal character instead of an
escape character. Alternatively, you can “escape” the backslashes by replacing
each single backslash with a double backslash. Please read Section 3.9 for more
details about these choices.
2
Basic Types
Strings are useful for processing text data such as names of people/places/things
and messages such as texts, tweets and emails (Beazley and Jones, 2014). If you
are dealing with numbers, you need floating points if you have a number that
might have a fractional part after its decimal; otherwise, you’ll need an integer.
Booleans are useful for situations where you need to record whether something
1
https://docs.python.org/3/library/stdtypes.html
DOI: 10.1201/9781003263241-2 11
12 2 Basic Types
my_int = 1
my_float = 3.2
my_sum = my_int + my_float
print("my_int's type", type(my_int))
## my_int's type <class 'int'>
print("my_float's type", type(my_float))
## my_float's type <class 'float'>
print(my_sum)
## 4.2
print("my_sum's type", type(my_sum))
## my_sum's type <class 'float'>
3.2 + "3.2"
2
https://numpy.org/doc/stable/user/basics.types.html
2.2 Basic Types in R 13
my_date = "5/2/2021"
month_day_year = my_date.split('/')
my_year = int(month_day_year[-1])
print('my_year is', my_year, 'and its type is', type(my_year))
## my_year is 2021 and its type is <class 'int'>
myInt = 1
myDouble = 3.2
mySum = myInt + myDouble
print(paste0("my_int's type is ", typeof(myInt)))
## [1] "my_int's type is double"
print(paste0("my_float's type is ", typeof(myDouble)))
3
“double” is short for “double precision floating point.” In other programming languages,
the programmer might choose how many decimal points of precision he or she wants.
14 2 Basic Types
print(typeof(1))
## [1] "double"
print(typeof(as.logical(1)))
## [1] "logical"
2.3 Exercises
2.3.1 R Questions
1.
Which R base type is ideal for each piece of data? Assign your answers to a
character vector of length four called questionOne.
a) An individual’s IP address
b) whether or not an individual attended a study
2.3 Exercises 15
2.
Floating points are weird. What gets printed is not the same as what is stored!
In R, you can control how many digits get printed by using the options
function.
a) Assign a to 2/3
b) print a, and copy/paste what you see into the variable aPrint. Make
sure it is a character.
c) Take a look at the documentation for options. Assign the value of
options()$digits to numDigitsStart
d) Change the number of digits to 22
e) Again, print, a and copy/paste what you see into the variable
aPrintv2. Make sure it is a character.
f) Assign the output of options()$digits to numDigitsEnd
3.
Floating points are weird. What gets stored might not be what you want. “The
only numbers that can be represented exactly in R’s numeric type are integers
and fractions whose denominator is a power of 2.”4 As a consequence, you
should never test strict equality (i.e. using ==) between two floating points.
Which Python type is ideal for each piece of data? Assign your answers to a
list of strings called question_one.
4
https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-
numbers-are-equal_003f
16 2 Basic Types
a) An individual’s IP address
b) whether or not an individual attended a study
c) the number of seeds found in a plant
d) the amount of time it takes for a car to race around a track
2.
Floating points are weird. What gets printed is not the same as what is stored!
In Python, you need to edit a class’s __str__ method if you want to control
how many digits get printed for a user-defined type/class, but we won’t do
that. Instead, we’ll use str.format()5 to return a string directly (instead of
copy/paste-ing it).
a) Assign a to 2/3
b) print a, and copy/paste what you see into the variable a_print
c) Create a str that displays 22 digits of 2/3. Call it a_printv2
d) print the above string
3.
Floating points are weird. What gets stored might not be what you want. The
Python documentation has an excellent discussion of how storage behavior can
be surprising. Click here6 to read it.
5
https://docs.python.org/3/library/stdtypes.html#str.format
6
https://docs.python.org/3/tutorial/floatingpoint.html
3
R Vectors versus Numpy Arrays and Pandas’ Series
This section is for describing the data types that let us store collections of
elements that all share the same type. Data is very commonly stored in this
fashion, so this section is quite important. Once we have one of these collection
objects in a program, we will be interested in learning how to extract and
modify different elements in the collection, as well as how to use the entire
collection in an efficient calculation.
3.1 Overview of R
In the previous section, I mentioned that R does not have scalar types—it
just has vectors1 . So, whether you want to store one number (or logical, or
character, or …), or many numbers, you will need a vector.
For many, the word “vector” evokes an impression that these objects
are designed to be used for performing matrix arithmetic (e.g. inner
products, transposes, etc.). You can perform these operations on
� vectors, but in my opinion, this preconception can be misleading,
and I recommend avoiding it. Most of the things you can do with
vectors in R have little to do with linear algebra!
How do we create one of these? There are many ways. One common way is to
read in elements from an external data set. Another way is to generate vectors
from code.
1
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Vector-objects
DOI: 10.1201/9781003263241-3 17
18 3 R Vectors versus Numpy Arrays and Pandas’ Series
## [1] 2 2 2 2 2
# combine elements without relying on a pattern
c("5/2/2021", "5/3/2021", "5/4/2021")
## [1] "5/2/2021" "5/3/2021" "5/4/2021"
# generate Gaussian random variables
rnorm(5)
## [1] 0.6403955 0.2524081 1.3740108 2.2544786 -1.2988103
c() is short for “combine”. seq() and rep() are short for “sequence” and “repli-
cate”, respectively. rnorm() samples normal (or Gaussian) random variables.
There is plenty more to learn about these functions, so I encourage you to take
a look at their documentation.
There are five ways to create numpy arrays (source2 ). Here are some examples
that complement the examples from above.
import numpy as np
np.array([1,2,3])
## array([1, 2, 3])
np.arange(1,12,2)
## array([ 1, 3, 5, 7, 9, 11])
np.random.normal(size=3)
## array([0.64255822, 0.01151642, 0.32897288])
import pandas as pd
first = pd.Series([2, 4, 6])
second = pd.Series([2, 4, 6], index = ['a','b','c'])
print(first[0])
## 2
print(second['c'])
## 6
3.3 Vectorization in R
An operation in R is vectorized if it applies to all of the elements of a vector
at once. An operator that is not vectorized can only be applied to individual
elements. In that case, the programmer would need to write more code to
instruct the function to be applied to all of the elements of a vector. You should
prefer writing vectorized code because it is usually easier to read. Moreover,
many of these vectorized functions are written in compiled code, so they can
often be much faster.
2
https://numpy.org/doc/stable/user/basics.creation.html
3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#
pandas-series
4
https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-
objects.html#Series-as-generalized-NumPy-array
20 3 R Vectors versus Numpy Arrays and Pandas’ Series
Arithmetic (e.g. +, -, *, /, ^, %%, %/%, etc.) and logical (e.g. !, |, &, >, >=, <, <=,
==, etc.) operators are commonly applied to one or two vectors. Arithmetic
is usually performed element-by-element. Numeric vectors are converted to
logical vectors if they need to be. Be careful of operator precedence if you seek
to minimize your use of parentheses.
Note that there are an extraordinary amount of named functions (e.g. sum(),
length(), cumsum(), etc.) that operate on entire vectors, as well. Here are
some examples:
(1:3) * (1:3)
## [1] 1 4 9
(1:3) == rev(1:3)
## [1] FALSE TRUE FALSE
sin( (2*pi/3)*(1:4))
## [1] 8.660254e-01 -8.660254e-01 -2.449294e-16 8.660254e-01
(1:3) * (1:4)
5
https://numpy.org/doc/stable/reference/ufuncs.html
3.4 Vectorization in Python 21
Ufuncs are called unary if they take in one array, and binary if they take
in two. At the moment, there are fewer than 100 available6 , all performing
either mathematical operations, boolean-emitting comparisons, or bit-twiddling
operations. For an exhaustive list of Numpy’s universal functions, click here.7
Here are some code examples:
np.arange(1,4)*np.arange(1,4)
## array([1, 4, 9])
np.zeros(5) > np.arange(-3,2)
## array([ True, True, True, False, False])
np.exp( -.5 * np.linspace(-3, 3, 10)**2) / np.sqrt( 2 * np.pi)
## array([0.00443185, 0.02622189, 0.09947714, 0.24197072, 0.37738323,
## 0.37738323, 0.24197072, 0.09947714, 0.02622189, 0.00443185])
np.arange(1,3)*np.arange(1,4)
If you are working with string arrays, Numpy has a np.char module with many
useful functions9 .
a = np.array(['a','b','c'])
np.char.upper(a)
## array(['A', 'B', 'C'], dtype='<U1')
Then there are the Series objects from Pandas. Ufuncs continue to work in
the same way on Series objects, and they respect common index values10 .
s1 = pd.Series(np.repeat(100,3))
s2 = pd.Series(np.repeat(10,3))
6
https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs
7
https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs
8
https://numpy.org/devdocs/user/theory.broadcasting.html
9
https://numpy.org/doc/stable/reference/routines.char.html#module-numpy.char
10
https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-
pandas.html
22 3 R Vectors versus Numpy Arrays and Pandas’ Series
s1 + s2
## 0 110
## 1 110
## 2 110
## dtype: int64
If you feel more comfortable, and you want to coerce these Series objects to
Numpy arrays before working with them, you can do that. For example, the
following works.
s = pd.Series(np.linspace(-1,1,5))
np.exp(s.to_numpy())
## array([0.36787944, 0.60653066, 1. , 1.64872127, 2.71828183])
ints = pd.Series(np.arange(10))
ints.abs()
## 0 0
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## dtype: int64
ints.mean()
## 4.5
ints.floordiv(2)
## 0 0
## 1 0
## 2 1
## 3 1
## 4 2
11
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#
pandas-series
3.4 Vectorization in Python 23
## 5 2
## 6 3
## 7 3
## 8 4
## 9 4
## dtype: int64
Series objects that have text data12 are a little bit different. For one, you
have to access the .str attribute of the Series before calling any vectorized
methods13 . Here are some examples.
s = pd.Series(['a','b','c','33'])
s.dtype
## dtype('O')
s.str.isdigit()
## 0 False
## 1 False
## 2 False
## 3 True
## dtype: bool
s.str.replace('a', 'z')
## 0 z
## 1 b
## 2 c
## 3 33
## dtype: object
String operations can be a big game changer, and we discuss text processing
strategies in more detail in Section 3.9.
12
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#working-
with-text-data
13
https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-
strings.html
24 3 R Vectors versus Numpy Arrays and Pandas’ Series
To access the first element, we use the index 1. To access the second, we
use 2, and so on. Also, the - sign tells R to remove elements. Both of these
functionalities are very different from Python, as we will see shortly.
We can use names to access elements elements, too, but only if the elements
are named.
import pandas as pd
one_through_ten = pd.Series(np.arange(1, 11))
one_through_ten[np.array([2,3])]
## 2 3
## 3 4
## dtype: int64
one_through_ten[1:10:2] # evens
## 1 2
## 3 4
## 5 6
## 7 8
## 9 10
## dtype: int64
one_through_ten[::-1] # reversed
## 9 10
15
https://pandas.pydata.org/docs/reference/api/pandas.Series.html
26 3 R Vectors versus Numpy Arrays and Pandas’ Series
## 8 9
## 7 8
## 6 7
## 5 6
## 4 5
## 3 4
## 2 3
## 1 2
## 0 1
## dtype: int64
one_through_ten[-2] = 99 # second to last
one_through_ten
## 0 1
## 1 2
## 2 3
## 3 4
## 4 5
## 5 6
## 6 7
## 7 8
## 8 9
## 9 10
## -2 99
## dtype: int64
one_through_ten[one_through_ten > 3] # bigger than three
## 3 4
## 4 5
## 5 6
## 6 7
## 7 8
## 8 9
## 9 10
## -2 99
## dtype: int64
one_through_ten.sum()
## 154
3.8 Some Gotchas 27
However, Pandas’ Series have .loc and .iloc methods16 . We won’t talk much
about these two methods now, but they will become very important when we
start to discuss Pandas’ data frames in Section 8.2.
one_through_ten.iloc[2]
## 3
one_through_ten.loc[2]
## 3
# in R
a <- c(1,2,3)
b <- a
b[1] <- 999
a # still the same!
## [1] 1 2 3
# in python
a = np.array([1,2,3])
b = a # b is an alias
c = a.view() # c is a view
d = a[:]
16
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
different-choices-for-indexing
17
https://numpy.org/devdocs/user/quickstart.html#copies-and-views
28 3 R Vectors versus Numpy Arrays and Pandas’ Series
b[0] = 999
a # two names for the same object in memory
## array([999, 2, 3])
b
## array([999, 2, 3])
c
## array([999, 2, 3])
d
## array([999, 2, 3])
It’s the same story with Pandas’ Series objects. You’re usually making a
“shallow” copy.
# in python
import pandas as pd
s1 = pd.Series(np.array([100.0,200.0,300.0]))
s2 = s1
s3 = s1.view()
s4 = s1[:]
s1[0] = 999
s1
## 0 999.0
## 1 200.0
## 2 300.0
## dtype: float64
s2
## 0 999.0
## 1 200.0
## 2 300.0
## dtype: float64
s3
## 0 999.0
## 1 200.0
## 2 300.0
## dtype: float64
s4
## 0 999.0
## 1 200.0
## 2 300.0
## dtype: float64
3.8 Some Gotchas 29
If you want a “deep copy” in Python, you usually want a function or method
called copy(). Use np.copy or np.ndarray.copy18 when you have a Numpy
array.
# in python
a = np.array([1,2,3])
b = np.copy(a)
c = a.copy()
b[0] = 999
a
## array([1, 2, 3])
b
## array([999, 2, 3])
c
## array([1, 2, 3])
Use pandas.Series.copy19 with Pandas’ Series objects. Make sure not to set
the deep argument to False. Otherwise you’ll get a shallow copy.
# in python
s1 = pd.Series(np.array([1,2,3]))
s2 = s1.copy()
s3 = s1.copy(deep=False)
s1[0] = 999
s1
## 0 999
## 1 2
## 2 3
## dtype: int64
s2
## 0 1
## 1 2
## 2 3
## dtype: int64
s3
## 0 999
## 1 2
18
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#
numpy-ndarray-copy
19
https://pandas.pydata.org/docs/reference/api/pandas.Series.copy.html#pandas-
series-copy
30 3 R Vectors versus Numpy Arrays and Pandas’ Series
## 2 3
## dtype: int64
NULL == FALSE
## logical(0)
NULL == NULL
## logical(0)
# create a function that doesn't return anything
# more information on this later
doNothingFunc <- function(a){}
thing <- doNothingFunc() # call our new function
is.null(thing)
## [1] TRUE
typeof(NULL)
## [1] "NULL"
None == False
## False
None == None
# create a function that doesn't return anything
# more information on this later
## True
def do_nothing_func():
20
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#NULL-object
3.8 Some Gotchas 31
pass
thing = do_nothing_func()
if thing is None:
print("thing is None!")
## thing is None!
type(None)
## <class 'NoneType'>
“NaN” stands for “not a number.” NaN is an object of type double in R, and
np.nan is of type float in Python. It can come in handy when you (deliberately
or accidentally) perform undefined calculations such as 0/0 or ∞/ − ∞.
# in R
0/0
## [1] NaN
Inf/Inf
## [1] NaN
is.na(0/0)
## [1] TRUE
# in Python
# 0/0
# the above yields a ZeroDivisionError
import numpy as np
np.inf/np.inf
## nan
np.isnan(np.nan)
## True
“NA” is short for “not available.” Missing data is a fact of life in data science.
Observations are often missing in data sets, introduced after joining/merging
data sets together (more on this in Section 12.3), or arise from calculations
involving underflow and overflow. There are many techniques designed to
estimate quantities in the presence of missing data. When you code them up,
you’ll need to make sure you deal with NAs properly.
# in R
babyData <- c(0,-1,9,NA,21)
NA == TRUE
32 3 R Vectors versus Numpy Arrays and Pandas’ Series
## [1] NA
is.na(babyData)
## [1] FALSE FALSE FALSE TRUE FALSE
typeof(NA)
## [1] "logical"
import numpy as np
import numpy.ma as ma
baby_data = ma.array([0,-1,9,-9999, 21]) # -9999 "stands for" missing
baby_data[3] = ma.masked
np.average(baby_data)
## 7.25
string data. The same tools can be used whether or not these Series objects
are contained in a Pandas DataFrame.
Regarding R, character vectors were first mentioned in Section 3.1. There
are many functions that operate on these, too, regardless of whether they are
held in a data.frame. The functions might be a little harder to find because
they aren’t methods, so pressing <Tab> and using your GUI’s autocomplete
feature doesn’t reveal them as easily.
Suppose you’re interested in replacing lowercase letters with uppercase ones,
removing certain characters from text, or counting the number of times a
certain expression appears. Up until now, as long as you can find a function
or method that performs the task, you were doing just fine. If you need to do
something with text data, there’s probably a function for it.
Notice what all of these tasks have in common—they all require the ability to
find patterns. When your patterns are easy to describe (e.g. find all lowercase
“a”s), then all is well. What can make matters more complicated, however,
is when the patterns are more difficult to describe (e.g. find all valid email
addresses). That is why this section is primarily concerned with discussing
regular expressions, which are a tool that help you describe the patterns in
text (Wickham and Grolemund, 2017) (López, 2014).
1. literal character, or as a
2. metacharacter.
If it is a literal character, then the character is the literal pattern. For example,
in the regular expression “e”, the character “e” has a literal interpretation. If
you seek to capitalize all instances of “e” in the following phrase, you can do it
pretty easily. As long as you know which function performs find-and-replace,
you’re good. The pattern is trivial to specify.
On the other hand, if I asked you to remove $s from price or salary data, you
might have a little more difficulty. This is because $ is a metacharacter in
regular expressions, and so it has a special meaning.25 In the examples below,
25
The dollar sign is useful if you only want to find certain patterns that finish a line. It
takes the characters preceding it, and says, only look for that pattern if it comes at the end
of a string.
34 3 R Vectors versus Numpy Arrays and Pandas’ Series
# in R
gsub(pattern = "e", replacement = "E",
x = "I don't need a regex for this!")
## [1] "I don't nEEd a rEgEx for this!"
# in Python
import pandas as pd
s = pd.Series(["I don't need a regex for this!"])
s.str.replace(pat="e",repl="E")
## 0 I don't nEEd a rEgEx for this!
## dtype: object
On the other hand, here are a few examples that remove dollar signs. We gener-
ally have two options to recognize symbols that happen to be metacharacters.
1. We can escape the dollar sign. That means you need to put a backslash
(i.e. \) before the dollar sign. The backslash is a metacharacter looks
at the character coming after it, and it either removes the special
meaning from a metacharacter, or adds special meaning to a literal
character.
2. Alternatively, we can tell the function to ignore regular expres-
sions. gsub() can take fixed=TRUE, and .str.replace() can take
regex=False.
# in Python
pd.Series(["$100, $200"]).str.replace(pat="$",repl="",regex=False)
## 0 100, 200
## dtype: object
pd.Series(["$100, $200"]).str.replace(pat="\$",repl="")
## 0 100, 200
## dtype: object
26
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
3.9 An Introduction to Regular Expressions 35
# in R
gsub(pattern = "$", replacement = "", x = c("$100, $200"), fixed=TRUE)
## [1] "100, 200"
stringr::str_remove_all(c("$100, $200"), pattern = "\\$")
## [1] "100, 200"
nchar('\n') #in R
## [1] 1
strs in Python and character vectors in R will look for these combinations
by default. When we specify regular expressions with strings, the backslashes
will be used first for this purpose. Their regular expression purpose is a second
priority.
The reason we used \\$ in the above example is to escape the second backslash.
\$ is not a special character, but Python and R will handle it differently.
Python will not recognize it, and it won’t complain that it didn’t. On the other
hand, R will throw an error that it can’t recognize it.
There is another way to deal with this issue—raw strings! Raw strings make
life easier because they do not interpret backslashes as the beginning of escape
sequences. You can make them in R and Python by putting an “r” in front of
the quotation marks. However, it is slightly more complicated in R because
you need a delimiter pair inside the quotation marks—for more information
type ?Quotes in your R console.
len(r'\n') # in Python
## 2
nchar(r'{\$}') # in R
## [1] 2
Many character classes feature an opening and closing square brackets. For in-
stance, [1-5] matches any digit between 1 and 5 (inclusive), [aeiouy] matches
any lowercase vowel, and [\^\-] matches either ^ or - (we had to escape these
two metacharacters because we are only interested in the literal pattern).
3.9 An Introduction to Regular Expressions 37
# remove vowels in R
gsub(pattern = "[aeiouy]", replacement = "",
x = "Can you still read this?")
## [1] "Cn stll rd ths?"
Concatenating two patterns, one after another, forms a more specific pattern
to be matched.
If you would like one pattern or another to appear, you can use the alternation
operator |.
## 1 False
## dtype: bool
Notice in the double “o” example, the word with three matched. To describe
that not being desirable requires the ability to look ahead of the match, to the
next character, and evaluate that. You can look ahead, or behind, and make
assertions about what patterns are required or disallowed.
However, this does not successfully remove "hellooo" because it will match
on the last two “o”s of the word. To prevent this, we can prepend a (?<!o),
which disallows a leading “o”, as well. In R, we also have to specify perl=TRUE
to use Perl-compatible regular expressions.
We also mention anchoring. If you only want to find a pattern at the beginning
of text, use ^. If you only want to find a pattern at the end of text, use $. Below
we use .str.extract()27 , whose documentation makes reference to capture
27
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
3.10 Exercises 39
groups. Capture groups are just regular expressions grouped inside parentheses
(e.g. (this)).
3.10 Exercises
3.10.1 R Questions
1.
Let’s flip some coins! Generate a thousand flips of a fair coin. Use rbinom, and
let heads be coded as 1 and tails coded as 0.
a) Assign the thousand raw coin flips to a variable flips. Make sure the
elements are integers, and make sure you flip a “fair” coin (𝑝 = .5).
b) Create a length 1000 logical vector called isHeads. Whenever you
get a heads, make sure the corresponding element is TRUE and FALSE
otherwise.
c) Create a variable called numHeads by tallying up the number of heads.
d) Calculate the percent of time that the number changes in flips.
Assign your number to acceptanceRate. Try to write only one line of
code to do this.
2.
Compute the elements of the tenth order Taylor approximation to exp(3) and
store them in taylorElems. Do not sum them. Use only one expression, and
do not use any loop. The approximation is,
You want to store each of those eleven numbers separately in a numeric vector.
3.
Do the following.
a) Create a vector called nums that contains all consecutive integers from
−100 to 100.
b) Create a logical vector that has the same length as the above, and
contains TRUE whenever an element of the above is even, and FALSE
otherwise. Call it myLogicals.
c) Assign the total number of TRUEs to totalEven.
d) Create a vector called evens of all even numbers from the above
vector.
e) Create a vector called reverse that holds the reversed elements of
nums.
4.
or most of the 𝑥𝑖 s are large, then we might bump up against the largest
allowable number. This is the problem of overflow. The biggest integer and
biggest floating point can be recovered by typing .Machine$integer.max and
.Machine$double.xmax, respectively.
10 10
log (∑ 𝑥𝑖 ) = log (∑ exp[log(𝑥𝑖 ) − 𝑚]) + 𝑚
𝑖=1 𝑖=1
𝑚 is usually chosen to be max𝑖 log 𝑥𝑖 . This is the same formula as above, which
is nice. You can use the same code to combat both overflow and underflow. e)
If you’re writing code, and you have a bunch of very large numbers, is it better
to store those numbers, or store the logarithm of those numbers? Assign your
answer to whichBetter. Use either the phrase "logs" or "nologs".
3.10 Exercises 41
5.
a) Use the natural logarithm and convert this vector into a vector of log
returns. Call the variable logReturns. If 𝑝𝑡 is the price at time 𝑡, the
log return ending at time 𝑡 is
𝑝𝑡
𝑟𝑡 = log ( ) = log 𝑝𝑡 − log 𝑝𝑡−1 (3.1)
𝑝𝑡−1
b) Do the same for arithmetic returns. These are regular percent changes
if you scale by 100. Call the variable arithReturns. The mathematical
formula you need is
𝑝 − 𝑝𝑡−1
𝑎𝑡 = ( 𝑡 ) × 100 (3.2)
𝑝𝑡−1
6.
𝑌 ∣ 𝑋 = 𝑥 ∼ Normal(0, 𝑥2 ) (3.3)
and
The following steps will demonstrate how you can use the Monte-Carlo
(Robert and Casella, 2005) method to approximate this probability.
7.
Let’s flip some coins (again)! Generate a thousand flips of a fair coin. Use
np.random.binomial, and let heads be coded as 1 and tails coded as 0.
a) Assign the thousand raw coin flips to a variable flips. Make sure the
elements are integers, and make sure you flip a “fair” coin (𝑝 = .5).
3.10 Exercises 43
2.
Create a Numpy array containing the numbers 1/2, 1/4, 1/8, … , 1/1024 Make
sure to call it my_array.
3.
Do the following:
4.
10 10
log (∑ 𝑥𝑖 ) = log (∑ exp[log(𝑥𝑖 ) − 𝑚]) + 𝑚 =
𝑖=1 𝑖=1
𝑚 is usually chosen to be max𝑖 log 𝑥𝑖 e) If you’re writing code, and you have a
bunch of very small positive numbers (e.g. probabilities, densities, etc.), is it
better to store those small numbers, or store the logarithm of those numbers?
Assign your answer to which_better. Use either the phrase "logs" or "nologs".
5.
6.
1 𝑛
ℙ(𝑋 > 6) ≈ ∑ 1(𝑋𝑖 > 6). (3.6)
𝑛 𝑖=1
If you haven’t seen an indicator function before (see Figure 3.1), it is defined
as
1 𝑋𝑖 > 6
1(𝑋𝑖 > 6) = { . (3.7)
0 𝑋𝑖 ≤ 6
So, the sum in this expression is just a count of the number of elements that
are greater than 6.
7.
1 𝑛
𝔼[𝑔(𝑊 )] ≈ ∑ 𝑔(𝑊𝑖 ). (3.8)
𝑛 𝑖=1
𝑋 ∼ Bernoulli(.5). (3.10)
Both 𝑓(𝑦 ∣ 𝑋 = 0) and 𝑓(𝑦 ∣ 𝑋 = 1) are bell-curved, and 𝑓(𝑦) looks like this
b) Simulate 1𝑒3 times from the Bernoulli distribution. Call the samples
bernoulli_flips
Sometimes you want a collection of elements that are all the same type, but
you want to store them in a two- or three-dimensional structure. For instance,
say you need to use matrix multiplication for some linear regression software
you’re writing, or that you need to use tensors for a computer vision project
you’re working on.
import numpy as np
a = np.array([[1,2],[3,4]], np.float)
a
## array([[1., 2.],
## [3., 4.]])
a.shape
## (2, 2)
a.ndim
## 2
a.dtype
## dtype('float64')
a.max()
## 4.0
a.resize((1,4)) # modification is **in place**
1
https://numpy.org/doc/stable/reference/arrays.ndarray.html
2
https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-methods
3
https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-attributes
DOI: 10.1201/9781003263241-4 47
48 4 Numpy ndarrays versus R’s Matrix and Array Types
a
## array([[1., 2., 3., 4.]])
b = np.ones(4).reshape((4,1))
np.dot(b,a) # matrix mult.
## array([[1., 2., 3., 4.],
## [1., 2., 3., 4.],
## [1., 2., 3., 4.],
## [1., 2., 3., 4.]])
b @ a # infix matrix mult. from PEP 465
## array([[1., 2., 3., 4.],
## [1., 2., 3., 4.],
## [1., 2., 3., 4.],
## [1., 2., 3., 4.]])
a * np.arange(4) # elementwise mult.
## array([[ 0., 2., 6., 12.]])
I should mention that there is also a matrix type in Numpy; however, this is
not described in this text because it is preferable to work with Numpy arrays
(Albon, 2018).
In both R and Python, there are matrix types and array types. In
� R, it is more common to work with matrixs than arrays, and the
opposite is true in Python!
a quick introduction to using these two classes. For more information, see
Chapter 3 of (Matloff, 2011).
I usually create matrix objects with the matrix() function or the as.matrix()
function. matrix() is to be preferred in my opinion. The first argument is
explicitly a vector of all the flattened data that you want in your matrix.
On the other hand, as.matrix() is more flexible; it takes in a variety of R
objects (e.g. data.frames), and tries to figure out what to do with them on
a case-by-case basis. In other words, as.matrix() is a generic function. More
information about generic functions is provided in Section 14.2.2.
Some other things to remember with matrix(): byrow= is FALSE by default, and
you will also need to specify either ncol= and/or nrow= if you want anything
that isn’t a 1-column matrix.
A <- matrix(1:4)
A
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
matrix(1:4, ncol = 2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
matrix(1:4, ncol = 2, byrow = T)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
as.matrix(
data.frame(
firstCol = c(1,2,3),
secondCol = c("a","b","c"))) # coerces numbers to characters!
## firstCol secondCol
## [1,] "1" "a"
## [2,] "2" "b"
50 4 Numpy ndarrays versus R’s Matrix and Array Types
array() is used to create array objects. This type is used less than the matrix
type, but this doesn’t mean you should avoid learning about it. This is mostly
a reflection of what kind of data sets people prefer to work with, and the fact
that matrix algebra is generally better understood than tensor algebra. You
won’t be able to avoid 3-d data sets (3-dimensions, not a 3-column matrix)
forever, though, particularly if you’re working in an area such as neuroimaging
or computer vision.
You can matrix-multiply matrix objects together with the %*% operator. If
you’re working on this, then the transpose operator (i.e. t()) comes in handy,
too. You can still use element-wise (Hadamard) multiplication. This is defined
with the more familiar multiplication operator *.
4.2 The Matrix and Array Classes in R 51
Qcopy <- Q
Qcopy[1,1] <- 3
Qcopy[2,2] <- 4
Qcopy
## [,1] [,2] [,3]
## [1,] 3 0 0
## [2,] 0 4 0
## [3,] 0 0 1
Here are some extraction examples. Notice that, if it can, [ will coerce a matrix
to vector. If you wish to avoid this, you can specify drop=FALSE.
Q
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
Q[1,1]
## [1] 1
Q[2,]
## [1] 0 1 0
Q[2,,drop=FALSE]
## [,1] [,2] [,3]
## [1,] 0 1 0
class(Q)
## [1] "matrix" "array"
class(Q[2,])
## [1] "numeric"
class(Q[2,,drop=FALSE])
52 4 Numpy ndarrays versus R’s Matrix and Array Types
There are other functions that operate on one or more matrix objects in more
interesting ways, but much of this will be covered in future sections. For
instance, we will describe how apply() works with matrixs in Section 15, and
we will discuss combining matrix objects in different ways in Section 12.
4.3 Exercises
4.3.1 R Questions
1.
Consider the following data set. Let 𝑁 = 20 be the number of rows. For
𝑖 = 1, … , 𝑁, define x𝑖 ∈ ℝ4 as the data in row 𝑖.
d <- matrix(c(
-1.1585476, 0.06059602, -1.854421163, 1.62855626,
0.5619835, 0.74857327, -0.830973409, 0.38432716,
-1.6949202, 1.24726626, 0.068601035, -0.32505127,
2.8260260, -0.68567999, -0.109012111, -0.59738648,
-0.3128249, -0.21192009, -0.317923437, -1.60813901,
0.3830597, 0.68000706, 0.787044622, 0.13872087,
-0.2381630, 1.02531172, -0.606091651, 1.80442260,
1.5429671, -0.05174198, -1.950780046, -0.87716787,
-0.5927925, -0.40566883, -0.309193162, 1.25575250,
-0.8970403, -0.10111751, 1.555160257, -0.54434356,
2.4060504, -0.08199934, -0.472715155, 0.25254794,
-1.0145770, -0.83132666, -0.009597552, -1.71378699,
-0.3590219, 0.84127504, 0.062052945, -1.00587841,
-0.1335952, -0.02769315, -0.102229046, -1.08526057,
0.1641571, -0.08308289, -0.711009361, 0.06809487,
4.3 Exercises 53
For the following problems, make sure to only use the transpose function t(),
matrix multiplication (i.e. %*%), and scalar multiplication/division. You may
use other functions in interactive mode to check your work, but please do not
use them in your submission.
2.
Create a matrix called P that has 100 rows, 100 columns, all of its elements
nonnegative, 1/10 on every diagonal element, and all rows summing to one.
This matrix is called stochastic and it describes how a Markov chain moves
randomly through time.
3.
Create a matrix called X that has one thousand rows, four columns, has
every element set to either 0 or 1, has its first column set to all 1s, has the
second column set to 1 in the second 250 elements and 0 elsewhere, has
the third column set to 1 in the third 250 spots and 0 elsewhere, and has the
fourth column set to 1 in the last 250 spots and 0 elsewhere. In other words,
it looks something like
54 4 Numpy ndarrays versus R’s Matrix and Array Types
1
𝑓(x; m, C) = (2𝜋)−𝑛/2 det (C)−1/2 exp [− (x − m)⊺ C−1 (x − m)] (4.3)
2
4.3 Exercises 55
Evaluating this density should be done with care. There is no one function
that is optimal for all situations. Here are a couple quick things to consider.
• Inverting large matrices with either np.linalg.solve5 or
very
6
np.linalg.inv becomes very slow if the covariance matrix is high-
dimensional. If you have special assumptions about the structure of the
covariance matrix, use it! Also, it’s a good idea to be aware of what happens
when you try to invert noninvertible matrices. For instance, can you rely on
errors to be thrown, or will it return a bogus answer?
• Recall from the last lab that exponentiating numbers close to −∞ risks
numerical underflow. It’s better to prefer evaluating log densities (base 𝑒,
the natural logarithm). There are also special functions that evaluate log
determinants7 that are less likely to underflow/overflow, too!
Complete the following problems. Do not use pre-made functions such as
scipy.stats.norm8 and scipy.stats.multivariate_normal9 in your sub-
mission, but you may use them to check your work. Use only “stan-
dard” functions and Numpy n-dimensional arrays. Use the following
definitions for x and m:
import numpy as np
x = np.array([1.1, .9, 1.0]).reshape((3,1))
m = np.ones(3).reshape((3,1))
10 0 0
a) Let C = ⎢ 0 10 0 ⎤
⎡
⎥. Evaluate and assign the log density to a float-
⎣ 0 0 10 ⎦
like called log_dens1. Can you do this without defining a numpy array
for C?
10 0 0
b) Let C = ⎡ ⎤
⎢ 0 11 0 ⎥. Evaluate and assign the log density to a float-
⎣ 0 0 12⎦
like called log_dens2. Can you do this without defining a numpy array
for C?
5
https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html
6
https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
7
https://numpy.org/doc/stable/reference/generated/numpy.linalg.slogdet.html
8
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
9
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_
normal.html
56 4 Numpy ndarrays versus R’s Matrix and Array Types
10 −.9 −.9
c) Let C = ⎢−.9 11 −.9⎤
⎡
⎥. Evaluate and assign the log density to
⎣−.9 −.9 12 ⎦
a float-like called log_dens3. Can you do this without defining a
numpy array for C?
2.
Consider this wine data set10 from (Cortez et al., 2009) hosted by (Dua and
Graff, 2017). Read it in with the following code. Note that you might need to
use os.chdir() first.
import pandas as pd
d = pd.read_csv("winequality-red.csv", sep = ";")
d.head()
When you need to store elements in a container, but you can’t guarantee that
these elements all have the same type, or you can’t guarantee that they all
have the same size, then you need a list in R. In Python, you might need a
list or dict (short for dictionary) (Lutz, 2013).
5.1 lists in R
lists are one of the most flexible data types in R. You can access individual
elements in many different ways, each element can be of different size, and
each element can be of a different type.
If you want to extract an element, you need to decide between using single
square brackets or double square brackets. The former returns a list, while
the second returns the type of the individual element.
You can also name the elements of a list. This can lead to more readable code.
To see why, examine the example below that makes use of spme data about
cars (sas, 2021). The lm() function estimates a linear regression model. It
returns a list with plenty of components.
DOI: 10.1201/9781003263241-5 59
60 5 R’s lists versus Python’s lists and dicts
1
https://docs.python.org/3/library/stdtypes.html#lists
2
https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
5.3 Dictionaries in Python 61
import numpy as np
another_list = [np.array([1,2,3]), "May 5th, 2021", True, [42,42]]
another_list[2]
## True
another_list[2] = 100
another_list
## [array([1, 2, 3]), 'May 5th, 2021', 100, [42, 42]]
Python lists have methods attached to them3 , which can come in handy.
another_list
## [array([1, 2, 3]), 'May 5th, 2021', 100, [42, 42]]
another_list.append('new element')
another_list
## [array([1, 2, 3]), 'May 5th, 2021', 100, [42, 42], 'new element']
Creating lists can be done as above, with the square bracket operators. They can
also be created with the list() function, and by creating a list comprehension.
List comprehensions are discussed more in Section 11.2.
The code above makes reference to a type that is not extensively discussed in
this text: tuples4 .
Here is an example of creating a dict with curly braces (i.e. {}). This dict stores
the current price of a few popular cryptocurrencies. Accessing an individual
element’s value using its key is done with the square bracket operator (i.e. []),
and deleting elements is done with the del keyword.
You can also create dicts using dictionary comprehensions. Just like list
comprehensions, these are discussed more in Section 11.2.
import pandas as pd
a_dict = { 'col1': [1,2,3], 'col2' : ['a','b','c']}
df_from_dict = pd.DataFrame(a_dict)
df_from_dict
## col1 col2
## 0 1 a
## 1 2 b
## 2 3 c
5.4 Exercises 63
5.4 Exercises
5.4.1 R Questions
1.
2.
a) Make a new list that is these two lists above “squished together.” It
has to be length 4, and each element is one of the elements of l1 and
l2. Call this list l3. Make sure to delete all the “tags” or “names” of
these four elements.
b) Extract the third element of l3 as a length one list and assign it to
the name l4.
c) Extract the third element of l3 as a vector and assign it to the name
v1.
2.
a) Make a new list that is these two dicts above “squished together”
(why can’t it be another dict?) It has to be length 4, and each value
is one of the values of 𝑑1 and 𝑑2. Call this list my_list.
b) Use a list comprehension to create a list called special_list of all
numbers starting from zero, up to (and including) one million, but
don’t include numbers that are divisible by any prime number less
than seven.
c) Assign the average of all elements in the above list to the variable
special_ave.
6
Functions
This text has already covered how to use functions that come to us pre-made.
At least we have discussed how to use them in a one-off way—just write the
name of the function, write some parentheses after that name, and then plug in
any requisite arguments by writing them in a comma-separated way between
those two parentheses. This is how it works in both R and Python.
In this section we take a look at how to define our own functions. This will
not only help us to understand pre-made functions, but it will also be useful if
we need some extra functionality that isn’t already provided to us.
Writing our own functions is also useful for “packaging up” computations. The
utility of this will become apparent very soon. Consider the task of estimating
a regression model. If you have a function that performs all of the required
calculations, then
• you can estimate models without having to think about lower-level details or
write any code yourself, and
• you can re-use this function every time you fit any model on any data set for
any project.
DOI: 10.1201/9781003263241-6 65
66 6 Functions
Below the definition, the function is called with an input of 41. When this
happens, the following sequence of events occurs
• The value 41 is assigned to myInput
• myOutput is given the value 42
• myOutput, which is 42, is returned from the function
• the temporary variables myInput and myOutput are destroyed.
We get the desired answer, and all the unnecessary intermediate variables are
cleaned up and thrown away after they are no longer needed.
return my_output
add_one(41) # call/invoke/use the function
## 42
Below the definition, the function is called with an input of 41. When this
happens, the following sequence of events occurs
• The value 41 is assigned to my_input
• my_output is given the value 42
• my_output, which is 42, is returned from the function
• the temporary variables my_input and my_output are destroyed.
We get the desired answer, and all the unnecessary intermediate variables are
cleaned up and thrown away after they are no longer needed.
The formal argument list is exactly what it sounds like. It is the list of arguments
a function takes. You can access a function’s formal argument list using the
formals() function. Note that it is not the actual arguments a user will plug
in—that isn’t knowable at the time the function is created in the first place.
Here is another function that takes a parameter called whichNumber that comes
with a default argument of 1. If the caller of the function does not specify
what she wants to add to myInput, addNumber() will use 1 as the default. This
default value shows up in the output of formals(addNumber).
1
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Function-objects
68 6 Functions
## [1] 4
formals(addNumber)
## $myInput
##
##
## $whichNumber
## [1] 1
The function’s body is also exactly what it sounds like. It is the work that a
function performs. You can access a function’s body using the body() function.
Every function you create also has a parent environment 2 . You can get/set this
using the environment() function. Environments help a function know which
variables it is allowed to use and how to use them. The parent environment of
a function is where the function was created, and it contains variables outside
of the body that the function can also use. The rules of which variables a
function can use are called scoping. When you create functions in R, you are
primarily using lexical scoping. This is discussed in more detail in Section
6.5.
2
Primitive functions are functions that contain no R code and are internally implemented
in C. These are the only type of function in R that don’t have a parent environment.
6.4 More Details on Python’s User-Defined Functions 69
The __code__ attribute has much more to offer. To see a list of names of all
its contents, you can use dir(add_number.__code__).
3
https://docs.python.org/3/tutorial/classes.html#python-scopes-and-namespaces
4
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
5
You might have noticed that Python uses two different words to prevent confusion.
Unlike R, Python uses the word “parameter” (instead of “argument”) to refer to the inputs
a function takes, and “arguments” to the specific values a user plugs in.
70 6 Functions
1. functions can use local variables that are defined inside themselves,
2. functions can use global variables defined in the environment where
the function itself was defined in, and
3. functions cannot necessarily use global variables defined in the envi-
ronment where the function was called in, and
4. functions will prefer local variables to global variables if there is a
name clash.
The first characteristic is obvious. The second and third are import to distin-
guish between. Consider the following code below. sillyFunction() can access
a because sillyFunction() and a are defined in the same place.
a <- 3
sillyFunction <- function(){
return(a + 20)
}
environment(sillyFunction) # the env. it was defined in contains a
## <environment: R_GlobalEnv>
sillyFunction()
## [1] 23
On the other hand, the following example will not work because a and
anotherSillyFunc() are not defined in the same place. Calling the function is
not the same as defining a function.
6.5 Function Scope in R 71
a <- 3
sillyFunction <- function(){
a <- 20
return(a + 20)
}
sillyFunction()
## [1] 40
print(a)
## [1] 3
The same concept applies if you create functions within functions. The inner
function innerFunc() looks “inside-out” for variables, but only in the place it
was defined.
Below we call outerFunc(), which then calls innerFunc(). innerFunc() can
refer to the variable b, because it lies in the same environment in which
innerFunc() was created. Interestingly, innerFunc() can also refer to the
variable a, because that variable was captured by outerFunc(), which provides
access to innerFunc().
}
return(innerFunc())
}
outerFunc()
## [1] "outside both"
## [1] "inside one"
We use this property all the time when we create functions that return other
functions. This is discussed in more detail in Chapter 15. In the above example,
outerFuncV2(), the function that returned another function, is called a function
factory.
1. functions can use local variables that are defined inside themselves,
2. functions have an order of preference for which variable to prefer in
the case of a name clash, and
3. functions can sometimes use variables defined outside itself, but that
ability depends on where the function and variable were defined, not
where the function was called.
Regarding characteristics (2) and (3), there is a famous acronym that describes
the rules Python follows when finding and choosing variables: LEGB.
• L: Local,
• E: Enclosing,
• G: Global, and
• B: Built-in.
A Python function will search for a variable in these namespaces in this order.6
“Local” refers to variables that are defined inside of the function’s block. The
function below uses the local a over the global one.
a = 3
def silly_function():
a = 22 # local a
print("local variables are ", locals())
return a + 20
silly_function()
## local variables are {'a': 22}
## 42
silly_function.__code__.co_nlocals # number of local variables
## 1
silly_function.__code__.co_varnames # names of local variables
## ('a',)
6
Functions aren’t the only thing that get their own namespace. Classes do, too.7 More
information on classes is provided in Chapter 14
7
https://docs.python.org/3/tutorial/classes.html#a-first-look-at-classes
74 6 Functions
a = "outside both"
def outer_func():
a = "inside one"
def inner_func():
print(a)
return inner_func
my_new_func = outer_func()
my_new_func()
## inside one
my_new_func.__code__.co_freevars
## ('a',)
a = "outside both"
def outer_func():
b = "inside one"
def inner_func():
print(a)
inner_func()
outer_func()
## outside both
Just like in R, Python functions cannot necessarily find variables where the
function was called. For example, here is some code that mimics the above R
example. Both a and b are accessible from within inner_func(). That is due
to LEGB.
However, if we start using outer_func() inside another function, calling it in
another function, when it was defined somewhere else, well then it doesn’t
have access to variables in the call site. You might be surprised at how the
6.7 Modifying a Function’s Arguments 75
following code functions. Does this print the right string: "this is the a I
want to use now!" No!
a = "outside both"
def outer_func():
b = "inside one"
def inner_func():
print(a)
print(b)
return inner_func()
def third_func():
a = "this is the a I want to use now!"
outer_func()
third_func()
## outside both
## inside one
If you feel like you understand lexical scoping, great! You should be ready
to take on Chapter 15, then. If not, keep playing around with examples.
Without understanding the scoping rules R and Python share, writing your
own functions will persistently feel more difficult than it really is.
a <- 1
f <- function(arg){
arg <- 2 # modifying a temporary variable, not a
return(arg)
8
There are some exceptions to this, but it’s generally true.
76 6 Functions
}
print(f(a))
## [1] 2
print(a)
## [1] 1
The function f has an argument called arg. When f(a) is performed, changes
are made to a copy of a. When a function constructs a copy of all input
variables inside its body, this is called pass-by-value semantics. This copy
is a temporary intermediate value that only serves as a starting point for the
function to produce a return value of 2.
arg could have been called a, and the same behavior will take place. However,
giving these two things different names is helpful to remind you and others
that R copies its arguments.
It is still possible to modify a, but I don’t recommend doing this either. I will
discuss this more in subsection 6.7.
a = 1
def f(arg):
arg = 2
return arg
print(f(a))
## 2
print(a)
## 1
6.7 Modifying a Function’s Arguments 77
In this case, a is not modified. That is because a is an int. ints are immutable
in Python, which means that their value9 cannot be changed after they are
created, either inside or outside of the function’s scope. However, consider the
case when a is a list, which is a mutable type. A mutable type is one that
can have its value changed after its created.
a = [999]
def f(arg):
arg[0] = 2
return arg
print(f(a))
## [2]
print(a) # not [999] anymore!
## [2]
In this case a is modified. Changing the value of the argument inside the
function effects changes to that variable outside of the function.
Ready to be confused? Here is a tricky third example. What happens if we
take in a list, but try to do something else with it.
a = [999]
def f(arg):
arg = [2]
return arg
print(f(a))
## [2]
print(a) # didn't change this time :(
## [999]
That time a did not permanently change in the global scope. Why does this
happen? I thought lists were mutable!
The reason behind all of this doesn’t even have anything to do with functions,
per se. Rather, it has to do with how Python manages, objects, values, and
types10 . It also has to do with what happens during assignment11 .
9
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
10
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
11
https://docs.python.org/3/reference/executionmodel.html#naming-and-binding
78 6 Functions
Let’s revisit the above code, but bring everything out of a function. Python is
pass-by-assignment, so all we have to do is understand how assignment works.
Starting with the immutable int example, we have the following.
# old code:
# a = 1
# def f(arg):
# arg = 2
# return arg
a = 1 # still done in global scope
arg = a # arg is a name that is bound to the object a refers to
arg = 2 # arg is a name that is bound to the object 2
print(arg is a)
## False
print(id(a), id(arg)) # different!`
## 139835665388896 139835665388928
print(a)
## 1
In the first line, the name a is bound to the object 1. In the second line, the
name arg is bound to the object that is referred to by the name a. After the
second line finishes, arg and a are two names for the same object (a fact that
you can confirm by inserting arg is a immediately after this line).
In the third line, arg is bound to 2. The variable arg can be changed, but only
by re-binding it with a separate object. Re-binding arg does not change the
value referred to by a because a still refers to 1, an object separate from 2.
There is no reason to re-bind a because it wasn’t mentioned at all in the third
line.
If we go back to the first function example, it’s basically the same idea. The only
difference, however, is that arg is in its own scope. Let’s look at a simplified
version of our second code chunk that uses a mutable list.
6.8 Accessing and Modifying Captured Variables 79
a = [999]
# old code:
# def f(arg):
# arg[0] = 2
# return arg
arg = a
arg[0] = 2
print(arg)
## [2]
print(a)
## [2]
print(arg is a)
## True
In this example, when we run arg = a, the name arg is bound to the same
object that is bound to a. This much is the same. The only difference here,
though, is that because lists are mutable, changing the first element of arg is
done “in place”, and all variables can access the mutated object.
Why did the third example produce unexpected results? The difference is in
the line arg = [2]. This rebinds the name arg to a different variable. lists
are still mutable, but this has nothing to do with re-binding—re-binding a
name works no matter what type of object you’re binding it to. In this case
we are re-binding arg to a completely different list.
a function will only try to access a referred-to variable when the function is
running, not when it is defined.
Consider the R code below. The dataReadyForModeling() function is created
in the global environment, and the global environment contains a Boolean
variable called dataAreClean.
# R
dataAreClean <- TRUE
dataReadyForModeling <- function(){
return(dataAreClean)
}
dataAreClean <- FALSE
# readyToDoSecondPart() # what happens if we call it now?
Now imagine sharing some code with a collaborator. Imagine, further, that
your collaborator is the subject-matter expert, and knows little about R
programming. Suppose that he changes dataAreClean, a global variable in the
script, after he is done . Shouldn’t this induce a relatively trivial change to the
overall program?
Let’s explore this hypothetical further. Consider what could happen if any of
the following (very typical) conditions are true:
• you or your collaborators aren’t sure what dataReadyForModeling() will
return because you don’t understand dynamic lookup, or
• it’s difficult to visually keep track of all assignments to dataAreClean (e.g. your
script is quite long or it changes often), or
• you are not running code sequentially (e.g. you are repeatedly testing chunks
at a time instead of clearing out your memory and source()ing from scratch,
over and over again).
In each of these situations, understanding of the program would be compromised.
However, if you follow the above principle of never referring to non-local
variables in function code, all members of the group could do their own work
separately, minimizing the dependence on one another.
Another reason violating this could be troublesome is if you define a function
that refers to a nonexistent variable. Defining the function will never throw an
error because R will assume that variable is defined in the global environment.
Calling the function might throw an error, unless you accidentally defined the
variable, or if you forgot to delete a variable whose name you no longer want
to use. Defining myFunc() with the code below will not throw an error, even if
you think it should!
6.8 Accessing and Modifying Captured Variables 81
# R
myFunc <- function(){
return(varigbleNameWithTypo) #varigble?
}
# python
missile_launch_codes_set = True
def everything_is_safe():
return not missile_launch_codes_set
missile_launch_codes_set = False
everything_is_safe()
## True
# python
def my_func():
return varigble_name_with_typo
So stay away from referring to variables outside the body of your function!
a <- 1
makeATwo <- function(arg){
arg <- 2
a <<- arg
}
print(makeATwo(a))
## [1] 2
82 6 Functions
print(a)
## [1] 2
In the program above, makeATwo() copies a into arg. It then assigns 2 to that
copy. Then it takes that 2 and writes it to the global a variable in
the parent environment. It does this using R’s super assignment operator
<<-. Regardless of the inputs passed in to this function, it will always assign
exactly 2 to a, no matter what.
This is problematic because you are pre-occupying your mind with one function:
makeATwo(). Whenever you write code that depends on a (or on things that
depend on a, or on things that depended on things that depend on a, or …),
you’ll have to repeatedly interrupt your train of thought to try and remember if
what you’re doing is going to be okay with the current and future makeATwo()
call sites.
The upside to the global keyword is that it makes hunting for side
effects relatively easy (A function’s side effects are changes it makes
to non-local variables). Yes, this keyword should be used sparingly,
� even more sparingly than merely referring to global variables, but if
you are ever debugging, and you want to hunt down places where
variables are surprisingly being changed, you can hit Ctrl-F and
search for the phrase “global.”
a = 1
def increment_a():
global a
a += 1
[increment_a() for _ in range(10)]
## [None, None, None, None, None, None, None, None, None, None]
print(a)
## 11
6.9 Exercises 83
6.9 Exercises
6.9.1 R Questions
1.
Once this 𝑝-dimensional vector is found, you can also obtain the predicted (or
fitted) values
ŷ ∶= X𝛽,̂ (6.3)
and the residuals (or errors)
y − ŷ (6.4)
2.
3.
Write a function called myDFT() that computes the Discrete Fourier Trans-
form of a vector and returns another vector. Feel free to check your work
against spec.pgram(), fft(), or astsa::mvspec(), but do not include calls to
those functions in your submission. Also, you should be aware that different
functions transform and scale the answer differently, so be sure to read the
documentation of any function you use to test against.
√
Given data 𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑖 = −1, and the Fourier/fundamental frequen-
cies 𝜔𝑗 = 𝑗/𝑛 for 𝑗 = 0, 1, … , 𝑛 − 1, we define the discrete Fourier transform
(DFT) as:
𝑛
𝑑(𝜔𝑗 ) = 𝑛 −1/2
∑ 𝑥𝑡 𝑒−2𝜋𝑖𝜔𝑗 𝑡 (6.5)
𝑡=1
𝑓 ′ (𝑥𝑛 )
𝑥𝑛+1 = 𝑥𝑛 − . (6.6)
𝑓 ″ (𝑥𝑛 )
Under appropriate regularity conditions for 𝑓, after many iterations of the
above recursion, when 𝑛̃ is very large, 𝑥𝑛̃ will be nearly the same as 𝑥𝑛−1
̃ , and
6.9 Exercises 85
𝑥𝑛̃ is pretty close to argmin𝑥 𝑓(𝑥). In other words, 𝑥𝑛̃ is the minimizer of 𝑓,
and a root of 𝑓 ′ .
a) Write a function called f that takes a float x and returns (𝑥−42)2 −33.
b) Write a function called f_prime that takes a float and returns the
derivative of the above.
c) Write a function called f_dub_prime that takes a float and returns
an evaluation of the second derivative of 𝑓.
d) Theoretically, what is the minimizer of 𝑓? Assign your answer to the
variable best_x.
e) Write a function called minimize() that takes three arguments, and
performs ten iterations of Newton’s algorithm, after which it returns
𝑥10 . Don’t be afraid of copy/pasting ten or so lines of code. We haven’t
learned loops yet, so that’s fine. The ordered arguments are:
•the function that evaluates the derivative of the function you’re
interested in,
•the function that evaluates the second derivative of your objective
function,
•an initial guess of the minimizer.
f) Test your function by plugging in the above functions, and use a
starting point of 10. Assign the output to a variable called x_ten.
2.
7.1 factors in R
Categorical data in R is often stored in a factor1 variable. factors are more
special than vectors of integers because
• they have a levels attribute, which is comprised of all the possible values
that each response could be;
• they may or may not be ordered, which will also control how they are used
in mathematical functions;
• they might have a contrasts attribute, which will control how they are used
in statistical modeling functions.
Here is a first example. Say we asked three people what their favorite season
was. The data might look something like this.
1
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Factors
DOI: 10.1201/9781003263241-7 87
88 7 Categorical Data
is.ordered(responses)
## [1] FALSE
#contrasts(responses)
# ^ controls how factor is used in different functions
factors always have levels, which is the collection of all possible unique values
each observation can take.
You should be careful if you are not specifying them directly. What
happens when you use the default option and replace the second as-
signment in the above code with responses <- factor(c("autumn",
"summer", "summer"))? The documentation of factor() will tell you
� that, by default, factor() will just take the unique values found in
the data. In this case, nobody prefers winter or spring, and so neither
will show up in levels(responses). This may or may not be what
you want.
factors can be ordered or unordered. Ordered factors are for ordinal data. Or-
dinal data is a particular type of categorical data that recognizes the categories
have a natural order (e.g. low/ medium/high and not red/green/blue).
As another example, say we asked ten people how much they liked statistical
computing, and they could only respond “love it”, “it’s okay” or “hate it”. The
data might look something like this.
Last, factors may or may not have a contrast attribute. You can get or set
this with the contrasts() function. This will influence some of the functions
you use on your data that estimate statistical models.
I will not discuss specifics of contrasts in this text, but the overall motivation
is important. In short, the primary reason for using factors is that they are
designed to allow control over how you model categorical data. To be more
specific, changing attributes of a factor could control the paremeterization of
a model you’re estimating. If you’re using a particular function for modeling
with categorical data, you need to know how it treats factors. On the other
hand, if you’re writing a function that performs modeling of categorical data,
you should know how to treat factors.
Here are two examples that you might come across in your studies.
The mathematical details of these examples is outside of the scope of this text.
If you have not learned about dummy variables in a regression course, or if
you have not considered the difference between multinomial logistic regression
and ordinal logistic regression, or if you have but you’re just a little rusty, that
is totally fine. I only mention these as examples for how the factor type can
trigger special behavior.
In addition to creating one with factor(), there are two other common ways
that you can end up with factors:
90 7 Categorical Data
Here is an example of (1). We can take non-categorical data, and cut() it into
something categorical.
Finally, be mindful of how different functions read in external data sets. When
reading in an external file, if a particular function comes across a column that
has characters in it, it will need to decide whether to store that column as a
character vector, or as a factor. For example, read.csv() and read.table()
have a stringsAsFactors= argument that you should be mindful of.
Pandas’ Series were discussed earlier in Sections 3.2 and 3.4. These were
containers that forced every element to share the same dtype. Here, we specify
dtype="category" in pd.Series().
import pandas as pd
szn_s = pd.Series(["autumn", "summer", "summer"], dtype = "category")
7.2 Two Options for Categorical Data in Pandas 91
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s.cat.ordered
## False
szn_s.dtype
## CategoricalDtype(categories=['autumn', 'summer'], ordered=False)
type(szn_s)
## <class 'pandas.core.series.Series'>
The second option is to use Pandas’ Categorical containers. They are quite
similar, so the choice is subtle. Like Series containers, they also force all of
their elements to share the same shared dtype.
You might have noticed that, with the Categorical container, methods and data
members were not accessed through the .cat accessor. It is also more similar
to R’s factors because you can specify more arguments in the constructor.
With Pandas’ Series it’s more difficult to specify a nondefault dtype. One
option is to change them after the object has been created.
szn_s = szn_s.cat.set_categories(
["autumn", "summer","spring","winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'spring', 'winter'], dtype='object')
szn_s = szn_s.cat.remove_categories(['spring','winter'])
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s = szn_s.cat.add_categories(["fall", "winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'fall', 'winter'], dtype='object')
Another option is to create the dtype before you create the Series, and pass
it into pd.Series().
cat_type = pd.CategoricalDtype(
categories=["autumn", "summer", "spring", "winter"],
ordered=True)
responses = pd.Series(
["autumn", "summer", "summer"],
dtype = cat_type)
responses
## 0 autumn
## 1 summer
## 2 summer
## dtype: category
## Categories (4, object): ['autumn' < 'summer' < 'spring' < 'winter']
Just like in R, you can convert numerical data into categorical. The function
even has the same name as in R: pd.cut()2 . Depending on the type of the
input, it will return either a Series or a Categorical3 .
import numpy as np
stock_returns = np.random.normal(size=10) # not categorical
# array input means Categorical output
2
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#
series-creation
3
https://pandas.pydata.org/docs/reference/api/pandas.cut.html
7.2 Two Options for Categorical Data in Pandas 93
type_of_day = pd.cut(stock_returns,
bins = [-np.inf, 0, np.inf],
labels = ['bad day', 'good day'])
type(type_of_day)
# Series in means Series out
## <class 'pandas.core.arrays.categorical.Categorical'>
type_of_day2 = pd.cut(pd.Series(stock_returns),
bins = [-np.inf, 0, np.inf],
labels = ['bad day', 'good day'])
type(type_of_day2)
## <class 'pandas.core.series.Series'>
Finally, when reading in data from an external source, choose carefully whether
you want character data to be stored as a string type, or as a categorical type.
Here we use pd.read_csv()4 to read in Fisher’s Iris data set (Fisher, 1988)
hosted by (Dua and Graff, 2017). More information on Pandas’ DataFrames
can be found in the next chapter.
import numpy as np
# make 5th col categorical
my_data = pd.read_csv("data/iris.csv", header=None,
dtype = {4:"category"})
my_data.head(1)
## 0 1 2 3 4
## 0 5.1 3.5 1.4 0.2 Iris-setosa
my_data.dtypes
## 0 float64
## 1 float64
## 2 float64
## 3 float64
## 4 category
## dtype: object
np.unique(my_data[4]).tolist()
## ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
4
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.
html
94 7 Categorical Data
7.3 Exercises
7.3.1 R Questions
1.
Read in this chess data set (mis, 1989), hosted by (Dua and Graff, 2017), with
the following code. You will probably have to change your working directory,
but if you do, make sure to comment out that code before you submit your
script to me.
2.
Suppose you have the following vector. Please make sure to include this code
in your script.
Consider the following simulated letter grade data for two students:
7.3 Exercises 95
import pandas as pd
import numpy as np
poss_grades = ['A+','A','A-','B+','B','B-',
'C+','C','C-','D+','D','D-',
'F']
grade_values = {'A+':4.0,'A':4.0,'A-':3.7,'B+':3.3,'B':3.0,'B-':2.7,
'C+':2.3,'C':2.0,'C-':1.7,'D+':1.3,'D':1.0,'D-':.67,
'F':0.0}
student1 = np.random.choice(poss_grades, size = 10, replace = True)
student2 = np.random.choice(poss_grades, size = 12, replace = True)
a) Convert the two Numpy arrays to one of the Pandas types for cate-
gorical data that the textbook discussed. Call these two variables s1
and s2.
b) These data are categorical. Are they ordinal? Make sure to adjust s1
and s2 accordingly.
c) Calculate the two student GPAs. Assign the floating point numbers
to variables named s1_gpa and s2_gpa. Use grade_values to convert
each letter grade to a number, and then average all the numbers for
each student together using equal weights.
d) Is each category equally-spaced? If yes, then these are said to be
interval data. Does your answer to this question affect the legitimacy
of averaging together any ordinal data? Assign a str response to the
variable ave_ord_data_response. Hint: consider (any) two different
data sets that happen to produce the same GPA. Is the equality of
these two GPAs misleading?
e) Compute the mode grade for each student. Assign your answers as
strs to the variables s1_mode and s2_mode. If there are more than one
modes, then assign the one that comes first alphabetically.
2.
Suppose you are creating a classifier whose job it is to predict labels. Consider
the following DataFrame of predicted labels next to their corresponding actual
labels. Please make sure to include this code in your script.
import pandas as pd
import numpy as np
d = pd.DataFrame({'predicted label' : [1,2,2,1,2,2,1,2,3,2,2,3],
'actual label': [1,2,3,1,2,3,1,2,3,1,2,3]},
96 7 Categorical Data
dtype='category')
d.dtypes[0]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)
d.dtypes[1]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)
DOI: 10.1201/9781003263241-8 97
98 8 Data Frames
typeof(irisData)
## [1] "list"
class(irisData) # we'll talk more about classes later
## [1] "data.frame"
dim(irisData)
## [1] 150 5
nrow(irisData)
## [1] 150
ncol(irisData)
## [1] 5
There are some exceptions, but most data sets can be stored as a data.frame.
These kinds of two-dimensional data sets are quite common. Any particular
row is often an observation on one experimental unit (e.g. person, place or
thing). Looking at a particular column gives you one kind of measurement
stored for all observations.
1
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Data-frame-
objects
8.1 Data Frames in R 99
Often times you will need to extract pieces of information from a data.frame.
This can be done in many ways. If the columns have names, you can use the $
operator to access a single column. Accessing a single column might be followed
up by creating a new vector. You can also use the [ operator to access multiple
columns by name.
The [ operator is also useful for selecting rows and columns by index numbers,
or by some logical criteria.
In R, data.frames might have row names. You can get and set this character
vector with the rownames() function. You can access rows by name using the
square bracket operator.
head(rownames(irisData))
## [1] "1" "2" "3" "4" "5" "6"
rownames(irisData) <- as.numeric(rownames(irisData)) + 1000
head(rownames(irisData))
## [1] "1001" "1002" "1003" "1004" "1005" "1006"
irisData["1002",]
## sepal.length sepal.width petal.length petal.width species
## 1002 4.9 3 1.4 0.2 Iris-setosa
Code that modifies data usually looks quite similar to code extracting data.
You’ll notice a lot of the same symbols (e.g. $, [, etc.), but the (<-) will point
in the other direction.
import pandas as pd
iris_data = pd.read_csv("data/iris.csv", header = None)
iris_data.head(3)
## 0 1 2 3 4
## 0 5.1 3.5 1.4 0.2 Iris-setosa
## 1 4.9 3.0 1.4 0.2 Iris-setosa
## 2 4.7 3.2 1.3 0.2 Iris-setosa
iris_data.shape
## (150, 5)
len(iris_data) # num rows
## 150
len(iris_data.columns) # num columns
## 5
list(iris_data.dtypes)[:3]
## [dtype('float64'), dtype('float64'), dtype('float64')]
list(iris_data.dtypes)[3:]
## [dtype('float64'), dtype('O')]
The structure is very similar to that of R’s data frame. It’s two dimensional, and
you can access columns and rows by name or number.2 Each column is a Series
object, and each column can have a different dtype, which is analogous to R’s
situation. Again, because the elements need to be the same type along columns
only, this is a big difference between 2-d Numpy ndarrays and DataFrames (c.f.
R’s matrix versus R’s data.frame).
2
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
102 8 Data Frames
Square brackets are a little different in Python than they are in R. Just like
in R, you can access columns by name with square brackets, and you can
also access rows. Unlike R, though, you don’t have to specify both rows and
columns every time you use the square brackets.
You can select columns and rows by number with the .iloc method3 . iloc is
(probably) short for “integer location.”
3
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
8.2 Data Frames in Python 103
Selecting columns by anything besides integer number can be done with the
.loc() method4 . You should generally prefer this method to access columns
because accessing things by name instead of number is more readable. Here
are some examples.
sepal_w_to_pedal_w = iris_data.loc[:,'sepal.width':'petal.width']
sepal_w_to_pedal_w.head()
## sepal.width petal.length petal.width
## 0 3.5 1.4 0.2
## 1 3.0 1.4 0.2
## 2 3.2 1.3 0.2
## 3 3.1 1.5 0.2
## 4 3.6 1.4 0.2
setosa_only = iris_data.loc[iris_data['species'] == "Iris-setosa",]
# don't need the redundant column anymore
del setosa_only['species']
setosa_only.head(3)
## sepal.length sepal.width petal.length petal.width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2
.iloc. Recall that .loc was label-based selection. Labels don’t necessarily have
to be strings. Consider the following example.
iris_data.index
# reverse the index
## RangeIndex(start=0, stop=150, step=1)
iris_data = iris_data.set_index(iris_data.index[::-1])
iris_data.iloc[-2:,:3] # top is now bottom
## sepal.length sepal.width petal.length
## 1 6.2 3.4 5.4
## 0 5.9 3.0 5.1
iris_data.loc[0] # last row has 0 index
## sepal.length 5.9
## sepal.width 3
## petal.length 5.1
## petal.width 1.8
## species Iris-virginica
## Name: 0, dtype: object
iris_data.iloc[0] # first row with big index
## sepal.length 5.1
## sepal.width 3.5
## petal.length 1.4
## petal.width 0.2
## species Iris-setosa
## Name: 149, dtype: object
iris_data.loc[0] selects the 0th index. The second line reversed the indexes,
so this is actually the last row. If you want the first row, use iris_data.iloc[0].
Modifying data inside a data frame looks quite similar to extracting data.
You’ll recognize a lot of the methods mentioned earlier.
import numpy as np
n_rows = iris_data.shape[0]
iris_data['col_ones'] = np.repeat(1.0, n_rows)
iris_data.iloc[:2,0] = np.random.normal(loc=999, size=2)
rand_nums = np.random.normal(loc=-999, size=n_rows)
iris_data.loc[:,'sepal.width'] = rand_nums
setosa_rows = iris_data['species'] == "Iris-setosa"
iris_data.loc[setosa_rows, 'species'] = "SETOSA!"
del iris_data['petal.length']
8.3 Exercises 105
iris_data.head(3)
## sepal.length sepal.width petal.width species col_ones
## 149 998.388556 -1000.146000 0.2 SETOSA! 1.0
## 148 997.790911 -998.745409 0.2 SETOSA! 1.0
## 147 4.700000 -996.467804 0.2 SETOSA! 1.0
You can also use the .assign() method6 to create a new column. This method
does not modify the data frame in place. It returns a new DataFrame with the
additional column.
8.3 Exercises
8.3.1 R Questions
1.
2.
mtcars is a data set that is built into R, so you don’t need to read it in. You
can read more about it by typing ?datasets::mtcars.
3.
This question investigates the Zillow Home Value Index (ZHVI)7 for single
family homes.
This question deals with looking at historical prices of the S&P500 Index.
This data was downloaded from https://finance.yahoo.com (gsp, 2021). It
contains prices starting from “2007-01-03” and going until “2021-10-01”.
a) Read in the data file "gspc.csv" as a data.frame and call the variable
gspc.
b) Use .set_index()8 to change the index of gspc to its "Index" column.
Store the new DataFrame as gspc_good_idx.
7
https://www.zillow.com/research/data/
8
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
108 8 Data Frames
2.
In this question we’ll look at some data on radon measurements9 (Gelman and
Hill, 2007). Instead of reading in a text file, we will load the data into Python
using the tensorflow_datasets module10 (TFD, 2021).
Please include the following code in your submission.
The third reason does not imply data created by code is unimportant.
For example, it is the most common approach to create data used
in simulation studies. Authors writing statistical papers need
to demonstrate that their techniques work on “nice” data: data
simulated from a known data-generating process. In a simulation
study, unlike in the “real-world,” you have access to the parameters
generating your data, and you can examine data that might otherwise
be unobserved or hidden. Further, with data from the real-world,
there is no guarantee your model correctly matches the true model.
Can your code/technique/algorithm, at the very least, obtain pa-
� rameter estimates that are “in-line” with the parameters your code
is using to simulate data? Are forecasts or predictions obtained by
your method accurate? These kinds of questions can often only be
answered by simulating fake data. Programmatically, simulating data
like this largely involves calling functions that we have seen before
(e.g. rnorm() in R or np.random.choice() in Python). This may or
may not involve setting a pseudorandom number seed, first, for re-
producibility.
Also, benchmark data sets are often readily available through special-
ized function calls.
Even though this chapter is written to teach you how to read in files into R
and Python, you should not expect that you will know how to read in all data
sets after reading this section. For both R and Python, there are an enormous
amount of functions, different functions have different return types, different
functions are suited for different file types, many functions are spread across a
plethora of third party libraries, and many of these functions have an enormous
amount of arguments. You will probably not be able to memorize everything.
In my very humble opinion, I doubt you should want to.
Instead, focus on developing your ability to identify and diagnose
data input problems. Reading in a data set correctly is often a process
of trial-and-error. After attempting to read in a data set, always check the
following items. Many of these points were previously mentioned in section
@(data-frames-in-r). Some apply to reading in text data more than reading in
structured data from a database, and vice versa.
when separators are found inside data elements or column names. For
example, sometimes it’s unclear whether people’s names in the “last,
first” format can be stored in one or two columns. Also, text data
might surprise you with unexpected spaces or other whitespace is a
common separator.
2. Check that the column names were parsed and stored correctly.
Column names should not be stored as data in R/Python. Functions
that read in data should not expect column names when they don’t
exist in the actual file.
3. Check that empty space and metadata was ignored correctly.
Data descriptions are sometimes stored in the same file as the data
itself, and that should be skipped over when it’s being read in. Empty
space between column names and data shouldn’t be stored. This can
occur at the beginning of the file, and even at the end of the file.
4. Check that type choice and recognition of special charac-
ters are performed correctly. Are letters stored as strings or as
something else such as an R factor? Are dates and times stored as a
special date/time type, or as strings? Is missing data correctly identi-
fied? Sometimes data providers use outrageous numbers like −9999
to represent missing data—don’t store that as a float or integer!
5. Be ready to prompt R or Python to recognize a specific
character encoding if you are reading in text data written in
another language. All text data has a character encoding, which is a
mapping of numbers to characters. Any specific encoding will dictate
what characters are recognizable in a program. If you try to read in
data written in another language, the function you are using will likely
complain about unrecognized characters. Fortunately, these errors and
warnings are easily fixed by specifying a nondefault argument such as
encoding= or fileEncoding=.
d <- read.csv("data/o-ring-erosion-only.data")
dim(d) # one row short, only 1 col
## [1] 22 1
typeof(d[,1])
## [1] "character"
Specifying header=FALSE fixes the column name issue, but sep = " " does not
fix the separator issue.
d <- read.csv("data/o-ring-erosion-only.data",
header=FALSE, sep = " ")
1
https://archive.ics.uci.edu/ml/datasets/Challenger+USA+Space+Shuttle+O-Ring
2
Open raw text files with text editor programs, not with programs that perform any kind
of processing. For instance, if you open it with Microsoft Excel, the appearance of the data
will change, and important information helping you to read your data into R or Python will
not be available to you.
9.2 Reading in Text Files with R 117
str(d)
## 'data.frame': 23 obs. of 7 variables:
## $ V1: int 6 6 6 6 6 6 6 6 6 6 ...
## $ V2: int 0 1 0 0 0 0 0 0 1 1 ...
## $ V3: int 66 70 69 68 67 72 73 70 57 63 ...
## $ V4: int NA NA NA NA NA NA 100 100 200 200 ...
## $ V5: int 50 50 50 50 50 50 NA NA NA 10 ...
## $ V6: int NA NA NA NA NA NA 7 8 9 NA ...
## $ V7: int 1 2 3 4 5 6 NA NA NA NA ...
One space is strictly one space. Some rows have two, though. This causes there
to be two too many columns filled with NAs.
After digging into the documentation a bit further, you will notice that ""
works for “one or more spaces, tabs, newlines or carriage returns.” This is why
read.table(), with its default arguments, works well.
d <- read.table("data/o-ring-erosion-only.data")
str(d)
## 'data.frame': 23 obs. of 5 variables:
## $ V1: int 6 6 6 6 6 6 6 6 6 6 ...
## $ V2: int 0 1 0 0 0 0 0 0 1 1 ...
## $ V3: int 66 70 69 68 67 72 73 70 57 63 ...
## $ V4: int 50 50 50 50 50 50 100 100 200 200 ...
## $ V5: int 1 2 3 4 5 6 7 8 9 10 ...
This data set has columns whose widths are “fixed”, too. It is in “fixed width
format” because any given column has all its elements take up a constant
amount of characters. The third column has integers with two or three digits,
but no matter what, each row has the same number of characters.
You may choose to exploit this and use a specialized function that reads in
data in a fixed width format (e.g. read.fwf()). The frustrating thing about
this approach, though, is that you have to specify what those widths are. This
can be quite tedious, particularly if your data set has many columns and/or
many rows. The upside though, is that the files can be a little bit smaller,
because the data provider does not have to waste characters on separators.
In the example below, we specify widths that include blank spaces to the left
of the digits. On the other hand, if we specified widths=c(2,2,4,4,1), which
includes spaces to the right of digits, then columns would have been recognized
as characters.
118 9 Input and Output
d <- read.fwf("data/o-ring-erosion-only.data",
widths = c(1,2,3,4,3)) # or try c(2,2,4,4,1)
str(d)
## 'data.frame': 23 obs. of 5 variables:
## $ V1: int 6 6 6 6 6 6 6 6 6 6 ...
## $ V2: int 0 1 0 0 0 0 0 0 1 1 ...
## $ V3: int 66 70 69 68 67 72 73 70 57 63 ...
## $ V4: int 50 50 50 50 50 50 100 100 200 200 ...
## $ V5: int 1 2 3 4 5 6 7 8 9 10 ...
If you need to read in some text data that does not possess a tabular structure,
then you may need readLines(). This function will read in all of the text,
separate each line into an element of a character vector, and will not make any
attempt to parse lines into columns. Further processing can be accomplished
using the techniques from Section 3.9.
Some of you may have had difficulty reading in the above data. This can
happen if your machine’s default character encoding is different than mine. For
instance, if your character encoding is “GBK”3 , then you might get a warning
message like “invalid input found on input connection.” This message means
that your machine didn’t recognize some of the characters in the data set.
These errors are easy to fix, though, so don’t worry. Just specify an encoding
argument in your function that reads in data.
Recall R has read.table() and read.csv(), and that they are very similar.
In Pandas, pd.read_csv()5 and pd.read_table()6 have a lot in common, too.
Their primary difference is the default column separator, as well.
Recall the O-Ring data from above. The columns are not separated by commas,
so if we treat it as a comma-separated file, the resulting Pandas DataFrame is
going to be missing all but one of its columns.
import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data")
d.shape # one column and missing a row
## (22, 1)
d.columns # column labels are data
## Index(['6 0 66 50 1'], dtype='object')
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = " ").head(2) # 1 space: no
## 0 1 2 3 4 5 6
## 0 6 0 66 NaN 50.0 NaN 1.0
## 1 6 1 70 NaN 50.0 NaN 2.0
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\t").head(2) # tabs: no
## 0
## 0 6 0 66 50 1
## 1 6 1 70 50 2
pd.read_table("data/o-ring-erosion-only.data",
header=None).head(2) # default sep is tabs, so no
## 0
## 0 6 0 66 50 1
## 1 6 1 70 50 2
pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\s+").head(2) # 1 or more spaces: yes
## 0 1 2 3 4
## 0 6 0 66 50 1
## 1 6 1 70 50 2
5
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.
html
6
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.
html
120 9 Input and Output
Reading in fixed width files can be done in a way that is nearly identical to
the way we did it in R. Here is an example.
d = pd.read_fwf("data/o-ring-erosion-only.data",
widths = [1,2,3,4,3], header=None) # try [2,2,4,4,1]
d.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 23 entries, 0 to 22
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 0 23 non-null int64
## 1 1 23 non-null int64
## 2 2 23 non-null int64
## 3 3 23 non-null int64
## 4 4 23 non-null int64
## dtypes: int64(5)
## memory usage: 1.0 KB
d = pd.read_fwf("data/o-ring-erosion-only.data",
widths = [2,2,4,4,1], header=None)
list(d.dtypes)[:4]
## [dtype('int64'), dtype('int64'), dtype('O'), dtype('O')]
d = d.astype({2:'string', 3:'string'})
list(d.dtypes)[:4]
## [dtype('int64'), dtype('int64'), StringDtype, StringDtype]
Just like in R, you may run into an encoding issue with a file. For instance,
the following will not work because the file contains Chinese characters. If you
7
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
astype.html
8
https://pandas.pydata.org/docs/reference/api/pandas.StringDtype.html#pandas.
StringDtype
9.4 Saving Data in R 121
mostly work with UTF-8 files, you will receive a UnicodeDecodeError if you
try to run the following code.
pd.read_csv("data/message.txt")
You may also read in unstructured, nontabular data with Python. Use the built-
in open()11 function to open up a file in read mode, and then use f.readlines()
to return a list of strings.
f = open("data/Google.html", "r")
d = f.readlines()
d[:1]
## ['<!DOCTYPE html>\n']
print(type(d), type(d[0]))
## <class 'list'> <class 'str'>
9
A list of more options of encodings that are built into Python,are available here.10
10
https://docs.python.org/3/library/codecs.html#standard-encodings
11
https://docs.python.org/3/library/functions.html#open
122 9 Input and Output
The above will not print anything to the R console, but we can use a text
editor to take a look at the raw text file on our hard drive. Here are the first
three rows.
"V1";"V2";"V3";"V4";"V5"
6;0;66;50;1
6;1;70;50;2
9.4.2 Serialization in R
Alternatively you may choose to store your data in a serialized form. With
this approach, you are still saving your data in a more permanent way to your
hard drive, but it is stored in format that’s usually more memory efficient.
Recall that a common reason for writing out data is to save your
progress. When you want to save your progress, it is important to
ask yourself: “is it better to save my progress as a serialized object,
or as a raw text file?”
When making this decision, consider versatility. On the one hand,
� raw text files are more versatile and can be used in more places. On
the other hand, versatility is often bug prone.
For example, suppose you want to save a cleaned up data.frame.
Are you sure you will remember to store that column of strings as
character and not a factor? Does any code that uses this data.frame
require that this column be in this format?
9.4 Saving Data in R 123
After it is saved with saveRDS(), we are free to delete the variable with rm(),
because it can be read back in later on. To do this, call readRDS(). This is file
has a special format that is recognized by R, so you will not need to worry
about any of the usual struggles that occur when reading in data from a plain
text file. Additionally, .rds files are typically smaller—oring.rds is only 248
bytes, while "oring_out.csv" is 332 bytes.
You can serialize multiple objects at once, too! Convention dictates that these
files end with the .RData suffix. Save your entire global environment with
save() or save.image(), and bring it back with load() or attach().
import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data",
header=None, sep = "\s+")
d.to_csv("data/oring_out2.csv",
header=True, index=False, sep = ",")
Here is how the first few rows of that file looks in a text editor.
0,1,2,3,4
6,0,66,50,1
6,1,70,50,2
12
https://pandas.pydata.org/pandas-docs/stable/reference/io.html#input-output
13
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
to_csv.html#pandas.DataFrame.to_csv
14
https://docs.python.org/3/library/pickle.html
15
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html
16
https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html#pandas.
read_pickle
9.6 Exercises 125
d.to_pickle("data/oring.pickle")
del d
d_is_back = pd.read_pickle("data/oring.pickle")
d_is_back.head(2)
## 0 1 2 3 4
## 0 6 0 66 50 1
## 1 6 1 70 50 2
9.6 Exercises
9.6.1 R Questions
1.
Consider again the data set called "gspc.csv", which contains daily open, high,
low and close values for the S&P500 Index.
17
The documentation for pickle18 mentions that the library is “not secure against erro-
neous or maliciously constructed data” and recommends that you “[n]ever unpickle data
received from an untrusted or unauthenticated source.”
18
https://docs.python.org/2/library/pickle.html
126 9 Input and Output
a) Use open() to open the "Google.html" file. Store the output of the
function as my_file.
b) Use the .readlines() method of the file to write the contents of the
file as a list called html_data
c) Coerce the list to a DataFrame with one column called html
d) Create a Series called nchars_ineach that stores the number of char-
acters in each line of text. Hint: the Series.str attribute has a lot of
helpful methods19 .
e) Create an int-like variable called num_div_tags that holds the total
number of times the phrase “<div>” appears in the file.
2.
Consider the data set called "gspc.csv", which contains daily open, high, low
and close values for the S&P500 Index.
install.packages("thePackage")
There are some packages that will not be available using this method. For
more information on that situation, see here.5
library(thePackage)
You can also use the require() function, which has slightly different behavior
when the requested package is not found.
To understand this more deeply, we need to talk about environments again.
We discussed these before in 6.3, but only in the context of user-defined
functions. When we load in a package with library(), we make its contents
available by putting it all in an environment for that package.
An environment6 holds the names of objects. There are usually several envi-
ronments, and each holds a different set of functions and variables. All the
variables you define are in an environment, every package you load in gets its
own environment, and all the functions that come in R pre-loaded have their
own environment.
5
https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html#
install-non-conda-packages
6
https://cran.r-project.org/doc/manuals/R-lang.html#Environment-objects
10.3 Loading Packages in R 129
the global environment, and then it will traverse the environments in the search
path in order. These has a few important implications.
• First, don’t define variables in the global environment that are
already named in another environment. There are many variables that
come pre-loaded in the base package (to see them, type ls("package:base")),
and if you like using a lot of packages, you’re increasing the number of names
you should avoid using.
• Second, don’t library() in a package unless you need it, and if you
do, be aware of all the names it will mask it packages you loaded in
before. The good news is that library will often print warnings letting you
know which names have been masked. The bad news is that it’s somewhat
out of your control—if you need two packages, then they might have a shared
name, and the only thing you can do about it is watch the ordering you load
them in.
• Third, don’t use library() inside code that is source()’d in other files. For
example, if you attach a package to the search path from within a function
you defined, anybody that uses your function loses control over the order of
packages that get attached.
All is not lost if there is a name conflict. The variables haven’t disappeared.
It’s just slightly more difficult to refer to them. For instance, if I load in Hmisc
(Harrell Jr et al., 2021), I get the warning warning that format.pval and units
are now masked because they were names that were in "package:base". I can
still refer to these masked variables with the double colon operator (::).
library(Hmisc)
# this now refers to Hmisc's format.pval
# because it was loaded more recently
format.pval
Hmisc::format.pval # in this case is the same as above
# the below code is the only way
# you can get base's format.pval now
base::format.pval
also more flexible. To make the contents of a package called, say, the_package
available, type one of the following inside a Python session.
import the_package
import the_package as tp
from the_package import *
15
https://numpy.org/doc/stable/reference/random/index.html?highlight=random#
module-numpy.random
10.4 Loading Packages in Python 133
This is one use of the dot operator (.). It is also used to access attributes and
methods of objects (more information on that will come later in Chapter 14).
normal is inside of random, which it itself inside of np.
Finally, we can import the function directly, and refer to it with only one letter.
This is highly discouraged, though. We are much more likely to accidentally
use the name n twice. Further, n is not a very descriptive name, which means
it could be difficult to understand what your program is doing later.
Keep in mind, you’re always at risk of accidentally re-using names, even if you
aren’t importing anything. For example, consider the following code.
This is very bad, because now you cannot use the n() function that was
imported from the numpy.random sub-module earlier. In other words, it is
16
https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
134 10 Using Third-Party Code
longer callable. The error message from the above code will be something like
TypeError: 'int' object is not callable.
Use the dir() function to see what is available inside a module. Here are a
few examples. Type them into your own machine to see what they output.
10.5 Exercises
1.
2.
What are important similarities and differences in the package loading proce-
dures of R and Python? Select all that apply.
3.
In Python, which of the following is, generally speaking, the best way to
import?
• import the_package
• from the_package import *
• import the_package as tp
4.
In Python, which of the following is, generally speaking, the worst way to
import?
• import the_package
• from the_package import *
• import the_package as tp
5.
In R, if you want to use a function func() from package, do you always have
to use library(package) or require(package) first?
• Yes, otherwise func() won’t be available.
• No, you can just use package::func() without calling any function that
performs pre-loading.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
11
Control Flow
In Python2 , you don’t need curly braces, but the indentation needs to be just
right, and you need a colon (Lutz, 2013).
my_name = "Taylor"
if my_name == "Taylor":
print("hi Taylor")
## hi Taylor
There can be more than one test of truth. To test alternative conditions, you
can add one or more else if (in R) or elif (in Python) blocks. The first block
with a Boolean that is found to be true will execute, and none of the resulting
conditions will be checked.
1
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#if
2
https://docs.python.org/3/tutorial/controlflow.html#if-statements
# in R
food <- "muffin"
if(food == "apple"){
print("an apple a day keeps the doctor away")
}else if(food == "muffin"){
print("muffins have a lot of sugar in them")
}else{
print("neither an apple nor a muffin")
}
## [1] "muffins have a lot of sugar in them"
# in Python
my_num = 42.999
if my_num % 2 == 0:
print("my_num is even")
elif my_num % 2 == 1:
my_num += 1
print("my_num was made even")
else:
print("you're cheating by not using integers!")
## you're cheating by not using integers!
11.2 Loops
One line of code generally does one “thing,” unless you’re using loops. Code
written inside a loop will execute many times.
The most common loop for us will be a for loop. A simple for loop in R3
might look like this
3
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#for
11.2 Loops 139
#in R
myLength <- 9
r <- vector(mode = "numeric", length = myLength)
for(i in seq_len(myLength)){
r[i] <- i
}
r
## [1] 1 2 3 4 5 6 7 8 9
#in Python
my_length = 9
r = []
for i in range(my_length):
r.append(i)
r
## [0, 1, 2, 3, 4, 5, 6, 7, 8]
Loops are for repeatedly executing code. for loops are great when you know
the number of iterations needed ahead of time. If the number of iterations is
not known, then you’ll need a while loop. While loops will only terminate after
a condition is found to be true. Here are some examples in R7 and in Python8 .
# in R
keepGoing <- TRUE
while(keepGoing){
oneOrZero <- rbinom(1, 1, .5)
print(paste("oneOrZero:", oneOrZero))
if(oneOrZero == 1)
keepGoing <- FALSE
}
## [1] "oneOrZero: 0"
## [1] "oneOrZero: 0"
## [1] "oneOrZero: 1"
# in Python
keep_going = True
while keep_going:
one_or_zero = np.random.binomial(1, .5)
print("one_or_zero: ", one_or_zero)
if one_or_zero == 1:
keep_going = False
## one_or_zero: 0
## one_or_zero: 1
7
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#while
8
https://docs.python.org/3/reference/compound_stmts.html#while
11.2 Loops 141
Python also provides an alternative way to construct lists similar to the one
we constructed in the above example. They are called list comprehensions9 .
These are convenient because you can incorporate iteration and conditional
logic in one line of code.
You might also have a look at generator expressions 10 and dictionary compre-
hensions 11 .
R can come close to replicating the above behavior with vectorization, but the
conditional part is hard to achieve without subsetting.
3*seq(0,9)[seq(0,9)%%2 == 0]
## [1] 0 6 12 18 24
11.3 Exercises
11.3.1 R Questions
1.
2.
2
⎧ 𝑥 (1−𝑥)
{ ∫1 𝑦2 (1−𝑦)𝑑𝑦 0<𝑥<1
𝑝(𝑥) = 0 . (11.2)
⎨
{
⎩0 otherwise
Note that this algorithm allows for other proposal distributions. The only
requirement of a proposal distribution is that its range of possible values must
subsume the range of possible values of the target.
144 11 Control Flow
a) Write a function called arSamp(n) that samples from 𝑝(𝑥) using accept-
reject sampling. It should take a single argument that is equal to the
number of samples desired. Below is one step of the accept-reject
algorithm. You will need to do many iterations of this. The number
of iterations will be random, because some of these proposals will not
be accepted.
3.
4.
Suppose you are trying to predict a value of 𝑌 given some information about
a corresponding independent variable 𝑥. Suppose further that you have a
historical data set of observations (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ). One approach for com-
ing up with predictions is to use Nadaraya—Watson Kernel Regression
(Nadaraya, 1964) (Watson, 1964). The prediction this approach provides is sim-
ply a weighted average of all of the historically-observed data points 𝑦1 , … , 𝑦𝑛 .
The weight for a given 𝑦𝑖 will be larger if 𝑥𝑖 is “close” to the value 𝑥 that you
are obtaining predictions for. On the other hand, if 𝑥𝑗 is far away from 𝑥, then
the weight for 𝑦𝑗 will be relatively small, and so this data point won’t influence
the prediction much.
Write a function called kernReg(xPred,xData,yData,kernFunc) that computes
the Nadaraya—Watson estimate of the prediction of 𝑌 given 𝑋 = 𝑥. Do not
use a for loop in your function definition. The formula is
11.3 Exercises 145
𝑛
𝐾(𝑥 − 𝑥𝑖 )
∑ 𝑛 𝑦, (11.3)
𝑖=1
∑𝑗=1 𝐾(𝑥 − 𝑥𝑗 ) 𝑖
where 𝑥 is the point you’re trying to get a prediction for.
• Your function should return one floating point number.
• The input xPred will be a floating point number.
• The input xData is a one-dimensional vector of numerical data of independent
variables.
• The input yData is a one-dimensional vector of numerical data of dependent
variables.
• kernFunc is a function that accepts a numeric vector and returns a floating
point. It’s vectorized.
Below is some code that will help you test your predictions. The kernel
function, gaussKernel()
√ , implements the Gaussian kernel function 𝐾(𝑧) =
exp[−𝑧 2 /2]/ 2𝜋. Notice the creation of preds was commented out. Use a for
loop to generate predictions for all elements of xTest and store them in the
vector preds.
Suppose you go to the casino with 10 dollars. You decide that your policy is
to play until you go broke, or until you triple your money. The only game
you play costs $1 to play. If you lose, you lose that dollar. If you win, you get
another $1 in addition to getting your money back.
b) Use a for loop to call your function 5000 times with probability p=.5.
Each time, store the number of games played. Store them all in a
Numpy array or Pandas Series called simulated_durations.
c) Take the average of simulated_durations. This is your Monte Carlo
estimate of the expected duration. How does it compare with what
you think it should be theoretically?
d) Perform the same analysis to estimate the expected duration when
𝑝 = .7. Store your answer as a float called expec_duration.
2.
Suppose you have the following data set. Please include the following snippet
in your submission.
import numpy as np
import pandas as pd
my_data = pd.read_csv("sim_data.csv", header=None).values.flatten()
a) Calculate the mean of this data set and store it as a floating point
number called sample_mean.
b) Calculate 5, 000 bootstrap sample means. Store them in a Numpy
array called bootstrapped_means. Use a for loop, and inside the loop,
sample with replacement 1000 times from the length 1000 data set.
You can use the function np.random.choice() to accomplish this.
c) Calculate the sample mean of these bootstrapped means. This is a
good estimate of the theoretical mean/expectation of the sample mean.
Call it mean_of_means.
d) Calculate the sample variance of these bootstrapped means. This is a
good estimate of the theoretical variance of the sample mean. Call it
var_of_means.
11.3 Exercises 147
3.
Write a function called ar_samp(n) that samples from 𝑝(𝑥) using accept-reject
sampling. Use any proposal distribution that you’d like. It should take a single
argument that is equal to the number of samples desired. Sample from the
following target:
In R, it all starts with vectors. There are two common functions you should
know: sort() and order(). sort() returns the sorted data, while order()
returns the order indexes.
import numpy as np
silly_data = np.random.normal(size=5)
print(silly_data)
## [-0.52817175 -1.07296862 0.86540763 -2.3015387 1.74481176]
print( np.sort(silly_data) )
## [-2.3015387 -1.07296862 -0.52817175 0.86540763 1.74481176]
np.argsort(silly_data)
## array([3, 1, 0, 2, 4])
For Pandas’ DataFrames, most of the functions I find useful are methods
attached to the DataFrame class. That means that, as long as something is
inside a DataFrame, you can use dot notation.
import pandas as pd
car_data = pd.read_csv("data/cars.csv")
car_data['no_dlr_msrp'] = car_data['MSRP'].str.replace("$", "",
regex = False)
no_commas = car_data['no_dlr_msrp'].str.replace(",","")
car_data['clean_MSRP'] = no_commas.astype(float)
car_data = car_data.sort_values(by='clean_MSRP', ascending = False)
car_data[["Make", "Model", "MSRP", "clean_MSRP"]].head(5)
## Make Model MSRP clean_MSRP
## 334 Porsche 911 GT2 2dr $192,465 192465.0
## 262 Mercedes-Benz CL600 2dr $128,420 128420.0
## 271 Mercedes-Benz SL600 convertible 2dr $126,670 126670.0
1
https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
2
https://numpy.org/doc/stable/reference/generated/numpy.sort.html
12.2 Stacking Data Sets and Placing Them Shoulder to Shoulder 151
1. you need to add a new row (or many rows) to a data frame,
2. you need to recombine data sets (e.g. recombine a train/test split), or
3. you’re creating a matrix in a step-by-step way.
In R, this can be done with rbind() (short for “row bind”). Consider the
following example that makes use of GIS data queried from (Albemarle County
Geographic Data Services Office, 2021) and cleaned with code from (Ford,
2016).
3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
replace.html
4
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
152 12 Reshaping and Combining Data Sets
The above example was with data.frames. This example of rbind() is with
matrix objects.
In Python, you can stack data frames with pd.concat()5 . It has a lot of options,
so feel free to peruse them. You can also replace the call to pd.concat() below
with test.append(train)6 . Consider the example below that uses the Albe-
marle County real estate data (Albemarle County Geographic Data Services
Office, 2021) (Ford, 2016).
import pandas as pd
real_estate = pd.read_csv("data/albemarle_real_estate.csv")
train = real_estate.iloc[1:,]
test = real_estate.iloc[[0],] # need the extra brackets!
stacked = pd.concat([test,train], axis=0)
stacked.iloc[:3,:3]
## YearBuilt YearRemodeled Condition
5
https://www.google.com/search?client=safari&rls=en&q=pandas+concat&ie=UTF-
8&oe=UTF-8
6
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
append.html
12.3 Merging or Joining Data Sets 153
Take note of the extra square brackets when we create test. If you use
real_estate.iloc[0,] instead, it will return a Series with all the elements
coerced to the same type, and this won’t pd.concat() properly with the rest
of the data!
# in R
baby1 <- read.csv("data/baby1.csv", stringsAsFactors = FALSE)
baby2 <- read.csv("data/baby2.csv", stringsAsFactors = FALSE)
head(baby1)
## idnum height.inches. email_address
## 1 1 74 [email protected]
## 2 3 66 [email protected]
7
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge
8
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
merge.html#pandas-dataframe-merge
154 12 Reshaping and Combining Data Sets
## 3 4 62 [email protected]
## 4 23 62 [email protected]
head(baby2)
## idnum phone email
## 1 3901283 5051234567 [email protected]
## 2 41823 5051234568 [email protected]
## 3 7198273 5051234568 [email protected]
The first thing you need to ask yourself is “which column is the unique identifier
that is shared between these two data sets?” In our case, they both have an
“identification number” column. However, these two data sets are coming from
different online platforms, and these two places use different schemes to number
their users.
In this case, it is better to merge on the email addresses. Users might be
using different email addresses on these two platforms, but there’s a stronger
guarantee that matched email addresses means that you’re matching the right
accounts. The columns are named differently in each data set, so we must
specify them by name.
# in R
merge(baby1, baby2, by.x = "email_address", by.y = "email")
## email_address idnum.x height.inches. idnum.y phone
## 1 [email protected] 3 66 7198273 5051234568
## 2 [email protected] 4 62 3901283 5051234567
## 3 [email protected] 23 62 3901283 5051234567
# in Python
baby1.merge(baby2, left_on = "email_address", right_on = "email")
## idnum_x height(inches) email_address idnum_y phone email
## 0 3 66 [email protected] 7198273 5051234568 [email protected]
## 1 4 62 [email protected] 3901283 5051234567 [email protected]
## 2 23 62 [email protected] 3901283 5051234567 [email protected]
with another person’s email address. In the case of duplicates, both rows will
match with the same rows in the other data frame.
Also, in this case, all email addresses that weren’t found in both data sets were
thrown away. This does not necessarily need to be the intended behavior. For
instance, if we wanted to make sure no rows were thrown away, that would be
possible. In this case, though, for email addresses that weren’t found in both
data sets, some information will be missing. Recall that Python and R handle
missing data differently (Section 3.8.2).
# in R
merge(baby1, baby2,
by.x = "email_address", by.y = "email",
all.x = TRUE, all.y = TRUE)
## email_address idnum.x height.inches. idnum.y phone
## 1 [email protected] 3 66 7198273 5051234568
## 2 [email protected] 1 74 NA NA
## 3 [email protected] 4 62 3901283 5051234567
## 4 [email protected] 23 62 3901283 5051234567
## 5 [email protected] NA NA 41823 5051234568
# in Python
le_merge = baby1.merge(baby2,
left_on = "email_address", right_on = "email",
how = "outer")
le_merge.iloc[:5,3:]
## idnum_y phone email
## 0 NaN NaN NaN
## 1 7198273.0 5.051235e+09 [email protected]
## 2 3901283.0 5.051235e+09 [email protected]
## 3 3901283.0 5.051235e+09 [email protected]
## 4 41823.0 5.051235e+09 [email protected]
You can see it’s slightly more concise in Python. If you are familiar with SQL,
you might have heard of inner and outer joins. This is where Pandas takes
some of its argument names from9 .
9
https://pandas.pydata.org/pandas-docs/version/0.15/merging.html#database-
style-dataframe-joining-merging
156 12 Reshaping and Combining Data Sets
Finally, if both data sets have multiple values in the column you’re joining on,
the result can have more rows than either table. This is because every possible
match shows up.
# in R
first <- data.frame(category = c('a','a'), measurement = c(1,2))
merge(first, first, by.x = "category", by.y = "category")
## category measurement.x measurement.y
## 1 a 1 1
## 2 a 1 2
## 3 a 2 1
## 4 a 2 2
# in Python
first = pd.DataFrame({'category' : ['a','a'], 'measurement' : [1,2]})
first.merge(first, left_on = "category", right_on = "category")
## category measurement_x measurement_y
## 0 a 1 1
## 1 a 1 2
## 2 a 2 1
## 3 a 2 2
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
A long format can also be used if you have multiple observations (at a single
time point) on an experimental unit. Here is another example.
If you would like to reshape the long data sets into a wide format, you can use
the reshape() function. You will need to specify which columns correspond
with the experimental unit, and which column is the “factor” variable.
reshape() will also go in the other direction: it can take wide data and convert
it into long data
reshape(fakeWideData1,
direction = "long",
idvar = "person",
varying = list(c("before","after")),
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
reshape(fakeWideData2,
direction = "long",
idvar = "person",
varying = list(c("attribute A","attribute B")),
12.4 Long versus Wide Data 159
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
import pandas as pd
fake_long_data1 = pd.DataFrame(
{'person' : ["Taylor","Taylor","Charlie","Charlie"],
'time_observed' : [1, 2, 1, 2],
'nums' : [100,101,300,301]})
fake_long_data1
## person time_observed nums
## 0 Taylor 1 100
## 1 Taylor 2 101
## 2 Charlie 1 300
## 3 Charlie 2 301
pivot_data1 = fake_long_data1.pivot(index='person',
columns='time_observed',
10
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
pivot.html#
11
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
melt.html?highlight=melt
12
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
reset_index.html#
160 12 Reshaping and Combining Data Sets
values='nums')
fake_wide_data1 = pivot_data1.reset_index()
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
Here’s one more example showing the same functionality—going from long to
wide format.
people_names = ["Taylor","Taylor","Charlie","Charlie"]
attribute_list = ['attrA', 'attrB', 'attrA', 'attrB']
fake_long_data2 = pd.DataFrame({'person' : people_names,
'attribute_name' : attribute_list,
'nums' : [100,101,300,301]})
fake_wide_data2 = fake_long_data2.pivot(index='person',
columns='attribute_name',
values='nums').reset_index()
fake_wide_data2
## attribute_name person attrA attrB
## 0 Charlie 300 301
## 1 Taylor 100 101
Here are some examples of going in the other direction: from wide to long with
pd.DataFrame.melt()13 . The first example specifies value columns by integers.
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
fake_wide_data1.melt(id_vars = "person", value_vars = [1,2])
## person time_observed value
## 0 Charlie 1 300
## 1 Taylor 1 100
## 2 Charlie 2 301
## 3 Taylor 2 101
13
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
melt.html?highlight=melt
12.5 Exercises 161
fake_wide_data2
## attribute_name person attrA attrB
## 0 Charlie 300 301
## 1 Taylor 100 101
fake_wide_data2.melt(id_vars = "person",
value_vars = ['attrA','attrB'])
## person attribute_name value
## 0 Charlie attrA 300
## 1 Taylor attrA 100
## 2 Charlie attrB 301
## 3 Taylor attrB 101
12.5 Exercises
12.5.1 R Questions
1.
Recall the car.data data set (mis, 1997), which is hosted by (Dua and Graff,
2017).
2.
a) Pretend day1Data and day2Data are two separate data sets that possess
the same type of measures but on different experimental units. Stack
day1Data on top of day2Data and call the result stackedData.
b) Pretend day1Data and day2Data are different measurements on the
same experimental units. Place them shoulder to shoulder and call
the result sideBySide. Put day1Data first, and day2Data second.
3.
If you are dealing with random matrices, you might need to vectorize a matrix
object. This is not the same as “vectorization” in programming. Instead, it
means you write the matrix as a big column vector by stacking the columns on
top of each other. Specifically, if you have a 𝑛 × 𝑝 real-valued matrix X, then
X1
vec(X) = ⎢ ⋮ ⎤
⎡
⎥ (12.1)
⎣X𝑝 ⎦
where X𝑖 is the 𝑖th column as an 𝑛×1 column vector. There is another operator
that we will use, the Kronecker product:
𝑎11 B ⋯ 𝑎1𝑛 B
⎡
A⊗B=⎢ ⋮ ⋱ ⋮ ⎤ (12.2)
⎥.
𝑎
⎣ 𝑚1 B ⋯ 𝑎 B
𝑚𝑛 ⎦
your work with %x%, but do not use this in your function.
12.5 Exercises 163
4.
This problem uses the Militarized Interstate Disputes (v5.0) (Palmer et al.,
0) data set from The Correlates of War Project14 . There are four .csv files
we use for this problem. MIDA 5.0.csv contains the essential attributes of
each militarized interstate dispute from 1/1/1816 through 12/31/2014. MIDB
5.0.csv describes the participants in each of those disputes. MIDI 5.0.csv
contains the essential elements of each militarized interstate incident, and
MIDIP 5.0.csv describes the participants in each of those incidents.
a) Read in the four data sets and give them the names mida, midb, midi,
and midp. Take care to convert all instances of -9 to NA.
b) Examine all rows of midb where its dispnum column equals 2. Do not
change midb permanently. Are these two rows corresponding to the
same conflict? If so, assign TRUE to sameConflict. Otherwise, assign
FALSE.
c) Join the first two data sets together on the dispute number column
(dispnum). Call the resulting data.frame join1. Do not address any
concerns about duplicate columns.
d) Is there any difference between doing an inner join and an outer join
in the previous question? If there was a difference, assign TRUE to
theyAreNotTheSame. Otherwise, assign FALSE to it.
e) Join the last two data sets together by incidnum and call the result
join2. Is there any difference between an inner and an outer join for
this problem? Why or why not? Do not address any concerns about
duplicate columns.
f) The codebook mentions that the last two data sets don’t go as far
back in time as the first two. Suppose then that we only care about the
events in join2. Merge join2 and join1 in a way where all undesired
rows from join1 are discarded, and all rows from join2 are kept. Call
the resulting data.frame midData. Do not address any concerns about
duplicate columns.
g) Use a scatterplot to display the relationship between the maximum
duration and the end year. Plot each country as a different color.
h) Create a data.frame called longData that has the following three
columns from midp: incidnum (incident identification number) stabb
(state abbreviation of participant) and fatalpre (precise number of
fatalities). Convert this to “wide” format. Make the new table called
wideData. Use the incident number row as a unique row-identifying
variable.
14
https://correlatesofwar.org/
164 12 Reshaping and Combining Data Sets
2.
indexes = np.random.choice(np.arange(20),size=20,replace=False)
d1 = pd.DataFrame({'a' : indexes,
'b' : np.random.normal(size=20)})
d2 = pd.DataFrame({'a' : indexes + 20,
'b' : np.random.normal(size=20)})
a) Pretend d1 and d2 are two separate data sets that possess the same
type of measures but on different experimental units. Stack d1 on top
of d2 and call the result stacked_data_sets. Make sure the index of
the result is the numbers 0 through 39
b) Pretend d1 and d2 are different measurements on the same exper-
imental units. Place them shoulder to shoulder and call the result
side_by_side_data_sets. Put d1 first, and d2 second.
3.
import numpy as np
import pandas as pd
dog_names1 = ['Charlie','Gus', 'Stubby', 'Toni','Pearl']
12.5 Exercises 165
a) Join/merge the two data sets together in such a way that there is a
row for every dog, whether or not both tables have information for
that dog. Call the result merged1.
b) Join/merge the two data sets together in such a way that there are only
rows for every dog in dataset1, whether or not there is information
about these dogs’ breeds. Call the result merged2.
c) Join/merge the two data sets together in such a way that there are only
rows for every dog in dataset2, whether or not there is information
about the dogs’ nicknames. Call the result merged3.
d) Join/merge the two data sets together in such a way that all rows
possess complete information. Call the result merged4.
4.
a) Read in iris.csv and store the DataFrame with the name iris. Let
it have the column names 'a','b','c', 'd' and 'e'.
b) Create a DataFrame called name_key that stores correspondences be-
tween long names and short names. It should have three rows and
two columns. The long names are the unique values of column five
of iris. The short names are either 's', 'vers' or 'virg'. Use the
column names 'long name' and 'short name'.
c) Merge/join the two data sets together to give iris a new column
with information about short names. Do not overwrite iris. Rather,
give the DataFrame a new name: iris_with_short_names. Remove any
columns with duplicate information.
d) Change the first four column names of iris_with_short_names to
s_len, s_wid, p_len, and p_wid. Use Matplotlib to create a figure
with 4 subplots arranged into a 2 × 2 grid. On each subplot, plot a
166 12 Reshaping and Combining Data Sets
I describe a few plotting paradigms in R and Python below. Note that these
descriptions are brief. More details could easily turn any of these subsections
into an entire textbook.
df <- read.csv("data/albemarle_real_estate.csv")
str(df, strict.width = "cut")
## 'data.frame': 30381 obs. of 12 variables:
## $ YearBuilt : int 1769 1818 2004 2006 2004 1995 1900 1960 ..
## $ YearRemodeled: int 1988 1991 NA NA NA NA NA NA NA NA ...
## $ Condition : chr "Average" "Average" "Average" "Average" ..
## $ NumStories : num 1.7 2 1 1 1.5 2.3 2 1 1 1 ...
## $ FinSqFt : int 5216 5160 1512 2019 1950 2579 1530 800 9..
## $ Bedroom : int 4 6 3 3 3 3 4 2 2 2 ...
## $ FullBath : int 3 4 2 3 3 2 1 1 1 1 ...
## $ HalfBath : int 0 1 1 0 0 1 0 0 0 0 ...
hist(log(df$TotalValue),
xlab = "natural logarithm of home price",
main = "Super-Duper Plot!")
I specified the xlab= and main= arguments, but there are many more that could
be tweaked. Make sure to skim the options in the documentation (?hist).
plot() is useful for plotting two univariate numerical variables. This can be
done in time series plots (variable versus time) and scatter plots (one variable
versus another). For an example of two scatter plots, see Figure 13.2.
13.1 Base R Plotting 169
par(mfrow=c(1,2))
plot(df$TotalValue, df$LotSize,
xlab = "total value ($)", ylab = "lot size (sq. ft.)",
pch = 3, col = "red", type = "b")
plot(log(df$TotalValue), log(df$LotSize),
xlab = "log. total value", ylab = "log. lot size",
pch = 2, col = "blue", type = "p")
abline(h = log(mean(df$LotSize)), col = "green")
par(mfrow=c(1,1))
I use some of the many arguments available (type ?plot). xlab= and ylab=
specify the x- and y-axis labels, respectively. col= is short for “color.” pch= is
short for “point character.” Changing this will change the symbol shapes used
for each point. type= is more general than that, but it is related. I typically
use it to specify whether or not I want the points connected with lines.
I use a couple other functions in the above code. abline() is used to superimpose
lines over the top of a plot. They can be horizontal, vertical, or you can specify
them in slope-intercept form, or by providing a linear model object. I also used
par() to set a graphical parameter. The graphical parameter par()$mfrow sets
the layout of a multiple plot visualization. I then set it back to the standard
1 × 1 layout afterwards.
170 13 Visualization
library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy))
1
https://ggplot2.tidyverse.org/index.html
2
https://plotnine.readthedocs.io/en/stable/#
3
Personally, I find its syntax more confusing, and so I tend to prefer base graphics.
However, it is very popular, and so I do believe that it is important to mention it here in
this text.
4
https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/ggplot
13.2 Plotting with ggplot2 171
You’ll notice a few things about the code and the result produced:
library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
5
https://ggplot2-book.org/toolbox.html#toolbox
6
https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/geom_
point
7
https://ggplot2-book.org/getting-started.html#basic-use
172 13 Visualization
Additionally, notice that the same layer will behave much differently if we
change the aesthetic mapping. The result after adding a ‘color=’ aesthetic is
displayed in Figure 13.4.
8
https://ggplot2-book.org/scales.html#scales
13.2 Plotting with ggplot2 173
We can also change plot colors with scale layers. Let’s add an aesthetic called
fill= so we can use colors to denote the value of a numerical (not categorical)
column. This data set doesn’t have any more unused numerical columns, so
let’s create a new one called score. We also use a new geom layer from a
function called geom_tile() (see Figure 13.6).
If we didn’t like these colors, we could change them with a scale layer. Personally,
I like this one (see Figure 13.7).
There are many to choose from, though. Try to run the following code on your
own to see what it produces.
13.3 Plotting with Matplotlib 175
9
https://matplotlib.org/stable/tutorials/index.html
176 13 Visualization
Figure 13.8 displays a histogram generated from the above code. In the first
line, we import the pyplot submodule of matplotlib. We rename it to plt,
which is short, and will save us some typing. Calling it plt follows the most
popular naming convention.
Second, we import Numpy in the same way we always have. Matplotlib is
written to work with Numpy arrays. If you want to plot some data, and it isn’t
in a Numpy array, you should convert it first.
13.3 Plotting with Matplotlib 177
Third, we call the subplots() function, and use sequence unpacking to unpack
the returned container into individual objects without storing the overall
container. “Subplots” sounds like it will make many different plots all on one
figure, but if you look at the documentation10 the number of rows and columns
defaults to one and one, respectively.
plt.subplots() returns a tuple1112 of two things: a Figure object, and one or
more Axes object(s). These two classes will require some explanation.
In line four, we call the hist() method16 of the Axes object called ax. We assign
the output of .hist() to a variable _. This is done to suppress the printing of
the method’s output, and because this variable name is a Python convention
that signals the object is temporary and will not be used later in the program.
There are many more plots available than plain histograms. Each one has its
own method, and you can peruse the options in the documentation17 .
If you want to make figures that are more elaborate, just keep calling different
methods of ax. If you want to fit more subplots to the same figure, add more
Axes objects. Here is an example using some code from one of the official
10
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html#
matplotlib-pyplot-subplots
11
https://docs.python.org/3.3/library/stdtypes.html?highlight=tuple#tuple
12
We didn’t talk about tuples in Chapter 2, but you can think of them as being similar
to lists. They are containers that can hold elements of different types. There are a few key
differences, though: they are made with parentheses (e.g. ('a') ) instead of square brackets,
and they are immutable instead of mutable.
13
https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure
14
https://matplotlib.org/stable/api/axes_api.html#the-axes-class
15
https://matplotlib.org/stable/tutorials/introductory/usage.html#axes
16
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html#
matplotlib.axes.Axes.hist
17
https://matplotlib.org/stable/api/axes_api.html#plotting
178 13 Visualization
Matplotlib tutorials18 . The plot generated from the code is displayed in Figure
13.9.
# first subplot
myAxes[0].plot(x, x, label='linear') # Plot some data on the axes.
## [<matplotlib.lines.Line2D object at 0x7f2e0412bef0>]
myAxes[0].plot(x, x**2, label='quadratic') # Plot more data
## [<matplotlib.lines.Line2D object at 0x7f2e0413a048>]
myAxes[0].plot(x, x**3, label='cubic') # ... and some more.
## [<matplotlib.lines.Line2D object at 0x7f2e0413a390>]
myAxes[0].set_xlabel('x label') # Add an x-label to the axes.
## Text(0.5, 0, 'x label')
myAxes[0].set_ylabel('y label') # Add a y-label to the axes.
## Text(0, 0.5, 'y label')
18
https://matplotlib.org/stable/tutorials/introductory/usage.html#the-object-
oriented-interface-and-the-pyplot-interface
13.4 Plotting with Pandas 179
# second subplot
## <matplotlib.legend.Legend object at 0x7f2e042a4ac8>
myAxes[1].plot(x,np.sin(x), label='sine wave')
## [<matplotlib.lines.Line2D object at 0x7f2e0413aba8>]
myAxes[1].legend()
## <matplotlib.legend.Legend object at 0x7f2e0415df28>
plt.show()
import pandas as pd
df = pd.read_csv("data/gspc.csv")
df.head()
## Index GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 0 2007-01-03 1418.030029 1429.420044 1407.859985 1416.599976 3.429160e+09 1416.599976
## 1 2007-01-04 1416.599976 1421.839966 1408.430054 1418.339966 3.004460e+09 1418.339966
## 2 2007-01-05 1418.339966 1418.339966 1405.750000 1409.709961 2.919400e+09 1409.709961
19
https://pandas.pydata.org/docs/user_guide/visualization.html#chart-
visualization
20
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
180 13 Visualization
Choosing among nondefault plot types can be done in a variety of ways. You
can either use the .plot accessor data member of a DataFrame, or you can
pass in different strings to .plot()’s kind= parameter. Third, some plot types
(e.g. boxplots and histograms) have their own dedicated methods.
df['returns'] = df['GSPC.Adjusted'].pct_change()
df['returns'].plot(kind='hist')
# same as df['returns'].plot.hist()
# same as df['returns'].hist()
## <matplotlib.axes._subplots.AxesSubplot object at 0x7f2e04117da0>
There are also several freestanding plotting functions21 (not methods) that
take in DataFrames and Series objects. Each of these functions is typically
imported individually from the pandas.plotting submodule.
The following code is an example of creating a “lag plot,” which is simply a
scatterplot between a time series’ lagged and nonlagged values. The primary
benefit of this function over .plot() is that this function does not require you
to construct an additional column of lagged values, and it comes up with good
default axis labels.
13.5 Exercises
13.5.1 R Questions
1.
1 𝑥2 + 𝑦 2
𝑓(𝑥, 𝑦) = exp [− ]. (1)
2𝜋 2
21
https://pandas.pydata.org/docs/user_guide/visualization.html#visualization-
tools
13.5 Exercises 181
The random elements 𝑋 and 𝑌, in this particular case, are independent, each
have unit variance, and zero mean. In this case, the marginal for 𝑋 is a mean
0, unit variance normal distribution:
1 𝑥2
𝑔(𝑥) = √ exp [− ] . (2)
2𝜋 2
a) Generate two plots of the bivariate density. For one, use persp(). For
the other, use contour().
b) Generate a third plot of the univariate density.
2.
Reproduce Figure 15.1) in Section 15.2.1.1 that displays the simple “spline”
function. Feel free to use any of the visible code in the text.
2.
22
https://correlatesofwar.org/
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part III
Programming Styles
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
14
An Introduction to Object-Oriented Programming
Here are a few abstract concepts that will help thinking about OOP. They are
not mutually exclusive, and they aren’t unique to OOP, but understanding
these words will help you understand the purpose of OOP. Later on, when we
start looking at code examples, I will alert you to when these concepts are
coming into play.
• Composition refers to the idea when one type of object contains an object of
another type. For example, a linear model object could hold on to estimated
regression coefficients, residuals, etc.
• Inheritance takes place when an object can be considered to be of another
type(s). For example, an analysis of variance linear regression model might
be a special case of a general linear model.
• Polymorphism is the idea that the programmer can use the same code
on objects of different types. For example, built-in functions in both R and
Python can work on arguments of a wide variety of different types.
In Python, we can usually calculate this one number very easily using
np.average. However, this function requires that we pass into it all of the
data at once. What if we don’t have all the data at any given time? In other
words, suppose that the data arrive intermittently . We might consider taking
advantage of a recursive formula for the sample means.
(𝑛 − 1)𝑥𝑛−1
̄ + 𝑥𝑛
𝑥𝑛̄ = (14.2)
𝑛
How would we program this in Python? A first option: we might create a
variable my_running_ave, and after every data point arrives, we could
my_running_ave = 1.0
my_running_ave
## 1.0
my_running_ave = ((2-1)*my_running_ave + 3.0)/2
my_running_ave
## 2.0
my_running_ave = ((3-1)*my_running_ave + 2.0)/3
my_running_ave
## 2.0
There are a few problems with this. Every time we add a data point, the
formula slightly changes. Every time we update the average, we have to write
a different line of code. This opens up the possibility for more bugs, and it
makes your code less likely to be used by other people and more difficult to
understand. And if we were trying to code up something more complicated
than a running average? That would make matters even worse.
A second option: write a class that holds onto the running average, and that
has
1. an update method that updates the running average every time a new
data point is received, and
188 14 An Introduction to Object-Oriented Programming
After seeing these new words that are unfamiliar and long, it’s tempt-
ing to dismiss these new ideas as superfluous. After all, if you are
confident that you can get your program working, why stress about
all these new concepts? If it ain’t broke, don’t fix it, right?
I urge you to try to keep an open mind, particularly if you are already
confident that you understand the basics of programming in R and
� Python. The topics in this chapter are more centered around design
choices. This material won’t help you write a first draft of a script
even faster, but it will make your code much better. Even though
you will have to slow down a bit before you start typing, thinking
about your program more deeply will prevent bugs and allow more
people to use your code.
Classes (obviously) need to be defined before they are used, so here is the
definition of our class.
14.1 OOP in Python 189
class RunningMean:
"""Updates a running average"""
def __init__(self):
self.current_xbar = 0.0
self.n = 0
def update(self, new_x):
self.n += 1
self.current_xbar *= (self.n-1)
self.current_xbar += new_x
self.current_xbar /= self.n
def get_current_xbar(self):
if self.n == 0:
return None
else:
return self.current_xbar
Methods that look like __init__, or that possess names that begin and
end with two underscores, are called dunder (double underscore)
methods, special methods or magic methods. There are many
� that you can take advantage of! For more information see thisa .
a
https://docs.python.org/3/reference/datamodel.html#special-method-
names
5. The update method provides the core functionality using the recursive
formula displayed above.
6. get_current_xbar simply returns the current average. In the case that
this function is called before any data has been seen, it returns None.
𝜎2 𝜎2
(𝑥̄ − 1.96√ , 𝑥̄ + 1.96√ ) . (14.3)
𝑛 𝑛
The width of the interval shrinks as we get more data (as 𝑛 → ∞). We can
write another class that, not only calculates the center of this interval, 𝑥,̄ but
also returns the interval endpoints.
4
Otherwise known as an independent and identically distributed sample
14.1 OOP in Python 191
If we wrote another class from scratch, then we would need to rewrite a lot of
the code that we already have in the definition of RunningMean. Instead, we’ll
use the idea of inheritance5 .
import numpy as np
def get_current_interval(self):
if self.n == 0:
return None
else:
half_width = 1.96 * np.sqrt(self.known_var / self.n)
left_num = self.current_xbar - half_width
right_num = self.current_xbar + half_width
return np.array([left_num, right_num])
The parentheses in the first line of the class definition signal that this new
class definition is inheriting from RunningMean. Inside the definition of this new
class, when I refer to self.current_xbar Python knows what I’m referring to
because it is defined in the base class. Last, I am using super() to access the
base class’s methods, such as __init__.
Inside the inner for loop, there is no need for include conditional logic that
tests for what kind of type each thing is. We can iterate through time more
succinctly.
If, in the future, you add a new class called class7, then you need to change
this inner for loop, as well as provide new code for the class.
from the base class (the sample mean class). This decoupling will have a few
implications. In general, composition is more flexible, but can lead to longer,
uglier code.
class RunningCI2:
"""Updates a running average and
gives you a known-variance confidence interval"""
self.mean = RunningMean()
self.known_var = known_var
def get_current_interval(self):
if self.n == 0:
return None
else:
half_width = 1.96 * np.sqrt(self.known_var / self.n)
left = self.mean.get_current_xbar() - half_width
right = self.mean.get_current_xbar() + half_width
return np.array([left, right])
14.2 OOP in R
R, unlike Python, has many different kinds of classes. In R, there is not only
one way to make a class. There are many! I will discuss
• S3 classes,
• S4 classes,
• Reference classes, and
• R6 classes.
If you like how Python does OOP, you will like reference classes and R6 classes,
while S3 and S4 classes will feel strange to you.
It’s best to learn about them chronologically, in my opinion. S3 classes came
first, S4 classes sought to improve upon those. Reference classes rely on S4
classes, and R6 classes are an improved version of Reference classes (Wickham,
2014).
This works because these “high-level” functions (such as print()), will look at
its input and choose the most appropriate function to call, based on what kind
of type the input has. print() is the high-level function. When you run some
of the above code, it might not be obvious which specific function print()
chooses for each input. You can’t see that happening, yet.
Last, recall that this discussion only applies to S3 objects. Not all objects are
S3 objects, though. To find out if an object x is an S3 object, use is.object(x).
plot
length(methods(plot))
## [1] 39
All of these S3 class methods share the same naming convention. Their name
has the generic function’s name as a prefix, then a dot (.), then the name of
the class that they are specifically written to be used with.
Also, these methods are not encapsulated inside a class definition like they are
in Python, either. They just look like loose functions—the method definition
for a particular class is not defined inside the class. These class methods can
be defined just as ordinary functions, out on their own, in whatever file you
think is appropriate to define functions in.
As an example, let’s try to plot() some specific objects.
Because aDF has its class set to data.frame, this causes plot() to try to find
a plot.data.frame() method. If this method was not found, R would attempt
to find/use a plot.default() method. If no default method existed, an error
would be thrown. In this case, plot() produces a matrix of scatterplots (see
Figure 14.1).
198 14 An Introduction to Object-Oriented Programming
As another example, we can play around with objects created with the ecdf()
function. This function computes an empirical cumulative distribution func-
tion, which takes a real number as an input, and outputs the proportion of
observations that are less than or equal to that input9 (see Figure 14.2).
This is how inheritance works in S3. The ecdf class inherits from the
stepfun class, which in turn inherits from the function class. When you
call plot(myECDF), ultimately plot.ecdf() is used on this object. However, if
plot.ecdf() did not exist, plot.stepfun() would be tried. S3 inheritance in
R is much simpler than Python’s inheritance!
9
It’s defined as 𝐹 ̂ (𝑥) = ∑𝑖=1 1(𝑋𝑖 ≤ 𝑥).
1 𝑛
𝑛
14.2 OOP in R 199
summary(myThing)
## [1] "No summary available!"
## [1] "Cool Classes are too cool for summaries!"
## [1] ":)"
library(Matrix)
M <- Matrix(10 + 1:28, 4, 7)
isS4(M)
## [1] TRUE
M
## 4 x 7 Matrix of class "dgeMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 11 15 19 23 27 31 35
10
https://adv-r.hadley.nz/s4.html
11
https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf
14.2 OOP in R 201
## [2,] 12 16 20 24 28 32 36
## [3,] 13 17 21 25 29 33 37
## [4,] 14 18 22 26 30 34 38
M@Dim
## [1] 4 7
Inside an S4 object, data members are called slots, and they are accessed with
the @ operator (instead of the $ operator). Objects can be tested if they are S4
with the function isS4(). Otherwise, they look and feel just like S3 objects.
setClass("RunningMean",
slots = list(n = "integer",
currentXbar = "numeric"))
setClass("RunningCI",
slots = list(knownVar = "numeric"),
contains = "RunningMean")
Next we want to define an update() generic function that will work on objects
of both types. This is what gives us polymorphism . The generic update() will
call specialized methods for objects of class RunningMean and RunningCI.
202 14 An Introduction to Object-Oriented Programming
Recall that in the Python example, each class had its own update method.
Here, we still have a specialized method for each class, but S4 methods don’t
have to be defined inside the class definition, as we can see below.
## Creating a new generic function for 'update' in the global environment
## [1] "update"
Here’s a demonstration of using these two classes that mirrors the example in subsection
14.1.3
So it will feel much more like Python’s class system. Some might say using refer-
ence classes that will lead to code that is not very R-ish, but it can be useful for
certain types of programs (e.g. long-running code, code that performs many/high-
dimensional/complicated simulations, or code that circumvents storing large data set
in your computer’s memory all at once).
This tells us a few things. First, data members are called fields now. Second, changing
class variables is done with the <<-. We can use it just as before.
12
https://www.rdocumentation.org/packages/methods/versions/3.6.2/topics/
ReferenceClasses
14.2 OOP in R 205
my_ave$current_xbar
## [1] 1
my_ave$n
## [1] 1
my_ave$update(3.)
my_ave$current_xbar
## [1] 2
my_ave$n
## [1] 2
Compare how similar this code looks to the code in 14.1.2! Note the paucity of
assignment operators, and plenty of side effects.
library(R6)
13
https://r6.r-lib.org/articles/Introduction.html
14
https://r6.r-lib.org/articles/Performance.html
206 14 An Introduction to Object-Oriented Programming
14.3 Exercises
14.3.1 Python Questions
1.
𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 . (14.4)
The coefficients 𝛽0 and 𝛽1 are unknown, and so must be estimated with the data.
Estimating the variance of the noise terms 𝜖𝑖 may also be of interest, but we do not
concern ourselves with that here.
15
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
LinearRegression.html
14.3 Exercises 207
The formulas for the estimated slope (i.e. 𝛽1̂ ) and the estimated intercept (i.e. 𝛽0̂ )
are as follows:
𝑛
∑ (𝑥𝑖 − 𝑥)(𝑦 ̄ 𝑖 − 𝑦)̄
𝛽1̂ = 𝑖=1 𝑛 (14.5)
∑𝑗=1 (𝑥𝑗 − 𝑥)̄ 2
• Name it SimpleLinReg.
• Its .__init__() method should not take any additional parameters. It should set
two data attributes/members est_intercept and est_slope, to np.nan.
• Give it a .fit() method that takes in two 1-dimensional Numpy arrays. The first
should be the array of independent values, and the second should be the set of
dependent values.
• .fit(x,y) should not return anything, but it should store two data attributes/mem-
bers: est_intercept and est_slope. Every time .fit() is called, it will re-calculate
the coefficient/parameter estimates.
• Give it a .get_coeffs() method. It should not make any changes to the data
attributes/members of the class. It should simply return a Numpy array with the
parameter estimates inside. Make the first element the estimated intercept, and
the second element the estimated slope. If no such coefficients have been estimated
at the time of its calling, it should return the same size array but with the initial
np.nans inside.
After you’ve finished writing your first class, you can bask in the glory and run the
following test code:
mod = SimpleLinReg()
mod.get_coeffs()
x = np.arange(10)
y = 1 + .2 * x + np.random.normal(size=10)
mod.fit(x,y)
mod.get_coeffs()
2.
Reconsider the above question that asked you to write a class called SimpleLinReg.
• Write a new class called LinearRegression2 that preserves all of the existing func-
tionality of SimpleLinReg. Do this in a way that does not excessively copy and paste
code from SimpleLinReg.
208 14 An Introduction to Object-Oriented Programming
• Give your new class a method called .visualize() that takes no arguments and plots
the most recent data, the data most recently provided to .fit(), in a scatterplot
with the estimated regression line superimposed.
• Unfortunately, SimpleLinReg().fit(x,y).get_coeffs() will not return estimated
regression coefficients. Give your new class this functionality. In other words, make
LinearRegression2().fit(x,y).get_coeffs() spit out regression coefficients. Hint:
the solution should only require one extra line of code, and it should involve the
self keyword.
3.
Consider the following time series model (West and Harrison, 1989)
𝑦𝑡 = 𝛽𝑡 + 𝜖𝑡 𝛽𝑡 = 𝛽𝑡−1 + 𝑤𝑡 𝛽1 = 𝑤1 (14.7)
Here 𝑦𝑡 is observed time series data, each 𝜖𝑡 is measurement noise with variance 𝑉 and
each 𝑤𝑡 is also noise but with variance 𝑊. Think of 𝛽𝑡 as a time-varying regression
coefficient.
Imagine our data are arriving sequentially. The Kalman Filter (Kalman, 1960) cite
provides an “optimal” estimate of each 𝛽𝑡 given all of the information we have up to
time 𝑡. What’s better is that the algorithm is recursive. Future estimates of 𝛽𝑡 will be
easy to calculate given our estimates of 𝛽𝑡−1 .
Let’s call the mean of 𝛽𝑡−1 (given all the information up to time 𝑡 − 1) 𝑚𝑡−1 , and the
variance of 𝛽𝑡−1 (given all the information up to time 𝑡 − 1) 𝑃𝑡−1 . Then the Kalman
recursions for this particular model are
𝑃𝑡−1 + 𝑊
𝑀𝑡 = 𝑀𝑡−1 + ( ) (𝑦𝑡 − 𝑀𝑡−1 ) (14.8)
𝑃𝑡−1 + 𝑊 + 𝑉
𝑃𝑡−1 + 𝑊
𝑃𝑡 = (1 − ) (𝑃𝑡−1 + 𝑊 ) (14.9)
𝑃𝑡−1 + 𝑊 + 𝑉
for 𝑡 ≥ 1.
of that array should be 𝑀𝑡 plus and minus two standard deviations—a standard
deviation at time 𝑡 is √𝑃𝑡 .
• Create a DataFrame called results with the three columns called yt, lower, and
upper. The last two columns should be a sequence of confidence intervals given to
you by the method you wrote. The first column should contain the following data:
[-1.7037539, -0.5966818, -0.7061919, -0.1226606, -0.5431923]. Plot all three
columns in a single line plot. Initialize your Kalman Filter object with both V and W
set equal to .5.
14.3.2 R Questions
1.
Which of the following classes in R produce objects that are mutable? Select all that
apply: S3, S4, reference classes, and R6.
2.
Which of the following classes in R produce objects that can have methods? Select all
that apply: S3, S4, reference classes, and R6.
3.
Which of the following classes in R produce objects that can store data? Select all
that apply: S3, S4, reference classes, and R6.
4.
Which of the following classes in R have encapsulated definitions? Select all that
apply: S3, S4, reference classes, and R6.
5.
Which of the following classes in R have “slots”? Select all that apply: S3, S4, reference
classes, and R6.
6.
Which of the following class systems in R is the newest? S3, S4, reference classes, or
R6?
7.
Which of the following class systems in R is the oldest? S3, S4, reference classes, or
R6?
210 14 An Introduction to Object-Oriented Programming
8.
Which of the following classes in R requires you to library() in something? Select all
that apply: S3, S4, reference classes, and R6.
9.
Suppose you have the following data set: 𝑋1 , … , 𝑋𝑛 . You assume it is a random
sample from a Normal distribution with unknown mean and variance parameters,
denoted by 𝜇 and 𝜎2 , respectively. Consider testing the null hypothesis that 𝜇 = 0 at
a significance level of 𝛼. To carry out this test, you calculate
𝑋̄
𝑡= √ (14.10)
𝑆/ 𝑛
and you reject the null hypothesis if |𝑡| > 𝑡𝑛−1,𝛼/2 . This is Student’s T-Test (Student,
1908). Here 𝑆 2 = ∑𝑖 (𝑋𝑖 − 𝑋)̄ 2 /(𝑛 − 1) is the sample variance, and 𝑡𝑛−1,𝛼/2 is the
1 − 𝛼/2 quantile of a t-distribution with 𝑛 − 1 degrees of freedom.
• Write a function called doTTest() that performs the above hypothesis test. It should
accept two parameters: dataVec (a vector of data) and significanceLevel (which
is 𝛼). Have the second parameter default to .05.
• Have it return an S3 object created from a list. The class of this list should be
"TwoSidedTTest". The elements in the list should be named decision and testStat.
The decision object should be either "reject" or "fail to reject". The test stat
should be equal to the calculation you made above for 𝑡.
• Create a summary method for this new class you created: TwoSidedTTest.
10.
Suppose you have a target density 𝑝(𝑥) that you are only able to evaluate up to a
normalizing constant. In other words, suppose that for some 𝑐 > 0, 𝑝(𝑥) = 𝑓(𝑥)/𝑐,
and you are only able to evaluate 𝑓(⋅). Your goal is that you would like to be able
to approximate the expected value of 𝑝(𝑥) (i.e. ∫ 𝑥𝑝(𝑥)𝑑𝑥) using some proposal
distribution 𝑞(𝑥). 𝑞(𝑥) is flexible in that you can sample from it, and you can evaluate
it. We will use importance sampling (Kahn, 1950a) (Kahn, 1950b) to achieve this.16
Algorithm 1: Importance Sampling i) Sample 𝑋 1 , … , 𝑋 𝑛 from 𝑞(𝑥), ii. For each
𝑖
sample 𝑥𝑖 , calculate an unnormalized weight 𝑤̃ 𝑖 ∶= 𝑓(𝑥
𝑞(𝑥𝑖 ) , iii. Calculate the normalized
)
After you evaluate each log 𝑤̃ 𝑖 , before you exponentiate them, subtract a number
𝑚 from all the values. A good choice for 𝑚 is max𝑖 (log 𝑤̃ 𝑖 ). These new values will
produce the same normalized weights because
• ideally functions will not (refer to and) modify non-local variables; and
Unfortunately, violating the first of these three criteria is very easy to do in both
of our languages. Recall our conversation about dynamic lookup in subsection 6.8.
Both R and Python use dynamic lookup, which means you can’t reliably control
when functions look for variables. Typos in variable names easily go undiscovered,
and modified global variables can potentially wreak havoc on your overall program.
Fortunately it is difficult to modify global variables inside functions in both R and
Python. This was also discussed in subsection 6.8. In Python, you need to make use of
the global keyword (mentioned in Section 6.7.2), and in R, you need to use the rare
super assignment operator (it looks like <<-, and it was mentioned in Section 6.7.1).
Because these two symbols are so rare, they can serve as signals to viewers of your
code about when and where (in which functions) global variables are being modified.
Last, violating the third criterion is easy in Python and difficult in R. This was
discussed earlier in Section 6.7. Python can mutate/change arguments that have a
mutable type because it has pass-by-assignment semantics (mentioned in Section
6.7.2), and R generally can’t modify its arguments at all because it has pass-by-value
semantics Section 6.7.1.
This chapter avoids the philosophical discussion of FP. Instead, it takes the applied
approach, and provides instructions on how to use FP in your own programs. I try to
give examples of how you can use FP, and when these tools are especially suitable.
One of the biggest tip-offs that you should be using functional programming is if you
need to evaluate a single function many times, or in many different ways. This happens
quite frequently in statistical computing. Instead of copy/pasting similar-looking lines
of code, you might consider higher-order functions that take your function as an input,
and intelligently call it in all the many ways you want it to. A third option you might
also consider is to use a loop (c.f. 11.2). However, that approach is not very functional,
and so it will not be heavily-discussed in this section.
Another tip-off that you need FP is if you need many different functions that are all
“related” to one another. Should you define each function separately, using excessive
copy/paste-ing? Or should you write a function that can elegantly generate any
function you need?
Not repeating yourself and re-using code is a primary motivation, but it is not the
only one. Another motivation for functional programming is clearly explained in
Advanced R12 :
1
https://adv-r.hadley.nz/fp.html
2
Even though this book only discusses one of our languages of interest, this quote applies
to both langauges.
15.1 Functions as Function Inputs in R 215
All of these sound like a good things to have in our code, so let’s get started with
some examples!
Suppose we have a data.frame that has 10 rows and 100 columns. What if we want
to take the mean of each column?
An amateurish way to do this would be something like the following.
You will need one line of code for each column in the data frame! For data frames
with a lot of columns, this becomes quite tedious. You should also ask yourself what
happens to you and your collaborators when the data frame changes even slightly, or
if you want to apply a different function to its columns. Third, the results are not
stored in a single container. You are making it difficult on yourself if you want to use
these variables in subsequent pieces of code.
“Don’t repeat yourself” (DRY) is an idea that’s been around for a while
and is widely accepted (Hunt and Thomas, 2000). DRY is the opposite of
� WETa .
a
https://en.wikipedia.org/wiki/Don%27t_repeat_yourself#WET
216 15 An Introduction to Functional Programming
Instead, prefer the use of sapply() in this situation. The “s” in sapply() stands for
“simplified.” In this bit of code mean() is called on each column of the data frame.
sapply() applies the function over columns, instead of rows, because data frames are
internally a list of columns.
Each call to mean() returns a double vector of length 1. This is necessary if you
want to collect all the results into a vector—remember, all elements of a vector
have to have the same type. To get the same behavior, you might also consider using
vapply(myDF, mean, numeric(1)).
In the above case, “simplify” referred to how one-hundred length-1 vectors were
simplified into one length-100 vector. However, “simplified” does not necessarily
imply that all elements will be stored in a vector. Consider the summary function,
which returns a double vector of length 6. In this case, one-hundred length-6 vectors
were simplified into one 6 × 100 matrix.
15.1.2 lapply()
For functions that do not return amenable types that fit into a vector, matrix or array,
they might need to be stored in list. In this situation, you would need lapply(). The
“l” in lapply() stands for “list”. lapply() always returns a list of the same length
as the input.
## [1] 100
class(myRegs[[1]])
## [1] "lm"
summary(myRegs[[12]])
##
## Call:
## lm(formula = y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6149 -0.8692 -0.2541 0.7596 2.5718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2104 0.4139 0.508 0.623
##
## Residual standard error: 1.309 on 9 degrees of freedom
15.1.3 apply()
I use sapply() and lapply() the most, personally. The next most common function
I use is apply(). You can use it to apply functions to rows of rectangular arrays
instead of columns. However, it can also apply functions over columns, just as the
other functions we discussed can.3
dim(myDF)
## [1] 10 100
results <- apply(myDF, 1, mean)
results[1:4]
## [1] 0.18971263 -0.07595286 0.18400138 0.08895979
3
apply() is everyone’s favorite whipping boy whenever it comes to comparing apply()
against the other *apply() functions. This is because it is generally a little slower—it is
written in R and doesn’t call out to compiled C code. However, in my humble opinion, it
doesn’t matter all that much because the fractions of a second saved don’t always add up in
practice.
218 15 An Introduction to Functional Programming
Complicated filtering criteria can become quite wide, so I prefer to break the above
code into three steps.
15.1.4 tapply()
tapply() can be very handy when you need it. First, we’ve alluded to the definition
before in subsection 8.1, but a ragged array is a collection of arrays that all have
potentially different lengths. I don’t typically construct such an object and then pass
it to tapply(). Rather, I let tapply() construct the ragged array for me. The first
argument it expects is, to quote the documentation, “typically vector-like,” while
the second tells us how to break that vector into chunks. The third argument is a
function that gets applied to each vector chunk.
If I wanted the average home price for each city, I could use something like this.
You might be wondering why we put albRealEstate$City into a list. That seems
kind of unnecessary. This is because tapply() can be used with multiple factors—this
will break down the vector input into a finer partition. The second argument must
220 15 An Introduction to Functional Programming
be one object, though, so all of these factors must be collected into a list. The
following code produces a “pivot table.”
For functions that return higher-dimensional output, you will have to use
� something like by() or aggregate() in place of tapply().
15.1.5 mapply()
## [[1]]
## [1] -0.0122999207 -0.0064744814 -0.0002297629
##
## [[2]]
4
https://stat.ethz.ch/R-manual/R-devel/library/base/html/mapply.html
15.1 Functions as Function Inputs in R 221
Unlike the other examples of functions that take other functions as inputs, Reduce()
and do.call() don’t have many outputs. Instead of collecting many outputs into a
container, they just output one thing.
Let’s start with an example: “combining” data sets. In section 12 we talked about
several different ways of combining data sets. We discussed stacking data sets on top
of one another with rbind() (c.f. subsection 12.2), stacking them side-by-side with
cbind() (also in 12.2), and intelligently joining them together with merge() (c.f. 12.3).
Now consider the task of combining many data sets. How can we combine three or
more data sets into one? Also, how do we write DRY code and abide by the DRY
principle? As the name of the subsection suggests, we can use either Reduce() or
do.call() as a higher-order function. Just like the aforementioned *apply() functions,
they take in either cbind(), rbind(), or merge() as a function input. Which one do
we pick, though? The answer to that question deals with how many arguments our
lower-order function takes.
Take a look at the documentation to rbind(). Its first argument is ..., which is the
dot-dot-dot5 symbol. This means rbind() can take a varying number of data.frames
to stack on top of each other. In other words, rbind() is variadic.
On the other hand, take a look at the documentation of merge(). It only takes two
data.frames at a time6 . If we want to combine many data sets, merge() needs a helper
function.
This is the difference between Reduce() and do.call(). do.call() calls a function
once on many arguments, so its function must be able to handle many arguments. On
the other hand, Reduce() calls a binary function many times on pairs of arguments.
Reduce()’s function argument gets called on the first two elements, then on the first
output and the third element, then on the second output and fourth element, and so
on.
Here is an initial example that makes use of four data sets d1.csv, d2.csv, d3.csv, and
d4.csv. To start, ask yourself how we would read all of these in. There is a temptation
to copy and paste read.csv calls, but that would violate the DRY principle. Instead,
5
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Dot_002ddot_
002ddot
6
Although, it is still variadic. The difference is that the dot-dot-dot symbol does not
refer to a varying number of data.frames, just a varying number of other things we don’t
care about in this example.
222 15 An Introduction to Functional Programming
let’s use lapply() an anonymous function that constructs a file path string, and then
uses it to read in the data set the string refers to.
numDataSets <- 4
dataSets <- paste0("d",1:numDataSets)
dfs <- lapply(dataSets,
function(name) read.csv(paste0("data/", name, ".csv")))
head(dfs[[3]])
## id obs3
## 1 a 7
## 2 b 8
## 3 c 9
Notice how the above code would only need to be changed by one character if we
wanted to increase the number of data sets being read in!7
Next, cbind()ing them all together can be done as follows. do.call() will call the
function only once. cbind() takes many arguments at once, so this works. This code
is even better than the above code in that if dfs becomes longer, or changes at all,
nothing will need to be changed.
What if we wanted to merge() all these data sets together? After all, the id column
appears to be repeating itself, and some data from d2 isn’t lining up.
Reduce(merge, dfs)
## id obs1 obs2 obs3 obs4
## 1 a 1 4 7 10
## 2 b 2 5 8 11
## 3 c 3 6 9 12
Again, this is very DRY code. Nothing would need to be changed if dfs grew. Further-
more, trying to do.call() the merge() function wouldn’t work because it can only
take two data sets at a time.
7
To make it even more flexible, we could write code that doesn’t assume the functions
are all named the same way, or in the same directory together.
15.2 Functions as Function Inputs in Python 223
15.2.1.1 map()
map()9 can call a function repeatedly using elements of a container as inputs. Here
is an example of calculating outputs of a spline function, which can be useful for
coming up with predictors in regression models. This particular spline function is
𝑓(𝑥) = (𝑥 − 𝑘)1(𝑥 ≥ 𝑘), where 𝑘 is some chosen “knot point (see Figure 15.1).”
import numpy as np
my_inputs = np.linspace(start = 0, stop = 2*np.pi)
def spline(x):
knot = 3.0
if x >= knot:
return x-knot
else:
return 0.0
output = list(map(spline, my_inputs))
We can visualize the mathematical function by plotting its outputs against its inputs.
More information on visualization was given in subsection 13.
map() can also be used like mapply(). In other words, you can apply it to two containers,
import numpy as np
x = np.linspace(start = -1., stop = 1.0)
y = np.linspace(start = -1., stop = 1.0)
def f(x,y):
return np.log(x**2 + y**2)
list(map(f, x, y))[:3]
## [0.6931471805599453, 0.6098017877588092, 0.5228315638793316]
8
https://docs.python.org/3/glossary.html
9
https://docs.python.org/3/library/functions.html#map
224 15 An Introduction to Functional Programming
15.2.1.2 filter()
raw_data = np.arange(0,1.45,.01)
for elem in filter(lambda x : x**2 > 2, raw_data):
print(elem)
## 1.42
## 1.43
## 1.44
apply() had a MARGIN= input (1 sums rows, 2 sums columns), whereas this function
has a axis=input (0 sums columns, 1 sums rows).
import numpy as np
my_array = np.arange(6).reshape((2,3))
my_array
## array([[0, 1, 2],
## [3, 4, 5]])
np.apply_along_axis(sum, 0, my_array) # summing columns
## array([3, 5, 7])
np.apply_along_axis(sum, 1, my_array) # summing rows
## array([ 3, 12])
import pandas as pd
alb_real_est = pd.read_csv("data/albemarle_real_estate.csv")
alb_real_est.shape
## (30381, 12)
alb_real_est.apply(len, axis=0) # length of columns
## YearBuilt 30381
## YearRemodeled 30381
## Condition 30381
## NumStories 30381
## FinSqFt 30381
## Bedroom 30381
## FullBath 30381
## HalfBath 30381
## TotalRooms 30381
## LotSize 30381
## TotalValue 30381
## City 30381
## dtype: int64
14
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
apply.html
15
You should know that a lot of special-case functions that you typically apply to rows or
columns come built-in as DataFrame methods. For instance, .mean() would allow you to do
something like my_df.mean().
226 15 An Introduction to Functional Programming
Another thing to keep in mind is that DataFrames, unlike ndarrays, don’t have to
have the same type for all elements. If you have mixed column types, then summing
rows, for instance, might not make sense. This just requires subsetting columns before
.apply()ing a function to rows. Here is an example of computing each property’s
“score”.
import pandas as pd
# alb_real_est.apply(sum, axis=1) # can't add letters to numbers!
def get_prop_score(row):
return 2*row[0] + 3*row[1]
two_col_df = alb_real_est[['FinSqFt','LotSize']]
alb_real_est['Score'] = two_col_df.apply(get_prop_score, 1)
alb_real_est[['FinSqFt','LotSize','Score']].head(2)
## FinSqFt LotSize Score
## 0 5216 5.102 10447.306
## 1 5160 453.893 11681.679
alb_real_est[['FinSqFt','LotSize']].apply([sum, len])
## FinSqFt LotSize
## sum 61730306 105063.1892
## len 30381 30381.0000
If you do not want to waste two lines defining a function with def, you can use an
anonymous lambda function. Be careful, though—if your function is complex enough,
then your lines will get quite wide. For instance, this example is pushing it.
The previous example .apply()s a binary function to each row. The function is binary
because it takes two elements at a time. If you want to apply a unary function (i.e. it
15.2 Functions as Function Inputs in Python 227
takes one argument at a time) function to each row for, and for each column, then
you can use .applymap()16 .
alb_real_est[['FinSqFt','LotSize']].applymap(lambda e : e + 1).head(3)
## FinSqFt LotSize
## 0 5217 6.102
## 1 5161 454.893
## 2 1513 43.590
Last, we have a .groupby()17 method, which can be used to mirror the behavior of R’s
tapply(), aggregate() or by(). It can take the DataFrame it belongs to, and group its
rows into multiple sub-DataFrames. The collection of sub-DataFrames has a lot of the
same methods that an individual DataFrame has (e.g. the subsetting operators, and
the .apply() method), which can all be used in a second step of calculating things on
each sub-DataFrame.
type(alb_real_est.groupby(['City']))
## pandas.core.groupby.generic.DataFrameGroupBy
type(alb_real_est.groupby(['City'])['TotalValue'])
## pandas.core.groupby.generic.SeriesGroupBy
Here is an example that models some pretty typical functionality. It shows two ways to
get the average home price by city. The first line groups the rows by which City they
are in, extracts the TotalValue column in each sub-DataFrame, and then .apply()s the
np.average() function on the sole column found in each sub-DataFrame. The second
.apply()s a lambda function to each sub-DataFrame directly. More details on this
“split-apply-combine” strategy can be found in the Pandas documentation.18
grouped = alb_real_est.groupby(['City'])
grouped['TotalValue'].apply(np.average)
## City
## CHARLOTTESVILLE 429926.502708
## CROZET 436090.502541
## EARLYSVILLE 482711.437566
## KESWICK 565985.092025
## NORTH GARDEN 399430.221519
## SCOTTSVILLE 293666.758242
## Name: TotalValue, dtype: float64
16
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
applymap.html#pandas.DataFrame.applymap
17
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
groupby.html
18
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
228 15 An Introduction to Functional Programming
grouped.apply(lambda df : np.average(df['TotalValue']))
## City
## CHARLOTTESVILLE 429926.502708
## CROZET 436090.502541
## EARLYSVILLE 482711.437566
## KESWICK 565985.092025
## NORTH GARDEN 399430.221519
## SCOTTSVILLE 293666.758242
## dtype: float64
Notice that the greetingMessage= argument that is passed in, "Hello", isn’t
temporary anymore. It lives on so it can be used by all the functions created by
funcFactory(). This is the most surprising aspect of writing function factories.
Let’s now consider a more complicated and realistic example. Let’s implement a
variance reduction technique called common random numbers.
Suppose 𝑋 ∼ Normal(𝜇, 𝜎2 ), and we are interested in approximating an expectation
of a function of this random variable. Suppose that we don’t know that
𝜎2
𝔼[sin(𝑋)] = sin(𝜇) exp (− ) (15.1)
2
15.3 Functions as Function Outputs in R 229
for any particular choice of 𝜇 and 𝜎2 , and instead, we choose to use the Monte Carlo
method:
1 𝑛
̂
𝔼[sin(𝑋)] = ∑ sin(𝑋 𝑖 ) (15.2)
𝑛 𝑖=1
iid
where 𝑋 1 , … , 𝑋 𝑛 ∼ Normal(𝜇, 𝜎2 ) is a large collection of draws from the appropri-
ate normal distribution, probably coming from a call to rnorm(). In more realistic
situations, the theoretical expectation might not be tractable, either because the
random variable has a complicated distribution, or maybe because the functional is
very complicated. In these cases, a tool like Monte Carlo might be the only available
approach.
Here are two functions that calculate the above quantities for 𝑛 = 1000.
actualExpectSin() is a function that computes the theoretical expectation for any
particular parameter pair. monteCarloSin() is a function that implements the Monte
Carlo approximate expectation.
par(mfrow=c(1,2))
contour(muGrid, sigmaGrid, actuals,
xlab = "mu", ylab = "sigma", main = "actual expects")
contour(muGrid, sigmaGrid, mcApprox,
xlab = "mu", ylab = "sigma", main = "mc without crn")
230 15 An Introduction to Functional Programming
iid
If we wanted to use common random numbers, we could generate 𝑍 1 , … , 𝑍 𝑛 ∼
Normal(0, 1), and use the fact that
𝑋 𝑖 = 𝜇 + 𝜎𝑍 𝑖 (15.3)
1 𝑛
̃
𝔼[sin(𝑋)] = ∑ sin(𝜇 + 𝜎𝑍 𝑖 ) (15.4)
𝑛 𝑖=1
Here is one function that naively implements Monte Carlo with common random
numbers. We generate the collection of standard normal random variables once,
globally. Each time you call monteCarloSinCRNv1(c(10,1)), you get the same answer.
15.3 Functions as Function Outputs in R 231
Let’s compare using common random numbers to going without. As you can see in
Figure 15.3, common random numbers make the plot look “smoother.” In other words,
we increase our sampling accuracy without spending more computational time.
FIGURE 15.3: Monte Carlo: With and without common random numbers.
par(mfrow=c(1,1))
232 15 An Introduction to Functional Programming
• we have another global variable—a bunch of samples called commonZs floating around,
and
• the dependence on the global variable for sample size is even further obscured.
We can fix these two problems very nicely by using a function factory.
• the desired sample size must be passed in as a function argument instead of being
captured,
• the re-used standard normal variates are not in the global environment anymore,
and
• a sensible default number of samples is provided in the event that the programmer
forgets to specify one.
The inner function did in fact capture commonZs, but it captured from the enclosing
scope, not the global scope. Capturing isn’t always a terrible idea. It would be difficult
to modify these variables, so we don’t need to worry about function behavior changing
in unpredictable ways. Actually capturing a variable instead of passing it in is an
intelligent design choice—now the end-users of functions created by this factory don’t
need to worry about plugging in extra parameters.
Let’s use 1000 samples again and make sure this function works by comparing its
output to the known true function. Run the following code on your own machine.
Note the new Greek letters in the axis labels.
def func_factory(greeting_message):
def func(name):
print(greeting_message + ' ' + name)
return func
greet_with_hello = func_factory("Hello")
greet_with_hello("Taylor")
## Hello Taylor
greet_with_hello("Charlie")
## Hello Charlie
Let’s consider another less trivial example. Recall the spline function from earlier in
the chapter:
import numpy as np
def spline(x):
knot = 3.0
if x >= knot:
return x-knot
else:
return 0.0
This function is limited in that it takes in only one element at a time. Unfortu-
nately, we would not be able to provide an entire Numpy array as an argument
(e.g. spline(np.arange(3))). Many functions do possess this behavior, and it is gener-
ally advantageous to take advantage of it. If you recall our discussion about universal
functions in section 3.4, you might have grown accustomed to taking advantage of
writing vectorized code.
234 15 An Introduction to Functional Programming
Fortunately there’s a way to automatically vectorize functions like the one above:
np.vectorize()19 . np.vectorize() takes in a unary function, and outputs a vectorized
version of it that is able to take entire arrays as an input. Here’s an example. Compare
this to us using map() before.
The above code doesn’t just demonstrate how to return functions from a function. It
is also an example of using functions as function inputs. When a function takes in and
spits out functions, there is an alternative way to use it that is unique to Python. You
can use function decorators20 . You can decorate a function by using the @ operator
(Lutz, 2013).
If you decorate a function, it is equivalent to passing that function in to a function
factory (aka outer function). That function will take the function you defined, alter
it, and then give it back to you with the same name that you chose in the first place.
example from Section 15.3 that implements Monte Carlo sampling using common
random numbers.
Before we get too ahead of ourselves, let’s describe the basics. Here is our first decorator
function add_greeting().
def add_greeting(func):
def wrapper(name):
print('Salutations, ')
func(name)
return wrapper
@add_greeting
def print_name(first_name):
print(first_name)
You could get the same behavior by typing the following. They are equivalent!
def print_name(first_name):
print(first_name)
print_name = add_greeting(print_name)
Things can get a little more complicated when your decorators take additional
arguments.
So how do we write decorators that accomplish this? The important thing to remember
is that @add_greeting("How you doin'") in the previous code block is equivalent
to writing this after the function definition: print_name = add_greeting("How you
doin'")(print_name). This is a function returning a function returning a function!
The definition of add_greeting() could look something like this.
def add_greeting(greet):
def decorator(func):
236 15 An Introduction to Functional Programming
def wrapper(name):
print(greet)
func(name)
return wrapper
return decorator
Now that you know how decorators work, you can feel comfortable using third-party
ones. You might come across, for example, the @jit decorator from Numba21 , which
will translate your Python function into faster machine code, the @lru_cache decorator
from the functools module22 —this can make your code faster by saving some of its
outputs—or decorators that perform application specific tasks like @tf.function23
from Tensorflow.
15.5 Exercises
15.5.1 Python Questions
1.
2.
a) Import the data "winequality-red.csv", call it wine, and remove all columns
except for fixed acidity, volatile acidity, and quality.
b) Write a function called generate_pred_func(fixed_cutoff, vol_cutoff,
dataset).
•The dataset argument should be a Pandas DataFrame that has three
columns called fixed acidity, volatile acidity, and quality.
•The fixed_cutoff argument should be a floating point number that
separates fixed acidity into two regions.
21
https://numba.pydata.org/
22
https://docs.python.org/3/library/functools.html
23
https://www.tensorflow.org/api_docs/python/tf/function
15.5 Exercises 237
After you finish the problem, you should have a definition of a generate_pred_func()
that could be used as follows:
3.
Let’s predict what type of activity some is doing based on measurements taken from
their cell phone. We will begin implementing a K-Nearest Neighbors (KNN)
classifier (Fix and Hodges, 1989) (Cover and Hart, 1967).
Consider the data files "X_train.txt" and "y_train.txt" from (Anguita et al., 2013),
which is available from the UCI Machine Learning Repository (Dua and Graff, 2017).
The first data set consists of recorded movements from a cell phone, and the second
data set consists of activity labels of people. Labels 1 through 6 correspond to walking,
walking upstairs, walking downstairs, sitting, standing and laying, respectively.
15.5.2 R Questions
1.
1 𝑥2 + 𝑦 2
𝑓(𝑥, 𝑦) = exp [− ]. (1)
2𝜋 2
The random elements 𝑋 and 𝑌, in this particular case, are independent, each have unit
variance, and zero mean. In this case, the marginal for 𝑋 is a mean 0, unit variance
normal distribution:
1 𝑥2
𝑔(𝑥) = √ exp [− ] . (2)
2𝜋 2
a) Write a function called fTwoArgs(x,y) that takes two arguments, and returns
the value of the above density in equation (1) at those two points.
b) Write a function called fOneArg(vec) that takes one argument: a length two
vector. It should return a density in equation (1) evaluated at that point.
c) Write a function called gOneArg(x) that evaluates the density in (2).
d) Generate two sequences called xPoints and yPoints. Have them contain the
twenty equally-spaced numbers going from −3 to 3, inclusive.
e) Use expand.grid() to create a data.frame called myGrid. It should have two
columns, and it should contain in its rows every possible pair of two points
from the above sequences. The “x” coordinates should be in the first column.
f) Use mapply() to evaluate the bivariate density on every grid point. Store
your results in a vector mEvals.
g) Use apply() to evaluate the bivariate density on every grid point. Store your
results in a vector aEvals.
h) Use sapply() to evaluate the univariate density on every element of xPoints.
Store your results in a vector sEvals.
i) Use vapply to evaluate the univariate density on every element of xPoints.
Store your results in vector vEvals.
j) Use lapply to evaluate the univariate density on every element of xPoints.
Store your results in a list lEvals.
k) Generate two plots of the bivariate density. For one, use persp(). For the
other, use contour(). Feel free to revive the code you used in Chapter 13’s
Exercise 1.
l) Generate a third plot of the univariate density. Feel free to revive the code
you used in Chapter 13’s Exercises.
15.5 Exercises 239
2.
Write a function that reads in all of the data sets contained in any given folder.
3.
Consider the Militarized Interstate Disputes (v5.0) (Palmer et al., 0) data sets again:
"MIDA 5.0.csv", "MIDB 5.0.csv", "MIDI 5.0.csv", and "MIDIP 5.0.csv".
241
242 Bibliography
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling wine
preferences by data mining from physicochemical properties. Decis. Support Syst.,
47(4):547–553.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans-
actions on Information Theory, 13(1):21–27.
Dua, D. and Graff, C. (2017). UCI machine learning repository.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of
Statistics, 7(1):1–26.
Fisher, R.A. & Creator, T. (1988). Iris. UCI Machine Learning Repository.
Fix, E. and Hodges, J. L. (1989). Discriminatory analysis. nonparametric discrimina-
tion: Consistency properties. International Statistical Review / Revue Internationale
de Statistique, 57(3):238–247.
Ford, C. (2016). ggplot: Files for UVA StatLab workshop, Fall 2016. https://github.
com/clayford/ggplot2.
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hi-
erarchical Models. Analytical methods for social research. Cambridge University
Press.
Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions
and Simulations. O’Reilly Media.
Guttman, L. (1946). Enlargement Methods for Computing the Inverse Matrix. The
Annals of Mathematical Statistics, 17(3):336–343.
Harrell Jr, F. E., with contributions from Charles Dupont, and many others. (2021).
Hmisc: Harrell Miscellaneous. R Package Version 4.5-0.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P.,
Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus,
M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe,
M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W.,
Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with
NumPy. Nature, 585(7825):357–362.
Hunt, A. and Thomas, D. (2000). The Pragmatic Programmer : From Journey-Man
to Master. Addison-Wesley, Boston [etc.].
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science
& Engineering, 9(3):90–95.
Janosi, A., Steinbrunn, W., Pfisterer, M., and Detrano, R. (1988). Heart Disease.
UCI Machine Learning Repository.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28:11–21.
Bibliography 243
245
246 Index